Re: [OMPI devel] OMPI devel] PMIX vs Solaris

2015-09-29 Thread Gilles Gouaillardet
Paul, the latest master nightly snapshot does include the fix, and i made PRs for v2.x and v1.10 Cheers, Gilles On 9/28/2015 6:29 PM, Gilles Gouaillardet wrote: Thanks Brice, I will do the PR for the various ompi branches from tomorrow Cheers, Gilles Brice Goglin wrote: Sorry, I didn't

Re: [OMPI devel] OMPI devel] PMIX vs Solaris

2015-09-28 Thread Gilles Gouaillardet
Thanks Brice, I will do the PR for the various ompi branches from tomorrow Cheers, Gilles Brice Goglin wrote: >Sorry, I didn't see this report before the pull request. > >I applied Gilles' "simple but arguable" fix to master and stable branches up >to v1.9. It could be too imperfect if somebo

Re: [OMPI devel] PMIX vs Solaris

2015-09-28 Thread Brice Goglin
Sorry, I didn't see this report before the pull request. I applied Gilles' "simple but arguable" fix to master and stable branches up to v1.9. It could be too imperfect if somebody ever changes to permissions of /devices/pci* but I guess that's not going to happen in practice. Finding the right de

Re: [OMPI devel] PMIX vs Solaris

2015-09-28 Thread Gilles Gouaillardet
Paul and Brice, the error message is displayed by libpciaccess when hwloc invokes pci_system_init on Solaris : crw--- 1 root sys 182, 253 Sep 28 10:55 /devices/pci@0,0:reg from libpciaccess snprintf(nexus_path, sizeof(nexus_path), "/devices%s", nexus_name); if ((fd = op

Re: [OMPI devel] PMIX vs Solaris

2015-09-25 Thread Paul Hargrove
FYI: Things look fine today with last night's master tarball. I hope Brice has a way to eliminate the hwloc warning, since I am sure I am not the only one with scripts that will notice "Error" in the output. -Paul On Wed, Sep 23, 2015 at 6:08 PM, Ralph Castain wrote: > Aha! Thanks - just what

Re: [OMPI devel] PMIX vs Solaris

2015-09-23 Thread Ralph Castain
Aha! Thanks - just what the doctor ordered! > On Sep 23, 2015, at 5:45 PM, Gilles Gouaillardet wrote: > > Ralph, > > the root cause is > getsockopt(..., SOL_SOCKET, SO_RCVTIMEO,...) > fails with errno ENOPROTOOPT on solaris 11.2 > > the attached patch is a proof of concept and works for me :

Re: [OMPI devel] PMIX vs Solaris

2015-09-23 Thread Gilles Gouaillardet
Ralph, the root cause is getsockopt(..., SOL_SOCKET, SO_RCVTIMEO,...) fails with errno ENOPROTOOPT on solaris 11.2 the attached patch is a proof of concept and works for me : /* if ENOPROTOOPT, do not try to set and restore SO_RCVTIMEO */ Cheers, Gilles On 9/21/2015 2:16 PM, Paul Hargrove wro

Re: [OMPI devel] PMIX vs Solaris

2015-09-21 Thread Paul Hargrove
Ralph, Just as you say: The first 64s pause was before the hwloc error message appeared. The second was after the second server_setup_fork appears, and before whatever line came after that. I don't know if stdio buffering my be "distorting" the placement of the pause relative to the lines of outp

Re: [OMPI devel] PMIX vs Solaris

2015-09-21 Thread Ralph Castain
?? Just so this old fossilized brain gets this right: you are saying there was a 64s pause before the hwloc error appeared, and then another 64s pause after the second server_setup_fork message appeared? If that’s true, then I’m chasing the wrong problem - it sounds like something is messed up

Re: [OMPI devel] PMIX vs Solaris

2015-09-20 Thread Paul Hargrove
Ralph, Still failing with that patch, but with the addition of a fairly long pause (64s) before the first error message appears, and again after the second "server setup_fork" (64s again) New output is attached. -Paul On Sun, Sep 20, 2015 at 2:15 PM, Ralph Castain wrote: > Argh - found a typo

Re: [OMPI devel] PMIX vs Solaris

2015-09-20 Thread Ralph Castain
Argh - found a typo in the output line. Could you please try the attached patch and do it again? This might fix it, but if not it will provide me with some idea of the returned error.ThanksRalph paul.diff Description: Binary data On Sep 20, 2015, at 12:40 PM, Paul Hargrove wro

Re: [OMPI devel] PMIX vs Solaris

2015-09-20 Thread Paul Hargrove
Yes, it is definitely at 10. Another attempt is attached. -Paul On Sun, Sep 20, 2015 at 8:19 AM, Ralph Castain wrote: > Paul - can you please confirm that you gave mpirun a level of 10 for the > pmix_base_verbose param? This output isn’t what I would have expected from > that level - it looks mo

Re: [OMPI devel] PMIX vs Solaris

2015-09-20 Thread Ralph Castain
Paul - can you please confirm that you gave mpirun a level of 10 for the pmix_base_verbose param? This output isn’t what I would have expected from that level - it looks more like the verbosity was set to 5, and so the error number isn’t printed. Thanks Ralph > On Sep 20, 2015, at 3:42 AM, Gi

Re: [OMPI devel] PMIX vs Solaris

2015-09-20 Thread Gilles Gouaillardet
Paul, I do not remember it like that ... at that time, the issue in ompi was that the global errno was uses instead of the per thread errno. though the man pages tells -mt should be used fir multithreaded apps, you tried -D_REENTRANT on all your platforms, and it was enough to get the expected re

Re: [OMPI devel] PMIX vs Solaris

2015-09-20 Thread Paul Hargrove
Gilles, Yes every $CC invocation in opal/mca/pmix/pmix1xx includes "-D_REENTRANT". However, they don't include "-mt". I believe we concluded (when we had problems previously) that "-mt" was the proper flag (at compile and link) for multi-threaded with the Studio compilers. -Paul On Sat, Sep 19,

Re: [OMPI devel] PMIX vs Solaris

2015-09-20 Thread Gilles Gouaillardet
Paul, Can you please double check pmix1xx is compiled with -D_REENTRANT ? We ran into similar issues in the past, and they only occurred with Solaris Cheers, Gilles On Sunday, September 20, 2015, Paul Hargrove wrote: > Ralph, > The output from the requested run is attached. > -Paul > > On Sat

Re: [OMPI devel] PMIX vs Solaris

2015-09-20 Thread Paul Hargrove
Ralph, The output from the requested run is attached. -Paul On Sat, Sep 19, 2015 at 9:46 PM, Ralph Castain wrote: > Ah, okay - that makes more sense. I’ll have to let Brice see if he can > figure out how to silence the hwloc error message as I can’t find where it > came from. The other errors ar

Re: [OMPI devel] PMIX vs Solaris

2015-09-20 Thread Ralph Castain
Ah, okay - that makes more sense. I’ll have to let Brice see if he can figure out how to silence the hwloc error message as I can’t find where it came from. The other errors are real and are the reason why the job was terminated. The problem is that we are trying to establish a communication bet

Re: [OMPI devel] PMIX vs Solaris

2015-09-20 Thread Paul Hargrove
Ralph, No it did not run. The complete output (which I really should have included in the first place) is below. -Paul $ mpirun -mca btl sm,self -np 2 examples/ring_c' Error opening /devices/pci@0,0:reg: Permission denied [pcp-d-3:26054] PMIX ERROR: ERROR in file /export/home/phargrov/OMPI/openm

Re: [OMPI devel] PMIX vs Solaris

2015-09-19 Thread Ralph Castain
Paul, can you clarify something for me? The error in this case indicates that the client wasn’t able to reach the daemon - this should have resulted in termination of the job. Did the job actually run? > On Sep 18, 2015, at 2:50 AM, Ralph Castain wrote: > > I'm on travel right now, but it sho

Re: [OMPI devel] PMIX vs Solaris

2015-09-18 Thread Ralph Castain
I'm on travel right now, but it should be an easy fix when I return. Sorry for the annoyance On Thu, Sep 17, 2015 at 11:13 PM, Paul Hargrove wrote: > Any suggestion how I (as a non-root user) can avoid seeing this hwloc > error message on every run? > > -Paul > > On Thu, Sep 17, 2015 at 11:00 P

Re: [OMPI devel] PMIX vs Solaris

2015-09-18 Thread Paul Hargrove
Any suggestion how I (as a non-root user) can avoid seeing this hwloc error message on every run? -Paul On Thu, Sep 17, 2015 at 11:00 PM, Gilles Gouaillardet wrote: > Paul, > > IIRC, the "Permission denied" is coming from hwloc that cannot collect all > the info it would like. > > Cheers, > > G

Re: [OMPI devel] PMIX vs Solaris

2015-09-18 Thread Gilles Gouaillardet
Paul, IIRC, the "Permission denied" is coming from hwloc that cannot collect all the info it would like. Cheers, Gilles On 9/18/2015 2:34 PM, Paul Hargrove wrote: Tried tonight's master tarball on Solaris 11.2 on x86-64 with the Studio Compilers (default ILP32 output) and saw the following

[OMPI devel] PMIX vs Solaris

2015-09-18 Thread Paul Hargrove
Tried tonight's master tarball on Solaris 11.2 on x86-64 with the Studio Compilers (default ILP32 output) and saw the following result $ mpirun -mca btl sm,self -np 2 examples/ring_c' Error opening /devices/pci@0,0:reg: Permission denied [pcp-d-4:00492] PMIX ERROR: ERROR in file /export/home/phar