[OMPI devel] [1.8.4rc3+patches] Solaris status summary

2014-12-12 Thread Paul Hargrove
It appears that with Ralph's oob_tcp patches (paul.diff) everything is now OK on Solaris-11/x86-64. On Solaris-10/SPARC I needed to fix guess_strlen() (or change "%u" to "%d" to avoid the issue) or else I didn't get very far at all (SEGV in orterun). However, with that issue resolved things are st

Re: [OMPI devel] [OMPI commits] Git: open-mpi/ompi branch master updated. dev-509-g38d6627

2014-12-12 Thread Ralph Castain
Nathan - does this need to come to 1.8.4? Or do you want to go with Paul’s suggested fix? > On Dec 12, 2014, at 8:09 AM, git...@crest.iu.edu wrote: > > This is an automated email from the git hooks/post-receive script. It was > generated because a ref change was pushed to the repository containi

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain
I suspect we’ll just remove it, but I want to give the other developers a chance to chime in before doing so. > On Dec 12, 2014, at 6:07 PM, Paul Hargrove wrote: > > Ralph, > > If preserved at all, the existing code should probably be made to act more > intelligently when it encounters an unk

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Paul Hargrove
Ralph, The Solaris-11/x86-64 system is now "all good" with those changes. Works with "-mca oob_tcp_if_include bge0", "-mca oob_tcp_if_exclude bge0" and with neither. I next check if this fixes the interrupted select warnings seen on Solaris-10/SPARC. -Paul On Fri, Dec 12, 2014 at 5:17 PM, Ralph

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove
Ralph, If preserved at all, the existing code should probably be made to act more intelligently when it encounters an unknown escape code. I would suggest advancing the length by some value (say 128?) that should be "big enough" and printing a prominent warning. So, the next time this bug surfac

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain
Looking at the comments in the code, it appears that the rationale when written was to provide support for REALLY ancient systems that didn’t have some of these functions. Since that time, we added a configure check for vsnprintf, so I’m adding Paul/Larry’s suggested code, protected by that conf

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Larry Baker
On 12 Dec 2014, at 5:22 PM, Paul Hargrove wrote: > HOWEVER, while the patch catches the "%u" case, there are plenty of potential > ways to hit the same problem if, for instance, one uses "%zu" for size_t. > Additionally, I've already noted that the code for "%ld", "%lx", "%lX", "%lf" > are all

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Larry Baker
Or, slightly modified using a defensive coding style: > return 1 + vsnprintf(dummy, sizeof( dummy ), fmt, ap); if you like sizeof() [which I prefer]. if you like sizeof: > return 1 + vsnprintf(dummy, sizeof dummy, fmt, ap); > Larry Baker US Geological Survey 650-329-5608 ba...@usgs.gov

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain
It’s a fair question - that code is ancient, however, so I’m surprised it has only surfaced now as a problem. I can take a look at making the change > On Dec 12, 2014, at 5:22 PM, Paul Hargrove wrote: > > OK, applying my attached patch (based on Gilles's observation) resolved the > problem! >

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove
OK, applying my attached patch (based on Gilles's observation) resolved the problem! So I fully expect Ralph's plan to use "%d" to also resolve this. HOWEVER, while the patch catches the "%u" case, there are plenty of potential ways to hit the same problem if, for instance, one uses "%zu" for size

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Ralph Castain
No need for autogen - simple change to a couple of files paul.diff Description: Binary data On Dec 12, 2014, at 4:38 PM, Paul Hargrove wrote:Ralph,Patches to code are fine, but I am not equipped to autogen.-PaulOn Fri, Dec 12, 2014 at 4:37 PM, Ralph Castain

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Paul Hargrove
Ralph, Patches to *code* are fine, but I am not equipped to autogen. -Paul On Fri, Dec 12, 2014 at 4:37 PM, Ralph Castain wrote: > Would you be open to a patch you can test instead of me rolling an rc? I'd > be happy to send one in a while > > On Dec 12, 2014, at 4:34 PM, Ralph Castain wrote:

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Ralph Castain
Would you be open to a patch you can test instead of me rolling an rc? I’d be happy to send one in a while > On Dec 12, 2014, at 4:34 PM, Ralph Castain wrote: > > I’m hoping it will fix it. The timeout code was the only change from 1.8.3 > besides the loopback warning, so it should restore the

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Ralph Castain
I’m hoping it will fix it. The timeout code was the only change from 1.8.3 besides the loopback warning, so it should restore the prior behavior. > On Dec 12, 2014, at 4:32 PM, Paul Hargrove wrote: > > > On Fri, Dec 12, 2014 at 4:29 PM, Ralph Castain > wrote: > All

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Paul Hargrove
On Fri, Dec 12, 2014 at 4:29 PM, Ralph Castain wrote: > All right - I'll surrender and remove the timeout. Will release rc4 later > tonight. > > Sorry for putting you thru this Paul - for some reason, these problems > aren't showing up elsewhere. > Even at a 300s timeout I don't get a connection

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain
Crud - sorry for delayed response. I was out for a bit. I’ll just change it to %d as there is nothing magic about it being unsigned. How bizarre. > On Dec 12, 2014, at 3:21 PM, Paul Hargrove wrote: > > NOTE: > > The existing code for "%l." in guess_strlen() is garbage. > The va_arg() macro c

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Ralph Castain
All right - I’ll surrender and remove the timeout. Will release rc4 later tonight. Sorry for putting you thru this Paul - for some reason, these problems aren’t showing up elsewhere. > On Dec 12, 2014, at 3:37 PM, Paul Hargrove wrote: > > > > On Fri, Dec 12, 2014 at 2:58 PM, Ralph Castain

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Paul Hargrove
On Fri, Dec 12, 2014 at 2:58 PM, Ralph Castain wrote: > Aha! You are the first to fall thru the timeout. How interesting. > When it comes to the release candidates, I seem to own a lot of "firsts". It is not as fun as one might imagine :-). Can you please try adding "-mca oob_tcp_connect_timeou

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove
NOTE: The existing code for "%l." in guess_strlen() is garbage. The va_arg() macro calls all have "int" for the type!! I am *only* testing a fix for the missing "%u" at the moment. -Paul On Fri, Dec 12, 2014 at 3:14 PM, Paul Hargrove wrote: > Thanks, Gilles! > > I was looking at that same cod

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove
Thanks, Gilles! I was looking at that same code just now and completely missed the lack of a case for '%u' (and '%lu'). I will add one now and see if that resolves the problem -Paul On Fri, Dec 12, 2014 at 3:10 PM, Gilles Gouaillardet < gilles.gouaillar...@gmail.com> wrote: > Ralph, > > I

Re: [OMPI devel] OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Gilles Gouaillardet
Ralph, I cannot find a case for the %u format is guess_strlen And since the default does not invoke va_arg() I it seems strlen is invoked on nnuma instead of arch Makes sense ? Cheers, Gilles Ralph Castain wrote: >Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address >do

Re: [OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Ralph Castain
Aha! You are the first to fall thru the timeout. How interesting. Can you please try adding “-mca oob_tcp_connect_timeout 5:0”? On Dec 12, 2014, at 8:53 AM, Paul Hargrove wrote: > > > First, I want to ask what became of the issue discussed in this thread? >http://www.open-mpi.org/community

Re: [OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain
Afraid I’m drawing a blank, Paul - I can’t see how we got to a bad address down there. This is at the beginning of orte_init, so there are no threads running nor has anything much happened. Do you have any suggestions? > On Dec 12, 2014, at 9:02 AM, Paul Hargrove wrote: > > Ralph, > > The "

Re: [OMPI devel] Trunk warnings

2014-12-12 Thread Edgar Gabriel
I'll take care of the one ompio warning. Edgar On 12/12/2014 12:01 PM, Nathan Hjelm wrote: The osc warnings will go away after the btl modifications are applied. I made signifigant changes to the component. -Nathan On Fri, Dec 12, 2014 at 09:49:47AM -0800, Ralph Castain wrote: While buil

Re: [OMPI devel] OpenIB has some borked code

2014-12-12 Thread Nathan Hjelm
As it is already given the commit is specified. Been thinking about trying to bring it and a handful of other fixes to master before the rest of the commits. -Nathan On Fri, Dec 12, 2014 at 11:08:46AM -0700, Howard Pritchard wrote: >Nathan, >Please make sure the fix for this problem is c

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Ralph Castain
I just checked it with —enable-memchecker —with-valgrind and found that many of these are legitimate leaks. We can take a look at them, though as I said, perhaps may wait for 1.8.5 as I wouldn’t hold up 1.8.4 for it. > On Dec 12, 2014, at 9:26 AM, Eric Chamberland > wrote: > > On 12/12/2014

Re: [OMPI devel] OpenIB has some borked code

2014-12-12 Thread Howard Pritchard
Nathan, Please make sure the fix for this problem is contained in its own commit. Howard 2014-12-12 9:38 GMT-07:00 Nathan Hjelm : > > > Yeah, that code is completely wrong. I have a fix in my btl > modifications branch. > > > https://github.com/hjelmn/ompi/commit/38e961193074d382983d000e68adb72

Re: [OMPI devel] Trunk warnings

2014-12-12 Thread Nathan Hjelm
The osc warnings will go away after the btl modifications are applied. I made signifigant changes to the component. -Nathan On Fri, Dec 12, 2014 at 09:49:47AM -0800, Ralph Castain wrote: >While building optimized on Linux: >bcol_ptpcoll_allreduce.c: In function >'bcol_ptpcoll_allredu

[OMPI devel] Trunk warnings

2014-12-12 Thread Ralph Castain
While building optimized on Linux: bcol_ptpcoll_allreduce.c: In function 'bcol_ptpcoll_allreduce_narraying_init': bcol_ptpcoll_allreduce.c:236: warning: unused variable 'dtype' bcol_ptpcoll_allreduce.c:235: warning: unused variable ‘count' io_ompio_file_set_view.c: In function 'mca_io_ompio_final

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Eric Chamberland
On 12/12/2014 11:38 AM, Jeff Squyres (jsquyres) wrote: Did you configure OMPI with --enable-memchecker? No, only "--prefix=" Eric

Re: [OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Paul Hargrove
Ralph, The "arch" variable looks fine: Current function is opal_hwloc_base_get_topo_signature 2134nnuma, nsocket, nl3, nl2, nl1, ncore, nhwt, arch); (dbx) print arch arch = 0x1001700a0 "sun4v" And so is "fmt": Current function is opal_asprintf 194 length = opal_vaspr

[OMPI devel] [1.8.4rc3] REGRESSION: connection problem on (multi-homed) Solaris host

2014-12-12 Thread Paul Hargrove
First, I want to ask what became of the issue discussed in this thread? http://www.open-mpi.org/community/lists/devel/2014/11/16160.php I though we had concluded that one just needed -D_REENTRANT. I mention that only for completeness, because I think my current problem is different. The followi

Re: [OMPI devel] OpenIB has some borked code

2014-12-12 Thread Nathan Hjelm
Yeah, that code is completely wrong. I have a fix in my btl modifications branch. https://github.com/hjelmn/ompi/commit/38e961193074d382983d000e68adb721aaf3df7d -Nathan On Fri, Dec 12, 2014 at 08:26:34AM -0800, Ralph Castain wrote: >Hey folks >I've been looking into this warning: >b

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Jeff Squyres (jsquyres)
Did you configure OMPI with --enable-memchecker? On Dec 12, 2014, at 8:35 AM, Ralph Castain wrote: > We have made more of an effort to get valgrind clean on the master - haven’t > brought all of it across due to the desire to minimize change in 1.8 > > I’ll see what can be done, probably more

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Ralph Castain
We have made more of an effort to get valgrind clean on the master - haven’t brought all of it across due to the desire to minimize change in 1.8 I’ll see what can be done, probably more for 1.8.5 at this point. Most of these look like legitimate leaks that should be addressed as opposed to supp

Re: [OMPI devel] [1.8.4rc3] dangling symlinks

2014-12-12 Thread Ralph Castain
Fixed in master, setup for 1.8.4 - thanks Paul! > On Dec 11, 2014, at 11:47 PM, Paul Hargrove wrote: > > On a Linux system configured without java support I see the following two > dangling symlinks installed in ${prefix}/bin: > > lrwxrwxrwx 1 phhargrove phhargrove 8 Dec 11 23:52 oshjavac ->

[OMPI devel] OpenIB has some borked code

2014-12-12 Thread Ralph Castain
Hey folks I’ve been looking into this warning: btl_openib_component.c: In function 'init_one_device': btl_openib_component.c:2019:54: warning: comparison between 'enum ' and 'mca_base_var_source_t' [-Wenum-compare] else if (BTL_OPENIB_RQ_SOURCE_DEVICE_INI ==

Re: [OMPI devel] 1.8.4rc2 now available for testing

2014-12-12 Thread Eric Chamberland
On 12/11/2014 05:45 AM, Ralph Castain wrote: ... by the reporters. Still, I would appreciate a fairly thorough testing as this is expected to be the last 1.8 series release for some time. Is is relevant to report valgrind leaks? Maybe they are "normal" or not, I don't know. If they are norm

Re: [OMPI devel] [1.8.4rc2] build broken by default on SGI UV

2014-12-12 Thread Nathan Hjelm
Hmm, I thought we already cleaned that up in 1.8. I will take a look today. BTW, can you send me the sn/xpmem.h file from your machine. I might have an idea what is going wrong. Can't seen to find the link the SGI's tarball on their oss site. -Nathan On Thu, Dec 11, 2014 at 06:53:00PM -0800, Pa

Re: [OMPI devel] OMPI devel] [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-12 Thread Gilles Gouaillardet
Ralph, I can do that starting from monday Cheers, Gilles Ralph Castain wrote: >Thanks Brice! > >Our 1.8 branch probably has another 2 or so years in it, but I think we can >lock it down fairly soon. Since we’ve shaken a lot of the bugs out of 1.8, we >are now seeing the “adoption wave” that

Re: [OMPI devel] hwloc out-of-order topology discovery with SLURM 14.11.0 and openmpi 1.6

2014-12-12 Thread Pim Schellart
Dear All, we have now recompiled both openmpi (1.8.3) and SLURM against an externally compiled and installed hwloc (1.10.0). With these changes the out-of-order topology discovery warning disappears. By now we also believe the problem was probably somewhere in SLURM rather than in openmpi but w

Re: [OMPI devel] [1.8.4rc2] orterun SEGVs on Solaris-10/SPARC

2014-12-12 Thread Ralph Castain
Hmmm….this is really odd. I actually do have a protection for that arch value being NULL, and you are in the code section when it isn’t. Do you still have the core file around? If so, can you print out the value of the “arch” variable? It would be in the opal_hwloc_base_get_topo_signature level

Re: [OMPI devel] [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-12 Thread Ralph Castain
Thanks Brice! Our 1.8 branch probably has another 2 or so years in it, but I think we can lock it down fairly soon. Since we’ve shaken a lot of the bugs out of 1.8, we are now seeing the “adoption wave” that is causing bug reports. Once we get thru this, I expect things will settle down again.

Re: [OMPI devel] [1.8.4rc3] false report of no loopback interface + segv at exit

2014-12-12 Thread Ralph Castain
Kewl - thanks to both of you for the explanation. I’ll make the adjustment. > On Dec 11, 2014, at 9:10 PM, Paul Hargrove wrote: > > Ralph, > > The "understanding" Gilles just expresses matches my own. > > The issue that the OP observed on an ARM/Linux system (and I was able to > reproduce on

Re: [OMPI devel] [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-12 Thread Brice Goglin
Le 12/12/2014 07:36, Gilles Gouaillardet a écrit : > Brice, > > ompi master is based on hwloc 1.9.1, isn't it ? Yes sorry, I am often confused by all these OMPI vs hwloc branch numbers. > > if some backport is required for hwloc 1.7.2 (used by ompi v1.8), then > could you please update the hwloc

Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c is called to late

2014-12-12 Thread Pascal Deveze
George, My initial problem is that when MPI is compiled with “--enable-mpi-thread-multiple”, the variable enable_mpi_threads is set to 1 even if MPI_Init() is called in place of MPI_Init_thread(). I saw also that opal_using_threads() exists and was used by other BTLs. Maybe the solution is to

[OMPI devel] [1.8.4rc3] dangling symlinks

2014-12-12 Thread Paul Hargrove
On a Linux system configured without java support I see the following two dangling symlinks installed in ${prefix}/bin: lrwxrwxrwx 1 phhargrove phhargrove 8 Dec 11 23:52 oshjavac -> mpijavac lrwxrwxrwx 1 phhargrove phhargrove 8 Dec 11 23:52 shmemjavac -> mpijavac It seems there is some logic mi

Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c is called to late

2014-12-12 Thread George Bosilca
On Thu, Dec 11, 2014 at 9:41 PM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > George, > > please allow me to jump in with naive comments ... > > currently (master) both openib and usnic btl invokes opal_using_threads in > component_init() : > > btl_openib_component_init(int *num_

Re: [OMPI devel] [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-12 Thread Gilles Gouaillardet
Brice, ompi master is based on hwloc 1.9.1, isn't it ? if some backport is required for hwloc 1.7.2 (used by ompi v1.8), then could you please update the hwloc v1.7 branch ? Cheers, Gilles On 2014/12/12 15:16, Brice Goglin wrote: > Yes. > > In theory, everything that's in hwloc/v1.8 should go

Re: [OMPI devel] [OMPI users] OpenMPI 1.8.4 and hwloc in Fedora 14 using a beta gcc 5.0 compiler.

2014-12-12 Thread Brice Goglin
Yes. In theory, everything that's in hwloc/v1.8 should go to OMPI/master. And most of it should go to v1.8 too, but that may require some backporting rework. I can update hwloc/v1.7 if that helps. Brice Le 12/12/2014 03:10, Gilles Gouaillardet a écrit : > Brice, > > should this fix be backpor

Re: [OMPI devel] Patch proposed: opal_set_using_threads(true) in ompi/runtime/ompi_mpi_init.c is called to late

2014-12-12 Thread George Bosilca
On Thu, Dec 11, 2014 at 8:30 PM, Ralph Castain wrote: > Just to help me understand: I don’t think this change actually changed any > behavior. However, it certainly *allows* a different behavior. Isn’t that > true? > It depends how you look at this. To be extremely clear it prevents the modules

Re: [OMPI devel] [1.8.4rc3] false report of no loopback interface + segv at exit

2014-12-12 Thread Paul Hargrove
Ralph, The "understanding" Gilles just expresses matches my own. The issue that the OP observed on an ARM/Linux system (and I was able to reproduce on Linux w/ any arch) is that when the LO interface is missing Linux is unable to pass loopback messages sent on ANY interface. The oob_tcp code was

Re: [OMPI devel] [1.8.4rc3] false report of no loopback interface + segv at exit

2014-12-12 Thread Paul Hargrove
Gilles, You are correct that mpirun is executed on a node other than n15 or n16. So, your count to 5 makes sense. It does seem a bit excessive, but it should only occur when there is problem. I have no MCA params file nor any MCA-related environment variables. So, there are no oob_tcp_if_{include

Re: [OMPI devel] [1.8.4rc3] false report of no loopback interface + segv at exit

2014-12-12 Thread Gilles Gouaillardet
Ralph, here is my understanding of what happens on Linux : lo: 127.0.0.1/8 eth0: 192.168.122.101/24 mpirun --mca orte_oob_tcp_if_include eth0 ... so the mpi task tries to contact orted/mpirun on 192.168.0.1/24 that works just fine if the loopback interface is active, and that hangs if there is