Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]
On Mon, 22 Oct 2007, Jeff Squyres wrote: There is in the openib BTL. The bug #1025 has in one the answers the following phrase: "It looks like this will affect many threading issues with the pathscale compiler -- the openib BTL is simply the first place we tripped it." which along with the rest of the data (failure dependency on TLS usage) led me to wonder about threading issues. To be honest, I removed the pathscale suite from my regular regression testing So, is anyone else testing PathScale 3.0 with stable versions of Open MPI ? Or with development versions ? I just recompiled the OMPI 1.2 branch with pathscale 3.0 on RHEL4U4 and I do not see the problems that you are seeing. :-\ Is Debian etch a supported pathscale platform? Seems like it's not... And indeed the older RHEL4 is a supported platform, which might explain the different results. But... I made some progress: if I configure with "--without-memory-manager" (along with all other options that I mentioned before), then it works. This was inspired by the fact that the segmentation fault occured in ptmalloc2. I have previously tried to remove the MX support without any effect; with ptmalloc2 out of the picture I have had test runs over MX and TCP without problems. Should I file a bug report ? Is there something else that you'd like me to try ? ompi_info output: Open MPI: 1.2.3 Open MPI SVN revision: r15136 Open RTE: 1.2.3 Open RTE SVN revision: r15136 OPAL: 1.2.3 OPAL SVN revision: r15136 Prefix: /home/thor1/costescu/openmpi-1.2.3-ps30 Configured architecture: x86_64-unknown-linux-gnu Configured by: costescu Configured on: Tue Oct 23 11:09:33 CEST 2007 Configure host: helics4 Built by: costescu Built on: Tue Oct 23 11:30:31 CEST 2007 Built host: helics4 C bindings: yes C++ bindings: yes Fortran77 bindings: yes (all) Fortran90 bindings: yes Fortran90 bindings size: small C compiler: pathcc C compiler absolute: /opt_local/pathscale/bin/pathcc C++ compiler: pathCC C++ compiler absolute: /opt_local/pathscale/bin/pathCC Fortran77 compiler: pathf90 Fortran77 compiler abs: /opt_local/pathscale/bin/pathf90 Fortran90 compiler: pathf90 Fortran90 compiler abs: /opt_local/pathscale/bin/pathf90 C profiling: yes C++ profiling: yes Fortran77 profiling: yes Fortran90 profiling: yes C++ exceptions: no Thread support: posix (mpi: no, progress: no) Internal debug support: yes MPI parameter check: runtime Memory profiling support: no Memory debugging support: no libltdl support: yes Heterogeneous support: yes mpirun default --prefix: no MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.3) MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.3) MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.3) MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.3) MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.3) MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.3) MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0) MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0) MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.3) MCA coll: self (MCA v1.0, API v1.0, Component v1.2.3) MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.3) MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.3) MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.3) MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.3) MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.3) MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.3) MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.3) MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.3) MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.3) MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.3) MCA btl: mx (MCA v1.0, API v1.0.1, Component v1.2.3) MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0) MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2.3) MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.3) MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.3) MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.3) MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.3) MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.3) MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.3) MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.3)
Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]
On Oct 23, 2007, at 6:33 AM, Bogdan Costescu wrote: There is in the openib BTL. The bug #1025 has in one the answers the following phrase: "It looks like this will affect many threading issues with the pathscale compiler -- the openib BTL is simply the first place we tripped it." which along with the rest of the data (failure dependency on TLS usage) led me to wonder about threading issues. FWIW, these problems even affect non-threaded builds, so I'm not entirely sure what the problem is. All indications point to a problem in the Pathscale compiler, but who knows -- perhaps we're doing something stupid that doesn't show up in any other compiler. To be honest, I removed the pathscale suite from my regular regression testing So, is anyone else testing PathScale 3.0 with stable versions of Open MPI ? Or with development versions ? I don't know; Cisco is not. I removed it from my normal testing set because all IB testing would fail -- so it wasn't worth testing. I just recompiled the OMPI 1.2 branch with pathscale 3.0 on RHEL4U4 and I do not see the problems that you are seeing. :-\ Is Debian etch a supported pathscale platform? Seems like it's not... And indeed the older RHEL4 is a supported platform, which might explain the different results. You might want to ask them if Debian etch is supported. I made some progress: if I configure with "--without-memory-manager" (along with all other options that I mentioned before), then it works. This was inspired by the fact that the segmentation fault occured in ptmalloc2. I have previously tried to remove the MX support without any effect; with ptmalloc2 out of the picture I have had test runs over MX and TCP without problems. This is ringing a [very] distant bell in my memory, but I don't remember the details. Brian: do you remember any specific issues about the memory manager and pathscale compiler? -- Jeff Squyres Cisco Systems
Re: [OMPI devel] collective problems
So this problem goes WAY back.. The problem here is that the PML marks MPI completion just prior to calling btl_send and then returns to the user. This wouldn't be a problem if the BTL then did something, but in the case of OpenIB this fragment may not actually be on the wire (the joys of user level flow control). One solution that we proposed was to allow btl_send to return either OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the PML to not mark MPI completion of the fragment and then MPI_WAITALL and others will do there job properly. - Galen On 10/11/07 11:26 AM, "Gleb Natapov" wrote: > On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote: >> David -- >> >> Gleb and I just actively re-looked at this problem yesterday; we >> think it's related to https://svn.open-mpi.org/trac/ompi/ticket/ >> 1015. We previously thought this ticket was a different problem, but >> our analysis yesterday shows that it could be a real problem in the >> openib BTL or ob1 PML (kinda think it's the openib btl because it >> doesn't seem to happen on other networks, but who knows...). >> >> Gleb is investigating. > Here is the result of the investigation. The problem is different than > #1015 ticket. What we have here is one rank calls isend() of a small > message and wait_all() in a loop and another one calls irecv(). The > problem is that isend() usually doesn't call opal_progress() anywhere > and wait_all() doesn't call progress if all requests are already completed > so messages are never progressed. We may force opal_progress() to be called > by setting btl_openib_free_list_max to 1000. Then wait_all() will call > progress because not every request will be immediately completed by OB1. Or > we can limit a number of uncompleted requests that OB1 can allocate by setting > pml_ob1_free_list_max to 1000. Then opal_progress() will be called from a > free_list_wait() when max will be reached. The second option works much > faster for me. > >> >> >> >> On Oct 5, 2007, at 12:59 AM, David Daniel wrote: >> >>> Hi Folks, >>> >>> I have been seeing some nasty behaviour in collectives, >>> particularly bcast and reduce. Attached is a reproducer (for bcast). >>> >>> The code will rapidly slow to a crawl (usually interpreted as a >>> hang in real applications) and sometimes gets killed with sigbus or >>> sigterm. >>> >>> I see this with >>> >>> openmpi-1.2.3 or openmpi-1.2.4 >>> ofed 1.2 >>> linux 2.6.19 + patches >>> gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2) >>> 4 socket, dual core opterons >>> >>> run as >>> >>> mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang >>> >>> To my now uneducated eye it looks as if the root process is rushing >>> ahead and not progressing earlier bcasts. >>> >>> Anyone else seeing similar? Any ideas for workarounds? >>> >>> As a point of reference, mvapich2 0.9.8 works fine. >>> >>> Thanks, David >>> >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > -- > Gleb. > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] collective problems
On Tue, Oct 23, 2007 at 09:40:45AM -0400, Shipman, Galen M. wrote: > So this problem goes WAY back.. > > The problem here is that the PML marks MPI completion just prior to calling > btl_send and then returns to the user. This wouldn't be a problem if the BTL > then did something, but in the case of OpenIB this fragment may not actually > be on the wire (the joys of user level flow control). > > One solution that we proposed was to allow btl_send to return either > OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the PML to > not mark MPI completion of the fragment and then MPI_WAITALL and others will > do there job properly. I even implemented this once, but there is a problem. Currently we mark request as completed on MPI level and then do btl_send(). Whenever IB completion will happen the request will be marked as complete on PML level and freed. The fix requires to change the order like this: Call btl_send(), check return value from BTL and mark request complete as necessary. The problem is that because we allow BTL to call opal_progress() internally the request may be already completed on MPI and MPL levels and freed before return from the call to btl_send(). I did a code review to see how hard it will be to get rid of recursion in Open MPI and I think this is doable. We have to disallow calling progress() (or other functions that may call progress() internally) from BTL and from ULP callbacks that are called by BTL. There is no much places that break this law. The main offenders are calls to FREE_LIST_WAIT(), but those never actually call progress if they can grow without limit and this is the most common use of FREE_LIST_WAIT() so they may be safely changed to FREE_LIST_GET(). After we will solve recursion problem the fix to the problem will be a couple of lines of code. > > - Galen > > > > On 10/11/07 11:26 AM, "Gleb Natapov" wrote: > > > On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote: > >> David -- > >> > >> Gleb and I just actively re-looked at this problem yesterday; we > >> think it's related to https://svn.open-mpi.org/trac/ompi/ticket/ > >> 1015. We previously thought this ticket was a different problem, but > >> our analysis yesterday shows that it could be a real problem in the > >> openib BTL or ob1 PML (kinda think it's the openib btl because it > >> doesn't seem to happen on other networks, but who knows...). > >> > >> Gleb is investigating. > > Here is the result of the investigation. The problem is different than > > #1015 ticket. What we have here is one rank calls isend() of a small > > message and wait_all() in a loop and another one calls irecv(). The > > problem is that isend() usually doesn't call opal_progress() anywhere > > and wait_all() doesn't call progress if all requests are already completed > > so messages are never progressed. We may force opal_progress() to be called > > by setting btl_openib_free_list_max to 1000. Then wait_all() will call > > progress because not every request will be immediately completed by OB1. Or > > we can limit a number of uncompleted requests that OB1 can allocate by > > setting > > pml_ob1_free_list_max to 1000. Then opal_progress() will be called from a > > free_list_wait() when max will be reached. The second option works much > > faster for me. > > > >> > >> > >> > >> On Oct 5, 2007, at 12:59 AM, David Daniel wrote: > >> > >>> Hi Folks, > >>> > >>> I have been seeing some nasty behaviour in collectives, > >>> particularly bcast and reduce. Attached is a reproducer (for bcast). > >>> > >>> The code will rapidly slow to a crawl (usually interpreted as a > >>> hang in real applications) and sometimes gets killed with sigbus or > >>> sigterm. > >>> > >>> I see this with > >>> > >>> openmpi-1.2.3 or openmpi-1.2.4 > >>> ofed 1.2 > >>> linux 2.6.19 + patches > >>> gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2) > >>> 4 socket, dual core opterons > >>> > >>> run as > >>> > >>> mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang > >>> > >>> To my now uneducated eye it looks as if the root process is rushing > >>> ahead and not progressing earlier bcasts. > >>> > >>> Anyone else seeing similar? Any ideas for workarounds? > >>> > >>> As a point of reference, mvapich2 0.9.8 works fine. > >>> > >>> Thanks, David > >>> > >>> > >>> > >>> ___ > >>> devel mailing list > >>> de...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > >> > >> > >> -- > >> Jeff Squyres > >> Cisco Systems > >> > >> ___ > >> devel mailing list > >> de...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > -- > > Gleb. > > ___ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > htt
Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]
On Oct 23, 2007, at 6:33 AM, Bogdan Costescu wrote: I made some progress: if I configure with "--without-memory-manager" (along with all other options that I mentioned before), then it works. This was inspired by the fact that the segmentation fault occured in ptmalloc2. I have previously tried to remove the MX support without any effect; with ptmalloc2 out of the picture I have had test runs over MX and TCP without problems. Should I file a bug report ? Is there something else that you'd like me to try ? Bogdan, Which version of MX are you using? Are you enabling the MX registration cache (regcache)? Can you try two runs, one exporting MX_RCACHE=1 and one exporting MX_RCACHE=0 to all processes? This will rule out any interaction between the OMPI memory manager and MX's regcache (either caused by or simply exposed by the Pathscale compiler). Thanks, Scott -- Scott Atchley Myricom Inc. http://www.myri.com
Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]
On Tue, 23 Oct 2007, Scott Atchley wrote: Which version of MX are you using? Are you enabling the MX registration cache (regcache)? Can you try two runs, one exporting MX_RCACHE=1 and one exporting MX_RCACHE=0 to all processes? I don't get to that point... I am not even able to use the wrapper compilers (f.e. mpif90) to obtain an executable to run. The segmentation fault happens when Open MPI utilities are being run, even ompi_info. This will rule out any interaction between the OMPI memory manager and MX's regcache (either caused by or simply exposed by the Pathscale compiler). As I wrote in my previous e-mail, I tried configuring with and without the MX libs, but this made no difference. It's only when I disabled the memory manager, while still enabling MX, that I was able to get a working build. -- Bogdan Costescu IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868 E-mail: bogdan.coste...@iwr.uni-heidelberg.de
Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]
On Oct 23, 2007, at 10:36 AM, Bogdan Costescu wrote: I don't get to that point... I am not even able to use the wrapper compilers (f.e. mpif90) to obtain an executable to run. The segmentation fault happens when Open MPI utilities are being run, even ompi_info. Ahh, I thought you were getting a segfault when it ran, not compiling. This will rule out any interaction between the OMPI memory manager and MX's regcache (either caused by or simply exposed by the Pathscale compiler). As I wrote in my previous e-mail, I tried configuring with and without the MX libs, but this made no difference. It's only when I disabled the memory manager, while still enabling MX, that I was able to get a working build. Sorry for the distraction. :-) Thanks, Scott
Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]
Hi Bogdan, Bogdan Costescu wrote: I made some progress: if I configure with "--without-memory-manager" (along with all other options that I mentioned before), then it works. This was inspired by the fact that the segmentation fault occured in ptmalloc2. I have previously tried to remove the MX support without any effect; with ptmalloc2 out of the picture I have had test runs over MX and TCP without problems. We have had portability problems using ptmalloc2 in MPICH-GM, specially relative to threads. In MX, we choose to use dlmalloc instead. It is not as optimized and its thread-safety has a coarser grain, but it is much more portable. Disabling the memory manager in OpenMPI is not a bad thing for MX, as its own dlmalloc-based registration cache will operate transparently with MX_RCACHE=1 (default). Patrick
Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]
On Oct 23, 2007, at 10:58 AM, Patrick Geoffray wrote: Bogdan Costescu wrote: I made some progress: if I configure with "--without-memory-manager" (along with all other options that I mentioned before), then it works. This was inspired by the fact that the segmentation fault occured in ptmalloc2. I have previously tried to remove the MX support without any effect; with ptmalloc2 out of the picture I have had test runs over MX and TCP without problems. We have had portability problems using ptmalloc2 in MPICH-GM, specially relative to threads. In MX, we choose to use dlmalloc instead. It is not as optimized and its thread-safety has a coarser grain, but it is much more portable. Disabling the memory manager in OpenMPI is not a bad thing for MX, as its own dlmalloc-based registration cache will operate transparently with MX_RCACHE=1 (default). If you're not packaging Open MPI with MX support, I'd configure Open MPI with the extra parameters: --without-memory-manager --enable-mca-static=btl-mx,mtl-mx This will provide the least possibility of something getting in the way of MX doing its thing with its memory hooks. It causes libmpi.so to depend on libmyriexpress.so, which is both a good and bad thing. Good because the malloc hooks in libmyriexpress aren't "seen" when we dlopen the OMPI MX drivers to suck in libmyriexpress, but they would be with this configuration. Bad in that libmpi.so now depends on libmyriexpress, so packaging for multiple machines could be more difficult. Brian
[OMPI devel] FD leak in Altix timer code?
I was just looking at the Altix timer code in the OMPI trunk and unless I am missing something, the fd opened in opal_timer_altix_open() is never closed. It can be safely closed right after the mmap(). -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group HPC Research Department Tel: +1-510-495-2352 Lawrence Berkeley National Laboratory Fax: +1-510-486-6900