Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]

2007-10-23 Thread Bogdan Costescu

On Mon, 22 Oct 2007, Jeff Squyres wrote:


There is in the openib BTL.


The bug #1025 has in one the answers the following phrase:

"It looks like this will affect many threading issues with the 
pathscale compiler -- the openib BTL is simply the first place we 
tripped it."


which along with the rest of the data (failure dependency on TLS 
usage) led me to wonder about threading issues.


To be honest, I removed the pathscale suite from my regular 
regression testing


So, is anyone else testing PathScale 3.0 with stable versions of Open 
MPI ? Or with development versions ?


I just recompiled the OMPI 1.2 branch with pathscale 3.0 on RHEL4U4 
and I do not see the problems that you are seeing.  :-\  Is Debian 
etch a supported pathscale platform?


Seems like it's not... And indeed the older RHEL4 is a supported 
platform, which might explain the different results.


But...

I made some progress: if I configure with "--without-memory-manager" 
(along with all other options that I mentioned before), then it works. 
This was inspired by the fact that the segmentation fault occured in 
ptmalloc2. I have previously tried to remove the MX support without 
any effect; with ptmalloc2 out of the picture I have had test runs 
over MX and TCP without problems.


Should I file a bug report ? Is there something else that you'd like 
me to try ?


ompi_info output:
Open MPI: 1.2.3
   Open MPI SVN revision: r15136
Open RTE: 1.2.3
   Open RTE SVN revision: r15136
OPAL: 1.2.3
   OPAL SVN revision: r15136
  Prefix: /home/thor1/costescu/openmpi-1.2.3-ps30
 Configured architecture: x86_64-unknown-linux-gnu
   Configured by: costescu
   Configured on: Tue Oct 23 11:09:33 CEST 2007
  Configure host: helics4
Built by: costescu
Built on: Tue Oct 23 11:30:31 CEST 2007
  Built host: helics4
  C bindings: yes
C++ bindings: yes
  Fortran77 bindings: yes (all)
  Fortran90 bindings: yes
 Fortran90 bindings size: small
  C compiler: pathcc
 C compiler absolute: /opt_local/pathscale/bin/pathcc
C++ compiler: pathCC
   C++ compiler absolute: /opt_local/pathscale/bin/pathCC
  Fortran77 compiler: pathf90
  Fortran77 compiler abs: /opt_local/pathscale/bin/pathf90
  Fortran90 compiler: pathf90
  Fortran90 compiler abs: /opt_local/pathscale/bin/pathf90
 C profiling: yes
   C++ profiling: yes
 Fortran77 profiling: yes
 Fortran90 profiling: yes
  C++ exceptions: no
  Thread support: posix (mpi: no, progress: no)
  Internal debug support: yes
 MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
 libltdl support: yes
   Heterogeneous support: yes
 mpirun default --prefix: no
   MCA backtrace: execinfo (MCA v1.0, API v1.0, Component v1.2.3)
   MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.2.3)
   MCA maffinity: first_use (MCA v1.0, API v1.0, Component v1.2.3)
   MCA timer: linux (MCA v1.0, API v1.0, Component v1.2.3)
 MCA installdirs: env (MCA v1.0, API v1.0, Component v1.2.3)
 MCA installdirs: config (MCA v1.0, API v1.0, Component v1.2.3)
   MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
   MCA allocator: bucket (MCA v1.0, API v1.0, Component v1.0)
MCA coll: basic (MCA v1.0, API v1.0, Component v1.2.3)
MCA coll: self (MCA v1.0, API v1.0, Component v1.2.3)
MCA coll: sm (MCA v1.0, API v1.0, Component v1.2.3)
MCA coll: tuned (MCA v1.0, API v1.0, Component v1.2.3)
   MCA mpool: rdma (MCA v1.0, API v1.0, Component v1.2.3)
   MCA mpool: sm (MCA v1.0, API v1.0, Component v1.2.3)
 MCA pml: cm (MCA v1.0, API v1.0, Component v1.2.3)
 MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.2.3)
 MCA bml: r2 (MCA v1.0, API v1.0, Component v1.2.3)
  MCA rcache: vma (MCA v1.0, API v1.0, Component v1.2.3)
 MCA btl: self (MCA v1.0, API v1.0.1, Component v1.2.3)
 MCA btl: sm (MCA v1.0, API v1.0.1, Component v1.2.3)
 MCA btl: mx (MCA v1.0, API v1.0.1, Component v1.2.3)
 MCA btl: tcp (MCA v1.0, API v1.0.1, Component v1.0)
 MCA mtl: mx (MCA v1.0, API v1.0, Component v1.2.3)
MCA topo: unity (MCA v1.0, API v1.0, Component v1.2.3)
 MCA osc: pt2pt (MCA v1.0, API v1.0, Component v1.2.3)
  MCA errmgr: hnp (MCA v1.0, API v1.3, Component v1.2.3)
  MCA errmgr: orted (MCA v1.0, API v1.3, Component v1.2.3)
  MCA errmgr: proxy (MCA v1.0, API v1.3, Component v1.2.3)
 MCA gpr: null (MCA v1.0, API v1.0, Component v1.2.3)
 MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.2.3)
 

Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]

2007-10-23 Thread Jeff Squyres

On Oct 23, 2007, at 6:33 AM, Bogdan Costescu wrote:


There is in the openib BTL.


The bug #1025 has in one the answers the following phrase:

"It looks like this will affect many threading issues with the
pathscale compiler -- the openib BTL is simply the first place we
tripped it."

which along with the rest of the data (failure dependency on TLS
usage) led me to wonder about threading issues.


FWIW, these problems even affect non-threaded builds, so I'm not  
entirely sure what the problem is.  All indications point to a  
problem in the Pathscale compiler, but who knows -- perhaps we're  
doing something stupid that doesn't show up in any other compiler.



To be honest, I removed the pathscale suite from my regular
regression testing


So, is anyone else testing PathScale 3.0 with stable versions of Open
MPI ? Or with development versions ?


I don't know; Cisco is not.  I removed it from my normal testing set  
because all IB testing would fail -- so it wasn't worth testing.



I just recompiled the OMPI 1.2 branch with pathscale 3.0 on RHEL4U4
and I do not see the problems that you are seeing.  :-\  Is Debian
etch a supported pathscale platform?


Seems like it's not... And indeed the older RHEL4 is a supported
platform, which might explain the different results.


You might want to ask them if Debian etch is supported.


I made some progress: if I configure with "--without-memory-manager"
(along with all other options that I mentioned before), then it works.
This was inspired by the fact that the segmentation fault occured in
ptmalloc2. I have previously tried to remove the MX support without
any effect; with ptmalloc2 out of the picture I have had test runs
over MX and TCP without problems.


This is ringing a [very] distant bell in my memory, but I don't  
remember the details.  Brian: do you remember any specific issues  
about the memory manager and pathscale compiler?


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] collective problems

2007-10-23 Thread Shipman, Galen M.
So this problem goes WAY back..

The problem here is that the PML marks MPI completion just prior to calling
btl_send and then returns to the user. This wouldn't be a problem if the BTL
then did something, but in the case of OpenIB this fragment may not actually
be on the wire (the joys of user level flow control).

One solution that we proposed was to allow btl_send to return either
OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the PML to
not mark MPI completion of the fragment and then MPI_WAITALL and others will
do there job properly.

- Galen 



On 10/11/07 11:26 AM, "Gleb Natapov"  wrote:

> On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:
>> David --
>> 
>> Gleb and I just actively re-looked at this problem yesterday; we
>> think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
>> 1015.  We previously thought this ticket was a different problem, but
>> our analysis yesterday shows that it could be a real problem in the
>> openib BTL or ob1 PML (kinda think it's the openib btl because it
>> doesn't seem to happen on other networks, but who knows...).
>> 
>> Gleb is investigating.
> Here is the result of the investigation. The problem is different than
> #1015 ticket. What we have here is one rank calls isend() of a small
> message and wait_all() in a loop and another one calls irecv(). The
> problem is that isend() usually doesn't call opal_progress() anywhere
> and wait_all() doesn't call progress if all requests are already completed
> so messages are never progressed. We may force opal_progress() to be called
> by setting btl_openib_free_list_max to 1000. Then wait_all() will call
> progress because not every request will be immediately completed by OB1. Or
> we can limit a number of uncompleted requests that OB1 can allocate by setting
> pml_ob1_free_list_max to 1000. Then opal_progress() will be called from a
> free_list_wait() when max will be reached. The second option works much
> faster for me.
> 
>> 
>> 
>> 
>> On Oct 5, 2007, at 12:59 AM, David Daniel wrote:
>> 
>>> Hi Folks,
>>> 
>>> I have been seeing some nasty behaviour in collectives,
>>> particularly bcast and reduce.  Attached is a reproducer (for bcast).
>>> 
>>> The code will rapidly slow to a crawl (usually interpreted as a
>>> hang in real applications) and sometimes gets killed with sigbus or
>>> sigterm.
>>> 
>>> I see this with
>>> 
>>>   openmpi-1.2.3 or openmpi-1.2.4
>>>   ofed 1.2
>>>   linux 2.6.19 + patches
>>>   gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
>>>   4 socket, dual core opterons
>>> 
>>> run as
>>> 
>>>   mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang
>>> 
>>> To my now uneducated eye it looks as if the root process is rushing
>>> ahead and not progressing earlier bcasts.
>>> 
>>> Anyone else seeing similar?  Any ideas for workarounds?
>>> 
>>> As a point of reference, mvapich2 0.9.8 works fine.
>>> 
>>> Thanks, David
>>> 
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> Cisco Systems
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> --
> Gleb.
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] collective problems

2007-10-23 Thread Gleb Natapov
On Tue, Oct 23, 2007 at 09:40:45AM -0400, Shipman, Galen M. wrote:
> So this problem goes WAY back..
> 
> The problem here is that the PML marks MPI completion just prior to calling
> btl_send and then returns to the user. This wouldn't be a problem if the BTL
> then did something, but in the case of OpenIB this fragment may not actually
> be on the wire (the joys of user level flow control).
> 
> One solution that we proposed was to allow btl_send to return either
> OMPI_SUCCESS or OMPI_NOT_ON_WIRE. OMPI_NOT_ON_WIRE would allow the PML to
> not mark MPI completion of the fragment and then MPI_WAITALL and others will
> do there job properly.
I even implemented this once, but there is a problem. Currently we mark
request as completed on MPI level and then do btl_send(). Whenever IB completion
will happen the request will be marked as complete on PML level and
freed. The fix requires to change the order like this: Call btl_send(),
check return value from BTL and mark request complete as necessary. The
problem is that because we allow BTL to call opal_progress() internally the
request may be already completed on MPI and MPL levels and freed before return 
from
the call to btl_send().

I did a code review to see how hard it will be to get rid of recursion
in Open MPI and I think this is doable. We have to disallow calling
progress() (or other functions that may call progress() internally) from
BTL and from ULP callbacks that are called by BTL. There is no much
places that break this law. The main offenders are calls to
FREE_LIST_WAIT(), but those never actually call progress if they can
grow without limit and this is the most common use of FREE_LIST_WAIT()
so they may be safely changed to FREE_LIST_GET(). After we will solve
recursion problem the fix to the problem will be a couple of lines of
code.

> 
> - Galen 
> 
> 
> 
> On 10/11/07 11:26 AM, "Gleb Natapov"  wrote:
> 
> > On Fri, Oct 05, 2007 at 09:43:44AM +0200, Jeff Squyres wrote:
> >> David --
> >> 
> >> Gleb and I just actively re-looked at this problem yesterday; we
> >> think it's related to https://svn.open-mpi.org/trac/ompi/ticket/
> >> 1015.  We previously thought this ticket was a different problem, but
> >> our analysis yesterday shows that it could be a real problem in the
> >> openib BTL or ob1 PML (kinda think it's the openib btl because it
> >> doesn't seem to happen on other networks, but who knows...).
> >> 
> >> Gleb is investigating.
> > Here is the result of the investigation. The problem is different than
> > #1015 ticket. What we have here is one rank calls isend() of a small
> > message and wait_all() in a loop and another one calls irecv(). The
> > problem is that isend() usually doesn't call opal_progress() anywhere
> > and wait_all() doesn't call progress if all requests are already completed
> > so messages are never progressed. We may force opal_progress() to be called
> > by setting btl_openib_free_list_max to 1000. Then wait_all() will call
> > progress because not every request will be immediately completed by OB1. Or
> > we can limit a number of uncompleted requests that OB1 can allocate by 
> > setting
> > pml_ob1_free_list_max to 1000. Then opal_progress() will be called from a
> > free_list_wait() when max will be reached. The second option works much
> > faster for me.
> > 
> >> 
> >> 
> >> 
> >> On Oct 5, 2007, at 12:59 AM, David Daniel wrote:
> >> 
> >>> Hi Folks,
> >>> 
> >>> I have been seeing some nasty behaviour in collectives,
> >>> particularly bcast and reduce.  Attached is a reproducer (for bcast).
> >>> 
> >>> The code will rapidly slow to a crawl (usually interpreted as a
> >>> hang in real applications) and sometimes gets killed with sigbus or
> >>> sigterm.
> >>> 
> >>> I see this with
> >>> 
> >>>   openmpi-1.2.3 or openmpi-1.2.4
> >>>   ofed 1.2
> >>>   linux 2.6.19 + patches
> >>>   gcc (GCC) 3.4.5 20051201 (Red Hat 3.4.5-2)
> >>>   4 socket, dual core opterons
> >>> 
> >>> run as
> >>> 
> >>>   mpirun --mca btl self,openib --npernode 1 --np 4 bcast-hang
> >>> 
> >>> To my now uneducated eye it looks as if the root process is rushing
> >>> ahead and not progressing earlier bcasts.
> >>> 
> >>> Anyone else seeing similar?  Any ideas for workarounds?
> >>> 
> >>> As a point of reference, mvapich2 0.9.8 works fine.
> >>> 
> >>> Thanks, David
> >>> 
> >>> 
> >>> 
> >>> ___
> >>> devel mailing list
> >>> de...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >> 
> >> 
> >> -- 
> >> Jeff Squyres
> >> Cisco Systems
> >> 
> >> ___
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > --
> > Gleb.
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> htt

Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]

2007-10-23 Thread Scott Atchley

On Oct 23, 2007, at 6:33 AM, Bogdan Costescu wrote:


I made some progress: if I configure with "--without-memory-manager"
(along with all other options that I mentioned before), then it works.
This was inspired by the fact that the segmentation fault occured in
ptmalloc2. I have previously tried to remove the MX support without
any effect; with ptmalloc2 out of the picture I have had test runs
over MX and TCP without problems.

Should I file a bug report ? Is there something else that you'd like
me to try ?


Bogdan,

Which version of MX are you using? Are you enabling the MX  
registration cache (regcache)?


Can you try two runs, one exporting MX_RCACHE=1 and one exporting  
MX_RCACHE=0 to all processes?


This will rule out any interaction between the OMPI memory manager  
and MX's regcache (either caused by or simply exposed by the  
Pathscale compiler).


Thanks,

Scott


--
Scott Atchley
Myricom Inc.
http://www.myri.com




Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]

2007-10-23 Thread Bogdan Costescu

On Tue, 23 Oct 2007, Scott Atchley wrote:


Which version of MX are you using? Are you enabling the MX
registration cache (regcache)?

Can you try two runs, one exporting MX_RCACHE=1 and one exporting
MX_RCACHE=0 to all processes?


I don't get to that point... I am not even able to use the wrapper 
compilers (f.e. mpif90) to obtain an executable to run. The 
segmentation fault happens when Open MPI utilities are being run, even 
ompi_info.



This will rule out any interaction between the OMPI memory manager
and MX's regcache (either caused by or simply exposed by the
Pathscale compiler).


As I wrote in my previous e-mail, I tried configuring with and without 
the MX libs, but this made no difference. It's only when I disabled 
the memory manager, while still enabling MX, that I was able to get a 
working build.


--
Bogdan Costescu

IWR - Interdisziplinaeres Zentrum fuer Wissenschaftliches Rechnen
Universitaet Heidelberg, INF 368, D-69120 Heidelberg, GERMANY
Telephone: +49 6221 54 8869, Telefax: +49 6221 54 8868
E-mail: bogdan.coste...@iwr.uni-heidelberg.de


Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]

2007-10-23 Thread Scott Atchley

On Oct 23, 2007, at 10:36 AM, Bogdan Costescu wrote:


I don't get to that point... I am not even able to use the wrapper
compilers (f.e. mpif90) to obtain an executable to run. The
segmentation fault happens when Open MPI utilities are being run, even
ompi_info.


Ahh, I thought you were getting a segfault when it ran, not compiling.


This will rule out any interaction between the OMPI memory manager
and MX's regcache (either caused by or simply exposed by the
Pathscale compiler).


As I wrote in my previous e-mail, I tried configuring with and without
the MX libs, but this made no difference. It's only when I disabled
the memory manager, while still enabling MX, that I was able to get a
working build.


Sorry for the distraction. :-)

Thanks,

Scott


Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]

2007-10-23 Thread Patrick Geoffray

Hi Bogdan,

Bogdan Costescu wrote:
I made some progress: if I configure with "--without-memory-manager" 
(along with all other options that I mentioned before), then it works. 
This was inspired by the fact that the segmentation fault occured in 
ptmalloc2. I have previously tried to remove the MX support without 
any effect; with ptmalloc2 out of the picture I have had test runs 
over MX and TCP without problems.


We have had portability problems using ptmalloc2 in MPICH-GM, specially 
relative to threads. In MX, we choose to use dlmalloc instead. It is not 
as optimized and its thread-safety has a coarser grain, but it is much 
more portable.


Disabling the memory manager in OpenMPI is not a bad thing for MX, as 
its own dlmalloc-based registration cache will operate transparently 
with MX_RCACHE=1 (default).


Patrick


Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]

2007-10-23 Thread Brian Barrett

On Oct 23, 2007, at 10:58 AM, Patrick Geoffray wrote:


Bogdan Costescu wrote:

I made some progress: if I configure with "--without-memory-manager"
(along with all other options that I mentioned before), then it  
works.

This was inspired by the fact that the segmentation fault occured in
ptmalloc2. I have previously tried to remove the MX support without
any effect; with ptmalloc2 out of the picture I have had test runs
over MX and TCP without problems.


We have had portability problems using ptmalloc2 in MPICH-GM,  
specially
relative to threads. In MX, we choose to use dlmalloc instead. It  
is not

as optimized and its thread-safety has a coarser grain, but it is much
more portable.

Disabling the memory manager in OpenMPI is not a bad thing for MX, as
its own dlmalloc-based registration cache will operate transparently
with MX_RCACHE=1 (default).


If you're not packaging Open MPI with MX support, I'd configure Open  
MPI with the extra parameters:


  --without-memory-manager --enable-mca-static=btl-mx,mtl-mx

This will provide the least possibility of something getting in the  
way of MX doing its thing with its memory hooks.  It causes libmpi.so  
to depend on libmyriexpress.so, which is both a good and bad thing.   
Good because the malloc hooks in libmyriexpress aren't "seen" when we  
dlopen the OMPI MX drivers to suck in libmyriexpress, but they would  
be with this configuration.  Bad in that libmpi.so now depends on  
libmyriexpress, so packaging for multiple machines could be more  
difficult.


Brian




[OMPI devel] FD leak in Altix timer code?

2007-10-23 Thread Paul H. Hargrove
I was just looking at the Altix timer code in the OMPI trunk and unless 
I am missing something, the fd opened in opal_timer_altix_open() is 
never closed.  It can be safely closed right after the mmap().


-Paul

--
Paul H. Hargrove  phhargr...@lbl.gov
Future Technologies Group
HPC Research Department   Tel: +1-510-495-2352
Lawrence Berkeley National Laboratory Fax: +1-510-486-6900