[OMPI devel] 1/4/3rc1 over MX

2010-08-26 Thread Scott Atchley
Hi all,

I compiled 1.4.3rc1 with MX 1.2.12 on RHEL 5.4 (2.6.18-164.el5). It does not 
like the memory manager and MX. Compiling using --without-memory-manager works 
fine. The output below is form the default configure (i.e. 
--with-memory-manager).

Note, I still see unusual latencies for some tests when using the BTL such as 
reduce-scatter, allgather, etc. I do not see them with the MTL. An example of 
BTL latencies from reduce-scatter is:

  256 1000 7.01 7.01 7.01
  512 1000 7.56 7.56 7.56
 1024 1000 8.58 8.58 8.58
 2048 100010.3610.3610.36
 4096 100014.4914.4914.49
 8192 1000  5180.16  5180.57  5180.36
16384 100094.9694.9794.96
32768 1000  4676.30  4676.68  4676.49
65536  640  4625.85  4626.23  4626.04
   131072  320   243.43   243.46   243.45
   262144  160   425.56   425.66   425.61

Scott

% mpirun -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong
[rain16:22509] *** Process received signal ***
[rain16:22509] Signal: Segmentation fault (11)
[rain16:22509] Signal code: Address not mapped (1)
[rain16:22509] Failing at address: 0x2c0
[rain15:24145] *** Process received signal ***
[rain15:24145] Signal: Segmentation fault (11)
[rain15:24145] Signal code: Address not mapped (1)
[rain15:24145] Failing at address: 0x25a0
--
mpirun noticed that process rank 1 with PID 22509 on node rain16 exited on 
signal 11 (Segmentation fault).
--

gdb shows:

#0  0x003d084075c8 in ?? () from /lib64/libgcc_s.so.1
(gdb) bt
#0  0x003d084075c8 in ?? () from /lib64/libgcc_s.so.1
#1  0x003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2  0x003d060e5eb8 in backtrace () from /lib64/libc.so.6
#3  0x2af68e7a47de in opal_backtrace_buffer ()
   from 
/nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#4  0x2af68e7a24ce in show_stackframe ()
   from 
/nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#5  
#6  0x02c0 in ?? ()
#7  0x2af690520640 in mca_mpool_fake_release_memory ()
   from 
/nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so
#8  0x2af68e2f49ce in mca_mpool_base_mem_cb ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#9  0x2af68e78347b in opal_mem_hooks_release_hook ()
   from 
/nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#10 0x2af68e7a791f in opal_mem_free_ptmalloc2_munmap ()
   from 
/nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#11 0x2af68e7ac2b1 in opal_memory_ptmalloc2_free_hook ()
   from 
/nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
#12 0x003d060727c1 in free () from /lib64/libc.so.6
#13 0x2af69197aaad in mx__rl_fini (rl=0xab5f928)
at ../../../libmyriexpress/userspace/../mx__request.c:102
#14 0x2af69196924d in mx_close_endpoint (endpoint=0xab5f820)
at ../../../libmyriexpress/userspace/../mx_close_endpoint.c:124
#15 0x2af69155e3dc in ompi_mtl_mx_finalize ()
   from 
/nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mtl_mx.so
#16 0x2af68e2f87e0 in mca_pml_base_select ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#17 0x2af68e2bcf40 in ompi_mpi_init ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#18 0x2af68e2da2b1 in PMPI_Init_thread ()
   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
#19 0x00403359 in main ()


If I tell it to use BTLs only it changes to:

% mpirun -mca pml ob1 -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong
[rain16:22552] *** Process received signal ***
[rain15:24195] *** Process received signal ***
[rain15:24195] Signal: Segmentation fault (11)
[rain15:24195] Signal code: Address not mapped (1)
[rain15:24195] Failing at address: 0x290
[rain16:22552] Signal: Segmentation fault (11)
[rain16:22552] Signal code: Address not mapped (1)
[rain16:22552] Failing at address: 0x290
--
mpirun noticed that process rank 1 with PID 22552 on node rain16 exited on 
signal 11 (Segmentation fault).
--

gdb shows:

#0  0x003d084075c8 in ?? () from /lib64/libgcc_s.so.1
#1  0x003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
#2  0x003d060e5eb8 in backtrace () from /lib64/libc.so.6
#3  0x2b831

[OMPI devel] 1.5rc5 over MX

2010-08-26 Thread Scott Atchley
Hi all,

Testing 1.5rc5 over MX with the same setup as 1.4.3rc1 (RHEL 5.4 and MX 1.2.12).

This version also dies during init due to the memory manager if I do not 
specify which pml to use. If I specify pml ob1 or pml cm, the tests start but 
die with segfaults:

   131072  320   166.86   749.15
[rain15:14939] *** Process received signal ***
[rain15:14939] Signal: Segmentation fault (11)
[rain15:14939] Signal code: Address not mapped (1)
[rain15:14939] Failing at address: 0x3b20

Again, configuring without the memory manager or setting 
OMPI_MCA_memory_ptmalloc2_disable=1 before calling mpirun work.

Similar latency issues with the BTl and not with the MTL.

Scott


Re: [OMPI devel] 1.5rc5 over MX

2010-08-27 Thread Scott Atchley
Jeff,

Sure, I need to register to file the tickets.

I have not had a chance yet. I will try to look at them first thing next week.

Scott

On Aug 27, 2010, at 2:41 PM, Jeff Squyres wrote:

> Scott --
> 
> Can you file tickets for this against 1.4 and 1.5?  These should probably be 
> blockers.
> 
> Have you been able to track these down any further, perchance?
> 
> 
> On Aug 26, 2010, at 10:38 AM, Scott Atchley wrote:
> 
>> Hi all,
>> 
>> Testing 1.5rc5 over MX with the same setup as 1.4.3rc1 (RHEL 5.4 and MX 
>> 1.2.12).
>> 
>> This version also dies during init due to the memory manager if I do not 
>> specify which pml to use. If I specify pml ob1 or pml cm, the tests start 
>> but die with segfaults:
>> 
>>  131072  320   166.86   749.15
>> [rain15:14939] *** Process received signal ***
>> [rain15:14939] Signal: Segmentation fault (11)
>> [rain15:14939] Signal code: Address not mapped (1)
>> [rain15:14939] Failing at address: 0x3b20
>> 
>> Again, configuring without the memory manager or setting 
>> OMPI_MCA_memory_ptmalloc2_disable=1 before calling mpirun work.
>> 
>> Similar latency issues with the BTl and not with the MTL.
>> 
>> Scott
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] 1.5rc5 over MX

2010-09-01 Thread Scott Atchley
Jeff,

I posted a patch on the ticket.

Scott

On Aug 27, 2010, at 3:08 PM, Scott Atchley wrote:

> Jeff,
> 
> Sure, I need to register to file the tickets.
> 
> I have not had a chance yet. I will try to look at them first thing next week.
> 
> Scott
> 
> On Aug 27, 2010, at 2:41 PM, Jeff Squyres wrote:
> 
>> Scott --
>> 
>> Can you file tickets for this against 1.4 and 1.5?  These should probably be 
>> blockers.
>> 
>> Have you been able to track these down any further, perchance?
>> 
>> 
>> On Aug 26, 2010, at 10:38 AM, Scott Atchley wrote:
>> 
>>> Hi all,
>>> 
>>> Testing 1.5rc5 over MX with the same setup as 1.4.3rc1 (RHEL 5.4 and MX 
>>> 1.2.12).
>>> 
>>> This version also dies during init due to the memory manager if I do not 
>>> specify which pml to use. If I specify pml ob1 or pml cm, the tests start 
>>> but die with segfaults:
>>> 
>>> 131072  320   166.86   749.15
>>> [rain15:14939] *** Process received signal ***
>>> [rain15:14939] Signal: Segmentation fault (11)
>>> [rain15:14939] Signal code: Address not mapped (1)
>>> [rain15:14939] Failing at address: 0x3b20
>>> 
>>> Again, configuring without the memory manager or setting 
>>> OMPI_MCA_memory_ptmalloc2_disable=1 before calling mpirun work.
>>> 
>>> Similar latency issues with the BTl and not with the MTL.
>>> 
>>> Scott
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> -- 
>> Jeff Squyres
>> jsquy...@cisco.com
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] 1/4/3rc1 over MX

2010-09-01 Thread Scott Atchley
Jeff,

I posted a patch for this on the ticket.

Scott

On Aug 26, 2010, at 10:10 AM, Scott Atchley wrote:

> Hi all,
> 
> I compiled 1.4.3rc1 with MX 1.2.12 on RHEL 5.4 (2.6.18-164.el5). It does not 
> like the memory manager and MX. Compiling using --without-memory-manager 
> works fine. The output below is form the default configure (i.e. 
> --with-memory-manager).
> 
> Note, I still see unusual latencies for some tests when using the BTL such as 
> reduce-scatter, allgather, etc. I do not see them with the MTL. An example of 
> BTL latencies from reduce-scatter is:
> 
>  256 1000 7.01 7.01 7.01
>  512 1000 7.56 7.56 7.56
> 1024 1000 8.58 8.58 8.58
> 2048 100010.3610.3610.36
> 4096 100014.4914.4914.49
> 8192 1000  5180.16  5180.57  5180.36
>16384 100094.9694.9794.96
>32768 1000  4676.30  4676.68  4676.49
>65536  640  4625.85  4626.23  4626.04
>   131072  320   243.43   243.46   243.45
>   262144  160   425.56   425.66   425.61
> 
> Scott
> 
> % mpirun -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong
> [rain16:22509] *** Process received signal ***
> [rain16:22509] Signal: Segmentation fault (11)
> [rain16:22509] Signal code: Address not mapped (1)
> [rain16:22509] Failing at address: 0x2c0
> [rain15:24145] *** Process received signal ***
> [rain15:24145] Signal: Segmentation fault (11)
> [rain15:24145] Signal code: Address not mapped (1)
> [rain15:24145] Failing at address: 0x25a0
> --
> mpirun noticed that process rank 1 with PID 22509 on node rain16 exited on 
> signal 11 (Segmentation fault).
> --
> 
> gdb shows:
> 
> #0  0x003d084075c8 in ?? () from /lib64/libgcc_s.so.1
> (gdb) bt
> #0  0x003d084075c8 in ?? () from /lib64/libgcc_s.so.1
> #1  0x003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1
> #2  0x003d060e5eb8 in backtrace () from /lib64/libc.so.6
> #3  0x2af68e7a47de in opal_backtrace_buffer ()
>   from 
> /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #4  0x2af68e7a24ce in show_stackframe ()
>   from 
> /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #5  
> #6  0x02c0 in ?? ()
> #7  0x2af690520640 in mca_mpool_fake_release_memory ()
>   from 
> /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so
> #8  0x2af68e2f49ce in mca_mpool_base_mem_cb ()
>   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #9  0x2af68e78347b in opal_mem_hooks_release_hook ()
>   from 
> /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #10 0x2af68e7a791f in opal_mem_free_ptmalloc2_munmap ()
>   from 
> /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #11 0x2af68e7ac2b1 in opal_memory_ptmalloc2_free_hook ()
>   from 
> /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0
> #12 0x003d060727c1 in free () from /lib64/libc.so.6
> #13 0x2af69197aaad in mx__rl_fini (rl=0xab5f928)
>at ../../../libmyriexpress/userspace/../mx__request.c:102
> #14 0x2af69196924d in mx_close_endpoint (endpoint=0xab5f820)
>at ../../../libmyriexpress/userspace/../mx_close_endpoint.c:124
> #15 0x2af69155e3dc in ompi_mtl_mx_finalize ()
>   from 
> /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mtl_mx.so
> #16 0x2af68e2f87e0 in mca_pml_base_select ()
>   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #17 0x2af68e2bcf40 in ompi_mpi_init ()
>   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #18 0x2af68e2da2b1 in PMPI_Init_thread ()
>   from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0
> #19 0x00403359 in main ()
> 
> 
> If I tell it to use BTLs only it changes to:
> 
> % mpirun -mca pml ob1 -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong
> [rain16:22552] *** Process received signal ***
> [rain15:24195] *** Process received signal ***
> [rain15:24195] Signal: Segmentation fault (11)
> [rain15:24195] Signal code: Address not mapped (1)
> [rain15:24195] Failing at address: 0x290
> [rain16:22552] Signal: Segmentation fault (11)

Re: [OMPI devel] 1/4/3rc1 over MX

2010-09-03 Thread Scott Atchley
On Sep 3, 2010, at 8:19 AM, Jeff Squyres wrote:

> On Sep 1, 2010, at 9:10 AM, Scott Atchley wrote:
> 
>> I posted a patch for this on the ticket.
> 
> Will someone be committing this to SVN?
> 
> I re-opened the ticket because just posting a patch to the ticket doesn't 
> actually fix anything.  :-)

We should probably set me up with commit privileges.

Scott


Re: [OMPI devel] 1.5rc5 over MX

2010-09-03 Thread Scott Atchley
Shouldn't the regression be a separate ticket since it is unrelated?

Scott

On Sep 3, 2010, at 8:20 AM, Jeff Squyres wrote:

> Ditto for the v1.5 patch -- it wasn't committed anywhere and no CMR was 
> filed, so I re-opened the ticket.
> 
> Plus you mentioned a 2us (!) latency increase.  Doesn't that need attention, 
> too?
> 
> 
> On Sep 1, 2010, at 9:09 AM, Scott Atchley wrote:
> 
>> Jeff,
>> 
>> I posted a patch on the ticket.
>> 
>> Scott
>> 
>> On Aug 27, 2010, at 3:08 PM, Scott Atchley wrote:
>> 
>>> Jeff,
>>> 
>>> Sure, I need to register to file the tickets.
>>> 
>>> I have not had a chance yet. I will try to look at them first thing next 
>>> week.
>>> 
>>> Scott
>>> 
>>> On Aug 27, 2010, at 2:41 PM, Jeff Squyres wrote:
>>> 
>>>> Scott --
>>>> 
>>>> Can you file tickets for this against 1.4 and 1.5?  These should probably 
>>>> be blockers.
>>>> 
>>>> Have you been able to track these down any further, perchance?
>>>> 
>>>> 
>>>> On Aug 26, 2010, at 10:38 AM, Scott Atchley wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> Testing 1.5rc5 over MX with the same setup as 1.4.3rc1 (RHEL 5.4 and MX 
>>>>> 1.2.12).
>>>>> 
>>>>> This version also dies during init due to the memory manager if I do not 
>>>>> specify which pml to use. If I specify pml ob1 or pml cm, the tests start 
>>>>> but die with segfaults:
>>>>> 
>>>>>   131072  320   166.86   749.15
>>>>> [rain15:14939] *** Process received signal ***
>>>>> [rain15:14939] Signal: Segmentation fault (11)
>>>>> [rain15:14939] Signal code: Address not mapped (1)
>>>>> [rain15:14939] Failing at address: 0x3b20
>>>>> 
>>>>> Again, configuring without the memory manager or setting 
>>>>> OMPI_MCA_memory_ptmalloc2_disable=1 before calling mpirun work.
>>>>> 
>>>>> Similar latency issues with the BTl and not with the MTL.
>>>>> 
>>>>> Scott
>>>>> ___
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> -- 
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> 
>>>> 
>>>> ___
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> ___
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
> 
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 




Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Scott Atchley

On Aug 13, 2007, at 4:06 AM, Pavel Shamis (Pasha) wrote:


Any objections?  We can discuss what approaches we want to take
(there's going to be some complications because of the PML driver,
etc.); perhaps in the Tuesday Mellanox teleconf...?


My main objection is that the only reason you propose to do this  
is some

bogus benchmark?


Pallas, Presta (as i know) also use static rank. So lets start to fix
all "bogus" benchmarks :-) ?

Pasha.


Why not:

   for (i=0; i < ITERATIONS; i++) {
  tag = i%MPI_TAG_UB;
  ...
   }

On a related note, we have often discussed the fact that benchmarks  
only give an upper-bound on performance. I would expect that some  
users would want to also know the lower-bound. For example, set a  
flag that causes the benchmark to use a different buffer each time in  
order to cause the registration cache to miss. I am sure we could  
come up with some other cases.


Scott


Re: [OMPI devel] SM BTL hang issue

2007-08-31 Thread Scott Atchley

Terry,

Are you testing on Linux? If so, which kernel?

See the patch to iperf to handle kernel 2.6.21 and the issue that  
they had with usleep(0):


http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt

Scott

On Aug 31, 2007, at 1:36 PM, Terry D. Dontje wrote:


Ok, I have an update to this issue.  I believe there is an
implementation difference of sched_yield between Linux and  
Solaris.  If

I change the sched_yield in opal_progress to be a usleep(500) then my
program completes quite quickly.  I have sent a few questions to a
Solaris engineer and hopefully will get some useful information.

That being said, CT-6's implementation also used yield calls (note  
this
actually is what sched_yield reduces down to in Solaris) and we did  
not

see the same degradation issue as with Open MPI.  I believe the reason
is because CT-6's SM implementation is not calling CT-6's version of
progress recursively and forcing all the unexpected to be read in  
before
continuing.  CT-6 also has a natural flow control in it's  
implementation

(ie it has a fixed set fifo for eager messages.

I believe both of these characteristics lend CT-6 to not being
completely killed by the yield differences.

--td


Li-Ta Lo wrote:


On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote:



Li-Ta Lo wrote:




On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote:





Li-Ta Lo wrote:






On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote:







hmmm, interesting since my version doesn't abort at all.







Some problem with fortran compiler/language binding? My C  
translation

doesn't have any problem.

[ollie@exponential ~]$ mpirun -np 4 a.out 10
Target duration (seconds): 10.00, #of msgs: 50331, usec  
per msg:

198.684707







Did you oversubscribe?  I found np=10 on a 8 core system  
clogged things

up sufficiently.





Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4  
threads).







Is this using Linux?






Yes.

Ollie


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]

2007-10-23 Thread Scott Atchley

On Oct 23, 2007, at 6:33 AM, Bogdan Costescu wrote:


I made some progress: if I configure with "--without-memory-manager"
(along with all other options that I mentioned before), then it works.
This was inspired by the fact that the segmentation fault occured in
ptmalloc2. I have previously tried to remove the MX support without
any effect; with ptmalloc2 out of the picture I have had test runs
over MX and TCP without problems.

Should I file a bug report ? Is there something else that you'd like
me to try ?


Bogdan,

Which version of MX are you using? Are you enabling the MX  
registration cache (regcache)?


Can you try two runs, one exporting MX_RCACHE=1 and one exporting  
MX_RCACHE=0 to all processes?


This will rule out any interaction between the OMPI memory manager  
and MX's regcache (either caused by or simply exposed by the  
Pathscale compiler).


Thanks,

Scott


--
Scott Atchley
Myricom Inc.
http://www.myri.com




Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]

2007-10-23 Thread Scott Atchley

On Oct 23, 2007, at 10:36 AM, Bogdan Costescu wrote:


I don't get to that point... I am not even able to use the wrapper
compilers (f.e. mpif90) to obtain an executable to run. The
segmentation fault happens when Open MPI utilities are being run, even
ompi_info.


Ahh, I thought you were getting a segfault when it ran, not compiling.


This will rule out any interaction between the OMPI memory manager
and MX's regcache (either caused by or simply exposed by the
Pathscale compiler).


As I wrote in my previous e-mail, I tried configuring with and without
the MX libs, but this made no difference. It's only when I disabled
the memory manager, while still enabling MX, that I was able to get a
working build.


Sorry for the distraction. :-)

Thanks,

Scott


Re: [OMPI devel] Still troubles with 1.3 and MX

2009-01-22 Thread Scott Atchley

On Jan 22, 2009, at 9:18 AM, Bogdan Costescu wrote:

I'm still having some troubles using the newly released 1.3 with  
Myricom's MX. I've meant to send a message earlier, but the release  
candidates went so fast that I didn't have time to catch up and test.


General details:
Nodes with dual CPU, dual core Opteron 2220, 8 GB RAM
Debian etch x86_64, self-compiled kernel 2.6.22.18, gcc-4.1
Torque 2.1.10 (but this shouldn't make a difference)
MX 1.2.7 with a tiny patch from Myricom
OpenMPI 1.3
IMB 3.1

OpenMPI was configured with '--enable-shared --enable-static --with- 
mx=... --with-tm=...'
In all cases, there were no options specified at runtime (either in  
files or on the command line) except for the PML and BTL selection.


Problem 1:

I still see hangs of collective functions when running on large  
number of nodes (or maybe ranks) with the default OB1+BTL. F.e. with  
128 ranks distributed as nodes=32:ppn=4 or nodes=64:ppn=2, the IMB  
hangs in Gather.


Bogdan, this sounds like a similar issue to what you experienced in  
December and that it had been fixed. I do not remember if this was  
tied to the default collective or to free list management.


Can you try a run with:

  -mca btl_mx_free_list_max 100

added to the command line?

After that, try a additional runs without the above but with:

  --mca coll_tuned_use_dynamic_rules 1 --mca  
coll_tuned_gather_algorithm N


where N is 0, 1, 2, then 3 (one run for each value).


Problem 2:

When using the CM+MTL with 128 ranks, it finishes fine when running  
on nodes=64:ppn=2, but on nodes=32:ppn=4 I get a stream of errors  
that I haven't seen before:


Max retransmit retries reached (1000) for message
Max retransmit retries reached (1000) for message
   type (2): send_medium
   state (0x14): buffered dead
   requeued: 1000 (timeout=51ms)
   dest: 00:60:dd:47:89:40 (opt029:0)
   partner: peer_index=146, endpoint=3, seqnum=0x2944
   type (2): send_medium
   state (0x14): buffered dead
   requeued: 1000 (timeout=51ms)
   dest: 00:60:dd:47:89:40 (opt029:0)
   partner: peer_index=146, endpoint=3, seqnum=0x2f9a
   matched_val: 0x0068002a_fff2
   slength=32768, xfer_length=32768
   matched_val: 0x0068002b_fff2
   slength=32768, xfer_length=32768
   seg: 0x2aaacc30f010,32768
   caller: 0x5b


These are two, overlapped messages from the MX library. It is unable  
to send to opt029 (i.e. opt029 is not consuming messages).


From the MX experts out there, I would also need some help to  
understand what is the source of these messages - I can only see  
opt029 mentioned,


Anyone, does 1.3 support rank labeling of stdout? If so, Bogdan should  
rerun it with --display-map and the option to support labeling.


so does it try to communicate intra-node ? (IOW the equivalent of  
"self" BTL in OpenMPI) This would be somehow consistent with running  
more ranks per node (4) than the successfull job (with 2 ranks per  
node).


I am under the impression that the MTLs pass all messages to the  
interconnect. If so, then MX is handling self, shared memory (shmem),  
and host-to-host. Self, by the way, is a single rank (process)  
communicating with itself. In your case, you are using shmem.


At this point, the job hangs in Alltoallv. The strace output is the  
same as for OB1+BTL above.


Can anyone suggest some ways forward ? I'd be happy to help in  
debugging if given some instructions.


I would suggest the same test as above with:

  -mca btl_mx_free_list_max 100

Additionally, try the following tuned collectives for alltoallv:

  --mca coll_tuned_use_dynamic_rules 1 --mca  
coll_tuned_alltoallv_algorithm N


where N is 0, 1, then 2 (one run for each value).

Scott


Re: [OMPI devel] RFC: PML/CM priority

2009-08-11 Thread Scott Atchley

George,

When asked about MTL versus BTL, we always suggest that users try both  
with their application and determine which is best. I have had  
customers report the BTL is better on Solaris (memory registration is  
expensive and the BTL can overlap registration and communication when  
it fragments a large message) and sometimes better on Linux, but not  
always.


The most common issue lately is that users see a failure on high core  
count machines (8 or 16) due to the fact that both the MTL and BTL are  
opening endpoints. They run into the max number of allowable endpoints  
and OMPI aborts. I would suggest that OMPI clearly selects one CM and  
only open endpoints for that CM, if possible.


Scott

On Aug 11, 2009, at 3:29 PM, George Bosilca wrote:

Here is an alternative solution. If instead of setting a hard coded  
value for the priority of CM, we make it use the priority of the MTL  
that get selected, we can solve this problem on a case by case  
approach by carefully setting the MTL's priority (bump up the  
portals and PSM one and decrease the MX MTL). As a result we can  
remove all the extra selection logic and priority management from  
the pml_cm_component.c, and still have a satisfactory solution for  
everybody.


 george.

On Aug 11, 2009, at 15:23 , Brian W. Barrett wrote:


On Tue, 11 Aug 2009, Rainer Keller wrote:

When compiling on systems with MX or Portals, we offer MTLs and  
BTLs.

If MTLs are used, the PML/CM is loaded as well as the PML/OB1.


Question 1: Is favoring OB1 over CM required for any MTL (MX,  
Portals, PSM)?


George has in the past had srtong feelings on this issue, believing  
that for MX, OB1 is prefered over CM.  For Portals, it's probably  
in the noise, but the BTL had been better tested than the MTL, so  
it was left as the default.  Obviously, PSM is a much better choice  
on InfiniPath than straight OFED, hence the odd priority bump.


At this point, I would have no objection to making CM's priority  
higher for Portals.


Question 2: If it is, I would like to reflect this in the default  
priorities,
aka have CM have a priority lower than OB1 and in the case of PSM  
raising it.


I don't have strong feelings on this one.

Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?

2009-10-21 Thread Scott Atchley

On Oct 21, 2009, at 1:25 PM, George Bosilca wrote:


Brice,

Because MX doesn't provide a real RMA protocol, we created a fake  
one on top of point-to-point. The two peers have to agree on a  
unique tag, then the receiver posts it before the sender starts the  
send. However, as this is integrated with the real RMA protocol,  
where only one side knows about the completion of the RMA operation,  
we still exchange the ACK at the end. Therefore, the receiver  
doesn't need to know when the receive is completed, as it will get  
an ACK from the sender. At least this was the original idea.


But I can see how this might fails if the short ACK from the sender  
manage to pass the RMA operation on the wire. I was under the  
impression (based on the fact that MX respect the ordering) that the  
mx_send will trigger the completion only when all data is on the  
wire/nic memory so I supposed there is _absolutely_ no way for the  
ACK to bypass the last RMA fragments and to reach the receiver  
before the recv is really completed. If my supposition is not  
correct, then we should remove the mx_forget and make sure the that  
before we mark a fragment as completed we got both completions (the  
one from mx_recv and the remote one).


George,

When is the ACK sent? After the "PUT" completion returns (via  
mx_test(), etc) or simply after calling mx_isend() for the "PUT" but  
before the completion?


If the former, the ACK cannot pass the data. If the latter, it is  
easily possible especially if there is a lot of contention (and thus a  
lot of route dispersion).


MX only guarantees order of matching (two identical tags will match in  
order), not order of completion.


Scott


Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?

2009-10-21 Thread Scott Atchley

On Oct 21, 2009, at 3:32 PM, Brice Goglin wrote:


George Bosilca wrote:

On Oct 21, 2009, at 13:42 , Scott Atchley wrote:

On Oct 21, 2009, at 1:25 PM, George Bosilca wrote:

Because MX doesn't provide a real RMA protocol, we created a fake
one on top of point-to-point. The two peers have to agree on a
unique tag, then the receiver posts it before the sender starts the
send. However, as this is integrated with the real RMA protocol,
where only one side knows about the completion of the RMA  
operation,

we still exchange the ACK at the end. Therefore, the receiver
doesn't need to know when the receive is completed, as it will get
an ACK from the sender. At least this was the original idea.

But I can see how this might fails if the short ACK from the sender
manage to pass the RMA operation on the wire. I was under the
impression (based on the fact that MX respect the ordering) that  
the

mx_send will trigger the completion only when all data is on the
wire/nic memory so I supposed there is _absolutely_ no way for the
ACK to bypass the last RMA fragments and to reach the receiver
before the recv is really completed. If my supposition is not
correct, then we should remove the mx_forget and make sure the that
before we mark a fragment as completed we got both completions (the
one from mx_recv and the remote one).


When is the ACK sent? After the "PUT" completion returns (via
mx_test(), etc) or simply after calling mx_isend() for the "PUT" but
before the completion?


The ACK is sent by the PML layer. If I'm not mistaken, it is sent  
when

the completion callback is triggered, which should happen only when
the MX BTL detect the completion of the mx_isend (using the mx_test).
Therefore, I think the ACK is sent in response to the completion of
the mx_isend.


Before or after mx_test() doesn't actually matter if it's a
small/medium. Even if the send(PUT) completes in mx_test(), the data
could still be on the wire in case of packet loss or so: if it's a
tiny/small/medium message (it's was a medium in my crash), the MX lib
opportunistically completes the request on the sender before it's
actually acked by the receiver. Matching is in order, request  
completion

is not. There's no strong delivery guarantee here.

Brice


Yes, I was thinking of the rendezvous case (>32 kB) only.

Scott