[OMPI devel] 1/4/3rc1 over MX
Hi all, I compiled 1.4.3rc1 with MX 1.2.12 on RHEL 5.4 (2.6.18-164.el5). It does not like the memory manager and MX. Compiling using --without-memory-manager works fine. The output below is form the default configure (i.e. --with-memory-manager). Note, I still see unusual latencies for some tests when using the BTL such as reduce-scatter, allgather, etc. I do not see them with the MTL. An example of BTL latencies from reduce-scatter is: 256 1000 7.01 7.01 7.01 512 1000 7.56 7.56 7.56 1024 1000 8.58 8.58 8.58 2048 100010.3610.3610.36 4096 100014.4914.4914.49 8192 1000 5180.16 5180.57 5180.36 16384 100094.9694.9794.96 32768 1000 4676.30 4676.68 4676.49 65536 640 4625.85 4626.23 4626.04 131072 320 243.43 243.46 243.45 262144 160 425.56 425.66 425.61 Scott % mpirun -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong [rain16:22509] *** Process received signal *** [rain16:22509] Signal: Segmentation fault (11) [rain16:22509] Signal code: Address not mapped (1) [rain16:22509] Failing at address: 0x2c0 [rain15:24145] *** Process received signal *** [rain15:24145] Signal: Segmentation fault (11) [rain15:24145] Signal code: Address not mapped (1) [rain15:24145] Failing at address: 0x25a0 -- mpirun noticed that process rank 1 with PID 22509 on node rain16 exited on signal 11 (Segmentation fault). -- gdb shows: #0 0x003d084075c8 in ?? () from /lib64/libgcc_s.so.1 (gdb) bt #0 0x003d084075c8 in ?? () from /lib64/libgcc_s.so.1 #1 0x003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 #2 0x003d060e5eb8 in backtrace () from /lib64/libc.so.6 #3 0x2af68e7a47de in opal_backtrace_buffer () from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 #4 0x2af68e7a24ce in show_stackframe () from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 #5 #6 0x02c0 in ?? () #7 0x2af690520640 in mca_mpool_fake_release_memory () from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so #8 0x2af68e2f49ce in mca_mpool_base_mem_cb () from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 #9 0x2af68e78347b in opal_mem_hooks_release_hook () from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 #10 0x2af68e7a791f in opal_mem_free_ptmalloc2_munmap () from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 #11 0x2af68e7ac2b1 in opal_memory_ptmalloc2_free_hook () from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 #12 0x003d060727c1 in free () from /lib64/libc.so.6 #13 0x2af69197aaad in mx__rl_fini (rl=0xab5f928) at ../../../libmyriexpress/userspace/../mx__request.c:102 #14 0x2af69196924d in mx_close_endpoint (endpoint=0xab5f820) at ../../../libmyriexpress/userspace/../mx_close_endpoint.c:124 #15 0x2af69155e3dc in ompi_mtl_mx_finalize () from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mtl_mx.so #16 0x2af68e2f87e0 in mca_pml_base_select () from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 #17 0x2af68e2bcf40 in ompi_mpi_init () from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 #18 0x2af68e2da2b1 in PMPI_Init_thread () from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 #19 0x00403359 in main () If I tell it to use BTLs only it changes to: % mpirun -mca pml ob1 -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong [rain16:22552] *** Process received signal *** [rain15:24195] *** Process received signal *** [rain15:24195] Signal: Segmentation fault (11) [rain15:24195] Signal code: Address not mapped (1) [rain15:24195] Failing at address: 0x290 [rain16:22552] Signal: Segmentation fault (11) [rain16:22552] Signal code: Address not mapped (1) [rain16:22552] Failing at address: 0x290 -- mpirun noticed that process rank 1 with PID 22552 on node rain16 exited on signal 11 (Segmentation fault). -- gdb shows: #0 0x003d084075c8 in ?? () from /lib64/libgcc_s.so.1 #1 0x003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 #2 0x003d060e5eb8 in backtrace () from /lib64/libc.so.6 #3 0x2b831
[OMPI devel] 1.5rc5 over MX
Hi all, Testing 1.5rc5 over MX with the same setup as 1.4.3rc1 (RHEL 5.4 and MX 1.2.12). This version also dies during init due to the memory manager if I do not specify which pml to use. If I specify pml ob1 or pml cm, the tests start but die with segfaults: 131072 320 166.86 749.15 [rain15:14939] *** Process received signal *** [rain15:14939] Signal: Segmentation fault (11) [rain15:14939] Signal code: Address not mapped (1) [rain15:14939] Failing at address: 0x3b20 Again, configuring without the memory manager or setting OMPI_MCA_memory_ptmalloc2_disable=1 before calling mpirun work. Similar latency issues with the BTl and not with the MTL. Scott
Re: [OMPI devel] 1.5rc5 over MX
Jeff, Sure, I need to register to file the tickets. I have not had a chance yet. I will try to look at them first thing next week. Scott On Aug 27, 2010, at 2:41 PM, Jeff Squyres wrote: > Scott -- > > Can you file tickets for this against 1.4 and 1.5? These should probably be > blockers. > > Have you been able to track these down any further, perchance? > > > On Aug 26, 2010, at 10:38 AM, Scott Atchley wrote: > >> Hi all, >> >> Testing 1.5rc5 over MX with the same setup as 1.4.3rc1 (RHEL 5.4 and MX >> 1.2.12). >> >> This version also dies during init due to the memory manager if I do not >> specify which pml to use. If I specify pml ob1 or pml cm, the tests start >> but die with segfaults: >> >> 131072 320 166.86 749.15 >> [rain15:14939] *** Process received signal *** >> [rain15:14939] Signal: Segmentation fault (11) >> [rain15:14939] Signal code: Address not mapped (1) >> [rain15:14939] Failing at address: 0x3b20 >> >> Again, configuring without the memory manager or setting >> OMPI_MCA_memory_ptmalloc2_disable=1 before calling mpirun work. >> >> Similar latency issues with the BTl and not with the MTL. >> >> Scott >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] 1.5rc5 over MX
Jeff, I posted a patch on the ticket. Scott On Aug 27, 2010, at 3:08 PM, Scott Atchley wrote: > Jeff, > > Sure, I need to register to file the tickets. > > I have not had a chance yet. I will try to look at them first thing next week. > > Scott > > On Aug 27, 2010, at 2:41 PM, Jeff Squyres wrote: > >> Scott -- >> >> Can you file tickets for this against 1.4 and 1.5? These should probably be >> blockers. >> >> Have you been able to track these down any further, perchance? >> >> >> On Aug 26, 2010, at 10:38 AM, Scott Atchley wrote: >> >>> Hi all, >>> >>> Testing 1.5rc5 over MX with the same setup as 1.4.3rc1 (RHEL 5.4 and MX >>> 1.2.12). >>> >>> This version also dies during init due to the memory manager if I do not >>> specify which pml to use. If I specify pml ob1 or pml cm, the tests start >>> but die with segfaults: >>> >>> 131072 320 166.86 749.15 >>> [rain15:14939] *** Process received signal *** >>> [rain15:14939] Signal: Segmentation fault (11) >>> [rain15:14939] Signal code: Address not mapped (1) >>> [rain15:14939] Failing at address: 0x3b20 >>> >>> Again, configuring without the memory manager or setting >>> OMPI_MCA_memory_ptmalloc2_disable=1 before calling mpirun work. >>> >>> Similar latency issues with the BTl and not with the MTL. >>> >>> Scott >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] 1/4/3rc1 over MX
Jeff, I posted a patch for this on the ticket. Scott On Aug 26, 2010, at 10:10 AM, Scott Atchley wrote: > Hi all, > > I compiled 1.4.3rc1 with MX 1.2.12 on RHEL 5.4 (2.6.18-164.el5). It does not > like the memory manager and MX. Compiling using --without-memory-manager > works fine. The output below is form the default configure (i.e. > --with-memory-manager). > > Note, I still see unusual latencies for some tests when using the BTL such as > reduce-scatter, allgather, etc. I do not see them with the MTL. An example of > BTL latencies from reduce-scatter is: > > 256 1000 7.01 7.01 7.01 > 512 1000 7.56 7.56 7.56 > 1024 1000 8.58 8.58 8.58 > 2048 100010.3610.3610.36 > 4096 100014.4914.4914.49 > 8192 1000 5180.16 5180.57 5180.36 >16384 100094.9694.9794.96 >32768 1000 4676.30 4676.68 4676.49 >65536 640 4625.85 4626.23 4626.04 > 131072 320 243.43 243.46 243.45 > 262144 160 425.56 425.66 425.61 > > Scott > > % mpirun -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong > [rain16:22509] *** Process received signal *** > [rain16:22509] Signal: Segmentation fault (11) > [rain16:22509] Signal code: Address not mapped (1) > [rain16:22509] Failing at address: 0x2c0 > [rain15:24145] *** Process received signal *** > [rain15:24145] Signal: Segmentation fault (11) > [rain15:24145] Signal code: Address not mapped (1) > [rain15:24145] Failing at address: 0x25a0 > -- > mpirun noticed that process rank 1 with PID 22509 on node rain16 exited on > signal 11 (Segmentation fault). > -- > > gdb shows: > > #0 0x003d084075c8 in ?? () from /lib64/libgcc_s.so.1 > (gdb) bt > #0 0x003d084075c8 in ?? () from /lib64/libgcc_s.so.1 > #1 0x003d0840882b in _Unwind_Backtrace () from /lib64/libgcc_s.so.1 > #2 0x003d060e5eb8 in backtrace () from /lib64/libc.so.6 > #3 0x2af68e7a47de in opal_backtrace_buffer () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #4 0x2af68e7a24ce in show_stackframe () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #5 > #6 0x02c0 in ?? () > #7 0x2af690520640 in mca_mpool_fake_release_memory () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mpool_fake.so > #8 0x2af68e2f49ce in mca_mpool_base_mem_cb () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #9 0x2af68e78347b in opal_mem_hooks_release_hook () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #10 0x2af68e7a791f in opal_mem_free_ptmalloc2_munmap () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #11 0x2af68e7ac2b1 in opal_memory_ptmalloc2_free_hook () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libopen-pal.so.0 > #12 0x003d060727c1 in free () from /lib64/libc.so.6 > #13 0x2af69197aaad in mx__rl_fini (rl=0xab5f928) >at ../../../libmyriexpress/userspace/../mx__request.c:102 > #14 0x2af69196924d in mx_close_endpoint (endpoint=0xab5f820) >at ../../../libmyriexpress/userspace/../mx_close_endpoint.c:124 > #15 0x2af69155e3dc in ompi_mtl_mx_finalize () > from > /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/openmpi/mca_mtl_mx.so > #16 0x2af68e2f87e0 in mca_pml_base_select () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #17 0x2af68e2bcf40 in ompi_mpi_init () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #18 0x2af68e2da2b1 in PMPI_Init_thread () > from /nfs/home/atchley/projects/openmpi-1.4.3rc1/build/rain/lib/libmpi.so.0 > #19 0x00403359 in main () > > > If I tell it to use BTLs only it changes to: > > % mpirun -mca pml ob1 -hostfile hosts -np 2 ./IMB-MPI1.ompi-1.4.3rc1 pingpong > [rain16:22552] *** Process received signal *** > [rain15:24195] *** Process received signal *** > [rain15:24195] Signal: Segmentation fault (11) > [rain15:24195] Signal code: Address not mapped (1) > [rain15:24195] Failing at address: 0x290 > [rain16:22552] Signal: Segmentation fault (11)
Re: [OMPI devel] 1/4/3rc1 over MX
On Sep 3, 2010, at 8:19 AM, Jeff Squyres wrote: > On Sep 1, 2010, at 9:10 AM, Scott Atchley wrote: > >> I posted a patch for this on the ticket. > > Will someone be committing this to SVN? > > I re-opened the ticket because just posting a patch to the ticket doesn't > actually fix anything. :-) We should probably set me up with commit privileges. Scott
Re: [OMPI devel] 1.5rc5 over MX
Shouldn't the regression be a separate ticket since it is unrelated? Scott On Sep 3, 2010, at 8:20 AM, Jeff Squyres wrote: > Ditto for the v1.5 patch -- it wasn't committed anywhere and no CMR was > filed, so I re-opened the ticket. > > Plus you mentioned a 2us (!) latency increase. Doesn't that need attention, > too? > > > On Sep 1, 2010, at 9:09 AM, Scott Atchley wrote: > >> Jeff, >> >> I posted a patch on the ticket. >> >> Scott >> >> On Aug 27, 2010, at 3:08 PM, Scott Atchley wrote: >> >>> Jeff, >>> >>> Sure, I need to register to file the tickets. >>> >>> I have not had a chance yet. I will try to look at them first thing next >>> week. >>> >>> Scott >>> >>> On Aug 27, 2010, at 2:41 PM, Jeff Squyres wrote: >>> >>>> Scott -- >>>> >>>> Can you file tickets for this against 1.4 and 1.5? These should probably >>>> be blockers. >>>> >>>> Have you been able to track these down any further, perchance? >>>> >>>> >>>> On Aug 26, 2010, at 10:38 AM, Scott Atchley wrote: >>>> >>>>> Hi all, >>>>> >>>>> Testing 1.5rc5 over MX with the same setup as 1.4.3rc1 (RHEL 5.4 and MX >>>>> 1.2.12). >>>>> >>>>> This version also dies during init due to the memory manager if I do not >>>>> specify which pml to use. If I specify pml ob1 or pml cm, the tests start >>>>> but die with segfaults: >>>>> >>>>> 131072 320 166.86 749.15 >>>>> [rain15:14939] *** Process received signal *** >>>>> [rain15:14939] Signal: Segmentation fault (11) >>>>> [rain15:14939] Signal code: Address not mapped (1) >>>>> [rain15:14939] Failing at address: 0x3b20 >>>>> >>>>> Again, configuring without the memory manager or setting >>>>> OMPI_MCA_memory_ptmalloc2_disable=1 before calling mpirun work. >>>>> >>>>> Similar latency issues with the BTl and not with the MTL. >>>>> >>>>> Scott >>>>> ___ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> -- >>>> Jeff Squyres >>>> jsquy...@cisco.com >>>> For corporate legal information go to: >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>>> >>>> >>>> ___ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > >
Re: [OMPI devel] openib btl header caching
On Aug 13, 2007, at 4:06 AM, Pavel Shamis (Pasha) wrote: Any objections? We can discuss what approaches we want to take (there's going to be some complications because of the PML driver, etc.); perhaps in the Tuesday Mellanox teleconf...? My main objection is that the only reason you propose to do this is some bogus benchmark? Pallas, Presta (as i know) also use static rank. So lets start to fix all "bogus" benchmarks :-) ? Pasha. Why not: for (i=0; i < ITERATIONS; i++) { tag = i%MPI_TAG_UB; ... } On a related note, we have often discussed the fact that benchmarks only give an upper-bound on performance. I would expect that some users would want to also know the lower-bound. For example, set a flag that causes the benchmark to use a different buffer each time in order to cause the registration cache to miss. I am sure we could come up with some other cases. Scott
Re: [OMPI devel] SM BTL hang issue
Terry, Are you testing on Linux? If so, which kernel? See the patch to iperf to handle kernel 2.6.21 and the issue that they had with usleep(0): http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt Scott On Aug 31, 2007, at 1:36 PM, Terry D. Dontje wrote: Ok, I have an update to this issue. I believe there is an implementation difference of sched_yield between Linux and Solaris. If I change the sched_yield in opal_progress to be a usleep(500) then my program completes quite quickly. I have sent a few questions to a Solaris engineer and hopefully will get some useful information. That being said, CT-6's implementation also used yield calls (note this actually is what sched_yield reduces down to in Solaris) and we did not see the same degradation issue as with Open MPI. I believe the reason is because CT-6's SM implementation is not calling CT-6's version of progress recursively and forcing all the unexpected to be read in before continuing. CT-6 also has a natural flow control in it's implementation (ie it has a fixed set fifo for eager messages. I believe both of these characteristics lend CT-6 to not being completely killed by the yield differences. --td Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote: hmmm, interesting since my version doesn't abort at all. Some problem with fortran compiler/language binding? My C translation doesn't have any problem. [ollie@exponential ~]$ mpirun -np 4 a.out 10 Target duration (seconds): 10.00, #of msgs: 50331, usec per msg: 198.684707 Did you oversubscribe? I found np=10 on a 8 core system clogged things up sufficiently. Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4 threads). Is this using Linux? Yes. Ollie ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]
On Oct 23, 2007, at 6:33 AM, Bogdan Costescu wrote: I made some progress: if I configure with "--without-memory-manager" (along with all other options that I mentioned before), then it works. This was inspired by the fact that the segmentation fault occured in ptmalloc2. I have previously tried to remove the MX support without any effect; with ptmalloc2 out of the picture I have had test runs over MX and TCP without problems. Should I file a bug report ? Is there something else that you'd like me to try ? Bogdan, Which version of MX are you using? Are you enabling the MX registration cache (regcache)? Can you try two runs, one exporting MX_RCACHE=1 and one exporting MX_RCACHE=0 to all processes? This will rule out any interaction between the OMPI memory manager and MX's regcache (either caused by or simply exposed by the Pathscale compiler). Thanks, Scott -- Scott Atchley Myricom Inc. http://www.myri.com
Re: [OMPI devel] PathScale 3.0 problems with Open MPI 1.2.[34]
On Oct 23, 2007, at 10:36 AM, Bogdan Costescu wrote: I don't get to that point... I am not even able to use the wrapper compilers (f.e. mpif90) to obtain an executable to run. The segmentation fault happens when Open MPI utilities are being run, even ompi_info. Ahh, I thought you were getting a segfault when it ran, not compiling. This will rule out any interaction between the OMPI memory manager and MX's regcache (either caused by or simply exposed by the Pathscale compiler). As I wrote in my previous e-mail, I tried configuring with and without the MX libs, but this made no difference. It's only when I disabled the memory manager, while still enabling MX, that I was able to get a working build. Sorry for the distraction. :-) Thanks, Scott
Re: [OMPI devel] Still troubles with 1.3 and MX
On Jan 22, 2009, at 9:18 AM, Bogdan Costescu wrote: I'm still having some troubles using the newly released 1.3 with Myricom's MX. I've meant to send a message earlier, but the release candidates went so fast that I didn't have time to catch up and test. General details: Nodes with dual CPU, dual core Opteron 2220, 8 GB RAM Debian etch x86_64, self-compiled kernel 2.6.22.18, gcc-4.1 Torque 2.1.10 (but this shouldn't make a difference) MX 1.2.7 with a tiny patch from Myricom OpenMPI 1.3 IMB 3.1 OpenMPI was configured with '--enable-shared --enable-static --with- mx=... --with-tm=...' In all cases, there were no options specified at runtime (either in files or on the command line) except for the PML and BTL selection. Problem 1: I still see hangs of collective functions when running on large number of nodes (or maybe ranks) with the default OB1+BTL. F.e. with 128 ranks distributed as nodes=32:ppn=4 or nodes=64:ppn=2, the IMB hangs in Gather. Bogdan, this sounds like a similar issue to what you experienced in December and that it had been fixed. I do not remember if this was tied to the default collective or to free list management. Can you try a run with: -mca btl_mx_free_list_max 100 added to the command line? After that, try a additional runs without the above but with: --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_gather_algorithm N where N is 0, 1, 2, then 3 (one run for each value). Problem 2: When using the CM+MTL with 128 ranks, it finishes fine when running on nodes=64:ppn=2, but on nodes=32:ppn=4 I get a stream of errors that I haven't seen before: Max retransmit retries reached (1000) for message Max retransmit retries reached (1000) for message type (2): send_medium state (0x14): buffered dead requeued: 1000 (timeout=51ms) dest: 00:60:dd:47:89:40 (opt029:0) partner: peer_index=146, endpoint=3, seqnum=0x2944 type (2): send_medium state (0x14): buffered dead requeued: 1000 (timeout=51ms) dest: 00:60:dd:47:89:40 (opt029:0) partner: peer_index=146, endpoint=3, seqnum=0x2f9a matched_val: 0x0068002a_fff2 slength=32768, xfer_length=32768 matched_val: 0x0068002b_fff2 slength=32768, xfer_length=32768 seg: 0x2aaacc30f010,32768 caller: 0x5b These are two, overlapped messages from the MX library. It is unable to send to opt029 (i.e. opt029 is not consuming messages). From the MX experts out there, I would also need some help to understand what is the source of these messages - I can only see opt029 mentioned, Anyone, does 1.3 support rank labeling of stdout? If so, Bogdan should rerun it with --display-map and the option to support labeling. so does it try to communicate intra-node ? (IOW the equivalent of "self" BTL in OpenMPI) This would be somehow consistent with running more ranks per node (4) than the successfull job (with 2 ranks per node). I am under the impression that the MTLs pass all messages to the interconnect. If so, then MX is handling self, shared memory (shmem), and host-to-host. Self, by the way, is a single rank (process) communicating with itself. In your case, you are using shmem. At this point, the job hangs in Alltoallv. The strace output is the same as for OB1+BTL above. Can anyone suggest some ways forward ? I'd be happy to help in debugging if given some instructions. I would suggest the same test as above with: -mca btl_mx_free_list_max 100 Additionally, try the following tuned collectives for alltoallv: --mca coll_tuned_use_dynamic_rules 1 --mca coll_tuned_alltoallv_algorithm N where N is 0, 1, then 2 (one run for each value). Scott
Re: [OMPI devel] RFC: PML/CM priority
George, When asked about MTL versus BTL, we always suggest that users try both with their application and determine which is best. I have had customers report the BTL is better on Solaris (memory registration is expensive and the BTL can overlap registration and communication when it fragments a large message) and sometimes better on Linux, but not always. The most common issue lately is that users see a failure on high core count machines (8 or 16) due to the fact that both the MTL and BTL are opening endpoints. They run into the max number of allowable endpoints and OMPI aborts. I would suggest that OMPI clearly selects one CM and only open endpoints for that CM, if possible. Scott On Aug 11, 2009, at 3:29 PM, George Bosilca wrote: Here is an alternative solution. If instead of setting a hard coded value for the priority of CM, we make it use the priority of the MTL that get selected, we can solve this problem on a case by case approach by carefully setting the MTL's priority (bump up the portals and PSM one and decrease the MX MTL). As a result we can remove all the extra selection logic and priority management from the pml_cm_component.c, and still have a satisfactory solution for everybody. george. On Aug 11, 2009, at 15:23 , Brian W. Barrett wrote: On Tue, 11 Aug 2009, Rainer Keller wrote: When compiling on systems with MX or Portals, we offer MTLs and BTLs. If MTLs are used, the PML/CM is loaded as well as the PML/OB1. Question 1: Is favoring OB1 over CM required for any MTL (MX, Portals, PSM)? George has in the past had srtong feelings on this issue, believing that for MX, OB1 is prefered over CM. For Portals, it's probably in the noise, but the BTL had been better tested than the MTL, so it was left as the default. Obviously, PSM is a much better choice on InfiniPath than straight OFED, hence the odd priority bump. At this point, I would have no objection to making CM's priority higher for Portals. Question 2: If it is, I would like to reflect this in the default priorities, aka have CM have a priority lower than OB1 and in the case of PSM raising it. I don't have strong feelings on this one. Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?
On Oct 21, 2009, at 1:25 PM, George Bosilca wrote: Brice, Because MX doesn't provide a real RMA protocol, we created a fake one on top of point-to-point. The two peers have to agree on a unique tag, then the receiver posts it before the sender starts the send. However, as this is integrated with the real RMA protocol, where only one side knows about the completion of the RMA operation, we still exchange the ACK at the end. Therefore, the receiver doesn't need to know when the receive is completed, as it will get an ACK from the sender. At least this was the original idea. But I can see how this might fails if the short ACK from the sender manage to pass the RMA operation on the wire. I was under the impression (based on the fact that MX respect the ordering) that the mx_send will trigger the completion only when all data is on the wire/nic memory so I supposed there is _absolutely_ no way for the ACK to bypass the last RMA fragments and to reach the receiver before the recv is really completed. If my supposition is not correct, then we should remove the mx_forget and make sure the that before we mark a fragment as completed we got both completions (the one from mx_recv and the remote one). George, When is the ACK sent? After the "PUT" completion returns (via mx_test(), etc) or simply after calling mx_isend() for the "PUT" but before the completion? If the former, the ACK cannot pass the data. If the latter, it is easily possible especially if there is a lot of contention (and thus a lot of route dispersion). MX only guarantees order of matching (two identical tags will match in order), not order of completion. Scott
Re: [OMPI devel] why mx_forget in mca_btl_mx_prepare_dst?
On Oct 21, 2009, at 3:32 PM, Brice Goglin wrote: George Bosilca wrote: On Oct 21, 2009, at 13:42 , Scott Atchley wrote: On Oct 21, 2009, at 1:25 PM, George Bosilca wrote: Because MX doesn't provide a real RMA protocol, we created a fake one on top of point-to-point. The two peers have to agree on a unique tag, then the receiver posts it before the sender starts the send. However, as this is integrated with the real RMA protocol, where only one side knows about the completion of the RMA operation, we still exchange the ACK at the end. Therefore, the receiver doesn't need to know when the receive is completed, as it will get an ACK from the sender. At least this was the original idea. But I can see how this might fails if the short ACK from the sender manage to pass the RMA operation on the wire. I was under the impression (based on the fact that MX respect the ordering) that the mx_send will trigger the completion only when all data is on the wire/nic memory so I supposed there is _absolutely_ no way for the ACK to bypass the last RMA fragments and to reach the receiver before the recv is really completed. If my supposition is not correct, then we should remove the mx_forget and make sure the that before we mark a fragment as completed we got both completions (the one from mx_recv and the remote one). When is the ACK sent? After the "PUT" completion returns (via mx_test(), etc) or simply after calling mx_isend() for the "PUT" but before the completion? The ACK is sent by the PML layer. If I'm not mistaken, it is sent when the completion callback is triggered, which should happen only when the MX BTL detect the completion of the mx_isend (using the mx_test). Therefore, I think the ACK is sent in response to the completion of the mx_isend. Before or after mx_test() doesn't actually matter if it's a small/medium. Even if the send(PUT) completes in mx_test(), the data could still be on the wire in case of packet loss or so: if it's a tiny/small/medium message (it's was a medium in my crash), the MX lib opportunistically completes the request on the sender before it's actually acked by the receiver. Matching is in order, request completion is not. There's no strong delivery guarantee here. Brice Yes, I was thinking of the rendezvous case (>32 kB) only. Scott