[OMPI devel] [PATCH] fix mx btl_bandwidth
For some reason, the MX btl sets btl_bandwidth in megabits/s instead of megabytes/s. So we get crazy btl_weights in case of heterogeneous multirail. And --mca btl_mx_bandwidth cannot work around the problem (it probably doesn't help because it's overriden by the runtime link width detection anyway?). Signed-off-by: Brice Goglin Index: ompi/mca/btl/mx/btl_mx_component.c === --- ompi/mca/btl/mx/btl_mx_component.c (révision 23711) +++ ompi/mca/btl/mx/btl_mx_component.c (copie de travail) @@ -159,7 +159,7 @@ MCA_BTL_FLAGS_PUT | MCA_BTL_FLAGS_SEND | MCA_BTL_FLAGS_RDMA_MATCHED); -mca_btl_mx_module.super.btl_bandwidth = 2000; +mca_btl_mx_module.super.btl_bandwidth = 250; mca_btl_mx_module.super.btl_latency = 5; mca_btl_base_param_register(&mca_btl_mx_component.super.btl_version, &mca_btl_mx_module.super); @@ -357,7 +357,7 @@ mx_btl->mx_endpoint = mx_endpoint; mx_btl->mx_endpoint_addr = mx_endpoint_addr; -mx_btl->super.btl_bandwidth = 2000; /* whatever */ +mx_btl->super.btl_bandwidth = 250; /* whatever */ mx_btl->super.btl_latency = 10; #if defined(MX_HAS_NET_TYPE) { @@ -370,11 +370,11 @@ } else { if( MX_SPEED_2G == value ) { mx_unique_network_id |= 0xaa00; -mx_btl->super.btl_bandwidth = 2000; +mx_btl->super.btl_bandwidth = 250; mx_btl->super.btl_latency = 5; } else if( MX_SPEED_10G == value ) { mx_unique_network_id |= 0xbb00; -mx_btl->super.btl_bandwidth = 1; +mx_btl->super.btl_bandwidth = 1250; mx_btl->super.btl_latency = 3; } else { mx_unique_network_id |= 0xcc00; Index: ompi/mca/btl/mx/btl_mx_component.c === --- ompi/mca/btl/mx/btl_mx_component.c (révision 23711) +++ ompi/mca/btl/mx/btl_mx_component.c (copie de travail) @@ -159,7 +159,7 @@ MCA_BTL_FLAGS_PUT | MCA_BTL_FLAGS_SEND | MCA_BTL_FLAGS_RDMA_MATCHED); -mca_btl_mx_module.super.btl_bandwidth = 2000; +mca_btl_mx_module.super.btl_bandwidth = 250; mca_btl_mx_module.super.btl_latency = 5; mca_btl_base_param_register(&mca_btl_mx_component.super.btl_version, &mca_btl_mx_module.super); @@ -357,7 +357,7 @@ mx_btl->mx_endpoint = mx_endpoint; mx_btl->mx_endpoint_addr = mx_endpoint_addr; -mx_btl->super.btl_bandwidth = 2000; /* whatever */ +mx_btl->super.btl_bandwidth = 250; /* whatever */ mx_btl->super.btl_latency = 10; #if defined(MX_HAS_NET_TYPE) { @@ -370,11 +370,11 @@ } else { if( MX_SPEED_2G == value ) { mx_unique_network_id |= 0xaa00; -mx_btl->super.btl_bandwidth = 2000; +mx_btl->super.btl_bandwidth = 250; mx_btl->super.btl_latency = 5; } else if( MX_SPEED_10G == value ) { mx_unique_network_id |= 0xbb00; -mx_btl->super.btl_bandwidth = 1; +mx_btl->super.btl_bandwidth = 1250; mx_btl->super.btl_latency = 3; } else { mx_unique_network_id |= 0xcc00;
Re: [OMPI devel] [PATCH] fix mx btl_bandwidth
Thanks; committed in r23712. Can you file CMRs for 1.4 and 1.5? Thanks. On Sep 3, 2010, at 3:53 AM, Brice Goglin wrote: > For some reason, the MX btl sets btl_bandwidth in megabits/s instead of > megabytes/s. So we get crazy btl_weights in case of heterogeneous > multirail. And --mca btl_mx_bandwidth cannot work around the > problem (it probably doesn't help because it's overriden by the runtime > link width detection anyway?). > > Signed-off-by: Brice Goglin > > Index: ompi/mca/btl/mx/btl_mx_component.c > === > --- ompi/mca/btl/mx/btl_mx_component.c(révision 23711) > +++ ompi/mca/btl/mx/btl_mx_component.c(copie de travail) > @@ -159,7 +159,7 @@ > MCA_BTL_FLAGS_PUT | > MCA_BTL_FLAGS_SEND | > MCA_BTL_FLAGS_RDMA_MATCHED); > -mca_btl_mx_module.super.btl_bandwidth = 2000; > +mca_btl_mx_module.super.btl_bandwidth = 250; > mca_btl_mx_module.super.btl_latency = 5; > mca_btl_base_param_register(&mca_btl_mx_component.super.btl_version, > &mca_btl_mx_module.super); > @@ -357,7 +357,7 @@ > mx_btl->mx_endpoint = mx_endpoint; > mx_btl->mx_endpoint_addr = mx_endpoint_addr; > > -mx_btl->super.btl_bandwidth = 2000; /* whatever */ > +mx_btl->super.btl_bandwidth = 250; /* whatever */ > mx_btl->super.btl_latency = 10; > #if defined(MX_HAS_NET_TYPE) > { > @@ -370,11 +370,11 @@ > } else { > if( MX_SPEED_2G == value ) { > mx_unique_network_id |= 0xaa00; > -mx_btl->super.btl_bandwidth = 2000; > +mx_btl->super.btl_bandwidth = 250; > mx_btl->super.btl_latency = 5; > } else if( MX_SPEED_10G == value ) { > mx_unique_network_id |= 0xbb00; > -mx_btl->super.btl_bandwidth = 1; > +mx_btl->super.btl_bandwidth = 1250; > mx_btl->super.btl_latency = 3; > } else { > mx_unique_network_id |= 0xcc00; > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] OMPI 1.5 twitter notification plugin probably broken by switch to OAUTH
On Sep 1, 2010, at 7:15 AM, Chris Samuel wrote: > Looking at the code for the Twitter notifier in OMPI 1.5 > and seeing its use of HTTP basic authentication I would > suggest that it may be non-functional due to Twitters > switch to purely OAUTH based authentication for their API. Oy; I got that notice from Twitter, too. I'm afraid I don't know much about OAuth -- would anyone be interested in submitting a patch to make the twitter notifier use OAuth? -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] openib btl - fatal errors don't abort the job
On Sep 1, 2010, at 4:47 PM, Steve Wise wrote: > I was wondering what the logic is behind allowing an MPI job to continue in > the presence of a fatal qp error? It's a feature...? > Note the "will try to continue" sentence: > > -- > The OpenFabrics stack has reported a network error event. Open MPI > will try to continue, but your job may end up failing. > > Local host:escher > MPI process PID: 19136 > Error number: 1 (IBV_EVENT_QP_FATAL) > > This error may indicate connectivity problems within the fabric; > please contact your system administrator. > -- > > Due to other bugs I'm chasing, I get these sorts of errors, and I notice the > job just hangs even though the connections have been disconnected, the qps > flushed, and all pending WRs completed with status == FLUSH. Would it be better to make it a fatal error? (I'm thinking probably "yes") Feel free to submit a patch... -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] 1/4/3rc1 over MX
On Sep 1, 2010, at 9:10 AM, Scott Atchley wrote: > I posted a patch for this on the ticket. Will someone be committing this to SVN? I re-opened the ticket because just posting a patch to the ticket doesn't actually fix anything. :-) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] 1.5rc5 over MX
Ditto for the v1.5 patch -- it wasn't committed anywhere and no CMR was filed, so I re-opened the ticket. Plus you mentioned a 2us (!) latency increase. Doesn't that need attention, too? On Sep 1, 2010, at 9:09 AM, Scott Atchley wrote: > Jeff, > > I posted a patch on the ticket. > > Scott > > On Aug 27, 2010, at 3:08 PM, Scott Atchley wrote: > >> Jeff, >> >> Sure, I need to register to file the tickets. >> >> I have not had a chance yet. I will try to look at them first thing next >> week. >> >> Scott >> >> On Aug 27, 2010, at 2:41 PM, Jeff Squyres wrote: >> >>> Scott -- >>> >>> Can you file tickets for this against 1.4 and 1.5? These should probably >>> be blockers. >>> >>> Have you been able to track these down any further, perchance? >>> >>> >>> On Aug 26, 2010, at 10:38 AM, Scott Atchley wrote: >>> Hi all, Testing 1.5rc5 over MX with the same setup as 1.4.3rc1 (RHEL 5.4 and MX 1.2.12). This version also dies during init due to the memory manager if I do not specify which pml to use. If I specify pml ob1 or pml cm, the tests start but die with segfaults: 131072 320 166.86 749.15 [rain15:14939] *** Process received signal *** [rain15:14939] Signal: Segmentation fault (11) [rain15:14939] Signal code: Address not mapped (1) [rain15:14939] Failing at address: 0x3b20 Again, configuring without the memory manager or setting OMPI_MCA_memory_ptmalloc2_disable=1 before calling mpirun work. Similar latency issues with the BTl and not with the MTL. Scott ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> -- >>> Jeff Squyres >>> jsquy...@cisco.com >>> For corporate legal information go to: >>> http://www.cisco.com/web/about/doing_business/legal/cri/ >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] 1/4/3rc1 over MX
On Sep 3, 2010, at 8:19 AM, Jeff Squyres wrote: > On Sep 1, 2010, at 9:10 AM, Scott Atchley wrote: > >> I posted a patch for this on the ticket. > > Will someone be committing this to SVN? > > I re-opened the ticket because just posting a patch to the ticket doesn't > actually fix anything. :-) We should probably set me up with commit privileges. Scott
Re: [OMPI devel] 1.5rc5 over MX
Shouldn't the regression be a separate ticket since it is unrelated? Scott On Sep 3, 2010, at 8:20 AM, Jeff Squyres wrote: > Ditto for the v1.5 patch -- it wasn't committed anywhere and no CMR was > filed, so I re-opened the ticket. > > Plus you mentioned a 2us (!) latency increase. Doesn't that need attention, > too? > > > On Sep 1, 2010, at 9:09 AM, Scott Atchley wrote: > >> Jeff, >> >> I posted a patch on the ticket. >> >> Scott >> >> On Aug 27, 2010, at 3:08 PM, Scott Atchley wrote: >> >>> Jeff, >>> >>> Sure, I need to register to file the tickets. >>> >>> I have not had a chance yet. I will try to look at them first thing next >>> week. >>> >>> Scott >>> >>> On Aug 27, 2010, at 2:41 PM, Jeff Squyres wrote: >>> Scott -- Can you file tickets for this against 1.4 and 1.5? These should probably be blockers. Have you been able to track these down any further, perchance? On Aug 26, 2010, at 10:38 AM, Scott Atchley wrote: > Hi all, > > Testing 1.5rc5 over MX with the same setup as 1.4.3rc1 (RHEL 5.4 and MX > 1.2.12). > > This version also dies during init due to the memory manager if I do not > specify which pml to use. If I specify pml ob1 or pml cm, the tests start > but die with segfaults: > > 131072 320 166.86 749.15 > [rain15:14939] *** Process received signal *** > [rain15:14939] Signal: Segmentation fault (11) > [rain15:14939] Signal code: Address not mapped (1) > [rain15:14939] Failing at address: 0x3b20 > > Again, configuring without the memory manager or setting > OMPI_MCA_memory_ptmalloc2_disable=1 before calling mpirun work. > > Similar latency issues with the BTl and not with the MTL. > > Scott > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > >
Re: [OMPI devel] 1/4/3rc1 over MX
On Fri, 3 Sep 2010, Jeff Squyres wrote: On Sep 1, 2010, at 9:10 AM, Scott Atchley wrote: I posted a patch for this on the ticket. Will someone be committing this to SVN? Done. Filed the CMRs to get this moved to 1.4.3 and 1.5. I re-opened the ticket because just posting a patch to the ticket doesn't actually fix anything. :-) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [PATCH] fix mx btl_bandwidth
Jeff, I think you will have to revert this patch as the btl_bandwidth __IS__ supposed to be in Mbs and not MBs. We usually talk about networks in Mbs (there is a pattern in Ethernet 1G/10G, Myricom 10G). In addition the original design of the multi-rail was based on this assumption, and the multi-rail handling code deal with these values (at that level I don't think it really matters, but at least it needs consistent values from all BTLs). However, going over the existing BTLs I can see that some BTLs do not correctly set this value: BTL BandwidthAuto-detect Status Elan2000NO Correct GM 250 NO Doubtful MX 2000/1 YES (Mbs)Correct (before the patch) OFUD800 NO Doubtful OpenIB 2000/4000/8000 YES (Mbs)Correct (multiplied by the active_width) Portals 1000NO Doubtful SCTP100 NO Conservative value (correct) Self100 XXX Correct (doesn't matter anyway) SM 9000NO Correct TCP 100 NO Conservative value (correct) UDAPL 225 NO Incorrect Some of these BTL values do not make sense, neither in Mbs or MBs. Here is a list of such BTLs: OFUD, Portals, UDAPL. If the corresponding developers can provide the default bandwidth (in Mbs) I will update their values. For SCTP, TCP I don't know how to detect it reliably in a portable way, so I expect to let them set to this very conservative value. Moreover, the BTL TCP is only used for multi-rail if the available high performance network allows it, so it doesn't really matter. george. On Sep 3, 2010, at 08:03 , Jeff Squyres wrote: > Thanks; committed in r23712. > > Can you file CMRs for 1.4 and 1.5? Thanks. > > > On Sep 3, 2010, at 3:53 AM, Brice Goglin wrote: > >> For some reason, the MX btl sets btl_bandwidth in megabits/s instead of >> megabytes/s. So we get crazy btl_weights in case of heterogeneous >> multirail. And --mca btl_mx_bandwidth cannot work around the >> problem (it probably doesn't help because it's overriden by the runtime >> link width detection anyway?). >> >> Signed-off-by: Brice Goglin >> >> Index: ompi/mca/btl/mx/btl_mx_component.c >> === >> --- ompi/mca/btl/mx/btl_mx_component.c (révision 23711) >> +++ ompi/mca/btl/mx/btl_mx_component.c (copie de travail) >> @@ -159,7 +159,7 @@ >> MCA_BTL_FLAGS_PUT | >> MCA_BTL_FLAGS_SEND | >> MCA_BTL_FLAGS_RDMA_MATCHED); >> -mca_btl_mx_module.super.btl_bandwidth = 2000; >> +mca_btl_mx_module.super.btl_bandwidth = 250; >>mca_btl_mx_module.super.btl_latency = 5; >>mca_btl_base_param_register(&mca_btl_mx_component.super.btl_version, >>&mca_btl_mx_module.super); >> @@ -357,7 +357,7 @@ >>mx_btl->mx_endpoint = mx_endpoint; >>mx_btl->mx_endpoint_addr = mx_endpoint_addr; >> >> -mx_btl->super.btl_bandwidth = 2000; /* whatever */ >> +mx_btl->super.btl_bandwidth = 250; /* whatever */ >>mx_btl->super.btl_latency = 10; >> #if defined(MX_HAS_NET_TYPE) >>{ >> @@ -370,11 +370,11 @@ >>} else { >>if( MX_SPEED_2G == value ) { >>mx_unique_network_id |= 0xaa00; >> -mx_btl->super.btl_bandwidth = 2000; >> +mx_btl->super.btl_bandwidth = 250; >>mx_btl->super.btl_latency = 5; >>} else if( MX_SPEED_10G == value ) { >>mx_unique_network_id |= 0xbb00; >> -mx_btl->super.btl_bandwidth = 1; >> +mx_btl->super.btl_bandwidth = 1250; >>mx_btl->super.btl_latency = 3; >>} else { >>mx_unique_network_id |= 0xcc00; >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [PATCH] fix mx btl_bandwidth
On Sep 3, 2010, at 9:38 AM, George Bosilca wrote: > I think you will have to revert this patch as the btl_bandwidth __IS__ > supposed to be in Mbs and not MBs. We usually talk about networks in Mbs > (there is a pattern in Ethernet 1G/10G, Myricom 10G). This is why I shouldn't commit patches for others, and why I'm glad I pushed Scott to commit the other fixes himself... I'll revert; you, Scott, and Brice figure out what you want to do. > In addition the original design of the multi-rail was based on this > assumption, and the multi-rail handling code deal with these values (at that > level I don't think it really matters, but at least it needs consistent > values from all BTLs). > > However, going over the existing BTLs I can see that some BTLs do not > correctly set this value: > > BTL BandwidthAuto-detect Status > Elan2000NO Correct > GM 250 NO Doubtful > MX 2000/1 YES (Mbs)Correct (before the patch) > OFUD800 NO Doubtful > OpenIB 2000/4000/8000 YES (Mbs)Correct (multiplied by the > active_width) > Portals 1000NO Doubtful > SCTP100 NO Conservative value (correct) > Self100 XXX Correct (doesn't matter anyway) > SM 9000NO Correct > TCP 100 NO Conservative value (correct) > UDAPL 225 NO Incorrect > > Some of these BTL values do not make sense, neither in Mbs or MBs. Here is a > list of such BTLs: OFUD, Portals, UDAPL. If the corresponding developers can > provide the default bandwidth (in Mbs) I will update their values. OFUD should be just like OpenFabrics. But I doubt anyone cares. Should we remove it? UDAPL intentionally hides that kind of stuff; I don't know if it's possible to get it. Rolf/Terry? > For SCTP, TCP I don't know how to detect it reliably in a portable way, so I > expect to let them set to this very conservative value. Moreover, the BTL TCP > is only used for multi-rail if the available high performance network allows > it, so it doesn't really matter. Some servers have 1GB and 10GB TCP, though... It might be worth having even a Linux-specific way to auto-detect, just for this use case (which is becoming more common -- 1GB LOM and a 10GB non-iWARP NIC). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] [PATCH] fix mx btl_bandwidth
Le 03/09/2010 15:38, George Bosilca a écrit : > Jeff, > > I think you will have to revert this patch as the btl_bandwidth __IS__ > supposed to be in Mbs and not MBs. We usually talk about networks in Mbs > (there is a pattern in Ethernet 1G/10G, Myricom 10G). In addition the > original design of the multi-rail was based on this assumption, and the > multi-rail handling code deal with these values (at that level I don't think > it really matters, but at least it needs consistent values from all BTLs). > > However, going over the existing BTLs I can see that some BTLs do not > correctly set this value: > > BTL BandwidthAuto-detect Status > Elan2000NO Correct > 2000 looks strange to me. Last time I played with Elan4, bandwidth was 900MB/s or so. > GM 250 NO Doubtful > MX 2000/1 YES (Mbs)Correct (before the patch) > OFUD800 NO Doubtful > OpenIB 2000/4000/8000 YES (Mbs)Correct (multiplied by the > active_width) > I found the problem when using both MX and OpenIB at the same time, so they can't be both wrong or both correct. IB was reporting 800, not 2000/4000/8000. Maybe because auto-detect didn't work and the default is wrong: btl_openib_mca.c:527:mca_btl_openib_module.super.btl_bandwidth = 800; Brice
Re: [OMPI devel] [PATCH] fix mx btl_bandwidth
On Sep 3, 2010, at 09:50 , Brice Goglin wrote: > Le 03/09/2010 15:38, George Bosilca a écrit : >> Jeff, >> >> I think you will have to revert this patch as the btl_bandwidth __IS__ >> supposed to be in Mbs and not MBs. We usually talk about networks in Mbs >> (there is a pattern in Ethernet 1G/10G, Myricom 10G). In addition the >> original design of the multi-rail was based on this assumption, and the >> multi-rail handling code deal with these values (at that level I don't think >> it really matters, but at least it needs consistent values from all BTLs). >> >> However, going over the existing BTLs I can see that some BTLs do not >> correctly set this value: >> >> BTL BandwidthAuto-detect Status >> Elan2000NO Correct >> > > 2000 looks strange to me. Last time I played with Elan4, bandwidth was > 900MB/s or so. Lucky you ;) The 2000 was the bandwidth of the last Elan device we had. > >> GM 250 NO Doubtful >> MX 2000/1 YES (Mbs)Correct (before the patch) >> OFUD800 NO Doubtful >> OpenIB 2000/4000/8000 YES (Mbs)Correct (multiplied by the >> active_width) >> > > I found the problem when using both MX and OpenIB at the same time, so > they can't be both wrong or both correct. IB was reporting 800, not > 2000/4000/8000. Maybe because auto-detect didn't work and the default is > wrong: > btl_openib_mca.c:527:mca_btl_openib_module.super.btl_bandwidth = 800; It appears that Open IB only auto-detect the bandwidth if the value is explicitly set to zero via the mca parameters. As a last resort: as for the other devices you can set it manually. Use something like btl_openib_bandwidth_%dev_name% to set the bandwidth per device. george. > > Brice > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] 1.5rc5 over MX
Yes, probably so. On Sep 3, 2010, at 8:53 AM, Scott Atchley wrote: > Shouldn't the regression be a separate ticket since it is unrelated? > > Scott > > On Sep 3, 2010, at 8:20 AM, Jeff Squyres wrote: > >> Ditto for the v1.5 patch -- it wasn't committed anywhere and no CMR was >> filed, so I re-opened the ticket. >> >> Plus you mentioned a 2us (!) latency increase. Doesn't that need attention, >> too? >> >> >> On Sep 1, 2010, at 9:09 AM, Scott Atchley wrote: >> >>> Jeff, >>> >>> I posted a patch on the ticket. >>> >>> Scott >>> >>> On Aug 27, 2010, at 3:08 PM, Scott Atchley wrote: >>> Jeff, Sure, I need to register to file the tickets. I have not had a chance yet. I will try to look at them first thing next week. Scott On Aug 27, 2010, at 2:41 PM, Jeff Squyres wrote: > Scott -- > > Can you file tickets for this against 1.4 and 1.5? These should probably > be blockers. > > Have you been able to track these down any further, perchance? > > > On Aug 26, 2010, at 10:38 AM, Scott Atchley wrote: > >> Hi all, >> >> Testing 1.5rc5 over MX with the same setup as 1.4.3rc1 (RHEL 5.4 and MX >> 1.2.12). >> >> This version also dies during init due to the memory manager if I do not >> specify which pml to use. If I specify pml ob1 or pml cm, the tests >> start but die with segfaults: >> >> 131072 320 166.86 749.15 >> [rain15:14939] *** Process received signal *** >> [rain15:14939] Signal: Segmentation fault (11) >> [rain15:14939] Signal code: Address not mapped (1) >> [rain15:14939] Failing at address: 0x3b20 >> >> Again, configuring without the memory manager or setting >> OMPI_MCA_memory_ptmalloc2_disable=1 before calling mpirun work. >> >> Similar latency issues with the BTl and not with the MTL. >> >> Scott >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> -- >> Jeff Squyres >> jsquy...@cisco.com >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] 1.5rc5 has been posted
Using MPI-2 (Gropp et al.) says MPI_SIZEOF() only supports numeric intrinsic data types. So, I patched OpenMPI 1.4.2 to remove the declarations of the undefined Logical and Character specific procedures in ompi/mpi/f90/scripts/mpi-f90-interfaces.h.sh: output_197 MPI_Sizeof ${rank} CH "character${dim}" output_197 MPI_Sizeof ${rank} L "logical${dim}" I also changed all the dummy array declarations in the INTERFACE declarations to use assumed-shape arrays, which is the correct Fortran 90 method to declare the rank and extent of any actual array arguments. I simplified both ompi/mpi/f90/scripts/mpi-f90-interfaces.h.sh and ompi/mpi/f90/scripts/mpi_sizeof.f90.sh. In mpi-f90-interfaces.h.sh, I defined an array, array_dims, with the DIMENSION declarations, then replaced all the copies of dim= throughout the code with a reference to array_dims by rank: array_dims[0]='' array_dims[1]=', dimension(:)' array_dims[2]=', dimension(:,:)' array_dims[3]=', dimension(:,:,:)' array_dims[4]=', dimension(:,:,:,:)' array_dims[5]=', dimension(:,:,:,:,:)' array_dims[6]=', dimension(:,:,:,:,:,:)' array_dims[7]=', dimension(:,:,:,:,:,:,:)' for rank in $allranks do dim=${array_dims[${rank}]} In mpi_sizeof.f90.sh, I copied the method to enumerate rank 0 with all the other ranks from the code in mpi-f90-interfaces.h.sh: allranks="0 $ranks" for rank in $allranks do case "$rank" in 0) dim='' ; esac case "$rank" in 1) dim=', dimension(:)' ; esac case "$rank" in 2) dim=', dimension(:,:)' ; esac case "$rank" in 3) dim=', dimension(:,:,:)' ; esac case "$rank" in 4) dim=', dimension(:,:,:,:)' ; esac case "$rank" in 5) dim=', dimension(:,:,:,:,:)' ; esac case "$rank" in 6) dim=', dimension(:,:,:,:,:,:)' ; esac case "$rank" in 7) dim=', dimension(:,:,:,:,:,:,:)' ; esac Here's the patch I used for OpenMPI 1.4.2: # Remove declarations of Logical and Character specific procedures from # Generic Subroutine MPI_SIZEOF and fix dummy arrays to be assumed- shape mv openmpi-1.4.2/ompi/mpi/f90/scripts/mpi-f90- interfaces.h.sh{,.original} sed -e $'34{p; s/^.*$/array_dims[0]=\'\'/;p; s/^.*$/array_dims[1]=\', dimension(:)\'/;p; s/^.*$/array_dims[2]=\', dimension(:,:)\'/;p; s/^.*$/array_dims[3]=\', dimension(:,:,:)\'/;p; s/^.*$/array_dims[4]=\', dimension(:,:,:,:)\'/;p; s/^.*$/array_dims[5]=\', dimension(:,:,:,:,:)\'/;p; s/^.*$/array_dims[6]=\', dimension(:,:,:,:,:,:)\'/;p; s/^.*$/array_dims[7]=\', dimension(:,:,:,:,:,:,:)\'/;p; s/^.*$//;}' \ -e '/case "$rank" in [0-6]) dim=/d' \ -e '/case "$rank" in 7) dim=.*$/s//dim=${array_dims[$ {rank}]}/' \ -e '7129,7130d' \ openmpi-1.4.2/ompi/mpi/f90/scripts/mpi-f90- interfaces.h.sh.original \ >openmpi-1.4.2/ompi/mpi/f90/scripts/mpi-f90-interfaces.h.sh chmod +x openmpi-1.4.2/ompi/mpi/f90/scripts/mpi-f90-interfaces.h.sh mv openmpi-1.4.2/ompi/mpi/f90/scripts/mpi_sizeof.f90.sh{,.original} sed -e '25,84d' \ -e '85s/^.*$/allranks="0 $ranks"/' \ -e '87s/\$ranks/$allranks/' \ -e $'88{p;s/^.*$/ case "$rank" in 0) dim=\'\' ; esac/;}' \ -e $'89,95{s/dim=\'/dim=\', dimension(/;s/1,/:,/g;s/\*\'/:) \'/;}' \ -e '97,110d' \ -e '118s/, dimension(\${dim})/${dim}/' \ -e '133s/, dimension(\${dim})/${dim}/' \ -e '148s/, dimension(\${dim})/${dim}/' \ openmpi-1.4.2/ompi/mpi/f90/scripts/mpi_sizeof.f90.sh.original \ >openmpi-1.4.2/ompi/mpi/f90/scripts/mpi_sizeof.f90.sh chmod +x openmpi-1.4.2/ompi/mpi/f90/scripts/mpi_sizeof.f90.sh Larry Baker US Geological Survey 650-329-5608 ba...@usgs.gov On Sep 1, 2010, at 5:09 PM, Larry Baker wrote: OpenMPI 1.4.x and 1.5x fail to link a program that calls Subroutine MPI_SIZEOF using the PGI 10.3 compilers: $ cat junk.f90 Use MPI Implicit None Integer var, size, err Call MPI_SIZEOF( var, size, err ) Write (*,*) 'Size of Integer var is ', size, ' bytes.' Stop End $ /opt/pgi/linux86-64/current/openmpi/bin/mpif90 -o junk junk.f90 /opt/pgi/linux86-64/10.3/openmpi/lib/libmpi_f90.so: undefined reference to `mpi_sizeof1dl_' /opt/pgi/linux86-64/10.3/openmpi/lib/libmpi_f90.so: undefined reference to `mpi_sizeof4dch_' /opt/pgi/linux86-64/10.3/openmpi/lib/libmpi_f90.so: undefined reference to `mpi_sizeof3dl_' /opt/pgi/linux86-64/10.3/openmpi/lib/libmpi_f90.so: undefined reference to `mpi_sizeof4dl_' /opt/pgi/linux86-64/10.3/openmpi/lib/libmpi_f90.so: undefined reference to `mpi_sizeof2dch_' /opt/pgi/linux86-64/10.3/openmpi/lib/libmpi_f90.so: undefined reference to `mpi_sizeof2dl_' /opt/pgi/linux86-64/10.3/openmpi/lib/libmpi_f90.so: undefined reference to `mpi_sizeof3dch_' /opt/pgi/linux86-64/10.3/openmpi/lib/libmpi_f90.so: undefined reference to `mpi_sizeof1dch_' /opt/pgi/linux86-64/10.3/openmpi/lib/libmpi_f90.so: undefined reference to `mpi_sizeof0dl_' /opt/pgi/linux86-64/10.
Re: [OMPI devel] [PATCH] fix mx btl_bandwidth
On Fri, Sep 3, 2010 at 3:47 PM, Jeff Squyres wrote: > It might be worth having even a Linux-specific way to auto-detect, just for > this use case (which is becoming more common -- 1GB LOM and a 10GB non-iWARP > NIC). The file: /sys/class/net/ethX/speed should contain the current speed and is readable by any user; if it contains 65535 there is no link so the speed is not defined. The information should also be available through ethtool, but for root only, which is not so useful in this case. The file might not always exists, f.e. when /sys is not mounted, using an older kernel, the driver doesn't expose this info, etc., but from what I understand this is just a best effort to locate a realistic value. Cheers, Bogdan