[OMPI devel] trunk hangs since r19010
Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny.
Re: [OMPI devel] trunk hangs since r19010
Is this related to r1378? On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
I believe it it. On 7/28/08, Jeff Squyres wrote: > > On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: > > Is this related to r1378? >> > > Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. > > > On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: >> >> Hi, >>> >>> I experience hanging of tests ( latency ) since r19010 >>> >>> >>> Best Regards >>> >>> Lenny. >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Funny warning message
Just got this warning today while trying to test IB connections. Last I checked, 32 was indeed smaller than 192... -- WARNING: rd_win specification is non optimal. For maximum performance it is advisable to configure rd_win smaller then (rd_num - rd_low), but currently rd_win = 32 and (rd_num - rd_low) = 192. -- Ralph
[OMPI devel] 1.3 build failing on MacOSX
I'm getting the following when I try and build 1.3 from SVN: gcc -DHAVE_CONFIG_H -I. -I../../adio/include -DOMPI_BUILDING=1 -I/ Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/romio/ romio/../../../../.. -I/Users/greg/Documents/workspaces/ptp_head/ompi/ ompi/mca/io/romio/romio/../../../../../opal/include - I../../../../../../../opal/include -I../../../../../../../ompi/include -I/Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/romio/ romio/include -I/Users/greg/Documents/workspaces/ptp_head/ompi/ompi/ mca/io/romio/romio/adio/include -D_REENTRANT -g -Wall -Wundef -Wno- long-long -Wsign-compare -Wmissing-prototypes -Wstrict-prototypes - Wcomment -pedantic -Wno-long-double -Werror-implicit-function- declaration -finline-functions -fno-strict-aliasing -DHAVE_ROMIOCONF_H -DHAVE_ROMIOCONF_H -I../../include -MT ad_write_nolock.lo -MD -MP - MF .deps/ad_write_nolock.Tpo -c ad_write_nolock.c -fno-common -DPIC - o .libs/ad_write_nolock.o ad_write_nolock.c: In function ‘ADIOI_NOLOCK_WriteStrided’: ad_write_nolock.c:92: error: implicit declaration of function ‘lseek64’ make[5]: *** [ad_write_nolock.lo] Error 1 make[4]: *** [all-recursive] Error 1 make[3]: *** [all-recursive] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 Configured with: ./configure --with-platform=contrib/platform/lanl/macosx-dynamic Any ideas? Greg
Re: [OMPI devel] 1.3 build failing on MacOSX
Blast. Looks like a problem with the new ROMIO I brought in last week. I'll fix shortly; thanks for the heads-up. On Jul 28, 2008, at 9:36 AM, Greg Watson wrote: I'm getting the following when I try and build 1.3 from SVN: gcc -DHAVE_CONFIG_H -I. -I../../adio/include -DOMPI_BUILDING=1 -I/ Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/romio/ romio/../../../../.. -I/Users/greg/Documents/workspaces/ptp_head/ ompi/ompi/mca/io/romio/romio/../../../../../opal/include - I../../../../../../../opal/include -I../../../../../../../ompi/ include -I/Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/ romio/romio/include -I/Users/greg/Documents/workspaces/ptp_head/ompi/ ompi/mca/io/romio/romio/adio/include -D_REENTRANT -g -Wall -Wundef - Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict- prototypes -Wcomment -pedantic -Wno-long-double -Werror-implicit- function-declaration -finline-functions -fno-strict-aliasing - DHAVE_ROMIOCONF_H -DHAVE_ROMIOCONF_H -I../../include -MT ad_write_nolock.lo -MD -MP -MF .deps/ad_write_nolock.Tpo -c ad_write_nolock.c -fno-common -DPIC -o .libs/ad_write_nolock.o ad_write_nolock.c: In function ‘ADIOI_NOLOCK_WriteStrided’: ad_write_nolock.c:92: error: implicit declaration of function ‘lseek64’ make[5]: *** [ad_write_nolock.lo] Error 1 make[4]: *** [all-recursive] Error 1 make[3]: *** [all-recursive] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 Configured with: ./configure --with-platform=contrib/platform/lanl/macosx-dynamic Any ideas? Greg ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] MCA base changes
With the update on #1400, I think we're ready to push the MCA base changes to the SVN trunk. Speak now if you object, or forever hold your peace. The most notable parts of this commit: - add "register" function to mca_base_component_t - converted coll:basic and paffinity:linux and paffinity:solaris to use this function --> we'll convert the rest over time (I'll file a ticket once all this is committed) - add 32 bytes of "reserved" space to the end of mca_base_component_t and mca_base_component_data_2_0_0_t to make future upgrades [slightly] easier - new mca_base_component_t size: 196 bytes - new mca_base_component_data_2_0_0_t size: 36 bytes - MCA base version bumped to v2.0 - **We now refuse to load components that are not MCA v2.0.x** - all MCA frameworks versions bumped to v2.0 - be a little more explicit about version numbers in the MCA base - add big comment in mca.h about versioning philosophy It's a pretty big commit because it touches a lot of files (although most are just changing the version number); I'll commit it this evening. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Funny warning message
It seems that the error felt into the helpfile. Index: ompi/mca/btl/openib/help-mpi-btl-openib.txt === --- ompi/mca/btl/openib/help-mpi-btl-openib.txt (revision 19054) +++ ompi/mca/btl/openib/help-mpi-btl-openib.txt (working copy) @@ -497,7 +497,7 @@ # [non optimal rd_win] WARNING: rd_win specification is non optimal. For maximum performance it is -advisable to configure rd_win smaller then (rd_num - rd_low), but currently +advisable to configure rd_win bigger then (rd_num - rd_low), but currently rd_win = %d and (rd_num - rd_low) = %d. # [apm without lmc] Best regards Lenny On 7/28/08, Ralph Castain wrote: > > Just got this warning today while trying to test IB connections. Last I > checked, 32 was indeed smaller than 192... > > -- > WARNING: rd_win specification is non optimal. For maximum performance it is > advisable to configure rd_win smaller then (rd_num - rd_low), but currently > rd_win = 32 and (rd_num - rd_low) = 192. > -- > > Ralph > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: > > I checked this out some more and I believe it is ticket #1378 related. We > lock up if SM is included in the BTL's, which is what I had done on my test. > If I ^sm, I can run fine. > > On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: > > It could also be something new. Brad and I noted on Fri that IB was locking > up as soon as we tried any cross-node communications. Hadn't seen that > before, and at least I haven't explored it further - planned to do so today. > > On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: > > I believe it it. > > On 7/28/08, Jeff Squyres wrote: >> >> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: >> >> Is this related to r1378? >>> >> >> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. >> >> >> On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: >>> >>> Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Funny warning message
On Mon, Jul 28, 2008 at 05:14:29PM +0300, Lenny Verkhovsky wrote: > -advisable to configure rd_win smaller then (rd_num - rd_low), but currently > +advisable to configure rd_win bigger then (rd_num - rd_low), but currently ^ a -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de
[OMPI devel] RFC: MCA DSO filename
WHAT: Rename MCA DSO filenames from "mca__.so" to "libmca__.so" (backwards compatibility can be preserved if we want it; see below) WHY: Allows simplifying component Makefile.am's WHEN: No real rush; just wanted to get the idea out there (does *not* need to be before v1.3; more explanation below) WHERE: autogen.sh, some stuff in opal/mca/base, and every component's Makefile.am TIMEOUT: Fri, 8 Aug 2008 In reviewing some old SVN/HG trees that I had hanging around, I discovered one about significantly simplifying (and slightly optimizing) component Makefile.am's. I believe that these ideas came from Brian, Ralf, and possibly others. Here's a "simple" current Makefile.am (the TCP BTL): https://svn.open-mpi.org/trac/ompi/browser/trunk/ompi/mca/btl/tcp/Makefile.am At the end of this mail, I include what the meat of the TCP BTL Makefile.am can be reduced to. However, to do this, we need to use the same output filename for both the static and dynamic builds (i.e., as a standalone DSO and as a convenience LT library). Libtool will complain if we build a convenience library with a filename that does not begin with "lib". Note that there are two parts involved: 1. touching each Makefile.am and converting to the simpler format. 2. converting the MCA base to look for "libmca__" filenames. NOTE: we can optionally have the MCA base *also* look for the old-style name "mca__" if backwards compatibility is desired. Because of the backwards compatibility possibility, there is no need to do this before v1.3 -- it could be done for v1.3.x or even v1.4 (there's no real rush). It's just an idea that has been around for a while, so I thought I'd turn it into an RFC. If the community agrees, I'll likely file a ticket about this and we'll get to it someday. Below is what the TCP BTL Makefile.am can be reduced to (compare the end of this file to the end of the current TCP BTL Makefile.am). Note that the whole "if" logic at the end could possibly be hidden in autogen -- I haven't thought that through, but it's a possibility (we can't hide that stuff in autogen until we unify the output filename; we can't do it in today's build system, for example). - libmca_btl_tcp_la_SOURCES = \ btl_tcp.c \ btl_tcp.h \ btl_tcp_addr.h \ btl_tcp_component.c \ btl_tcp_endpoint.c \ btl_tcp_endpoint.h \ btl_tcp_frag.c \ btl_tcp_frag.h \ btl_tcp_hdr.h \ btl_tcp_proc.c \ btl_tcp_proc.h \ btl_tcp_ft.c \ btl_tcp_ft.h libmca_btl_tcp_la_LDFLAGS = -module -avoid-version if OMPI_BUILD_btl_tcp_DSO mcacomponentdir = $(pkglibdir) mcacomponent_LTLIBRARIES = libmca_btl_tcp.la else noinst_LTLIBRARIES = libmca_btl_tcp.la endif - -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Funny warning message
I think Lenny is pointing out that "smaller" got changed to "bigger", too. :-) Looking at the test in the code (btl_openib_component.c): if ((rd_num - rd_low) > rd_win) { orte_show_help("help-mpi-btl-openib.txt", "non optimal rd_win", true, rd_win, rd_num - rd_low); } So the change in the help message is correct -- it is better when rd_win is bigger than (rd_num - rd_low). Ralph -- were you running with a non-default btl_openib_receive_queues? On Jul 28, 2008, at 10:17 AM, Adrian Knoth wrote: On Mon, Jul 28, 2008 at 05:14:29PM +0300, Lenny Verkhovsky wrote: -advisable to configure rd_win smaller then (rd_num - rd_low), but currently +advisable to configure rd_win bigger then (rd_num - rd_low), but currently ^ a -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Funny warning message
On Jul 28, 2008, at 8:22 AM, Jeff Squyres wrote: I think Lenny is pointing out that "smaller" got changed to "bigger", too. :-) Looking at the test in the code (btl_openib_component.c): if ((rd_num - rd_low) > rd_win) { orte_show_help("help-mpi-btl-openib.txt", "non optimal rd_win", true, rd_win, rd_num - rd_low); } So the change in the help message is correct -- it is better when rd_win is bigger than (rd_num - rd_low). Ralph -- were you running with a non-default btl_openib_receive_queues? Yep...was using a queue layout from Brad that is pretty complex. I was just pointing out that the warning's stated condition was met, so either the warning text is wrong or the test that generates it is wrong. On Jul 28, 2008, at 10:17 AM, Adrian Knoth wrote: On Mon, Jul 28, 2008 at 05:14:29PM +0300, Lenny Verkhovsky wrote: -advisable to configure rd_win smaller then (rd_num - rd_low), but currently +advisable to configure rd_win bigger then (rd_num - rd_low), but currently ^ a -- Cluster and Metacomputing Working Group Friedrich-Schiller-Universität Jena, Germany private: http://adi.thur.de ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky > I failed to run on different nodes or on the same node via self,openib > > > > On 7/28/08, Ralph Castain wrote: >> >> I checked this out some more and I believe it is ticket #1378 related. We >> lock up if SM is included in the BTL's, which is what I had done on my test. >> If I ^sm, I can run fine. >> >> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: >> >> It could also be something new. Brad and I noted on Fri that IB was >> locking up as soon as we tried any cross-node communications. Hadn't seen >> that before, and at least I haven't explored it further - planned to do so >> today. >> >> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: >> >> I believe it it. >> >> On 7/28/08, Jeff Squyres wrote: >>> >>> On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: >>> >>> Is this related to r1378? >>> >>> Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. >>> >>> >>> On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, > > I experience hanging of tests ( latency ) since r19010 > > > Best Regards > > Lenny. > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres Cisco Systems >>> >>> -- >>> Jeff Squyres >>> Cisco Systems >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
Interesting - you are quite correct and I should have been more precise. I ran with -mca btl openib and it worked. So having just openib seems to be okay. On Jul 28, 2008, at 8:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. On 7/28/08, Jeff Squyres wrote: > > FWIW, all my MTT runs are hanging as well. > > > On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: > > My experience is the same a Lenny's. I've tested on x86_64 and ppc64 >> systems and tests using --mca btl openib,self hang in all cases. >> >> --brad >> >> >> 2008/7/28 Lenny Verkhovsky >> I failed to run on different nodes or on the same node via self,openib >> >> >> >> >> On 7/28/08, Ralph Castain wrote: >> I checked this out some more and I believe it is ticket #1378 related. We >> lock up if SM is included in the BTL's, which is what I had done on my test. >> If I ^sm, I can run fine. >> >> >> On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: >> >> It could also be something new. Brad and I noted on Fri that IB was >>> locking up as soon as we tried any cross-node communications. Hadn't seen >>> that before, and at least I haven't explored it further - planned to do so >>> today. >>> >>> >>> On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: >>> >>> I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > > -- > Jeff Squyres > Cisco Systems > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] trunk hangs since r19010
On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyres wrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] 1.3 build failing on MacOSX
Looking into it a bit more, the situation is a little convoluted. I've filed https://svn.open-mpi.org/trac/ompi/ticket/1419; followups will occur there. On Jul 28, 2008, at 9:42 AM, Jeff Squyres wrote: Blast. Looks like a problem with the new ROMIO I brought in last week. I'll fix shortly; thanks for the heads-up. On Jul 28, 2008, at 9:36 AM, Greg Watson wrote: I'm getting the following when I try and build 1.3 from SVN: gcc -DHAVE_CONFIG_H -I. -I../../adio/include -DOMPI_BUILDING=1 -I/ Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/io/romio/ romio/../../../../.. -I/Users/greg/Documents/workspaces/ptp_head/ ompi/ompi/mca/io/romio/romio/../../../../../opal/include - I../../../../../../../opal/include -I../../../../../../../ompi/ include -I/Users/greg/Documents/workspaces/ptp_head/ompi/ompi/mca/ io/romio/romio/include -I/Users/greg/Documents/workspaces/ptp_head/ ompi/ompi/mca/io/romio/romio/adio/include -D_REENTRANT -g -Wall - Wundef -Wno-long-long -Wsign-compare -Wmissing-prototypes -Wstrict- prototypes -Wcomment -pedantic -Wno-long-double -Werror-implicit- function-declaration -finline-functions -fno-strict-aliasing - DHAVE_ROMIOCONF_H -DHAVE_ROMIOCONF_H -I../../include -MT ad_write_nolock.lo -MD -MP -MF .deps/ad_write_nolock.Tpo -c ad_write_nolock.c -fno-common -DPIC -o .libs/ad_write_nolock.o ad_write_nolock.c: In function ‘ADIOI_NOLOCK_WriteStrided’: ad_write_nolock.c:92: error: implicit declaration of function ‘lseek64’ make[5]: *** [ad_write_nolock.lo] Error 1 make[4]: *** [all-recursive] Error 1 make[3]: *** [all-recursive] Error 1 make[2]: *** [all-recursive] Error 1 make[1]: *** [all-recursive] Error 1 make: *** [all-recursive] Error 1 Configured with: ./configure --with-platform=contrib/platform/lanl/macosx-dynamic Any ideas? Greg ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
[OMPI devel] Change in slot_list specification
Just an FYI for those of you working with slot_lists. Lenny, Jeff and I have changed the mca param associated with how you specify the slot list you want the rank_file mapper to use. This was done to avoid the possibility of ORTE processes such as mpirun and orted accidentally binding themselves to cores. The prior param was identical to the one used to tell MPI procs their core bindings - so if someone ever modified the paffinity system to detect the param and automatically perform the binding, mpirun and orted could both bind themselves to the specified cores...which isn't what we would want. The new param is "rmaps_base_slot_list". To make life easier, we also added a new orterun cmd line option --slot-list which acts as a shorthand for the new mca param. Ralph
[OMPI devel] Change in hostfile behavior
Per an earlier telecon, I have modified the hostfile behavior slightly to allow hostfiles to subdivide allocations. Briefly: given an allocation, we allow users to specify --hostfile on a per-app_context basis. In this mode, the hostfile info is used to filter the nodes that will be used for that app_context. However, the prior implementation only filtered the nodes themselves - i.e., it was a binary filter that allowed you to include or exclude an entire node. The change now allows you to include a specified #slots for a given node as opposed to -all- slots from that node. You are limited to the #slots included in the original allocation. I just realized that I hadn't output a warning if you attempt to violate this condition - will do so shortly. Rather than just abort if this happens, I set the allocation to that of the original - please let me know if you would prefer it to abort. If you have interest in this behavior, please check it out and let me know if this meets needs. Ralph
Re: [OMPI devel] trunk hangs since r19010
I'm a little bit lost here. You're stating that openib,self doesn't work while openib does? In other words that adding self to the BTL leads to deadlocks? george. PS: Btw, it is not supposed to work at all, except in the case where openib handle internal messages (where the source and destination is the same process). On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote: On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyres wrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] trunk hangs since r19010
I just re-tested to confirm, and that is correct. -mca btl openib works -mca btl openib,selfhangs -mca btl openib,sm works On Jul 28, 2008, at 9:49 AM, George Bosilca wrote: I'm a little bit lost here. You're stating that openib,self doesn't work while openib does? In other words that adding self to the BTL leads to deadlocks? george. PS: Btw, it is not supposed to work at all, except in the case where openib handle internal messages (where the source and destination is the same process). On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote: On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyres wrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
Interesting. The self is only used for local communications. I don't expect that any benchmark execute such communications, but apparently I was wrong. Please let me know the failing test, I will take a look this evening. Thanks, george. On Jul 28, 2008, at 5:56 PM, Ralph Castain wrote: I just re-tested to confirm, and that is correct. -mca btl openib works -mca btl openib,selfhangs -mca btl openib,sm works On Jul 28, 2008, at 9:49 AM, George Bosilca wrote: I'm a little bit lost here. You're stating that openib,self doesn't work while openib does? In other words that adding self to the BTL leads to deadlocks? george. PS: Btw, it is not supposed to work at all, except in the case where openib handle internal messages (where the source and destination is the same process). On Jul 28, 2008, at 5:05 PM, Ralph Castain wrote: On Jul 28, 2008, at 8:52 AM, Lenny Verkhovsky wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? On 7/28/08, Jeff Squyres wrote: FWIW, all my MTT runs are hanging as well. On Jul 28, 2008, at 10:37 AM, Brad Benton wrote: My experience is the same a Lenny's. I've tested on x86_64 and ppc64 systems and tests using --mca btl openib,self hang in all cases. --brad 2008/7/28 Lenny Verkhovsky I failed to run on different nodes or on the same node via self,openib On 7/28/08, Ralph Castain wrote: I checked this out some more and I believe it is ticket #1378 related. We lock up if SM is included in the BTL's, which is what I had done on my test. If I ^sm, I can run fine. On Jul 28, 2008, at 6:41 AM, Ralph Castain wrote: It could also be something new. Brad and I noted on Fri that IB was locking up as soon as we tried any cross-node communications. Hadn't seen that before, and at least I haven't explored it further - planned to do so today. On Jul 28, 2008, at 6:01 AM, Lenny Verkhovsky wrote: I believe it it. On 7/28/08, Jeff Squyres wrote: On Jul 28, 2008, at 7:51 AM, Jeff Squyres wrote: Is this related to r1378? Gah -- I meant #1378, meaning the "PML ob1 deadlock" ticket. On Jul 28, 2008, at 7:13 AM, Lenny Verkhovsky wrote: Hi, I experience hanging of tests ( latency ) since r19010 Best Regards Lenny. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel smime.p7s Description: S/MIME cryptographic signature
Re: [OMPI devel] trunk hangs since r19010
On Jul 28, 2008, at 12:03 PM, George Bosilca wrote: Interesting. The self is only used for local communications. I don't expect that any benchmark execute such communications, but apparently I was wrong. Please let me know the failing test, I will take a look this evening. FWIW, my manual tests of a simplistic "ring" program work for all combinations (openib, openib+self, openib+self+sm). Shrug. But for OSU latency, I found that openib, openib+sm work, but openib+sm +self hangs (same results whether the 2 procs are on the same node or different nodes). There is no self communication in osu_latency, so something else must be going on. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
My test wasn't a benchmark - I was just testing with a little program that calls mpi_init, mpi_barrier, and mpi_finalize. A test with just mpi_init/finalize works fine, so it looks like we simply hang when trying to communicate. This also only happens on multi-node operations. On Jul 28, 2008, at 10:16 AM, Jeff Squyres wrote: On Jul 28, 2008, at 12:03 PM, George Bosilca wrote: Interesting. The self is only used for local communications. I don't expect that any benchmark execute such communications, but apparently I was wrong. Please let me know the failing test, I will take a look this evening. FWIW, my manual tests of a simplistic "ring" program work for all combinations (openib, openib+self, openib+self+sm). Shrug. But for OSU latency, I found that openib, openib+sm work, but openib +sm+self hangs (same results whether the 2 procs are on the same node or different nodes). There is no self communication in osu_latency, so something else must be going on. -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] trunk hangs since r19010
On Jul 28, 2008, at 11:05 AM, Ralph Castain wrote: only openib works for me too, but Glebs said to me once that it's illigal and I always need to use self btl. Don't know - could be true. But if that is true, then we should check to see if that condition is met and error out - with an appropriate message - if so. Otherwise, how is a user supposed to know this condition? This used to be true, but I think we changed it a while ago (Pasha: do you remember?) because Mellanox HCAs are capable of send-to-self (process) and there were no code changes necessary to enable it. So it allowed a slightly simpler command line. This was quite a while ago, IIRC. All current iWARP adapters do not allow loopback communication at all (i.e., communication to either the same proc or other procs on the same host), so we added the following test in openib's add_procs: if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev- >transport_type && 0 != (ompi_proc->proc_flags && OMPI_PROC_FLAG_LOCAL)) { continue; } (meaning: skip this proc if it's on the same host; let btl self handle it, etc.) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Change in hostfile behavior
My only concern is how will this interact with PLPA. Say two Open MPI jobs each use "half" the cores (slots) on a particular node... how would they be able to bind themselves to a disjoint set of cores? I'm not asking you to solve this Ralph, I'm just pointing it out so we can maybe warn users that if both jobs sharing a node try to use processor affinity, we don't make that magically work well, and that we would expect it to do quite poorly. I could see disabling paffinity and/or warning if it was enabled for one of these "fractional" nodes. On Mon, Jul 28, 2008 at 11:43 AM, Ralph Castain wrote: > Per an earlier telecon, I have modified the hostfile behavior slightly to > allow hostfiles to subdivide allocations. > > Briefly: given an allocation, we allow users to specify --hostfile on a > per-app_context basis. In this mode, the hostfile info is used to filter the > nodes that will be used for that app_context. However, the prior > implementation only filtered the nodes themselves - i.e., it was a binary > filter that allowed you to include or exclude an entire node. > > The change now allows you to include a specified #slots for a given node as > opposed to -all- slots from that node. You are limited to the #slots > included in the original allocation. I just realized that I hadn't output a > warning if you attempt to violate this condition - will do so shortly. > Rather than just abort if this happens, I set the allocation to that of the > original - please let me know if you would prefer it to abort. > > If you have interest in this behavior, please check it out and let me know > if this meets needs. > > Ralph > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/
Re: [OMPI devel] trunk hangs since r19010
Jeff Squyres wrote: On Jul 28, 2008, at 12:03 PM, George Bosilca wrote: Interesting. The self is only used for local communications. I don't expect that any benchmark execute such communications, but apparently I was wrong. Please let me know the failing test, I will take a look this evening. FWIW, my manual tests of a simplistic "ring" program work for all combinations (openib, openib+self, openib+self+sm). Shrug. But for OSU latency, I found that openib, openib+sm work, but openib+sm+self hangs (same results whether the 2 procs are on the same node or different nodes). There is no self communication in osu_latency, so something else must be going on. Is it something to do with the MPI_Barrier call? osu_latency uses MPI_Barrier and from rhc's email it sounds like his code does too. --td
Re: [OMPI devel] Change in hostfile behavior
Actually, this is true today regardless of this change. If two separate mpirun invocations share a node and attempt to use paffinity, they will conflict with each other. The problem isn't caused by the hostfile sub-allocation. The problem is that the two mpiruns have no knowledge of each other's actions, and hence assign node ranks to each process independently. Thus, we would have two procs that think they are node rank=0 and should therefore bind to the 0 processor, and so on up the line. Obviously, if you run within one mpirun and have two app_contexts, the hostfile sub-allocation is fine - mpirun will track node rank across the app_contexts. It is only the use of multiple mpiruns that share nodes that causes the problem. Several of us have discussed this problem and have a proposed solution for 1.4. Once we get past 1.3 (someday!), we'll bring it to the group. On Jul 28, 2008, at 10:44 AM, Tim Mattox wrote: My only concern is how will this interact with PLPA. Say two Open MPI jobs each use "half" the cores (slots) on a particular node... how would they be able to bind themselves to a disjoint set of cores? I'm not asking you to solve this Ralph, I'm just pointing it out so we can maybe warn users that if both jobs sharing a node try to use processor affinity, we don't make that magically work well, and that we would expect it to do quite poorly. I could see disabling paffinity and/or warning if it was enabled for one of these "fractional" nodes. On Mon, Jul 28, 2008 at 11:43 AM, Ralph Castain wrote: Per an earlier telecon, I have modified the hostfile behavior slightly to allow hostfiles to subdivide allocations. Briefly: given an allocation, we allow users to specify --hostfile on a per-app_context basis. In this mode, the hostfile info is used to filter the nodes that will be used for that app_context. However, the prior implementation only filtered the nodes themselves - i.e., it was a binary filter that allowed you to include or exclude an entire node. The change now allows you to include a specified #slots for a given node as opposed to -all- slots from that node. You are limited to the #slots included in the original allocation. I just realized that I hadn't output a warning if you attempt to violate this condition - will do so shortly. Rather than just abort if this happens, I set the allocation to that of the original - please let me know if you would prefer it to abort. If you have interest in this behavior, please check it out and let me know if this meets needs. Ralph ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] parallel debugger attach
I think I fixed the parallel debugger attach stuff in an hg -- can interested parties test it out at their own sites before I bring it back to the SVN trunk? It should be working for both Allinea DDT and TotalView. HG: http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/debugger-stuff/ Ticket: https://svn.open-mpi.org/trac/ompi/ticket/1361 Thanks. -- Jeff Squyres Cisco Systems
Re: [OMPI devel] trunk hangs since r19010
On Mon, Jul 28, 2008 at 12:08 PM, Terry Dontje wrote: > Jeff Squyres wrote: > >> On Jul 28, 2008, at 12:03 PM, George Bosilca wrote: >> >> Interesting. The self is only used for local communications. I don't >>> expect that any benchmark execute such communications, but apparently I was >>> wrong. Please let me know the failing test, I will take a look this evening. >>> >> >> FWIW, my manual tests of a simplistic "ring" program work for all >> combinations (openib, openib+self, openib+self+sm). Shrug. >> >> But for OSU latency, I found that openib, openib+sm work, but >> openib+sm+self hangs (same results whether the 2 procs are on the same node >> or different nodes). There is no self communication in osu_latency, so >> something else must be going on. >> >> Is it something to do with the MPI_Barrier call? osu_latency uses > MPI_Barrier and from rhc's email it sounds like his code does too. I don't think it's an issue with MPI_Barrier(). I'm running into this problem with srtest.c (one of the example programs from the mpich distribution). It's a ring-type test with no barriers until the end, yet it hangs on the very first Send/Recv pair from rank0 to rank1. I my case, openib and openib+sm works, but openib+self & openib+sm+self hang. --brad > > --td > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
[OMPI devel] MCA_BTL_BASE_VERSION_1_0_1 and MCA_BTL_BASE_VERSION_1_0_0
Since the trunk has now been bumped to MCA v2.0, and all frameworks have also been bumped to v2.0, are these two #defines relevant anymore: MCA_BTL_BASE_VERSION_1_0_1 MCA_BTL_BASE_VERSION_1_0_0 I know there was at least one BTL being developed at an organization that may not have kept up with the trunk. Do we need to put in backwards compatibility for that BTL, or should we delete these #defines? -- Jeff Squyres Cisco Systems