Re: [OMPI devel] openmpi-1.3rc4 build failure with qsnet4.30
I can confirm that both 1.3rc6 and 1.2.9rc2 now build fine for me. -Paul George Bosilca wrote: Paul, Thanks for noticing the Elan problem. It appears we miss one patch in the 1.3 (https://svn.open-mpi.org/trac/ompi/changeset/20122). I'll fill a CMR asap. Thanks, george. On Jan 13, 2009, at 16:31 , Paul H. Hargrove wrote: Since it looks like you guys are very close to release, I just grabbed the 1.3rc4 tarball to give it a spin. Unfortunately, the elan BTL is not building: $ ../configure --prefix= CC= CXX=to g++-4.3.2> FC= ... $ make ... Making all in mca/btl/elan make[2]: Entering directory `/home/pcp1/phargrov/OpenMPI/openmpi-1.3rc4/BLD/ompi/mca/btl/elan' depbase=`echo btl_elan.lo | sed 's|[^/]*$|.deps/&|;s|\.lo$||'`;\ /bin/sh ../../../../libtool --tag=CC --mode=compile /usr/local/pkg/gcc-4.3.2/bin/gcc -DHAVE_CONFIG_H -I. -I../../../../../ompi/mca/btl/elan -I../../../../opal/include -I../../../../orte/include -I../../../../ompi/include -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../../.. -I../../../.. -I../../../../../opal/include -I../../../../../orte/include -I../../../../../ompi/include-O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread -fvisibility=hidden -MT btl_elan.lo -MD -MP -MF $depbase.Tpo -c -o btl_elan.lo ../../../../../ompi/mca/btl/elan/btl_elan.c &&\ mv -f $depbase.Tpo $depbase.Plo libtool: compile: /usr/local/pkg/gcc-4.3.2/bin/gcc -DHAVE_CONFIG_H -I. -I../../../../../ompi/mca/btl/elan -I../../../../opal/include -I../../../../orte/include -I../../../../ompi/include -I../../../../opal/mca/paffinity/linux/plpa/src/libplpa -I../../../../.. -I../../../.. -I../../../../../opal/include -I../../../../../orte/include -I../../../../../ompi/include -O3 -DNDEBUG -finline-functions -fno-strict-aliasing -pthread -fvisibility=hidden -MT btl_elan.lo -MD -MP -MF .deps/btl_elan.Tpo -c ../../../../../ompi/mca/btl/elan/btl_elan.c -fPIC -DPIC -o .libs/btl_elan.o In file included from /usr/include/qsnet/fence.h:116, from /usr/include/elan3/elan3.h:42, from ../../../../../ompi/mca/btl/elan/btl_elan.h:34, from ../../../../../ompi/mca/btl/elan/btl_elan.c:18: /usr/include/asm/bitops.h:333:2: warning: #warning This includefile is not available on all architectures. /usr/include/asm/bitops.h:334:2: warning: #warning Using kernel headers in userspace. ../../../../../ompi/mca/btl/elan/btl_elan.c: In function 'mca_btl_elan_add_procs': ../../../../../ompi/mca/btl/elan/btl_elan.c:167: error: 'ELAN_TPORT_USERCOPY_DISABLE' undeclared (first use in this function) ../../../../../ompi/mca/btl/elan/btl_elan.c:167: error: (Each undeclared identifier is reported only once ../../../../../ompi/mca/btl/elan/btl_elan.c:167: error: for each function it appears in.) ../../../../../ompi/mca/btl/elan/btl_elan.c: In function 'mca_btl_elan_get': ../../../../../ompi/mca/btl/elan/btl_elan.c:551: warning: cast to pointer from integer of different size make[2]: *** [btl_elan.lo] Error 1 make[2]: Leaving directory `/home/pcp1/phargrov/OpenMPI/openmpi-1.3rc4/BLD/ompi/mca/btl/elan' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/home/pcp1/phargrov/OpenMPI/openmpi-1.3rc4/BLD/ompi' make: *** [all-recursive] Error 1 $ rpm -qif /usr/include/qsnet Name: qsnet-headersRelocations: (not relocateable) Version : 4.30qsnet Vendor: (none) Release : 0 Build Date: Mon 31 Jan 2005 07:36:45 AM PST Install date: Mon 13 Mar 2006 04:37:36 PM PST Build Host: pingu Group : Development/SystemSource RPM: qsnet-headers-4.30qsnet-0.src.rpm Size: 608924 License: GPL Signature : (none) Summary : The QsNet header files for the qsnet Linux kernel. Description : The headers package contains the QsNet kernel headers which are required by library programmers to use the QsNet hardware. I couldn't find any info in the README about minimum supported version of qsnet. However, I did notice a cut-and-paste error in the following text in README ("InfiniPath" should be "Elan"): --with-elan= Specify the directory where the Quadrics Elan library and header files are located. This option is generally only necessary if the InfiniPath headers and libraries are not in default compiler/linker search paths. Sorry not to have done any testing earlier than today. -Paul -- Paul H. Hargrove phhargr...@lbl.gov Future Technologies Group Tel: +1-510-495-2352 HPC Research Department Fax: +1-510-486-6900 Lawrence Berkeley National Laboratory ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org
[OMPI devel] Open MPI v1.3rc6 has been posted
Hi All, The sixth (yes 6!) release candidate of Open MPI v1.3 is now available: http://www.open-mpi.org/software/ompi/v1.3/ Please run it through it's paces as best you can. Anticipated release of 1.3 is tomorrow morning. This only has a fix for a segfault in coll_hierarch_component.c with respect to rc5 (ticket #1751), so if you have already started testing with rc5, and are not explicitly enabling coll_hierarch, there is no need to start your tests over. -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/
Re: [OMPI devel] reduce_scatter bug with hierarch
r20275 looks good. I suggest that we CMR that into 1.3 and get rc6 rolled and tested. (actually, Jeff just did the CMR...so off to rc6) --brad On Wed, Jan 14, 2009 at 1:16 PM, Edgar Gabrielwrote: > so I am not entirely sure why the bug only happened on trunk, it could in > theory also appear on v1.3 (is there a difference on how pointer_arrays are > handled between the two versions?) > > Anyway, it passes now on both with changeset 20275. We should probably move > that over to 1.3 as well, whether for 1.3.0 or 1.3.1 I leave that up to > others to decide... > > Thanks > Edgar > > > Edgar Gabriel wrote: > >> I'm already debugging it. the good news is that it only seems to appear >> with trunk, with 1.3 (after copying the new tuned module over), all the >> tests pass. >> >> Now if somebody can tell me a trick on how to tell mpirun not kill the >> debugger under my feet, then I could even see where the problem occurs:-) >> >> Thanks >> Edga >> >> George Bosilca wrote: >> >>> All these errors are in the MPI_Finalize, it should not be that hard to >>> find. I'll take a look later this afternoon. >>> >>> george. >>> >>> On Jan 14, 2009, at 06:41 , Tim Mattox wrote: >>> >>> Unfortunately, although this fixed some problems when enabling hierarch coll, there is still a segfault in two of IU's tests that only shows up when we set -mca coll_hierarch_priority 100 See this MTT summary to see how the failures improved on the trunk, but that there are still two that segfault even at 1.4a1r20267: http://www.open-mpi.org/mtt/index.php?do_redir=923 This link just has the remaining failures: http://www.open-mpi.org/mtt/index.php?do_redir=922 So, I'll vote for applying the CMR for 1.3 since it clearly improved things, but there is still more to be done to get coll_hierarch ready for regular use. On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca wrote: > Here we go by the book :) > > https://svn.open-mpi.org/trac/ompi/ticket/1749 > > george. > > On Jan 13, 2009, at 23:40 , Jeff Squyres wrote: > > Let's debate tomorrow when people are around, but first you have to >> file a >> CMR... :-) >> >> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote: >> >> Unfortunately, this pinpoint the fact that we didn't test enough the >>> collective module mixing thing. I went over the tuned collective >>> functions >>> and changed all instances to use the correct module information. It >>> is now >>> on the trunk, revision 20267. Simultaneously,I checked that all other >>> collective components do the right thing ... and I have to admit >>> tuned was >>> the only faulty one. >>> >>> This is clearly a bug in the tuned, and correcting it will allow >>> people >>> to use the hierarch. In the current incarnation 1.3 will >>> mostly/always >>> segfault when hierarch is active. I would prefer not to give a broken >>> toy >>> out there. How about pushing r20267 in the 1.3? >>> >>> george. >>> >>> >>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote: >>> >>> Thanks for digging into this. Can you file a bug? Let's mark it for v1.3.1. I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and since hierarch isn't currently selected by default (you must specifically elevate hierarch's priority to get it to run), there's no danger that users will run into this problem in default runs. But clearly the problem needs to be fixed, and therefore we need a bug to track it. On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote: I just debugged the Reduce_scatter bug mentioned previously. The > bug is > unfortunately not in hierarch, but in tuned. > > Here is the code snipplet causing the problems: > > int reduce_scatter (, mca_coll_base_module_t *module) > { > ... > err = comm->c_coll.coll_reduce (, module) > ... > } > > > but should be > { > ... > err = comm->c_coll.coll_reduce (..., > comm->c_coll.coll_reduce_module); > ... > } > > The problem as it is right now is, that when using hierarch, only a > subset of the function are set, e.g. reduce,allreduce, bcast and > barrier. > Thus, reduce_scatter is from tuned in most scenarios, and calls the > subsequent functions with the wrong module. Hierarch of course does > not like > that :-) > > Anyway, a quick glance through the tuned code reveals a significant > number of
Re: [OMPI devel] OpenMPI rpm build 1.3rc3r20226 build failed
Sorry, I have searched the whole day for a solution of that problem, but unfortunately, I'm clueless :-( I cannot say which flag causes the compile error. Furthermore, I'm also unable to reproduce this error on some different platforms. The coding style in the concerned source file looks also not special... My suggestion is to use the workaround (configure flag) for the 1.3 release. On Wed, 2009-01-14 at 07:57 -0500, Jeff Squyres wrote: > Is there some code that can be fixed instead? I.e., is this feature > totally incompatible with whatever RPM compiler flags are used, or is > it just some coding style that these particular flags don't like? > > > On Jan 14, 2009, at 5:05 AM, Matthias Jurenz wrote: > > > Another workaround should be to disable the I/O tracing feature of VT > > by adding the configure option > > > > '--with-contrib-vt-flags=--disable-iotrace' > > > > That will have the effect that the upcoming OMPI-rpm's have no support > > for I/O tracing, but in our opinion it is not so bad... > > > > Furthermore, we could add the configure option in > > 'ompi/contrib/vt/configure.m4' to retain the feature-consistency > > between > > the rpm's and the source packages. > > > > > > Matthias > > > > On Tue, 2009-01-13 at 17:13 +0200, Lenny Verkhovsky wrote: > >> I don't want to move changes ( default value of the flag), since > >> there > >> are important people, for whom it works :) > >> I also think that this is VT issue, but I guess we are the only one > >> who experience the errors. > >> > >> we can now overwrite this params from the environment as a > >> workaround, > >> Mike comitted buildrpm.sh script to the trunk r20253 that allows > >> overwriting params from the environment. > >> > >> we observed the problem on CentOS 5.2 with boundled gcc and RedHat > >> 5.2 > >> with boundled gcc. > >> > >> #uname -a > >> Linux elfit1 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64 > >> x86_64 x86_64 GNU/Linux > >> > >> #lsb_release -a > >> LSB Version: > >> :core-3.1-amd64:core-3.1-ia32:core-3.1-noarch:graphics-3.1- > >> amd64:graphics-3.1-ia32:graphics-3.1-noarch > >> Distributor ID: CentOS > >> Description:CentOS release 5.2 (Final) > >> Release:5.2 > >> Codename: Final > >> > >> gcc version 4.1.2 20071124 (Red Hat 4.1.2-42) > >> > >> Best regards, > >> Lenny. > >> > >> > >> On Tue, Jan 13, 2009 at 4:40 PM, Jeff Squyres> >> wrote: > >>> I'm still guessing that this is a distro / compiler issue -- I can > >>> build > >>> with the default flags just fine...? > >>> > >>> Can you specify what distro / compiler you were using? > >>> > >>> Also, if you want to move the changes that have been made to > >>> buildrpm.sh to > >>> the v1.3 branch, just file a CMR. That file is not included in > >>> release > >>> tarballs, so Tim can move it over at any time. > >>> > >>> > >>> > >>> On Jan 13, 2009, at 6:35 AM, Lenny Verkhovsky wrote: > >>> > it seems that setting use_default_rpm_opt_flags to 0 solves the > problem. > Maybe vt developers should take a look on it. > > Lenny. > > > On Sun, Jan 11, 2009 at 2:40 PM, Jeff Squyres > wrote: > > > > This sounds like a distro/compiler version issue. > > > > Can you narrow down the issue at all? > > > > > > On Jan 11, 2009, at 3:23 AM, Lenny Verkhovsky wrote: > > > >> it doesnt happen if I do autogen, configure and make install, > >> only when I try to make an rpm from the tar file. > >> > >> > >> > >> On Thu, Jan 8, 2009 at 9:43 PM, Jeff Squyres > >> wrote: > >>> > >>> This doesn't happen in a normal build of the same tree? > >>> > >>> I ask because both 1.3r20226 builds fine for me manually (i.e., > >>> ./configure;make and buildrpm.sh). > >>> > >>> > >>> On Jan 8, 2009, at 8:15 AM, Lenny Verkhovsky wrote: > >>> > Hi, > > I am trying to build rpm from nightly snaposhots of 1.3 > > with the downloaded buildrpm.sh and ompi.spec file from > http://svn.open-mpi.org/svn/ompi/branches/v1.3/contrib/dist/linux/ > > I am getting this error > . > Making all in vtlib > make[5]: Entering directory > > `/hpc/home/USERS/lennyb/work/svn/release/scripts/dist-1.3--1/ > OMPI/BUILD/ > openmpi-1.3rc3r20226/ompi/contrib/vt/vt/vtlib' > gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib > -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE > -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\" > -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG > -DVT_MEMHOOK -DVT_IOWRAP -O2 -g -pipe -Wall -Wp,- > D_FORTIFY_SOURCE=2 > -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 >
Re: [OMPI devel] reduce_scatter bug with hierarch
so I am not entirely sure why the bug only happened on trunk, it could in theory also appear on v1.3 (is there a difference on how pointer_arrays are handled between the two versions?) Anyway, it passes now on both with changeset 20275. We should probably move that over to 1.3 as well, whether for 1.3.0 or 1.3.1 I leave that up to others to decide... Thanks Edgar Edgar Gabriel wrote: I'm already debugging it. the good news is that it only seems to appear with trunk, with 1.3 (after copying the new tuned module over), all the tests pass. Now if somebody can tell me a trick on how to tell mpirun not kill the debugger under my feet, then I could even see where the problem occurs:-) Thanks Edga George Bosilca wrote: All these errors are in the MPI_Finalize, it should not be that hard to find. I'll take a look later this afternoon. george. On Jan 14, 2009, at 06:41 , Tim Mattox wrote: Unfortunately, although this fixed some problems when enabling hierarch coll, there is still a segfault in two of IU's tests that only shows up when we set -mca coll_hierarch_priority 100 See this MTT summary to see how the failures improved on the trunk, but that there are still two that segfault even at 1.4a1r20267: http://www.open-mpi.org/mtt/index.php?do_redir=923 This link just has the remaining failures: http://www.open-mpi.org/mtt/index.php?do_redir=922 So, I'll vote for applying the CMR for 1.3 since it clearly improved things, but there is still more to be done to get coll_hierarch ready for regular use. On Wed, Jan 14, 2009 at 12:15 AM, George Bosilcawrote: Here we go by the book :) https://svn.open-mpi.org/trac/ompi/ticket/1749 george. On Jan 13, 2009, at 23:40 , Jeff Squyres wrote: Let's debate tomorrow when people are around, but first you have to file a CMR... :-) On Jan 13, 2009, at 10:28 PM, George Bosilca wrote: Unfortunately, this pinpoint the fact that we didn't test enough the collective module mixing thing. I went over the tuned collective functions and changed all instances to use the correct module information. It is now on the trunk, revision 20267. Simultaneously,I checked that all other collective components do the right thing ... and I have to admit tuned was the only faulty one. This is clearly a bug in the tuned, and correcting it will allow people to use the hierarch. In the current incarnation 1.3 will mostly/always segfault when hierarch is active. I would prefer not to give a broken toy out there. How about pushing r20267 in the 1.3? george. On Jan 13, 2009, at 20:13 , Jeff Squyres wrote: Thanks for digging into this. Can you file a bug? Let's mark it for v1.3.1. I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and since hierarch isn't currently selected by default (you must specifically elevate hierarch's priority to get it to run), there's no danger that users will run into this problem in default runs. But clearly the problem needs to be fixed, and therefore we need a bug to track it. On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote: I just debugged the Reduce_scatter bug mentioned previously. The bug is unfortunately not in hierarch, but in tuned. Here is the code snipplet causing the problems: int reduce_scatter (, mca_coll_base_module_t *module) { ... err = comm->c_coll.coll_reduce (, module) ... } but should be { ... err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module); ... } The problem as it is right now is, that when using hierarch, only a subset of the function are set, e.g. reduce,allreduce, bcast and barrier. Thus, reduce_scatter is from tuned in most scenarios, and calls the subsequent functions with the wrong module. Hierarch of course does not like that :-) Anyway, a quick glance through the tuned code reveals a significant number of instances where this appears(reduce_scatter, allreduce, allgather, allgatherv). Basic, hierarch and inter seem to do that mostly correctly. Thanks Edgar -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___
Re: [OMPI devel] reduce_scatter bug with hierarch
So, if it looks okay on 1.3...then there should not be anything holding up the release, right? Otherwise, George we need to decide on whether or not this is a blocker, or if we go ahead and release with this as a known issue and schedule the fix for 1.3.1. My vote is to go ahead and release, but if you (or others) think otherwise, let's talk about how best to move forward. --brad On Wed, Jan 14, 2009 at 12:04 PM, Edgar Gabrielwrote: > I'm already debugging it. the good news is that it only seems to appear > with trunk, with 1.3 (after copying the new tuned module over), all the > tests pass. > > Now if somebody can tell me a trick on how to tell mpirun not kill the > debugger under my feet, then I could even see where the problem occurs:-) > > Thanks > Edga > > > George Bosilca wrote: > >> All these errors are in the MPI_Finalize, it should not be that hard to >> find. I'll take a look later this afternoon. >> >> george. >> >> On Jan 14, 2009, at 06:41 , Tim Mattox wrote: >> >> Unfortunately, although this fixed some problems when enabling hierarch >>> coll, >>> there is still a segfault in two of IU's tests that only shows up when we >>> set >>> -mca coll_hierarch_priority 100 >>> >>> See this MTT summary to see how the failures improved on the trunk, >>> but that there are still two that segfault even at 1.4a1r20267: >>> http://www.open-mpi.org/mtt/index.php?do_redir=923 >>> >>> This link just has the remaining failures: >>> http://www.open-mpi.org/mtt/index.php?do_redir=922 >>> >>> So, I'll vote for applying the CMR for 1.3 since it clearly improved >>> things, >>> but there is still more to be done to get coll_hierarch ready for regular >>> use. >>> >>> On Wed, Jan 14, 2009 at 12:15 AM, George Bosilca >>> wrote: >>> Here we go by the book :) https://svn.open-mpi.org/trac/ompi/ticket/1749 george. On Jan 13, 2009, at 23:40 , Jeff Squyres wrote: Let's debate tomorrow when people are around, but first you have to > file a > CMR... :-) > > On Jan 13, 2009, at 10:28 PM, George Bosilca wrote: > > Unfortunately, this pinpoint the fact that we didn't test enough the >> collective module mixing thing. I went over the tuned collective >> functions >> and changed all instances to use the correct module information. It is >> now >> on the trunk, revision 20267. Simultaneously,I checked that all other >> collective components do the right thing ... and I have to admit tuned >> was >> the only faulty one. >> >> This is clearly a bug in the tuned, and correcting it will allow >> people >> to use the hierarch. In the current incarnation 1.3 will mostly/always >> segfault when hierarch is active. I would prefer not to give a broken >> toy >> out there. How about pushing r20267 in the 1.3? >> >> george. >> >> >> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote: >> >> Thanks for digging into this. Can you file a bug? Let's mark it for >>> v1.3.1. >>> >>> I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, >>> and >>> since hierarch isn't currently selected by default (you must >>> specifically >>> elevate hierarch's priority to get it to run), there's no danger that >>> users >>> will run into this problem in default runs. >>> >>> But clearly the problem needs to be fixed, and therefore we need a >>> bug >>> to track it. >>> >>> >>> >>> On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote: >>> >>> I just debugged the Reduce_scatter bug mentioned previously. The bug is unfortunately not in hierarch, but in tuned. Here is the code snipplet causing the problems: int reduce_scatter (, mca_coll_base_module_t *module) { ... err = comm->c_coll.coll_reduce (, module) ... } but should be { ... err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module); ... } The problem as it is right now is, that when using hierarch, only a subset of the function are set, e.g. reduce,allreduce, bcast and barrier. Thus, reduce_scatter is from tuned in most scenarios, and calls the subsequent functions with the wrong module. Hierarch of course does not like that :-) Anyway, a quick glance through the tuned code reveals a significant number of instances where this appears(reduce_scatter, allreduce, allgather, allgatherv). Basic, hierarch and inter seem to do that mostly correctly. Thanks Edgar -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab
Re: [OMPI devel] crcpw verbosity
The crcpw component is in the PML framework. The following should be the MCA parameter you are looking for: pml_crcpw_verbose=20 You can use the 'ompi_info' command to find out more information about MCA parameters available. For example to find this one you can use the following: ompi_info --param pml crcpw Cheers, Josh On Jan 14, 2009, at 12:54 PM, Caciano Machado wrote: Hi, What variable should I set to increase the verbosity of crcpw component? I've tried "ompi_crcpw_verbose=20" and "crcpw_base_verbose=20". How can I figure out the name of the variable. Regards, Caciano ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] autosizing the shared memory backing file
I think you'd like to know more than just how many procs are local. E.g., if the chunk or eager limits are changed much, that would impact how much memory you'd like to allocate. A phone chat is all right for me, though so far all I've heard is that no one understands the code! But, maybe we can nip this one in the bud. How about the following proposal. First, what's happening is: *) the sm BTL (which knows how big the file should be) calls *) mca_mpool_base_module_create() calls *) mca_mpool_sm_init() (which creates the file) There is no explicit calling argument to transmit an mpool size through these function calls, but there is a "resources" argument. This resources argument appears to be opaque to the intervening function, but it seems to be understood by both the sm BTL caller and the sm mpool component callee. Other components appear to have other definitions of the resources data structure. So, I propose: *) In mca/mpool/sm/mpool_sm.h, there is a definition of mca_mpool_base_resources_t. It has a single field (int32_t mem_node). How about I add another field here: size_t size. *) In the sm BTL in sm_btl_first_time_init(), I can set the size of the mmap file in my "resources" data structure. *) In mca_mpool_sm_init(), when I determine the mmap file size, I just look up the resources->size value and use that. Yes? Clean and proper solution? Does not break other BTLs? Ralph Castain wrote: I also know little about that part of the code, but agree that does seem weird. Seeing as we know how many local procs there are before we get to this point, I would think we could be smart about our memory pool size. You might not need to dive into the sm BTL to get the info you need - if all you need is how many procs are local, that can be obtained fairly easily. Be happy to contribute to the chat, if it would be helpful. On Jan 14, 2009, at 7:43 AM, Jeff Squyres wrote: Would it be useful to get on the phone and discuss this stuff? On Jan 14, 2009, at 1:11 AM, Eugene Loh wrote: Thanks for the reply. I kind of understand, but it's rather weird. The BTL calls mca_mpool_base_module_create() to create a pool of memory, but the BTL has no say how big of a pool to create? E.g., I see that there is a "resources" argument (mca_mpool_base_resources_t). Maybe that structure should be expanded to include a "size" field? On Jan 13, 2009, at 19:22 , Eugene Loh wrote: With the sm BTL, there is a file that each process mmaps in for shared memory. I'm trying to get mpool_sm to size the file appropriately. mpool_sm creates and mmaps the file, but the size depends on parameters like eager limit and max frag size that are known by the btl_sm.
[OMPI devel] Open MPI v1.3rc5 has been posted
Hi All, The fifth release candidate of Open MPI v1.3 is now available: http://www.open-mpi.org/software/ompi/v1.3/ Please run it through it's paces as best you can. Anticipated release of 1.3 is tonight/tomorrow. (again) -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/
Re: [OMPI devel] reduce_scatter bug with hierarch
I'm already debugging it. the good news is that it only seems to appear with trunk, with 1.3 (after copying the new tuned module over), all the tests pass. Now if somebody can tell me a trick on how to tell mpirun not kill the debugger under my feet, then I could even see where the problem occurs:-) Thanks Edga George Bosilca wrote: All these errors are in the MPI_Finalize, it should not be that hard to find. I'll take a look later this afternoon. george. On Jan 14, 2009, at 06:41 , Tim Mattox wrote: Unfortunately, although this fixed some problems when enabling hierarch coll, there is still a segfault in two of IU's tests that only shows up when we set -mca coll_hierarch_priority 100 See this MTT summary to see how the failures improved on the trunk, but that there are still two that segfault even at 1.4a1r20267: http://www.open-mpi.org/mtt/index.php?do_redir=923 This link just has the remaining failures: http://www.open-mpi.org/mtt/index.php?do_redir=922 So, I'll vote for applying the CMR for 1.3 since it clearly improved things, but there is still more to be done to get coll_hierarch ready for regular use. On Wed, Jan 14, 2009 at 12:15 AM, George Bosilcawrote: Here we go by the book :) https://svn.open-mpi.org/trac/ompi/ticket/1749 george. On Jan 13, 2009, at 23:40 , Jeff Squyres wrote: Let's debate tomorrow when people are around, but first you have to file a CMR... :-) On Jan 13, 2009, at 10:28 PM, George Bosilca wrote: Unfortunately, this pinpoint the fact that we didn't test enough the collective module mixing thing. I went over the tuned collective functions and changed all instances to use the correct module information. It is now on the trunk, revision 20267. Simultaneously,I checked that all other collective components do the right thing ... and I have to admit tuned was the only faulty one. This is clearly a bug in the tuned, and correcting it will allow people to use the hierarch. In the current incarnation 1.3 will mostly/always segfault when hierarch is active. I would prefer not to give a broken toy out there. How about pushing r20267 in the 1.3? george. On Jan 13, 2009, at 20:13 , Jeff Squyres wrote: Thanks for digging into this. Can you file a bug? Let's mark it for v1.3.1. I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and since hierarch isn't currently selected by default (you must specifically elevate hierarch's priority to get it to run), there's no danger that users will run into this problem in default runs. But clearly the problem needs to be fixed, and therefore we need a bug to track it. On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote: I just debugged the Reduce_scatter bug mentioned previously. The bug is unfortunately not in hierarch, but in tuned. Here is the code snipplet causing the problems: int reduce_scatter (, mca_coll_base_module_t *module) { ... err = comm->c_coll.coll_reduce (, module) ... } but should be { ... err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module); ... } The problem as it is right now is, that when using hierarch, only a subset of the function are set, e.g. reduce,allreduce, bcast and barrier. Thus, reduce_scatter is from tuned in most scenarios, and calls the subsequent functions with the wrong module. Hierarch of course does not like that :-) Anyway, a quick glance through the tuned code reveals a significant number of instances where this appears(reduce_scatter, allreduce, allgather, allgatherv). Basic, hierarch and inter seem to do that mostly correctly. Thanks Edgar -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] reduce_scatter bug with hierarch
All these errors are in the MPI_Finalize, it should not be that hard to find. I'll take a look later this afternoon. george. On Jan 14, 2009, at 06:41 , Tim Mattox wrote: Unfortunately, although this fixed some problems when enabling hierarch coll, there is still a segfault in two of IU's tests that only shows up when we set -mca coll_hierarch_priority 100 See this MTT summary to see how the failures improved on the trunk, but that there are still two that segfault even at 1.4a1r20267: http://www.open-mpi.org/mtt/index.php?do_redir=923 This link just has the remaining failures: http://www.open-mpi.org/mtt/index.php?do_redir=922 So, I'll vote for applying the CMR for 1.3 since it clearly improved things, but there is still more to be done to get coll_hierarch ready for regular use. On Wed, Jan 14, 2009 at 12:15 AM, George Bosilcawrote: Here we go by the book :) https://svn.open-mpi.org/trac/ompi/ticket/1749 george. On Jan 13, 2009, at 23:40 , Jeff Squyres wrote: Let's debate tomorrow when people are around, but first you have to file a CMR... :-) On Jan 13, 2009, at 10:28 PM, George Bosilca wrote: Unfortunately, this pinpoint the fact that we didn't test enough the collective module mixing thing. I went over the tuned collective functions and changed all instances to use the correct module information. It is now on the trunk, revision 20267. Simultaneously,I checked that all other collective components do the right thing ... and I have to admit tuned was the only faulty one. This is clearly a bug in the tuned, and correcting it will allow people to use the hierarch. In the current incarnation 1.3 will mostly/ always segfault when hierarch is active. I would prefer not to give a broken toy out there. How about pushing r20267 in the 1.3? george. On Jan 13, 2009, at 20:13 , Jeff Squyres wrote: Thanks for digging into this. Can you file a bug? Let's mark it for v1.3.1. I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and since hierarch isn't currently selected by default (you must specifically elevate hierarch's priority to get it to run), there's no danger that users will run into this problem in default runs. But clearly the problem needs to be fixed, and therefore we need a bug to track it. On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote: I just debugged the Reduce_scatter bug mentioned previously. The bug is unfortunately not in hierarch, but in tuned. Here is the code snipplet causing the problems: int reduce_scatter (, mca_coll_base_module_t *module) { ... err = comm->c_coll.coll_reduce (, module) ... } but should be { ... err = comm->c_coll.coll_reduce (..., comm- >c_coll.coll_reduce_module); ... } The problem as it is right now is, that when using hierarch, only a subset of the function are set, e.g. reduce,allreduce, bcast and barrier. Thus, reduce_scatter is from tuned in most scenarios, and calls the subsequent functions with the wrong module. Hierarch of course does not like that :-) Anyway, a quick glance through the tuned code reveals a significant number of instances where this appears(reduce_scatter, allreduce, allgather, allgatherv). Basic, hierarch and inter seem to do that mostly correctly. Thanks Edgar -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] crcpw verbosity
Hi, What variable should I set to increase the verbosity of crcpw component? I've tried "ompi_crcpw_verbose=20" and "crcpw_base_verbose=20". How can I figure out the name of the variable. Regards, Caciano
Re: [OMPI devel] OpenMPI question
To followup for the web archives -- we discussed this more off-list. AFAIK, compiling Open MPI -- including its memory registration cache -- works fine in 32 bit mode, even on 64 bit platforms (there was some confusion between virtual and physical memory addresses and who uses what [OMPI *only* sees virtual memory addresses because it's user- space code]). On Jan 13, 2009, at 2:48 PM, Jeff Squyres wrote: On Jan 13, 2009, at 7:37 AM, Alex A. Granovsky wrote: Am I correct assuming that OpenMPI memory registration/cache module is completely broken by design on any 32-bit system allowing physical address space larger than 4 GB, and especially when compiled for 32-bit under 64-bit OS (e.g., Linux)? I'm not sure what you mean -- OMPI 32 bit builds on a 64 bit system should be ok...? Have you found a problem? -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] -display-map
We -may- be able to do a more formal XML output at some point. The problem will be the natural interleaving of stdout/err from the various procs due to the async behavior of MPI. Mpirun receives fragmented output in the forwarding system, limited by the buffer sizes and the amount of data we can read at any one "bite" from the pipes connecting us to the procs. So even though the user -thinks- they output a single large line of stuff, it may show up at mpirun as a series of fragments. Hence, it gets tricky to know how to put appropriate XML brackets around it. Given this input about when you actually want resolved name info, I can at least do something about that area. Won't be in 1.3.0, but should make 1.3.1. As for XML-tagged stdout/err: the OMPI community asked me not to turn that feature "on" for 1.3.0 as they felt it hasn't been adequately tested yet. The code is present, but cannot be activated in 1.3.0. However, I believe it is activated on the trunk when you do --xml -- tagged-output, so perhaps some testing will help us debug and validate it adequately for 1.3.1? Thanks Ralph On Jan 14, 2009, at 7:02 AM, Greg Watson wrote: Ralph, The only time we use the resolved names is when we get a map, so we consider them part of the map output. If quasi-XML is all that will ever be possible with 1.3, then you may as well leave as-is and we will attempt to clean it up in Eclipse. It would be nice if a future version of ompi could output correct XML (including stdout) as this would vastly simplify the parsing we need to do. Regards, Greg On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote: Hmmm...well, I can't do either for 1.3.0 as it is departing this afternoon. The first option would be very hard to do. I would have to expose the display-map option across the code base and check it prior to printing anything about resolving node names. I guess I should ask: do you only want noderesolve statements when we are displaying the map? Right now, I will output them regardless. The second option could be done. I could check if any "display" option has been specified, and output the root at that time (likewise for the end). Anything we output in-between would be encapsulated between the two, but that would include any user output to stdout and/or stderr - which for 1.3.0 is not in xml. Any thoughts? Ralph PS. Guess I should clarify that I was not striving for true XML interaction here, but rather a quasi-XML format that would help you to filter the output. I have no problem trying to get to something more formally correct, but it could be tricky in some places to achieve it due to the inherent async nature of the beast. On Jan 13, 2009, at 12:17 PM, Greg Watson wrote: Ralph, The XML is looking better now, but there is still one problem. To be valid, there needs to be only one root element, but currently you don't have any (or many). So rather than: the XML should be: or: Would either of these be possible? Thanks, Greg On Dec 8, 2008, at 2:18 PM, Greg Watson wrote: Ok thanks. I'll test from trunk in future. Greg On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote: Working its way around the CMR process now. Might be easier in the future if we could test/debug this in the trunk, though. Otherwise, the CMR procedure will fall behind and a fix might miss a release window. Anyway, hopefully this one will make the 1.3.0 release cutoff. Thanks Ralph On Dec 8, 2008, at 9:56 AM, Greg Watson wrote: Hi Ralph, This is now in 1.3rc2, thanks. However there are a couple of problems. Here is what I see: [Jarrah.watson.ibm.com:58957] resolved="Jarrah.watson.ibm.com"> For some reason each line is prefixed with "[...]", any idea why this is? Also the end tag should be "/>" not ">". Thanks, Greg On Nov 24, 2008, at 3:06 PM, Greg Watson wrote: Great, thanks. I'll take a look once it comes over to 1.3. Cheers, Greg On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote: Yo Greg This is in the trunk as of r20032. I'll bring it over to 1.3 in a few days. I implemented it as another MCA param "orte_show_resolved_nodenames" so you can actually get the info as you execute the job, if you want. The xml tag is "noderesolve" - let me know if you need any changes. Ralph On Oct 22, 2008, at 11:55 AM, Greg Watson wrote: Ralph, I guess the issue for us is that we will have to run two commands to get the information we need. One to get the configuration information, such as
Re: [OMPI devel] autosizing the shared memory backing file
I also know little about that part of the code, but agree that does seem weird. Seeing as we know how many local procs there are before we get to this point, I would think we could be smart about our memory pool size. You might not need to dive into the sm BTL to get the info you need - if all you need is how many procs are local, that can be obtained fairly easily. Be happy to contribute to the chat, if it would be helpful. On Jan 14, 2009, at 7:43 AM, Jeff Squyres wrote: Ya, that does seem weird to me, but I never fully grokked the whole mpool / allocator scheme (I haven't had to interact with that part of the code much). Would it be useful to get on the phone and discuss this stuff? On Jan 14, 2009, at 1:11 AM, Eugene Loh wrote: Thanks for the reply. I kind of understand, but it's rather weird. The BTL calls mca_mpool_base_module_create() to create a pool of memory, but the BTL has no say how big of a pool to create? Could you imagine having a memory allocation routine ("malloc" or something) that didn't allow you to control the size of the allocation? Instead, the allocation routine determines the size. That's weird. I must be missing something about how this is supposed to work. E.g., I see that there is a "resources" argument (mca_mpool_base_resources_t). Maybe that structure should be expanded to include a "size" field? Or, maybe I should bypass mca_mpool_base_module_create()/ mca_mpool_sm_init() and just call mca_common_sm_mmap_init() directly, the way mca/coll/sm does things. That would allow me to specify the size of the file. George Bosilca wrote: The simple answer is you can't. The mpool is loaded before the BTLs and on Linux the loader use the RTLD_NOW flag (i.e. all symbols have to be defined or the dlopen call will fail). Moreover, there is no way in Open MPI to exchange information between components except a global variable or something in the mca/common. In other words there is no way for you to call from the mpool a function from the sm BTL. On Jan 13, 2009, at 19:22 , Eugene Loh wrote: With the sm BTL, there is a file that each process mmaps in for shared memory. I'm trying to get mpool_sm to size the file appropriately. So, I would like mpool_sm to call some mca_btl_sm function that provides a good guess of the size. (mpool_sm creates and mmaps the file, but the size depends on parameters like eager limit and max frag size that are known by the btl_sm.) ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Fragmented sm Allocations
Whoa, this analysis rocks. :-) I'm going through trying to grok it all... Just wanted to say: kudos for this. On Jan 14, 2009, at 1:14 AM, Eugene Loh wrote: RFC: Fragmented sm Allocations WHAT: Dealing with the fragmented allocations of sm BTL FIFO circular buffers (CB) during MPI_Init(). Also: • Improve handling of error codes. • Automate the sizing of the mmap file. WHY: To reduce consumption of shared memory, making job startup more robust, and possibly improving the scalability of startup. WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to ompi_fifo_init(). This is where CBs are initialized one at a time, components of a CB allocated individually. Changes can be seen in ssh://www.open-mpi.org/ ~eugene/hg/sm-allocation. WHEN: Upon acceptance. TIMEOUT: January 30, 2009. WHY (details) The sm BTL establishes a FIFO for each non-self, on-node connection. Each FIFO is initialized during MPI_Init() with a circular buffer (CB). (More CBs can be added later in program execution if a FIFO runs out of room.) A CB has different components that are used in different ways: • The "wrapper" is read by both sender and receiver, but is rarely written. • The "queue" (FIFO entries) is accessed by both the sender and receiver. • The "head" is accessed by the sender. • The "tail" is accessed by the receiver. For performance reasons, a CB is not allocated as one large data structure. Rather, these components are laid out separately in memory and the wrapper has pointers to the various locations. Performance considerations include: • false sharing: a component used by one process should not share a cacheline with another component that is modified by another process • NUMA: certain components should perhaps be mapped preferentially to memory pages that are close to the processes that access these components Currently, the sm BTL handles these issues by allocating each component of each CB its own page. (Actually, it couples tails and queues together.) As the number of on-node processes grows, however, the shared-memory allocation skyrockets. E.g., let's say there are n processes on- node. There are therefore n(n-1) = O(n2) FIFOs, each with 3 allocations (wrapper, head, and tail/queue). The shared-memory allocation for CBs becomes 3n2 pages. For large n, this dominates the shared-memory consumption, even though most of the CB allocation is unused. E.g., a 12-byte "head" ends up consuming a full memory page! Not only is the 3n2-page allocation large, but it is also not tunable via any MCA parameters. Large shared-memory consumption has led to some number of start-up and other user problems. E.g., the e-mail thread at http://www.open-mpi.org/community/lists/devel/2008/11/4882.php . WHAT (details) Several actions are recommended here. 1. Cacheline Rather than Pagesize Alignment The first set of changes reduces pagesize to cacheline alignment. Though mapping to pages is motivated by NUMA locality, note: • The code already has NUMA locality optimizations (maffinity and mpools) anyhow. • There is no data that I'm aware of substantiating the benefits of locality optimizations in this context. More to the point, I've tried some such experiments myself. I had two processes communicating via shared memory on a large SMP that had a large difference between remote and local memory access times. I timed the roundtrip latency for pingpongs between the processes. That time was correlated to the relative separation between the two processes, and not at all to the placement of the physical memory backing the shared variables. It did not matter if the memory was local to the sender or receiver or remote from both! (E.g., colocal processes showed fast timings even if the shared memory were remote to both processes.) My results do not prove that all NUMA platforms behave in the same way. My point is only that, though I understand the logic behind locality optimizations for FIFO placement, the only data I am aware of does not substantiate that logic. The changes are: • File: ompi/mca/mpool/sm/mpool_sm_module.c Function: mca_mpool_sm_alloc() Use the alignment requested by the caller rather than adding additional pagesize alignment as well. • File: ompi/class/ompi_fifo.h Function: ompi_fifo_init() and ompi_fifo_write_to_head() Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page. • File: ompi/class/ompi_circular_buffer_fifo.h Function: ompi_cb_fifo_init() Align the two calls to mpool_alloc on cacheline rather than page. 2. Aggregated Allocation Another option is to lay out all the CBs at once and aggregate their allocations. This may have the added benefit of reducing lock contention during MPI_Init(). On the one hand, the 3n2 CB allocations during MPI_Init() contend for a single mca_common_sm_mmap->map_seg- >seg_lock
Re: [OMPI devel] autosizing the shared memory backing file
Ya, that does seem weird to me, but I never fully grokked the whole mpool / allocator scheme (I haven't had to interact with that part of the code much). Would it be useful to get on the phone and discuss this stuff? On Jan 14, 2009, at 1:11 AM, Eugene Loh wrote: Thanks for the reply. I kind of understand, but it's rather weird. The BTL calls mca_mpool_base_module_create() to create a pool of memory, but the BTL has no say how big of a pool to create? Could you imagine having a memory allocation routine ("malloc" or something) that didn't allow you to control the size of the allocation? Instead, the allocation routine determines the size. That's weird. I must be missing something about how this is supposed to work. E.g., I see that there is a "resources" argument (mca_mpool_base_resources_t). Maybe that structure should be expanded to include a "size" field? Or, maybe I should bypass mca_mpool_base_module_create()/ mca_mpool_sm_init() and just call mca_common_sm_mmap_init() directly, the way mca/coll/sm does things. That would allow me to specify the size of the file. George Bosilca wrote: The simple answer is you can't. The mpool is loaded before the BTLs and on Linux the loader use the RTLD_NOW flag (i.e. all symbols have to be defined or the dlopen call will fail). Moreover, there is no way in Open MPI to exchange information between components except a global variable or something in the mca/common. In other words there is no way for you to call from the mpool a function from the sm BTL. On Jan 13, 2009, at 19:22 , Eugene Loh wrote: With the sm BTL, there is a file that each process mmaps in for shared memory. I'm trying to get mpool_sm to size the file appropriately. So, I would like mpool_sm to call some mca_btl_sm function that provides a good guess of the size. (mpool_sm creates and mmaps the file, but the size depends on parameters like eager limit and max frag size that are known by the btl_sm.) ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] -display-map
Ralph, The only time we use the resolved names is when we get a map, so we consider them part of the map output. If quasi-XML is all that will ever be possible with 1.3, then you may as well leave as-is and we will attempt to clean it up in Eclipse. It would be nice if a future version of ompi could output correct XML (including stdout) as this would vastly simplify the parsing we need to do. Regards, Greg On Jan 13, 2009, at 3:30 PM, Ralph Castain wrote: Hmmm...well, I can't do either for 1.3.0 as it is departing this afternoon. The first option would be very hard to do. I would have to expose the display-map option across the code base and check it prior to printing anything about resolving node names. I guess I should ask: do you only want noderesolve statements when we are displaying the map? Right now, I will output them regardless. The second option could be done. I could check if any "display" option has been specified, and output the root at that time (likewise for the end). Anything we output in-between would be encapsulated between the two, but that would include any user output to stdout and/or stderr - which for 1.3.0 is not in xml. Any thoughts? Ralph PS. Guess I should clarify that I was not striving for true XML interaction here, but rather a quasi-XML format that would help you to filter the output. I have no problem trying to get to something more formally correct, but it could be tricky in some places to achieve it due to the inherent async nature of the beast. On Jan 13, 2009, at 12:17 PM, Greg Watson wrote: Ralph, The XML is looking better now, but there is still one problem. To be valid, there needs to be only one root element, but currently you don't have any (or many). So rather than: the XML should be: or: Would either of these be possible? Thanks, Greg On Dec 8, 2008, at 2:18 PM, Greg Watson wrote: Ok thanks. I'll test from trunk in future. Greg On Dec 8, 2008, at 2:05 PM, Ralph Castain wrote: Working its way around the CMR process now. Might be easier in the future if we could test/debug this in the trunk, though. Otherwise, the CMR procedure will fall behind and a fix might miss a release window. Anyway, hopefully this one will make the 1.3.0 release cutoff. Thanks Ralph On Dec 8, 2008, at 9:56 AM, Greg Watson wrote: Hi Ralph, This is now in 1.3rc2, thanks. However there are a couple of problems. Here is what I see: [Jarrah.watson.ibm.com:58957] resolved="Jarrah.watson.ibm.com"> For some reason each line is prefixed with "[...]", any idea why this is? Also the end tag should be "/>" not ">". Thanks, Greg On Nov 24, 2008, at 3:06 PM, Greg Watson wrote: Great, thanks. I'll take a look once it comes over to 1.3. Cheers, Greg On Nov 24, 2008, at 2:59 PM, Ralph Castain wrote: Yo Greg This is in the trunk as of r20032. I'll bring it over to 1.3 in a few days. I implemented it as another MCA param "orte_show_resolved_nodenames" so you can actually get the info as you execute the job, if you want. The xml tag is "noderesolve" - let me know if you need any changes. Ralph On Oct 22, 2008, at 11:55 AM, Greg Watson wrote: Ralph, I guess the issue for us is that we will have to run two commands to get the information we need. One to get the configuration information, such as version and MCA parameters, and one to get the host information, whereas it would seem more logical that this should all be available via some kind of "configuration discovery" command. I understand the issue with supplying the hostfile though, so maybe this just points at the need for us to separate configuration information from the host information. In any case, we'll work with what you think is best. Greg On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote: Hmmm...just to be sure we are all clear on this. The reason we proposed to use mpirun is that "hostfile" has no meaning outside of mpirun. That's why ompi_info can't do anything in this regard. We have no idea what hostfile the user may specify until we actually get the mpirun cmd line. They may have specified a default-hostfile, but they could also specify hostfiles for the individual app_contexts. These may or may not include the node upon which mpirun is executing. So the only way to provide you with a separate command to get a hostfile<->nodename mapping would require you to provide us with the default-hostifle and/or hostfile cmd line options just as if you were issuing
Re: [OMPI devel] OpenMPI Performance Problem with Open|SpeedShop
If your timer is actually generating an interrupt to the process, then that could be the source of the problem. I believe the event library also treats interrupts as events, and assigns them the highest priority. So every one of your interrupts would cause the event library to stop what it was doing and go into its interrupt handling routine. I'm no expert on the event library though - just speculating that this could be the source of the problem. Ralph On Jan 13, 2009, at 8:18 PM, William Hachfeld wrote: Jeff & George, > Hum; interesting. I can't think of any reason why that would be a problem offhand. The > mca_btl_sm_component_progress() function is the shared memory progression function. > opal_progress() and mca_bml_r2_progress() are likely mainly dispatching off to this > function. > > Does OSS interfere with shared memory between processes in any way? (I'm not enough > of a kernel guy to know what the ramifications of ptrace and whatnot are) Open|SS shouldn't interfere with shared memory. We use the pthread library to access some TLS, but no shared memory... > There might be one reason to slowdown the application quite a bit. If the fact that you're > using timer interact with the libevent (the library we're using to internally manage any kind > of events), then we might end-up in the situation where we call the poll for every iteration > in the event library. And this is really expensive. I did contemplate the notion that maybe we were getting into the "progress monitoring" part of OpenMPI every time the timer interrupted the process (1000s of times per second). Can either of you see any mechanism by which that might happen? > A quick way to figure out if this is that case is to run Open MPI without support for shared > memory (--mca btl ^sm). This way we will call poll on a regular basis anyway, and if there > is no difference between a normal run and a OSS one, we know at least where to start > looking ... I ran SMG2000 on an 8-CPU Yellowrail node in the two configurations and recorded the wall/cpu clock times as reported by SMG2000 itself: "mpirun -np 8 smg2000 -n 32 64 64" Struct Interface, wall clock time = 0.042348 seconds Struct Interface, cpu clock time = 0.04 seconds SMG Setup, wall clock time =0.732441 seconds SMG Setup, cpu clock time = 0.73 seconds SMG Solve, wall clock time = 6.881814 seconds SMG Solve, cpu clock time =6.88 seconds "mpirun --mca btl ^sm -np 8 smg2000 -n 64 64 64" Struct Interface, wall clock time = 0.059137 seconds Struct Interface, cpu clock time = 0.06 seconds SMG Setup, wall clock time = 0.931437 seconds SMG Setup, cpu clock time = 0.93 seconds SMG Solve, wall clock time = 9.107343 seconds SMG Solve, cpu clock time = 9.11 seconds But running the application with the "--mac btl ^sm" option inside Open|SS also results in an extreme slowdown. I.e. it doesn't make any difference whether the shared memory transport is enabled or not. Open|SS reports time spent as follows (in case this helps pinpoint what is going on inside OpenMPI): Exclusive CPU time in seconds.Function (defining location) 364.05 btl_openib_component_proress (libmpi.so.0) 165.89 mthca_poll_cq (libmthca-rdmav2.so) 122.09 pthread_spin_lock (libpthread.so.0) 90.79 opal_progress (libopen-pal.so.0) 48.23 mca_bml_r2_progress (libmpi.so.0) 30.88 ompi_request_wait_all (libmpi.so.0) 9.78pthread_spin_unlock (libpthread.so.0) 4.91mthca_free_srq_wqe (libmthca-rdmav2.so) 4.91mthca_unlock_cqs (libmthca-rdmav2.so) 4.73mthca_lock_cqs (libmthca-rdmav2.so) 0.89__poll (libc.so.6) ... Does this help at all? -- Bill Hachfeld, The Open|SpeedShop Project ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Fragmented sm Allocations
I haven't reviewed the code either, but really do appreciate someone taking the time for such a thorough analysis of the problems we have all observed for some time! Thanks Eugene!! On Jan 14, 2009, at 5:05 AM, Tim Mattox wrote: Great analysis and suggested changes! I've not had a chance yet to look at your hg branch, so this sin't a code review... Barring a bad code review, I'd say these changes should all go in the trunk for inclusion in 1.4. 2009/1/14 Eugene Loh: RFC: Fragmented sm Allocations WHAT: Dealing with the fragmented allocations of sm BTL FIFO circular buffers (CB) during MPI_Init(). Also: Improve handling of error codes. Automate the sizing of the mmap file. WHY: To reduce consumption of shared memory, making job startup more robust, and possibly improving the scalability of startup. WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to ompi_fifo_init(). This is where CBs are initialized one at a time, components of a CB allocated individually. Changes can be seen in ssh://www.open-mpi.org/~eugene/hg/sm-allocation. WHEN: Upon acceptance. TIMEOUT: January 30, 2009. WHY (details) The sm BTL establishes a FIFO for each non-self, on-node connection. Each FIFO is initialized during MPI_Init() with a circular buffer (CB). (More CBs can be added later in program execution if a FIFO runs out of room.) A CB has different components that are used in different ways: The "wrapper" is read by both sender and receiver, but is rarely written. The "queue" (FIFO entries) is accessed by both the sender and receiver. The "head" is accessed by the sender. The "tail" is accessed by the receiver. For performance reasons, a CB is not allocated as one large data structure. Rather, these components are laid out separately in memory and the wrapper has pointers to the various locations. Performance considerations include: false sharing: a component used by one process should not share a cacheline with another component that is modified by another process NUMA: certain components should perhaps be mapped preferentially to memory pages that are close to the processes that access these components Currently, the sm BTL handles these issues by allocating each component of each CB its own page. (Actually, it couples tails and queues together.) As the number of on-node processes grows, however, the shared-memory allocation skyrockets. E.g., let's say there are n processes on- node. There are therefore n(n-1) = O(n2) FIFOs, each with 3 allocations (wrapper, head, and tail/queue). The shared-memory allocation for CBs becomes 3n2 pages. For large n, this dominates the shared-memory consumption, even though most of the CB allocation is unused. E.g., a 12-byte "head" ends up consuming a full memory page! Not only is the 3n2-page allocation large, but it is also not tunable via any MCA parameters. Large shared-memory consumption has led to some number of start-up and other user problems. E.g., the e-mail thread at http://www.open-mpi.org/community/lists/devel/2008/11/4882.php. WHAT (details) Several actions are recommended here. 1. Cacheline Rather than Pagesize Alignment The first set of changes reduces pagesize to cacheline alignment. Though mapping to pages is motivated by NUMA locality, note: The code already has NUMA locality optimizations (maffinity and mpools) anyhow. There is no data that I'm aware of substantiating the benefits of locality optimizations in this context. More to the point, I've tried some such experiments myself. I had two processes communicating via shared memory on a large SMP that had a large difference between remote and local memory access times. I timed the roundtrip latency for pingpongs between the processes. That time was correlated to the relative separation between the two processes, and not at all to the placement of the physical memory backing the shared variables. It did not matter if the memory was local to the sender or receiver or remote from both! (E.g., colocal processes showed fast timings even if the shared memory were remote to both processes.) My results do not prove that all NUMA platforms behave in the same way. My point is only that, though I understand the logic behind locality optimizations for FIFO placement, the only data I am aware of does not substantiate that logic. The changes are: File: ompi/mca/mpool/sm/mpool_sm_module.c Function: mca_mpool_sm_alloc() Use the alignment requested by the caller rather than adding additional pagesize alignment as well. File: ompi/class/ompi_fifo.h Function: ompi_fifo_init() and ompi_fifo_write_to_head() Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page. File: ompi/class/ompi_circular_buffer_fifo.h Function: ompi_cb_fifo_init() Align the two calls to mpool_alloc on cacheline rather than page. 2. Aggregated Allocation Another option is
Re: [OMPI devel] OpenMPI rpm build 1.3rc3r20226 build failed
Is there some code that can be fixed instead? I.e., is this feature totally incompatible with whatever RPM compiler flags are used, or is it just some coding style that these particular flags don't like? On Jan 14, 2009, at 5:05 AM, Matthias Jurenz wrote: Another workaround should be to disable the I/O tracing feature of VT by adding the configure option '--with-contrib-vt-flags=--disable-iotrace' That will have the effect that the upcoming OMPI-rpm's have no support for I/O tracing, but in our opinion it is not so bad... Furthermore, we could add the configure option in 'ompi/contrib/vt/configure.m4' to retain the feature-consistency between the rpm's and the source packages. Matthias On Tue, 2009-01-13 at 17:13 +0200, Lenny Verkhovsky wrote: I don't want to move changes ( default value of the flag), since there are important people, for whom it works :) I also think that this is VT issue, but I guess we are the only one who experience the errors. we can now overwrite this params from the environment as a workaround, Mike comitted buildrpm.sh script to the trunk r20253 that allows overwriting params from the environment. we observed the problem on CentOS 5.2 with boundled gcc and RedHat 5.2 with boundled gcc. #uname -a Linux elfit1 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64 x86_64 x86_64 GNU/Linux #lsb_release -a LSB Version: :core-3.1-amd64:core-3.1-ia32:core-3.1-noarch:graphics-3.1- amd64:graphics-3.1-ia32:graphics-3.1-noarch Distributor ID: CentOS Description:CentOS release 5.2 (Final) Release:5.2 Codename: Final gcc version 4.1.2 20071124 (Red Hat 4.1.2-42) Best regards, Lenny. On Tue, Jan 13, 2009 at 4:40 PM, Jeff Squyreswrote: I'm still guessing that this is a distro / compiler issue -- I can build with the default flags just fine...? Can you specify what distro / compiler you were using? Also, if you want to move the changes that have been made to buildrpm.sh to the v1.3 branch, just file a CMR. That file is not included in release tarballs, so Tim can move it over at any time. On Jan 13, 2009, at 6:35 AM, Lenny Verkhovsky wrote: it seems that setting use_default_rpm_opt_flags to 0 solves the problem. Maybe vt developers should take a look on it. Lenny. On Sun, Jan 11, 2009 at 2:40 PM, Jeff Squyres wrote: This sounds like a distro/compiler version issue. Can you narrow down the issue at all? On Jan 11, 2009, at 3:23 AM, Lenny Verkhovsky wrote: it doesnt happen if I do autogen, configure and make install, only when I try to make an rpm from the tar file. On Thu, Jan 8, 2009 at 9:43 PM, Jeff Squyres wrote: This doesn't happen in a normal build of the same tree? I ask because both 1.3r20226 builds fine for me manually (i.e., ./configure;make and buildrpm.sh). On Jan 8, 2009, at 8:15 AM, Lenny Verkhovsky wrote: Hi, I am trying to build rpm from nightly snaposhots of 1.3 with the downloaded buildrpm.sh and ompi.spec file from http://svn.open-mpi.org/svn/ompi/branches/v1.3/contrib/dist/linux/ I am getting this error . Making all in vtlib make[5]: Entering directory `/hpc/home/USERS/lennyb/work/svn/release/scripts/dist-1.3--1/ OMPI/BUILD/ openmpi-1.3rc3r20226/ompi/contrib/vt/vt/vtlib' gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\" -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG -DVT_MEMHOOK -DVT_IOWRAP -O2 -g -pipe -Wall -Wp,- D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -MT vt_comp_gnu.o -MD -MP -MF .deps/ vt_comp_gnu.Tpo -c -o vt_comp_gnu.o vt_comp_gnu.c gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\" -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG -DVT_MEMHOOK -DVT_IOWRAP -O2 -g -pipe -Wall -Wp,- D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -MT vt_memhook.o -MD -MP -MF .deps/ vt_memhook.Tpo -c -o vt_memhook.o vt_memhook.c gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\" -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG -DVT_MEMHOOK -DVT_IOWRAP -O2 -g -pipe -Wall -Wp,- D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -MT vt_memreg.o -MD -MP -MF .deps/ vt_memreg.Tpo -c -o vt_memreg.o vt_memreg.c gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\" -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG -DVT_MEMHOOK -DVT_IOWRAP -O2 -g
Re: [OMPI devel] RFC: Fragmented sm Allocations
Great analysis and suggested changes! I've not had a chance yet to look at your hg branch, so this sin't a code review... Barring a bad code review, I'd say these changes should all go in the trunk for inclusion in 1.4. 2009/1/14 Eugene Loh: > > > RFC: Fragmented sm Allocations > > WHAT: Dealing with the fragmented allocations of sm BTL FIFO circular > buffers (CB) during MPI_Init(). > > Also: > > Improve handling of error codes. > Automate the sizing of the mmap file. > > WHY: To reduce consumption of shared memory, making job startup more robust, > and possibly improving the scalability of startup. > > WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to > ompi_fifo_init(). This is where CBs are initialized one at a time, > components of a CB allocated individually. Changes can be seen in > ssh://www.open-mpi.org/~eugene/hg/sm-allocation. > > WHEN: Upon acceptance. > > TIMEOUT: January 30, 2009. > > > > WHY (details) > > The sm BTL establishes a FIFO for each non-self, on-node connection. Each > FIFO is initialized during MPI_Init() with a circular buffer (CB). (More CBs > can be added later in program execution if a FIFO runs out of room.) > > A CB has different components that are used in different ways: > > The "wrapper" is read by both sender and receiver, but is rarely written. > The "queue" (FIFO entries) is accessed by both the sender and receiver. > The "head" is accessed by the sender. > The "tail" is accessed by the receiver. > > For performance reasons, a CB is not allocated as one large data structure. > Rather, these components are laid out separately in memory and the wrapper > has pointers to the various locations. Performance considerations include: > > false sharing: a component used by one process should not share a cacheline > with another component that is modified by another process > NUMA: certain components should perhaps be mapped preferentially to memory > pages that are close to the processes that access these components > > Currently, the sm BTL handles these issues by allocating each component of > each CB its own page. (Actually, it couples tails and queues together.) > > As the number of on-node processes grows, however, the shared-memory > allocation skyrockets. E.g., let's say there are n processes on-node. There > are therefore n(n-1) = O(n2) FIFOs, each with 3 allocations (wrapper, head, > and tail/queue). The shared-memory allocation for CBs becomes 3n2 pages. For > large n, this dominates the shared-memory consumption, even though most of > the CB allocation is unused. E.g., a 12-byte "head" ends up consuming a full > memory page! > > Not only is the 3n2-page allocation large, but it is also not tunable via > any MCA parameters. > > Large shared-memory consumption has led to some number of start-up and other > user problems. E.g., the e-mail thread at > http://www.open-mpi.org/community/lists/devel/2008/11/4882.php. > > WHAT (details) > > Several actions are recommended here. > > 1. Cacheline Rather than Pagesize Alignment > > The first set of changes reduces pagesize to cacheline alignment. Though > mapping to pages is motivated by NUMA locality, note: > > The code already has NUMA locality optimizations (maffinity and mpools) > anyhow. > There is no data that I'm aware of substantiating the benefits of locality > optimizations in this context. > > More to the point, I've tried some such experiments myself. I had two > processes communicating via shared memory on a large SMP that had a large > difference between remote and local memory access times. I timed the > roundtrip latency for pingpongs between the processes. That time was > correlated to the relative separation between the two processes, and not at > all to the placement of the physical memory backing the shared variables. It > did not matter if the memory was local to the sender or receiver or remote > from both! (E.g., colocal processes showed fast timings even if the shared > memory were remote to both processes.) > > My results do not prove that all NUMA platforms behave in the same way. My > point is only that, though I understand the logic behind locality > optimizations for FIFO placement, the only data I am aware of does not > substantiate that logic. > > The changes are: > > File: ompi/mca/mpool/sm/mpool_sm_module.c > Function: mca_mpool_sm_alloc() > > Use the alignment requested by the caller rather than adding additional > pagesize alignment as well. > > File: ompi/class/ompi_fifo.h > Function: ompi_fifo_init() and ompi_fifo_write_to_head() > > Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page. > > File: ompi/class/ompi_circular_buffer_fifo.h > Function: ompi_cb_fifo_init() > > Align the two calls to mpool_alloc on cacheline rather than page. > > 2. Aggregated Allocation > > Another option is to lay out all the CBs at once and aggregate their > allocations. > > This may have the added benefit of reducing lock
Re: [OMPI devel] reduce_scatter bug with hierarch
Unfortunately, although this fixed some problems when enabling hierarch coll, there is still a segfault in two of IU's tests that only shows up when we set -mca coll_hierarch_priority 100 See this MTT summary to see how the failures improved on the trunk, but that there are still two that segfault even at 1.4a1r20267: http://www.open-mpi.org/mtt/index.php?do_redir=923 This link just has the remaining failures: http://www.open-mpi.org/mtt/index.php?do_redir=922 So, I'll vote for applying the CMR for 1.3 since it clearly improved things, but there is still more to be done to get coll_hierarch ready for regular use. On Wed, Jan 14, 2009 at 12:15 AM, George Bosilcawrote: > Here we go by the book :) > > https://svn.open-mpi.org/trac/ompi/ticket/1749 > > george. > > On Jan 13, 2009, at 23:40 , Jeff Squyres wrote: > >> Let's debate tomorrow when people are around, but first you have to file a >> CMR... :-) >> >> On Jan 13, 2009, at 10:28 PM, George Bosilca wrote: >> >>> Unfortunately, this pinpoint the fact that we didn't test enough the >>> collective module mixing thing. I went over the tuned collective functions >>> and changed all instances to use the correct module information. It is now >>> on the trunk, revision 20267. Simultaneously,I checked that all other >>> collective components do the right thing ... and I have to admit tuned was >>> the only faulty one. >>> >>> This is clearly a bug in the tuned, and correcting it will allow people >>> to use the hierarch. In the current incarnation 1.3 will mostly/always >>> segfault when hierarch is active. I would prefer not to give a broken toy >>> out there. How about pushing r20267 in the 1.3? >>> >>> george. >>> >>> >>> On Jan 13, 2009, at 20:13 , Jeff Squyres wrote: >>> Thanks for digging into this. Can you file a bug? Let's mark it for v1.3.1. I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and since hierarch isn't currently selected by default (you must specifically elevate hierarch's priority to get it to run), there's no danger that users will run into this problem in default runs. But clearly the problem needs to be fixed, and therefore we need a bug to track it. On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote: > I just debugged the Reduce_scatter bug mentioned previously. The bug is > unfortunately not in hierarch, but in tuned. > > Here is the code snipplet causing the problems: > > int reduce_scatter (, mca_coll_base_module_t *module) > { > ... > err = comm->c_coll.coll_reduce (, module) > ... > } > > > but should be > { > ... > err = comm->c_coll.coll_reduce (..., comm->c_coll.coll_reduce_module); > ... > } > > The problem as it is right now is, that when using hierarch, only a > subset of the function are set, e.g. reduce,allreduce, bcast and barrier. > Thus, reduce_scatter is from tuned in most scenarios, and calls the > subsequent functions with the wrong module. Hierarch of course does not > like > that :-) > > Anyway, a quick glance through the tuned code reveals a significant > number of instances where this appears(reduce_scatter, allreduce, > allgather, > allgatherv). Basic, hierarch and inter seem to do that mostly correctly. > > Thanks > Edgar > -- > Edgar Gabriel > Assistant Professor > Parallel Software Technologies Lab http://pstl.cs.uh.edu > Department of Computer Science University of Houston > Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA > Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ___ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> -- >> Jeff Squyres >> Cisco Systems >> >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/
Re: [OMPI devel] OpenMPI rpm build 1.3rc3r20226 build failed
Another workaround should be to disable the I/O tracing feature of VT by adding the configure option '--with-contrib-vt-flags=--disable-iotrace' That will have the effect that the upcoming OMPI-rpm's have no support for I/O tracing, but in our opinion it is not so bad... Furthermore, we could add the configure option in 'ompi/contrib/vt/configure.m4' to retain the feature-consistency between the rpm's and the source packages. Matthias On Tue, 2009-01-13 at 17:13 +0200, Lenny Verkhovsky wrote: > I don't want to move changes ( default value of the flag), since there > are important people, for whom it works :) > I also think that this is VT issue, but I guess we are the only one > who experience the errors. > > we can now overwrite this params from the environment as a workaround, > Mike comitted buildrpm.sh script to the trunk r20253 that allows > overwriting params from the environment. > > we observed the problem on CentOS 5.2 with boundled gcc and RedHat 5.2 > with boundled gcc. > > #uname -a > Linux elfit1 2.6.18-92.el5 #1 SMP Tue Jun 10 18:51:06 EDT 2008 x86_64 > x86_64 x86_64 GNU/Linux > > #lsb_release -a > LSB Version: > :core-3.1-amd64:core-3.1-ia32:core-3.1-noarch:graphics-3.1-amd64:graphics-3.1-ia32:graphics-3.1-noarch > Distributor ID: CentOS > Description:CentOS release 5.2 (Final) > Release:5.2 > Codename: Final > > gcc version 4.1.2 20071124 (Red Hat 4.1.2-42) > > Best regards, > Lenny. > > > On Tue, Jan 13, 2009 at 4:40 PM, Jeff Squyreswrote: > > I'm still guessing that this is a distro / compiler issue -- I can build > > with the default flags just fine...? > > > > Can you specify what distro / compiler you were using? > > > > Also, if you want to move the changes that have been made to buildrpm.sh to > > the v1.3 branch, just file a CMR. That file is not included in release > > tarballs, so Tim can move it over at any time. > > > > > > > > On Jan 13, 2009, at 6:35 AM, Lenny Verkhovsky wrote: > > > >> it seems that setting use_default_rpm_opt_flags to 0 solves the problem. > >> Maybe vt developers should take a look on it. > >> > >> Lenny. > >> > >> > >> On Sun, Jan 11, 2009 at 2:40 PM, Jeff Squyres wrote: > >>> > >>> This sounds like a distro/compiler version issue. > >>> > >>> Can you narrow down the issue at all? > >>> > >>> > >>> On Jan 11, 2009, at 3:23 AM, Lenny Verkhovsky wrote: > >>> > it doesnt happen if I do autogen, configure and make install, > only when I try to make an rpm from the tar file. > > > > On Thu, Jan 8, 2009 at 9:43 PM, Jeff Squyres wrote: > > > > This doesn't happen in a normal build of the same tree? > > > > I ask because both 1.3r20226 builds fine for me manually (i.e., > > ./configure;make and buildrpm.sh). > > > > > > On Jan 8, 2009, at 8:15 AM, Lenny Verkhovsky wrote: > > > >> Hi, > >> > >> I am trying to build rpm from nightly snaposhots of 1.3 > >> > >> with the downloaded buildrpm.sh and ompi.spec file from > >> http://svn.open-mpi.org/svn/ompi/branches/v1.3/contrib/dist/linux/ > >> > >> I am getting this error > >> . > >> Making all in vtlib > >> make[5]: Entering directory > >> > >> `/hpc/home/USERS/lennyb/work/svn/release/scripts/dist-1.3--1/OMPI/BUILD/ > >> openmpi-1.3rc3r20226/ompi/contrib/vt/vt/vtlib' > >> gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib > >> -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE > >> -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\" > >> -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG > >> -DVT_MEMHOOK -DVT_IOWRAP -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 > >> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 > >> -mtune=generic -MT vt_comp_gnu.o -MD -MP -MF .deps/vt_comp_gnu.Tpo -c > >> -o > >> vt_comp_gnu.o vt_comp_gnu.c > >> gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib > >> -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE > >> -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\" > >> -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG > >> -DVT_MEMHOOK -DVT_IOWRAP -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 > >> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 > >> -mtune=generic -MT vt_memhook.o -MD -MP -MF .deps/vt_memhook.Tpo -c -o > >> vt_memhook.o vt_memhook.c > >> gcc -DHAVE_CONFIG_H -I. -I.. -I../tools/opari/lib > >> -I../extlib/otf/otflib -I../extlib/otf/otflib -D_GNU_SOURCE > >> -DBINDIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/bin\" > >> -DDATADIR=\"/opt/openmpi/1.3rc3r20226-V00/gcc/share\" -DRFG > >> -DVT_MEMHOOK -DVT_IOWRAP -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 > >> -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 > >> -mtune=generic -MT vt_memreg.o -MD -MP -MF
[OMPI devel] RFC: Fragmented sm Allocations
Title: RFC: Fragmented sm Allocations RFC: Fragmented sm Allocations WHAT: Dealing with the fragmented allocations of sm BTL FIFO circular buffers (CB) during MPI_Init(). Also: Improve handling of error codes. Automate the sizing of the mmap file. WHY: To reduce consumption of shared memory, making job startup more robust, and possibly improving the scalability of startup. WHERE: In mca_btl_sm_add_procs(), there is a loop over calls to ompi_fifo_init(). This is where CBs are initialized one at a time, components of a CB allocated individually. Changes can be seen in ssh://www.open-mpi.org/~eugene/hg/sm-allocation. WHEN: Upon acceptance. TIMEOUT: January 30, 2009. WHY (details) The sm BTL establishes a FIFO for each non-self, on-node connection. Each FIFO is initialized during MPI_Init() with a circular buffer (CB). (More CBs can be added later in program execution if a FIFO runs out of room.) A CB has different components that are used in different ways: The "wrapper" is read by both sender and receiver, but is rarely written. The "queue" (FIFO entries) is accessed by both the sender and receiver. The "head" is accessed by the sender. The "tail" is accessed by the receiver. For performance reasons, a CB is not allocated as one large data structure. Rather, these components are laid out separately in memory and the wrapper has pointers to the various locations. Performance considerations include: false sharing: a component used by one process should not share a cacheline with another component that is modified by another process NUMA: certain components should perhaps be mapped preferentially to memory pages that are close to the processes that access these components Currently, the sm BTL handles these issues by allocating each component of each CB its own page. (Actually, it couples tails and queues together.) As the number of on-node processes grows, however, the shared-memory allocation skyrockets. E.g., let's say there are n processes on-node. There are therefore n(n-1) = O(n2) FIFOs, each with 3 allocations (wrapper, head, and tail/queue). The shared-memory allocation for CBs becomes 3n2 pages. For large n, this dominates the shared-memory consumption, even though most of the CB allocation is unused. E.g., a 12-byte "head" ends up consuming a full memory page! Not only is the 3n2-page allocation large, but it is also not tunable via any MCA parameters. Large shared-memory consumption has led to some number of start-up and other user problems. E.g., the e-mail thread at http://www.open-mpi.org/community/lists/devel/2008/11/4882.php. WHAT (details) Several actions are recommended here. 1. Cacheline Rather than Pagesize Alignment The first set of changes reduces pagesize to cacheline alignment. Though mapping to pages is motivated by NUMA locality, note: The code already has NUMA locality optimizations (maffinity and mpools) anyhow. There is no data that I'm aware of substantiating the benefits of locality optimizations in this context. More to the point, I've tried some such experiments myself. I had two processes communicating via shared memory on a large SMP that had a large difference between remote and local memory access times. I timed the roundtrip latency for pingpongs between the processes. That time was correlated to the relative separation between the two processes, and not at all to the placement of the physical memory backing the shared variables. It did not matter if the memory was local to the sender or receiver or remote from both! (E.g., colocal processes showed fast timings even if the shared memory were remote to both processes.) My results do not prove that all NUMA platforms behave in the same way. My point is only that, though I understand the logic behind locality optimizations for FIFO placement, the only data I am aware of does not substantiate that logic. The changes are: File: ompi/mca/mpool/sm/mpool_sm_module.c Function: mca_mpool_sm_alloc() Use the alignment requested by the caller rather than adding additional pagesize alignment as well. File: ompi/class/ompi_fifo.h Function: ompi_fifo_init() and ompi_fifo_write_to_head() Align ompi_cb_fifo_wrapper_t structure on cacheline rather than page. File: ompi/class/ompi_circular_buffer_fifo.h Function: ompi_cb_fifo_init() Align the two calls to mpool_alloc on cacheline rather than page. 2. Aggregated Allocation Another option is to lay out all the CBs at once and aggregate their allocations. This may have the added benefit of reducing lock contention during MPI_Init(). On the one hand, the 3n2 CB allocations during MPI_Init() contend for a single mca_common_sm_mmap->map_seg->seg_lock lock. On the other hand, I know so far of no data
Re: [OMPI devel] autosizing the shared memory backing file
Thanks for the reply. I kind of understand, but it's rather weird. The BTL calls mca_mpool_base_module_create() to create a pool of memory, but the BTL has no say how big of a pool to create? Could you imagine having a memory allocation routine ("malloc" or something) that didn't allow you to control the size of the allocation? Instead, the allocation routine determines the size. That's weird. I must be missing something about how this is supposed to work. E.g., I see that there is a "resources" argument (mca_mpool_base_resources_t). Maybe that structure should be expanded to include a "size" field? Or, maybe I should bypass mca_mpool_base_module_create()/mca_mpool_sm_init() and just call mca_common_sm_mmap_init() directly, the way mca/coll/sm does things. That would allow me to specify the size of the file. George Bosilca wrote: The simple answer is you can't. The mpool is loaded before the BTLs and on Linux the loader use the RTLD_NOW flag (i.e. all symbols have to be defined or the dlopen call will fail). Moreover, there is no way in Open MPI to exchange information between components except a global variable or something in the mca/common. In other words there is no way for you to call from the mpool a function from the sm BTL. On Jan 13, 2009, at 19:22 , Eugene Loh wrote: With the sm BTL, there is a file that each process mmaps in for shared memory. I'm trying to get mpool_sm to size the file appropriately. So, I would like mpool_sm to call some mca_btl_sm function that provides a good guess of the size. (mpool_sm creates and mmaps the file, but the size depends on parameters like eager limit and max frag size that are known by the btl_sm.)
Re: [OMPI devel] reduce_scatter bug with hierarch
Here we go by the book :) https://svn.open-mpi.org/trac/ompi/ticket/1749 george. On Jan 13, 2009, at 23:40 , Jeff Squyres wrote: Let's debate tomorrow when people are around, but first you have to file a CMR... :-) On Jan 13, 2009, at 10:28 PM, George Bosilca wrote: Unfortunately, this pinpoint the fact that we didn't test enough the collective module mixing thing. I went over the tuned collective functions and changed all instances to use the correct module information. It is now on the trunk, revision 20267. Simultaneously,I checked that all other collective components do the right thing ... and I have to admit tuned was the only faulty one. This is clearly a bug in the tuned, and correcting it will allow people to use the hierarch. In the current incarnation 1.3 will mostly/always segfault when hierarch is active. I would prefer not to give a broken toy out there. How about pushing r20267 in the 1.3? george. On Jan 13, 2009, at 20:13 , Jeff Squyres wrote: Thanks for digging into this. Can you file a bug? Let's mark it for v1.3.1. I say 1.3.1 instead of 1.3.0 because this *only* affects hierarch, and since hierarch isn't currently selected by default (you must specifically elevate hierarch's priority to get it to run), there's no danger that users will run into this problem in default runs. But clearly the problem needs to be fixed, and therefore we need a bug to track it. On Jan 13, 2009, at 2:09 PM, Edgar Gabriel wrote: I just debugged the Reduce_scatter bug mentioned previously. The bug is unfortunately not in hierarch, but in tuned. Here is the code snipplet causing the problems: int reduce_scatter (, mca_coll_base_module_t *module) { ... err = comm->c_coll.coll_reduce (, module) ... } but should be { ... err = comm->c_coll.coll_reduce (..., comm- >c_coll.coll_reduce_module); ... } The problem as it is right now is, that when using hierarch, only a subset of the function are set, e.g. reduce,allreduce, bcast and barrier. Thus, reduce_scatter is from tuned in most scenarios, and calls the subsequent functions with the wrong module. Hierarch of course does not like that :-) Anyway, a quick glance through the tuned code reveals a significant number of instances where this appears(reduce_scatter, allreduce, allgather, allgatherv). Basic, hierarch and inter seem to do that mostly correctly. Thanks Edgar -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel