Re: [OMPI devel] regression with derived datatypes
Since I have a system that has the scif libraries installed I will try to reproduce and see if I can come up with a fix. It will probably be sometime next week at the earliest. -Nathan From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet [gilles.gouaillar...@iferc.org] Sent: Wednesday, May 07, 2014 9:03 PM To: de...@open-mpi.org Subject: Re: [OMPI devel] regression with derived datatypes On 2014/05/08 2:15, Ralph Castain wrote: > I wonder if that might also explain the issue reported by Gilles regarding > the scif BTL? In his example, the problem only occurred if the message was > split across scif and vader. If so, then it might be that splitting messages > in general is broken. > i am afraid there is a misunderstanding : the problem always occur with scif,vader,self (regardless the ompi v1.8 version) the problem occurs with scif,self only if r31496 is applied to ompi v1.8 In my previous email http://www.open-mpi.org/community/lists/devel/2014/05/14699.php i reported the following interesting fact : with ompi v1.8 (latest r31678), the following command produces incorrect results : mpirun -host localhost -np 2 --mca btl scif,self ./test_scif but with ompi v1.8 r31309, the very same command produces correct results Elena pointed that r31496 is a suspect. so i took the latest v1.8 (r31678) and reverted r31496 and ... mpirun -host localhost -np 2 --mca btl scif,self ./test_scif works again ! note that the "default" mpirun -host localhost -np 2 --mca btl scif,vader,self ./test_scif still produces incorrect results in order to reproduce the issue, a MIC is *not* needed, you only need to install the software stack, load the mic kernel module and make sure you can read/write /dev/mic/* bottom line, there are two issues here : 1) r31496 broke something : mpirun -np 2 -host localhost --mca btl scif,self ./test_scif 2) something else never worked : mpirun -np 2 -host localhost --mca btl scif,vader,self ./test_scif Gilles ___ devel mailing list de...@open-mpi.org Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel Link to this post: http://www.open-mpi.org/community/lists/devel/2014/05/14739.php
Re: [OMPI devel] regression with derived datatypes
Nathan, or anybody with access to the target hardware, If you can provide a minimalistic output of the applications with and without the above-mentioned patch and with mpi_ddt_unpack_debug and mpi_ddt_pack_debug, and mpi_ddt_position_debug set to 1, I would try to help. George. On Thu, May 8, 2014 at 2:50 AM, Hjelm, Nathan T wrote: > Since I have a system that has the scif libraries installed I will try to > reproduce and see if I can come up with a fix. It will probably be sometime > next week at the earliest. > > -Nathan > > From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet > [gilles.gouaillar...@iferc.org] > Sent: Wednesday, May 07, 2014 9:03 PM > To: de...@open-mpi.org > Subject: Re: [OMPI devel] regression with derived datatypes > > On 2014/05/08 2:15, Ralph Castain wrote: >> I wonder if that might also explain the issue reported by Gilles regarding >> the scif BTL? In his example, the problem only occurred if the message was >> split across scif and vader. If so, then it might be that splitting messages >> in general is broken. >> > i am afraid there is a misunderstanding : > the problem always occur with scif,vader,self (regardless the ompi v1.8 > version) > the problem occurs with scif,self only if r31496 is applied to ompi v1.8 > > > In my previous email > http://www.open-mpi.org/community/lists/devel/2014/05/14699.php > i reported the following interesting fact : > > with ompi v1.8 (latest r31678), the following command produces incorrect > results : > mpirun -host localhost -np 2 --mca btl scif,self ./test_scif > > but with ompi v1.8 r31309, the very same command produces correct results > > Elena pointed that r31496 is a suspect. so i took the latest v1.8 > (r31678) and reverted r31496 and ... > > > mpirun -host localhost -np 2 --mca btl scif,self ./test_scif > > works again ! > > note that the "default" > mpirun -host localhost -np 2 --mca btl scif,vader,self ./test_scif > still produces incorrect results > > in order to reproduce the issue, a MIC is *not* needed, > you only need to install the software stack, load the mic kernel module > and make sure you can read/write /dev/mic/* > > bottom line, there are two issues here : > 1) r31496 broke something : mpirun -np 2 -host localhost --mca btl > scif,self ./test_scif > 2) something else never worked : mpirun -np 2 -host localhost --mca btl > scif,vader,self ./test_scif > > Gilles > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14739.php > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14742.php
Re: [OMPI devel] regression with derived datatypes
George, you do not need any hardware, just download MPSS from Intel and install it. make sure the mic kernel module is loaded *and* you can read/write to the newly created /dev/mic/* devices. /* i am now running this on a virtual machine with no MIC whatsoever */ i was able to improve things a bit for the new attached test case /* send MPI_PACKED / recv newtype */ with the attached unpack.patch. it has to be applied on r31678 (aka the latest checkout of the v1.8 branch) with this patch (zero regression test so far, it might solve one problem but break anything else !) mpirun -np 2 -host localhost --mca btl,scif,vader ./test_scif2 works fine :-) but mpirun -np 2 -host localhost --mca btl scif,vader ./test_scif2 still crashes (and it did not crash before r31496) i will provide the output you requested shortly Cheers, Gilles /* * This test is an oversimplified version of collective/bcast_struct * that comes with the ibm test suite. * it must be ran on two tasks on a single host where the MIC software stack * is present (e.g. libscif.so is present, the mic driver is loaded and * /dev/mic/* are accessible and the scif btl is available. * * mpirun -np 2 -host localhost --mca scif,vader,self ./test_scif * will produce incorrect results with trunk and v1.8 * * mpirun -np 2 --mca btl ^scif -host localhost ./test_scif * will work with trunk and v1.8 * * mpirun -np 2 --mca btl scif,self -host localhost ./test_scif * will produce correct results with v1.8 r31309 (but eventually crash in MPI_Finalize) * and produce incorrect result with v1.8 r31671 and trunk r31667 * * Copyright (c) 2011 Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2014 Research Organization for Information Science * and Technology (RIST). All rights reserved. */ / MESSAGE PASSING INTERFACE TEST CASE SUITE Copyright IBM Corp. 1995 IBM Corp. hereby grants a non-exclusive license to use, copy, modify, and distribute this software for any purpose and without fee provided that the above copyright notice and the following paragraphs appear in all copies. IBM Corp. makes no representation that the test cases comprising this suite are correct or are an accurate representation of any standard. In no event shall IBM be liable to any party for direct, indirect, special incidental, or consequential damage arising out of the use of this software even if IBM Corp. has been advised of the possibility of such damage. IBM CORP. SPECIFICALLY DISCLAIMS ANY WARRANTIES INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE PROVIDED HEREUNDER IS ON AN "AS IS" BASIS AND IBM CORP. HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. These test cases reflect an interpretation of the MPI Standard. They are are, in most cases, unit tests of specific MPI behaviors. If a user of any test case from this set believes that the MPI Standard requires behavior different than that implied by the test case we would appreciate feedback. Comments may be sent to: Richard Treumann treum...@kgn.ibm.com */ #include #include #include #include "mpi.h" #define ompitest_error(file,line,...) {fprintf(stderr, "FUCK at %s:%d root=%d size=%d (i,j)=(%d,%d)\n", file, line,root, i0, i, j); MPI_Abort(MPI_COMM_WORLD, 1);} const int SIZE = 1000; int main(int argc, char **argv) { int myself; double a[2], t_stop; int ii, size; int len[2]; MPI_Aint disp[2]; MPI_Datatype type[2], newtype, t1, t2; struct foo_t { int i[3]; double d[3]; } foo, *bar; struct pfoo_t { int i[2]; double d[2]; } pfoo, *pbar; int i0, i, j, root, nseconds = 600, done_flag; int _dbg=0; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&myself); MPI_Comm_size(MPI_COMM_WORLD,&size); // _dbg = (0 == myself); while (_dbg) poll(NULL,0,1); if ( argc > 1 ) nseconds = atoi(argv[1]); t_stop = MPI_Wtime() + nseconds; /*-*/ /* Build a datatype that is guaranteed to have holes; send/recv large numbers of them */ MPI_Type_vector(2, 1, 2, MPI_INT, &t1); MPI_Type_commit(&t1); MPI_Type_vector(2, 1, 2, MPI_DOUBLE, &t2); MPI_Type_commit(&t2); len[0] = len[1] = 1; MPI_Address(&foo.i[0], &disp[0]); MPI_Address(&foo.d[0], &disp[1]); printf ("%d: %x %x\n", myself, disp[0], disp[1]); disp[0] -= (MPI_Aint) &foo; disp[1] -= (MPI_Aint) &foo; printf ("%d: %ld %ld\n", myself, disp[0], disp[1]); type[0] = t1; type[1] = t2; MPI_Type_struct(2, len, disp, type, &newtype); MPI_Type_commit(&newtype);
Re: [OMPI devel] regression with derived datatypes
If you can get me the backtrace from one of the crash core files I would like to see what is going on there. -Nathan From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet [gilles.gouaillar...@iferc.org] Sent: Thursday, May 08, 2014 1:32 AM To: Open MPI Developers Subject: Re: [OMPI devel] regression with derived datatypes George, you do not need any hardware, just download MPSS from Intel and install it. make sure the mic kernel module is loaded *and* you can read/write to the newly created /dev/mic/* devices. /* i am now running this on a virtual machine with no MIC whatsoever */ i was able to improve things a bit for the new attached test case /* send MPI_PACKED / recv newtype */ with the attached unpack.patch. it has to be applied on r31678 (aka the latest checkout of the v1.8 branch) with this patch (zero regression test so far, it might solve one problem but break anything else !) mpirun -np 2 -host localhost --mca btl,scif,vader ./test_scif2 works fine :-) but mpirun -np 2 -host localhost --mca btl scif,vader ./test_scif2 still crashes (and it did not crash before r31496) i will provide the output you requested shortly Cheers, Gilles
Re: [OMPI devel] regression with derived datatypes
Nathan and George, here are the output files of the original test_scif.c the command line was mpirun -np 2 -host localhost --mca btl scif,vader,self --mca mpi_ddt_unpack_debug 1 --mca mpi_ddt_pack_debug 1 --mca mpi_ddt_position_debug 1 a.out this is a silent failure and there is no core file the test itself detects it did not receive the expected value /* grep "expected" in the output */ Gilles On 2014/05/08 16:43, Hjelm, Nathan T wrote: > If you can get me the backtrace from one of the crash core files I would like > to see what is going on there. >
Re: [OMPI devel] regression with derived datatypes
Hi, My reproducer failed even with one port enabled (-mca btl_openib_if_include mlx4_0:1 ). I tried with trunk as well - the same issue. Best, Elena On Thu, May 8, 2014 at 11:49 AM, Gilles Gouaillardet < gilles.gouaillar...@iferc.org> wrote: > Nathan and George, > > here are the output files of the original test_scif.c > the command line was > > mpirun -np 2 -host localhost --mca btl scif,vader,self --mca > mpi_ddt_unpack_debug 1 --mca mpi_ddt_pack_debug 1 --mca > mpi_ddt_position_debug 1 a.out > > this is a silent failure and there is no core file > the test itself detects it did not receive the expected value > /* grep "expected" in the output */ > > Gilles > > On 2014/05/08 16:43, Hjelm, Nathan T wrote: > > If you can get me the backtrace from one of the crash core files I would > like to see what is going on there. > > > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14746.php >
Re: [OMPI devel] regression with derived datatypes
Nathan and George, here are the (compressed) traces Gilles On 2014/05/08 16:43, Hjelm, Nathan T wrote: > If you can get me the backtrace from one of the crash core files I would like > to see what is going on there. > > -Nathan > > From: devel [devel-boun...@open-mpi.org] on behalf of Gilles Gouaillardet > [gilles.gouaillar...@iferc.org] > Sent: Thursday, May 08, 2014 1:32 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] regression with derived datatypes > > George, > > you do not need any hardware, just download MPSS from Intel and install it. > make sure the mic kernel module is loaded *and* you can read/write to the > newly created /dev/mic/* devices. > > /* i am now running this on a virtual machine with no MIC whatsoever */ > > i was able to improve things a bit for the new attached test case > /* send MPI_PACKED / recv newtype */ > with the attached unpack.patch. > > it has to be applied on r31678 (aka the latest checkout of the v1.8 branch) > > with this patch (zero regression test so far, it might solve one problem > but break anything else !) > > mpirun -np 2 -host localhost --mca btl,scif,vader ./test_scif2 > works fine :-) > > but > > mpirun -np 2 -host localhost --mca btl scif,vader ./test_scif2 > still crashes (and it did not crash before r31496) > > i will provide the output you requested shortly > > Cheers, > > Gilles > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14745.php r31678.log.bz2 Description: Binary data r31678withoutr31496.log.bz2 Description: Binary data
[OMPI devel] RFC: Remove autogen.sh sym link
WHAT: Remove the backwards-compatibility autogen.sh sym link WHY: Because it's time WHERE: svn rm autogen.sh TIMEOUT: Teleconf next Tuesday, 13 May 2014 MORE DETAIL: We converted from autogen.sh to autogen.pl nearly 4 years ago (2010-09-17). The autogen.sh->autogen.pl sym link was put in shortly thereafter as a stopgap measure to give people time to update their automated scripts from autogen.sh to autogen.pl (or better yet, test and see which name they should invoke). Every time I type "./au", it stops at "./autogen.", which is just annoying. It's been nearly 4 years. I think it's time to cut the cord: remove the autogen.sh sym link and move on. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Remove autogen.sh sym link
+1 On Thu, May 8, 2014 at 6:08 AM, Jeff Squyres (jsquyres) wrote: > WHAT: Remove the backwards-compatibility autogen.sh sym link > > WHY: Because it's time > > WHERE: svn rm autogen.sh > > TIMEOUT: Teleconf next Tuesday, 13 May 2014 > > MORE DETAIL: > > We converted from autogen.sh to autogen.pl nearly 4 years ago > (2010-09-17). The autogen.sh->autogen.pl sym link was put in shortly > thereafter as a stopgap measure to give people time to update their > automated scripts from autogen.sh to autogen.pl (or better yet, test and > see which name they should invoke). > > Every time I type "./au", it stops at "./autogen.", which is just > annoying. > > It's been nearly 4 years. I think it's time to cut the cord: remove the > autogen.sh sym link and move on. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14749.php >
Re: [OMPI devel] RFC: Remove autogen.sh sym link
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 +1 Best Steve... On 5/8/14, 6:08 AM, Jeff Squyres (jsquyres) wrote: > WHAT: Remove the backwards-compatibility autogen.sh sym link > > WHY: Because it's time > > WHERE: svn rm autogen.sh > > TIMEOUT: Teleconf next Tuesday, 13 May 2014 > > MORE DETAIL: > > We converted from autogen.sh to autogen.pl nearly 4 years ago (2010-09-17). The autogen.sh->autogen.pl sym link was put in shortly thereafter as a stopgap measure to give people time to update their automated scripts from autogen.sh to autogen.pl (or better yet, test and see which name they should invoke). > > Every time I type "./au", it stops at "./autogen.", which is just annoying. > > It's been nearly 4 years. I think it's time to cut the cord: remove the autogen.sh sym link and move on. > -BEGIN PGP SIGNATURE- Version: GnuPG/MacGPG2 v2.0.22 (Darwin) Comment: GPGTools - http://gpgtools.org Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJTa2TCAAoJECiO+w6Set8uM5cP/RvKUK79ics4EmAFub0SZW3k TvskGXtSEIIS8G0YsiQq8ipdPSr0IhddaAr0oQx1y/fspzPKWvSzAFr0OUn8O9OM 636obLQfYkl0Eq98JdmS8fVJvOerwR9h2COahbvuybFJaE3W/2EkwY8zmzTqAj96 q5yhAJdYu0UaXrZWcpOuX5Q6FncSyE0+PM0msWcW8VeSu8MxAbF3ooQvcst03RgJ gFaqDc447xyY+bV0GHuPRrd1nwU9p4JJsP4mLGvseXxMuIMAVkQfMVnyElDU4qsH ZfqrzdtXS8UmHyWLxw/Ir75ZzEpE56LySfofqAzvO9BHxSnHvYCgcc8y7+jYIg9t r2aC3gGmLkCXObG40OuF1s+O/t+UCEc6TiEvjYTUPJRmEvbimJ5aqzP5zX8/NuyY yWe8JwdhASvvYf9Ps+tGaKw0nbH1Xx22zB6iBd3ARTv27ifvpZccVOtNHAYLBz0w 0RHTblHNUZlt2255lZkpHcUijL+MvgwU5wEh9MTpuYwb3mkD+y7Ql8Ag6guGyn1D /nOZ/d3t2j4DSXVCsLCKyZOhZtcQDwWi23EMjn/0xaV4gMQIiGcGqFTOX+nuBw1m YKxnc/eb+En84l0yFjppzDq45VhBhYPYJYHHyIRAPsoT/2Cv5SEj+JcJfeL9fpZO /ytJJoXwfxCFnBaa2fq1 =8tsu -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Remove autogen.sh sym link
This will break my build but it’s an easy fix so don’t let that stop you. Ashley. On 8 May 2014, at 11:08, Jeff Squyres (jsquyres) wrote: > WHAT: Remove the backwards-compatibility autogen.sh sym link > > WHY: Because it's time > > WHERE: svn rm autogen.sh > > TIMEOUT: Teleconf next Tuesday, 13 May 2014 > > MORE DETAIL: > > We converted from autogen.sh to autogen.pl nearly 4 years ago (2010-09-17). > The autogen.sh->autogen.pl sym link was put in shortly thereafter as a > stopgap measure to give people time to update their automated scripts from > autogen.sh to autogen.pl (or better yet, test and see which name they should > invoke). > > Every time I type "./au", it stops at "./autogen.", which is just > annoying. > > It's been nearly 4 years. I think it's time to cut the cord: remove the > autogen.sh sym link and move on. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14749.php
[OMPI devel] VPATH builds broken?
I started getting build failures against trunk on the 29th, most likely as a result of this commit: https://github.com/open-mpi/ompi-svn-mirror/commit/3f42cbf50670c5b311cc4414dbb3f4ccf762e455 It looks like there was another commit almost immediately afterwards which fixed the first problem (include file errors) however I’m still seeing build failures with the following error, I don’t know if this is still aside effect of the previous VPATH problem or something else. Making all in mpi make[10]: Entering directory `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/mpi' ln -s ../../../../../../../../../../source/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/handler.c handler.c CC otfmerge_mpi-handler.o ln -s ../../../../../../../../../../source/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/otfmerge.c otfmerge.c CC otfmerge_mpi-otfmerge.o CCLD otfmerge-mpi /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: undefined reference to `opal_dstore_peer' /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: undefined reference to `opal_value_load' /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: undefined reference to `opal_value_unload' /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: undefined reference to `opal_dstore_nonpeer' /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: undefined reference to `opal_dstore_internal' /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: undefined reference to `opal_dstore' collect2: error: ld returned 1 exit status make[10]: *** [otfmerge-mpi] Error 1 make[10]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/mpi' make[9]: *** [all-recursive] Error 1 make[9]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge' make[8]: *** [all-recursive] Error 1 make[8]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools' make[7]: *** [all-recursive] Error 1 make[7]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf' make[6]: *** [all] Error 2 make[6]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf' make[5]: *** [all-recursive] Error 1 make[5]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib' make[4]: *** [all-recursive] Error 1 make[4]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt' make[3]: *** [all] Error 2 make[3]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt' make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt' make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi' make: *** [all-recursive] Error 1 The build script I’m using is fairly simple, it’s working from a clean checkout each time but is doing a “VPATH” or out-of-tree build cd source ./autogen.sh cd .. [ -d build ] && rm -rf build [ -d build ] && rm -rf install mkdir build cd build ../source/configure --enable-mpirun-prefix-by-default --prefix $WORKSPACE/install make make install Ashley,
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Hi, Adam We (MLNX) are working on a new SLURM PMI2 plugin that we plan to eventually push upstream. However, to use it, it will require linking in a proprietary Mellanox library that accelerates the collective operations (available in MOFED versions 2.1 and higher.) Similar in spirit to the MXM MTL or FCA COLL components in OMPI. Best, Josh On Wed, May 7, 2014 at 11:45 AM, Moody, Adam T. wrote: > Hi Josh, > Are your changes to OMPI or SLURM's PMI2 implementation? Do you plan to > push those changes back upstream? > -Adam > > > -- > *From:* devel [devel-boun...@open-mpi.org] on behalf of Joshua Ladd [ > jladd.m...@gmail.com] > *Sent:* Wednesday, May 07, 2014 7:56 AM > *To:* Open MPI Developers > > *Subject:* Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is > specifically requested > > Ah, I see. Sorry for the reactionary comment - but this feature falls > squarely within my "jurisdiction", and we've invested a lot in improving > OMPI jobstart under srun. > > That being said (now that I've taken some deep breaths and carefully read > your original email :)), what you're proposing isn't a bad idea. I think it > would be good to maybe add a "--with-pmi2" flag to configure since > "--with-pmi" automagically uses PMI2 if it finds the header and lib. This > way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or > hack the installation. > > Josh > > > On Wed, May 7, 2014 at 10:45 AM, Ralph Castain wrote: > >> Okay, then we'll just have to develop a workaround for all those Slurm >> releases where PMI-2 is borked :-( >> >> FWIW: I think people misunderstood my statement. I specifically did >> *not* propose to *lose* PMI-2 support. I suggested that we change it to >> "on-by-request" instead of the current "on-by-default" so we wouldn't keep >> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation >> stabilized, then we could reverse that policy. >> >> However, given that both you and Chris appear to prefer to keep it >> "on-by-default", we'll see if we can find a way to detect that PMI-2 is >> broken and then fall back to PMI-1. >> >> >> On May 7, 2014, at 7:39 AM, Joshua Ladd wrote: >> >> Just saw this thread, and I second Chris' observations: at scale we >> are seeing huge gains in jobstart performance with PMI2 over PMI1. We >> *CANNOT* loose this functionality. For competitive reasons, I cannot >> provide exact numbers, but let's say the difference is in the ballpark of a >> full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely >> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, >> but there is no contest between PMI1 and PMI2. We (MLNX) are actively >> working to resolve some of the scalability issues in PMI2. >> >> Josh >> >> Joshua S. Ladd >> Staff Engineer, HPC Software >> Mellanox Technologies >> >> Email: josh...@mellanox.com >> >> >> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain wrote: >> >>> Interesting - how many nodes were involved? As I said, the bad scaling >>> becomes more evident at a fairly high node count. >>> >>> On May 7, 2014, at 12:07 AM, Christopher Samuel >>> wrote: >>> >>> > -BEGIN PGP SIGNED MESSAGE- >>> > Hash: SHA1 >>> > >>> > Hiya Ralph, >>> > >>> > On 07/05/14 14:49, Ralph Castain wrote: >>> > >>> >> I should have looked closer to see the numbers you posted, Chris - >>> >> those include time for MPI wireup. So what you are seeing is that >>> >> mpirun is much more efficient at exchanging the MPI endpoint info >>> >> than PMI. I suspect that PMI2 is not much better as the primary >>> >> reason for the difference is that mpriun sends blobs, while PMI >>> >> requires that everything be encoded into strings and sent in little >>> >> pieces. >>> >> >>> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex" >>> >> operation) much faster, and MPI_Init completes faster. Rest of the >>> >> computation should be the same, so long compute apps will see the >>> >> difference narrow considerably. >>> > >>> > Unfortunately it looks like I had an enthusiastic cleanup at some point >>> > and so I cannot find the out files from those runs at the moment, but >>> > I did find some comparisons from around that time. >>> > >>> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103 >>> > run with mpirun and srun successively from inside the same Slurm job. >>> > >>> > mpirun namd2 macpf.conf >>> > srun --mpi=pmi2 namd2 macpf.conf >>> > >>> > Firstly the mpirun output (grep'ing the interesting bits): >>> > >>> > Charm++> Running on MPI version: 2.1 >>> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns >>> 1055.19 MB memory >>> > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns >>> 1055.19 MB memory >>> > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns >>> 1055.19 MB memory >>> > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns >>> 1055.19 MB memory >>> > Info
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
On Thu, 8 May 2014 09:10:00 AM Joshua Ladd wrote: > We (MLNX) are working on a new SLURM PMI2 plugin that we plan to eventually > push upstream. However, to use it, it will require linking in a proprietary > Mellanox library that accelerates the collective operations (available in > MOFED versions 2.1 and higher.) What about those of us who cannot run Mellanox OFED? All the best, Chris -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci
Re: [OMPI devel] RFC: Remove autogen.sh sym link
On May 8, 2014, at 8:59 AM, Ashley Pittman wrote: > This will break my build but it’s an easy fix so don’t let that stop you. Something like this should do ya: --- bogus 2014-05-08 06:26:19.759259593 -0700 +++ bogus-new 2014-05-08 06:26:22.567481480 -0700 @@ -14,7 +14,11 @@ -./autogen.sh +if test -x autogen.sh; then + ./autogen.sh +else + ./autogen.pl +fi -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Remove autogen.sh sym link
I was thinking of something even easier than that ;) I try to keep an eye on the message queue functionality so it’s not often that I need to build code over four years old from source. Ashley. On 8 May 2014, at 14:27, Jeff Squyres (jsquyres) wrote: > On May 8, 2014, at 8:59 AM, Ashley Pittman wrote: > >> This will break my build but it’s an easy fix so don’t let that stop you. > > Something like this should do ya: > > --- bogus 2014-05-08 06:26:19.759259593 -0700 > +++ bogus-new 2014-05-08 06:26:22.567481480 -0700 > @@ -14,7 +14,11 @@ > > > > -./autogen.sh > +if test -x autogen.sh; then > + ./autogen.sh > +else > + ./autogen.pl > +fi > > > > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14756.php
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
On May 8, 2014, at 6:23 AM, Chris Samuel wrote: > On Thu, 8 May 2014 09:10:00 AM Joshua Ladd wrote: > >> We (MLNX) are working on a new SLURM PMI2 plugin that we plan to eventually >> push upstream. However, to use it, it will require linking in a proprietary >> Mellanox library that accelerates the collective operations (available in >> MOFED versions 2.1 and higher.) > > What about those of us who cannot run Mellanox OFED? Artem and I are working on a new PMIx plugin that will resolve it for non-Mellanox cases. > > All the best, > Chris > -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14755.php
Re: [OMPI devel] VPATH builds broken?
I'm unable to reproduce your error, even with a git clone of the mirror. Perhaps you need to "git clean -df"? On May 8, 2014, at 9:09 AM, Ashley Pittman wrote: > > I started getting build failures against trunk on the 29th, most likely as a > result of this commit: > > https://github.com/open-mpi/ompi-svn-mirror/commit/3f42cbf50670c5b311cc4414dbb3f4ccf762e455 > > It looks like there was another commit almost immediately afterwards which > fixed the first problem (include file errors) however I’m still seeing build > failures with the following error, I don’t know if this is still aside effect > of the previous VPATH problem or something else. > > Making all in mpi > make[10]: Entering directory > `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/mpi' > ln -s > ../../../../../../../../../../source/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/handler.c > handler.c > CC otfmerge_mpi-handler.o > ln -s > ../../../../../../../../../../source/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/otfmerge.c > otfmerge.c > CC otfmerge_mpi-otfmerge.o > CCLD otfmerge-mpi > /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: > undefined reference to `opal_dstore_peer' > /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: > undefined reference to `opal_value_load' > /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: > undefined reference to `opal_value_unload' > /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: > undefined reference to `opal_dstore_nonpeer' > /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: > undefined reference to `opal_dstore_internal' > /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: > undefined reference to `opal_dstore' > collect2: error: ld returned 1 exit status > make[10]: *** [otfmerge-mpi] Error 1 > make[10]: Leaving directory > `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/mpi' > make[9]: *** [all-recursive] Error 1 > make[9]: Leaving directory > `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge' > make[8]: *** [all-recursive] Error 1 > make[8]: Leaving directory > `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools' > make[7]: *** [all-recursive] Error 1 > make[7]: Leaving directory > `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf' > make[6]: *** [all] Error 2 > make[6]: Leaving directory > `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf' > make[5]: *** [all-recursive] Error 1 > make[5]: Leaving directory > `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib' > make[4]: *** [all-recursive] Error 1 > make[4]: Leaving directory > `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt' > make[3]: *** [all] Error 2 > make[3]: Leaving directory > `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt' > make[2]: *** [all-recursive] Error 1 > make[2]: Leaving directory > `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt' > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi' > make: *** [all-recursive] Error 1 > > > The build script I’m using is fairly simple, it’s working from a clean > checkout each time but is doing a “VPATH” or out-of-tree build > > cd source > ./autogen.sh > cd .. > [ -d build ] && rm -rf build > [ -d build ] && rm -rf install > mkdir build > cd build > ../source/configure --enable-mpirun-prefix-by-default --prefix > $WORKSPACE/install > make > make install > > Ashley, > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14753.php -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
Chris, The necessary packages will be supported and available in community OFED. Josh On Thu, May 8, 2014 at 9:23 AM, Chris Samuel wrote: > On Thu, 8 May 2014 09:10:00 AM Joshua Ladd wrote: > > > We (MLNX) are working on a new SLURM PMI2 plugin that we plan to > eventually > > push upstream. However, to use it, it will require linking in a > proprietary > > Mellanox library that accelerates the collective operations (available in > > MOFED versions 2.1 and higher.) > > What about those of us who cannot run Mellanox OFED? > > All the best, > Chris > -- > Christopher SamuelSenior Systems Administrator > VLSCI - Victorian Life Sciences Computation Initiative > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 > http://www.vlsci.org.au/ http://twitter.com/vlsci > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14755.php >
Re: [OMPI devel] VPATH builds broken?
Ah, it was something my end. I had a bug in my build script that it wasn’t wiping the install directory before doing the build. This might be an indication that something in the build is picking up the install directory in preference to the build directory but I don’t think that would represent a real problem - frankly I’m surprised this worked as long as it did. Ashley, On 8 May 2014, at 14:52, Jeff Squyres (jsquyres) wrote: > I'm unable to reproduce your error, even with a git clone of the mirror. > Perhaps you need to "git clean -df"? > > > On May 8, 2014, at 9:09 AM, Ashley Pittman wrote: > >> >> I started getting build failures against trunk on the 29th, most likely as a >> result of this commit: >> >> https://github.com/open-mpi/ompi-svn-mirror/commit/3f42cbf50670c5b311cc4414dbb3f4ccf762e455 >> >> It looks like there was another commit almost immediately afterwards which >> fixed the first problem (include file errors) however I’m still seeing build >> failures with the following error, I don’t know if this is still aside >> effect of the previous VPATH problem or something else. >> >> Making all in mpi >> make[10]: Entering directory >> `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/mpi' >> ln -s >> ../../../../../../../../../../source/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/handler.c >> handler.c >> CC otfmerge_mpi-handler.o >> ln -s >> ../../../../../../../../../../source/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/otfmerge.c >> otfmerge.c >> CC otfmerge_mpi-otfmerge.o >> CCLD otfmerge-mpi >> /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: >> undefined reference to `opal_dstore_peer' >> /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: >> undefined reference to `opal_value_load' >> /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: >> undefined reference to `opal_value_unload' >> /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: >> undefined reference to `opal_dstore_nonpeer' >> /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: >> undefined reference to `opal_dstore_internal' >> /space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/../../../.libs/libmpi.so: >> undefined reference to `opal_dstore' >> collect2: error: ld returned 1 exit status >> make[10]: *** [otfmerge-mpi] Error 1 >> make[10]: Leaving directory >> `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge/mpi' >> make[9]: *** [all-recursive] Error 1 >> make[9]: Leaving directory >> `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools/otfmerge' >> make[8]: *** [all-recursive] Error 1 >> make[8]: Leaving directory >> `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf/tools' >> make[7]: *** [all-recursive] Error 1 >> make[7]: Leaving directory >> `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf' >> make[6]: *** [all] Error 2 >> make[6]: Leaving directory >> `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib/otf' >> make[5]: *** [all-recursive] Error 1 >> make[5]: Leaving directory >> `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt/extlib' >> make[4]: *** [all-recursive] Error 1 >> make[4]: Leaving directory >> `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt' >> make[3]: *** [all] Error 2 >> make[3]: Leaving directory >> `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt/vt' >> make[2]: *** [all-recursive] Error 1 >> make[2]: Leaving directory >> `/space/jenkins/workspace/open-mpi/build/ompi/contrib/vt' >> make[1]: *** [all-recursive] Error 1 >> make[1]: Leaving directory `/space/jenkins/workspace/open-mpi/build/ompi' >> make: *** [all-recursive] Error 1 >> >> >> The build script I’m using is fairly simple, it’s working from a clean >> checkout each time but is doing a “VPATH” or out-of-tree build >> >> cd source >> ./autogen.sh >> cd .. >> [ -d build ] && rm -rf build >> [ -d build ] && rm -rf install >> mkdir build >> cd build >> ../source/configure --enable-mpirun-prefix-by-default --prefix >> $WORKSPACE/install >> make >> make install >> >> Ashley, >> ___ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2014/05/14753.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > ___ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2014/05/14759.php
Re: [OMPI devel] RFC: continue cleanup of build system abstractions
This RFC is now complete - the renaming exercise is done. My apologies to all for the churn, and my deepest thanks for your patience. I know it will take awhile to get used to using the revised names and to avoid breaking the abstractions going forward. We have a "canary" for most of the abstraction breaks, so we can deal with them rather quickly when they occur. Please let me know if/when you hit issues and we'll fix them as quickly as possible. I think the system is pretty close to right, but (as usual) there may be things in areas we can't compile that are broken. Thanks again for your patience during this transition. Ralph On Apr 27, 2014, at 4:39 PM, Ralph Castain wrote: > WHAT: continue the cleanup of build system abstractions that was started > a couple of years ago by Brian, Jeff, and I. The objective is to > fix > all the naming conventions for things like OMPI_CHECK_PACKAGE > so they accurately reflect their targeted level in the code base > - e.g., > OMPI_foo gets used for things in the MPI layer. This basically > just > corrects some historical decisions made before we cared as much > about abstractions > > WHEN: to be done in a series of commits over the next two months > > HOW:a simple search_replace.pl across the repo > > First step: >OMPI_CHECK_PACKAGE-> OPAL_CHECK_PACKAGE >OMPI_CHECK_FUNC_LIB-> OPAL_CHECK_FUNC_LIB >OMPI_CHECK_COMPILER_WORKS -> OPAL_CHECK_COMPILER_WORKS >OMPI_CHECK_WITHDIR -> OPAL_CHECK_WITHDIR > > > TIMEOUT: if nobody raises an objection, sometime after the Tues telecon > >
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 08/05/14 23:45, Ralph Castain wrote: > Artem and I are working on a new PMIx plugin that will resolve it > for non-Mellanox cases. Ah yes of course, sorry my bad! - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNsGcsACgkQO2KABBYQAh/ATgCfeQHS1KsZbLS8Hdux6p98K3w3 DqsAn3vZJMtYGs1xWK4ubK26ceuACtf1 =zPyS -END PGP SIGNATURE-
Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 09/05/14 00:16, Joshua Ladd wrote: > The necessary packages will be supported and available in community > OFED. We're constrained to what is in RHEL6 I'm afraid. This is because we have to run GPFS over IB to BG/Q from the same NSDs that talk GPFS to all our Intel clusters. We did try MOFED 2.x (in connected mode) on a new Intel cluster during its bring up last year which worked for MPI but stopped it talking to the NSDs. Reverting to vanilla RHEL6 fixed it. Not your problem though. :-) As Ralph has said there is work on an alternative solution that we will be able to use. Thanks! Chris - -- Christopher SamuelSenior Systems Administrator VLSCI - Victorian Life Sciences Computation Initiative Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545 http://www.vlsci.org.au/ http://twitter.com/vlsci -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.14 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iEYEARECAAYFAlNsG88ACgkQO2KABBYQAh8+SwCfZWpViBFwuhlxqERXpbXbr8Eq awwAnjj7NJ2/zUGBeZNT0UPwkmaGOaLR =nPxl -END PGP SIGNATURE-