Re: [OMPI devel] Exit status
On Apr 14 2011, Ralph Castain wrote: I've run across an interesting issue for which I don't have a ready answer. If an MPI process aborts, we automatically abort the entire job. If an MPI process returns a non-zero exit status, indicating that there was something abnormal about its termination, we ignore it and let the job continue. We do print an error message out upon completion of the job, but we don't terminate the job upon receiving the non-zero status. Note that non-zero status is considered a "standard" method of indicating abnormal termination, though no meaning has been agreed upon for the specific value. Not really. See below. Should we be allowing the job to continue in that circumstance? In the case I'm reviewing, the user's code indicates there is an error in the result. Since he has already called MPI_Finalize, he can't call MPI_Abort, and his system won't allow him to drop cores by calling "abort". So the exit status is his only way of indicating "abnormal termination". Obviously, in this case, he would prefer the job terminate as nothing useful is going to be accomplished - so no point in tying up the machine. Thoughts? Blame Unix. Seriously. Many or most mainframes had the following categories: Complete success - or, rather, a failure to detect an error :-) Partial success, with warnings of potential problems Failure that was diagnosed and partially cleaned-up Heap horrible failure - all bets are off Unix has no such categorisation. The distinction between a zero return and other values can occur at any point, and some programs even use them as flags. It's hopeless, and whatever you do will be wrong for many people. I have no idea what Microsoft do, but assume that it has copied Unix, as that is its SOP. I recommend NOT rocking this boat. He might do better by calling abort after MPI_Finalize, but that's pretty iffy - just like all other approaches. To improve this needs a new function or argument to MPI_Finalize. Regards, Nick Maclaren.
Re: [OMPI devel] Exit status
On Apr 14, 2011, at 4:02 AM, N.M. Maclaren wrote: > ... It's hopeless, and whatever you do will be wrong for many > people. ... I think that sums it up pretty well. :-) It does seem a little strange that the scenario you describe somewhat implies that one process is calling MPI_Finalize lng before the others do. Specifically, the user is concerned with tying up resources after one process has called Finalize -- which implies that the others may continue on for a while. It's not invalid, of course, but it is a little unusual. I see two possibilities here: 1. have the user delay calling MPI_Finalize in the application until it can do the test that indicates that the rest of the job should be aborted (i.e., so that it can still call MPI_Abort if it wants to). Don't forget that an implementation is allowed to block in MPI_Finalize until all processes call MPI_Finalize, anyway. 2. add an MCA param and/or orterun CLI option to abort a job if an MPI process terminates after MPI_Finalize with a nonzero exit status. Just my $0.02. :-) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Exit status
On Apr 14, 2011, at 5:33 AM, Jeff Squyres wrote: > On Apr 14, 2011, at 4:02 AM, N.M. Maclaren wrote: > >> ... It's hopeless, and whatever you do will be wrong for many >> people. ... > > I think that sums it up pretty well. :-) > > It does seem a little strange that the scenario you describe somewhat implies > that one process is calling MPI_Finalize lng before the others do. > Specifically, the user is concerned with tying up resources after one process > has called Finalize -- which implies that the others may continue on for a > while. It's not invalid, of course, but it is a little unusual. I'm finding it more common than we thought. Note that I didn't say that one process called MPI_Finalize before the others. In this case, they call it fairly close together, but the individual processes continue running for quite some time, or until they determine that something is wrong and exit with non-zero status. > > I see two possibilities here: > > 1. have the user delay calling MPI_Finalize in the application until it can > do the test that indicates that the rest of the job should be aborted (i.e., > so that it can still call MPI_Abort if it wants to). Don't forget that an > implementation is allowed to block in MPI_Finalize until all processes call > MPI_Finalize, anyway. > > 2. add an MCA param and/or orterun CLI option to abort a job if an MPI > process terminates after MPI_Finalize with a nonzero exit status. > I figure this last is the best option. My point was just that we abort the job if someone calls "abort". However, if they indicate their program is exiting with "something is wrong", we ignore it. Not that big a deal - the param was my option too. Just thought I'd raise it to the group since it had never been discussed. > Just my $0.02. :-) > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Exit status
On Apr 14, 2011, at 9:13 AM, Ralph Castain wrote: > I figure this last is the best option. My point was just that we abort the > job if someone calls "abort". However, if they indicate their program is > exiting with "something is wrong", we ignore it. Another option for the user is to kill(getpid(), 9). That would kill the entire job, no? :-) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Exit status
On Apr 14 2011, Ralph Castain wrote: ... It's hopeless, and whatever you do will be wrong for many people. ... I think that sums it up pretty well. :-) It does seem a little strange that the scenario you describe somewhat implies that one process is calling MPI_Finalize lng before the others do. Specifically, the user is concerned with tying up resources after one process has called Finalize -- which implies that the others may continue on for a while. It's not invalid, of course, but it is a little unusual. I'm finding it more common than we thought. Note that I didn't say that one process called MPI_Finalize before the others. In this case, they call it fairly close together, but the individual processes continue running for quite some time, or until they determine that something is wrong and exit with non-zero status. Nobody is denying that it is common. Now, what happens when you encounter a language or compiler that uses return codes for mere warnings (e.g. ignored IEEE 754 flags, as stated to be desirable by LIA-1)? Bang! Remember that C is not the universe and many languages use MPI via the C interface, but do not let C control their model. Regards, Nick Maclaren.
Re: [OMPI devel] Exit status
Point well made, Nick. In other words, irrespective of OS or language, are we citing the need for "application correcting code" from OpenMPI, (relocate a/o retry) similar to ECC in memory? Ken On Thu, 2011-04-14 at 14:31 +0100, N.M. Maclaren wrote: > On Apr 14 2011, Ralph Castain wrote: > >> > >>> ... It's hopeless, and whatever you do will be wrong for many > >>> people. ... > >> > >> I think that sums it up pretty well. :-) > >> > >> It does seem a little strange that the scenario you describe somewhat > >> implies that one process is calling MPI_Finalize lng before the > >> others do. Specifically, the user is concerned with tying up resources > >> after one process has called Finalize -- which implies that the others > >> may continue on for a while. It's not invalid, of course, but it is a > >> little unusual. > > > > I'm finding it more common than we thought. Note that I didn't say that > > one process called MPI_Finalize before the others. In this case, they > > call it fairly close together, but the individual processes continue > > running for quite some time, or until they determine that something is > > wrong and exit with non-zero status. > > Nobody is denying that it is common. Now, what happens when you encounter > a language or compiler that uses return codes for mere warnings (e.g. > ignored IEEE 754 flags, as stated to be desirable by LIA-1)? Bang! > > Remember that C is not the universe and many languages use MPI via the > C interface, but do not let C control their model. > > Regards, > Nick Maclaren. > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel = Kenneth A. Lloyd CEO - Director of Systems Science Watt Systems Technologies Inc. www.wattsys.com kenneth.ll...@wattsys.com This e-mail is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521 and is intended only for the addressee named above. It may contain privileged or confidential information. If you are not the addressee you must not copy, distribute, disclose or use any of the information in it. If you have received it in error please delete it and immediately notify the sender.
Re: [OMPI devel] Exit status
I think Ralph's point is that OMPI is providing the run-time environment for the application, and it would probably behoove us to support both kinds of behaviors since there are obviously people in both camps out there. It's pretty easy to add a non-default MCA param / orterun CLI option to say "abort the job if any of them exit with a non-zero status." On Apr 14, 2011, at 9:43 AM, Ken Lloyd wrote: > Point well made, Nick. In other words, irrespective of OS or language, are we > citing the need for "application correcting code" from OpenMPI, (relocate a/o > retry) similar to ECC in memory? > > Ken > > On Thu, 2011-04-14 at 14:31 +0100, N.M. Maclaren wrote: >> On Apr 14 2011, Ralph Castain wrote: >> >> >> >>> ... It's hopeless, and whatever you do will be wrong for many >> >>> people. ... >> >> >> >> I think that sums it up pretty well. :-) >> >> >> >> It does seem a little strange that the scenario you describe somewhat >> >> implies that one process is calling MPI_Finalize lng before the >> >> others do. Specifically, the user is concerned with tying up resources >> >> after one process has called Finalize -- which implies that the others >> >> may continue on for a while. It's not invalid, of course, but it is a >> >> little unusual. >> > >> > I'm finding it more common than we thought. Note that I didn't say that >> > one process called MPI_Finalize before the others. In this case, they >> > call it fairly close together, but the individual processes continue >> > running for quite some time, or until they determine that something is >> > wrong and exit with non-zero status. >> >> Nobody is denying that it is common. Now, what happens when you encounter >> a language or compiler that uses return codes for mere warnings (e.g. >> ignored IEEE 754 flags, as stated to be desirable by LIA-1)? Bang! >> >> Remember that C is not the universe and many languages use MPI via the >> C interface, but do not let C control their model. >> >> Regards, >> Nick Maclaren. >> >> ___ >> devel mailing list >> >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > = > Kenneth A. Lloyd > CEO - Director of Systems Science > Watt Systems Technologies Inc. > www.wattsys.com > kenneth.ll...@wattsys.com > > This e-mail is covered by the Electronic Communications Privacy Act, 18 > U.S.C. 2510-2521 and is intended only for the addressee named above. It may > contain privileged or confidential information. If you are not the addressee > you must not copy, distribute, disclose or use any of the information in it. > If you have received it in error please delete it and immediately notify the > sender. > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
Hello Rolf, CUDA support is always welcome. Please see my comments bellow +#if OMPI_CUDA_SUPPORT +fl->fl_frag_block_alignment = 0; +fl->fl_flags = 0; +#endif [pasha] It seem that the "fl_flags" is a hack that allow you to do the second (cuda) registration in mpool_rdma: +#if OMPI_CUDA_SUPPORT +if ((flags & MCA_MPOOL_FLAGS_CUDA_MEM) && mca_common_cuda_registered_memory) { +mca_common_cuda_register(addr, size, + mpool->mpool_component->mpool_version.mca_component_name); + } +#endif [pasha] It is really _hack_ way to enable multiple device registration. I would prefer see new mpool component, that has support multiple device registration in contrast to single device registration in mpool_rdma. fl->fl_payload_buffer_size=0; fl->fl_payload_buffer_alignment=0; fl->fl_frag_class = OBJ_CLASS(ompi_free_list_item_t); @@ -190,7 +194,19 @@ alloc_size = num_elements * head_size + sizeof(ompi_free_list_memory_t) + flist->fl_frag_alignment; +#if OMPI_CUDA_SUPPORT +/* Hack for TCP since there is no memory pool. */ +if (flist->fl_frag_block_alignment) { +alloc_size = OPAL_ALIGN(alloc_size, 4096, size_t); +if((errno = posix_memalign((void *)&alloc_ptr, 4096, alloc_size)) != 0) { +alloc_ptr = NULL; +} +} else { +alloc_ptr = (ompi_free_list_memory_t*)malloc(alloc_size); +} +#else alloc_ptr = (ompi_free_list_memory_t*)malloc(alloc_size); +#endif [pasha] I would prefer not to _hack_ ompi_free_list in order to work around BTL related issues. Such kinda of problem should be handled by tcp btl. If you think, that it is not enough flexibility in free list or mpool interface, we may discuss the inderface update or modification. IMHO it is much better that hack. Regards, Pavel (Pasha) Shamis --- Application Performance Tools Group Computer Science and Math Division Oak Ridge National Laboratory On Apr 13, 2011, at 12:47 PM, Rolf vandeVaart wrote: > WHAT: Add support to send data directly from CUDA device memory via MPI calls. > > TIMEOUT: April 25, 2011 > > DETAILS: When programming in a mixed MPI and CUDA environment, one cannot > currently send data directly from CUDA device memory. The programmer first > has to move the data into host memory, and then send it. On the receiving > side, it has to first be received into host memory, and then copied into CUDA > device memory. > > This RFC adds the ability to send and receive CUDA device memory directly. > > There are three basic changes being made to add the support. First, when it > is detected that a buffer is CUDA device memory, the protocols that can be > used are restricted to the ones that first copy data into internal buffers. > This means that we will not be using the PUT and RGET protocols, just the > send and receive ones. Secondly, rather than using memcpy to move the data > into and out of the host buffers, the library has to use a special CUDA copy > routine called cuMemcpy. Lastly, to improve performance, the internal host > buffers have to also be registered with the CUDA environment (although > currently it is unclear how helpful that is) > > By default, the code is disable and has to be configured into the library. > --with-cuda(=DIR) Build cuda support, optionally adding DIR/include, > DIR/lib, and DIR/lib64 > --with-cuda-libdir=DIR Search for cuda libraries in DIR > > An initial implementation can be viewed at: > https://bitbucket.org/rolfv/ompi-trunk-cuda-3 > > Here is a list of the files being modified so one can see the scope of the > impact. > > $ svn status > M VERSION > M opal/datatype/opal_convertor.h > M opal/datatype/opal_datatype_unpack.c > M opal/datatype/opal_datatype_pack.h > M opal/datatype/opal_convertor.c > M opal/datatype/opal_datatype_unpack.h > M configure.ac > M ompi/mca/btl/sm/btl_sm.c > M ompi/mca/btl/sm/Makefile.am > M ompi/mca/btl/tcp/btl_tcp_component.c > M ompi/mca/btl/tcp/btl_tcp.c > M ompi/mca/btl/tcp/Makefile.am > M ompi/mca/btl/openib/btl_openib_component.c > M ompi/mca/btl/openib/btl_openib_endpoint.c > M ompi/mca/btl/openib/btl_openib_mca.c > M ompi/mca/mpool/sm/Makefile.am > M ompi/mca/mpool/sm/mpool_sm_module.c > M ompi/mca/mpool/rdma/mpool_rdma_module.c > M ompi/mca/mpool/rdma/Makefile.am > M ompi/mca/mpool/mpool.h > A ompi/mca/common/cuda > A ompi/mca/common/cuda/configure.m4 > A ompi/mca/common/cuda/common_cuda.c > A ompi/mca/common/cuda/help-mpi-common-cuda.txt > A ompi/mca/common/cuda/Makefile.am > A ompi/mca/common/cuda/common_cuda.h > M ompi/mca/pml/ob1/pml_ob1_component.c > M ompi/mca/pml/ob1/pml_ob1_sendreq.h > M ompi/mca/pml/ob1/pml_ob1_recvreq.h > M ompi/mca/pml/ob1/Makefile.am >
Re: [OMPI devel] Exit status
On Apr 14 2011, Jeff Squyres wrote: I think Ralph's point is that OMPI is providing the run-time environment for the application, and it would probably behoove us to support both kinds of behaviors since there are obviously people in both camps out there. It's pretty easy to add a non-default MCA param / orterun CLI option to say "abort the job if any of them exit with a non-zero status." That's not a problem! Any more than a similar one to provide timeouts, both on inactivity and on total running time (useful for teaching). Unless options are unclean or excessive, they can be ignored by people who don't want them. Regards, Nick Maclaren.
[OMPI devel] Problem of memory lost in MPI_Type_create_hindexed() with count = 1 (patch proposed)
Calling MPI_Type_create_hindexed(int count, int array_of_blocklengths[], MPI_Aint array_of_displacements[], MPI_Datatype oldtype, MPI_Datatype *newtype) with a count parameter of 1 causes a loss of memory detected by valgrind : ==2053== 576 (448 direct, 128 indirect) bytes in 1 blocks are definitely lost in loss record 157 of 182 ==2053==at 0x4C2415D: malloc (vg_replace_malloc.c:195) ==2053==by 0x4E7CEC7: opal_obj_new (opal_object.h:469) ==2053==by 0x4E7D134: ompi_datatype_create (ompi_datatype_create.c:71) ==2053==by 0x4E7D58E: ompi_datatype_create_hindexed (ompi_datatype_create_indexed.c:89) ==2053==by 0x4EA74D0: PMPI_Type_create_hindexed (ptype_create_hindexed.c:75) ==2053==by 0x401A5C: main (in /home_nfs/xxx/type_create_hindexed) This can be reproduced with the following trivial code: = #include "mpi.h" MPI_Datatype newtype; int lg[3]; MPI_Aint disp[3]; int main(int argc, char **argv) { MPI_Init(&argc,&argv); disp[0] = (MPI_Aint)disp; disp[1] = (MPI_Aint)disp+1; lg[0] = 5; lg[1] = 5; MPI_Type_create_hindexed(1, lg, disp, MPI_BYTE, &newtype); MPI_Type_free(&newtype); MPI_Finalize(); } == If MPI_Type_create_hindexed() is called with a count parameter greater 1, valgrind does not detect any lost record. Patch proposed: hg diff ompi/datatype/ompi_datatype_create_indexed.c diff -r a2d94a70f474 ompi/datatype/ompi_datatype_create_indexed.c --- a/ompi/datatype/ompi_datatype_create_indexed.c Wed Mar 30 18:47:31 2011 +0200 +++ b/ompi/datatype/ompi_datatype_create_indexed.c Thu Apr 14 16:16:08 2011 +0200 @@ -91,11 +91,6 @@ dLength = pBlockLength[0]; endat = disp + dLength * extent; -if( 1 >= count ) { -pdt = ompi_datatype_create( oldType->super.desc.used + 2 ); -/* multiply by count to make it zero if count is zero */ -ompi_datatype_add( pdt, oldType, count * dLength, disp, extent ); -} else { for( i = 1; i < count; i++ ) { if( endat == pDisp[i] ) { /* contiguous with the previsious */ @@ -109,7 +104,6 @@ } } ompi_datatype_add( pdt, oldType, dLength, disp, extent ); -} *newType = pdt; return OMPI_SUCCESS; } Explanation: The case (0 == count) was resolved before by returning. The problem is that, in the case ( 1 >= count ), ompi_datatype_create() is called again (it has been just called before). In fact the case (1 == count) is not different of the case (1 < count), so it is possible to just avoid the if-else statement. We need a patch for OpenMPI 1.5 branch.
Re: [OMPI devel] Problem of memory lost in MPI_Type_create_hindexed() with count = 1 (patch proposed)
That looks reasonable to me, but I'd also re-indent the body of the else{} (i.e., remove 4 spaces from each). George? On Apr 14, 2011, at 10:48 AM, Pascal Deveze wrote: > Calling MPI_Type_create_hindexed(int count, int array_of_blocklengths[], > MPI_Aint array_of_displacements[], MPI_Datatype oldtype, > MPI_Datatype *newtype) > with a count parameter of 1 causes a loss of memory detected by valgrind : > > ==2053== 576 (448 direct, 128 indirect) bytes in 1 blocks are definitely lost > in loss record 157 of 182 > ==2053==at 0x4C2415D: malloc (vg_replace_malloc.c:195) > ==2053==by 0x4E7CEC7: opal_obj_new (opal_object.h:469) > ==2053==by 0x4E7D134: ompi_datatype_create (ompi_datatype_create.c:71) > ==2053==by 0x4E7D58E: ompi_datatype_create_hindexed > (ompi_datatype_create_indexed.c:89) > ==2053==by 0x4EA74D0: PMPI_Type_create_hindexed > (ptype_create_hindexed.c:75) > ==2053==by 0x401A5C: main (in /home_nfs/xxx/type_create_hindexed) > > This can be reproduced with the following trivial code: > = > #include "mpi.h" > > MPI_Datatype newtype; > int lg[3]; > MPI_Aint disp[3]; > > int main(int argc, char **argv) { > MPI_Init(&argc,&argv); > > disp[0] = (MPI_Aint)disp; > disp[1] = (MPI_Aint)disp+1; > lg[0] = 5; > lg[1] = 5; > > MPI_Type_create_hindexed(1, lg, disp, MPI_BYTE, &newtype); > MPI_Type_free(&newtype); > > MPI_Finalize(); > } > == > If MPI_Type_create_hindexed() is called with a count parameter greater 1, > valgrind does not detect any lost record. > > Patch proposed: > > hg diff ompi/datatype/ompi_datatype_create_indexed.c > diff -r a2d94a70f474 ompi/datatype/ompi_datatype_create_indexed.c > --- a/ompi/datatype/ompi_datatype_create_indexed.c Wed Mar 30 18:47:31 > 2011 +0200 > +++ b/ompi/datatype/ompi_datatype_create_indexed.c Thu Apr 14 16:16:08 > 2011 +0200 > @@ -91,11 +91,6 @@ >dLength = pBlockLength[0]; >endat = disp + dLength * extent; > -if( 1 >= count ) { > -pdt = ompi_datatype_create( oldType->super.desc.used + 2 ); > -/* multiply by count to make it zero if count is zero */ > -ompi_datatype_add( pdt, oldType, count * dLength, disp, extent ); > -} else { >for( i = 1; i < count; i++ ) { >if( endat == pDisp[i] ) { >/* contiguous with the previsious */ > @@ -109,7 +104,6 @@ >} >} >ompi_datatype_add( pdt, oldType, dLength, disp, extent ); > -} >*newType = pdt; >return OMPI_SUCCESS; > } > > Explanation: > The case (0 == count) was resolved before by returning. > The problem is that, in the case ( 1 >= count ), ompi_datatype_create() is > called again (it has been just called before). > In fact the case (1 == count) is not different of the case (1 < count), so > it is possible to just avoid the if-else statement. > > We need a patch for OpenMPI 1.5 branch. > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
On Apr 13, 2011, at 12:47 PM, Rolf vandeVaart wrote: > By default, the code is disable and has to be configured into the library. > --with-cuda(=DIR) Build cuda support, optionally adding DIR/include, > DIR/lib, and DIR/lib64 > --with-cuda-libdir=DIR Search for cuda libraries in DIR My $0.02: cuda shouldn't be disabled by default (and only enabled if you --with-cuda). If configure finds all the Right cuda magic, then cuda support should be enabled by default. Just like all other optional support libraries that OMPI uses. More specifically: the cuda support code in OMPI should strive to be such that it can be enabled by default and not cause any performance penalties to codes that do not need/use any cuda stuff. I'm not saying I know how to do that -- I'm just saying that that should be the goal. :-) -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
On Apr 13, 2011, at 12:47 PM, Rolf vandeVaart wrote: > An initial implementation can be viewed at: > https://bitbucket.org/rolfv/ompi-trunk-cuda-3 Random comments on the code... 1. I see changes like this: mca_btl_sm_la_LIBADD += \ $(top_ompi_builddir)/ompi/mca/common/cuda/libmca_common_cuda.la But I don't see any common/cuda function calls in the SM BTL. Why the link? 2. I see a new "opal_output(-1,.." in btl_tcp.c. If it's a developer-only opal_output, it should be compiled out by default. 3. In ompi_free_list.c, you call posix_memalign(), protected by OMPI_CUDA_SUPPORT. Does posix_memalign() exist in Windows, and/or does OMPI_CUDE_SUPPORT exclude Windows? 4. Along with what Pasha said, it seems odd to put a CUDA-specific value in mpool.h (MCA_MPOOL_FLAGS_CUDA_MEM). --> Some explanation is required for this comment. My gut reaction is to have portable code in OMPI, such that we can support multiple registration-necessary memory pools. That being said, NVIDIA is the first mover here; is there any other interest in ever wanting to be able to register other kinds of memory, too? Or should we let NVIDIA do it this way on the assumption that it will be years before anyone *might* want to use some other multi-memory-registration scheme? I can see both sides of the coin here... 5. In pml_ob1_sendreq.h, you set size to 0 if OMPI_CUDA_SUPPORT. That means that any OMPI compiled with CUDA support will have this value -- regardless if they're using accelerators or not. Shouldn't there be a compile-time check AND a run-time check for this kind of thing? 6. Instead of #if OMPI_CUDA_SUPPORT to select which memcpy to use, why not have a different opal memcopy MCA component for the cuda memcpy? Would that make a bunch of convertor #if OMPI_CUDA_SUPPORT's go away? 7. Using the name OMPI_* down in OPAL doesn't seem like a good idea (there's still some OMPI_* preprocessor names down in there that haven't yet been converted to OPAL_*, but adding new OMPI_* names down there doesn't seem to be a good idea). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
I'd suggest supporting CUDA device queries in carto and hwloc. Ken On Thu, 2011-04-14 at 11:25 -0400, Jeff Squyres wrote: > On Apr 13, 2011, at 12:47 PM, Rolf vandeVaart wrote: > > > By default, the code is disable and has to be configured into the library. > > --with-cuda(=DIR) Build cuda support, optionally adding DIR/include, > > DIR/lib, and DIR/lib64 > > --with-cuda-libdir=DIR Search for cuda libraries in DIR > > My $0.02: cuda shouldn't be disabled by default (and only enabled if you > --with-cuda). If configure finds all the Right cuda magic, then cuda support > should be enabled by default. Just like all other optional support libraries > that OMPI uses. > > More specifically: the cuda support code in OMPI should strive to be such > that it can be enabled by default and not cause any performance penalties to > codes that do not need/use any cuda stuff. I'm not saying I know how to do > that -- I'm just saying that that should be the goal. :-) >
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
> >> By default, the code is disable and has to be configured into the library. >> --with-cuda(=DIR) Build cuda support, optionally adding DIR/include, >> DIR/lib, and DIR/lib64 >> --with-cuda-libdir=DIR Search for cuda libraries in DIR > > My $0.02: cuda shouldn't be disabled by default (and only enabled if you > --with-cuda). If configure finds all the Right cuda magic, then cuda support > should be enabled by default. Just like all other optional support libraries > that OMPI uses. Actually I'm not sure that it is good idea to enable CUDA by default, since it disables zero-copy protocol, that is critical for good performance. My 0.02$ Pasha.
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
On Apr 13, 2011, at 20:07 , Ken Lloyd wrote: > George, Yes. GPUDirect eliminated an additional (host) memory buffering step > between the HCA and the GPU that took CPU cycles. If this is the case then why do we need to use special memcpy functions to copy the data back into the host memory prior to using the send/recv protocol? If GPUDirect remove the need for host buffering then as soon as the memory is identified as being on the device (using the Unified Virtual Addressing), the device can deliver it directly to the network card. george. > I was never very comfortable with the kernel patch necessary, nor the patched > OFED required to make it all work. Having said that, it did provide a ~14% > improvement in throughput on some of my code. Not bad. > > Now comes GPUDirect 2.0 (mostly helping GPU-GPU across PCIe) and Unified > Virtual Addressing. Holds great promise, but the real understanding comes > from whitebox analysis, and instrumenting my app code. > > On Wed, 2011-04-13 at 17:21 -0400, George Bosilca wrote: >> On Apr 13, 2011, at 14:48 , Rolf vandeVaart wrote: >> >> > This work does not depend on GPU Direct. It is making use of the fact >> > that one can malloc memory, register it with IB, and register it with CUDA >> > via the new 4.0 API cuMemHostRegister API. Then one can copy device >> > memory into this memory. >> >> Wasn't that the point behind GPUDirect? To allow direct memory copy between >> the GPU and the network card without external intervention? >> >> george. >> >> >> >> ___ >> devel mailing list >> >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > = > Kenneth A. Lloyd > CEO - Director of Systems Science > Watt Systems Technologies Inc. > www.wattsys.com > kenneth.ll...@wattsys.com > > This e-mail is covered by the Electronic Communications Privacy Act, 18 > U.S.C. 2510-2521 and is intended only for the addressee named above. It may > contain privileged or confidential information. If you are not the addressee > you must not copy, distribute, disclose or use any of the information in it. > If you have received it in error please delete it and immediately notify the > sender. > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel "I disapprove of what you say, but I will defend to the death your right to say it" -- Evelyn Beatrice Hall
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
Le 14/04/2011 17:58, George Bosilca a écrit : > On Apr 13, 2011, at 20:07 , Ken Lloyd wrote: > > >> George, Yes. GPUDirect eliminated an additional (host) memory buffering step >> between the HCA and the GPU that took CPU cycles. >> > If this is the case then why do we need to use special memcpy functions to > copy the data back into the host memory prior to using the send/recv > protocol? If GPUDirect remove the need for host buffering then as soon as the > memory is identified as being on the device (using the Unified Virtual > Addressing), the device can deliver it directly to the network card. > GPUDirect is only about using the same host buffer for DMA from/to both the NIC and the GPU. Without GPUDirect, you have a host buffer for the GPU and another one for IB (looks like some strange memory registration problem to me...), and you have to memcpy between them in the middle . We're all confused with the name "GPUDirect" because we remember people doing DMA directly between the NIC and a GPU or SCSI disk ten years ago. GPUDirect doesn't go that far unfortunately :/ Brice
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
hwloc (since 1.1, on Linux) can already tell you which CPUs are close to a CUDA device, see https://svn.open-mpi.org/trac/hwloc/browser/trunk/include/hwloc/cuda.h and https://svn.open-mpi.org/trac/hwloc/browser/trunk/include/hwloc/cudart.h Do you need anything else ? Brice Le 14/04/2011 17:44, Ken Lloyd a écrit : > I'd suggest supporting CUDA device queries in carto and hwloc. > > Ken > > > On Thu, 2011-04-14 at 11:25 -0400, Jeff Squyres wrote: >> On Apr 13, 2011, at 12:47 PM, Rolf vandeVaart wrote: >> >> > By default, the code is disable and has to be configured into the library. >> > --with-cuda(=DIR) Build cuda support, optionally adding >> > DIR/include, >> > DIR/lib, and DIR/lib64 >> > --with-cuda-libdir=DIR Search for cuda libraries in DIR >> >> My $0.02: cuda shouldn't be disabled by default (and only enabled if you >> --with-cuda). If configure finds all the Right cuda magic, then cuda >> support should be enabled by default. Just like all other optional support >> libraries that OMPI uses. >> >> More specifically: the cuda support code in OMPI should strive to be such >> that it can be enabled by default and not cause any performance penalties to >> codes that do not need/use any cuda stuff. I'm not saying I know how to do >> that -- I'm just saying that that should be the goal. :-) >> >> > > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
On Apr 14, 2011, at 11:48 AM, Shamis, Pavel wrote: > Actually I'm not sure that it is good idea to enable CUDA by default, since > it disables zero-copy protocol, that is critical for good performance. That can easily be a run-time check during startup. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
On Apr 14, 2011, at 12:37 PM, Brice Goglin wrote: > GPUDirect is only about using the same host buffer for DMA from/to both > the NIC and the GPU. Without GPUDirect, you have a host buffer for the > GPU and another one for IB (looks like some strange memory registration > problem to me...), and you have to memcpy between them in the middle . > > We're all confused with the name "GPUDirect" because we remember people > doing DMA directly between the NIC and a GPU or SCSI disk ten years ago. > GPUDirect doesn't go that far unfortunately :/ Correct. GPUDirect is a brilliant marketing name. Its name has nothing to do with what it really is: the ability to register the same buffer to both CUDA and OpenFabrics. As Brice says: GPUDirect does NOT send/receive data directly from the accelerator's memory. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
On Apr 14, 2011, at 12:41 PM, Brice Goglin wrote: > hwloc (since 1.1, on Linux) can already tell you which CPUs are close to a > CUDA device, see > https://svn.open-mpi.org/trac/hwloc/browser/trunk/include/hwloc/cuda.h and > https://svn.open-mpi.org/trac/hwloc/browser/trunk/include/hwloc/cudart.h > Do you need anything else ? Nope. I think the inference was that *all* CUDA support should be under carto/hwloc. I don't think that's quite possible, though, for some of the reasons Rolf mentioned (i.e., we need to do more than just know *where* the accelerators are). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
> >> Actually I'm not sure that it is good idea to enable CUDA by default, since >> it disables zero-copy protocol, that is critical for good performance. > > That can easily be a run-time check during startup. It could be fixed. My point was that in the existing code, it's compile time decision and not run time. Pasha
Re: [OMPI devel] RFC: Add support to send/receive CUDA device memory directly
On Apr 14, 2011, at 3:13 PM, Shamis, Pavel wrote: >> That can easily be a run-time check during startup. > > It could be fixed. My point was that in the existing code, it's compile time > decision and not run time. I agree; I mentioned the same issue in my review, too. Some of the code can clearly use both a compile time and a run time check (like the part that we're talking about right now :-) ). -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] Problem of memory lost in MPI_Type_create_hindexed() with count = 1 (patch proposed)
Interesting, this issue exists in 2 out of 3 functions defined in the ompi_datatype_create_indexed.c file. Based on your patch I create one that fixes all the issues with the indexed type creation. Attached is the patch. I'll push it in the trunk and create CMRs. Thanks, george. Index: ompi/datatype/ompi_datatype_create_indexed.c === --- ompi/datatype/ompi_datatype_create_indexed.c(revision 24616) +++ ompi/datatype/ompi_datatype_create_indexed.c(working copy) @@ -3,7 +3,7 @@ * Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana * University Research and Technology * Corporation. All rights reserved. - * Copyright (c) 2004-2009 The University of Tennessee and The University + * Copyright (c) 2004-2010 The University of Tennessee and The University * of Tennessee Research Foundation. All rights * reserved. * Copyright (c) 2004-2006 High Performance Computing Center Stuttgart, @@ -46,26 +46,21 @@ dLength = pBlockLength[0]; endat = disp + dLength; ompi_datatype_type_extent( oldType, &extent ); -if( 1 >= count ) { -pdt = ompi_datatype_create( oldType->super.desc.used + 2 ); -/* multiply by count to make it zero if count is zero */ -ompi_datatype_add( pdt, oldType, count * dLength, disp * extent, extent ); -} else { -pdt = ompi_datatype_create( count * (2 + oldType->super.desc.used) ); -for( i = 1; i < count; i++ ) { -if( endat == pDisp[i] ) { -/* contiguous with the previsious */ -dLength += pBlockLength[i]; -endat += pBlockLength[i]; -} else { -ompi_datatype_add( pdt, oldType, dLength, disp * extent, extent ); -disp = pDisp[i]; -dLength = pBlockLength[i]; -endat = disp + pBlockLength[i]; -} + +pdt = ompi_datatype_create( count * (2 + oldType->super.desc.used) ); +for( i = 1; i < count; i++ ) { +if( endat == pDisp[i] ) { +/* contiguous with the previsious */ +dLength += pBlockLength[i]; +endat += pBlockLength[i]; +} else { +ompi_datatype_add( pdt, oldType, dLength, disp * extent, extent ); +disp = pDisp[i]; +dLength = pBlockLength[i]; +endat = disp + pBlockLength[i]; } -ompi_datatype_add( pdt, oldType, dLength, disp * extent, extent ); } +ompi_datatype_add( pdt, oldType, dLength, disp * extent, extent ); *newType = pdt; return OMPI_SUCCESS; @@ -91,25 +86,20 @@ dLength = pBlockLength[0]; endat = disp + dLength * extent; -if( 1 >= count ) { -pdt = ompi_datatype_create( oldType->super.desc.used + 2 ); -/* multiply by count to make it zero if count is zero */ -ompi_datatype_add( pdt, oldType, count * dLength, disp, extent ); -} else { -for( i = 1; i < count; i++ ) { -if( endat == pDisp[i] ) { -/* contiguous with the previsious */ -dLength += pBlockLength[i]; -endat += pBlockLength[i] * extent; -} else { -ompi_datatype_add( pdt, oldType, dLength, disp, extent ); -disp = pDisp[i]; -dLength = pBlockLength[i]; -endat = disp + pBlockLength[i] * extent; -} +for( i = 1; i < count; i++ ) { +if( endat == pDisp[i] ) { +/* contiguous with the previsious */ +dLength += pBlockLength[i]; +endat += pBlockLength[i] * extent; +} else { +ompi_datatype_add( pdt, oldType, dLength, disp, extent ); +disp = pDisp[i]; +dLength = pBlockLength[i]; +endat = disp + pBlockLength[i] * extent; } -ompi_datatype_add( pdt, oldType, dLength, disp, extent ); } +ompi_datatype_add( pdt, oldType, dLength, disp, extent ); + *newType = pdt; return OMPI_SUCCESS; } On Apr 14, 2011, at 10:48 , Pascal Deveze wrote: > Calling MPI_Type_create_hindexed(int count, int array_of_blocklengths[], > MPI_Aint array_of_displacements[], MPI_Datatype oldtype, > MPI_Datatype *newtype) > with a count parameter of 1 causes a loss of memory detected by valgrind : > > ==2053== 576 (448 direct, 128 indirect) bytes in 1 blocks are definitely lost > in loss record 157 of 182 > ==2053==at 0x4C2415D: malloc (vg_replace_malloc.c:195) > ==2053==by 0x4E7CEC7: opal_obj_new (opal_object.h:469) > ==2053==by 0x4E7D134: ompi_datatype_create (ompi_datatype_create.c:71) > ==2053==by 0x4E7D58E: ompi_datatype_create_hindexed > (ompi_datatype_create_indexed.c:89) > ==2053==by 0x4EA74D0: PMPI_Type_create_hindexed > (ptype_create_hindexed.c:75) > ==2
[OMPI devel] Fwd: [OMPI svn-full] svn:open-mpi r24617
George -- Unfortunately, this didn't automatically create CMRs (I'm not sure why). :-( Begin forwarded message: > From: bosi...@osl.iu.edu > Date: April 14, 2011 5:50:07 PM EDT > To: svn-f...@open-mpi.org > Subject: [OMPI svn-full] svn:open-mpi r24617 > Reply-To: de...@open-mpi.org > > Author: bosilca > Date: 2011-04-14 17:50:06 EDT (Thu, 14 Apr 2011) > New Revision: 24617 > URL: https://svn.open-mpi.org/trac/ompi/changeset/24617 > > Log: > Based on the patch submitted by Pascal Deveze, here is the memory leak fix > for the type indexed creation. > > CMR v1.4 and v1.5. > > Text files modified: > trunk/ompi/datatype/ompi_datatype_create_indexed.c |62 > --- > 1 files changed, 26 insertions(+), 36 deletions(-) > > Modified: trunk/ompi/datatype/ompi_datatype_create_indexed.c > == > --- trunk/ompi/datatype/ompi_datatype_create_indexed.c(original) > +++ trunk/ompi/datatype/ompi_datatype_create_indexed.c2011-04-14 > 17:50:06 EDT (Thu, 14 Apr 2011) > @@ -3,7 +3,7 @@ > * Copyright (c) 2004-2006 The Trustees of Indiana University and Indiana > * University Research and Technology > * Corporation. All rights reserved. > - * Copyright (c) 2004-2009 The University of Tennessee and The University > + * Copyright (c) 2004-2010 The University of Tennessee and The University > * of Tennessee Research Foundation. All rights > * reserved. > * Copyright (c) 2004-2006 High Performance Computing Center Stuttgart, > @@ -46,26 +46,21 @@ > dLength = pBlockLength[0]; > endat = disp + dLength; > ompi_datatype_type_extent( oldType, &extent ); > -if( 1 >= count ) { > -pdt = ompi_datatype_create( oldType->super.desc.used + 2 ); > -/* multiply by count to make it zero if count is zero */ > -ompi_datatype_add( pdt, oldType, count * dLength, disp * extent, > extent ); > -} else { > -pdt = ompi_datatype_create( count * (2 + oldType->super.desc.used) ); > -for( i = 1; i < count; i++ ) { > -if( endat == pDisp[i] ) { > -/* contiguous with the previsious */ > -dLength += pBlockLength[i]; > -endat += pBlockLength[i]; > -} else { > -ompi_datatype_add( pdt, oldType, dLength, disp * extent, > extent ); > -disp = pDisp[i]; > -dLength = pBlockLength[i]; > -endat = disp + pBlockLength[i]; > -} > + > +pdt = ompi_datatype_create( count * (2 + oldType->super.desc.used) ); > +for( i = 1; i < count; i++ ) { > +if( endat == pDisp[i] ) { > +/* contiguous with the previsious */ > +dLength += pBlockLength[i]; > +endat += pBlockLength[i]; > +} else { > +ompi_datatype_add( pdt, oldType, dLength, disp * extent, extent > ); > +disp = pDisp[i]; > +dLength = pBlockLength[i]; > +endat = disp + pBlockLength[i]; > } > -ompi_datatype_add( pdt, oldType, dLength, disp * extent, extent ); > } > +ompi_datatype_add( pdt, oldType, dLength, disp * extent, extent ); > > *newType = pdt; > return OMPI_SUCCESS; > @@ -91,25 +86,20 @@ > dLength = pBlockLength[0]; > endat = disp + dLength * extent; > > -if( 1 >= count ) { > -pdt = ompi_datatype_create( oldType->super.desc.used + 2 ); > -/* multiply by count to make it zero if count is zero */ > -ompi_datatype_add( pdt, oldType, count * dLength, disp, extent ); > -} else { > -for( i = 1; i < count; i++ ) { > -if( endat == pDisp[i] ) { > -/* contiguous with the previsious */ > -dLength += pBlockLength[i]; > -endat += pBlockLength[i] * extent; > -} else { > -ompi_datatype_add( pdt, oldType, dLength, disp, extent ); > -disp = pDisp[i]; > -dLength = pBlockLength[i]; > -endat = disp + pBlockLength[i] * extent; > -} > +for( i = 1; i < count; i++ ) { > +if( endat == pDisp[i] ) { > +/* contiguous with the previsious */ > +dLength += pBlockLength[i]; > +endat += pBlockLength[i] * extent; > +} else { > +ompi_datatype_add( pdt, oldType, dLength, disp, extent ); > +disp = pDisp[i]; > +dLength = pBlockLength[i]; > +endat = disp + pBlockLength[i] * extent; > } > -ompi_datatype_add( pdt, oldType, dLength, disp, extent ); > } > +ompi_datatype_add( pdt, oldType, dLength, disp, extent ); > + > *newType = pdt; > return OMPI_SUCCESS; > } > ___ > svn-full mailing list > svn-f...@open-mpi.or