Hello, It seems that this was really a bug. It was recently fixed in repository https://svn.open-mpi.org/trac/ompi/changeset/23030 and will likely be fixed in next 1.4 release.
Here is corresponding thread in ompi-devel: http://www.open-mpi.org/community/lists/devel/2010/04/7787.php В Птн, 05/03/2010 в 10:51 +0100, TRINH Minh Hieu пишет: > Hi, > > Thank you for those informations. > For the moment, I didn't encountered those problems yet. Maybe > because, my program don't use much memory (100MB) and the master > machine have huge RAM (8GB). > So meanwhile, the solution seems to be the parameter > "btl_tcp_eager_limit" but a cleaner solution is very welcome :-) > > TMHieu > > 2010/3/5 Aurélien Bouteiller <boute...@eecs.utk.edu>: > > Hi, > > > > setting the eager limit to such a drastically high value will have > the effect of generating gigantic memory consumption for unexpected > messages. Any message you send which does not have a preposted ready > recv will mallocate 150mb of temporary storage, and will be memcopied > from that internal buffer to the recv buffer when the recv is posted. > You should expect very poor bandwidth and probably crash/abort due to > memory exhaustion on the receivers. > > > > Aurelien > > -- > > Dr. Aurelien Bouteiller > > Innovative Computing Laboratory > > University of Tennessee > > Knoxville, TN, USA > > > > > > Le 4 mars 2010 à 09:02, TRINH Minh Hieu a écrit : > > > >> Hi, > >> > >> I have some new discovery about this problem : > >> > >> It seems that the array size sendable from a 32bit to 64bit > machines > >> is proportional to the parameter "btl_tcp_eager_limit" > >> When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send > an > >> array up to 2e07 double (152MB). > >> > >> I didn't found much informations about btl_tcp_eager_limit other > than > >> in the "ompi_info --all" command. If I let it at 2e08, will it > impacts > >> the performance of OpenMPI ? > >> > >> It may be noteworth also that if the master (rank 0) is a 32bit > >> machines, I don't have segfault. I can send big array with small > >> "btl_tcp_eager_limit" from a 64bit machine to a 32bit one. > >> > >> Do I have to move this thread to devel mailing list ? > >> > >> Regards, > >> > >> TMHieu > >> > >> On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtr...@gmail.com> > wrote: > >>> Hello, > >>> > >>> Yes, I compiled OpenMPI with --enable-heterogeneous. More > precisely I > >>> compiled with : > >>> $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous > >>> --enable-cxx-exceptions --enable-shared > >>> --enable-orterun-prefix-by-default > >>> $ make all install > >>> > >>> I attach the output of ompi_info of my 2 machines. > >>> > >>> TMHieu > >>> > >>> On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquy...@cisco.com> > wrote: > >>>> Did you configure Open MPI with --enable-heterogeneous? > >>>> > >>>> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote: > >>>> > >>>>> Hello, > >>>>> > >>>>> I have some problems running MPI on my heterogeneous cluster. > More > >>>>> precisley i got segmentation fault when sending a large array > (about > >>>>> 10000) of double from a i686 machine to a x86_64 machine. It > does not > >>>>> happen with small array. Here is the send/recv code source > (complete > >>>>> source is in attached file) : > >>>>> ========code ================ > >>>>> if (me == 0 ) { > >>>>> for (int pe=1; pe<nprocs; pe++) > >>>>> { > >>>>> printf("Receiving from proc %d : ",pe); > fflush(stdout); > >>>>> d=(double *)malloc(sizeof(double)*n); > >>>>> > MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status); > >>>>> printf("OK\n"); fflush(stdout); > >>>>> } > >>>>> printf("All done.\n"); > >>>>> } > >>>>> else { > >>>>> d=(double *)malloc(sizeof(double)*n); > >>>>> MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD); > >>>>> } > >>>>> ======== code ================ > >>>>> > >>>>> I got segmentation fault with n=10000 but no error with n=1000 > >>>>> I have 2 machines : > >>>>> sbtn155 : Intel Xeon, x86_64 > >>>>> sbtn211 : Intel Pentium 4, i686 > >>>>> > >>>>> The code is compiled in x86_64 and i686 machine, using OpenMPI > 1.4.1, > >>>>> installed in /tmp/openmpi : > >>>>> [mhtrinh@sbtn211 heterogenous]$ make hetero > >>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o > hetero.i686.o > >>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 > -I/tmp/openmpi/include > >>>>> hetero.i686.o -o hetero.i686 -lm > >>>>> > >>>>> [mhtrinh@sbtn155 heterogenous]$ make hetero > >>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o > hetero.x86_64.o > >>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 > -I/tmp/openmpi/include > >>>>> hetero.x86_64.o -o hetero.x86_64 -lm > >>>>> > >>>>> I run with the code using appfile and got thoses error : > >>>>> $ cat appfile > >>>>> --host sbtn155 -np 1 hetero.x86_64 > >>>>> --host sbtn155 -np 1 hetero.x86_64 > >>>>> --host sbtn211 -np 1 hetero.i686 > >>>>> > >>>>> $ mpirun -hetero --app appfile > >>>>> Input array length : > >>>>> 10000 > >>>>> Receiving from proc 1 : OK > >>>>> Receiving from proc 2 : [sbtn155:26386] *** Process received > signal *** > >>>>> [sbtn155:26386] Signal: Segmentation fault (11) > >>>>> [sbtn155:26386] Signal code: Address not mapped (1) > >>>>> [sbtn155:26386] Failing at address: 0x200627bd8 > >>>>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540] > >>>>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so > [0x2aaaad8d7908] > >>>>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so > [0x2aaaae2fc6e3] > >>>>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0 > [0x2aaaaafe39db] > >>>>> [sbtn155:26386] [ 4] > >>>>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) > [0x2aaaaafd8b9e] > >>>>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so > [0x2aaaad8d4b25] > >>>>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv > +0x13b) > >>>>> [0x2aaaaab30f9b] > >>>>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe] > >>>>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x3fa421e074] > >>>>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29] > >>>>> [sbtn155:26386] *** End of error message *** > >>>>> > -------------------------------------------------------------------------- > >>>>> mpirun noticed that process rank 0 with PID 26386 on node > sbtn155 > >>>>> exited on signal 11 (Segmentation fault). > >>>>> > -------------------------------------------------------------------------- > >>>>> > >>>>> Am I missing an option in order to run in heterogenous cluster ? > >>>>> MPI_Send/Recv have limit array size when using heterogeneous > cluster ? > >>>>> Thanks for your help. Regards > >>>>> > >>>>> -- > >>>>> ============================================ > >>>>> M. TRINH Minh Hieu > >>>>> CEA, IBEB, SBTN/LIRM, > >>>>> F-30207 Bagnols-sur-Cèze, FRANCE > >>>>> ============================================ > >>>>> > >>>>> <hetero.c.bz2>_______________________________________________ > >>>>> users mailing list > >>>>> us...@open-mpi.org > >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>>> > >>>> > >>>> -- > >>>> Jeff Squyres > >>>> jsquy...@cisco.com > >>>> For corporate legal information go to: > >>>> http://www.cisco.com/web/about/doing_business/legal/cri/ > >>>> > >>>> > >>>> _______________________________________________ > >>>> users mailing list > >>>> us...@open-mpi.org > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >>> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > ============================================ > M. TRINH Minh Hieu > CEA, IBEB, SBTN/LIRM, > F-30207 Bagnols-sur-Cèze, FRANCE > ============================================ > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users -- Kind regards, Timur Magomedov Senior C++ Developer DevelopOnBox LLC / Zodiac Interactive http://www.zodiac.tv/