Hi,

Thank you for those informations.
For the moment, I didn't encountered those problems yet. Maybe because, my
program don't use much memory (100MB) and the master machine have huge RAM
(8GB).
So meanwhile, the solution seems to be the parameter "btl_tcp_eager_limit"
but a cleaner solution is very welcome :-)

   TMHieu

2010/3/5 Aurélien Bouteiller <boute...@eecs.utk.edu>:
> Hi,
>
> setting the eager limit to such a drastically high value will have the
effect of generating gigantic memory consumption for unexpected messages.
Any message you send which does not have a preposted ready recv will
mallocate 150mb of temporary storage, and will be memcopied from that
internal buffer to the recv buffer when the recv is posted. You should
expect very poor bandwidth and probably crash/abort due to memory exhaustion
on the receivers.
>
> Aurelien
> --
> Dr. Aurelien Bouteiller
> Innovative Computing Laboratory
> University of Tennessee
> Knoxville, TN, USA
>
>
> Le 4 mars 2010 à 09:02, TRINH Minh Hieu a écrit :
>
>> Hi,
>>
>> I have some new discovery about this problem :
>>
>> It seems that the array size sendable from a 32bit to 64bit machines
>> is proportional to the parameter "btl_tcp_eager_limit"
>> When I set it to 200 000 000 (2e08 bytes, about 190MB), I can send an
>> array up to 2e07 double (152MB).
>>
>> I didn't found much informations about btl_tcp_eager_limit other than
>> in the "ompi_info --all" command. If I let it at 2e08, will it impacts
>> the performance of OpenMPI ?
>>
>> It may be noteworth also that if the master (rank 0) is a 32bit
>> machines, I don't have segfault. I can send big array with small
>> "btl_tcp_eager_limit" from a 64bit machine to a 32bit one.
>>
>> Do I have to move this thread to devel mailing list ?
>>
>> Regards,
>>
>>   TMHieu
>>
>> On Tue, Mar 2, 2010 at 2:54 PM, TRINH Minh Hieu <mhtr...@gmail.com>
wrote:
>>> Hello,
>>>
>>> Yes, I compiled OpenMPI with --enable-heterogeneous. More precisely I
>>> compiled with :
>>> $ ./configure --prefix=/tmp/openmpi --enable-heterogeneous
>>> --enable-cxx-exceptions --enable-shared
>>> --enable-orterun-prefix-by-default
>>> $ make all install
>>>
>>> I attach the output of ompi_info of my 2 machines.
>>>
>>>    TMHieu
>>>
>>> On Tue, Mar 2, 2010 at 1:57 PM, Jeff Squyres <jsquy...@cisco.com> wrote:
>>>> Did you configure Open MPI with --enable-heterogeneous?
>>>>
>>>> On Feb 28, 2010, at 1:22 PM, TRINH Minh Hieu wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> I have some problems running MPI on my heterogeneous cluster. More
>>>>> precisley i got segmentation fault when sending a large array (about
>>>>> 10000) of double from a i686 machine to a x86_64 machine. It does not
>>>>> happen with small array. Here is the send/recv code source (complete
>>>>> source is in attached file) :
>>>>> ========code ================
>>>>>     if (me == 0 ) {
>>>>>         for (int pe=1; pe<nprocs; pe++)
>>>>>         {
>>>>>                 printf("Receiving from proc %d : ",pe);
fflush(stdout);
>>>>>             d=(double *)malloc(sizeof(double)*n);
>>>>>             MPI_Recv(d,n,MPI_DOUBLE,pe,999,MPI_COMM_WORLD,&status);
>>>>>             printf("OK\n"); fflush(stdout);
>>>>>         }
>>>>>         printf("All done.\n");
>>>>>     }
>>>>>     else {
>>>>>       d=(double *)malloc(sizeof(double)*n);
>>>>>       MPI_Send(d,n,MPI_DOUBLE,0,999,MPI_COMM_WORLD);
>>>>>     }
>>>>> ======== code ================
>>>>>
>>>>> I got segmentation fault with n=10000 but no error with n=1000
>>>>> I have 2 machines :
>>>>> sbtn155 : Intel Xeon,         x86_64
>>>>> sbtn211 : Intel Pentium 4, i686
>>>>>
>>>>> The code is compiled in x86_64 and i686 machine, using OpenMPI 1.4.1,
>>>>> installed in /tmp/openmpi :
>>>>> [mhtrinh@sbtn211 heterogenous]$ make hetero
>>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o
hetero.i686.o
>>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>>>> hetero.i686.o -o hetero.i686 -lm
>>>>>
>>>>> [mhtrinh@sbtn155 heterogenous]$ make hetero
>>>>> gcc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include -c hetero.c -o
hetero.x86_64.o
>>>>> /tmp/openmpi/bin/mpicc -Wall -I. -std=c99 -O3 -I/tmp/openmpi/include
>>>>> hetero.x86_64.o -o hetero.x86_64 -lm
>>>>>
>>>>> I run with the code using appfile and got thoses error :
>>>>> $ cat appfile
>>>>> --host sbtn155 -np 1 hetero.x86_64
>>>>> --host sbtn155 -np 1 hetero.x86_64
>>>>> --host sbtn211 -np 1 hetero.i686
>>>>>
>>>>> $ mpirun -hetero --app appfile
>>>>> Input array length :
>>>>> 10000
>>>>> Receiving from proc 1 : OK
>>>>> Receiving from proc 2 : [sbtn155:26386] *** Process received signal
***
>>>>> [sbtn155:26386] Signal: Segmentation fault (11)
>>>>> [sbtn155:26386] Signal code: Address not mapped (1)
>>>>> [sbtn155:26386] Failing at address: 0x200627bd8
>>>>> [sbtn155:26386] [ 0] /lib64/libpthread.so.0 [0x3fa4e0e540]
>>>>> [sbtn155:26386] [ 1] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so
[0x2aaaad8d7908]
>>>>> [sbtn155:26386] [ 2] /tmp/openmpi/lib/openmpi/mca_btl_tcp.so
[0x2aaaae2fc6e3]
>>>>> [sbtn155:26386] [ 3] /tmp/openmpi/lib/libopen-pal.so.0
[0x2aaaaafe39db]
>>>>> [sbtn155:26386] [ 4]
>>>>> /tmp/openmpi/lib/libopen-pal.so.0(opal_progress+0x9e) [0x2aaaaafd8b9e]
>>>>> [sbtn155:26386] [ 5] /tmp/openmpi/lib/openmpi/mca_pml_ob1.so
[0x2aaaad8d4b25]
>>>>> [sbtn155:26386] [ 6] /tmp/openmpi/lib/libmpi.so.0(MPI_Recv+0x13b)
>>>>> [0x2aaaaab30f9b]
>>>>> [sbtn155:26386] [ 7] hetero.x86_64(main+0xde) [0x400cbe]
>>>>> [sbtn155:26386] [ 8] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x3fa421e074]
>>>>> [sbtn155:26386] [ 9] hetero.x86_64 [0x400b29]
>>>>> [sbtn155:26386] *** End of error message ***
>>>>>
--------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 0 with PID 26386 on node sbtn155
>>>>> exited on signal 11 (Segmentation fault).
>>>>>
--------------------------------------------------------------------------
>>>>>
>>>>> Am I missing an option in order to run in heterogenous cluster ?
>>>>> MPI_Send/Recv have limit array size when using heterogeneous cluster ?
>>>>> Thanks for your help. Regards
>>>>>
>>>>> --
>>>>> ============================================
>>>>>    M. TRINH Minh Hieu
>>>>>    CEA, IBEB, SBTN/LIRM,
>>>>>    F-30207 Bagnols-sur-Cèze, FRANCE
>>>>> ============================================
>>>>>
>>>>> <hetero.c.bz2>_______________________________________________
>>>>> users mailing list
>>>>> us...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>>
>>>>
>>>> --
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>>
>>>>
>>>> _______________________________________________
>>>> users mailing list
>>>> us...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>>
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
>



-- 
============================================
  M. TRINH Minh Hieu
  CEA, IBEB, SBTN/LIRM,
  F-30207 Bagnols-sur-Cèze, FRANCE
============================================

Reply via email to