Thanks, it's at least good to know that the behaviour isn't normal!

Could it be some sort of memory leak in the call? The code in

    ompi/runtime/ompi_mpi_preconnect.c

looks reasonably safe, though maybe doing thousands of of isend/irecv
pairs is causing problems with the buffer used in ptp messages?

I'm trying to see if valgrind can see anything, but nothing from
ompi_init_preconnect_mpi is coming up (although there are some other
warnings).


On Sun, Oct 19, 2014 at 2:37 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> On Oct 17, 2014, at 3:37 AM, Marshall Ward <marshall.w...@gmail.com> wrote:
>>
>> I currently have a numerical model that, for reasons unknown, requires
>> preconnection to avoid hanging on an initial MPI_Allreduce call.
>
> That is indeed odd - it might take a while for all the connections to form, 
> but it shouldn’t hang
>
>> But
>> when we try to scale out beyond around 1000 cores, we are unable to
>> get past MPI_Init's preconnection phase.
>>
>> To test this, I have a basic C program containing only MPI_Init() and
>> MPI_Finalize() named `mpi_init`, which I compile and run using `mpirun
>> -mca mpi_preconnect_mpi 1 mpi_init`.
>
> I doubt preconnect has been tested in a rather long time as I’m unaware of 
> anyone still using it (we originally provided it for some legacy code that 
> otherwise took a long time to initialize). However, I could give it a try and 
> see what happens. FWIW: because it was so targeted and hasn’t been used in a 
> long time, the preconnect algo is really not very efficient. Still, shouldn’t 
> have anything to do with memory footprint.
>
>>
>> This preconnection seems to consume a large amount of memory, and is
>> exceeding the available memory on our nodes (~2GiB/core) as the number
>> gets into the thousands (~4000 or so). If we try to preconnect to
>> around ~6000, we start to see hangs and crashes.
>>
>> A failed 5600 core preconnection gave this warning (~10k times) while
>> hanging for 30 minutes:
>>
>>    [warn] opal_libevent2021_event_base_loop: reentrant invocation.
>> Only one event_base_loop can run on each event_base at once.
>>
>> A failed 6000-core preconnection job crashed almost immediately with
>> the following error.
>>
>>    [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
>> file ras_tm_module.c at line 159
>>    [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
>> file ras_tm_module.c at line 85
>>    [r104:18459] [[32743,0],0] ORTE_ERROR_LOG: File open failure in
>> file base/ras_base_allocate.c at line 187
>
> This doesn’t have anything to do with preconnect - it indicates that mpirun 
> was unable to open the Torque allocation file. However, it shouldn’t have 
> “crashed”, but instead simply exited with an error message.
>
>>
>> Should we expect to use very large amounts of memory for
>> preconnections of thousands of CPUs? And can these
>>
>> I am using Open MPI 1.8.2 on Linux 2.6.32 (centOS) and FDR infiniband
>> network. This is probably not enough information, but I'll try to
>> provide more if necessary. My knowledge of implementation is
>> unfortunately very limited.
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2014/10/25527.php
>
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2014/10/25536.php

Reply via email to