Re: [OMPI users] OpenMPI LS-DYNA Connection refused

Terry Dontje Tue, 3 May 2011 09:36:00 -0400

Looking at your output more the below "Connect to address" doesn't matchany messages I see in the source code. Also "trying normal/usr/bin/rsh" looks odd to me.

You may want to set the mca parameter mpi_abort_delay and attach adebugger to the abortive process and dump out a stack trace. Thatshould give a better idea where the failure is being triggered. You canlook at http://www.open-mpi.org/faq/?category=debugging question 4 formore info on the parameter.


--td

On 05/02/2011 03:40 PM, Robert Walters wrote:

I've attached the typical error message I've been getting. This isfrom a run I initiated this morning. The first few lines or so arerelated to the LS-DYNA program and are just there to let you know itsrunning successfully for an hour and a half.
What's interesting is this doesn't happen on every job I run, and willrecur for the same simulation. For instance, Simulation A will run for40 hours, and complete successfully. Simulation B will run for 6hours, and die from an error. Any further attempts to run simulation Bwill always end from an error. This makes me think there is some kindof bad calculation happening that OpenMPI doesn't know how to handle,or LS-DYNA doesn't know how to pass to OpenMPI. On the other hand,this particular simulation is one of those "benchmarks" and everyoneruns it. I should not be getting errors from the FE code itself.Odd... I think I'll try this as an SMP job as well as an MPP job overa single node and see if the issue continues. That way I can figureout if its OpenMPI related or FE code related, but as I mentioned, Idon't think it is FE code related since others have successfully runthis particular benchmarking simulation.
*_Error Message:_*

 Parallel execution with     56 MPP proc

 NLQ used/max               152/   152

 Start time   05/02/2011 10:02:20

 End time     05/02/2011 11:24:46
Elapsed time 4946 seconds( 1 hours 22 min. 26 sec.) for 9293cycles
 E r r o r   t e r m i n a t i o n

--------------------------------------------------------------------------

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD

with errorcode -1525207032.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

You may or may not see output from other processes, depending on

exactly when Open MPI kills them.

--------------------------------------------------------------------------

connect to address xx.xxx.xx.xxx port 544: Connection refused

connect to address xx.xxx.xx.xxx port 544: Connection refused

trying normal rsh (/usr/bin/rsh)

--------------------------------------------------------------------------

mpirun has exited due to process rank 0 with PID 24488 on

node allision exiting without calling "finalize". This may

have caused other processes in the application to be

terminated by signals sent by mpirun (as reported here).

--------------------------------------------------------------------------

Regards,

Robert Walters

------------------------------------------------------------------------
*From:*users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org]*On Behalf Of *Terry Dontje
*Sent:* Monday, May 02, 2011 2:50 PM
*To:* us...@open-mpi.org
*Subject:* Re: [OMPI users] OpenMPI LS-DYNA Connection refused

On 05/02/2011 02:04 PM, Robert Walters wrote:

Terry,
I was under the impression that all connections are made because ofthe nature of the program that OpenMPI is invoking. LS-DYNA is afinite element solver and for any given simulation I run, the cores oneach node must constantly communicate with one another to check forvarious occurrences (contact with various pieces/parts, updating nodalcoordinates, etc...).
You might be right, the connections might have been established butthe error message you state (connection refused) seems out of place ifthe connection was already established.
Was there more error messages from OMPI other than "connectionrefused"? If so could you possibly provide that output to us, maybeit will give us a hint where in the library things are messing up.
I've run the program using --mca mpi_preconnect_mpi 1 and thesimulation has started itself up successfully which I think means thatthe mpi_preconnect passed since all of the child processes havestarted up on each individual node. Thanks for the suggestion though,it's a good place to start.
Yeah, it possibly could be telling if things do work with this setting.
I've been worried (though I have no basis for it) that messages may begetting queued up and hitting some kind of ceiling or timeout. As afinite element code, I think the communication occurs on a largescale. Lots of very small packets going back and forth quickly. A fewstudies have been done by the High Performance Computing AdvisoryCouncil(http://www.hpcadvisorycouncil.com/pdf/LS-DYNA%20_analysis.pdf) andthey've suggested that LS-DYNA communicates at very, very high rates(Not sure but from pg.15 of that document they're suggesting hundredsof millions of messages in only a few hours). Is there any kind ofbuffer or queue that OpenMPI develops if messages are created tooquickly? Does it dispatch them immediately or does it attempt to applysome kind of traffic flow control?
The queuing really depends on what type of calls the application ismaking. If it is doing blocking sends then I wouldn't expect too muchqueuing happening using the tcp btl. As far as traffic flow controlis concerned I believe the tcp btl doesn't do any for the most partand lets tcp handle that. Maybe someone else on the list could chimein if I am wrong here.
In the past I have seen where lots of traffic on the network and to aparticular node has cause some connections not to be established. ButI don't know of any outstanding issue with such issues right now.
--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>


_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users



--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>

Re: [OMPI users] OpenMPI LS-DYNA Connection refused

Reply via email to