Hi
We are getting the following kind of error messages when trying to run
MPI_alltoall on 170 nodes with slots=8 on each node (i.e. 170*8=1360
MPI processes in total):
$ mpiexec -n 1360 -hostfile ./mach.8 ./a.out
...
You may try to use ibdiagnet tool:
http://linux.die.net/man/1/ibdiagnet
The tool is part of OFED (http://www.openfabrics.org/)
Pasha.
Prentice Bisbal wrote:
Several jobs on my cluster just died with the error below.
Are there any IB/Open MPI diagnostics I should use to diagnose, should I
Several jobs on my cluster just died with the error below.
Are there any IB/Open MPI diagnostics I should use to diagnose, should I
just reboot the nodes, or should I have the user who submitted these
jobs just increase the retry count/timeout paramters?
Is doing blocking communication in a separate thread better then
asynchronous progress?
(At least as a workaround until the proper implementation gets
improved)
At the moment, yes. OMPI's asynchronous progress is "loosely
tested" (at best).
OMPI's threading support is somewhat
Thank you very much for your help.
Julia
--- On Wed, 8/19/09, Eugene Loh wrote:
From: Eugene Loh
Subject: Re: [OMPI users] MPI loop problem
To: "Open MPI Users"
List-Post: users@lists.open-mpi.org
Date: Wednesday, August 19, 2009,
Hello,
(I don't know whether this should have been sent to the dev-list, but
the last time this error occured, it was posted to the users-list, so
I'm doing it too.)
The last days I had problems compiling OpenMPI on a Debian and a SuSE
Linux.
The bug had been already reported in 2007.