Good suggestion, increasing the timeout to somewhere around 12 allowed the job to finish. Initial experimentation showed that I could get a factor of 3-4x improvement in performance using even larger timeouts, matching the times for 64 processes and 1/4 the data set. The cluster is presently having scheduler issues, I'll post again if I find anything else interesting.
Thanks- -Neil > Date: Tue, 17 Jul 2007 10:14:44 +0300 > From: "Pavel Shamis (Pasha)" <pa...@dev.mellanox.co.il> > Subject: Re: [OMPI devel] InfiniBand timeout errors > To: Open MPI Developers <de...@open-mpi.org> > Message-ID: <469c6c64.4040...@dev.mellanox.co.il> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > Hi, > Try to increase the IB time out parameter: --mca btl_mvapi_ib_timeout 14 > If the 14 will not work , try to increase little bit more (16) > > Thanks, > Pasha > > Neil Ludban wrote: > > Hi, > > > > I'm getting the errors below when calling MPI_Alltoallv() as part of > > a matrix transpose operation. It's 100% repeatable when testing with > > 16M matrix elements divided between 64 processes on 32 dual core nodes. > > There are never any errors with fewer processes or elements, including > > the same 32 nodes with only one process per node. If anyone wants > > any additional information or has suggestions to try, please let me > > know. Otherwise, I'll have the system rebooted and hope the problem > > goes away. > > > > -Neil > > > > > > > > [0,1,7][btl_mvapi_component.c:854:mca_btl_mvapi_component_progress] > > from c065 to: c077 [0,1,3][btl_mvapi_component.c:854: > > mca_btl_mvapi_component_progress] from c069 error polling HP > > CQ with status VAPI_RETRY_EXC_ERR status number 12 for Frag : > > 0x2ab6590200 to: c078 error polling HP CQ with status > > VAPI_RETRY_EXC_ERR status number 12 for Frag : 0x2ab61f6380 > > -------------------------------------------------------------------------- > > The retry count is a down counter initialized on creation of the QP. Retry > > count is defined in the InfiniBand Spec 1.2 (12.7.38): > > The total number of times that the sender wishes the receiver to retry tim- > > eout, packet sequence, etc. errors before posting a completion error. > > > > Note that two mca parameters are involved here: > > btl_mvapi_ib_retry_count - The number of times the sender will attempt to > > retry (defaulted to 7, the maximum value). > > > > btl_mvapi_ib_timeout - The local ack timeout parameter (defaulted to 10). > > The > > actual timeout value used is calculated as: > > (4.096 micro-seconds * 2^btl_mvapi_ib_timeout). > > See InfiniBand Spec 1.2 (12.7.34) for more details.