Good suggestion, increasing the timeout to somewhere around 12
allowed the job to finish.  Initial experimentation showed that
I could get a factor of 3-4x improvement in performance using
even larger timeouts, matching the times for 64 processes and
1/4 the data set.  The cluster is presently having scheduler
issues, I'll post again if I find anything else interesting.

Thanks-
-Neil

> Date: Tue, 17 Jul 2007 10:14:44 +0300
> From: "Pavel Shamis (Pasha)" <pa...@dev.mellanox.co.il>
> Subject: Re: [OMPI devel] InfiniBand timeout errors
> To: Open MPI Developers <de...@open-mpi.org>
> Message-ID: <469c6c64.4040...@dev.mellanox.co.il>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Hi,
> Try to increase the IB time out parameter: --mca btl_mvapi_ib_timeout 14
> If the 14 will not work , try to increase little bit more (16)
> 
> Thanks,
> Pasha
> 
> Neil Ludban wrote:
> > Hi,
> >
> > I'm getting the errors below when calling MPI_Alltoallv() as part of
> > a matrix transpose operation.  It's 100% repeatable when testing with
> > 16M matrix elements divided between 64 processes on 32 dual core nodes.
> > There are never any errors with fewer processes or elements, including
> > the same 32 nodes with only one process per node.  If anyone wants
> > any additional information or has suggestions to try, please let me
> > know.  Otherwise, I'll have the system rebooted and hope the problem
> > goes away.
> >
> > -Neil
> >
> >
> >
> > [0,1,7][btl_mvapi_component.c:854:mca_btl_mvapi_component_progress]
> >     from c065 to: c077 [0,1,3][btl_mvapi_component.c:854:
> >     mca_btl_mvapi_component_progress] from c069 error polling HP
> >     CQ with status VAPI_RETRY_EXC_ERR status number 12 for Frag :
> >     0x2ab6590200 to: c078 error polling HP CQ with status
> >     VAPI_RETRY_EXC_ERR status number 12 for Frag : 0x2ab61f6380
> > --------------------------------------------------------------------------
> > The retry count is a down counter initialized on creation of the QP. Retry
> > count is defined in the InfiniBand Spec 1.2 (12.7.38): 
> > The total number of times that the sender wishes the receiver to retry tim- 
> > eout, packet sequence, etc. errors before posting a completion error.
> >
> > Note that two mca parameters are involved here: 
> > btl_mvapi_ib_retry_count - The number of times the sender will attempt to
> > retry  (defaulted to 7, the maximum value). 
> >
> > btl_mvapi_ib_timeout - The local ack timeout parameter (defaulted to 10). 
> > The
> > actual timeout value used is calculated as: 
> > (4.096 micro-seconds * 2^btl_mvapi_ib_timeout). 
> > See InfiniBand Spec 1.2 (12.7.34) for more details.

Reply via email to