Re: [OMPI users] RETRY EXCEEDED ERROR status number 12

2009-08-21 Thread Pavel Shamis (Pasha)

You may try to use ibdiagnet tool:
http://linux.die.net/man/1/ibdiagnet

The tool is part of OFED (http://www.openfabrics.org/)

Pasha.

Prentice Bisbal wrote:

Several jobs on my cluster just died with the error below.

Are there any IB/Open MPI diagnostics I should use to diagnose, should I
just reboot the nodes, or should I have the user who submitted these
jobs just increase the retry count/timeout paramters?


[0,1,6][../../../../../ompi/mca/btl/openib/btl_openib_component.c:1375:btl_openib_component_progress]
from node14.aurora to: node40.aurora error polling HP CQ with status
RETRY EXCEEDED ERROR status number 12 for wr_id 13606831800 opcode 9
--
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
attempt to retry (defaulted to 7, the maximum value).

* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
to 10). The actual timeout value used is calculated as:

4.096 microseconds * (2^btl_openib_ib_timeout)

See the InfiniBand spec 1.2 (section 12.7.34) for more details.

  




[OMPI users] RETRY EXCEEDED ERROR status number 12

2009-08-21 Thread Prentice Bisbal
Several jobs on my cluster just died with the error below.

Are there any IB/Open MPI diagnostics I should use to diagnose, should I
just reboot the nodes, or should I have the user who submitted these
jobs just increase the retry count/timeout paramters?


[0,1,6][../../../../../ompi/mca/btl/openib/btl_openib_component.c:1375:btl_openib_component_progress]
from node14.aurora to: node40.aurora error polling HP CQ with status
RETRY EXCEEDED ERROR status number 12 for wr_id 13606831800 opcode 9
--
The InfiniBand retry count between two MPI processes has been
exceeded. "Retry count" is defined in the InfiniBand spec 1.2
(section 12.7.38):

The total number of times that the sender wishes the receiver to
retry timeout, packet sequence, etc. errors before posting a
completion error.

This error typically means that there is something awry within the
InfiniBand fabric itself. You should note the hosts on which this
error has occurred; it has been observed that rebooting or removing a
particular host from the job can sometimes resolve this issue.

Two MCA parameters can be used to control Open MPI's behavior with
respect to the retry count:

* btl_openib_ib_retry_count - The number of times the sender will
attempt to retry (defaulted to 7, the maximum value).

* btl_openib_ib_timeout - The local ACK timeout parameter (defaulted
to 10). The actual timeout value used is calculated as:

4.096 microseconds * (2^btl_openib_ib_timeout)

See the InfiniBand spec 1.2 (section 12.7.34) for more details.

-- 
Prentice Bisbal
Linux Software Support Specialist/System Administrator
School of Natural Sciences
Institute for Advanced Study
Princeton, NJ


Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-05 Thread Pavel Shamis (Pasha)



Thanks Pasha!
ibdiagnet reports the following:

-I---
-I- IPoIB Subnets Check
-I---
-I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Port localhost/P1 lid=0x00e2 guid=0x001e0b4ced75 dev=25218 can not join
due to rate:2.5Gbps < group:10Gbps

I guess this may indicate a bad adapter.  Now, I just need to find what
system this maps to.
  

I guess it is some bad cable

I also ran ibcheckerrors and it reports a lot of problems with buffer
overruns.  Here's the tail end of the output, with only some of the last
ports reported:

#warn: counter SymbolErrors = 36905 (threshold 10) lid 193 port 14
#warn: counter LinkDowned = 23  (threshold 10) lid 193 port 14
#warn: counter RcvErrors = 15641(threshold 10) lid 193 port 14
#warn: counter RcvSwRelayErrors = 225   (threshold 100) lid 193 port 14
#warn: counter ExcBufOverrunErrors = 10 (threshold 10) lid 193 port 14
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 14:  FAILED 
#warn: counter LinkRecovers = 181   (threshold 10) lid 193 port 1

#warn: counter RcvSwRelayErrors = 2417  (threshold 100) lid 193 port 1
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 1:  FAILED 
#warn: counter LinkRecovers = 103   (threshold 10) lid 193 port 3

#warn: counter RcvErrors = 9035 (threshold 10) lid 193 port 3
#warn: counter RcvSwRelayErrors = 64670 (threshold 100) lid 193 port 3
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 3:  FAILED 
#warn: counter SymbolErrors = 13151 (threshold 10) lid 193 port 4

#warn: counter RcvErrors = 109  (threshold 10) lid 193 port 4
#warn: counter RcvSwRelayErrors = 507   (threshold 100) lid 193 port 4
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 4:  FAILED 


## Summary: 209 nodes checked, 0 bad nodes found
##  716 ports checked, 103 ports have errors beyond threshold

  
It reports a lot of symbol errors. I recommend you to reset all these 
counters (if i remember correct it is
-c flag in ibdiagnet) and rerun the testing again after the mpi process 
failure.


Thanks,
Pasha



Re: [OMPI users] RETRY EXCEEDED ERROR

2009-03-05 Thread Jan Lindheim
On Thu, Mar 05, 2009 at 10:27:27AM +0200, Pavel Shamis (Pasha) wrote:
> 
> >Time to dig up diagnostics tools and look at port statistics.
> >  
> You may use ibdiagnet tool for the network debug - 
> *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED.
> 
> Pasha.
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users
> 

Thanks Pasha!
ibdiagnet reports the following:

-I---
-I- IPoIB Subnets Check
-I---
-I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00
-W- Port localhost/P1 lid=0x00e2 guid=0x001e0b4ced75 dev=25218 can not join
due to rate:2.5Gbps < group:10Gbps

I guess this may indicate a bad adapter.  Now, I just need to find what
system this maps to.

I also ran ibcheckerrors and it reports a lot of problems with buffer
overruns.  Here's the tail end of the output, with only some of the last
ports reported:

#warn: counter SymbolErrors = 36905 (threshold 10) lid 193 port 14
#warn: counter LinkDowned = 23  (threshold 10) lid 193 port 14
#warn: counter RcvErrors = 15641(threshold 10) lid 193 port 14
#warn: counter RcvSwRelayErrors = 225   (threshold 100) lid 193 port 14
#warn: counter ExcBufOverrunErrors = 10 (threshold 10) lid 193 port 14
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 14:  FAILED 
#warn: counter LinkRecovers = 181   (threshold 10) lid 193 port 1
#warn: counter RcvSwRelayErrors = 2417  (threshold 100) lid 193 port 1
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 1:  FAILED 
#warn: counter LinkRecovers = 103   (threshold 10) lid 193 port 3
#warn: counter RcvErrors = 9035 (threshold 10) lid 193 port 3
#warn: counter RcvSwRelayErrors = 64670 (threshold 100) lid 193 port 3
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 3:  FAILED 
#warn: counter SymbolErrors = 13151 (threshold 10) lid 193 port 4
#warn: counter RcvErrors = 109  (threshold 10) lid 193 port 4
#warn: counter RcvSwRelayErrors = 507   (threshold 100) lid 193 port 4
Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 4:  FAILED 

## Summary: 209 nodes checked, 0 bad nodes found
##  716 ports checked, 103 ports have errors beyond threshold


I wonder if this is something that needs to be tuned in the Infiniband
switch or if there is something in OpenMPI/OpenIB that can be tuned.

Thanks,
Jan Lindheim