Re: [OMPI users] RETRY EXCEEDED ERROR status number 12
You may try to use ibdiagnet tool: http://linux.die.net/man/1/ibdiagnet The tool is part of OFED (http://www.openfabrics.org/) Pasha. Prentice Bisbal wrote: Several jobs on my cluster just died with the error below. Are there any IB/Open MPI diagnostics I should use to diagnose, should I just reboot the nodes, or should I have the user who submitted these jobs just increase the retry count/timeout paramters? [0,1,6][../../../../../ompi/mca/btl/openib/btl_openib_component.c:1375:btl_openib_component_progress] from node14.aurora to: node40.aurora error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 13606831800 opcode 9 -- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is something awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details.
[OMPI users] RETRY EXCEEDED ERROR status number 12
Several jobs on my cluster just died with the error below. Are there any IB/Open MPI diagnostics I should use to diagnose, should I just reboot the nodes, or should I have the user who submitted these jobs just increase the retry count/timeout paramters? [0,1,6][../../../../../ompi/mca/btl/openib/btl_openib_component.c:1375:btl_openib_component_progress] from node14.aurora to: node40.aurora error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 13606831800 opcode 9 -- The InfiniBand retry count between two MPI processes has been exceeded. "Retry count" is defined in the InfiniBand spec 1.2 (section 12.7.38): The total number of times that the sender wishes the receiver to retry timeout, packet sequence, etc. errors before posting a completion error. This error typically means that there is something awry within the InfiniBand fabric itself. You should note the hosts on which this error has occurred; it has been observed that rebooting or removing a particular host from the job can sometimes resolve this issue. Two MCA parameters can be used to control Open MPI's behavior with respect to the retry count: * btl_openib_ib_retry_count - The number of times the sender will attempt to retry (defaulted to 7, the maximum value). * btl_openib_ib_timeout - The local ACK timeout parameter (defaulted to 10). The actual timeout value used is calculated as: 4.096 microseconds * (2^btl_openib_ib_timeout) See the InfiniBand spec 1.2 (section 12.7.34) for more details. -- Prentice Bisbal Linux Software Support Specialist/System Administrator School of Natural Sciences Institute for Advanced Study Princeton, NJ
Re: [OMPI users] RETRY EXCEEDED ERROR
Thanks Pasha! ibdiagnet reports the following: -I--- -I- IPoIB Subnets Check -I--- -I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Port localhost/P1 lid=0x00e2 guid=0x001e0b4ced75 dev=25218 can not join due to rate:2.5Gbps < group:10Gbps I guess this may indicate a bad adapter. Now, I just need to find what system this maps to. I guess it is some bad cable I also ran ibcheckerrors and it reports a lot of problems with buffer overruns. Here's the tail end of the output, with only some of the last ports reported: #warn: counter SymbolErrors = 36905 (threshold 10) lid 193 port 14 #warn: counter LinkDowned = 23 (threshold 10) lid 193 port 14 #warn: counter RcvErrors = 15641(threshold 10) lid 193 port 14 #warn: counter RcvSwRelayErrors = 225 (threshold 100) lid 193 port 14 #warn: counter ExcBufOverrunErrors = 10 (threshold 10) lid 193 port 14 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 14: FAILED #warn: counter LinkRecovers = 181 (threshold 10) lid 193 port 1 #warn: counter RcvSwRelayErrors = 2417 (threshold 100) lid 193 port 1 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 1: FAILED #warn: counter LinkRecovers = 103 (threshold 10) lid 193 port 3 #warn: counter RcvErrors = 9035 (threshold 10) lid 193 port 3 #warn: counter RcvSwRelayErrors = 64670 (threshold 100) lid 193 port 3 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 3: FAILED #warn: counter SymbolErrors = 13151 (threshold 10) lid 193 port 4 #warn: counter RcvErrors = 109 (threshold 10) lid 193 port 4 #warn: counter RcvSwRelayErrors = 507 (threshold 100) lid 193 port 4 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 4: FAILED ## Summary: 209 nodes checked, 0 bad nodes found ## 716 ports checked, 103 ports have errors beyond threshold It reports a lot of symbol errors. I recommend you to reset all these counters (if i remember correct it is -c flag in ibdiagnet) and rerun the testing again after the mpi process failure. Thanks, Pasha
Re: [OMPI users] RETRY EXCEEDED ERROR
On Thu, Mar 05, 2009 at 10:27:27AM +0200, Pavel Shamis (Pasha) wrote: > > >Time to dig up diagnostics tools and look at port statistics. > > > You may use ibdiagnet tool for the network debug - > *http://linux.die.net/man/1/ibdiagnet. *This tool is part of OFED. > > Pasha. > ___ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users > Thanks Pasha! ibdiagnet reports the following: -I--- -I- IPoIB Subnets Check -I--- -I- Subnet: IPv4 PKey:0x7fff QKey:0x0b1b MTU:2048Byte rate:10Gbps SL:0x00 -W- Port localhost/P1 lid=0x00e2 guid=0x001e0b4ced75 dev=25218 can not join due to rate:2.5Gbps < group:10Gbps I guess this may indicate a bad adapter. Now, I just need to find what system this maps to. I also ran ibcheckerrors and it reports a lot of problems with buffer overruns. Here's the tail end of the output, with only some of the last ports reported: #warn: counter SymbolErrors = 36905 (threshold 10) lid 193 port 14 #warn: counter LinkDowned = 23 (threshold 10) lid 193 port 14 #warn: counter RcvErrors = 15641(threshold 10) lid 193 port 14 #warn: counter RcvSwRelayErrors = 225 (threshold 100) lid 193 port 14 #warn: counter ExcBufOverrunErrors = 10 (threshold 10) lid 193 port 14 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 14: FAILED #warn: counter LinkRecovers = 181 (threshold 10) lid 193 port 1 #warn: counter RcvSwRelayErrors = 2417 (threshold 100) lid 193 port 1 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 1: FAILED #warn: counter LinkRecovers = 103 (threshold 10) lid 193 port 3 #warn: counter RcvErrors = 9035 (threshold 10) lid 193 port 3 #warn: counter RcvSwRelayErrors = 64670 (threshold 100) lid 193 port 3 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 3: FAILED #warn: counter SymbolErrors = 13151 (threshold 10) lid 193 port 4 #warn: counter RcvErrors = 109 (threshold 10) lid 193 port 4 #warn: counter RcvSwRelayErrors = 507 (threshold 100) lid 193 port 4 Error check on lid 193 (ISR9288/ISR9096 Voltaire sLB-24) port 4: FAILED ## Summary: 209 nodes checked, 0 bad nodes found ## 716 ports checked, 103 ports have errors beyond threshold I wonder if this is something that needs to be tuned in the Infiniband switch or if there is something in OpenMPI/OpenIB that can be tuned. Thanks, Jan Lindheim