Re: [MTT users] RETRY EXCEEDED ERROR
On Jul 31, 2008, at 12:42 PM, Rafael Folco wrote: Thanks for the response, Pasha. Yes, I agree this is some issue with the IB network. I came to the list looking for some previous experience of other users... I wonder why 10.2.1.90 works with all other nodes, 10.2.1.50 works with all other nodes as well, but they can't work together. Maybe OFED lists will be more appropriate for this kind of question. Unless 10.2.x is your IPoIB network, then the pings and whatnot don't really have much bearing on what the IB network is doing. The OF general list is a good place to ask questions, but they'll ask you to do the standard level 1 network diagnostics first (check counters and cables, etc.). -- Jeff Squyres Cisco Systems
Re: [MTT users] RETRY EXCEEDED ERROR
Thanks for the response, Pasha. Yes, I agree this is some issue with the IB network. I came to the list looking for some previous experience of other users... I wonder why 10.2.1.90 works with all other nodes, 10.2.1.50 works with all other nodes as well, but they can't work together. Maybe OFED lists will be more appropriate for this kind of question. Regards, Rafael On Thu, 2008-07-31 at 18:52 +0300, Pavel Shamis (Pasha) wrote: > The "RETRY EXCEEDED ERROR" error is related to IB and not MTT. > > The error says that IB failed to send IB packet from > > machine 10.2.1.90 to 10.2.1.50 > > You need to run your IB network monitoring tool and found the issue. > > Usually it is some bad cable in IB fabric that causes such errors. > > Regards, > Pasha > > > Rafael Folco wrote: > > Hi, > > > > I need some help, please. > > > > I'm running a set of MTT tests on my cluster and I have issues in a > > particular node. > > > > [0,1,7][btl_openib_component.c:1332:btl_openib_component_progress] from > > 10.2.1.90 to: 10.2.1.50 error polling HP CQ with status RETRY EXCEEDED > > ERROR status number 12 for wr_id 268870712 opcode 0 > > > > I am able to ping from 10.2.1.90 to 10.2.1.50, and they are visible to > > each other in the network, just like the other nodes. I've already > > checked the drivers, reinstalled openmpi, but nothing changes... > > > > On 10.2.1.90: > > # ping 10.2.1.50 > > PING 10.2.1.50 (10.2.1.50) 56(84) bytes of data. > > 64 bytes from 10.2.1.50: icmp_seq=1 ttl=64 time=9.95 ms > > 64 bytes from 10.2.1.50: icmp_seq=2 ttl=64 time=0.076 ms > > 64 bytes from 10.2.1.50: icmp_seq=3 ttl=64 time=0.114 ms > > > > > > The cable connections are the same to every node and all tests run fine > > without 10.2.1.90. In the other hand, when I add 10.2.1.90 to the > > hostlist, I get many failures. > > > > Please, could someone tell me why 10.2.1.90 doesn't like 10.2.1.50 ? Any > > clue? > > > > I don't see any problems with other combination of nodes. This is very > > very weird. > > > > > > MTT Results Summary > > hostname: p6ihopenhpc1-ib0 > > uname: Linux p6ihopenhpc1-ib0 2.6.16.60-0.21-ppc64 #1 SMP Tue May 6 > > 12:41:02 UTC 2008 ppc64 ppc64 ppc64 GNU/Linux > > who am i: root pts/3Jul 31 13:31 (elm3b150:S.0) > > +-+-+--+--+--+--+ > > | Phase | Section | Pass | Fail | Time out | Skip | > > +-+-+--+--+--+--+ > > | MPI install | openmpi-1.2.5 | 1| 0| 0| 0| > > | Test Build | trivial | 1| 0| 0| 0| > > | Test Build | ibm | 1| 0| 0| 0| > > | Test Build | onesided| 1| 0| 0| 0| > > | Test Build | mpicxx | 1| 0| 0| 0| > > | Test Build | imb | 1| 0| 0| 0| > > | Test Build | netpipe | 1| 0| 0| 0| > > | Test Run| trivial | 4| 4| 0| 0| > > | Test Run| ibm | 59 | 120 | 0| 3| > > | Test Run| onesided| 95 | 37 | 0| 0| > > | Test Run| mpicxx | 0| 1| 0| 0| > > | Test Run| imb correctness | 0| 1| 0| 0| > > | Test Run| imb performance | 0| 12 | 0| 0| > > | Test Run| netpipe | 1| 0| 0| 0| > > +-+-+--+--+--+--+ > > > > > > I also attached one of the errors here. > > > > Thanks in advance, > > > > Rafael > > > > > > > > > > ___ > > mtt-users mailing list > > mtt-us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users > > > -- Rafael Folco OpenHPC / Brazil Test Lead IBM Linux Technology Center E-Mail: rfo...@linux.vnet.ibm.com
Re: [MTT users] RETRY EXCEEDED ERROR
The "RETRY EXCEEDED ERROR" error is related to IB and not MTT. The error says that IB failed to send IB packet from machine 10.2.1.90 to 10.2.1.50 You need to run your IB network monitoring tool and found the issue. Usually it is some bad cable in IB fabric that causes such errors. Regards, Pasha Rafael Folco wrote: Hi, I need some help, please. I'm running a set of MTT tests on my cluster and I have issues in a particular node. [0,1,7][btl_openib_component.c:1332:btl_openib_component_progress] from 10.2.1.90 to: 10.2.1.50 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 268870712 opcode 0 I am able to ping from 10.2.1.90 to 10.2.1.50, and they are visible to each other in the network, just like the other nodes. I've already checked the drivers, reinstalled openmpi, but nothing changes... On 10.2.1.90: # ping 10.2.1.50 PING 10.2.1.50 (10.2.1.50) 56(84) bytes of data. 64 bytes from 10.2.1.50: icmp_seq=1 ttl=64 time=9.95 ms 64 bytes from 10.2.1.50: icmp_seq=2 ttl=64 time=0.076 ms 64 bytes from 10.2.1.50: icmp_seq=3 ttl=64 time=0.114 ms The cable connections are the same to every node and all tests run fine without 10.2.1.90. In the other hand, when I add 10.2.1.90 to the hostlist, I get many failures. Please, could someone tell me why 10.2.1.90 doesn't like 10.2.1.50 ? Any clue? I don't see any problems with other combination of nodes. This is very very weird. MTT Results Summary hostname: p6ihopenhpc1-ib0 uname: Linux p6ihopenhpc1-ib0 2.6.16.60-0.21-ppc64 #1 SMP Tue May 6 12:41:02 UTC 2008 ppc64 ppc64 ppc64 GNU/Linux who am i: root pts/3Jul 31 13:31 (elm3b150:S.0) +-+-+--+--+--+--+ | Phase | Section | Pass | Fail | Time out | Skip | +-+-+--+--+--+--+ | MPI install | openmpi-1.2.5 | 1| 0| 0| 0| | Test Build | trivial | 1| 0| 0| 0| | Test Build | ibm | 1| 0| 0| 0| | Test Build | onesided| 1| 0| 0| 0| | Test Build | mpicxx | 1| 0| 0| 0| | Test Build | imb | 1| 0| 0| 0| | Test Build | netpipe | 1| 0| 0| 0| | Test Run| trivial | 4| 4| 0| 0| | Test Run| ibm | 59 | 120 | 0| 3| | Test Run| onesided| 95 | 37 | 0| 0| | Test Run| mpicxx | 0| 1| 0| 0| | Test Run| imb correctness | 0| 1| 0| 0| | Test Run| imb performance | 0| 12 | 0| 0| | Test Run| netpipe | 1| 0| 0| 0| +-+-+--+--+--+--+ I also attached one of the errors here. Thanks in advance, Rafael ___ mtt-users mailing list mtt-us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/mtt-users
[MTT users] RETRY EXCEEDED ERROR
Hi, I need some help, please. I'm running a set of MTT tests on my cluster and I have issues in a particular node. [0,1,7][btl_openib_component.c:1332:btl_openib_component_progress] from 10.2.1.90 to: 10.2.1.50 error polling HP CQ with status RETRY EXCEEDED ERROR status number 12 for wr_id 268870712 opcode 0 I am able to ping from 10.2.1.90 to 10.2.1.50, and they are visible to each other in the network, just like the other nodes. I've already checked the drivers, reinstalled openmpi, but nothing changes... On 10.2.1.90: # ping 10.2.1.50 PING 10.2.1.50 (10.2.1.50) 56(84) bytes of data. 64 bytes from 10.2.1.50: icmp_seq=1 ttl=64 time=9.95 ms 64 bytes from 10.2.1.50: icmp_seq=2 ttl=64 time=0.076 ms 64 bytes from 10.2.1.50: icmp_seq=3 ttl=64 time=0.114 ms The cable connections are the same to every node and all tests run fine without 10.2.1.90. In the other hand, when I add 10.2.1.90 to the hostlist, I get many failures. Please, could someone tell me why 10.2.1.90 doesn't like 10.2.1.50 ? Any clue? I don't see any problems with other combination of nodes. This is very very weird. MTT Results Summary hostname: p6ihopenhpc1-ib0 uname: Linux p6ihopenhpc1-ib0 2.6.16.60-0.21-ppc64 #1 SMP Tue May 6 12:41:02 UTC 2008 ppc64 ppc64 ppc64 GNU/Linux who am i: root pts/3Jul 31 13:31 (elm3b150:S.0) +-+-+--+--+--+--+ | Phase | Section | Pass | Fail | Time out | Skip | +-+-+--+--+--+--+ | MPI install | openmpi-1.2.5 | 1| 0| 0| 0| | Test Build | trivial | 1| 0| 0| 0| | Test Build | ibm | 1| 0| 0| 0| | Test Build | onesided| 1| 0| 0| 0| | Test Build | mpicxx | 1| 0| 0| 0| | Test Build | imb | 1| 0| 0| 0| | Test Build | netpipe | 1| 0| 0| 0| | Test Run| trivial | 4| 4| 0| 0| | Test Run| ibm | 59 | 120 | 0| 3| | Test Run| onesided| 95 | 37 | 0| 0| | Test Run| mpicxx | 0| 1| 0| 0| | Test Run| imb correctness | 0| 1| 0| 0| | Test Run| imb performance | 0| 12 | 0| 0| | Test Run| netpipe | 1| 0| 0| 0| +-+-+--+--+--+--+ I also attached one of the errors here. Thanks in advance, Rafael -- Rafael Folco OpenHPC / Brazil Test Lead IBM Linux Technology Center E-Mail: rfo...@linux.vnet.ibm.com | command | mpirun --hostfile /tmp/ompi-core-testers/hosts.list -np 8 --mca btl | | | openib,self --mca btl_openib_warn_default_gid_prefix 0 --prefix | | | /usr/lib/mpi/gcc/openmpi collective/gather | | duration | 0 seconds | | exit_value | 143 | | result_message | Failed; exit status: 143 | | result_stdout| [0,1,7][btl_openib_component.c:1332:btl_openib_component_progress] from | | | 10.2.1.90 to: 10.2.1.50 error polling HP CQ with status RETRY EXCEEDED ERROR | | | status number 12 for wr_id 268870712 opcode 0 | | | -- | | | The InfiniBand retry count between two MPI processes has been| | | exceeded. "Retry count" is defined in the InfiniBand spec 1.2 | | | (section 12.7.38): | | | | | | The total number of times that the sender wishes the receiver to | | | retry timeout, packet sequence, etc. errors before posting a | | | completion error. | | | | | | This error typically means that there is something awry within the | | | InfiniBand fabric itself. You should note the hosts on which this | | | error has occurred; it has been observed that rebooting or removing a| |