Don, On Tue, 2006-05-30 at 10:55, [EMAIL PROTECTED] wrote: > Hal, > > With your patch to OpenSM, I think everything is ok on the local node.
That patch with one minor change (elimination of the CL_ASSERT) will be part of the upcoming RC6. > The remote node is definitely having some problems, resulting in not > responding to the MAD packets. I have entered a separate message on > the problems with the "ib0" interface on that machine. > > > > On Fri, 2006-05-26 at 20:59, Hal Rosenstock wrote: > > > > What next, coach? > > > > > > Can you turn on madeye on the remote node and see what packets are > > > received and sent ? Let me know if you need help with that. I > think you > > > said you were running OFED, right ? > > > > Yes, I am running kernel 2.6.16 with the OFED RC5 release. I will > investigate how to run madeye, but the hangs on the remote machine are > probably the root cause of the link failure. Ah; got it. It's tied into the other problem. Yes, when the hangs are resolved, the SMA on the remote node will respond and I would expect the port to get to active and you should be on your way then. > > I don't think madeye is part of OFED :-( Can it get added for RC6, > > Tziporet ? I think it would be a useful tool to add for problems > like > > this. > > > > Also, was this a working setup before ? Did anything else change > besides > > installing RC5 on both nodes ? > > > > This back to back setup was working originally with a backported > 2.6.11-34 kernel and I believe it was revision 6500 from the OpenIB > svn trunk at that time. The problems started when I tried to move to > RC4 and now RC5 of the OFED release, with the 2.6.16 kernel. > > > I have two more experiments I'd like you to try, before we go down > the > > madeye "route": > > > > 1. Do you have another IB cable to try ? > > > > 2. Can you completely shutdown and repower the remote node and see > if it > > starts responding ? > > > > It is difficult for me to debug this sort of thing, since I > telecommute from Tucson and the machines are located in Phoenix. But > I can get someone there to power the machine down and reboot. It's OK; you explained the state of the remote node so neither of those experiments is necessary. -- Hal > -Don Albert- > _______________________________________________ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general