I was corresponding with Hal Rosenstock about this problem,  but he suggested that I resubmit to a wider audience.   The previous messages are under the subject of  "How do I use "madeye" to diagnose a problem?".   I was trying to use "madeye" to find out if any MAD packets were being received by a node in which the link fails to initialize.

I have a small two-node testbed system which consists of two EM64T machines ("koa" and "jatoba") cabled back-to-back with two Mellanox MT25204 (4x DDR) HCAs.   This configuration worked with a backported 2.6.11-34 kernel and revision 6500 from the OpenIB svn trunk.   I was able to run basic tests and several sets of MPI benchmarks.

Since moving to a "2.6.16" kernel and the OFED-1.0 release,  we cannot get the link on the "jatoba" machine to come up.   The "madeye" module seems to show that no MAD packets are being received when the Subnet Manager is run on the other machine.   When I try to run SM on "jatoba",  or try to run any other program that uses MAD,  I get process hangs.   Here is a portion of the stack traces for one of the hung processes,  obtained by doing "echo t > /proc/sysrq-trigger" and looking at the dmesg output.


ibis          D 0000000000000003     0  5489   5097  5522               (NOTLB)
ffff8100788c7d28 ffff810037cb9030 ffff8100788c7c78 ffff81007c606640
       ffffffff803c1b65 0000000000000001 ffffffff801350ce ffff810003392418
       ffff8100788c6000 ffff8100788c7cb8
Call Trace: <ffffffff803c1b65>{_spin_lock_irqsave+14}
       <ffffffff801350ce>{lock_timer_base+27} <ffffffff880c4a0d>{:ib_mthca:mthca_table_put+65}
       <ffffffff803c1c20>{_spin_unlock_irq+9} <ffffffff803bfd5f>{wait_for_completion+179}
       <ffffffff80127468>{default_wake_function+0} <ffffffff80127468>{default_wake_function+0}
       <ffffffff88023909>{:ib_mad:ib_cancel_rmpp_recvs+144}
       <ffffffff88020933>{:ib_mad:ib_unregister_mad_agent+1019}
       <ffffffff8803bc29>{:ib_umad:ib_umad_ioctl+564} <ffffffff80140025>{autoremove_wake_function+0}
       <ffffffff80180d4d>{do_ioctl+45} <ffffffff80181034>{vfs_ioctl+658}
       <ffffffff8018948e>{mntput_no_expire+28} <ffffffff80181083>{sys_ioctl+60}
       <ffffffff8010aa52>{system_call+126}

It seems to be a lock or mutex problem,  but I don't know how to proceed from here.

Some things I have tried are:
  1. Connecting the two machines to a switch instead of back-to-back,  to use the SM in the switch.  The link to "koa" comes up, but the link to "jatoba" does not.
  2. Physically swapping the two HCAs between the two machines:   the problem stays on the "jatoba" side.
  3. Turning on "debug_level" traces with "modprobe ib_mthca debug_level=1" on both machines.   The traces seem to be identical on both, except for the actual PCI bus location and the memory addresses being mapped.  No additional traces are generated when the hangs occur.

The machines are both EM64T but are not identical.  The "koa" side has the HCA on PCI "06:00.0",  and the "jatoba" side has the HCA on "03:00.0".  The two machines are:

   koa (the working one) is an Intel SE7520BD2 motherboard (7520 chip set).
   jatoba (the bad one) is an Intel SE7525GP2 motherboard (7525 chip set).

Can anyone suggest what to try or look at next?

        -Don Albert-
_______________________________________________
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to