So, take a set of 3 or 4 hosts on a fabric, and run # ibping -S
on each. Then, on each, run the ibping client in a loop so that each host sends a few packets to each other host. For example: # while true; do echo; date; echo; ibping -c 3 -L 3; ibping -c 3 -L 5; ibping -c 3 -L 1; sleep 1; done What you will discover is that ib_mad on one or more of the hosts will begin consuming 100% of the cpu on that host: PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 919 root 1 -19 0 0 0 R 100.0 0.0 29:23.82 ib_mad1 Even more interesting, ib_mad will continue to consume 100% of that cpu core even if the ibping processes are stopped. In fact, you may not be able to completely terminate the ibping processes on the affected machine - they become trapped inside umad_recv(). When in this situation, ib_mad will continue to consume 100% of cpu even if the SM is stopped and all activity on the fabric has been terminated - and even though the HCA is reporting that it is not sending or receiving packets. This behavior has been demonstrated on 1.5.4 and 1.5.4.1, and on RHEL6.0, 6.1 and 5.7. I used systemtap to trace ib_mad in an install of OFED 1.5.4.1; filled 15 megs of log in a few seconds. The trace makes it look like a MAD is looping between the receive and send queues: 1328740603 0 ib_mad1(919):-->ib_response_mad 1328740603 2 ib_mad1(919): -->agent_send_response 1328740603 5 ib_mad1(919): -->ib_create_send_mad 1328740603 7 ib_mad1(919): <--ib_create_send_mad 1328740603 9 ib_mad1(919): -->ib_post_send_mad 1328740603 12 ib_mad1(919): -->ib_send_mad 1328740603 15 ib_mad1(919): <--ib_send_mad 1328740603 17 ib_mad1(919): <--ib_post_send_mad 1328740603 19 ib_mad1(919): <--agent_send_response 1328740603 22 ib_mad1(919): -->ib_mad_complete_send_wr 1328740603 24 ib_mad1(919): -->ib_free_send_mad 1328740603 26 ib_mad1(919): <--ib_free_send_mad 1328740603 28 ib_mad1(919): <--ib_mad_complete_send_wr 1328740603 30 ib_mad1(919): -->ib_response_mad 1328740603 33 ib_mad1(919): -->agent_send_response 1328740603 35 ib_mad1(919): -->ib_create_send_mad 1328740603 37 ib_mad1(919): <--ib_create_send_mad 1328740603 40 ib_mad1(919): -->ib_post_send_mad 1328740603 42 ib_mad1(919): -->ib_send_mad 1328740603 45 ib_mad1(919): <--ib_send_mad 1328740603 47 ib_mad1(919): <--ib_post_send_mad 1328740603 49 ib_mad1(919): <--agent_send_response 1328740603 52 ib_mad1(919): -->ib_mad_complete_send_wr 1328740603 54 ib_mad1(919): -->ib_free_send_mad 1328740603 56 ib_mad1(919): <--ib_free_send_mad 1328740603 59 ib_mad1(919): <--ib_mad_complete_send_wr 1328740603 61 ib_mad1(919): -->ib_response_mad 1328740603 63 ib_mad1(919): -->agent_send_response 1328740603 66 ib_mad1(919): -->ib_create_send_mad 1328740603 68 ib_mad1(919): <--ib_create_send_mad 1328740603 70 ib_mad1(919): -->ib_post_send_mad 1328740603 72 ib_mad1(919): -->ib_send_mad 1328740603 75 ib_mad1(919): <--ib_send_mad 1328740603 78 ib_mad1(919): <--ib_post_send_mad 1328740603 80 ib_mad1(919): <--agent_send_response 1328740603 82 ib_mad1(919): -->ib_mad_complete_send_wr 1328740603 85 ib_mad1(919): -->ib_free_send_mad 1328740603 87 ib_mad1(919): <--ib_free_send_mad 1328740603 89 ib_mad1(919): <--ib_mad_complete_send_wr 1328740603 91 ib_mad1(919): -->ib_response_mad 1328740603 94 ib_mad1(919): -->agent_send_response 1328740603 96 ib_mad1(919): -->ib_create_send_mad 1328740603 98 ib_mad1(919): <--ib_create_send_mad 1328740603 101 ib_mad1(919): -->ib_post_send_mad 1328740603 103 ib_mad1(919): -->ib_send_mad 1328740603 106 ib_mad1(919): <--ib_send_mad 1328740603 109 ib_mad1(919): <--ib_post_send_mad 1328740603 111 ib_mad1(919): <--agent_send_response 1328740603 113 ib_mad1(919): -->ib_mad_complete_send_wr 1328740603 116 ib_mad1(919): -->ib_free_send_mad 1328740603 118 ib_mad1(919): <--ib_free_send_mad 1328740603 120 ib_mad1(919): <--ib_mad_complete_send_wr 1328740603 123 ib_mad1(919): -->ib_response_mad . . . I'm going to continue looking at this but I thought it was important enough to post this information now. This message and any attached documents contain information from QLogic Corporation or its wholly-owned subsidiaries that may be confidential. If you are not the intended recipient, you may not read, copy, distribute, or use this information. If you have received this transmission in error, please notify the sender immediately by reply e-mail and then delete this message. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html