Re: NULL pointer dereference in rdma_ucm

2010-10-06 Thread Josh England
Sorry to revive a stale thread, but I wanted to post an update and see about getting this rolling again. I tried the suggestion of removing the call to ib_send_cm_mra(). Unfortunately, doing this wedges the stack 100% of the time. Since the original post, I've upgraded to CentOS-5.5 with

RE: NULL pointer dereference in rdma_ucm

2010-10-06 Thread Hefty, Sean
I tried the suggestion of removing the call to ib_send_cm_mra(). Unfortunately, doing this wedges the stack 100% of the time. Since the original post, I've upgraded to CentOS-5.5 with OFED-1.5.1. The NULL pointer reference now occurs far less frequently but still with some regularity. I

Re: NULL pointer dereference in rdma_ucm

2010-10-06 Thread Josh England
Good news! I'll apply this and test right away. -JE On Wed, Oct 6, 2010 at 12:04 PM, Hefty, Sean sean.he...@intel.com wrote: I tried the suggestion of removing the call to ib_send_cm_mra(). Unfortunately, doing this wedges the stack 100% of the time.  Since the original post, I've upgraded

Re: NULL pointer dereference in rdma_ucm

2010-07-21 Thread Or Gerlitz
Josh England wrote: Do you think upgrading to OFED-1.5.1 would help at all? it might help you to diagnose the problem better, if you read through the thread I pointed on (its very short, four messages, let then two minutes), you would see that Arthur is reporting on the lap_state and Sean is

Re: NULL pointer dereference in rdma_ucm

2010-07-21 Thread Josh England
I do have the sysfs counters in /sys/class/infiniband_cm/device/port_num/cm_tx_msgs/. Could you point me to a reference for what they all mean? There are a few patches I've had to throw into 1.4.2 so I'll need to check whether they are still needed in 1.5.1, but I'll work on that today. Now,

RE: NULL pointer dereference in rdma_ucm

2010-07-21 Thread Hefty, Sean
I do have the sysfs counters in /sys/class/infiniband_cm/device/port_num/cm_tx_msgs/. Could you point me to a reference for what they all mean? These are counting the number of CM messages sent for each type. You would need to refer to the Infiniband specification to understand the CM

RE: NULL pointer dereference in rdma_ucm

2010-07-21 Thread Hefty, Sean
[88407850] :rdma_cm:rdma_init_qp_attr+0xed/0x13f [8841725e] :rdma_ucm:ucma_init_qp_attr+0x97/0xe4 [8008a461] default_wake_function+0x0/0xe [8008a461] default_wake_function+0x0/0xe [800d66d2] shmem_file_write+0x23f/0x251 [88416326]

RE: NULL pointer dereference in rdma_ucm

2010-07-21 Thread Hefty, Sean
Timed out connections might be something that can be compensated for in the app. It is definitely preferable to a kernel panic. Still, I'll work on making OFED-1.5.1 happy before playing around with removing the ib_send_cm_mra() call. FYI - the possible issue I'm describing is in the

Re: NULL pointer dereference in rdma_ucm

2010-07-21 Thread Josh England
Upgrading 1.5.1 is the way to go for me. I have other dependencies tying me down to the CentOS kernel for the time being. Hopefully any patch to mainstream should apply fairly cleanly to 1.5.1. -JE On Wed, Jul 21, 2010 at 1:51 PM, Hefty, Sean sean.he...@intel.com wrote: Timed out connections

Re: NULL pointer dereference in rdma_ucm

2010-07-20 Thread Or Gerlitz
Josh England wrote: It may be that the in-kernel field cm_id_priv has a NULL -alt_av.port , causing the Oops, but I don't know for sure. Any ideas on how to debug this? seems like this was reported in the past but remained unsolved,

Re: NULL pointer dereference in rdma_ucm

2010-07-20 Thread Josh England
Do you think upgrading to OFED-1.5.1 would help at all? -JE On Mon, Jul 19, 2010 at 11:40 PM, Or Gerlitz ogerl...@voltaire.com wrote: Josh England wrote: It may be that the in-kernel field cm_id_priv has a NULL -alt_av.port , causing the Oops, but I don't know for sure.  Any ideas on how to

Re: NULL pointer dereference in rdma_ucm

2010-07-20 Thread Roland Dreier
I'm experimenting with an rdma_cm application to push data around between nodes on an ~1000 node cluster (CentOS-5.3 with 2.6.18-128.el5 and OFED-1.4.2). Under heavy load, I'm seeing several nodes per day kernel panic due to a NULL pointer dereference. It may be that the in-kernel

Re: NULL pointer dereference in rdma_ucm

2010-07-20 Thread Josh England
My kernel selections are limited to which versions I can get a panfs module for. The most recent kernel I see support for is a 2.6.31 variant from FC12. I could ask them for a custom port for a newer mainline kernel but the turn-around will likely be several weeks. I'll go ahead and ask for one

NULL pointer dereference in rdma_ucm

2010-07-19 Thread Josh England
Hi, I'm experimenting with an rdma_cm application to push data around between nodes on an ~1000 node cluster (CentOS-5.3 with 2.6.18-128.el5 and OFED-1.4.2). Under heavy load, I'm seeing several nodes per day kernel panic due to a NULL pointer dereference. It may be that the in-kernel field