Sorry to revive a stale thread, but I wanted to post an update and see
about getting this rolling again.
I tried the suggestion of removing the call to ib_send_cm_mra().
Unfortunately, doing this wedges the stack 100% of the time. Since
the original post, I've upgraded to CentOS-5.5 with
I tried the suggestion of removing the call to ib_send_cm_mra().
Unfortunately, doing this wedges the stack 100% of the time. Since
the original post, I've upgraded to CentOS-5.5 with OFED-1.5.1. The
NULL pointer reference now occurs far less frequently but still with
some regularity. I
Good news! I'll apply this and test right away.
-JE
On Wed, Oct 6, 2010 at 12:04 PM, Hefty, Sean sean.he...@intel.com wrote:
I tried the suggestion of removing the call to ib_send_cm_mra().
Unfortunately, doing this wedges the stack 100% of the time. Since
the original post, I've upgraded
Josh England wrote:
Do you think upgrading to OFED-1.5.1 would help at all?
it might help you to diagnose the problem better, if you read through the
thread I pointed on (its very short, four messages, let then two minutes),
you would see that Arthur is reporting on the lap_state and Sean is
I do have the sysfs counters in
/sys/class/infiniband_cm/device/port_num/cm_tx_msgs/. Could you
point me to a reference for what they all mean? There are a few
patches I've had to throw into 1.4.2 so I'll need to check whether
they are still needed in 1.5.1, but I'll work on that today.
Now,
I do have the sysfs counters in
/sys/class/infiniband_cm/device/port_num/cm_tx_msgs/. Could you
point me to a reference for what they all mean?
These are counting the number of CM messages sent for each type. You would
need to refer to the Infiniband specification to understand the CM
[88407850] :rdma_cm:rdma_init_qp_attr+0xed/0x13f
[8841725e] :rdma_ucm:ucma_init_qp_attr+0x97/0xe4
[8008a461] default_wake_function+0x0/0xe
[8008a461] default_wake_function+0x0/0xe
[800d66d2] shmem_file_write+0x23f/0x251
[88416326]
Timed out connections might be something that can be compensated for
in the app. It is definitely preferable to a kernel panic. Still,
I'll work on making OFED-1.5.1 happy before playing around with
removing the ib_send_cm_mra() call.
FYI - the possible issue I'm describing is in the
Upgrading 1.5.1 is the way to go for me. I have other dependencies
tying me down to the CentOS kernel for the time being. Hopefully any
patch to mainstream should apply fairly cleanly to 1.5.1.
-JE
On Wed, Jul 21, 2010 at 1:51 PM, Hefty, Sean sean.he...@intel.com wrote:
Timed out connections
Josh England wrote:
It may be that the in-kernel field cm_id_priv has a NULL -alt_av.port ,
causing the Oops, but I don't know for sure. Any ideas on how to debug this?
seems like this was reported in the past but remained unsolved,
Do you think upgrading to OFED-1.5.1 would help at all?
-JE
On Mon, Jul 19, 2010 at 11:40 PM, Or Gerlitz ogerl...@voltaire.com wrote:
Josh England wrote:
It may be that the in-kernel field cm_id_priv has a NULL -alt_av.port ,
causing the Oops, but I don't know for sure. Any ideas on how to
I'm experimenting with an rdma_cm application to push data around
between nodes on an ~1000 node cluster (CentOS-5.3 with 2.6.18-128.el5
and OFED-1.4.2). Under heavy load, I'm seeing several nodes per day
kernel panic due to a NULL pointer dereference. It may be that the
in-kernel
My kernel selections are limited to which versions I can get a panfs
module for. The most recent kernel I see support for is a 2.6.31
variant from FC12. I could ask them for a custom port for a newer
mainline kernel but the turn-around will likely be several weeks.
I'll go ahead and ask for one
Hi,
I'm experimenting with an rdma_cm application to push data around
between nodes on an ~1000 node cluster (CentOS-5.3 with 2.6.18-128.el5
and OFED-1.4.2). Under heavy load, I'm seeing several nodes per day
kernel panic due to a NULL pointer dereference. It may be that the
in-kernel field
14 matches
Mail list logo