Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
On Mar 28, 2014, at 8:52 PM, sagi grimberg sa...@mellanox.com wrote: On 3/29/2014 3:05 AM, Chuck Lever wrote: On Mar 28, 2014, at 4:06 PM, sagi grimberg sa...@mellanox.com wrote: On 3/29/2014 1:30 AM, Chuck Lever wrote: On Mar 28, 2014, at 2:42 AM, Senn Klemens klemens.s...@ims.co.at wrote: Hi Chuck, On 03/27/2014 04:59 PM, Chuck Lever wrote: Hi- On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote: On 03/26/2014 07:15 PM, Chuck Lever wrote: Hi Rafael- I’ll take a look. Can you report your HCA and how you reproduce this issue? The HCA is Mellanox Technologies MT26428. Reproduction: 1) Mount a directory via NFS/RDMA mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/ An additional ls /mnt is needed here (between step 1 and 2) 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection 3) ls /mnt 4) wait 5-30 seconds Thanks for the information. I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this: # rpcdebug -m rpc -s trans then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread. The output of /var/log/messages is: [ 143.233701] RPC: 1688 xprt_rdma_allocate: size 1112 too large for buffer[1024]: prog 13 vers 4 proc 1 [ 143.233708] RPC: 1688 xprt_rdma_allocate: size 1112, request 0x88105894c000 [ 143.233715] RPC: 1688 rpcrdma_inline_pullup: pad 0 destp 0x88105894d7dc len 124 hdrlen 124 [ 143.233718] RPC: rpcrdma_register_frmr_external: Using frmr 88084e589260 to map 1 segments [ 143.233722] RPC: 1688 rpcrdma_create_chunks: reply chunk elem 652@0x105894d92c:0xced01 (last) [ 143.233725] RPC: 1688 rpcrdma_marshal_req: reply chunk: hdrlen 48 rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760 lkey 0x8000 [ 143.233785] RPC: rpcrdma_event_process: event rep 88084e589260 status 0 opcode 8 length 0 [ 177.272397] RPC: rpcrdma_event_process: event rep (null) status C opcode 8808 length 4294967295 [ 177.272649] RPC: rpcrdma_event_process: event rep 880848ed status 5 opcode 8808 length 4294936584 The mlx4 provider is returning a WC completion status of IB_WC_WR_FLUSH_ERR. [ 177.272651] RPC: rpcrdma_event_process: WC opcode -30712 status 5, connection lost -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the WC opcode. rpcrdma_event_process() thus can’t depend on the contents of the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS. Hey Chuck, That is correct, the opcode field in the wc is not reliable in FLUSH errors. A copy of the opcode reachable from the incoming rpcrdma_rep could be added, initialized in the forward paths. rpcrdma_event_process() could use the copy in the error case. How about suppressing completions alltogether for fast_reg and local_inv work requests? if these shall fail you will get an error completion and the QP will transition to error state generating FLUSH_ERR completions for all pending WRs. In this case, you can just ignore flush fast_reg + local_inv errors. see http://marc.info/?l=linux-rdmam=139047309831997w=2 While considering your suggestion, I see that my proposed fix doesn’t work. In the FAST_REG_MR and LOCAL_INV cases, wr_id points to a struct rpcrdma_mw, not a struct rpcrdma_rep. Putting a copy of the opcode in rpcrdma_rep would have no effect. Worse: 158 if (IB_WC_SUCCESS != wc-status) { 159 dprintk(RPC: %s: WC opcode %d status %X, connection lost\n, 160 __func__, wc-opcode, wc-status); 161 rep-rr_len = ~0U; Suppose this is an IB_WC_FAST_REG_MR completion, so “rep” here is actually a struct rpcrdma_mw, not a struct rpcrdma_rep. Line 161 pokes 32 one-bits at the top of that struct rpcrdma_mw. If wc-opcode was always usable, we’d at least have to fix that. nod So for error completions one needs to be careful dereferencing wr_id as the opcode might not reliable. it will be better to first identify that wr_id is indeed a pointer to rep. 162 if (wc-opcode != IB_WC_FAST_REG_MR wc-opcode != IB_WC_LOCAL_INV) 163 rpcrdma_schedule_tasklet(rep); 164 return; 165 } 166 167 switch (wc-opcode) { 168 case IB_WC_FAST_REG_MR: 169 frmr = (struct rpcrdma_mw *)(unsigned long)wc-wr_id; 170 frmr-r.frmr.state = FRMR_IS_VALID; 171 break; To make my initial solution work, you’d have to add a field to both struct rpcrdma_mw and struct rpcrdma_rep, and ensure they are at the same offset in both structures. Ewe. Eliminating completions for FAST_REG_MR and LOCAL_INV might be a preferable way to address this.
Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
Hi Chuck, On 03/27/2014 04:59 PM, Chuck Lever wrote: Hi- On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote: On 03/26/2014 07:15 PM, Chuck Lever wrote: Hi Rafael- I’ll take a look. Can you report your HCA and how you reproduce this issue? The HCA is Mellanox Technologies MT26428. Reproduction: 1) Mount a directory via NFS/RDMA mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/ An additional ls /mnt is needed here (between step 1 and 2) 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection 3) ls /mnt 4) wait 5-30 seconds Thanks for the information. I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this: # rpcdebug -m rpc -s trans then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread. The output of /var/log/messages is: [ 143.233701] RPC: 1688 xprt_rdma_allocate: size 1112 too large for buffer[1024]: prog 13 vers 4 proc 1 [ 143.233708] RPC: 1688 xprt_rdma_allocate: size 1112, request 0x88105894c000 [ 143.233715] RPC: 1688 rpcrdma_inline_pullup: pad 0 destp 0x88105894d7dc len 124 hdrlen 124 [ 143.233718] RPC: rpcrdma_register_frmr_external: Using frmr 88084e589260 to map 1 segments [ 143.233722] RPC: 1688 rpcrdma_create_chunks: reply chunk elem 652@0x105894d92c:0xced01 (last) [ 143.233725] RPC: 1688 rpcrdma_marshal_req: reply chunk: hdrlen 48 rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760 lkey 0x8000 [ 143.233785] RPC: rpcrdma_event_process: event rep 88084e589260 status 0 opcode 8 length 0 [ 177.272397] RPC: rpcrdma_event_process: event rep (null) status C opcode 8808 length 4294967295 [ 177.272649] RPC: rpcrdma_event_process: event rep 880848ed status 5 opcode 8808 length 4294936584 [ 177.272651] RPC: rpcrdma_event_process: WC opcode -30712 status 5, connection lost [ 177.984996] RPC: 1689 xprt_rdma_allocate: size 436, request 0x880848f0 [ 182.290655] RPC: xprt_rdma_connect_worker: reconnect [ 182.290992] RPC: rpcrdma_ep_disconnect: after wait, disconnected [ 187.300726] RPC: xprt_rdma_connect_worker: exit [ 197.320527] RPC: xprt_rdma_connect_worker: reconnect [ 197.320795] RPC: rpcrdma_ep_disconnect: after wait, disconnected [ 202.330477] RPC: xprt_rdma_connect_worker: exit [ 222.354286] RPC: xprt_rdma_connect_worker: reconnect [ 222.354624] RPC: rpcrdma_ep_disconnect: after wait, disconnected The output on the serial terminal is: [ 227.364376] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 227.364517] RPC: 1689 rpcrdma_inline_pullup: pad 0 destp 0x880848f017c4 len 100 hdrlen 100 [ 227.364519] RPC: rpcrdma_register_frmr_external: Using frmr 88084e588810 to map 1 segments [ 227.364522] RPC: 1689 rpcrdma_create_chunks: reply chunk elem 152@0x848f0187c:0xcab01 (last) [ 227.364523] RPC: 1689 rpcrdma_marshal_req: reply chunk: hdrlen 48 rpclen 100 padlen 0 headerp 0x880848f01100 base 0x880848f01760 lkey 0x8000 [ 227.411547] BUG: unable to handle kernel paging request at 880848ed1758 [ 227.418535] IP: [880848ed1758] 0x880848ed1757 [ 227.423781] PGD 1d7c067 PUD 85d52f063 PMD 848f12063 PTE 800848ed1163 [ 227.430544] Oops: 0011 [#1] SMP [ 227.433802] Modules linked in: auth_rpcgss oid_registry nfsv4 xprtrdma cpuid af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib ib_sa ib_mad ib_core joydev usbhid mlx4_core iTCO_wdt iTCO_vendor_support acpi_cpufreq coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr sb_edac edac_core i2c_i801 isci sg libsas ehci_pci ehci_hcd scsi_transport_sas usbcore lpc_ich ioatdma mfd_core usb_common shpchp pci_hotplug wmi mperf processor thermal_sys button edd autofs4 xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca i2c_algo_bit ptp pps_core [ 227.496536] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.17-allpatches+ #1 [ 227.503583] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013 [ 227.510451] task: 81a11440 ti: 81a0 task.ti: 81a0 [ 227.517924] RIP: 0010:[880848ed1758] [880848ed1758] 0x880848ed1757 [ 227.525597] RSP: 0018:88087fc03e88 EFLAGS: 00010282 [ 227.530903] RAX: 0286 RBX: 880848ed1758 RCX: a0354360 [ 227.538032] RDX: 88084e589280 RSI: 0286 RDI: 88084e589260 [ 227.545157] RBP: 88087fc03ea0 R08: a0354360 R09: 05f0 [ 227.552286] R10: 0003 R11: dead00100100 R12: 88084e589260 [ 227.559412] R13: 0006 R14: 0006 R15: 81a5db90 [ 227.566540] FS:
Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
On Mar 28, 2014, at 2:42 AM, Senn Klemens klemens.s...@ims.co.at wrote: Hi Chuck, On 03/27/2014 04:59 PM, Chuck Lever wrote: Hi- On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote: On 03/26/2014 07:15 PM, Chuck Lever wrote: Hi Rafael- I’ll take a look. Can you report your HCA and how you reproduce this issue? The HCA is Mellanox Technologies MT26428. Reproduction: 1) Mount a directory via NFS/RDMA mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/ An additional ls /mnt is needed here (between step 1 and 2) 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection 3) ls /mnt 4) wait 5-30 seconds Thanks for the information. I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this: # rpcdebug -m rpc -s trans then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread. The output of /var/log/messages is: [ 143.233701] RPC: 1688 xprt_rdma_allocate: size 1112 too large for buffer[1024]: prog 13 vers 4 proc 1 [ 143.233708] RPC: 1688 xprt_rdma_allocate: size 1112, request 0x88105894c000 [ 143.233715] RPC: 1688 rpcrdma_inline_pullup: pad 0 destp 0x88105894d7dc len 124 hdrlen 124 [ 143.233718] RPC: rpcrdma_register_frmr_external: Using frmr 88084e589260 to map 1 segments [ 143.233722] RPC: 1688 rpcrdma_create_chunks: reply chunk elem 652@0x105894d92c:0xced01 (last) [ 143.233725] RPC: 1688 rpcrdma_marshal_req: reply chunk: hdrlen 48 rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760 lkey 0x8000 [ 143.233785] RPC: rpcrdma_event_process: event rep 88084e589260 status 0 opcode 8 length 0 [ 177.272397] RPC: rpcrdma_event_process: event rep (null) status C opcode 8808 length 4294967295 [ 177.272649] RPC: rpcrdma_event_process: event rep 880848ed status 5 opcode 8808 length 4294936584 The mlx4 provider is returning a WC completion status of IB_WC_WR_FLUSH_ERR. [ 177.272651] RPC: rpcrdma_event_process: WC opcode -30712 status 5, connection lost -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the WC opcode. rpcrdma_event_process() thus can’t depend on the contents of the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS. A copy of the opcode reachable from the incoming rpcrdma_rep could be added, initialized in the forward paths. rpcrdma_event_process() could use the copy in the error case. [ 177.984996] RPC: 1689 xprt_rdma_allocate: size 436, request 0x880848f0 [ 182.290655] RPC: xprt_rdma_connect_worker: reconnect [ 182.290992] RPC: rpcrdma_ep_disconnect: after wait, disconnected [ 187.300726] RPC: xprt_rdma_connect_worker: exit [ 197.320527] RPC: xprt_rdma_connect_worker: reconnect [ 197.320795] RPC: rpcrdma_ep_disconnect: after wait, disconnected [ 202.330477] RPC: xprt_rdma_connect_worker: exit [ 222.354286] RPC: xprt_rdma_connect_worker: reconnect [ 222.354624] RPC: rpcrdma_ep_disconnect: after wait, disconnected The output on the serial terminal is: [ 227.364376] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 227.364517] RPC: 1689 rpcrdma_inline_pullup: pad 0 destp 0x880848f017c4 len 100 hdrlen 100 [ 227.364519] RPC: rpcrdma_register_frmr_external: Using frmr 88084e588810 to map 1 segments [ 227.364522] RPC: 1689 rpcrdma_create_chunks: reply chunk elem 152@0x848f0187c:0xcab01 (last) [ 227.364523] RPC: 1689 rpcrdma_marshal_req: reply chunk: hdrlen 48 rpclen 100 padlen 0 headerp 0x880848f01100 base 0x880848f01760 lkey 0x8000 [ 227.411547] BUG: unable to handle kernel paging request at 880848ed1758 [ 227.418535] IP: [880848ed1758] 0x880848ed1757 [ 227.423781] PGD 1d7c067 PUD 85d52f063 PMD 848f12063 PTE 800848ed1163 [ 227.430544] Oops: 0011 [#1] SMP [ 227.433802] Modules linked in: auth_rpcgss oid_registry nfsv4 xprtrdma cpuid af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib ib_sa ib_mad ib_core joydev usbhid mlx4_core iTCO_wdt iTCO_vendor_support acpi_cpufreq coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 microcode pcspkr sb_edac edac_core i2c_i801 isci sg libsas ehci_pci ehci_hcd scsi_transport_sas usbcore lpc_ich ioatdma mfd_core usb_common shpchp pci_hotplug wmi mperf processor thermal_sys button edd autofs4 xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca i2c_algo_bit ptp pps_core [ 227.496536] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.17-allpatches+ #1 [ 227.503583] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013 [ 227.510451] task:
Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
On 3/29/2014 1:30 AM, Chuck Lever wrote: On Mar 28, 2014, at 2:42 AM, Senn Klemens klemens.s...@ims.co.at wrote: Hi Chuck, On 03/27/2014 04:59 PM, Chuck Lever wrote: Hi- On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote: On 03/26/2014 07:15 PM, Chuck Lever wrote: Hi Rafael- I’ll take a look. Can you report your HCA and how you reproduce this issue? The HCA is Mellanox Technologies MT26428. Reproduction: 1) Mount a directory via NFS/RDMA mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/ An additional ls /mnt is needed here (between step 1 and 2) 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection 3) ls /mnt 4) wait 5-30 seconds Thanks for the information. I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this: # rpcdebug -m rpc -s trans then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread. The output of /var/log/messages is: [ 143.233701] RPC: 1688 xprt_rdma_allocate: size 1112 too large for buffer[1024]: prog 13 vers 4 proc 1 [ 143.233708] RPC: 1688 xprt_rdma_allocate: size 1112, request 0x88105894c000 [ 143.233715] RPC: 1688 rpcrdma_inline_pullup: pad 0 destp 0x88105894d7dc len 124 hdrlen 124 [ 143.233718] RPC: rpcrdma_register_frmr_external: Using frmr 88084e589260 to map 1 segments [ 143.233722] RPC: 1688 rpcrdma_create_chunks: reply chunk elem 652@0x105894d92c:0xced01 (last) [ 143.233725] RPC: 1688 rpcrdma_marshal_req: reply chunk: hdrlen 48 rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760 lkey 0x8000 [ 143.233785] RPC: rpcrdma_event_process: event rep 88084e589260 status 0 opcode 8 length 0 [ 177.272397] RPC: rpcrdma_event_process: event rep (null) status C opcode 8808 length 4294967295 [ 177.272649] RPC: rpcrdma_event_process: event rep 880848ed status 5 opcode 8808 length 4294936584 The mlx4 provider is returning a WC completion status of IB_WC_WR_FLUSH_ERR. [ 177.272651] RPC: rpcrdma_event_process: WC opcode -30712 status 5, connection lost -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the WC opcode. rpcrdma_event_process() thus can’t depend on the contents of the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS. Hey Chuck, That is correct, the opcode field in the wc is not reliable in FLUSH errors. A copy of the opcode reachable from the incoming rpcrdma_rep could be added, initialized in the forward paths. rpcrdma_event_process() could use the copy in the error case. How about suppressing completions alltogether for fast_reg and local_inv work requests? if these shall fail you will get an error completion and the QP will transition to error state generating FLUSH_ERR completions for all pending WRs. In this case, you can just ignore flush fast_reg + local_inv errors. see http://marc.info/?l=linux-rdmam=139047309831997w=2 Sagi. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
On Mar 28, 2014, at 4:06 PM, sagi grimberg sa...@mellanox.com wrote: On 3/29/2014 1:30 AM, Chuck Lever wrote: On Mar 28, 2014, at 2:42 AM, Senn Klemens klemens.s...@ims.co.at wrote: Hi Chuck, On 03/27/2014 04:59 PM, Chuck Lever wrote: Hi- On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote: On 03/26/2014 07:15 PM, Chuck Lever wrote: Hi Rafael- I’ll take a look. Can you report your HCA and how you reproduce this issue? The HCA is Mellanox Technologies MT26428. Reproduction: 1) Mount a directory via NFS/RDMA mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/ An additional ls /mnt is needed here (between step 1 and 2) 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection 3) ls /mnt 4) wait 5-30 seconds Thanks for the information. I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this: # rpcdebug -m rpc -s trans then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread. The output of /var/log/messages is: [ 143.233701] RPC: 1688 xprt_rdma_allocate: size 1112 too large for buffer[1024]: prog 13 vers 4 proc 1 [ 143.233708] RPC: 1688 xprt_rdma_allocate: size 1112, request 0x88105894c000 [ 143.233715] RPC: 1688 rpcrdma_inline_pullup: pad 0 destp 0x88105894d7dc len 124 hdrlen 124 [ 143.233718] RPC: rpcrdma_register_frmr_external: Using frmr 88084e589260 to map 1 segments [ 143.233722] RPC: 1688 rpcrdma_create_chunks: reply chunk elem 652@0x105894d92c:0xced01 (last) [ 143.233725] RPC: 1688 rpcrdma_marshal_req: reply chunk: hdrlen 48 rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760 lkey 0x8000 [ 143.233785] RPC: rpcrdma_event_process: event rep 88084e589260 status 0 opcode 8 length 0 [ 177.272397] RPC: rpcrdma_event_process: event rep (null) status C opcode 8808 length 4294967295 [ 177.272649] RPC: rpcrdma_event_process: event rep 880848ed status 5 opcode 8808 length 4294936584 The mlx4 provider is returning a WC completion status of IB_WC_WR_FLUSH_ERR. [ 177.272651] RPC: rpcrdma_event_process: WC opcode -30712 status 5, connection lost -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the WC opcode. rpcrdma_event_process() thus can’t depend on the contents of the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS. Hey Chuck, That is correct, the opcode field in the wc is not reliable in FLUSH errors. A copy of the opcode reachable from the incoming rpcrdma_rep could be added, initialized in the forward paths. rpcrdma_event_process() could use the copy in the error case. How about suppressing completions alltogether for fast_reg and local_inv work requests? if these shall fail you will get an error completion and the QP will transition to error state generating FLUSH_ERR completions for all pending WRs. In this case, you can just ignore flush fast_reg + local_inv errors. see http://marc.info/?l=linux-rdmam=139047309831997w=2 While considering your suggestion, I see that my proposed fix doesn’t work. In the FAST_REG_MR and LOCAL_INV cases, wr_id points to a struct rpcrdma_mw, not a struct rpcrdma_rep. Putting a copy of the opcode in rpcrdma_rep would have no effect. Worse: 158 if (IB_WC_SUCCESS != wc-status) { 159 dprintk(RPC: %s: WC opcode %d status %X, connection lost\n, 160 __func__, wc-opcode, wc-status); 161 rep-rr_len = ~0U; Suppose this is an IB_WC_FAST_REG_MR completion, so “rep” here is actually a struct rpcrdma_mw, not a struct rpcrdma_rep. Line 161 pokes 32 one-bits at the top of that struct rpcrdma_mw. If wc-opcode was always usable, we’d at least have to fix that. 162 if (wc-opcode != IB_WC_FAST_REG_MR wc-opcode != IB_WC_LOCAL_INV) 163 rpcrdma_schedule_tasklet(rep); 164 return; 165 } 166 167 switch (wc-opcode) { 168 case IB_WC_FAST_REG_MR: 169 frmr = (struct rpcrdma_mw *)(unsigned long)wc-wr_id; 170 frmr-r.frmr.state = FRMR_IS_VALID; 171 break; To make my initial solution work, you’d have to add a field to both struct rpcrdma_mw and struct rpcrdma_rep, and ensure they are at the same offset in both structures. Ewe. Eliminating completions for FAST_REG_MR and LOCAL_INV might be a preferable way to address this. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
On 3/29/2014 3:05 AM, Chuck Lever wrote: On Mar 28, 2014, at 4:06 PM, sagi grimberg sa...@mellanox.com wrote: On 3/29/2014 1:30 AM, Chuck Lever wrote: On Mar 28, 2014, at 2:42 AM, Senn Klemens klemens.s...@ims.co.at wrote: Hi Chuck, On 03/27/2014 04:59 PM, Chuck Lever wrote: Hi- On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote: On 03/26/2014 07:15 PM, Chuck Lever wrote: Hi Rafael- I’ll take a look. Can you report your HCA and how you reproduce this issue? The HCA is Mellanox Technologies MT26428. Reproduction: 1) Mount a directory via NFS/RDMA mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/ An additional ls /mnt is needed here (between step 1 and 2) 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection 3) ls /mnt 4) wait 5-30 seconds Thanks for the information. I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this: # rpcdebug -m rpc -s trans then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread. The output of /var/log/messages is: [ 143.233701] RPC: 1688 xprt_rdma_allocate: size 1112 too large for buffer[1024]: prog 13 vers 4 proc 1 [ 143.233708] RPC: 1688 xprt_rdma_allocate: size 1112, request 0x88105894c000 [ 143.233715] RPC: 1688 rpcrdma_inline_pullup: pad 0 destp 0x88105894d7dc len 124 hdrlen 124 [ 143.233718] RPC: rpcrdma_register_frmr_external: Using frmr 88084e589260 to map 1 segments [ 143.233722] RPC: 1688 rpcrdma_create_chunks: reply chunk elem 652@0x105894d92c:0xced01 (last) [ 143.233725] RPC: 1688 rpcrdma_marshal_req: reply chunk: hdrlen 48 rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760 lkey 0x8000 [ 143.233785] RPC: rpcrdma_event_process: event rep 88084e589260 status 0 opcode 8 length 0 [ 177.272397] RPC: rpcrdma_event_process: event rep (null) status C opcode 8808 length 4294967295 [ 177.272649] RPC: rpcrdma_event_process: event rep 880848ed status 5 opcode 8808 length 4294936584 The mlx4 provider is returning a WC completion status of IB_WC_WR_FLUSH_ERR. [ 177.272651] RPC: rpcrdma_event_process: WC opcode -30712 status 5, connection lost -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the WC opcode. rpcrdma_event_process() thus can’t depend on the contents of the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS. Hey Chuck, That is correct, the opcode field in the wc is not reliable in FLUSH errors. A copy of the opcode reachable from the incoming rpcrdma_rep could be added, initialized in the forward paths. rpcrdma_event_process() could use the copy in the error case. How about suppressing completions alltogether for fast_reg and local_inv work requests? if these shall fail you will get an error completion and the QP will transition to error state generating FLUSH_ERR completions for all pending WRs. In this case, you can just ignore flush fast_reg + local_inv errors. see http://marc.info/?l=linux-rdmam=139047309831997w=2 While considering your suggestion, I see that my proposed fix doesn’t work. In the FAST_REG_MR and LOCAL_INV cases, wr_id points to a struct rpcrdma_mw, not a struct rpcrdma_rep. Putting a copy of the opcode in rpcrdma_rep would have no effect. Worse: 158 if (IB_WC_SUCCESS != wc-status) { 159 dprintk(RPC: %s: WC opcode %d status %X, connection lost\n, 160 __func__, wc-opcode, wc-status); 161 rep-rr_len = ~0U; Suppose this is an IB_WC_FAST_REG_MR completion, so “rep” here is actually a struct rpcrdma_mw, not a struct rpcrdma_rep. Line 161 pokes 32 one-bits at the top of that struct rpcrdma_mw. If wc-opcode was always usable, we’d at least have to fix that. nod So for error completions one needs to be careful dereferencing wr_id as the opcode might not reliable. it will be better to first identify that wr_id is indeed a pointer to rep. 162 if (wc-opcode != IB_WC_FAST_REG_MR wc-opcode != IB_WC_LOCAL_INV) 163 rpcrdma_schedule_tasklet(rep); 164 return; 165 } 166 167 switch (wc-opcode) { 168 case IB_WC_FAST_REG_MR: 169 frmr = (struct rpcrdma_mw *)(unsigned long)wc-wr_id; 170 frmr-r.frmr.state = FRMR_IS_VALID; 171 break; To make my initial solution work, you’d have to add a field to both struct rpcrdma_mw and struct rpcrdma_rep, and ensure they are at the same offset in both structures. Ewe. Eliminating completions for FAST_REG_MR and LOCAL_INV might be a preferable way to address this. Agree, Same applies for MW. Sagi. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to
Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
On 03/26/2014 07:15 PM, Chuck Lever wrote: Hi Rafael- I’ll take a look. Can you report your HCA and how you reproduce this issue? The HCA is Mellanox Technologies MT26428. Reproduction: 1) Mount a directory via NFS/RDMA mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/ 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection 3) ls /mnt 4) wait 5-30 seconds Do you (or does anyone) know whether OpenSuSE uses the stock upstream providers, or has any of OFED been applied to this kernel? My first post was misleading -- I use the vanilla 3.10.17 kernel, and the error occurs with and without Mellanox OFED. The above report references the vanilla kernel. Rafael -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
Hi- On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote: On 03/26/2014 07:15 PM, Chuck Lever wrote: Hi Rafael- I’ll take a look. Can you report your HCA and how you reproduce this issue? The HCA is Mellanox Technologies MT26428. Reproduction: 1) Mount a directory via NFS/RDMA mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/ 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection 3) ls /mnt 4) wait 5-30 seconds Thanks for the information. I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this: # rpcdebug -m rpc -s trans then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread. Do you (or does anyone) know whether OpenSuSE uses the stock upstream providers, or has any of OFED been applied to this kernel? My first post was misleading -- I use the vanilla 3.10.17 kernel, and the error occurs with and without Mellanox OFED. The above report references the vanilla kernel. Rafael -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
On Mar 26, 2014, at 6:20 AM, rafael.reiter rafael.rei...@ims.co.at wrote: Hello, I am looking into a problem with NFS/RDMA with openSuse 12.3 and the 3.10.17 kernel. The following kernel oops, followed by a kernel panic, occurs on the client computer after 5-30 seconds when the connection between NFS server and client is disrupted (e.g. by pulling the Infiniband cable on the server or using ibportstate to disable the port on the client) and the mounted directory is accessed (cd, ls). The export on the server side is done with /data 172.16.100.0/24(rw,wdelay,crossmnt,insecure,no_root_squash,no_subtree_check,fsid=0,mountpoint) Following command is used for mounting the NFSv4 share: mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/ [ 3336.995934] kernel tried to execute NX-protected page - exploit attempt? (uid: 0) [ 3337.003415] BUG: unable to handle kernel paging request at 880842900768 [ 3337.010404] IP: [880842900768] 0x880842900767 [ 3337.015658] PGD 1d7c067 PUD 85d4e1063 PMD 842f48063 PTE 800842900163 [ 3337.022420] Oops: 0011 [#1] SMP [ 3337.025681] Modules linked in: xprtrdma(O) auth_rpcgss oid_registry nfsv4 cpuid af_packet 8021q garp stp llc rdma_ucm(O) ib_ucm(O) rdma_cm(O) iw_cm(O) ib_ipoib(O) ib_cm(O) ib_uverbs(O) ib_umad(O) mlx4_en(O) mlx4_ib(O) ib_sa(O) ib_mad(O) ib_core(O) ib_addr(O) sr_mod cdrom usb_storage nvidia(PO) joydev usbhid mlx4_core(O) compat(O) adm1021 lm90 coretemp nouveau kvm_intel kvm crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 acpi_cpufreq iTCO_wdt iTCO_vendor_support microcode pcspkr ttm drm_kms_helper sb_edac edac_core isci drm i2c_i801 libsas ehci_pci ehci_hcd scsi_transport_sas mxm_wmi sg video usbcore lpc_ich ioatdma mfd_core usb_common shpchp pci_hotplug wmi mperf processor thermal_sys button edd fuse autofs4 xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca i2c_algo_bit ptp pps_core [ 3337.102467] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P O 3.10.17-ims2 #2 [ 3337.109863] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013 [ 3337.116736] task: 81a11440 ti: 81a0 task.ti: 81a0 [ 3337.124218] RIP: 0010:[880842900768] [880842900768] 0x880842900767 [ 3337.131892] RSP: 0018:88087fc03e88 EFLAGS: 00010282 [ 3337.137208] RAX: 0286 RBX: 880842900768 RCX: 88085e2454a0 [ 3337.144335] RDX: 88085e2454a0 RSI: 0286 RDI: 88085e245480 [ 3337.151463] RBP: 88087fc03ea0 R08: 88085e24b170 R09: 0040 [ 3337.158588] R10: 0003 R11: dead00100100 R12: 88085e245480 [ 3337.165716] R13: 0006 R14: 0006 R15: 81a5db90 [ 3337.172842] FS: () GS:88087fc0() knlGS: [ 3337.180932] CS: 0010 DS: ES: CR0: 80050033 [ 3337.186671] CR2: 880842900768 CR3: 01a0c000 CR4: 000407f0 [ 3337.193809] DR0: DR1: DR2: [ 3337.200934] DR3: DR6: 0ff0 DR7: 0400 [ 3337.208061] Stack: [ 3337.210073] a04f7cbe a04fd388 88087fc03ec0 [ 3337.217530] 81049c82 0001 81a050b0 88087fc03f30 [ 3337.224987] 81049870 81a01fd8 0001000b940f 00202000 [ 3337.232443] Call Trace: [ 3337.234903] IRQ [ 3337.236838] [a04f7cbe] ? rpcrdma_run_tasklet+0x7e/0xc0 [xprtrdma] [ 3337.244116] [81049c82] tasklet_action+0x52/0xc0 [ 3337.249611] [81049870] __do_softirq+0xe0/0x220 [ 3337.255014] [8155cbac] call_softirq+0x1c/0x30 [ 3337.260335] [8100452d] do_softirq+0x4d/0x80 [ 3337.265470] [81049b05] irq_exit+0x95/0xa0 [ 3337.270437] [8100411e] do_IRQ+0x5e/0xd0 [ 3337.275224] [81553eaa] common_interrupt+0x6a/0x6a [ 3337.280884] EOI [ 3337.282809] [81069090] ? __hrtimer_start_range_ns+0x1c0/0x400 [ 3337.289729] [8141de86] ? cpuidle_enter_state+0x56/0xd0 [ 3337.295824] [8141de82] ? cpuidle_enter_state+0x52/0xd0 [ 3337.301928] [8141dfb6] cpuidle_idle_call+0xb6/0x200 [ 3337.307764] [8100aa39] arch_cpu_idle+0x9/0x20 [ 3337.313087] [81087cc0] cpu_startup_entry+0x80/0x200 [ 3337.318924] [815358a2] rest_init+0x72/0x80 [ 3337.323984] [81ac4e28] start_kernel+0x3b2/0x3bf [ 3337.329464] [81ac4875] ? repair_env_string+0x5e/0x5e [ 3337.335386] [81ac45a5] x86_64_start_reservations+0x2a/0x2c [ 3337.341820] [81ac4675] x86_64_start_kernel+0xce/0xd2 [ 3337.347732] Code: 88 ff ff e7 33 4f a0 ff ff ff ff 00 b0 dd 57 10 88 ff ff a9 be a9 be 10 00 00 00 00 01 00 00 00 00 00 00 20 00 00 00 00 00 00 00 d0 35 24 5e 08 88 ff ff 98 54 24 5e 08 88 ff ff