Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection

2014-04-04 Thread Chuck Lever

On Mar 28, 2014, at 8:52 PM, sagi grimberg sa...@mellanox.com wrote:

 On 3/29/2014 3:05 AM, Chuck Lever wrote:
 On Mar 28, 2014, at 4:06 PM, sagi grimberg sa...@mellanox.com wrote:
 
 On 3/29/2014 1:30 AM, Chuck Lever wrote:
 On Mar 28, 2014, at 2:42 AM, Senn Klemens klemens.s...@ims.co.at wrote:
 
 Hi Chuck,
 
 On 03/27/2014 04:59 PM, Chuck Lever wrote:
 Hi-
 
 
 On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at 
 wrote:
 
 On 03/26/2014 07:15 PM, Chuck Lever wrote:
 Hi Rafael-
 
 I’ll take a look. Can you report your HCA and how you reproduce this 
 issue?
 The HCA is Mellanox Technologies MT26428.
 
 Reproduction:
 1) Mount a directory via NFS/RDMA
 mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
 An additional ls /mnt is needed here (between step 1 and 2)
 
 2) Pull the Infiniband cable or use ibportstate to disrupt the 
 Infiniband connection
 3) ls /mnt
 4) wait 5-30 seconds
 Thanks for the information.
 
 I have that HCA, but I won’t have access to my test systems for a week 
 (traveling). So can you try this:
 
 # rpcdebug -m rpc -s trans
 
 then reproduce (starting with step 1 above). Some debugging output will 
 appear at the tail of /var/log/messages. Copy it to this thread.
 
 The output of /var/log/messages is:
 
 [  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
 buffer[1024]: prog 13 vers 4 proc 1
 [  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
 0x88105894c000
 [  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
 0x88105894d7dc len 124 hdrlen 124
 [  143.233718] RPC:   rpcrdma_register_frmr_external: Using frmr
 88084e589260 to map 1 segments
 [  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
 652@0x105894d92c:0xced01 (last)
 [  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
 rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760
 lkey 0x8000
 [  143.233785] RPC:   rpcrdma_event_process: event rep
 88084e589260 status 0 opcode 8 length 0
 [  177.272397] RPC:   rpcrdma_event_process: event rep
 (null) status C opcode 8808 length 4294967295
 [  177.272649] RPC:   rpcrdma_event_process: event rep
 880848ed status 5 opcode 8808 length 4294936584
 The mlx4 provider is returning a WC completion status of
 IB_WC_WR_FLUSH_ERR.
 
 [  177.272651] RPC:   rpcrdma_event_process: WC opcode -30712 status
 5, connection lost
 -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
 WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
 the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.
 Hey Chuck,
 
 That is correct, the opcode field in the wc is not reliable in FLUSH errors.
 
 A copy of the opcode reachable from the incoming rpcrdma_rep could be
 added, initialized in the forward paths. rpcrdma_event_process() could
 use the copy in the error case.
 How about suppressing completions alltogether for fast_reg and local_inv 
 work requests?
 if these shall fail you will get an error completion and the QP will 
 transition to error state
 generating FLUSH_ERR completions for all pending WRs. In this case, you can 
 just ignore
 flush fast_reg + local_inv errors.
 
 see http://marc.info/?l=linux-rdmam=139047309831997w=2
 While considering your suggestion, I see that my proposed fix doesn’t work. 
 In the FAST_REG_MR and LOCAL_INV cases, wr_id points to a struct rpcrdma_mw, 
 not a struct rpcrdma_rep. Putting a copy of the opcode in rpcrdma_rep would 
 have no effect. Worse:
 
  158 if (IB_WC_SUCCESS != wc-status) {
  159 dprintk(RPC:   %s: WC opcode %d status %X, 
 connection lost\n,
  160 __func__, wc-opcode, wc-status);
  161 rep-rr_len = ~0U;
 Suppose this is an IB_WC_FAST_REG_MR completion, so “rep” here is actually a 
 struct rpcrdma_mw, not a struct rpcrdma_rep. Line 161 pokes 32 one-bits at 
 the top of that struct rpcrdma_mw. If wc-opcode was always usable, we’d at 
 least have to fix that.
 
 nod
 So for error completions one needs to be careful dereferencing wr_id as the 
 opcode might not reliable. it will be better to first identify that wr_id is 
 indeed a pointer to rep.
 
  162 if (wc-opcode != IB_WC_FAST_REG_MR  wc-opcode != 
 IB_WC_LOCAL_INV)
  163 rpcrdma_schedule_tasklet(rep);
  164 return;
  165 }
  166
  167 switch (wc-opcode) {
  168 case IB_WC_FAST_REG_MR:
  169 frmr = (struct rpcrdma_mw *)(unsigned long)wc-wr_id;
  170 frmr-r.frmr.state = FRMR_IS_VALID;
  171 break;
 
 To make my initial solution work, you’d have to add a field to both struct 
 rpcrdma_mw and struct rpcrdma_rep, and ensure they are at the same offset in 
 both structures. Ewe.
 
 Eliminating completions for FAST_REG_MR and LOCAL_INV might be a preferable 
 way to address this.
 

Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection

2014-03-28 Thread Senn Klemens
Hi Chuck,

On 03/27/2014 04:59 PM, Chuck Lever wrote:
 Hi-
 
 
 On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote:
 
 On 03/26/2014 07:15 PM, Chuck Lever wrote:

 Hi Rafael-

 I’ll take a look. Can you report your HCA and how you reproduce this issue?

 The HCA is Mellanox Technologies MT26428.

 Reproduction:
 1) Mount a directory via NFS/RDMA
 mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/

An additional ls /mnt is needed here (between step 1 and 2)

 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband 
 connection
 3) ls /mnt
 4) wait 5-30 seconds
 
 Thanks for the information.
 
 I have that HCA, but I won’t have access to my test systems for a week 
 (traveling). So can you try this:
 
  # rpcdebug -m rpc -s trans
 
 then reproduce (starting with step 1 above). Some debugging output will 
 appear at the tail of /var/log/messages. Copy it to this thread.
 

The output of /var/log/messages is:

[  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
buffer[1024]: prog 13 vers 4 proc 1
[  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
0x88105894c000
[  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
0x88105894d7dc len 124 hdrlen 124
[  143.233718] RPC:   rpcrdma_register_frmr_external: Using frmr
88084e589260 to map 1 segments
[  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
652@0x105894d92c:0xced01 (last)
[  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760
lkey 0x8000
[  143.233785] RPC:   rpcrdma_event_process: event rep
88084e589260 status 0 opcode 8 length 0
[  177.272397] RPC:   rpcrdma_event_process: event rep
(null) status C opcode 8808 length 4294967295
[  177.272649] RPC:   rpcrdma_event_process: event rep
880848ed status 5 opcode 8808 length 4294936584
[  177.272651] RPC:   rpcrdma_event_process: WC opcode -30712 status
5, connection lost
[  177.984996] RPC:  1689 xprt_rdma_allocate: size 436, request
0x880848f0
[  182.290655] RPC:   xprt_rdma_connect_worker: reconnect
[  182.290992] RPC:   rpcrdma_ep_disconnect: after wait, disconnected
[  187.300726] RPC:   xprt_rdma_connect_worker: exit
[  197.320527] RPC:   xprt_rdma_connect_worker: reconnect
[  197.320795] RPC:   rpcrdma_ep_disconnect: after wait, disconnected
[  202.330477] RPC:   xprt_rdma_connect_worker: exit
[  222.354286] RPC:   xprt_rdma_connect_worker: reconnect
[  222.354624] RPC:   rpcrdma_ep_disconnect: after wait, disconnected


The output on the serial terminal is:

[  227.364376] kernel tried to execute NX-protected page - exploit
attempt? (uid: 0)
[  227.364517] RPC:  1689 rpcrdma_inline_pullup: pad 0 destp
0x880848f017c4 len 100 hdrlen 100
[  227.364519] RPC:   rpcrdma_register_frmr_external: Using frmr
88084e588810 to map 1 segments
[  227.364522] RPC:  1689 rpcrdma_create_chunks: reply chunk elem
152@0x848f0187c:0xcab01 (last)
[  227.364523] RPC:  1689 rpcrdma_marshal_req: reply chunk: hdrlen 48
rpclen 100 padlen 0 headerp 0x880848f01100 base 0x880848f01760
lkey 0x8000
[  227.411547] BUG: unable to handle kernel paging request at
880848ed1758
[  227.418535] IP: [880848ed1758] 0x880848ed1757
[  227.423781] PGD 1d7c067 PUD 85d52f063 PMD 848f12063 PTE 800848ed1163
[  227.430544] Oops: 0011 [#1] SMP
[  227.433802] Modules linked in: auth_rpcgss oid_registry nfsv4
xprtrdma cpuid af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm
iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib ib_sa
ib_mad ib_core joydev usbhid mlx4_core iTCO_wdt iTCO_vendor_support
acpi_cpufreq coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel
aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64
microcode pcspkr sb_edac edac_core i2c_i801 isci sg libsas ehci_pci
ehci_hcd scsi_transport_sas usbcore lpc_ich ioatdma mfd_core usb_common
shpchp pci_hotplug wmi mperf processor thermal_sys button edd autofs4
xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca
i2c_algo_bit ptp pps_core
[  227.496536] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
3.10.17-allpatches+ #1
[  227.503583] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0
09/04/2013
[  227.510451] task: 81a11440 ti: 81a0 task.ti:
81a0
[  227.517924] RIP: 0010:[880848ed1758]  [880848ed1758]
0x880848ed1757
[  227.525597] RSP: 0018:88087fc03e88  EFLAGS: 00010282
[  227.530903] RAX: 0286 RBX: 880848ed1758 RCX:
a0354360
[  227.538032] RDX: 88084e589280 RSI: 0286 RDI:
88084e589260
[  227.545157] RBP: 88087fc03ea0 R08: a0354360 R09:
05f0
[  227.552286] R10: 0003 R11: dead00100100 R12:
88084e589260
[  227.559412] R13: 0006 R14: 0006 R15:
81a5db90
[  227.566540] FS:  

Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection

2014-03-28 Thread Chuck Lever

On Mar 28, 2014, at 2:42 AM, Senn Klemens klemens.s...@ims.co.at wrote:

 Hi Chuck,
 
 On 03/27/2014 04:59 PM, Chuck Lever wrote:
 Hi-
 
 
 On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote:
 
 On 03/26/2014 07:15 PM, Chuck Lever wrote:
 
 Hi Rafael-
 
 I’ll take a look. Can you report your HCA and how you reproduce this issue?
 
 The HCA is Mellanox Technologies MT26428.
 
 Reproduction:
 1) Mount a directory via NFS/RDMA
 mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
 
 An additional ls /mnt is needed here (between step 1 and 2)
 
 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband 
 connection
 3) ls /mnt
 4) wait 5-30 seconds
 
 Thanks for the information.
 
 I have that HCA, but I won’t have access to my test systems for a week 
 (traveling). So can you try this:
 
 # rpcdebug -m rpc -s trans
 
 then reproduce (starting with step 1 above). Some debugging output will 
 appear at the tail of /var/log/messages. Copy it to this thread.
 
 
 The output of /var/log/messages is:
 
 [  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
 buffer[1024]: prog 13 vers 4 proc 1
 [  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
 0x88105894c000
 [  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
 0x88105894d7dc len 124 hdrlen 124
 [  143.233718] RPC:   rpcrdma_register_frmr_external: Using frmr
 88084e589260 to map 1 segments
 [  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
 652@0x105894d92c:0xced01 (last)
 [  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
 rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760
 lkey 0x8000
 [  143.233785] RPC:   rpcrdma_event_process: event rep
 88084e589260 status 0 opcode 8 length 0
 [  177.272397] RPC:   rpcrdma_event_process: event rep
 (null) status C opcode 8808 length 4294967295
 [  177.272649] RPC:   rpcrdma_event_process: event rep
 880848ed status 5 opcode 8808 length 4294936584

The mlx4 provider is returning a WC completion status of
IB_WC_WR_FLUSH_ERR.

 [  177.272651] RPC:   rpcrdma_event_process: WC opcode -30712 status
 5, connection lost

-30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.

A copy of the opcode reachable from the incoming rpcrdma_rep could be
added, initialized in the forward paths. rpcrdma_event_process() could
use the copy in the error case.


 [  177.984996] RPC:  1689 xprt_rdma_allocate: size 436, request
 0x880848f0
 [  182.290655] RPC:   xprt_rdma_connect_worker: reconnect
 [  182.290992] RPC:   rpcrdma_ep_disconnect: after wait, disconnected
 [  187.300726] RPC:   xprt_rdma_connect_worker: exit
 [  197.320527] RPC:   xprt_rdma_connect_worker: reconnect
 [  197.320795] RPC:   rpcrdma_ep_disconnect: after wait, disconnected
 [  202.330477] RPC:   xprt_rdma_connect_worker: exit
 [  222.354286] RPC:   xprt_rdma_connect_worker: reconnect
 [  222.354624] RPC:   rpcrdma_ep_disconnect: after wait, disconnected
 
 
 The output on the serial terminal is:
 
 [  227.364376] kernel tried to execute NX-protected page - exploit
 attempt? (uid: 0)
 [  227.364517] RPC:  1689 rpcrdma_inline_pullup: pad 0 destp
 0x880848f017c4 len 100 hdrlen 100
 [  227.364519] RPC:   rpcrdma_register_frmr_external: Using frmr
 88084e588810 to map 1 segments
 [  227.364522] RPC:  1689 rpcrdma_create_chunks: reply chunk elem
 152@0x848f0187c:0xcab01 (last)
 [  227.364523] RPC:  1689 rpcrdma_marshal_req: reply chunk: hdrlen 48
 rpclen 100 padlen 0 headerp 0x880848f01100 base 0x880848f01760
 lkey 0x8000
 [  227.411547] BUG: unable to handle kernel paging request at
 880848ed1758
 [  227.418535] IP: [880848ed1758] 0x880848ed1757
 [  227.423781] PGD 1d7c067 PUD 85d52f063 PMD 848f12063 PTE 800848ed1163
 [  227.430544] Oops: 0011 [#1] SMP
 [  227.433802] Modules linked in: auth_rpcgss oid_registry nfsv4
 xprtrdma cpuid af_packet 8021q garp stp llc rdma_ucm ib_ucm rdma_cm
 iw_cm ib_addr ib_ipoib ib_cm ib_uverbs ib_umad mlx4_en mlx4_ib ib_sa
 ib_mad ib_core joydev usbhid mlx4_core iTCO_wdt iTCO_vendor_support
 acpi_cpufreq coretemp kvm_intel kvm crc32c_intel ghash_clmulni_intel
 aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64
 microcode pcspkr sb_edac edac_core i2c_i801 isci sg libsas ehci_pci
 ehci_hcd scsi_transport_sas usbcore lpc_ich ioatdma mfd_core usb_common
 shpchp pci_hotplug wmi mperf processor thermal_sys button edd autofs4
 xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca
 i2c_algo_bit ptp pps_core
 [  227.496536] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
 3.10.17-allpatches+ #1
 [  227.503583] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0
 09/04/2013
 [  227.510451] task: 

Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection

2014-03-28 Thread sagi grimberg

On 3/29/2014 1:30 AM, Chuck Lever wrote:

On Mar 28, 2014, at 2:42 AM, Senn Klemens klemens.s...@ims.co.at wrote:


Hi Chuck,

On 03/27/2014 04:59 PM, Chuck Lever wrote:

Hi-


On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote:


On 03/26/2014 07:15 PM, Chuck Lever wrote:

Hi Rafael-

I’ll take a look. Can you report your HCA and how you reproduce this issue?

The HCA is Mellanox Technologies MT26428.

Reproduction:
1) Mount a directory via NFS/RDMA
mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/

An additional ls /mnt is needed here (between step 1 and 2)


2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband 
connection
3) ls /mnt
4) wait 5-30 seconds

Thanks for the information.

I have that HCA, but I won’t have access to my test systems for a week 
(traveling). So can you try this:

# rpcdebug -m rpc -s trans

then reproduce (starting with step 1 above). Some debugging output will appear 
at the tail of /var/log/messages. Copy it to this thread.


The output of /var/log/messages is:

[  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
buffer[1024]: prog 13 vers 4 proc 1
[  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
0x88105894c000
[  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
0x88105894d7dc len 124 hdrlen 124
[  143.233718] RPC:   rpcrdma_register_frmr_external: Using frmr
88084e589260 to map 1 segments
[  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
652@0x105894d92c:0xced01 (last)
[  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760
lkey 0x8000
[  143.233785] RPC:   rpcrdma_event_process: event rep
88084e589260 status 0 opcode 8 length 0
[  177.272397] RPC:   rpcrdma_event_process: event rep
(null) status C opcode 8808 length 4294967295
[  177.272649] RPC:   rpcrdma_event_process: event rep
880848ed status 5 opcode 8808 length 4294936584

The mlx4 provider is returning a WC completion status of
IB_WC_WR_FLUSH_ERR.


[  177.272651] RPC:   rpcrdma_event_process: WC opcode -30712 status
5, connection lost

-30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.


Hey Chuck,

That is correct, the opcode field in the wc is not reliable in FLUSH errors.



A copy of the opcode reachable from the incoming rpcrdma_rep could be
added, initialized in the forward paths. rpcrdma_event_process() could
use the copy in the error case.


How about suppressing completions alltogether for fast_reg and local_inv 
work requests?
if these shall fail you will get an error completion and the QP will 
transition to error state
generating FLUSH_ERR completions for all pending WRs. In this case, you 
can just ignore

flush fast_reg + local_inv errors.

see http://marc.info/?l=linux-rdmam=139047309831997w=2

Sagi.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection

2014-03-28 Thread Chuck Lever

On Mar 28, 2014, at 4:06 PM, sagi grimberg sa...@mellanox.com wrote:

 On 3/29/2014 1:30 AM, Chuck Lever wrote:
 On Mar 28, 2014, at 2:42 AM, Senn Klemens klemens.s...@ims.co.at wrote:
 
 Hi Chuck,
 
 On 03/27/2014 04:59 PM, Chuck Lever wrote:
 Hi-
 
 
 On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at 
 wrote:
 
 On 03/26/2014 07:15 PM, Chuck Lever wrote:
 Hi Rafael-
 
 I’ll take a look. Can you report your HCA and how you reproduce this 
 issue?
 The HCA is Mellanox Technologies MT26428.
 
 Reproduction:
 1) Mount a directory via NFS/RDMA
 mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
 An additional ls /mnt is needed here (between step 1 and 2)
 
 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband 
 connection
 3) ls /mnt
 4) wait 5-30 seconds
 Thanks for the information.
 
 I have that HCA, but I won’t have access to my test systems for a week 
 (traveling). So can you try this:
 
 # rpcdebug -m rpc -s trans
 
 then reproduce (starting with step 1 above). Some debugging output will 
 appear at the tail of /var/log/messages. Copy it to this thread.
 
 The output of /var/log/messages is:
 
 [  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
 buffer[1024]: prog 13 vers 4 proc 1
 [  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
 0x88105894c000
 [  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
 0x88105894d7dc len 124 hdrlen 124
 [  143.233718] RPC:   rpcrdma_register_frmr_external: Using frmr
 88084e589260 to map 1 segments
 [  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
 652@0x105894d92c:0xced01 (last)
 [  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
 rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760
 lkey 0x8000
 [  143.233785] RPC:   rpcrdma_event_process: event rep
 88084e589260 status 0 opcode 8 length 0
 [  177.272397] RPC:   rpcrdma_event_process: event rep
 (null) status C opcode 8808 length 4294967295
 [  177.272649] RPC:   rpcrdma_event_process: event rep
 880848ed status 5 opcode 8808 length 4294936584
 The mlx4 provider is returning a WC completion status of
 IB_WC_WR_FLUSH_ERR.
 
 [  177.272651] RPC:   rpcrdma_event_process: WC opcode -30712 status
 5, connection lost
 -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
 WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
 the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.
 
 Hey Chuck,
 
 That is correct, the opcode field in the wc is not reliable in FLUSH errors.
 
 
 A copy of the opcode reachable from the incoming rpcrdma_rep could be
 added, initialized in the forward paths. rpcrdma_event_process() could
 use the copy in the error case.
 
 How about suppressing completions alltogether for fast_reg and local_inv work 
 requests?
 if these shall fail you will get an error completion and the QP will 
 transition to error state
 generating FLUSH_ERR completions for all pending WRs. In this case, you can 
 just ignore
 flush fast_reg + local_inv errors.
 
 see http://marc.info/?l=linux-rdmam=139047309831997w=2

While considering your suggestion, I see that my proposed fix doesn’t work. In 
the FAST_REG_MR and LOCAL_INV cases, wr_id points to a struct rpcrdma_mw, not a 
struct rpcrdma_rep. Putting a copy of the opcode in rpcrdma_rep would have no 
effect. Worse:

  158 if (IB_WC_SUCCESS != wc-status) {
  159 dprintk(RPC:   %s: WC opcode %d status %X, 
 connection lost\n,
  160 __func__, wc-opcode, wc-status);
  161 rep-rr_len = ~0U;

Suppose this is an IB_WC_FAST_REG_MR completion, so “rep” here is actually a 
struct rpcrdma_mw, not a struct rpcrdma_rep. Line 161 pokes 32 one-bits at the 
top of that struct rpcrdma_mw. If wc-opcode was always usable, we’d at least 
have to fix that.

  162 if (wc-opcode != IB_WC_FAST_REG_MR  wc-opcode != 
 IB_WC_LOCAL_INV)
  163 rpcrdma_schedule_tasklet(rep);
  164 return;
  165 }
  166 
  167 switch (wc-opcode) {
  168 case IB_WC_FAST_REG_MR:
  169 frmr = (struct rpcrdma_mw *)(unsigned long)wc-wr_id;
  170 frmr-r.frmr.state = FRMR_IS_VALID;
  171 break;


To make my initial solution work, you’d have to add a field to both struct 
rpcrdma_mw and struct rpcrdma_rep, and ensure they are at the same offset in 
both structures. Ewe.

Eliminating completions for FAST_REG_MR and LOCAL_INV might be a preferable way 
to address this.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection

2014-03-28 Thread sagi grimberg

On 3/29/2014 3:05 AM, Chuck Lever wrote:

On Mar 28, 2014, at 4:06 PM, sagi grimberg sa...@mellanox.com wrote:


On 3/29/2014 1:30 AM, Chuck Lever wrote:

On Mar 28, 2014, at 2:42 AM, Senn Klemens klemens.s...@ims.co.at wrote:


Hi Chuck,

On 03/27/2014 04:59 PM, Chuck Lever wrote:

Hi-


On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote:


On 03/26/2014 07:15 PM, Chuck Lever wrote:

Hi Rafael-

I’ll take a look. Can you report your HCA and how you reproduce this issue?

The HCA is Mellanox Technologies MT26428.

Reproduction:
1) Mount a directory via NFS/RDMA
mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/

An additional ls /mnt is needed here (between step 1 and 2)


2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband 
connection
3) ls /mnt
4) wait 5-30 seconds

Thanks for the information.

I have that HCA, but I won’t have access to my test systems for a week 
(traveling). So can you try this:

# rpcdebug -m rpc -s trans

then reproduce (starting with step 1 above). Some debugging output will appear 
at the tail of /var/log/messages. Copy it to this thread.


The output of /var/log/messages is:

[  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
buffer[1024]: prog 13 vers 4 proc 1
[  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
0x88105894c000
[  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
0x88105894d7dc len 124 hdrlen 124
[  143.233718] RPC:   rpcrdma_register_frmr_external: Using frmr
88084e589260 to map 1 segments
[  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
652@0x105894d92c:0xced01 (last)
[  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
rpclen 124 padlen 0 headerp 0x88105894d100 base 0x88105894d760
lkey 0x8000
[  143.233785] RPC:   rpcrdma_event_process: event rep
88084e589260 status 0 opcode 8 length 0
[  177.272397] RPC:   rpcrdma_event_process: event rep
(null) status C opcode 8808 length 4294967295
[  177.272649] RPC:   rpcrdma_event_process: event rep
880848ed status 5 opcode 8808 length 4294936584

The mlx4 provider is returning a WC completion status of
IB_WC_WR_FLUSH_ERR.


[  177.272651] RPC:   rpcrdma_event_process: WC opcode -30712 status
5, connection lost

-30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.

Hey Chuck,

That is correct, the opcode field in the wc is not reliable in FLUSH errors.


A copy of the opcode reachable from the incoming rpcrdma_rep could be
added, initialized in the forward paths. rpcrdma_event_process() could
use the copy in the error case.

How about suppressing completions alltogether for fast_reg and local_inv work 
requests?
if these shall fail you will get an error completion and the QP will transition 
to error state
generating FLUSH_ERR completions for all pending WRs. In this case, you can 
just ignore
flush fast_reg + local_inv errors.

see http://marc.info/?l=linux-rdmam=139047309831997w=2

While considering your suggestion, I see that my proposed fix doesn’t work. In 
the FAST_REG_MR and LOCAL_INV cases, wr_id points to a struct rpcrdma_mw, not a 
struct rpcrdma_rep. Putting a copy of the opcode in rpcrdma_rep would have no 
effect. Worse:


  158 if (IB_WC_SUCCESS != wc-status) {
  159 dprintk(RPC:   %s: WC opcode %d status %X, connection 
lost\n,
  160 __func__, wc-opcode, wc-status);
  161 rep-rr_len = ~0U;

Suppose this is an IB_WC_FAST_REG_MR completion, so “rep” here is actually a 
struct rpcrdma_mw, not a struct rpcrdma_rep. Line 161 pokes 32 one-bits at the top 
of that struct rpcrdma_mw. If wc-opcode was always usable, we’d at least have 
to fix that.


nod
So for error completions one needs to be careful dereferencing wr_id as 
the opcode might not reliable. it will be better to first identify that 
wr_id is indeed a pointer to rep.



  162 if (wc-opcode != IB_WC_FAST_REG_MR  wc-opcode != 
IB_WC_LOCAL_INV)
  163 rpcrdma_schedule_tasklet(rep);
  164 return;
  165 }
  166
  167 switch (wc-opcode) {
  168 case IB_WC_FAST_REG_MR:
  169 frmr = (struct rpcrdma_mw *)(unsigned long)wc-wr_id;
  170 frmr-r.frmr.state = FRMR_IS_VALID;
  171 break;


To make my initial solution work, you’d have to add a field to both struct 
rpcrdma_mw and struct rpcrdma_rep, and ensure they are at the same offset in 
both structures. Ewe.

Eliminating completions for FAST_REG_MR and LOCAL_INV might be a preferable way 
to address this.


Agree, Same applies for MW.

Sagi.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to 

Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection

2014-03-27 Thread Reiter Rafael
On 03/26/2014 07:15 PM, Chuck Lever wrote:
 
 Hi Rafael-
 
 I’ll take a look. Can you report your HCA and how you reproduce this issue?

The HCA is Mellanox Technologies MT26428.

Reproduction:
1) Mount a directory via NFS/RDMA
mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband 
connection
3) ls /mnt
4) wait 5-30 seconds

 
 Do you (or does anyone) know whether OpenSuSE uses the stock upstream 
 providers, or has any of OFED been applied to this kernel?

My first post was misleading -- I use the vanilla 3.10.17 kernel, and the error 
occurs with and without Mellanox OFED. The above report
references the vanilla kernel.

Rafael

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection

2014-03-27 Thread Chuck Lever
Hi-


On Mar 27, 2014, at 12:53 AM, Reiter Rafael rafael.rei...@ims.co.at wrote:

 On 03/26/2014 07:15 PM, Chuck Lever wrote:
 
 Hi Rafael-
 
 I’ll take a look. Can you report your HCA and how you reproduce this issue?
 
 The HCA is Mellanox Technologies MT26428.
 
 Reproduction:
 1) Mount a directory via NFS/RDMA
 mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband 
 connection
 3) ls /mnt
 4) wait 5-30 seconds

Thanks for the information.

I have that HCA, but I won’t have access to my test systems for a week 
(traveling). So can you try this:

 # rpcdebug -m rpc -s trans

then reproduce (starting with step 1 above). Some debugging output will appear 
at the tail of /var/log/messages. Copy it to this thread.

 
 
 Do you (or does anyone) know whether OpenSuSE uses the stock upstream 
 providers, or has any of OFED been applied to this kernel?
 
 My first post was misleading -- I use the vanilla 3.10.17 kernel, and the 
 error occurs with and without Mellanox OFED. The above report
 references the vanilla kernel.
 
 Rafael
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection

2014-03-26 Thread Chuck Lever

On Mar 26, 2014, at 6:20 AM, rafael.reiter rafael.rei...@ims.co.at wrote:

 Hello,
 
 I am looking into a problem with NFS/RDMA with openSuse 12.3 and the 3.10.17 
 kernel. The following kernel oops, followed by a kernel panic, occurs on
 the client computer after 5-30 seconds when the connection between NFS server 
 and client is disrupted (e.g. by pulling the Infiniband cable on the
 server or using ibportstate to disable the port on the client) and the 
 mounted directory is accessed (cd, ls).
 
 
 The export on the server side is done with
 /data
 172.16.100.0/24(rw,wdelay,crossmnt,insecure,no_root_squash,no_subtree_check,fsid=0,mountpoint)
 
 Following command is used for mounting the NFSv4 share:
  mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
 
 
 [ 3336.995934] kernel tried to execute NX-protected page - exploit attempt?
 (uid: 0)
 [ 3337.003415] BUG: unable to handle kernel paging request at 880842900768
 [ 3337.010404] IP: [880842900768] 0x880842900767
 [ 3337.015658] PGD 1d7c067 PUD 85d4e1063 PMD 842f48063 PTE 800842900163
 [ 3337.022420] Oops: 0011 [#1] SMP
 [ 3337.025681] Modules linked in: xprtrdma(O) auth_rpcgss oid_registry nfsv4
 cpuid af_packet 8021q garp stp llc rdma_ucm(O) ib_ucm(O) rdma_cm(O) iw_cm(O)
 ib_ipoib(O) ib_cm(O) ib_uverbs(O) ib_umad(O) mlx4_en(O) mlx4_ib(O) ib_sa(O)
 ib_mad(O) ib_core(O) ib_addr(O) sr_mod cdrom usb_storage nvidia(PO) joydev
 usbhid mlx4_core(O) compat(O) adm1021 lm90 coretemp nouveau kvm_intel kvm
 crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul
 glue_helper aes_x86_64 acpi_cpufreq iTCO_wdt iTCO_vendor_support microcode
 pcspkr ttm drm_kms_helper sb_edac edac_core isci drm i2c_i801 libsas ehci_pci
 ehci_hcd scsi_transport_sas mxm_wmi sg video usbcore lpc_ich ioatdma mfd_core
 usb_common shpchp pci_hotplug wmi mperf processor thermal_sys button edd fuse
 autofs4 xfs libcrc32c nfsv3 nfs fscache lockd nfs_acl sunrpc igb dca
 i2c_algo_bit ptp pps_core
 [ 3337.102467] CPU: 0 PID: 0 Comm: swapper/0 Tainted: P   O 
 3.10.17-ims2
 #2
 [ 3337.109863] Hardware name: Supermicro B9DRG-E/B9DRG-E, BIOS 3.0 09/04/2013
 [ 3337.116736] task: 81a11440 ti: 81a0 task.ti:
 81a0
 [ 3337.124218] RIP: 0010:[880842900768]  [880842900768]
 0x880842900767
 [ 3337.131892] RSP: 0018:88087fc03e88  EFLAGS: 00010282
 [ 3337.137208] RAX: 0286 RBX: 880842900768 RCX: 
 88085e2454a0
 [ 3337.144335] RDX: 88085e2454a0 RSI: 0286 RDI: 
 88085e245480
 [ 3337.151463] RBP: 88087fc03ea0 R08: 88085e24b170 R09: 
 0040
 [ 3337.158588] R10: 0003 R11: dead00100100 R12: 
 88085e245480
 [ 3337.165716] R13: 0006 R14: 0006 R15: 
 81a5db90
 [ 3337.172842] FS:  () GS:88087fc0()
 knlGS:
 [ 3337.180932] CS:  0010 DS:  ES:  CR0: 80050033
 [ 3337.186671] CR2: 880842900768 CR3: 01a0c000 CR4: 
 000407f0
 [ 3337.193809] DR0:  DR1:  DR2: 
 
 [ 3337.200934] DR3:  DR6: 0ff0 DR7: 
 0400
 [ 3337.208061] Stack:
 [ 3337.210073]  a04f7cbe a04fd388 
 88087fc03ec0
 [ 3337.217530]  81049c82 0001 81a050b0
 88087fc03f30
 [ 3337.224987]  81049870 81a01fd8 0001000b940f
 00202000
 [ 3337.232443] Call Trace:
 [ 3337.234903]  IRQ
 [ 3337.236838]  [a04f7cbe] ? rpcrdma_run_tasklet+0x7e/0xc0 
 [xprtrdma]
 [ 3337.244116]  [81049c82] tasklet_action+0x52/0xc0
 [ 3337.249611]  [81049870] __do_softirq+0xe0/0x220
 [ 3337.255014]  [8155cbac] call_softirq+0x1c/0x30
 [ 3337.260335]  [8100452d] do_softirq+0x4d/0x80
 [ 3337.265470]  [81049b05] irq_exit+0x95/0xa0
 [ 3337.270437]  [8100411e] do_IRQ+0x5e/0xd0
 [ 3337.275224]  [81553eaa] common_interrupt+0x6a/0x6a
 [ 3337.280884]  EOI
 [ 3337.282809]  [81069090] ? __hrtimer_start_range_ns+0x1c0/0x400
 [ 3337.289729]  [8141de86] ? cpuidle_enter_state+0x56/0xd0
 [ 3337.295824]  [8141de82] ? cpuidle_enter_state+0x52/0xd0
 [ 3337.301928]  [8141dfb6] cpuidle_idle_call+0xb6/0x200
 [ 3337.307764]  [8100aa39] arch_cpu_idle+0x9/0x20
 [ 3337.313087]  [81087cc0] cpu_startup_entry+0x80/0x200
 [ 3337.318924]  [815358a2] rest_init+0x72/0x80
 [ 3337.323984]  [81ac4e28] start_kernel+0x3b2/0x3bf
 [ 3337.329464]  [81ac4875] ? repair_env_string+0x5e/0x5e
 [ 3337.335386]  [81ac45a5] x86_64_start_reservations+0x2a/0x2c
 [ 3337.341820]  [81ac4675] x86_64_start_kernel+0xce/0xd2
 [ 3337.347732] Code: 88 ff ff e7 33 4f a0 ff ff ff ff 00 b0 dd 57 10 88 ff ff 
 a9
 be a9 be 10 00 00 00 00 01 00 00 00 00 00 00 20 00 00 00 00 00 00 00 d0 35 
 24
 5e 08 88 ff ff 98 54 24 5e 08 88 ff ff