Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)
On Thu, Dec 10, 2015 at 07:49:59PM -0500, Chuck Lever wrote: > > > On Dec 10, 2015, at 6:30 PM, Christoph Hellwigwrote: > > > > On Thu, Dec 10, 2015 at 11:07:03AM -0700, Jason Gunthorpe wrote: > >> The ARM folks do this sort of stuff on a regular basis.. Very early on > >> Doug prepares a topic branch with only the big change, NFS folks pull > >> it and then pull your work. Then Doug would send the topic branch to > >> Linus as soon as the merge window opens, then NFS would send theirs. > >> > >> This is alot less work overall than trying to sequence multiple > >> patches over multiple releases.. > > > > Agreed. Staging has alaways been a giant pain and things tend to never > > finish moving over that way if they are non-trivial enough. > > In that case: > > You need to make sure you have all the right Acks. I've added > Anna and Bruce to Ack the NFS-related portions. Santosh should > Ack the RDS part. > > http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/ib_device_attr Fine by me. > Given the proximity to the holidays and the next merge window, > Doug will need to get a properly-acked topic branch published > in the next day or two so the rest of us can rebase and start > testing before the relevant parties disappear for the holiday. What branch should I be working from? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 01/11] svcrdma: Do not send XDR roundup bytes for a write chunk
On Mon, Dec 21, 2015 at 05:11:56PM -0500, Chuck Lever wrote: > > > On Dec 21, 2015, at 4:29 PM, J. Bruce Fields <bfie...@fieldses.org> wrote: > > > > On Mon, Dec 21, 2015 at 04:15:23PM -0500, Chuck Lever wrote: > >> > >>> On Dec 21, 2015, at 4:07 PM, J. Bruce Fields <bfie...@fieldses.org> wrote: > >>> > >>> On Mon, Dec 14, 2015 at 04:30:09PM -0500, Chuck Lever wrote: > >>>> Minor optimization: when dealing with write chunk XDR roundup, do > >>>> not post a Write WR for the zero bytes in the pad. Simply update > >>>> the write segment in the RPC-over-RDMA header to reflect the extra > >>>> pad bytes. > >>>> > >>>> The Reply chunk is also a write chunk, but the server does not use > >>>> send_write_chunks() to send the Reply chunk. That's OK in this case: > >>>> the server Upper Layer typically marshals the Reply chunk contents > >>>> in a single contiguous buffer, without a separate tail for the XDR > >>>> pad. > >>>> > >>>> The comments and the variable naming refer to "chunks" but what is > >>>> really meant is "segments." The existing code sends only one > >>>> xdr_write_chunk per RPC reply. > >>>> > >>>> The fix assumes this as well. When the XDR pad in the first write > >>>> chunk is reached, the assumption is the Write list is complete and > >>>> send_write_chunks() returns. > >>>> > >>>> That will remain a valid assumption until the server Upper Layer can > >>>> support multiple bulk payload results per RPC. > >>>> > >>>> Signed-off-by: Chuck Lever <chuck.le...@oracle.com> > >>>> --- > >>>> net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++ > >>>> 1 file changed, 7 insertions(+) > >>>> > >>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >>>> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >>>> index 969a1ab..bad5eaa 100644 > >>>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >>>> @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma > >>>> *xprt, > >>>> arg_ch->rs_handle, > >>>> arg_ch->rs_offset, > >>>> write_len); > >>>> + > >>>> +/* Do not send XDR pad bytes */ > >>>> +if (chunk_no && write_len < 4) { > >>>> +chunk_no++; > >>>> +break; > >>> > >>> I'm pretty lost in this code. Why does (chunk_no && write_len < 4) mean > >>> this is xdr padding? > >> > >> Chunk zero is always data. Padding is always going to be > >> after the first chunk. Any chunk after chunk zero that is > >> shorter than XDR quad alignment is going to be a pad. > > > > I don't really know what a chunk is Looking at the code: > > > > write_len = min(xfer_len, be32_to_cpu(arg_ch->rs_length)); > > > > so I guess the assumption is just that those rs_length's are always a > > multiple of four? > > The example you recently gave was a two-byte NFS READ > that crosses a page boundary. > > In that case, the NFSD would pass down an xdr_buf that > has one byte in a page, one byte in another page, and > a two-byte XDR pad. The logic introduced by this > optimization would be fooled, and neither the second > byte nor the XDR pad would be written to the client. > > Unless you can think of a way to recognize an XDR pad > in the xdr_buf 100% of the time, you should drop this > patch. It might be best to make this explicit and mark bytes as padding somehow while encoding. > As far as I know, none of the other patches in this > series depend on this optimization, so please merge > them if you can. OK, I'll take a look at the rest, thanks. --b. > >> Probably too clever. Is there a better way to detect > >> the XDR pad? > >> > >> > >>>> +} > >>>> + > >>>> chunk_off = 0; > >>>> while (write_len) { > >>>> ret = send_write(xprt, rqstp, > >> > >> -- > >> Chuck Lever > >> > >> > >> > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Chuck Lever > > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 01/11] svcrdma: Do not send XDR roundup bytes for a write chunk
On Mon, Dec 14, 2015 at 04:30:09PM -0500, Chuck Lever wrote: > Minor optimization: when dealing with write chunk XDR roundup, do > not post a Write WR for the zero bytes in the pad. Simply update > the write segment in the RPC-over-RDMA header to reflect the extra > pad bytes. > > The Reply chunk is also a write chunk, but the server does not use > send_write_chunks() to send the Reply chunk. That's OK in this case: > the server Upper Layer typically marshals the Reply chunk contents > in a single contiguous buffer, without a separate tail for the XDR > pad. > > The comments and the variable naming refer to "chunks" but what is > really meant is "segments." The existing code sends only one > xdr_write_chunk per RPC reply. > > The fix assumes this as well. When the XDR pad in the first write > chunk is reached, the assumption is the Write list is complete and > send_write_chunks() returns. > > That will remain a valid assumption until the server Upper Layer can > support multiple bulk payload results per RPC. > > Signed-off-by: Chuck Lever> --- > net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > index 969a1ab..bad5eaa 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, > arg_ch->rs_handle, > arg_ch->rs_offset, > write_len); > + > + /* Do not send XDR pad bytes */ > + if (chunk_no && write_len < 4) { > + chunk_no++; > + break; I'm pretty lost in this code. Why does (chunk_no && write_len < 4) mean this is xdr padding? > + } > + > chunk_off = 0; > while (write_len) { > ret = send_write(xprt, rqstp, -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 01/11] svcrdma: Do not send XDR roundup bytes for a write chunk
On Mon, Dec 21, 2015 at 04:15:23PM -0500, Chuck Lever wrote: > > > On Dec 21, 2015, at 4:07 PM, J. Bruce Fields <bfie...@fieldses.org> wrote: > > > > On Mon, Dec 14, 2015 at 04:30:09PM -0500, Chuck Lever wrote: > >> Minor optimization: when dealing with write chunk XDR roundup, do > >> not post a Write WR for the zero bytes in the pad. Simply update > >> the write segment in the RPC-over-RDMA header to reflect the extra > >> pad bytes. > >> > >> The Reply chunk is also a write chunk, but the server does not use > >> send_write_chunks() to send the Reply chunk. That's OK in this case: > >> the server Upper Layer typically marshals the Reply chunk contents > >> in a single contiguous buffer, without a separate tail for the XDR > >> pad. > >> > >> The comments and the variable naming refer to "chunks" but what is > >> really meant is "segments." The existing code sends only one > >> xdr_write_chunk per RPC reply. > >> > >> The fix assumes this as well. When the XDR pad in the first write > >> chunk is reached, the assumption is the Write list is complete and > >> send_write_chunks() returns. > >> > >> That will remain a valid assumption until the server Upper Layer can > >> support multiple bulk payload results per RPC. > >> > >> Signed-off-by: Chuck Lever <chuck.le...@oracle.com> > >> --- > >> net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++ > >> 1 file changed, 7 insertions(+) > >> > >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >> index 969a1ab..bad5eaa 100644 > >> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >> @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma > >> *xprt, > >>arg_ch->rs_handle, > >>arg_ch->rs_offset, > >>write_len); > >> + > >> + /* Do not send XDR pad bytes */ > >> + if (chunk_no && write_len < 4) { > >> + chunk_no++; > >> + break; > > > > I'm pretty lost in this code. Why does (chunk_no && write_len < 4) mean > > this is xdr padding? > > Chunk zero is always data. Padding is always going to be > after the first chunk. Any chunk after chunk zero that is > shorter than XDR quad alignment is going to be a pad. I don't really know what a chunk is Looking at the code: write_len = min(xfer_len, be32_to_cpu(arg_ch->rs_length)); so I guess the assumption is just that those rs_length's are always a multiple of four? --b. > > Probably too clever. Is there a better way to detect > the XDR pad? > > > >> + } > >> + > >>chunk_off = 0; > >>while (write_len) { > >>ret = send_write(xprt, rqstp, > > -- > Chuck Lever > > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: warning in ext4 with nfs/rdma server
On Tue, Dec 08, 2015 at 07:31:56AM -0600, Steve Wise wrote: > > > > -Original Message- > > From: Chuck Lever [mailto:chuck.le...@oracle.com] > > Sent: Monday, December 07, 2015 9:45 AM > > To: Steve Wise > > Cc: linux-rdma@vger.kernel.org; Veeresh U. Kokatnur; Linux NFS Mailing List > > Subject: Re: warning in ext4 with nfs/rdma server > > > > Hi Steve- > > > > > On Dec 7, 2015, at 10:38 AM, Steve Wise> > > wrote: > > > > > > Hey Chuck/NFS developers, > > > > > > We're hitting this warning in ext4 on the linux-4.3 nfs server running > > > over RDMA/cxgb4. We're still gathering data, like if it > > > happens with NFS/TCP. But has anyone seen this warning on 4.3? Is it > > > likely to indicate some bug in the xprtrdma transport or > > > above it in NFS? > > > > Yes, please confirm with NFS/TCP. Thanks! > > > > The same thing happens with NFS/TCP, so this isn't related to xprtrdma. > > > > > > We can hit this running cthon tests over 2 mount points: > > > > > > - > > > #!/bin/bash > > > rm -rf /root/cthon04/loop_iter.txt > > > while [ 1 ] > > > do > > > { > > > > > > ./server -s -m /mnt/share1 -o rdma,port=20049,vers=4 -p /mnt/share1 -N 100 > > > 102.1.1.162 & > > > ./server -s -m /mnt/share2 -o > > > rdma,port=20049,vers=3,rsize=65535,wsize=65535 -p > > > /mnt/share2 -N 100 102.2.2.162 & > > > wait > > > echo "iteration $i" >>/root/cthon04/loop_iter.txt > > > date >>/root/cthon04/loop_iter.txt > > > } > > > done > > > -- > > > > > > Thanks, > > > > > > Steve. > > > > > > [ cut here ] > > > WARNING: CPU: 14 PID: 6689 at fs/ext4/inode.c:231 > > > ext4_evict_inode+0x41e/0x490 Looks like this is the WARN_ON(atomic_read(_I(inode)->i_ioend_count)); in ext4_evice_inode? Ext4 developers, any idea how that could happen? --b. > > > [ext4]() > > > Modules linked in: nfsd(E) lockd(E) grace(E) nfs_acl(E) exportfs(E) > > > auth_rpcgss(E) rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_uverbs(E) rdma_cm(E) > > > ib_cm(E) ib_sa(E) ib_mad(E) iw_cxgb4(E) iw_cm(E) ib_core(E) ib_addr(E) > > > cxgb4(E) > > > autofs4(E) target_core_iblock(E) target_core_file(E) target_core_pscsi(E) > > > target_core_mod(E) configfs(E) bnx2fc(E) cnic(E) uio(E) fcoe(E) libfcoe(E) > > > 8021q(E) libfc(E) garp(E) stp(E) llc(E) cpufreq_ondemand(E) cachefiles(E) > > > fscache(E) ipv6(E) dm_mirror(E) dm_region_hash(E) dm_log(E) vhost_net(E) > > > macvtap(E) macvlan(E) vhost(E) tun(E) kvm(E) uinput(E) microcode(E) sg(E) > > > pcspkr(E) serio_raw(E) fam15h_power(E) k10temp(E) amd64_edac_mod(E) > > > edac_core(E) edac_mce_amd(E) i2c_piix4(E) igb(E) dca(E) i2c_algo_bit(E) > > > i2c_core(E) ptp(E) pps_core(E) scsi_transport_fc(E) acpi_cpufreq(E) > > > dm_mod(E) > > > ext4(E) jbd2(E) mbcache(E) sr_mod(E) cdrom(E) sd_mod(E) ahci(E) libahci(E) > > > [last unloaded: cxgb4] > > > CPU: 14 PID: 6689 Comm: nfsd Tainted: GE 4.3.0 #1 > > > Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.512/19/2013 > > > 00e7 88400634fad8 812a4084 a00c96eb > > > 88400634fb18 81059fd5 88400634fbd8 > > > 880fd1a460c8 880fd1a461d8 880fd1a46008 88400634fbd8 > > > Call Trace: > > > [] dump_stack+0x48/0x64 > > > [] warn_slowpath_common+0x95/0xe0 > > > [] warn_slowpath_null+0x1a/0x20 > > > [] ext4_evict_inode+0x41e/0x490 [ext4] > > > [] evict+0xae/0x1a0 > > > [] iput_final+0xe5/0x170 > > > [] iput+0xa3/0xf0 > > > [] ? fsnotify_destroy_marks+0x64/0x80 > > > [] dentry_unlink_inode+0xa9/0xe0 > > > [] d_delete+0xa6/0xb0 > > > [] vfs_unlink+0x138/0x140 > > > [] nfsd_unlink+0x165/0x200 [nfsd] > > > [] ? lru_put_end+0x5c/0x70 [nfsd] > > > [] nfsd3_proc_remove+0x83/0x120 [nfsd] > > > [] nfsd_dispatch+0xdc/0x210 [nfsd] > > > [] svc_process_common+0x311/0x620 [sunrpc] > > > [] ? nfsd_set_nrthreads+0x1b0/0x1b0 [nfsd] > > > [] svc_process+0x128/0x1b0 [sunrpc] > > > [] nfsd+0xf3/0x160 [nfsd] > > > [] kthread+0xcc/0xf0 > > > [] ? schedule_tail+0x1e/0xc0 > > > [] ? kthread_freezable_should_stop+0x70/0x70 > > > [] ret_from_fork+0x3f/0x70 > > > [] ? kthread_freezable_should_stop+0x70/0x70 > > > ---[ end trace 39afe9aeef2cfb34 ]--- > > > [ cut here ] > > > > -- > > Chuck Lever > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] svcrdma: Do not send XDR roundup bytes for a write chunk
On Wed, Nov 11, 2015 at 03:20:33PM -0500, Chuck Lever wrote: > > > On Nov 3, 2015, at 3:10 PM, Chuck Leverwrote: > > > > > >> On Oct 16, 2015, at 9:30 AM, Chuck Lever wrote: > >> > >> Minor optimization: when dealing with write chunk XDR roundup, do > >> not post a Write WR for the zero bytes in the pad. Simply update > >> the write segment in the RPC-over-RDMA header to reflect the extra > >> pad bytes. > >> > >> The Reply chunk is also a write chunk, but the server does not use > >> send_write_chunks() to send the Reply chunk. That's OK in this case: > >> the server Upper Layer typically marshals the Reply chunk contents > >> in a single contiguous buffer, without a separate tail for the XDR > >> pad. > >> > >> The comments and the variable naming refer to "chunks" but what is > >> really meant is "segments." The existing code sends only one > >> xdr_write_chunk per RPC reply. > >> > >> The fix assumes this as well. When the XDR pad in the first write > >> chunk is reached, the assumption is the Write list is complete and > >> send_write_chunks() returns. > >> > >> That will remain a valid assumption until the server Upper Layer can > >> support multiple bulk payload results per RPC. > >> > >> Signed-off-by: Chuck Lever > >> -- > > > > Bruce, can you take this for 4.4 ? > > Hi Bruce- > > Your 4.4 pull request did not include this patch. Whoops, sorry. For some reason I don't have the original email, though I do see your first followup. Would you mind resending? Also, it looks like that wasn't actually sent to me? I really want submitted patches to have my address on the To: line. Obviously I also try to pick up stuff sent only to the mailing list, but I'm a little more likely to miss those. But my fault either way, apologies. --b. > > > >> net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++ > >> 1 files changed, 7 insertions(+), 0 deletions(-) > >> > >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >> index 1dfae83..6c48901 100644 > >> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > >> @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma > >> *xprt, > >>arg_ch->rs_handle, > >>arg_ch->rs_offset, > >>write_len); > >> + > >> + /* Do not send XDR pad bytes */ > >> + if (chunk_no && write_len < 4) { > >> + chunk_no++; > >> + break; > >> + } > >> + > >>chunk_off = 0; > >>while (write_len) { > >>ret = send_write(xprt, rqstp, > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > >> the body of a message to majord...@vger.kernel.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > — > > Chuck Lever > > > > > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > — > Chuck Lever > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Fwd: [PATCH] svcrdma: Do not send XDR roundup bytes for a write chunk
On Thu, Nov 12, 2015 at 09:47:13AM -0500, Chuck Lever wrote: > I wish git send-mail had an address book. I’m really tired > of misspelling the to/cc addresses on patches. Hah. Yeah, my hack is a text file of addresses that I cut-and-paste from. --b. > > Resending so there is one thread for all replies. Apologies > for the noise. > > > > Begin forwarded message: > > > > From: Chuck Lever> > Subject: [PATCH] svcrdma: Do not send XDR roundup bytes for a write chunk > > Date: November 12, 2015 at 9:44:33 AM EST > > To: bfie...@fieldses.org > > Cc: linux-...@vger.kernel.org, linux-r...@vger.kernel.org > > > > Minor optimization: when dealing with write chunk XDR roundup, do > > not post a Write WR for the zero bytes in the pad. Simply update > > the write segment in the RPC-over-RDMA header to reflect the extra > > pad bytes. > > > > The Reply chunk is also a write chunk, but the server does not use > > send_write_chunks() to send the Reply chunk. That's OK in this case: > > the server Upper Layer typically marshals the Reply chunk contents > > in a single contiguous buffer, without a separate tail for the XDR > > pad. > > > > The comments and the variable naming refer to "chunks" but what is > > really meant is "segments." The existing code sends only one > > xdr_write_chunk per RPC reply. > > > > The fix assumes this as well. When the XDR pad in the first write > > chunk is reached, the assumption is the Write list is complete and > > send_write_chunks() returns. > > > > That will remain a valid assumption until the server Upper Layer can > > support multiple bulk payload results per RPC. > > > > Signed-off-by: Chuck Lever > > --- > > net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++ > > 1 file changed, 7 insertions(+) > > > > diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > > b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > > index 1dfae83..6c48901 100644 > > --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c > > +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c > > @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, > > arg_ch->rs_handle, > > arg_ch->rs_offset, > > write_len); > > + > > + /* Do not send XDR pad bytes */ > > + if (chunk_no && write_len < 4) { > > + chunk_no++; > > + break; > > + } > > + > > chunk_off = 0; > > while (write_len) { > > ret = send_write(xprt, rqstp, > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Chuck Lever > > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE
On Mon, Oct 12, 2015 at 10:53:39AM -0400, Chuck Lever wrote: > Now that the NFS server advertises a maximum payload size of 1MB > for RPC/RDMA again, it crashes in svc_process_common() when NFS > client sends a 1MB NFS WRITE on an NFS/RDMA mount. > > The server has set up a 259 element array of struct page pointers > in rq_pages[] for each incoming request. The last element of the > array is NULL. > > When an incoming request has been completely received, > rdma_read_complete() attempts to set the starting page of the > incoming page vector: > > rqstp->rq_arg.pages = >rq_pages[head->hdr_count]; > > and the page to use for the reply: > > rqstp->rq_respages = >rq_arg.pages[page_no]; > > But the value of page_no has already accounted for head->hdr_count. > Thus rq_respages now points past the end of the incoming pages. > > For NFS WRITE operations smaller than the maximum, this is harmless. > But when the NFS WRITE operation is as large as the server's max > payload size, rq_respages now points at the last entry in rq_pages, > which is NULL. > > Fixes: cc9a903d915c ('svcrdma: Change maximum server payload . . .') > BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270 > Signed-off-by: Chuck Lever> Reviewed-by: Sagi Grimberg > Reviewed-by: Steve Wise > Reviewed-by: Shirley Ma > --- > > Hi Bruce- > > This is a regression in 4.3. Can you send this to Linus? OK, queuing for 4.3, thanks.--b. > > > net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |2 +- > 1 files changed, 1 insertions(+), 1 deletions(-) > > diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > index cb51742..37b4341 100644 > --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > @@ -531,7 +531,7 @@ static int rdma_read_complete(struct svc_rqst *rqstp, > rqstp->rq_arg.page_base = head->arg.page_base; > > /* rq_respages starts after the last arg page */ > - rqstp->rq_respages = >rq_arg.pages[page_no]; > + rqstp->rq_respages = >rq_pages[page_no]; > rqstp->rq_next_page = rqstp->rq_respages + 1; > > /* Rebuild rq_arg head and tail. */ -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] svcrdma: handle rdma read with a non-zero initial page offset
On Wed, Oct 07, 2015 at 01:33:05PM -0400, Doug Ledford wrote: > On 10/07/2015 01:01 PM, J. Bruce Fields wrote: > > On Tue, Oct 06, 2015 at 01:44:25PM -0400, Doug Ledford wrote: > >> On 09/28/2015 05:46 PM, Steve Wise wrote: > >>> The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions > >>> were not taking into account the initial page_offset when determining > >>> the rdma read length. This resulted in a read who's starting address > >>> and length exceeded the base/bounds of the frmr. > >>> > >>> Most work loads don't tickle this bug apparently, but one test hit it > >>> every time: building the linux kernel on a 16 core node with 'make -j > >>> 16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA. > >>> > >>> This bug seems to only be tripped with devices having small fastreg page > >>> list depths. I didn't see it with mlx4, for instance. > >> > >> Bruce, what's you're take on this? Do you want to push this through or > >> would you care if I push it through my tree? > > > > Whoops, sorry, I meant to send a pull request for that last week. Uh, I > > think I'll go ahead and do that now if it's OK with you. > > Fine with me. I was just trying to make sure it didn't get forgotten ;-) Understood, thanks! I've sent the pull request.--b. > > > --b. > > > >> > >>> Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic') > >>> Signed-off-by: Steve Wise <sw...@opengridcomputing.com> > >>> Tested-by: Chuck Lever <chuck.le...@oracle.com> > >>> --- > >>> > >>> net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |6 -- > >>> 1 files changed, 4 insertions(+), 2 deletions(-) > >>> > >>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > >>> b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > >>> index cb51742..5f6ca47 100644 > >>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > >>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > >>> @@ -136,7 +136,8 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt, > >>> ctxt->direction = DMA_FROM_DEVICE; > >>> ctxt->read_hdr = head; > >>> pages_needed = min_t(int, pages_needed, xprt->sc_max_sge_rd); > >>> - read = min_t(int, pages_needed << PAGE_SHIFT, rs_length); > >>> + read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset, > >>> + rs_length); > >>> > >>> for (pno = 0; pno < pages_needed; pno++) { > >>> int len = min_t(int, rs_length, PAGE_SIZE - pg_off); > >>> @@ -235,7 +236,8 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt, > >>> ctxt->direction = DMA_FROM_DEVICE; > >>> ctxt->frmr = frmr; > >>> pages_needed = min_t(int, pages_needed, xprt->sc_frmr_pg_list_len); > >>> - read = min_t(int, pages_needed << PAGE_SHIFT, rs_length); > >>> + read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset, > >>> + rs_length); > >>> > >>> frmr->kva = page_address(rqstp->rq_arg.pages[pg_no]); > >>> frmr->direction = DMA_FROM_DEVICE; > >>> > >>> -- > >>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > >>> the body of a message to majord...@vger.kernel.org > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > >>> > >> > >> > >> -- > >> Doug Ledford <dledf...@redhat.com> > >> GPG KeyID: 0E572FDD > >> > >> > > > > > > > -- > Doug Ledford <dledf...@redhat.com> > GPG KeyID: 0E572FDD > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND] svcrdma: handle rdma read with a non-zero initial page offset
On Tue, Oct 06, 2015 at 01:44:25PM -0400, Doug Ledford wrote: > On 09/28/2015 05:46 PM, Steve Wise wrote: > > The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions > > were not taking into account the initial page_offset when determining > > the rdma read length. This resulted in a read who's starting address > > and length exceeded the base/bounds of the frmr. > > > > Most work loads don't tickle this bug apparently, but one test hit it > > every time: building the linux kernel on a 16 core node with 'make -j > > 16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA. > > > > This bug seems to only be tripped with devices having small fastreg page > > list depths. I didn't see it with mlx4, for instance. > > Bruce, what's you're take on this? Do you want to push this through or > would you care if I push it through my tree? Whoops, sorry, I meant to send a pull request for that last week. Uh, I think I'll go ahead and do that now if it's OK with you. --b. > > > Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic') > > Signed-off-by: Steve Wise> > Tested-by: Chuck Lever > > --- > > > > net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |6 -- > > 1 files changed, 4 insertions(+), 2 deletions(-) > > > > diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > > b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > > index cb51742..5f6ca47 100644 > > --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > > +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c > > @@ -136,7 +136,8 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt, > > ctxt->direction = DMA_FROM_DEVICE; > > ctxt->read_hdr = head; > > pages_needed = min_t(int, pages_needed, xprt->sc_max_sge_rd); > > - read = min_t(int, pages_needed << PAGE_SHIFT, rs_length); > > + read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset, > > +rs_length); > > > > for (pno = 0; pno < pages_needed; pno++) { > > int len = min_t(int, rs_length, PAGE_SIZE - pg_off); > > @@ -235,7 +236,8 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt, > > ctxt->direction = DMA_FROM_DEVICE; > > ctxt->frmr = frmr; > > pages_needed = min_t(int, pages_needed, xprt->sc_frmr_pg_list_len); > > - read = min_t(int, pages_needed << PAGE_SHIFT, rs_length); > > + read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset, > > +rs_length); > > > > frmr->kva = page_address(rqstp->rq_arg.pages[pg_no]); > > frmr->direction = DMA_FROM_DEVICE; > > > > -- > > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > > the body of a message to majord...@vger.kernel.org > > More majordomo info at http://vger.kernel.org/majordomo-info.html > > > > > -- > Doug Ledford > GPG KeyID: 0E572FDD > > -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] svcrdma: handle rdma read with a non-zero initial page offset
On Mon, Sep 28, 2015 at 04:49:21PM -0500, Steve Wise wrote: > > > > -Original Message- > > From: J. Bruce Fields [mailto:bfie...@fieldses.org] > > Sent: Monday, September 28, 2015 4:05 PM > > To: Steve Wise > > Cc: trond.mykleb...@primarydata.com; linux-...@vger.kernel.org; > > linux-rdma@vger.kernel.org > > Subject: Re: [PATCH 2/3] svcrdma: handle rdma read with a non-zero initial > > page offset > > > > On Mon, Sep 28, 2015 at 09:31:25AM -0500, Steve Wise wrote: > > > On 9/21/2015 12:24 PM, Steve Wise wrote: > > > >The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions > > > >were not taking into account the initial page_offset when determining > > > >the rdma read length. This resulted in a read who's starting address > > > >and length exceeded the base/bounds of the frmr. > > > > > > > >Most work loads don't tickle this bug apparently, but one test hit it > > > >every time: building the linux kernel on a 16 core node with 'make -j > > > >16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA. > > > > > > > >This bug seems to only be tripped with devices having small fastreg page > > > >list depths. I didn't see it with mlx4, for instance. > > > > > > > >Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic') > > > >Signed-off-by: Steve Wise <sw...@opengridcomputing.com> > > > >Tested-by: Chuck Lever <chuck.le...@oracle.com> > > > >--- > > > > > > > > > > > > > > Hey Bruce, can this make 4.3-rc? Also, what do you think about > > > pushing it to stable? > > > > It looks like a reasonable candidate for stable. Apologies, somehow I > > missed it when you posted it--would you mind resending? > > > > --b. > > resent this one patch. > > What is your process for pushing to stable? Currently my standard regression tests don't seem to be passing on recent 4.3; once I figure that out, I'll push out a branch with your patch (and any others required) to git://linux-nfs.org/~bfields for-4.3 and then send a pull request to Linus. Should be by the end of the week. --b. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] svcrdma: handle rdma read with a non-zero initial page offset
On Mon, Sep 28, 2015 at 09:31:25AM -0500, Steve Wise wrote: > On 9/21/2015 12:24 PM, Steve Wise wrote: > >The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions > >were not taking into account the initial page_offset when determining > >the rdma read length. This resulted in a read who's starting address > >and length exceeded the base/bounds of the frmr. > > > >Most work loads don't tickle this bug apparently, but one test hit it > >every time: building the linux kernel on a 16 core node with 'make -j > >16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA. > > > >This bug seems to only be tripped with devices having small fastreg page > >list depths. I didn't see it with mlx4, for instance. > > > >Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic') > >Signed-off-by: Steve Wise> >Tested-by: Chuck Lever > >--- > > > > > > Hey Bruce, can this make 4.3-rc? Also, what do you think about > pushing it to stable? It looks like a reasonable candidate for stable. Apologies, somehow I missed it when you posted it--would you mind resending? --b. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] svcrdma: limit FRMR page list lengths to device max
On Fri, Aug 07, 2015 at 11:11:20AM -0500, Steve Wise wrote: Svcrdma was incorrectly allocating fastreg MRs and page lists using RPCSVC_MAXPAGES, which can exceed the device capabilities. So limit the depth to the minimum of RPCSVC_MAXPAGES and xprt-sc_frmr_pg_list_len. Signed-off-by: Steve Wise sw...@opengridcomputing.com --- Doug, this patch needs to be added after this commit: e20684a xprtrdma, svcrdma: Convert to ib_alloc_mr and before these commits: af78181 cxgb3: Support ib_alloc_mr verb b7e06cd iw_cxgb4: Support ib_alloc_mr verb This will avoid a bisect window where NFSRDMA over cxgb4 is broken. Bruce, please ACK if this commit looks good, and also if you're ok with this flowing through Doug's rdma tree due to the dependencies. Fine by me. On a quick check the only pending change in my tree nearby there is 31193fe5f6fb svcrdma: Remove svc_rdma_fastreg(), which shouldn't conflict. Acked-by: J. Bruce Fields bfie...@redhat.com --b. --- net/sunrpc/xprtrdma/svc_rdma_transport.c |6 -- 1 files changed, 4 insertions(+), 2 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 8752a2d..11d5133 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -733,17 +733,19 @@ static struct svc_rdma_fastreg_mr *rdma_alloc_frmr(struct svcxprt_rdma *xprt) struct ib_mr *mr; struct ib_fast_reg_page_list *pl; struct svc_rdma_fastreg_mr *frmr; + u32 num_sg; frmr = kmalloc(sizeof(*frmr), GFP_KERNEL); if (!frmr) goto err; - mr = ib_alloc_mr(xprt-sc_pd, IB_MR_TYPE_MEM_REG, RPCSVC_MAXPAGES); + num_sg = min_t(u32, RPCSVC_MAXPAGES, xprt-sc_frmr_pg_list_len); + mr = ib_alloc_mr(xprt-sc_pd, IB_MR_TYPE_MEM_REG, num_sg); if (IS_ERR(mr)) goto err_free_frmr; pl = ib_alloc_fast_reg_page_list(xprt-sc_cm_id-device, - RPCSVC_MAXPAGES); + num_sg); if (IS_ERR(pl)) goto err_free_mr; -- 1.7.1 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RESEND 0/4] Use max_sge_rd device capability
On Mon, Jul 27, 2015 at 06:09:55PM -0500, Steve Wise wrote: Resending because I forgot to cc linux-rdma :( Some devices were not setting this capability, so fix those devices, and svcrdma to use max_sge_rd. Also remove rdma_cap_read_multi_sge() since it isn't needed. These patches were originally part of: http://www.spinics.net/lists/linux-rdma/msg27436.html They really aren't part of iSER/iWARP at all, so I've split them out. Bruce: This hits svcrdma, but I suggest they get merged via Doug's tree to avoid any merge problems. Sounds fine. I've got some changes to those files in my for-next as well, but they don't appear to conflict. Feel free to add my Acked-by: J. Bruce Fields bfie...@redhat.com to patch 3/4 if it helps. --b. --- Sagi Grimberg (1): mlx4, mlx5, mthca: Expose max_sge_rd correctly Steve Wise (3): RDMA/Core: remove rdma_cap_read_multi_sge() helper svcrdma: Use max_sge_rd for destination read depths ipath,qib: Expose max_sge_rd correctly drivers/infiniband/hw/ipath/ipath_verbs.c|1 + drivers/infiniband/hw/mlx4/main.c|1 + drivers/infiniband/hw/mlx5/main.c|1 + drivers/infiniband/hw/mthca/mthca_provider.c |1 + drivers/infiniband/hw/qib/qib_verbs.c|1 + include/linux/sunrpc/svc_rdma.h |1 + include/rdma/ib_verbs.h | 28 -- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 12 +-- net/sunrpc/xprtrdma/svc_rdma_transport.c |4 9 files changed, 11 insertions(+), 39 deletions(-) -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 5/5] svcrdma: Boost NFS READ/WRITE payload size maximum
On Fri, Jul 10, 2015 at 10:59:48AM -0400, Chuck Lever wrote: On Jul 10, 2015, at 10:18 AM, bfie...@fieldses.org wrote: On Thu, Jul 09, 2015 at 04:45:47PM -0400, Chuck Lever wrote: Increased to 1 megabyte. Why not more or less? Why do we even have this constant, why shouldn't we just use RPCSVC_MAXPAYLOAD? The payload size maximum for RDMA is based on RPCRDMA_MAX_SVC_SEGS. We could reverse the relationship and make RPCRDMA_MAX_SVC_SEGS based on RPCSVC_MAXPAYLOAD divided by the platform’s page size. But there'd be no reason to do that, because we're not using RPCRDMA_MAX_SVC_SEGS anywhere. Should we be? --b. --b. Signed-off-by: Chuck Lever chuck.le...@oracle.com --- include/linux/sunrpc/svc_rdma.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index 13af61b..1bca6dd 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -172,7 +172,7 @@ struct svcxprt_rdma { #define RDMAXPRT_SQ_PENDING2 #define RDMAXPRT_CONN_PENDING 3 -#define RPCRDMA_MAX_SVC_SEGS (64)/* server max scatter/gather */ +#define RPCRDMA_MAX_SVC_SEGS (256) /* server max scatter/gather */ #if RPCSVC_MAXPAYLOAD (RPCRDMA_MAX_SVC_SEGS PAGE_SHIFT) #define RPCRDMA_MAXPAYLOAD RPCSVC_MAXPAYLOAD #else -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 5/5] svcrdma: Boost NFS READ/WRITE payload size maximum
On Fri, Jul 10, 2015 at 10:18:14AM -0400, J. Bruce Fields wrote: On Thu, Jul 09, 2015 at 04:45:47PM -0400, Chuck Lever wrote: Increased to 1 megabyte. Why not more or less? Why do we even have this constant, why shouldn't we just use RPCSVC_MAXPAYLOAD? (That one question aside these look fine, I'll apply unless I hear otherwise.) --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 5/5] svcrdma: Boost NFS READ/WRITE payload size maximum
On Thu, Jul 09, 2015 at 04:45:47PM -0400, Chuck Lever wrote: Increased to 1 megabyte. Why not more or less? Why do we even have this constant, why shouldn't we just use RPCSVC_MAXPAYLOAD? --b. Signed-off-by: Chuck Lever chuck.le...@oracle.com --- include/linux/sunrpc/svc_rdma.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index 13af61b..1bca6dd 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -172,7 +172,7 @@ struct svcxprt_rdma { #define RDMAXPRT_SQ_PENDING 2 #define RDMAXPRT_CONN_PENDING3 -#define RPCRDMA_MAX_SVC_SEGS (64)/* server max scatter/gather */ +#define RPCRDMA_MAX_SVC_SEGS (256) /* server max scatter/gather */ #if RPCSVC_MAXPAYLOAD (RPCRDMA_MAX_SVC_SEGS PAGE_SHIFT) #define RPCRDMA_MAXPAYLOAD RPCSVC_MAXPAYLOAD #else -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 5/5] svcrdma: Boost NFS READ/WRITE payload size maximum
On Fri, Jul 10, 2015 at 11:59:20AM -0400, Chuck Lever wrote: On Jul 10, 2015, at 11:54 AM, J. Bruce Fields bfie...@fieldses.org wrote: On Fri, Jul 10, 2015 at 10:59:48AM -0400, Chuck Lever wrote: On Jul 10, 2015, at 10:18 AM, bfie...@fieldses.org wrote: On Thu, Jul 09, 2015 at 04:45:47PM -0400, Chuck Lever wrote: Increased to 1 megabyte. Why not more or less? Why do we even have this constant, why shouldn't we just use RPCSVC_MAXPAYLOAD? The payload size maximum for RDMA is based on RPCRDMA_MAX_SVC_SEGS. We could reverse the relationship and make RPCRDMA_MAX_SVC_SEGS based on RPCSVC_MAXPAYLOAD divided by the platform’s page size. But there'd be no reason to do that, because we're not using RPCRDMA_MAX_SVC_SEGS anywhere. Should we be? Let me try using RPCSVC_MAXPAYLOAD. That is based on RPCSVC_MAXPAGES, which is actually used in svc_rdma_*.c OK, thanks, that sounds like that would make more sense. --b. --b. --b. Signed-off-by: Chuck Lever chuck.le...@oracle.com --- include/linux/sunrpc/svc_rdma.h |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index 13af61b..1bca6dd 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -172,7 +172,7 @@ struct svcxprt_rdma { #define RDMAXPRT_SQ_PENDING 2 #define RDMAXPRT_CONN_PENDING3 -#define RPCRDMA_MAX_SVC_SEGS(64)/* server max scatter/gather */ +#define RPCRDMA_MAX_SVC_SEGS(256) /* server max scatter/gather */ #if RPCSVC_MAXPAYLOAD (RPCRDMA_MAX_SVC_SEGS PAGE_SHIFT) #define RPCRDMA_MAXPAYLOAD RPCSVC_MAXPAYLOAD #else -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever -- Chuck Lever -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFSoRDMA bi-weekly developers meeting minutes (5/28/15)
On Thu, May 28, 2015 at 11:11:30AM -0700, Shirley Ma wrote: NFSD workqueue mode vs. threading mode performance impacts: (Shirley) - Shirley has spent sometime test/fix/evaluating NFSD workqueue mode originally implemented by Jeff Layton. NFSD workqueue patchset NFSD workqueue modes benefit some workloads while global threading mode benefits some other workloads. workqueue mode is better than per-cpu threading mode in general. workqueue is per-cpu workqueue while global threading mode is not per-cpu based. There might be some room to make NFSD workqueue not worse than global threading mode, but this requires to do more in depth performance investigation work. Feel free to reply here for anything missing or incorrect. Thanks everyone for joining the call and providing valuable inputs/work to the community to make NFSoRDMA better. Are you planning to repost those patches to the list with a summary of the results so far? I don't necessarily expect that we'll have all performance problems worked out before merging, but we do need to judge the risk that the new threading mode never overcomes those regressions so that we're stuck supporting both old and new threading modes indefinitely. --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/10] NFS/RDMA server for 3.20
On Tue, Jan 13, 2015 at 11:31:16AM -0600, Steve Wise wrote: Reviewed-by: Steve Wise sw...@opengridcomputing.com Thanks, applying for 3.20. --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 09/10] Move read list XDR round-up logic
On Fri, Jan 09, 2015 at 02:23:11PM -0500, Chuck Lever wrote: Simpler approach, and this needs to be done before the tail is copied to the end of the read list (pre-requisite for a subsequent patch). Fixes: e560e3b510d2 (svcrdma: Add zero padding if the client... ) Wait, does this actually fix a bug in that previous commit is or it just cleanup? If the latter, the Fixes:.. line is confusing, just mention it in the text (Simpler approach compared to the original attempt in e560e3b510d2...). --b. Signed-off-by: Chuck Lever chuck.le...@oracle.com --- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 37 --- 1 files changed, 9 insertions(+), 28 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 36cf51a..a345cad 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -43,7 +43,6 @@ #include linux/sunrpc/debug.h #include linux/sunrpc/rpc_rdma.h #include linux/spinlock.h -#include linux/highmem.h #include asm/unaligned.h #include rdma/ib_verbs.h #include rdma/rdma_cm.h @@ -434,6 +433,15 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt, } } + /* Read list may need XDR round-up (see RFC 5666, s. 3.7) */ + if (page_offset 3) { + u32 pad = 4 - (page_offset 3); + + head-arg.page_len += pad; + head-arg.len += pad; + head-arg.buflen += pad; + } + ret = 1; head-position = position; @@ -446,32 +454,6 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt, return ret; } -/* - * To avoid a separate RDMA READ just for a handful of zero bytes, - * RFC 5666 section 3.7 allows the client to omit the XDR zero pad - * in chunk lists. - */ -static void -rdma_fix_xdr_pad(struct xdr_buf *buf) -{ - unsigned int page_len = buf-page_len; - unsigned int size = (XDR_QUADLEN(page_len) 2) - page_len; - unsigned int offset, pg_no; - char *p; - - if (size == 0) - return; - - pg_no = page_len PAGE_SHIFT; - offset = page_len ~PAGE_MASK; - p = page_address(buf-pages[pg_no]); - memset(p + offset, 0, size); - - buf-page_len += size; - buf-buflen += size; - buf-len += size; -} - static int rdma_read_complete(struct svc_rqst *rqstp, struct svc_rdma_op_ctxt *head) { @@ -499,7 +481,6 @@ static int rdma_read_complete(struct svc_rqst *rqstp, } /* Point rq_arg.pages past header */ - rdma_fix_xdr_pad(head-arg); rqstp-rq_arg.pages = rqstp-rq_pages[head-hdr_count]; rqstp-rq_arg.page_len = head-arg.page_len; rqstp-rq_arg.page_base = head-arg.page_base; -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/10] NFS/RDMA server for 3.20
On Fri, Jan 09, 2015 at 02:21:54PM -0500, Chuck Lever wrote: In addition to miscellaneous clean up, the following series of patches for the Linux NFS server introduces support in the server’s RPC/RDMA transport implementation for RDMA_NOMSG type messages, and fixes a bug that prevents the server from handling RPC/RDMA messages with inline content following a read list. These patches are contained in the branch nfsd-rdma-for-3.20 at: git://git.linux-nfs.org/projects/cel/cel-2.6.git They look OK to me; applying unless someone tells me otherwise. --b. --- Chuck Lever (10): svcrdma: Handle additional inline content Move read list XDR round-up logic svcrdma: Support RDMA_NOMSG requests svcrdma: rc_position sanity checking svcrdma: Plant reader function in struct svcxprt_rdma svcrdma: Find rmsgp more reliably svcrdma: Scrub BUG_ON() and WARN_ON() call sites svcrdma: Clean up read chunk counting svcrdma: Remove unused variable svcrdma: Clean up dprintk include/linux/sunrpc/svc_rdma.h | 13 +- net/sunrpc/xprtrdma/svc_rdma_marshal.c | 16 -- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 251 +++--- net/sunrpc/xprtrdma/svc_rdma_sendto.c| 46 +++-- net/sunrpc/xprtrdma/svc_rdma_transport.c | 47 +++--- 5 files changed, 224 insertions(+), 149 deletions(-) -- Chuck Lever -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 00/10] NFS/RDMA server for 3.20
On Fri, Jan 09, 2015 at 03:40:51PM -0500, Chuck Lever wrote: On Jan 9, 2015, at 3:39 PM, J. Bruce Fields bfie...@fieldses.org wrote: On Fri, Jan 09, 2015 at 02:21:54PM -0500, Chuck Lever wrote: In addition to miscellaneous clean up, the following series of patches for the Linux NFS server introduces support in the server’s RPC/RDMA transport implementation for RDMA_NOMSG type messages, and fixes a bug that prevents the server from handling RPC/RDMA messages with inline content following a read list. These patches are contained in the branch nfsd-rdma-for-3.20 at: git://git.linux-nfs.org/projects/cel/cel-2.6.git They look OK to me; applying unless someone tells me otherwise. I’d like to update the patch descriptions in several of these before you apply the series. OK, let me know.--b. --b. --- Chuck Lever (10): svcrdma: Handle additional inline content Move read list XDR round-up logic svcrdma: Support RDMA_NOMSG requests svcrdma: rc_position sanity checking svcrdma: Plant reader function in struct svcxprt_rdma svcrdma: Find rmsgp more reliably svcrdma: Scrub BUG_ON() and WARN_ON() call sites svcrdma: Clean up read chunk counting svcrdma: Remove unused variable svcrdma: Clean up dprintk include/linux/sunrpc/svc_rdma.h | 13 +- net/sunrpc/xprtrdma/svc_rdma_marshal.c | 16 -- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 251 +++--- net/sunrpc/xprtrdma/svc_rdma_sendto.c| 46 +++-- net/sunrpc/xprtrdma/svc_rdma_transport.c | 47 +++--- 5 files changed, 224 insertions(+), 149 deletions(-) -- Chuck Lever -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/10] xprtrdma: Enable pad optimization
On Mon, Nov 10, 2014 at 08:54:27AM -0600, Chuck Lever wrote: On Nov 10, 2014, at 8:36 AM, Anna Schumaker anna.schuma...@netapp.com wrote: Hey Chuck, On 11/08/2014 08:14 PM, Chuck Lever wrote: The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when pad optimization was enabled. That bug was fixed by commit e560e3b510d2 (svcrdma: Add zero padding if the client doesn't send it). Do we need to worry about backwards compatibility with servers that don't have this patch? My impression is that we have a window where the server is assumed not to work and thus is not enabled in distributions, and that therefore changes like this are allowed. I could be wrong. Bruce, any guidance on this? In any event, if things break, they break immediately, and the fix is simply to set this feature flag via /proc. I don't think there's any hard-and-fast rule here, but my impression is that nfs/rdma still isn't widely used, even less so with linux knfsd, so on balance it's probably not worth much to humor older servers. --b. Anna We can now enable pad optimization on the client, which helps performance and is supported now by both Linux and Solaris servers. Signed-off-by: Chuck Lever chuck.le...@oracle.com --- net/sunrpc/xprtrdma/transport.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c index cfe9a81..8ed2576 100644 --- a/net/sunrpc/xprtrdma/transport.c +++ b/net/sunrpc/xprtrdma/transport.c @@ -73,7 +73,7 @@ static unsigned int xprt_rdma_max_inline_read = RPCRDMA_DEF_INLINE; static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE; static unsigned int xprt_rdma_inline_write_padding; static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR; -int xprt_rdma_pad_optimize = 0; + int xprt_rdma_pad_optimize = 1; #ifdef RPC_DEBUG -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] svcrdma: advertise the correct max payload
On Mon, Sep 29, 2014 at 11:07:25AM -0500, Steve Wise wrote: Hey Bruce, is this version acceptable for 3.18? Yes, applying, thanks for the reminder.--b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] svcrdma: Advertise the correct max payload
On Tue, Sep 23, 2014 at 02:42:34PM -0500, Steve Wise wrote: diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index c419498..a9cf5c3 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -392,4 +392,8 @@ extern struct kmem_cache *svc_rdma_ctxt_cachep; /* Workqueue created in svc_rdma.c */ extern struct workqueue_struct *svc_rdma_wq; +#define RPCSVC_MAXPAYLOAD_RDMA \ + (RPCSVC_MAXPAYLOAD (RPCRDMA_MAX_DATA_SEGS PAGE_SHIFT) ? \ + RPCSVC_MAXPAYLOAD : (RPCRDMA_MAX_DATA_SEGS PAGE_SHIFT)) + Couldn't you use: #if RPCSVC_MAXPAYLOAD (RPCRDMA_MAX_DATA_SEGS PAGE_SHIFT) #define RPCSVC_MAXPAYLOAD_RDMA RPC_MAXPAYLOAD #else #define RPCSVC_MAXPAYLOAD_RDMA (RPCRDMA_MAX_DATA_SEGS PAGE_SHIFT) #endif That seems more idiomatic. Sure. That makes it easier to read in my opinion too. I'll send out V3 with this change. While we're bikeshedding, why not use min()? --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] svcrdma: Advertise the correct max payload
On Tue, Sep 23, 2014 at 02:53:28PM -0500, Steve Wise wrote: -Original Message- From: 'J. Bruce Fields' [mailto:bfie...@fieldses.org] Sent: Tuesday, September 23, 2014 2:48 PM To: Steve Wise Cc: 'Chuck Lever'; linux-...@vger.kernel.org; linux-rdma@vger.kernel.org Subject: Re: [PATCH V2] svcrdma: Advertise the correct max payload On Tue, Sep 23, 2014 at 02:42:34PM -0500, Steve Wise wrote: diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index c419498..a9cf5c3 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -392,4 +392,8 @@ extern struct kmem_cache *svc_rdma_ctxt_cachep; /* Workqueue created in svc_rdma.c */ extern struct workqueue_struct *svc_rdma_wq; +#define RPCSVC_MAXPAYLOAD_RDMA \ + (RPCSVC_MAXPAYLOAD (RPCRDMA_MAX_DATA_SEGS PAGE_SHIFT) ? \ + RPCSVC_MAXPAYLOAD : (RPCRDMA_MAX_DATA_SEGS PAGE_SHIFT)) + Couldn't you use: #if RPCSVC_MAXPAYLOAD (RPCRDMA_MAX_DATA_SEGS PAGE_SHIFT) #define RPCSVC_MAXPAYLOAD_RDMA RPC_MAXPAYLOAD #else #define RPCSVC_MAXPAYLOAD_RDMA (RPCRDMA_MAX_DATA_SEGS PAGE_SHIFT) #endif That seems more idiomatic. Sure. That makes it easier to read in my opinion too. I'll send out V3 with this change. While we're bikeshedding, why not use min()? --b. I tried that initially. But min() and min_t() don't work because of the way we use the #define. With it defined thusly: OK, OK. Whatever works.--b. #define RPCSVC_MAXPAYLOAD_RDMA min(RPC_MAXPAYLOAD, RPCRDMA_MAX_DATA_SEGS PAGE_SHIFT) I see this error: CC [M] net/sunrpc/xprtrdma/svc_rdma_transport.o net/sunrpc/xprtrdma/svc_rdma_transport.c:94: error: braced-group within expression allowed only inside a function make[3]: *** [net/sunrpc/xprtrdma/svc_rdma_transport.o] Error 1 make[2]: *** [net/sunrpc/xprtrdma] Error 2 make[1]: *** [net/sunrpc] Error 2 make: *** [net] Error 2 min() looks like this: #define min(x, y) ({\ typeof(x) _min1 = (x); \ typeof(y) _min2 = (y); \ (void) (_min1 == _min2); \ _min1 _min2 ? _min1 : _min2; }) -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] svcrdma: Advertise the correct max payload
On Mon, Sep 22, 2014 at 01:36:53PM -0500, Steve Wise wrote: Svcrdma currently advertises 1MB, which is too large. The correct value is the max scatter-gather allowed in an NFSRDMA IO chunk * the host page size. This bug is usually benign because the Linux X64 NFSRDMA client correctly limits the payload size to the correct value (64*4096 = 256KB). But if the Linux client is PPC64 with a 64KB page size, then the client will indeed use a payload size that will overflow the server. Signed-off-by: Steve Wise sw...@opengridcomputing.com --- net/sunrpc/xprtrdma/svc_rdma_transport.c |2 +- net/sunrpc/xprtrdma/xprt_rdma.h |2 ++ 2 files changed, 3 insertions(+), 1 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 374feb4..4e61880 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -91,7 +91,7 @@ struct svc_xprt_class svc_rdma_class = { .xcl_name = rdma, .xcl_owner = THIS_MODULE, .xcl_ops = svc_rdma_ops, - .xcl_max_payload = RPCSVC_MAXPAYLOAD_TCP, + .xcl_max_payload = RPCSVC_MAXPAYLOAD_RDMA, .xcl_ident = XPRT_TRANSPORT_RDMA, }; diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h index c419498..467a77c 100644 --- a/net/sunrpc/xprtrdma/xprt_rdma.h +++ b/net/sunrpc/xprtrdma/xprt_rdma.h @@ -392,4 +392,6 @@ extern struct kmem_cache *svc_rdma_ctxt_cachep; /* Workqueue created in svc_rdma.c */ extern struct workqueue_struct *svc_rdma_wq; +#define RPCSVC_MAXPAYLOAD_RDMA (RPCRDMA_MAX_DATA_SEGS PAGE_SHIFT) Do you want to define this as the minimum of this and RPCSVC_MAXPAYLOAD_TCP, in case RPCRDMA_MAX_DATA_SEGS gets increased some day? --b. + #endif /* _LINUX_SUNRPC_XPRT_RDMA_H */ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] svcrdma: remove rdma_create_qp() failure recovery logic
Thanks, applying for 3.17.--b. On Thu, Jul 31, 2014 at 03:26:07PM -0500, Steve Wise wrote: In svc_rdma_accept(), if rdma_create_qp() fails, there is useless logic to try and call rdma_create_qp() again with reduced sge depths. The assumption, I guess, was that perhaps the initial sge depths chosen were too big. However they initial depths are selected based on the rdma device attribute max_sge returned from ib_query_device(). If rdma_create_qp() fails, it would not be because the max_send_sge and max_recv_sge values passed in exceed the device's max. So just remove this code. Signed-off-by: Steve Wise sw...@opengridcomputing.com --- net/sunrpc/xprtrdma/svc_rdma_transport.c | 19 ++- 1 files changed, 2 insertions(+), 17 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index e7323fb..282a43b 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -942,23 +942,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) ret = rdma_create_qp(newxprt-sc_cm_id, newxprt-sc_pd, qp_attr); if (ret) { - /* - * XXX: This is a hack. We need a xx_request_qp interface - * that will adjust the qp_attr's with a best-effort - * number - */ - qp_attr.cap.max_send_sge -= 2; - qp_attr.cap.max_recv_sge -= 2; - ret = rdma_create_qp(newxprt-sc_cm_id, newxprt-sc_pd, - qp_attr); - if (ret) { - dprintk(svcrdma: failed to create QP, ret=%d\n, ret); - goto errout; - } - newxprt-sc_max_sge = qp_attr.cap.max_send_sge; - newxprt-sc_max_sge = qp_attr.cap.max_recv_sge; - newxprt-sc_sq_depth = qp_attr.cap.max_send_wr; - newxprt-sc_max_requests = qp_attr.cap.max_recv_wr; + dprintk(svcrdma: failed to create QP, ret=%d\n, ret); + goto errout; } newxprt-sc_qp = newxprt-sc_cm_id-qp; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] svcrdma: Remove extra writeargs sanity check for NFSv2/3
On Thu, Jul 10, 2014 at 04:43:54PM -0400, Chuck Lever wrote: I think the sanity check you pointed out is strictly satisfied by testing against the unaligned number of bytes. Is there a strong reason to do the extra math for that check during each WRITE? I can't think of any good reason why the check should be against the rounded-up length. So my worry was just a more general uh-oh, a non-multiple-of-4-length is pretty weird, future me may not remember that's a possibility. --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] svcrdma: Remove extra writeargs sanity check for NFSv2/3
On Thu, Jul 10, 2014 at 01:44:35PM -0400, Chuck Lever wrote: The server should comply with RFC 5667, Where's the relevant language? (I took a quick look but didn't see it.) and handle WRITE payloads that may not have a zero pad on the end (XDR length round-up). Fixes: f34b9568 (The NFSv2/NFSv3 server does not handle zero...) BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=246 Signed-off-by: Chuck Lever chuck.le...@oracle.com --- fs/nfsd/nfs3xdr.c | 15 +-- fs/nfsd/nfsxdr.c | 16 +--- 2 files changed, 2 insertions(+), 29 deletions(-) diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c index e6c01e8..6074f6a 100644 --- a/fs/nfsd/nfs3xdr.c +++ b/fs/nfsd/nfs3xdr.c @@ -361,7 +361,7 @@ int nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p, struct nfsd3_writeargs *args) { - unsigned int len, v, hdr, dlen; + unsigned int len, v, hdr; u32 max_blocksize = svc_max_payload(rqstp); p = decode_fh(p, args-fh); @@ -383,19 +383,6 @@ nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p, * bytes. */ hdr = (void*)p - rqstp-rq_arg.head[0].iov_base; - dlen = rqstp-rq_arg.head[0].iov_len + rqstp-rq_arg.page_len - - hdr; - /* - * Round the length of the data which was specified up to - * the next multiple of XDR units and then compare that - * against the length which was actually received. - * Note that when RPCSEC/GSS (for example) is used, the - * data buffer can be padded so dlen might be larger - * than required. It must never be smaller. - */ - if (dlen XDR_QUADLEN(len)*4) - return 0; - if (args-count max_blocksize) { args-count = max_blocksize; len = args-len = max_blocksize; And then: } rqstp-rq_vec[0].iov_base = (void*)p; rqstp-rq_vec[0].iov_len = rqstp-rq_arg.head[0].iov_len - hdr; v = 0; while (len rqstp-rq_vec[v].iov_len) { len -= rqstp-rq_vec[v].iov_len; v++; rqstp-rq_vec[v].iov_base = page_address(rqstp-rq_pages[v]); rqstp-rq_vec[v].iov_len = PAGE_SIZE; } rqstp-rq_vec[v].iov_len = len; args-vlen = v + 1; return 1; So if len is allowed to be larger than the available space, then the rq_vec will get filled with data beyond the end of the client's request. I believe the max_blocksize check will at least ensure that rq_pages[v] is always still a valid page, so we shouldn't crash. But it could contain random data, and writing that data to a file could disclose information we don't mean to. Am I missing something? The v2 case looks similar. So I think you just want to drop the round-up of len, not the whole check. --b. diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c index 1ac306b..1f5347e 100644 --- a/fs/nfsd/nfsxdr.c +++ b/fs/nfsd/nfsxdr.c @@ -280,7 +280,7 @@ int nfssvc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p, struct nfsd_writeargs *args) { - unsigned int len, hdr, dlen; + unsigned int len, hdr; int v; p = decode_fh(p, args-fh); @@ -302,20 +302,6 @@ nfssvc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p, * bytes. */ hdr = (void*)p - rqstp-rq_arg.head[0].iov_base; - dlen = rqstp-rq_arg.head[0].iov_len + rqstp-rq_arg.page_len - - hdr; - - /* - * Round the length of the data which was specified up to - * the next multiple of XDR units and then compare that - * against the length which was actually received. - * Note that when RPCSEC/GSS (for example) is used, the - * data buffer can be padded so dlen might be larger - * than required. It must never be smaller. - */ - if (dlen XDR_QUADLEN(len)*4) - return 0; - rqstp-rq_vec[0].iov_base = (void*)p; rqstp-rq_vec[0].iov_len = rqstp-rq_arg.head[0].iov_len - hdr; v = 0; -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html :set diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c index 1ac306b..1f5347e 100644 --- a/fs/nfsd/nfsxdr.c +++ b/fs/nfsd/nfsxdr.c @@ -280,7 +280,7 @@ int nfssvc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p, struct nfsd_writeargs *args) { - unsigned int len, hdr, dlen; + unsigned int len, hdr; int v; p = decode_fh(p, args-fh); @@ -302,20 +302,6 @@ nfssvc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p, * bytes. */ hdr = (void*)p - rqstp-rq_arg.head[0].iov_base; - dlen = rqstp-rq_arg.head[0].iov_len + rqstp-rq_arg.page_len - - hdr; - - /* -
Re: [PATCH 2/2] svcrdma: Remove extra writeargs sanity check for NFSv2/3
On Thu, Jul 10, 2014 at 02:24:57PM -0400, Chuck Lever wrote: On Jul 10, 2014, at 2:19 PM, J. Bruce Fields bfie...@fieldses.org wrote: On Thu, Jul 10, 2014 at 01:44:35PM -0400, Chuck Lever wrote: The server should comply with RFC 5667, Where's the relevant language? (I took a quick look but didn't see it.) Sorry, I listed the wrong RFC when I wrote the description of bug 246. It’s actually RFC 5666, section 3.7. Thanks. So I think you just want to drop the round-up of len, not the whole check. I’ll try that, thanks! Actually, I'd really rather this get fixed up in the rpc layer. The padding is really required for correct xdr. The fact that RDMA doesn't require those zeroes to be literally transmitted across the network sounds like a transport-level detail that should be hidden from the xdr code. --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] svcrdma: Remove extra writeargs sanity check for NFSv2/3
On Thu, Jul 10, 2014 at 03:07:34PM -0400, Chuck Lever wrote: Hi Bruce- On Jul 10, 2014, at 2:49 PM, J. Bruce Fields bfie...@fieldses.org wrote: On Thu, Jul 10, 2014 at 02:24:57PM -0400, Chuck Lever wrote: On Jul 10, 2014, at 2:19 PM, J. Bruce Fields bfie...@fieldses.org wrote: On Thu, Jul 10, 2014 at 01:44:35PM -0400, Chuck Lever wrote: The server should comply with RFC 5667, Where's the relevant language? (I took a quick look but didn't see it.) Sorry, I listed the wrong RFC when I wrote the description of bug 246. It’s actually RFC 5666, section 3.7. Thanks. So I think you just want to drop the round-up of len, not the whole check. I’ll try that, thanks! Works-as-expected. Actually, I'd really rather this get fixed up in the rpc layer. The padding is really required for correct xdr. How so? Well, to be spec-lawyerly, rfc 1832 3.9 defines opaque data as including the zero-padding; a sequence of bytes isn't legal xdr if it just ends early. All of NFSv4 and all other NFSv3 operations work as expected without that padding present. There doesn’t seem to be any operational dependency on the presence of padding. Help? I can believe that the code deals with it now, I just wonder if this check may not be the only case where someone writing xdr code expects total length to be a multiple of four. The drc code also depends on the length being right, see nfsd_cache_csum. I don't know whether that will cause a practical problem in this case. (What about the krb5i case?) The fact that RDMA doesn't require those zeroes to be literally transmitted across the network sounds like a transport-level detail that should be hidden from the xdr code. The best I can think of is adding a false page array entry to the xdr_buf if the last incoming page is short by a few bytes. The padding just gets added to the end of whichever page the write ended on, and you only use another page if you run out of space, right? I don't know, if that's a huge pain then I guess we can live with this. --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] svcrdma: send_write() must not overflow the device's max sge
On Wed, Jul 09, 2014 at 01:49:15PM -0500, Steve Wise wrote: Function send_write() must stop creating sges when it reaches the device max and return the amount sent in the RDMA Write to the caller. Queueing up for 3.17, thanks.--b. Signed-off-by: Steve Wise sw...@opengridcomputing.com --- net/sunrpc/xprtrdma/svc_rdma_sendto.c | 39 + 1 files changed, 15 insertions(+), 24 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c b/net/sunrpc/xprtrdma/svc_rdma_sendto.c index 49fd21a..9f1b506 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c @@ -192,6 +192,8 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, xdr_sge_no++; BUG_ON(xdr_sge_no vec-count); bc -= sge_bytes; + if (sge_no == xprt-sc_max_sge) + break; } /* Prepare WRITE WR */ @@ -209,7 +211,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct svc_rqst *rqstp, atomic_inc(rdma_stat_write); if (svc_rdma_send(xprt, write_wr)) goto err; - return 0; + return write_len - bc; err: svc_rdma_unmap_dma(ctxt); svc_rdma_put_context(ctxt, 0); @@ -225,7 +227,6 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, { u32 xfer_len = rqstp-rq_res.page_len + rqstp-rq_res.tail[0].iov_len; int write_len; - int max_write; u32 xdr_off; int chunk_off; int chunk_no; @@ -239,8 +240,6 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, res_ary = (struct rpcrdma_write_array *) rdma_resp-rm_body.rm_chunks[1]; - max_write = xprt-sc_max_sge * PAGE_SIZE; - /* Write chunks start at the pagelist */ for (xdr_off = rqstp-rq_res.head[0].iov_len, chunk_no = 0; xfer_len chunk_no arg_ary-wc_nchunks; @@ -260,23 +259,21 @@ static int send_write_chunks(struct svcxprt_rdma *xprt, write_len); chunk_off = 0; while (write_len) { - int this_write; - this_write = min(write_len, max_write); ret = send_write(xprt, rqstp, ntohl(arg_ch-rs_handle), rs_offset + chunk_off, xdr_off, - this_write, + write_len, vec); - if (ret) { + if (ret = 0) { dprintk(svcrdma: RDMA_WRITE failed, ret=%d\n, ret); return -EIO; } - chunk_off += this_write; - xdr_off += this_write; - xfer_len -= this_write; - write_len -= this_write; + chunk_off += ret; + xdr_off += ret; + xfer_len -= ret; + write_len -= ret; } } /* Update the req with the number of chunks actually used */ @@ -293,7 +290,6 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt, { u32 xfer_len = rqstp-rq_res.len; int write_len; - int max_write; u32 xdr_off; int chunk_no; int chunk_off; @@ -311,8 +307,6 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt, res_ary = (struct rpcrdma_write_array *) rdma_resp-rm_body.rm_chunks[2]; - max_write = xprt-sc_max_sge * PAGE_SIZE; - /* xdr offset starts at RPC message */ nchunks = ntohl(arg_ary-wc_nchunks); for (xdr_off = 0, chunk_no = 0; @@ -330,24 +324,21 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt, write_len); chunk_off = 0; while (write_len) { - int this_write; - - this_write = min(write_len, max_write); ret = send_write(xprt, rqstp, ntohl(ch-rs_handle), rs_offset + chunk_off, xdr_off, - this_write, + write_len, vec); - if (ret) { + if (ret = 0) { dprintk(svcrdma: RDMA_WRITE failed, ret=%d\n, ret); return -EIO; } - chunk_off += this_write; - xdr_off += this_write; -
Re: [PATCH V2] svcrdma: Fence LOCAL_INV work requests
On Thu, Jun 05, 2014 at 09:54:31AM -0500, Steve Wise wrote: This applies on top of: commit e5a070216356dbcb03607cb264cc3104e17339b3 Author: Steve Wise sw...@opengridcomputing.com Date: Wed May 28 15:12:01 2014 -0500 svcrdma: refactor marshalling logic Sorry, I lost track--was there another revision of that patch coming too? --b. Fencing forces the invalidate to only happen after all prior send work requests have been completed. Signed-off-by: Steve Wise sw...@opengridcomputing.com --- net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 52d9f2c..8f92a61 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -338,7 +338,7 @@ static int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt, memset(inv_wr, 0, sizeof(inv_wr)); inv_wr.wr_id = (unsigned long)ctxt; inv_wr.opcode = IB_WR_LOCAL_INV; - inv_wr.send_flags = IB_SEND_SIGNALED; + inv_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_FENCE; inv_wr.ex.invalidate_rkey = frmr-mr-lkey; } ctxt-wr_op = read_wr.opcode; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] svcrdma: refactor marshalling logic
On Mon, Jun 02, 2014 at 11:47:58AM -0500, Steve Wise wrote: Steve I have not checked the code because I am away from my laptop But I assume mlx and mthca is using fmr and cxgb4 is using rdma-read-with-invalidate which has implicit fence. If invalidate is issued before read is completed its a problem. You're correct. And this bug appears to be in the current upstream code as well. If an IB_WR_LOCAL_INV wr is used, it must include IB_SEND_FENCE to fence it until the prior read completes. Good catch! I'll post V4 soon. Any chance that can be handled as a separate patch rather than folded in? (Disclaimer: I've been following the discussion only very superficially.) --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] svcrdma: refactor marshalling logic
On Mon, Jun 02, 2014 at 11:52:47AM -0500, Steve Wise wrote: You're correct. And this bug appears to be in the current upstream code as well. If an IB_WR_LOCAL_INV wr is used, it must include IB_SEND_FENCE to fence it until the prior read completes. Good catch! I'll post V4 soon. Any chance that can be handled as a separate patch rather than folded in? (Disclaimer: I've been following the discussion only very superficially.) Sure. I'll post the patch soon. Thanks, and, again, I'm not terribly happy about the monster patch--anything you can split off it is great, even if that thing's small. As long as all the intermediate stages still build and run. (And any bugs you've identified in upstream code are good candidates for separate patches, hopefully preceding the rewrite. That also allows us to apply those fixes to stable kernels if appropriate.) --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3] svcrdma: refactor marshalling logic
On Mon, Jun 02, 2014 at 12:06:39PM -0500, Steve Wise wrote: On Mon, Jun 02, 2014 at 11:52:47AM -0500, Steve Wise wrote: You're correct. And this bug appears to be in the current upstream code as well. If an IB_WR_LOCAL_INV wr is used, it must include IB_SEND_FENCE to fence it until the prior read completes. Good catch! I'll post V4 soon. Any chance that can be handled as a separate patch rather than folded in? (Disclaimer: I've been following the discussion only very superficially.) Sure. I'll post the patch soon. Thanks, and, again, I'm not terribly happy about the monster patch--anything you can split off it is great, even if that thing's small. As long as all the intermediate stages still build and run. I don't see any way to do this for this particular patch. It rewrites the entire rdma read logic. There's almost always a way. (And any bugs you've identified in upstream code are good candidates for separate patches, hopefully preceding the rewrite. That also allows us to apply those fixes to stable kernels if appropriate.) If I do this, then I'd have to respin the refactor patch. I really would like to get this merged as-is (with the one change I'm sending soon), and move on. I definitely will try and keep the patches smaller and more discrete going forward. Will that work? Yes, just this once, we'll live. --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] NFSD: Ignore client's source port on RDMA transports
On Mon, May 19, 2014 at 01:40:22PM -0400, Chuck Lever wrote: An NFS/RDMA client's source port is meaningless for RDMA transports. The transport layer typically sets the source port value on the connection to a random ephemeral port. Currently, NFS server administrators must specify the insecure export option to enable clients to access exports via RDMA. But this means NFS clients can access such an export via IP using an ephemeral port, which may not be desirable. This patch eliminates the need to specify the insecure export option to allow NFS/RDMA clients access to an export. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=250 Signed-off-by: Chuck Lever chuck.le...@oracle.com --- Hi Bruce- I've done some simple testing. insecure still behaves correctly for IP-based clients, and is now unnecessary for RDMA-based clients. Would you consider this for nfsd-next? Yes, looks god to me; applying for 3.16. --b. include/linux/sunrpc/svc_xprt.h |1 + net/sunrpc/svc_xprt.c|2 +- net/sunrpc/svcsock.c |9 + net/sunrpc/xprtrdma/svc_rdma_transport.c |7 +++ 4 files changed, 18 insertions(+), 1 deletions(-) diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h index b05963f..0cec1b9 100644 --- a/include/linux/sunrpc/svc_xprt.h +++ b/include/linux/sunrpc/svc_xprt.h @@ -24,6 +24,7 @@ struct svc_xprt_ops { void(*xpo_release_rqst)(struct svc_rqst *); void(*xpo_detach)(struct svc_xprt *); void(*xpo_free)(struct svc_xprt *); + int (*xpo_secure_port)(struct svc_rqst *); }; struct svc_xprt_class { diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index 06c6ff0..614956f 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -793,7 +793,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) clear_bit(XPT_OLD, xprt-xpt_flags); - rqstp-rq_secure = svc_port_is_privileged(svc_addr(rqstp)); + rqstp-rq_secure = xprt-xpt_ops-xpo_secure_port(rqstp); rqstp-rq_chandle.defer = svc_defer; if (serv-sv_stats) diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 43bcb46..0cb34f5 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -400,6 +400,12 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd, release_sock(sock-sk); #endif } + +static int svc_sock_secure_port(struct svc_rqst *rqstp) +{ + return svc_port_is_privileged(svc_addr(rqstp)); +} + /* * INET callback when data has been received on the socket. */ @@ -678,6 +684,7 @@ static struct svc_xprt_ops svc_udp_ops = { .xpo_prep_reply_hdr = svc_udp_prep_reply_hdr, .xpo_has_wspace = svc_udp_has_wspace, .xpo_accept = svc_udp_accept, + .xpo_secure_port = svc_sock_secure_port, }; static struct svc_xprt_class svc_udp_class = { @@ -1234,6 +1241,7 @@ static struct svc_xprt_ops svc_tcp_bc_ops = { .xpo_detach = svc_bc_tcp_sock_detach, .xpo_free = svc_bc_sock_free, .xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr, + .xpo_secure_port = svc_sock_secure_port, }; static struct svc_xprt_class svc_tcp_bc_class = { @@ -1272,6 +1280,7 @@ static struct svc_xprt_ops svc_tcp_ops = { .xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr, .xpo_has_wspace = svc_tcp_has_wspace, .xpo_accept = svc_tcp_accept, + .xpo_secure_port = svc_sock_secure_port, }; static struct svc_xprt_class svc_tcp_class = { diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 25688fa..02db8d9 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -65,6 +65,7 @@ static void dto_tasklet_func(unsigned long data); static void svc_rdma_detach(struct svc_xprt *xprt); static void svc_rdma_free(struct svc_xprt *xprt); static int svc_rdma_has_wspace(struct svc_xprt *xprt); +static int svc_rdma_secure_port(struct svc_rqst *); static void rq_cq_reap(struct svcxprt_rdma *xprt); static void sq_cq_reap(struct svcxprt_rdma *xprt); @@ -82,6 +83,7 @@ static struct svc_xprt_ops svc_rdma_ops = { .xpo_prep_reply_hdr = svc_rdma_prep_reply_hdr, .xpo_has_wspace = svc_rdma_has_wspace, .xpo_accept = svc_rdma_accept, + .xpo_secure_port = svc_rdma_secure_port, }; struct svc_xprt_class svc_rdma_class = { @@ -1207,6 +1209,11 @@ static int svc_rdma_has_wspace(struct svc_xprt *xprt) return 1; } +static int svc_rdma_secure_port(struct svc_rqst *rqstp) +{ + return 1; +} + /* * Attempt to register the kvec representing the RPC memory with the * device. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: [PATCH] NFSD: Ignore client's source port on RDMA transports
On Thu, May 22, 2014 at 03:55:17PM -0400, bfields wrote: On Mon, May 19, 2014 at 01:40:22PM -0400, Chuck Lever wrote: An NFS/RDMA client's source port is meaningless for RDMA transports. The transport layer typically sets the source port value on the connection to a random ephemeral port. Currently, NFS server administrators must specify the insecure export option to enable clients to access exports via RDMA. But this means NFS clients can access such an export via IP using an ephemeral port, which may not be desirable. This patch eliminates the need to specify the insecure export option to allow NFS/RDMA clients access to an export. BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=250 Signed-off-by: Chuck Lever chuck.le...@oracle.com --- Hi Bruce- I've done some simple testing. insecure still behaves correctly for IP-based clients, and is now unnecessary for RDMA-based clients. Would you consider this for nfsd-next? Yes, looks god to me; applying for 3.16. Um, you know what I meant. --b. --b. include/linux/sunrpc/svc_xprt.h |1 + net/sunrpc/svc_xprt.c|2 +- net/sunrpc/svcsock.c |9 + net/sunrpc/xprtrdma/svc_rdma_transport.c |7 +++ 4 files changed, 18 insertions(+), 1 deletions(-) diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h index b05963f..0cec1b9 100644 --- a/include/linux/sunrpc/svc_xprt.h +++ b/include/linux/sunrpc/svc_xprt.h @@ -24,6 +24,7 @@ struct svc_xprt_ops { void(*xpo_release_rqst)(struct svc_rqst *); void(*xpo_detach)(struct svc_xprt *); void(*xpo_free)(struct svc_xprt *); + int (*xpo_secure_port)(struct svc_rqst *); }; struct svc_xprt_class { diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c index 06c6ff0..614956f 100644 --- a/net/sunrpc/svc_xprt.c +++ b/net/sunrpc/svc_xprt.c @@ -793,7 +793,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout) clear_bit(XPT_OLD, xprt-xpt_flags); - rqstp-rq_secure = svc_port_is_privileged(svc_addr(rqstp)); + rqstp-rq_secure = xprt-xpt_ops-xpo_secure_port(rqstp); rqstp-rq_chandle.defer = svc_defer; if (serv-sv_stats) diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c index 43bcb46..0cb34f5 100644 --- a/net/sunrpc/svcsock.c +++ b/net/sunrpc/svcsock.c @@ -400,6 +400,12 @@ static void svc_sock_setbufsize(struct socket *sock, unsigned int snd, release_sock(sock-sk); #endif } + +static int svc_sock_secure_port(struct svc_rqst *rqstp) +{ + return svc_port_is_privileged(svc_addr(rqstp)); +} + /* * INET callback when data has been received on the socket. */ @@ -678,6 +684,7 @@ static struct svc_xprt_ops svc_udp_ops = { .xpo_prep_reply_hdr = svc_udp_prep_reply_hdr, .xpo_has_wspace = svc_udp_has_wspace, .xpo_accept = svc_udp_accept, + .xpo_secure_port = svc_sock_secure_port, }; static struct svc_xprt_class svc_udp_class = { @@ -1234,6 +1241,7 @@ static struct svc_xprt_ops svc_tcp_bc_ops = { .xpo_detach = svc_bc_tcp_sock_detach, .xpo_free = svc_bc_sock_free, .xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr, + .xpo_secure_port = svc_sock_secure_port, }; static struct svc_xprt_class svc_tcp_bc_class = { @@ -1272,6 +1280,7 @@ static struct svc_xprt_ops svc_tcp_ops = { .xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr, .xpo_has_wspace = svc_tcp_has_wspace, .xpo_accept = svc_tcp_accept, + .xpo_secure_port = svc_sock_secure_port, }; static struct svc_xprt_class svc_tcp_class = { diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 25688fa..02db8d9 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -65,6 +65,7 @@ static void dto_tasklet_func(unsigned long data); static void svc_rdma_detach(struct svc_xprt *xprt); static void svc_rdma_free(struct svc_xprt *xprt); static int svc_rdma_has_wspace(struct svc_xprt *xprt); +static int svc_rdma_secure_port(struct svc_rqst *); static void rq_cq_reap(struct svcxprt_rdma *xprt); static void sq_cq_reap(struct svcxprt_rdma *xprt); @@ -82,6 +83,7 @@ static struct svc_xprt_ops svc_rdma_ops = { .xpo_prep_reply_hdr = svc_rdma_prep_reply_hdr, .xpo_has_wspace = svc_rdma_has_wspace, .xpo_accept = svc_rdma_accept, + .xpo_secure_port = svc_rdma_secure_port, }; struct svc_xprt_class svc_rdma_class = { @@ -1207,6 +1209,11 @@ static int svc_rdma_has_wspace(struct svc_xprt *xprt) return 1; } +static int svc_rdma_secure_port(struct svc_rqst *rqstp) +{ + return 1; +} + /* * Attempt to register the kvec representing the RPC memory with the * device. -- To
Re: [PATCH V2 RFC 1/3] svcrdma: Transport and header file changes
On Tue, May 06, 2014 at 12:46:27PM -0500, Steve Wise wrote: From: Tom Tucker t...@opengridcomputing.com Change poll logic to grab up to 6 completions at a time. RDMA write and send completions no longer deal with fastreg objects. Set SVCRDMA_DEVCAP_FAST_REG and allocate a dma_mr based on the device capabilities. Signed-off-by: Tom Tucker t...@opengridcomputing.com Signed-off-by: Steve Wise sw...@opengridcomputing.com --- include/linux/sunrpc/svc_rdma.h |3 - net/sunrpc/xprtrdma/svc_rdma_transport.c | 62 +- 2 files changed, 37 insertions(+), 28 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index 0b8e3e6..5cf99a0 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -115,14 +115,13 @@ struct svc_rdma_fastreg_mr { struct list_head frmr_list; }; struct svc_rdma_req_map { - struct svc_rdma_fastreg_mr *frmr; unsigned long count; union { struct kvec sge[RPCSVC_MAXPAGES]; struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES]; + unsigned long lkey[RPCSVC_MAXPAGES]; }; }; -#define RDMACTXT_F_FAST_UNREG1 #define RDMACTXT_F_LAST_CTXT 2 #define SVCRDMA_DEVCAP_FAST_REG 1 /* fast mr registration */ diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 25688fa..2c5b201 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -1,4 +1,5 @@ /* + * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved. * Copyright (c) 2005-2007 Network Appliance, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -160,7 +161,6 @@ struct svc_rdma_req_map *svc_rdma_get_req_map(void) schedule_timeout_uninterruptible(msecs_to_jiffies(500)); } map-count = 0; - map-frmr = NULL; return map; } @@ -336,22 +336,21 @@ static void process_context(struct svcxprt_rdma *xprt, switch (ctxt-wr_op) { case IB_WR_SEND: - if (test_bit(RDMACTXT_F_FAST_UNREG, ctxt-flags)) - svc_rdma_put_frmr(xprt, ctxt-frmr); + BUG_ON(ctxt-frmr); svc_rdma_put_context(ctxt, 1); break; case IB_WR_RDMA_WRITE: + BUG_ON(ctxt-frmr); svc_rdma_put_context(ctxt, 0); break; case IB_WR_RDMA_READ: case IB_WR_RDMA_READ_WITH_INV: + svc_rdma_put_frmr(xprt, ctxt-frmr); if (test_bit(RDMACTXT_F_LAST_CTXT, ctxt-flags)) { struct svc_rdma_op_ctxt *read_hdr = ctxt-read_hdr; BUG_ON(!read_hdr); - if (test_bit(RDMACTXT_F_FAST_UNREG, ctxt-flags)) - svc_rdma_put_frmr(xprt, ctxt-frmr); spin_lock_bh(xprt-sc_rq_dto_lock); set_bit(XPT_DATA, xprt-sc_xprt.xpt_flags); list_add_tail(read_hdr-dto_q, @@ -363,6 +362,7 @@ static void process_context(struct svcxprt_rdma *xprt, break; default: + BUG_ON(1); printk(KERN_ERR svcrdma: unexpected completion type, opcode=%d\n, ctxt-wr_op); Note the printk's unreachable now. Should some of these BUG_ON()'s be WARN_ON()'s? @@ -378,29 +378,42 @@ static void process_context(struct svcxprt_rdma *xprt, static void sq_cq_reap(struct svcxprt_rdma *xprt) { struct svc_rdma_op_ctxt *ctxt = NULL; - struct ib_wc wc; + struct ib_wc wc_a[6]; + struct ib_wc *wc; struct ib_cq *cq = xprt-sc_sq_cq; int ret; May want to keep an eye on the stack usage here? --b. + memset(wc_a, 0, sizeof(wc_a)); + if (!test_and_clear_bit(RDMAXPRT_SQ_PENDING, xprt-sc_flags)) return; ib_req_notify_cq(xprt-sc_sq_cq, IB_CQ_NEXT_COMP); atomic_inc(rdma_stat_sq_poll); - while ((ret = ib_poll_cq(cq, 1, wc)) 0) { - if (wc.status != IB_WC_SUCCESS) - /* Close the transport */ - set_bit(XPT_CLOSE, xprt-sc_xprt.xpt_flags); + while ((ret = ib_poll_cq(cq, ARRAY_SIZE(wc_a), wc_a)) 0) { + int i; - /* Decrement used SQ WR count */ - atomic_dec(xprt-sc_sq_count); - wake_up(xprt-sc_send_wait); + for (i = 0; i ret; i++) { + wc = wc_a[i]; + if (wc-status != IB_WC_SUCCESS) { + dprintk(svcrdma: sq wc err status %d\n, + wc-status); - ctxt = (struct svc_rdma_op_ctxt *)(unsigned long)wc.wr_id; - if (ctxt) - process_context(xprt, ctxt); +
Re: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic
On Tue, May 06, 2014 at 12:46:21PM -0500, Steve Wise wrote: This patch series refactors the NFSRDMA server marshalling logic to remove the intermediary map structures. It also fixes an existing bug where the NFSRDMA server was not minding the device fast register page list length limitations. I've also made a git repo available with these patches on top of 3.15-rc4: git://git.openfabrics.org/~swise/linux svcrdma-refactor Changes since V1: - fixed regression for devices that don't support FRMRs (see rdma_read_chunk_lcl()) - split patch up for closer review. However I request it be squashed before merging as they is not bisectable, and I think these changes should all be a single commit anyway. If it's not split up in a way that's bisectable, then yes, just don't bother. --b. Please review, and test if you can. Signed-off-by: Tom Tucker t...@opengridcomputing.com Signed-off-by: Steve Wise sw...@opengridcomputing.com --- Tom Tucker (3): svcrdma: Sendto changes svcrdma: Recvfrom changes svcrdma: Transport and header file changes include/linux/sunrpc/svc_rdma.h |3 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 633 -- net/sunrpc/xprtrdma/svc_rdma_sendto.c| 230 +-- net/sunrpc/xprtrdma/svc_rdma_transport.c | 62 ++- 4 files changed, 318 insertions(+), 610 deletions(-) -- Steve / Tom -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 RFC 1/3] svcrdma: Transport and header file changes
On Tue, May 06, 2014 at 04:02:41PM -0500, Steve Wise wrote: On 5/6/2014 2:21 PM, J. Bruce Fields wrote: On Tue, May 06, 2014 at 12:46:27PM -0500, Steve Wise wrote: From: Tom Tucker t...@opengridcomputing.com Change poll logic to grab up to 6 completions at a time. RDMA write and send completions no longer deal with fastreg objects. Set SVCRDMA_DEVCAP_FAST_REG and allocate a dma_mr based on the device capabilities. Signed-off-by: Tom Tucker t...@opengridcomputing.com Signed-off-by: Steve Wise sw...@opengridcomputing.com --- include/linux/sunrpc/svc_rdma.h |3 - net/sunrpc/xprtrdma/svc_rdma_transport.c | 62 +- 2 files changed, 37 insertions(+), 28 deletions(-) diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index 0b8e3e6..5cf99a0 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -115,14 +115,13 @@ struct svc_rdma_fastreg_mr { struct list_head frmr_list; }; struct svc_rdma_req_map { - struct svc_rdma_fastreg_mr *frmr; unsigned long count; union { struct kvec sge[RPCSVC_MAXPAGES]; struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES]; + unsigned long lkey[RPCSVC_MAXPAGES]; }; }; -#define RDMACTXT_F_FAST_UNREG 1 #define RDMACTXT_F_LAST_CTXT 2 #define SVCRDMA_DEVCAP_FAST_REG 1 /* fast mr registration */ diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c index 25688fa..2c5b201 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c @@ -1,4 +1,5 @@ /* + * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved. * Copyright (c) 2005-2007 Network Appliance, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -160,7 +161,6 @@ struct svc_rdma_req_map *svc_rdma_get_req_map(void) schedule_timeout_uninterruptible(msecs_to_jiffies(500)); } map-count = 0; - map-frmr = NULL; return map; } @@ -336,22 +336,21 @@ static void process_context(struct svcxprt_rdma *xprt, switch (ctxt-wr_op) { case IB_WR_SEND: - if (test_bit(RDMACTXT_F_FAST_UNREG, ctxt-flags)) - svc_rdma_put_frmr(xprt, ctxt-frmr); + BUG_ON(ctxt-frmr); svc_rdma_put_context(ctxt, 1); break; case IB_WR_RDMA_WRITE: + BUG_ON(ctxt-frmr); svc_rdma_put_context(ctxt, 0); break; case IB_WR_RDMA_READ: case IB_WR_RDMA_READ_WITH_INV: + svc_rdma_put_frmr(xprt, ctxt-frmr); if (test_bit(RDMACTXT_F_LAST_CTXT, ctxt-flags)) { struct svc_rdma_op_ctxt *read_hdr = ctxt-read_hdr; BUG_ON(!read_hdr); - if (test_bit(RDMACTXT_F_FAST_UNREG, ctxt-flags)) - svc_rdma_put_frmr(xprt, ctxt-frmr); spin_lock_bh(xprt-sc_rq_dto_lock); set_bit(XPT_DATA, xprt-sc_xprt.xpt_flags); list_add_tail(read_hdr-dto_q, @@ -363,6 +362,7 @@ static void process_context(struct svcxprt_rdma *xprt, break; default: + BUG_ON(1); printk(KERN_ERR svcrdma: unexpected completion type, opcode=%d\n, ctxt-wr_op); Note the printk's unreachable now. Should some of these BUG_ON()'s be WARN_ON()'s? I'll remove the printk. And if any of the new BUG_ON()'s can be WARN_ON(), then I'll do that. But only if proceeding after a WARN_ON() results in a working server. The other thing to keep in mind is what the consequences of the BUG might be--e.g. if we BUG while holding an important lock then that lock never gets dropped and the system can freeze pretty quickly--possibly before we get any useful information to the system logs. On a quick check that doesn't look like the case here, though. --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] svcrdma: refactor marshalling logic
On Thu, Apr 10, 2014 at 01:37:23PM -0500, Steve Wise wrote: From: Tom Tucker t...@ogc.us This patch refactors the marshalling logic to remove the intermediary map structures. It also fixes an existing bug where the NFSRDMA server was not minding the device fast register page list length limitations. Signed-off-by: Tom Tucker t...@ogc.us --- include/linux/sunrpc/svc_rdma.h |3 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 551 +- net/sunrpc/xprtrdma/svc_rdma_sendto.c| 230 + net/sunrpc/xprtrdma/svc_rdma_transport.c | 57 ++- 4 files changed, 222 insertions(+), 619 deletions(-) Is it possible to make this change in more than one step? RDMA is pretty esoteric to most of us, so honestly this will probably get merged based just on your having tested it, but if it was possible to break this up into smaller patches you might give us at least a fighting chance of giving it some review --b. diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h index 0b8e3e6..5cf99a0 100644 --- a/include/linux/sunrpc/svc_rdma.h +++ b/include/linux/sunrpc/svc_rdma.h @@ -115,14 +115,13 @@ struct svc_rdma_fastreg_mr { struct list_head frmr_list; }; struct svc_rdma_req_map { - struct svc_rdma_fastreg_mr *frmr; unsigned long count; union { struct kvec sge[RPCSVC_MAXPAGES]; struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES]; + unsigned long lkey[RPCSVC_MAXPAGES]; }; }; -#define RDMACTXT_F_FAST_UNREG1 #define RDMACTXT_F_LAST_CTXT 2 #define SVCRDMA_DEVCAP_FAST_REG 1 /* fast mr registration */ diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 8d904e4..9cf65ff 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -1,4 +1,5 @@ /* + * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved. * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved. * * This software is available to you under a choice of one of two @@ -69,7 +70,8 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp, /* Set up the XDR head */ rqstp-rq_arg.head[0].iov_base = page_address(page); - rqstp-rq_arg.head[0].iov_len = min(byte_count, ctxt-sge[0].length); + rqstp-rq_arg.head[0].iov_len = + min_t(size_t, byte_count, ctxt-sge[0].length); rqstp-rq_arg.len = byte_count; rqstp-rq_arg.buflen = byte_count; @@ -85,7 +87,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp, page = ctxt-pages[sge_no]; put_page(rqstp-rq_pages[sge_no]); rqstp-rq_pages[sge_no] = page; - bc -= min(bc, ctxt-sge[sge_no].length); + bc -= min_t(u32, bc, ctxt-sge[sge_no].length); rqstp-rq_arg.buflen += ctxt-sge[sge_no].length; sge_no++; } @@ -113,291 +115,154 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp, rqstp-rq_arg.tail[0].iov_len = 0; } -/* Encode a read-chunk-list as an array of IB SGE - * - * Assumptions: - * - chunk[0]-position points to pages[0] at an offset of 0 - * - pages[] is not physically or virtually contiguous and consists of - * PAGE_SIZE elements. - * - * Output: - * - sge array pointing into pages[] array. - * - chunk_sge array specifying sge index and count for each - * chunk in the read list - * - */ -static int map_read_chunks(struct svcxprt_rdma *xprt, -struct svc_rqst *rqstp, -struct svc_rdma_op_ctxt *head, -struct rpcrdma_msg *rmsgp, -struct svc_rdma_req_map *rpl_map, -struct svc_rdma_req_map *chl_map, -int ch_count, -int byte_count) -{ - int sge_no; - int sge_bytes; - int page_off; - int page_no; - int ch_bytes; - int ch_no; - struct rpcrdma_read_chunk *ch; - - sge_no = 0; - page_no = 0; - page_off = 0; - ch = (struct rpcrdma_read_chunk *)rmsgp-rm_body.rm_chunks[0]; - ch_no = 0; - ch_bytes = ntohl(ch-rc_target.rs_length); - head-arg.head[0] = rqstp-rq_arg.head[0]; - head-arg.tail[0] = rqstp-rq_arg.tail[0]; - head-arg.pages = head-pages[head-count]; - head-hdr_count = head-count; /* save count of hdr pages */ - head-arg.page_base = 0; - head-arg.page_len = ch_bytes; - head-arg.len = rqstp-rq_arg.len + ch_bytes; - head-arg.buflen = rqstp-rq_arg.buflen + ch_bytes; - head-count++; - chl_map-ch[0].start = 0; - while (byte_count) { - rpl_map-sge[sge_no].iov_base = - page_address(rqstp-rq_arg.pages[page_no]) + page_off; - sge_bytes = min_t(int, PAGE_SIZE-page_off,
Re: NFS over RDMA benchmark
On Sun, Apr 28, 2013 at 10:42:48AM -0400, J. Bruce Fields wrote: On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4- 512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200- 980MB/sec. ... I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs ... Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunction DSO ... 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux ... Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is: 36.18% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner | --- mutex_spin_on_owner | |--99.99%-- __mutex_lock_slowpath | mutex_lock | | | |--85.30%-- generic_file_aio_write That's the inode i_mutex. Looking at the code With CONFIG_MUTEX_SPIN_ON_OWNER it spins (instead of sleeping) as long as the lock owner's still running. So this is just a lot of contention on the i_mutex, I guess. Not sure what to do aobut that. --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA benchmark
On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4- 512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200- 980MB/sec. ... I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs ... Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunction DSO ... 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux ... Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio Sorry it took me a while to get perf to show the call trace (did not enable frame pointers in kernel and struggled with perf options...), but what I get is: 36.18% nfsd [kernel.kallsyms] [k] mutex_spin_on_owner | --- mutex_spin_on_owner | |--99.99%-- __mutex_lock_slowpath | mutex_lock | | | |--85.30%-- generic_file_aio_write That's the inode i_mutex. | | do_sync_readv_writev | | do_readv_writev | | vfs_writev | | nfsd_vfs_write | | nfsd_write | | nfsd3_proc_write | | nfsd_dispatch | | svc_process_common | | svc_process | | nfsd | | kthread | | kernel_thread_helper | | | --14.70%-- svc_send That's the xpt_mutex (ensuring rpc replies aren't interleaved). | svc_process | nfsd | kthread | kernel_thread_helper --0.01%-- [...] 9.63% nfsd [kernel.kallsyms] [k] _raw_spin_lock_irqsave | --- _raw_spin_lock_irqsave | |--43.97%-- alloc_iova And that (and __free_iova below) looks like iova_rbtree_lock. --b. | intel_alloc_iova | __intel_map_single | intel_map_page | | | |--60.47%-- svc_rdma_sendto | | svc_send | | svc_process | | nfsd | | kthread | | kernel_thread_helper | | | |--30.10%-- rdma_read_xdr | | svc_rdma_recvfrom | | svc_recv | | nfsd | | kthread | | kernel_thread_helper | | | |--6.69%-- svc_rdma_post_recv | | send_reply | | svc_rdma_sendto |
Re: NFS over RDMA benchmark
On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote: -Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Wednesday, April 24, 2013 00:06 To: Yan Burman Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote: -Original Message- From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com] Sent: Wednesday, April 17, 2013 21:06 To: Atchley, Scott Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: NFS over RDMA benchmark On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov wrote: On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote: Hi. I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. Yan, Are you trying to optimize single client performance or server performance with multiple clients? I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs Would any profiling help figure out which code it's spending time in? (E.g. something simple as perf top might have useful output.) Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunctionDSO ___ _ ___ ___ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux I guess that means lots of contention on some mutex? If only we knew which one perf should also be able to collect stack statistics, I forget how. --b. 978.00 8.4% 810297f0 clflush_cache_range /root/vmlinux 445.00 3.8% 812ea440 __domain_mapping /root/vmlinux 441.00 3.8% 00018c30 svc_recv /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 344.00 3.0% 813a1bc0 _raw_spin_lock_bh /root/vmlinux 333.00 2.9% 813a19e0 _raw_spin_lock_irqsave /root/vmlinux 288.00 2.5% 813a07d0 __schedule /root/vmlinux 249.00 2.1% 811a87e0 rb_prev /root/vmlinux 242.00 2.1% 813a19b0 _raw_spin_lock /root/vmlinux 184.00 1.6% 2e90 svc_rdma_sendto /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 177.00 1.5% 810ac820 get_page_from_freelist /root/vmlinux 174.00 1.5% 812e6da0 alloc_iova /root/vmlinux 165.00 1.4% 810b1390 put_page /root/vmlinux 148.00 1.3% 00014760 sunrpc_cache_lookup /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 128.00 1.1% 00017f20 svc_xprt_enqueue /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 126.00 1.1% 8139f820 __mutex_lock_slowpath /root/vmlinux 108.00 0.9% 811a81d0 rb_insert_color /root/vmlinux 107.00 0.9% 4690 svc_rdma_recvfrom /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 102.00 0.9% 2640 send_reply /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 99.00 0.9% 810e6490 kmem_cache_alloc /root/vmlinux 96.00 0.8% 810e5840 __slab_alloc /root/vmlinux
Re: NFS over RDMA benchmark
On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote: On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote: -Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Wednesday, April 24, 2013 00:06 To: Yan Burman Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA benchmark On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote: -Original Message- From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com] Sent: Wednesday, April 17, 2013 21:06 To: Atchley, Scott Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: NFS over RDMA benchmark On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov wrote: On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote: Hi. I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. Yan, Are you trying to optimize single client performance or server performance with multiple clients? I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs Would any profiling help figure out which code it's spending time in? (E.g. something simple as perf top might have useful output.) Perf top for the CPU with high tasklet count gives: samples pcnt RIPfunction DSO ___ _ ___ ___ 2787.00 24.1% 81062a00 mutex_spin_on_owner /root/vmlinux I guess that means lots of contention on some mutex? If only we knew which one perf should also be able to collect stack statistics, I forget how. Googling around I think we want: perf record -a --call-graph (give it a chance to collect some samples, then ^C) perf report --call-graph --stdio --b. --b. 978.00 8.4% 810297f0 clflush_cache_range /root/vmlinux 445.00 3.8% 812ea440 __domain_mapping /root/vmlinux 441.00 3.8% 00018c30 svc_recv /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 344.00 3.0% 813a1bc0 _raw_spin_lock_bh /root/vmlinux 333.00 2.9% 813a19e0 _raw_spin_lock_irqsave /root/vmlinux 288.00 2.5% 813a07d0 __schedule /root/vmlinux 249.00 2.1% 811a87e0 rb_prev /root/vmlinux 242.00 2.1% 813a19b0 _raw_spin_lock /root/vmlinux 184.00 1.6% 2e90 svc_rdma_sendto /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko 177.00 1.5% 810ac820 get_page_from_freelist /root/vmlinux 174.00 1.5% 812e6da0 alloc_iova /root/vmlinux 165.00 1.4% 810b1390 put_page /root/vmlinux 148.00 1.3% 00014760 sunrpc_cache_lookup /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 128.00 1.1% 00017f20 svc_xprt_enqueue /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko 126.00 1.1% 8139f820 __mutex_lock_slowpath /root/vmlinux 108.00 0.9% 811a81d0 rb_insert_color /root/vmlinux 107.00 0.9% 4690 svc_rdma_recvfrom /lib
Re: NFS over RDMA benchmark
On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote: -Original Message- From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com] Sent: Wednesday, April 17, 2013 21:06 To: Atchley, Scott Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org; linux-...@vger.kernel.org Subject: Re: NFS over RDMA benchmark On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov wrote: On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com wrote: On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com wrote: Hi. I've been trying to do some benchmarks for NFS over RDMA and I seem to only get about half of the bandwidth that the HW can give me. My setup consists of 2 servers each with 16 cores, 32Gb of memory, and Mellanox ConnectX3 QDR card over PCI-e gen3. These servers are connected to a QDR IB switch. The backing storage on the server is tmpfs mounted with noatime. I am running kernel 3.5.7. When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K. When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec. Yan, Are you trying to optimize single client performance or server performance with multiple clients? I am trying to get maximum performance from a single server - I used 2 processes in fio test - more than 2 did not show any performance boost. I tried running fio from 2 different PCs on 2 different files, but the sum of the two is more or less the same as running from single client PC. What I did see is that server is sweating a lot more than the clients and more than that, it has 1 core (CPU5) in 100% softirq tasklet: cat /proc/softirqs Would any profiling help figure out which code it's spending time in? (E.g. something simple as perf top might have useful output.) --b. CPU0 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 CPU15 HI: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TIMER: 418767 46596 43515 44547 50099 34815 40634 40337 39551 93442 73733 42631 42509 41592 40351 61793 NET_TX: 28719309 1421 1294 1730 1243832937 11 44 41 20 26 19 15 29 NET_RX: 612070 19 22 21 6 235 3 2 9 6 17 16 20 13 16 10 BLOCK: 5941 0 0 0 0 0 0 0519259 1238272 253174215 2618 BLOCK_IOPOLL: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 TASKLET: 28 1 1 1 1 1540653 1 1 29 1 1 1 1 1 1 2 SCHED: 364965 26547 16807 18403 22919 8678 14358 14091 16981 64903 47141 18517 19179 18036 17037 38261 HRTIMER: 13 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 RCU: 945823 841546 715281 892762 823564 42663 863063 841622 333577 389013 393501 239103 221524 258159 313426 234030 Remember there are always gaps between wire speed (that ib_send_bw measures) and real world applications. I realize that, but I don't expect the difference to be more than twice. That being said, does your server use default export (sync) option ? Export the share with async option can bring you closer to wire speed. However, the practice (async) is generally not recommended in a real production system - as it can cause data integrity issues, e.g. you have more chances to lose data when the boxes crash. I am running with async export option, but that should not matter too much, since my backing storage is tmpfs mounted with noatime. -- Wendy Wendy, It has a been a few years since I looked at RPCRDMA, but I seem to remember that RPCs were limited to 32KB which means that you have to pipeline them to get linerate. In addition to requiring
Re: NFS over RDMA crashing
On Mon, Feb 11, 2013 at 03:19:42PM +, Yan Burman wrote: -Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Thursday, February 07, 2013 18:42 To: Yan Burman Cc: linux-...@vger.kernel.org; sw...@opengridcomputing.com; linux- r...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA crashing On Wed, Feb 06, 2013 at 05:24:35PM -0500, J. Bruce Fields wrote: On Wed, Feb 06, 2013 at 05:48:15PM +0200, Yan Burman wrote: When killing mount command that got stuck: --- BUG: unable to handle kernel paging request at 880324dc7ff8 IP: [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] PGD 1a0c063 PUD 32f82e063 PMD 32f2fd063 PTE 800324dc7161 Oops: 0003 [#1] PREEMPT SMP Modules linked in: md5 ib_ipoib xprtrdma svcrdma rdma_cm ib_cm iw_cm ib_addr nfsd exportfs netconsole ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat nfsv3 nfs_acl ebtables x_tables nfsv4 auth_rpcgss nfs lockd autofs4 sunrpc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs 8021q bridge stp llc ipv6 dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support kvm_intel kvm crc32c_intel microcode pcspkr joydev i2c_i801 lpc_ich mfd_core ehci_pci ehci_hcd sg ioatdma ixgbe mdio mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb hwmon dca ptp pps_core button dm_mod ext3 jbd sd_mod ata_piix libata uhci_hcd megaraid_sas scsi_mod CPU 6 Pid: 4744, comm: nfsd Not tainted 3.8.0-rc5+ #4 Supermicro X8DTH-i/6/iF/6F/X8DTH RIP: 0010:[a05f3dfb] [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] RSP: 0018:880324c3dbf8 EFLAGS: 00010297 RAX: 880324dc8000 RBX: 0001 RCX: 880324dd8428 RDX: 880324dc7ff8 RSI: 880324dd8428 RDI: 81149618 RBP: 880324c3dd78 R08: 60f9c860 R09: 0001 R10: 880324dd8000 R11: 0001 R12: 8806299dcb10 R13: 0003 R14: 0001 R15: 0010 FS: () GS:88063fc0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 880324dc7ff8 CR3: 01a0b000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process nfsd (pid: 4744, threadinfo 880324c3c000, task 88033055) Stack: 880324c3dc78 880324c3dcd8 0282 880631cec000 880324dd8000 88062ed33040 000124c3dc48 880324dd8000 88062ed33058 880630ce2b90 8806299e8000 0003 Call Trace: [a05f466e] svc_rdma_recvfrom+0x3ee/0xd80 [svcrdma] [81086540] ? try_to_wake_up+0x2f0/0x2f0 [a045963f] svc_recv+0x3ef/0x4b0 [sunrpc] [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd] [a0571e5d] nfsd+0xad/0x130 [nfsd] [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd] [81071df6] kthread+0xd6/0xe0 [81071d20] ? __init_kthread_worker+0x70/0x70 [814b462c] ret_from_fork+0x7c/0xb0 [81071d20] ? __init_kthread_worker+0x70/0x70 Code: 63 c2 49 8d 8c c2 18 02 00 00 48 39 ce 77 e1 49 8b 82 40 0a 00 00 48 39 c6 0f 84 92 f7 ff ff 90 48 8d 50 f8 49 89 92 40 0a 00 00 48 c7 40 f8 00 00 00 00 49 8b 82 40 0a 00 00 49 3b 82 30 0a 00 RIP [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] RSP 880324c3dbf8 CR2: 880324dc7ff8 ---[ end trace 06d0384754e9609a ]--- It seems that commit afc59400d6c65bad66d4ad0b2daf879cbff8e23e nfsd4: cleanup: replace rq_resused count by rq_next_page pointer is responsible for the crash (it seems to be crashing in net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:527) It may be because I have CONFIG_DEBUG_SET_MODULE_RONX and CONFIG_DEBUG_RODATA enabled. I did not try to disable them yet. When I moved to commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e I was no longer getting the server crashes, so the reset of my tests were done using that point (it is somewhere in the middle of 3.7.0-rc2). OK, so this part's clearly my fault--I'll work on a patch, but the rdma's use of the -rq_pages array is pretty confusing. Does this help? They must have added this for some reason, but I'm not seeing how it could have ever done anything --b. diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 0ce7552..e8f25ec 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -520,13 +520,6 @@ next_sge: for (ch_no = 0; rqstp-rq_pages[ch_no] rqstp-rq_respages; ch_no++) rqstp-rq_pages[ch_no] = NULL
Re: NFS over RDMA crashing
On Mon, Feb 11, 2013 at 03:19:42PM +, Yan Burman wrote: -Original Message- From: J. Bruce Fields [mailto:bfie...@fieldses.org] Sent: Thursday, February 07, 2013 18:42 To: Yan Burman Cc: linux-...@vger.kernel.org; sw...@opengridcomputing.com; linux- r...@vger.kernel.org; Or Gerlitz Subject: Re: NFS over RDMA crashing On Wed, Feb 06, 2013 at 05:24:35PM -0500, J. Bruce Fields wrote: On Wed, Feb 06, 2013 at 05:48:15PM +0200, Yan Burman wrote: When killing mount command that got stuck: --- BUG: unable to handle kernel paging request at 880324dc7ff8 IP: [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] PGD 1a0c063 PUD 32f82e063 PMD 32f2fd063 PTE 800324dc7161 Oops: 0003 [#1] PREEMPT SMP Modules linked in: md5 ib_ipoib xprtrdma svcrdma rdma_cm ib_cm iw_cm ib_addr nfsd exportfs netconsole ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat nfsv3 nfs_acl ebtables x_tables nfsv4 auth_rpcgss nfs lockd autofs4 sunrpc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs 8021q bridge stp llc ipv6 dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support kvm_intel kvm crc32c_intel microcode pcspkr joydev i2c_i801 lpc_ich mfd_core ehci_pci ehci_hcd sg ioatdma ixgbe mdio mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb hwmon dca ptp pps_core button dm_mod ext3 jbd sd_mod ata_piix libata uhci_hcd megaraid_sas scsi_mod CPU 6 Pid: 4744, comm: nfsd Not tainted 3.8.0-rc5+ #4 Supermicro X8DTH-i/6/iF/6F/X8DTH RIP: 0010:[a05f3dfb] [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] RSP: 0018:880324c3dbf8 EFLAGS: 00010297 RAX: 880324dc8000 RBX: 0001 RCX: 880324dd8428 RDX: 880324dc7ff8 RSI: 880324dd8428 RDI: 81149618 RBP: 880324c3dd78 R08: 60f9c860 R09: 0001 R10: 880324dd8000 R11: 0001 R12: 8806299dcb10 R13: 0003 R14: 0001 R15: 0010 FS: () GS:88063fc0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 880324dc7ff8 CR3: 01a0b000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process nfsd (pid: 4744, threadinfo 880324c3c000, task 88033055) Stack: 880324c3dc78 880324c3dcd8 0282 880631cec000 880324dd8000 88062ed33040 000124c3dc48 880324dd8000 88062ed33058 880630ce2b90 8806299e8000 0003 Call Trace: [a05f466e] svc_rdma_recvfrom+0x3ee/0xd80 [svcrdma] [81086540] ? try_to_wake_up+0x2f0/0x2f0 [a045963f] svc_recv+0x3ef/0x4b0 [sunrpc] [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd] [a0571e5d] nfsd+0xad/0x130 [nfsd] [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd] [81071df6] kthread+0xd6/0xe0 [81071d20] ? __init_kthread_worker+0x70/0x70 [814b462c] ret_from_fork+0x7c/0xb0 [81071d20] ? __init_kthread_worker+0x70/0x70 Code: 63 c2 49 8d 8c c2 18 02 00 00 48 39 ce 77 e1 49 8b 82 40 0a 00 00 48 39 c6 0f 84 92 f7 ff ff 90 48 8d 50 f8 49 89 92 40 0a 00 00 48 c7 40 f8 00 00 00 00 49 8b 82 40 0a 00 00 49 3b 82 30 0a 00 RIP [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] RSP 880324c3dbf8 CR2: 880324dc7ff8 ---[ end trace 06d0384754e9609a ]--- It seems that commit afc59400d6c65bad66d4ad0b2daf879cbff8e23e nfsd4: cleanup: replace rq_resused count by rq_next_page pointer is responsible for the crash (it seems to be crashing in net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:527) It may be because I have CONFIG_DEBUG_SET_MODULE_RONX and CONFIG_DEBUG_RODATA enabled. I did not try to disable them yet. When I moved to commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e I was no longer getting the server crashes, so the reset of my tests were done using that point (it is somewhere in the middle of 3.7.0-rc2). OK, so this part's clearly my fault--I'll work on a patch, but the rdma's use of the -rq_pages array is pretty confusing. Does this help? They must have added this for some reason, but I'm not seeing how it could have ever done anything --b. diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 0ce7552..e8f25ec 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -520,13 +520,6 @@ next_sge: for (ch_no = 0; rqstp-rq_pages[ch_no] rqstp-rq_respages; ch_no++) rqstp-rq_pages[ch_no] = NULL
Re: NFS over RDMA crashing
On Wed, Feb 06, 2013 at 05:24:35PM -0500, J. Bruce Fields wrote: On Wed, Feb 06, 2013 at 05:48:15PM +0200, Yan Burman wrote: When killing mount command that got stuck: --- BUG: unable to handle kernel paging request at 880324dc7ff8 IP: [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] PGD 1a0c063 PUD 32f82e063 PMD 32f2fd063 PTE 800324dc7161 Oops: 0003 [#1] PREEMPT SMP Modules linked in: md5 ib_ipoib xprtrdma svcrdma rdma_cm ib_cm iw_cm ib_addr nfsd exportfs netconsole ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat nfsv3 nfs_acl ebtables x_tables nfsv4 auth_rpcgss nfs lockd autofs4 sunrpc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs 8021q bridge stp llc ipv6 dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support kvm_intel kvm crc32c_intel microcode pcspkr joydev i2c_i801 lpc_ich mfd_core ehci_pci ehci_hcd sg ioatdma ixgbe mdio mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb hwmon dca ptp pps_core button dm_mod ext3 jbd sd_mod ata_piix libata uhci_hcd megaraid_sas scsi_mod CPU 6 Pid: 4744, comm: nfsd Not tainted 3.8.0-rc5+ #4 Supermicro X8DTH-i/6/iF/6F/X8DTH RIP: 0010:[a05f3dfb] [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] RSP: 0018:880324c3dbf8 EFLAGS: 00010297 RAX: 880324dc8000 RBX: 0001 RCX: 880324dd8428 RDX: 880324dc7ff8 RSI: 880324dd8428 RDI: 81149618 RBP: 880324c3dd78 R08: 60f9c860 R09: 0001 R10: 880324dd8000 R11: 0001 R12: 8806299dcb10 R13: 0003 R14: 0001 R15: 0010 FS: () GS:88063fc0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 880324dc7ff8 CR3: 01a0b000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process nfsd (pid: 4744, threadinfo 880324c3c000, task 88033055) Stack: 880324c3dc78 880324c3dcd8 0282 880631cec000 880324dd8000 88062ed33040 000124c3dc48 880324dd8000 88062ed33058 880630ce2b90 8806299e8000 0003 Call Trace: [a05f466e] svc_rdma_recvfrom+0x3ee/0xd80 [svcrdma] [81086540] ? try_to_wake_up+0x2f0/0x2f0 [a045963f] svc_recv+0x3ef/0x4b0 [sunrpc] [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd] [a0571e5d] nfsd+0xad/0x130 [nfsd] [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd] [81071df6] kthread+0xd6/0xe0 [81071d20] ? __init_kthread_worker+0x70/0x70 [814b462c] ret_from_fork+0x7c/0xb0 [81071d20] ? __init_kthread_worker+0x70/0x70 Code: 63 c2 49 8d 8c c2 18 02 00 00 48 39 ce 77 e1 49 8b 82 40 0a 00 00 48 39 c6 0f 84 92 f7 ff ff 90 48 8d 50 f8 49 89 92 40 0a 00 00 48 c7 40 f8 00 00 00 00 49 8b 82 40 0a 00 00 49 3b 82 30 0a 00 RIP [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] RSP 880324c3dbf8 CR2: 880324dc7ff8 ---[ end trace 06d0384754e9609a ]--- It seems that commit afc59400d6c65bad66d4ad0b2daf879cbff8e23e nfsd4: cleanup: replace rq_resused count by rq_next_page pointer is responsible for the crash (it seems to be crashing in net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:527) It may be because I have CONFIG_DEBUG_SET_MODULE_RONX and CONFIG_DEBUG_RODATA enabled. I did not try to disable them yet. When I moved to commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e I was no longer getting the server crashes, so the reset of my tests were done using that point (it is somewhere in the middle of 3.7.0-rc2). OK, so this part's clearly my fault--I'll work on a patch, but the rdma's use of the -rq_pages array is pretty confusing. Does this help? They must have added this for some reason, but I'm not seeing how it could have ever done anything --b. diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c index 0ce7552..e8f25ec 100644 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c @@ -520,13 +520,6 @@ next_sge: for (ch_no = 0; rqstp-rq_pages[ch_no] rqstp-rq_respages; ch_no++) rqstp-rq_pages[ch_no] = NULL; - /* -* Detach res pages. If svc_release sees any it will attempt to -* put them. -*/ - while (rqstp-rq_next_page != rqstp-rq_respages) - *(--rqstp-rq_next_page) = NULL; - return err; } -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: NFS over RDMA crashing
On Wed, Feb 06, 2013 at 05:48:15PM +0200, Yan Burman wrote: When killing mount command that got stuck: --- BUG: unable to handle kernel paging request at 880324dc7ff8 IP: [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] PGD 1a0c063 PUD 32f82e063 PMD 32f2fd063 PTE 800324dc7161 Oops: 0003 [#1] PREEMPT SMP Modules linked in: md5 ib_ipoib xprtrdma svcrdma rdma_cm ib_cm iw_cm ib_addr nfsd exportfs netconsole ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat nfsv3 nfs_acl ebtables x_tables nfsv4 auth_rpcgss nfs lockd autofs4 sunrpc target_core_iblock target_core_file target_core_pscsi target_core_mod configfs 8021q bridge stp llc ipv6 dm_mirror dm_region_hash dm_log vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support kvm_intel kvm crc32c_intel microcode pcspkr joydev i2c_i801 lpc_ich mfd_core ehci_pci ehci_hcd sg ioatdma ixgbe mdio mlx4_ib ib_sa ib_mad ib_core mlx4_en mlx4_core igb hwmon dca ptp pps_core button dm_mod ext3 jbd sd_mod ata_piix libata uhci_hcd megaraid_sas scsi_mod CPU 6 Pid: 4744, comm: nfsd Not tainted 3.8.0-rc5+ #4 Supermicro X8DTH-i/6/iF/6F/X8DTH RIP: 0010:[a05f3dfb] [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] RSP: 0018:880324c3dbf8 EFLAGS: 00010297 RAX: 880324dc8000 RBX: 0001 RCX: 880324dd8428 RDX: 880324dc7ff8 RSI: 880324dd8428 RDI: 81149618 RBP: 880324c3dd78 R08: 60f9c860 R09: 0001 R10: 880324dd8000 R11: 0001 R12: 8806299dcb10 R13: 0003 R14: 0001 R15: 0010 FS: () GS:88063fc0() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: 880324dc7ff8 CR3: 01a0b000 CR4: 07e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process nfsd (pid: 4744, threadinfo 880324c3c000, task 88033055) Stack: 880324c3dc78 880324c3dcd8 0282 880631cec000 880324dd8000 88062ed33040 000124c3dc48 880324dd8000 88062ed33058 880630ce2b90 8806299e8000 0003 Call Trace: [a05f466e] svc_rdma_recvfrom+0x3ee/0xd80 [svcrdma] [81086540] ? try_to_wake_up+0x2f0/0x2f0 [a045963f] svc_recv+0x3ef/0x4b0 [sunrpc] [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd] [a0571e5d] nfsd+0xad/0x130 [nfsd] [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd] [81071df6] kthread+0xd6/0xe0 [81071d20] ? __init_kthread_worker+0x70/0x70 [814b462c] ret_from_fork+0x7c/0xb0 [81071d20] ? __init_kthread_worker+0x70/0x70 Code: 63 c2 49 8d 8c c2 18 02 00 00 48 39 ce 77 e1 49 8b 82 40 0a 00 00 48 39 c6 0f 84 92 f7 ff ff 90 48 8d 50 f8 49 89 92 40 0a 00 00 48 c7 40 f8 00 00 00 00 49 8b 82 40 0a 00 00 49 3b 82 30 0a 00 RIP [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] RSP 880324c3dbf8 CR2: 880324dc7ff8 ---[ end trace 06d0384754e9609a ]--- It seems that commit afc59400d6c65bad66d4ad0b2daf879cbff8e23e nfsd4: cleanup: replace rq_resused count by rq_next_page pointer is responsible for the crash (it seems to be crashing in net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:527) It may be because I have CONFIG_DEBUG_SET_MODULE_RONX and CONFIG_DEBUG_RODATA enabled. I did not try to disable them yet. When I moved to commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e I was no longer getting the server crashes, so the reset of my tests were done using that point (it is somewhere in the middle of 3.7.0-rc2). OK, so this part's clearly my fault--I'll work on a patch, but the rdma's use of the -rq_pages array is pretty confusing. --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Q] I tried to mount nfs_server:/mnt volume, but It mounted another volume.
On Sun, Nov 21, 2010 at 03:41:18PM +0200, Boaz Harrosh wrote: On 11/21/2010 02:23 PM, Hiroyuki Sato wrote: Hello lists OS: CentOS 5.5 kernel: 2.6.36 rebuild myself. I'm trying to test NFS/RDMA. I tried to mount nfs_server:/mnt volume on on nfs_client but It mounted nfs_server:/nfstest volume note: /nfstest is tmpfs this is mount output tmpfs on /nfstest type tmpfs (rw,size=4g) /dev/sdb1 on /mnt type ext3 (rw) Is this bug?? NFS server config # ls -1 /mnt This_is_mnt_volume # ls -1 /nfstest This_is_nfstest_volume # cat /etc/exports /nfstest 192.168.100.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) /mnt 192.168.100.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) You must not have two exports with fsid=0. First one is picked. nfs4 will only export a single name space point, other exports are subdirs of that root export. (use bind mounts to present a single directory tree) http://www.citi.umich.edu/projects/nfsv4/linux/using-nfsv4.html We should update and/or replace that documentation; with new nfs-utils and kernel, if you omit the fsid=0, you'll end up with an nfsv4 namespace that's the same as v2/v3's always was. --b. NFSv4 exports on linux ~~ NFSv4 no longer has a separate mount protocol. Instead of exporting a number of distinct exports, an NFSv4 client sees the NFSv4 server's exports as existing inside a single filesystem, called the nfsv4 pseudofilesystem. On the current linux implementation, the pseudofilesystem is a single real filesystem, identified at export with the fsid=0 option. In the example above, we exported only a single filesystem, which the client mounted as /. You can provide clients with multiple filesystems to mount, producing NFSv3-like-behavior, by creative use of mount --bind. For example, you could export /usr/local/bin to clients as /bin and /usr/local/etc as /etc as follows: mkdir /export mkdir /export/bin mkdir /export/etc mount --bind /usr/local/bin /export/bin mount --bind /usr/local/etc /export/etc exportfs -ofsid=0,insecure,no_subtree_check *:/export exportfs -orw,nohide,insecure,no_subtree_check *:/export/bin exportfs -orw,nohide,insecure,no_subtree_check *:/export/etc Note that the paths returned by the showmount program are meaningful only to clients using nfs versions 2 and 3; in the above example, showmount will list the paths /export, /export/bin/, and /export/etc, but nfsv4 clients should mount yourserver:/, yourserver:/bin, or yourserver:/etc. /http://www.citi.umich.edu/projects/nfsv4/linux/using-nfsv4.html Boaz # modprobe svcrdma # /sbin/service nfs start # echo rdma 20049 /proc/fs/nfsd/portlist Client Setting /sbin/modprobe xprtrdma /sbin/mount.rnfs 192.168.100.231:/mnt /mnt -i -o rdma,port=20049 # ls -1 /mnt This_is_nfstest_volume NFS Server log sysctl -w sunrpc.nfsd_debug=1023 Nov 21 20:47:37 dell1435 mountd[3575]: authenticated mount request from 192.168.100.232:766 for /mnt (/mnt) Nov 21 20:47:37 dell1435 mountd[3575]: /nfstest and /mnt have same filehandle for 192.168.100.0/255.255.255.0, using first Nov 21 20:48:55 dell1435 mountd[3575]: authenticated unmount request from 192.168.100.232:912 for /mnt (/mnt) Nov 21 20:48:55 dell1435 mountd[3575]: authenticated unmount request from 192.168.100.232:913 for /mnt (/mnt) Nov 21 20:49:00 dell1435 mountd[3575]: authenticated unmount request from 192.168.100.232:917 for /mnt (/mnt) Nov 21 20:49:16 dell1435 mountd[3575]: authenticated mount request from 192.168.100.232:865 for /mnt (/mnt) Nov 21 21:02:22 dell1435 mountd[3575]: authenticated unmount request from 192.168.100.232:955 for /mnt (/mnt) Nov 21 21:02:26 dell1435 mountd[3575]: authenticated mount request from 192.168.100.232:884 for /mnt (/mnt) Nov 21 21:02:26 dell1435 kernel: nfsd: exp_rootfh(/mnt [88011f586740] 192.168.100.0/255.255.255.0:sdb1/2) -- Hiroyuki Sato -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-nfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/2] svcrdma: NFSRDMA Server fixes for 2.6.37
On Tue, Oct 12, 2010 at 03:33:46PM -0500, Tom Tucker wrote: Hi Bruce, These fixes are ready for 2.6.37. They fix two bugs in the server-side NFSRDMA transport. Both applied and pushed out, thanks. --b. Thanks, Tom --- Tom Tucker (2): svcrdma: Cleanup DMA unmapping in error paths. svcrdma: Change DMA mapping logic to avoid the page_address kernel API net/sunrpc/xprtrdma/svc_rdma_recvfrom.c | 19 --- net/sunrpc/xprtrdma/svc_rdma_sendto.c| 82 ++ net/sunrpc/xprtrdma/svc_rdma_transport.c | 41 +++ 3 files changed, 92 insertions(+), 50 deletions(-) -- Signed-off-by: Tom Tucker t...@ogc.us -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] svcrdma: RDMA support not yet compatible with RPC6
On Mon, Apr 05, 2010 at 12:16:18PM -0400, J. Bruce Fields wrote: On Mon, Apr 05, 2010 at 10:50:16AM -0500, Tom Tucker wrote: J. Bruce Fields wrote: On Mon, Apr 05, 2010 at 10:55:12AM -0400, Chuck Lever wrote: On 04/03/2010 09:27 AM, Tom Tucker wrote: RPC6 requires that it be possible to create endpoints that listen exclusively for IPv4 or IPv6 connection requests. This is not currently supported by the RDMA API. Signed-off-by: Tom Tuckert...@opengridcomputing.com Tested-by: Steve Wisesw...@opengridcomputing.com Reviewed-by: Chuck Lever chuck.le...@oracle.com Thanks to all. I take it the problem began with 37498292a NFSD: Create PF_INET6 listener in write_ports? Yes. Thanks. I'll pass along git://linux-nfs.org/~bfields/linux.git for-2.6.34 soon. And: sorry we didn't catch this when it happened. I have some of the equipment I'd need to do basic regression tests, but haven't set it up. I hope I get to it at some point For now I depend on others to catch even basic rdma regressions--let me know if there's some way I could make your testing easier. --b. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html