Re: device attr cleanup (was: Handle mlx4 max_sge_rd correctly)

2015-12-23 Thread J. Bruce Fields
On Thu, Dec 10, 2015 at 07:49:59PM -0500, Chuck Lever wrote:
> 
> > On Dec 10, 2015, at 6:30 PM, Christoph Hellwig  wrote:
> > 
> > On Thu, Dec 10, 2015 at 11:07:03AM -0700, Jason Gunthorpe wrote:
> >> The ARM folks do this sort of stuff on a regular basis.. Very early on
> >> Doug prepares a topic branch with only the big change, NFS folks pull
> >> it and then pull your work. Then Doug would send the topic branch to
> >> Linus as soon as the merge window opens, then NFS would send theirs.
> >> 
> >> This is alot less work overall than trying to sequence multiple
> >> patches over multiple releases..
> > 
> > Agreed.  Staging has alaways been a giant pain and things tend to never
> > finish moving over that way if they are non-trivial enough.
> 
> In that case:
> 
> You need to make sure you have all the right Acks. I've added
> Anna and Bruce to Ack the NFS-related portions. Santosh should
> Ack the RDS part.
> 
> http://git.infradead.org/users/hch/rdma.git/shortlog/refs/heads/ib_device_attr

Fine by me.

> Given the proximity to the holidays and the next merge window,
> Doug will need to get a properly-acked topic branch published
> in the next day or two so the rest of us can rebase and start
> testing before the relevant parties disappear for the holiday.

What branch should I be working from?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/11] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-12-23 Thread J. Bruce Fields
On Mon, Dec 21, 2015 at 05:11:56PM -0500, Chuck Lever wrote:
> 
> > On Dec 21, 2015, at 4:29 PM, J. Bruce Fields <bfie...@fieldses.org> wrote:
> > 
> > On Mon, Dec 21, 2015 at 04:15:23PM -0500, Chuck Lever wrote:
> >> 
> >>> On Dec 21, 2015, at 4:07 PM, J. Bruce Fields <bfie...@fieldses.org> wrote:
> >>> 
> >>> On Mon, Dec 14, 2015 at 04:30:09PM -0500, Chuck Lever wrote:
> >>>> Minor optimization: when dealing with write chunk XDR roundup, do
> >>>> not post a Write WR for the zero bytes in the pad. Simply update
> >>>> the write segment in the RPC-over-RDMA header to reflect the extra
> >>>> pad bytes.
> >>>> 
> >>>> The Reply chunk is also a write chunk, but the server does not use
> >>>> send_write_chunks() to send the Reply chunk. That's OK in this case:
> >>>> the server Upper Layer typically marshals the Reply chunk contents
> >>>> in a single contiguous buffer, without a separate tail for the XDR
> >>>> pad.
> >>>> 
> >>>> The comments and the variable naming refer to "chunks" but what is
> >>>> really meant is "segments." The existing code sends only one
> >>>> xdr_write_chunk per RPC reply.
> >>>> 
> >>>> The fix assumes this as well. When the XDR pad in the first write
> >>>> chunk is reached, the assumption is the Write list is complete and
> >>>> send_write_chunks() returns.
> >>>> 
> >>>> That will remain a valid assumption until the server Upper Layer can
> >>>> support multiple bulk payload results per RPC.
> >>>> 
> >>>> Signed-off-by: Chuck Lever <chuck.le...@oracle.com>
> >>>> ---
> >>>> net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
> >>>> 1 file changed, 7 insertions(+)
> >>>> 
> >>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
> >>>> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >>>> index 969a1ab..bad5eaa 100644
> >>>> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >>>> @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma 
> >>>> *xprt,
> >>>>  arg_ch->rs_handle,
> >>>>  arg_ch->rs_offset,
> >>>>  write_len);
> >>>> +
> >>>> +/* Do not send XDR pad bytes */
> >>>> +if (chunk_no && write_len < 4) {
> >>>> +chunk_no++;
> >>>> +break;
> >>> 
> >>> I'm pretty lost in this code.  Why does (chunk_no && write_len < 4) mean
> >>> this is xdr padding?
> >> 
> >> Chunk zero is always data. Padding is always going to be
> >> after the first chunk. Any chunk after chunk zero that is
> >> shorter than XDR quad alignment is going to be a pad.
> > 
> > I don't really know what a chunk is  Looking at the code:
> > 
> > write_len = min(xfer_len, be32_to_cpu(arg_ch->rs_length));
> > 
> > so I guess the assumption is just that those rs_length's are always a
> > multiple of four?
> 
> The example you recently gave was a two-byte NFS READ
> that crosses a page boundary.
> 
> In that case, the NFSD would pass down an xdr_buf that
> has one byte in a page, one byte in another page, and
> a two-byte XDR pad. The logic introduced by this
> optimization would be fooled, and neither the second
> byte nor the XDR pad would be written to the client.
> 
> Unless you can think of a way to recognize an XDR pad
> in the xdr_buf 100% of the time, you should drop this
> patch.

It might be best to make this explicit and mark bytes as padding somehow
while encoding.

> As far as I know, none of the other patches in this
> series depend on this optimization, so please merge
> them if you can.

OK, I'll take a look at the rest, thanks.

--b.

> >> Probably too clever. Is there a better way to detect
> >> the XDR pad?
> >> 
> >> 
> >>>> +}
> >>>> +
> >>>>  chunk_off = 0;
> >>>>  while (write_len) {
> >>>>  ret = send_write(xprt, rqstp,
> >> 
> >> --
> >> Chuck Lever
> >> 
> >> 
> >> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/11] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-12-21 Thread J. Bruce Fields
On Mon, Dec 14, 2015 at 04:30:09PM -0500, Chuck Lever wrote:
> Minor optimization: when dealing with write chunk XDR roundup, do
> not post a Write WR for the zero bytes in the pad. Simply update
> the write segment in the RPC-over-RDMA header to reflect the extra
> pad bytes.
> 
> The Reply chunk is also a write chunk, but the server does not use
> send_write_chunks() to send the Reply chunk. That's OK in this case:
> the server Upper Layer typically marshals the Reply chunk contents
> in a single contiguous buffer, without a separate tail for the XDR
> pad.
> 
> The comments and the variable naming refer to "chunks" but what is
> really meant is "segments." The existing code sends only one
> xdr_write_chunk per RPC reply.
> 
> The fix assumes this as well. When the XDR pad in the first write
> chunk is reached, the assumption is the Write list is complete and
> send_write_chunks() returns.
> 
> That will remain a valid assumption until the server Upper Layer can
> support multiple bulk payload results per RPC.
> 
> Signed-off-by: Chuck Lever 
> ---
>  net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> index 969a1ab..bad5eaa 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
>   arg_ch->rs_handle,
>   arg_ch->rs_offset,
>   write_len);
> +
> + /* Do not send XDR pad bytes */
> + if (chunk_no && write_len < 4) {
> + chunk_no++;
> + break;

I'm pretty lost in this code.  Why does (chunk_no && write_len < 4) mean
this is xdr padding?

> + }
> +
>   chunk_off = 0;
>   while (write_len) {
>   ret = send_write(xprt, rqstp,
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/11] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-12-21 Thread J. Bruce Fields
On Mon, Dec 21, 2015 at 04:15:23PM -0500, Chuck Lever wrote:
> 
> > On Dec 21, 2015, at 4:07 PM, J. Bruce Fields <bfie...@fieldses.org> wrote:
> > 
> > On Mon, Dec 14, 2015 at 04:30:09PM -0500, Chuck Lever wrote:
> >> Minor optimization: when dealing with write chunk XDR roundup, do
> >> not post a Write WR for the zero bytes in the pad. Simply update
> >> the write segment in the RPC-over-RDMA header to reflect the extra
> >> pad bytes.
> >> 
> >> The Reply chunk is also a write chunk, but the server does not use
> >> send_write_chunks() to send the Reply chunk. That's OK in this case:
> >> the server Upper Layer typically marshals the Reply chunk contents
> >> in a single contiguous buffer, without a separate tail for the XDR
> >> pad.
> >> 
> >> The comments and the variable naming refer to "chunks" but what is
> >> really meant is "segments." The existing code sends only one
> >> xdr_write_chunk per RPC reply.
> >> 
> >> The fix assumes this as well. When the XDR pad in the first write
> >> chunk is reached, the assumption is the Write list is complete and
> >> send_write_chunks() returns.
> >> 
> >> That will remain a valid assumption until the server Upper Layer can
> >> support multiple bulk payload results per RPC.
> >> 
> >> Signed-off-by: Chuck Lever <chuck.le...@oracle.com>
> >> ---
> >> net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
> >> 1 file changed, 7 insertions(+)
> >> 
> >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
> >> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >> index 969a1ab..bad5eaa 100644
> >> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >> @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma 
> >> *xprt,
> >>arg_ch->rs_handle,
> >>arg_ch->rs_offset,
> >>write_len);
> >> +
> >> +  /* Do not send XDR pad bytes */
> >> +  if (chunk_no && write_len < 4) {
> >> +  chunk_no++;
> >> +  break;
> > 
> > I'm pretty lost in this code.  Why does (chunk_no && write_len < 4) mean
> > this is xdr padding?
> 
> Chunk zero is always data. Padding is always going to be
> after the first chunk. Any chunk after chunk zero that is
> shorter than XDR quad alignment is going to be a pad.

I don't really know what a chunk is  Looking at the code:

write_len = min(xfer_len, be32_to_cpu(arg_ch->rs_length));

so I guess the assumption is just that those rs_length's are always a
multiple of four?

--b.

> 
> Probably too clever. Is there a better way to detect
> the XDR pad?
> 
> 
> >> +  }
> >> +
> >>chunk_off = 0;
> >>while (write_len) {
> >>ret = send_write(xprt, rqstp,
> 
> --
> Chuck Lever
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: warning in ext4 with nfs/rdma server

2015-12-16 Thread J. Bruce Fields
On Tue, Dec 08, 2015 at 07:31:56AM -0600, Steve Wise wrote:
> 
> 
> > -Original Message-
> > From: Chuck Lever [mailto:chuck.le...@oracle.com]
> > Sent: Monday, December 07, 2015 9:45 AM
> > To: Steve Wise
> > Cc: linux-rdma@vger.kernel.org; Veeresh U. Kokatnur; Linux NFS Mailing List
> > Subject: Re: warning in ext4 with nfs/rdma server
> > 
> > Hi Steve-
> > 
> > > On Dec 7, 2015, at 10:38 AM, Steve Wise  
> > > wrote:
> > >
> > > Hey Chuck/NFS developers,
> > >
> > > We're hitting this warning in ext4 on the linux-4.3 nfs server running 
> > > over RDMA/cxgb4.  We're still gathering data, like if it
> > > happens with NFS/TCP.  But has anyone seen this warning on 4.3?  Is it 
> > > likely to indicate some bug in the xprtrdma transport or
> > > above it in NFS?
> > 
> > Yes, please confirm with NFS/TCP. Thanks!
> >
> 
> The same thing happens with NFS/TCP, so this isn't related to xprtrdma.
>  
> > 
> > > We can hit this running cthon tests over 2 mount points:
> > >
> > > -
> > > #!/bin/bash
> > > rm -rf /root/cthon04/loop_iter.txt
> > > while [ 1 ]
> > > do
> > > {
> > >
> > > ./server -s -m /mnt/share1 -o rdma,port=20049,vers=4 -p /mnt/share1 -N 100
> > > 102.1.1.162 &
> > > ./server -s -m /mnt/share2 -o 
> > > rdma,port=20049,vers=3,rsize=65535,wsize=65535 -p
> > > /mnt/share2 -N 100 102.2.2.162 &
> > > wait
> > > echo "iteration $i" >>/root/cthon04/loop_iter.txt
> > > date >>/root/cthon04/loop_iter.txt
> > > }
> > > done
> > > --
> > >
> > > Thanks,
> > >
> > > Steve.
> > >
> > > [ cut here ]
> > > WARNING: CPU: 14 PID: 6689 at fs/ext4/inode.c:231 
> > > ext4_evict_inode+0x41e/0x490

Looks like this is the

WARN_ON(atomic_read(_I(inode)->i_ioend_count));

in ext4_evice_inode?  Ext4 developers, any idea how that could happen?

--b.


> > > [ext4]()
> > > Modules linked in: nfsd(E) lockd(E) grace(E) nfs_acl(E) exportfs(E)
> > > auth_rpcgss(E) rpcrdma(E) sunrpc(E) rdma_ucm(E) ib_uverbs(E) rdma_cm(E)
> > > ib_cm(E) ib_sa(E) ib_mad(E) iw_cxgb4(E) iw_cm(E) ib_core(E) ib_addr(E) 
> > > cxgb4(E)
> > > autofs4(E) target_core_iblock(E) target_core_file(E) target_core_pscsi(E)
> > > target_core_mod(E) configfs(E) bnx2fc(E) cnic(E) uio(E) fcoe(E) libfcoe(E)
> > > 8021q(E) libfc(E) garp(E) stp(E) llc(E) cpufreq_ondemand(E) cachefiles(E)
> > > fscache(E) ipv6(E) dm_mirror(E) dm_region_hash(E) dm_log(E) vhost_net(E)
> > > macvtap(E) macvlan(E) vhost(E) tun(E) kvm(E) uinput(E) microcode(E) sg(E)
> > > pcspkr(E) serio_raw(E) fam15h_power(E) k10temp(E) amd64_edac_mod(E)
> > > edac_core(E) edac_mce_amd(E) i2c_piix4(E) igb(E) dca(E) i2c_algo_bit(E)
> > > i2c_core(E) ptp(E) pps_core(E) scsi_transport_fc(E) acpi_cpufreq(E) 
> > > dm_mod(E)
> > > ext4(E) jbd2(E) mbcache(E) sr_mod(E) cdrom(E) sd_mod(E) ahci(E) libahci(E)
> > > [last unloaded: cxgb4]
> > > CPU: 14 PID: 6689 Comm: nfsd Tainted: GE   4.3.0 #1
> > > Hardware name: Supermicro H8QGL/H8QGL, BIOS 3.512/19/2013
> > > 00e7 88400634fad8 812a4084 a00c96eb
> > >  88400634fb18 81059fd5 88400634fbd8
> > > 880fd1a460c8 880fd1a461d8 880fd1a46008 88400634fbd8
> > > Call Trace:
> > > [] dump_stack+0x48/0x64
> > > [] warn_slowpath_common+0x95/0xe0
> > > [] warn_slowpath_null+0x1a/0x20
> > > [] ext4_evict_inode+0x41e/0x490 [ext4]
> > > [] evict+0xae/0x1a0
> > > [] iput_final+0xe5/0x170
> > > [] iput+0xa3/0xf0
> > > [] ? fsnotify_destroy_marks+0x64/0x80
> > > [] dentry_unlink_inode+0xa9/0xe0
> > > [] d_delete+0xa6/0xb0
> > > [] vfs_unlink+0x138/0x140
> > > [] nfsd_unlink+0x165/0x200 [nfsd]
> > > [] ? lru_put_end+0x5c/0x70 [nfsd]
> > > [] nfsd3_proc_remove+0x83/0x120 [nfsd]
> > > [] nfsd_dispatch+0xdc/0x210 [nfsd]
> > > [] svc_process_common+0x311/0x620 [sunrpc]
> > > [] ? nfsd_set_nrthreads+0x1b0/0x1b0 [nfsd]
> > > [] svc_process+0x128/0x1b0 [sunrpc]
> > > [] nfsd+0xf3/0x160 [nfsd]
> > > [] kthread+0xcc/0xf0
> > > [] ? schedule_tail+0x1e/0xc0
> > > [] ? kthread_freezable_should_stop+0x70/0x70
> > > [] ret_from_fork+0x3f/0x70
> > > [] ? kthread_freezable_should_stop+0x70/0x70
> > > ---[ end trace 39afe9aeef2cfb34 ]---
> > > [ cut here ]
> > 
> > --
> > Chuck Lever
> > 
> > 
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-11-12 Thread J. Bruce Fields
On Wed, Nov 11, 2015 at 03:20:33PM -0500, Chuck Lever wrote:
> 
> > On Nov 3, 2015, at 3:10 PM, Chuck Lever  wrote:
> > 
> > 
> >> On Oct 16, 2015, at 9:30 AM, Chuck Lever  wrote:
> >> 
> >> Minor optimization: when dealing with write chunk XDR roundup, do
> >> not post a Write WR for the zero bytes in the pad. Simply update
> >> the write segment in the RPC-over-RDMA header to reflect the extra
> >> pad bytes.
> >> 
> >> The Reply chunk is also a write chunk, but the server does not use
> >> send_write_chunks() to send the Reply chunk. That's OK in this case:
> >> the server Upper Layer typically marshals the Reply chunk contents
> >> in a single contiguous buffer, without a separate tail for the XDR
> >> pad.
> >> 
> >> The comments and the variable naming refer to "chunks" but what is
> >> really meant is "segments." The existing code sends only one
> >> xdr_write_chunk per RPC reply.
> >> 
> >> The fix assumes this as well. When the XDR pad in the first write
> >> chunk is reached, the assumption is the Write list is complete and
> >> send_write_chunks() returns.
> >> 
> >> That will remain a valid assumption until the server Upper Layer can
> >> support multiple bulk payload results per RPC.
> >> 
> >> Signed-off-by: Chuck Lever 
> >> --
> > 
> > Bruce, can you take this for 4.4 ?
> 
> Hi Bruce-
> 
> Your 4.4 pull request did not include this patch.

Whoops, sorry.

For some reason I don't have the original email, though I do see your
first followup.  Would you mind resending?

Also, it looks like that wasn't actually sent to me?  I really want
submitted patches to have my address on the To: line.  Obviously I also
try to pick up stuff sent only to the mailing list, but I'm a little
more likely to miss those.

But my fault either way, apologies.

--b.

> 
> 
> >> net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
> >> 1 files changed, 7 insertions(+), 0 deletions(-)
> >> 
> >> diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
> >> b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >> index 1dfae83..6c48901 100644
> >> --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >> +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> >> @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma 
> >> *xprt,
> >>arg_ch->rs_handle,
> >>arg_ch->rs_offset,
> >>write_len);
> >> +
> >> +  /* Do not send XDR pad bytes */
> >> +  if (chunk_no && write_len < 4) {
> >> +  chunk_no++;
> >> +  break;
> >> +  }
> >> +
> >>chunk_off = 0;
> >>while (write_len) {
> >>ret = send_write(xprt, rqstp,
> >> 
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > —
> > Chuck Lever
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> —
> Chuck Lever
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Fwd: [PATCH] svcrdma: Do not send XDR roundup bytes for a write chunk

2015-11-12 Thread J. Bruce Fields
On Thu, Nov 12, 2015 at 09:47:13AM -0500, Chuck Lever wrote:
> I wish git send-mail had an address book. I’m really tired
> of misspelling the to/cc addresses on patches.

Hah.  Yeah, my hack is a text file of addresses that I cut-and-paste
from.

--b.

> 
> Resending so there is one thread for all replies. Apologies
> for the noise.
> 
> 
> > Begin forwarded message:
> > 
> > From: Chuck Lever 
> > Subject: [PATCH] svcrdma: Do not send XDR roundup bytes for a write chunk
> > Date: November 12, 2015 at 9:44:33 AM EST
> > To: bfie...@fieldses.org
> > Cc: linux-...@vger.kernel.org, linux-r...@vger.kernel.org
> > 
> > Minor optimization: when dealing with write chunk XDR roundup, do
> > not post a Write WR for the zero bytes in the pad. Simply update
> > the write segment in the RPC-over-RDMA header to reflect the extra
> > pad bytes.
> > 
> > The Reply chunk is also a write chunk, but the server does not use
> > send_write_chunks() to send the Reply chunk. That's OK in this case:
> > the server Upper Layer typically marshals the Reply chunk contents
> > in a single contiguous buffer, without a separate tail for the XDR
> > pad.
> > 
> > The comments and the variable naming refer to "chunks" but what is
> > really meant is "segments." The existing code sends only one
> > xdr_write_chunk per RPC reply.
> > 
> > The fix assumes this as well. When the XDR pad in the first write
> > chunk is reached, the assumption is the Write list is complete and
> > send_write_chunks() returns.
> > 
> > That will remain a valid assumption until the server Upper Layer can
> > support multiple bulk payload results per RPC.
> > 
> > Signed-off-by: Chuck Lever 
> > ---
> > net/sunrpc/xprtrdma/svc_rdma_sendto.c |7 +++
> > 1 file changed, 7 insertions(+)
> > 
> > diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
> > b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> > index 1dfae83..6c48901 100644
> > --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> > +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
> > @@ -342,6 +342,13 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
> > arg_ch->rs_handle,
> > arg_ch->rs_offset,
> > write_len);
> > +
> > +   /* Do not send XDR pad bytes */
> > +   if (chunk_no && write_len < 4) {
> > +   chunk_no++;
> > +   break;
> > +   }
> > +
> > chunk_off = 0;
> > while (write_len) {
> > ret = send_write(xprt, rqstp,
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> Chuck Lever
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: Fix NFS server crash triggered by 1MB NFS WRITE

2015-10-12 Thread J. Bruce Fields
On Mon, Oct 12, 2015 at 10:53:39AM -0400, Chuck Lever wrote:
> Now that the NFS server advertises a maximum payload size of 1MB
> for RPC/RDMA again, it crashes in svc_process_common() when NFS
> client sends a 1MB NFS WRITE on an NFS/RDMA mount.
> 
> The server has set up a 259 element array of struct page pointers
> in rq_pages[] for each incoming request. The last element of the
> array is NULL.
> 
> When an incoming request has been completely received,
> rdma_read_complete() attempts to set the starting page of the
> incoming page vector:
> 
>   rqstp->rq_arg.pages = >rq_pages[head->hdr_count];
> 
> and the page to use for the reply:
> 
>   rqstp->rq_respages = >rq_arg.pages[page_no];
> 
> But the value of page_no has already accounted for head->hdr_count.
> Thus rq_respages now points past the end of the incoming pages.
> 
> For NFS WRITE operations smaller than the maximum, this is harmless.
> But when the NFS WRITE operation is as large as the server's max
> payload size, rq_respages now points at the last entry in rq_pages,
> which is NULL.
> 
> Fixes: cc9a903d915c ('svcrdma: Change maximum server payload . . .')
> BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=270
> Signed-off-by: Chuck Lever 
> Reviewed-by: Sagi Grimberg 
> Reviewed-by: Steve Wise 
> Reviewed-by: Shirley Ma 
> ---
> 
> Hi Bruce-
> 
> This is a regression in 4.3. Can you send this to Linus?

OK, queuing for 4.3, thanks.--b.

> 
> 
>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
> b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> index cb51742..37b4341 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> @@ -531,7 +531,7 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
>   rqstp->rq_arg.page_base = head->arg.page_base;
>  
>   /* rq_respages starts after the last arg page */
> - rqstp->rq_respages = >rq_arg.pages[page_no];
> + rqstp->rq_respages = >rq_pages[page_no];
>   rqstp->rq_next_page = rqstp->rq_respages + 1;
>  
>   /* Rebuild rq_arg head and tail. */
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] svcrdma: handle rdma read with a non-zero initial page offset

2015-10-07 Thread J. Bruce Fields
On Wed, Oct 07, 2015 at 01:33:05PM -0400, Doug Ledford wrote:
> On 10/07/2015 01:01 PM, J. Bruce Fields wrote:
> > On Tue, Oct 06, 2015 at 01:44:25PM -0400, Doug Ledford wrote:
> >> On 09/28/2015 05:46 PM, Steve Wise wrote:
> >>> The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions
> >>> were not taking into account the initial page_offset when determining
> >>> the rdma read length.  This resulted in a read who's starting address
> >>> and length exceeded the base/bounds of the frmr.
> >>>
> >>> Most work loads don't tickle this bug apparently, but one test hit it
> >>> every time: building the linux kernel on a 16 core node with 'make -j
> >>> 16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA.
> >>>
> >>> This bug seems to only be tripped with devices having small fastreg page
> >>> list depths.  I didn't see it with mlx4, for instance.
> >>
> >> Bruce, what's you're take on this?  Do you want to push this through or
> >> would you care if I push it through my tree?
> > 
> > Whoops, sorry, I meant to send a pull request for that last week.  Uh, I
> > think I'll go ahead and do that now if it's OK with you.
> 
> Fine with me.  I was just trying to make sure it didn't get forgotten ;-)

Understood, thanks!  I've sent the pull request.--b.

> 
> > --b.
> > 
> >>
> >>> Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic')
> >>> Signed-off-by: Steve Wise <sw...@opengridcomputing.com>
> >>> Tested-by: Chuck Lever <chuck.le...@oracle.com>
> >>> ---
> >>>
> >>>  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |6 --
> >>>  1 files changed, 4 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
> >>> b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >>> index cb51742..5f6ca47 100644
> >>> --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >>> +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> >>> @@ -136,7 +136,8 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
> >>>   ctxt->direction = DMA_FROM_DEVICE;
> >>>   ctxt->read_hdr = head;
> >>>   pages_needed = min_t(int, pages_needed, xprt->sc_max_sge_rd);
> >>> - read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
> >>> + read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
> >>> +  rs_length);
> >>>  
> >>>   for (pno = 0; pno < pages_needed; pno++) {
> >>>   int len = min_t(int, rs_length, PAGE_SIZE - pg_off);
> >>> @@ -235,7 +236,8 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
> >>>   ctxt->direction = DMA_FROM_DEVICE;
> >>>   ctxt->frmr = frmr;
> >>>   pages_needed = min_t(int, pages_needed, xprt->sc_frmr_pg_list_len);
> >>> - read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
> >>> + read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
> >>> +  rs_length);
> >>>  
> >>>   frmr->kva = page_address(rqstp->rq_arg.pages[pg_no]);
> >>>   frmr->direction = DMA_FROM_DEVICE;
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >>> the body of a message to majord...@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >>
> >> -- 
> >> Doug Ledford <dledf...@redhat.com>
> >>   GPG KeyID: 0E572FDD
> >>
> >>
> > 
> > 
> 
> 
> -- 
> Doug Ledford <dledf...@redhat.com>
>   GPG KeyID: 0E572FDD
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] svcrdma: handle rdma read with a non-zero initial page offset

2015-10-07 Thread J. Bruce Fields
On Tue, Oct 06, 2015 at 01:44:25PM -0400, Doug Ledford wrote:
> On 09/28/2015 05:46 PM, Steve Wise wrote:
> > The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions
> > were not taking into account the initial page_offset when determining
> > the rdma read length.  This resulted in a read who's starting address
> > and length exceeded the base/bounds of the frmr.
> > 
> > Most work loads don't tickle this bug apparently, but one test hit it
> > every time: building the linux kernel on a 16 core node with 'make -j
> > 16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA.
> > 
> > This bug seems to only be tripped with devices having small fastreg page
> > list depths.  I didn't see it with mlx4, for instance.
> 
> Bruce, what's you're take on this?  Do you want to push this through or
> would you care if I push it through my tree?

Whoops, sorry, I meant to send a pull request for that last week.  Uh, I
think I'll go ahead and do that now if it's OK with you.

--b.

> 
> > Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic')
> > Signed-off-by: Steve Wise 
> > Tested-by: Chuck Lever 
> > ---
> > 
> >  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |6 --
> >  1 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
> > b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> > index cb51742..5f6ca47 100644
> > --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> > +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
> > @@ -136,7 +136,8 @@ int rdma_read_chunk_lcl(struct svcxprt_rdma *xprt,
> > ctxt->direction = DMA_FROM_DEVICE;
> > ctxt->read_hdr = head;
> > pages_needed = min_t(int, pages_needed, xprt->sc_max_sge_rd);
> > -   read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
> > +   read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
> > +rs_length);
> >  
> > for (pno = 0; pno < pages_needed; pno++) {
> > int len = min_t(int, rs_length, PAGE_SIZE - pg_off);
> > @@ -235,7 +236,8 @@ int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
> > ctxt->direction = DMA_FROM_DEVICE;
> > ctxt->frmr = frmr;
> > pages_needed = min_t(int, pages_needed, xprt->sc_frmr_pg_list_len);
> > -   read = min_t(int, pages_needed << PAGE_SHIFT, rs_length);
> > +   read = min_t(int, (pages_needed << PAGE_SHIFT) - *page_offset,
> > +rs_length);
> >  
> > frmr->kva = page_address(rqstp->rq_arg.pages[pg_no]);
> > frmr->direction = DMA_FROM_DEVICE;
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> 
> 
> -- 
> Doug Ledford 
>   GPG KeyID: 0E572FDD
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] svcrdma: handle rdma read with a non-zero initial page offset

2015-09-29 Thread 'J. Bruce Fields'
On Mon, Sep 28, 2015 at 04:49:21PM -0500, Steve Wise wrote:
> 
> 
> > -Original Message-
> > From: J. Bruce Fields [mailto:bfie...@fieldses.org]
> > Sent: Monday, September 28, 2015 4:05 PM
> > To: Steve Wise
> > Cc: trond.mykleb...@primarydata.com; linux-...@vger.kernel.org; 
> > linux-rdma@vger.kernel.org
> > Subject: Re: [PATCH 2/3] svcrdma: handle rdma read with a non-zero initial 
> > page offset
> > 
> > On Mon, Sep 28, 2015 at 09:31:25AM -0500, Steve Wise wrote:
> > > On 9/21/2015 12:24 PM, Steve Wise wrote:
> > > >The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions
> > > >were not taking into account the initial page_offset when determining
> > > >the rdma read length.  This resulted in a read who's starting address
> > > >and length exceeded the base/bounds of the frmr.
> > > >
> > > >Most work loads don't tickle this bug apparently, but one test hit it
> > > >every time: building the linux kernel on a 16 core node with 'make -j
> > > >16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA.
> > > >
> > > >This bug seems to only be tripped with devices having small fastreg page
> > > >list depths.  I didn't see it with mlx4, for instance.
> > > >
> > > >Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic')
> > > >Signed-off-by: Steve Wise <sw...@opengridcomputing.com>
> > > >Tested-by: Chuck Lever <chuck.le...@oracle.com>
> > > >---
> > > >
> > > >
> > >
> > > Hey Bruce, can this make 4.3-rc?  Also, what do you think about
> > > pushing it to stable?
> > 
> > It looks like a reasonable candidate for stable.  Apologies, somehow I
> > missed it when you posted it--would you mind resending?
> > 
> > --b.
> 
> resent this one patch.
> 
> What is your process for pushing to stable?

Currently my standard regression tests don't seem to be passing on
recent 4.3; once I figure that out, I'll push out a branch with your
patch (and any others required) to

git://linux-nfs.org/~bfields for-4.3

and then send a pull request to Linus.  Should be by the end of the
week.

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/3] svcrdma: handle rdma read with a non-zero initial page offset

2015-09-28 Thread J. Bruce Fields
On Mon, Sep 28, 2015 at 09:31:25AM -0500, Steve Wise wrote:
> On 9/21/2015 12:24 PM, Steve Wise wrote:
> >The server rdma_read_chunk_lcl() and rdma_read_chunk_frmr() functions
> >were not taking into account the initial page_offset when determining
> >the rdma read length.  This resulted in a read who's starting address
> >and length exceeded the base/bounds of the frmr.
> >
> >Most work loads don't tickle this bug apparently, but one test hit it
> >every time: building the linux kernel on a 16 core node with 'make -j
> >16 O=/mnt/0' where /mnt/0 is a ramdisk mounted via NFSRDMA.
> >
> >This bug seems to only be tripped with devices having small fastreg page
> >list depths.  I didn't see it with mlx4, for instance.
> >
> >Fixes: 0bf4828983df ('svcrdma: refactor marshalling logic')
> >Signed-off-by: Steve Wise 
> >Tested-by: Chuck Lever 
> >---
> >
> >
> 
> Hey Bruce, can this make 4.3-rc?  Also, what do you think about
> pushing it to stable?

It looks like a reasonable candidate for stable.  Apologies, somehow I
missed it when you posted it--would you mind resending?

--b.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: limit FRMR page list lengths to device max

2015-08-07 Thread J. Bruce Fields
On Fri, Aug 07, 2015 at 11:11:20AM -0500, Steve Wise wrote:
 Svcrdma was incorrectly allocating fastreg MRs and page lists using
 RPCSVC_MAXPAGES, which can exceed the device capabilities.  So limit
 the depth to the minimum of RPCSVC_MAXPAGES and xprt-sc_frmr_pg_list_len.
 
 Signed-off-by: Steve Wise sw...@opengridcomputing.com
 ---
 Doug, this patch needs to be added after this commit:
 
 e20684a xprtrdma, svcrdma: Convert to ib_alloc_mr
 
 and before these commits:
 
 af78181 cxgb3: Support ib_alloc_mr verb
 b7e06cd iw_cxgb4: Support ib_alloc_mr verb
 
 This will avoid a bisect window where NFSRDMA over cxgb4 is broken.
 
 Bruce, please ACK if this commit looks good, and also if you're ok with
 this flowing through Doug's rdma tree due to the dependencies.

Fine by me.  On a quick check the only pending change in my tree nearby
there is 31193fe5f6fb svcrdma: Remove svc_rdma_fastreg(), which
shouldn't conflict.

Acked-by: J. Bruce Fields bfie...@redhat.com

--b.

 ---
  net/sunrpc/xprtrdma/svc_rdma_transport.c |6 --
  1 files changed, 4 insertions(+), 2 deletions(-)
 
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
 b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 index 8752a2d..11d5133 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 @@ -733,17 +733,19 @@ static struct svc_rdma_fastreg_mr 
 *rdma_alloc_frmr(struct svcxprt_rdma *xprt)
   struct ib_mr *mr;
   struct ib_fast_reg_page_list *pl;
   struct svc_rdma_fastreg_mr *frmr;
 + u32 num_sg;
  
   frmr = kmalloc(sizeof(*frmr), GFP_KERNEL);
   if (!frmr)
   goto err;
  
 - mr = ib_alloc_mr(xprt-sc_pd, IB_MR_TYPE_MEM_REG, RPCSVC_MAXPAGES);
 + num_sg = min_t(u32, RPCSVC_MAXPAGES, xprt-sc_frmr_pg_list_len);
 + mr = ib_alloc_mr(xprt-sc_pd, IB_MR_TYPE_MEM_REG, num_sg);
   if (IS_ERR(mr))
   goto err_free_frmr;
  
   pl = ib_alloc_fast_reg_page_list(xprt-sc_cm_id-device,
 -  RPCSVC_MAXPAGES);
 +  num_sg);
   if (IS_ERR(pl))
   goto err_free_mr;
  
 -- 
 1.7.1
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND 0/4] Use max_sge_rd device capability

2015-07-28 Thread J. Bruce Fields
On Mon, Jul 27, 2015 at 06:09:55PM -0500, Steve Wise wrote:
 Resending because I forgot to cc linux-rdma :(
 
 Some devices were not setting this capability, so fix those devices,
 and svcrdma to use max_sge_rd.  Also remove rdma_cap_read_multi_sge()
 since it isn't needed.
 
 These patches were originally part of:
 
 http://www.spinics.net/lists/linux-rdma/msg27436.html
 
 They really aren't part of iSER/iWARP at all, so I've split 
 them out.  
 
 Bruce: This hits svcrdma, but I suggest they get merged via Doug's tree
 to avoid any merge problems.

Sounds fine.  I've got some changes to those files in my for-next as
well, but they don't appear to conflict.

Feel free to add my

Acked-by: J. Bruce Fields bfie...@redhat.com

to patch 3/4 if it helps.

--b.

 
 ---
 
 Sagi Grimberg (1):
   mlx4, mlx5, mthca: Expose max_sge_rd correctly
 
 Steve Wise (3):
   RDMA/Core: remove rdma_cap_read_multi_sge() helper
   svcrdma: Use max_sge_rd for destination read depths
   ipath,qib: Expose max_sge_rd correctly
 
 
  drivers/infiniband/hw/ipath/ipath_verbs.c|1 +
  drivers/infiniband/hw/mlx4/main.c|1 +
  drivers/infiniband/hw/mlx5/main.c|1 +
  drivers/infiniband/hw/mthca/mthca_provider.c |1 +
  drivers/infiniband/hw/qib/qib_verbs.c|1 +
  include/linux/sunrpc/svc_rdma.h  |1 +
  include/rdma/ib_verbs.h  |   28 
 --
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |   12 +--
  net/sunrpc/xprtrdma/svc_rdma_transport.c |4 
  9 files changed, 11 insertions(+), 39 deletions(-)
 
 -- 
 
 Steve
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 5/5] svcrdma: Boost NFS READ/WRITE payload size maximum

2015-07-10 Thread J. Bruce Fields
On Fri, Jul 10, 2015 at 10:59:48AM -0400, Chuck Lever wrote:
 
 On Jul 10, 2015, at 10:18 AM, bfie...@fieldses.org wrote:
 
  On Thu, Jul 09, 2015 at 04:45:47PM -0400, Chuck Lever wrote:
  Increased to 1 megabyte.
  
  Why not more or less?
  
  Why do we even have this constant, why shouldn't we just use
  RPCSVC_MAXPAYLOAD?
 
 The payload size maximum for RDMA is based on RPCRDMA_MAX_SVC_SEGS.
 We could reverse the relationship and make RPCRDMA_MAX_SVC_SEGS
 based on RPCSVC_MAXPAYLOAD divided by the platform’s page size.

But there'd be no reason to do that, because we're not using
RPCRDMA_MAX_SVC_SEGS anywhere.  Should we be?

--b.

 
 
  --b.
  
  
  Signed-off-by: Chuck Lever chuck.le...@oracle.com
  ---
  
  include/linux/sunrpc/svc_rdma.h |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
  
  diff --git a/include/linux/sunrpc/svc_rdma.h 
  b/include/linux/sunrpc/svc_rdma.h
  index 13af61b..1bca6dd 100644
  --- a/include/linux/sunrpc/svc_rdma.h
  +++ b/include/linux/sunrpc/svc_rdma.h
  @@ -172,7 +172,7 @@ struct svcxprt_rdma {
  #define RDMAXPRT_SQ_PENDING2
  #define RDMAXPRT_CONN_PENDING  3
  
  -#define RPCRDMA_MAX_SVC_SEGS  (64)/* server max scatter/gather */
  +#define RPCRDMA_MAX_SVC_SEGS  (256)   /* server max scatter/gather */
  #if RPCSVC_MAXPAYLOAD  (RPCRDMA_MAX_SVC_SEGS  PAGE_SHIFT)
  #define RPCRDMA_MAXPAYLOAD RPCSVC_MAXPAYLOAD
  #else
  
  --
  To unsubscribe from this list: send the line unsubscribe linux-nfs in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 Chuck Lever
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 5/5] svcrdma: Boost NFS READ/WRITE payload size maximum

2015-07-10 Thread J. Bruce Fields
On Fri, Jul 10, 2015 at 10:18:14AM -0400, J. Bruce Fields wrote:
 On Thu, Jul 09, 2015 at 04:45:47PM -0400, Chuck Lever wrote:
  Increased to 1 megabyte.
 
 Why not more or less?
 
 Why do we even have this constant, why shouldn't we just use
 RPCSVC_MAXPAYLOAD?

(That one question aside these look fine, I'll apply unless I hear
otherwise.)

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 5/5] svcrdma: Boost NFS READ/WRITE payload size maximum

2015-07-10 Thread J. Bruce Fields
On Thu, Jul 09, 2015 at 04:45:47PM -0400, Chuck Lever wrote:
 Increased to 1 megabyte.

Why not more or less?

Why do we even have this constant, why shouldn't we just use
RPCSVC_MAXPAYLOAD?

--b.

 
 Signed-off-by: Chuck Lever chuck.le...@oracle.com
 ---
 
  include/linux/sunrpc/svc_rdma.h |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
 index 13af61b..1bca6dd 100644
 --- a/include/linux/sunrpc/svc_rdma.h
 +++ b/include/linux/sunrpc/svc_rdma.h
 @@ -172,7 +172,7 @@ struct svcxprt_rdma {
  #define RDMAXPRT_SQ_PENDING  2
  #define RDMAXPRT_CONN_PENDING3
  
 -#define RPCRDMA_MAX_SVC_SEGS (64)/* server max scatter/gather */
 +#define RPCRDMA_MAX_SVC_SEGS (256)   /* server max scatter/gather */
  #if RPCSVC_MAXPAYLOAD  (RPCRDMA_MAX_SVC_SEGS  PAGE_SHIFT)
  #define RPCRDMA_MAXPAYLOAD   RPCSVC_MAXPAYLOAD
  #else
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 5/5] svcrdma: Boost NFS READ/WRITE payload size maximum

2015-07-10 Thread J. Bruce Fields
On Fri, Jul 10, 2015 at 11:59:20AM -0400, Chuck Lever wrote:
 
 On Jul 10, 2015, at 11:54 AM, J. Bruce Fields bfie...@fieldses.org wrote:
 
  On Fri, Jul 10, 2015 at 10:59:48AM -0400, Chuck Lever wrote:
  
  On Jul 10, 2015, at 10:18 AM, bfie...@fieldses.org wrote:
  
  On Thu, Jul 09, 2015 at 04:45:47PM -0400, Chuck Lever wrote:
  Increased to 1 megabyte.
  
  Why not more or less?
  
  Why do we even have this constant, why shouldn't we just use
  RPCSVC_MAXPAYLOAD?
  
  The payload size maximum for RDMA is based on RPCRDMA_MAX_SVC_SEGS.
  We could reverse the relationship and make RPCRDMA_MAX_SVC_SEGS
  based on RPCSVC_MAXPAYLOAD divided by the platform’s page size.
  
  But there'd be no reason to do that, because we're not using
  RPCRDMA_MAX_SVC_SEGS anywhere.  Should we be?
 
 Let me try using RPCSVC_MAXPAYLOAD. That is based on RPCSVC_MAXPAGES,
 which is actually used in svc_rdma_*.c

OK, thanks, that sounds like that would make more sense.

--b.

 
 
  --b.
  
  
  
  --b.
  
  
  Signed-off-by: Chuck Lever chuck.le...@oracle.com
  ---
  
  include/linux/sunrpc/svc_rdma.h |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
  
  diff --git a/include/linux/sunrpc/svc_rdma.h 
  b/include/linux/sunrpc/svc_rdma.h
  index 13af61b..1bca6dd 100644
  --- a/include/linux/sunrpc/svc_rdma.h
  +++ b/include/linux/sunrpc/svc_rdma.h
  @@ -172,7 +172,7 @@ struct svcxprt_rdma {
  #define RDMAXPRT_SQ_PENDING  2
  #define RDMAXPRT_CONN_PENDING3
  
  -#define RPCRDMA_MAX_SVC_SEGS(64)/* server max scatter/gather */
  +#define RPCRDMA_MAX_SVC_SEGS(256)   /* server max scatter/gather */
  #if RPCSVC_MAXPAYLOAD  (RPCRDMA_MAX_SVC_SEGS  PAGE_SHIFT)
  #define RPCRDMA_MAXPAYLOAD   RPCSVC_MAXPAYLOAD
  #else
  
  --
  To unsubscribe from this list: send the line unsubscribe linux-nfs in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  --
  Chuck Lever
  
  
 
 --
 Chuck Lever
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFSoRDMA bi-weekly developers meeting minutes (5/28/15)

2015-06-03 Thread J. Bruce Fields
On Thu, May 28, 2015 at 11:11:30AM -0700, Shirley Ma wrote:
 NFSD workqueue mode vs. threading mode performance impacts: (Shirley)
 -
 Shirley has spent sometime test/fix/evaluating NFSD workqueue mode
 originally implemented by Jeff Layton. NFSD workqueue patchset NFSD
 workqueue modes benefit some workloads while global threading mode
 benefits some other workloads. workqueue mode is better than per-cpu
 threading mode in general. workqueue is per-cpu workqueue while global
 threading mode is not per-cpu based. There might be some room to make
 NFSD workqueue not worse than global threading mode, but this requires
 to do more in depth performance investigation work.
 
 Feel free to reply here for anything missing or incorrect. Thanks
 everyone for joining the call and providing valuable inputs/work to
 the community to make NFSoRDMA better.

Are you planning to repost those patches to the list with a summary of
the results so far?

I don't necessarily expect that we'll have all performance problems
worked out before merging, but we do need to judge the risk that the new
threading mode never overcomes those regressions so that we're stuck
supporting both old and new threading modes indefinitely.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/10] NFS/RDMA server for 3.20

2015-01-15 Thread J. Bruce Fields
On Tue, Jan 13, 2015 at 11:31:16AM -0600, Steve Wise wrote:
 
 Reviewed-by: Steve Wise sw...@opengridcomputing.com

Thanks, applying for 3.20.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 09/10] Move read list XDR round-up logic

2015-01-09 Thread J. Bruce Fields
On Fri, Jan 09, 2015 at 02:23:11PM -0500, Chuck Lever wrote:
 Simpler approach, and this needs to be done before the tail is
 copied to the end of the read list (pre-requisite for a subsequent
 patch).
 
 Fixes: e560e3b510d2 (svcrdma: Add zero padding if the client... )

Wait, does this actually fix a bug in that previous commit is or it
just cleanup?

If the latter, the Fixes:.. line is confusing, just mention it in the
text (Simpler approach compared to the original attempt in
e560e3b510d2...).

--b.

 Signed-off-by: Chuck Lever chuck.le...@oracle.com
 ---
 
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |   37 
 ---
  1 files changed, 9 insertions(+), 28 deletions(-)
 
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
 b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 index 36cf51a..a345cad 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 @@ -43,7 +43,6 @@
  #include linux/sunrpc/debug.h
  #include linux/sunrpc/rpc_rdma.h
  #include linux/spinlock.h
 -#include linux/highmem.h
  #include asm/unaligned.h
  #include rdma/ib_verbs.h
  #include rdma/rdma_cm.h
 @@ -434,6 +433,15 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
   }
   }
  
 + /* Read list may need XDR round-up (see RFC 5666, s. 3.7) */
 + if (page_offset  3) {
 + u32 pad = 4 - (page_offset  3);
 +
 + head-arg.page_len += pad;
 + head-arg.len += pad;
 + head-arg.buflen += pad;
 + }
 +
   ret = 1;
   head-position = position;
  
 @@ -446,32 +454,6 @@ static int rdma_read_chunks(struct svcxprt_rdma *xprt,
   return ret;
  }
  
 -/*
 - * To avoid a separate RDMA READ just for a handful of zero bytes,
 - * RFC 5666 section 3.7 allows the client to omit the XDR zero pad
 - * in chunk lists.
 - */
 -static void
 -rdma_fix_xdr_pad(struct xdr_buf *buf)
 -{
 - unsigned int page_len = buf-page_len;
 - unsigned int size = (XDR_QUADLEN(page_len)  2) - page_len;
 - unsigned int offset, pg_no;
 - char *p;
 -
 - if (size == 0)
 - return;
 -
 - pg_no = page_len  PAGE_SHIFT;
 - offset = page_len  ~PAGE_MASK;
 - p = page_address(buf-pages[pg_no]);
 - memset(p + offset, 0, size);
 -
 - buf-page_len += size;
 - buf-buflen += size;
 - buf-len += size;
 -}
 -
  static int rdma_read_complete(struct svc_rqst *rqstp,
 struct svc_rdma_op_ctxt *head)
  {
 @@ -499,7 +481,6 @@ static int rdma_read_complete(struct svc_rqst *rqstp,
   }
  
   /* Point rq_arg.pages past header */
 - rdma_fix_xdr_pad(head-arg);
   rqstp-rq_arg.pages = rqstp-rq_pages[head-hdr_count];
   rqstp-rq_arg.page_len = head-arg.page_len;
   rqstp-rq_arg.page_base = head-arg.page_base;
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/10] NFS/RDMA server for 3.20

2015-01-09 Thread J. Bruce Fields
On Fri, Jan 09, 2015 at 02:21:54PM -0500, Chuck Lever wrote:
 In addition to miscellaneous clean up, the following series of
 patches for the Linux NFS server introduces support in the
 server’s RPC/RDMA transport implementation for RDMA_NOMSG type
 messages, and fixes a bug that prevents the server from handling
 RPC/RDMA messages with inline content following a read list.
 
 These patches are contained in the branch nfsd-rdma-for-3.20 at:
 
   git://git.linux-nfs.org/projects/cel/cel-2.6.git

They look OK to me; applying unless someone tells me otherwise.

--b.

 
 ---
 
 Chuck Lever (10):
   svcrdma: Handle additional inline content
   Move read list XDR round-up logic
   svcrdma: Support RDMA_NOMSG requests
   svcrdma: rc_position sanity checking
   svcrdma: Plant reader function in struct svcxprt_rdma
   svcrdma: Find rmsgp more reliably
   svcrdma: Scrub BUG_ON() and WARN_ON() call sites
   svcrdma: Clean up read chunk counting
   svcrdma: Remove unused variable
   svcrdma: Clean up dprintk
 
 
  include/linux/sunrpc/svc_rdma.h  |   13 +-
  net/sunrpc/xprtrdma/svc_rdma_marshal.c   |   16 --
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  251 
 +++---
  net/sunrpc/xprtrdma/svc_rdma_sendto.c|   46 +++--
  net/sunrpc/xprtrdma/svc_rdma_transport.c |   47 +++---
  5 files changed, 224 insertions(+), 149 deletions(-)
 
 --
 Chuck Lever
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 00/10] NFS/RDMA server for 3.20

2015-01-09 Thread J. Bruce Fields
On Fri, Jan 09, 2015 at 03:40:51PM -0500, Chuck Lever wrote:
 
 On Jan 9, 2015, at 3:39 PM, J. Bruce Fields bfie...@fieldses.org wrote:
 
  On Fri, Jan 09, 2015 at 02:21:54PM -0500, Chuck Lever wrote:
  In addition to miscellaneous clean up, the following series of
  patches for the Linux NFS server introduces support in the
  server’s RPC/RDMA transport implementation for RDMA_NOMSG type
  messages, and fixes a bug that prevents the server from handling
  RPC/RDMA messages with inline content following a read list.
  
  These patches are contained in the branch nfsd-rdma-for-3.20 at:
  
   git://git.linux-nfs.org/projects/cel/cel-2.6.git
  
  They look OK to me; applying unless someone tells me otherwise.
 
 I’d like to update the patch descriptions in several of these before
 you apply the series.

OK, let me know.--b.

 
  
  --b.
  
  
  ---
  
  Chuck Lever (10):
   svcrdma: Handle additional inline content
   Move read list XDR round-up logic
   svcrdma: Support RDMA_NOMSG requests
   svcrdma: rc_position sanity checking
   svcrdma: Plant reader function in struct svcxprt_rdma
   svcrdma: Find rmsgp more reliably
   svcrdma: Scrub BUG_ON() and WARN_ON() call sites
   svcrdma: Clean up read chunk counting
   svcrdma: Remove unused variable
   svcrdma: Clean up dprintk
  
  
  include/linux/sunrpc/svc_rdma.h  |   13 +-
  net/sunrpc/xprtrdma/svc_rdma_marshal.c   |   16 --
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  251 
  +++---
  net/sunrpc/xprtrdma/svc_rdma_sendto.c|   46 +++--
  net/sunrpc/xprtrdma/svc_rdma_transport.c |   47 +++---
  5 files changed, 224 insertions(+), 149 deletions(-)
  
  --
  Chuck Lever
  --
  To unsubscribe from this list: send the line unsubscribe linux-nfs in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 Chuck Lever
 chuck[dot]lever[at]oracle[dot]com
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 06/10] xprtrdma: Enable pad optimization

2014-11-10 Thread J. Bruce Fields
On Mon, Nov 10, 2014 at 08:54:27AM -0600, Chuck Lever wrote:
 
 On Nov 10, 2014, at 8:36 AM, Anna Schumaker anna.schuma...@netapp.com wrote:
 
  Hey Chuck,
  
  
  On 11/08/2014 08:14 PM, Chuck Lever wrote:
  The Linux NFS/RDMA server used to reject NFSv3 WRITE requests when
  pad optimization was enabled. That bug was fixed by commit
  e560e3b510d2 (svcrdma: Add zero padding if the client doesn't send
  it).
  
  Do we need to worry about backwards compatibility with servers that don't 
  have this patch?
 
 My impression is that we have a window where the server is assumed not
 to work and thus is not enabled in distributions, and that therefore
 changes like this are allowed. I could be wrong. Bruce, any guidance
 on this?
 
 In any event, if things break, they break immediately, and the fix is
 simply to set this feature flag via /proc.

I don't think there's any hard-and-fast rule here, but my impression is
that nfs/rdma still isn't widely used, even less so with linux knfsd, so
on balance it's probably not worth much to humor older servers.

--b.

 
 
 
 
  Anna
  
  
  We can now enable pad optimization on the client, which helps
  performance and is supported now by both Linux and Solaris servers.
  
  Signed-off-by: Chuck Lever chuck.le...@oracle.com
  ---
  net/sunrpc/xprtrdma/transport.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)
  
  diff --git a/net/sunrpc/xprtrdma/transport.c 
  b/net/sunrpc/xprtrdma/transport.c
  index cfe9a81..8ed2576 100644
  --- a/net/sunrpc/xprtrdma/transport.c
  +++ b/net/sunrpc/xprtrdma/transport.c
  @@ -73,7 +73,7 @@ static unsigned int xprt_rdma_max_inline_read = 
  RPCRDMA_DEF_INLINE;
  static unsigned int xprt_rdma_max_inline_write = RPCRDMA_DEF_INLINE;
  static unsigned int xprt_rdma_inline_write_padding;
  static unsigned int xprt_rdma_memreg_strategy = RPCRDMA_FRMR;
  -int xprt_rdma_pad_optimize = 0;
  +  int xprt_rdma_pad_optimize = 1;
  
  #ifdef RPC_DEBUG
  
  
  --
  To unsubscribe from this list: send the line unsubscribe linux-nfs in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
  
  --
  To unsubscribe from this list: send the line unsubscribe linux-nfs in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 Chuck Lever
 chuck[dot]lever[at]oracle[dot]com
 
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] svcrdma: advertise the correct max payload

2014-09-29 Thread J. Bruce Fields
On Mon, Sep 29, 2014 at 11:07:25AM -0500, Steve Wise wrote:
 
 Hey Bruce, is this version acceptable for 3.18?

Yes, applying, thanks for the reminder.--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] svcrdma: Advertise the correct max payload

2014-09-23 Thread 'J. Bruce Fields'
On Tue, Sep 23, 2014 at 02:42:34PM -0500, Steve Wise wrote:
 
   diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h 
   b/net/sunrpc/xprtrdma/xprt_rdma.h
   index c419498..a9cf5c3 100644
   --- a/net/sunrpc/xprtrdma/xprt_rdma.h
   +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
   @@ -392,4 +392,8 @@ extern struct kmem_cache *svc_rdma_ctxt_cachep;
   /* Workqueue created in svc_rdma.c */
   extern struct workqueue_struct *svc_rdma_wq;
  
   +#define RPCSVC_MAXPAYLOAD_RDMA \
   + (RPCSVC_MAXPAYLOAD  (RPCRDMA_MAX_DATA_SEGS  PAGE_SHIFT) ? \
   +  RPCSVC_MAXPAYLOAD : (RPCRDMA_MAX_DATA_SEGS  PAGE_SHIFT))
   +
  
  Couldn't you use:
  
  #if RPCSVC_MAXPAYLOAD  (RPCRDMA_MAX_DATA_SEGS  PAGE_SHIFT)
  #define RPCSVC_MAXPAYLOAD_RDMA RPC_MAXPAYLOAD
  #else
  #define RPCSVC_MAXPAYLOAD_RDMA (RPCRDMA_MAX_DATA_SEGS  PAGE_SHIFT)
  #endif
  
  That seems more idiomatic.
 
 Sure.  That makes it easier to read in my opinion too.
 
 I'll send out V3 with this change.

While we're bikeshedding, why not use min()?

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2] svcrdma: Advertise the correct max payload

2014-09-23 Thread 'J. Bruce Fields'
On Tue, Sep 23, 2014 at 02:53:28PM -0500, Steve Wise wrote:
 
 
  -Original Message-
  From: 'J. Bruce Fields' [mailto:bfie...@fieldses.org]
  Sent: Tuesday, September 23, 2014 2:48 PM
  To: Steve Wise
  Cc: 'Chuck Lever'; linux-...@vger.kernel.org; linux-rdma@vger.kernel.org
  Subject: Re: [PATCH V2] svcrdma: Advertise the correct max payload
  
  On Tue, Sep 23, 2014 at 02:42:34PM -0500, Steve Wise wrote:
  
 diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h 
 b/net/sunrpc/xprtrdma/xprt_rdma.h
 index c419498..a9cf5c3 100644
 --- a/net/sunrpc/xprtrdma/xprt_rdma.h
 +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
 @@ -392,4 +392,8 @@ extern struct kmem_cache *svc_rdma_ctxt_cachep;
 /* Workqueue created in svc_rdma.c */
 extern struct workqueue_struct *svc_rdma_wq;

 +#define RPCSVC_MAXPAYLOAD_RDMA \
 + (RPCSVC_MAXPAYLOAD  (RPCRDMA_MAX_DATA_SEGS  PAGE_SHIFT) ? \
 +  RPCSVC_MAXPAYLOAD : (RPCRDMA_MAX_DATA_SEGS  PAGE_SHIFT))
 +
   
Couldn't you use:
   
#if RPCSVC_MAXPAYLOAD  (RPCRDMA_MAX_DATA_SEGS  PAGE_SHIFT)
#define RPCSVC_MAXPAYLOAD_RDMA RPC_MAXPAYLOAD
#else
#define RPCSVC_MAXPAYLOAD_RDMA (RPCRDMA_MAX_DATA_SEGS  PAGE_SHIFT)
#endif
   
That seems more idiomatic.
  
   Sure.  That makes it easier to read in my opinion too.
  
   I'll send out V3 with this change.
  
  While we're bikeshedding, why not use min()?
  
  --b.
 
 I tried that initially.  But min() and min_t() don't work because of the way 
 we use the #define.  With it defined thusly:

OK, OK.  Whatever works.--b.

 
 #define RPCSVC_MAXPAYLOAD_RDMA min(RPC_MAXPAYLOAD, RPCRDMA_MAX_DATA_SEGS  
 PAGE_SHIFT)
 
 I see this error:
 
   CC [M]  net/sunrpc/xprtrdma/svc_rdma_transport.o
 net/sunrpc/xprtrdma/svc_rdma_transport.c:94: error: braced-group within 
 expression allowed only inside a function
 make[3]: *** [net/sunrpc/xprtrdma/svc_rdma_transport.o] Error 1
 make[2]: *** [net/sunrpc/xprtrdma] Error 2
 make[1]: *** [net/sunrpc] Error 2
 make: *** [net] Error 2
 
 min() looks like this:
 
 #define min(x, y) ({\
 typeof(x) _min1 = (x);  \
 typeof(y) _min2 = (y);  \
 (void) (_min1 == _min2);  \
 _min1  _min2 ? _min1 : _min2; })
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: Advertise the correct max payload

2014-09-22 Thread J. Bruce Fields
On Mon, Sep 22, 2014 at 01:36:53PM -0500, Steve Wise wrote:
 Svcrdma currently advertises 1MB, which is too large.  The correct value
 is the max scatter-gather allowed in an NFSRDMA IO chunk * the host page
 size. This bug is usually benign because the Linux X64 NFSRDMA client
 correctly limits the payload size to the correct value (64*4096 = 256KB).
 But if the Linux client is PPC64 with a 64KB page size, then the client
 will indeed use a payload size that will overflow the server.
 
 Signed-off-by: Steve Wise sw...@opengridcomputing.com
 ---
 
  net/sunrpc/xprtrdma/svc_rdma_transport.c |2 +-
  net/sunrpc/xprtrdma/xprt_rdma.h  |2 ++
  2 files changed, 3 insertions(+), 1 deletions(-)
 
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
 b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 index 374feb4..4e61880 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 @@ -91,7 +91,7 @@ struct svc_xprt_class svc_rdma_class = {
   .xcl_name = rdma,
   .xcl_owner = THIS_MODULE,
   .xcl_ops = svc_rdma_ops,
 - .xcl_max_payload = RPCSVC_MAXPAYLOAD_TCP,
 + .xcl_max_payload = RPCSVC_MAXPAYLOAD_RDMA,
   .xcl_ident = XPRT_TRANSPORT_RDMA,
  };
  
 diff --git a/net/sunrpc/xprtrdma/xprt_rdma.h b/net/sunrpc/xprtrdma/xprt_rdma.h
 index c419498..467a77c 100644
 --- a/net/sunrpc/xprtrdma/xprt_rdma.h
 +++ b/net/sunrpc/xprtrdma/xprt_rdma.h
 @@ -392,4 +392,6 @@ extern struct kmem_cache *svc_rdma_ctxt_cachep;
  /* Workqueue created in svc_rdma.c */
  extern struct workqueue_struct *svc_rdma_wq;
  
 +#define RPCSVC_MAXPAYLOAD_RDMA   (RPCRDMA_MAX_DATA_SEGS  PAGE_SHIFT)

Do you want to define this as the minimum of this and
RPCSVC_MAXPAYLOAD_TCP, in case RPCRDMA_MAX_DATA_SEGS gets increased some
day?

--b.

 +
  #endif   /* _LINUX_SUNRPC_XPRT_RDMA_H */
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: remove rdma_create_qp() failure recovery logic

2014-08-05 Thread J. Bruce Fields
Thanks, applying for 3.17.--b.

On Thu, Jul 31, 2014 at 03:26:07PM -0500, Steve Wise wrote:
 In svc_rdma_accept(), if rdma_create_qp() fails, there is useless
 logic to try and call rdma_create_qp() again with reduced sge depths.
 The assumption, I guess, was that perhaps the initial sge depths
 chosen were too big.  However they initial depths are selected based
 on the rdma device attribute max_sge returned from ib_query_device().
 If rdma_create_qp() fails, it would not be because the max_send_sge and
 max_recv_sge values passed in exceed the device's max.  So just remove
 this code.
 
 Signed-off-by: Steve Wise sw...@opengridcomputing.com
 ---
 
  net/sunrpc/xprtrdma/svc_rdma_transport.c |   19 ++-
  1 files changed, 2 insertions(+), 17 deletions(-)
 
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
 b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 index e7323fb..282a43b 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 @@ -942,23 +942,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt 
 *xprt)
  
   ret = rdma_create_qp(newxprt-sc_cm_id, newxprt-sc_pd, qp_attr);
   if (ret) {
 - /*
 -  * XXX: This is a hack. We need a xx_request_qp interface
 -  * that will adjust the qp_attr's with a best-effort
 -  * number
 -  */
 - qp_attr.cap.max_send_sge -= 2;
 - qp_attr.cap.max_recv_sge -= 2;
 - ret = rdma_create_qp(newxprt-sc_cm_id, newxprt-sc_pd,
 -  qp_attr);
 - if (ret) {
 - dprintk(svcrdma: failed to create QP, ret=%d\n, ret);
 - goto errout;
 - }
 - newxprt-sc_max_sge = qp_attr.cap.max_send_sge;
 - newxprt-sc_max_sge = qp_attr.cap.max_recv_sge;
 - newxprt-sc_sq_depth = qp_attr.cap.max_send_wr;
 - newxprt-sc_max_requests = qp_attr.cap.max_recv_wr;
 + dprintk(svcrdma: failed to create QP, ret=%d\n, ret);
 + goto errout;
   }
   newxprt-sc_qp = newxprt-sc_cm_id-qp;
  
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] svcrdma: Remove extra writeargs sanity check for NFSv2/3

2014-07-11 Thread J. Bruce Fields
On Thu, Jul 10, 2014 at 04:43:54PM -0400, Chuck Lever wrote:
 I think the sanity check you pointed out is strictly satisfied by
 testing against the unaligned number of bytes. Is there a strong
 reason to do the extra math for that check during each WRITE?

I can't think of any good reason why the check should be against the
rounded-up length.

So my worry was just a more general uh-oh, a non-multiple-of-4-length
is pretty weird, future me may not remember that's a possibility.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] svcrdma: Remove extra writeargs sanity check for NFSv2/3

2014-07-10 Thread J. Bruce Fields
On Thu, Jul 10, 2014 at 01:44:35PM -0400, Chuck Lever wrote:
 The server should comply with RFC 5667,

Where's the relevant language?  (I took a quick look but didn't see it.)

 and handle WRITE payloads
 that may not have a zero pad on the end (XDR length round-up).
 
 Fixes: f34b9568 (The NFSv2/NFSv3 server does not handle zero...)
 BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=246
 Signed-off-by: Chuck Lever chuck.le...@oracle.com
 ---
 
  fs/nfsd/nfs3xdr.c |   15 +--
  fs/nfsd/nfsxdr.c  |   16 +---
  2 files changed, 2 insertions(+), 29 deletions(-)
 
 diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
 index e6c01e8..6074f6a 100644
 --- a/fs/nfsd/nfs3xdr.c
 +++ b/fs/nfsd/nfs3xdr.c
 @@ -361,7 +361,7 @@ int
  nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p,
   struct nfsd3_writeargs *args)
  {
 - unsigned int len, v, hdr, dlen;
 + unsigned int len, v, hdr;
   u32 max_blocksize = svc_max_payload(rqstp);
  
   p = decode_fh(p, args-fh);
 @@ -383,19 +383,6 @@ nfs3svc_decode_writeargs(struct svc_rqst *rqstp, __be32 
 *p,
* bytes.
*/
   hdr = (void*)p - rqstp-rq_arg.head[0].iov_base;
 - dlen = rqstp-rq_arg.head[0].iov_len + rqstp-rq_arg.page_len
 - - hdr;
 - /*
 -  * Round the length of the data which was specified up to
 -  * the next multiple of XDR units and then compare that
 -  * against the length which was actually received.
 -  * Note that when RPCSEC/GSS (for example) is used, the
 -  * data buffer can be padded so dlen might be larger
 -  * than required.  It must never be smaller.
 -  */
 - if (dlen  XDR_QUADLEN(len)*4)
 - return 0;
 -
   if (args-count  max_blocksize) {
   args-count = max_blocksize;
   len = args-len = max_blocksize;
And then:
}
rqstp-rq_vec[0].iov_base = (void*)p;
rqstp-rq_vec[0].iov_len = rqstp-rq_arg.head[0].iov_len - hdr;
v = 0;
while (len  rqstp-rq_vec[v].iov_len) {
len -= rqstp-rq_vec[v].iov_len;
v++;
rqstp-rq_vec[v].iov_base = page_address(rqstp-rq_pages[v]);
rqstp-rq_vec[v].iov_len = PAGE_SIZE;
}
rqstp-rq_vec[v].iov_len = len;
args-vlen = v + 1;
return 1;

So if len is allowed to be larger than the available space, then the rq_vec
will get filled with data beyond the end of the client's request.

I believe the max_blocksize check will at least ensure that rq_pages[v]
is always still a valid page, so we shouldn't crash.  But it could
contain random data, and writing that data to a file could disclose
information we don't mean to.

Am I missing something?  The v2 case looks similar.

So I think you just want to drop the round-up of len, not the whole
check.

--b.


 diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
 index 1ac306b..1f5347e 100644
 --- a/fs/nfsd/nfsxdr.c
 +++ b/fs/nfsd/nfsxdr.c
 @@ -280,7 +280,7 @@ int
  nfssvc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p,
   struct nfsd_writeargs *args)
  {
 - unsigned int len, hdr, dlen;
 + unsigned int len, hdr;
   int v;
  
   p = decode_fh(p, args-fh);
 @@ -302,20 +302,6 @@ nfssvc_decode_writeargs(struct svc_rqst *rqstp, __be32 
 *p,
* bytes.
*/
   hdr = (void*)p - rqstp-rq_arg.head[0].iov_base;
 - dlen = rqstp-rq_arg.head[0].iov_len + rqstp-rq_arg.page_len
 - - hdr;
 -
 - /*
 -  * Round the length of the data which was specified up to
 -  * the next multiple of XDR units and then compare that
 -  * against the length which was actually received.
 -  * Note that when RPCSEC/GSS (for example) is used, the
 -  * data buffer can be padded so dlen might be larger
 -  * than required.  It must never be smaller.
 -  */
 - if (dlen  XDR_QUADLEN(len)*4)
 - return 0;
 -
   rqstp-rq_vec[0].iov_base = (void*)p;
   rqstp-rq_vec[0].iov_len = rqstp-rq_arg.head[0].iov_len - hdr;
   v = 0;
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
:set 

 diff --git a/fs/nfsd/nfsxdr.c b/fs/nfsd/nfsxdr.c
 index 1ac306b..1f5347e 100644
 --- a/fs/nfsd/nfsxdr.c
 +++ b/fs/nfsd/nfsxdr.c
 @@ -280,7 +280,7 @@ int
  nfssvc_decode_writeargs(struct svc_rqst *rqstp, __be32 *p,
   struct nfsd_writeargs *args)
  {
 - unsigned int len, hdr, dlen;
 + unsigned int len, hdr;
   int v;
  
   p = decode_fh(p, args-fh);
 @@ -302,20 +302,6 @@ nfssvc_decode_writeargs(struct svc_rqst *rqstp, __be32 
 *p,
* bytes.
*/
   hdr = (void*)p - rqstp-rq_arg.head[0].iov_base;
 - dlen = rqstp-rq_arg.head[0].iov_len + rqstp-rq_arg.page_len
 - - hdr;
 -
 - /*
 -

Re: [PATCH 2/2] svcrdma: Remove extra writeargs sanity check for NFSv2/3

2014-07-10 Thread J. Bruce Fields
On Thu, Jul 10, 2014 at 02:24:57PM -0400, Chuck Lever wrote:
 
 On Jul 10, 2014, at 2:19 PM, J. Bruce Fields bfie...@fieldses.org wrote:
 
  On Thu, Jul 10, 2014 at 01:44:35PM -0400, Chuck Lever wrote:
  The server should comply with RFC 5667,
  
  Where's the relevant language?  (I took a quick look but didn't see it.)
 
 Sorry, I listed the wrong RFC when I wrote the description of bug 246.
 It’s actually RFC 5666, section 3.7.

Thanks.

  So I think you just want to drop the round-up of len, not the whole
  check.
 
 I’ll try that, thanks!

Actually, I'd really rather this get fixed up in the rpc layer.  The
padding is really required for correct xdr.  The fact that RDMA doesn't
require those zeroes to be literally transmitted across the network
sounds like a transport-level detail that should be hidden from the xdr
code.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] svcrdma: Remove extra writeargs sanity check for NFSv2/3

2014-07-10 Thread J. Bruce Fields
On Thu, Jul 10, 2014 at 03:07:34PM -0400, Chuck Lever wrote:
 Hi Bruce-
 
 On Jul 10, 2014, at 2:49 PM, J. Bruce Fields bfie...@fieldses.org wrote:
 
  On Thu, Jul 10, 2014 at 02:24:57PM -0400, Chuck Lever wrote:
  
  On Jul 10, 2014, at 2:19 PM, J. Bruce Fields bfie...@fieldses.org wrote:
  
  On Thu, Jul 10, 2014 at 01:44:35PM -0400, Chuck Lever wrote:
  The server should comply with RFC 5667,
  
  Where's the relevant language?  (I took a quick look but didn't see it.)
  
  Sorry, I listed the wrong RFC when I wrote the description of bug 246.
  It’s actually RFC 5666, section 3.7.
  
  Thanks.
  
  So I think you just want to drop the round-up of len, not the whole
  check.
  
  I’ll try that, thanks!
 
 Works-as-expected.
 
  Actually, I'd really rather this get fixed up in the rpc layer.  The
  padding is really required for correct xdr.
 
 How so?

Well, to be spec-lawyerly, rfc 1832 3.9 defines opaque data as including
the zero-padding; a sequence of bytes isn't legal xdr if it just ends
early.

 All of NFSv4 and all other NFSv3 operations work as expected
 without that padding present. There doesn’t seem to be any operational
 dependency on the presence of padding. Help?

I can believe that the code deals with it now, I just wonder if this
check may not be the only case where someone writing xdr code expects
total length to be a multiple of four.

The drc code also depends on the length being right, see
nfsd_cache_csum.  I don't know whether that will cause a practical
problem in this case.

(What about the krb5i case?)

  The fact that RDMA doesn't
  require those zeroes to be literally transmitted across the network
  sounds like a transport-level detail that should be hidden from the xdr
  code.
 
 The best I can think of is adding a false page array entry to the
 xdr_buf if the last incoming page is short by a few bytes.

The padding just gets added to the end of whichever page the write ended
on, and you only use another page if you run out of space, right?

I don't know, if that's a huge pain then I guess we can live with this.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: send_write() must not overflow the device's max sge

2014-07-10 Thread J. Bruce Fields
On Wed, Jul 09, 2014 at 01:49:15PM -0500, Steve Wise wrote:
 Function send_write() must stop creating sges when it reaches the device
 max and return the amount sent in the RDMA Write to the caller.

Queueing up for 3.17, thanks.--b.

 
 Signed-off-by: Steve Wise sw...@opengridcomputing.com
 ---
 
  net/sunrpc/xprtrdma/svc_rdma_sendto.c |   39 
 +
  1 files changed, 15 insertions(+), 24 deletions(-)
 
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_sendto.c 
 b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
 index 49fd21a..9f1b506 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_sendto.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_sendto.c
 @@ -192,6 +192,8 @@ static int send_write(struct svcxprt_rdma *xprt, struct 
 svc_rqst *rqstp,
   xdr_sge_no++;
   BUG_ON(xdr_sge_no  vec-count);
   bc -= sge_bytes;
 + if (sge_no == xprt-sc_max_sge)
 + break;
   }
  
   /* Prepare WRITE WR */
 @@ -209,7 +211,7 @@ static int send_write(struct svcxprt_rdma *xprt, struct 
 svc_rqst *rqstp,
   atomic_inc(rdma_stat_write);
   if (svc_rdma_send(xprt, write_wr))
   goto err;
 - return 0;
 + return write_len - bc;
   err:
   svc_rdma_unmap_dma(ctxt);
   svc_rdma_put_context(ctxt, 0);
 @@ -225,7 +227,6 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
  {
   u32 xfer_len = rqstp-rq_res.page_len + rqstp-rq_res.tail[0].iov_len;
   int write_len;
 - int max_write;
   u32 xdr_off;
   int chunk_off;
   int chunk_no;
 @@ -239,8 +240,6 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
   res_ary = (struct rpcrdma_write_array *)
   rdma_resp-rm_body.rm_chunks[1];
  
 - max_write = xprt-sc_max_sge * PAGE_SIZE;
 -
   /* Write chunks start at the pagelist */
   for (xdr_off = rqstp-rq_res.head[0].iov_len, chunk_no = 0;
xfer_len  chunk_no  arg_ary-wc_nchunks;
 @@ -260,23 +259,21 @@ static int send_write_chunks(struct svcxprt_rdma *xprt,
   write_len);
   chunk_off = 0;
   while (write_len) {
 - int this_write;
 - this_write = min(write_len, max_write);
   ret = send_write(xprt, rqstp,
ntohl(arg_ch-rs_handle),
rs_offset + chunk_off,
xdr_off,
 -  this_write,
 +  write_len,
vec);
 - if (ret) {
 + if (ret = 0) {
   dprintk(svcrdma: RDMA_WRITE failed, ret=%d\n,
   ret);
   return -EIO;
   }
 - chunk_off += this_write;
 - xdr_off += this_write;
 - xfer_len -= this_write;
 - write_len -= this_write;
 + chunk_off += ret;
 + xdr_off += ret;
 + xfer_len -= ret;
 + write_len -= ret;
   }
   }
   /* Update the req with the number of chunks actually used */
 @@ -293,7 +290,6 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
  {
   u32 xfer_len = rqstp-rq_res.len;
   int write_len;
 - int max_write;
   u32 xdr_off;
   int chunk_no;
   int chunk_off;
 @@ -311,8 +307,6 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
   res_ary = (struct rpcrdma_write_array *)
   rdma_resp-rm_body.rm_chunks[2];
  
 - max_write = xprt-sc_max_sge * PAGE_SIZE;
 -
   /* xdr offset starts at RPC message */
   nchunks = ntohl(arg_ary-wc_nchunks);
   for (xdr_off = 0, chunk_no = 0;
 @@ -330,24 +324,21 @@ static int send_reply_chunks(struct svcxprt_rdma *xprt,
   write_len);
   chunk_off = 0;
   while (write_len) {
 - int this_write;
 -
 - this_write = min(write_len, max_write);
   ret = send_write(xprt, rqstp,
ntohl(ch-rs_handle),
rs_offset + chunk_off,
xdr_off,
 -  this_write,
 +  write_len,
vec);
 - if (ret) {
 + if (ret = 0) {
   dprintk(svcrdma: RDMA_WRITE failed, ret=%d\n,
   ret);
   return -EIO;
   }
 - chunk_off += this_write;
 - xdr_off += this_write;
 - 

Re: [PATCH V2] svcrdma: Fence LOCAL_INV work requests

2014-06-05 Thread J. Bruce Fields
On Thu, Jun 05, 2014 at 09:54:31AM -0500, Steve Wise wrote:
 This applies on top of:
 
   commit e5a070216356dbcb03607cb264cc3104e17339b3
   Author: Steve Wise sw...@opengridcomputing.com
   Date:   Wed May 28 15:12:01 2014 -0500
 
   svcrdma: refactor marshalling logic

Sorry, I lost track--was there another revision of that patch coming
too?

--b.

 
 Fencing forces the invalidate to only happen after all prior send
 work requests have been completed.
 
 Signed-off-by: Steve Wise sw...@opengridcomputing.com
 ---
 
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c |2 +-
  1 files changed, 1 insertions(+), 1 deletions(-)
 
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
 b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 index 52d9f2c..8f92a61 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 @@ -338,7 +338,7 @@ static int rdma_read_chunk_frmr(struct svcxprt_rdma *xprt,
   memset(inv_wr, 0, sizeof(inv_wr));
   inv_wr.wr_id = (unsigned long)ctxt;
   inv_wr.opcode = IB_WR_LOCAL_INV;
 - inv_wr.send_flags = IB_SEND_SIGNALED;
 + inv_wr.send_flags = IB_SEND_SIGNALED | IB_SEND_FENCE;
   inv_wr.ex.invalidate_rkey = frmr-mr-lkey;
   }
   ctxt-wr_op = read_wr.opcode;
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] svcrdma: refactor marshalling logic

2014-06-02 Thread J. Bruce Fields
On Mon, Jun 02, 2014 at 11:47:58AM -0500, Steve Wise wrote:
 
 
  Steve
  
  I have not checked the code because I am away from my laptop
  But I assume mlx and mthca is using fmr and cxgb4 is using 
  rdma-read-with-invalidate
 which
  has implicit fence.
  
  If invalidate is issued before read is completed its a problem.
  
 
 You're correct.  And this bug appears to be in the current upstream code as 
 well.  If an
 IB_WR_LOCAL_INV wr is used, it must include IB_SEND_FENCE to fence it until 
 the prior read
 completes.
 
 Good catch!  I'll post V4 soon.

Any chance that can be handled as a separate patch rather than folded
in?

(Disclaimer: I've been following the discussion only very
superficially.)

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] svcrdma: refactor marshalling logic

2014-06-02 Thread 'J. Bruce Fields'
On Mon, Jun 02, 2014 at 11:52:47AM -0500, Steve Wise wrote:
   You're correct.  And this bug appears to be in the current upstream code 
   as well.  If
 an
   IB_WR_LOCAL_INV wr is used, it must include IB_SEND_FENCE to fence it 
   until the prior
  read
   completes.
  
   Good catch!  I'll post V4 soon.
  
  Any chance that can be handled as a separate patch rather than folded
  in?
  
  (Disclaimer: I've been following the discussion only very
  superficially.)
  
 
 Sure.  I'll post the patch soon.

Thanks, and, again, I'm not terribly happy about the monster
patch--anything you can split off it is great, even if that thing's
small.  As long as all the intermediate stages still build and run.

(And any bugs you've identified in upstream code are good candidates for
separate patches, hopefully preceding the rewrite.  That also allows us
to apply those fixes to stable kernels if appropriate.)

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3] svcrdma: refactor marshalling logic

2014-06-02 Thread 'J. Bruce Fields'
On Mon, Jun 02, 2014 at 12:06:39PM -0500, Steve Wise wrote:
 
  On Mon, Jun 02, 2014 at 11:52:47AM -0500, Steve Wise wrote:
 You're correct.  And this bug appears to be in the current upstream 
 code as well.
 If
   an
 IB_WR_LOCAL_INV wr is used, it must include IB_SEND_FENCE to fence it 
 until the
 prior
read
 completes.

 Good catch!  I'll post V4 soon.
   
Any chance that can be handled as a separate patch rather than folded
in?
   
(Disclaimer: I've been following the discussion only very
superficially.)
   
  
   Sure.  I'll post the patch soon.
  
  Thanks, and, again, I'm not terribly happy about the monster
  patch--anything you can split off it is great, even if that thing's
  small.  As long as all the intermediate stages still build and run.
 
 I don't see any way to do this for this particular patch.  It rewrites the 
 entire rdma
 read logic.  

There's almost always a way.

  (And any bugs you've identified in upstream code are good candidates for
  separate patches, hopefully preceding the rewrite.  That also allows us
  to apply those fixes to stable kernels if appropriate.)
  
 
 If I do this, then I'd have to respin the refactor patch.  I really would 
 like to get this
 merged as-is (with the one change I'm sending soon), and move on.  I 
 definitely will try
 and keep the patches smaller and more discrete going forward.  
 
 Will that work?

Yes, just this once, we'll live.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] NFSD: Ignore client's source port on RDMA transports

2014-05-22 Thread J. Bruce Fields
On Mon, May 19, 2014 at 01:40:22PM -0400, Chuck Lever wrote:
 An NFS/RDMA client's source port is meaningless for RDMA transports.
 The transport layer typically sets the source port value on the
 connection to a random ephemeral port.
 
 Currently, NFS server administrators must specify the insecure
 export option to enable clients to access exports via RDMA.
 
 But this means NFS clients can access such an export via IP using an
 ephemeral port, which may not be desirable.
 
 This patch eliminates the need to specify the insecure export
 option to allow NFS/RDMA clients access to an export.
 
 BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=250
 Signed-off-by: Chuck Lever chuck.le...@oracle.com
 ---
 
 Hi Bruce-
 
 I've done some simple testing. insecure still behaves correctly
 for IP-based clients, and is now unnecessary for RDMA-based clients.
 Would you consider this for nfsd-next?

Yes, looks god to me; applying for 3.16.

--b.

 
 
  include/linux/sunrpc/svc_xprt.h  |1 +
  net/sunrpc/svc_xprt.c|2 +-
  net/sunrpc/svcsock.c |9 +
  net/sunrpc/xprtrdma/svc_rdma_transport.c |7 +++
  4 files changed, 18 insertions(+), 1 deletions(-)
 
 diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
 index b05963f..0cec1b9 100644
 --- a/include/linux/sunrpc/svc_xprt.h
 +++ b/include/linux/sunrpc/svc_xprt.h
 @@ -24,6 +24,7 @@ struct svc_xprt_ops {
   void(*xpo_release_rqst)(struct svc_rqst *);
   void(*xpo_detach)(struct svc_xprt *);
   void(*xpo_free)(struct svc_xprt *);
 + int (*xpo_secure_port)(struct svc_rqst *);
  };
  
  struct svc_xprt_class {
 diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
 index 06c6ff0..614956f 100644
 --- a/net/sunrpc/svc_xprt.c
 +++ b/net/sunrpc/svc_xprt.c
 @@ -793,7 +793,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
  
   clear_bit(XPT_OLD, xprt-xpt_flags);
  
 - rqstp-rq_secure = svc_port_is_privileged(svc_addr(rqstp));
 + rqstp-rq_secure = xprt-xpt_ops-xpo_secure_port(rqstp);
   rqstp-rq_chandle.defer = svc_defer;
  
   if (serv-sv_stats)
 diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
 index 43bcb46..0cb34f5 100644
 --- a/net/sunrpc/svcsock.c
 +++ b/net/sunrpc/svcsock.c
 @@ -400,6 +400,12 @@ static void svc_sock_setbufsize(struct socket *sock, 
 unsigned int snd,
   release_sock(sock-sk);
  #endif
  }
 +
 +static int svc_sock_secure_port(struct svc_rqst *rqstp)
 +{
 + return svc_port_is_privileged(svc_addr(rqstp));
 +}
 +
  /*
   * INET callback when data has been received on the socket.
   */
 @@ -678,6 +684,7 @@ static struct svc_xprt_ops svc_udp_ops = {
   .xpo_prep_reply_hdr = svc_udp_prep_reply_hdr,
   .xpo_has_wspace = svc_udp_has_wspace,
   .xpo_accept = svc_udp_accept,
 + .xpo_secure_port = svc_sock_secure_port,
  };
  
  static struct svc_xprt_class svc_udp_class = {
 @@ -1234,6 +1241,7 @@ static struct svc_xprt_ops svc_tcp_bc_ops = {
   .xpo_detach = svc_bc_tcp_sock_detach,
   .xpo_free = svc_bc_sock_free,
   .xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr,
 + .xpo_secure_port = svc_sock_secure_port,
  };
  
  static struct svc_xprt_class svc_tcp_bc_class = {
 @@ -1272,6 +1280,7 @@ static struct svc_xprt_ops svc_tcp_ops = {
   .xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr,
   .xpo_has_wspace = svc_tcp_has_wspace,
   .xpo_accept = svc_tcp_accept,
 + .xpo_secure_port = svc_sock_secure_port,
  };
  
  static struct svc_xprt_class svc_tcp_class = {
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
 b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 index 25688fa..02db8d9 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 @@ -65,6 +65,7 @@ static void dto_tasklet_func(unsigned long data);
  static void svc_rdma_detach(struct svc_xprt *xprt);
  static void svc_rdma_free(struct svc_xprt *xprt);
  static int svc_rdma_has_wspace(struct svc_xprt *xprt);
 +static int svc_rdma_secure_port(struct svc_rqst *);
  static void rq_cq_reap(struct svcxprt_rdma *xprt);
  static void sq_cq_reap(struct svcxprt_rdma *xprt);
  
 @@ -82,6 +83,7 @@ static struct svc_xprt_ops svc_rdma_ops = {
   .xpo_prep_reply_hdr = svc_rdma_prep_reply_hdr,
   .xpo_has_wspace = svc_rdma_has_wspace,
   .xpo_accept = svc_rdma_accept,
 + .xpo_secure_port = svc_rdma_secure_port,
  };
  
  struct svc_xprt_class svc_rdma_class = {
 @@ -1207,6 +1209,11 @@ static int svc_rdma_has_wspace(struct svc_xprt *xprt)
   return 1;
  }
  
 +static int svc_rdma_secure_port(struct svc_rqst *rqstp)
 +{
 + return 1;
 +}
 +
  /*
   * Attempt to register the kvec representing the RPC memory with the
   * device.
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: [PATCH] NFSD: Ignore client's source port on RDMA transports

2014-05-22 Thread J. Bruce Fields
On Thu, May 22, 2014 at 03:55:17PM -0400, bfields wrote:
 On Mon, May 19, 2014 at 01:40:22PM -0400, Chuck Lever wrote:
  An NFS/RDMA client's source port is meaningless for RDMA transports.
  The transport layer typically sets the source port value on the
  connection to a random ephemeral port.
  
  Currently, NFS server administrators must specify the insecure
  export option to enable clients to access exports via RDMA.
  
  But this means NFS clients can access such an export via IP using an
  ephemeral port, which may not be desirable.
  
  This patch eliminates the need to specify the insecure export
  option to allow NFS/RDMA clients access to an export.
  
  BugLink: https://bugzilla.linux-nfs.org/show_bug.cgi?id=250
  Signed-off-by: Chuck Lever chuck.le...@oracle.com
  ---
  
  Hi Bruce-
  
  I've done some simple testing. insecure still behaves correctly
  for IP-based clients, and is now unnecessary for RDMA-based clients.
  Would you consider this for nfsd-next?
 
 Yes, looks god to me; applying for 3.16.

Um, you know what I meant.

--b.

 
 --b.
 
  
  
   include/linux/sunrpc/svc_xprt.h  |1 +
   net/sunrpc/svc_xprt.c|2 +-
   net/sunrpc/svcsock.c |9 +
   net/sunrpc/xprtrdma/svc_rdma_transport.c |7 +++
   4 files changed, 18 insertions(+), 1 deletions(-)
  
  diff --git a/include/linux/sunrpc/svc_xprt.h 
  b/include/linux/sunrpc/svc_xprt.h
  index b05963f..0cec1b9 100644
  --- a/include/linux/sunrpc/svc_xprt.h
  +++ b/include/linux/sunrpc/svc_xprt.h
  @@ -24,6 +24,7 @@ struct svc_xprt_ops {
  void(*xpo_release_rqst)(struct svc_rqst *);
  void(*xpo_detach)(struct svc_xprt *);
  void(*xpo_free)(struct svc_xprt *);
  +   int (*xpo_secure_port)(struct svc_rqst *);
   };
   
   struct svc_xprt_class {
  diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
  index 06c6ff0..614956f 100644
  --- a/net/sunrpc/svc_xprt.c
  +++ b/net/sunrpc/svc_xprt.c
  @@ -793,7 +793,7 @@ int svc_recv(struct svc_rqst *rqstp, long timeout)
   
  clear_bit(XPT_OLD, xprt-xpt_flags);
   
  -   rqstp-rq_secure = svc_port_is_privileged(svc_addr(rqstp));
  +   rqstp-rq_secure = xprt-xpt_ops-xpo_secure_port(rqstp);
  rqstp-rq_chandle.defer = svc_defer;
   
  if (serv-sv_stats)
  diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
  index 43bcb46..0cb34f5 100644
  --- a/net/sunrpc/svcsock.c
  +++ b/net/sunrpc/svcsock.c
  @@ -400,6 +400,12 @@ static void svc_sock_setbufsize(struct socket *sock, 
  unsigned int snd,
  release_sock(sock-sk);
   #endif
   }
  +
  +static int svc_sock_secure_port(struct svc_rqst *rqstp)
  +{
  +   return svc_port_is_privileged(svc_addr(rqstp));
  +}
  +
   /*
* INET callback when data has been received on the socket.
*/
  @@ -678,6 +684,7 @@ static struct svc_xprt_ops svc_udp_ops = {
  .xpo_prep_reply_hdr = svc_udp_prep_reply_hdr,
  .xpo_has_wspace = svc_udp_has_wspace,
  .xpo_accept = svc_udp_accept,
  +   .xpo_secure_port = svc_sock_secure_port,
   };
   
   static struct svc_xprt_class svc_udp_class = {
  @@ -1234,6 +1241,7 @@ static struct svc_xprt_ops svc_tcp_bc_ops = {
  .xpo_detach = svc_bc_tcp_sock_detach,
  .xpo_free = svc_bc_sock_free,
  .xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr,
  +   .xpo_secure_port = svc_sock_secure_port,
   };
   
   static struct svc_xprt_class svc_tcp_bc_class = {
  @@ -1272,6 +1280,7 @@ static struct svc_xprt_ops svc_tcp_ops = {
  .xpo_prep_reply_hdr = svc_tcp_prep_reply_hdr,
  .xpo_has_wspace = svc_tcp_has_wspace,
  .xpo_accept = svc_tcp_accept,
  +   .xpo_secure_port = svc_sock_secure_port,
   };
   
   static struct svc_xprt_class svc_tcp_class = {
  diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
  b/net/sunrpc/xprtrdma/svc_rdma_transport.c
  index 25688fa..02db8d9 100644
  --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
  +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
  @@ -65,6 +65,7 @@ static void dto_tasklet_func(unsigned long data);
   static void svc_rdma_detach(struct svc_xprt *xprt);
   static void svc_rdma_free(struct svc_xprt *xprt);
   static int svc_rdma_has_wspace(struct svc_xprt *xprt);
  +static int svc_rdma_secure_port(struct svc_rqst *);
   static void rq_cq_reap(struct svcxprt_rdma *xprt);
   static void sq_cq_reap(struct svcxprt_rdma *xprt);
   
  @@ -82,6 +83,7 @@ static struct svc_xprt_ops svc_rdma_ops = {
  .xpo_prep_reply_hdr = svc_rdma_prep_reply_hdr,
  .xpo_has_wspace = svc_rdma_has_wspace,
  .xpo_accept = svc_rdma_accept,
  +   .xpo_secure_port = svc_rdma_secure_port,
   };
   
   struct svc_xprt_class svc_rdma_class = {
  @@ -1207,6 +1209,11 @@ static int svc_rdma_has_wspace(struct svc_xprt *xprt)
  return 1;
   }
   
  +static int svc_rdma_secure_port(struct svc_rqst *rqstp)
  +{
  +   return 1;
  +}
  +
   /*
* Attempt to register the kvec representing the RPC memory with the
* device.
  
--
To 

Re: [PATCH V2 RFC 1/3] svcrdma: Transport and header file changes

2014-05-06 Thread J. Bruce Fields
On Tue, May 06, 2014 at 12:46:27PM -0500, Steve Wise wrote:
 From: Tom Tucker t...@opengridcomputing.com
 
 Change poll logic to grab up to 6 completions at a time.
 
 RDMA write and send completions no longer deal with fastreg objects.
 
 Set SVCRDMA_DEVCAP_FAST_REG and allocate a dma_mr based on the device
 capabilities.
 
 Signed-off-by: Tom Tucker t...@opengridcomputing.com
 Signed-off-by: Steve Wise sw...@opengridcomputing.com
 ---
 
  include/linux/sunrpc/svc_rdma.h  |3 -
  net/sunrpc/xprtrdma/svc_rdma_transport.c |   62 
 +-
  2 files changed, 37 insertions(+), 28 deletions(-)
 
 diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
 index 0b8e3e6..5cf99a0 100644
 --- a/include/linux/sunrpc/svc_rdma.h
 +++ b/include/linux/sunrpc/svc_rdma.h
 @@ -115,14 +115,13 @@ struct svc_rdma_fastreg_mr {
   struct list_head frmr_list;
  };
  struct svc_rdma_req_map {
 - struct svc_rdma_fastreg_mr *frmr;
   unsigned long count;
   union {
   struct kvec sge[RPCSVC_MAXPAGES];
   struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
 + unsigned long lkey[RPCSVC_MAXPAGES];
   };
  };
 -#define RDMACTXT_F_FAST_UNREG1
  #define RDMACTXT_F_LAST_CTXT 2
  
  #define  SVCRDMA_DEVCAP_FAST_REG 1   /* fast mr registration 
 */
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
 b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 index 25688fa..2c5b201 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 @@ -1,4 +1,5 @@
  /*
 + * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
   * Copyright (c) 2005-2007 Network Appliance, Inc. All rights reserved.
   *
   * This software is available to you under a choice of one of two
 @@ -160,7 +161,6 @@ struct svc_rdma_req_map *svc_rdma_get_req_map(void)
   schedule_timeout_uninterruptible(msecs_to_jiffies(500));
   }
   map-count = 0;
 - map-frmr = NULL;
   return map;
  }
  
 @@ -336,22 +336,21 @@ static void process_context(struct svcxprt_rdma *xprt,
  
   switch (ctxt-wr_op) {
   case IB_WR_SEND:
 - if (test_bit(RDMACTXT_F_FAST_UNREG, ctxt-flags))
 - svc_rdma_put_frmr(xprt, ctxt-frmr);
 + BUG_ON(ctxt-frmr);
   svc_rdma_put_context(ctxt, 1);
   break;
  
   case IB_WR_RDMA_WRITE:
 + BUG_ON(ctxt-frmr);
   svc_rdma_put_context(ctxt, 0);
   break;
  
   case IB_WR_RDMA_READ:
   case IB_WR_RDMA_READ_WITH_INV:
 + svc_rdma_put_frmr(xprt, ctxt-frmr);
   if (test_bit(RDMACTXT_F_LAST_CTXT, ctxt-flags)) {
   struct svc_rdma_op_ctxt *read_hdr = ctxt-read_hdr;
   BUG_ON(!read_hdr);
 - if (test_bit(RDMACTXT_F_FAST_UNREG, ctxt-flags))
 - svc_rdma_put_frmr(xprt, ctxt-frmr);
   spin_lock_bh(xprt-sc_rq_dto_lock);
   set_bit(XPT_DATA, xprt-sc_xprt.xpt_flags);
   list_add_tail(read_hdr-dto_q,
 @@ -363,6 +362,7 @@ static void process_context(struct svcxprt_rdma *xprt,
   break;
  
   default:
 + BUG_ON(1);
   printk(KERN_ERR svcrdma: unexpected completion type, 
  opcode=%d\n,
  ctxt-wr_op);

Note the printk's unreachable now.  Should some of these BUG_ON()'s be
WARN_ON()'s?

 @@ -378,29 +378,42 @@ static void process_context(struct svcxprt_rdma *xprt,
  static void sq_cq_reap(struct svcxprt_rdma *xprt)
  {
   struct svc_rdma_op_ctxt *ctxt = NULL;
 - struct ib_wc wc;
 + struct ib_wc wc_a[6];
 + struct ib_wc *wc;
   struct ib_cq *cq = xprt-sc_sq_cq;
   int ret;

May want to keep an eye on the stack usage here?

--b.

  
 + memset(wc_a, 0, sizeof(wc_a));
 +
   if (!test_and_clear_bit(RDMAXPRT_SQ_PENDING, xprt-sc_flags))
   return;
  
   ib_req_notify_cq(xprt-sc_sq_cq, IB_CQ_NEXT_COMP);
   atomic_inc(rdma_stat_sq_poll);
 - while ((ret = ib_poll_cq(cq, 1, wc))  0) {
 - if (wc.status != IB_WC_SUCCESS)
 - /* Close the transport */
 - set_bit(XPT_CLOSE, xprt-sc_xprt.xpt_flags);
 + while ((ret = ib_poll_cq(cq, ARRAY_SIZE(wc_a), wc_a))  0) {
 + int i;
  
 - /* Decrement used SQ WR count */
 - atomic_dec(xprt-sc_sq_count);
 - wake_up(xprt-sc_send_wait);
 + for (i = 0; i  ret; i++) {
 + wc = wc_a[i];
 + if (wc-status != IB_WC_SUCCESS) {
 + dprintk(svcrdma: sq wc err status %d\n,
 + wc-status);
  
 - ctxt = (struct svc_rdma_op_ctxt *)(unsigned long)wc.wr_id;
 - if (ctxt)
 - process_context(xprt, ctxt);
 +  

Re: [PATCH V2 RFC 0/3] svcrdma: refactor marshalling logic

2014-05-06 Thread J. Bruce Fields
On Tue, May 06, 2014 at 12:46:21PM -0500, Steve Wise wrote:
 This patch series refactors the NFSRDMA server marshalling logic to
 remove the intermediary map structures.  It also fixes an existing bug
 where the NFSRDMA server was not minding the device fast register page
 list length limitations.
 
 I've also made a git repo available with these patches on top of 3.15-rc4:
 
 git://git.openfabrics.org/~swise/linux svcrdma-refactor
 
 Changes since V1:
 
 - fixed regression for devices that don't support FRMRs (see
   rdma_read_chunk_lcl())
 
 - split patch up for closer review.  However I request it be squashed
   before merging as they is not bisectable, and I think these changes
   should all be a single commit anyway.

If it's not split up in a way that's bisectable, then yes, just don't
bother.

--b.

 
 Please review, and test if you can.
 
 Signed-off-by: Tom Tucker t...@opengridcomputing.com
 Signed-off-by: Steve Wise sw...@opengridcomputing.com
 
 ---
 
 Tom Tucker (3):
   svcrdma: Sendto changes
   svcrdma: Recvfrom changes
   svcrdma: Transport and header file changes
 
 
  include/linux/sunrpc/svc_rdma.h  |3 
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  633 
 --
  net/sunrpc/xprtrdma/svc_rdma_sendto.c|  230 +--
  net/sunrpc/xprtrdma/svc_rdma_transport.c |   62 ++-
  4 files changed, 318 insertions(+), 610 deletions(-)
 
 -- 
 
 Steve / Tom
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V2 RFC 1/3] svcrdma: Transport and header file changes

2014-05-06 Thread J. Bruce Fields
On Tue, May 06, 2014 at 04:02:41PM -0500, Steve Wise wrote:
 On 5/6/2014 2:21 PM, J. Bruce Fields wrote:
 On Tue, May 06, 2014 at 12:46:27PM -0500, Steve Wise wrote:
 From: Tom Tucker t...@opengridcomputing.com
 
 Change poll logic to grab up to 6 completions at a time.
 
 RDMA write and send completions no longer deal with fastreg objects.
 
 Set SVCRDMA_DEVCAP_FAST_REG and allocate a dma_mr based on the device
 capabilities.
 
 Signed-off-by: Tom Tucker t...@opengridcomputing.com
 Signed-off-by: Steve Wise sw...@opengridcomputing.com
 ---
 
   include/linux/sunrpc/svc_rdma.h  |3 -
   net/sunrpc/xprtrdma/svc_rdma_transport.c |   62 
  +-
   2 files changed, 37 insertions(+), 28 deletions(-)
 
 diff --git a/include/linux/sunrpc/svc_rdma.h 
 b/include/linux/sunrpc/svc_rdma.h
 index 0b8e3e6..5cf99a0 100644
 --- a/include/linux/sunrpc/svc_rdma.h
 +++ b/include/linux/sunrpc/svc_rdma.h
 @@ -115,14 +115,13 @@ struct svc_rdma_fastreg_mr {
 struct list_head frmr_list;
   };
   struct svc_rdma_req_map {
 -   struct svc_rdma_fastreg_mr *frmr;
 unsigned long count;
 union {
 struct kvec sge[RPCSVC_MAXPAGES];
 struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
 +   unsigned long lkey[RPCSVC_MAXPAGES];
 };
   };
 -#define RDMACTXT_F_FAST_UNREG  1
   #define RDMACTXT_F_LAST_CTXT  2
   #define   SVCRDMA_DEVCAP_FAST_REG 1   /* fast mr registration 
  */
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c 
 b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 index 25688fa..2c5b201 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
 @@ -1,4 +1,5 @@
   /*
 + * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
* Copyright (c) 2005-2007 Network Appliance, Inc. All rights reserved.
*
* This software is available to you under a choice of one of two
 @@ -160,7 +161,6 @@ struct svc_rdma_req_map *svc_rdma_get_req_map(void)
 schedule_timeout_uninterruptible(msecs_to_jiffies(500));
 }
 map-count = 0;
 -   map-frmr = NULL;
 return map;
   }
 @@ -336,22 +336,21 @@ static void process_context(struct svcxprt_rdma *xprt,
 switch (ctxt-wr_op) {
 case IB_WR_SEND:
 -   if (test_bit(RDMACTXT_F_FAST_UNREG, ctxt-flags))
 -   svc_rdma_put_frmr(xprt, ctxt-frmr);
 +   BUG_ON(ctxt-frmr);
 svc_rdma_put_context(ctxt, 1);
 break;
 case IB_WR_RDMA_WRITE:
 +   BUG_ON(ctxt-frmr);
 svc_rdma_put_context(ctxt, 0);
 break;
 case IB_WR_RDMA_READ:
 case IB_WR_RDMA_READ_WITH_INV:
 +   svc_rdma_put_frmr(xprt, ctxt-frmr);
 if (test_bit(RDMACTXT_F_LAST_CTXT, ctxt-flags)) {
 struct svc_rdma_op_ctxt *read_hdr = ctxt-read_hdr;
 BUG_ON(!read_hdr);
 -   if (test_bit(RDMACTXT_F_FAST_UNREG, ctxt-flags))
 -   svc_rdma_put_frmr(xprt, ctxt-frmr);
 spin_lock_bh(xprt-sc_rq_dto_lock);
 set_bit(XPT_DATA, xprt-sc_xprt.xpt_flags);
 list_add_tail(read_hdr-dto_q,
 @@ -363,6 +362,7 @@ static void process_context(struct svcxprt_rdma *xprt,
 break;
 default:
 +   BUG_ON(1);
 printk(KERN_ERR svcrdma: unexpected completion type, 
opcode=%d\n,
ctxt-wr_op);
 Note the printk's unreachable now.  Should some of these BUG_ON()'s be
 WARN_ON()'s?
 
 I'll remove the printk.  And if any of the new BUG_ON()'s can be
 WARN_ON(), then I'll do that.  But only if proceeding after a
 WARN_ON() results in a working server.

The other thing to keep in mind is what the consequences of the BUG
might be--e.g. if we BUG while holding an important lock then that lock
never gets dropped and the system can freeze pretty quickly--possibly
before we get any useful information to the system logs.  On a quick
check that doesn't look like the case here, though.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: refactor marshalling logic

2014-04-25 Thread J. Bruce Fields
On Thu, Apr 10, 2014 at 01:37:23PM -0500, Steve Wise wrote:
 From: Tom Tucker t...@ogc.us
 
 This patch refactors the marshalling logic to remove the intermediary
 map structures.  It also fixes an existing bug where the NFSRDMA server
 was not minding the device fast register page list length limitations.
 
 Signed-off-by: Tom Tucker t...@ogc.us
 ---
 
  include/linux/sunrpc/svc_rdma.h  |3 
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |  551 
 +-
  net/sunrpc/xprtrdma/svc_rdma_sendto.c|  230 +
  net/sunrpc/xprtrdma/svc_rdma_transport.c |   57 ++-
  4 files changed, 222 insertions(+), 619 deletions(-)

Is it possible to make this change in more than one step?

RDMA is pretty esoteric to most of us, so honestly this will probably
get merged based just on your having tested it, but if it was possible
to break this up into smaller patches you might give us at least a
fighting chance of giving it some review

--b.

 
 diff --git a/include/linux/sunrpc/svc_rdma.h b/include/linux/sunrpc/svc_rdma.h
 index 0b8e3e6..5cf99a0 100644
 --- a/include/linux/sunrpc/svc_rdma.h
 +++ b/include/linux/sunrpc/svc_rdma.h
 @@ -115,14 +115,13 @@ struct svc_rdma_fastreg_mr {
   struct list_head frmr_list;
  };
  struct svc_rdma_req_map {
 - struct svc_rdma_fastreg_mr *frmr;
   unsigned long count;
   union {
   struct kvec sge[RPCSVC_MAXPAGES];
   struct svc_rdma_chunk_sge ch[RPCSVC_MAXPAGES];
 + unsigned long lkey[RPCSVC_MAXPAGES];
   };
  };
 -#define RDMACTXT_F_FAST_UNREG1
  #define RDMACTXT_F_LAST_CTXT 2
  
  #define  SVCRDMA_DEVCAP_FAST_REG 1   /* fast mr registration 
 */
 diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
 b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 index 8d904e4..9cf65ff 100644
 --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
 @@ -1,4 +1,5 @@
  /*
 + * Copyright (c) 2014 Open Grid Computing, Inc. All rights reserved.
   * Copyright (c) 2005-2006 Network Appliance, Inc. All rights reserved.
   *
   * This software is available to you under a choice of one of two
 @@ -69,7 +70,8 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
  
   /* Set up the XDR head */
   rqstp-rq_arg.head[0].iov_base = page_address(page);
 - rqstp-rq_arg.head[0].iov_len = min(byte_count, ctxt-sge[0].length);
 + rqstp-rq_arg.head[0].iov_len =
 + min_t(size_t, byte_count, ctxt-sge[0].length);
   rqstp-rq_arg.len = byte_count;
   rqstp-rq_arg.buflen = byte_count;
  
 @@ -85,7 +87,7 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
   page = ctxt-pages[sge_no];
   put_page(rqstp-rq_pages[sge_no]);
   rqstp-rq_pages[sge_no] = page;
 - bc -= min(bc, ctxt-sge[sge_no].length);
 + bc -= min_t(u32, bc, ctxt-sge[sge_no].length);
   rqstp-rq_arg.buflen += ctxt-sge[sge_no].length;
   sge_no++;
   }
 @@ -113,291 +115,154 @@ static void rdma_build_arg_xdr(struct svc_rqst *rqstp,
   rqstp-rq_arg.tail[0].iov_len = 0;
  }
  
 -/* Encode a read-chunk-list as an array of IB SGE
 - *
 - * Assumptions:
 - * - chunk[0]-position points to pages[0] at an offset of 0
 - * - pages[] is not physically or virtually contiguous and consists of
 - *   PAGE_SIZE elements.
 - *
 - * Output:
 - * - sge array pointing into pages[] array.
 - * - chunk_sge array specifying sge index and count for each
 - *   chunk in the read list
 - *
 - */
 -static int map_read_chunks(struct svcxprt_rdma *xprt,
 -struct svc_rqst *rqstp,
 -struct svc_rdma_op_ctxt *head,
 -struct rpcrdma_msg *rmsgp,
 -struct svc_rdma_req_map *rpl_map,
 -struct svc_rdma_req_map *chl_map,
 -int ch_count,
 -int byte_count)
 -{
 - int sge_no;
 - int sge_bytes;
 - int page_off;
 - int page_no;
 - int ch_bytes;
 - int ch_no;
 - struct rpcrdma_read_chunk *ch;
 -
 - sge_no = 0;
 - page_no = 0;
 - page_off = 0;
 - ch = (struct rpcrdma_read_chunk *)rmsgp-rm_body.rm_chunks[0];
 - ch_no = 0;
 - ch_bytes = ntohl(ch-rc_target.rs_length);
 - head-arg.head[0] = rqstp-rq_arg.head[0];
 - head-arg.tail[0] = rqstp-rq_arg.tail[0];
 - head-arg.pages = head-pages[head-count];
 - head-hdr_count = head-count; /* save count of hdr pages */
 - head-arg.page_base = 0;
 - head-arg.page_len = ch_bytes;
 - head-arg.len = rqstp-rq_arg.len + ch_bytes;
 - head-arg.buflen = rqstp-rq_arg.buflen + ch_bytes;
 - head-count++;
 - chl_map-ch[0].start = 0;
 - while (byte_count) {
 - rpl_map-sge[sge_no].iov_base =
 - page_address(rqstp-rq_arg.pages[page_no]) + page_off;
 - sge_bytes = min_t(int, PAGE_SIZE-page_off, 

Re: NFS over RDMA benchmark

2013-04-30 Thread J. Bruce Fields
On Sun, Apr 28, 2013 at 10:42:48AM -0400, J. Bruce Fields wrote:
 On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote:
 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
 y...@mellanox.com
 I've been trying to do some benchmarks for NFS over RDMA
 and I seem to
only get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores, 32Gb of
 memory, and
Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The
 backing storage on
the server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block 
 sizes 4-
   512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec
 for the
same block sizes (4-512K). running over IPoIB-CM, I get 200-
   980MB/sec.
 ...
   I am trying to get maximum performance from a single server - I
   used 2
  processes in fio test - more than 2 did not show any performance 
  boost.
   I tried running fio from 2 different PCs on 2 different files,
   but the sum of
  the two is more or less the same as running from single client PC.
  
   What I did see is that server is sweating a lot more than the
   clients and
  more than that, it has 1 core (CPU5) in 100% softirq tasklet:
   cat /proc/softirqs
 ...
 Perf top for the CPU with high tasklet count gives:

  samples  pcnt RIPfunction
 DSO
 ...
  2787.00 24.1% 81062a00 mutex_spin_on_owner
   /root/vmlinux
 ...
   Googling around  I think we want:
   
 perf record -a --call-graph
 (give it a chance to collect some samples, then ^C)
 perf report --call-graph --stdio
   
  
  Sorry it took me a while to get perf to show the call trace (did not enable 
  frame pointers in kernel and struggled with perf options...), but what I 
  get is:
  36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
  |
  --- mutex_spin_on_owner
 |
 |--99.99%-- __mutex_lock_slowpath
 |  mutex_lock
 |  |
 |  |--85.30%-- generic_file_aio_write
 
 That's the inode i_mutex.

Looking at the code  With CONFIG_MUTEX_SPIN_ON_OWNER it spins
(instead of sleeping) as long as the lock owner's still running.  So
this is just a lot of contention on the i_mutex, I guess.  Not sure what
to do aobut that.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA benchmark

2013-04-28 Thread J. Bruce Fields
On Sun, Apr 28, 2013 at 06:28:16AM +, Yan Burman wrote:
On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman
y...@mellanox.com
I've been trying to do some benchmarks for NFS over RDMA
and I seem to
   only get about half of the bandwidth that the HW can give me.
My setup consists of 2 servers each with 16 cores, 32Gb of
memory, and
   Mellanox ConnectX3 QDR card over PCI-e gen3.
These servers are connected to a QDR IB switch. The
backing storage on
   the server is tmpfs mounted with noatime.
I am running kernel 3.5.7.
   
When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 
4-
  512K.
When I run fio over rdma mounted nfs, I get 260-2200MB/sec
for the
   same block sizes (4-512K). running over IPoIB-CM, I get 200-
  980MB/sec.
...
  I am trying to get maximum performance from a single server - I
  used 2
 processes in fio test - more than 2 did not show any performance 
 boost.
  I tried running fio from 2 different PCs on 2 different files,
  but the sum of
 the two is more or less the same as running from single client PC.
 
  What I did see is that server is sweating a lot more than the
  clients and
 more than that, it has 1 core (CPU5) in 100% softirq tasklet:
  cat /proc/softirqs
...
Perf top for the CPU with high tasklet count gives:
   
 samples  pcnt RIPfunction  
  DSO
...
 2787.00 24.1% 81062a00 mutex_spin_on_owner
  /root/vmlinux
...
  Googling around  I think we want:
  
  perf record -a --call-graph
  (give it a chance to collect some samples, then ^C)
  perf report --call-graph --stdio
  
 
 Sorry it took me a while to get perf to show the call trace (did not enable 
 frame pointers in kernel and struggled with perf options...), but what I get 
 is:
 36.18%  nfsd  [kernel.kallsyms]   [k] mutex_spin_on_owner
 |
 --- mutex_spin_on_owner
|
|--99.99%-- __mutex_lock_slowpath
|  mutex_lock
|  |
|  |--85.30%-- generic_file_aio_write

That's the inode i_mutex.

|  |  do_sync_readv_writev
|  |  do_readv_writev
|  |  vfs_writev
|  |  nfsd_vfs_write
|  |  nfsd_write
|  |  nfsd3_proc_write
|  |  nfsd_dispatch
|  |  svc_process_common
|  |  svc_process
|  |  nfsd
|  |  kthread
|  |  kernel_thread_helper
|  |
|   --14.70%-- svc_send

That's the xpt_mutex (ensuring rpc replies aren't interleaved).

| svc_process
| nfsd
| kthread
| kernel_thread_helper
 --0.01%-- [...]
 
  9.63%  nfsd  [kernel.kallsyms]   [k] _raw_spin_lock_irqsave
 |
 --- _raw_spin_lock_irqsave
|
|--43.97%-- alloc_iova

And that (and __free_iova below) looks like iova_rbtree_lock.

--b.

|  intel_alloc_iova
|  __intel_map_single
|  intel_map_page
|  |
|  |--60.47%-- svc_rdma_sendto
|  |  svc_send
|  |  svc_process
|  |  nfsd
|  |  kthread
|  |  kernel_thread_helper
|  |
|  |--30.10%-- rdma_read_xdr
|  |  svc_rdma_recvfrom
|  |  svc_recv
|  |  nfsd
|  |  kthread
|  |  kernel_thread_helper
|  |
|  |--6.69%-- svc_rdma_post_recv
|  |  send_reply
|  |  svc_rdma_sendto
|  

Re: NFS over RDMA benchmark

2013-04-24 Thread J. Bruce Fields
On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
 
 
  -Original Message-
  From: J. Bruce Fields [mailto:bfie...@fieldses.org]
  Sent: Wednesday, April 24, 2013 00:06
  To: Yan Burman
  Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
  linux-...@vger.kernel.org; Or Gerlitz
  Subject: Re: NFS over RDMA benchmark
  
  On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:
  
  
-Original Message-
From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
Sent: Wednesday, April 17, 2013 21:06
To: Atchley, Scott
Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
Subject: Re: NFS over RDMA benchmark
   
On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
atchle...@ornl.gov
wrote:
 On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
wrote:

 On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
wrote:
 Hi.

 I've been trying to do some benchmarks for NFS over RDMA and I
 seem to
only get about half of the bandwidth that the HW can give me.
 My setup consists of 2 servers each with 16 cores, 32Gb of
 memory, and
Mellanox ConnectX3 QDR card over PCI-e gen3.
 These servers are connected to a QDR IB switch. The backing
 storage on
the server is tmpfs mounted with noatime.
 I am running kernel 3.5.7.

 When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 
 4-512K.
 When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
 the
same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.

 Yan,

 Are you trying to optimize single client performance or server
 performance
with multiple clients?

  
   I am trying to get maximum performance from a single server - I used 2
  processes in fio test - more than 2 did not show any performance boost.
   I tried running fio from 2 different PCs on 2 different files, but the 
   sum of
  the two is more or less the same as running from single client PC.
  
   What I did see is that server is sweating a lot more than the clients and
  more than that, it has 1 core (CPU5) in 100% softirq tasklet:
   cat /proc/softirqs
  
  Would any profiling help figure out which code it's spending time in?
  (E.g. something simple as perf top might have useful output.)
  
 
 
 Perf top for the CPU with high tasklet count gives:
 
  samples  pcnt RIPfunctionDSO
  ___ _  ___ 
 ___
 
  2787.00 24.1% 81062a00 mutex_spin_on_owner 
 /root/vmlinux

I guess that means lots of contention on some mutex?  If only we knew
which one perf should also be able to collect stack statistics, I
forget how.

--b.

   978.00  8.4% 810297f0 clflush_cache_range 
 /root/vmlinux
   445.00  3.8% 812ea440 __domain_mapping
 /root/vmlinux
   441.00  3.8% 00018c30 svc_recv
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
   344.00  3.0% 813a1bc0 _raw_spin_lock_bh   
 /root/vmlinux
   333.00  2.9% 813a19e0 _raw_spin_lock_irqsave  
 /root/vmlinux
   288.00  2.5% 813a07d0 __schedule  
 /root/vmlinux
   249.00  2.1% 811a87e0 rb_prev 
 /root/vmlinux
   242.00  2.1% 813a19b0 _raw_spin_lock  
 /root/vmlinux
   184.00  1.6% 2e90 svc_rdma_sendto 
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
   177.00  1.5% 810ac820 get_page_from_freelist  
 /root/vmlinux
   174.00  1.5% 812e6da0 alloc_iova  
 /root/vmlinux
   165.00  1.4% 810b1390 put_page
 /root/vmlinux
   148.00  1.3% 00014760 sunrpc_cache_lookup 
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
   128.00  1.1% 00017f20 svc_xprt_enqueue
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
   126.00  1.1% 8139f820 __mutex_lock_slowpath   
 /root/vmlinux
   108.00  0.9% 811a81d0 rb_insert_color 
 /root/vmlinux
   107.00  0.9% 4690 svc_rdma_recvfrom   
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
   102.00  0.9% 2640 send_reply  
 /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
99.00  0.9% 810e6490 kmem_cache_alloc
 /root/vmlinux
96.00  0.8% 810e5840 __slab_alloc
 /root/vmlinux

Re: NFS over RDMA benchmark

2013-04-24 Thread J. Bruce Fields
On Wed, Apr 24, 2013 at 11:05:40AM -0400, J. Bruce Fields wrote:
 On Wed, Apr 24, 2013 at 12:35:03PM +, Yan Burman wrote:
  
  
   -Original Message-
   From: J. Bruce Fields [mailto:bfie...@fieldses.org]
   Sent: Wednesday, April 24, 2013 00:06
   To: Yan Burman
   Cc: Wendy Cheng; Atchley, Scott; Tom Tucker; linux-rdma@vger.kernel.org;
   linux-...@vger.kernel.org; Or Gerlitz
   Subject: Re: NFS over RDMA benchmark
   
   On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:
   
   
 -Original Message-
 From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
 Sent: Wednesday, April 17, 2013 21:06
 To: Atchley, Scott
 Cc: Yan Burman; J. Bruce Fields; Tom Tucker;
 linux-rdma@vger.kernel.org; linux-...@vger.kernel.org
 Subject: Re: NFS over RDMA benchmark

 On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott
 atchle...@ornl.gov
 wrote:
  On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
 wrote:
 
  On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
 wrote:
  Hi.
 
  I've been trying to do some benchmarks for NFS over RDMA and I
  seem to
 only get about half of the bandwidth that the HW can give me.
  My setup consists of 2 servers each with 16 cores, 32Gb of
  memory, and
 Mellanox ConnectX3 QDR card over PCI-e gen3.
  These servers are connected to a QDR IB switch. The backing
  storage on
 the server is tmpfs mounted with noatime.
  I am running kernel 3.5.7.
 
  When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 
  4-512K.
  When I run fio over rdma mounted nfs, I get 260-2200MB/sec for
  the
 same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
 
  Yan,
 
  Are you trying to optimize single client performance or server
  performance
 with multiple clients?
 
   
I am trying to get maximum performance from a single server - I used 2
   processes in fio test - more than 2 did not show any performance boost.
I tried running fio from 2 different PCs on 2 different files, but the 
sum of
   the two is more or less the same as running from single client PC.
   
What I did see is that server is sweating a lot more than the clients 
and
   more than that, it has 1 core (CPU5) in 100% softirq tasklet:
cat /proc/softirqs
   
   Would any profiling help figure out which code it's spending time in?
   (E.g. something simple as perf top might have useful output.)
   
  
  
  Perf top for the CPU with high tasklet count gives:
  
   samples  pcnt RIPfunction
  DSO
   ___ _  ___ 
  ___
  
   2787.00 24.1% 81062a00 mutex_spin_on_owner 
  /root/vmlinux
 
 I guess that means lots of contention on some mutex?  If only we knew
 which one perf should also be able to collect stack statistics, I
 forget how.

Googling around  I think we want:

perf record -a --call-graph
(give it a chance to collect some samples, then ^C)
perf report --call-graph --stdio

--b.

 
 --b.
 
978.00  8.4% 810297f0 clflush_cache_range 
  /root/vmlinux
445.00  3.8% 812ea440 __domain_mapping
  /root/vmlinux
441.00  3.8% 00018c30 svc_recv
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
344.00  3.0% 813a1bc0 _raw_spin_lock_bh   
  /root/vmlinux
333.00  2.9% 813a19e0 _raw_spin_lock_irqsave  
  /root/vmlinux
288.00  2.5% 813a07d0 __schedule  
  /root/vmlinux
249.00  2.1% 811a87e0 rb_prev 
  /root/vmlinux
242.00  2.1% 813a19b0 _raw_spin_lock  
  /root/vmlinux
184.00  1.6% 2e90 svc_rdma_sendto 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/xprtrdma/svcrdma.ko
177.00  1.5% 810ac820 get_page_from_freelist  
  /root/vmlinux
174.00  1.5% 812e6da0 alloc_iova  
  /root/vmlinux
165.00  1.4% 810b1390 put_page
  /root/vmlinux
148.00  1.3% 00014760 sunrpc_cache_lookup 
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
128.00  1.1% 00017f20 svc_xprt_enqueue
  /lib/modules/3.5.7-dbg/kernel/net/sunrpc/sunrpc.ko
126.00  1.1% 8139f820 __mutex_lock_slowpath   
  /root/vmlinux
108.00  0.9% 811a81d0 rb_insert_color 
  /root/vmlinux
107.00  0.9% 4690 svc_rdma_recvfrom   
  /lib

Re: NFS over RDMA benchmark

2013-04-23 Thread J. Bruce Fields
On Thu, Apr 18, 2013 at 12:47:09PM +, Yan Burman wrote:
 
 
  -Original Message-
  From: Wendy Cheng [mailto:s.wendy.ch...@gmail.com]
  Sent: Wednesday, April 17, 2013 21:06
  To: Atchley, Scott
  Cc: Yan Burman; J. Bruce Fields; Tom Tucker; linux-rdma@vger.kernel.org;
  linux-...@vger.kernel.org
  Subject: Re: NFS over RDMA benchmark
  
  On Wed, Apr 17, 2013 at 10:32 AM, Atchley, Scott atchle...@ornl.gov
  wrote:
   On Apr 17, 2013, at 1:15 PM, Wendy Cheng s.wendy.ch...@gmail.com
  wrote:
  
   On Wed, Apr 17, 2013 at 7:36 AM, Yan Burman y...@mellanox.com
  wrote:
   Hi.
  
   I've been trying to do some benchmarks for NFS over RDMA and I seem to
  only get about half of the bandwidth that the HW can give me.
   My setup consists of 2 servers each with 16 cores, 32Gb of memory, and
  Mellanox ConnectX3 QDR card over PCI-e gen3.
   These servers are connected to a QDR IB switch. The backing storage on
  the server is tmpfs mounted with noatime.
   I am running kernel 3.5.7.
  
   When running ib_send_bw, I get 4.3-4.5 GB/sec for block sizes 4-512K.
   When I run fio over rdma mounted nfs, I get 260-2200MB/sec for the
  same block sizes (4-512K). running over IPoIB-CM, I get 200-980MB/sec.
  
   Yan,
  
   Are you trying to optimize single client performance or server performance
  with multiple clients?
  
 
 I am trying to get maximum performance from a single server - I used 2 
 processes in fio test - more than 2 did not show any performance boost.
 I tried running fio from 2 different PCs on 2 different files, but the sum of 
 the two is more or less the same as running from single client PC.
 
 What I did see is that server is sweating a lot more than the clients and 
 more than that, it has 1 core (CPU5) in 100% softirq tasklet:
 cat /proc/softirqs

Would any profiling help figure out which code it's spending time in?
(E.g. something simple as perf top might have useful output.)

--b.

 CPU0   CPU1   CPU2   CPU3   CPU4   
 CPU5   CPU6   CPU7   CPU8   CPU9   CPU10  CPU11  
 CPU12  CPU13  CPU14  CPU15
   HI:  0  0  0  0  0  
 0  0  0  0  0  0  0  
 0  0  0  0
TIMER: 418767  46596  43515  44547  50099  
 34815  40634  40337  39551  93442  73733  42631  
 42509  41592  40351  61793
   NET_TX:  28719309   1421   1294   1730   
 1243832937 11 44 41 20
  26 19 15 29
   NET_RX: 612070 19 22 21  6
 235  3  2  9  6 17 16 
 20 13 16 10
BLOCK:   5941  0  0  0  0  
 0  0  0519259   1238272
 253174215   2618
 BLOCK_IOPOLL:  0  0  0  0  0  
 0  0  0  0  0  0  0  
 0  0  0  0
  TASKLET: 28  1  1  1  1
 1540653  1  1 29  1  1  1 
  1  1  1  2
SCHED: 364965  26547  16807  18403  22919   
 8678  14358  14091  16981  64903  47141  18517  
 19179  18036  17037  38261
  HRTIMER: 13  0  1  1  0  
 0  0  0  0  0  0  0  
 1  1  0  1
  RCU: 945823 841546 715281 892762 823564  
 42663 863063 841622 333577 389013 393501 239103 
 221524 258159 313426 234030
  
   Remember there are always gaps between wire speed (that ib_send_bw
   measures) and real world applications.
 
 I realize that, but I don't expect the difference to be more than twice.
 
  
   That being said, does your server use default export (sync) option ?
   Export the share with async option can bring you closer to wire
   speed. However, the practice (async) is generally not recommended in
   a real production system - as it can cause data integrity issues, e.g.
   you have more chances to lose data when the boxes crash.
 
 I am running with async export option, but that should not matter too much, 
 since my backing storage is tmpfs mounted with noatime.
 
  
   -- Wendy
  
  
   Wendy,
  
   It has a been a few years since I looked at RPCRDMA, but I seem to
  remember that RPCs were limited to 32KB which means that you have to
  pipeline them to get linerate. In addition to requiring

Re: NFS over RDMA crashing

2013-02-15 Thread J. Bruce Fields
On Mon, Feb 11, 2013 at 03:19:42PM +, Yan Burman wrote:
  -Original Message-
  From: J. Bruce Fields [mailto:bfie...@fieldses.org]
  Sent: Thursday, February 07, 2013 18:42
  To: Yan Burman
  Cc: linux-...@vger.kernel.org; sw...@opengridcomputing.com; linux-
  r...@vger.kernel.org; Or Gerlitz
  Subject: Re: NFS over RDMA crashing
  
  On Wed, Feb 06, 2013 at 05:24:35PM -0500, J. Bruce Fields wrote:
   On Wed, Feb 06, 2013 at 05:48:15PM +0200, Yan Burman wrote:
When killing mount command that got stuck:
---
   
BUG: unable to handle kernel paging request at 880324dc7ff8
IP: [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] PGD
1a0c063 PUD 32f82e063 PMD 32f2fd063 PTE 800324dc7161
Oops: 0003 [#1] PREEMPT SMP
Modules linked in: md5 ib_ipoib xprtrdma svcrdma rdma_cm ib_cm
  iw_cm
ib_addr nfsd exportfs netconsole ip6table_filter ip6_tables
iptable_filter ip_tables ebtable_nat nfsv3 nfs_acl ebtables x_tables
nfsv4 auth_rpcgss nfs lockd autofs4 sunrpc target_core_iblock
target_core_file target_core_pscsi target_core_mod configfs 8021q
bridge stp llc ipv6 dm_mirror dm_region_hash dm_log vhost_net
macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support kvm_intel
kvm crc32c_intel microcode pcspkr joydev i2c_i801 lpc_ich mfd_core
ehci_pci ehci_hcd sg ioatdma ixgbe mdio mlx4_ib ib_sa ib_mad ib_core
mlx4_en mlx4_core igb hwmon dca ptp pps_core button dm_mod ext3
  jbd
sd_mod ata_piix libata uhci_hcd megaraid_sas scsi_mod CPU 6
Pid: 4744, comm: nfsd Not tainted 3.8.0-rc5+ #4 Supermicro
X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[a05f3dfb]  [a05f3dfb]
rdma_read_xdr+0x8bb/0xd40 [svcrdma]
RSP: 0018:880324c3dbf8  EFLAGS: 00010297
RAX: 880324dc8000 RBX: 0001 RCX: 880324dd8428
RDX: 880324dc7ff8 RSI: 880324dd8428 RDI: 81149618
RBP: 880324c3dd78 R08: 60f9c860 R09: 0001
R10: 880324dd8000 R11: 0001 R12: 8806299dcb10
R13: 0003 R14: 0001 R15: 0010
FS:  () GS:88063fc0()
knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 880324dc7ff8 CR3: 01a0b000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process nfsd (pid: 4744, threadinfo 880324c3c000, task
88033055)
Stack:
 880324c3dc78 880324c3dcd8 0282 880631cec000
 880324dd8000 88062ed33040 000124c3dc48 880324dd8000
 88062ed33058 880630ce2b90 8806299e8000 0003
Call Trace:
 [a05f466e] svc_rdma_recvfrom+0x3ee/0xd80 [svcrdma]
[81086540] ? try_to_wake_up+0x2f0/0x2f0
[a045963f] svc_recv+0x3ef/0x4b0 [sunrpc]
[a0571db0] ? nfsd_svc+0x740/0x740 [nfsd]
[a0571e5d] nfsd+0xad/0x130 [nfsd]  [a0571db0] ?
nfsd_svc+0x740/0x740 [nfsd]  [81071df6] kthread+0xd6/0xe0
[81071d20] ? __init_kthread_worker+0x70/0x70
[814b462c] ret_from_fork+0x7c/0xb0  [81071d20] ?
__init_kthread_worker+0x70/0x70
Code: 63 c2 49 8d 8c c2 18 02 00 00 48 39 ce 77 e1 49 8b 82 40 0a 00
00 48 39 c6 0f 84 92 f7 ff ff 90 48 8d 50 f8 49 89 92 40 0a 00 00
48 c7 40 f8 00 00 00 00 49 8b 82 40 0a 00 00 49 3b 82 30 0a 00 RIP
[a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma]  RSP
880324c3dbf8
CR2: 880324dc7ff8
---[ end trace 06d0384754e9609a ]---
   
   
It seems that commit afc59400d6c65bad66d4ad0b2daf879cbff8e23e
nfsd4: cleanup: replace rq_resused count by rq_next_page pointer
is responsible for the crash (it seems to be crashing in
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:527)
It may be because I have CONFIG_DEBUG_SET_MODULE_RONX and
CONFIG_DEBUG_RODATA enabled. I did not try to disable them yet.
   
When I moved to commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e I
was no longer getting the server crashes, so the reset of my tests
were done using that point (it is somewhere in the middle of
3.7.0-rc2).
  
   OK, so this part's clearly my fault--I'll work on a patch, but the
   rdma's use of the -rq_pages array is pretty confusing.
  
  Does this help?
  
  They must have added this for some reason, but I'm not seeing how it could
  have ever done anything
  
  --b.
  
  diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  index 0ce7552..e8f25ec 100644
  --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  @@ -520,13 +520,6 @@ next_sge:
  for (ch_no = 0; rqstp-rq_pages[ch_no]  rqstp-rq_respages;
  ch_no++)
  rqstp-rq_pages[ch_no] = NULL

Re: NFS over RDMA crashing

2013-02-11 Thread J. Bruce Fields
On Mon, Feb 11, 2013 at 03:19:42PM +, Yan Burman wrote:
  -Original Message-
  From: J. Bruce Fields [mailto:bfie...@fieldses.org]
  Sent: Thursday, February 07, 2013 18:42
  To: Yan Burman
  Cc: linux-...@vger.kernel.org; sw...@opengridcomputing.com; linux-
  r...@vger.kernel.org; Or Gerlitz
  Subject: Re: NFS over RDMA crashing
  
  On Wed, Feb 06, 2013 at 05:24:35PM -0500, J. Bruce Fields wrote:
   On Wed, Feb 06, 2013 at 05:48:15PM +0200, Yan Burman wrote:
When killing mount command that got stuck:
---
   
BUG: unable to handle kernel paging request at 880324dc7ff8
IP: [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma] PGD
1a0c063 PUD 32f82e063 PMD 32f2fd063 PTE 800324dc7161
Oops: 0003 [#1] PREEMPT SMP
Modules linked in: md5 ib_ipoib xprtrdma svcrdma rdma_cm ib_cm
  iw_cm
ib_addr nfsd exportfs netconsole ip6table_filter ip6_tables
iptable_filter ip_tables ebtable_nat nfsv3 nfs_acl ebtables x_tables
nfsv4 auth_rpcgss nfs lockd autofs4 sunrpc target_core_iblock
target_core_file target_core_pscsi target_core_mod configfs 8021q
bridge stp llc ipv6 dm_mirror dm_region_hash dm_log vhost_net
macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support kvm_intel
kvm crc32c_intel microcode pcspkr joydev i2c_i801 lpc_ich mfd_core
ehci_pci ehci_hcd sg ioatdma ixgbe mdio mlx4_ib ib_sa ib_mad ib_core
mlx4_en mlx4_core igb hwmon dca ptp pps_core button dm_mod ext3
  jbd
sd_mod ata_piix libata uhci_hcd megaraid_sas scsi_mod CPU 6
Pid: 4744, comm: nfsd Not tainted 3.8.0-rc5+ #4 Supermicro
X8DTH-i/6/iF/6F/X8DTH
RIP: 0010:[a05f3dfb]  [a05f3dfb]
rdma_read_xdr+0x8bb/0xd40 [svcrdma]
RSP: 0018:880324c3dbf8  EFLAGS: 00010297
RAX: 880324dc8000 RBX: 0001 RCX: 880324dd8428
RDX: 880324dc7ff8 RSI: 880324dd8428 RDI: 81149618
RBP: 880324c3dd78 R08: 60f9c860 R09: 0001
R10: 880324dd8000 R11: 0001 R12: 8806299dcb10
R13: 0003 R14: 0001 R15: 0010
FS:  () GS:88063fc0()
knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2: 880324dc7ff8 CR3: 01a0b000 CR4: 07e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process nfsd (pid: 4744, threadinfo 880324c3c000, task
88033055)
Stack:
 880324c3dc78 880324c3dcd8 0282 880631cec000
 880324dd8000 88062ed33040 000124c3dc48 880324dd8000
 88062ed33058 880630ce2b90 8806299e8000 0003
Call Trace:
 [a05f466e] svc_rdma_recvfrom+0x3ee/0xd80 [svcrdma]
[81086540] ? try_to_wake_up+0x2f0/0x2f0
[a045963f] svc_recv+0x3ef/0x4b0 [sunrpc]
[a0571db0] ? nfsd_svc+0x740/0x740 [nfsd]
[a0571e5d] nfsd+0xad/0x130 [nfsd]  [a0571db0] ?
nfsd_svc+0x740/0x740 [nfsd]  [81071df6] kthread+0xd6/0xe0
[81071d20] ? __init_kthread_worker+0x70/0x70
[814b462c] ret_from_fork+0x7c/0xb0  [81071d20] ?
__init_kthread_worker+0x70/0x70
Code: 63 c2 49 8d 8c c2 18 02 00 00 48 39 ce 77 e1 49 8b 82 40 0a 00
00 48 39 c6 0f 84 92 f7 ff ff 90 48 8d 50 f8 49 89 92 40 0a 00 00
48 c7 40 f8 00 00 00 00 49 8b 82 40 0a 00 00 49 3b 82 30 0a 00 RIP
[a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma]  RSP
880324c3dbf8
CR2: 880324dc7ff8
---[ end trace 06d0384754e9609a ]---
   
   
It seems that commit afc59400d6c65bad66d4ad0b2daf879cbff8e23e
nfsd4: cleanup: replace rq_resused count by rq_next_page pointer
is responsible for the crash (it seems to be crashing in
net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:527)
It may be because I have CONFIG_DEBUG_SET_MODULE_RONX and
CONFIG_DEBUG_RODATA enabled. I did not try to disable them yet.
   
When I moved to commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e I
was no longer getting the server crashes, so the reset of my tests
were done using that point (it is somewhere in the middle of
3.7.0-rc2).
  
   OK, so this part's clearly my fault--I'll work on a patch, but the
   rdma's use of the -rq_pages array is pretty confusing.
  
  Does this help?
  
  They must have added this for some reason, but I'm not seeing how it could
  have ever done anything
  
  --b.
  
  diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  index 0ce7552..e8f25ec 100644
  --- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  +++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
  @@ -520,13 +520,6 @@ next_sge:
  for (ch_no = 0; rqstp-rq_pages[ch_no]  rqstp-rq_respages;
  ch_no++)
  rqstp-rq_pages[ch_no] = NULL

Re: NFS over RDMA crashing

2013-02-07 Thread J. Bruce Fields
On Wed, Feb 06, 2013 at 05:24:35PM -0500, J. Bruce Fields wrote:
 On Wed, Feb 06, 2013 at 05:48:15PM +0200, Yan Burman wrote:
  When killing mount command that got stuck:
  ---
  
  BUG: unable to handle kernel paging request at 880324dc7ff8
  IP: [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma]
  PGD 1a0c063 PUD 32f82e063 PMD 32f2fd063 PTE 800324dc7161
  Oops: 0003 [#1] PREEMPT SMP
  Modules linked in: md5 ib_ipoib xprtrdma svcrdma rdma_cm ib_cm iw_cm
  ib_addr nfsd exportfs netconsole ip6table_filter ip6_tables
  iptable_filter ip_tables ebtable_nat nfsv3 nfs_acl ebtables x_tables
  nfsv4 auth_rpcgss nfs lockd autofs4 sunrpc target_core_iblock
  target_core_file target_core_pscsi target_core_mod configfs 8021q
  bridge stp llc ipv6 dm_mirror dm_region_hash dm_log vhost_net
  macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support kvm_intel
  kvm crc32c_intel microcode pcspkr joydev i2c_i801 lpc_ich mfd_core
  ehci_pci ehci_hcd sg ioatdma ixgbe mdio mlx4_ib ib_sa ib_mad ib_core
  mlx4_en mlx4_core igb hwmon dca ptp pps_core button dm_mod ext3 jbd
  sd_mod ata_piix libata uhci_hcd megaraid_sas scsi_mod
  CPU 6
  Pid: 4744, comm: nfsd Not tainted 3.8.0-rc5+ #4 Supermicro
  X8DTH-i/6/iF/6F/X8DTH
  RIP: 0010:[a05f3dfb]  [a05f3dfb]
  rdma_read_xdr+0x8bb/0xd40 [svcrdma]
  RSP: 0018:880324c3dbf8  EFLAGS: 00010297
  RAX: 880324dc8000 RBX: 0001 RCX: 880324dd8428
  RDX: 880324dc7ff8 RSI: 880324dd8428 RDI: 81149618
  RBP: 880324c3dd78 R08: 60f9c860 R09: 0001
  R10: 880324dd8000 R11: 0001 R12: 8806299dcb10
  R13: 0003 R14: 0001 R15: 0010
  FS:  () GS:88063fc0() knlGS:
  CS:  0010 DS:  ES:  CR0: 8005003b
  CR2: 880324dc7ff8 CR3: 01a0b000 CR4: 07e0
  DR0:  DR1:  DR2: 
  DR3:  DR6: 0ff0 DR7: 0400
  Process nfsd (pid: 4744, threadinfo 880324c3c000, task 88033055)
  Stack:
   880324c3dc78 880324c3dcd8 0282 880631cec000
   880324dd8000 88062ed33040 000124c3dc48 880324dd8000
   88062ed33058 880630ce2b90 8806299e8000 0003
  Call Trace:
   [a05f466e] svc_rdma_recvfrom+0x3ee/0xd80 [svcrdma]
   [81086540] ? try_to_wake_up+0x2f0/0x2f0
   [a045963f] svc_recv+0x3ef/0x4b0 [sunrpc]
   [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd]
   [a0571e5d] nfsd+0xad/0x130 [nfsd]
   [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd]
   [81071df6] kthread+0xd6/0xe0
   [81071d20] ? __init_kthread_worker+0x70/0x70
   [814b462c] ret_from_fork+0x7c/0xb0
   [81071d20] ? __init_kthread_worker+0x70/0x70
  Code: 63 c2 49 8d 8c c2 18 02 00 00 48 39 ce 77 e1 49 8b 82 40 0a 00
  00 48 39 c6 0f 84 92 f7 ff ff 90 48 8d 50 f8 49 89 92 40 0a 00 00
  48 c7 40 f8 00 00 00 00 49 8b 82 40 0a 00 00 49 3b 82 30 0a 00
  RIP  [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma]
   RSP 880324c3dbf8
  CR2: 880324dc7ff8
  ---[ end trace 06d0384754e9609a ]---
  
  
  It seems that commit afc59400d6c65bad66d4ad0b2daf879cbff8e23e
  nfsd4: cleanup: replace rq_resused count by rq_next_page pointer
  is responsible for the crash (it seems to be crashing in
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:527)
  It may be because I have CONFIG_DEBUG_SET_MODULE_RONX and
  CONFIG_DEBUG_RODATA enabled. I did not try to disable them yet.
  
  When I moved to commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e I
  was no longer getting the server crashes,
  so the reset of my tests were done using that point (it is somewhere
  in the middle of 3.7.0-rc2).
 
 OK, so this part's clearly my fault--I'll work on a patch, but the
 rdma's use of the -rq_pages array is pretty confusing.

Does this help?

They must have added this for some reason, but I'm not seeing how it
could have ever done anything

--b.

diff --git a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c 
b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
index 0ce7552..e8f25ec 100644
--- a/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
+++ b/net/sunrpc/xprtrdma/svc_rdma_recvfrom.c
@@ -520,13 +520,6 @@ next_sge:
for (ch_no = 0; rqstp-rq_pages[ch_no]  rqstp-rq_respages; ch_no++)
rqstp-rq_pages[ch_no] = NULL;
 
-   /*
-* Detach res pages. If svc_release sees any it will attempt to
-* put them.
-*/
-   while (rqstp-rq_next_page != rqstp-rq_respages)
-   *(--rqstp-rq_next_page) = NULL;
-
return err;
 }
 
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: NFS over RDMA crashing

2013-02-06 Thread J. Bruce Fields
On Wed, Feb 06, 2013 at 05:48:15PM +0200, Yan Burman wrote:
 When killing mount command that got stuck:
 ---
 
 BUG: unable to handle kernel paging request at 880324dc7ff8
 IP: [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma]
 PGD 1a0c063 PUD 32f82e063 PMD 32f2fd063 PTE 800324dc7161
 Oops: 0003 [#1] PREEMPT SMP
 Modules linked in: md5 ib_ipoib xprtrdma svcrdma rdma_cm ib_cm iw_cm
 ib_addr nfsd exportfs netconsole ip6table_filter ip6_tables
 iptable_filter ip_tables ebtable_nat nfsv3 nfs_acl ebtables x_tables
 nfsv4 auth_rpcgss nfs lockd autofs4 sunrpc target_core_iblock
 target_core_file target_core_pscsi target_core_mod configfs 8021q
 bridge stp llc ipv6 dm_mirror dm_region_hash dm_log vhost_net
 macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support kvm_intel
 kvm crc32c_intel microcode pcspkr joydev i2c_i801 lpc_ich mfd_core
 ehci_pci ehci_hcd sg ioatdma ixgbe mdio mlx4_ib ib_sa ib_mad ib_core
 mlx4_en mlx4_core igb hwmon dca ptp pps_core button dm_mod ext3 jbd
 sd_mod ata_piix libata uhci_hcd megaraid_sas scsi_mod
 CPU 6
 Pid: 4744, comm: nfsd Not tainted 3.8.0-rc5+ #4 Supermicro
 X8DTH-i/6/iF/6F/X8DTH
 RIP: 0010:[a05f3dfb]  [a05f3dfb]
 rdma_read_xdr+0x8bb/0xd40 [svcrdma]
 RSP: 0018:880324c3dbf8  EFLAGS: 00010297
 RAX: 880324dc8000 RBX: 0001 RCX: 880324dd8428
 RDX: 880324dc7ff8 RSI: 880324dd8428 RDI: 81149618
 RBP: 880324c3dd78 R08: 60f9c860 R09: 0001
 R10: 880324dd8000 R11: 0001 R12: 8806299dcb10
 R13: 0003 R14: 0001 R15: 0010
 FS:  () GS:88063fc0() knlGS:
 CS:  0010 DS:  ES:  CR0: 8005003b
 CR2: 880324dc7ff8 CR3: 01a0b000 CR4: 07e0
 DR0:  DR1:  DR2: 
 DR3:  DR6: 0ff0 DR7: 0400
 Process nfsd (pid: 4744, threadinfo 880324c3c000, task 88033055)
 Stack:
  880324c3dc78 880324c3dcd8 0282 880631cec000
  880324dd8000 88062ed33040 000124c3dc48 880324dd8000
  88062ed33058 880630ce2b90 8806299e8000 0003
 Call Trace:
  [a05f466e] svc_rdma_recvfrom+0x3ee/0xd80 [svcrdma]
  [81086540] ? try_to_wake_up+0x2f0/0x2f0
  [a045963f] svc_recv+0x3ef/0x4b0 [sunrpc]
  [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd]
  [a0571e5d] nfsd+0xad/0x130 [nfsd]
  [a0571db0] ? nfsd_svc+0x740/0x740 [nfsd]
  [81071df6] kthread+0xd6/0xe0
  [81071d20] ? __init_kthread_worker+0x70/0x70
  [814b462c] ret_from_fork+0x7c/0xb0
  [81071d20] ? __init_kthread_worker+0x70/0x70
 Code: 63 c2 49 8d 8c c2 18 02 00 00 48 39 ce 77 e1 49 8b 82 40 0a 00
 00 48 39 c6 0f 84 92 f7 ff ff 90 48 8d 50 f8 49 89 92 40 0a 00 00
 48 c7 40 f8 00 00 00 00 49 8b 82 40 0a 00 00 49 3b 82 30 0a 00
 RIP  [a05f3dfb] rdma_read_xdr+0x8bb/0xd40 [svcrdma]
  RSP 880324c3dbf8
 CR2: 880324dc7ff8
 ---[ end trace 06d0384754e9609a ]---
 
 
 It seems that commit afc59400d6c65bad66d4ad0b2daf879cbff8e23e
 nfsd4: cleanup: replace rq_resused count by rq_next_page pointer
 is responsible for the crash (it seems to be crashing in
 net/sunrpc/xprtrdma/svc_rdma_recvfrom.c:527)
 It may be because I have CONFIG_DEBUG_SET_MODULE_RONX and
 CONFIG_DEBUG_RODATA enabled. I did not try to disable them yet.
 
 When I moved to commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e I
 was no longer getting the server crashes,
 so the reset of my tests were done using that point (it is somewhere
 in the middle of 3.7.0-rc2).

OK, so this part's clearly my fault--I'll work on a patch, but the
rdma's use of the -rq_pages array is pretty confusing.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Q] I tried to mount nfs_server:/mnt volume, but It mounted another volume.

2010-11-21 Thread J. Bruce Fields
On Sun, Nov 21, 2010 at 03:41:18PM +0200, Boaz Harrosh wrote:
 On 11/21/2010 02:23 PM, Hiroyuki Sato wrote:
  Hello lists
  
  OS: CentOS 5.5
  kernel: 2.6.36 rebuild myself.
  
  
I'm trying to test NFS/RDMA.
I tried to mount nfs_server:/mnt volume on on nfs_client
but It mounted nfs_server:/nfstest volume
  
note: /nfstest is tmpfs
  
   this is mount output
  tmpfs on /nfstest type tmpfs (rw,size=4g)
  /dev/sdb1 on /mnt type ext3 (rw)
  
Is this bug??
  
  NFS server config
  
# ls -1 /mnt
This_is_mnt_volume
  
# ls -1 /nfstest
This_is_nfstest_volume
  
# cat /etc/exports
/nfstest   
  192.168.100.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
/mnt   
  192.168.100.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash)
  
 
 You must not have two exports with fsid=0. First one is picked. 
 
 nfs4 will only export a single name space point, other exports are
 subdirs of that root export. (use bind mounts to present a single
 directory tree)
 
 http://www.citi.umich.edu/projects/nfsv4/linux/using-nfsv4.html

We should update and/or replace that documentation; with new nfs-utils
and kernel, if you omit the fsid=0, you'll end up with an nfsv4
namespace that's the same as v2/v3's always was.

--b.

 
 NFSv4 exports on linux
 ~~
 
 NFSv4 no longer has a separate mount protocol. Instead of exporting a 
 number of distinct exports, an NFSv4 client sees the NFSv4 server's exports 
 as existing inside a single filesystem, called the nfsv4 pseudofilesystem.
 
 On the current linux implementation, the pseudofilesystem is a single real 
 filesystem, identified at export with the fsid=0 option.
 
 In the example above, we exported only a single filesystem, which the client 
 mounted as /. You can provide clients with multiple filesystems to mount, 
 producing NFSv3-like-behavior, by creative use of mount --bind. For example, 
 you could export /usr/local/bin to clients as /bin and /usr/local/etc as /etc 
 as follows:
 
 mkdir /export
 mkdir /export/bin
 mkdir /export/etc
 mount --bind /usr/local/bin /export/bin
 mount --bind /usr/local/etc /export/etc
 exportfs -ofsid=0,insecure,no_subtree_check *:/export
 exportfs -orw,nohide,insecure,no_subtree_check *:/export/bin
 exportfs -orw,nohide,insecure,no_subtree_check *:/export/etc
 
 Note that the paths returned by the showmount program are meaningful only 
 to clients using nfs versions 2 and 3; in the above example, showmount will 
 list the paths /export, /export/bin/, and /export/etc, but nfsv4 clients 
 should mount yourserver:/, yourserver:/bin, or yourserver:/etc.
 
 /http://www.citi.umich.edu/projects/nfsv4/linux/using-nfsv4.html
 
 Boaz
  
# modprobe svcrdma
  
# /sbin/service nfs start
  
# echo rdma 20049  /proc/fs/nfsd/portlist
  
  
  Client Setting
  
/sbin/modprobe xprtrdma
/sbin/mount.rnfs 192.168.100.231:/mnt /mnt -i -o rdma,port=20049
  
# ls -1 /mnt
This_is_nfstest_volume
  
  NFS Server log
sysctl -w sunrpc.nfsd_debug=1023
  
Nov 21 20:47:37 dell1435 mountd[3575]: authenticated mount request
  from 192.168.100.232:766 for /mnt (/mnt)
Nov 21 20:47:37 dell1435 mountd[3575]: /nfstest and /mnt have same
  filehandle for 192.168.100.0/255.255.255.0, using first
Nov 21 20:48:55 dell1435 mountd[3575]: authenticated unmount request
  from 192.168.100.232:912 for /mnt (/mnt)
Nov 21 20:48:55 dell1435 mountd[3575]: authenticated unmount request
  from 192.168.100.232:913 for /mnt (/mnt)
Nov 21 20:49:00 dell1435 mountd[3575]: authenticated unmount request
  from 192.168.100.232:917 for /mnt (/mnt)
Nov 21 20:49:16 dell1435 mountd[3575]: authenticated mount request
  from 192.168.100.232:865 for /mnt (/mnt)
Nov 21 21:02:22 dell1435 mountd[3575]: authenticated unmount request
  from 192.168.100.232:955 for /mnt (/mnt)
Nov 21 21:02:26 dell1435 mountd[3575]: authenticated mount request
  from 192.168.100.232:884 for /mnt (/mnt)
Nov 21 21:02:26 dell1435 kernel: nfsd: exp_rootfh(/mnt
  [88011f586740] 192.168.100.0/255.255.255.0:sdb1/2)
  
  
  --
  Hiroyuki Sato
  --
  To unsubscribe from this list: send the line unsubscribe linux-nfs in
  the body of a message to majord...@vger.kernel.org
  More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-nfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/2] svcrdma: NFSRDMA Server fixes for 2.6.37

2010-10-19 Thread J. Bruce Fields
On Tue, Oct 12, 2010 at 03:33:46PM -0500, Tom Tucker wrote:
 Hi Bruce,
 
 These fixes are ready for 2.6.37. They fix two bugs in the server-side
 NFSRDMA transport.

Both applied and pushed out, thanks.

--b.

 
 Thanks,
 Tom
 ---
 
 Tom Tucker (2):
   svcrdma: Cleanup DMA unmapping in error paths.
   svcrdma: Change DMA mapping logic to avoid the page_address kernel API
 
 
  net/sunrpc/xprtrdma/svc_rdma_recvfrom.c  |   19 ---
  net/sunrpc/xprtrdma/svc_rdma_sendto.c|   82 
 ++
  net/sunrpc/xprtrdma/svc_rdma_transport.c |   41 +++
  3 files changed, 92 insertions(+), 50 deletions(-)
 
 -- 
 Signed-off-by: Tom Tucker t...@ogc.us
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] svcrdma: RDMA support not yet compatible with RPC6

2010-04-05 Thread J. Bruce Fields
On Mon, Apr 05, 2010 at 12:16:18PM -0400, J. Bruce Fields wrote:
 On Mon, Apr 05, 2010 at 10:50:16AM -0500, Tom Tucker wrote:
  J. Bruce Fields wrote:
  On Mon, Apr 05, 2010 at 10:55:12AM -0400, Chuck Lever wrote:

  On 04/03/2010 09:27 AM, Tom Tucker wrote:
  
  RPC6 requires that it be possible to create endpoints that listen
  exclusively for IPv4 or IPv6 connection requests. This is not currently
  supported by the RDMA API.
 
  Signed-off-by: Tom Tuckert...@opengridcomputing.com
  Tested-by: Steve Wisesw...@opengridcomputing.com

  Reviewed-by: Chuck Lever chuck.le...@oracle.com
  
 
  Thanks to all.  I take it the problem began with 37498292a NFSD: Create
  PF_INET6 listener in write_ports?
 

 
  Yes.
 
 Thanks.  I'll pass along
 
   git://linux-nfs.org/~bfields/linux.git for-2.6.34
 
 soon.

And: sorry we didn't catch this when it happened.  I have some of the
equipment I'd need to do basic regression tests, but haven't set it up.

I hope I get to it at some point  For now I depend on others to
catch even basic rdma regressions--let me know if there's some way I
could make your testing easier.

--b.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html