Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-23 Thread Christoph Hellwig
On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
> In the current xprtrdma implementation, some memreg strategies
> implement ro_unmap synchronously (the MR is knocked down before the
> method returns) and some asynchonously (the MR will be knocked down
> and returned to the pool in the background).
> 
> To guarantee the MR is truly invalid before the RPC consumer is
> allowed to resume execution, we need an unmap method that is
> always synchronous, invoked from the RPC/RDMA reply handler.
> 
> The new method unmaps all MRs for an RPC. The existing ro_unmap
> method unmaps only one MR at a time.

Do we really want to go down that road?  It seems like we've decided
in general that while the protocol specs say MR must be unmapped before
proceeding with the data that is painful enough to ignore this
requirement.  E.g. iser for example only does the local invalidate
just before reusing the MR.

I'd like to hear arguments for and against each method instead of
adding more magic to drivers to either optimize MR performance and
add clunky workarounds to make it even slower, and instead handled
the semantics we agreed upo in common code.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-23 Thread Jason Gunthorpe
On Mon, Nov 23, 2015 at 10:45:56PM -0800, Christoph Hellwig wrote:
> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
> > In the current xprtrdma implementation, some memreg strategies
> > implement ro_unmap synchronously (the MR is knocked down before the
> > method returns) and some asynchonously (the MR will be knocked down
> > and returned to the pool in the background).
> > 
> > To guarantee the MR is truly invalid before the RPC consumer is
> > allowed to resume execution, we need an unmap method that is
> > always synchronous, invoked from the RPC/RDMA reply handler.
> > 
> > The new method unmaps all MRs for an RPC. The existing ro_unmap
> > method unmaps only one MR at a time.
> 
> Do we really want to go down that road?  It seems like we've decided
> in general that while the protocol specs say MR must be unmapped before
> proceeding with the data that is painful enough to ignore this

That is not my impression, I was thinking we keep finding that ULPs
are not implemented correctly. The various clean up exercises keep
exposing flaws.

The common code is intended to drive RDMA properly.

Async invalidating the rkey is fundamentally a security issue and
should be treated as such. The kernel never trades security for
performance without a user opt in. This is the same logic we've used
for purging the global writable rkey stuff, even though it often had
performance.

> requirement.  E.g. iser for example only does the local invalidate
> just before reusing the MR.

Ugh :(

> I'd like to hear arguments for and against each method instead of
> adding more magic to drivers to either optimize MR performance and
> add clunky workarounds to make it even slower, and instead handled
> the semantics we agreed upo in common code.

Common code should make it easy to do this right, an invalidate of the
MR ordered before the dma unmap, which must complete before the buffer
is handed back to the caller. With easy support for send with
invalidate.

If the common code has an opt-in to make some of these steps run
async, and that gives performance, then fine, but the default should be
secure operation.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-24 Thread Sagi Grimberg



On 24/11/2015 08:45, Christoph Hellwig wrote:

On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:

In the current xprtrdma implementation, some memreg strategies
implement ro_unmap synchronously (the MR is knocked down before the
method returns) and some asynchonously (the MR will be knocked down
and returned to the pool in the background).

To guarantee the MR is truly invalid before the RPC consumer is
allowed to resume execution, we need an unmap method that is
always synchronous, invoked from the RPC/RDMA reply handler.

The new method unmaps all MRs for an RPC. The existing ro_unmap
method unmaps only one MR at a time.


Do we really want to go down that road?  It seems like we've decided
in general that while the protocol specs say MR must be unmapped before
proceeding with the data that is painful enough to ignore this
requirement.  E.g. iser for example only does the local invalidate
just before reusing the MR.


It is painful, too painful. The entire value proposition of RDMA is
low-latency and waiting for the extra HW round-trip for a local
invalidation to complete is unacceptable, moreover it adds a huge loads
of extra interrupts and cache-line pollutions.

As I see it, if we don't wait for local-invalidate to complete before
unmap and IO completion (and no one does) then local invalidate before
re-use is only marginally worse. For iSER, remote invalidate solves this 
(patches submitted!) and I'd say we should push for all the

storage standards to include remote invalidate. There is the question
of multi-rkey transactions, which is why I stated in the past that
arbitrary sg registration is important (which will be submitted soon
for ConnectX-4).

Waiting for local invalidate to complete would be a really big
sacrifice for our storage ULPs.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-24 Thread Tom Talpey

On 11/24/2015 5:59 AM, Sagi Grimberg wrote:

As I see it, if we don't wait for local-invalidate to complete before
unmap and IO completion (and no one does)


For the record, that is false. Windows quite strictly performs these
steps, and deliver millions of real 4K IOPS over SMB Direct.


Waiting for local invalidate to complete would be a really big
sacrifice for our storage ULPs.


Not waiting would also be a sacrifice, by compromising data integrity.
It's a tough tradeoff, but if you choose to go that way it will be
critical to be honest about the consequences to the data.

Tom.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-24 Thread Chuck Lever

> On Nov 24, 2015, at 5:59 AM, Sagi Grimberg  wrote:
> 
> 
> 
> On 24/11/2015 08:45, Christoph Hellwig wrote:
>> On Mon, Nov 23, 2015 at 05:14:14PM -0500, Chuck Lever wrote:
>>> In the current xprtrdma implementation, some memreg strategies
>>> implement ro_unmap synchronously (the MR is knocked down before the
>>> method returns) and some asynchonously (the MR will be knocked down
>>> and returned to the pool in the background).
>>> 
>>> To guarantee the MR is truly invalid before the RPC consumer is
>>> allowed to resume execution, we need an unmap method that is
>>> always synchronous, invoked from the RPC/RDMA reply handler.
>>> 
>>> The new method unmaps all MRs for an RPC. The existing ro_unmap
>>> method unmaps only one MR at a time.
>> 
>> Do we really want to go down that road?  It seems like we've decided
>> in general that while the protocol specs say MR must be unmapped before
>> proceeding with the data that is painful enough to ignore this
>> requirement.  E.g. iser for example only does the local invalidate
>> just before reusing the MR.

That leaves the MR exposed to the remote indefinitely. If
the MR is registered for remote write, that seems hazardous.


> It is painful, too painful. The entire value proposition of RDMA is
> low-latency and waiting for the extra HW round-trip for a local
> invalidation to complete is unacceptable, moreover it adds a huge loads
> of extra interrupts and cache-line pollutions.

The killer is the extra context switches, I’ve found.


> As I see it, if we don't wait for local-invalidate to complete before
> unmap and IO completion (and no one does) then local invalidate before
> re-use is only marginally worse. For iSER, remote invalidate solves this 
> (patches submitted!) and I'd say we should push for all the
> storage standards to include remote invalidate.

I agree: the right answer is to use remote invalidation,
and to ensure the order is always:

  1. invalidate the MR
  2. unmap the MR
  3. wake up the consumer

And that is exactly my strategy for NFS/RDMA. I don’t have
a choice: as Tom observed yesterday, krb5i is meaningless
unless the integrity of the data is guaranteed by fencing
the server before the client performs checksumming. I
expect the same is true for T10-PI.


> There is the question
> of multi-rkey transactions, which is why I stated in the past that
> arbitrary sg registration is important (which will be submitted soon
> for ConnectX-4).
> 
> Waiting for local invalidate to complete would be a really big
> sacrifice for our storage ULPs.

I’ve noticed only a marginal loss of performance on modern
hardware.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-24 Thread Sagi Grimberg

Hey Tom,


On 11/24/2015 5:59 AM, Sagi Grimberg wrote:

As I see it, if we don't wait for local-invalidate to complete before
unmap and IO completion (and no one does)


For the record, that is false. Windows quite strictly performs these
steps,


I meant Linux ULPs. I'm not very familiar with Windows SMBD.


and deliver millions of real 4K IOPS over SMB Direct.


That's very encouraging! Is this the client side scaling?
From my experience, getting a storage client/initiator to scale
up to 1.5-3 MIOPs over a single HCA with limited number of cores
is really a struggle for each interrupt and cacheline.

Still the latency of each IO will see a noticable increase.


Waiting for local invalidate to complete would be a really big
sacrifice for our storage ULPs.


Not waiting would also be a sacrifice, by compromising data integrity.
It's a tough tradeoff, but if you choose to go that way it will be
critical to be honest about the consequences to the data.


That's true. I assume that the best compromise would be remote
invalidation but some standards needs enhancements and it doesn't solve
the multi-rkey transactions.

As storage devices are becoming faster we really should try to do our
best to keep up.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-24 Thread Sagi Grimberg

Hey Chuck,




It is painful, too painful. The entire value proposition of RDMA is
low-latency and waiting for the extra HW round-trip for a local
invalidation to complete is unacceptable, moreover it adds a huge loads
of extra interrupts and cache-line pollutions.


The killer is the extra context switches, I’ve found.


That too...


I’ve noticed only a marginal loss of performance on modern
hardware.


Would you mind sharing your observations?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-24 Thread Chuck Lever

> On Nov 24, 2015, at 9:44 AM, Sagi Grimberg  wrote:
> 
> Hey Chuck,
> 
>> 
>>> It is painful, too painful. The entire value proposition of RDMA is
>>> low-latency and waiting for the extra HW round-trip for a local
>>> invalidation to complete is unacceptable, moreover it adds a huge loads
>>> of extra interrupts and cache-line pollutions.
>> 
>> The killer is the extra context switches, I’ve found.
> 
> That too...
> 
>> I’ve noticed only a marginal loss of performance on modern
>> hardware.
> 
> Would you mind sharing your observations?

I’m testing with CX-3 Pro on FDR.

NFS READ and WRITE round trip latency, which includes the cost
of registration and now invalidation, is not noticeably longer.
dbench and fio results are marginally slower (in the neighborhood
of 5%).

For NFS, the cost of invalidation is probably not significant
compared to other bottlenecks in our stack (lock contention and
scheduling overhead are likely the largest contributors).

Notice that xprtrdma chains together all the LOCAL_INV WRs for
an RPC, and only signals the final one. Before, every LOCAL_INV
WR was signaled. So this patch actually reduces the send
completion rate.

The main benefit for NFS of waiting for invalidation to complete
is better send queue accounting. Even without the data integrity
issue, we have to ensure the WQEs consumed by invalidation
requests are released before dispatching another RPC. Otherwise
the send queue can be overrun.


--
Chuck Lever




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 3/9] xprtrdma: Introduce ro_unmap_sync method

2015-11-24 Thread Jason Gunthorpe
On Tue, Nov 24, 2015 at 09:39:33AM -0500, Chuck Lever wrote:
> >> Do we really want to go down that road?  It seems like we've decided
> >> in general that while the protocol specs say MR must be unmapped before
> >> proceeding with the data that is painful enough to ignore this
> >> requirement.  E.g. iser for example only does the local invalidate
> >> just before reusing the MR.
> 
> That leaves the MR exposed to the remote indefinitely. If
> the MR is registered for remote write, that seems hazardous.

Agree.

If we have a performance knob it should be to post the invalidate,
then immediately dma unmap and start using the buffer, not to defer
the invalidate potentially indefinitely.  Then it is just a race to be
exploited not a long lived window into random page cache memory.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html