subject:"\\\\\\\[Nbd\\\\\\\] \\\\\\\[RESEND\\\\\\\]\\\\\\\[PATCH 0\\\\\\\/5\\\\\\\] nbd improvements"

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Paolo Bonzini



On 15/09/2016 17:23, Alex Bligh wrote:
> Paolo,
> 
>> On 15 Sep 2016, at 15:07, Paolo Bonzini  wrote:
>>
>> I don't think QEMU forbids multiple clients to the single server, and
>> guarantees consistency as long as there is no overlap between writes and
>> reads.  These are the same guarantees you have for multiple commands on
>> a single connection.
>>
>> In other words, from the POV of QEMU there's no difference whether
>> multiple commands come from one or more connections.
> 
> This isn't really about ordering, it's about cache coherency
> and persisting things to disk.
> 
> What you say is correct as far as it goes in terms of ordering. However
> consider the scenario with read and writes on two channels as follows
> of the same block:
> 
>  Channel1 Channel2
> 
>  R  Block read, and cached in user space in
> channel 1's cache
> Reply sent
> 
>   W New value written, channel 2's cache updated
> channel 1's cache not
> 
>  R  Value returned from channel 1's cache.
> 
> 
> In the above scenario, there is a problem if the server(s) handling the
> two channels each use a read cache which is not coherent between the
> two channels. An example would be a read-through cache on a server that
> did fork() and shared no state between connections.

qemu-nbd does not fork(), so there is no coherency issue if W has replied.

However, if W hasn't replied, channel1 can get garbage.  Typically the
VM will be the one during writes, everyone else must be ready to handle
whatever mess the VM throws at them.

Paolo

> Similarly, if there is a write on channel 1 that has completed, and
> the flush goes to channel 2, it may not (if state is not shared) guarantee
> that the write on channel 1 (which has completed) is persisted to non-volatile
> media. Obviously if the 'state' is OS block cache/buffers/whatever, it
> will, but if it's (e.g.) a user-space per process write-through cache,
> it won't.
> 
> I don't know whether qemu-nbd is likely to suffer from either of these.

It can't happen.  On the other hand, channel1 must be ready to handle
garbage, it's illegal.






signature.asc
Description: OpenPGP digital signature

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Eric Blake

On 09/15/2016 11:27 AM, Wouter Verhelst wrote:
> On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote:
>> Wouter,
>>
>>> The server can always refuse to allow multiple connections.
>>
>> Sure, but it would be neater to warn the client of that at negotiation
>> stage (it would only be one flag, e.g.  'multiple connections
>> unsafe').
> 
> I suppose that's not a bad idea.

Seems like it may need to be a per-export flag, rather than a global
flag (as a given server may be able to serve multiple types of files,
where the safety depends on the type of file being served).


-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Wouter,

> On 15 Sep 2016, at 17:27, Wouter Verhelst  wrote:
> 
> On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote:
>> Wouter,
>> 
>>> The server can always refuse to allow multiple connections.
>> 
>> Sure, but it would be neater to warn the client of that at negotiation
>> stage (it would only be one flag, e.g.  'multiple connections
>> unsafe').
> 
> I suppose that's not a bad idea.

Good.

> [...]
>>> I was thinking of changing the spec as follows:
>>> 
>>> diff --git a/doc/proto.md b/doc/proto.md
>>> index 217f57e..cb099e2 100644
>>> --- a/doc/proto.md
>>> +++ b/doc/proto.md
>>> @@ -308,6 +308,23 @@ specification, the
>>> [kernel 
>>> documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
>>> may be useful.
>>> 
>>> +For performance reasons, clients MAY open multiple connections to the
>>> +same server. To support such clients, servers SHOULD ensure that at
>>> +least one of the following conditions hold:
>>> +
>>> +* Flush commands are processed for ALL connections. That is, when an
>>> +  `NBD_CMD_WRITE` is processed on one connection, and then an
>>> +  `NBD_CMD_FLUSH` is processed on another connection, the data of the
>>> +  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
>>> +  before the reply of the `NBD_CMD_FLUSH` is sent.
>>> +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
>>> +  connection
>>> +* Multiple connections are not allowed
>>> +
>>> +In addition, clients using multiple connections SHOULD NOT send
>>> +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
>>> +the flush has not been replied to yet.
>>> +
>> 
>> I don't think that should be a mandatory behaviour.
> 
> Which part of it?

I've read it again :-) The wording was slightly contorted. I think
what I mean is that if you don't support flush at all, that's
another option.

The final paragraph I am not sure is right, as that's not what the kernel
currently does. If we are going to suggest a change in our main client's
behaviour, should we not just request that flush is done on all channels?

>> For once, it would
>> be reasonably easy on gonbdserver but a PITA on the reference server.
>> You'd need to put in IPC between each of the forked processes OR rely
>> on fdatasync() only - I'm not sure that would necessarily work
>> safely with (e.g.) the 'treefiles' / COW options.
>> 
>> I think better would be to say that the server MUST either
>> 
>> * Not support NBD_CMD_FLUSH at all
> 
> I think we should discourage not supporting FLUSH, rather than
> suggesting it. 

Sure, but some backends just don't support flush. For them, this
aspect at least is not a problem.

>> * Support NBD_CMD_FLUSH across channels (as you set out above), or
>> * Indicate that it does not support multiple channels.
> 
> You dropped the one with no writes. I said "at most" there for a reason.
> Originally I was going to say "if the server is read-only", but then
> thought that it could work to do the "at most" thing. After having given
> that some more thought, I now realize that if you write, you need to
> flush across to other channels, regardless of whether they write too, so
> that bit of it is moot now anyway.
> 
> Still, a server which exports read-only should still be safe for
> multiple connections, even if there is no cache coherency (since
> presumably nothing's going to change anyway).

Yes

>> Actually I think this is a problem anyway. A simpler failure case is
>> one where (by chance) one channel gets the writes, and one channel
>> gets the flushes. The flush reply is delayed beyond the replies to
>> unconnected writes (on the other channel) and hence the kernel thinks
>> replied-to writes have been persisted when they have not been.
> 
> Yes, that is another example of essentially the same problem.

Yeah, I was just trying to simplify things.

>> The only way to fix that (as far as I can see) without changing flush
>> semantics is for the block layer to issue flush requests on each
>> channel when flushing on one channel.
> 
> Christoph just said that that doesn't (currently) happen; I don't know
> whether the kernel currently already (optimistically) sends out flush
> requests before the writes that it expects to hit permanent storage have
> finished, but if it doesn't do that, then there is no problem and my
> suggested bit of spec would be okay.
> 
> If there are good reasons to do so, however, we do indeed have a problem
> and something else is necessary. I don't think flushing across all
> connections is the best solution, though.

Well, the way I look at it is that we have a proposed change in
client behaviour (multiple channels) which causes problems at
least with flush and also (I think) with cache coherency (see other
email). We should either not make that change, or ensure other changes
are added which mitigate these issues.

Flush is actually the obvious one. Cache coherency is far more
subtle (though

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 05:08:21PM +0100, Alex Bligh wrote:
> Wouter,
> 
> > The server can always refuse to allow multiple connections.
> 
> Sure, but it would be neater to warn the client of that at negotiation
> stage (it would only be one flag, e.g.  'multiple connections
> unsafe').

I suppose that's not a bad idea.

[...]
> > I was thinking of changing the spec as follows:
> > 
> > diff --git a/doc/proto.md b/doc/proto.md
> > index 217f57e..cb099e2 100644
> > --- a/doc/proto.md
> > +++ b/doc/proto.md
> > @@ -308,6 +308,23 @@ specification, the
> > [kernel 
> > documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
> > may be useful.
> > 
> > +For performance reasons, clients MAY open multiple connections to the
> > +same server. To support such clients, servers SHOULD ensure that at
> > +least one of the following conditions hold:
> > +
> > +* Flush commands are processed for ALL connections. That is, when an
> > +  `NBD_CMD_WRITE` is processed on one connection, and then an
> > +  `NBD_CMD_FLUSH` is processed on another connection, the data of the
> > +  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
> > +  before the reply of the `NBD_CMD_FLUSH` is sent.
> > +* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
> > +  connection
> > +* Multiple connections are not allowed
> > +
> > +In addition, clients using multiple connections SHOULD NOT send
> > +`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
> > +the flush has not been replied to yet.
> > +
> 
> I don't think that should be a mandatory behaviour.

Which part of it?

> For once, it would
> be reasonably easy on gonbdserver but a PITA on the reference server.
> You'd need to put in IPC between each of the forked processes OR rely
> on fdatasync() only - I'm not sure that would necessarily work
> safely with (e.g.) the 'treefiles' / COW options.
> 
> I think better would be to say that the server MUST either
> 
> * Not support NBD_CMD_FLUSH at all

I think we should discourage not supporting FLUSH, rather than
suggesting it. 

> * Support NBD_CMD_FLUSH across channels (as you set out above), or
> * Indicate that it does not support multiple channels.

You dropped the one with no writes. I said "at most" there for a reason.
Originally I was going to say "if the server is read-only", but then
thought that it could work to do the "at most" thing. After having given
that some more thought, I now realize that if you write, you need to
flush across to other channels, regardless of whether they write too, so
that bit of it is moot now anyway.

Still, a server which exports read-only should still be safe for
multiple connections, even if there is no cache coherency (since
presumably nothing's going to change anyway).

[...]
> > The latter bit (on the client side) is because even if your backend has
> > no cache coherency issues, TCP does not guarantee ordering between
> > multiple connections. I don't know if the above is in line with what
> > blk-mq does, but consider the following scenario:
> > 
> > - A client sends two writes to the server, followed (immediately) by a
> >  flush, where at least the second write and the flush are not sent over
> >  the same connection.
> > - The first write is a small one, and it is handled almost immediately.
> > - The second write takes a little longer, so the flush is handled
> >  earlier than the second write
> > - The network packet containing the flush reply gets lost for whatever
> >  reason, so the client doesn't get it, and we fall into TCP
> >  retransmits.
> > - The second write finishes, and its reply header does not get lost
> > - After the second write reply reaches the client, the TCP retransmits
> >  for the flush reply are handled.
> > 
> > In the above scenario, the flush reply arrives on the client side after
> > a write reply which it did not cover; so the client will (incorrectly)
> > assume that the write has reached permanent storage when in fact it may
> > not have done so yet.
> > 
> > If the kernel does not care about the ordering of the two writes versus
> > the flush, then there is no problem. I don't know how blk-mq works in
> > that context, but if the above is a likely scenario, we may have to
> > reconsider adding blk-mq to nbd.
> 
> Actually I think this is a problem anyway. A simpler failure case is
> one where (by chance) one channel gets the writes, and one channel
> gets the flushes. The flush reply is delayed beyond the replies to
> unconnected writes (on the other channel) and hence the kernel thinks
> replied-to writes have been persisted when they have not been.

Yes, that is another example of essentially the same problem.

> The only way to fix that (as far as I can see) without changing flush
> semantics is for the block layer to issue flush requests on each
> channel when flushing on one channel.

Christoph just said that that doesn't (currently) happen; I don't know

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Josef Bacik


On 09/15/2016 09:17 AM, Wouter Verhelst wrote:

On Thu, Sep 15, 2016 at 01:44:29PM +0100, Alex Bligh wrote:



On 15 Sep 2016, at 13:41, Christoph Hellwig  wrote:

On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:

That's probably right in the case of file-based back ends that
are running on a Linux OS. But gonbdserver for instance supports
(e.g.) Ceph based backends, where each connection might be talking
to a completely separate ceph node, and there may be no cache
consistency between connections.


Yes, if you don't have a cache coherent backend you are generally
screwed with a multiqueue protocol.


I wonder if the ability to support multiqueue should be visible
in the negotiation stage. That would allow the client to refuse
to select multiqueue where it isn't safe.


The server can always refuse to allow multiple connections.

I was thinking of changing the spec as follows:

diff --git a/doc/proto.md b/doc/proto.md
index 217f57e..cb099e2 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -308,6 +308,23 @@ specification, the
 [kernel 
documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
 may be useful.

+For performance reasons, clients MAY open multiple connections to the
+same server. To support such clients, servers SHOULD ensure that at
+least one of the following conditions hold:
+
+* Flush commands are processed for ALL connections. That is, when an
+  `NBD_CMD_WRITE` is processed on one connection, and then an
+  `NBD_CMD_FLUSH` is processed on another connection, the data of the
+  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
+  before the reply of the `NBD_CMD_FLUSH` is sent.
+* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
+  connection
+* Multiple connections are not allowed
+
+In addition, clients using multiple connections SHOULD NOT send
+`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
+the flush has not been replied to yet.
+
  Request message

 The request message, sent by the client, looks as follows:

The latter bit (on the client side) is because even if your backend has
no cache coherency issues, TCP does not guarantee ordering between
multiple connections. I don't know if the above is in line with what
blk-mq does, but consider the following scenario:

- A client sends two writes to the server, followed (immediately) by a
  flush, where at least the second write and the flush are not sent over
  the same connection.
- The first write is a small one, and it is handled almost immediately.
- The second write takes a little longer, so the flush is handled
  earlier than the second write
- The network packet containing the flush reply gets lost for whatever
  reason, so the client doesn't get it, and we fall into TCP
  retransmits.
- The second write finishes, and its reply header does not get lost
- After the second write reply reaches the client, the TCP retransmits
  for the flush reply are handled.

In the above scenario, the flush reply arrives on the client side after
a write reply which it did not cover; so the client will (incorrectly)
assume that the write has reached permanent storage when in fact it may
not have done so yet.


This isn't an NBD problem, this is an application problem.  The application must 
wait for all writes it cares about _before_ issuing a flush.  This is the same 
as for normal storage as it is for NBD.  It is not NBD's responsibility to 
maintain coherency between multiple requests across connections, just simply to 
act on and respond to requests.


I think changing the specification to indicate that this is the case for 
multiple connections is a good thing, to keep NBD servers from doing weird 
things like sending different connections to the same export to different 
backing stores without some sort of synchronization.  It should definitely be 
explicitly stated somewhere that NBD does not provide any ordering guarantees 
and that is up to the application.  Thanks,


Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 01:44:29PM +0100, Alex Bligh wrote:
> 
> > On 15 Sep 2016, at 13:41, Christoph Hellwig  wrote:
> > 
> > On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
> >> That's probably right in the case of file-based back ends that
> >> are running on a Linux OS. But gonbdserver for instance supports
> >> (e.g.) Ceph based backends, where each connection might be talking
> >> to a completely separate ceph node, and there may be no cache
> >> consistency between connections.
> > 
> > Yes, if you don't have a cache coherent backend you are generally
> > screwed with a multiqueue protocol.
> 
> I wonder if the ability to support multiqueue should be visible
> in the negotiation stage. That would allow the client to refuse
> to select multiqueue where it isn't safe.

The server can always refuse to allow multiple connections.

I was thinking of changing the spec as follows:

diff --git a/doc/proto.md b/doc/proto.md
index 217f57e..cb099e2 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -308,6 +308,23 @@ specification, the
 [kernel 
documentation](https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt)
 may be useful.
 
+For performance reasons, clients MAY open multiple connections to the
+same server. To support such clients, servers SHOULD ensure that at
+least one of the following conditions hold:
+
+* Flush commands are processed for ALL connections. That is, when an
+  `NBD_CMD_WRITE` is processed on one connection, and then an
+  `NBD_CMD_FLUSH` is processed on another connection, the data of the
+  `NBD_CMD_WRITE` on the first connection MUST reach permanent storage
+  before the reply of the `NBD_CMD_FLUSH` is sent.
+* The server allows `NBD_CMD_WRITE` and `NBD_CMD_FLUSH` on at most one
+  connection
+* Multiple connections are not allowed
+
+In addition, clients using multiple connections SHOULD NOT send
+`NBD_CMD_FLUSH` if an `NBD_CMD_WRITE` for which they care in relation to
+the flush has not been replied to yet.
+
  Request message
 
 The request message, sent by the client, looks as follows:

The latter bit (on the client side) is because even if your backend has
no cache coherency issues, TCP does not guarantee ordering between
multiple connections. I don't know if the above is in line with what
blk-mq does, but consider the following scenario:

- A client sends two writes to the server, followed (immediately) by a
  flush, where at least the second write and the flush are not sent over
  the same connection.
- The first write is a small one, and it is handled almost immediately.
- The second write takes a little longer, so the flush is handled
  earlier than the second write
- The network packet containing the flush reply gets lost for whatever
  reason, so the client doesn't get it, and we fall into TCP
  retransmits.
- The second write finishes, and its reply header does not get lost
- After the second write reply reaches the client, the TCP retransmits
  for the flush reply are handled.

In the above scenario, the flush reply arrives on the client side after
a write reply which it did not cover; so the client will (incorrectly)
assume that the write has reached permanent storage when in fact it may
not have done so yet.

If the kernel does not care about the ordering of the two writes versus
the flush, then there is no problem. I don't know how blk-mq works in
that context, but if the above is a likely scenario, we may have to
reconsider adding blk-mq to nbd.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:41, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
>> That's probably right in the case of file-based back ends that
>> are running on a Linux OS. But gonbdserver for instance supports
>> (e.g.) Ceph based backends, where each connection might be talking
>> to a completely separate ceph node, and there may be no cache
>> consistency between connections.
> 
> Yes, if you don't have a cache coherent backend you are generally
> screwed with a multiqueue protocol.

I wonder if the ability to support multiqueue should be visible
in the negotiation stage. That would allow the client to refuse
to select multiqueue where it isn't safe.

Wouter?

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 01:39:11PM +0100, Alex Bligh wrote:
> That's probably right in the case of file-based back ends that
> are running on a Linux OS. But gonbdserver for instance supports
> (e.g.) Ceph based backends, where each connection might be talking
> to a completely separate ceph node, and there may be no cache
> consistency between connections.

Yes, if you don't have a cache coherent backend you are generally
screwed with a multiqueue protocol.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:36, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 01:33:20PM +0100, Alex Bligh wrote:
>> At an implementation level that is going to be a little difficult
>> for some NBD servers, e.g. ones that fork() a different process per
>> connection. There is in general no IPC to speak of between server
>> instances. Such servers would thus be unsafe with more than one
>> connection if FLUSH is in use.
>> 
>> I believe such servers include the reference server where there is
>> process per connection (albeit possibly with several threads).
>> 
>> Even single process servers (including mine - gonbdserver) would
>> require logic to pair up multiple connections to the same
>> device.
> 
> Why?  If you only send the completion after your I/O syscall returned
> your are fine if fsync comes from a difference process, no matter
> if you're using direct or buffered I/O underneath.

That's probably right in the case of file-based back ends that
are running on a Linux OS. But gonbdserver for instance supports
(e.g.) Ceph based backends, where each connection might be talking
to a completely separate ceph node, and there may be no cache
consistency between connections.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

> On 15 Sep 2016, at 13:23, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 02:21:20PM +0200, Wouter Verhelst wrote:
>> Right. So do I understand you correctly that blk-mq currently doesn't
>> look at multiple queues, and just assumes that if a FLUSH is sent over
>> any one of the queues, it applies to all queues?
> 
> Yes.  The same is true at the protocol level for NVMe or SCSI transports
> that can make use of multiple queues.

At an implementation level that is going to be a little difficult
for some NBD servers, e.g. ones that fork() a different process per
connection. There is in general no IPC to speak of between server
instances. Such servers would thus be unsafe with more than one
connection if FLUSH is in use.

I believe such servers include the reference server where there is
process per connection (albeit possibly with several threads).

Even single process servers (including mine - gonbdserver) would
require logic to pair up multiple connections to the same
device.

I suspect ALL nbd servers would require a change. So I think we should
think carefully before introducing this protocol change.

It might be easier to make the client code change work around this
issue (somehow).

-- 
Alex Bligh

--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 13:18, Christoph Hellwig  wrote:
> 
> Yes, please do that.  A "barrier" implies draining of the queue.

Done

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 05:20:08AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 15, 2016 at 02:01:59PM +0200, Wouter Verhelst wrote:
> > Yes. There was some discussion on that part, and we decided that setting
> > the flag doesn't hurt, but the spec also clarifies that using it on READ
> > does nothing, semantically.
> >
> > 
> > The problem is that there are clients in the wild which do set it on
> > READ, so it's just a matter of "be liberal in what you accept".
> 
> Note that FUA on READ in SCSI and NVMe does have a meaning - it
> requires you to bypass any sort of cache on the target.  I think it's an
> wrong defintion because it mandates implementation details that aren't
> observable by the initiator, but it's still the spec wording and nbd
> diverges from it.  That's not nessecarily a bad thing, but a caveat to
> look out for.

Yes. I think the kernel nbd driver should probably filter out FUA on
READ. It has no meaning in the case of nbd, and whatever expectations
the kernel may have cannot be provided for by nbd anyway.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 04:38:07AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 15, 2016 at 12:49:35PM +0200, Wouter Verhelst wrote:
> > A while back, we spent quite some time defining the semantics of the
> > various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
> > write barriers. At the time, we decided that it would be unreasonable
> > to expect servers to make these write barriers effective across
> > different connections.
> 
> Do you have a nbd protocol specification?

https://github.com/yoe/nbd/blob/master/doc/proto.md

> treating a flush or fua as any sort of barrier is incredibly stupid.

Maybe I'm not using the correct terminology here. The point is that
after a FLUSH, the server asserts that all write commands *for which a
reply has already been sent to the client* will also have reached
permanent storage. Nothing is asserted about commands for which the
reply has not yet been sent, regardless of whether they were sent to the
server before or after the FLUSH; they may reach permanent storage as a
result of the FLUSH, or they may not, we don't say.

With FUA, we only assert that the FUA-flagged command reaches permanent
storage before its reply is sent, nothing else.

If that's not a write barrier, then I was using the wrong terminology
(and offer my apologies for the confusion).

The point still remains that "X was sent before Y" is difficult to
determine on the client side if X was sent over a different TCP channel
than Y, because a packet might be dropped (requiring a retransmit) for
X, and similar things. If blk-mq can deal with that, we're good and
nothing further needs to be done. If not, this should be evaluated by
someone more familiar with the internals of the kernel block subsystem
than me.

> Is it really documented that way, and if yes, why?

The documentation does use the term write barrier, but only right before
the same explanation as above. It does so because I assumed that that is
what a write barrier is, and that this would make things easier to
understand. If you tell me that that term is wrong as used there, I can
easily remove it (it's not critical to the rest of the documentation).

Regards,

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 01:55:14PM +0200, Wouter Verhelst wrote:
> Maybe I'm not using the correct terminology here. The point is that
> after a FLUSH, the server asserts that all write commands *for which a
> reply has already been sent to the client* will also have reached
> permanent storage. Nothing is asserted about commands for which the
> reply has not yet been sent, regardless of whether they were sent to the
> server before or after the FLUSH; they may reach permanent storage as a
> result of the FLUSH, or they may not, we don't say.
> 
> With FUA, we only assert that the FUA-flagged command reaches permanent
> storage before its reply is sent, nothing else.

Yes, that's the correct semantics.

> If that's not a write barrier, then I was using the wrong terminology
> (and offer my apologies for the confusion).

It's not a write barrier - a write barrier was command that ensured that

 a) all previous writes were completed to the host/client
 b) all previous writes were on non-volatile storage

and

 c) the actual write with the barrier bit was on non-volatile storage

> The point still remains that "X was sent before Y" is difficult to
> determine on the client side if X was sent over a different TCP channel
> than Y, because a packet might be dropped (requiring a retransmit) for
> X, and similar things. If blk-mq can deal with that, we're good and
> nothing further needs to be done. If not, this should be evaluated by
> someone more familiar with the internals of the kernel block subsystem
> than me.

The important bit in all the existing protocols, and which Linux relies
on is that any write the Linux block layer got a completion for earlier
needs to be flushed out to non-volatile storage when a FLUSH command is
set.  Anything that still is in flight does not matter.  Which for
NBD means anything that you already replies to need to be flushed.

Or to say it more practicly - in the nbd server you simply need to
call fdatasync on the backing device or file whenever you get a FLUSH
requires, and it will do the right thing.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 04:52:17AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 15, 2016 at 12:46:07PM +0100, Alex Bligh wrote:
> > Essentially NBD does supports FLUSH/FUA like this:
> > 
> > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt
> > 
> > IE supports the same FLUSH/FUA primitives as other block drivers (AIUI).
> > 
> > Link to protocol (per last email) here:
> > 
> > https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes
> 
> Flush as defined by the Linux block layer (and supported that way in
> SCSI, ATA, NVMe) only requires to flush all already completed writes
> to non-volatile media.

That is precisely what FLUSH in nbd does, too.

> It does not impose any ordering unlike the nbd spec.

If you read the spec differently, then that's a bug in the spec. Can you
clarify which part of it caused that confusion? We should fix it, then.

> FUA as defined by the Linux block layer (and supported that way in SCSI,
> ATA, NVMe) only requires the write operation the FUA bit is set on to be
> on non-volatile media before completing the write operation.  It does
> not impose any ordering, which seems to match the nbd spec.  Unlike the
> NBD spec Linux does not allow FUA to be set on anything by WRITE
> commands.  Some other storage protocols allow a FUA bit on READ
> commands or other commands that write data to the device, though.

Yes. There was some discussion on that part, and we decided that setting
the flag doesn't hurt, but the spec also clarifies that using it on READ
does nothing, semantically.

The problem is that there are clients in the wild which do set it on
READ, so it's just a matter of "be liberal in what you accept".

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 12:43:35PM +0100, Alex Bligh wrote:
> Sure, it's at:
> 
> https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes
> 
> and that link takes you to the specific section.
> 
> The treatment of FLUSH and FUA is meant to mirror exactly the
> linux block layer (or rather how the linux block layer was a few
> years ago). I even asked on LKML to verify a few points.

Linux never expected ordering on the wire.  Before 2010 we had barriers
in the kernel that provided ordering to the caller, but we never
required it from the protocol / hardware.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh


> On 15 Sep 2016, at 12:40, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 01:29:36PM +0200, Wouter Verhelst wrote:
>> Yes, and that is why I was asking about this. If the write barriers
>> are expected to be shared across connections, we have a problem. If,
>> however, they are not, then it doesn't matter that the commands may be
>> processed out of order.
> 
> There is no such thing as a write barrier in the Linux kernel.  We'd
> much prefer protocols not to introduce any pointless synchronization
> if we can avoid it.

I suspect the issue is terminological.

Essentially NBD does supports FLUSH/FUA like this:

https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt

IE supports the same FLUSH/FUA primitives as other block drivers (AIUI).

Link to protocol (per last email) here:

https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Alex Bligh

Christoph,

> On 15 Sep 2016, at 12:38, Christoph Hellwig  wrote:
> 
> On Thu, Sep 15, 2016 at 12:49:35PM +0200, Wouter Verhelst wrote:
>> A while back, we spent quite some time defining the semantics of the
>> various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
>> write barriers. At the time, we decided that it would be unreasonable
>> to expect servers to make these write barriers effective across
>> different connections.
> 
> Do you have a nbd protocol specification?  treating a flush or fua
> as any sort of barrier is incredibly stupid.  Is it really documented
> that way, and if yes, why?

Sure, it's at:

https://github.com/yoe/nbd/blob/master/doc/proto.md#ordering-of-messages-and-writes

and that link takes you to the specific section.

The treatment of FLUSH and FUA is meant to mirror exactly the
linux block layer (or rather how the linux block layer was a few
years ago). I even asked on LKML to verify a few points.

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 01:29:36PM +0200, Wouter Verhelst wrote:
> Yes, and that is why I was asking about this. If the write barriers
> are expected to be shared across connections, we have a problem. If,
> however, they are not, then it doesn't matter that the commands may be
> processed out of order.

There is no such thing as a write barrier in the Linux kernel.  We'd
much prefer protocols not to introduce any pointless synchronization
if we can avoid it.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Christoph Hellwig

On Thu, Sep 15, 2016 at 12:09:28PM +0100, Alex Bligh wrote:
> A more general point is that with multiple queues requests
> may be processed in a different order even by those servers that
> currently process the requests in strict order, or in something
> similar to strict order. The server is permitted by the spec
> (save as mandated by NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA) to
> process commands out of order anyway, but I suspect this has
> to date been little tested.

The Linux kernel does not assume any synchroniztion between block I/O
commands.  So any sort of synchronization a protocol does is complete
overkill for us.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

On Thu, Sep 15, 2016 at 12:09:28PM +0100, Alex Bligh wrote:
> Wouter, Josef, (& Eric)
> 
> > On 15 Sep 2016, at 11:49, Wouter Verhelst  wrote:
> > 
> > Hi,
> > 
> > On Fri, Sep 09, 2016 at 10:02:03PM +0200, Wouter Verhelst wrote:
> >> I see some practical problems with this:
> > [...]
> > 
> > One more that I didn't think about earlier:
> > 
> > A while back, we spent quite some time defining the semantics of the
> > various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
> > write barriers. At the time, we decided that it would be unreasonable
> > to expect servers to make these write barriers effective across
> > different connections.
> 
> Actually I wonder if there is a wider problem in that implementations
> might mediate access to a device by presence of an extant TCP connection,
> i.e. only permit one TCP connection to access a given block device at
> once. If you think about (for instance) a forking daemon that does
> writeback caching, that would be an entirely reasonable thing to do
> for data consistency.

Sure. They will have to live with the fact that clients connected to
them will run slower; I don't think that's a problem. In addition,
Josef's client implementation requires the user to explicitly ask for
multiple connections.

There are multiple contexts in which NBD can be used, and in some
performance is more important than in others. I think that is fine.

[...]
> A more general point is that with multiple queues requests
> may be processed in a different order even by those servers that
> currently process the requests in strict order, or in something
> similar to strict order. The server is permitted by the spec
> (save as mandated by NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA) to
> process commands out of order anyway, but I suspect this has
> to date been little tested.

Yes, and that is why I was asking about this. If the write barriers
are expected to be shared across connections, we have a problem. If,
however, they are not, then it doesn't matter that the commands may be
processed out of order.

[...]

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

2016-09-15 Thread Wouter Verhelst

Hi,

On Fri, Sep 09, 2016 at 10:02:03PM +0200, Wouter Verhelst wrote:
> I see some practical problems with this:
[...]

One more that I didn't think about earlier:

A while back, we spent quite some time defining the semantics of the
various commands in the face of the NBD_CMD_FLUSH and NBD_CMD_FLAG_FUA
write barriers. At the time, we decided that it would be unreasonable
to expect servers to make these write barriers effective across
different connections.

Since my knowledge of kernel internals is limited, I tried finding some
documentation on this, but I guess that either it doesn't exist or I'm
looking in the wrong place; therefore, am I correct in assuming that
blk-mq knows about such semantics, and will handle them correctly (by
either sending a write barrier to all queues, or not making assumptions
about write barriers that were sent over a different queue)? If not,
this may be something that needs to be taken care of.

Thanks,

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

Re: [Nbd] [RESEND][PATCH 0/5] nbd improvements

22 matches

Site Navigation

Mail list logo

Footer information