Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension

2016-03-30 Thread Wouter Verhelst
On Wed, Mar 30, 2016 at 02:54:41PM -0600, Eric Blake wrote:
> On 03/30/2016 01:51 PM, Wouter Verhelst wrote:
> 
> > (side note: this seems to be mostly an NBD discussion at this point.
> > Does qemu-devel need to be in the loop before we've finished that? I
> > don't care either way, but then I don't want to bore or annoy people...)
> 
> Well, it stemmed out of qemu's desires to implement more efficient ways
> to both push and pull sparse files across NBD; and qemu will ultimately
> want to implement what we discuss.  But I'm okay doing most of the churn
> off of the qemu list, and only adding qemu back in the loop when it is
> actually time to implement the final design, unless someone else speaks
> up asking to remain in on the conversation.

Sure. I was just asking the question...

[...]
> Well, I'm worried about the opposite - if the client does not negotiate
> structured replies, but does negotiate NBD_CMD_GET_LBA_STATUS (which has
> a variable-length payload to reply), then the server has three choices:
> 0) refuse the client (we already said this is a bit undesirable, as it
> may lead to the client retrying in an infloop - having a way to return
> an error message is better than dropping the connection); 1) the server
> must find a way to shoehorn the same data that would be sent in a
> structured reply to fit in the old-style unstructured reply (one nice
> thing about the structured reply is that we get a length for free; that
> length would have to be shoehorned into the old-style reply, different
> from CMD_READ where a length was implied from the request), 2) the
> server must fail the message
> 
> Actually, having typed that, maybe choice 2 is not all that bad.

It isn't.

> It's fairly easy for a server to ALWAYS fail NBD_CMD_GET_LBA_STATUS
> with EINVAL if structured replies were not negotiated, and to document
> that a client that negotiates GET_LBA_STATUS MUST also negotiate
> structured replies for the command to be useful.  For that matter, we
> just documented that servers SHOULD use EINVAL for an unrecognized (or
> out-of-context) command.  The client can enable the two options in
> either order.  And we'd have the following table:
> 
> enabled   enabled
> structuredGET_LBAresult of:
> replies read GET_LBA   write
> 
> nonoold replyEINVALold reply
> yes   nonew replyEINVAL[*]
> noyes   old replyEINVALold reply
> yes   yes   new replynew reply [*]

Right.

It would also be reasonable to say that if you don't negotiate an option
but do end up using it, you end up squarely in undefined behaviour-land.
The server could send EINVAL, it could honour your request in ways that
you may not expect (including structured replies when you didn't ask for
them), or it could cause nasal demons.

> [*] Here, we're still debating whether it makes sense to allow/require a
> server to send new replies everywhere (and clients must handle both
> styles if they negotiate structured replies), or to require a server to
> send old replies (so that read is the only command where clients have to
> handle two styles, but where the results of negotiating pinpoint which
> style).

I'm still in favour of having it be "old reply" for the whole of that
"write" column. I'm just not convinced there's a downside to that, while
there is an upside.

> > Since only the read reply has a payload at this point in time, you don't
> > really need to special-case it, anyway.
> 
> Okay, so in v1 I tried to negotiate STRUCTURED_REPLY; in v2 I was
> specific to STRUCTURED_READ; it sounds like we are leaning back towards
> STRUCTURED_REPLY and just a caveat that any new command that sends
> payload SHOULD/MUST fail if STRUCTURED_REPLY was not also negotiated.

You could formulate it that way. Alternatively, you could formulate it
so that any command that sends payload may fail if STRUCTURED_REPLY was
not also negotiated, with caveat that there is this backwards
compatibility thing for READ.

(i.e., make READ be the exception rather than the rule)

> I guess that also makes it easier to argue for a server that uses a
> structured reply for ALL messages (REPLY_TYPE_NONE for success changes
> 16 bytes into 20, and REPLY_TYPE_ERROR for errors changes 16 bytes
> into 24).

Perhaps.

[...]
> > The client will already need to keep state to reassemble a fragmented
> > and out-of-order read reply, anyway. If that's already the case, adding
> > in the requirement to *also* keep track of error state (which could in
> > the simplest case be a single bit for a client which doesn't care about
> > offsets or error count) isn't that much more of an issue.
> 
> For a client that is copying NBD_CMD_READ into a local file, dumping
> directly via pwrite() as each chunk comes in doesn't require the client
> track any sta

Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension

2016-03-30 Thread Eric Blake
On 03/30/2016 01:51 PM, Wouter Verhelst wrote:

> (side note: this seems to be mostly an NBD discussion at this point.
> Does qemu-devel need to be in the loop before we've finished that? I
> don't care either way, but then I don't want to bore or annoy people...)

Well, it stemmed out of qemu's desires to implement more efficient ways
to both push and pull sparse files across NBD; and qemu will ultimately
want to implement what we discuss.  But I'm okay doing most of the churn
off of the qemu list, and only adding qemu back in the loop when it is
actually time to implement the final design, unless someone else speaks
up asking to remain in on the conversation.

>> I'm a bit reluctant to put ordering requirements on option haggling
>> (that option A must be turned on before option B),
> 
> Yes, me too.
> 
>> but then again, the
>> SELECT extension requires NBD_OPT_SELECT to be haggled prior to
>> NBD_OPT_GO, so there's precedent.
> 
> Yeah, but then, that's only because GO is meant to transition from
> negotiation to transmission, so it needs to be after *any* other
> negotiation; anything else would defeat its purpose.
> 
> Requiring structured read after negotiating other structured commands
> could easily be done by saying that it's an error to leave the
> negotiation phase with "some other" structured command negotiated, but
> not structured read.
> 
> On the other hand, I feel compelled to point out that we only find
> ourselves in this hole because we've coupled the structured reply with
> the read command. That may have looked like a good idea at the time, but
> obviously it isn't. If we just have it be "negotiate the structured
> reply format" rather than "the structured read reply", and state that
> once negotiated, the structured reply format is required for any command
> with a payload, we're done.

Well, I'm worried about the opposite - if the client does not negotiate
structured replies, but does negotiate NBD_CMD_GET_LBA_STATUS (which has
a variable-length payload to reply), then the server has three choices:
0) refuse the client (we already said this is a bit undesirable, as it
may lead to the client retrying in an infloop - having a way to return
an error message is better than dropping the connection); 1) the server
must find a way to shoehorn the same data that would be sent in a
structured reply to fit in the old-style unstructured reply (one nice
thing about the structured reply is that we get a length for free; that
length would have to be shoehorned into the old-style reply, different
from CMD_READ where a length was implied from the request), 2) the
server must fail the message

Actually, having typed that, maybe choice 2 is not all that bad.  It's
fairly easy for a server to ALWAYS fail NBD_CMD_GET_LBA_STATUS with
EINVAL if structured replies were not negotiated, and to document that a
client that negotiates GET_LBA_STATUS MUST also negotiate structured
replies for the command to be useful.  For that matter, we just
documented that servers SHOULD use EINVAL for an unrecognized (or
out-of-context) command.  The client can enable the two options in
either order.  And we'd have the following table:

enabled   enabled
structuredGET_LBAresult of:
replies read GET_LBA   write

nonoold replyEINVALold reply
yes   nonew replyEINVAL[*]
noyes   old replyEINVALold reply
yes   yes   new replynew reply [*]

[*] Here, we're still debating whether it makes sense to allow/require a
server to send new replies everywhere (and clients must handle both
styles if they negotiate structured replies), or to require a server to
send old replies (so that read is the only command where clients have to
handle two styles, but where the results of negotiating pinpoint which
style).

> 
> Since only the read reply has a payload at this point in time, you don't
> really need to special-case it, anyway.

Okay, so in v1 I tried to negotiate STRUCTURED_REPLY; in v2 I was
specific to STRUCTURED_READ; it sounds like we are leaning back towards
STRUCTURED_REPLY and just a caveat that any new command that sends
payload SHOULD/MUST fail if STRUCTURED_REPLY was not also negotiated.  I
guess that also makes it easier to argue for a server that uses a
structured reply for ALL messages (REPLY_TYPE_NONE for success changes
16 bytes into 20, and REPLY_TYPE_ERROR for errors changes 16 bytes into 24).

> 
>> I also am thinking of proposing an extension for option haggling to
>> communicate minimum/preferred alignments and maximum don't-fragment
>> sizing, and that option would have to be enabled before
>> OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it would change the
>> NBD_REP_SERVER layout in response to those option requests), which
>> would be another case where option A affects the behavi

Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension

2016-03-30 Thread Wouter Verhelst
So,

On Tue, Mar 29, 2016 at 05:01:00PM -0600, Eric Blake wrote:
[...]
> +- `NBD_OPT_STRUCTURED_READ` (8)
> +
> +Defined by the experimental `Structured Read` extension; see below.

(detail: that makes the "structured read" extension be typographically
different from everything else. Either make it all caps, or not
monocase.)

[...]
>  A write request. Length and offset define the location and amount of
> @@ -536,13 +556,16 @@ The following error values are defined:
>  * `ENOMEM` (12), Cannot allocate memory.
>  * `EINVAL` (22), Invalid argument.
>  * `ENOSPC` (28), No space left on device.
> +* `EOVERFLOW` (75), Value too large.
>
>  The server SHOULD return `ENOSPC` if it receives a write request
>  including one or more sectors beyond the size of the device.  It SHOULD
>  return `EINVAL` if it receives a read or trim request including one or
>  more sectors beyond the size of the device.  It also SHOULD map the
> -`EDQUOT` and `EFBIG` errors to `ENOSPC`.  Finally, it SHOULD return
> -`EPERM` if it receives a write or trim request on a read-only export.
> +`EDQUOT` and `EFBIG` errors to `ENOSPC`.  It SHOULD return `EOVERFLOW`
> +on a request to send structured read data without fragmentation but
> +where the length is too large.  Finally, it SHOULD return `EPERM` if
> +it receives a write or trim request on a read-only export.

I'd like some more explicit language here that makes it clear EOVERFLOW
can *only* be used on structured replies. We reduced the set of valid
error numbers a while back for good reason; it would be bad if we
accidentally increase it for existing replies now.

>  The server SHOULD return `EINVAL` if it receives an unknown command.
> 
> @@ -579,7 +602,7 @@ To remedy this, a `SELECT` extension is envisioned. This 
> extension adds
>  two option requests and one error reply type, and extends one existing
>  option reply type.
> 
> -* `NBD_OPT_SELECT`
> +* `NBD_OPT_SELECT` (6)

NAK. The spec is currently consistent in associating numbers to
constants in only *one* place. This is no accident, and I'd like to keep
it that way.

(at least I think it is; if not, that's a bug)

>  The client wishes to select the export with the given name for use
>  in the transmission phase, but does not yet want to move to the
> @@ -613,7 +636,7 @@ option reply type.
>handle `NBD_REP_ERR_UNSUP`. In this case, they should fall back to
>using `NBD_OPT_EXPORT_NAME`.
> 
> -* `NBD_OPT_GO`
> +* `NBD_OPT_GO` (7)

same.

>  The client wishes to terminate the negotiation phase and progress to
>  the transmission phase. Possible replies from the server include:
> @@ -635,6 +658,235 @@ option reply type.
>message if they do not also send it as a reply to the
>`NBD_OPT_SELECT` message.
> 
> +### `Structured Read` extension
> +
> +Some of the major downsides of the default reply to `NBD_CMD_READ`
> +(without structured replies) are as follows.  First, it is not
> +possible to support partial reads (the command must succeed or fail as
> +a whole, either len bytes of data must be sent or the connection must
> +be closed).  There is no way to efficiently skip over portions of a
> +sparse file that are known to contain all zeroes.  Finally, it is not
> +possible to reliably decode the server traffic without also having
> +context of what pending read requests were sent by the client.
> +
> +To remedy this, a `Structured Read` extension is envisioned. This
> +extension adds a new option request, a new reply type during the
> +transmission phase, and a new command flag, and alters the set of
> +valid replies to an existing command.
> +
> +* `NBD_OPT_STRUCTURED_READ` (8)

same

> +The client wishes to use structured reads during the transmission
> +phase.  The option request has no additional data.
> +
> +The server replies with one of the following:
> +
> +- `NBD_REP_ACK`: Structured reads have been negotiated; the server
> +  MUST use structured replies to `NBD_CMD_READ`
> +- `NBD_REP_UNSUP`: Structured reads are not available; the transmission
  ^ ERR_

(however, see below ;-)

> +  phase MUST remain the same as if the client had not attempted
> +  `NBD_OPT_STRUCTURED_READ`

This makes it seem as though NBD_REP_ERR_UNSUP has a different meaning
here than it usually does. I think the spec should just say that the
server should reply with NBD_REP_ACK, and then mention that for
backwards compatibility the client should be prepared to handle
NBD_REP_UNSUP too (as is done elsewhere).

That is, if the server implements structured replies, it should allow it
(it makes no sense for the server to disallow structured reads if it
knows about them)

[...]
> +A server MUST NOT send a data payload in a normal reply if
> +Structured Reads are negotiated.  It is envisioned that all future
> +extension commands that require a data payload in the response
> +will require independent option negotiation, and therefore, the

Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension

2016-03-30 Thread Wouter Verhelst
Hi all,

(side note: this seems to be mostly an NBD discussion at this point.
Does qemu-devel need to be in the loop before we've finished that? I
don't care either way, but then I don't want to bore or annoy people...)

On Wed, Mar 30, 2016 at 11:45:04AM -0600, Eric Blake wrote:
> On 03/30/2016 12:50 AM, Alex Bligh wrote:
[...]
> So taking your sentence at face value, yes there is another solution
> possible: require that any NBD_OPT_* haggling used to negotiate the use
> of any other command with a structured reply MUST fail if the client has
> not first negotiated NBD_OPT_STRUCTURED_READ, but leaves the connection
> open so the client can continue option haggling.  That way, the only way
> to end option haggling with the new command enabled is to also enable
> structured reads.  The burden is then on the client to haggle in the
> correct order, and on the server to track haggling state when deciding
> how to answer the option requests for the new commands.
> 
> I'm a bit reluctant to put ordering requirements on option haggling
> (that option A must be turned on before option B),

Yes, me too.

> but then again, the
> SELECT extension requires NBD_OPT_SELECT to be haggled prior to
> NBD_OPT_GO, so there's precedent.

Yeah, but then, that's only because GO is meant to transition from
negotiation to transmission, so it needs to be after *any* other
negotiation; anything else would defeat its purpose.

Requiring structured read after negotiating other structured commands
could easily be done by saying that it's an error to leave the
negotiation phase with "some other" structured command negotiated, but
not structured read.

On the other hand, I feel compelled to point out that we only find
ourselves in this hole because we've coupled the structured reply with
the read command. That may have looked like a good idea at the time, but
obviously it isn't. If we just have it be "negotiate the structured
reply format" rather than "the structured read reply", and state that
once negotiated, the structured reply format is required for any command
with a payload, we're done.

Since only the read reply has a payload at this point in time, you don't
really need to special-case it, anyway.

> I also am thinking of proposing an extension for option haggling to
> communicate minimum/preferred alignments and maximum don't-fragment
> sizing, and that option would have to be enabled before
> OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it would change the
> NBD_REP_SERVER layout in response to those option requests), which
> would be another case where option A affects the behavior of option B.

I reused the NBD_REP_SERVER command in reply to the NBD_OPT_SELECT
command since its purpose seems fairly similar ("send metadata about an
export to the client"), but that may have been a mistake. It certainly
wasn't meant that if you say NBD_OPT_SELECT first and then go
NBD_OPT_LIST, that the NBD_REP_SERVER reply to NBD_OPT_LIST should be
the one as specified in the SELECT extension.

[...]
> > These multiple error chunks are neat. However, I suspect lazy implementors
> > may just send an error without an offset.
> 
> Any ideas are appreciated on how to word it to suggest that servers
> SHOULD try to give full details; but I think we want the fallback to the
> no-offset case because some situations will not have an offset.

I'm wary of making the spec too complicated. Adding such language might
make it unclear. Since as you say it can't be a hard-and-fast rule, I'd
prefer that we just trust implementor's judgement on this.

Not sending an offset (even if it would be possible) can certainly be
the better choice in some cases -- a protocol description can never know
all the trade-offs and choices a particular implementor may want or need
to make.

> >> +The server MAY send additional chunks or offset error replies, if
> >> +`NBD_REPLY_FLAG_DONE` was not set, but MUST ensure the final reply
> >> +also reports an error (that is, the final reply MUST NOT use
> >> +`NBD_REPLY_TYPE_NONE`), and MAY reuse an offset reported earlier
> >> +in constructing the final reply.
> > 
> > I'm not sure I get that bit. Why don't you make an errorred reply simply
> > one that contains one or more error chunks. An errorred reply need not 
> > contain
> > all the data requested (though each chunk must be complete). A reply that
> > isn't errorred needs not contain all the data requested. Why do you
> > need anything stronger than that? So if you have a parallelised server which
> > is simply sending several reads in parallel (think Ceph) it sends the
> > result from each thread, possibly followed by an error packet, and some
> > other thread notices when all of these have completed and sends a
> > NBD_REPLY_TYPE_NONE packet (always, error or not) to close of use of the
> > handle. This seems perfectly natural and no harder for the client to deal
> > with, but you are prohibiting it.
> 
> I was thinking that it's easier on the client if