Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension
On Wed, Mar 30, 2016 at 02:54:41PM -0600, Eric Blake wrote: > On 03/30/2016 01:51 PM, Wouter Verhelst wrote: > > > (side note: this seems to be mostly an NBD discussion at this point. > > Does qemu-devel need to be in the loop before we've finished that? I > > don't care either way, but then I don't want to bore or annoy people...) > > Well, it stemmed out of qemu's desires to implement more efficient ways > to both push and pull sparse files across NBD; and qemu will ultimately > want to implement what we discuss. But I'm okay doing most of the churn > off of the qemu list, and only adding qemu back in the loop when it is > actually time to implement the final design, unless someone else speaks > up asking to remain in on the conversation. Sure. I was just asking the question... [...] > Well, I'm worried about the opposite - if the client does not negotiate > structured replies, but does negotiate NBD_CMD_GET_LBA_STATUS (which has > a variable-length payload to reply), then the server has three choices: > 0) refuse the client (we already said this is a bit undesirable, as it > may lead to the client retrying in an infloop - having a way to return > an error message is better than dropping the connection); 1) the server > must find a way to shoehorn the same data that would be sent in a > structured reply to fit in the old-style unstructured reply (one nice > thing about the structured reply is that we get a length for free; that > length would have to be shoehorned into the old-style reply, different > from CMD_READ where a length was implied from the request), 2) the > server must fail the message > > Actually, having typed that, maybe choice 2 is not all that bad. It isn't. > It's fairly easy for a server to ALWAYS fail NBD_CMD_GET_LBA_STATUS > with EINVAL if structured replies were not negotiated, and to document > that a client that negotiates GET_LBA_STATUS MUST also negotiate > structured replies for the command to be useful. For that matter, we > just documented that servers SHOULD use EINVAL for an unrecognized (or > out-of-context) command. The client can enable the two options in > either order. And we'd have the following table: > > enabled enabled > structuredGET_LBAresult of: > replies read GET_LBA write > > nonoold replyEINVALold reply > yes nonew replyEINVAL[*] > noyes old replyEINVALold reply > yes yes new replynew reply [*] Right. It would also be reasonable to say that if you don't negotiate an option but do end up using it, you end up squarely in undefined behaviour-land. The server could send EINVAL, it could honour your request in ways that you may not expect (including structured replies when you didn't ask for them), or it could cause nasal demons. > [*] Here, we're still debating whether it makes sense to allow/require a > server to send new replies everywhere (and clients must handle both > styles if they negotiate structured replies), or to require a server to > send old replies (so that read is the only command where clients have to > handle two styles, but where the results of negotiating pinpoint which > style). I'm still in favour of having it be "old reply" for the whole of that "write" column. I'm just not convinced there's a downside to that, while there is an upside. > > Since only the read reply has a payload at this point in time, you don't > > really need to special-case it, anyway. > > Okay, so in v1 I tried to negotiate STRUCTURED_REPLY; in v2 I was > specific to STRUCTURED_READ; it sounds like we are leaning back towards > STRUCTURED_REPLY and just a caveat that any new command that sends > payload SHOULD/MUST fail if STRUCTURED_REPLY was not also negotiated. You could formulate it that way. Alternatively, you could formulate it so that any command that sends payload may fail if STRUCTURED_REPLY was not also negotiated, with caveat that there is this backwards compatibility thing for READ. (i.e., make READ be the exception rather than the rule) > I guess that also makes it easier to argue for a server that uses a > structured reply for ALL messages (REPLY_TYPE_NONE for success changes > 16 bytes into 20, and REPLY_TYPE_ERROR for errors changes 16 bytes > into 24). Perhaps. [...] > > The client will already need to keep state to reassemble a fragmented > > and out-of-order read reply, anyway. If that's already the case, adding > > in the requirement to *also* keep track of error state (which could in > > the simplest case be a single bit for a client which doesn't care about > > offsets or error count) isn't that much more of an issue. > > For a client that is copying NBD_CMD_READ into a local file, dumping > directly via pwrite() as each chunk comes in doesn't require the client > track any sta
Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension
On 03/30/2016 01:51 PM, Wouter Verhelst wrote: > (side note: this seems to be mostly an NBD discussion at this point. > Does qemu-devel need to be in the loop before we've finished that? I > don't care either way, but then I don't want to bore or annoy people...) Well, it stemmed out of qemu's desires to implement more efficient ways to both push and pull sparse files across NBD; and qemu will ultimately want to implement what we discuss. But I'm okay doing most of the churn off of the qemu list, and only adding qemu back in the loop when it is actually time to implement the final design, unless someone else speaks up asking to remain in on the conversation. >> I'm a bit reluctant to put ordering requirements on option haggling >> (that option A must be turned on before option B), > > Yes, me too. > >> but then again, the >> SELECT extension requires NBD_OPT_SELECT to be haggled prior to >> NBD_OPT_GO, so there's precedent. > > Yeah, but then, that's only because GO is meant to transition from > negotiation to transmission, so it needs to be after *any* other > negotiation; anything else would defeat its purpose. > > Requiring structured read after negotiating other structured commands > could easily be done by saying that it's an error to leave the > negotiation phase with "some other" structured command negotiated, but > not structured read. > > On the other hand, I feel compelled to point out that we only find > ourselves in this hole because we've coupled the structured reply with > the read command. That may have looked like a good idea at the time, but > obviously it isn't. If we just have it be "negotiate the structured > reply format" rather than "the structured read reply", and state that > once negotiated, the structured reply format is required for any command > with a payload, we're done. Well, I'm worried about the opposite - if the client does not negotiate structured replies, but does negotiate NBD_CMD_GET_LBA_STATUS (which has a variable-length payload to reply), then the server has three choices: 0) refuse the client (we already said this is a bit undesirable, as it may lead to the client retrying in an infloop - having a way to return an error message is better than dropping the connection); 1) the server must find a way to shoehorn the same data that would be sent in a structured reply to fit in the old-style unstructured reply (one nice thing about the structured reply is that we get a length for free; that length would have to be shoehorned into the old-style reply, different from CMD_READ where a length was implied from the request), 2) the server must fail the message Actually, having typed that, maybe choice 2 is not all that bad. It's fairly easy for a server to ALWAYS fail NBD_CMD_GET_LBA_STATUS with EINVAL if structured replies were not negotiated, and to document that a client that negotiates GET_LBA_STATUS MUST also negotiate structured replies for the command to be useful. For that matter, we just documented that servers SHOULD use EINVAL for an unrecognized (or out-of-context) command. The client can enable the two options in either order. And we'd have the following table: enabled enabled structuredGET_LBAresult of: replies read GET_LBA write nonoold replyEINVALold reply yes nonew replyEINVAL[*] noyes old replyEINVALold reply yes yes new replynew reply [*] [*] Here, we're still debating whether it makes sense to allow/require a server to send new replies everywhere (and clients must handle both styles if they negotiate structured replies), or to require a server to send old replies (so that read is the only command where clients have to handle two styles, but where the results of negotiating pinpoint which style). > > Since only the read reply has a payload at this point in time, you don't > really need to special-case it, anyway. Okay, so in v1 I tried to negotiate STRUCTURED_REPLY; in v2 I was specific to STRUCTURED_READ; it sounds like we are leaning back towards STRUCTURED_REPLY and just a caveat that any new command that sends payload SHOULD/MUST fail if STRUCTURED_REPLY was not also negotiated. I guess that also makes it easier to argue for a server that uses a structured reply for ALL messages (REPLY_TYPE_NONE for success changes 16 bytes into 20, and REPLY_TYPE_ERROR for errors changes 16 bytes into 24). > >> I also am thinking of proposing an extension for option haggling to >> communicate minimum/preferred alignments and maximum don't-fragment >> sizing, and that option would have to be enabled before >> OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it would change the >> NBD_REP_SERVER layout in response to those option requests), which >> would be another case where option A affects the behavi
Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension
So, On Tue, Mar 29, 2016 at 05:01:00PM -0600, Eric Blake wrote: [...] > +- `NBD_OPT_STRUCTURED_READ` (8) > + > +Defined by the experimental `Structured Read` extension; see below. (detail: that makes the "structured read" extension be typographically different from everything else. Either make it all caps, or not monocase.) [...] > A write request. Length and offset define the location and amount of > @@ -536,13 +556,16 @@ The following error values are defined: > * `ENOMEM` (12), Cannot allocate memory. > * `EINVAL` (22), Invalid argument. > * `ENOSPC` (28), No space left on device. > +* `EOVERFLOW` (75), Value too large. > > The server SHOULD return `ENOSPC` if it receives a write request > including one or more sectors beyond the size of the device. It SHOULD > return `EINVAL` if it receives a read or trim request including one or > more sectors beyond the size of the device. It also SHOULD map the > -`EDQUOT` and `EFBIG` errors to `ENOSPC`. Finally, it SHOULD return > -`EPERM` if it receives a write or trim request on a read-only export. > +`EDQUOT` and `EFBIG` errors to `ENOSPC`. It SHOULD return `EOVERFLOW` > +on a request to send structured read data without fragmentation but > +where the length is too large. Finally, it SHOULD return `EPERM` if > +it receives a write or trim request on a read-only export. I'd like some more explicit language here that makes it clear EOVERFLOW can *only* be used on structured replies. We reduced the set of valid error numbers a while back for good reason; it would be bad if we accidentally increase it for existing replies now. > The server SHOULD return `EINVAL` if it receives an unknown command. > > @@ -579,7 +602,7 @@ To remedy this, a `SELECT` extension is envisioned. This > extension adds > two option requests and one error reply type, and extends one existing > option reply type. > > -* `NBD_OPT_SELECT` > +* `NBD_OPT_SELECT` (6) NAK. The spec is currently consistent in associating numbers to constants in only *one* place. This is no accident, and I'd like to keep it that way. (at least I think it is; if not, that's a bug) > The client wishes to select the export with the given name for use > in the transmission phase, but does not yet want to move to the > @@ -613,7 +636,7 @@ option reply type. >handle `NBD_REP_ERR_UNSUP`. In this case, they should fall back to >using `NBD_OPT_EXPORT_NAME`. > > -* `NBD_OPT_GO` > +* `NBD_OPT_GO` (7) same. > The client wishes to terminate the negotiation phase and progress to > the transmission phase. Possible replies from the server include: > @@ -635,6 +658,235 @@ option reply type. >message if they do not also send it as a reply to the >`NBD_OPT_SELECT` message. > > +### `Structured Read` extension > + > +Some of the major downsides of the default reply to `NBD_CMD_READ` > +(without structured replies) are as follows. First, it is not > +possible to support partial reads (the command must succeed or fail as > +a whole, either len bytes of data must be sent or the connection must > +be closed). There is no way to efficiently skip over portions of a > +sparse file that are known to contain all zeroes. Finally, it is not > +possible to reliably decode the server traffic without also having > +context of what pending read requests were sent by the client. > + > +To remedy this, a `Structured Read` extension is envisioned. This > +extension adds a new option request, a new reply type during the > +transmission phase, and a new command flag, and alters the set of > +valid replies to an existing command. > + > +* `NBD_OPT_STRUCTURED_READ` (8) same > +The client wishes to use structured reads during the transmission > +phase. The option request has no additional data. > + > +The server replies with one of the following: > + > +- `NBD_REP_ACK`: Structured reads have been negotiated; the server > + MUST use structured replies to `NBD_CMD_READ` > +- `NBD_REP_UNSUP`: Structured reads are not available; the transmission ^ ERR_ (however, see below ;-) > + phase MUST remain the same as if the client had not attempted > + `NBD_OPT_STRUCTURED_READ` This makes it seem as though NBD_REP_ERR_UNSUP has a different meaning here than it usually does. I think the spec should just say that the server should reply with NBD_REP_ACK, and then mention that for backwards compatibility the client should be prepared to handle NBD_REP_UNSUP too (as is done elsewhere). That is, if the server implements structured replies, it should allow it (it makes no sense for the server to disallow structured reads if it knows about them) [...] > +A server MUST NOT send a data payload in a normal reply if > +Structured Reads are negotiated. It is envisioned that all future > +extension commands that require a data payload in the response > +will require independent option negotiation, and therefore, the
Re: [Qemu-devel] [Nbd] [PATCH v2 3/3] doc: Propose Structured Read extension
Hi all, (side note: this seems to be mostly an NBD discussion at this point. Does qemu-devel need to be in the loop before we've finished that? I don't care either way, but then I don't want to bore or annoy people...) On Wed, Mar 30, 2016 at 11:45:04AM -0600, Eric Blake wrote: > On 03/30/2016 12:50 AM, Alex Bligh wrote: [...] > So taking your sentence at face value, yes there is another solution > possible: require that any NBD_OPT_* haggling used to negotiate the use > of any other command with a structured reply MUST fail if the client has > not first negotiated NBD_OPT_STRUCTURED_READ, but leaves the connection > open so the client can continue option haggling. That way, the only way > to end option haggling with the new command enabled is to also enable > structured reads. The burden is then on the client to haggle in the > correct order, and on the server to track haggling state when deciding > how to answer the option requests for the new commands. > > I'm a bit reluctant to put ordering requirements on option haggling > (that option A must be turned on before option B), Yes, me too. > but then again, the > SELECT extension requires NBD_OPT_SELECT to be haggled prior to > NBD_OPT_GO, so there's precedent. Yeah, but then, that's only because GO is meant to transition from negotiation to transmission, so it needs to be after *any* other negotiation; anything else would defeat its purpose. Requiring structured read after negotiating other structured commands could easily be done by saying that it's an error to leave the negotiation phase with "some other" structured command negotiated, but not structured read. On the other hand, I feel compelled to point out that we only find ourselves in this hole because we've coupled the structured reply with the read command. That may have looked like a good idea at the time, but obviously it isn't. If we just have it be "negotiate the structured reply format" rather than "the structured read reply", and state that once negotiated, the structured reply format is required for any command with a payload, we're done. Since only the read reply has a payload at this point in time, you don't really need to special-case it, anyway. > I also am thinking of proposing an extension for option haggling to > communicate minimum/preferred alignments and maximum don't-fragment > sizing, and that option would have to be enabled before > OPT_LIST/OPT_SELECT/OPT_EXPORT_NAME (because it would change the > NBD_REP_SERVER layout in response to those option requests), which > would be another case where option A affects the behavior of option B. I reused the NBD_REP_SERVER command in reply to the NBD_OPT_SELECT command since its purpose seems fairly similar ("send metadata about an export to the client"), but that may have been a mistake. It certainly wasn't meant that if you say NBD_OPT_SELECT first and then go NBD_OPT_LIST, that the NBD_REP_SERVER reply to NBD_OPT_LIST should be the one as specified in the SELECT extension. [...] > > These multiple error chunks are neat. However, I suspect lazy implementors > > may just send an error without an offset. > > Any ideas are appreciated on how to word it to suggest that servers > SHOULD try to give full details; but I think we want the fallback to the > no-offset case because some situations will not have an offset. I'm wary of making the spec too complicated. Adding such language might make it unclear. Since as you say it can't be a hard-and-fast rule, I'd prefer that we just trust implementor's judgement on this. Not sending an offset (even if it would be possible) can certainly be the better choice in some cases -- a protocol description can never know all the trade-offs and choices a particular implementor may want or need to make. > >> +The server MAY send additional chunks or offset error replies, if > >> +`NBD_REPLY_FLAG_DONE` was not set, but MUST ensure the final reply > >> +also reports an error (that is, the final reply MUST NOT use > >> +`NBD_REPLY_TYPE_NONE`), and MAY reuse an offset reported earlier > >> +in constructing the final reply. > > > > I'm not sure I get that bit. Why don't you make an errorred reply simply > > one that contains one or more error chunks. An errorred reply need not > > contain > > all the data requested (though each chunk must be complete). A reply that > > isn't errorred needs not contain all the data requested. Why do you > > need anything stronger than that? So if you have a parallelised server which > > is simply sending several reads in parallel (think Ceph) it sends the > > result from each thread, possibly followed by an error packet, and some > > other thread notices when all of these have completed and sends a > > NBD_REPLY_TYPE_NONE packet (always, error or not) to close of use of the > > handle. This seems perfectly natural and no harder for the client to deal > > with, but you are prohibiting it. > > I was thinking that it's easier on the client if