On 10/03/2016 07:34 AM, Alex Bligh wrote: > >> On 3 Oct 2016, at 08:57, Christoph Hellwig <[email protected]> wrote: >> >>> Can you clarify what you mean by that? Why is it an "odd flush >>> definition", and how would you "properly" define it? >> >> E.g. take the defintion from NVMe which also supports multiple queues: >> >> "The Flush command shall commit data and metadata associated with the >> specified namespace(s) to non-volatile media. The flush applies to all >> commands completed prior to the submission of the Flush command. >> The controller may also flush additional data and/or metadata from any >> namespace." >> >> The focus is completed - we need to get a reply to the host first >> before we can send the flush command, so anything that we require >> to be flushed needs to explicitly be completed first. > > I think there are two separate issues here: > > a) What's described as the "HOL blocking issue". > > This comes down to what Wouter said here: > >> Well, when I asked earlier, Christoph said[1] that blk-mq assumes that >> when a FLUSH is sent over one channel, and the reply comes in, that all >> commands which have been received, regardless of which channel they were >> received over, have reched disk. >> >> [1] Message-ID: <[email protected]> >> >> It is impossible for nbd to make such a guarantee, due to head-of-line >> blocking on TCP. > > this is perfectly accurate as far as it goes, but this isn't the current > NBD definition of 'flush'. > > That is (from the docs): > >> All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) that the >> server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH >> MUST be written to non-volatile storage prior to replying to >> thatNBD_CMD_FLUSH. This paragraph only applies if NBD_FLAG_SEND_FLUSH is set >> within the transmission flags, as otherwise NBD_CMD_FLUSH will never be sent >> by the client to the server. > > I don't think HOL blocking is an issue here by that definition, because all > FLUSH requires is that commands that are actually completed are flushed to > disk. If there is head of line blocking which delays the arrival of a write > issued before a flush, then the sender cannot be relying on whether that > write is actually completed or not (or it would have waited for the result). > The flush requires only that those commands COMPLETED are flushed to disk, > not that those commands RECEIVED have been flushed to disk (and a fortiori > not that those commands SENT FIRST) have been flushed to disk. From the point > of view of the client, the flush can therefore only guarantee that the data > associated with those commands for which it's actually received a reply prior > to issuing the flush will be flushed, because the replies can be disordered > too. > > I don't think there is actually a problem here - Wouter if I'm wrong about > this, I'd like to understand your argument better. > > > > b) What I'm describing - which is the lack of synchronisation between > channels. > > Suppose you have a simple forking NBD server which uses (e.g.) a Ceph > backend. Each process (i.e. each NBD channel) will have a separate connection > to something with its own cache and buffering. Issuing a flush in Ceph > requires waiting until a quorum of backends (OSDs) has been written to, and > with a number of backends substantially greater than the quorum, it is not > unlikely that a flush on one channel will not wait for writes on what Ceph > considers a completely independent channel to have fully written (assuming > the write completes before the flush is done). > > The same would happen pretty trivially with a forking server that uses a > process-space write-back cache. > > This is because the spec when the spec says: "All write commands (that > includes NBD_CMD_WRITE, and NBD_CMD_TRIM) that the server completes (i.e. > replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to > non-volatile storage prior to replying to that NBD_CMD_FLUSH." what it > currently means is actually "All write commands (that includes NBD_CMD_WRITE, > and NBD_CMD_TRIM) ***ASSOCIATED WITH THAT CLIENT*** that the server completes > (i.e. replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to > non-volatile storage prior to replying to that NBD_CMD_FLUSH". > > So what we would need the spec to mean is "All write commands (that includes > NBD_CMD_WRITE, and NBD_CMD_TRIM) ***ASSOCIATED WITH ANY CHANNEL OF THAT > CLIENT*** that the server completes (i.e. replies to) prior to processing to > a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to > that NBD_CMD_FLUSH". And as we have no way to associate different channels of > the same client, for servers that can't rely on the OS to synchronise > flushing across different clients relating to the same file, in practice that > means "All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) > ***ASSOCIATED WITH ANY CLIENT AT ALL*** that the server completes (i.e. > replies to) prior to processing to a NBD_CMD_FLUSH MUST be written to > non-volatile storage prior to replying to that NBD_CMD_FLUSH" - i.e. a flush > on any channel of any client must flush every channel of every client, > because we have no easy way to tell which clients are in fact two channels. I > have concerns over the sc alability of this. > > Now, in the reference server, NBD_CMD_FLUSH is implemented through an > fdatasync(). Each client (and therefore each channel) runs in a different > process. > > Earlier in this thread, someone suggested that if this happens: > > Process A Process B > ========= ========= > > fd1=open("file123") > fd2=open("file123") > > write(fd1, ...) > fdatasync("fd2") > > then the fdatasync() is guaranteed to sync the write that Process A has > written. This may or may not be the case under Linux (wiser minds than me > will know). Is it guaranteed to be the case with (e.g.) the file on NFS? On > all POSIX platforms? > > Looking at > https://urldefense.proofpoint.com/v2/url?u=http-3A__pubs.opengroup.org_onlinepubs_009695399_functions_fdatasync.html&d=DQIFAg&c=5VD0RTtNlTh3ycd41b3MUw&r=sDzg6MvHymKOUgI8SFIm4Q&m=CrKTswZ5fz5tdtZvA9rerZHXb8O8O57LSOjNJN1ejms&s=bkOpj64mHN60JXapJ62GJe0Qtzp-ZWwVn91kXmJ247M&e= > I'd say it was a little ambiguous as to whether it will ALWAYS flush all > data associated with the file even if it is being written by a different > process (and a different FD). > > If fdatasync() is always guaranteed to flush data associated with writes by a > different process (and separately opened fd), then as it happens there won't > be a problem on the reference server, just on servers that don't happen to > use fdatasync() or similar to implement flushes, and which don't maintain > their own caches. If fdatasync() is not so guaranteed, we have a problem with > the reference server too, at least on some platforms and fling systems. > > What I'm therefore asking for is either: > a) that the server can say 'if you are multichannel, you will need to send > flush on each channel' (best); OR > b) that the server can say 'don't go multichannel' > > as part of the negotiation stage. Of course as this is dependent on the > backend, this is going to be something that is per-target (i.e. needs to come > as a transmission flag or similar). >
Ok I understand your objections now. You aren't arguing that we are unsafe by default, only that we are unsafe with servers that do something special beyond simply writing to a single disk or file. I agree this is problematic, but you simply don't use this feature if your server can't deal with it well. Thanks, Josef ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, SlashDot.org! http://sdm.link/slashdot _______________________________________________ Nbd-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nbd-general
