> On 3 Oct 2016, at 08:57, Christoph Hellwig <h...@infradead.org> wrote:
> 
>> Can you clarify what you mean by that? Why is it an "odd flush
>> definition", and how would you "properly" define it?
> 
> E.g. take the defintion from NVMe which also supports multiple queues:
> 
> "The Flush command shall commit data and metadata associated with the
> specified namespace(s) to non-volatile media. The flush applies to all
> commands completed prior to the submission of the Flush command.
> The controller may also flush additional data and/or metadata from any
> namespace."
> 
> The focus is completed - we need to get a reply to the host first
> before we can send the flush command, so anything that we require
> to be flushed needs to explicitly be completed first.

I think there are two separate issues here:

a) What's described as the "HOL blocking issue". 

This comes down to what Wouter said here:

> Well, when I asked earlier, Christoph said[1] that blk-mq assumes that
> when a FLUSH is sent over one channel, and the reply comes in, that all
> commands which have been received, regardless of which channel they were
> received over, have reched disk.
> 
> [1] Message-ID: <20160915122304.ga15...@infradead.org>
> 
> It is impossible for nbd to make such a guarantee, due to head-of-line
> blocking on TCP.

this is perfectly accurate as far as it goes, but this isn't the current
NBD definition of 'flush'.

That is (from the docs):

> All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) that the 
> server completes (i.e. replies to) prior to processing to a NBD_CMD_FLUSH 
> MUST be written to non-volatile storage prior to replying to 
> thatNBD_CMD_FLUSH. This paragraph only applies if NBD_FLAG_SEND_FLUSH is set 
> within the transmission flags, as otherwise NBD_CMD_FLUSH will never be sent 
> by the client to the server.

I don't think HOL blocking is an issue here by that definition, because all 
FLUSH requires is that commands that are actually completed are flushed to 
disk. If there is head of line blocking which delays the arrival of a write 
issued before a flush, then the sender cannot be relying on whether that write 
is actually completed or not (or it would have waited for the result). The 
flush requires only that those commands COMPLETED are flushed to disk, not that 
those commands RECEIVED have been flushed to disk (and a fortiori not that 
those commands SENT FIRST) have been flushed to disk. From the point of view of 
the client, the flush can therefore only guarantee that the data associated 
with those commands for which it's actually received a reply prior to issuing 
the flush will be flushed, because the replies can be disordered too.

I don't think there is actually a problem here - Wouter if I'm wrong about 
this, I'd like to understand your argument better.



b) What I'm describing - which is the lack of synchronisation between channels.

Suppose you have a simple forking NBD server which uses (e.g.) a Ceph backend. 
Each process (i.e. each NBD channel) will have a separate connection to 
something with its own cache and buffering. Issuing a flush in Ceph requires 
waiting until a quorum of backends (OSDs) has been written to, and with a 
number of backends substantially greater than the quorum, it is not unlikely 
that a flush on one channel will not wait for writes on what Ceph considers a 
completely independent channel to have fully written (assuming the write 
completes before the flush is done).

The same would happen pretty trivially with a forking server that uses a 
process-space write-back cache.

This is because the spec when the spec says: "All write commands (that includes 
NBD_CMD_WRITE, and NBD_CMD_TRIM) that the server completes (i.e. replies to) 
prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage 
prior to replying to that NBD_CMD_FLUSH." what it currently means is actually 
"All write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) 
***ASSOCIATED WITH THAT CLIENT*** that the server completes (i.e. replies to) 
prior to processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage 
prior to replying to that NBD_CMD_FLUSH". 

So what we would need the spec to mean is "All write commands (that includes 
NBD_CMD_WRITE, and NBD_CMD_TRIM) ***ASSOCIATED WITH ANY CHANNEL OF THAT 
CLIENT*** that the server completes (i.e. replies to) prior to processing to a 
NBD_CMD_FLUSH MUST be written to non-volatile storage prior to replying to that 
NBD_CMD_FLUSH". And as we have no way to associate different channels of the 
same client, for servers that can't rely on the OS to synchronise flushing 
across different clients relating to the same file, in practice that means "All 
write commands (that includes NBD_CMD_WRITE, and NBD_CMD_TRIM) ***ASSOCIATED 
WITH ANY CLIENT AT ALL*** that the server completes (i.e. replies to) prior to 
processing to a NBD_CMD_FLUSH MUST be written to non-volatile storage prior to 
replying to that NBD_CMD_FLUSH" - i.e. a flush on any channel of any client 
must flush every channel of every client, because we have no easy way to tell 
which clients are in fact two channels. I have concerns over the scalability of 
this.

Now, in the reference server, NBD_CMD_FLUSH is implemented through an 
fdatasync(). Each client (and therefore each channel) runs in a different 
process.

Earlier in this thread, someone suggested that if this happens:

      Process A                Process B
      =========                =========

      fd1=open("file123")
                               fd2=open("file123")

      write(fd1, ...)
                               fdatasync("fd2")

then the fdatasync() is guaranteed to sync the write that Process A has 
written. This may or may not be the case under Linux (wiser minds than me will 
know). Is it guaranteed to be the case with (e.g.) the file on NFS? On all 
POSIX platforms?

Looking at 
http://pubs.opengroup.org/onlinepubs/009695399/functions/fdatasync.html I'd say 
it was a little ambiguous as to whether it will ALWAYS flush all data 
associated with the file even if it is being written by a different process 
(and a different FD).

If fdatasync() is always guaranteed to flush data associated with writes by a 
different process (and separately opened fd), then as it happens there won't be 
a problem on the reference server, just on servers that don't happen to use 
fdatasync() or similar to implement flushes, and which don't maintain their own 
caches. If fdatasync() is not so guaranteed, we have a problem with the 
reference server too, at least on some platforms and fling systems.

What I'm therefore asking for is either:
a) that the server can say 'if you are multichannel, you will need to send 
flush on each channel' (best); OR
b) that the server can say 'don't go multichannel'

as part of the negotiation stage. Of course as this is dependent on the 
backend, this is going to be something that is per-target (i.e. needs to come 
as a transmission flag or similar).

-- 
Alex Bligh




--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to