Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24/09/2016 14:27, Vladimir Sementsov-Ogievskiy wrote: > On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote: >> On 24.09.2016 00:21, Wouter Verhelst wrote: >>> On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: My preference would be a new flag to the existing commands, with explicit documentation that 0 offset and 0 length must be used with that flag, when requesting a full-device wipe. >>> Alternatively, what about a flag that says "if you use this flag, the >>> size should be left-shifted by X bits before processing"? That allows >>> you to do TRIM or WRITE_ZEROES on much larger chunks, without being >>> limited to "whole disk" commands. We should probably make it an illegal >>> flag for any command that actually sends data over the wire, though. >> >> Note: if disk size is not aligned to X we will have to send request >> larger than the disk size to clear the whole disk. > > Also, in this case, which realization of bdrv interface in qemu would be > most appropriate? Similar flag (in this case X must be defined in some > very transparent way, as a constant of 64k for example), or flag > BDRV_REQ_WHOLE_DISK, or separate .bdrv_zero_all and .bdrv_discard_all ? This makes nice sense. It also matches the SANITIZE command from the SCSI command set. Paolo
Re: [Qemu-block] write_zeroes/trim on the whole disk
Am 24.09.2016 um 14:27 hat Vladimir Sementsov-Ogievskiy geschrieben: > On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote: > >On 24.09.2016 00:21, Wouter Verhelst wrote: > >>On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: > >>>My preference would be a new flag to the existing commands, with > >>>explicit documentation that 0 offset and 0 length must be used > >>>with that > >>>flag, when requesting a full-device wipe. > >>Alternatively, what about a flag that says "if you use this flag, the > >>size should be left-shifted by X bits before processing"? That allows > >>you to do TRIM or WRITE_ZEROES on much larger chunks, without being > >>limited to "whole disk" commands. We should probably make it an illegal > >>flag for any command that actually sends data over the wire, though. > >> > > > > > >Note: if disk size is not aligned to X we will have to send > >request larger than the disk size to clear the whole disk. > > > > Also, in this case, which realization of bdrv interface in qemu > would be most appropriate? Similar flag (in this case X must be > defined in some very transparent way, as a constant of 64k for > example), or flag BDRV_REQ_WHOLE_DISK, or separate .bdrv_zero_all > and .bdrv_discard_all ? Maybe the best would be to extend the existing discard/write_zeroes functions to take a 64 bit byte count and then NBD can internally check whether a request clears the whole disk. Kevin
Re: [Qemu-block] write_zeroes/trim on the whole disk
On Sat, Sep 24, 2016 at 11:19:53PM +0300, Vladimir Sementsov-Ogievskiy wrote: > On 24.09.2016 21:24, Alex Bligh wrote: > > > On 24 Sep 2016, at 18:47, Vladimir Sementsov-Ogievskiy > > >wrote: > > > > > > I just wanted to say, that if we want a possibility of clearing the whole > > > disk in one request for qcow2 we have to take 512 as granularity for such > > > requests (with X = 9). An this is too small. 1tb will be the upper bound > > > for the request. > > Sure. But I do not see the value in optimising these huge commands to run > > as single requests. If you want to do that, do it properly and have a > > negotiation-phase flag that supports 64 bit request lengths. > > And add additional request type with another magic in first field and 64bit > length field? If such solution is appropriate for nbd it is ok for me of > course. I've proposed something like this in first letter - "Increase length > field of the request to 64bit". Changing existing request message type is > wrong of course, but creating an additional one should be ok. I'd be opposed to that. Currently, the request and reply headers are strictly the same, for all messages. That's a feature; it allows a server to read the header (determine whether it needs to read data and if so, read that too), and hand it off to something else. By changing the header's format, you make the protocol more complicated. I'd like to avoid that. [...] > or a separate command/flag for clearing the whole disk, and separate > block-based solution in future if needed. I'd prefer to avoid one-usecase commands; adding too many of those would make the protocol specification too complicated. Currently, NBD is a simple to understand protocol; I'd like to keep it that way. I think it's not a huge problem if we introduce something that has rounding errors; we could easily have a spec saying that if NBD_CMD_WRITE_ZEROES with the NBD_CMD_FLAG_SHIFT flag is used, and the last byte of the device falls between N-1 and N blocks, then it is not an error to specify a length of N blocks. or something along those lines. > or new request type with 64bit length Not going to happen as far as I'm concerned, as per above. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 21:24, Alex Bligh wrote: On 24 Sep 2016, at 18:47, Vladimir Sementsov-Ogievskiywrote: I just wanted to say, that if we want a possibility of clearing the whole disk in one request for qcow2 we have to take 512 as granularity for such requests (with X = 9). An this is too small. 1tb will be the upper bound for the request. Sure. But I do not see the value in optimising these huge commands to run as single requests. If you want to do that, do it properly and have a negotiation-phase flag that supports 64 bit request lengths. And add additional request type with another magic in first field and 64bit length field? If such solution is appropriate for nbd it is ok for me of course. I've proposed something like this in first letter - "Increase length field of the request to 64bit". Changing existing request message type is wrong of course, but creating an additional one should be ok. Full backup, for example: 1. target can do fast write_zeroes: clear the whole disk (great if we can do it in one request, without splitting, etc), then backup all data except zero or unallocated (save a lot of time on this skipping). 2. target can not do fast write_zeroes: just backup all data. We need not clear the disk, as we will not save time by this. So here, we need not splitting as a general. Just clear all or not clearing at all. As I said, within the current protocol you cannot tell whether a target supports 'fast write zeroes', and indeed the support may be partial - for instance with a QCOW2 backend, a write that is not cluster aligned would likely only partially satisfy the command by deallocating bytes. There is no current flag for 'supports fast write zeroes' and (given the foregoing) it isn't evident to me exactly what it would mean. I suggest to add this flag - which is a negotiation-phase flag, exposing support of the whole feature (separate command or flag for clearing the whole disk). Fast here means that we can do this in one request. write_zeroes(of any size, up to the whole disk) is fast if it will not take more time than usual write (restricted to 2G). It seems however you could support your use case by simply iterating through the backup disk, using NBD_CMD_WRITE for the areas that are allocated and non-zero, and using NBD_CMD_WRITE_ZEROES for the areas that are not allocated or zeroed. This technique would not require a protocol change (beyond the existing NBD_CMD_WRITE_ZEROES extension), works irrespective of whether the target supports write zeroes or not, works irrespective of difference in cluster allocation size between source and target, is far simpler, and has the added advantage of making the existing zeroes-but-not-holes area into holes (that is optional if you can tell the difference between zeroes and holes on the source media). It also works on a single pass. Yes, you need to split requests up, but you need to split requests up ANYWAY to cope with NBD_CMD_WRITE's 2^32-1 length limit (I strongly advise you not to use more than 2^31). And in any case, you probably want to parallelise reads and writes and have more than one write in flight in any case, all of which suggests you are going to be breaking up requests anyway. This is slow, see my first letter. Iterative zeroing of qcow2 is slow. Why separate command/flag for clearing the whole disk is better for me than block-based solution with splitting requests? I want to clear the whole disk and I don't want to introduce new functionality, which I don't need for now. I need to clearing the whole disk, but with block-based solution I have a lot of code, which solves another task. And it only indirectly solves my task. I.e. instead of simple_realisation+simple_usage+nice_solution_for_my_task I have harder_realisation+harder_usage+ugly_solution_for_my_task. I understand, that we must take into account that such functionality (large requests) will likely be needed in future, so more generic solution is better for a protocol. And I suggest a compromise: negotiation-phase flag NBD_FLAG_SEND_BIG_REQUEST : command flag NBD_CMD_FLAG_BIG_REQUEST is supported for WRITE_ZEROES and TRIM negotiation-phase flag NBD_FLAG_SEND_BIG_REQUEST_REGION : non-zero length is supported for big request flag NBD_CMD_FLAG_BIG_REQUEST is set and length = 0-> request on the whole disk, offset must be 0 flag NBD_CMD_FLAG_BIG_REQUEST is set and length > 0-> request on (offset*block_size, length*block_size), length*block_size must be <= disk_size (only if NBD_FLAG_SEND_BIG_REQUEST_REGION is negotiated) flag NBD_CMD_FLAG_BIG_REQUEST is unset ->usual request on (offset, length) or a separate command/flag for clearing the whole disk, and separate block-based solution in future if needed. or new request type with 64bit length -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 18:47, Vladimir Sementsov-Ogievskiy >wrote: > > I just wanted to say, that if we want a possibility of clearing the whole > disk in one request for qcow2 we have to take 512 as granularity for such > requests (with X = 9). An this is too small. 1tb will be the upper bound for > the request. Sure. But I do not see the value in optimising these huge commands to run as single requests. If you want to do that, do it properly and have a negotiation-phase flag that supports 64 bit request lengths. > Full backup, for example: > > 1. target can do fast write_zeroes: clear the whole disk (great if we can do > it in one request, without splitting, etc), then backup all data except zero > or unallocated (save a lot of time on this skipping). > 2. target can not do fast write_zeroes: just backup all data. We need not > clear the disk, as we will not save time by this. > > So here, we need not splitting as a general. Just clear all or not clearing > at all. As I said, within the current protocol you cannot tell whether a target supports 'fast write zeroes', and indeed the support may be partial - for instance with a QCOW2 backend, a write that is not cluster aligned would likely only partially satisfy the command by deallocating bytes. There is no current flag for 'supports fast write zeroes' and (given the foregoing) it isn't evident to me exactly what it would mean. It seems however you could support your use case by simply iterating through the backup disk, using NBD_CMD_WRITE for the areas that are allocated and non-zero, and using NBD_CMD_WRITE_ZEROES for the areas that are not allocated or zeroed. This technique would not require a protocol change (beyond the existing NBD_CMD_WRITE_ZEROES extension), works irrespective of whether the target supports write zeroes or not, works irrespective of difference in cluster allocation size between source and target, is far simpler, and has the added advantage of making the existing zeroes-but-not-holes area into holes (that is optional if you can tell the difference between zeroes and holes on the source media). It also works on a single pass. Yes, you need to split requests up, but you need to split requests up ANYWAY to cope with NBD_CMD_WRITE's 2^32-1 length limit (I strongly advise you not to use more than 2^31). And in any case, you probably want to parallelise reads and writes and have more than one write in flight in any case, all of which suggests you are going to be breaking up requests anyway. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 20:32, Alex Bligh wrote: On 24 Sep 2016, at 18:13, Vladimir Sementsov-Ogievskiywrote: On 24.09.2016 19:49, Alex Bligh wrote: On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 19:31, Alex Bligh wrote: On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy wrote: Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. This means that this extension could not be used with any qcow2 disk, as qcow2 may have size not aligned to its cluster size. # qemu-img create -f qcow2 mega 1K Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 # qemu-img info mega image: mega file format: qcow2 virtual size: 1.0K (1024 bytes) disk size: 196K cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false And there is no such restriction in documentation. Or we have to consider sector-size (512b) as block size for qcow2, which is too small for our needs. If by "this extension" you mean the INFO extension (which reports block sizes) that's incorrect. An nbd server using a QCOW2 file as the backend would report the sector size as the minimum block size. It might report the cluster size or the sector size as the preferred block size, or anything in between. QCOW2 cluster size essentially determines the allocation unit. NBD is not bothered as to the underlying allocation unit. It does not (currently) support the concept of making holes visible to the client. If you use NBD_CMD_WRITE_ZEREOS you get zeroes, which might or might not be implemented as one or more holes or 'real' zeroes (save if you specify NBD_CMD_FLAG_NO_HOLE in which case you are guaranteed to get 'real' zeroes'). If you use NBD_CMD_TRIM then the area trimmed might nor might not be written with one or more whole. There is (currently) no way to detect the presence of holes separately from zeroes (though a bitmap extension was discussed). I just wanted to say, that if we want a possibility of clearing the whole disk in one request for qcow2 we have to take 512 as granularity for such requests (with X = 9). An this is too small. 1tb will be the upper bound for the request. But there is no guarantee that disk_size/block_size < INT_MAX.. I think you mean 2^32-1, but yes there is no guarantee of that. In that case you would need to break the call up into multiple calls. However, being able to break the call up into multiple calls seems pretty sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of time, and a REALLY long time if the server doesn't support trim. May be, additional option, specifying the shift would be better. With convention that if offset+length exceeds disk size, length should be recalculated as disk_size-offset. I don't think we should do that. We already have clear semantics that prevent operations beyond the end of the disk. Again, just break the command up into multipl commands. No great hardship. I agree that requests larger than disk size are ugly.. But splitting request brings me again to idea of having separate command or flag for clearing the whole disk without that dance. Server may report availability of this/flag command only if target driver supports fast write_zeroes (qcow2 in our case). Why? In the general case you need to break up requests anyway (particularly with the INFO extension where there is a maximum command size), and issuing a command over a TCP connection that might take hours or days to complete with no hint of progress, and no TCP traffic to keep NAT etc. alive, sounds like bad practice. The overhead is tiny. I would be against this change. Full backup, for example: 1. target can do fast write_zeroes: clear the whole disk (great if we can do it in one request, without splitting, etc), then backup all data except zero or unallocated (save a lot of time on this skipping). 2. target can not do fast write_zeroes: just backup all data. We need not clear the disk, as we will not save time by this. So here, we need not splitting as a general. Just clear all or not clearing at all. -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 18:13, Vladimir Sementsov-Ogievskiy >wrote: > > On 24.09.2016 19:49, Alex Bligh wrote: >>> On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy >>> wrote: >>> >>> On 24.09.2016 19:31, Alex Bligh wrote: > On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy > wrote: > > Note: if disk size is not aligned to X we will have to send request > larger than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. > > This means that this extension could not be used with any qcow2 disk, as > qcow2 may have size not aligned to its cluster size. > > # qemu-img create -f qcow2 mega 1K > Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 > lazy_refcounts=off refcount_bits=16 > # qemu-img info mega > image: mega > file format: qcow2 > virtual size: 1.0K (1024 bytes) > disk size: 196K > cluster_size: 65536 > Format specific information: >compat: 1.1 >lazy refcounts: false >refcount bits: 16 >corrupt: false > > And there is no such restriction in documentation. Or we have to consider > sector-size (512b) as block size for qcow2, which is too small for our needs. If by "this extension" you mean the INFO extension (which reports block sizes) that's incorrect. An nbd server using a QCOW2 file as the backend would report the sector size as the minimum block size. It might report the cluster size or the sector size as the preferred block size, or anything in between. QCOW2 cluster size essentially determines the allocation unit. NBD is not bothered as to the underlying allocation unit. It does not (currently) support the concept of making holes visible to the client. If you use NBD_CMD_WRITE_ZEREOS you get zeroes, which might or might not be implemented as one or more holes or 'real' zeroes (save if you specify NBD_CMD_FLAG_NO_HOLE in which case you are guaranteed to get 'real' zeroes'). If you use NBD_CMD_TRIM then the area trimmed might nor might not be written with one or more whole. There is (currently) no way to detect the presence of holes separately from zeroes (though a bitmap extension was discussed). >>> But there is no guarantee that disk_size/block_size < INT_MAX.. >> I think you mean 2^32-1, but yes there is no guarantee of that. In that case >> you would need to break the call up into multiple calls. >> >> However, being able to break the call up into multiple calls seems pretty >> sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of >> time, and a REALLY long time if the server doesn't support trim. >> >>> May be, additional option, specifying the shift would be better. With >>> convention that if offset+length exceeds disk size, length should be >>> recalculated as disk_size-offset. >> I don't think we should do that. We already have clear semantics that >> prevent operations beyond the end of the disk. Again, just break the command >> up into multipl commands. No great hardship. >> > > I agree that requests larger than disk size are ugly.. But splitting request > brings me again to idea of having separate command or flag for clearing the > whole disk without that dance. Server may report availability of this/flag > command only if target driver supports fast write_zeroes (qcow2 in our case). Why? In the general case you need to break up requests anyway (particularly with the INFO extension where there is a maximum command size), and issuing a command over a TCP connection that might take hours or days to complete with no hint of progress, and no TCP traffic to keep NAT etc. alive, sounds like bad practice. The overhead is tiny. I would be against this change. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 20:13, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 19:49, Alex Bligh wrote: On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiywrote: On 24.09.2016 19:31, Alex Bligh wrote: On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy wrote: Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. This means that this extension could not be used with any qcow2 disk, as qcow2 may have size not aligned to its cluster size. # qemu-img create -f qcow2 mega 1K Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 # qemu-img info mega image: mega file format: qcow2 virtual size: 1.0K (1024 bytes) disk size: 196K cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false And there is no such restriction in documentation. Or we have to consider sector-size (512b) as block size for qcow2, which is too small for our needs. But there is no guarantee that disk_size/block_size < INT_MAX.. I think you mean 2^32-1, but yes there is no guarantee of that. In that case you would need to break the call up into multiple calls. However, being able to break the call up into multiple calls seems pretty sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of time, and a REALLY long time if the server doesn't support trim. May be, additional option, specifying the shift would be better. With convention that if offset+length exceeds disk size, length should be recalculated as disk_size-offset. I don't think we should do that. We already have clear semantics that prevent operations beyond the end of the disk. Again, just break the command up into multipl commands. No great hardship. I agree that requests larger than disk size are ugly.. But splitting request brings me again to idea of having separate command or flag for clearing the whole disk without that dance. Server may report availability of this/flag command only if target driver supports fast write_zeroes (qcow2 in our case). Also, such flag may be used to satisfy all needs: flag BIG_REQUEST is set and length = 0->request on the whole disk, offset must be 0 flag BIG_REQUEST is set and length > 0->request on (offset*block_size, length*block_size), length*block_size must be <= disk_size flag BIG_REQUEST is unset ->usual request on (offset, length) -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 19:49, Alex Bligh wrote: On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiywrote: On 24.09.2016 19:31, Alex Bligh wrote: On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy wrote: Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. This means that this extension could not be used with any qcow2 disk, as qcow2 may have size not aligned to its cluster size. # qemu-img create -f qcow2 mega 1K Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 # qemu-img info mega image: mega file format: qcow2 virtual size: 1.0K (1024 bytes) disk size: 196K cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false And there is no such restriction in documentation. Or we have to consider sector-size (512b) as block size for qcow2, which is too small for our needs. But there is no guarantee that disk_size/block_size < INT_MAX.. I think you mean 2^32-1, but yes there is no guarantee of that. In that case you would need to break the call up into multiple calls. However, being able to break the call up into multiple calls seems pretty sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of time, and a REALLY long time if the server doesn't support trim. May be, additional option, specifying the shift would be better. With convention that if offset+length exceeds disk size, length should be recalculated as disk_size-offset. I don't think we should do that. We already have clear semantics that prevent operations beyond the end of the disk. Again, just break the command up into multipl commands. No great hardship. I agree that requests larger than disk size are ugly.. But splitting request brings me again to idea of having separate command or flag for clearing the whole disk without that dance. Server may report availability of this/flag command only if target driver supports fast write_zeroes (qcow2 in our case). -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 17:52, Alex Blighwrote: > > In *your* use-case holes may be desirable. However in the general case, you > cannot assume a server supports holes. Optional support for holes isn't even > in the mainline spec yet (AFAIR). You should also be aware that the minimum granularity for writes (currently 1 byte in the spec, but the blocksize extension allows for larger) may be smaller than the minimum granularity for holes, which may depend on the underlying filing system. A hole is in essence merely an optimised way of storing zeroes. And the trim operation says 'I'm not worried about this data any more - so zero it if you like'. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 17:48, Vladimir Sementsov-Ogievskiy >wrote: > >>> Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty >>> much assume that a server that supports holes will write holes. A server >>> that does not support holes will write zeroes. If you don't care whether >>> the resultant data is zero, just use NBD_CMD_TRIM. But as you do care (see >>> above) you must be prepared for a 'thick' write of zeroes on servers that >>> don't support it. >>> >> >> No, holes are critical. Concreate case: incremental backup to delta file. If >> we write zeroes instead of holes, we will lose underlying data (from >> previous incremental). >> > hmm, no, sorry, that is not needed. In *your* use-case holes may be desirable. However in the general case, you cannot assume a server supports holes. Optional support for holes isn't even in the mainline spec yet (AFAIR). It's up to you if you choose not to connect to servers that don't support NBD_CMD_WRITE_ZEREOS or NBD_CMD_TRIM, but even if you take that strategy, you cannot guarantee that the server will not implement them by ignoring NBD_CMD_TRIM (this is perfectly acceptable as NBD_CMD_TRIM is advisory only) and making NBD_CMD_WRITE_ZEREOS do a long write of zeroes. IE there is no way to detect whether the server actually supports holes. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy >wrote: > > On 24.09.2016 19:31, Alex Bligh wrote: >>> On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy >>> wrote: >>> >>> Note: if disk size is not aligned to X we will have to send request larger >>> than the disk size to clear the whole disk. >> If you look at the block size extension, the size of the disk must be an >> exact multiple of the minimum block size. So that would work. >> > > But there is no guarantee that disk_size/block_size < INT_MAX.. I think you mean 2^32-1, but yes there is no guarantee of that. In that case you would need to break the call up into multiple calls. However, being able to break the call up into multiple calls seems pretty sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of time, and a REALLY long time if the server doesn't support trim. > May be, additional option, specifying the shift would be better. With > convention that if offset+length exceeds disk size, length should be > recalculated as disk_size-offset. I don't think we should do that. We already have clear semantics that prevent operations beyond the end of the disk. Again, just break the command up into multipl commands. No great hardship. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 19:44, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 19:35, Alex Bligh wrote: On 24 Sep 2016, at 17:20, Vladimir Sementsov-Ogievskiywrote: Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for disk clearing: * `NBD_CMD_TRIM` (4) A hint to the server that the data defined by len and offset is no longer needed. A server MAY discard len bytes starting at offset, but is not required to. After issuing this command, a client MUST NOT make any assumptions about the contents of the export affected by this command, until overwriting it again with `NBD_CMD_WRITE`. - it may do nothing.. So, what to do with this? add flag FORCE_TRIM for this command? Or add FORCE_HOLES flag to WRITE_ZEROES? You cannot force a hole, because NBD the is not guaranteed to support holes. Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty much assume that a server that supports holes will write holes. A server that does not support holes will write zeroes. If you don't care whether the resultant data is zero, just use NBD_CMD_TRIM. But as you do care (see above) you must be prepared for a 'thick' write of zeroes on servers that don't support it. No, holes are critical. Concreate case: incremental backup to delta file. If we write zeroes instead of holes, we will lose underlying data (from previous incremental). hmm, no, sorry, that is not needed. -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 19:35, Alex Bligh wrote: On 24 Sep 2016, at 17:20, Vladimir Sementsov-Ogievskiywrote: Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for disk clearing: * `NBD_CMD_TRIM` (4) A hint to the server that the data defined by len and offset is no longer needed. A server MAY discard len bytes starting at offset, but is not required to. After issuing this command, a client MUST NOT make any assumptions about the contents of the export affected by this command, until overwriting it again with `NBD_CMD_WRITE`. - it may do nothing.. So, what to do with this? add flag FORCE_TRIM for this command? Or add FORCE_HOLES flag to WRITE_ZEROES? You cannot force a hole, because NBD the is not guaranteed to support holes. Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty much assume that a server that supports holes will write holes. A server that does not support holes will write zeroes. If you don't care whether the resultant data is zero, just use NBD_CMD_TRIM. But as you do care (see above) you must be prepared for a 'thick' write of zeroes on servers that don't support it. No, holes are critical. Concreate case: incremental backup to delta file. If we write zeroes instead of holes, we will lose underlying data (from previous incremental). -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 19:31, Alex Bligh wrote: On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiywrote: Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. But there is no guarantee that disk_size/block_size < INT_MAX.. May be, additional option, specifying the shift would be better. With convention that if offset+length exceeds disk size, length should be recalculated as disk_size-offset. -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 17:20, Vladimir Sementsov-Ogievskiy >wrote: > > Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for disk > clearing: > > * `NBD_CMD_TRIM` (4) > > A hint to the server that the data defined by len and offset is no > longer needed. A server MAY discard len bytes starting at offset, but > is not required to. > > After issuing this command, a client MUST NOT make any assumptions > about the contents of the export affected by this command, until > overwriting it again with `NBD_CMD_WRITE`. > > - it may do nothing.. So, what to do with this? add flag FORCE_TRIM for this > command? Or add FORCE_HOLES flag to WRITE_ZEROES? You cannot force a hole, because NBD the is not guaranteed to support holes. Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty much assume that a server that supports holes will write holes. A server that does not support holes will write zeroes. If you don't care whether the resultant data is zero, just use NBD_CMD_TRIM. But as you do care (see above) you must be prepared for a 'thick' write of zeroes on servers that don't support it. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy >wrote: > > Note: if disk size is not aligned to X we will have to send request larger > than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 09/24/2016 12:21 AM, Wouter Verhelst wrote: > On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: >> My preference would be a new flag to the existing commands, with >> explicit documentation that 0 offset and 0 length must be used with that >> flag, when requesting a full-device wipe. > Alternatively, what about a flag that says "if you use this flag, the > size should be left-shifted by X bits before processing"? That allows > you to do TRIM or WRITE_ZEROES on much larger chunks, without being > limited to "whole disk" commands. We should probably make it an illegal > flag for any command that actually sends data over the wire, though. > interesting. Actually I like this ;)
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 16:42, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 00:21, Wouter Verhelst wrote: On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: My preference would be a new flag to the existing commands, with explicit documentation that 0 offset and 0 length must be used with that flag, when requesting a full-device wipe. Alternatively, what about a flag that says "if you use this flag, the size should be left-shifted by X bits before processing"? That allows you to do TRIM or WRITE_ZEROES on much larger chunks, without being limited to "whole disk" commands. We should probably make it an illegal flag for any command that actually sends data over the wire, though. Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. which is not beautiful.. And splitting the request into two: (disk_size / X * X) and (disk_size % X) is not beautiful too. May be flag, that says 'whole disk request' is more appropriate? Nothing prevents adding then flag about X-shift. Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for disk clearing: * `NBD_CMD_TRIM` (4) A hint to the server that the data defined by len and offset is no longer needed. A server MAY discard len bytes starting at offset, but is not required to. After issuing this command, a client MUST NOT make any assumptions about the contents of the export affected by this command, until overwriting it again with `NBD_CMD_WRITE`. - it may do nothing.. So, what to do with this? add flag FORCE_TRIM for this command? Or add FORCE_HOLES flag to WRITE_ZEROES? -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 00:21, Wouter Verhelst wrote: On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: My preference would be a new flag to the existing commands, with explicit documentation that 0 offset and 0 length must be used with that flag, when requesting a full-device wipe. Alternatively, what about a flag that says "if you use this flag, the size should be left-shifted by X bits before processing"? That allows you to do TRIM or WRITE_ZEROES on much larger chunks, without being limited to "whole disk" commands. We should probably make it an illegal flag for any command that actually sends data over the wire, though. Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. which is not beautiful.. And splitting the request into two: (disk_size / X * X) and (disk_size % X) is not beautiful too. May be flag, that says 'whole disk request' is more appropriate? Nothing prevents adding then flag about X-shift. -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 00:21, Wouter Verhelst wrote: On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: My preference would be a new flag to the existing commands, with explicit documentation that 0 offset and 0 length must be used with that flag, when requesting a full-device wipe. Alternatively, what about a flag that says "if you use this flag, the size should be left-shifted by X bits before processing"? That allows you to do TRIM or WRITE_ZEROES on much larger chunks, without being limited to "whole disk" commands. We should probably make it an illegal flag for any command that actually sends data over the wire, though. Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. Also, in this case, which realization of bdrv interface in qemu would be most appropriate? Similar flag (in this case X must be defined in some very transparent way, as a constant of 64k for example), or flag BDRV_REQ_WHOLE_DISK, or separate .bdrv_zero_all and .bdrv_discard_all ? -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 00:21, Wouter Verhelst wrote: On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: My preference would be a new flag to the existing commands, with explicit documentation that 0 offset and 0 length must be used with that flag, when requesting a full-device wipe. Alternatively, what about a flag that says "if you use this flag, the size should be left-shifted by X bits before processing"? That allows you to do TRIM or WRITE_ZEROES on much larger chunks, without being limited to "whole disk" commands. We should probably make it an illegal flag for any command that actually sends data over the wire, though. Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: > My preference would be a new flag to the existing commands, with > explicit documentation that 0 offset and 0 length must be used with that > flag, when requesting a full-device wipe. Alternatively, what about a flag that says "if you use this flag, the size should be left-shifted by X bits before processing"? That allows you to do TRIM or WRITE_ZEROES on much larger chunks, without being limited to "whole disk" commands. We should probably make it an illegal flag for any command that actually sends data over the wire, though. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 09/23/2016 01:32 PM, Vladimir Sementsov-Ogievskiy wrote: > Hi all! > > There is a following problem. When we need to write_zeroes or trim the > whole disk, we have to do it iteratively, because of 32-bit restriction > on request length. > For example, current implementation of mirror (see mirror_dirty_init()) > do this by chunks of 2147418112 bytes (with default granularity of > 65536). So, to zero 16tb disk we will make 8192 requests instead of one. > > Incremental zeroing of 1tb qcow2 takes > 80 seconds for me (see below). > This means ~20 minutes for copying empty 16tb qcow2 disk which is > obviously a waste of time. > > We see the following solutions for nbd: > || > 1. Add command NBD_MAKE_EMPTY, with flag, saying what should be done: > trim or write_zeroes. Presumably spelled NBD_CMD_MAKE_EMPTY. > 2. Add flag NBD_CMD_FLAG_WHOLE for commands NBD_TRIM and > NBD_WRITE_ZEROES, which will say (with zeroed offset and lenght of the > request), that the whole disk should be discarded/zeroed. Both of these are possible. As it is, NBD_CMD_WRITE_ZEROES is not even formally part of the NBD spec yet, although NBD_CMD_TRIM is (I'm still sitting on my qemu proof-of-concept patches for WRITE_ZEROES, and need to resubmit them now that the qemu 2.8 development window is open). Either way, the server would have to advertise if the new command and/or new flags to existing commands are available for a whole-disk trim/zero, before a client could use it, and clients must be prepared to fall back to incremental approaches otherwise. My preference would be a new flag to the existing commands, with explicit documentation that 0 offset and 0 length must be used with that flag, when requesting a full-device wipe. > 3. Increase length field of the request to 64bit. No; that won't work. It would be a fundamental change to the NBD protocol, and require both new servers and new clients to talk a different wire protocol with different size length parameters. > > As soon as we have some way to empty disk in nbd, we can use > qcow2_make_empty, to trim the whole disk (and something similar should > be done for zeroing). > > What do you think about this all, and which way has a chance to get into > nbd proto? It's not necessarily obvious that the ability to bulk-trim or bulk-zero a device should be fundamentally faster than doing it incrementally in 2G chunks; but I concede that there may indeed be scenarios such as qemu's qcow2 file where that is true. So it does sound like a useful option and/or command to be proposed for addition to the NBD protocol, from that point of view. As with other extensions to NBD, the best way is to write up a proposal for how the documentation should change, submit that as patches to the nbd list, and accompany it with a proof-of-concept implementation (qemu's nbd server and nbd client work well), so that we can iron out the details of the documentation before making it a formal part of the spec. It's important to remember that such a proposal should still be optional (a server need not implement the new mode, and a client should be prepared to fall back to other means if the server does not support a whole-device action). -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
[Qemu-block] write_zeroes/trim on the whole disk
Hi all! There is a following problem. When we need to write_zeroes or trim the whole disk, we have to do it iteratively, because of 32-bit restriction on request length. For example, current implementation of mirror (see mirror_dirty_init()) do this by chunks of 2147418112 bytes (with default granularity of 65536). So, to zero 16tb disk we will make 8192 requests instead of one. Incremental zeroing of 1tb qcow2 takes > 80 seconds for me (see below). This means ~20 minutes for copying empty 16tb qcow2 disk which is obviously a waste of time. We see the following solutions for nbd: || 1. Add command NBD_MAKE_EMPTY, with flag, saying what should be done: trim or write_zeroes. 2. Add flag NBD_CMD_FLAG_WHOLE for commands NBD_TRIM and NBD_WRITE_ZEROES, which will say (with zeroed offset and lenght of the request), that the whole disk should be discarded/zeroed. 3. Increase length field of the request to 64bit. As soon as we have some way to empty disk in nbd, we can use qcow2_make_empty, to trim the whole disk (and something similar should be done for zeroing). What do you think about this all, and which way has a chance to get into nbd proto? == test incremental qcow2 zeroing in mirror == 1. enable it. If we will use nbd, it will be enabled. diff --git a/block/mirror.c b/block/mirror.c index f9d1fec..4ac0c39 100644 --- a/block/mirror.c +++ b/block/mirror.c @@ -556,7 +556,7 @@ static int coroutine_fn mirror_dirty_init(MirrorBlockJob *s) end = s->bdev_length / BDRV_SECTOR_SIZE; -if (base == NULL && !bdrv_has_zero_init(target_bs)) { +if (base == NULL) { if (!bdrv_can_write_zeroes_with_unmap(target_bs)) { bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end); return 0; test qemu-img create -f qcow2 /tmp/1tb.qcow2 1T virsh start backup-vm --paused Domain backup-vm started virsh qemu-monitor-command backup-vm {"execute":"blockdev-add","arguments":{"options": {"aio": "native", "file": {"driver": "file", "filename": "/tmp/1tb.qcow2"}, "discard": "unmap", "cache": {"direct": true}, "driver": "qcow2", "id": "disk"}}} {"return":{},"id":"libvirt-32"} /usr/bin/time -f '%e seconds' sh -c 'virsh qemu-monitor-event' & virsh qemu-monitor-command backup-vm {"execute":"drive-mirror","arguments":{"device": "disk", "sync": "full", "target": "/tmp/targ"}} {"return":{},"id":"libvirt-33"} [root@kvm qemu]# event BLOCK_JOB_READY at 1474652677.668624 for domain backup-vm: {"device":"disk","len":1099511627776,"offset":1099511627776,"speed":0,"type":"mirror"} events received: 1 86.39 seconds - the same for 2tb empty disk: 180.19 seconds - and without patch, it takes < 1 second, of course. -- Best regards, Vladimir