Re: [Qemu-block] write_zeroes/trim on the whole disk
On Sat, Sep 24, 2016 at 11:19:53PM +0300, Vladimir Sementsov-Ogievskiy wrote: > On 24.09.2016 21:24, Alex Bligh wrote: > > > On 24 Sep 2016, at 18:47, Vladimir Sementsov-Ogievskiy > > >wrote: > > > > > > I just wanted to say, that if we want a possibility of clearing the whole > > > disk in one request for qcow2 we have to take 512 as granularity for such > > > requests (with X = 9). An this is too small. 1tb will be the upper bound > > > for the request. > > Sure. But I do not see the value in optimising these huge commands to run > > as single requests. If you want to do that, do it properly and have a > > negotiation-phase flag that supports 64 bit request lengths. > > And add additional request type with another magic in first field and 64bit > length field? If such solution is appropriate for nbd it is ok for me of > course. I've proposed something like this in first letter - "Increase length > field of the request to 64bit". Changing existing request message type is > wrong of course, but creating an additional one should be ok. I'd be opposed to that. Currently, the request and reply headers are strictly the same, for all messages. That's a feature; it allows a server to read the header (determine whether it needs to read data and if so, read that too), and hand it off to something else. By changing the header's format, you make the protocol more complicated. I'd like to avoid that. [...] > or a separate command/flag for clearing the whole disk, and separate > block-based solution in future if needed. I'd prefer to avoid one-usecase commands; adding too many of those would make the protocol specification too complicated. Currently, NBD is a simple to understand protocol; I'd like to keep it that way. I think it's not a huge problem if we introduce something that has rounding errors; we could easily have a spec saying that if NBD_CMD_WRITE_ZEROES with the NBD_CMD_FLAG_SHIFT flag is used, and the last byte of the device falls between N-1 and N blocks, then it is not an error to specify a length of N blocks. or something along those lines. > or new request type with 64bit length Not going to happen as far as I'm concerned, as per above. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12
Re: [Qemu-block] [Nbd] write_zeroes/trim on the whole disk
On Sat, Sep 24, 2016 at 11:31:25AM +0100, Alex Bligh wrote: > > > On 23 Sep 2016, at 22:21, Wouter Verhelstwrote: > > > > On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: > >> My preference would be a new flag to the existing commands, with > >> explicit documentation that 0 offset and 0 length must be used with that > >> flag, when requesting a full-device wipe. > > > > Alternatively, what about a flag that says "if you use this flag, the > > size should be left-shifted by X bits before processing"? That allows > > you to do TRIM or WRITE_ZEROES on much larger chunks, without being > > limited to "whole disk" commands. We should probably make it an illegal > > flag for any command that actually sends data over the wire, though. > > I'd prefer an approach like this. Perhaps X could be negotiated > with the block size extension (or simply be defined as the > 'preferred block size'. > > This could then be defined to work with all commands. The downside of making the shift size too small is that it wouldn't help as much with the use case that was presented here (that is, blanking a whole disk with one command). I think if the shift size is to be negotiated, it should be something that the client prefers, rather than something the server prefers. I really don't think allowing a shift for things that requires data transfer makes much sense. Yes, a shift for things like "clear all data" can be sensible, because (depending on backend) this can be a fairly simple operation. However, "write more than 2G of data" or "read more than 2G of data" isn't a simple operation and takes a lot of data to be transferred, so allowing shift for those operations doesn't make sense, at all; and suggesting that we allow it could actually make people believe it's a good idea to do so when it really really (really) isn't. -- < ron> I mean, the main *practical* problem with C++, is there's like a dozen people in the world who think they really understand all of its rules, and pretty much all of them are just lying to themselves too. -- #debian-devel, OFTC, 2016-02-12
Re: [Qemu-block] [Nbd] write_zeroes/trim on the whole disk
On 24.09.2016 19:33, Vladimir Sementsov-Ogievskiy wrote: > On 24.09.2016 20:13, Vladimir Sementsov-Ogievskiy wrote: >> I agree that requests larger than disk size are ugly.. But splitting >> request brings me again to idea of having separate command or flag >> for clearing the whole disk without that dance. Server may report >> availability of this/flag command only if target driver supports fast >> write_zeroes (qcow2 in our case). > Also, such flag may be used to satisfy all needs: > > flag BIG_REQUEST is set and length = 0->request on the whole > disk, offset must be 0 > flag BIG_REQUEST is set and length > 0->request on > (offset*block_size, length*block_size), length*block_size must be <= > disk_size What happens if length*block_size<=disk_size, but offset*block_size+length*block_size>disk_size? Wraparound? Regards, Carl-Daniel
Re: [Qemu-block] [Nbd] write_zeroes/trim on the whole disk
On 24.09.2016 23:14, Carl-Daniel Hailfinger wrote: On 24.09.2016 19:33, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 20:13, Vladimir Sementsov-Ogievskiy wrote: I agree that requests larger than disk size are ugly.. But splitting request brings me again to idea of having separate command or flag for clearing the whole disk without that dance. Server may report availability of this/flag command only if target driver supports fast write_zeroes (qcow2 in our case). Also, such flag may be used to satisfy all needs: flag BIG_REQUEST is set and length = 0->request on the whole disk, offset must be 0 flag BIG_REQUEST is set and length > 0->request on (offset*block_size, length*block_size), length*block_size must be <= disk_size What happens if length*block_size<=disk_size, but offset*block_size+length*block_size>disk_size? Wraparound? Personally, for me main scenario for is length=0, to clear the whole disk. In general allowing requests larger than disk size may not be very good idea.. Also: On 24.09.2016 19:49, Alex Bligh wrote: >May be, additional option, specifying the shift would be better. With convention that if offset+length exceeds disk size, length should be recalculated as disk_size-offset. I don't think we should do that. We already have clear semantics that prevent operations beyond the end of the disk. Again, just break the command up into multipl commands. No great hardship. So, most possible answer on your question: it should be an error. Regards, Carl-Daniel -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 21:24, Alex Bligh wrote: On 24 Sep 2016, at 18:47, Vladimir Sementsov-Ogievskiywrote: I just wanted to say, that if we want a possibility of clearing the whole disk in one request for qcow2 we have to take 512 as granularity for such requests (with X = 9). An this is too small. 1tb will be the upper bound for the request. Sure. But I do not see the value in optimising these huge commands to run as single requests. If you want to do that, do it properly and have a negotiation-phase flag that supports 64 bit request lengths. And add additional request type with another magic in first field and 64bit length field? If such solution is appropriate for nbd it is ok for me of course. I've proposed something like this in first letter - "Increase length field of the request to 64bit". Changing existing request message type is wrong of course, but creating an additional one should be ok. Full backup, for example: 1. target can do fast write_zeroes: clear the whole disk (great if we can do it in one request, without splitting, etc), then backup all data except zero or unallocated (save a lot of time on this skipping). 2. target can not do fast write_zeroes: just backup all data. We need not clear the disk, as we will not save time by this. So here, we need not splitting as a general. Just clear all or not clearing at all. As I said, within the current protocol you cannot tell whether a target supports 'fast write zeroes', and indeed the support may be partial - for instance with a QCOW2 backend, a write that is not cluster aligned would likely only partially satisfy the command by deallocating bytes. There is no current flag for 'supports fast write zeroes' and (given the foregoing) it isn't evident to me exactly what it would mean. I suggest to add this flag - which is a negotiation-phase flag, exposing support of the whole feature (separate command or flag for clearing the whole disk). Fast here means that we can do this in one request. write_zeroes(of any size, up to the whole disk) is fast if it will not take more time than usual write (restricted to 2G). It seems however you could support your use case by simply iterating through the backup disk, using NBD_CMD_WRITE for the areas that are allocated and non-zero, and using NBD_CMD_WRITE_ZEROES for the areas that are not allocated or zeroed. This technique would not require a protocol change (beyond the existing NBD_CMD_WRITE_ZEROES extension), works irrespective of whether the target supports write zeroes or not, works irrespective of difference in cluster allocation size between source and target, is far simpler, and has the added advantage of making the existing zeroes-but-not-holes area into holes (that is optional if you can tell the difference between zeroes and holes on the source media). It also works on a single pass. Yes, you need to split requests up, but you need to split requests up ANYWAY to cope with NBD_CMD_WRITE's 2^32-1 length limit (I strongly advise you not to use more than 2^31). And in any case, you probably want to parallelise reads and writes and have more than one write in flight in any case, all of which suggests you are going to be breaking up requests anyway. This is slow, see my first letter. Iterative zeroing of qcow2 is slow. Why separate command/flag for clearing the whole disk is better for me than block-based solution with splitting requests? I want to clear the whole disk and I don't want to introduce new functionality, which I don't need for now. I need to clearing the whole disk, but with block-based solution I have a lot of code, which solves another task. And it only indirectly solves my task. I.e. instead of simple_realisation+simple_usage+nice_solution_for_my_task I have harder_realisation+harder_usage+ugly_solution_for_my_task. I understand, that we must take into account that such functionality (large requests) will likely be needed in future, so more generic solution is better for a protocol. And I suggest a compromise: negotiation-phase flag NBD_FLAG_SEND_BIG_REQUEST : command flag NBD_CMD_FLAG_BIG_REQUEST is supported for WRITE_ZEROES and TRIM negotiation-phase flag NBD_FLAG_SEND_BIG_REQUEST_REGION : non-zero length is supported for big request flag NBD_CMD_FLAG_BIG_REQUEST is set and length = 0-> request on the whole disk, offset must be 0 flag NBD_CMD_FLAG_BIG_REQUEST is set and length > 0-> request on (offset*block_size, length*block_size), length*block_size must be <= disk_size (only if NBD_FLAG_SEND_BIG_REQUEST_REGION is negotiated) flag NBD_CMD_FLAG_BIG_REQUEST is unset ->usual request on (offset, length) or a separate command/flag for clearing the whole disk, and separate block-based solution in future if needed. or new request type with 64bit length -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 18:47, Vladimir Sementsov-Ogievskiy >wrote: > > I just wanted to say, that if we want a possibility of clearing the whole > disk in one request for qcow2 we have to take 512 as granularity for such > requests (with X = 9). An this is too small. 1tb will be the upper bound for > the request. Sure. But I do not see the value in optimising these huge commands to run as single requests. If you want to do that, do it properly and have a negotiation-phase flag that supports 64 bit request lengths. > Full backup, for example: > > 1. target can do fast write_zeroes: clear the whole disk (great if we can do > it in one request, without splitting, etc), then backup all data except zero > or unallocated (save a lot of time on this skipping). > 2. target can not do fast write_zeroes: just backup all data. We need not > clear the disk, as we will not save time by this. > > So here, we need not splitting as a general. Just clear all or not clearing > at all. As I said, within the current protocol you cannot tell whether a target supports 'fast write zeroes', and indeed the support may be partial - for instance with a QCOW2 backend, a write that is not cluster aligned would likely only partially satisfy the command by deallocating bytes. There is no current flag for 'supports fast write zeroes' and (given the foregoing) it isn't evident to me exactly what it would mean. It seems however you could support your use case by simply iterating through the backup disk, using NBD_CMD_WRITE for the areas that are allocated and non-zero, and using NBD_CMD_WRITE_ZEROES for the areas that are not allocated or zeroed. This technique would not require a protocol change (beyond the existing NBD_CMD_WRITE_ZEROES extension), works irrespective of whether the target supports write zeroes or not, works irrespective of difference in cluster allocation size between source and target, is far simpler, and has the added advantage of making the existing zeroes-but-not-holes area into holes (that is optional if you can tell the difference between zeroes and holes on the source media). It also works on a single pass. Yes, you need to split requests up, but you need to split requests up ANYWAY to cope with NBD_CMD_WRITE's 2^32-1 length limit (I strongly advise you not to use more than 2^31). And in any case, you probably want to parallelise reads and writes and have more than one write in flight in any case, all of which suggests you are going to be breaking up requests anyway. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 20:32, Alex Bligh wrote: On 24 Sep 2016, at 18:13, Vladimir Sementsov-Ogievskiywrote: On 24.09.2016 19:49, Alex Bligh wrote: On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 19:31, Alex Bligh wrote: On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy wrote: Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. This means that this extension could not be used with any qcow2 disk, as qcow2 may have size not aligned to its cluster size. # qemu-img create -f qcow2 mega 1K Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 # qemu-img info mega image: mega file format: qcow2 virtual size: 1.0K (1024 bytes) disk size: 196K cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false And there is no such restriction in documentation. Or we have to consider sector-size (512b) as block size for qcow2, which is too small for our needs. If by "this extension" you mean the INFO extension (which reports block sizes) that's incorrect. An nbd server using a QCOW2 file as the backend would report the sector size as the minimum block size. It might report the cluster size or the sector size as the preferred block size, or anything in between. QCOW2 cluster size essentially determines the allocation unit. NBD is not bothered as to the underlying allocation unit. It does not (currently) support the concept of making holes visible to the client. If you use NBD_CMD_WRITE_ZEREOS you get zeroes, which might or might not be implemented as one or more holes or 'real' zeroes (save if you specify NBD_CMD_FLAG_NO_HOLE in which case you are guaranteed to get 'real' zeroes'). If you use NBD_CMD_TRIM then the area trimmed might nor might not be written with one or more whole. There is (currently) no way to detect the presence of holes separately from zeroes (though a bitmap extension was discussed). I just wanted to say, that if we want a possibility of clearing the whole disk in one request for qcow2 we have to take 512 as granularity for such requests (with X = 9). An this is too small. 1tb will be the upper bound for the request. But there is no guarantee that disk_size/block_size < INT_MAX.. I think you mean 2^32-1, but yes there is no guarantee of that. In that case you would need to break the call up into multiple calls. However, being able to break the call up into multiple calls seems pretty sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of time, and a REALLY long time if the server doesn't support trim. May be, additional option, specifying the shift would be better. With convention that if offset+length exceeds disk size, length should be recalculated as disk_size-offset. I don't think we should do that. We already have clear semantics that prevent operations beyond the end of the disk. Again, just break the command up into multipl commands. No great hardship. I agree that requests larger than disk size are ugly.. But splitting request brings me again to idea of having separate command or flag for clearing the whole disk without that dance. Server may report availability of this/flag command only if target driver supports fast write_zeroes (qcow2 in our case). Why? In the general case you need to break up requests anyway (particularly with the INFO extension where there is a maximum command size), and issuing a command over a TCP connection that might take hours or days to complete with no hint of progress, and no TCP traffic to keep NAT etc. alive, sounds like bad practice. The overhead is tiny. I would be against this change. Full backup, for example: 1. target can do fast write_zeroes: clear the whole disk (great if we can do it in one request, without splitting, etc), then backup all data except zero or unallocated (save a lot of time on this skipping). 2. target can not do fast write_zeroes: just backup all data. We need not clear the disk, as we will not save time by this. So here, we need not splitting as a general. Just clear all or not clearing at all. -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 18:13, Vladimir Sementsov-Ogievskiy >wrote: > > On 24.09.2016 19:49, Alex Bligh wrote: >>> On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy >>> wrote: >>> >>> On 24.09.2016 19:31, Alex Bligh wrote: > On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy > wrote: > > Note: if disk size is not aligned to X we will have to send request > larger than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. > > This means that this extension could not be used with any qcow2 disk, as > qcow2 may have size not aligned to its cluster size. > > # qemu-img create -f qcow2 mega 1K > Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 > lazy_refcounts=off refcount_bits=16 > # qemu-img info mega > image: mega > file format: qcow2 > virtual size: 1.0K (1024 bytes) > disk size: 196K > cluster_size: 65536 > Format specific information: >compat: 1.1 >lazy refcounts: false >refcount bits: 16 >corrupt: false > > And there is no such restriction in documentation. Or we have to consider > sector-size (512b) as block size for qcow2, which is too small for our needs. If by "this extension" you mean the INFO extension (which reports block sizes) that's incorrect. An nbd server using a QCOW2 file as the backend would report the sector size as the minimum block size. It might report the cluster size or the sector size as the preferred block size, or anything in between. QCOW2 cluster size essentially determines the allocation unit. NBD is not bothered as to the underlying allocation unit. It does not (currently) support the concept of making holes visible to the client. If you use NBD_CMD_WRITE_ZEREOS you get zeroes, which might or might not be implemented as one or more holes or 'real' zeroes (save if you specify NBD_CMD_FLAG_NO_HOLE in which case you are guaranteed to get 'real' zeroes'). If you use NBD_CMD_TRIM then the area trimmed might nor might not be written with one or more whole. There is (currently) no way to detect the presence of holes separately from zeroes (though a bitmap extension was discussed). >>> But there is no guarantee that disk_size/block_size < INT_MAX.. >> I think you mean 2^32-1, but yes there is no guarantee of that. In that case >> you would need to break the call up into multiple calls. >> >> However, being able to break the call up into multiple calls seems pretty >> sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of >> time, and a REALLY long time if the server doesn't support trim. >> >>> May be, additional option, specifying the shift would be better. With >>> convention that if offset+length exceeds disk size, length should be >>> recalculated as disk_size-offset. >> I don't think we should do that. We already have clear semantics that >> prevent operations beyond the end of the disk. Again, just break the command >> up into multipl commands. No great hardship. >> > > I agree that requests larger than disk size are ugly.. But splitting request > brings me again to idea of having separate command or flag for clearing the > whole disk without that dance. Server may report availability of this/flag > command only if target driver supports fast write_zeroes (qcow2 in our case). Why? In the general case you need to break up requests anyway (particularly with the INFO extension where there is a maximum command size), and issuing a command over a TCP connection that might take hours or days to complete with no hint of progress, and no TCP traffic to keep NAT etc. alive, sounds like bad practice. The overhead is tiny. I would be against this change. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 20:13, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 19:49, Alex Bligh wrote: On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiywrote: On 24.09.2016 19:31, Alex Bligh wrote: On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy wrote: Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. This means that this extension could not be used with any qcow2 disk, as qcow2 may have size not aligned to its cluster size. # qemu-img create -f qcow2 mega 1K Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 # qemu-img info mega image: mega file format: qcow2 virtual size: 1.0K (1024 bytes) disk size: 196K cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false And there is no such restriction in documentation. Or we have to consider sector-size (512b) as block size for qcow2, which is too small for our needs. But there is no guarantee that disk_size/block_size < INT_MAX.. I think you mean 2^32-1, but yes there is no guarantee of that. In that case you would need to break the call up into multiple calls. However, being able to break the call up into multiple calls seems pretty sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of time, and a REALLY long time if the server doesn't support trim. May be, additional option, specifying the shift would be better. With convention that if offset+length exceeds disk size, length should be recalculated as disk_size-offset. I don't think we should do that. We already have clear semantics that prevent operations beyond the end of the disk. Again, just break the command up into multipl commands. No great hardship. I agree that requests larger than disk size are ugly.. But splitting request brings me again to idea of having separate command or flag for clearing the whole disk without that dance. Server may report availability of this/flag command only if target driver supports fast write_zeroes (qcow2 in our case). Also, such flag may be used to satisfy all needs: flag BIG_REQUEST is set and length = 0->request on the whole disk, offset must be 0 flag BIG_REQUEST is set and length > 0->request on (offset*block_size, length*block_size), length*block_size must be <= disk_size flag BIG_REQUEST is unset ->usual request on (offset, length) -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 19:49, Alex Bligh wrote: On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiywrote: On 24.09.2016 19:31, Alex Bligh wrote: On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy wrote: Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. This means that this extension could not be used with any qcow2 disk, as qcow2 may have size not aligned to its cluster size. # qemu-img create -f qcow2 mega 1K Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 lazy_refcounts=off refcount_bits=16 # qemu-img info mega image: mega file format: qcow2 virtual size: 1.0K (1024 bytes) disk size: 196K cluster_size: 65536 Format specific information: compat: 1.1 lazy refcounts: false refcount bits: 16 corrupt: false And there is no such restriction in documentation. Or we have to consider sector-size (512b) as block size for qcow2, which is too small for our needs. But there is no guarantee that disk_size/block_size < INT_MAX.. I think you mean 2^32-1, but yes there is no guarantee of that. In that case you would need to break the call up into multiple calls. However, being able to break the call up into multiple calls seems pretty sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of time, and a REALLY long time if the server doesn't support trim. May be, additional option, specifying the shift would be better. With convention that if offset+length exceeds disk size, length should be recalculated as disk_size-offset. I don't think we should do that. We already have clear semantics that prevent operations beyond the end of the disk. Again, just break the command up into multipl commands. No great hardship. I agree that requests larger than disk size are ugly.. But splitting request brings me again to idea of having separate command or flag for clearing the whole disk without that dance. Server may report availability of this/flag command only if target driver supports fast write_zeroes (qcow2 in our case). -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 17:52, Alex Blighwrote: > > In *your* use-case holes may be desirable. However in the general case, you > cannot assume a server supports holes. Optional support for holes isn't even > in the mainline spec yet (AFAIR). You should also be aware that the minimum granularity for writes (currently 1 byte in the spec, but the blocksize extension allows for larger) may be smaller than the minimum granularity for holes, which may depend on the underlying filing system. A hole is in essence merely an optimised way of storing zeroes. And the trim operation says 'I'm not worried about this data any more - so zero it if you like'. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 17:48, Vladimir Sementsov-Ogievskiy >wrote: > >>> Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty >>> much assume that a server that supports holes will write holes. A server >>> that does not support holes will write zeroes. If you don't care whether >>> the resultant data is zero, just use NBD_CMD_TRIM. But as you do care (see >>> above) you must be prepared for a 'thick' write of zeroes on servers that >>> don't support it. >>> >> >> No, holes are critical. Concreate case: incremental backup to delta file. If >> we write zeroes instead of holes, we will lose underlying data (from >> previous incremental). >> > hmm, no, sorry, that is not needed. In *your* use-case holes may be desirable. However in the general case, you cannot assume a server supports holes. Optional support for holes isn't even in the mainline spec yet (AFAIR). It's up to you if you choose not to connect to servers that don't support NBD_CMD_WRITE_ZEREOS or NBD_CMD_TRIM, but even if you take that strategy, you cannot guarantee that the server will not implement them by ignoring NBD_CMD_TRIM (this is perfectly acceptable as NBD_CMD_TRIM is advisory only) and making NBD_CMD_WRITE_ZEREOS do a long write of zeroes. IE there is no way to detect whether the server actually supports holes. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy >wrote: > > On 24.09.2016 19:31, Alex Bligh wrote: >>> On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy >>> wrote: >>> >>> Note: if disk size is not aligned to X we will have to send request larger >>> than the disk size to clear the whole disk. >> If you look at the block size extension, the size of the disk must be an >> exact multiple of the minimum block size. So that would work. >> > > But there is no guarantee that disk_size/block_size < INT_MAX.. I think you mean 2^32-1, but yes there is no guarantee of that. In that case you would need to break the call up into multiple calls. However, being able to break the call up into multiple calls seems pretty sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of time, and a REALLY long time if the server doesn't support trim. > May be, additional option, specifying the shift would be better. With > convention that if offset+length exceeds disk size, length should be > recalculated as disk_size-offset. I don't think we should do that. We already have clear semantics that prevent operations beyond the end of the disk. Again, just break the command up into multipl commands. No great hardship. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 19:44, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 19:35, Alex Bligh wrote: On 24 Sep 2016, at 17:20, Vladimir Sementsov-Ogievskiywrote: Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for disk clearing: * `NBD_CMD_TRIM` (4) A hint to the server that the data defined by len and offset is no longer needed. A server MAY discard len bytes starting at offset, but is not required to. After issuing this command, a client MUST NOT make any assumptions about the contents of the export affected by this command, until overwriting it again with `NBD_CMD_WRITE`. - it may do nothing.. So, what to do with this? add flag FORCE_TRIM for this command? Or add FORCE_HOLES flag to WRITE_ZEROES? You cannot force a hole, because NBD the is not guaranteed to support holes. Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty much assume that a server that supports holes will write holes. A server that does not support holes will write zeroes. If you don't care whether the resultant data is zero, just use NBD_CMD_TRIM. But as you do care (see above) you must be prepared for a 'thick' write of zeroes on servers that don't support it. No, holes are critical. Concreate case: incremental backup to delta file. If we write zeroes instead of holes, we will lose underlying data (from previous incremental). hmm, no, sorry, that is not needed. -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 19:35, Alex Bligh wrote: On 24 Sep 2016, at 17:20, Vladimir Sementsov-Ogievskiywrote: Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for disk clearing: * `NBD_CMD_TRIM` (4) A hint to the server that the data defined by len and offset is no longer needed. A server MAY discard len bytes starting at offset, but is not required to. After issuing this command, a client MUST NOT make any assumptions about the contents of the export affected by this command, until overwriting it again with `NBD_CMD_WRITE`. - it may do nothing.. So, what to do with this? add flag FORCE_TRIM for this command? Or add FORCE_HOLES flag to WRITE_ZEROES? You cannot force a hole, because NBD the is not guaranteed to support holes. Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty much assume that a server that supports holes will write holes. A server that does not support holes will write zeroes. If you don't care whether the resultant data is zero, just use NBD_CMD_TRIM. But as you do care (see above) you must be prepared for a 'thick' write of zeroes on servers that don't support it. No, holes are critical. Concreate case: incremental backup to delta file. If we write zeroes instead of holes, we will lose underlying data (from previous incremental). -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 19:31, Alex Bligh wrote: On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiywrote: Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. But there is no guarantee that disk_size/block_size < INT_MAX.. May be, additional option, specifying the shift would be better. With convention that if offset+length exceeds disk size, length should be recalculated as disk_size-offset. -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 17:20, Vladimir Sementsov-Ogievskiy >wrote: > > Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for disk > clearing: > > * `NBD_CMD_TRIM` (4) > > A hint to the server that the data defined by len and offset is no > longer needed. A server MAY discard len bytes starting at offset, but > is not required to. > > After issuing this command, a client MUST NOT make any assumptions > about the contents of the export affected by this command, until > overwriting it again with `NBD_CMD_WRITE`. > > - it may do nothing.. So, what to do with this? add flag FORCE_TRIM for this > command? Or add FORCE_HOLES flag to WRITE_ZEROES? You cannot force a hole, because NBD the is not guaranteed to support holes. Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty much assume that a server that supports holes will write holes. A server that does not support holes will write zeroes. If you don't care whether the resultant data is zero, just use NBD_CMD_TRIM. But as you do care (see above) you must be prepared for a 'thick' write of zeroes on servers that don't support it. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
> On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy >wrote: > > Note: if disk size is not aligned to X we will have to send request larger > than the disk size to clear the whole disk. If you look at the block size extension, the size of the disk must be an exact multiple of the minimum block size. So that would work. -- Alex Bligh
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 09/24/2016 12:21 AM, Wouter Verhelst wrote: > On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: >> My preference would be a new flag to the existing commands, with >> explicit documentation that 0 offset and 0 length must be used with that >> flag, when requesting a full-device wipe. > Alternatively, what about a flag that says "if you use this flag, the > size should be left-shifted by X bits before processing"? That allows > you to do TRIM or WRITE_ZEROES on much larger chunks, without being > limited to "whole disk" commands. We should probably make it an illegal > flag for any command that actually sends data over the wire, though. > interesting. Actually I like this ;)
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 16:42, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 00:21, Wouter Verhelst wrote: On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: My preference would be a new flag to the existing commands, with explicit documentation that 0 offset and 0 length must be used with that flag, when requesting a full-device wipe. Alternatively, what about a flag that says "if you use this flag, the size should be left-shifted by X bits before processing"? That allows you to do TRIM or WRITE_ZEROES on much larger chunks, without being limited to "whole disk" commands. We should probably make it an illegal flag for any command that actually sends data over the wire, though. Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. which is not beautiful.. And splitting the request into two: (disk_size / X * X) and (disk_size % X) is not beautiful too. May be flag, that says 'whole disk request' is more appropriate? Nothing prevents adding then flag about X-shift. Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for disk clearing: * `NBD_CMD_TRIM` (4) A hint to the server that the data defined by len and offset is no longer needed. A server MAY discard len bytes starting at offset, but is not required to. After issuing this command, a client MUST NOT make any assumptions about the contents of the export affected by this command, until overwriting it again with `NBD_CMD_WRITE`. - it may do nothing.. So, what to do with this? add flag FORCE_TRIM for this command? Or add FORCE_HOLES flag to WRITE_ZEROES? -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 00:21, Wouter Verhelst wrote: On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: My preference would be a new flag to the existing commands, with explicit documentation that 0 offset and 0 length must be used with that flag, when requesting a full-device wipe. Alternatively, what about a flag that says "if you use this flag, the size should be left-shifted by X bits before processing"? That allows you to do TRIM or WRITE_ZEROES on much larger chunks, without being limited to "whole disk" commands. We should probably make it an illegal flag for any command that actually sends data over the wire, though. Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. which is not beautiful.. And splitting the request into two: (disk_size / X * X) and (disk_size % X) is not beautiful too. May be flag, that says 'whole disk request' is more appropriate? Nothing prevents adding then flag about X-shift. -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote: On 24.09.2016 00:21, Wouter Verhelst wrote: On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: My preference would be a new flag to the existing commands, with explicit documentation that 0 offset and 0 length must be used with that flag, when requesting a full-device wipe. Alternatively, what about a flag that says "if you use this flag, the size should be left-shifted by X bits before processing"? That allows you to do TRIM or WRITE_ZEROES on much larger chunks, without being limited to "whole disk" commands. We should probably make it an illegal flag for any command that actually sends data over the wire, though. Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. Also, in this case, which realization of bdrv interface in qemu would be most appropriate? Similar flag (in this case X must be defined in some very transparent way, as a constant of 64k for example), or flag BDRV_REQ_WHOLE_DISK, or separate .bdrv_zero_all and .bdrv_discard_all ? -- Best regards, Vladimir
Re: [Qemu-block] write_zeroes/trim on the whole disk
On 24.09.2016 00:21, Wouter Verhelst wrote: On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: My preference would be a new flag to the existing commands, with explicit documentation that 0 offset and 0 length must be used with that flag, when requesting a full-device wipe. Alternatively, what about a flag that says "if you use this flag, the size should be left-shifted by X bits before processing"? That allows you to do TRIM or WRITE_ZEROES on much larger chunks, without being limited to "whole disk" commands. We should probably make it an illegal flag for any command that actually sends data over the wire, though. Note: if disk size is not aligned to X we will have to send request larger than the disk size to clear the whole disk. -- Best regards, Vladimir
Re: [Qemu-block] [Nbd] write_zeroes/trim on the whole disk
> On 23 Sep 2016, at 22:21, Wouter Verhelstwrote: > > On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote: >> My preference would be a new flag to the existing commands, with >> explicit documentation that 0 offset and 0 length must be used with that >> flag, when requesting a full-device wipe. > > Alternatively, what about a flag that says "if you use this flag, the > size should be left-shifted by X bits before processing"? That allows > you to do TRIM or WRITE_ZEROES on much larger chunks, without being > limited to "whole disk" commands. We should probably make it an illegal > flag for any command that actually sends data over the wire, though. I'd prefer an approach like this. Perhaps X could be negotiated with the block size extension (or simply be defined as the 'preferred block size'. This could then be defined to work with all commands. -- Alex Bligh