Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-26 Thread Paolo Bonzini


On 24/09/2016 14:27, Vladimir Sementsov-Ogievskiy wrote:
> On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote:
>> On 24.09.2016 00:21, Wouter Verhelst wrote:
>>> On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote:
 My preference would be a new flag to the existing commands, with
 explicit documentation that 0 offset and 0 length must be used with
 that
 flag, when requesting a full-device wipe.
>>> Alternatively, what about a flag that says "if you use this flag, the
>>> size should be left-shifted by X bits before processing"? That allows
>>> you to do TRIM or WRITE_ZEROES on much larger chunks, without being
>>> limited to "whole disk" commands. We should probably make it an illegal
>>> flag for any command that actually sends data over the wire, though.
>>
>> Note: if disk size is not aligned to X we will have to send request
>> larger than the disk size to clear the whole disk.
> 
> Also, in this case, which realization of bdrv interface in qemu would be
> most appropriate? Similar flag (in this case X must be defined in some
> very transparent way, as a constant of 64k for example), or flag
> BDRV_REQ_WHOLE_DISK, or separate .bdrv_zero_all and .bdrv_discard_all ?

This makes nice sense.

It also matches the SANITIZE command from the SCSI command set.

Paolo



Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-26 Thread Kevin Wolf
Am 24.09.2016 um 14:27 hat Vladimir Sementsov-Ogievskiy geschrieben:
> On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote:
> >On 24.09.2016 00:21, Wouter Verhelst wrote:
> >>On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote:
> >>>My preference would be a new flag to the existing commands, with
> >>>explicit documentation that 0 offset and 0 length must be used
> >>>with that
> >>>flag, when requesting a full-device wipe.
> >>Alternatively, what about a flag that says "if you use this flag, the
> >>size should be left-shifted by X bits before processing"? That allows
> >>you to do TRIM or WRITE_ZEROES on much larger chunks, without being
> >>limited to "whole disk" commands. We should probably make it an illegal
> >>flag for any command that actually sends data over the wire, though.
> >>
> >
> >
> >Note: if disk size is not aligned to X we will have to send
> >request larger than the disk size to clear the whole disk.
> >
> 
> Also, in this case, which realization of bdrv interface in qemu
> would be most appropriate? Similar flag (in this case X must be
> defined in some very transparent way, as a constant of 64k for
> example), or flag BDRV_REQ_WHOLE_DISK, or separate .bdrv_zero_all
> and .bdrv_discard_all ?

Maybe the best would be to extend the existing discard/write_zeroes
functions to take a 64 bit byte count and then NBD can internally
check whether a request clears the whole disk.

Kevin



Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Wouter Verhelst
On Sat, Sep 24, 2016 at 11:19:53PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> On 24.09.2016 21:24, Alex Bligh wrote:
> > > On 24 Sep 2016, at 18:47, Vladimir Sementsov-Ogievskiy 
> > >  wrote:
> > > 
> > > I just wanted to say, that if we want a possibility of clearing the whole 
> > > disk in one request for qcow2 we have to take 512 as granularity for such 
> > > requests (with X = 9). An this is too small. 1tb will be the upper bound 
> > > for the request.
> > Sure. But I do not see the value in optimising these huge commands to run 
> > as single requests. If you want to do that, do it properly and have a 
> > negotiation-phase flag that supports 64 bit request lengths.
> 
> And add additional request type with another magic in first field and 64bit
> length field? If such solution is appropriate for nbd it is ok for me of
> course. I've proposed something like this in first letter - "Increase length
> field of the request to 64bit". Changing existing request message type is
> wrong of course, but creating an additional one should be ok.

I'd be opposed to that.

Currently, the request and reply headers are strictly the same, for all
messages. That's a feature; it allows a server to read the header
(determine whether it needs to read data and if so, read that too), and
hand it off to something else.

By changing the header's format, you make the protocol more complicated.
I'd like to avoid that.

[...]
> or a separate command/flag for clearing the whole disk, and separate
> block-based solution in future if needed.

I'd prefer to avoid one-usecase commands; adding too many of those would
make the protocol specification too complicated. Currently, NBD is a
simple to understand protocol; I'd like to keep it that way.

I think it's not a huge problem if we introduce something that has
rounding errors; we could easily have a spec saying that

if NBD_CMD_WRITE_ZEROES with the NBD_CMD_FLAG_SHIFT flag is used,
and the last byte of the device falls between N-1 and N blocks, then
it is not an error to specify a length of N blocks.

or something along those lines.

> or new request type with 64bit length

Not going to happen as far as I'm concerned, as per above.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12



Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Vladimir Sementsov-Ogievskiy

On 24.09.2016 21:24, Alex Bligh wrote:

On 24 Sep 2016, at 18:47, Vladimir Sementsov-Ogievskiy 
 wrote:

I just wanted to say, that if we want a possibility of clearing the whole disk 
in one request for qcow2 we have to take 512 as granularity for such requests 
(with X = 9). An this is too small. 1tb will be the upper bound for the request.

Sure. But I do not see the value in optimising these huge commands to run as 
single requests. If you want to do that, do it properly and have a 
negotiation-phase flag that supports 64 bit request lengths.


And add additional request type with another magic in first field and 
64bit length field? If such solution is appropriate for nbd it is ok for 
me of course. I've proposed something like this in first letter - 
"Increase length field of the request to 64bit". Changing existing 
request message type is wrong of course, but creating an additional one 
should be ok.





Full backup, for example:

1. target can do fast write_zeroes: clear the whole disk (great if we can do it 
in one request, without splitting, etc), then backup all data except zero or 
unallocated (save a lot of time on this skipping).
2. target can not do fast write_zeroes: just backup all data. We need not clear 
the disk, as we will not save time by this.

So here, we need not splitting as a general. Just clear all or not clearing at 
all.

As I said, within the current protocol you cannot tell whether a target 
supports 'fast write zeroes', and indeed the support may be partial - for 
instance with a QCOW2 backend, a write that is not cluster aligned would likely 
only partially satisfy the command by deallocating bytes. There is no current 
flag for 'supports fast write zeroes' and (given the foregoing) it isn't 
evident to me exactly what it would mean.


I suggest to add this flag - which is a negotiation-phase flag, exposing 
support of the whole feature (separate command or flag for clearing the 
whole disk). Fast here means that we can do this in one request. 
write_zeroes(of any size, up to the whole disk) is fast if it will not 
take more time than usual write (restricted to 2G).




It seems however you could support your use case by simply iterating through 
the backup disk, using NBD_CMD_WRITE for the areas that are allocated and 
non-zero, and using NBD_CMD_WRITE_ZEROES for the areas that are not allocated 
or zeroed. This technique would not require a protocol change (beyond the 
existing NBD_CMD_WRITE_ZEROES extension), works irrespective of whether the 
target supports write zeroes or not, works irrespective of difference in 
cluster allocation size between source and target, is far simpler, and has the 
added advantage of making the existing zeroes-but-not-holes area into holes 
(that is optional if you can tell the difference between zeroes and holes on 
the source media). It also works on a single pass. Yes, you need to split 
requests up, but you need to split requests up ANYWAY to cope with 
NBD_CMD_WRITE's 2^32-1 length limit (I strongly advise you not to use more than 
2^31). And in any case, you probably want to parallelise reads and writes and 
have more than one write in flight in any case, all of which suggests you are 
going to be breaking up requests anyway.


This is slow, see my first letter. Iterative zeroing of qcow2 is slow.

Why separate command/flag for clearing the whole disk is better for me 
than block-based solution with splitting requests? I want to clear the 
whole disk and I don't want to introduce new functionality, which I 
don't need for now. I need to clearing the whole disk, but with 
block-based solution I have a lot of code, which solves another task. 
And it only indirectly solves my task. I.e. instead of 
simple_realisation+simple_usage+nice_solution_for_my_task I have 
harder_realisation+harder_usage+ugly_solution_for_my_task.


I understand, that we must take into account that such functionality 
(large requests) will likely be needed in future, so more generic 
solution is better for a protocol. And I suggest a compromise:


negotiation-phase flag NBD_FLAG_SEND_BIG_REQUEST : command flag 
NBD_CMD_FLAG_BIG_REQUEST is supported for WRITE_ZEROES and TRIM
negotiation-phase flag NBD_FLAG_SEND_BIG_REQUEST_REGION : non-zero 
length is supported for big request


flag NBD_CMD_FLAG_BIG_REQUEST is set and length = 0-> request on the 
whole disk, offset must be 0
flag NBD_CMD_FLAG_BIG_REQUEST is set and length > 0-> request on 
(offset*block_size, length*block_size), length*block_size must be <= 
disk_size (only if NBD_FLAG_SEND_BIG_REQUEST_REGION is negotiated)
flag NBD_CMD_FLAG_BIG_REQUEST is unset ->usual request on 
(offset, length)




or a separate command/flag for clearing the whole disk, and separate 
block-based solution in future if needed.




or new request type with 64bit length


--
Best regards,
Vladimir





Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 18:47, Vladimir Sementsov-Ogievskiy 
>  wrote:
> 
> I just wanted to say, that if we want a possibility of clearing the whole 
> disk in one request for qcow2 we have to take 512 as granularity for such 
> requests (with X = 9). An this is too small. 1tb will be the upper bound for 
> the request.

Sure. But I do not see the value in optimising these huge commands to run as 
single requests. If you want to do that, do it properly and have a 
negotiation-phase flag that supports 64 bit request lengths.

> Full backup, for example:
> 
> 1. target can do fast write_zeroes: clear the whole disk (great if we can do 
> it in one request, without splitting, etc), then backup all data except zero 
> or unallocated (save a lot of time on this skipping).
> 2. target can not do fast write_zeroes: just backup all data. We need not 
> clear the disk, as we will not save time by this.
> 
> So here, we need not splitting as a general. Just clear all or not clearing 
> at all.

As I said, within the current protocol you cannot tell whether a target 
supports 'fast write zeroes', and indeed the support may be partial - for 
instance with a QCOW2 backend, a write that is not cluster aligned would likely 
only partially satisfy the command by deallocating bytes. There is no current 
flag for 'supports fast write zeroes' and (given the foregoing) it isn't 
evident to me exactly what it would mean.

It seems however you could support your use case by simply iterating through 
the backup disk, using NBD_CMD_WRITE for the areas that are allocated and 
non-zero, and using NBD_CMD_WRITE_ZEROES for the areas that are not allocated 
or zeroed. This technique would not require a protocol change (beyond the 
existing NBD_CMD_WRITE_ZEROES extension), works irrespective of whether the 
target supports write zeroes or not, works irrespective of difference in 
cluster allocation size between source and target, is far simpler, and has the 
added advantage of making the existing zeroes-but-not-holes area into holes 
(that is optional if you can tell the difference between zeroes and holes on 
the source media). It also works on a single pass. Yes, you need to split 
requests up, but you need to split requests up ANYWAY to cope with 
NBD_CMD_WRITE's 2^32-1 length limit (I strongly advise you not to use more than 
2^31). And in any case, you probably want to parallelise reads and writes and 
have more than one write in flight in any case, all of which suggests you are 
going to be breaking up requests anyway.

-- 
Alex Bligh







Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Vladimir Sementsov-Ogievskiy

On 24.09.2016 20:32, Alex Bligh wrote:

On 24 Sep 2016, at 18:13, Vladimir Sementsov-Ogievskiy 
 wrote:

On 24.09.2016 19:49, Alex Bligh wrote:

On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy 
 wrote:

On 24.09.2016 19:31, Alex Bligh wrote:

On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy 
 wrote:

Note: if disk size is not aligned to X we will have to send request larger than 
the disk size to clear the whole disk.

If you look at the block size extension, the size of the disk must be an exact 
multiple of the minimum block size. So that would work.

This means that this extension could not be used with any qcow2 disk, as qcow2 
may have size not aligned to its cluster size.

# qemu-img create -f qcow2 mega 1K
Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 
lazy_refcounts=off refcount_bits=16
# qemu-img info mega
image: mega
file format: qcow2
virtual size: 1.0K (1024 bytes)
disk size: 196K
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

And there is no such restriction in documentation. Or we have to consider 
sector-size (512b) as block size for qcow2, which is too small for our needs.

If by "this extension" you mean the INFO extension (which reports block sizes) 
that's incorrect.

An nbd server using a QCOW2 file as the backend would report the sector size as 
the minimum block size. It might report the cluster size or the sector size as 
the preferred block size, or anything in between.

QCOW2 cluster size essentially determines the allocation unit. NBD is not 
bothered as to the underlying allocation unit. It does not (currently) support 
the concept of making holes visible to the client. If you use 
NBD_CMD_WRITE_ZEREOS you get zeroes, which might or might not be implemented as 
one or more holes or 'real' zeroes (save if you specify NBD_CMD_FLAG_NO_HOLE in 
which case you are guaranteed to get 'real' zeroes'). If you use NBD_CMD_TRIM 
then the area trimmed might nor might not be written with one or more whole. 
There is (currently) no way to detect the presence of holes separately from 
zeroes (though a bitmap extension was discussed).


I just wanted to say, that if we want a possibility of clearing the 
whole disk in one request for qcow2 we have to take 512 as granularity 
for such requests (with X = 9). An this is too small. 1tb will be the 
upper bound for the request.





But there is no guarantee that disk_size/block_size < INT_MAX..

I think you mean 2^32-1, but yes there is no guarantee of that. In that case 
you would need to break the call up into multiple calls.

However, being able to break the call up into multiple calls seems pretty 
sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of
time, and a REALLY long time if the server doesn't support trim.


May be, additional option, specifying the shift would be better. With 
convention that if offset+length exceeds disk size, length should be 
recalculated as disk_size-offset.

I don't think we should do that. We already have clear semantics that prevent 
operations beyond the end of the disk. Again, just break the command up into 
multipl commands. No great hardship.


I agree that requests larger than disk size are ugly.. But splitting request 
brings me again to idea of having separate command or flag for clearing the 
whole disk without that dance. Server may report availability of this/flag 
command only if target driver supports fast write_zeroes (qcow2 in our case).

Why? In the general case you need to break up requests anyway (particularly 
with the INFO extension where there is a maximum command size), and issuing a 
command over a TCP connection that might take hours or days to complete with no 
hint of progress, and no TCP traffic to keep NAT etc. alive, sounds like bad 
practice. The overhead is tiny.

I would be against this change.



Full backup, for example:

1. target can do fast write_zeroes: clear the whole disk (great if we 
can do it in one request, without splitting, etc), then backup all data 
except zero or unallocated (save a lot of time on this skipping).
2. target can not do fast write_zeroes: just backup all data. We need 
not clear the disk, as we will not save time by this.


So here, we need not splitting as a general. Just clear all or not 
clearing at all.


--
Best regards,
Vladimir




Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 18:13, Vladimir Sementsov-Ogievskiy 
>  wrote:
> 
> On 24.09.2016 19:49, Alex Bligh wrote:
>>> On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy 
>>>  wrote:
>>> 
>>> On 24.09.2016 19:31, Alex Bligh wrote:
> On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy 
>  wrote:
> 
> Note: if disk size is not aligned to X we will have to send request 
> larger than the disk size to clear the whole disk.
 If you look at the block size extension, the size of the disk must be an 
 exact multiple of the minimum block size. So that would work.
> 
> This means that this extension could not be used with any qcow2 disk, as 
> qcow2 may have size not aligned to its cluster size.
> 
> # qemu-img create -f qcow2 mega 1K
> Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 
> lazy_refcounts=off refcount_bits=16
> # qemu-img info mega
> image: mega
> file format: qcow2
> virtual size: 1.0K (1024 bytes)
> disk size: 196K
> cluster_size: 65536
> Format specific information:
>compat: 1.1
>lazy refcounts: false
>refcount bits: 16
>corrupt: false
> 
> And there is no such restriction in documentation. Or we have to consider 
> sector-size (512b) as block size for qcow2, which is too small for our needs.

If by "this extension" you mean the INFO extension (which reports block sizes) 
that's incorrect.

An nbd server using a QCOW2 file as the backend would report the sector size as 
the minimum block size. It might report the cluster size or the sector size as 
the preferred block size, or anything in between.

QCOW2 cluster size essentially determines the allocation unit. NBD is not 
bothered as to the underlying allocation unit. It does not (currently) support 
the concept of making holes visible to the client. If you use 
NBD_CMD_WRITE_ZEREOS you get zeroes, which might or might not be implemented as 
one or more holes or 'real' zeroes (save if you specify NBD_CMD_FLAG_NO_HOLE in 
which case you are guaranteed to get 'real' zeroes'). If you use NBD_CMD_TRIM 
then the area trimmed might nor might not be written with one or more whole. 
There is (currently) no way to detect the presence of holes separately from 
zeroes (though a bitmap extension was discussed).

>>> But there is no guarantee that disk_size/block_size < INT_MAX..
>> I think you mean 2^32-1, but yes there is no guarantee of that. In that case 
>> you would need to break the call up into multiple calls.
>> 
>> However, being able to break the call up into multiple calls seems pretty 
>> sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of
>> time, and a REALLY long time if the server doesn't support trim.
>> 
>>> May be, additional option, specifying the shift would be better. With 
>>> convention that if offset+length exceeds disk size, length should be 
>>> recalculated as disk_size-offset.
>> I don't think we should do that. We already have clear semantics that 
>> prevent operations beyond the end of the disk. Again, just break the command 
>> up into multipl commands. No great hardship.
>> 
> 
> I agree that requests larger than disk size are ugly.. But splitting request 
> brings me again to idea of having separate command or flag for clearing the 
> whole disk without that dance. Server may report availability of this/flag 
> command only if target driver supports fast write_zeroes (qcow2 in our case).

Why? In the general case you need to break up requests anyway (particularly 
with the INFO extension where there is a maximum command size), and issuing a 
command over a TCP connection that might take hours or days to complete with no 
hint of progress, and no TCP traffic to keep NAT etc. alive, sounds like bad 
practice. The overhead is tiny.

I would be against this change.

-- 
Alex Bligh







Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Vladimir Sementsov-Ogievskiy

On 24.09.2016 20:13, Vladimir Sementsov-Ogievskiy wrote:

On 24.09.2016 19:49, Alex Bligh wrote:
On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy 
 wrote:


On 24.09.2016 19:31, Alex Bligh wrote:
On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy 
 wrote:


Note: if disk size is not aligned to X we will have to send 
request larger than the disk size to clear the whole disk.
If you look at the block size extension, the size of the disk must 
be an exact multiple of the minimum block size. So that would work.


This means that this extension could not be used with any qcow2 disk, 
as qcow2 may have size not aligned to its cluster size.


# qemu-img create -f qcow2 mega 1K
Formatting 'mega', fmt=qcow2 size=1024 encryption=off 
cluster_size=65536 lazy_refcounts=off refcount_bits=16

# qemu-img info mega
image: mega
file format: qcow2
virtual size: 1.0K (1024 bytes)
disk size: 196K
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

And there is no such restriction in documentation. Or we have to 
consider sector-size (512b) as block size for qcow2, which is too 
small for our needs.





But there is no guarantee that disk_size/block_size < INT_MAX..
I think you mean 2^32-1, but yes there is no guarantee of that. In 
that case you would need to break the call up into multiple calls.


However, being able to break the call up into multiple calls seems 
pretty sensible given that NBD_CMD_WRITE_ZEROES may take a large 
amount of

time, and a REALLY long time if the server doesn't support trim.

May be, additional option, specifying the shift would be better. 
With convention that if offset+length exceeds disk size, length 
should be recalculated as disk_size-offset.
I don't think we should do that. We already have clear semantics that 
prevent operations beyond the end of the disk. Again, just break the 
command up into multipl commands. No great hardship.




I agree that requests larger than disk size are ugly.. But splitting 
request brings me again to idea of having separate command or flag for 
clearing the whole disk without that dance. Server may report 
availability of this/flag command only if target driver supports fast 
write_zeroes (qcow2 in our case).




Also, such flag may be used to satisfy all needs:

flag BIG_REQUEST is set and length = 0->request on the whole 
disk, offset must be 0
flag BIG_REQUEST is set and length > 0->request on 
(offset*block_size, length*block_size), length*block_size must be <= 
disk_size

flag BIG_REQUEST is unset ->usual request on (offset, length)

--
Best regards,
Vladimir




Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Vladimir Sementsov-Ogievskiy

On 24.09.2016 19:49, Alex Bligh wrote:

On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy 
 wrote:

On 24.09.2016 19:31, Alex Bligh wrote:

On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy 
 wrote:

Note: if disk size is not aligned to X we will have to send request larger than 
the disk size to clear the whole disk.

If you look at the block size extension, the size of the disk must be an exact 
multiple of the minimum block size. So that would work.


This means that this extension could not be used with any qcow2 disk, as 
qcow2 may have size not aligned to its cluster size.


# qemu-img create -f qcow2 mega 1K
Formatting 'mega', fmt=qcow2 size=1024 encryption=off cluster_size=65536 
lazy_refcounts=off refcount_bits=16

# qemu-img info mega
image: mega
file format: qcow2
virtual size: 1.0K (1024 bytes)
disk size: 196K
cluster_size: 65536
Format specific information:
compat: 1.1
lazy refcounts: false
refcount bits: 16
corrupt: false

And there is no such restriction in documentation. Or we have to 
consider sector-size (512b) as block size for qcow2, which is too small 
for our needs.





But there is no guarantee that disk_size/block_size < INT_MAX..

I think you mean 2^32-1, but yes there is no guarantee of that. In that case 
you would need to break the call up into multiple calls.

However, being able to break the call up into multiple calls seems pretty 
sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of
time, and a REALLY long time if the server doesn't support trim.


May be, additional option, specifying the shift would be better. With 
convention that if offset+length exceeds disk size, length should be 
recalculated as disk_size-offset.

I don't think we should do that. We already have clear semantics that prevent 
operations beyond the end of the disk. Again, just break the command up into 
multipl commands. No great hardship.



I agree that requests larger than disk size are ugly.. But splitting 
request brings me again to idea of having separate command or flag for 
clearing the whole disk without that dance. Server may report 
availability of this/flag command only if target driver supports fast 
write_zeroes (qcow2 in our case).


--
Best regards,
Vladimir




Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 17:52, Alex Bligh  wrote:
> 
> In *your* use-case holes may be desirable. However in the general case, you 
> cannot assume a server supports holes. Optional support for holes isn't even 
> in the mainline spec yet (AFAIR).

You should also be aware that the minimum granularity for writes (currently 1 
byte in the spec, but the blocksize extension allows for larger) may be smaller 
than the minimum granularity for holes, which may depend on the underlying 
filing system.

A hole is in essence merely an optimised way of storing zeroes. And the trim 
operation says 'I'm not worried about this data any more - so zero it if you 
like'.

-- 
Alex Bligh







Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 17:48, Vladimir Sementsov-Ogievskiy 
>  wrote:
> 
>>> Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty 
>>> much assume that a server that supports holes will write holes. A server 
>>> that does not support holes will write zeroes. If you don't care whether 
>>> the resultant data is zero, just use NBD_CMD_TRIM. But as you do care (see 
>>> above) you must be prepared for a 'thick' write of zeroes on servers that 
>>> don't support it.
>>> 
>> 
>> No, holes are critical. Concreate case: incremental backup to delta file. If 
>> we write zeroes instead of holes, we will lose underlying data (from 
>> previous incremental).
>> 
> hmm, no, sorry, that is not needed.

In *your* use-case holes may be desirable. However in the general case, you 
cannot assume a server supports holes. Optional support for holes isn't even in 
the mainline spec yet (AFAIR).

It's up to you if you choose not to connect to servers that don't support 
NBD_CMD_WRITE_ZEREOS or NBD_CMD_TRIM, but even if you take that strategy, you 
cannot guarantee that the server will not implement them by ignoring 
NBD_CMD_TRIM (this is perfectly acceptable as NBD_CMD_TRIM is advisory only) 
and making NBD_CMD_WRITE_ZEREOS do a long write of zeroes.

IE there is no way to detect whether the server actually supports holes.

-- 
Alex Bligh







Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 17:42, Vladimir Sementsov-Ogievskiy 
>  wrote:
> 
> On 24.09.2016 19:31, Alex Bligh wrote:
>>> On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy 
>>>  wrote:
>>> 
>>> Note: if disk size is not aligned to X we will have to send request larger 
>>> than the disk size to clear the whole disk.
>> If you look at the block size extension, the size of the disk must be an 
>> exact multiple of the minimum block size. So that would work.
>> 
> 
> But there is no guarantee that disk_size/block_size < INT_MAX..

I think you mean 2^32-1, but yes there is no guarantee of that. In that case 
you would need to break the call up into multiple calls.

However, being able to break the call up into multiple calls seems pretty 
sensible given that NBD_CMD_WRITE_ZEROES may take a large amount of
time, and a REALLY long time if the server doesn't support trim.

> May be, additional option, specifying the shift would be better. With 
> convention that if offset+length exceeds disk size, length should be 
> recalculated as disk_size-offset.

I don't think we should do that. We already have clear semantics that prevent 
operations beyond the end of the disk. Again, just break the command up into 
multipl commands. No great hardship.

-- 
Alex Bligh







Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Vladimir Sementsov-Ogievskiy

On 24.09.2016 19:44, Vladimir Sementsov-Ogievskiy wrote:

On 24.09.2016 19:35, Alex Bligh wrote:
On 24 Sep 2016, at 17:20, Vladimir Sementsov-Ogievskiy 
 wrote:


Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate 
for disk clearing:


  * `NBD_CMD_TRIM` (4)

  A hint to the server that the data defined by len and offset 
is no
  longer needed. A server MAY discard len bytes starting at 
offset, but

  is not required to.

  After issuing this command, a client MUST NOT make any 
assumptions

  about the contents of the export affected by this command, until
  overwriting it again with `NBD_CMD_WRITE`.

- it may do nothing.. So, what to do with this? add flag FORCE_TRIM 
for this command? Or add FORCE_HOLES flag to WRITE_ZEROES?
You cannot force a hole, because NBD the is not guaranteed to support 
holes.


Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can 
pretty much assume that a server that supports holes will write 
holes. A server that does not support holes will write zeroes. If you 
don't care whether the resultant data is zero, just use NBD_CMD_TRIM. 
But as you do care (see above) you must be prepared for a 'thick' 
write of zeroes on servers that don't support it.




No, holes are critical. Concreate case: incremental backup to delta 
file. If we write zeroes instead of holes, we will lose underlying 
data (from previous incremental).



hmm, no, sorry, that is not needed.

--
Best regards,
Vladimir




Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Vladimir Sementsov-Ogievskiy

On 24.09.2016 19:35, Alex Bligh wrote:

On 24 Sep 2016, at 17:20, Vladimir Sementsov-Ogievskiy 
 wrote:

Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for disk 
clearing:

  * `NBD_CMD_TRIM` (4)

  A hint to the server that the data defined by len and offset is no
  longer needed. A server MAY discard len bytes starting at offset, but
  is not required to.

  After issuing this command, a client MUST NOT make any assumptions
  about the contents of the export affected by this command, until
  overwriting it again with `NBD_CMD_WRITE`.

- it may do nothing.. So, what to do with this? add flag FORCE_TRIM for this 
command? Or add FORCE_HOLES flag to WRITE_ZEROES?

You cannot force a hole, because NBD the is not guaranteed to support holes.

Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty much 
assume that a server that supports holes will write holes. A server that does 
not support holes will write zeroes. If you don't care whether the resultant 
data is zero, just use NBD_CMD_TRIM. But as you do care (see above) you must be 
prepared for a 'thick' write of zeroes on servers that don't support it.



No, holes are critical. Concreate case: incremental backup to delta 
file. If we write zeroes instead of holes, we will lose underlying data 
(from previous incremental).


--
Best regards,
Vladimir




Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Vladimir Sementsov-Ogievskiy

On 24.09.2016 19:31, Alex Bligh wrote:

On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy 
 wrote:

Note: if disk size is not aligned to X we will have to send request larger than 
the disk size to clear the whole disk.

If you look at the block size extension, the size of the disk must be an exact 
multiple of the minimum block size. So that would work.



But there is no guarantee that disk_size/block_size < INT_MAX.. May be, 
additional option, specifying the shift would be better. With convention 
that if offset+length exceeds disk size, length should be recalculated 
as disk_size-offset.


--
Best regards,
Vladimir




Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 17:20, Vladimir Sementsov-Ogievskiy 
>  wrote:
> 
> Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for disk 
> clearing:
> 
>  * `NBD_CMD_TRIM` (4)
> 
>  A hint to the server that the data defined by len and offset is no
>  longer needed. A server MAY discard len bytes starting at offset, but
>  is not required to.
> 
>  After issuing this command, a client MUST NOT make any assumptions
>  about the contents of the export affected by this command, until
>  overwriting it again with `NBD_CMD_WRITE`.
> 
> - it may do nothing.. So, what to do with this? add flag FORCE_TRIM for this 
> command? Or add FORCE_HOLES flag to WRITE_ZEROES?

You cannot force a hole, because NBD the is not guaranteed to support holes.

Use NBD_CMD_WRITE_ZEROES without NBD_CMD_FLAG_NO_HOLE and you can pretty much 
assume that a server that supports holes will write holes. A server that does 
not support holes will write zeroes. If you don't care whether the resultant 
data is zero, just use NBD_CMD_TRIM. But as you do care (see above) you must be 
prepared for a 'thick' write of zeroes on servers that don't support it.

-- 
Alex Bligh







Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Alex Bligh

> On 24 Sep 2016, at 13:06, Vladimir Sementsov-Ogievskiy 
>  wrote:
> 
> Note: if disk size is not aligned to X we will have to send request larger 
> than the disk size to clear the whole disk.

If you look at the block size extension, the size of the disk must be an exact 
multiple of the minimum block size. So that would work.

-- 
Alex Bligh







Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Denis V. Lunev
On 09/24/2016 12:21 AM, Wouter Verhelst wrote:
> On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote:
>> My preference would be a new flag to the existing commands, with
>> explicit documentation that 0 offset and 0 length must be used with that
>> flag, when requesting a full-device wipe.
> Alternatively, what about a flag that says "if you use this flag, the
> size should be left-shifted by X bits before processing"? That allows
> you to do TRIM or WRITE_ZEROES on much larger chunks, without being
> limited to "whole disk" commands. We should probably make it an illegal
> flag for any command that actually sends data over the wire, though.
>
interesting. Actually I like this ;)



Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Vladimir Sementsov-Ogievskiy

On 24.09.2016 16:42, Vladimir Sementsov-Ogievskiy wrote:

On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote:

On 24.09.2016 00:21, Wouter Verhelst wrote:

On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote:

My preference would be a new flag to the existing commands, with
explicit documentation that 0 offset and 0 length must be used with 
that

flag, when requesting a full-device wipe.

Alternatively, what about a flag that says "if you use this flag, the
size should be left-shifted by X bits before processing"? That allows
you to do TRIM or WRITE_ZEROES on much larger chunks, without being
limited to "whole disk" commands. We should probably make it an illegal
flag for any command that actually sends data over the wire, though.




Note: if disk size is not aligned to X we will have to send request 
larger than the disk size to clear the whole disk.


which is not beautiful.. And splitting the request into two: 
(disk_size / X * X) and (disk_size % X) is not beautiful too. May be 
flag, that says 'whole disk request' is more appropriate? Nothing 
prevents adding then flag about X-shift.





Also, accordingly to documentation, NBD_CMD_TRIM is not appropriate for 
disk clearing:


  * `NBD_CMD_TRIM` (4)

  A hint to the server that the data defined by len and offset is no
  longer needed. A server MAY discard len bytes starting at offset, but
  is not required to.

  After issuing this command, a client MUST NOT make any assumptions
  about the contents of the export affected by this command, until
  overwriting it again with `NBD_CMD_WRITE`.

- it may do nothing.. So, what to do with this? add flag FORCE_TRIM for 
this command? Or add FORCE_HOLES flag to WRITE_ZEROES?



--
Best regards,
Vladimir




Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Vladimir Sementsov-Ogievskiy

On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote:

On 24.09.2016 00:21, Wouter Verhelst wrote:

On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote:

My preference would be a new flag to the existing commands, with
explicit documentation that 0 offset and 0 length must be used with 
that

flag, when requesting a full-device wipe.

Alternatively, what about a flag that says "if you use this flag, the
size should be left-shifted by X bits before processing"? That allows
you to do TRIM or WRITE_ZEROES on much larger chunks, without being
limited to "whole disk" commands. We should probably make it an illegal
flag for any command that actually sends data over the wire, though.




Note: if disk size is not aligned to X we will have to send request 
larger than the disk size to clear the whole disk.


which is not beautiful.. And splitting the request into two: (disk_size 
/ X * X) and (disk_size % X) is not beautiful too. May be flag, that 
says 'whole disk request' is more appropriate? Nothing prevents adding 
then flag about X-shift.



--
Best regards,
Vladimir




Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Vladimir Sementsov-Ogievskiy

On 24.09.2016 15:06, Vladimir Sementsov-Ogievskiy wrote:

On 24.09.2016 00:21, Wouter Verhelst wrote:

On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote:

My preference would be a new flag to the existing commands, with
explicit documentation that 0 offset and 0 length must be used with 
that

flag, when requesting a full-device wipe.

Alternatively, what about a flag that says "if you use this flag, the
size should be left-shifted by X bits before processing"? That allows
you to do TRIM or WRITE_ZEROES on much larger chunks, without being
limited to "whole disk" commands. We should probably make it an illegal
flag for any command that actually sends data over the wire, though.




Note: if disk size is not aligned to X we will have to send request 
larger than the disk size to clear the whole disk.




Also, in this case, which realization of bdrv interface in qemu would be 
most appropriate? Similar flag (in this case X must be defined in some 
very transparent way, as a constant of 64k for example), or flag 
BDRV_REQ_WHOLE_DISK, or separate .bdrv_zero_all and .bdrv_discard_all ?


--
Best regards,
Vladimir




Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-24 Thread Vladimir Sementsov-Ogievskiy

On 24.09.2016 00:21, Wouter Verhelst wrote:

On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote:

My preference would be a new flag to the existing commands, with
explicit documentation that 0 offset and 0 length must be used with that
flag, when requesting a full-device wipe.

Alternatively, what about a flag that says "if you use this flag, the
size should be left-shifted by X bits before processing"? That allows
you to do TRIM or WRITE_ZEROES on much larger chunks, without being
limited to "whole disk" commands. We should probably make it an illegal
flag for any command that actually sends data over the wire, though.




Note: if disk size is not aligned to X we will have to send request 
larger than the disk size to clear the whole disk.


--
Best regards,
Vladimir




Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-23 Thread Wouter Verhelst
On Fri, Sep 23, 2016 at 02:00:06PM -0500, Eric Blake wrote:
> My preference would be a new flag to the existing commands, with
> explicit documentation that 0 offset and 0 length must be used with that
> flag, when requesting a full-device wipe.

Alternatively, what about a flag that says "if you use this flag, the
size should be left-shifted by X bits before processing"? That allows
you to do TRIM or WRITE_ZEROES on much larger chunks, without being
limited to "whole disk" commands. We should probably make it an illegal
flag for any command that actually sends data over the wire, though.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12



Re: [Qemu-block] write_zeroes/trim on the whole disk

2016-09-23 Thread Eric Blake
On 09/23/2016 01:32 PM, Vladimir Sementsov-Ogievskiy wrote:
> Hi all!
> 
> There is a following problem. When we need to write_zeroes or trim the
> whole disk, we have to do it iteratively, because of 32-bit restriction
> on request length.
> For example, current implementation of mirror (see mirror_dirty_init())
> do this by chunks of 2147418112 bytes (with default granularity of
> 65536). So, to zero 16tb disk we will make 8192 requests instead of one.
> 
> Incremental zeroing of 1tb qcow2 takes > 80 seconds for me (see below).
> This means ~20 minutes for copying empty 16tb qcow2 disk which is
> obviously a waste of time.
> 
> We see the following solutions for nbd:
> ||
> 1. Add command NBD_MAKE_EMPTY, with flag, saying what should be done:
> trim or write_zeroes.

Presumably spelled NBD_CMD_MAKE_EMPTY.

> 2. Add flag NBD_CMD_FLAG_WHOLE for commands NBD_TRIM and
> NBD_WRITE_ZEROES, which will say (with zeroed offset and lenght of the
> request), that the whole disk should be discarded/zeroed.

Both of these are possible.  As it is, NBD_CMD_WRITE_ZEROES is not even
formally part of the NBD spec yet, although NBD_CMD_TRIM is (I'm still
sitting on my qemu proof-of-concept patches for WRITE_ZEROES, and need
to resubmit them now that the qemu 2.8 development window is open).
Either way, the server would have to advertise if the new command and/or
new flags to existing commands are available for a whole-disk trim/zero,
before a client could use it, and clients must be prepared to fall back
to incremental approaches otherwise.

My preference would be a new flag to the existing commands, with
explicit documentation that 0 offset and 0 length must be used with that
flag, when requesting a full-device wipe.

> 3. Increase length field of the request to 64bit.

No; that won't work.  It would be a fundamental change to the NBD
protocol, and require both new servers and new clients to talk a
different wire protocol with different size length parameters.

> 
> As soon as we have some way to empty disk  in nbd, we can use
> qcow2_make_empty, to trim the whole disk (and something similar should
> be done for zeroing).
> 
> What do you think about this all, and which way has a chance to get into
> nbd proto?

It's not necessarily obvious that the ability to bulk-trim or bulk-zero
a device should be fundamentally faster than doing it incrementally in
2G chunks; but I concede that there may indeed be scenarios such as
qemu's qcow2 file where that is true.  So it does sound like a useful
option and/or command to be proposed for addition to the NBD protocol,
from that point of view.

As with other extensions to NBD, the best way is to write up a proposal
for how the documentation should change, submit that as patches to the
nbd list, and accompany it with a proof-of-concept implementation
(qemu's nbd server and nbd client work well), so that we can iron out
the details of the documentation before making it a formal part of the
spec.  It's important to remember that such a proposal should still be
optional (a server need not implement the new mode, and a client should
be prepared to fall back to other means if the server does not support a
whole-device action).

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


[Qemu-block] write_zeroes/trim on the whole disk

2016-09-23 Thread Vladimir Sementsov-Ogievskiy

Hi all!

There is a following problem. When we need to write_zeroes or trim the 
whole disk, we have to do it iteratively, because of 32-bit restriction 
on request length.
For example, current implementation of mirror (see mirror_dirty_init()) 
do this by chunks of 2147418112 bytes (with default granularity of 
65536). So, to zero 16tb disk we will make 8192 requests instead of one.


Incremental zeroing of 1tb qcow2 takes > 80 seconds for me (see below). 
This means ~20 minutes for copying empty 16tb qcow2 disk which is 
obviously a waste of time.


We see the following solutions for nbd:
||
1. Add command NBD_MAKE_EMPTY, with flag, saying what should be done: 
trim or write_zeroes.
2. Add flag NBD_CMD_FLAG_WHOLE for commands NBD_TRIM and 
NBD_WRITE_ZEROES, which will say (with zeroed offset and lenght of the 
request), that the whole disk should be discarded/zeroed.

3. Increase length field of the request to 64bit.

As soon as we have some way to empty disk  in nbd, we can use 
qcow2_make_empty, to trim the whole disk (and something similar should 
be done for zeroing).


What do you think about this all, and which way has a chance to get into 
nbd proto?



== test incremental qcow2 zeroing in mirror ==

1. enable it. If we will use nbd, it will be enabled.
diff --git a/block/mirror.c b/block/mirror.c
index f9d1fec..4ac0c39 100644
--- a/block/mirror.c
+++ b/block/mirror.c
@@ -556,7 +556,7 @@ static int coroutine_fn 
mirror_dirty_init(MirrorBlockJob *s)


 end = s->bdev_length / BDRV_SECTOR_SIZE;

-if (base == NULL && !bdrv_has_zero_init(target_bs)) {
+if (base == NULL) {
 if (!bdrv_can_write_zeroes_with_unmap(target_bs)) {
 bdrv_set_dirty_bitmap(s->dirty_bitmap, 0, end);
 return 0;


 test 
qemu-img create  -f qcow2 /tmp/1tb.qcow2 1T

virsh start backup-vm --paused
Domain backup-vm started

virsh qemu-monitor-command backup-vm 
{"execute":"blockdev-add","arguments":{"options": {"aio": "native", 
"file": {"driver": "file", "filename": "/tmp/1tb.qcow2"}, "discard": 
"unmap", "cache": {"direct": true}, "driver": "qcow2", "id": "disk"}}}

{"return":{},"id":"libvirt-32"}

/usr/bin/time -f '%e seconds' sh -c 'virsh qemu-monitor-event' &
virsh qemu-monitor-command backup-vm 
{"execute":"drive-mirror","arguments":{"device": "disk", "sync": "full", 
"target": "/tmp/targ"}}

{"return":{},"id":"libvirt-33"}

[root@kvm qemu]# event BLOCK_JOB_READY at 1474652677.668624 for domain 
backup-vm: 
{"device":"disk","len":1099511627776,"offset":1099511627776,"speed":0,"type":"mirror"}

events received: 1

86.39 seconds

- the same for 2tb empty disk: 180.19 seconds
- and without patch, it takes < 1 second, of course.

--
Best regards,
Vladimir