Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-12 Thread Alex Bligh

On 11 May 2016, at 22:12, Wouter Verhelst  wrote:

> On Tue, May 10, 2016 at 04:38:29PM +0100, Alex Bligh wrote:
>> On 10 May 2016, at 16:29, Eric Blake  wrote:
>>> So the kernel is currently one of the clients that does NOT honor block
>>> sizes, and as such, servers should be prepared for ANY size up to
>>> UINT_MAX (other than DoS handling).
>> 
>> Or not to permit a connection.
> 
> Right -- and this is why I was recommending against making this a MUST in the
> first place.

Indeed, and it currently is a 'MAY':

> except that if a server believes a client's behaviour constitutes
> a denial of service, it MAY initiate a hard disconnect.

-- 
Alex Bligh







Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-11 Thread Wouter Verhelst
On Tue, May 10, 2016 at 04:38:29PM +0100, Alex Bligh wrote:
> On 10 May 2016, at 16:29, Eric Blake  wrote:
> > So the kernel is currently one of the clients that does NOT honor block
> > sizes, and as such, servers should be prepared for ANY size up to
> > UINT_MAX (other than DoS handling).
> 
> Or not to permit a connection.

Right -- and this is why I was recommending against making this a MUST in the
first place.

[...]

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12



Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-11 Thread Wouter Verhelst
On Tue, May 10, 2016 at 09:29:23AM -0600, Eric Blake wrote:
> On 05/10/2016 09:08 AM, Alex Bligh wrote:
> > Eric,
> > 
> >> Hmm. The current wording of the experimental block size additions does
> >> NOT allow the client to send a NBD_CMD_TRIM with a size larger than the
> >> maximum NBD_CMD_WRITE:
> >> https://github.com/yoe/nbd/blob/extension-info/doc/proto.md#block-size-constraints
> > 
> > Correct
> > 
> >> Maybe we should revisit that in the spec, and/or advertise yet another
> >> block size (since the maximum size for a trim and/or write_zeroes
> >> request may indeed be different than the maximum size for a read/write).
> > 
> > I think it's up to the server to either handle large requests, or
> > for the client to break these up.
> 
> But the question at hand here is whether we should permit servers to
> advertise multiple maximum block sizes (one for read/write, another one
> for trim/write_zero, or even two [at least qemu tracks a separate
> maximum trim vs. write_zero sizing in its generic block layer]), or
> merely stick with the current wording that requires clients that honor
> maximum block size to obey the same maximum for ALL commands, regardless
> of amount of data sent over the wire.
> 
> > 
> > The core problem here is that the kernel (and, ahem, most servers) are
> > ignorant of the block size extension, and need to guess how to break
> > things up. In my view the client (kernel in this case) should
> > be breaking the trim requests up into whatever size it uses as the
> > maximum size write requests. But then it would have to know about block
> > sizes which are in (another) experimental extension.
> 
> Correct - no one has yet patched the kernel to honor block sizes
> advertised through what is currently an experimental extension.  (We
> have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's
> minimum block size, but I haven't audited whether the kernel actually
> guarantees that all client requests are sent aligned to the value passed
> that way - but we have nothing to set the maximum size, and are at the
> mercy of however the kernel currently decides to split large requests).

I don't actually think it does that at all, tbh. There is an
"integrityhuge" test in the reference server test suite which performs a
number of large requests (up to 50M), and which was created by a script
that just does direct read requests to /dev/nbdX.

It just so happens that most upper layers (filesystems etc) don't make
requests larger than about 32MiB, but that's not related.

> So the kernel is currently one of the clients that does NOT honor block
> sizes, and as such, servers should be prepared for ANY size up to
> UINT_MAX (other than DoS handling).  My question above only applies to
> clients that use the experimental block size extensions.

Right.

[...]

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12



Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-11 Thread Wouter Verhelst
On Tue, May 10, 2016 at 04:08:50PM +0100, Alex Bligh wrote:
> What surprises me is that a release kernel is using experimental
> NBD extensions; there's no guarantee these won't change. Or does
> fstrim work some other way?

What makes you say NBD_CMD_TRIM is experimental? It's been implemented for
years.

-- 
< ron> I mean, the main *practical* problem with C++, is there's like a dozen
   people in the world who think they really understand all of its rules,
   and pretty much all of them are just lying to themselves too.
 -- #debian-devel, OFTC, 2016-02-12



Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-11 Thread Paolo Bonzini


On 10/05/2016 18:23, Alex Bligh wrote:
> > Is there a use case where you'd want to split up a single big TRIM request
> > in smaller ones (as in some hardware would not support it or something)?
> > Even then, it looks like this splitting up would be hardware dependant and
> > better implemented in block device drivers.
> 
> Part of the point of the block size extension is to push such limits to the
> client.
> 
> I could make up use cases (e.g. that performing a multi-gigabyte trim in
> a single threaded server will effectively block all other I/O), but the
> main reason in my book is orthogonality, and the fact the client needs
> to do some breaking up anyway.

That's why SCSI for example has a limit to UNMAP and WRITE SAME requests
(actually it has three limits: number of ranges per unmap, which in NBD
and in SCSI WRITE SAME is 1; number of blocks per unmap descriptor;
number of blocks per WRITE SAME operation).  These limits however are a
different one than the read/write limit, because you want to support
systems where TRIM is much faster than regular I/O (and possibly even
O(1) if trimming something that is already trimmed).

Paolo



Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Eric Blake
On 05/10/2016 10:33 AM, Quentin Casasnovas wrote:

> Looks like there's an easier way:
> 
>  $ qemu-img create -f qcow2 foo.qcow2 10G
>  $ qemu-nbd --discard=on -c /dev/nbd0 foo.qcow2
>  $ mkfs.ext4 /dev/nbd0
>  mke2fs 1.42.13 (17-May-2015)
>  Discarding device blocks: failed - Input/output error

strace on mkfs.ext4 shows:
...
open("/dev/nbd1", O_RDWR|O_EXCL)= 3
stat("/dev/nbd1", {st_mode=S_IFBLK|0660, st_rdev=makedev(43, 1), ...}) = 0
ioctl(3, BLKDISCARDZEROES, [0]) = 0
ioctl(3, BLKROGET, [0]) = 0
uname({sysname="Linux", nodename="red", ...}) = 0
ioctl(3, BLKDISCARD, [0, 100])  = 0
write(1, "Discarding device blocks: ", 26) = 26
write(1, "   4096/2621440", 15   4096/2621440) = 15
write(1, "\10\10\10\10\10\10\10\10\10\10\10\1) = 15
ioctl(3, BLKDISCARD, [100, 8000]) = -1 EIO (Input/output error)
write(1, "   ", 15   ) = 15
write(1, "\10\10\10\10\10\10\10\10\10\10\10\1) = 15
write(1, "failed - ", 9failed - )= 9

so it does indeed look like the smaller request [0, 0x100] succeeded
while the larger [0x100, 0x8000] failed with EIO.  But I
instrumented my qemu-nbd server to output all of the incoming requests
it was receiving from the kernel, and I never saw a mention of
NBD_CMD_TRIM being received.  Maybe it's dependent on what version of
the kernel and/or NBD kernel module you are running?

I also tried fallocate(1) on /dev/nbd1, as in:
 # strace fallocate -o 0 -l 64k -p /dev/nbd1
but it looks like the kernel doesn't yet wire up fallocate(2) to forward
to the appropriate device ioctls and/or NBD_CMD_TRIM, where strace says:

open("/dev/nbd1", O_RDWR)   = 3
fallocate(3, FALLOC_FL_KEEP_SIZE|FALLOC_FL_PUNCH_HOLE, 0, 65536) = -1
ENODEV (No such device)

The ENODEV is a rather unusual choice of error.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Paolo Bonzini


On 10/05/2016 17:38, Alex Bligh wrote:
> > and are at the
> > mercy of however the kernel currently decides to split large requests).
> 
> I am surprised TRIM doesn't get broken up the same way READ and WRITE
> do.

The payload of TRIM has constant size, so it makes sense to do that.
The kernel then self-imposes a 2GB limit in blkdev_issue_discard.

On the other hand, READ and WRITE of size n can possibly require O(n)
memory.

Paolo



Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Quentin Casasnovas
On Tue, May 10, 2016 at 05:54:44PM +0200, Quentin Casasnovas wrote:
> On Tue, May 10, 2016 at 09:46:36AM -0600, Eric Blake wrote:
> > On 05/10/2016 09:41 AM, Alex Bligh wrote:
> > > 
> > > On 10 May 2016, at 16:29, Eric Blake  wrote:
> > > 
> > >> So the kernel is currently one of the clients that does NOT honor block
> > >> sizes, and as such, servers should be prepared for ANY size up to
> > >> UINT_MAX (other than DoS handling).
> > > 
> > > Interesting followup question:
> > > 
> > > If the kernel does not fragment TRIM requests at all (in the
> > > same way it fragments read and write requests), I suspect
> > > something bad may happen with TRIM requests over 2^31
> > > in size (particularly over 2^32 in size), as the length
> > > field in nbd only has 32 bits.
> > > 
> > > Whether it supports block size constraints or not, it is
> > > going to need to do *some* breaking up of requests.
> > 
> > Does anyone have an easy way to cause the kernel to request a trim
> > operation that large on a > 4G export?  I'm not familiar enough with
> > EXT4 operation to know what file system operations you can run to
> > ultimately indirectly create a file system trim operation that large.
> > But maybe there is something simpler - does the kernel let you use the
> > fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or
> > FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device?
> > 
> 
> It was fairly reproducible here, we just used a random qcow2 image with
> some Debian minimal system pre-installed, mounted that qcow2 image through
> qemu-nbd then compiled a whole kernel inside it.  Then you can make clean
> and run fstrim on the mount point.  I'm assuming you can go faster than
> that by just writing a big file to the qcow2 image mounted without -o
> discard, delete the big file, then remount with -o discard + run fstrim.
> 

Looks like there's an easier way:

 $ qemu-img create -f qcow2 foo.qcow2 10G
 $ qemu-nbd --discard=on -c /dev/nbd0 foo.qcow2
 $ mkfs.ext4 /dev/nbd0
 mke2fs 1.42.13 (17-May-2015)
 Discarding device blocks: failed - Input/output error
 Creating filesystem with 2621440 4k blocks and 655360 inodes
 Filesystem UUID: 25aeb51f-0dea-4c1d-8b65-61f6bcdf97e9
 Superblock backups stored on blocks:
   32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632

 Allocating group tables: done
 Writing inode tables: done
 Creating journal (32768 blocks): done
 Writing superblocks and filesystem accounting information: done

Notice the "Discarding device blocks: failed - Input/output error" line, I
bet that it is mkfs.ext4 trying to trim all blocks prior to writing the
filesystem, but it gets an I/O error while doing so.  I haven't verified it
is the same problem, but it it isn't, simply mount the resulting filesystem
and run fstrim on it:

 $ mount -o discard /dev/nbd0 /tmp/foo
 $ fstrim /tmp/foo
 fstrim: /tmp/foo: FITRIM ioctl failed: Input/output error

Quentin



Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Quentin Casasnovas
On Tue, May 10, 2016 at 04:49:57PM +0100, Alex Bligh wrote:
> 
> On 10 May 2016, at 16:45, Quentin Casasnovas  
> wrote:
> 
> > I'm by no mean an expert in this, but why would the kernel break up those
> > TRIM commands?  After all, breaking things up makes sense when the length
> > of the request is big, not that much when it only contains the request
> > header, which is the case for TRIM commands.
> 
> 1. You are assuming that the only reason for limiting the size of operations 
> is
>limiting the data transferred within one request. That is not necessarily
>the case. There are good reasons (if only orthogonality) to have limits
>in place even where no data is transferred.
> 
> 2. As and when the blocksize extension is implemented in the kernel (it isn't
>now), the protocol requires it.
> 
> 3. The maximum length of an NBD trim operation is 2^32. The maximum length
>of a trim operation is larger. Therefore the kernel needs to do at least
>some breaking up.

Alright, I assumed by 'length of an NBD request', the specification was
talking about the length of.. well, the request as opposed to whatever is
in the length field of the request header.

Is there a use case where you'd want to split up a single big TRIM request
in smaller ones (as in some hardware would not support it or something)?
Even then, it looks like this splitting up would be hardware dependant and
better implemented in block device drivers.

I'm just finding odd that something that fits inside the length field can't
be used.  I do agree with your point number 3, obviously if the lenght
field type doesn't allow something bigger than a u32, then the kernel has
to do some breaking up in that case.

Quentin



Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Quentin Casasnovas
On Tue, May 10, 2016 at 09:46:36AM -0600, Eric Blake wrote:
> On 05/10/2016 09:41 AM, Alex Bligh wrote:
> > 
> > On 10 May 2016, at 16:29, Eric Blake  wrote:
> > 
> >> So the kernel is currently one of the clients that does NOT honor block
> >> sizes, and as such, servers should be prepared for ANY size up to
> >> UINT_MAX (other than DoS handling).
> > 
> > Interesting followup question:
> > 
> > If the kernel does not fragment TRIM requests at all (in the
> > same way it fragments read and write requests), I suspect
> > something bad may happen with TRIM requests over 2^31
> > in size (particularly over 2^32 in size), as the length
> > field in nbd only has 32 bits.
> > 
> > Whether it supports block size constraints or not, it is
> > going to need to do *some* breaking up of requests.
> 
> Does anyone have an easy way to cause the kernel to request a trim
> operation that large on a > 4G export?  I'm not familiar enough with
> EXT4 operation to know what file system operations you can run to
> ultimately indirectly create a file system trim operation that large.
> But maybe there is something simpler - does the kernel let you use the
> fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or
> FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device?
> 

It was fairly reproducible here, we just used a random qcow2 image with
some Debian minimal system pre-installed, mounted that qcow2 image through
qemu-nbd then compiled a whole kernel inside it.  Then you can make clean
and run fstrim on the mount point.  I'm assuming you can go faster than
that by just writing a big file to the qcow2 image mounted without -o
discard, delete the big file, then remount with -o discard + run fstrim.

Quentin



Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Quentin Casasnovas
On Tue, May 10, 2016 at 04:38:29PM +0100, Alex Bligh wrote:
> Eric,
> 
> On 10 May 2016, at 16:29, Eric Blake  wrote:
> >>> Maybe we should revisit that in the spec, and/or advertise yet another
> >>> block size (since the maximum size for a trim and/or write_zeroes
> >>> request may indeed be different than the maximum size for a read/write).
> >> 
> >> I think it's up to the server to either handle large requests, or
> >> for the client to break these up.
> > 
> > But the question at hand here is whether we should permit servers to
> > advertise multiple maximum block sizes (one for read/write, another one
> > for trim/write_zero, or even two [at least qemu tracks a separate
> > maximum trim vs. write_zero sizing in its generic block layer]), or
> > merely stick with the current wording that requires clients that honor
> > maximum block size to obey the same maximum for ALL commands, regardless
> > of amount of data sent over the wire.
> 
> In my view, we should not change this. Block sizes maxima are not there
> to support DoS prevention (that's a separate phrase). They are there
> to impose maximum block sizes. Adding a different maximum block size
> for different commands is way too overengineered. There are after
> all reasons (especially without structured replies) why you'd want
> different maximum block sizes for writes and reads. If clients support
> block sizes, they will necessarily have to have the infrastructure
> to break requests up.
> 
> IE maximum block size should continue to mean maximum block size.
> 
> >> 
> >> The core problem here is that the kernel (and, ahem, most servers) are
> >> ignorant of the block size extension, and need to guess how to break
> >> things up. In my view the client (kernel in this case) should
> >> be breaking the trim requests up into whatever size it uses as the
> >> maximum size write requests. But then it would have to know about block
> >> sizes which are in (another) experimental extension.
> > 
> > Correct - no one has yet patched the kernel to honor block sizes
> > advertised through what is currently an experimental extension.
> 
> Unsurprising at it's still experimental, and only settled down a couple
> of weeks ago :-)
> 
> >  (We
> > have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's
> > minimum block size,
> 
> Technically that is 'out of band transmission of block size
> constraints' :-)
> 
> > but I haven't audited whether the kernel actually
> > guarantees that all client requests are sent aligned to the value passed
> > that way - but we have nothing to set the maximum size,
> 
> indeed
> 
> > and are at the
> > mercy of however the kernel currently decides to split large requests).
> 
> I am surprised TRIM doesn't get broken up the same way READ and WRITE
> do.
>

I'm by no mean an expert in this, but why would the kernel break up those
TRIM commands?  After all, breaking things up makes sense when the length
of the request is big, not that much when it only contains the request
header, which is the case for TRIM commands.

What am I missing?

Quentin



Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Alex Bligh

On 10 May 2016, at 16:45, Quentin Casasnovas  
wrote:

> I'm by no mean an expert in this, but why would the kernel break up those
> TRIM commands?  After all, breaking things up makes sense when the length
> of the request is big, not that much when it only contains the request
> header, which is the case for TRIM commands.

1. You are assuming that the only reason for limiting the size of operations is
   limiting the data transferred within one request. That is not necessarily
   the case. There are good reasons (if only orthogonality) to have limits
   in place even where no data is transferred.

2. As and when the blocksize extension is implemented in the kernel (it isn't
   now), the protocol requires it.

3. The maximum length of an NBD trim operation is 2^32. The maximum length
   of a trim operation is larger. Therefore the kernel needs to do at least
   some breaking up.

-- 
Alex Bligh







Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Alex Bligh

On 10 May 2016, at 16:46, Eric Blake  wrote:

> Does anyone have an easy way to cause the kernel to request a trim
> operation that large on a > 4G export?  I'm not familiar enough with
> EXT4 operation to know what file system operations you can run to
> ultimately indirectly create a file system trim operation that large.
> But maybe there is something simpler - does the kernel let you use the
> fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or
> FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device?

Not tried it, but fallocate(1) with -p ?

http://man7.org/linux/man-pages/man1/fallocate.1.html

As it takes length and offset in TB, PB, EB, ZB and YB, it
seems to be 64 bit aware :-)

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Eric Blake
On 05/10/2016 09:41 AM, Alex Bligh wrote:
> 
> On 10 May 2016, at 16:29, Eric Blake  wrote:
> 
>> So the kernel is currently one of the clients that does NOT honor block
>> sizes, and as such, servers should be prepared for ANY size up to
>> UINT_MAX (other than DoS handling).
> 
> Interesting followup question:
> 
> If the kernel does not fragment TRIM requests at all (in the
> same way it fragments read and write requests), I suspect
> something bad may happen with TRIM requests over 2^31
> in size (particularly over 2^32 in size), as the length
> field in nbd only has 32 bits.
> 
> Whether it supports block size constraints or not, it is
> going to need to do *some* breaking up of requests.

Does anyone have an easy way to cause the kernel to request a trim
operation that large on a > 4G export?  I'm not familiar enough with
EXT4 operation to know what file system operations you can run to
ultimately indirectly create a file system trim operation that large.
But maybe there is something simpler - does the kernel let you use the
fallocate(2) syscall operation with FALLOC_FL_PUNCH_HOLE or
FALLOC_FL_ZERO_RANGE on an fd backed by an NBD device?

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Alex Bligh

On 10 May 2016, at 16:29, Eric Blake  wrote:

> So the kernel is currently one of the clients that does NOT honor block
> sizes, and as such, servers should be prepared for ANY size up to
> UINT_MAX (other than DoS handling).

Interesting followup question:

If the kernel does not fragment TRIM requests at all (in the
same way it fragments read and write requests), I suspect
something bad may happen with TRIM requests over 2^31
in size (particularly over 2^32 in size), as the length
field in nbd only has 32 bits.

Whether it supports block size constraints or not, it is
going to need to do *some* breaking up of requests.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Alex Bligh
Eric,

On 10 May 2016, at 16:29, Eric Blake  wrote:
>>> Maybe we should revisit that in the spec, and/or advertise yet another
>>> block size (since the maximum size for a trim and/or write_zeroes
>>> request may indeed be different than the maximum size for a read/write).
>> 
>> I think it's up to the server to either handle large requests, or
>> for the client to break these up.
> 
> But the question at hand here is whether we should permit servers to
> advertise multiple maximum block sizes (one for read/write, another one
> for trim/write_zero, or even two [at least qemu tracks a separate
> maximum trim vs. write_zero sizing in its generic block layer]), or
> merely stick with the current wording that requires clients that honor
> maximum block size to obey the same maximum for ALL commands, regardless
> of amount of data sent over the wire.

In my view, we should not change this. Block sizes maxima are not there
to support DoS prevention (that's a separate phrase). They are there
to impose maximum block sizes. Adding a different maximum block size
for different commands is way too overengineered. There are after
all reasons (especially without structured replies) why you'd want
different maximum block sizes for writes and reads. If clients support
block sizes, they will necessarily have to have the infrastructure
to break requests up.

IE maximum block size should continue to mean maximum block size.

>> 
>> The core problem here is that the kernel (and, ahem, most servers) are
>> ignorant of the block size extension, and need to guess how to break
>> things up. In my view the client (kernel in this case) should
>> be breaking the trim requests up into whatever size it uses as the
>> maximum size write requests. But then it would have to know about block
>> sizes which are in (another) experimental extension.
> 
> Correct - no one has yet patched the kernel to honor block sizes
> advertised through what is currently an experimental extension.

Unsurprising at it's still experimental, and only settled down a couple
of weeks ago :-)

>  (We
> have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's
> minimum block size,

Technically that is 'out of band transmission of block size
constraints' :-)

> but I haven't audited whether the kernel actually
> guarantees that all client requests are sent aligned to the value passed
> that way - but we have nothing to set the maximum size,

indeed

> and are at the
> mercy of however the kernel currently decides to split large requests).

I am surprised TRIM doesn't get broken up the same way READ and WRITE
do.

> So the kernel is currently one of the clients that does NOT honor block
> sizes, and as such, servers should be prepared for ANY size up to
> UINT_MAX (other than DoS handling).

Or not to permit a connection.

>  My question above only applies to
> clients that use the experimental block size extensions.

Indeed.

>> What surprises me is that a release kernel is using experimental
>> NBD extensions; there's no guarantee these won't change. Or does
>> fstrim work some other way?
> 
> No extension in play.  The kernel is obeying NBD_FLAG_SEND_TRIM, which
> is in the normative standard, and unrelated to the INFO/BLOCK_SIZE
> extensions.

My mistake. I was confusing 'WRITE_ZEROES' with 'TRIM'.

--
Alex Bligh






signature.asc
Description: Message signed with OpenPGP using GPGMail


Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Eric Blake
On 05/10/2016 09:08 AM, Alex Bligh wrote:
> Eric,
> 
>> Hmm. The current wording of the experimental block size additions does
>> NOT allow the client to send a NBD_CMD_TRIM with a size larger than the
>> maximum NBD_CMD_WRITE:
>> https://github.com/yoe/nbd/blob/extension-info/doc/proto.md#block-size-constraints
> 
> Correct
> 
>> Maybe we should revisit that in the spec, and/or advertise yet another
>> block size (since the maximum size for a trim and/or write_zeroes
>> request may indeed be different than the maximum size for a read/write).
> 
> I think it's up to the server to either handle large requests, or
> for the client to break these up.

But the question at hand here is whether we should permit servers to
advertise multiple maximum block sizes (one for read/write, another one
for trim/write_zero, or even two [at least qemu tracks a separate
maximum trim vs. write_zero sizing in its generic block layer]), or
merely stick with the current wording that requires clients that honor
maximum block size to obey the same maximum for ALL commands, regardless
of amount of data sent over the wire.

> 
> The core problem here is that the kernel (and, ahem, most servers) are
> ignorant of the block size extension, and need to guess how to break
> things up. In my view the client (kernel in this case) should
> be breaking the trim requests up into whatever size it uses as the
> maximum size write requests. But then it would have to know about block
> sizes which are in (another) experimental extension.

Correct - no one has yet patched the kernel to honor block sizes
advertised through what is currently an experimental extension.  (We
have ioctl(NBD_SET_BLKSIZE) which can be argued to set the kernel's
minimum block size, but I haven't audited whether the kernel actually
guarantees that all client requests are sent aligned to the value passed
that way - but we have nothing to set the maximum size, and are at the
mercy of however the kernel currently decides to split large requests).
 So the kernel is currently one of the clients that does NOT honor block
sizes, and as such, servers should be prepared for ANY size up to
UINT_MAX (other than DoS handling).  My question above only applies to
clients that use the experimental block size extensions.

> 
> I've nothing against you fixing it qemu's server (after all I don't
> think there is anything to /prohibit/ a server working on something
> larger than the maximum block size), and ...
> 
>> But since the kernel is the one sending the large length request, and
>> since you are right that this is not a denial-of-service in the amount
>> of data being sent in a single NBD message,
> 
> ... currently there isn't actually a maximum block size advertised (that
> being in another experimental addition), so this is just DoS prevention,
> which qemu is free to do how it wishes.

Okay, sounds like that part is uncontroversial, and it is indeed worth
improving qemu's behavior.

> 
> What surprises me is that a release kernel is using experimental
> NBD extensions; there's no guarantee these won't change. Or does
> fstrim work some other way?

No extension in play.  The kernel is obeying NBD_FLAG_SEND_TRIM, which
is in the normative standard, and unrelated to the INFO/BLOCK_SIZE
extensions.

-- 
Eric Blake   eblake redhat com+1-919-301-3266
Libvirt virtualization library http://libvirt.org



signature.asc
Description: OpenPGP digital signature


Re: [Qemu-block] [Nbd] [Qemu-devel] [PATCH] nbd: fix trim/discard commands with a length bigger than NBD_MAX_BUFFER_SIZE

2016-05-10 Thread Alex Bligh
Eric,

> Hmm. The current wording of the experimental block size additions does
> NOT allow the client to send a NBD_CMD_TRIM with a size larger than the
> maximum NBD_CMD_WRITE:
> https://github.com/yoe/nbd/blob/extension-info/doc/proto.md#block-size-constraints

Correct

> Maybe we should revisit that in the spec, and/or advertise yet another
> block size (since the maximum size for a trim and/or write_zeroes
> request may indeed be different than the maximum size for a read/write).

I think it's up to the server to either handle large requests, or
for the client to break these up.

The core problem here is that the kernel (and, ahem, most servers) are
ignorant of the block size extension, and need to guess how to break
things up. In my view the client (kernel in this case) should
be breaking the trim requests up into whatever size it uses as the
maximum size write requests. But then it would have to know about block
sizes which are in (another) experimental extension.

I've nothing against you fixing it qemu's server (after all I don't
think there is anything to /prohibit/ a server working on something
larger than the maximum block size), and ...

> But since the kernel is the one sending the large length request, and
> since you are right that this is not a denial-of-service in the amount
> of data being sent in a single NBD message,

... currently there isn't actually a maximum block size advertised (that
being in another experimental addition), so this is just DoS prevention,
which qemu is free to do how it wishes.

> I definitely agree that qemu
> would be wise as a quality-of-implementation to allow the larger size,
> for maximum interoperability, even if it exceeds advertised limits (that
> is, when no limits are advertised, we should handle everything possible
> if it is not so large as to be construed a denial-of-service, and
> NBD_CMD_TRIM is not large; and when limits ARE advertised, a client that
> violates limits is out of spec but we can still be liberal and respond
> successfully to such a client rather than having to outright reject it).
> So I think this patch is headed in the right direction.

I'd agree with that.

What surprises me is that a release kernel is using experimental
NBD extensions; there's no guarantee these won't change. Or does
fstrim work some other way?

Alex

> 
>> 
>> and looking at the kernel side implementation of the nbd device
>> (drivers/block/nbd.c) where it only sends the request header with no data
>> for a NBD_CMD_TRIM.
>> 
>> With this fix in, I am now able to run fstrim on my qcow2 images and keep
>> them small (or at least keep their size proportional to the amount of data
>> present on them).
>> 
>> Signed-off-by: Quentin Casasnovas 
>> CC: Paolo Bonzini 
>> CC: 
>> CC: 
>> CC: 
> 
> This is NOT trivial material and should not go in through that tree.
> However, I concur that it qualifies for a backport on a stable branch.
> 
>> ---
>> nbd.c | 8 +++-
>> 1 file changed, 7 insertions(+), 1 deletion(-)
>> 
>> diff --git a/nbd.c b/nbd.c
>> index b3d9654..e733669 100644
>> --- a/nbd.c
>> +++ b/nbd.c
>> @@ -1209,6 +1209,11 @@ static ssize_t nbd_co_send_reply(NBDRequest *req, 
>> struct nbd_reply *reply,
>> return rc;
>> }
>> 
>> +static bool nbd_should_check_request_size(const struct nbd_request *request)
>> +{
>> +return (request->type & NBD_CMD_MASK_COMMAND) != NBD_CMD_TRIM;
>> +}
>> +
>> static ssize_t nbd_co_receive_request(NBDRequest *req, struct nbd_request 
>> *request)
>> {
>> NBDClient *client = req->client;
>> @@ -1227,7 +1232,8 @@ static ssize_t nbd_co_receive_request(NBDRequest *req, 
>> struct nbd_request *reque
>> goto out;
>> }
>> 
>> -if (request->len > NBD_MAX_BUFFER_SIZE) {
>> +if (nbd_should_check_request_size(request) &&
>> +request->len > NBD_MAX_BUFFER_SIZE) {
> 
> I'd rather sort out the implications of this on the NBD protocol before
> taking anything into qemu.  We've got time on our hand, so let's use it
> to get this right.  (That, and I have several pending patches that
> conflict with this as part of adding WRITE_ZEROES and INFO_BLOCK_SIZE
> support, where it may be easier to resubmit this fix on top of my
> pending patches).
> 
>> LOG("len (%u) is larger than max len (%u)",
>> request->len, NBD_MAX_BUFFER_SIZE);
>> rc = -EINVAL;
>> 
> 
> --
> Eric Blake   eblake redhat com+1-919-301-3266
> Libvirt virtualization library http://libvirt.org
> 
> --
> Mobile security can be enabling, not merely restricting. Employees who
> bring their own devices (BYOD) to work are irked by the imposition of MDM
> restrictions. Mobile Device Manager Plus allows you to control only the
> apps on BYO-devices by containerizing them, leaving personal data untouched!
>