Re: [Qemu-block] [PATCH 3/4] nbd/client: Support requests of additional block sizing info
On 05/22/2018 10:33 AM, Vladimir Sementsov-Ogievskiy wrote: 02.05.2018 00:13, Eric Blake wrote: The NBD spec is clarifying [1] that a server may want to advertise different limits for READ/WRITE (in our case, 32M) than for TRIM/ZERO (in our case, nearly 4G). Implement the client side support for these alternate limits, by always requesting the new information (a compliant server must ignore the request if it is unknown), and by validating anything reported by the server before populating the block layer limits. I still need to do a v2 of this series based on feedback from the NBD list, but answering your question in the meantime: bs->bl.request_alignment = min ? min : BDRV_SECTOR_SIZE; - bs->bl.max_pdiscard = max; - bs->bl.max_pwrite_zeroes = max; + if (s->info.max_trim) { + bs->bl.max_pdiscard = MIN(s->info.max_trim, BDRV_REQUEST_MAX_BYTES); + } else { + bs->bl.max_pdiscard = max; + } + bs->bl.pdiscard_alignment = s->info.min_trim; + if (s->info.max_zero) { + bs->bl.max_pwrite_zeroes = MIN(s->info.max_zero, + BDRV_REQUEST_MAX_BYTES); + } else { + bs->bl.max_pwrite_zeroes = max; + } + bs->bl.pwrite_zeroes_alignment = s->info.min_zero; bs->bl.max_transfer = max; Should not max_transfer be updated too? Looks like it limits all requests, is it right? max_transfer affects read and write requests, but not write_zero requests (where max_pwrite_zeroes is used instead) or trim requests (where max_pdiscard is used instead). Which is precisely the semantics we want - the NBD extension is mirroring the fact that qemu already has a way to track independent limits for trim/zero that can be larger than the normal limits for read/write, because trim/zero do not involve a transfer of a large data buffer. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [PATCH 3/4] nbd/client: Support requests of additional block sizing info
02.05.2018 00:13, Eric Blake wrote: The NBD spec is clarifying [1] that a server may want to advertise different limits for READ/WRITE (in our case, 32M) than for TRIM/ZERO (in our case, nearly 4G). Implement the client side support for these alternate limits, by always requesting the new information (a compliant server must ignore the request if it is unknown), and by validating anything reported by the server before populating the block layer limits. [1] https://lists.debian.org/nbd/2018/03/msg00048.html Signed-off-by: Eric Blake --- The given URL for the NBD spec was v3; it will change to be a v4 version of that patch in part to point back to this qemu commit as a proof of implementation. --- include/block/nbd.h | 4 ++ block/nbd.c | 15 ++- nbd/client.c| 116 ++-- nbd/trace-events| 2 + 4 files changed, 131 insertions(+), 6 deletions(-) diff --git a/include/block/nbd.h b/include/block/nbd.h index cbf51628f78..90ddef32bb3 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -270,6 +270,10 @@ struct NBDExportInfo { uint32_t min_block; uint32_t opt_block; uint32_t max_block; +uint32_t min_trim; +uint32_t max_trim; +uint32_t min_zero; +uint32_t max_zero; uint32_t meta_base_allocation_id; }; diff --git a/block/nbd.c b/block/nbd.c index 1e2b3ba2d3b..823d10b251d 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -478,8 +478,19 @@ static void nbd_refresh_limits(BlockDriverState *bs, Error **errp) uint32_t max = MIN_NON_ZERO(NBD_MAX_BUFFER_SIZE, s->info.max_block); bs->bl.request_alignment = min ? min : BDRV_SECTOR_SIZE; -bs->bl.max_pdiscard = max; -bs->bl.max_pwrite_zeroes = max; +if (s->info.max_trim) { +bs->bl.max_pdiscard = MIN(s->info.max_trim, BDRV_REQUEST_MAX_BYTES); +} else { +bs->bl.max_pdiscard = max; +} +bs->bl.pdiscard_alignment = s->info.min_trim; +if (s->info.max_zero) { +bs->bl.max_pwrite_zeroes = MIN(s->info.max_zero, + BDRV_REQUEST_MAX_BYTES); +} else { +bs->bl.max_pwrite_zeroes = max; +} +bs->bl.pwrite_zeroes_alignment = s->info.min_zero; bs->bl.max_transfer = max; Should not max_transfer be updated too? Looks like it limits all requests, is it right? Best regards, Vladimir
Re: [Qemu-block] [PATCH 3/4] nbd/client: Support requests of additional block sizing info
03.05.2018 17:49, Eric Blake wrote: On 05/03/2018 04:17 AM, Vladimir Sementsov-Ogievskiy wrote: 02.05.2018 00:13, Eric Blake wrote: The NBD spec is clarifying [1] that a server may want to advertise different limits for READ/WRITE (in our case, 32M) than for TRIM/ZERO (in our case, nearly 4G). Implement the client side support for these alternate limits, by always requesting the new information (a compliant server must ignore the request if it is unknown), and by validating anything reported by the server before populating the block layer limits. [1] https://lists.debian.org/nbd/2018/03/msg00048.html Signed-off-by: Eric Blake +++ b/nbd/client.c @@ -337,16 +337,18 @@ static int nbd_opt_go(QIOChannel *ioc, const char *wantname, info->flags = 0; trace_nbd_opt_go_start(wantname); - buf = g_malloc(4 + len + 2 + 2 * info->request_sizes + 1); + buf = g_malloc(4 + len + 2 + 3 * 2 * info->request_sizes + 1); Hmm, what "+1" means? Doesn't appear to be necessary, but a little slop never hurts. Better might be struct-ifying things rather than open-coding offsets, or even using a vectored approach rather than requiring a single buffer. But if useful, that's refactoring that is independent of this patch, while this is just demonstrating the bare minimum implementation of the new feature. stl_be_p(buf, len); memcpy(buf + 4, wantname, len); - /* At most one request, everything else up to server */ - stw_be_p(buf + 4 + len, info->request_sizes); + /* Either 0 or 3 requests, everything else up to server */ + stw_be_p(buf + 4 + len, 3 * info->request_sizes); if (info->request_sizes) { stw_be_p(buf + 4 + len + 2, NBD_INFO_BLOCK_SIZE); + stw_be_p(buf + 4 + len + 2 + 2, NBD_INFO_TRIM_SIZE); + stw_be_p(buf + 4 + len + 2 + 2 + 2, NBD_INFO_ZERO_SIZE); } error = nbd_send_option_request(ioc, NBD_OPT_GO, - 4 + len + 2 + 2 * info->request_sizes, + 4 + len + 2 + 3 * 2 * info->request_sizes, buf, errp); magic chunk) Change looks correct. Is it worth introducing a buf_len variable, to not retype it? And increment as we go? Yeah, it might make things more legible. + be32_to_cpus(&info->max_trim); + if (!info->min_trim || !info->max_trim || + (info->max_trim != UINT32_MAX && + !QEMU_IS_ALIGNED(info->max_trim, info->min_trim))) { + error_setg(errp, "server trim sizes %" PRIu32 "/%" PRIu32 + " are not valid", info->min_trim, info->max_trim); + nbd_send_opt_abort(ioc); + return -1; + } Didn't you forget to check that min_trim is a power of two? Nope. The NBD spec wording specifically uses "The minimum trim size SHOULD be a power of 2, and MUST be at least as large as the preferred block size advertised in `NBD_INFO_BLOCK_SIZE`", in part because there ARE existing hardware iscsi devices that have an advertised preferred/maximum trim size of exactly 15 megabytes (you can request smaller trims 1M at a time, and the device then caches things to eventually report unmapped slices at a 15M boundary). For more information on this odd hardware, look in the qemu logs for the Dell Equallogic device, such as commit 3482b9bc. Ok. Interesting, thank you. Since the NBD 'min trim' is advisory, and merely telling the client the ideal alignments to use for a trim most likely to have an effect (on Equallogic, a 1M trim by itself might not trim anything unless neighboring areas are also trimmed, while an aligned 15M trim is immediately observable), an NBD server wrapping such an iscsi device may prefer to mirror the hardware's preferred alignment of 15M, rather than inventing a minimum alignment of 1M, in what it advertises in NBD_INFO_TRIM_SIZE. And maybe I should tweak the NBD spec addition to call things "preferred trim/zero" rather than "min trim/zero", if that makes things any easier to grasp on first read, and better matches the iscsi concept. good idea. and it will better correspond to pref_block for INFO_BLOCK_SIZE, as they are compared with it. hmm, so, now nbd_opt_go() is full of very-very similar code parts, which may worth some refactoring now or in future. Are we going to add similar limits for BLOCK_STATUS? and for CACHE? I have an idea: why not to make an universal option for it in protocol? Something like option INFO_CMD_SIZE: uint16_t command uint32_t min_block uint32_t max_block and server can send several such options. Most of semantics for these additional limits are common, so it will simplify both documentation and realization.. I'll copy this to nbd thread and it is better to discuss it in it. Yep, I've followed up on that thread, but will post a v2 of this proposal along those lines. -- Best regards, Vladimir
Re: [Qemu-block] [PATCH 3/4] nbd/client: Support requests of additional block sizing info
On 05/03/2018 04:17 AM, Vladimir Sementsov-Ogievskiy wrote: 02.05.2018 00:13, Eric Blake wrote: The NBD spec is clarifying [1] that a server may want to advertise different limits for READ/WRITE (in our case, 32M) than for TRIM/ZERO (in our case, nearly 4G). Implement the client side support for these alternate limits, by always requesting the new information (a compliant server must ignore the request if it is unknown), and by validating anything reported by the server before populating the block layer limits. [1] https://lists.debian.org/nbd/2018/03/msg00048.html Signed-off-by: Eric Blake +++ b/nbd/client.c @@ -337,16 +337,18 @@ static int nbd_opt_go(QIOChannel *ioc, const char *wantname, info->flags = 0; trace_nbd_opt_go_start(wantname); - buf = g_malloc(4 + len + 2 + 2 * info->request_sizes + 1); + buf = g_malloc(4 + len + 2 + 3 * 2 * info->request_sizes + 1); Hmm, what "+1" means? Doesn't appear to be necessary, but a little slop never hurts. Better might be struct-ifying things rather than open-coding offsets, or even using a vectored approach rather than requiring a single buffer. But if useful, that's refactoring that is independent of this patch, while this is just demonstrating the bare minimum implementation of the new feature. stl_be_p(buf, len); memcpy(buf + 4, wantname, len); - /* At most one request, everything else up to server */ - stw_be_p(buf + 4 + len, info->request_sizes); + /* Either 0 or 3 requests, everything else up to server */ + stw_be_p(buf + 4 + len, 3 * info->request_sizes); if (info->request_sizes) { stw_be_p(buf + 4 + len + 2, NBD_INFO_BLOCK_SIZE); + stw_be_p(buf + 4 + len + 2 + 2, NBD_INFO_TRIM_SIZE); + stw_be_p(buf + 4 + len + 2 + 2 + 2, NBD_INFO_ZERO_SIZE); } error = nbd_send_option_request(ioc, NBD_OPT_GO, - 4 + len + 2 + 2 * info->request_sizes, + 4 + len + 2 + 3 * 2 * info->request_sizes, buf, errp); magic chunk) Change looks correct. Is it worth introducing a buf_len variable, to not retype it? And increment as we go? Yeah, it might make things more legible. + be32_to_cpus(&info->max_trim); + if (!info->min_trim || !info->max_trim || + (info->max_trim != UINT32_MAX && + !QEMU_IS_ALIGNED(info->max_trim, info->min_trim))) { + error_setg(errp, "server trim sizes %" PRIu32 "/%" PRIu32 + " are not valid", info->min_trim, info->max_trim); + nbd_send_opt_abort(ioc); + return -1; + } Didn't you forget to check that min_trim is a power of two? Nope. The NBD spec wording specifically uses "The minimum trim size SHOULD be a power of 2, and MUST be at least as large as the preferred block size advertised in `NBD_INFO_BLOCK_SIZE`", in part because there ARE existing hardware iscsi devices that have an advertised preferred/maximum trim size of exactly 15 megabytes (you can request smaller trims 1M at a time, and the device then caches things to eventually report unmapped slices at a 15M boundary). For more information on this odd hardware, look in the qemu logs for the Dell Equallogic device, such as commit 3482b9bc. Since the NBD 'min trim' is advisory, and merely telling the client the ideal alignments to use for a trim most likely to have an effect (on Equallogic, a 1M trim by itself might not trim anything unless neighboring areas are also trimmed, while an aligned 15M trim is immediately observable), an NBD server wrapping such an iscsi device may prefer to mirror the hardware's preferred alignment of 15M, rather than inventing a minimum alignment of 1M, in what it advertises in NBD_INFO_TRIM_SIZE. And maybe I should tweak the NBD spec addition to call things "preferred trim/zero" rather than "min trim/zero", if that makes things any easier to grasp on first read, and better matches the iscsi concept. hmm, so, now nbd_opt_go() is full of very-very similar code parts, which may worth some refactoring now or in future. Are we going to add similar limits for BLOCK_STATUS? and for CACHE? I have an idea: why not to make an universal option for it in protocol? Something like option INFO_CMD_SIZE: uint16_t command uint32_t min_block uint32_t max_block and server can send several such options. Most of semantics for these additional limits are common, so it will simplify both documentation and realization.. I'll copy this to nbd thread and it is better to discuss it in it. Yep, I've followed up on that thread, but will post a v2 of this proposal along those lines. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org
Re: [Qemu-block] [PATCH 3/4] nbd/client: Support requests of additional block sizing info
02.05.2018 00:13, Eric Blake wrote: The NBD spec is clarifying [1] that a server may want to advertise different limits for READ/WRITE (in our case, 32M) than for TRIM/ZERO (in our case, nearly 4G). Implement the client side support for these alternate limits, by always requesting the new information (a compliant server must ignore the request if it is unknown), and by validating anything reported by the server before populating the block layer limits. [1] https://lists.debian.org/nbd/2018/03/msg00048.html Signed-off-by: Eric Blake --- The given URL for the NBD spec was v3; it will change to be a v4 version of that patch in part to point back to this qemu commit as a proof of implementation. --- include/block/nbd.h | 4 ++ block/nbd.c | 15 ++- nbd/client.c| 116 ++-- nbd/trace-events| 2 + 4 files changed, 131 insertions(+), 6 deletions(-) diff --git a/include/block/nbd.h b/include/block/nbd.h index cbf51628f78..90ddef32bb3 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -270,6 +270,10 @@ struct NBDExportInfo { uint32_t min_block; uint32_t opt_block; uint32_t max_block; +uint32_t min_trim; +uint32_t max_trim; +uint32_t min_zero; +uint32_t max_zero; uint32_t meta_base_allocation_id; }; diff --git a/block/nbd.c b/block/nbd.c index 1e2b3ba2d3b..823d10b251d 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -478,8 +478,19 @@ static void nbd_refresh_limits(BlockDriverState *bs, Error **errp) uint32_t max = MIN_NON_ZERO(NBD_MAX_BUFFER_SIZE, s->info.max_block); bs->bl.request_alignment = min ? min : BDRV_SECTOR_SIZE; -bs->bl.max_pdiscard = max; -bs->bl.max_pwrite_zeroes = max; +if (s->info.max_trim) { +bs->bl.max_pdiscard = MIN(s->info.max_trim, BDRV_REQUEST_MAX_BYTES); +} else { +bs->bl.max_pdiscard = max; +} +bs->bl.pdiscard_alignment = s->info.min_trim; +if (s->info.max_zero) { +bs->bl.max_pwrite_zeroes = MIN(s->info.max_zero, + BDRV_REQUEST_MAX_BYTES); +} else { +bs->bl.max_pwrite_zeroes = max; +} +bs->bl.pwrite_zeroes_alignment = s->info.min_zero; bs->bl.max_transfer = max; if (s->info.opt_block && diff --git a/nbd/client.c b/nbd/client.c index 0abb195b856..f1364747ba1 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -337,16 +337,18 @@ static int nbd_opt_go(QIOChannel *ioc, const char *wantname, info->flags = 0; trace_nbd_opt_go_start(wantname); -buf = g_malloc(4 + len + 2 + 2 * info->request_sizes + 1); +buf = g_malloc(4 + len + 2 + 3 * 2 * info->request_sizes + 1); Hmm, what "+1" means? stl_be_p(buf, len); memcpy(buf + 4, wantname, len); -/* At most one request, everything else up to server */ -stw_be_p(buf + 4 + len, info->request_sizes); +/* Either 0 or 3 requests, everything else up to server */ +stw_be_p(buf + 4 + len, 3 * info->request_sizes); if (info->request_sizes) { stw_be_p(buf + 4 + len + 2, NBD_INFO_BLOCK_SIZE); +stw_be_p(buf + 4 + len + 2 + 2, NBD_INFO_TRIM_SIZE); +stw_be_p(buf + 4 + len + 2 + 2 + 2, NBD_INFO_ZERO_SIZE); } error = nbd_send_option_request(ioc, NBD_OPT_GO, -4 + len + 2 + 2 * info->request_sizes, +4 + len + 2 + 3 * 2 * info->request_sizes, buf, errp); magic chunk) Change looks correct. Is it worth introducing a buf_len variable, to not retype it? g_free(buf); if (error < 0) { @@ -467,6 +469,72 @@ static int nbd_opt_go(QIOChannel *ioc, const char *wantname, info->max_block); break; +case NBD_INFO_TRIM_SIZE: +if (len != sizeof(info->min_trim) * 2) { +error_setg(errp, "remaining trim info len %" PRIu32 + " is unexpected size", len); +nbd_send_opt_abort(ioc); +return -1; +} +if (nbd_read(ioc, &info->min_trim, sizeof(info->min_trim), + errp) < 0) { +error_prepend(errp, "failed to read info minimum trim size: "); +nbd_send_opt_abort(ioc); +return -1; +} +be32_to_cpus(&info->min_trim); +if (nbd_read(ioc, &info->max_trim, sizeof(info->max_trim), + errp) < 0) { +error_prepend(errp, + "failed to read info maximum trim size: "); +nbd_send_opt_abort(ioc); +return -1; +} +be32_to_cpus(&info->max_trim); +if (!info->min_trim || !info->max_trim || +(info->max_trim != UINT32_MAX && + !QEMU_IS_ALIGNED(info->max_trim, info->min_trim))) { +
[Qemu-block] [PATCH 3/4] nbd/client: Support requests of additional block sizing info
The NBD spec is clarifying [1] that a server may want to advertise different limits for READ/WRITE (in our case, 32M) than for TRIM/ZERO (in our case, nearly 4G). Implement the client side support for these alternate limits, by always requesting the new information (a compliant server must ignore the request if it is unknown), and by validating anything reported by the server before populating the block layer limits. [1] https://lists.debian.org/nbd/2018/03/msg00048.html Signed-off-by: Eric Blake --- The given URL for the NBD spec was v3; it will change to be a v4 version of that patch in part to point back to this qemu commit as a proof of implementation. --- include/block/nbd.h | 4 ++ block/nbd.c | 15 ++- nbd/client.c| 116 ++-- nbd/trace-events| 2 + 4 files changed, 131 insertions(+), 6 deletions(-) diff --git a/include/block/nbd.h b/include/block/nbd.h index cbf51628f78..90ddef32bb3 100644 --- a/include/block/nbd.h +++ b/include/block/nbd.h @@ -270,6 +270,10 @@ struct NBDExportInfo { uint32_t min_block; uint32_t opt_block; uint32_t max_block; +uint32_t min_trim; +uint32_t max_trim; +uint32_t min_zero; +uint32_t max_zero; uint32_t meta_base_allocation_id; }; diff --git a/block/nbd.c b/block/nbd.c index 1e2b3ba2d3b..823d10b251d 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -478,8 +478,19 @@ static void nbd_refresh_limits(BlockDriverState *bs, Error **errp) uint32_t max = MIN_NON_ZERO(NBD_MAX_BUFFER_SIZE, s->info.max_block); bs->bl.request_alignment = min ? min : BDRV_SECTOR_SIZE; -bs->bl.max_pdiscard = max; -bs->bl.max_pwrite_zeroes = max; +if (s->info.max_trim) { +bs->bl.max_pdiscard = MIN(s->info.max_trim, BDRV_REQUEST_MAX_BYTES); +} else { +bs->bl.max_pdiscard = max; +} +bs->bl.pdiscard_alignment = s->info.min_trim; +if (s->info.max_zero) { +bs->bl.max_pwrite_zeroes = MIN(s->info.max_zero, + BDRV_REQUEST_MAX_BYTES); +} else { +bs->bl.max_pwrite_zeroes = max; +} +bs->bl.pwrite_zeroes_alignment = s->info.min_zero; bs->bl.max_transfer = max; if (s->info.opt_block && diff --git a/nbd/client.c b/nbd/client.c index 0abb195b856..f1364747ba1 100644 --- a/nbd/client.c +++ b/nbd/client.c @@ -337,16 +337,18 @@ static int nbd_opt_go(QIOChannel *ioc, const char *wantname, info->flags = 0; trace_nbd_opt_go_start(wantname); -buf = g_malloc(4 + len + 2 + 2 * info->request_sizes + 1); +buf = g_malloc(4 + len + 2 + 3 * 2 * info->request_sizes + 1); stl_be_p(buf, len); memcpy(buf + 4, wantname, len); -/* At most one request, everything else up to server */ -stw_be_p(buf + 4 + len, info->request_sizes); +/* Either 0 or 3 requests, everything else up to server */ +stw_be_p(buf + 4 + len, 3 * info->request_sizes); if (info->request_sizes) { stw_be_p(buf + 4 + len + 2, NBD_INFO_BLOCK_SIZE); +stw_be_p(buf + 4 + len + 2 + 2, NBD_INFO_TRIM_SIZE); +stw_be_p(buf + 4 + len + 2 + 2 + 2, NBD_INFO_ZERO_SIZE); } error = nbd_send_option_request(ioc, NBD_OPT_GO, -4 + len + 2 + 2 * info->request_sizes, +4 + len + 2 + 3 * 2 * info->request_sizes, buf, errp); g_free(buf); if (error < 0) { @@ -467,6 +469,72 @@ static int nbd_opt_go(QIOChannel *ioc, const char *wantname, info->max_block); break; +case NBD_INFO_TRIM_SIZE: +if (len != sizeof(info->min_trim) * 2) { +error_setg(errp, "remaining trim info len %" PRIu32 + " is unexpected size", len); +nbd_send_opt_abort(ioc); +return -1; +} +if (nbd_read(ioc, &info->min_trim, sizeof(info->min_trim), + errp) < 0) { +error_prepend(errp, "failed to read info minimum trim size: "); +nbd_send_opt_abort(ioc); +return -1; +} +be32_to_cpus(&info->min_trim); +if (nbd_read(ioc, &info->max_trim, sizeof(info->max_trim), + errp) < 0) { +error_prepend(errp, + "failed to read info maximum trim size: "); +nbd_send_opt_abort(ioc); +return -1; +} +be32_to_cpus(&info->max_trim); +if (!info->min_trim || !info->max_trim || +(info->max_trim != UINT32_MAX && + !QEMU_IS_ALIGNED(info->max_trim, info->min_trim))) { +error_setg(errp, "server trim sizes %" PRIu32 "/%" PRIu32 + " are not valid", info->min_trim, info->max_trim); +nbd_send_opt_abort(ioc); +