Re: [PATCH v2 4/4] nbd: Enable multi-conn using round-robin
On Thu, May 22, 2025 at 08:37:58PM +0300, Andrey Drobyshev wrote: > On 4/28/25 9:46 PM, Eric Blake wrote: > > From: "Richard W.M. Jones" > > > > Enable NBD multi-conn by spreading operations across multiple > > connections. > > > > (XXX) This uses a naive round-robin approach which could be improved. > > For example we could look at how many requests are in flight and > > assign operations to the connections with fewest. Or we could try to > > estimate (based on size of requests outstanding) the load on each > > connection. But this implementation doesn't do any of that. > > > > Signed-off-by: Richard W.M. Jones > > Message-ID: <[email protected]> > > --- > > block/nbd.c | 67 +++-- > > 1 file changed, 49 insertions(+), 18 deletions(-) > > > > diff --git a/block/nbd.c b/block/nbd.c > > index 19da1a7a1fe..bf5bc57569c 100644 > > --- a/block/nbd.c > > +++ b/block/nbd.c > > > > [...] > > > > @@ -2207,24 +2233,29 @@ static const char *const nbd_strong_runtime_opts[] > > = { > > static void nbd_cancel_in_flight(BlockDriverState *bs) > > { > > BDRVNBDState *s = (BDRVNBDState *)bs->opaque; > > -NBDConnState * const cs = s->conns[0]; > > +size_t i; > > +NBDConnState *cs; > > > > -reconnect_delay_timer_del(cs); > > +for (i = 0; i < MAX_MULTI_CONN; ++i) { > > +cs = s->conns[i]; > > > > -qemu_mutex_lock(&cs->requests_lock); > > -if (cs->state == NBD_CLIENT_CONNECTING_WAIT) { > > -cs->state = NBD_CLIENT_CONNECTING_NOWAIT; > > +reconnect_delay_timer_del(cs); > > + > > This code is causing iotests/{185,264,281} to segfault. E.g.: > > > (gdb) bt > > #0 0x55bbaec58119 in reconnect_delay_timer_del (cs=0x0) at > > ../block/nbd.c:205 > > #1 0x55bbaec5d8e4 in nbd_cancel_in_flight (bs=0x55bbb1458020) at > > ../block/nbd.c:2242 > > #2 0x55bbaec4ff16 in bdrv_cancel_in_flight (bs=0x55bbb1458020) at > > ../block/io.c:3737 > > #3 0x55bbaec54ec1 in mirror_cancel (job=0x55bbb21ce800, force=true) at > > ../block/mirror.c:1335 > > #4 0x55bbaec18278 in job_cancel_async_locked (job=0x55bbb21ce800, > > force=true) at ../job.c:893 > > #5 0x55bbaec18df2 in job_cancel_locked (job=0x55bbb21ce800, > > force=true) at ../job.c:1143 > > #6 0x55bbaec18ef3 in job_force_cancel_err_locked (job=0x55bbb21ce800, > > errp=0x7fff44f247a0) at ../job.c:1190 > > #7 0x55bbaec192a4 in job_finish_sync_locked (job=0x55bbb21ce800, > > finish=0x55bbaec18ed2 , errp=0x0) at > > ../job.c:1253 > > #8 0x55bbaec18f2e in job_cancel_sync_locked (job=0x55bbb21ce800, > > force=true) at ../job.c:1196 > > #9 0x55bbaec19086 in job_cancel_sync_all () at ../job.c:1214 > > #10 0x55bbaed55177 in qemu_cleanup (status=0) at > > ../system/runstate.c:949 > > #11 0x55bbaedd0aad in qemu_default_main (opaque=0x0) at > > ../system/main.c:51 > > #12 0x55bbaedd0b4f in main (argc=21, argv=0x7fff44f249d8) at > > ../system/main.c:80 > > We're dereferencing a pointer to NBDConnState that was never > initialized. So it should either be > > > -for (i = 0; i < MAX_MULTI_CONN; ++i) { > > +for (i = 0; i < s->multi_conn; ++i) { Thanks for pointing it out; I'll fix that. -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org
Re: [PATCH v2 4/4] nbd: Enable multi-conn using round-robin
On 4/28/25 9:46 PM, Eric Blake wrote: > From: "Richard W.M. Jones" > > Enable NBD multi-conn by spreading operations across multiple > connections. > > (XXX) This uses a naive round-robin approach which could be improved. > For example we could look at how many requests are in flight and > assign operations to the connections with fewest. Or we could try to > estimate (based on size of requests outstanding) the load on each > connection. But this implementation doesn't do any of that. > > Signed-off-by: Richard W.M. Jones > Message-ID: <[email protected]> > --- > block/nbd.c | 67 +++-- > 1 file changed, 49 insertions(+), 18 deletions(-) > > diff --git a/block/nbd.c b/block/nbd.c > index 19da1a7a1fe..bf5bc57569c 100644 > --- a/block/nbd.c > +++ b/block/nbd.c > > [...] > @@ -2207,24 +2233,29 @@ static const char *const nbd_strong_runtime_opts[] = { > static void nbd_cancel_in_flight(BlockDriverState *bs) > { > BDRVNBDState *s = (BDRVNBDState *)bs->opaque; > -NBDConnState * const cs = s->conns[0]; > +size_t i; > +NBDConnState *cs; > > -reconnect_delay_timer_del(cs); > +for (i = 0; i < MAX_MULTI_CONN; ++i) { > +cs = s->conns[i]; > > -qemu_mutex_lock(&cs->requests_lock); > -if (cs->state == NBD_CLIENT_CONNECTING_WAIT) { > -cs->state = NBD_CLIENT_CONNECTING_NOWAIT; > +reconnect_delay_timer_del(cs); > + This code is causing iotests/{185,264,281} to segfault. E.g.: > (gdb) bt > #0 0x55bbaec58119 in reconnect_delay_timer_del (cs=0x0) at > ../block/nbd.c:205 > #1 0x55bbaec5d8e4 in nbd_cancel_in_flight (bs=0x55bbb1458020) at > ../block/nbd.c:2242 > #2 0x55bbaec4ff16 in bdrv_cancel_in_flight (bs=0x55bbb1458020) at > ../block/io.c:3737 > #3 0x55bbaec54ec1 in mirror_cancel (job=0x55bbb21ce800, force=true) at > ../block/mirror.c:1335 > #4 0x55bbaec18278 in job_cancel_async_locked (job=0x55bbb21ce800, > force=true) at ../job.c:893 > #5 0x55bbaec18df2 in job_cancel_locked (job=0x55bbb21ce800, force=true) > at ../job.c:1143 > #6 0x55bbaec18ef3 in job_force_cancel_err_locked (job=0x55bbb21ce800, > errp=0x7fff44f247a0) at ../job.c:1190 > #7 0x55bbaec192a4 in job_finish_sync_locked (job=0x55bbb21ce800, > finish=0x55bbaec18ed2 , errp=0x0) at > ../job.c:1253 > #8 0x55bbaec18f2e in job_cancel_sync_locked (job=0x55bbb21ce800, > force=true) at ../job.c:1196 > #9 0x55bbaec19086 in job_cancel_sync_all () at ../job.c:1214 > #10 0x55bbaed55177 in qemu_cleanup (status=0) at ../system/runstate.c:949 > #11 0x55bbaedd0aad in qemu_default_main (opaque=0x0) at > ../system/main.c:51 > #12 0x55bbaedd0b4f in main (argc=21, argv=0x7fff44f249d8) at > ../system/main.c:80 We're dereferencing a pointer to NBDConnState that was never initialized. So it should either be > -for (i = 0; i < MAX_MULTI_CONN; ++i) { > +for (i = 0; i < s->multi_conn; ++i) { or > -cs = s->conns[i]; > +if (s->conns[i]) { > +cs = s->conns[i]; > +... [...] Andrey
Re: [PATCH v2 4/4] nbd: Enable multi-conn using round-robin
On Mon, Apr 28, 2025 at 08:27:54PM +0100, Richard W.M. Jones wrote: > On Mon, Apr 28, 2025 at 01:46:47PM -0500, Eric Blake wrote: > [...] > > This all looks similar to when I posted it before. However I noted a > couple of problems ... > > > (XXX) The strategy here is very naive. Firstly if you were going to > > open them, you'd probably want to open them in parallel. Secondly it > > would make sense to delay opening until multiple parallel requests are > > seen (perhaps above some threshold), so that simple or shortlived NBD > > operations do not require multiple connections to be made. > > > (XXX) This uses a naive round-robin approach which could be improved. > > For example we could look at how many requests are in flight and > > assign operations to the connections with fewest. Or we could try to > > estimate (based on size of requests outstanding) the load on each > > connection. But this implementation doesn't do any of that. > > Plus there was a third rather more fundamental problem that apparently > I didn't write about. That is that connections were serialised on a > single thread (called from many coroutines). This bottleneck meant > that there wasn't very much advantage to multi-conn, compared to what > we get in libnbd / nbdcopy. > > Are these fixed / planned to be fixed, especially the third? That indeed is what I hope to address - to provide additional patches on top of this rebase to make it possible to specify a pool of more than one thread using the recent multiqueue work in the block layer. It's taking me longer than I want to get something that I'm happy with posting. -- Eric Blake, Principal Software Engineer Red Hat, Inc. Virtualization: qemu.org | libguestfs.org
Re: [PATCH v2 4/4] nbd: Enable multi-conn using round-robin
On Mon, Apr 28, 2025 at 01:46:47PM -0500, Eric Blake wrote: [...] This all looks similar to when I posted it before. However I noted a couple of problems ... > (XXX) The strategy here is very naive. Firstly if you were going to > open them, you'd probably want to open them in parallel. Secondly it > would make sense to delay opening until multiple parallel requests are > seen (perhaps above some threshold), so that simple or shortlived NBD > operations do not require multiple connections to be made. > (XXX) This uses a naive round-robin approach which could be improved. > For example we could look at how many requests are in flight and > assign operations to the connections with fewest. Or we could try to > estimate (based on size of requests outstanding) the load on each > connection. But this implementation doesn't do any of that. Plus there was a third rather more fundamental problem that apparently I didn't write about. That is that connections were serialised on a single thread (called from many coroutines). This bottleneck meant that there wasn't very much advantage to multi-conn, compared to what we get in libnbd / nbdcopy. Are these fixed / planned to be fixed, especially the third? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top
