Re: [Qemu-devel] [PATCH v4 08/10] block/nbd: add cmdline and qapi parameter reconnect-delay
05.01.2019 1:25, Eric Blake wrote: > On 7/31/18 12:30 PM, Vladimir Sementsov-Ogievskiy wrote: >> Reconnect will be implemented in the following commit, so for now, >> in semantics below, disconnect itself is a "serious error". >> >> Signed-off-by: Vladimir Sementsov-Ogievskiy >> --- >> qapi/block-core.json | 12 +++- >> block/nbd-client.h | 1 + >> block/nbd-client.c | 1 + >> block/nbd.c | 16 +++- >> 4 files changed, 28 insertions(+), 2 deletions(-) >> >> diff --git a/qapi/block-core.json b/qapi/block-core.json >> index 5b9084a394..cf03402ec5 100644 >> --- a/qapi/block-core.json >> +++ b/qapi/block-core.json >> @@ -3511,13 +3511,23 @@ >> # traditional "base:allocation" block status (see >> # NBD_OPT_LIST_META_CONTEXT in the NBD protocol) (since >> 3.0) >> # >> +# @reconnect-delay: Reconnect delay. On disconnect, nbd client tries to >> connect > > Maybe 'On unexpected disconnect', since intentional disconnect is not > unexpected. > >> +# again, until success or serious error. During first >> +# @reconnect-delay seconds of reconnecting loop all >> requests >> +# are paused and have a chance to rerun, if successful >> +# connect occures during this time. After @reconnect-delay > > occurs > >> +# seconds all delayed requests are failed and all >> following >> +# requests will be failed to (until successfull >> reconnect). > > successful > >> +# Default 300 seconds (Since 3.1) > > My delay in reviewing means this now has to be 4.0. > > I'm guessing that a delay of 0 means disable auto-reconnect. I doubt, as reconnect-delay=0 is a valid case, when all request are failed (pre-patch behavior), but we still try to reconnect in background. Any reason not to try? If any, I suggest one of the following: 1. treat absence of the option as no-reconnect-at-all. So absence and 0 would be not the same 2. use -1 value as no-reconnect-at-all 3. additional bool option "reconnect" > From a > backwards-compatibility standpoint, no auto-reconnect is more in line > with what we previously had - but from a usability standpoint, trying to > reconnect can avoid turning transient network hiccups into permanent > loss of a device to EIO errors, especially if the retry timeout is long > enough to allow an administrator to reroute the network to an > alternative server. So I'm probably okay with the default being > non-zero - but it DOES mean that where you used to get instant EIO > failures when a network connection was severed, you now have to wait for > the reconnect delay to expire, and 5 minutes can be a long wait. Since > the long delay is guest-observable, can we run into issues where a guest > that is currently used to instant EIO and total loss of the device could > instead get confused by not getting any response for up to 5 minutes, > whether or not that response eventually turns out to be EIO or a > successful recovery? > >> +++ b/block/nbd.c >> @@ -360,6 +360,18 @@ static QemuOptsList nbd_runtime_opts = { >> .help = "experimental: expose named dirty bitmap in place of " >> "block status", >> }, >> +{ >> +.name = "reconnect-delay", >> +.type = QEMU_OPT_NUMBER, >> +.help = "Reconnect delay. On disconnect, nbd client tries to" >> +"connect again, until success or serious error. During" >> +"first @reconnect-delay seconds of reconnecting loop >> all" >> +"requests are paused and have a chance to rerun, if" >> +"successful connect occures during this time. After" >> +"@reconnect-delay seconds all delayed requests are >> failed" >> +"and all following requests will be failed to (until" >> +"successfull reconnect). Default 300 seconds", > > Same typos as in qapi. > > The UI aspects look fine, now I need to review the patch series for code > issues :) > > -- Best regards, Vladimir
Re: [Qemu-devel] [PATCH v4 08/10] block/nbd: add cmdline and qapi parameter reconnect-delay
05.01.2019 1:25, Eric Blake wrote: > On 7/31/18 12:30 PM, Vladimir Sementsov-Ogievskiy wrote: >> Reconnect will be implemented in the following commit, so for now, >> in semantics below, disconnect itself is a "serious error". >> >> Signed-off-by: Vladimir Sementsov-Ogievskiy >> --- >> qapi/block-core.json | 12 +++- >> block/nbd-client.h | 1 + >> block/nbd-client.c | 1 + >> block/nbd.c | 16 +++- >> 4 files changed, 28 insertions(+), 2 deletions(-) >> >> diff --git a/qapi/block-core.json b/qapi/block-core.json >> index 5b9084a394..cf03402ec5 100644 >> --- a/qapi/block-core.json >> +++ b/qapi/block-core.json >> @@ -3511,13 +3511,23 @@ >> # traditional "base:allocation" block status (see >> # NBD_OPT_LIST_META_CONTEXT in the NBD protocol) (since >> 3.0) >> # >> +# @reconnect-delay: Reconnect delay. On disconnect, nbd client tries to >> connect > > Maybe 'On unexpected disconnect', since intentional disconnect is not > unexpected. > >> +# again, until success or serious error. During first >> +# @reconnect-delay seconds of reconnecting loop all >> requests >> +# are paused and have a chance to rerun, if successful >> +# connect occures during this time. After @reconnect-delay > > occurs > >> +# seconds all delayed requests are failed and all >> following >> +# requests will be failed to (until successfull >> reconnect). > > successful > >> +# Default 300 seconds (Since 3.1) > > My delay in reviewing means this now has to be 4.0. > > I'm guessing that a delay of 0 means disable auto-reconnect. From a > backwards-compatibility standpoint, no auto-reconnect is more in line > with what we previously had - but from a usability standpoint, trying to > reconnect can avoid turning transient network hiccups into permanent > loss of a device to EIO errors, especially if the retry timeout is long > enough to allow an administrator to reroute the network to an > alternative server. So I'm probably okay with the default being > non-zero - but it DOES mean that where you used to get instant EIO > failures when a network connection was severed, you now have to wait for > the reconnect delay to expire, and 5 minutes can be a long wait. Since > the long delay is guest-observable, can we run into issues where a guest > that is currently used to instant EIO and total loss of the device could > instead get confused by not getting any response for up to 5 minutes, > whether or not that response eventually turns out to be EIO or a > successful recovery? Hmm, do you have in mind such scenarios? Who could know? If we unsure, I'm OK to proceed with no-reconnect behavior by default. We can change the default in a separate patch later, if needed. > >> +++ b/block/nbd.c >> @@ -360,6 +360,18 @@ static QemuOptsList nbd_runtime_opts = { >> .help = "experimental: expose named dirty bitmap in place of " >> "block status", >> }, >> +{ >> +.name = "reconnect-delay", >> +.type = QEMU_OPT_NUMBER, >> +.help = "Reconnect delay. On disconnect, nbd client tries to" >> +"connect again, until success or serious error. During" >> +"first @reconnect-delay seconds of reconnecting loop >> all" >> +"requests are paused and have a chance to rerun, if" >> +"successful connect occures during this time. After" >> +"@reconnect-delay seconds all delayed requests are >> failed" >> +"and all following requests will be failed to (until" >> +"successfull reconnect). Default 300 seconds", > > Same typos as in qapi. > > The UI aspects look fine, now I need to review the patch series for code > issues :) > > -- Best regards, Vladimir
Re: [Qemu-devel] [PATCH v4 08/10] block/nbd: add cmdline and qapi parameter reconnect-delay
On 7/31/18 12:30 PM, Vladimir Sementsov-Ogievskiy wrote: > Reconnect will be implemented in the following commit, so for now, > in semantics below, disconnect itself is a "serious error". > > Signed-off-by: Vladimir Sementsov-Ogievskiy > --- > qapi/block-core.json | 12 +++- > block/nbd-client.h | 1 + > block/nbd-client.c | 1 + > block/nbd.c | 16 +++- > 4 files changed, 28 insertions(+), 2 deletions(-) > > diff --git a/qapi/block-core.json b/qapi/block-core.json > index 5b9084a394..cf03402ec5 100644 > --- a/qapi/block-core.json > +++ b/qapi/block-core.json > @@ -3511,13 +3511,23 @@ > # traditional "base:allocation" block status (see > # NBD_OPT_LIST_META_CONTEXT in the NBD protocol) (since 3.0) > # > +# @reconnect-delay: Reconnect delay. On disconnect, nbd client tries to > connect Maybe 'On unexpected disconnect', since intentional disconnect is not unexpected. > +# again, until success or serious error. During first > +# @reconnect-delay seconds of reconnecting loop all > requests > +# are paused and have a chance to rerun, if successful > +# connect occures during this time. After @reconnect-delay occurs > +# seconds all delayed requests are failed and all following > +# requests will be failed to (until successfull reconnect). successful > +# Default 300 seconds (Since 3.1) My delay in reviewing means this now has to be 4.0. I'm guessing that a delay of 0 means disable auto-reconnect. From a backwards-compatibility standpoint, no auto-reconnect is more in line with what we previously had - but from a usability standpoint, trying to reconnect can avoid turning transient network hiccups into permanent loss of a device to EIO errors, especially if the retry timeout is long enough to allow an administrator to reroute the network to an alternative server. So I'm probably okay with the default being non-zero - but it DOES mean that where you used to get instant EIO failures when a network connection was severed, you now have to wait for the reconnect delay to expire, and 5 minutes can be a long wait. Since the long delay is guest-observable, can we run into issues where a guest that is currently used to instant EIO and total loss of the device could instead get confused by not getting any response for up to 5 minutes, whether or not that response eventually turns out to be EIO or a successful recovery? > +++ b/block/nbd.c > @@ -360,6 +360,18 @@ static QemuOptsList nbd_runtime_opts = { > .help = "experimental: expose named dirty bitmap in place of " > "block status", > }, > +{ > +.name = "reconnect-delay", > +.type = QEMU_OPT_NUMBER, > +.help = "Reconnect delay. On disconnect, nbd client tries to" > +"connect again, until success or serious error. During" > +"first @reconnect-delay seconds of reconnecting loop all" > +"requests are paused and have a chance to rerun, if" > +"successful connect occures during this time. After" > +"@reconnect-delay seconds all delayed requests are > failed" > +"and all following requests will be failed to (until" > +"successfull reconnect). Default 300 seconds", Same typos as in qapi. The UI aspects look fine, now I need to review the patch series for code issues :) -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3226 Virtualization: qemu.org | libvirt.org signature.asc Description: OpenPGP digital signature
[Qemu-devel] [PATCH v4 08/10] block/nbd: add cmdline and qapi parameter reconnect-delay
Reconnect will be implemented in the following commit, so for now, in semantics below, disconnect itself is a "serious error". Signed-off-by: Vladimir Sementsov-Ogievskiy --- qapi/block-core.json | 12 +++- block/nbd-client.h | 1 + block/nbd-client.c | 1 + block/nbd.c | 16 +++- 4 files changed, 28 insertions(+), 2 deletions(-) diff --git a/qapi/block-core.json b/qapi/block-core.json index 5b9084a394..cf03402ec5 100644 --- a/qapi/block-core.json +++ b/qapi/block-core.json @@ -3511,13 +3511,23 @@ # traditional "base:allocation" block status (see # NBD_OPT_LIST_META_CONTEXT in the NBD protocol) (since 3.0) # +# @reconnect-delay: Reconnect delay. On disconnect, nbd client tries to connect +# again, until success or serious error. During first +# @reconnect-delay seconds of reconnecting loop all requests +# are paused and have a chance to rerun, if successful +# connect occures during this time. After @reconnect-delay +# seconds all delayed requests are failed and all following +# requests will be failed to (until successfull reconnect). +# Default 300 seconds (Since 3.1) +# # Since: 2.9 ## { 'struct': 'BlockdevOptionsNbd', 'data': { 'server': 'SocketAddress', '*export': 'str', '*tls-creds': 'str', -'*x-dirty-bitmap': 'str' } } +'*x-dirty-bitmap': 'str', +'*reconnect-delay': 'uint32' } } ## # @BlockdevOptionsRaw: diff --git a/block/nbd-client.h b/block/nbd-client.h index e7bda4d3c4..ef8a6a9239 100644 --- a/block/nbd-client.h +++ b/block/nbd-client.h @@ -53,6 +53,7 @@ int nbd_client_init(BlockDriverState *bs, QCryptoTLSCreds *tlscreds, const char *hostname, const char *x_dirty_bitmap, +uint32_t reconnect_delay, Error **errp); void nbd_client_close(BlockDriverState *bs); diff --git a/block/nbd-client.c b/block/nbd-client.c index c184a9f0e9..41e6e6e702 100644 --- a/block/nbd-client.c +++ b/block/nbd-client.c @@ -1079,6 +1079,7 @@ int nbd_client_init(BlockDriverState *bs, QCryptoTLSCreds *tlscreds, const char *hostname, const char *x_dirty_bitmap, +uint32_t reconnect_delay, Error **errp) { NBDClientSession *client = nbd_get_client_session(bs); diff --git a/block/nbd.c b/block/nbd.c index 9db5eded89..b40fb5a977 100644 --- a/block/nbd.c +++ b/block/nbd.c @@ -360,6 +360,18 @@ static QemuOptsList nbd_runtime_opts = { .help = "experimental: expose named dirty bitmap in place of " "block status", }, +{ +.name = "reconnect-delay", +.type = QEMU_OPT_NUMBER, +.help = "Reconnect delay. On disconnect, nbd client tries to" +"connect again, until success or serious error. During" +"first @reconnect-delay seconds of reconnecting loop all" +"requests are paused and have a chance to rerun, if" +"successful connect occures during this time. After" +"@reconnect-delay seconds all delayed requests are failed" +"and all following requests will be failed to (until" +"successfull reconnect). Default 300 seconds", +}, { /* end of list */ } }, }; @@ -411,7 +423,9 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags, /* NBD handshake */ ret = nbd_client_init(bs, s->saddr, s->export, tlscreds, hostname, - qemu_opt_get(opts, "x-dirty-bitmap"), errp); + qemu_opt_get(opts, "x-dirty-bitmap"), + qemu_opt_get_number(opts, "reconnect-delay", 300), + errp); error: if (tlscreds) { -- 2.11.1