Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-04-07 Thread Vladimir Sementsov-Ogievskiy

07.04.2021 10:45, Roman Kagan wrote:

On Wed, Mar 17, 2021 at 11:35:31AM +0300, Vladimir Sementsov-Ogievskiy wrote:

15.03.2021 09:06, Roman Kagan wrote:

The reconnection logic doesn't need to stop while in a drained section.
Moreover it has to be active during the drained section, as the requests
that were caught in-flight with the connection to the server broken can
only usefully get drained if the connection is restored.  Otherwise such
requests can only either stall resulting in a deadlock (before
8c517de24a), or be aborted defeating the purpose of the reconnection
machinery (after 8c517de24a).

This series aims to just stop messing with the drained section in the
reconnection code.

While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
connection_co reentrance"); as I've missed the point of that commit I'd
appreciate more scrutiny in this area.

Roman Kagan (7):
block/nbd: avoid touching freed connect_thread
block/nbd: use uniformly nbd_client_connecting_wait
block/nbd: assert attach/detach runs in the proper context
block/nbd: transfer reconnection stuff across aio_context switch
block/nbd: better document a case in nbd_co_establish_connection
block/nbd: decouple reconnect from drain
block/nbd: stop manipulating in_flight counter

   block/nbd.c  | 191 +++
   nbd/client.c |   2 -
   2 files changed, 86 insertions(+), 107 deletions(-)




Hmm. The huge source of problems for this series is weird logic around
drain and aio context switch in NBD driver.

Why do we have all these too complicated logic with abuse of in_flight
counter in NBD? The answer is connection_co. NBD differs from other
drivers, it has a coroutine independent of request coroutines. And we
have to move this coroutine carefully to new aio context. We can't
just enter it from the new context, we want to be sure that
connection_co is in one of yield points that supports reentering.

I have an idea of how to avoid this thing: drop connection_co at all.

1. nbd negotiation goes to connection thread and becomes independent
of any aio context.

2. waiting for server reply goes to request code. So, instead of
reading the replay from socket always in connection_co, we read in the
request coroutine, after sending the request. We'll need a CoMutex for
it (as only one request coroutine should read from socket), and be
prepared to coming reply is not for _this_ request (in this case we
should wake another request and continue read from socket).


The problem with this approach is that it would change the reconnect
behavior.

Currently connection_co purpose is three-fold:

1) receive the header of the server response, identify the request it
pertains to, and wake the resective request coroutine

2) take on the responsibility to reestablish the connection when it's
lost

3) monitor the idle connection and initiate the reconnect as soon as the
connection is lost

Points 1 and 2 can be moved to the request coroutines indeed.  However I
don't see how to do 3 without an extra ever-running coroutine.
Sacrificing it would mean that a connection loss wouldn't be noticed and
the recovery wouldn't be attempted until a request arrived.

This change looks to me like a degradation compared to the current
state.



For 3 we can check the connection by timeout:

 - getsockopt( .. SO_ERROR .. ), which could be done from bs aio context, or 
even from reconnect-thread context

 - or, we can create a PING request: just use some request with parameters for 
which we are sure that NBD server should do no action but report some expected 
error. We can create such request by timeout when there no more requests, just 
to check that connection still works.

Note, that neither of this (and nor current [3] which is just endless read from 
socket) will work only with keep-alive set, which is not by default for now.

Anyway I think first step is splitting connect-thread out of nbd.c which is 
overcomplicated now, I'm going to send a refactoring series for this.

--
Best regards,
Vladimir



Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-04-07 Thread Roman Kagan
On Wed, Mar 17, 2021 at 11:35:31AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> > 
> > Roman Kagan (7):
> >block/nbd: avoid touching freed connect_thread
> >block/nbd: use uniformly nbd_client_connecting_wait
> >block/nbd: assert attach/detach runs in the proper context
> >block/nbd: transfer reconnection stuff across aio_context switch
> >block/nbd: better document a case in nbd_co_establish_connection
> >block/nbd: decouple reconnect from drain
> >block/nbd: stop manipulating in_flight counter
> > 
> >   block/nbd.c  | 191 +++
> >   nbd/client.c |   2 -
> >   2 files changed, 86 insertions(+), 107 deletions(-)
> > 
> 
> 
> Hmm. The huge source of problems for this series is weird logic around
> drain and aio context switch in NBD driver.
> 
> Why do we have all these too complicated logic with abuse of in_flight
> counter in NBD? The answer is connection_co. NBD differs from other
> drivers, it has a coroutine independent of request coroutines. And we
> have to move this coroutine carefully to new aio context. We can't
> just enter it from the new context, we want to be sure that
> connection_co is in one of yield points that supports reentering.
> 
> I have an idea of how to avoid this thing: drop connection_co at all.
> 
> 1. nbd negotiation goes to connection thread and becomes independent
> of any aio context.
> 
> 2. waiting for server reply goes to request code. So, instead of
> reading the replay from socket always in connection_co, we read in the
> request coroutine, after sending the request. We'll need a CoMutex for
> it (as only one request coroutine should read from socket), and be
> prepared to coming reply is not for _this_ request (in this case we
> should wake another request and continue read from socket).

The problem with this approach is that it would change the reconnect
behavior.

Currently connection_co purpose is three-fold:

1) receive the header of the server response, identify the request it
   pertains to, and wake the resective request coroutine

2) take on the responsibility to reestablish the connection when it's
   lost

3) monitor the idle connection and initiate the reconnect as soon as the
   connection is lost

Points 1 and 2 can be moved to the request coroutines indeed.  However I
don't see how to do 3 without an extra ever-running coroutine.
Sacrificing it would mean that a connection loss wouldn't be noticed and
the recovery wouldn't be attempted until a request arrived.

This change looks to me like a degradation compared to the current
state.

Roman.



Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-26 Thread Roman Kagan
On Wed, Mar 17, 2021 at 11:35:31AM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> > 
> > Roman Kagan (7):
> >block/nbd: avoid touching freed connect_thread
> >block/nbd: use uniformly nbd_client_connecting_wait
> >block/nbd: assert attach/detach runs in the proper context
> >block/nbd: transfer reconnection stuff across aio_context switch
> >block/nbd: better document a case in nbd_co_establish_connection
> >block/nbd: decouple reconnect from drain
> >block/nbd: stop manipulating in_flight counter
> > 
> >   block/nbd.c  | 191 +++
> >   nbd/client.c |   2 -
> >   2 files changed, 86 insertions(+), 107 deletions(-)
> > 
> 
> 
> Hmm. The huge source of problems for this series is weird logic around
> drain and aio context switch in NBD driver.
> 
> Why do we have all these too complicated logic with abuse of in_flight
> counter in NBD? The answer is connection_co. NBD differs from other
> drivers, it has a coroutine independent of request coroutines. And we
> have to move this coroutine carefully to new aio context. We can't
> just enter it from the new context, we want to be sure that
> connection_co is in one of yield points that supports reentering.
> 
> I have an idea of how to avoid this thing: drop connection_co at all.
> 
> 1. nbd negotiation goes to connection thread and becomes independent
> of any aio context.
> 
> 2. waiting for server reply goes to request code. So, instead of
> reading the replay from socket always in connection_co, we read in the
> request coroutine, after sending the request. We'll need a CoMutex for
> it (as only one request coroutine should read from socket), and be
> prepared to coming reply is not for _this_ request (in this case we
> should wake another request and continue read from socket).
> 
> but this may be too much for soft freeze.

This approach does look appealing to me, and I gave it a quick shot but
the amount of changes this involves exceeds the rc tolerance indeed.

> Another idea:
> 
> You want all the requests be completed on drain_begin(), not
> cancelled. Actually, you don't need reconnect runnning during drained
> section for it. It should be enough just wait for all currenct
> requests before disabling the reconnect in drain_begin handler.

So effectively you suggest doing nbd's own drain within
bdrv_co_drain_begin callback.  I'm not totally sure there are no
assumptions this may break, but I'll try to look into this possibility.

Thanks,
Roman.



Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-17 Thread Vladimir Sementsov-Ogievskiy

15.03.2021 09:06, Roman Kagan wrote:

The reconnection logic doesn't need to stop while in a drained section.
Moreover it has to be active during the drained section, as the requests
that were caught in-flight with the connection to the server broken can
only usefully get drained if the connection is restored.  Otherwise such
requests can only either stall resulting in a deadlock (before
8c517de24a), or be aborted defeating the purpose of the reconnection
machinery (after 8c517de24a).

This series aims to just stop messing with the drained section in the
reconnection code.

While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
connection_co reentrance"); as I've missed the point of that commit I'd
appreciate more scrutiny in this area.

Roman Kagan (7):
   block/nbd: avoid touching freed connect_thread
   block/nbd: use uniformly nbd_client_connecting_wait
   block/nbd: assert attach/detach runs in the proper context
   block/nbd: transfer reconnection stuff across aio_context switch
   block/nbd: better document a case in nbd_co_establish_connection
   block/nbd: decouple reconnect from drain
   block/nbd: stop manipulating in_flight counter

  block/nbd.c  | 191 +++
  nbd/client.c |   2 -
  2 files changed, 86 insertions(+), 107 deletions(-)




Hmm. The huge source of problems for this series is weird logic around drain 
and aio context switch in NBD driver.

Why do we have all these too complicated logic with abuse of in_flight counter 
in NBD? The answer is connection_co. NBD differs from other drivers, it has a 
coroutine independent of request coroutines. And we have to move this coroutine 
carefully to new aio context. We can't just enter it from the new context, we 
want to be sure that connection_co is in one of yield points that supports 
reentering.

I have an idea of how to avoid this thing: drop connection_co at all.

1. nbd negotiation goes to connection thread and becomes independent of any aio 
context.

2. waiting for server reply goes to request code. So, instead of reading the 
replay from socket always in connection_co, we read in the request coroutine, 
after sending the request. We'll need a CoMutex for it (as only one request 
coroutine should read from socket), and be prepared to coming reply is not for 
_this_ request (in this case we should wake another request and continue read 
from socket).

but this may be too much for soft freeze.


Another idea:

You want all the requests be completed on drain_begin(), not cancelled. 
Actually, you don't need reconnect runnning during drained section for it. It 
should be enough just wait for all currenct requests before disabling the 
reconnect in drain_begin handler.

--
Best regards,
Vladimir



Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-16 Thread Roman Kagan
On Tue, Mar 16, 2021 at 09:41:36AM -0500, Eric Blake wrote:
> On 3/15/21 1:06 AM, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> 
> Soft freeze is today.  I'm leaning towards declaring this series as a
> bug fix (and so give it some more soak time to get right, but still okay
> for -rc1) rather than a feature addition (and therefore would need to be
> in a pull request today).  Speak up now if this characterization is off
> base.

Yes I'd consider it a bug fix, too.  I'll do my best beating it into
shape before -rc2.

Thanks,
Roman.



Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-16 Thread Roman Kagan
On Mon, Mar 15, 2021 at 10:45:39PM +0300, Vladimir Sementsov-Ogievskiy wrote:
> 15.03.2021 09:06, Roman Kagan wrote:
> > The reconnection logic doesn't need to stop while in a drained section.
> > Moreover it has to be active during the drained section, as the requests
> > that were caught in-flight with the connection to the server broken can
> > only usefully get drained if the connection is restored.  Otherwise such
> > requests can only either stall resulting in a deadlock (before
> > 8c517de24a), or be aborted defeating the purpose of the reconnection
> > machinery (after 8c517de24a).
> > 
> > This series aims to just stop messing with the drained section in the
> > reconnection code.
> > 
> > While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> > connection_co reentrance"); as I've missed the point of that commit I'd
> > appreciate more scrutiny in this area.
> 
> 
> The actual point is:
> 
> connection_co (together with all functions called from it) has a lot of yield 
> points. And we can't just enter the coroutine in any of the when we want, as 
> it may break some BH which is actually waited for in this yield point..
> 
> Still, we should care only about yield points possible during drained 
> section, so we don't need to care about direct qemu_coroutine_yield() inside 
> nbd_connection_entry().
> 
> Many things changed since 5ad81b4946.. So probably, now all the (possible 
> during drained section) yield points in nbd_connection_entry support 
> reentering. But some analysis of possible yield points should be done.

Thanks for the explanation.  Will do this analysis.

Roman.



Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-16 Thread Eric Blake
On 3/15/21 1:06 AM, Roman Kagan wrote:
> The reconnection logic doesn't need to stop while in a drained section.
> Moreover it has to be active during the drained section, as the requests
> that were caught in-flight with the connection to the server broken can
> only usefully get drained if the connection is restored.  Otherwise such
> requests can only either stall resulting in a deadlock (before
> 8c517de24a), or be aborted defeating the purpose of the reconnection
> machinery (after 8c517de24a).
> 
> This series aims to just stop messing with the drained section in the
> reconnection code.
> 
> While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
> connection_co reentrance"); as I've missed the point of that commit I'd
> appreciate more scrutiny in this area.

Soft freeze is today.  I'm leaning towards declaring this series as a
bug fix (and so give it some more soak time to get right, but still okay
for -rc1) rather than a feature addition (and therefore would need to be
in a pull request today).  Speak up now if this characterization is off
base.

> 
> Roman Kagan (7):
>   block/nbd: avoid touching freed connect_thread
>   block/nbd: use uniformly nbd_client_connecting_wait
>   block/nbd: assert attach/detach runs in the proper context
>   block/nbd: transfer reconnection stuff across aio_context switch
>   block/nbd: better document a case in nbd_co_establish_connection
>   block/nbd: decouple reconnect from drain
>   block/nbd: stop manipulating in_flight counter
> 
>  block/nbd.c  | 191 +++
>  nbd/client.c |   2 -
>  2 files changed, 86 insertions(+), 107 deletions(-)
> 

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.   +1-919-301-3226
Virtualization:  qemu.org | libvirt.org




Re: [PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-15 Thread Vladimir Sementsov-Ogievskiy

15.03.2021 09:06, Roman Kagan wrote:

The reconnection logic doesn't need to stop while in a drained section.
Moreover it has to be active during the drained section, as the requests
that were caught in-flight with the connection to the server broken can
only usefully get drained if the connection is restored.  Otherwise such
requests can only either stall resulting in a deadlock (before
8c517de24a), or be aborted defeating the purpose of the reconnection
machinery (after 8c517de24a).

This series aims to just stop messing with the drained section in the
reconnection code.

While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
connection_co reentrance"); as I've missed the point of that commit I'd
appreciate more scrutiny in this area.



The actual point is:

connection_co (together with all functions called from it) has a lot of yield 
points. And we can't just enter the coroutine in any of the when we want, as it 
may break some BH which is actually waited for in this yield point..

Still, we should care only about yield points possible during drained section, 
so we don't need to care about direct qemu_coroutine_yield() inside 
nbd_connection_entry().

Many things changed since 5ad81b4946.. So probably, now all the (possible 
during drained section) yield points in nbd_connection_entry support 
reentering. But some analysis of possible yield points should be done.



Roman Kagan (7):
   block/nbd: avoid touching freed connect_thread
   block/nbd: use uniformly nbd_client_connecting_wait
   block/nbd: assert attach/detach runs in the proper context
   block/nbd: transfer reconnection stuff across aio_context switch
   block/nbd: better document a case in nbd_co_establish_connection
   block/nbd: decouple reconnect from drain
   block/nbd: stop manipulating in_flight counter

  block/nbd.c  | 191 +++
  nbd/client.c |   2 -
  2 files changed, 86 insertions(+), 107 deletions(-)




--
Best regards,
Vladimir



[PATCH 0/7] block/nbd: decouple reconnect from drain

2021-03-15 Thread Roman Kagan
The reconnection logic doesn't need to stop while in a drained section.
Moreover it has to be active during the drained section, as the requests
that were caught in-flight with the connection to the server broken can
only usefully get drained if the connection is restored.  Otherwise such
requests can only either stall resulting in a deadlock (before
8c517de24a), or be aborted defeating the purpose of the reconnection
machinery (after 8c517de24a).

This series aims to just stop messing with the drained section in the
reconnection code.

While doing so it undoes the effect of 5ad81b4946 ("nbd: Restrict
connection_co reentrance"); as I've missed the point of that commit I'd
appreciate more scrutiny in this area.

Roman Kagan (7):
  block/nbd: avoid touching freed connect_thread
  block/nbd: use uniformly nbd_client_connecting_wait
  block/nbd: assert attach/detach runs in the proper context
  block/nbd: transfer reconnection stuff across aio_context switch
  block/nbd: better document a case in nbd_co_establish_connection
  block/nbd: decouple reconnect from drain
  block/nbd: stop manipulating in_flight counter

 block/nbd.c  | 191 +++
 nbd/client.c |   2 -
 2 files changed, 86 insertions(+), 107 deletions(-)

-- 
2.30.2