nbd-client: nbd reconnect

Kevin Wolf Fri, 07 Jun 2019 08:57:23 -0700

Am 07.06.2019 um 16:27 hat Vladimir Sementsov-Ogievskiy geschrieben:
> 07.06.2019 16:26, Kevin Wolf wrote:
> > Am 07.06.2019 um 14:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
> >> 07.06.2019 11:06, Kevin Wolf wrote:
> >>> Am 07.06.2019 um 05:17 hat Eric Blake geschrieben:
> >>>> On 4/11/19 12:27 PM, Vladimir Sementsov-Ogievskiy wrote:
> >>>>> +static coroutine_fn void nbd_reconnect_loop(NBDConnection *con)
> >>>>> +{
> >>>>> +    NBDClientSession *s = nbd_get_client_session(con->bs);
> >>>>> +    uint64_t start_time_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> >>>>> +    uint64_t delay_ns = s->reconnect_delay * 1000000000UL;
> >>>>
> >>>> Do we have a #define constant for nanoseconds in a second to make this
> >>>> more legible than counting '0's?
> >>>>
> >>>>> +    uint64_t timeout = 1000000000UL; /* 1 sec */
> >>>>> +    uint64_t max_timeout = 16000000000UL; /* 16 sec */
> >>>>
> >>>> 1 * constant, 16 * constant
> >>>>
> >>>>> +
> >>>>> +    nbd_reconnect_attempt(con);
> >>>>> +
> >>>>> +    while (nbd_client_connecting(s)) {
> >>>>> +        if (s->state == NBD_CLIENT_CONNECTING_WAIT &&
> >>>>> +            qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time_ns > 
> >>>>> delay_ns)
> >>>>> +        {
> >>>>> +            s->state = NBD_CLIENT_CONNECTING_NOWAIT;
> >>>>> +            qemu_co_queue_restart_all(&s->free_sema);
> >>>>> +        }
> >>>>> +
> >>>>> +        bdrv_dec_in_flight(con->bs);
> >>>>> +        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, timeout);
> >>>>
> >>>> Another place where I'd like someone more familiar with coroutines to
> >>>> also have a look.
> >>>
> >>> What's the exact question you'd like me to answer?
> >>>
> >>> But anyway, bdrv_dec/inc_in_flight() around the sleep looks wrong to me.
> >>> Either the operation must be waited for in drain, then you can't
> >>> decrease the counter even during the sleep. Or drain doesn't have to
> >>> consider it, then why is the counter even increased in the first place?
> >>>
> >>> The way it currently is, drain can return assuming that there are no
> >>> requests, but after the timeout the request automatically comes back
> >>> while the drain caller assumes that there is no more activity. This
> >>> doesn't look right.
> >>
> >> Hmm.
> >>
> >> This ind/dec around all lifetime of connection coroutine you added in
> >>
> >> commit 5ad81b4946baf948c65cf7e1436d9b74803c1280
> >> Author: Kevin Wolf <kw...@redhat.com>
> >> Date:   Fri Feb 15 16:53:51 2019 +0100
> >>
> >>       nbd: Restrict connection_co reentrance
> >>
> >>
> >> And now I try to connect in endless loop, with qemu_co_sleep_ns() inside.
> >> I need to get a change to bdrv_drain to complete, so I have to sometimes
> >> drop this in_flight request. But I understand your point.
> > 
> > Ah, right, I see. I think it's fine to add a second point where we
> > decrease the counter (though I would add a comment telling why we do
> > this) as long as the right conditions are met.
> > 
> > The right conditions are probably something like: Once drained, we
> > guarantee that we don't call any callbacks until the drained section
> > ends. In nbd_read_eof() this is true because we can't get an answer if
> > we had no request running.
> > 
> > Or actually... This assumes a nice compliant server that doesn't just
> > arbitrarily send unexpected messages. Is the existing code broken if we
> > connect to a malicious server?
> > 
> >> Will the following work better?
> >>
> >> bdrv_dec_in_flight(con->bs);
> >> qemu_co_sleep_ns(...);
> >> while (s->drained) {
> >>     s->wait_drained_end = true;
> >>     qemu_coroutine_yield();
> >> }
> >> bdrv_inc_in_flight(con->bs);
> >>
> >> ...
> >> .drained_begin() {
> >>      s->drained = true;
> >> }
> >>
> >> .drained_end() {
> >>      if (s->wait_drained_end) {
> >>         s->wait_drained_end = false;
> >>         aio_co_wake(s->connection_co);
> >>      }
> >> }
> > 
> > This should make sure that we don't call any callbacks before the drain
> > section has ended.
> > 
> > We probably still have a problem because nbd_client_attach_aio_context()
> > reenters the coroutine with qemu_aio_coroutine_enter(), which will cause
> > an assertion failure if co->scheduled is set. So this needs to use a
> > version that can cancel a sleep.
> > 
> > I see that your patch currently simply ignores attaching a new context,
> > but then the coroutine stays in the old AioContext. Did I miss some
> > additional code that moves it to the new context somehow or will it just
> > stay where it was if the coroutine happens to be reconnecting when the
> > AioContext was supposed to change?
> 
> 
> Hmm. Do you mean "in latter case we do nothing" in
> 
>    void nbd_client_attach_aio_context(BlockDriverState *bs,
>                                       AioContext *new_context)
>    {
>        NBDClientSession *client = nbd_get_client_session(bs);
> 
>        /*
>         * client->connection_co is either yielded from nbd_receive_reply or 
> from
>         * nbd_reconnect_loop(), in latter case we do nothing
>         */
>        if (client->state == NBD_CLIENT_CONNECTED) {
>            qio_channel_attach_aio_context(QIO_CHANNEL(client->ioc), 
> new_context);
> 
>            bdrv_inc_in_flight(bs);
> 
>            /*
>             * Need to wait here for the BH to run because the BH must run 
> while the
>             * node is still drained.
>             */
>            aio_wait_bh_oneshot(new_context, nbd_client_attach_aio_context_bh, 
> bs);
>        }
>    }
> 
> ?
> Not sure why I ignored this case. So, I should run if() body unconditionally 
> here and support
> interrupting timer-sleeping coroutine in nbd_client_attach_aio_context_bh, 
> yes?


Yes, I think so. We have now two places where the coroutine could be
yielded, the old place that simply yielded in a loop and can be
reentered without a problem and the new one in a sleep. Both need to be
supported when we switch to coroutine to a new AioContext.

Kevin

Re: [Qemu-devel] [PATCH v6 6/7] block/nbd-client: nbd reconnect

Reply via email to