nbd-client: nbd reconnect

Vladimir Sementsov-Ogievskiy Fri, 07 Jun 2019 08:25:25 -0700

07.06.2019 16:26, Kevin Wolf wrote:
> Am 07.06.2019 um 14:02 hat Vladimir Sementsov-Ogievskiy geschrieben:
>> 07.06.2019 11:06, Kevin Wolf wrote:
>>> Am 07.06.2019 um 05:17 hat Eric Blake geschrieben:
>>>> On 4/11/19 12:27 PM, Vladimir Sementsov-Ogievskiy wrote:
>>>>> +static coroutine_fn void nbd_reconnect_loop(NBDConnection *con)
>>>>> +{
>>>>> +    NBDClientSession *s = nbd_get_client_session(con->bs);
>>>>> +    uint64_t start_time_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>>>> +    uint64_t delay_ns = s->reconnect_delay * 1000000000UL;
>>>>
>>>> Do we have a #define constant for nanoseconds in a second to make this
>>>> more legible than counting '0's?
>>>>
>>>>> +    uint64_t timeout = 1000000000UL; /* 1 sec */
>>>>> +    uint64_t max_timeout = 16000000000UL; /* 16 sec */
>>>>
>>>> 1 * constant, 16 * constant
>>>>
>>>>> +
>>>>> +    nbd_reconnect_attempt(con);
>>>>> +
>>>>> +    while (nbd_client_connecting(s)) {
>>>>> +        if (s->state == NBD_CLIENT_CONNECTING_WAIT &&
>>>>> +            qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time_ns > 
>>>>> delay_ns)
>>>>> +        {
>>>>> +            s->state = NBD_CLIENT_CONNECTING_NOWAIT;
>>>>> +            qemu_co_queue_restart_all(&s->free_sema);
>>>>> +        }
>>>>> +
>>>>> +        bdrv_dec_in_flight(con->bs);
>>>>> +        qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, timeout);
>>>>
>>>> Another place where I'd like someone more familiar with coroutines to
>>>> also have a look.
>>>
>>> What's the exact question you'd like me to answer?
>>>
>>> But anyway, bdrv_dec/inc_in_flight() around the sleep looks wrong to me.
>>> Either the operation must be waited for in drain, then you can't
>>> decrease the counter even during the sleep. Or drain doesn't have to
>>> consider it, then why is the counter even increased in the first place?
>>>
>>> The way it currently is, drain can return assuming that there are no
>>> requests, but after the timeout the request automatically comes back
>>> while the drain caller assumes that there is no more activity. This
>>> doesn't look right.
>>
>> Hmm.
>>
>> This ind/dec around all lifetime of connection coroutine you added in
>>
>> commit 5ad81b4946baf948c65cf7e1436d9b74803c1280
>> Author: Kevin Wolf <kw...@redhat.com>
>> Date:   Fri Feb 15 16:53:51 2019 +0100
>>
>>       nbd: Restrict connection_co reentrance
>>
>>
>> And now I try to connect in endless loop, with qemu_co_sleep_ns() inside.
>> I need to get a change to bdrv_drain to complete, so I have to sometimes
>> drop this in_flight request. But I understand your point.
> 
> Ah, right, I see. I think it's fine to add a second point where we
> decrease the counter (though I would add a comment telling why we do
> this) as long as the right conditions are met.
> 
> The right conditions are probably something like: Once drained, we
> guarantee that we don't call any callbacks until the drained section
> ends. In nbd_read_eof() this is true because we can't get an answer if
> we had no request running.
> 
> Or actually... This assumes a nice compliant server that doesn't just
> arbitrarily send unexpected messages. Is the existing code broken if we
> connect to a malicious server?
> 
>> Will the following work better?
>>
>> bdrv_dec_in_flight(con->bs);
>> qemu_co_sleep_ns(...);
>> while (s->drained) {
>>     s->wait_drained_end = true;
>>     qemu_coroutine_yield();
>> }
>> bdrv_inc_in_flight(con->bs);
>>
>> ...
>> .drained_begin() {
>>      s->drained = true;
>> }
>>
>> .drained_end() {
>>      if (s->wait_drained_end) {
>>         s->wait_drained_end = false;
>>         aio_co_wake(s->connection_co);
>>      }
>> }
> 
> This should make sure that we don't call any callbacks before the drain
> section has ended.
> 
> We probably still have a problem because nbd_client_attach_aio_context()
> reenters the coroutine with qemu_aio_coroutine_enter(), which will cause
> an assertion failure if co->scheduled is set. So this needs to use a
> version that can cancel a sleep.
> 
> I see that your patch currently simply ignores attaching a new context,
> but then the coroutine stays in the old AioContext. Did I miss some
> additional code that moves it to the new context somehow or will it just
> stay where it was if the coroutine happens to be reconnecting when the
> AioContext was supposed to change?



Hmm. Do you mean "in latter case we do nothing" in

   void nbd_client_attach_aio_context(BlockDriverState *bs,
                                      AioContext *new_context)
   {
       NBDClientSession *client = nbd_get_client_session(bs);

       /*
        * client->connection_co is either yielded from nbd_receive_reply or from
        * nbd_reconnect_loop(), in latter case we do nothing
        */
       if (client->state == NBD_CLIENT_CONNECTED) {
           qio_channel_attach_aio_context(QIO_CHANNEL(client->ioc), 
new_context);

           bdrv_inc_in_flight(bs);

           /*
            * Need to wait here for the BH to run because the BH must run while 
the
            * node is still drained.
            */
           aio_wait_bh_oneshot(new_context, nbd_client_attach_aio_context_bh, 
bs);
       }
   }

?
Not sure why I ignored this case. So, I should run if() body unconditionally 
here and support
interrupting timer-sleeping coroutine in nbd_client_attach_aio_context_bh, yes?

-- 
Best regards,
Vladimir

Re: [Qemu-devel] [PATCH v6 6/7] block/nbd-client: nbd reconnect

Reply via email to