07.06.2019 16:26, Kevin Wolf wrote: > Am 07.06.2019 um 14:02 hat Vladimir Sementsov-Ogievskiy geschrieben: >> 07.06.2019 11:06, Kevin Wolf wrote: >>> Am 07.06.2019 um 05:17 hat Eric Blake geschrieben: >>>> On 4/11/19 12:27 PM, Vladimir Sementsov-Ogievskiy wrote: >>>>> +static coroutine_fn void nbd_reconnect_loop(NBDConnection *con) >>>>> +{ >>>>> + NBDClientSession *s = nbd_get_client_session(con->bs); >>>>> + uint64_t start_time_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME); >>>>> + uint64_t delay_ns = s->reconnect_delay * 1000000000UL; >>>> >>>> Do we have a #define constant for nanoseconds in a second to make this >>>> more legible than counting '0's? >>>> >>>>> + uint64_t timeout = 1000000000UL; /* 1 sec */ >>>>> + uint64_t max_timeout = 16000000000UL; /* 16 sec */ >>>> >>>> 1 * constant, 16 * constant >>>> >>>>> + >>>>> + nbd_reconnect_attempt(con); >>>>> + >>>>> + while (nbd_client_connecting(s)) { >>>>> + if (s->state == NBD_CLIENT_CONNECTING_WAIT && >>>>> + qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - start_time_ns > >>>>> delay_ns) >>>>> + { >>>>> + s->state = NBD_CLIENT_CONNECTING_NOWAIT; >>>>> + qemu_co_queue_restart_all(&s->free_sema); >>>>> + } >>>>> + >>>>> + bdrv_dec_in_flight(con->bs); >>>>> + qemu_co_sleep_ns(QEMU_CLOCK_REALTIME, timeout); >>>> >>>> Another place where I'd like someone more familiar with coroutines to >>>> also have a look. >>> >>> What's the exact question you'd like me to answer? >>> >>> But anyway, bdrv_dec/inc_in_flight() around the sleep looks wrong to me. >>> Either the operation must be waited for in drain, then you can't >>> decrease the counter even during the sleep. Or drain doesn't have to >>> consider it, then why is the counter even increased in the first place? >>> >>> The way it currently is, drain can return assuming that there are no >>> requests, but after the timeout the request automatically comes back >>> while the drain caller assumes that there is no more activity. This >>> doesn't look right. >> >> Hmm. >> >> This ind/dec around all lifetime of connection coroutine you added in >> >> commit 5ad81b4946baf948c65cf7e1436d9b74803c1280 >> Author: Kevin Wolf <kw...@redhat.com> >> Date: Fri Feb 15 16:53:51 2019 +0100 >> >> nbd: Restrict connection_co reentrance >> >> >> And now I try to connect in endless loop, with qemu_co_sleep_ns() inside. >> I need to get a change to bdrv_drain to complete, so I have to sometimes >> drop this in_flight request. But I understand your point. > > Ah, right, I see. I think it's fine to add a second point where we > decrease the counter (though I would add a comment telling why we do > this) as long as the right conditions are met. > > The right conditions are probably something like: Once drained, we > guarantee that we don't call any callbacks until the drained section > ends. In nbd_read_eof() this is true because we can't get an answer if > we had no request running. > > Or actually... This assumes a nice compliant server that doesn't just > arbitrarily send unexpected messages. Is the existing code broken if we > connect to a malicious server? > >> Will the following work better? >> >> bdrv_dec_in_flight(con->bs); >> qemu_co_sleep_ns(...); >> while (s->drained) { >> s->wait_drained_end = true; >> qemu_coroutine_yield(); >> } >> bdrv_inc_in_flight(con->bs); >> >> ... >> .drained_begin() { >> s->drained = true; >> } >> >> .drained_end() { >> if (s->wait_drained_end) { >> s->wait_drained_end = false; >> aio_co_wake(s->connection_co); >> } >> } > > This should make sure that we don't call any callbacks before the drain > section has ended. > > We probably still have a problem because nbd_client_attach_aio_context() > reenters the coroutine with qemu_aio_coroutine_enter(), which will cause > an assertion failure if co->scheduled is set. So this needs to use a > version that can cancel a sleep. > > I see that your patch currently simply ignores attaching a new context, > but then the coroutine stays in the old AioContext. Did I miss some > additional code that moves it to the new context somehow or will it just > stay where it was if the coroutine happens to be reconnecting when the > AioContext was supposed to change?
Hmm. Do you mean "in latter case we do nothing" in void nbd_client_attach_aio_context(BlockDriverState *bs, AioContext *new_context) { NBDClientSession *client = nbd_get_client_session(bs); /* * client->connection_co is either yielded from nbd_receive_reply or from * nbd_reconnect_loop(), in latter case we do nothing */ if (client->state == NBD_CLIENT_CONNECTED) { qio_channel_attach_aio_context(QIO_CHANNEL(client->ioc), new_context); bdrv_inc_in_flight(bs); /* * Need to wait here for the BH to run because the BH must run while the * node is still drained. */ aio_wait_bh_oneshot(new_context, nbd_client_attach_aio_context_bh, bs); } } ? Not sure why I ignored this case. So, I should run if() body unconditionally here and support interrupting timer-sleeping coroutine in nbd_client_attach_aio_context_bh, yes? -- Best regards, Vladimir