Am 10.02.2016 um 13:51 hat Pavel Dovgalyuk geschrieben: > > From: Kevin Wolf [mailto:kw...@redhat.com] > > Am 10.02.2016 um 13:05 hat Pavel Dovgalyuk geschrieben: > > > > Am 09.02.2016 um 12:52 hat Pavel Dovgalyuk geschrieben: > > > > > > From: Kevin Wolf [mailto:kw...@redhat.com] > > > > > > But even this doesn't feel completely right, because block drivers > > > > > > are > > > > > > already layered and there is no need to hardcode something optional > > > > > > (and > > > > > > rarely used) in the hot code path that could just be another layer. > > > > > > > > > > > > I assume that you know beforehand if you want to replay something, > > > > > > so > > > > > > requiring you to configure your block devices with a replay driver > > > > > > on > > > > > > top of the stack seems reasonable enough. > > > > > > > > > > I cannot use block drivers for this. When driver functions are used, > > > > > QEMU > > > > > is already used coroutines (and probably started bottom halves). > > > > > Coroutines make execution non-deterministic. > > > > > That's why we have to intercept blk_aio_ functions, that are called > > > > > deterministically. > > > > > > > > What does "deterministic" mean in this context, i.e. what are your exact > > > > requirements? > > > > > > "Deterministic" means that the replayed execution should run exactly > > > the same guest instructions in the same sequence, as in recording session. > > > > Okay. I think with this we can do better than what you have now. > > > > > > I don't think that coroutines introduce anything non-deterministic per > > > > se. Depending on what you mean by it, the block layer code paths in > > > > block.c may contain problematic code. > > > > > > They are non-deterministic if we need instruction-level accuracy. > > > Thread switching (and therefore callbacks and BH execution) is > > > non-deterministic. > > > > Thread switching depends on an external event (the kernel scheduler > > deciding to switch), so agreed, if a thread switch ever influences what > > the guest sees, that would be a problem. > > > > Generally, however, callbacks and BHs don't involve a thread switch at > > all (BHs can be invoked from a different thread in theory, but we have > > very few of those cases and they shouldn't be visible for the guest). > > The same is true for coroutines, which are semantically equivalent to > > callbacks. > > > > > In two different executions these callbacks may happen at different > > > moments of > > > time (counting in number of executed instructions). > > > All operations with virtual devices (including memory, interrupt > > > controller, > > > and disk drive controller) should happen at deterministic moments of time > > > to be replayable. > > > > Right, so let's talk about what this external non-deterministic event > > really is. > > > > I think the only thing whose timing is unknown in the block layer is the > > completion of I/O requests. This non-determinism comes from the time the > > I/O syscalls made by the lowest layer (usually raw-posix) take. > > Right. > > > This means that we can add logic to remove the non-determinism at the > > point of our choice between raw-posix and the guest device emulation. A > > block driver on top is as good as anything else. > > > > While recording, this block driver would just pass the request to next > > lower layer (starting a request is deterministic, so it doesn't need to > > be logged) and once the request completes it logs it. While replaying, > > the completion of requests is delayed until we read it in the log; if we > > read it in the log and the request hasn't completed yet, we do a busy > > wait for it (while(!completed) aio_poll();). > > I tried serializing all bottom halves and worker thread callbacks in > previous version of the patches. That code was much more complicated > and error-prone than the current version. We had to classify all bottom > halves to recorded and non-recorded (because sometimes they are used > for qemu's purposes, not the guest ones). > > However, I don't understand yet which layer do you offer as the candidate > for record/replay? What functions should be changed? > I would like to investigate this way, but I don't got it yet.
At the core, I wouldn't change any existing function, but introduce a new block driver. You could copy raw_bsd.c for a start and then tweak it. Leave out functions that you don't want to support, and add the necessary magic to .bdrv_co_readv/writev. Something like this (can probably be generalised for more than just reads as the part after the bdrv_co_reads() call should be the same for reads, writes and any other request types): int blkreplay_co_readv() { BlockReplayState *s = bs->opaque; int reqid = s->reqid++; bdrv_co_readv(bs->file, ...); if (mode == record) { log(reqid, time); } else { assert(mode == replay); bool *done = req_replayed_list_get(reqid) if (done) { *done = true; } else { req_completed_list_insert(reqid, qemu_coroutine_self()); qemu_coroutine_yield(); } } } /* called by replay.c */ int blkreplay_run_event() { if (mode == replay) { co = req_completed_list_get(e.reqid); if (co) { qemu_coroutine_enter(co); } else { bool done = false; req_replayed_list_insert(reqid, &done); /* wait synchronously for completion */ while (!done) { aio_poll(); } } } } Where we could consider changing existing code is that it might be desirable to automatically put an instance of this block driver on top of every block device when record/replay is used. If we don't do that, you need to explicitly specify -drive driver=blkreplay,... > > This model would get rid of the bdrv_drain_all() that you call > > everywhere and therefore allow concurrent requests, giving a result that > > is much closer to the "normal" behaviour without replay. > > > > > > The block layer uses bottom halves in some cases for request completion, > > > > but not until calling into the first driver (why would they be a > > > > problem?). What could happen is that a request is serialised and > > > > therefore delayed in some special configurations, which sounds a lot > > > > like what you wanted to avoid. > > > > > > Drivers cannot distinguish the requests from guest CPU and from > > > savevm/loadvm. > > > First ones have to be deterministic, because they affect guest memory, > > > virtual disk controller, and interrupts. > > > > Sure they can, these are two different callbacks. But even if they > > couldn't, making more things than necessary deterministic might be > > wasteful, but not really harmful. > > Is there any universal way to check this? No, the block layer doesn't know its caller. It's only possible in this specific case because the guest never calls .bdrv_save_vmstate(). But as I said, logging more requests than necessary doesn't really hurt expect making the log a bit larger. Kevin