Alex Bennée <alex.ben...@linaro.org> writes: > Pavel Dovgalyuk <dovga...@ispras.ru> writes: > >>> From: mttcg-requ...@listserver.greensocs.com >>> [mailto:mttcg-requ...@listserver.greensocs.com] >>> Pavel Dovgalyuk <dovga...@ispras.ru> writes: >>> >> From: mttcg-requ...@listserver.greensocs.com [mailto:mttcg- >>> requ...@listserver.greensocs.com] >>> >> Pavel Dovgalyuk <dovga...@ispras.ru> writes: >>> >> >> From: mttcg-requ...@listserver.greensocs.com [mailto:mttcg- >>> >> requ...@listserver.greensocs.com] >>> >> >> Pavel Dovgalyuk <dovga...@ispras.ru> writes: >>> <snip> >>> >> >> > I tested on vexpress-a9 platform with Debian wheezy. >>> >> >> >>> >> >> Thanks for that. I now have a test case that I can reproduce failures >>> >> >> on >>> >> >> without needing graphics. >>> >> >> >>> >> >> I've been investigating if there are any problems with the timer >>> >> >> processing now they have been moved into the TCG thread. The record >>> >> >> stage seems to work fine but I'm having difficulty figuring out why >>> >> >> playback freezes. It seems we get to a point where we are stuck >>> >> >> waiting >>> >> >> for a suspiciously exact timer deadline: >>> >> > >>> >> > I've encountered the following behavior at replay stage: >>> >> > - replay takes some instructions to execute (qemu_icount += counter) >>> >> > - virtual timer is fired >>> >> >>> >> This is the virtual timer based on icount not the virtual rt timer? So >>> > >>> > Regular virtual timer. It's value is based on icount. >>> > virtual_rt is used for internal icount purposes. >>> >>> And this is where the clock warps come in? The bias brings the >>> virtual_rt time forward because execution is waiting for some external >>> event to fire (e.g. a timer IRQ)? >> >> I guess so. But bias is not updated when the vCPU works. >> vCPU thread updates only qemu_icount which is used for virtual clock >> calculation. >> >>> >> under the new scheme of being processed in the vCPU loop we should only >>> >> fire this one once all execution is done (although you may exit the loop >>> >> early before icount is exhausted). >>> > >>> > We should stop the vCPU before processing virtual timers because: >>> > - virtual clock depends on instruction counter >>> > - virtual timers may change virtual hardware state >>> >>> If we do the processing in the top of main vCPU loop this should be >>> equivalent. The instruction counter cannot run because we haven't >>> entered tcg_exec_cpu. We also process virtual timers in this thread >>> outside the loop so nothing else can be poking the hardware state. >> >> This is how qemu worked in older version - it switched between >> processing tasks (vCPU and timers) in one thread. > > The only real difference is the sequencing in the old case was protected > by the BQL - now its my program order. > >> But how we'll join this behavior with the current design and MTTCG? > > Ignore MTTCG for now. We don't even try and run multiple-threads when > icount is enabled. However the change in the BQL locking is what has > triggered the failures. > > Anyway I think I'm getting closer to narrowing it down. On record I see > replay_current_step jump backwards with this: > > /* A common event print, called when reading or saving an event */ > static void print_event(uint8_t event) > { > static int event_count; > static uint64_t last_step; > uint64_t this_step = replay_get_current_step(); > > fprintf(stderr, "replay: event %d is %d @ step=%#" PRIx64 "\n", > event_count, event, this_step); > > if (this_step < last_step) { > fprintf(stderr,"%s: !!! step %d went backwards > %#"PRIx64"=>%#"PRIx64"!!!\n", > __func__, event_count, last_step, this_step); > abort(); > } > last_step = this_step; > event_count++; > } > > void replay_put_event(uint8_t event) > { > assert(event < EVENT_COUNT); > replay_put_byte(event); > print_event(event); > } > > The jump back is fairly consistent in my case (event 66 in the vexpress > run) but I'm fairly sure that is the bug. I can't see any reason for the > step count to go backwards - indeed that breaks the premise of . > > Unfortunately when I start putting break and watchpoints in to see how > this jump back occurs the problem goes away until I disable them. So > this very much looks like a race condition corrupting the icount data.
So this is a symptom of cpu_get_icount_raw(void) only accounting for in progress instructions when in the vCPU context and the fact timers_state.qemu_icount is "in credit" while the vCPU is running. -- Alex Bennée