On Nov 15 22:36, Alex Bennée wrote: > > Aaron Lindsay <aa...@os.amperecomputing.com> writes: > > > Hello, > > > > I have been wrestling with what might be a bug in the plugin memory > > callbacks. The immediate error is that I hit the > > `g_assert_not_reached()` in the 'default:' case in > > qemu_plugin_vcpu_mem_cb, indicating the callback type was invalid. When > > breaking on this assertion in gdb, the contents of cpu->plugin_mem_cbs > > are obviously bogus (`len` was absurdly high, for example). After doing > > some further digging/instrumenting, I eventually found that > > `free_dyn_cb_arr(void *p, ...)` is being called shortly before the > > assertion is hit with `p` pointing to the same address as > > `cpu->plugin_mem_cbs` will later hold at assertion-time. We are freeing > > the memory still pointed to by `cpu->plugin_mem_cbs`. > > > > I believe the code *should* always reset `cpu->plugin_mem_cbs` to NULL at > > the > > end of an instruction/TB's execution, so its not exactly clear to me how > > this > > is occurring. However, I suspect it may be relevant that we are calling > > `free_dyn_cb_arr()` because my plugin called `qemu_plugin_reset()`. > > Hmm I'm going to have to remind myself about how this bit works. > > > > > I have additionally found that the below addition allows me to run > > successfully > > without hitting the assert: > > > > diff --git a/plugins/core.c b/plugins/core.c > > --- a/plugins/core.c > > +++ b/plugins/core.c > > @@ -427,9 +427,14 @@ static bool free_dyn_cb_arr(void *p, uint32_t h, void > > *userp) > > > > void qemu_plugin_flush_cb(void) > > { > > + CPUState *cpu; > > qht_iter_remove(&plugin.dyn_cb_arr_ht, free_dyn_cb_arr, NULL); > > qht_reset(&plugin.dyn_cb_arr_ht); > > > > + CPU_FOREACH(cpu) { > > + cpu->plugin_mem_cbs = NULL; > > + } > > + > > This is essentially qemu_plugin_disable_mem_helpers() but for all CPUs. > I think we should be able to treat the CPUs separately.
I agree it's similar to qemu_plugin_disable_mem_helpers(), but for all CPUs. Though a perhaps important distinction is that its occurring unconditionally in conjunction with the event which flushes the TBs and frees the callback arrays. Isn't the code calling into qemu_plugin_flush_cb() already clearing TBs for all CPUs? Can you help me understand why treating the CPUs separately would be better? > > plugin_cb__simple(QEMU_PLUGIN_EV_FLUSH); > > } > > > > Unfortunately, the workload/setup I have encountered this bug with are > > difficult to reproduce in a way suitable for sharing upstream (admittedly > > potentially because I do not fully understand the conditions necessary to > > trigger it). It is also deep into a run > > How many full TB flushes have there been? You only see > qemu_plugin_flush_cb when we flush whole translation buffer (which is > something we do more often when plugins exit). There have been maybe hundreds of TB flushes at this point (I, erm, use qemu_plugin_reset() somewhat liberally in this plugin). I believe it is the most recent such flush that is problematic - I observe the call to free_dyn_cb_arr() mentioned above as a result of it. > Does lowering tb-size make it easier to hit the failure mode? Hrm, interesting, I have not tried that. I'll poke at that if the rr debug doesn't pan out. > > , and I haven't found a good way > > to break in gdb immediately prior to it happening in order to inspect > > it, without perturbing it enough such that it doesn't happen... > > This is exactly the sort of thing rr is great for. Can you trigger it in > that? > > https://rr-project.org/ I had not used rr before, thanks for the push to do so! I have, so far, discovered the following timeline: 1. My plugin receives a instruction execution callback for a load instruction. At this time, cpu->plugin_mem_cbs points to the same memory which will later be freed 2. During the handling of this callback, my plugin calls qemu_plugin_reset() 3. Ostensibly something goes wrong here with the cleanup of cpu->plugin_mem_cbs??? 4. Step 2 triggers the TBs to be flushed, which frees the memory pointed to by cpu->plugin_mem_cbs Since I have this nicely recorded now with rr, it ought to be easier to poke at, though I admit I'm not entirely sure how to poke at the generated code to see what's going wrong (i.e. why wouldn't the tb exit stuff be clearing this pointer like normal?). > > I welcome any feedback or insights on how to further nail down the > > failure case and/or help in working towards an appropriate solution. -Aaron