Emilio G Cota writes: > On Fri, Sep 29, 2017 at 16:16:41 +0300, Lluís Vilanova wrote: >> Lluís Vilanova writes: >> [...] >> > This was working on a much older version of instrumentation for QEMU, but >> > I can >> > implement something that does the first use-case point above and some >> > filtering >> > example (second use-case point) to see what's the performance difference. >> >> Ok, so here's some numbers for the discussion (booting Emilio's ARM full >> system >> image that immediately shuts down): >> >> * Without instrumentation >> >> real 0m10,099s >> user 0m9,876s >> sys 0m0,128s >> >> * Count number of memory access writes, by instrumenting only when they are >> executed >> >> real 0m15,896s >> user 0m15,752s >> sys 0m0,108s >> >> * Counting same, but the filtering is done at translation time (i.e., not >> generate an execute-time callback if it's not a write) >> >> real 0m11,084s >> user 0m10,880s >> sys 0m0,112s >> >> As Peter said, the filtering can be added into the API to take advantage of >> this "speedup", without exposing translation vs execution time callbacks.
> I'm not sure I understand this concept of filtering. Are you saying that in > the first case, all memory accesses are instrumented, and then in the > "access helper" we only call the user's callback if it's a memory write? > And in the second case, we simply just generate a "write helper" instead > of an "access helper". Am I understanding this correctly? In the previous case (no filtering), the user callback is always called when a memory access is *executed*, and the user then checks if the access mode is a write to decide whether to increment a counter. In this case (with filtering), a user callback is called when a memory access is *translated*, and if the access mode is a write, the user generates a call to a second callback that is executed every time a memory access is executed (only that it is only generated for memory writes, the ones we care about). Is this clearer? >> * Counting number of executed instructions, by instrumenting the beginning of >> each one of them >> >> real 0m24,583s >> user 0m24,352s >> sys 0m0,184s >> >> * Counting same, but per-TB numbers are collected at translation-time, and we >> only generate a per-TB execution time callback to add the corresponding >> number >> of instructions for that TB >> >> real 0m11,151s >> user 0m10,952s >> sys 0m0,092s >> >> This really needs to expose translation vs execution time callbacks to take >> advantage of this "speedup". > Clearly instrumenting per-TB is a significant net gain. I think we should > definitely allow instrumenters to use this option. > FWIW my experiments so far show similar numbers for instrumenting each > instruction (haven't done the per-tb yet). The difference is that I'm > exposing to instrumenters a copy of the guest instructions (const void *data, > size_t size). These copies are kept around until TB's are flushed. > Luckily there seems to be very little overhead in keeping these around, > apart from the memory overhead -- but in terms of performance, the > necessary allocations do not induce significant overhead. To keep this use-case simpler, I added the memory access API I posted in this series, where instrumenters can read guest memory (more general than passing a copy of the current instruction). Cheers, Lluis
