[Qemu-devel] [RFC 00/48] Plugin support

Emilio G. Cota Thu, 25 Oct 2018 11:06:27 -0700

For those of you who need some context: "plugins" are dynamic
libraries that are loaded at run-time. These plugins can
subscribe to interesting events (e.g. instruction execution)
via an API, to then do something interesting with them. This
functionality is similar to what other instrumentation tools (e.g.
Pin and DynamoRIO) provide, although since QEMU is full-system
we have some additional features.


As an example application, I've been using this plugin implementation
for the last year or so to implement a parallel computer simulator
that uses QEMU as its execution frontend.

The key features of this plugin implementation are:

- Support for an arbitrary number of plugins

- Focus on speed. "Dynamic" callbacks are used for frequent events,
  such as memory callbacks, to call the plugin code directly, i.e.
  without going through an intermediate helper. This provides
  an average 1.33x speedup for SPEC06 over using helpers with a list
  of subscribers, and it becomes more important as more subscribers
  are added. I can share more detailed numbers if you want them.

- Instruction-granularity instrumentation. Getting callbacks
  on *all* TBs/mem accesses/instructions is not flexible. Consider
  a plugin that just wants to get callbacks on the specific memory
  accesses of a set of instructions (e.g. cmpxchg); the API
  must provide a way for the plugin to subscribe to those events
  *only*, instead of giving it all events (e.g. all mem accesses)
  for the plugin to then discard 99.9% of them.

- 2-pass translation. Once a "TB translation" callback is called,
  the plugin must know the span of the TB. We should not
  force plugins to guess where the TB will end; that is strictly
  QEMU's job, and can change any time. A TB is thus a sequence
  of instructions of whatever length the particular QEMU
  implementation decides. Thus, for each TB, a 3-step process
  is followed: (1) the plugin layer keeps a copy of the contents
  of the current TB, (2) once the TB is well-defined, its
  descriptor and contents are passed to plugins, which then
  register their desired instrumentation (e.g. "call me back
  on this particular instruction", or "call me back when
  the whole TB executes"); note that plugins can use a disassembler
  like capstone to decide what to do with each instruction; they
  can also allocate memory and then get a pointer to it passed
  back from the callbacks. And finally, (3) the target translator
  is called again to generate the final instrumented translated TB.
  This is what I called the "2-pass translation", since we go
  twice over the translation loop in translator.c. Note that the
  2-pass approach has virtually no overhead (0.40% for SPEC06int);
  translation is much cheaper than execution. But anyway, if no
  plugins have subscribed to TB translation, we only do one pass.

- Support for inlining instrumentation. This is done via an
  explicit API, i.e. we do not export TCG ops, which are internal
  to QEMU. For now, I just have support for incrementing a u64
  with an immediate, e.g. to increment a counter.

- Treating the plugins as "malicious", in that we don't export
  any pointers to key QEMU data structures (CPUState, TB).
  I implemented this after a comment from Stefan, but maybe it is
  a bit overkill.

- Other features that go beyond passively getting callbacks (I need
  these for the simulator):
  + Control of the virtual clock from plugins
  + CPU lockstep execution, where plugins decide when CPUs must
    synchronize to reduce their execution skew. This can be understood
    as a "parallel icount" mode, although plugins can decide to
    synchronize whenever they want, not whenever a certain amount of
    instructions have execution. For instance, I am using this to
    synchronize CPUs every X number of simulated cycles, thereby
    having the ability to limit skew while maintaining parallelism.
    When a CPU is idle, then we assume its "execution window" (aka
    "time slice") has expired.
  + Guest hooks. Instead of using "magic" instructions, export a
    PCI device and let plugins determine what encoding to follow.
    I'm using this to mark regions of interest in guest programs,
    so that in the simulator I start/stop recording simulation events.

- Things I haven't included here:
  + Ability to emulate devices from plugins. I'm using this to
    simulate peripherals. These are devices whose timing is important
    to overall performance (e.g. 'accelerators' to which the main
    CPU offloads computation, e.g. a JPEG encoder).

The design I'm showing here shares nothing with the tracing infrastructure.
While it is true that some features (e.g. syscall callbacks) are
identical, some others (instruction-granularity instrumentation,
2-pass translation, lockstep execution) are not. So I'm open to
discussing where we could save code (e.g. having a single trace+plugin
generator, e.g. for syscalls), as long as performance and/or the
ability to instrument aren't compromise.

Peter: I remember you asked for an API first. I am including that as
a single patch in patch 14; see also patches 40, 45 and 47.

The first 10 or so patches in the series are preliminary work,
including the support of runtime TCG helpers. I think a subset
of this could be in a proper patch series, particularly the
xxhash patches. Then I've added plugin-related patches, trying
to break this down my original 80-or-so patches into something
a little easier to review. The "core" plugin code is perhaps the last
place to look, because when it is added nothing is calling it yet.
The last patch in the series adds some example plugins just for
discussion's sake.

This series applies on top of my cpu-lock-v4 series. You can fetch
it from:
  https://github.com/cota/qemu/tree/plugin

Cheers,

                Emilio

[Qemu-devel] [RFC 00/48] Plugin support

Reply via email to