Peter Feiner <pe...@gridcentric.ca> writes: >> This is not reasonable IMHO. >> >> I was okay with sticking a name on a ramblock, but encoding a guest PA >> offset turns this into a supported ABI which I'm not willing to do. >> >> A one line change is one thing, but not a complex new option that >> introduces an ABI only for a proprietary product that's jumping through >> hoops to keep >> from contributing useful logic to QEMU. > > Hi Anthony, > > Thanks for getting back to me. > > Sticking a name on the ramblock file would suite our product just > fine. Indeed, this is what we had agreed upon at the KVM forum. > However, I submitted a more complex patch in an attempt to expose a > more general & easy to use feature; I was trying to make a more useful > contribution than the simple patch :-) > > Perhaps I can assuage your ABI concern and argue the utility of this > patch vs the one-line version. However, if you aren't satisfied, > please let me know and I'll resubmit the one-line version.
Yes, please submit the oneliner. > On ABI: This patch doesn't add a new ABI. QEMU already has this ABI > due to Xen live migration. > > When a Xen domain is booted, a new domain is created with an empty > physmap. Then QEMU is launched. QEMU creates its ramblocks and, via > memory callbacks (xen_add_to_physmap), populates Xen's physmap using > ramblock sizes & offsets. > > On incoming migration, the Xen toolstack creates a new domain, > populates its physmap, and copies RAM from the outgoing migration. > When QEMU is launched, it populates its Xen memory model (i.e., > XenIOState) by reading the domain's existing physmap from xenstore. > When QEMU creates ramblocks, the callbacks in xen-all.c _ignore_ the > new ramblocks because their offsets are already in the physmap. If the > new ramblocks had different sizes & offsets than those from the > outgoing QEMU process, then QEMU's memory model would be inconsistent > with Xen's (i.e., the physmap maintained by the hypervisor and the > XenIOState maintained in userspace). In particular, QEMU would expect > memory at a particular physmap offset that wouldn't have been > populated by the Xen toolstack during live migration. This is an internal detail between Xen and QEMU. That doesn't mean it's a general public API. I'm fairly certain that Xen does not support arbitrary versions of QEMU to be used as qemu-dm. Regards, Anthony Liguori > > On utility: Just adding ramblock names to backing file paths makes > post-copy migration & cloning possible, but involves some painful VFS > contortions, which I give a detailed example of below. On the other > hand, these new -mem-path parameters make post-copy migration & > cloning simple by leveraging an existing QMP command, existing > filesystems, and kernel behavior. Put another way, the useful logic > for memory sharing and post-copy live migration already exists in the > kernel and a myriad of filesystems. A fairly small patch (albeit not > one line) enables that logic in QEMU. > > Peter > > Detailed example: > > Suppose you have a patched QEMU that adds ramblock names to their > backing files and you want to implement memory sharing via cloning. > When clones come up, each of their ramblocks' backing files need to > contain the same data as the corresponding backing file from the > parent (obviously you want those new backing files to somehow share > pages and COW). The basic idea is to save the parent's ramblock files > and arrange for the clones to open them. > > You can see the parent's ramblock files easily enough by looking at > the unlinked ramblock files (e.g., /proc/pid/fd/10 is a symlink to > /tmp/qemu_back_mem.pc.ram.WHFZYw (deleted), /proc/pid/fd/11 is a > symlink to /tmp/qemu_back_mem.vga.vram.WT1yQW (deleted), etc.). > Unfortunately, since they're all mapped MAP_PRIVATE, these symlinks, > when opened, will give all zeros. So you can either implement your own > filesystem that gives you a backdoor to the MAP_PRIVATE pages (fast > but complicated), or you can use qemu's monitor to dump guest RAM > (slow but works). > > When a clone runs and creates a new backing file using mkstemp, you > need to arrange for that backing file to somehow contain the same data > as the corresponding file from the parent. There is an obvious > heuristic for determining this correspondence: parse the ramblock name > from the child's file and use the matching file from the parent. > Correctness aside (e.g., multiple ramblocks can have the same name, > e.g., e1000.rom, but this is moot because the _important_ ramblocks, > i.e., pc.ram and vga.ram, are unique in the emulated system we care > about), implementing this heuristic is a pain. To see the file being > created, you need to implement a custom file system. Moreover, to > share memory with another file that's been opened MAP_PRIVATE, you > have to implement your own VMA operations. Oye!