apple-gfx: Introduce ParavirtualizedGraphics.Framework support

Phil Dennis-Jordan Sat, 23 Sep 2023 15:40:27 -0700

This is a pretty big patch, but I've done my best to make sense of it.
Here's a reasonably thorough first pass review.



First off: I've tested it; it seems to broadly work as advertised!

Tested-by: Phil Dennis-Jordan <p...@philjordan.eu>


Second: Nice work on figuring out the undocumented bits for the
vmapple-related things!


General comments/questions on code:

- This is alluded to in the comment about
PGDeviceDescriptorExt/PGIOSurfaceHostDeviceDescriptor/PGIOSurfaceHostDevice:
PV graphics support is useful for running an x86-64 macOS host/guest
combination as well as aarch64/vmapple. It would be more useful to
have a generic device class as well as subclasses with (a) the vmapple
specific concrete implementation, and (b) the PCI interface for
x86-64. Obviously we could merge this patch roughly as it stands and
tease out the split later…
At minimum though, the device type/name for the vmapple-specific
tweaks should I think include 'vmapple' so we can separate out a
generic one later without renaming things.
(Disclosure: I have something of a vested interest in this as I've got
a working x86-64 PV Graphics patch set which I've been off-and-on
cleaning up for upstreaming, but your patch has beaten me to it. And
yes, this also means I have bandwidth for helping out with making it
happen.)

- I'm pretty sure there are some threading issues with this code. I've
commented inline where I suspect the issues. Essentially, it comes
down to ParavirtualizedGraphics.framework using Apple's GCD dispatch
queues, regardless of what constraints its calling code might have.
The display object's handler blocks fire on the specified queue; the
PGDevice's handler blocks make no guarantees about reentrancy or what
threads will be used.



Finally, I have what turned out to be lots of comments, questions,
notes, etc. of varying degree of nit-picky-ness, inline in the patch
below.

Cheers,
Phil


On Wed, Aug 30, 2023 at 6:18 PM Alexander Graf <g...@amazon.com> wrote:

> +#define MAX_MRS 512

This appears to be unused.

> +
> +static const PGDisplayCoord_t apple_gfx_modes[] = {
> +    { .x = 1440, .y = 1080 },
> +    { .x = 1280, .y = 1024 },
> +};

Any specific reason for choosing these modes and not some more…
conventional ones?

> + * We have to map PVG memory into our address space. Use the one below
> + * as base start address. In normal linker setups it points to a free
> + * memory range.
> + */
> +#define APPLE_GFX_BASE_VA ((void *)(uintptr_t)0x500000000000UL)

Hard-coding this and hoping for the best seems unnecessary? We can
just reserve a range of address space for each task on demand
(mach_vm_allocate()), with no risk of collisions. This doesn't add any
complexity. mach_vm_remap will still work.

> +typedef struct AppleGFXTask {
> +    QTAILQ_ENTRY(AppleGFXTask) node;
> +    void *mem;
> +    uint64_t len;
> +} AppleGFXTask;

Using the name PGTask_s for the struct, as used by the PV Graphics
framework, would save a bunch of casting. Any particular reason not to
use that name? (We could typedef it to something matching Qemu's
naming conventions if that's the concern.)

> +typedef QTAILQ_HEAD(, AppleGFXTask) AppleGFXTaskList;
> +
> +typedef struct AppleGFXState {
> +    /* Private */
> +    SysBusDevice parent_obj;

I saw some reviews on the other patches in the series commented on
this - I guess this way of defining QObj types no longer matches Qemu
convention.

> +static uint64_t apple_gfx_read(void *opaque, hwaddr offset, unsigned size)
> +{
> +    AppleGFXState *s = opaque;
> +    uint64_t res = 0;
> +
> +    switch (offset) {
> +    default:
> +        res = [s->pgdev mmioReadAtOffset:offset];
> +        break;
> +    }

This switch block looks like it might have had a purpose once, but is
no longer needed.

> +
> +    trace_apple_gfx_read(offset, res);
> +
> +    return res;
> +}
> +
> +static void apple_gfx_write(void *opaque, hwaddr offset, uint64_t val, 
> unsigned size)
> +{
> +    AppleGFXState *s = opaque;
> +
> +    trace_apple_gfx_write(offset, val);
> +
> +    qemu_mutex_unlock_iothread();
> +    [s->pgdev mmioWriteAtOffset:offset value:val];
> +    qemu_mutex_lock_iothread();
> +}

Is a momentary unlock like that safe to do here? Does the calling code
expect the lock to continue being held?

> +static uint64_t apple_iosfc_read(void *opaque, hwaddr offset, unsigned size)
> +{
> +    AppleGFXState *s = opaque;
> +    uint64_t res = 0;
> +
> +    qemu_mutex_unlock_iothread();
> +    res = [s->pgiosfc mmioReadAtOffset:offset];
> +    qemu_mutex_lock_iothread();

As above

> +
> +    trace_apple_iosfc_read(offset, res);
> +
> +    return res;
> +}
> +

[…]

> +static void apple_gfx_fb_update_display(void *opaque)
> +{
> +    AppleGFXState *s = opaque;
> +
> +    if (!s->new_frame || !s->handles_frames) {
> +        return;
> +    }
> +
> +    s->new_frame = false;
> +
> +    BOOL r;
> +    uint32_t width = surface_width(s->surface);
> +    uint32_t height = surface_height(s->surface);
> +    MTLRegion region = MTLRegionMake2D(0, 0, width, height);
> +    id<MTLCommandQueue> commandQueue = [s->mtl newCommandQueue];
> +    id<MTLCommandBuffer> mipmapCommandBuffer = [commandQueue commandBuffer];
> +
> +    r = [s->pgdisp encodeCurrentFrameToCommandBuffer:mipmapCommandBuffer
> +                                             texture:s->texture
> +                                              region:region];
> +
> +    if (r != YES) {
> +        return;
> +    }
> +
> +    id<MTLBlitCommandEncoder> blitCommandEncoder = [mipmapCommandBuffer 
> blitCommandEncoder];
> +    [blitCommandEncoder generateMipmapsForTexture:s->texture];

It shouldn't be necessary to generate a mip map for the rendered
frame, especially as it's just mip level 0 being read back to a system
memory buffer below.

> +    [blitCommandEncoder endEncoding];
> +    [mipmapCommandBuffer commit];
> +    [mipmapCommandBuffer waitUntilCompleted];

Instead of waiting synchronously for the GPU to do its thing (while
holding the BQL), this would be a textbook case for using
.gfx_update_async = true in the GraphicHwOps. Perhaps an improvement
for a future patch though.

> +    [s->texture getBytes:s->vram bytesPerRow:(width * 4)
> +                                 bytesPerImage: (width * height * 4)
> +                                 fromRegion: region
> +                                 mipmapLevel: 0
> +                                 slice: 0];
> +
> +    /* Need to render cursor manually if not supported by backend */
> +    if (!dpy_cursor_define_supported(s->con) && s->cursor && s->cursor_show) 
> {
> +        pixman_image_t *image =
> +            pixman_image_create_bits(PIXMAN_a8r8g8b8,
> +                                     s->cursor->width,
> +                                     s->cursor->height,
> +                                     (uint32_t *)s->cursor->data,
> +                                     s->cursor->width * 4);
> +
> +        pixman_image_composite(PIXMAN_OP_OVER,
> +                               image, NULL, s->surface->image,
> +                               0, 0, 0, 0, s->pgdisp.cursorPosition.x,
> +                               s->pgdisp.cursorPosition.y, s->cursor->width,
> +                               s->cursor->height);
> +
> +        pixman_image_unref(image);
> +    }
> +
> +    dpy_gfx_update_full(s->con);
> +
> +    [commandQueue release];

I notice the command buffer isn't being released. I assume that has
autorelease semantics?

> +}
> +
> +static const GraphicHwOps apple_gfx_fb_ops = {
> +    .gfx_update = apple_gfx_fb_update_display,
> +};
> +
> +static void update_cursor(AppleGFXState *s)
> +{
> +    dpy_mouse_set(s->con, s->pgdisp.cursorPosition.x, 
> s->pgdisp.cursorPosition.y, s->cursor_show);
> +
> +    /* Need to render manually if cursor is not natively supported */
> +    if (!dpy_cursor_define_supported(s->con)) {
> +        s->new_frame = true;
> +    }
> +}
> +
> +static void set_mode(AppleGFXState *s, uint32_t width, uint32_t height)
> +{
> +    void *vram = g_malloc0(width * height * 4);
> +    void *old_vram = s->vram;
> +    DisplaySurface *surface;
> +    MTLTextureDescriptor *textureDescriptor;
> +    id<MTLTexture> old_texture = s->texture;

Accesses to various fields including s->vram and s->texture above and
more below appear to not be serialised with their uses in the gfx
update callback. So although mode setting is a fairly rare occurrence,
these accesses can race and cause use-after-free, etc.

> +
> +    if (s->surface &&
> +        width == surface_width(s->surface) &&
> +        height == surface_height(s->surface)) {
> +        return;
> +    }
> +    surface = qemu_create_displaysurface_from(width, height, 
> PIXMAN_LE_a8r8g8b8,
> +                                              width * 4, vram);
> +    s->surface = surface;
> +    dpy_gfx_replace_surface(s->con, surface);
> +    s->vram = vram;
> +    g_free(old_vram);
> +
> +    textureDescriptor = [MTLTextureDescriptor 
> texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm
> +                                              width:width
> +                                              height:height
> +                                              mipmapped:NO];
> +    textureDescriptor.usage = s->pgdisp.minimumTextureUsage;
> +    s->texture = [s->mtl newTextureWithDescriptor:textureDescriptor];
> +
> +    if (old_texture) {
> +        [old_texture release];
> +    }
> +}
> +


> +
> +static void apple_gfx_realize(DeviceState *dev, Error **errp)

This function keeps going and going and going, but performs a bunch of
distinct operations. Perhaps split these out into separate functions?

> +{
> +    AppleGFXState *s = APPLE_GFX(dev);
> +    PGDeviceDescriptor *desc = [PGDeviceDescriptor new];
> +    PGDisplayDescriptor *disp_desc = [PGDisplayDescriptor new];
> +    PGIOSurfaceHostDeviceDescriptor *iosfc_desc = 
> [PGIOSurfaceHostDeviceDescriptor new];

These 'new' calls don't seem to be matched by 'release' calls. This
code is only called once in the lifetime of a VM, so the leaks are of
course minor.

> +    PGDeviceDescriptorExt *desc_ext = (PGDeviceDescriptorExt *)desc;

I don't quite understand why we need 2 pointer variables to refer to
the same object? The base class properties are accessible via a
subclass pointer.

> +    PGDisplayMode *modes[ARRAY_SIZE(apple_gfx_modes)];
> +    int i;
> +
> +    for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) {
> +        modes[i] = [PGDisplayMode new];

Again, no matching release.

> +        [modes[i] initWithSizeInPixels:apple_gfx_modes[i] 
> refreshRateInHz:60.];
> +    }
> +
> +    s->mtl = MTLCreateSystemDefaultDevice();
> +
> +    desc.device = s->mtl;
> +    desc_ext.usingIOSurfaceMapper = true;
> +
> +    desc.createTask = ^(uint64_t vmSize, void * _Nullable * _Nonnull 
> baseAddress) {

Are we in any way certain about thread safety guarantees for this and
the other 3 "task" related handler blocks? I've seen them called from
at least 2 different dispatch queues
("com.apple.root.default-qos.overcommit" and "FIFOQueue").

> +        AppleGFXTask *task = apple_gfx_new_task(s, vmSize);
> +        *baseAddress = task->mem;
> +        trace_apple_gfx_create_task(vmSize, *baseAddress);
> +        return (PGTask_t *)task;
> +    };
> +
> +    desc.destroyTask = ^(PGTask_t * _Nonnull _task) {
> +        AppleGFXTask *task = (AppleGFXTask *)_task;
> +        trace_apple_gfx_destroy_task(task);
> +        QTAILQ_REMOVE(&s->tasks, task, node);
> +        g_free(task);
> +    };
> +
> +    desc.mapMemory = ^(PGTask_t * _Nonnull _task, uint32_t rangeCount, 
> uint64_t virtualOffset, bool readOnly, PGPhysicalMemoryRange_t * _Nonnull 
> ranges) {
> +        AppleGFXTask *task = (AppleGFXTask*)_task;
> +       mach_port_t mtask = mach_task_self();
> +        trace_apple_gfx_map_memory(task, rangeCount, virtualOffset, 
> readOnly);
> +        for (int i = 0; i < rangeCount; i++) {
> +            PGPhysicalMemoryRange_t *range = &ranges[i];
> +            MemoryRegion *tmp_mr;
> +            /* TODO: Bounds checks? r/o? */
> +            qemu_mutex_lock_iothread();
> +            AppleGFXMR *mr = apple_gfx_mapMemory(s, task, virtualOffset,
> +                                                 range->physicalAddress,
> +                                                 range->physicalLength);
> +
> +            trace_apple_gfx_map_memory_range(i, range->physicalAddress, 
> range->physicalLength, mr->va);
> +
> +            vm_address_t target = (vm_address_t)mr->va;
> +            uint64_t mask = 0;
> +            bool anywhere = false;
> +            vm_address_t source = (vm_address_t)gpa2hva(&tmp_mr, mr->pa, 
> mr->len, NULL);
> +            vm_prot_t cur_protection = 0;
> +            vm_prot_t max_protection = 0;
> +            kern_return_t retval = vm_remap(mtask, &target, mr->len, mask,
> +                                            anywhere, mtask, source, false,
> +                                            &cur_protection, &max_protection,
> +                                            VM_INHERIT_DEFAULT);
> +            trace_apple_gfx_remap(retval, source, target);
> +            g_assert(retval == KERN_SUCCESS);
> +
> +            qemu_mutex_unlock_iothread();
> +
> +            virtualOffset += mr->len;
> +        }
> +        return (bool)true;
> +    };
> +
> +    desc.unmapMemory = ^(PGTask_t * _Nonnull _task, uint64_t virtualOffset, 
> uint64_t length) {
> +        AppleGFXTask *task = (AppleGFXTask *)_task;
> +        AppleGFXMR *mr, *next;
> +
> +        trace_apple_gfx_unmap_memory(task, virtualOffset, length);
> +        qemu_mutex_lock_iothread();

Why is this handler locking and the mapMemory one isn't?

> +        QTAILQ_FOREACH_SAFE(mr, &s->mrs, node, next) {

Is this a good choice of data structure for the operation? How many
list members are we expecting here, and how hot is this handler?

> +            if (mr->va >= (task->mem + virtualOffset) &&
> +                (mr->va + mr->len) <= (task->mem + virtualOffset + length)) {
> +                vm_address_t addr = (vm_address_t)mr->va;
> +                vm_deallocate(mach_task_self(), addr, mr->len);
> +                QTAILQ_REMOVE(&s->mrs, mr, node);
> +                g_free(mr);
> +            }
> +        }
> +        qemu_mutex_unlock_iothread();
> +        return (bool)true;
> +    };
> +


> +    desc.addTraceRange = ^(PGPhysicalMemoryRange_t * _Nonnull range, 
> PGTraceRangeHandler _Nonnull handler) {
> +        /* Never saw this called. Return a bogus pointer so we catch access. 
> */
> +        return (PGTraceRange_t *)(void *)(uintptr_t)0x4242;
> +    };
> +
> +    desc.removeTraceRange = ^(PGTraceRange_t * _Nonnull range) {
> +        /* Never saw this called. Nothing to do. */
> +    };

The "trace range" functionality is optional, and, I believe, only used
by the UEFI driver for the PV graphics device. vmapple doesn't use
UEFI, so you won't see these called on aarch64 guests. It's safe to
not set any addTraceRange/removeTraceRange handlers, no need for dummy
ones. (Especially not ones with undefined behaviour…)

> +    s->pgdev = PGNewDeviceWithDescriptor(desc);
> +
> +    [disp_desc init];

[PGDisplayDescriptor new] earlier already implies -init. Either use
new, or the alloc/init idiom, but not both.

> +    disp_desc.name = @"QEMU display";
> +    disp_desc.sizeInMillimeters = NSMakeSize(400., 300.); /* A 20" display */
> +    disp_desc.queue = dispatch_get_main_queue();

Note that the GCD main queue does not map to anything well-defined in
Qemu's view of the world. (It doesn't even necessarily map to the
process's starting thread: it's just guaranteed to be serialised with
regard to the starting thread.)

With that in mind:

> +    disp_desc.newFrameEventHandler = ^(void) {
> +        trace_apple_gfx_new_frame();
> +
> +        /* Tell QEMU gfx stack that a new frame arrived */
> +        s->handles_frames = true;
> +        s->new_frame = true;

Setting new_frame on the GCD main queue and performing check-and-clear
on it in the Qemu graphics fb update callback without using atomics or
locking makes me uneasy.

> +    };
> +    disp_desc.modeChangeHandler = ^(PGDisplayCoord_t sizeInPixels, OSType 
> pixelFormat) {
> +        trace_apple_gfx_mode_change(sizeInPixels.x, sizeInPixels.y);
> +        set_mode(s, sizeInPixels.x, sizeInPixels.y);
> +    };
> +    disp_desc.cursorGlyphHandler = ^(NSBitmapImageRep *glyph, 
> PGDisplayCoord_t hotSpot) {
> +        uint32_t bpp = glyph.bitsPerPixel;
> +        uint64_t width = glyph.pixelsWide;
> +        uint64_t height = glyph.pixelsHigh;
> +
> +        trace_apple_gfx_cursor_set(bpp, width, height);
> +
> +        if (s->cursor) {
> +            cursor_unref(s->cursor);
> +        }
> +        s->cursor = cursor_alloc(width, height);

I suspect this is not thread safe.

> +
> +        /* TODO handle different bpp */
> +        if (bpp == 32) {
> +            memcpy(s->cursor->data, glyph.bitmapData, glyph.bytesPerPlane);
> +            dpy_cursor_define(s->con, s->cursor);
> +            update_cursor(s);
> +        }
> +    };
> +    disp_desc.cursorShowHandler = ^(BOOL show) {
> +        trace_apple_gfx_cursor_show(show);
> +        s->cursor_show = show;
> +        update_cursor(s);
> +    };
> +    disp_desc.cursorMoveHandler = ^(void) {
> +        trace_apple_gfx_cursor_move();
> +        update_cursor(s);
> +    };

Note that cursorMoveHandler wasn't in every API version of the PV
Graphics framework, although it was only missing from macOS 11.0 and
was added with 11.1, so I suspect the issue is fairly theoretical.
(I'm not sure what you're supposed to use instead, I guess polling
cursorPosition on every frame? In any case, setting cursorMoveHandler
will crash on runtime versions where it's not available unless wrapped
in an availability check.)

> +    s->pgdisp = [s->pgdev newDisplayWithDescriptor:disp_desc port:0 
> serialNum:1234];
> +    s->pgdisp.modeList = [NSArray arrayWithObjects:modes 
> count:ARRAY_SIZE(apple_gfx_modes)];
> +
> +    [iosfc_desc init];
> +    iosfc_desc.mapMemory = ^(uint64_t phys, uint64_t len, bool ro, void 
> **va, void *e, void *f) {

I assume we don't really have any concrete information on the meaning
of the unused arguments?

> +        trace_apple_iosfc_map_memory(phys, len, ro, va, e, f);
> +        MemoryRegion *tmp_mr;
> +        *va = gpa2hva(&tmp_mr, phys, len, NULL);
> +        return (bool)true;
> +    };
> +

Re: [PATCH v2 10/12] hw/vmapple/apple-gfx: Introduce ParavirtualizedGraphics.Framework support

Reply via email to