This is a pretty big patch, but I've done my best to make sense of it. Here's a reasonably thorough first pass review.
First off: I've tested it; it seems to broadly work as advertised! Tested-by: Phil Dennis-Jordan <p...@philjordan.eu> Second: Nice work on figuring out the undocumented bits for the vmapple-related things! General comments/questions on code: - This is alluded to in the comment about PGDeviceDescriptorExt/PGIOSurfaceHostDeviceDescriptor/PGIOSurfaceHostDevice: PV graphics support is useful for running an x86-64 macOS host/guest combination as well as aarch64/vmapple. It would be more useful to have a generic device class as well as subclasses with (a) the vmapple specific concrete implementation, and (b) the PCI interface for x86-64. Obviously we could merge this patch roughly as it stands and tease out the split later… At minimum though, the device type/name for the vmapple-specific tweaks should I think include 'vmapple' so we can separate out a generic one later without renaming things. (Disclosure: I have something of a vested interest in this as I've got a working x86-64 PV Graphics patch set which I've been off-and-on cleaning up for upstreaming, but your patch has beaten me to it. And yes, this also means I have bandwidth for helping out with making it happen.) - I'm pretty sure there are some threading issues with this code. I've commented inline where I suspect the issues. Essentially, it comes down to ParavirtualizedGraphics.framework using Apple's GCD dispatch queues, regardless of what constraints its calling code might have. The display object's handler blocks fire on the specified queue; the PGDevice's handler blocks make no guarantees about reentrancy or what threads will be used. Finally, I have what turned out to be lots of comments, questions, notes, etc. of varying degree of nit-picky-ness, inline in the patch below. Cheers, Phil On Wed, Aug 30, 2023 at 6:18 PM Alexander Graf <g...@amazon.com> wrote: > +#define MAX_MRS 512 This appears to be unused. > + > +static const PGDisplayCoord_t apple_gfx_modes[] = { > + { .x = 1440, .y = 1080 }, > + { .x = 1280, .y = 1024 }, > +}; Any specific reason for choosing these modes and not some more… conventional ones? > + * We have to map PVG memory into our address space. Use the one below > + * as base start address. In normal linker setups it points to a free > + * memory range. > + */ > +#define APPLE_GFX_BASE_VA ((void *)(uintptr_t)0x500000000000UL) Hard-coding this and hoping for the best seems unnecessary? We can just reserve a range of address space for each task on demand (mach_vm_allocate()), with no risk of collisions. This doesn't add any complexity. mach_vm_remap will still work. > +typedef struct AppleGFXTask { > + QTAILQ_ENTRY(AppleGFXTask) node; > + void *mem; > + uint64_t len; > +} AppleGFXTask; Using the name PGTask_s for the struct, as used by the PV Graphics framework, would save a bunch of casting. Any particular reason not to use that name? (We could typedef it to something matching Qemu's naming conventions if that's the concern.) > +typedef QTAILQ_HEAD(, AppleGFXTask) AppleGFXTaskList; > + > +typedef struct AppleGFXState { > + /* Private */ > + SysBusDevice parent_obj; I saw some reviews on the other patches in the series commented on this - I guess this way of defining QObj types no longer matches Qemu convention. > +static uint64_t apple_gfx_read(void *opaque, hwaddr offset, unsigned size) > +{ > + AppleGFXState *s = opaque; > + uint64_t res = 0; > + > + switch (offset) { > + default: > + res = [s->pgdev mmioReadAtOffset:offset]; > + break; > + } This switch block looks like it might have had a purpose once, but is no longer needed. > + > + trace_apple_gfx_read(offset, res); > + > + return res; > +} > + > +static void apple_gfx_write(void *opaque, hwaddr offset, uint64_t val, > unsigned size) > +{ > + AppleGFXState *s = opaque; > + > + trace_apple_gfx_write(offset, val); > + > + qemu_mutex_unlock_iothread(); > + [s->pgdev mmioWriteAtOffset:offset value:val]; > + qemu_mutex_lock_iothread(); > +} Is a momentary unlock like that safe to do here? Does the calling code expect the lock to continue being held? > +static uint64_t apple_iosfc_read(void *opaque, hwaddr offset, unsigned size) > +{ > + AppleGFXState *s = opaque; > + uint64_t res = 0; > + > + qemu_mutex_unlock_iothread(); > + res = [s->pgiosfc mmioReadAtOffset:offset]; > + qemu_mutex_lock_iothread(); As above > + > + trace_apple_iosfc_read(offset, res); > + > + return res; > +} > + […] > +static void apple_gfx_fb_update_display(void *opaque) > +{ > + AppleGFXState *s = opaque; > + > + if (!s->new_frame || !s->handles_frames) { > + return; > + } > + > + s->new_frame = false; > + > + BOOL r; > + uint32_t width = surface_width(s->surface); > + uint32_t height = surface_height(s->surface); > + MTLRegion region = MTLRegionMake2D(0, 0, width, height); > + id<MTLCommandQueue> commandQueue = [s->mtl newCommandQueue]; > + id<MTLCommandBuffer> mipmapCommandBuffer = [commandQueue commandBuffer]; > + > + r = [s->pgdisp encodeCurrentFrameToCommandBuffer:mipmapCommandBuffer > + texture:s->texture > + region:region]; > + > + if (r != YES) { > + return; > + } > + > + id<MTLBlitCommandEncoder> blitCommandEncoder = [mipmapCommandBuffer > blitCommandEncoder]; > + [blitCommandEncoder generateMipmapsForTexture:s->texture]; It shouldn't be necessary to generate a mip map for the rendered frame, especially as it's just mip level 0 being read back to a system memory buffer below. > + [blitCommandEncoder endEncoding]; > + [mipmapCommandBuffer commit]; > + [mipmapCommandBuffer waitUntilCompleted]; Instead of waiting synchronously for the GPU to do its thing (while holding the BQL), this would be a textbook case for using .gfx_update_async = true in the GraphicHwOps. Perhaps an improvement for a future patch though. > + [s->texture getBytes:s->vram bytesPerRow:(width * 4) > + bytesPerImage: (width * height * 4) > + fromRegion: region > + mipmapLevel: 0 > + slice: 0]; > + > + /* Need to render cursor manually if not supported by backend */ > + if (!dpy_cursor_define_supported(s->con) && s->cursor && s->cursor_show) > { > + pixman_image_t *image = > + pixman_image_create_bits(PIXMAN_a8r8g8b8, > + s->cursor->width, > + s->cursor->height, > + (uint32_t *)s->cursor->data, > + s->cursor->width * 4); > + > + pixman_image_composite(PIXMAN_OP_OVER, > + image, NULL, s->surface->image, > + 0, 0, 0, 0, s->pgdisp.cursorPosition.x, > + s->pgdisp.cursorPosition.y, s->cursor->width, > + s->cursor->height); > + > + pixman_image_unref(image); > + } > + > + dpy_gfx_update_full(s->con); > + > + [commandQueue release]; I notice the command buffer isn't being released. I assume that has autorelease semantics? > +} > + > +static const GraphicHwOps apple_gfx_fb_ops = { > + .gfx_update = apple_gfx_fb_update_display, > +}; > + > +static void update_cursor(AppleGFXState *s) > +{ > + dpy_mouse_set(s->con, s->pgdisp.cursorPosition.x, > s->pgdisp.cursorPosition.y, s->cursor_show); > + > + /* Need to render manually if cursor is not natively supported */ > + if (!dpy_cursor_define_supported(s->con)) { > + s->new_frame = true; > + } > +} > + > +static void set_mode(AppleGFXState *s, uint32_t width, uint32_t height) > +{ > + void *vram = g_malloc0(width * height * 4); > + void *old_vram = s->vram; > + DisplaySurface *surface; > + MTLTextureDescriptor *textureDescriptor; > + id<MTLTexture> old_texture = s->texture; Accesses to various fields including s->vram and s->texture above and more below appear to not be serialised with their uses in the gfx update callback. So although mode setting is a fairly rare occurrence, these accesses can race and cause use-after-free, etc. > + > + if (s->surface && > + width == surface_width(s->surface) && > + height == surface_height(s->surface)) { > + return; > + } > + surface = qemu_create_displaysurface_from(width, height, > PIXMAN_LE_a8r8g8b8, > + width * 4, vram); > + s->surface = surface; > + dpy_gfx_replace_surface(s->con, surface); > + s->vram = vram; > + g_free(old_vram); > + > + textureDescriptor = [MTLTextureDescriptor > texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm > + width:width > + height:height > + mipmapped:NO]; > + textureDescriptor.usage = s->pgdisp.minimumTextureUsage; > + s->texture = [s->mtl newTextureWithDescriptor:textureDescriptor]; > + > + if (old_texture) { > + [old_texture release]; > + } > +} > + > + > +static void apple_gfx_realize(DeviceState *dev, Error **errp) This function keeps going and going and going, but performs a bunch of distinct operations. Perhaps split these out into separate functions? > +{ > + AppleGFXState *s = APPLE_GFX(dev); > + PGDeviceDescriptor *desc = [PGDeviceDescriptor new]; > + PGDisplayDescriptor *disp_desc = [PGDisplayDescriptor new]; > + PGIOSurfaceHostDeviceDescriptor *iosfc_desc = > [PGIOSurfaceHostDeviceDescriptor new]; These 'new' calls don't seem to be matched by 'release' calls. This code is only called once in the lifetime of a VM, so the leaks are of course minor. > + PGDeviceDescriptorExt *desc_ext = (PGDeviceDescriptorExt *)desc; I don't quite understand why we need 2 pointer variables to refer to the same object? The base class properties are accessible via a subclass pointer. > + PGDisplayMode *modes[ARRAY_SIZE(apple_gfx_modes)]; > + int i; > + > + for (i = 0; i < ARRAY_SIZE(apple_gfx_modes); i++) { > + modes[i] = [PGDisplayMode new]; Again, no matching release. > + [modes[i] initWithSizeInPixels:apple_gfx_modes[i] > refreshRateInHz:60.]; > + } > + > + s->mtl = MTLCreateSystemDefaultDevice(); > + > + desc.device = s->mtl; > + desc_ext.usingIOSurfaceMapper = true; > + > + desc.createTask = ^(uint64_t vmSize, void * _Nullable * _Nonnull > baseAddress) { Are we in any way certain about thread safety guarantees for this and the other 3 "task" related handler blocks? I've seen them called from at least 2 different dispatch queues ("com.apple.root.default-qos.overcommit" and "FIFOQueue"). > + AppleGFXTask *task = apple_gfx_new_task(s, vmSize); > + *baseAddress = task->mem; > + trace_apple_gfx_create_task(vmSize, *baseAddress); > + return (PGTask_t *)task; > + }; > + > + desc.destroyTask = ^(PGTask_t * _Nonnull _task) { > + AppleGFXTask *task = (AppleGFXTask *)_task; > + trace_apple_gfx_destroy_task(task); > + QTAILQ_REMOVE(&s->tasks, task, node); > + g_free(task); > + }; > + > + desc.mapMemory = ^(PGTask_t * _Nonnull _task, uint32_t rangeCount, > uint64_t virtualOffset, bool readOnly, PGPhysicalMemoryRange_t * _Nonnull > ranges) { > + AppleGFXTask *task = (AppleGFXTask*)_task; > + mach_port_t mtask = mach_task_self(); > + trace_apple_gfx_map_memory(task, rangeCount, virtualOffset, > readOnly); > + for (int i = 0; i < rangeCount; i++) { > + PGPhysicalMemoryRange_t *range = &ranges[i]; > + MemoryRegion *tmp_mr; > + /* TODO: Bounds checks? r/o? */ > + qemu_mutex_lock_iothread(); > + AppleGFXMR *mr = apple_gfx_mapMemory(s, task, virtualOffset, > + range->physicalAddress, > + range->physicalLength); > + > + trace_apple_gfx_map_memory_range(i, range->physicalAddress, > range->physicalLength, mr->va); > + > + vm_address_t target = (vm_address_t)mr->va; > + uint64_t mask = 0; > + bool anywhere = false; > + vm_address_t source = (vm_address_t)gpa2hva(&tmp_mr, mr->pa, > mr->len, NULL); > + vm_prot_t cur_protection = 0; > + vm_prot_t max_protection = 0; > + kern_return_t retval = vm_remap(mtask, &target, mr->len, mask, > + anywhere, mtask, source, false, > + &cur_protection, &max_protection, > + VM_INHERIT_DEFAULT); > + trace_apple_gfx_remap(retval, source, target); > + g_assert(retval == KERN_SUCCESS); > + > + qemu_mutex_unlock_iothread(); > + > + virtualOffset += mr->len; > + } > + return (bool)true; > + }; > + > + desc.unmapMemory = ^(PGTask_t * _Nonnull _task, uint64_t virtualOffset, > uint64_t length) { > + AppleGFXTask *task = (AppleGFXTask *)_task; > + AppleGFXMR *mr, *next; > + > + trace_apple_gfx_unmap_memory(task, virtualOffset, length); > + qemu_mutex_lock_iothread(); Why is this handler locking and the mapMemory one isn't? > + QTAILQ_FOREACH_SAFE(mr, &s->mrs, node, next) { Is this a good choice of data structure for the operation? How many list members are we expecting here, and how hot is this handler? > + if (mr->va >= (task->mem + virtualOffset) && > + (mr->va + mr->len) <= (task->mem + virtualOffset + length)) { > + vm_address_t addr = (vm_address_t)mr->va; > + vm_deallocate(mach_task_self(), addr, mr->len); > + QTAILQ_REMOVE(&s->mrs, mr, node); > + g_free(mr); > + } > + } > + qemu_mutex_unlock_iothread(); > + return (bool)true; > + }; > + > + desc.addTraceRange = ^(PGPhysicalMemoryRange_t * _Nonnull range, > PGTraceRangeHandler _Nonnull handler) { > + /* Never saw this called. Return a bogus pointer so we catch access. > */ > + return (PGTraceRange_t *)(void *)(uintptr_t)0x4242; > + }; > + > + desc.removeTraceRange = ^(PGTraceRange_t * _Nonnull range) { > + /* Never saw this called. Nothing to do. */ > + }; The "trace range" functionality is optional, and, I believe, only used by the UEFI driver for the PV graphics device. vmapple doesn't use UEFI, so you won't see these called on aarch64 guests. It's safe to not set any addTraceRange/removeTraceRange handlers, no need for dummy ones. (Especially not ones with undefined behaviour…) > + s->pgdev = PGNewDeviceWithDescriptor(desc); > + > + [disp_desc init]; [PGDisplayDescriptor new] earlier already implies -init. Either use new, or the alloc/init idiom, but not both. > + disp_desc.name = @"QEMU display"; > + disp_desc.sizeInMillimeters = NSMakeSize(400., 300.); /* A 20" display */ > + disp_desc.queue = dispatch_get_main_queue(); Note that the GCD main queue does not map to anything well-defined in Qemu's view of the world. (It doesn't even necessarily map to the process's starting thread: it's just guaranteed to be serialised with regard to the starting thread.) With that in mind: > + disp_desc.newFrameEventHandler = ^(void) { > + trace_apple_gfx_new_frame(); > + > + /* Tell QEMU gfx stack that a new frame arrived */ > + s->handles_frames = true; > + s->new_frame = true; Setting new_frame on the GCD main queue and performing check-and-clear on it in the Qemu graphics fb update callback without using atomics or locking makes me uneasy. > + }; > + disp_desc.modeChangeHandler = ^(PGDisplayCoord_t sizeInPixels, OSType > pixelFormat) { > + trace_apple_gfx_mode_change(sizeInPixels.x, sizeInPixels.y); > + set_mode(s, sizeInPixels.x, sizeInPixels.y); > + }; > + disp_desc.cursorGlyphHandler = ^(NSBitmapImageRep *glyph, > PGDisplayCoord_t hotSpot) { > + uint32_t bpp = glyph.bitsPerPixel; > + uint64_t width = glyph.pixelsWide; > + uint64_t height = glyph.pixelsHigh; > + > + trace_apple_gfx_cursor_set(bpp, width, height); > + > + if (s->cursor) { > + cursor_unref(s->cursor); > + } > + s->cursor = cursor_alloc(width, height); I suspect this is not thread safe. > + > + /* TODO handle different bpp */ > + if (bpp == 32) { > + memcpy(s->cursor->data, glyph.bitmapData, glyph.bytesPerPlane); > + dpy_cursor_define(s->con, s->cursor); > + update_cursor(s); > + } > + }; > + disp_desc.cursorShowHandler = ^(BOOL show) { > + trace_apple_gfx_cursor_show(show); > + s->cursor_show = show; > + update_cursor(s); > + }; > + disp_desc.cursorMoveHandler = ^(void) { > + trace_apple_gfx_cursor_move(); > + update_cursor(s); > + }; Note that cursorMoveHandler wasn't in every API version of the PV Graphics framework, although it was only missing from macOS 11.0 and was added with 11.1, so I suspect the issue is fairly theoretical. (I'm not sure what you're supposed to use instead, I guess polling cursorPosition on every frame? In any case, setting cursorMoveHandler will crash on runtime versions where it's not available unless wrapped in an availability check.) > + s->pgdisp = [s->pgdev newDisplayWithDescriptor:disp_desc port:0 > serialNum:1234]; > + s->pgdisp.modeList = [NSArray arrayWithObjects:modes > count:ARRAY_SIZE(apple_gfx_modes)]; > + > + [iosfc_desc init]; > + iosfc_desc.mapMemory = ^(uint64_t phys, uint64_t len, bool ro, void > **va, void *e, void *f) { I assume we don't really have any concrete information on the meaning of the unused arguments? > + trace_apple_iosfc_map_memory(phys, len, ro, va, e, f); > + MemoryRegion *tmp_mr; > + *va = gpa2hva(&tmp_mr, phys, len, NULL); > + return (bool)true; > + }; > +