Hi All, Cc list is a bit of guess, so please add anyone else who might be interested in this topic.
This came up in discussion of the CXL emulation series a while back and I've finally gotten around to looking more closely at it (having carried a local hack in the meantime). https://lore.kernel.org/qemu-devel/20210218165010.00004...@huawei.com/#t The code in question is: +static uint64_t mailbox_reg_read(void *opaque, hwaddr offset, unsigned size) +{ + CXLDeviceState *cxl_dstate = opaque; + + switch (size) { + case 8: + return cxl_dstate->mbox_reg_state64[offset / 8]; + case 4: + return cxl_dstate->mbox_reg_state32[offset / 4]; + default: + g_assert_not_reached(); + } +} + +static void mailbox_reg_write(void *opaque, hwaddr offset, uint64_t value, + unsigned size) +{ + CXLDeviceState *cxl_dstate = opaque; + + if (offset >= A_CXL_DEV_CMD_PAYLOAD) { + memcpy(cxl_dstate->mbox_reg_state + offset, &value, size); + return; + } + + /* + * Lock is needed to prevent concurrent writes as well as to prevent writes + * coming in while the firmware is processing. Without background commands + * or the second mailbox implemented, this serves no purpose since the + * memory access is synchronized at a higher level (per memory region). + */ + RCU_READ_LOCK_GUARD(); + + switch (size) { + case 4: + mailbox_mem_writel(cxl_dstate->mbox_reg_state32, offset, value); + break; + case 8: + mailbox_mem_writeq(cxl_dstate->mbox_reg_state64, offset, value); + break; + default: + g_assert_not_reached(); + } + + if (ARRAY_FIELD_EX32(cxl_dstate->mbox_reg_state32, CXL_DEV_MAILBOX_CTRL, + DOORBELL)) + cxl_process_mailbox(cxl_dstate); +} ... +static const MemoryRegionOps mailbox_ops = { + .read = mailbox_reg_read, + .write = mailbox_reg_write, + .endianness = DEVICE_LITTLE_ENDIAN, + .valid = { + .min_access_size = 1, + .max_access_size = 8, + .unaligned = false, + }, + .impl = { + .min_access_size = 4, + .max_access_size = 8, + }, +}; + And when run against the Linux driver, a particular memcpy_fromio() on ARM64 will result in byte reads when not 8 byte aligned whereas on x86 it will result in 4 byte alignment. Byte reads under these circumstances today will return whatever is in the lowest byte of the a 4 byte unaligned read (which ends up being forced aligned by the simple implementation of the callback). My initial suggestion was to fix this by adding the relatively simple code needed in the driver to implement byte read / write, but Ben pointed at the QEMU docs - docs/devel/memory.rst which says " .impl.min_access_size, .impl.max_access_size define the access sizes (in bytes) supported by the *implementation*; other access sizes will be emulated using the ones available. For example a 4-byte write will be emulated using four 1-byte writes, if .impl.max_access_size = 1. " This isn't true when we have the situation where .valid.min_access_size < .imp.min_access_size So change the docs or try to make this work? Assuming no side effects (if there are any then the emulated device should not use this) it is reasonably easy to augment access_with_adjusted_size() for the read case, but the write case would require a read modify / write cycle. You could argue that any driver doing this really needs to be sure that reads and writes are side effect free, but it is pretty nasty. So in conclusion, Docs wrong or implementation of this corner case wrong? (or are we misreading the docs or code?) Thanks, Jonathan