On Fri, 2 Feb 2024 16:56:18 +0000 Peter Maydell <peter.mayd...@linaro.org> wrote:
> On Fri, 2 Feb 2024 at 16:50, Gregory Price <gregory.pr...@memverge.com> wrote: > > > > On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote: > > > Here we are trying to take an interrupt. This isn't related to the > > > other can_do_io stuff, it's happening because do_ld_mmio_beN assumes > > > it's called with the BQL not held, but in fact there are some > > > situations where we call into the memory subsystem and we do > > > already have the BQL. > > > It's bugs all the way down as usual! > > https://xkcd.com/1416/ > > > > I'll dig in a little next week to see if there's an easy fix. We can see > > the return address is already 0 going into mmu_translate, so it does > > look unrelated to the patch I threw out - but probably still has to do > > with things being on IO. > > Yes, the low level memory accessors only need to take the BQL if the thing > being accessed is an MMIO device. Probably what is wanted is for those > functions to do "take the lock if we don't already have it", something > like hw/core/cpu-common.c:cpu_reset_interrupt() does. > > -- PMM Still a work in progress but I thought I'd give an update on some of the fun... I have a set of somewhat dubious workarounds that sort of do the job (where the aim is to be able to safely run any workload on top of any valid emulated CXL device setup). To recap, the issue is that for CXL memory interleaving we need to have find grained routing to each device (16k Max Gran). That was fine whilst pretty much all the testing was DAX based so software wasn't running out of it. Now the kernel is rather more aggressive in defaulting any volatile CXL memory it finds to being normal memory (in some configs anyway) people started hitting problems. Given one of the most important functions of the emulation is to check data ends up in the right backing stores, I'm not keen to drop that feature unless we absolutely have to. 1) For the simple case of no interleave I have working code that just shoves the MemoryRegion in directly and all works fine. That was always on the todo list for virtualization cases anyway were we pretend the underlying devices aren't interleaved and frig the reported perf numbers to present aggregate performance etc. I'll tidy this up and post it. We may want a config parameter to 'reject' address decoder programming that would result in interleave - it's not remotely spec compliant, but meh, it will make it easier to understand. For virt case we'll probably present locked down decoders (as if a FW has set them up) but for emulation that limits usefulness too much. 2) Unfortunately, for the interleaved case can't just add a lot of memory regions because even at highest granularity (16k) and minimum size 512MiB it takes for ever to eventually run into an assert in phys_section_add with the comment: "The physical section number is ORed with a page-aligned pointer to produce the iotlb entries. Thus it should never overflow into the page-aligned value." That sounds hard to 'fix' though I've not looked into it. So back to plan (A) papering over the cracks with TCG. I've focused on arm64 which seems a bit easier than x86 (and is arguably part of my day job) Challenges 1) The atomic updates of accessed and dirty bits in arm_casq_ptw() fail because we don't have a proper address to do them on. However, there is precedence for non atomic updates in there already (used when the host system doesn't support big enough cas) I think we can do something similar under the bql for this case. Not 100% sure I'm writing to the correct address but a simple frig superficially appears to work. 2) Emulated devices try to do DMA to buffers in the CXL emulated interleave memory (virtio_blk for example). Can't do that because there is no actual translation available - just read and write functions. So should be easy to avoid as we know how to handle DMA limitations. Just set the max dma address width to 40 bits (so below the CXL Fixed Memory Windows and rely on Linux to bounce buffer with swiotlb). For a while I couldn't work out why changing IORT to provide this didn't work and I saw errors for virtio-pci-blk. So digging ensued. Virtio devices by default (sort of) bypass the dma-api in linux. vring_use_dma_api() in Linux. That is reasonable from the translation point of view, but not the DMA limits (and resulting need to use bounce buffers). Maybe could put a sanity check in linux on no iommu + a DMA restriction to below 64 bits but I'm not 100% sure we wouldn't break other platforms. Alternatively just use emulated real device and all seems fine - I've tested with nvme. 3) I need to fix the kernel handling for CXL CDAT table originated NUMA nodes on ARM64. For now I have a hack in place so I can make sure I hit the memory I intend to when testing. I suspect we need some significant work to sort Suggestions for other approaches would definitely be welcome! Jonathan