On Thu, Feb 15, 2024 at 4:29 PM Jonathan Cameron < jonathan.came...@huawei.com> wrote:
> On Thu, 8 Feb 2024 14:50:59 +0000 > Jonathan Cameron <jonathan.came...@huawei.com> wrote: > > > On Wed, 7 Feb 2024 17:34:15 +0000 > > Jonathan Cameron <jonathan.came...@huawei.com> wrote: > > > > > On Fri, 2 Feb 2024 16:56:18 +0000 > > > Peter Maydell <peter.mayd...@linaro.org> wrote: > > > > > > > On Fri, 2 Feb 2024 at 16:50, Gregory Price < > gregory.pr...@memverge.com> wrote: > > > > > > > > > > On Fri, Feb 02, 2024 at 04:33:20PM +0000, Peter Maydell wrote: > > > > > > > Here we are trying to take an interrupt. This isn't related to > the > > > > > > other can_do_io stuff, it's happening because do_ld_mmio_beN > assumes > > > > > > it's called with the BQL not held, but in fact there are some > > > > > > situations where we call into the memory subsystem and we do > > > > > > already have the BQL. > > > > > > > > > It's bugs all the way down as usual! > > > > > https://xkcd.com/1416/ > > > > > > > > > > I'll dig in a little next week to see if there's an easy fix. We > can see > > > > > the return address is already 0 going into mmu_translate, so it > does > > > > > look unrelated to the patch I threw out - but probably still has > to do > > > > > with things being on IO. > > > > > > > > Yes, the low level memory accessors only need to take the BQL if the > thing > > > > being accessed is an MMIO device. Probably what is wanted is for > those > > > > functions to do "take the lock if we don't already have it", > something > > > > like hw/core/cpu-common.c:cpu_reset_interrupt() does. > > > > Got back to x86 testing and indeed not taking the lock in that one path > > does get things running (with all Gregory's earlier hacks + DMA limits as > > described below). Guess it's time to roll some cleaned up patches and > > see how much everyone screams :) > > > > 3 series sent out: > (all also on gitlab.com/jic23/qemu cxl-2024-02-15 though I updated patch > descriptions > a little after pushing that out) > > Main set of fixes (x86 'works' under my light testing after this one) > > https://lore.kernel.org/qemu-devel/20240215150133.2088-1-jonathan.came...@huawei.com/ > > ARM FEAT_HADFS (access and dirty it updating in PTW) workaround for > missing atomic CAS > > https://lore.kernel.org/qemu-devel/20240215151804.2426-1-jonathan.came...@huawei.com/T/#t > > DMA / virtio fix: > > https://lore.kernel.org/qemu-devel/20240215142817.1904-1-jonathan.came...@huawei.com/ > > Last thing I need to do is propose a suitable flag to make > Mattias' bounce buffering size parameter apply to "memory" address space. For background, I actually had a global bounce buffer size parameter apply to all address spaces in an earlier version of my series. After discussion on the list, we settled on an address-space specific parameter so it can be configured per device. I haven't looked into where the memory accesses in your context originate from - can they be attributed to a specific entity to house the parameter? > Currently > I'm carrying this: (I've no idea how much is need but it's somewhere > between 4k and 1G) > > diff --git a/system/physmem.c b/system/physmem.c > index 43b37942cf..49b961c7a5 100644 > --- a/system/physmem.c > +++ b/system/physmem.c > @@ -2557,6 +2557,7 @@ static void memory_map_init(void) > memory_region_init(system_memory, NULL, "system", UINT64_MAX); > address_space_init(&address_space_memory, system_memory, "memory"); > > + address_space_memory.max_bounce_buffer_size = 1024 * 1024 * 1024; > system_io = g_malloc(sizeof(*system_io)); > memory_region_init_io(system_io, NULL, &unassigned_io_ops, NULL, "io", > 65536); > > Please take a look. These are all in areas of QEMU I'm not particularly > confident > about so relying on nice people giving feedback even more than normal! > > Thanks to all those who helped with debugging and suggestions. > > Thanks, > > Jonathan > > > Jonathan > > > > > > > > > > > > -- PMM > > > > > > Still a work in progress but I thought I'd give an update on some of > the fun... > > > > > > I have a set of somewhat dubious workarounds that sort of do the job > (where > > > the aim is to be able to safely run any workload on top of any valid > > > emulated CXL device setup). > > > > > > To recap, the issue is that for CXL memory interleaving we need to have > > > find grained routing to each device (16k Max Gran). That was fine > whilst > > > pretty much all the testing was DAX based so software wasn't running > out > > > of it. Now the kernel is rather more aggressive in defaulting any > volatile > > > CXL memory it finds to being normal memory (in some configs anyway) > people > > > started hitting problems. Given one of the most important functions of > the > > > emulation is to check data ends up in the right backing stores, I'm not > > > keen to drop that feature unless we absolutely have to. > > > > > > 1) For the simple case of no interleave I have working code that just > > > shoves the MemoryRegion in directly and all works fine. That was > always > > > on the todo list for virtualization cases anyway were we pretend the > > > underlying devices aren't interleaved and frig the reported perf > numbers > > > to present aggregate performance etc. I'll tidy this up and post > it. > > > We may want a config parameter to 'reject' address decoder > programming > > > that would result in interleave - it's not remotely spec compliant, > but > > > meh, it will make it easier to understand. For virt case we'll > probably > > > present locked down decoders (as if a FW has set them up) but for > emulation > > > that limits usefulness too much. > > > > > > 2) Unfortunately, for the interleaved case can't just add a lot of > memory > > > regions because even at highest granularity (16k) and minimum size > > > 512MiB it takes for ever to eventually run into an assert in > > > phys_section_add with the comment: > > > "The physical section number is ORed with a page-aligned > > > pointer to produce the iotlb entries. Thus it should > > > never overflow into the page-aligned value." > > > That sounds hard to 'fix' though I've not looked into it. > > > > > > So back to plan (A) papering over the cracks with TCG. > > > > > > I've focused on arm64 which seems a bit easier than x86 (and is > arguably > > > part of my day job) > > > > > > Challenges > > > 1) The atomic updates of accessed and dirty bits in > > > arm_casq_ptw() fail because we don't have a proper address to do > them > > > on. However, there is precedence for non atomic updates in there > > > already (used when the host system doesn't support big enough cas) > > > I think we can do something similar under the bql for this case. > > > Not 100% sure I'm writing to the correct address but a simple frig > > > superficially appears to work. > > > 2) Emulated devices try to do DMA to buffers in the CXL emulated > interleave > > > memory (virtio_blk for example). Can't do that because there is no > > > actual translation available - just read and write functions. > > > > > > So should be easy to avoid as we know how to handle DMA limitations. > > > Just set the max dma address width to 40 bits (so below the CXL > Fixed Memory > > > Windows and rely on Linux to bounce buffer with swiotlb). For a > while > > > I couldn't work out why changing IORT to provide this didn't work > and > > > I saw errors for virtio-pci-blk. So digging ensued. > > > Virtio devices by default (sort of) bypass the dma-api in linux. > > > vring_use_dma_api() in Linux. That is reasonable from the > translation > > > point of view, but not the DMA limits (and resulting need to use > bounce > > > buffers). Maybe could put a sanity check in linux on no iommu + > > > a DMA restriction to below 64 bits but I'm not 100% sure we wouldn't > > > break other platforms. > > > Alternatively just use emulated real device and all seems fine > > > - I've tested with nvme. > > > > > > 3) I need to fix the kernel handling for CXL CDAT table originated > > > NUMA nodes on ARM64. For now I have a hack in place so I can make > > > sure I hit the memory I intend to when testing. I suspect we need > > > some significant work to sort > > > > > > Suggestions for other approaches would definitely be welcome! > > > > > > Jonathan > > > > > > > > >