On 2022/5/13 17:10, Bruce Richardson wrote: > On Fri, May 13, 2022 at 04:52:10PM +0800, fengchengwen wrote: >> On 2022/4/8 14:29, Pai G, Sunil wrote: >>>> -----Original Message----- >>>> From: Richardson, Bruce <bruce.richard...@intel.com> >>>> Sent: Tuesday, April 5, 2022 5:38 PM >>>> To: Ilya Maximets <i.maxim...@ovn.org>; Chengwen Feng >>>> <fengcheng...@huawei.com>; Radha Mohan Chintakuntla <rad...@marvell.com>; >>>> Veerasenareddy Burru <vbu...@marvell.com>; Gagandeep Singh >>>> <g.si...@nxp.com>; Nipun Gupta <nipun.gu...@nxp.com> >>>> Cc: Pai G, Sunil <sunil.pa...@intel.com>; Stokes, Ian >>>> <ian.sto...@intel.com>; Hu, Jiayu <jiayu...@intel.com>; Ferriter, Cian >>>> <cian.ferri...@intel.com>; Van Haaren, Harry <harry.van.haa...@intel.com>; >>>> Maxime Coquelin (maxime.coque...@redhat.com) <maxime.coque...@redhat.com>; >>>> ovs-dev@openvswitch.org; d...@dpdk.org; Mcnamara, John >>>> <john.mcnam...@intel.com>; O'Driscoll, Tim <tim.odrisc...@intel.com>; >>>> Finn, Emma <emma.f...@intel.com> >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion >>>> >>>> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote: >>>>> On 3/30/22 16:09, Bruce Richardson wrote: >>>>>> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote: >>>>>>> On 3/30/22 13:12, Bruce Richardson wrote: >>>>>>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets wrote: >>>>>>>>> On 3/30/22 12:41, Ilya Maximets wrote: >>>>>>>>>> Forking the thread to discuss a memory consistency/ordering model. >>>>>>>>>> >>>>>>>>>> AFAICT, dmadev can be anything from part of a CPU to a >>>>>>>>>> completely separate PCI device. However, I don't see any memory >>>>>>>>>> ordering being enforced or even described in the dmadev API or >>>> documentation. >>>>>>>>>> Please, point me to the correct documentation, if I somehow missed >>>> it. >>>>>>>>>> >>>>>>>>>> We have a DMA device (A) and a CPU core (B) writing respectively >>>>>>>>>> the data and the descriptor info. CPU core (C) is reading the >>>>>>>>>> descriptor and the data it points too. >>>>>>>>>> >>>>>>>>>> A few things about that process: >>>>>>>>>> >>>>>>>>>> 1. There is no memory barrier between writes A and B (Did I miss >>>>>>>>>> them?). Meaning that those operations can be seen by C in a >>>>>>>>>> different order regardless of barriers issued by C and >>>> regardless >>>>>>>>>> of the nature of devices A and B. >>>>>>>>>> >>>>>>>>>> 2. Even if there is a write barrier between A and B, there is >>>>>>>>>> no guarantee that C will see these writes in the same order >>>>>>>>>> as C doesn't use real memory barriers because vhost >>>>>>>>>> advertises >>>>>>>>> >>>>>>>>> s/advertises/does not advertise/ >>>>>>>>> >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM. >>>>>>>>>> >>>>>>>>>> So, I'm getting to conclusion that there is a missing write >>>>>>>>>> barrier on the vhost side and vhost itself must not advertise >>>>>>>>>> the >>>>>>>>> >>>>>>>>> s/must not/must/ >>>>>>>>> >>>>>>>>> Sorry, I wrote things backwards. :) >>>>>>>>> >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use actual >>>>>>>>>> memory barriers. >>>>>>>>>> >>>>>>>>>> Would like to hear some thoughts on that topic. Is it a real >>>> issue? >>>>>>>>>> Is it an issue considering all possible CPU architectures and >>>>>>>>>> DMA HW variants? >>>>>>>>>> >>>>>>>> >>>>>>>> In terms of ordering of operations using dmadev: >>>>>>>> >>>>>>>> * Some DMA HW will perform all operations strictly in order e.g. >>>> Intel >>>>>>>> IOAT, while other hardware may not guarantee order of >>>> operations/do >>>>>>>> things in parallel e.g. Intel DSA. Therefore the dmadev API >>>> provides the >>>>>>>> fence operation which allows the order to be enforced. The fence >>>> can be >>>>>>>> thought of as a full memory barrier, meaning no jobs after the >>>> barrier can >>>>>>>> be started until all those before it have completed. Obviously, >>>> for HW >>>>>>>> where order is always enforced, this will be a no-op, but for >>>> hardware that >>>>>>>> parallelizes, we want to reduce the fences to get best >>>> performance. >>>>>>>> >>>>>>>> * For synchronization between DMA devices and CPUs, where a CPU can >>>> only >>>>>>>> write after a DMA copy has been done, the CPU must wait for the >>>> dma >>>>>>>> completion to guarantee ordering. Once the completion has been >>>> returned >>>>>>>> the completed operation is globally visible to all cores. >>>>>>> >>>>>>> Thanks for explanation! Some questions though: >>>>>>> >>>>>>> In our case one CPU waits for completion and another CPU is >>>>>>> actually using the data. IOW, "CPU must wait" is a bit ambiguous. >>>> Which CPU must wait? >>>>>>> >>>>>>> Or should it be "Once the completion is visible on any core, the >>>>>>> completed operation is globally visible to all cores." ? >>>>>>> >>>>>> >>>>>> The latter. >>>>>> Once the change to memory/cache is visible to any core, it is >>>>>> visible to all ones. This applies to regular CPU memory writes too - >>>>>> at least on IA, and I expect on many other architectures - once the >>>>>> write is visible outside the current core it is visible to every >>>>>> other core. Once the data hits the l1 or l2 cache of any core, any >>>>>> subsequent requests for that data from any other core will "snoop" >>>>>> the latest data from the cores cache, even if it has not made its >>>>>> way down to a shared cache, e.g. l3 on most IA systems. >>>>> >>>>> It sounds like you're referring to the "multicopy atomicity" of the >>>>> architecture. However, that is not universally supported thing. >>>>> AFAICT, POWER and older ARM systems doesn't support it, so writes >>>>> performed by one core are not necessarily available to all other cores >>>>> at the same time. That means that if the CPU0 writes the data and the >>>>> completion flag, CPU1 reads the completion flag and writes the ring, >>>>> CPU2 may see the ring write, but may still not see the write of the >>>>> data, even though there was a control dependency on CPU1. >>>>> There should be a full memory barrier on CPU1 in order to fulfill the >>>>> memory ordering requirements for CPU2, IIUC. >>>>> >>>>> In our scenario the CPU0 is a DMA device, which may or may not be part >>>>> of a CPU and may have different memory consistency/ordering >>>>> requirements. So, the question is: does DPDK DMA API guarantee >>>>> multicopy atomicity between DMA device and all CPU cores regardless of >>>>> CPU architecture and a nature of the DMA device? >>>>> >>>> >>>> Right now, it doesn't because this never came up in discussion. In order >>>> to be useful, it sounds like it explicitly should do so. At least for the >>>> Intel ioat and idxd driver cases, this will be supported, so we just need >>>> to ensure all other drivers currently upstreamed can offer this too. If >>>> they cannot, we cannot offer it as a global guarantee, and we should see >>>> about adding a capability flag for this to indicate when the guarantee is >>>> there or not. >>>> >>>> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok to >>>> document for dmadev that once a DMA operation is completed, the op is >>>> guaranteed visible to all cores/threads? If not, any thoughts on what >>>> guarantees we can provide in this regard, or what capabilities should be >>>> exposed? >>> >>> >>> >>> Hi @Chengwen Feng, @Radha Mohan Chintakuntla, @Veerasenareddy Burru, >>> @Gagandeep Singh, @Nipun Gupta, >>> Requesting your valuable opinions for the queries on this thread. >> >> Sorry late for reply due I didn't follow this thread. >> >> I don't think the DMA API should provide such guarantee because: >> 1. DMA is an acceleration device, which is the same as encryption/decryption >> device or network device. >> 2. For Hisilicon Kunpeng platform: >> The DMA device support: >> a) IO coherency: which mean it could read read the latest data which >> may stay the cache, and will >> invalidate cache's data and write data to DDR when write. >> b) Order in one request: which mean it only write completion descriptor >> after the copy is done. >> Note: orders between multiple requests can be implemented through >> the fence mechanism. >> The DMA driver only should: >> a) Add one write memory barrier(use lightweight mb) when doorbell. >> So once the DMA is completed the operation is guaranteed visible to all >> cores, >> And the 3rd core will observed the right order: core-B prepare data and >> issue request to DMA, DMA >> start work, core-B get completion status. >> 3. I did a TI multi-core SoC many years ago, the SoC don't support cache >> coherence and consistency between >> cores. The SoC also have DMA device which have many channel. Here we do a >> hypothetical design the DMA >> driver with the DPDK DMA framework: >> The DMA driver should: >> a) write back DMA's src buffer, so that there are none cache data when >> DMA running. >> b) invalidate DMA's dst buffer >> c) do a full mb >> d) update DMA's registers. >> Then DMA will execute the copy task, it copy from DDR and write to DDR, >> and after copy it will modify >> it's status register to completed. >> In this case, the 3rd core will also observed the right order. >> A particular point of this is: If one buffer will shared on multiple >> core, application should explicit >> maintain the cache. >> >> Based on above, I don't think the DMA API should explicit add the >> descriptor, it's driver's and even >> application(e.g. above TI's SoC)'s duty to make sure it. >> > Hi, > > thanks for that. So if I understand correctly, your current HW does provide > this guarantee, but you don't think it should be always the case for > dmadev, correct?
Yes, our HW will provide the guarantee. If some HW could not provide, it's driver's and maybe application's duty to provide it. > > Based on that, what do you think should be the guarantee on completion? > Once a job is completed, the completion is visible to the submitting core, > or the core reading the completion? Do you think it's acceptable to add a Both core will visible to it. > capability flag for drivers to indicate that they do support a "globally > visible" guarantee? I think the driver (and with HW) should support "globally visible" guarantee. And for some HW, even application (or middleware) should care about it. > > Thanks, > /Bruce > > . > _______________________________________________ dev mailing list d...@openvswitch.org https://mail.openvswitch.org/mailman/listinfo/ovs-dev