> -----Original Message----- > From: Bruce Richardson <bruce.richard...@intel.com> > Sent: Friday, May 13, 2022 3:34 AM > To: fengchengwen <fengcheng...@huawei.com> > Cc: Pai G, Sunil <sunil.pa...@intel.com>; Ilya Maximets > <i.maxim...@ovn.org>; Radha Chintakuntla <rad...@marvell.com>; > Veerasenareddy Burru <vbu...@marvell.com>; Gagandeep Singh > <g.si...@nxp.com>; Nipun Gupta <nipun.gu...@nxp.com>; Stokes, Ian > <ian.sto...@intel.com>; Hu, Jiayu <jiayu...@intel.com>; Ferriter, Cian > <cian.ferri...@intel.com>; Van Haaren, Harry > <harry.van.haa...@intel.com>; Maxime Coquelin > (maxime.coque...@redhat.com) <maxime.coque...@redhat.com>; ovs- > d...@openvswitch.org; dev@dpdk.org; Mcnamara, John > <john.mcnam...@intel.com>; O'Driscoll, Tim <tim.odrisc...@intel.com>; > Finn, Emma <emma.f...@intel.com> > Subject: [EXT] Re: OVS DPDK DMA-Dev library/Design Discussion > > External Email > > ---------------------------------------------------------------------- > On Fri, May 13, 2022 at 05:48:35PM +0800, fengchengwen wrote: > > On 2022/5/13 17:10, Bruce Richardson wrote: > > > On Fri, May 13, 2022 at 04:52:10PM +0800, fengchengwen wrote: > > >> On 2022/4/8 14:29, Pai G, Sunil wrote: > > >>>> -----Original Message----- > > >>>> From: Richardson, Bruce <bruce.richard...@intel.com> > > >>>> Sent: Tuesday, April 5, 2022 5:38 PM > > >>>> To: Ilya Maximets <i.maxim...@ovn.org>; Chengwen Feng > > >>>> <fengcheng...@huawei.com>; Radha Mohan Chintakuntla > > >>>> <rad...@marvell.com>; Veerasenareddy Burru > <vbu...@marvell.com>; > > >>>> Gagandeep Singh <g.si...@nxp.com>; Nipun Gupta > > >>>> <nipun.gu...@nxp.com> > > >>>> Cc: Pai G, Sunil <sunil.pa...@intel.com>; Stokes, Ian > > >>>> <ian.sto...@intel.com>; Hu, Jiayu <jiayu...@intel.com>; Ferriter, > > >>>> Cian <cian.ferri...@intel.com>; Van Haaren, Harry > > >>>> <harry.van.haa...@intel.com>; Maxime Coquelin > > >>>> (maxime.coque...@redhat.com) <maxime.coque...@redhat.com>; > > >>>> ovs-...@openvswitch.org; dev@dpdk.org; Mcnamara, John > > >>>> <john.mcnam...@intel.com>; O'Driscoll, Tim > > >>>> <tim.odrisc...@intel.com>; Finn, Emma <emma.f...@intel.com> > > >>>> Subject: Re: OVS DPDK DMA-Dev library/Design Discussion > > >>>> > > >>>> On Tue, Apr 05, 2022 at 01:29:25PM +0200, Ilya Maximets wrote: > > >>>>> On 3/30/22 16:09, Bruce Richardson wrote: > > >>>>>> On Wed, Mar 30, 2022 at 01:41:34PM +0200, Ilya Maximets wrote: > > >>>>>>> On 3/30/22 13:12, Bruce Richardson wrote: > > >>>>>>>> On Wed, Mar 30, 2022 at 12:52:15PM +0200, Ilya Maximets > wrote: > > >>>>>>>>> On 3/30/22 12:41, Ilya Maximets wrote: > > >>>>>>>>>> Forking the thread to discuss a memory consistency/ordering > model. > > >>>>>>>>>> > > >>>>>>>>>> AFAICT, dmadev can be anything from part of a CPU to a > > >>>>>>>>>> completely separate PCI device. However, I don't see any > > >>>>>>>>>> memory ordering being enforced or even described in the > > >>>>>>>>>> dmadev API or > > >>>> documentation. > > >>>>>>>>>> Please, point me to the correct documentation, if I somehow > > >>>>>>>>>> missed > > >>>> it. > > >>>>>>>>>> > > >>>>>>>>>> We have a DMA device (A) and a CPU core (B) writing > > >>>>>>>>>> respectively the data and the descriptor info. CPU core > > >>>>>>>>>> (C) is reading the descriptor and the data it points too. > > >>>>>>>>>> > > >>>>>>>>>> A few things about that process: > > >>>>>>>>>> > > >>>>>>>>>> 1. There is no memory barrier between writes A and B (Did I > miss > > >>>>>>>>>> them?). Meaning that those operations can be seen by C in > a > > >>>>>>>>>> different order regardless of barriers issued by C and > > >>>> regardless > > >>>>>>>>>> of the nature of devices A and B. > > >>>>>>>>>> > > >>>>>>>>>> 2. Even if there is a write barrier between A and B, there is > > >>>>>>>>>> no guarantee that C will see these writes in the same order > > >>>>>>>>>> as C doesn't use real memory barriers because vhost > > >>>>>>>>>> advertises > > >>>>>>>>> > > >>>>>>>>> s/advertises/does not advertise/ > > >>>>>>>>> > > >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM. > > >>>>>>>>>> > > >>>>>>>>>> So, I'm getting to conclusion that there is a missing write > > >>>>>>>>>> barrier on the vhost side and vhost itself must not > > >>>>>>>>>> advertise the > > >>>>>>>>> > > >>>>>>>>> s/must not/must/ > > >>>>>>>>> > > >>>>>>>>> Sorry, I wrote things backwards. :) > > >>>>>>>>> > > >>>>>>>>>> VIRTIO_F_ORDER_PLATFORM, so the virtio driver can use > > >>>>>>>>>> actual memory barriers. > > >>>>>>>>>> > > >>>>>>>>>> Would like to hear some thoughts on that topic. Is it a > > >>>>>>>>>> real > > >>>> issue? > > >>>>>>>>>> Is it an issue considering all possible CPU architectures > > >>>>>>>>>> and DMA HW variants? > > >>>>>>>>>> > > >>>>>>>> > > >>>>>>>> In terms of ordering of operations using dmadev: > > >>>>>>>> > > >>>>>>>> * Some DMA HW will perform all operations strictly in order e.g. > > >>>> Intel > > >>>>>>>> IOAT, while other hardware may not guarantee order of > > >>>> operations/do > > >>>>>>>> things in parallel e.g. Intel DSA. Therefore the dmadev API > > >>>> provides the > > >>>>>>>> fence operation which allows the order to be enforced. The > > >>>>>>>> fence > > >>>> can be > > >>>>>>>> thought of as a full memory barrier, meaning no jobs after > > >>>>>>>> the > > >>>> barrier can > > >>>>>>>> be started until all those before it have completed. > > >>>>>>>> Obviously, > > >>>> for HW > > >>>>>>>> where order is always enforced, this will be a no-op, but > > >>>>>>>> for > > >>>> hardware that > > >>>>>>>> parallelizes, we want to reduce the fences to get best > > >>>> performance. > > >>>>>>>> > > >>>>>>>> * For synchronization between DMA devices and CPUs, where a > > >>>>>>>> CPU can > > >>>> only > > >>>>>>>> write after a DMA copy has been done, the CPU must wait for > > >>>>>>>> the > > >>>> dma > > >>>>>>>> completion to guarantee ordering. Once the completion has > > >>>>>>>> been > > >>>> returned > > >>>>>>>> the completed operation is globally visible to all cores. > > >>>>>>> > > >>>>>>> Thanks for explanation! Some questions though: > > >>>>>>> > > >>>>>>> In our case one CPU waits for completion and another CPU is > > >>>>>>> actually using the data. IOW, "CPU must wait" is a bit ambiguous. > > >>>> Which CPU must wait? > > >>>>>>> > > >>>>>>> Or should it be "Once the completion is visible on any core, > > >>>>>>> the completed operation is globally visible to all cores." ? > > >>>>>>> > > >>>>>> > > >>>>>> The latter. > > >>>>>> Once the change to memory/cache is visible to any core, it is > > >>>>>> visible to all ones. This applies to regular CPU memory writes > > >>>>>> too - at least on IA, and I expect on many other architectures > > >>>>>> - once the write is visible outside the current core it is > > >>>>>> visible to every other core. Once the data hits the l1 or l2 > > >>>>>> cache of any core, any subsequent requests for that data from any > other core will "snoop" > > >>>>>> the latest data from the cores cache, even if it has not made > > >>>>>> its way down to a shared cache, e.g. l3 on most IA systems. > > >>>>> > > >>>>> It sounds like you're referring to the "multicopy atomicity" of > > >>>>> the architecture. However, that is not universally supported thing. > > >>>>> AFAICT, POWER and older ARM systems doesn't support it, so > > >>>>> writes performed by one core are not necessarily available to > > >>>>> all other cores at the same time. That means that if the CPU0 > > >>>>> writes the data and the completion flag, CPU1 reads the > > >>>>> completion flag and writes the ring, > > >>>>> CPU2 may see the ring write, but may still not see the write of > > >>>>> the data, even though there was a control dependency on CPU1. > > >>>>> There should be a full memory barrier on CPU1 in order to > > >>>>> fulfill the memory ordering requirements for CPU2, IIUC. > > >>>>> > > >>>>> In our scenario the CPU0 is a DMA device, which may or may not > > >>>>> be part of a CPU and may have different memory > > >>>>> consistency/ordering requirements. So, the question is: does > > >>>>> DPDK DMA API guarantee multicopy atomicity between DMA device > > >>>>> and all CPU cores regardless of CPU architecture and a nature of the > DMA device? > > >>>>> > > >>>> > > >>>> Right now, it doesn't because this never came up in discussion. > > >>>> In order to be useful, it sounds like it explicitly should do so. > > >>>> At least for the Intel ioat and idxd driver cases, this will be > > >>>> supported, so we just need to ensure all other drivers currently > > >>>> upstreamed can offer this too. If they cannot, we cannot offer it > > >>>> as a global guarantee, and we should see about adding a > > >>>> capability flag for this to indicate when the guarantee is there or > > >>>> not. > > >>>> > > >>>> Maintainers of dma/cnxk, dma/dpaa and dma/hisilicon - are we ok > > >>>> to document for dmadev that once a DMA operation is completed, > > >>>> the op is guaranteed visible to all cores/threads? If not, any > > >>>> thoughts on what guarantees we can provide in this regard, or > > >>>> what capabilities should be exposed? > > >>> > > >>> > > >>> > > >>> Hi @Chengwen Feng, @Radha Mohan Chintakuntla, > @Veerasenareddy > > >>> Burru, @Gagandeep Singh, @Nipun Gupta, Requesting your valuable > opinions for the queries on this thread. > > >> > > >> Sorry late for reply due I didn't follow this thread. > > >> > > >> I don't think the DMA API should provide such guarantee because: > > >> 1. DMA is an acceleration device, which is the same as > encryption/decryption device or network device. > > >> 2. For Hisilicon Kunpeng platform: > > >> The DMA device support: > > >> a) IO coherency: which mean it could read read the latest data which > may stay the cache, and will > > >> invalidate cache's data and write data to DDR when write. > > >> b) Order in one request: which mean it only write completion > descriptor after the copy is done. > > >> Note: orders between multiple requests can be implemented > through the fence mechanism. > > >> The DMA driver only should: > > >> a) Add one write memory barrier(use lightweight mb) when doorbell. > > >> So once the DMA is completed the operation is guaranteed visible to > all cores, > > >> And the 3rd core will observed the right order: core-B prepare data > and issue request to DMA, DMA > > >> start work, core-B get completion status. > > >> 3. I did a TI multi-core SoC many years ago, the SoC don't support cache > coherence and consistency between > > >> cores. The SoC also have DMA device which have many channel. Here > we do a hypothetical design the DMA > > >> driver with the DPDK DMA framework: > > >> The DMA driver should: > > >> a) write back DMA's src buffer, so that there are none cache data > when DMA running. > > >> b) invalidate DMA's dst buffer > > >> c) do a full mb > > >> d) update DMA's registers. > > >> Then DMA will execute the copy task, it copy from DDR and write to > DDR, and after copy it will modify > > >> it's status register to completed. > > >> In this case, the 3rd core will also observed the right order. > > >> A particular point of this is: If one buffer will shared on multiple > > >> core, > application should explicit > > >> maintain the cache. > > >> > > >> Based on above, I don't think the DMA API should explicit add the > > >> descriptor, it's driver's and even application(e.g. above TI's SoC)'s > > >> duty > to make sure it. > > >> > > > Hi, > > > > > > thanks for that. So if I understand correctly, your current HW does > > > provide this guarantee, but you don't think it should be always the > > > case for dmadev, correct? > > > > Yes, our HW will provide the guarantee. > > If some HW could not provide, it's driver's and maybe application's duty to > provide it. > > > > > > > > Based on that, what do you think should be the guarantee on > completion? > > > Once a job is completed, the completion is visible to the submitting > > > core, or the core reading the completion? Do you think it's > > > acceptable to add a > > > > Both core will visible to it. > > > > > capability flag for drivers to indicate that they do support a > > > "globally visible" guarantee? > > > > I think the driver (and with HW) should support "globally visible" > > guarantee. > > And for some HW, even application (or middleware) should care about it. > > > > From a dmadev API viewpoint, whether the driver handles it or the HW itself, > does not matter. However, if the application needs to take special actions to > guarantee visibility, then that needs to be flagged as part of the dmadev API. > > I see three possibilities: > 1 Wait until we have a driver that does not have global visibility on > return from rte_dma_completed, and at that point add a flag indicating > the lack of that support. Until then, document that results of ops will > be globally visible. > 2 Add a flag now to allow drivers to indicate *lack* of global visibility, > and document that results are visible unless flag is set. > 3 Add a flag now to allow drivers call out that all results are g.v., and > update drivers to use this flag. > > I would be very much in favour of #1, because: > * YAGNI principle - (subject to confirmation by other maintainers) if we > don't have a driver right now that needs non-g.v. behaviour we may never > need one. > * In the absence of a concrete case where g.v. is not guaranteed, we may > struggle to document correctly what the actual guarantees are, especially if > submitter core and completer core are different. > > @Radha Mohan Chintakuntla, @Veerasenareddy Burru, @Gagandeep Singh, > @Nipun Gupta, As driver maintainers, can you please confirm if on receipt of > a completion from HW/driver, the operation results are visible on all > application cores, i.e. the app does not need additional barriers to propagate > visibility to other cores. Your opinions on this discussion would also be > useful.
[Radha Chintakuntla] Yes as of today on our HW the completion is visible on all cores. > > Regards, > /Bruce