Hi, On Thu, Feb 22, 2018 at 7:42 PM, Tomasz Figa <tf...@chromium.org> wrote: > On Thu, Feb 22, 2018 at 10:45 PM, Robin Murphy <robin.mur...@arm.com> wrote: >> [sorry, I had intended to reply sooner but clearly forgot] >> >> >> On 16/02/18 00:13, Tomasz Figa wrote: >>> >>> On Fri, Feb 16, 2018 at 2:14 AM, Robin Murphy <robin.mur...@arm.com> >>> wrote: >>>> >>>> On 15/02/18 04:17, Tomasz Figa wrote: >>>> [...] >>>>>> >>>>>> >>>>>> Could you elaborate on what kind of locking you are concerned about? >>>>>> As I explained before, the normally happening fast path would lock >>>>>> dev->power_lock only for the brief moment of incrementing the runtime >>>>>> PM usage counter. >>>>> >>>>> >>>>> >>>>> My bad, that's not even it. >>>>> >>>>> The atomic usage counter is incremented beforehands, without any >>>>> locking [1] and the spinlock is acquired only for the sake of >>>>> validating that device's runtime PM state remained valid indeed [2], >>>>> which would be the case in the fast path of the same driver doing two >>>>> mappings in parallel, with the master powered on (and so the SMMU, >>>>> through device links; if master was not powered on already, powering >>>>> on the SMMU is unavoidable anyway and it would add much more latency >>>>> than the spinlock itself). >>>> >>>> >>>> >>>> We now have no locking at all in the map path, and only a per-domain lock >>>> around TLB sync in unmap which is unfortunately necessary for >>>> correctness; >>>> the latter isn't too terrible, since in "serious" hardware it should only >>>> be >>>> serialising a few cpus serving the same device against each other (e.g. >>>> for >>>> multiple queues on a single NIC). >>>> >>>> Putting in a global lock which serialises *all* concurrent map and unmap >>>> calls for *all* unrelated devices makes things worse. Period. Even if the >>>> lock itself were held for the minimum possible time, i.e. trivially >>>> "spin_lock(&lock); spin_unlock(&lock)", the cost of repeatedly bouncing >>>> that >>>> one cache line around between 96 CPUs across two sockets is not >>>> negligible. >>> >>> >>> Fair enough. Note that we're in a quite interesting situation now: >>> a) We need to have runtime PM enabled on Qualcomm SoC to have power >>> properly managed, >>> b) We need to have lock-free map/unmap on such distributed systems, >>> c) If runtime PM is enabled, we need to call into runtime PM from any >>> code that does hardware accesses, otherwise the IOMMU API (and so DMA >>> API and then any V4L2 driver) becomes unusable. >>> >>> I can see one more way that could potentially let us have all the >>> three. How about enabling runtime PM only on selected implementations >>> (e.g. qcom,smmu) and then having all the runtime PM calls surrounded >>> with if (pm_runtime_enabled()), which is lockless? >> >> >> Yes, that's the kind of thing I was gravitating towards - my vague thought >> was adding some flag to the smmu_domain, but pm_runtime_enabled() does look >> conceptually a lot cleaner. > > Great, thanks. Looks like we're in agreement now. \o/ > > Vivek, does this sound reasonable to you?
Yea, sound good to me. I will respin the patches. Thanks & Regards Vivek > > Best regards, > Tomasz -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation