On Thu, Jan 30, 2020 at 9:06 AM Auger Eric <eric.au...@redhat.com> wrote:
>
> Hi Rob,
> On 1/17/20 10:16 PM, Rob Herring wrote:
> > Arm SMMUv3.2 adds support for TLB range invalidate operations.
> > Support for range invalidate is determined by the RIL bit in the IDR3
> > register.
> >
> > The range invalidate is in units of the leaf page size and operates on
> > 1-32 chunks of a power of 2 multiple pages. First, we determine from the
> > size what power of 2 multiple we can use. Then we calculate how many
> > chunks (1-31) of the power of 2 size for the range on the iteration. On
> > each iteration, we move up in size by at least 5 bits.
> >
> > Cc: Eric Auger <eric.au...@redhat.com>
> > Cc: Jean-Philippe Brucker <jean-phili...@linaro.org>
> > Cc: Will Deacon <w...@kernel.org>
> > Cc: Robin Murphy <robin.mur...@arm.com>
> > Cc: Joerg Roedel <j...@8bytes.org>
> > Signed-off-by: Rob Herring <r...@kernel.org>
> > ---
> >  drivers/iommu/arm-smmu-v3.c | 66 ++++++++++++++++++++++++++++++++++++-
> >  1 file changed, 65 insertions(+), 1 deletion(-)


> > @@ -2003,7 +2024,7 @@ static void arm_smmu_tlb_inv_range(unsigned long 
> > iova, size_t size,
> >  {
> >       u64 cmds[CMDQ_BATCH_ENTRIES * CMDQ_ENT_DWORDS];
> >       struct arm_smmu_device *smmu = smmu_domain->smmu;
> > -     unsigned long start = iova, end = iova + size;
> > +     unsigned long start = iova, end = iova + size, num_pages = 0, tg = 0;
> >       int i = 0;
> >       struct arm_smmu_cmdq_ent cmd = {
> >               .tlbi = {
> > @@ -2022,12 +2043,50 @@ static void arm_smmu_tlb_inv_range(unsigned long 
> > iova, size_t size,
> >               cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
> >       }
> >
> > +     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> > +             /* Get the leaf page size */
> > +             tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> > +
> > +             /* Convert page size of 12,14,16 (log2) to 1,2,3 */
> > +             cmd.tlbi.tg = ((tg - ilog2(SZ_4K)) / 2) + 1;
> > +
> > +             /* Determine what level the granule is at */
> > +             cmd.tlbi.ttl = 4 - ((ilog2(granule) - 3) / (tg - 3));
> > +
> > +             num_pages = size / (1UL << tg);
> > +     }
> > +
> >       while (iova < end) {
> >               if (i == CMDQ_BATCH_ENTRIES) {
> >                       arm_smmu_cmdq_issue_cmdlist(smmu, cmds, i, false);
> >                       i = 0;
> >               }
> >
> > +             if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> > +                     /*
> > +                      * On each iteration of the loop, the range is 5 bits
> > +                      * worth of the aligned size remaining.
> > +                      * The range in pages is:
> > +                      *
> > +                      * range = (num_pages & (0x1f << __ffs(num_pages)))
> > +                      */
> > +                     unsigned long scale, num;
> > +
> > +                     /* Determine the power of 2 multiple number of pages 
> > */
> > +                     scale = __ffs(num_pages);
> > +                     cmd.tlbi.scale = scale;
> > +
> > +                     /* Determine how many chunks of 2^scale size we have 
> > */
> > +                     num = (num_pages >> scale) & CMDQ_TLBI_RANGE_NUM_MAX;
> > +                     cmd.tlbi.num = num - 1;
> > +
> > +                     /* range is num * 2^scale * pgsize */
> > +                     granule = num << (scale + tg);
> > +
> > +                     /* Clear out the lower order bits for the next 
> > iteration */
> > +                     num_pages -= num << scale;
> Regarding the 2 options given in
> https://lore.kernel.org/linux-arm-kernel/CAL_JsqKABoE+0crGwyZdNogNgEoG=moopf6deqgh6s73c0u...@mail.gmail.com/raw,
>
> I understand you implemented 2) but I still do not understand why you
> preferred that one against 1).
>
> In your case of 1023*4k pages this will invalidate by 31 32*2^0*4K +
> 31*2^0*4K pages
> whereas you could achieve that with 10 invalidations with the 1st algo.
> I did not get the case where it is more efficient. Please can you detail.

No, it's only 2 commands. We do 31*4K and then 31*2^5*4K. Here's a the
output of a test case:

iova=10001000, num_pages=0x3e0, granule=1f000, num=31, scale=0, ttl=3
iova=10020000, num_pages=0x0, granule=3e0000, num=31, scale=5, ttl=3

(num_pages being what's left at end of the loop)

As I mentioned on v1, worst case is 4 commands for up to 4GB. It's
20-bits of size (32-12) and each loop processes a minimum of 5 bits.
Each loop becomes a larger aligned size, so scale goes up each pass.
This is what I tried to explain in the top comment.

Rob
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

Reply via email to