Re: [PATCH 0/5] arm-smmu: performance optimization

2017-08-18 Thread Will Deacon
On Fri, Aug 18, 2017 at 11:19:00AM +0800, Leizhen (ThunderTown) wrote:
> 
> 
> On 2017/8/17 22:36, Will Deacon wrote:
> > Thunder, Nate, Robin,
> > 
> > On Mon, Jun 26, 2017 at 09:38:45PM +0800, Zhen Lei wrote:
> >> I described the optimization more detail in patch 1 and 2, and patch 3-5 
> >> are
> >> the implementation on arm-smmu/arm-smmu-v3 of patch 2.
> >>
> >> Patch 1 is v2. In v1, I directly replaced writel with writel_relaxed in
> >> queue_inc_prod. But Robin figured that it may lead SMMU consume stale
> >> memory contents. I thought more than 3 whole days and got this one.
> >>
> >> This patchset is based on Robin Murphy's [PATCH v2 0/8] io-pgtable lock 
> >> removal.
> > 
> > For the time being, I think we should focus on the new TLB flushing
> > interface posted by Joerg:
> > 
> > http://lkml.kernel.org/r/1502974596-23835-1-git-send-email-j...@8bytes.org
> > 
> > which looks like it can give us most of the benefits of this series. Once
> > we've got that, we can see what's left in the way of performance and focus
> > on the cmdq batching separately (because I'm still not convinced about it).
> OK, this is a good news.
> 
> But I have a review comment(sorry, I have not subscribed it yet, so can not 
> directly reply it):
> I don't think we should add tlb sync for map operation
> 1. at init time, all tlbs will be invalidated
> 2. when we try to map a new range, there are no related ptes bufferd in tlb, 
> because of above 1 and below 3
> 3. when we unmap the above range, make sure all related ptes bufferd in tlb 
> to be invalidated before unmap finished

Yup, you're completely correct and I raised that with Joerg, who is looking
into a way to avoid it.

Will


Re: [PATCH 0/5] arm-smmu: performance optimization

2017-08-18 Thread Will Deacon
On Fri, Aug 18, 2017 at 11:19:00AM +0800, Leizhen (ThunderTown) wrote:
> 
> 
> On 2017/8/17 22:36, Will Deacon wrote:
> > Thunder, Nate, Robin,
> > 
> > On Mon, Jun 26, 2017 at 09:38:45PM +0800, Zhen Lei wrote:
> >> I described the optimization more detail in patch 1 and 2, and patch 3-5 
> >> are
> >> the implementation on arm-smmu/arm-smmu-v3 of patch 2.
> >>
> >> Patch 1 is v2. In v1, I directly replaced writel with writel_relaxed in
> >> queue_inc_prod. But Robin figured that it may lead SMMU consume stale
> >> memory contents. I thought more than 3 whole days and got this one.
> >>
> >> This patchset is based on Robin Murphy's [PATCH v2 0/8] io-pgtable lock 
> >> removal.
> > 
> > For the time being, I think we should focus on the new TLB flushing
> > interface posted by Joerg:
> > 
> > http://lkml.kernel.org/r/1502974596-23835-1-git-send-email-j...@8bytes.org
> > 
> > which looks like it can give us most of the benefits of this series. Once
> > we've got that, we can see what's left in the way of performance and focus
> > on the cmdq batching separately (because I'm still not convinced about it).
> OK, this is a good news.
> 
> But I have a review comment(sorry, I have not subscribed it yet, so can not 
> directly reply it):
> I don't think we should add tlb sync for map operation
> 1. at init time, all tlbs will be invalidated
> 2. when we try to map a new range, there are no related ptes bufferd in tlb, 
> because of above 1 and below 3
> 3. when we unmap the above range, make sure all related ptes bufferd in tlb 
> to be invalidated before unmap finished

Yup, you're completely correct and I raised that with Joerg, who is looking
into a way to avoid it.

Will


Re: [PATCH 0/5] arm-smmu: performance optimization

2017-08-17 Thread Leizhen (ThunderTown)


On 2017/8/17 22:36, Will Deacon wrote:
> Thunder, Nate, Robin,
> 
> On Mon, Jun 26, 2017 at 09:38:45PM +0800, Zhen Lei wrote:
>> I described the optimization more detail in patch 1 and 2, and patch 3-5 are
>> the implementation on arm-smmu/arm-smmu-v3 of patch 2.
>>
>> Patch 1 is v2. In v1, I directly replaced writel with writel_relaxed in
>> queue_inc_prod. But Robin figured that it may lead SMMU consume stale
>> memory contents. I thought more than 3 whole days and got this one.
>>
>> This patchset is based on Robin Murphy's [PATCH v2 0/8] io-pgtable lock 
>> removal.
> 
> For the time being, I think we should focus on the new TLB flushing
> interface posted by Joerg:
> 
> http://lkml.kernel.org/r/1502974596-23835-1-git-send-email-j...@8bytes.org
> 
> which looks like it can give us most of the benefits of this series. Once
> we've got that, we can see what's left in the way of performance and focus
> on the cmdq batching separately (because I'm still not convinced about it).
OK, this is a good news.

But I have a review comment(sorry, I have not subscribed it yet, so can not 
directly reply it):
I don't think we should add tlb sync for map operation
1. at init time, all tlbs will be invalidated
2. when we try to map a new range, there are no related ptes bufferd in tlb, 
because of above 1 and below 3
3. when we unmap the above range, make sure all related ptes bufferd in tlb to 
be invalidated before unmap finished

> 
> Thanks,
> 
> Will
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH 0/5] arm-smmu: performance optimization

2017-08-17 Thread Leizhen (ThunderTown)


On 2017/8/17 22:36, Will Deacon wrote:
> Thunder, Nate, Robin,
> 
> On Mon, Jun 26, 2017 at 09:38:45PM +0800, Zhen Lei wrote:
>> I described the optimization more detail in patch 1 and 2, and patch 3-5 are
>> the implementation on arm-smmu/arm-smmu-v3 of patch 2.
>>
>> Patch 1 is v2. In v1, I directly replaced writel with writel_relaxed in
>> queue_inc_prod. But Robin figured that it may lead SMMU consume stale
>> memory contents. I thought more than 3 whole days and got this one.
>>
>> This patchset is based on Robin Murphy's [PATCH v2 0/8] io-pgtable lock 
>> removal.
> 
> For the time being, I think we should focus on the new TLB flushing
> interface posted by Joerg:
> 
> http://lkml.kernel.org/r/1502974596-23835-1-git-send-email-j...@8bytes.org
> 
> which looks like it can give us most of the benefits of this series. Once
> we've got that, we can see what's left in the way of performance and focus
> on the cmdq batching separately (because I'm still not convinced about it).
OK, this is a good news.

But I have a review comment(sorry, I have not subscribed it yet, so can not 
directly reply it):
I don't think we should add tlb sync for map operation
1. at init time, all tlbs will be invalidated
2. when we try to map a new range, there are no related ptes bufferd in tlb, 
because of above 1 and below 3
3. when we unmap the above range, make sure all related ptes bufferd in tlb to 
be invalidated before unmap finished

> 
> Thanks,
> 
> Will
> 
> .
> 

-- 
Thanks!
BestRegards



Re: [PATCH 0/5] arm-smmu: performance optimization

2017-08-17 Thread Will Deacon
Thunder, Nate, Robin,

On Mon, Jun 26, 2017 at 09:38:45PM +0800, Zhen Lei wrote:
> I described the optimization more detail in patch 1 and 2, and patch 3-5 are
> the implementation on arm-smmu/arm-smmu-v3 of patch 2.
> 
> Patch 1 is v2. In v1, I directly replaced writel with writel_relaxed in
> queue_inc_prod. But Robin figured that it may lead SMMU consume stale
> memory contents. I thought more than 3 whole days and got this one.
> 
> This patchset is based on Robin Murphy's [PATCH v2 0/8] io-pgtable lock 
> removal.

For the time being, I think we should focus on the new TLB flushing
interface posted by Joerg:

http://lkml.kernel.org/r/1502974596-23835-1-git-send-email-j...@8bytes.org

which looks like it can give us most of the benefits of this series. Once
we've got that, we can see what's left in the way of performance and focus
on the cmdq batching separately (because I'm still not convinced about it).

Thanks,

Will


Re: [PATCH 0/5] arm-smmu: performance optimization

2017-08-17 Thread Will Deacon
Thunder, Nate, Robin,

On Mon, Jun 26, 2017 at 09:38:45PM +0800, Zhen Lei wrote:
> I described the optimization more detail in patch 1 and 2, and patch 3-5 are
> the implementation on arm-smmu/arm-smmu-v3 of patch 2.
> 
> Patch 1 is v2. In v1, I directly replaced writel with writel_relaxed in
> queue_inc_prod. But Robin figured that it may lead SMMU consume stale
> memory contents. I thought more than 3 whole days and got this one.
> 
> This patchset is based on Robin Murphy's [PATCH v2 0/8] io-pgtable lock 
> removal.

For the time being, I think we should focus on the new TLB flushing
interface posted by Joerg:

http://lkml.kernel.org/r/1502974596-23835-1-git-send-email-j...@8bytes.org

which looks like it can give us most of the benefits of this series. Once
we've got that, we can see what's left in the way of performance and focus
on the cmdq batching separately (because I'm still not convinced about it).

Thanks,

Will


[PATCH 0/5] arm-smmu: performance optimization

2017-06-26 Thread Zhen Lei
I described the optimization more detail in patch 1 and 2, and patch 3-5 are
the implementation on arm-smmu/arm-smmu-v3 of patch 2.

Patch 1 is v2. In v1, I directly replaced writel with writel_relaxed in
queue_inc_prod. But Robin figured that it may lead SMMU consume stale
memory contents. I thought more than 3 whole days and got this one.

This patchset is based on Robin Murphy's [PATCH v2 0/8] io-pgtable lock removal.

Zhen Lei (5):
  iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock
confliction
  iommu: add a new member unmap_tlb_sync into struct iommu_ops
  iommu/arm-smmu-v3: add support for unmap an iova range with only one
tlb sync
  iommu/arm-smmu: add support for unmap a memory range with only one tlb
sync
  iommu/io-pgtable: delete member tlb_sync_pending of struct io_pgtable

 drivers/iommu/arm-smmu-v3.c| 52 ++
 drivers/iommu/arm-smmu.c   | 10 
 drivers/iommu/io-pgtable-arm-v7s.c | 32 +++
 drivers/iommu/io-pgtable-arm.c | 30 ++
 drivers/iommu/io-pgtable.h |  9 ++-
 drivers/iommu/iommu.c  |  3 +++
 include/linux/iommu.h  |  1 +
 7 files changed, 104 insertions(+), 33 deletions(-)

-- 
2.5.0




[PATCH 0/5] arm-smmu: performance optimization

2017-06-26 Thread Zhen Lei
I described the optimization more detail in patch 1 and 2, and patch 3-5 are
the implementation on arm-smmu/arm-smmu-v3 of patch 2.

Patch 1 is v2. In v1, I directly replaced writel with writel_relaxed in
queue_inc_prod. But Robin figured that it may lead SMMU consume stale
memory contents. I thought more than 3 whole days and got this one.

This patchset is based on Robin Murphy's [PATCH v2 0/8] io-pgtable lock removal.

Zhen Lei (5):
  iommu/arm-smmu-v3: put off the execution of TLBI* to reduce lock
confliction
  iommu: add a new member unmap_tlb_sync into struct iommu_ops
  iommu/arm-smmu-v3: add support for unmap an iova range with only one
tlb sync
  iommu/arm-smmu: add support for unmap a memory range with only one tlb
sync
  iommu/io-pgtable: delete member tlb_sync_pending of struct io_pgtable

 drivers/iommu/arm-smmu-v3.c| 52 ++
 drivers/iommu/arm-smmu.c   | 10 
 drivers/iommu/io-pgtable-arm-v7s.c | 32 +++
 drivers/iommu/io-pgtable-arm.c | 30 ++
 drivers/iommu/io-pgtable.h |  9 ++-
 drivers/iommu/iommu.c  |  3 +++
 include/linux/iommu.h  |  1 +
 7 files changed, 104 insertions(+), 33 deletions(-)

-- 
2.5.0