Re: [PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT

2014-07-27 Thread Benjamin Herrenschmidt
On Mon, 2014-07-28 at 14:09 +0800, Liu Ping Fan wrote:
> In current code, the setup of hpte is under the risk of race with
> mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn.
> Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT.

Please describe the race you think you see. I'm quite sure both Paul and
I went over that code and somewhat convinced ourselves that it was ok
but it's possible that we were both wrong :-)

Cheers,
Ben.

> Signed-off-by: Liu Ping Fan 
> ---
>  arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++-
>  1 file changed, 10 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
> b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> index 8056107..e6dcff4 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
> struct kvm_vcpu *vcpu,
>  
>   if (hptep[0] & HPTE_V_VALID) {
>   /* HPTE was previously valid, so we need to invalidate it */
> - unlock_rmap(rmap);
>   hptep[0] |= HPTE_V_ABSENT;
>   kvmppc_invalidate_hpte(kvm, hptep, index);
>   /* don't lose previous R and C bits */
>   r |= hptep[1] & (HPTE_R_R | HPTE_R_C);
> +
> + hptep[1] = r;
> + eieio();
> + hptep[0] = hpte[0];
> + asm volatile("ptesync" : : : "memory");
> + unlock_rmap(rmap);
>   } else {
> + hptep[1] = r;
> + eieio();
> + hptep[0] = hpte[0];
> + asm volatile("ptesync" : : : "memory");
>   kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
>   }
>  
> - hptep[1] = r;
> - eieio();
> - hptep[0] = hpte[0];
> - asm volatile("ptesync" : : : "memory");
>   preempt_enable();
>   if (page && hpte_is_writable(r))
>   SetPageDirty(page);


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH] powerpc: kvm: make the setup of hpte under the protection of KVMPPC_RMAP_LOCK_BIT

2014-07-27 Thread Liu Ping Fan
In current code, the setup of hpte is under the risk of race with
mmu_notifier_invalidate, i.e we may setup a hpte with a invalid pfn.
Resolve this issue by sync the two actions by KVMPPC_RMAP_LOCK_BIT.

Signed-off-by: Liu Ping Fan 
---
 arch/powerpc/kvm/book3s_64_mmu_hv.c | 15 ++-
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c 
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 8056107..e6dcff4 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -754,19 +754,24 @@ int kvmppc_book3s_hv_page_fault(struct kvm_run *run, 
struct kvm_vcpu *vcpu,
 
if (hptep[0] & HPTE_V_VALID) {
/* HPTE was previously valid, so we need to invalidate it */
-   unlock_rmap(rmap);
hptep[0] |= HPTE_V_ABSENT;
kvmppc_invalidate_hpte(kvm, hptep, index);
/* don't lose previous R and C bits */
r |= hptep[1] & (HPTE_R_R | HPTE_R_C);
+
+   hptep[1] = r;
+   eieio();
+   hptep[0] = hpte[0];
+   asm volatile("ptesync" : : : "memory");
+   unlock_rmap(rmap);
} else {
+   hptep[1] = r;
+   eieio();
+   hptep[0] = hpte[0];
+   asm volatile("ptesync" : : : "memory");
kvmppc_add_revmap_chain(kvm, rev, rmap, index, 0);
}
 
-   hptep[1] = r;
-   eieio();
-   hptep[0] = hpte[0];
-   asm volatile("ptesync" : : : "memory");
preempt_enable();
if (page && hpte_is_writable(r))
SetPageDirty(page);
-- 
1.8.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 12/18] powerpc/iommu: Fix missing permission bits in iommu_put_tce_user_mode()

2014-07-27 Thread Benjamin Herrenschmidt
On Mon, 2014-07-28 at 14:32 +1000, Alexey Kardashevskiy wrote:
> On 07/28/2014 11:19 AM, Benjamin Herrenschmidt wrote:
> > On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> >> This adds missing permission bits to the translated TCE.
> > 
> > Is this a bug fix for existing stuff ? If yes, submit it separately.
> 
> 
> There is 15/18 patch which fixes possible bug with leaking pages, and that
> patch won't work until this one is applied.

Please collapse them then.

> Merge this one and "[PATCH v3 15/18] powerpc/iommu: Implement put_page() if
> TCE had non-zero value"?

Right, and if the result is a bug fix on top of something already
upstream, please send it separately.

Cheers,
Ben.

> 
> 
> > 
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> >>  arch/powerpc/kernel/iommu.c | 1 +
> >>  1 file changed, 1 insertion(+)
> >>
> >> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> >> index 0cda2e8..5af2319 100644
> >> --- a/arch/powerpc/kernel/iommu.c
> >> +++ b/arch/powerpc/kernel/iommu.c
> >> @@ -1088,6 +1088,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, 
> >> unsigned long entry,
> >>return -EFAULT;
> >>}
> >>hwaddr = (unsigned long) page_address(page) + offset;
> >> +  hwaddr |= tce & (TCE_PCI_READ | TCE_PCI_WRITE);
> >>  
> >>ret = iommu_tce_build(tbl, entry, hwaddr, direction);
> >>if (ret)
> > 
> > 
> 
> 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 03/18] KVM: PPC: Account TCE pages in locked_vm

2014-07-27 Thread Benjamin Herrenschmidt
On Mon, 2014-07-28 at 14:23 +1000, Alexey Kardashevskiy wrote:
> On 07/28/2014 10:43 AM, Benjamin Herrenschmidt wrote:
> > On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
> >> Signed-off-by: Alexey Kardashevskiy 
> >> ---
> > 
> > You need a description.
> > 
> >>  arch/powerpc/kvm/book3s_64_vio.c | 35 ++-
> >>  1 file changed, 34 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> >> b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 516f2ee..48b7ed4 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -45,18 +45,48 @@ static long kvmppc_stt_npages(unsigned long 
> >> window_size)
> >> * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
> >>  }
> >>  
> >> +/*
> >> + * Checks ulimit in order not to let the user space to pin all
> >> + * available memory for TCE tables.
> >> + */
> >> +static long kvmppc_account_memlimit(long npages)
> >> +{
> >> +  unsigned long ret = 0, locked, lock_limit;
> >> +
> >> +  if (!current->mm)
> >> +  return -ESRCH; /* process exited */
> >> +
> >> +  down_write(¤t->mm->mmap_sem);
> >> +  locked = current->mm->locked_vm + npages;
> >> +  lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> >> +  if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> >> +  pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> >> +  rlimit(RLIMIT_MEMLOCK));
> >> +  ret = -ENOMEM;
> >> +  } else {
> >> +  current->mm->locked_vm += npages;
> >> +  }
> >> +  up_write(¤t->mm->mmap_sem);
> >> +
> >> +  return ret;
> >> +}
> >> +
> >>  static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
> >>  {
> >>struct kvm *kvm = stt->kvm;
> >>int i;
> >> +  long npages = kvmppc_stt_npages(stt->window_size);
> >>  
> >>mutex_lock(&kvm->lock);
> >>list_del(&stt->list);
> >> -  for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> >> +  for (i = 0; i < npages; i++)
> >>__free_page(stt->pages[i]);
> >> +
> >>kfree(stt);
> >>mutex_unlock(&kvm->lock);
> >>  
> >> +  kvmppc_account_memlimit(-(npages + 1));
> >> +
> >>kvm_put_kvm(kvm);
> >>  }
> >>  
> >> @@ -112,6 +142,9 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>}
> >>  
> >>npages = kvmppc_stt_npages(args->window_size);
> >> +  ret = kvmppc_account_memlimit(npages + 1);
> >> +  if (ret)
> >> +  goto fail;
> > 
> > This is called for VFIO only or is it also called when creating TCE
> > tables for emulated devices ? Because in the latter case, you don't
> > want to account the pages as locked, do you ?
> 
> At the moment TCE-containing pages (for emulated TCE) are allocated with
> alloc_page() which is kernel memory and therefore always locked, no?

So the npages up there is the number of TCE-containing pages, not the
number of mapped-by-TCE pages ? In that case it makes sense yes.

> 
> > Also, you need to explain what +1
> > 
> > Finally, do I correctly deduce that creating 10 TCE tables of 2G
> > each will end up accounting 20G as locked even if the guest for
> > example only has 4G of RAM ? 
> 
> 
> The user is required to set the limit to 20G, correct. But this does not
> mean all 20G will be pinned. Ugly but better than nothing. As I remember
> from you explanations, even if we give up real/virtual mode handlers for
> H_PUT_TCE&Co, we cannot rely of existing counters in page struct in order
> to understand whether we need to account a page again or not so we are
> stuck with this code till we have a "clone DDW window" API.

Right but please put that explanation somewhere in one of the changeset
comments or as comments near the code.

> But this patch is not about guest pages, it is about pages with TCEs, there
> was no counting for this at all.

Ok.

> 
> 
> > 
> >>stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
> >>  GFP_KERNEL);
> > 
> > Ben.
> > 
> > 
> 
> 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 12/18] powerpc/iommu: Fix missing permission bits in iommu_put_tce_user_mode()

2014-07-27 Thread Alexey Kardashevskiy
On 07/28/2014 11:19 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
>> This adds missing permission bits to the translated TCE.
> 
> Is this a bug fix for existing stuff ? If yes, submit it separately.


There is 15/18 patch which fixes possible bug with leaking pages, and that
patch won't work until this one is applied.

Merge this one and "[PATCH v3 15/18] powerpc/iommu: Implement put_page() if
TCE had non-zero value"?




> 
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
>>  arch/powerpc/kernel/iommu.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index 0cda2e8..5af2319 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -1088,6 +1088,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, 
>> unsigned long entry,
>>  return -EFAULT;
>>  }
>>  hwaddr = (unsigned long) page_address(page) + offset;
>> +hwaddr |= tce & (TCE_PCI_READ | TCE_PCI_WRITE);
>>  
>>  ret = iommu_tce_build(tbl, entry, hwaddr, direction);
>>  if (ret)
> 
> 


-- 
Alexey
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 02/18] KVM: PPC: Use RCU when adding to arch.spapr_tce_tables

2014-07-27 Thread Benjamin Herrenschmidt
On Mon, 2014-07-28 at 14:11 +1000, Alexey Kardashevskiy wrote:
> ===
> At the moment the spapr_tce_tables list is not protected against races
> which may happen if the userspace issues the KVM_CREATE_SPAPR_TCE ioctl
> to KVM from different threads.
> 
> This makes use of _rcu helpers for list_add()/list_del().
> ===
> 
> and it is missing this bit as well (was under impression that all tables
> get freed at once at KVM exit but this might not be the case for malicious
> guest):
> 
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -211,7 +211,7 @@ static void release_spapr_tce_table(struct
> kvmppc_spapr_tce_table *stt)
> long npages = kvmppc_stt_npages(stt->size);
> 
> mutex_lock(&kvm->lock);
> -   list_del(&stt->list);
> +   list_del_rcu(&stt->list);
> 
> 
> 
> Would this be sufficient?

Well, not really, we'd need to see the corresponding RCU'isms when
walking the list including the rcu read lock ... which could be
problematic if we are in real mode

Cheers,
Ben.


> 
> 
> > 
> > Cheers,
> > Ben.
> > 
> >> ---
> >>  arch/powerpc/kvm/book3s_64_vio.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> >> b/arch/powerpc/kvm/book3s_64_vio.c
> >> index 54cf9bc..516f2ee 100644
> >> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >> @@ -131,7 +131,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
> >>kvm_get_kvm(kvm);
> >>  
> >>mutex_lock(&kvm->lock);
> >> -  list_add(&stt->list, &kvm->arch.spapr_tce_tables);
> >> +  list_add_rcu(&stt->list, &kvm->arch.spapr_tce_tables);
> >>  
> >>mutex_unlock(&kvm->lock);
> >>  
> > 
> > 
> 
> 


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 03/18] KVM: PPC: Account TCE pages in locked_vm

2014-07-27 Thread Alexey Kardashevskiy
On 07/28/2014 10:43 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
>> Signed-off-by: Alexey Kardashevskiy 
>> ---
> 
> You need a description.
> 
>>  arch/powerpc/kvm/book3s_64_vio.c | 35 ++-
>>  1 file changed, 34 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
>> b/arch/powerpc/kvm/book3s_64_vio.c
>> index 516f2ee..48b7ed4 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -45,18 +45,48 @@ static long kvmppc_stt_npages(unsigned long window_size)
>>   * sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
>>  }
>>  
>> +/*
>> + * Checks ulimit in order not to let the user space to pin all
>> + * available memory for TCE tables.
>> + */
>> +static long kvmppc_account_memlimit(long npages)
>> +{
>> +unsigned long ret = 0, locked, lock_limit;
>> +
>> +if (!current->mm)
>> +return -ESRCH; /* process exited */
>> +
>> +down_write(¤t->mm->mmap_sem);
>> +locked = current->mm->locked_vm + npages;
>> +lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
>> +if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
>> +pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
>> +rlimit(RLIMIT_MEMLOCK));
>> +ret = -ENOMEM;
>> +} else {
>> +current->mm->locked_vm += npages;
>> +}
>> +up_write(¤t->mm->mmap_sem);
>> +
>> +return ret;
>> +}
>> +
>>  static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
>>  {
>>  struct kvm *kvm = stt->kvm;
>>  int i;
>> +long npages = kvmppc_stt_npages(stt->window_size);
>>  
>>  mutex_lock(&kvm->lock);
>>  list_del(&stt->list);
>> -for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
>> +for (i = 0; i < npages; i++)
>>  __free_page(stt->pages[i]);
>> +
>>  kfree(stt);
>>  mutex_unlock(&kvm->lock);
>>  
>> +kvmppc_account_memlimit(-(npages + 1));
>> +
>>  kvm_put_kvm(kvm);
>>  }
>>  
>> @@ -112,6 +142,9 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  }
>>  
>>  npages = kvmppc_stt_npages(args->window_size);
>> +ret = kvmppc_account_memlimit(npages + 1);
>> +if (ret)
>> +goto fail;
> 
> This is called for VFIO only or is it also called when creating TCE
> tables for emulated devices ? Because in the latter case, you don't
> want to account the pages as locked, do you ?

At the moment TCE-containing pages (for emulated TCE) are allocated with
alloc_page() which is kernel memory and therefore always locked, no?


> Also, you need to explain what +1
> 
> Finally, do I correctly deduce that creating 10 TCE tables of 2G
> each will end up accounting 20G as locked even if the guest for
> example only has 4G of RAM ? 


The user is required to set the limit to 20G, correct. But this does not
mean all 20G will be pinned. Ugly but better than nothing. As I remember
from you explanations, even if we give up real/virtual mode handlers for
H_PUT_TCE&Co, we cannot rely of existing counters in page struct in order
to understand whether we need to account a page again or not so we are
stuck with this code till we have a "clone DDW window" API.


But this patch is not about guest pages, it is about pages with TCEs, there
was no counting for this at all.



> 
>>  stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
>>GFP_KERNEL);
> 
> Ben.
> 
> 


-- 
Alexey
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 10/18] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()

2014-07-27 Thread Benjamin Herrenschmidt
On Mon, 2014-07-28 at 13:55 +1000, Alexey Kardashevskiy wrote:
> On 07/28/2014 11:18 AM, Benjamin Herrenschmidt wrote:
> > On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> >> At the moment the iommu_table struct has a set_bypass() which enables/
> >> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
> >> which calls this callback when external IOMMU users such as VFIO are
> >> about to get over a PHB.
> >>
> >> Since the set_bypass() is not really an iommu_table function but PE's
> >> function, and we have an ops struct per IOMMU owner, let's move
> >> set_bypass() to the spapr_tce_iommu_ops struct.
> >>
> >> As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
> >> has very little to do with PEs, this moves take_ownership() calls to
> >> the VFIO SPAPR TCE driver.
> >>
> >> This renames set_bypass() to take_ownership() as it is not necessarily
> >> just enabling bypassing, it can be something else/more so let's give it
> >> a generic name. The bool parameter is inverted.
> > 
> > I disagree with the name change. take_ownership() is the right semantic
> > at the high level, but down to the actual table, it *is* about disabling
> > bypass.
> 
> It is still pnv_pci_ioda2_set_bypass() :-/
> 
> take_ownership() is a callback of PNV IOMMU group.
> 
> s/take_ownership/set_bypass/ ?
> 

Hrm, ... ok. It's a bit messy but should do for now.

> > 
> >> Signed-off-by: Alexey Kardashevskiy 
> >> Reviewed-by: Gavin Shan 
> >> ---
> >>  arch/powerpc/include/asm/iommu.h  |  1 -
> >>  arch/powerpc/include/asm/tce.h|  2 ++
> >>  arch/powerpc/kernel/iommu.c   | 12 
> >>  arch/powerpc/platforms/powernv/pci-ioda.c | 18 +++---
> >>  drivers/vfio/vfio_iommu_spapr_tce.c   | 16 
> >>  5 files changed, 29 insertions(+), 20 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/iommu.h 
> >> b/arch/powerpc/include/asm/iommu.h
> >> index 84ee339..2b0b01d 100644
> >> --- a/arch/powerpc/include/asm/iommu.h
> >> +++ b/arch/powerpc/include/asm/iommu.h
> >> @@ -77,7 +77,6 @@ struct iommu_table {
> >>  #ifdef CONFIG_IOMMU_API
> >>struct iommu_group *it_group;
> >>  #endif
> >> -  void (*set_bypass)(struct iommu_table *tbl, bool enable);
> >>  };
> >>  
> >>  /* Pure 2^n version of get_order */
> >> diff --git a/arch/powerpc/include/asm/tce.h 
> >> b/arch/powerpc/include/asm/tce.h
> >> index 8bfe98f..5ee4987 100644
> >> --- a/arch/powerpc/include/asm/tce.h
> >> +++ b/arch/powerpc/include/asm/tce.h
> >> @@ -58,6 +58,8 @@ struct spapr_tce_iommu_ops {
> >>struct iommu_table *(*get_table)(
> >>struct spapr_tce_iommu_group *data,
> >>phys_addr_t addr);
> >> +  void (*take_ownership)(struct spapr_tce_iommu_group *data,
> >> +  bool enable);
> >>  };
> >>  
> >>  struct spapr_tce_iommu_group {
> >> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> >> index e203314..06984d5 100644
> >> --- a/arch/powerpc/kernel/iommu.c
> >> +++ b/arch/powerpc/kernel/iommu.c
> >> @@ -1116,14 +1116,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
> >>memset(tbl->it_map, 0xff, sz);
> >>iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
> >>  
> >> -  /*
> >> -   * Disable iommu bypass, otherwise the user can DMA to all of
> >> -   * our physical memory via the bypass window instead of just
> >> -   * the pages that has been explicitly mapped into the iommu
> >> -   */
> >> -  if (tbl->set_bypass)
> >> -  tbl->set_bypass(tbl, false);
> >> -
> >>return 0;
> >>  }
> >>  EXPORT_SYMBOL_GPL(iommu_take_ownership);
> >> @@ -1138,10 +1130,6 @@ void iommu_release_ownership(struct iommu_table 
> >> *tbl)
> >>/* Restore bit#0 set by iommu_init_table() */
> >>if (tbl->it_offset == 0)
> >>set_bit(0, tbl->it_map);
> >> -
> >> -  /* The kernel owns the device now, we can restore the iommu bypass */
> >> -  if (tbl->set_bypass)
> >> -  tbl->set_bypass(tbl, true);
> >>  }
> >>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
> >>  
> >> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> >> b/arch/powerpc/platforms/powernv/pci-ioda.c
> >> index 495137b..f828c57 100644
> >> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> >> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> >> @@ -709,10 +709,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
> >> *phb,
> >>__free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
> >>  }
> >>  
> >> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> >> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
> >>  {
> >> -  struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> >> -tce32.table);
> >>uint16_t window_id = (pe->pe_number << 1 ) + 1;
> >>int64_t rc;
> >>  
> >> @@ -752,15 +750,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(s

Re: [PATCH v3 02/18] KVM: PPC: Use RCU when adding to arch.spapr_tce_tables

2014-07-27 Thread Alexey Kardashevskiy
On 07/28/2014 10:40 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
>> Signed-off-by: Alexey Kardashevskiy 
>> Reviewed-by: Gavin Shan 
> 
> You need a much better changeset here, explaining what you are
> protecting against and why just changing that one list_add is
> sufficient.



===
At the moment the spapr_tce_tables list is not protected against races
which may happen if the userspace issues the KVM_CREATE_SPAPR_TCE ioctl
to KVM from different threads.

This makes use of _rcu helpers for list_add()/list_del().
===

and it is missing this bit as well (was under impression that all tables
get freed at once at KVM exit but this might not be the case for malicious
guest):

--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -211,7 +211,7 @@ static void release_spapr_tce_table(struct
kvmppc_spapr_tce_table *stt)
long npages = kvmppc_stt_npages(stt->size);

mutex_lock(&kvm->lock);
-   list_del(&stt->list);
+   list_del_rcu(&stt->list);



Would this be sufficient?



> 
> Cheers,
> Ben.
> 
>> ---
>>  arch/powerpc/kvm/book3s_64_vio.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
>> b/arch/powerpc/kvm/book3s_64_vio.c
>> index 54cf9bc..516f2ee 100644
>> --- a/arch/powerpc/kvm/book3s_64_vio.c
>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
>> @@ -131,7 +131,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>>  kvm_get_kvm(kvm);
>>  
>>  mutex_lock(&kvm->lock);
>> -list_add(&stt->list, &kvm->arch.spapr_tce_tables);
>> +list_add_rcu(&stt->list, &kvm->arch.spapr_tce_tables);
>>  
>>  mutex_unlock(&kvm->lock);
>>  
> 
> 


-- 
Alexey
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 10/18] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()

2014-07-27 Thread Alexey Kardashevskiy
On 07/28/2014 11:18 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
>> At the moment the iommu_table struct has a set_bypass() which enables/
>> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
>> which calls this callback when external IOMMU users such as VFIO are
>> about to get over a PHB.
>>
>> Since the set_bypass() is not really an iommu_table function but PE's
>> function, and we have an ops struct per IOMMU owner, let's move
>> set_bypass() to the spapr_tce_iommu_ops struct.
>>
>> As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
>> has very little to do with PEs, this moves take_ownership() calls to
>> the VFIO SPAPR TCE driver.
>>
>> This renames set_bypass() to take_ownership() as it is not necessarily
>> just enabling bypassing, it can be something else/more so let's give it
>> a generic name. The bool parameter is inverted.
> 
> I disagree with the name change. take_ownership() is the right semantic
> at the high level, but down to the actual table, it *is* about disabling
> bypass.

It is still pnv_pci_ioda2_set_bypass() :-/

take_ownership() is a callback of PNV IOMMU group.

s/take_ownership/set_bypass/ ?


> 
>> Signed-off-by: Alexey Kardashevskiy 
>> Reviewed-by: Gavin Shan 
>> ---
>>  arch/powerpc/include/asm/iommu.h  |  1 -
>>  arch/powerpc/include/asm/tce.h|  2 ++
>>  arch/powerpc/kernel/iommu.c   | 12 
>>  arch/powerpc/platforms/powernv/pci-ioda.c | 18 +++---
>>  drivers/vfio/vfio_iommu_spapr_tce.c   | 16 
>>  5 files changed, 29 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/iommu.h 
>> b/arch/powerpc/include/asm/iommu.h
>> index 84ee339..2b0b01d 100644
>> --- a/arch/powerpc/include/asm/iommu.h
>> +++ b/arch/powerpc/include/asm/iommu.h
>> @@ -77,7 +77,6 @@ struct iommu_table {
>>  #ifdef CONFIG_IOMMU_API
>>  struct iommu_group *it_group;
>>  #endif
>> -void (*set_bypass)(struct iommu_table *tbl, bool enable);
>>  };
>>  
>>  /* Pure 2^n version of get_order */
>> diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
>> index 8bfe98f..5ee4987 100644
>> --- a/arch/powerpc/include/asm/tce.h
>> +++ b/arch/powerpc/include/asm/tce.h
>> @@ -58,6 +58,8 @@ struct spapr_tce_iommu_ops {
>>  struct iommu_table *(*get_table)(
>>  struct spapr_tce_iommu_group *data,
>>  phys_addr_t addr);
>> +void (*take_ownership)(struct spapr_tce_iommu_group *data,
>> +bool enable);
>>  };
>>  
>>  struct spapr_tce_iommu_group {
>> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
>> index e203314..06984d5 100644
>> --- a/arch/powerpc/kernel/iommu.c
>> +++ b/arch/powerpc/kernel/iommu.c
>> @@ -1116,14 +1116,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
>>  memset(tbl->it_map, 0xff, sz);
>>  iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
>>  
>> -/*
>> - * Disable iommu bypass, otherwise the user can DMA to all of
>> - * our physical memory via the bypass window instead of just
>> - * the pages that has been explicitly mapped into the iommu
>> - */
>> -if (tbl->set_bypass)
>> -tbl->set_bypass(tbl, false);
>> -
>>  return 0;
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_take_ownership);
>> @@ -1138,10 +1130,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
>>  /* Restore bit#0 set by iommu_init_table() */
>>  if (tbl->it_offset == 0)
>>  set_bit(0, tbl->it_map);
>> -
>> -/* The kernel owns the device now, we can restore the iommu bypass */
>> -if (tbl->set_bypass)
>> -tbl->set_bypass(tbl, true);
>>  }
>>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
>>  
>> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
>> b/arch/powerpc/platforms/powernv/pci-ioda.c
>> index 495137b..f828c57 100644
>> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
>> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
>> @@ -709,10 +709,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
>> *phb,
>>  __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
>>  }
>>  
>> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
>> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
>>  {
>> -struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
>> -  tce32.table);
>>  uint16_t window_id = (pe->pe_number << 1 ) + 1;
>>  int64_t rc;
>>  
>> @@ -752,15 +750,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
>> pnv_phb *phb,
>>  /* TVE #1 is selected by PCI address bit 59 */
>>  pe->tce_bypass_base = 1ull << 59;
>>  
>> -/* Install set_bypass callback for VFIO */
>> -pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
>> -
>>  /* Enable bypass by default */
>> -pnv_pci_iod

[PATCH] Add Michael Ellerman as powerpc co-maintainer

2014-07-27 Thread Benjamin Herrenschmidt
Michael has been backing me up and helping will all aspects of
maintainership for a while now, let's make it official.

Signed-off-by: Benjamin Herrenschmidt 
--
Linus: I'll put this patch in powerpc-next, I'm just CC'ing
you so you are aware, in case I'm not available and Michael starts
sending you stuff directly.

diff --git a/MAINTAINERS b/MAINTAINERS
index c43ea88..815cd6f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5357,6 +5357,7 @@ F:arch/powerpc/boot/rs6000.h
 LINUX FOR POWERPC (32-BIT AND 64-BIT)
 M: Benjamin Herrenschmidt 
 M: Paul Mackerras 
+M: Michael Ellerman 
 W: http://www.penguinppc.org/
 L: linuxppc-dev@lists.ozlabs.org
 Q: http://patchwork.ozlabs.org/project/linuxppc-dev/list/



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[git pull] Please pull powerpc.git merge branch

2014-07-27 Thread Benjamin Herrenschmidt
Hi Linus !

Here are 3 more small powerpc fixes that should still go into .16.
One is a recent regression (MMCR2 business), the other is a trivial
endian fix without which FW updates won't work on LE in IBM machines,
and the 3rd one turns a BUG_ON into a WARN_ON which is definitely
a LOT more friendly especially when the whole thing is about retrieving
error logs ...

Cheers,
Ben.

The following changes since commit 6f5405bc2ee0102bb3856e2cdea64ff415db2e0c:

  powerpc: use _GLOBAL_TOC for memmove (2014-07-22 15:56:04 +1000)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git merge

for you to fetch changes up to 396a34340cdf7373c00e3977db27d1a20ea65ebc:

  powerpc: Fix endianness of flash_block_list in rtas_flash (2014-07-28 
11:30:54 +1000)


Michael Ellerman (1):
  powerpc/perf: Fix MMCR2 handling for EBB

Thomas Falcon (1):
  powerpc: Fix endianness of flash_block_list in rtas_flash

Vasant Hegde (1):
  powerpc/powernv: Change BUG_ON to WARN_ON in elog code

 arch/powerpc/kernel/rtas_flash.c   | 6 --
 arch/powerpc/perf/core-book3s.c| 6 +++---
 arch/powerpc/platforms/powernv/opal-elog.c | 4 ++--
 3 files changed, 9 insertions(+), 7 deletions(-)


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] powerpc: fixing endianness of flash_block_list in rtas_flash

2014-07-27 Thread Michael Ellerman
Hi Thomas,

On Fri, 2014-07-25 at 12:47 -0500, Thomas Falcon wrote:
> [PATCH] powerpc: fixing endianness of flash_block_list in rtas_flash

Minor nit, but the commit title should use the imperative mood, so eg:

  [PATCH] powerpc: Fix endianness of flash_block_list in rtas_flash

> The function rtas_flash_firmware passes the address of a data structure,
> flash_block_list, when making the update-flash-64-and-reboot rtas call.
> While the endianness of the address is handled correctly, the endianness
> of the data is not.  This patch ensures that the data in flash_block_list
> is big endian when passed to rtas on little endian hosts.

This looks good.

But you can do even better by changing the data types to be explicitly BE, so
eg. length should be __be64 I think.

Then if you build with "make C=2 CF=-D__CHECK_ENDIAN__" and have sparse
installed it will tell you when incorrectly assign from/to the BE types.

cheers


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 12/18] powerpc/iommu: Fix missing permission bits in iommu_put_tce_user_mode()

2014-07-27 Thread Benjamin Herrenschmidt
On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> This adds missing permission bits to the translated TCE.

Is this a bug fix for existing stuff ? If yes, submit it separately.

> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/kernel/iommu.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 0cda2e8..5af2319 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1088,6 +1088,7 @@ int iommu_put_tce_user_mode(struct iommu_table *tbl, 
> unsigned long entry,
>   return -EFAULT;
>   }
>   hwaddr = (unsigned long) page_address(page) + offset;
> + hwaddr |= tce & (TCE_PCI_READ | TCE_PCI_WRITE);
>  
>   ret = iommu_tce_build(tbl, entry, hwaddr, direction);
>   if (ret)


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 10/18] powerpc/powernv: Convert/move set_bypass() callback to take_ownership()

2014-07-27 Thread Benjamin Herrenschmidt
On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> At the moment the iommu_table struct has a set_bypass() which enables/
> disables DMA bypass on IODA2 PHB. This is exposed to POWERPC IOMMU code
> which calls this callback when external IOMMU users such as VFIO are
> about to get over a PHB.
> 
> Since the set_bypass() is not really an iommu_table function but PE's
> function, and we have an ops struct per IOMMU owner, let's move
> set_bypass() to the spapr_tce_iommu_ops struct.
> 
> As arch/powerpc/kernel/iommu.c is more about POWERPC IOMMU tables and
> has very little to do with PEs, this moves take_ownership() calls to
> the VFIO SPAPR TCE driver.
> 
> This renames set_bypass() to take_ownership() as it is not necessarily
> just enabling bypassing, it can be something else/more so let's give it
> a generic name. The bool parameter is inverted.

I disagree with the name change. take_ownership() is the right semantic
at the high level, but down to the actual table, it *is* about disabling
bypass.

> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: Gavin Shan 
> ---
>  arch/powerpc/include/asm/iommu.h  |  1 -
>  arch/powerpc/include/asm/tce.h|  2 ++
>  arch/powerpc/kernel/iommu.c   | 12 
>  arch/powerpc/platforms/powernv/pci-ioda.c | 18 +++---
>  drivers/vfio/vfio_iommu_spapr_tce.c   | 16 
>  5 files changed, 29 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index 84ee339..2b0b01d 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -77,7 +77,6 @@ struct iommu_table {
>  #ifdef CONFIG_IOMMU_API
>   struct iommu_group *it_group;
>  #endif
> - void (*set_bypass)(struct iommu_table *tbl, bool enable);
>  };
>  
>  /* Pure 2^n version of get_order */
> diff --git a/arch/powerpc/include/asm/tce.h b/arch/powerpc/include/asm/tce.h
> index 8bfe98f..5ee4987 100644
> --- a/arch/powerpc/include/asm/tce.h
> +++ b/arch/powerpc/include/asm/tce.h
> @@ -58,6 +58,8 @@ struct spapr_tce_iommu_ops {
>   struct iommu_table *(*get_table)(
>   struct spapr_tce_iommu_group *data,
>   phys_addr_t addr);
> + void (*take_ownership)(struct spapr_tce_iommu_group *data,
> + bool enable);
>  };
>  
>  struct spapr_tce_iommu_group {
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index e203314..06984d5 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -1116,14 +1116,6 @@ int iommu_take_ownership(struct iommu_table *tbl)
>   memset(tbl->it_map, 0xff, sz);
>   iommu_clear_tces_and_put_pages(tbl, tbl->it_offset, tbl->it_size);
>  
> - /*
> -  * Disable iommu bypass, otherwise the user can DMA to all of
> -  * our physical memory via the bypass window instead of just
> -  * the pages that has been explicitly mapped into the iommu
> -  */
> - if (tbl->set_bypass)
> - tbl->set_bypass(tbl, false);
> -
>   return 0;
>  }
>  EXPORT_SYMBOL_GPL(iommu_take_ownership);
> @@ -1138,10 +1130,6 @@ void iommu_release_ownership(struct iommu_table *tbl)
>   /* Restore bit#0 set by iommu_init_table() */
>   if (tbl->it_offset == 0)
>   set_bit(0, tbl->it_map);
> -
> - /* The kernel owns the device now, we can restore the iommu bypass */
> - if (tbl->set_bypass)
> - tbl->set_bypass(tbl, true);
>  }
>  EXPORT_SYMBOL_GPL(iommu_release_ownership);
>  
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 495137b..f828c57 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -709,10 +709,8 @@ static void pnv_pci_ioda_setup_dma_pe(struct pnv_phb 
> *phb,
>   __free_pages(tce_mem, get_order(TCE32_TABLE_SIZE * segs));
>  }
>  
> -static void pnv_pci_ioda2_set_bypass(struct iommu_table *tbl, bool enable)
> +static void pnv_pci_ioda2_set_bypass(struct pnv_ioda_pe *pe, bool enable)
>  {
> - struct pnv_ioda_pe *pe = container_of(tbl, struct pnv_ioda_pe,
> -   tce32.table);
>   uint16_t window_id = (pe->pe_number << 1 ) + 1;
>   int64_t rc;
>  
> @@ -752,15 +750,21 @@ static void pnv_pci_ioda2_setup_bypass_pe(struct 
> pnv_phb *phb,
>   /* TVE #1 is selected by PCI address bit 59 */
>   pe->tce_bypass_base = 1ull << 59;
>  
> - /* Install set_bypass callback for VFIO */
> - pe->tce32.table.set_bypass = pnv_pci_ioda2_set_bypass;
> -
>   /* Enable bypass by default */
> - pnv_pci_ioda2_set_bypass(&pe->tce32.table, true);
> + pnv_pci_ioda2_set_bypass(pe, true);
> +}
> +
> +static void pnv_ioda2_take_ownership(struct spapr_tce_iommu_group *data,
> +  bool enable)
> +{
> + struct pnv_ioda_pe *pe = data->iommu_

Re: [PATCH v3 13/18] powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values

2014-07-27 Thread Benjamin Herrenschmidt
On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> The tce_build/tce_build_rm callbacks are used to implement H_PUT_TCE/etc
> hypercalls. The PAPR spec does not allow to fail if the TCE is not empty.
> However we cannot just overwrite the existing TCE value with the new one
> as we still have to do page counting.
> 
> This adds an optional @old_tces return parameter. If it is not NULL,
> it must point to an array of @npages size where the callbacks will
> store old TCE values. Since tce_build receives virtual addresses,
> the old_tces array will contain virtual addresses as well.
> 
> As this patch is mechanical, no change in behaviour is expected.

Yeah well ... you added the new argument but you never actually populate
it. Don't do that, make that one single patch.

Also this is broken in many other ways. You don't implement it for all
iommu's which is not a reasonable interface. On the other hand,
implementing for all of them would have a significant cost since on
pseries It would mean another H-call.

So I'd suggest you actually create a separate callback, something like
tce_get_and_set() or something...

In fact, since you're going to touch each iommu interface like that you
should probably do what I suggested earlier and move those callbacks
to an iommu_ops structure per iommu.

> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: Gavin Shan 
> ---
>  arch/powerpc/include/asm/machdep.h |  2 ++
>  arch/powerpc/kernel/iommu.c|  8 +---
>  arch/powerpc/platforms/powernv/pci.c   | 13 -
>  arch/powerpc/platforms/pseries/iommu.c |  7 +--
>  arch/powerpc/sysdev/dart_iommu.c   |  1 +
>  5 files changed, 21 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h 
> b/arch/powerpc/include/asm/machdep.h
> index f92b0b5..f11596c 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -69,6 +69,7 @@ struct machdep_calls {
>long index,
>long npages,
>unsigned long uaddr,
> +  unsigned long *old_tces,
>enum dma_data_direction direction,
>struct dma_attrs *attrs);
>   void(*tce_free)(struct iommu_table *tbl,
> @@ -83,6 +84,7 @@ struct machdep_calls {
>long index,
>long npages,
>unsigned long uaddr,
> +  long *old_tces,
>enum dma_data_direction direction,
>struct dma_attrs *attrs);
>   void(*tce_free_rm)(struct iommu_table *tbl,
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 5af2319..ccf7510 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -324,7 +324,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct 
> iommu_table *tbl,
>   /* Put the TCEs in the HW table */
>   build_fail = ppc_md.tce_build(tbl, entry, npages,
> (unsigned long)page &
> -   IOMMU_PAGE_MASK(tbl), direction, attrs);
> +   IOMMU_PAGE_MASK(tbl), NULL, direction,
> +   attrs);
>  
>   /* ppc_md.tce_build() only returns non-zero for transient errors.
>* Clean up the table bitmap in this case and return
> @@ -497,7 +498,7 @@ int iommu_map_sg(struct device *dev, struct iommu_table 
> *tbl,
>   /* Insert into HW table */
>   build_fail = ppc_md.tce_build(tbl, entry, npages,
> vaddr & IOMMU_PAGE_MASK(tbl),
> -   direction, attrs);
> +   NULL, direction, attrs);
>   if(unlikely(build_fail))
>   goto failure;
>  
> @@ -1059,7 +1060,8 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned 
> long entry,
>   oldtce = ppc_md.tce_get(tbl, entry);
>   /* Add new entry if it is not busy */
>   if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> - ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
> + ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, NULL,
> + direction, NULL);
>  
>   spin_unlock(&(pool->lock));
>  
> diff --git a/arch/powerpc/platforms/powernv/pci.c 
> b/arch/powerpc/platforms/powernv/pci.c
> index cc54e3b..4b764c2 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -572,7 +572,8 @@ static void pnv_tce_invalidate(struct iommu_table *tbl, 
> __be64 *startp,
>  }
>  
>  static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> -  

Re: [PATCH v3 13/18] powerpc/iommu: Extend ppc_md.tce_build(_rm) to return old TCE values

2014-07-27 Thread Benjamin Herrenschmidt
On Thu, 2014-07-24 at 18:48 +1000, Alexey Kardashevskiy wrote:
> The tce_build/tce_build_rm callbacks are used to implement H_PUT_TCE/etc
> hypercalls. The PAPR spec does not allow to fail if the TCE is not empty.
> However we cannot just overwrite the existing TCE value with the new one
> as we still have to do page counting.
> 
> This adds an optional @old_tces return parameter. If it is not NULL,
> it must point to an array of @npages size where the callbacks will
> store old TCE values. Since tce_build receives virtual addresses,
> the old_tces array will contain virtual addresses as well.
> 
> As this patch is mechanical, no change in behaviour is expected.

You missed a few iommu's in the conversion ... pasemi, cell, ...

Ben.

> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: Gavin Shan 
> ---
>  arch/powerpc/include/asm/machdep.h |  2 ++
>  arch/powerpc/kernel/iommu.c|  8 +---
>  arch/powerpc/platforms/powernv/pci.c   | 13 -
>  arch/powerpc/platforms/pseries/iommu.c |  7 +--
>  arch/powerpc/sysdev/dart_iommu.c   |  1 +
>  5 files changed, 21 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/machdep.h 
> b/arch/powerpc/include/asm/machdep.h
> index f92b0b5..f11596c 100644
> --- a/arch/powerpc/include/asm/machdep.h
> +++ b/arch/powerpc/include/asm/machdep.h
> @@ -69,6 +69,7 @@ struct machdep_calls {
>long index,
>long npages,
>unsigned long uaddr,
> +  unsigned long *old_tces,
>enum dma_data_direction direction,
>struct dma_attrs *attrs);
>   void(*tce_free)(struct iommu_table *tbl,
> @@ -83,6 +84,7 @@ struct machdep_calls {
>long index,
>long npages,
>unsigned long uaddr,
> +  long *old_tces,
>enum dma_data_direction direction,
>struct dma_attrs *attrs);
>   void(*tce_free_rm)(struct iommu_table *tbl,
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index 5af2319..ccf7510 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -324,7 +324,8 @@ static dma_addr_t iommu_alloc(struct device *dev, struct 
> iommu_table *tbl,
>   /* Put the TCEs in the HW table */
>   build_fail = ppc_md.tce_build(tbl, entry, npages,
> (unsigned long)page &
> -   IOMMU_PAGE_MASK(tbl), direction, attrs);
> +   IOMMU_PAGE_MASK(tbl), NULL, direction,
> +   attrs);
>  
>   /* ppc_md.tce_build() only returns non-zero for transient errors.
>* Clean up the table bitmap in this case and return
> @@ -497,7 +498,7 @@ int iommu_map_sg(struct device *dev, struct iommu_table 
> *tbl,
>   /* Insert into HW table */
>   build_fail = ppc_md.tce_build(tbl, entry, npages,
> vaddr & IOMMU_PAGE_MASK(tbl),
> -   direction, attrs);
> +   NULL, direction, attrs);
>   if(unlikely(build_fail))
>   goto failure;
>  
> @@ -1059,7 +1060,8 @@ int iommu_tce_build(struct iommu_table *tbl, unsigned 
> long entry,
>   oldtce = ppc_md.tce_get(tbl, entry);
>   /* Add new entry if it is not busy */
>   if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> - ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, direction, NULL);
> + ret = ppc_md.tce_build(tbl, entry, 1, hwaddr, NULL,
> + direction, NULL);
>  
>   spin_unlock(&(pool->lock));
>  
> diff --git a/arch/powerpc/platforms/powernv/pci.c 
> b/arch/powerpc/platforms/powernv/pci.c
> index cc54e3b..4b764c2 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -572,7 +572,8 @@ static void pnv_tce_invalidate(struct iommu_table *tbl, 
> __be64 *startp,
>  }
>  
>  static int pnv_tce_build(struct iommu_table *tbl, long index, long npages,
> -  unsigned long uaddr, enum dma_data_direction direction,
> +  unsigned long uaddr, unsigned long *old_tces,
> +  enum dma_data_direction direction,
>struct dma_attrs *attrs, bool rm)
>  {
>   u64 proto_tce;
> @@ -597,12 +598,12 @@ static int pnv_tce_build(struct iommu_table *tbl, long 
> index, long npages,
>  }
>  
>  static int pnv_tce_build_vm(struct iommu_table *tbl, long index, long npages,
> - unsigned long uaddr,
> + unsigned long uad

Re: [PATCH v3 04/18] vfio: powerpc: Move locked_vm accounting to a helper

2014-07-27 Thread Benjamin Herrenschmidt
On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
> Additional DMA windows support is coming and accounting will include them
> so let's move this code to a helper for reuse.

This code looks a LOT like the one you added in the previous patch
(ie. kvmppc_account_memlimit()), though in a different place. I
wonder if we should re-arrange things so that there is really
only one version of it..

> Signed-off-by: Alexey Kardashevskiy 
> ---
>  drivers/vfio/vfio_iommu_spapr_tce.c | 54 
> -
>  1 file changed, 36 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c 
> b/drivers/vfio/vfio_iommu_spapr_tce.c
> index a84788b..c8f7284 100644
> --- a/drivers/vfio/vfio_iommu_spapr_tce.c
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -47,10 +47,40 @@ struct tce_container {
>   bool enabled;
>  };
>  
> +/*
> + * Checks ulimit in order not to let the user space to pin all
> + * available memory for TCE tables.
> + */
> +static long tce_iommu_account_memlimit(struct iommu_table *tbl, bool inc)
> +{
> + unsigned long ret = 0, locked, lock_limit;
> + long npages;
> +
> + if (!current->mm)
> + return -ESRCH; /* process exited */
> +
> + npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
> + if (!inc)
> + npages = -npages;
> +
> + down_write(¤t->mm->mmap_sem);
> + locked = current->mm->locked_vm + npages;
> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> + pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> + rlimit(RLIMIT_MEMLOCK));
> + ret = -ENOMEM;
> + } else {
> + current->mm->locked_vm += npages;
> + }
> + up_write(¤t->mm->mmap_sem);
> +
> + return ret;
> +}
> +
>  static int tce_iommu_enable(struct tce_container *container)
>  {
>   int ret = 0;
> - unsigned long locked, lock_limit, npages;
>   struct iommu_table *tbl = container->tbl;
>  
>   if (!container->tbl)
> @@ -80,20 +110,11 @@ static int tce_iommu_enable(struct tce_container 
> *container)
>* that would effectively kill the guest at random points, much better
>* enforcing the limit based on the max that the guest can map.
>*/
> - down_write(¤t->mm->mmap_sem);
> - npages = (tbl->it_size << IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
> - locked = current->mm->locked_vm + npages;
> - lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> - if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> - pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> - rlimit(RLIMIT_MEMLOCK));
> - ret = -ENOMEM;
> - } else {
> + ret = tce_iommu_account_memlimit(tbl, true);
> + if (ret)
> + return ret;
>  
> - current->mm->locked_vm += npages;
> - container->enabled = true;
> - }
> - up_write(¤t->mm->mmap_sem);
> + container->enabled = true;
>  
>   return ret;
>  }
> @@ -108,10 +129,7 @@ static void tce_iommu_disable(struct tce_container 
> *container)
>   if (!container->tbl || !current->mm)
>   return;
>  
> - down_write(¤t->mm->mmap_sem);
> - current->mm->locked_vm -= (container->tbl->it_size <<
> - IOMMU_PAGE_SHIFT_4K) >> PAGE_SHIFT;
> - up_write(¤t->mm->mmap_sem);
> + tce_iommu_account_memlimit(container->tbl, false);
>  }
>  
>  static void *tce_iommu_open(unsigned long arg)


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 03/18] KVM: PPC: Account TCE pages in locked_vm

2014-07-27 Thread Benjamin Herrenschmidt
On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
> Signed-off-by: Alexey Kardashevskiy 
> ---

You need a description.

>  arch/powerpc/kvm/book3s_64_vio.c | 35 ++-
>  1 file changed, 34 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index 516f2ee..48b7ed4 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -45,18 +45,48 @@ static long kvmppc_stt_npages(unsigned long window_size)
>* sizeof(u64), PAGE_SIZE) / PAGE_SIZE;
>  }
>  
> +/*
> + * Checks ulimit in order not to let the user space to pin all
> + * available memory for TCE tables.
> + */
> +static long kvmppc_account_memlimit(long npages)
> +{
> + unsigned long ret = 0, locked, lock_limit;
> +
> + if (!current->mm)
> + return -ESRCH; /* process exited */
> +
> + down_write(¤t->mm->mmap_sem);
> + locked = current->mm->locked_vm + npages;
> + lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> + if (locked > lock_limit && !capable(CAP_IPC_LOCK)) {
> + pr_warn("RLIMIT_MEMLOCK (%ld) exceeded\n",
> + rlimit(RLIMIT_MEMLOCK));
> + ret = -ENOMEM;
> + } else {
> + current->mm->locked_vm += npages;
> + }
> + up_write(¤t->mm->mmap_sem);
> +
> + return ret;
> +}
> +
>  static void release_spapr_tce_table(struct kvmppc_spapr_tce_table *stt)
>  {
>   struct kvm *kvm = stt->kvm;
>   int i;
> + long npages = kvmppc_stt_npages(stt->window_size);
>  
>   mutex_lock(&kvm->lock);
>   list_del(&stt->list);
> - for (i = 0; i < kvmppc_stt_npages(stt->window_size); i++)
> + for (i = 0; i < npages; i++)
>   __free_page(stt->pages[i]);
> +
>   kfree(stt);
>   mutex_unlock(&kvm->lock);
>  
> + kvmppc_account_memlimit(-(npages + 1));
> +
>   kvm_put_kvm(kvm);
>  }
>  
> @@ -112,6 +142,9 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   }
>  
>   npages = kvmppc_stt_npages(args->window_size);
> + ret = kvmppc_account_memlimit(npages + 1);
> + if (ret)
> + goto fail;

This is called for VFIO only or is it also called when creating TCE
tables for emulated devices ? Because in the latter case, you don't
want to account the pages as locked, do you ?

Also, you need to explain what +1

Finally, do I correctly deduce that creating 10 TCE tables of 2G
each will end up accounting 20G as locked even if the guest for
example only has 4G of RAM ? 

>   stt = kzalloc(sizeof(*stt) + npages * sizeof(struct page *),
> GFP_KERNEL);

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v3 02/18] KVM: PPC: Use RCU when adding to arch.spapr_tce_tables

2014-07-27 Thread Benjamin Herrenschmidt
On Thu, 2014-07-24 at 18:47 +1000, Alexey Kardashevskiy wrote:
> Signed-off-by: Alexey Kardashevskiy 
> Reviewed-by: Gavin Shan 

You need a much better changeset here, explaining what you are
protecting against and why just changing that one list_add is
sufficient.

Cheers,
Ben.

> ---
>  arch/powerpc/kvm/book3s_64_vio.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_vio.c 
> b/arch/powerpc/kvm/book3s_64_vio.c
> index 54cf9bc..516f2ee 100644
> --- a/arch/powerpc/kvm/book3s_64_vio.c
> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> @@ -131,7 +131,7 @@ long kvm_vm_ioctl_create_spapr_tce(struct kvm *kvm,
>   kvm_get_kvm(kvm);
>  
>   mutex_lock(&kvm->lock);
> - list_add(&stt->list, &kvm->arch.spapr_tce_tables);
> + list_add_rcu(&stt->list, &kvm->arch.spapr_tce_tables);
>  
>   mutex_unlock(&kvm->lock);
>  


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev