Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-12-12 Thread Alexey Kardashevskiy

On 13/12/12 13:29, Benjamin Herrenschmidt wrote:

On Wed, 2012-12-12 at 07:34 -0700, Alex Williamson wrote:

But what would I put there?... IOMMU ID is more than enough at the moment
and struct iommu_table does not have anything what would have made sense to
show in the sysfs...


I believe David mentioned that PEs had user visible names.  Perhaps they
match an enclosure location or something.  Group numbers are rather
arbitrary and really have no guarantee of persistence.  Thanks,


I agree. Make up something, for example domain[PE] or something like
that.


To be able to add a PE number, I need to call iommu_group_alloc() in the 
correct place where I know this number OR I have to carry it in iommu_table 
till the moment the iommu_group_alloc() is called (acceptable but not cool).


I will post a patch which would help as a response to this mail.


--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-12-12 Thread Alex Williamson
On Thu, 2012-12-13 at 13:57 +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2012-12-12 at 16:30 -0700, Alex Williamson wrote:
> > Locked page accounting in this version is very, very broken.  How do
> > powerpc folks feel about seemingly generic kernel iommu interfaces
> > messing with the current task mm?  Besides that, more problems
> > below...
> 
> After a second look & thought...
> 
> This whole accounting business is fucked. First, we simply can't just
> randomly return errors from H_PUT_TCE because the process reached some
> rlimit. This is not a proper failure mode. That means that the guest
> will probably panic() ... possibly right in the middle of some disk
> writeback or god knows what. Not good.
> 
> Also the overhead of doing all that crap on every TCE map/unmap is
> ridiculous.
> 
> Finally, it's just not going to work for real mode which we really want,
> since we can't take the mmap-sem in real mode anyway, so unless we
> convert that counter to an atomic, we can't do it.
> 
> I'd suggest just not bothering, or if you want to bother, check once
> when creating a TCE table that the rlimit is enough to bolt as many
> pages as can be populated in that table and fail to create *that*. The
> failure mode is much better, ie, qemu failing to create a PCI bus due to
> insufficient rlimits.

I agree, we don't seem to be headed in the right direction.  x86 needs
to track rlimits or else a user can exploit the interface to pin all the
memory in the system.  On power, only the iova window can be pinned, so
it's a fixed amount.  I could see it as granting access to a group
implicitly grants access to pinning the iova window.  We can still make
it more explicit by handling the rlimit accounting upfront.  Thanks,

Alex

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-12-12 Thread Benjamin Herrenschmidt
On Wed, 2012-12-12 at 16:30 -0700, Alex Williamson wrote:
> Locked page accounting in this version is very, very broken.  How do
> powerpc folks feel about seemingly generic kernel iommu interfaces
> messing with the current task mm?  Besides that, more problems
> below...

After a second look & thought...

This whole accounting business is fucked. First, we simply can't just
randomly return errors from H_PUT_TCE because the process reached some
rlimit. This is not a proper failure mode. That means that the guest
will probably panic() ... possibly right in the middle of some disk
writeback or god knows what. Not good.

Also the overhead of doing all that crap on every TCE map/unmap is
ridiculous.

Finally, it's just not going to work for real mode which we really want,
since we can't take the mmap-sem in real mode anyway, so unless we
convert that counter to an atomic, we can't do it.

I'd suggest just not bothering, or if you want to bother, check once
when creating a TCE table that the rlimit is enough to bolt as many
pages as can be populated in that table and fail to create *that*. The
failure mode is much better, ie, qemu failing to create a PCI bus due to
insufficient rlimits.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-12-12 Thread Benjamin Herrenschmidt
On Wed, 2012-12-12 at 16:30 -0700, Alex Williamson wrote:

> Locked page accounting in this version is very, very broken.  How do
> powerpc folks feel about seemingly generic kernel iommu interfaces
> messing with the current task mm?  Besides that, more problems below...

Not good at all :-)

I don't understand tho ... H_PUT_TCE calls should be in the qemu context
(or the guest) as current at the point of the call, so everything should
be accounted fine on the *current* task when those calls occur, what's
the point of the work queue Alexey ?

This code looks horribly complicated ... where does it come from ?

> > +/*
> > + * iommu_reset_table is called when it started/stopped being used.
> > + *
> > + * restore==true says to bring the iommu_table into the state as it was
> > + * before being used by VFIO.
> > + */
> > +void iommu_reset_table(struct iommu_table *tbl, bool restore)
> > +{
> > +   /* Page#0 is marked as used in iommu_init_table, so we clear it... */
> > +   if (!restore && (tbl->it_offset == 0))
> > +   clear_bit(0, tbl->it_map);
> > +
> > +   iommu_clear_tces(tbl, tbl->it_offset, tbl->it_size);
> 
> This does locked page accounting and unpins pages, even on startup when
> the pages aren't necessarily pinned or accounted against the current
> process.

Not sure what you mean Alex, and not sure either what Alexey
implementation actually does but indeed, pages inside an iommu table
that was used by the host don't have their refcount elevated by the fact
that they are there.

So when taking ownership of an iommu for vfio, you probably need to FAIL
if any page is already mapped. Only once you know the iommu is clear for
use, then you can start populating it and account for anything you put
in it (and de-account anything you remove from it when cleaning things
up).

> > +
> > +   /* ... or restore  */
> > +   if (restore && (tbl->it_offset == 0))
> > +   set_bit(0, tbl->it_map);
> > +}
> > +EXPORT_SYMBOL_GPL(iommu_reset_table);
> > +
> > +/*
> > + * Returns the number of used IOMMU pages (4K) within
> > + * the same system page (4K or 64K).
> > + *
> > + * syspage_weight_zero is optimized for expected case == 0
> > + * syspage_weight_one is optimized for expected case > 1
> > + * Other case are not used in this file.
> > + */
> > +#if PAGE_SIZE == IOMMU_PAGE_SIZE
> > +
> > +#define syspage_weight_zero(map, offset)   test_bit((map), (offset))
> > +#define syspage_weight_one(map, offset)test_bit((map), 
> > (offset))
> > +
> > +#elif PAGE_SIZE/IOMMU_PAGE_SIZE == 16
> > +
> > +static int syspage_weight_zero(unsigned long *map, unsigned long offset)
> > +{
> > +   offset &= PAGE_MASK >> IOMMU_PAGE_SHIFT;
> > +   return 0xUL & (map[BIT_WORD(offset)] >>
> > +   (offset & (BITS_PER_LONG-1)));
> > +}
> 
> I would have expected these to be bools and return true if the weight
> matches the value.

What is that business anyway ? It's very obscure.

> If you replaced 0x above w/ this, would you need the #error below?
> 
> (1UL << (PAGE_SIZE/IOMMU_PAGE_SIZE)) - 1)
> 
> > +
> > +static int syspage_weight_one(unsigned long *map, unsigned long offset)
> > +{
> > +   int ret = 0, nbits = PAGE_SIZE/IOMMU_PAGE_SIZE;
> > +
> > +   /* Aligns TCE entry number to system page boundary */
> > +   offset &= PAGE_MASK >> IOMMU_PAGE_SHIFT;
> > +
> > +   /* Count used 4K pages */
> > +   while (nbits && (ret < 2)) {
> 
> Don't you have a ffs()?  Could also be used for _zero.  Surely there are
> some bitops helpers that could help here even on big endian.  hweight
> really doesn't work?
> 
> > +   if (test_bit(offset, map))
> > +   ++ret;
> > +
> > +   --nbits;
> > +   ++offset;
> > +   }
> > +
> > +   return ret;
> > +}
> > +#else
> > +#error TODO: support other page size
> > +#endif

What combinations do you support ?

> > +static void tce_flush(struct iommu_table *tbl)
> > +{
> > +   /* Flush/invalidate TLB caches if necessary */
> > +   if (ppc_md.tce_flush)
> > +   ppc_md.tce_flush(tbl);
> > +
> > +   /* Make sure updates are seen by hardware */
> > +   mb();
> > +}
>> +
> > +/*
> > + * iommu_clear_tces clears tces and returned the number of system pages
> > + * which it called put_page() on
> > + */
> > +static long clear_tces_nolock(struct iommu_table *tbl, unsigned long entry,
> > +   unsigned long pages)
> > +{
> > +   int i, retpages = 0, clr;
> > +   unsigned long oldtce, oldweight;
> > +   struct page *page;
> > +
> > +   for (i = 0; i < pages; ++i, ++entry) {
> > +   if (!test_bit(entry - tbl->it_offset, tbl->it_map))
> > +   continue;
> > +
> > +   oldtce = ppc_md.tce_get(tbl, entry);
> > +   ppc_md.tce_free(tbl, entry, 1);
> > +
> > +   oldweight = syspage_weight_one(tbl->it_map,
> > +   entry - tbl->it_offset);
> > +   clr = __test_and_clear_bit(entry - tbl->it_offset,
> > +   tbl->it_map);
> > +
> > +

Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-12-12 Thread Alexey Kardashevskiy

On 13/12/12 10:30, Alex Williamson wrote:

On Wed, 2012-12-12 at 23:34 +1100, Alexey Kardashevskiy wrote:

This patch initializes IOMMU groups based on the IOMMU
configuration discovered during the PCI scan on POWERNV
(POWER non virtualized) platform. The IOMMU groups are
to be used later by VFIO driver (PCI pass through).

It also implements an API for mapping/unmapping pages for
guest PCI drivers and providing DMA window properties.
This API is going to be used later by QEMU-VFIO to handle
h_put_tce hypercalls from the KVM guest.

Although this driver has been tested only on the POWERNV
platform, it should work on any platform which supports
TCE tables.

To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
option and configure VFIO as required.

Cc: David Gibson 
Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/include/asm/iommu.h |   10 ++
  arch/powerpc/kernel/iommu.c  |  329 ++
  arch/powerpc/platforms/powernv/pci.c |  134 ++
  drivers/iommu/Kconfig|8 +
  4 files changed, 481 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index cbfe678..3c861ae 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -76,6 +76,9 @@ struct iommu_table {
struct iommu_pool large_pool;
struct iommu_pool pools[IOMMU_NR_POOLS];
unsigned long *it_map;   /* A simple allocation bitmap for now */
+#ifdef CONFIG_IOMMU_API
+   struct iommu_group *it_group;
+#endif
  };

  struct scatterlist;
@@ -147,5 +150,12 @@ static inline void iommu_restore(void)
  }
  #endif

+extern void iommu_reset_table(struct iommu_table *tbl, bool restore);
+extern long iommu_clear_tces(struct iommu_table *tbl, unsigned long ioba,
+   unsigned long size);
+extern long iommu_put_tces(struct iommu_table *tbl, unsigned long ioba,
+   uint64_t tce, enum dma_data_direction direction,
+   unsigned long size);
+
  #endif /* __KERNEL__ */
  #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ff5a6ce..f3bb2e7 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -36,6 +36,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -44,6 +45,7 @@
  #include 
  #include 
  #include 
+#include 

  #define DBG(...)

@@ -856,3 +858,330 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
free_pages((unsigned long)vaddr, get_order(size));
}
  }
+
+#ifdef CONFIG_IOMMU_API
+/*
+ * SPAPR TCE API
+ */
+
+struct vwork {
+   struct mm_struct*mm;
+   longnpage;
+   struct work_struct  work;
+};
+
+/* delayed decrement/increment for locked_vm */
+static void lock_acct_bg(struct work_struct *work)
+{
+   struct vwork *vwork = container_of(work, struct vwork, work);
+   struct mm_struct *mm;
+
+   mm = vwork->mm;
+   down_write(&mm->mmap_sem);
+   mm->locked_vm += vwork->npage;
+   up_write(&mm->mmap_sem);
+   mmput(mm);
+   kfree(vwork);
+}
+
+static void lock_acct(long npage)
+{
+   struct vwork *vwork;
+   struct mm_struct *mm;
+
+   if (!current->mm)
+   return; /* process exited */
+
+   if (down_write_trylock(¤t->mm->mmap_sem)) {
+   current->mm->locked_vm += npage;
+   up_write(¤t->mm->mmap_sem);
+   return;
+   }
+
+   /*
+* Couldn't get mmap_sem lock, so must setup to update
+* mm->locked_vm later. If locked_vm were atomic, we
+* wouldn't need this silliness
+*/
+   vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
+   if (!vwork)
+   return;
+   mm = get_task_mm(current);
+   if (!mm) {
+   kfree(vwork);
+   return;
+   }
+   INIT_WORK(&vwork->work, lock_acct_bg);
+   vwork->mm = mm;
+   vwork->npage = npage;
+   schedule_work(&vwork->work);
+}


Locked page accounting in this version is very, very broken.  How do
powerpc folks feel about seemingly generic kernel iommu interfaces
messing with the current task mm?  Besides that, more problems below...


+
+/*
+ * iommu_reset_table is called when it started/stopped being used.
+ *
+ * restore==true says to bring the iommu_table into the state as it was
+ * before being used by VFIO.
+ */
+void iommu_reset_table(struct iommu_table *tbl, bool restore)
+{
+   /* Page#0 is marked as used in iommu_init_table, so we clear it... */
+   if (!restore && (tbl->it_offset == 0))
+   clear_bit(0, tbl->it_map);
+
+   iommu_clear_tces(tbl, tbl->it_offset, tbl->it_size);


This does locked page accounting and unpins pages, even on startup when
the pages aren't necessarily pinned or accounted against the current
process.

>

+
+   /* ... or restore  */
+   if (restore && (tbl->it_offset

Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-12-12 Thread Benjamin Herrenschmidt
On Wed, 2012-12-12 at 07:34 -0700, Alex Williamson wrote:
> > But what would I put there?... IOMMU ID is more than enough at the moment 
> > and struct iommu_table does not have anything what would have made sense to 
> > show in the sysfs...
> 
> I believe David mentioned that PEs had user visible names.  Perhaps they
> match an enclosure location or something.  Group numbers are rather
> arbitrary and really have no guarantee of persistence.  Thanks, 

I agree. Make up something, for example domain[PE] or something like
that.

Cheers,
Ben.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-12-12 Thread Alex Williamson
On Wed, 2012-12-12 at 23:34 +1100, Alexey Kardashevskiy wrote:
> This patch initializes IOMMU groups based on the IOMMU
> configuration discovered during the PCI scan on POWERNV
> (POWER non virtualized) platform. The IOMMU groups are
> to be used later by VFIO driver (PCI pass through).
> 
> It also implements an API for mapping/unmapping pages for
> guest PCI drivers and providing DMA window properties.
> This API is going to be used later by QEMU-VFIO to handle
> h_put_tce hypercalls from the KVM guest.
> 
> Although this driver has been tested only on the POWERNV
> platform, it should work on any platform which supports
> TCE tables.
> 
> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
> option and configure VFIO as required.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/include/asm/iommu.h |   10 ++
>  arch/powerpc/kernel/iommu.c  |  329 
> ++
>  arch/powerpc/platforms/powernv/pci.c |  134 ++
>  drivers/iommu/Kconfig|8 +
>  4 files changed, 481 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index cbfe678..3c861ae 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -76,6 +76,9 @@ struct iommu_table {
>   struct iommu_pool large_pool;
>   struct iommu_pool pools[IOMMU_NR_POOLS];
>   unsigned long *it_map;   /* A simple allocation bitmap for now */
> +#ifdef CONFIG_IOMMU_API
> + struct iommu_group *it_group;
> +#endif
>  };
>  
>  struct scatterlist;
> @@ -147,5 +150,12 @@ static inline void iommu_restore(void)
>  }
>  #endif
>  
> +extern void iommu_reset_table(struct iommu_table *tbl, bool restore);
> +extern long iommu_clear_tces(struct iommu_table *tbl, unsigned long ioba,
> + unsigned long size);
> +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long ioba,
> + uint64_t tce, enum dma_data_direction direction,
> + unsigned long size);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_IOMMU_H */
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index ff5a6ce..f3bb2e7 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -36,6 +36,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -44,6 +45,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DBG(...)
>  
> @@ -856,3 +858,330 @@ void iommu_free_coherent(struct iommu_table *tbl, 
> size_t size,
>   free_pages((unsigned long)vaddr, get_order(size));
>   }
>  }
> +
> +#ifdef CONFIG_IOMMU_API
> +/*
> + * SPAPR TCE API
> + */
> +
> +struct vwork {
> + struct mm_struct*mm;
> + longnpage;
> + struct work_struct  work;
> +};
> +
> +/* delayed decrement/increment for locked_vm */
> +static void lock_acct_bg(struct work_struct *work)
> +{
> + struct vwork *vwork = container_of(work, struct vwork, work);
> + struct mm_struct *mm;
> +
> + mm = vwork->mm;
> + down_write(&mm->mmap_sem);
> + mm->locked_vm += vwork->npage;
> + up_write(&mm->mmap_sem);
> + mmput(mm);
> + kfree(vwork);
> +}
> +
> +static void lock_acct(long npage)
> +{
> + struct vwork *vwork;
> + struct mm_struct *mm;
> +
> + if (!current->mm)
> + return; /* process exited */
> +
> + if (down_write_trylock(¤t->mm->mmap_sem)) {
> + current->mm->locked_vm += npage;
> + up_write(¤t->mm->mmap_sem);
> + return;
> + }
> +
> + /*
> +  * Couldn't get mmap_sem lock, so must setup to update
> +  * mm->locked_vm later. If locked_vm were atomic, we
> +  * wouldn't need this silliness
> +  */
> + vwork = kmalloc(sizeof(struct vwork), GFP_KERNEL);
> + if (!vwork)
> + return;
> + mm = get_task_mm(current);
> + if (!mm) {
> + kfree(vwork);
> + return;
> + }
> + INIT_WORK(&vwork->work, lock_acct_bg);
> + vwork->mm = mm;
> + vwork->npage = npage;
> + schedule_work(&vwork->work);
> +}

Locked page accounting in this version is very, very broken.  How do
powerpc folks feel about seemingly generic kernel iommu interfaces
messing with the current task mm?  Besides that, more problems below...

> +
> +/*
> + * iommu_reset_table is called when it started/stopped being used.
> + *
> + * restore==true says to bring the iommu_table into the state as it was
> + * before being used by VFIO.
> + */
> +void iommu_reset_table(struct iommu_table *tbl, bool restore)
> +{
> + /* Page#0 is marked as used in iommu_init_table, so we clear it... */
> + if (!restore && (tbl->it_offset == 0))
> + clear_bit(0, tbl->it_map);
> +
> + iommu_clear_tces(tbl, tbl->it_offset, tbl->it_size);

This does locked page accounting and unpins pages, even on startup when
the pages ar

Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-12-12 Thread Alex Williamson
On Wed, 2012-12-12 at 17:14 +1100, Alexey Kardashevskiy wrote:
> On 08/12/12 04:38, Alex Williamson wrote:
> >> +static int __init tce_iommu_init(void)
> >> +{
> >> +  struct pci_dev *pdev = NULL;
> >> +  struct iommu_table *tbl;
> >> +  struct iommu_group *grp;
> >> +
> >> +  /* Allocate and initialize IOMMU groups */
> >> +  for_each_pci_dev(pdev) {
> >> +  tbl = get_iommu_table_base(&pdev->dev);
> >> +  if (!tbl)
> >> +  continue;
> >> +
> >> +  /* Skip already initialized */
> >> +  if (tbl->it_group)
> >> +  continue;
> >> +
> >> +  grp = iommu_group_alloc();
> >> +  if (IS_ERR(grp)) {
> >> +  pr_info("tce_vfio: cannot create new IOMMU group, 
> >> ret=%ld\n",
> >> +  PTR_ERR(grp));
> >> +  return PTR_ERR(grp);
> >> +  }
> >> +  tbl->it_group = grp;
> >> +  iommu_group_set_iommudata(grp, tbl, group_release);
> >
> > BTW, groups have a name property that shows up in sysfs that can be set
> > with iommu_group_set_name().  IIRC, this was a feature David requested
> > for PEs.  It'd be nice if it was used for PEs...  Thanks,
> 
> 
> 
> But what would I put there?... IOMMU ID is more than enough at the moment 
> and struct iommu_table does not have anything what would have made sense to 
> show in the sysfs...

I believe David mentioned that PEs had user visible names.  Perhaps they
match an enclosure location or something.  Group numbers are rather
arbitrary and really have no guarantee of persistence.  Thanks,

Alex


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-12-12 Thread Alexey Kardashevskiy

Hi Alex,

I posted other pair of patches. While debugging and testing my stuff I 
implemented some rough hack to support IOMMU mappings without passing those 
hypercalls to the QEMU, this is why I moved pieces of code around - want to 
support both QEMU-VFIO and kernel optimized H_PUT_TCE handler.




On 12/12/12 23:34, Alexey Kardashevskiy wrote:

This patch initializes IOMMU groups based on the IOMMU
configuration discovered during the PCI scan on POWERNV
(POWER non virtualized) platform. The IOMMU groups are
to be used later by VFIO driver (PCI pass through).

It also implements an API for mapping/unmapping pages for
guest PCI drivers and providing DMA window properties.
This API is going to be used later by QEMU-VFIO to handle
h_put_tce hypercalls from the KVM guest.

Although this driver has been tested only on the POWERNV
platform, it should work on any platform which supports
TCE tables.

To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
option and configure VFIO as required.

Cc: David Gibson 
Signed-off-by: Alexey Kardashevskiy 
---



--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-12-11 Thread Alexey Kardashevskiy

On 08/12/12 04:38, Alex Williamson wrote:

+static int __init tce_iommu_init(void)
+{
+   struct pci_dev *pdev = NULL;
+   struct iommu_table *tbl;
+   struct iommu_group *grp;
+
+   /* Allocate and initialize IOMMU groups */
+   for_each_pci_dev(pdev) {
+   tbl = get_iommu_table_base(&pdev->dev);
+   if (!tbl)
+   continue;
+
+   /* Skip already initialized */
+   if (tbl->it_group)
+   continue;
+
+   grp = iommu_group_alloc();
+   if (IS_ERR(grp)) {
+   pr_info("tce_vfio: cannot create new IOMMU group, 
ret=%ld\n",
+   PTR_ERR(grp));
+   return PTR_ERR(grp);
+   }
+   tbl->it_group = grp;
+   iommu_group_set_iommudata(grp, tbl, group_release);


BTW, groups have a name property that shows up in sysfs that can be set
with iommu_group_set_name().  IIRC, this was a feature David requested
for PEs.  It'd be nice if it was used for PEs...  Thanks,




But what would I put there?... IOMMU ID is more than enough at the moment 
and struct iommu_table does not have anything what would have made sense to 
show in the sysfs...



--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-12-07 Thread Alex Williamson
On Fri, 2012-12-07 at 18:35 +1100, Alexey Kardashevskiy wrote:
> This patch initializes IOMMU groups based on the IOMMU
> configuration discovered during the PCI scan on POWERNV
> (POWER non virtualized) platform. The IOMMU groups are
> to be used later by VFIO driver (PCI pass through).
> 
> It also implements an API for mapping/unmapping pages for
> guest PCI drivers and providing DMA window properties.
> This API is going to be used later by QEMU-VFIO to handle
> h_put_tce hypercalls from the KVM guest.
> 
> Although this driver has been tested only on the POWERNV
> platform, it should work on any platform which supports
> TCE tables.
> 
> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
> option and configure VFIO as required.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/include/asm/iommu.h |   10 ++
>  arch/powerpc/kernel/iommu.c  |  214 
> ++
>  arch/powerpc/platforms/powernv/pci.c |  134 +
>  drivers/iommu/Kconfig|8 ++
>  4 files changed, 366 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index cbfe678..be3b11b 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -76,6 +76,9 @@ struct iommu_table {
>   struct iommu_pool large_pool;
>   struct iommu_pool pools[IOMMU_NR_POOLS];
>   unsigned long *it_map;   /* A simple allocation bitmap for now */
> +#ifdef CONFIG_IOMMU_API
> + struct iommu_group *it_group;
> +#endif
>  };
>  
>  struct scatterlist;
> @@ -147,5 +150,12 @@ static inline void iommu_restore(void)
>  }
>  #endif
>  
> +extern void iommu_reset_table(struct iommu_table *tbl, bool release);
> +extern long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages);
> +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry,
> + uint64_t tce, enum dma_data_direction direction,
> + unsigned long pages);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_IOMMU_H */
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index ff5a6ce..123431a 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DBG(...)
>  
> @@ -856,3 +857,216 @@ void iommu_free_coherent(struct iommu_table *tbl, 
> size_t size,
>   free_pages((unsigned long)vaddr, get_order(size));
>   }
>  }
> +
> +#ifdef CONFIG_IOMMU_API
> +/*
> + * SPAPR TCE API
> + */
> +
> +/*
> + * iommu_reset_table is called when it started/stopped being used
> + */
> +void iommu_reset_table(struct iommu_table *tbl, bool release)
> +{
> + /*
> +  * Page at 0 is marked as used in iommu_init_table,
> +  * so here we clear it when called with release=false...
> +  */
> + if (!release && (tbl->it_offset == 0))
> + clear_bit(0, tbl->it_map);

Isn't this redundant to the memset below?

> +
> + iommu_clear_tces(tbl, tbl->it_offset, tbl->it_size);
> +
> + memset(tbl->it_map, 0, (tbl->it_size + 7) >> 3);
> +
> + /*
> +  * ... or restore when release=true
> +  */
> + if (release && (tbl->it_offset == 0))
> + set_bit(0, tbl->it_map);

"release" to me implies something is freed, maybe this should just be
called "restore".

> +}
> +EXPORT_SYMBOL_GPL(iommu_reset_table);
> +
> +/*
> + * Returns the number of used IOMMU pages (4K) within
> + * the same system page (4K or 64K).
> + * bitmap_weight is not used as it does not support bigendian maps.
> + * "offset" is an IOMMU page number relative to DMA window start.
> + */
> +static int syspage_weight(unsigned long *map, unsigned long offset)
> +{
> + int ret = 0, nbits = PAGE_SIZE/IOMMU_PAGE_SIZE;
> +
> + /* Aligns TCE entry number to system page boundary */
> + offset &= PAGE_MASK >> IOMMU_PAGE_SHIFT;
> +
> + /* Count used 4K pages */
> + while (nbits) {
> + if (test_bit(offset, map))
> + ++ret;
> + --nbits;
> + ++offset;
> + }
> +
> + return ret;
> +}
> +
> +static void tce_flush(struct iommu_table *tbl)
> +{
> + /* Flush/invalidate TLB caches if necessary */
> + if (ppc_md.tce_flush)
> + ppc_md.tce_flush(tbl);
> +
> + /* Make sure updates are seen by hardware */
> + mb();
> +}
> +
> +/*
> + * iommu_clear_tces clears tces and returned the number of system pages
> + * which it called put_page() on
> + */
> +static long clear_tces_nolock(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages)
> +{
> + int i, retpages = 0, clr;
> + unsigned long oldtce, oldweight;
> + struct page *page;
> +
> + for (i = 0; i < pages; ++i) {

Any reason not to increment "entry" and avoid the 5 cases of "entry + i"
below?

> + if (!test_bit(entr

Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-11-30 Thread Alexey Kardashevskiy

On 01/12/12 03:48, Alex Williamson wrote:

On Fri, 2012-11-30 at 17:14 +1100, Alexey Kardashevskiy wrote:

This patch initializes IOMMU groups based on the IOMMU
configuration discovered during the PCI scan on POWERNV
(POWER non virtualized) platform. The IOMMU groups are
to be used later by VFIO driver (PCI pass through).

It also implements an API for mapping/unmapping pages for
guest PCI drivers and providing DMA window properties.
This API is going to be used later by QEMU-VFIO to handle
h_put_tce hypercalls from the KVM guest.

Although this driver has been tested only on the POWERNV
platform, it should work on any platform which supports
TCE tables.

To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
option and configure VFIO as required.

Cc: David Gibson 
Signed-off-by: Alexey Kardashevskiy 
---
  arch/powerpc/include/asm/iommu.h |9 ++
  arch/powerpc/kernel/iommu.c  |  186 ++
  arch/powerpc/platforms/powernv/pci.c |  135 
  drivers/iommu/Kconfig|8 ++
  4 files changed, 338 insertions(+)

diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
index cbfe678..5c7087a 100644
--- a/arch/powerpc/include/asm/iommu.h
+++ b/arch/powerpc/include/asm/iommu.h
@@ -76,6 +76,9 @@ struct iommu_table {
struct iommu_pool large_pool;
struct iommu_pool pools[IOMMU_NR_POOLS];
unsigned long *it_map;   /* A simple allocation bitmap for now */
+#ifdef CONFIG_IOMMU_API
+   struct iommu_group *it_group;
+#endif
  };

  struct scatterlist;
@@ -147,5 +150,11 @@ static inline void iommu_restore(void)
  }
  #endif

+extern long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
+   unsigned long pages);
+extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry,
+   uint64_t tce, enum dma_data_direction direction,
+   unsigned long pages);
+
  #endif /* __KERNEL__ */
  #endif /* _ASM_IOMMU_H */
diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ff5a6ce..0646c50 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -44,6 +44,7 @@
  #include 
  #include 
  #include 
+#include 

  #define DBG(...)

@@ -856,3 +857,188 @@ void iommu_free_coherent(struct iommu_table *tbl, size_t 
size,
free_pages((unsigned long)vaddr, get_order(size));
}
  }
+
+#ifdef CONFIG_IOMMU_API
+/*
+ * SPAPR TCE API
+ */
+
+/*
+ * Returns the number of used IOMMU pages (4K) within
+ * the same system page (4K or 64K).
+ * bitmap_weight is not used as it does not support bigendian maps.
+ */
+static int syspage_weight(unsigned long *map, unsigned long entry)
+{
+   int ret = 0, nbits = PAGE_SIZE/IOMMU_PAGE_SIZE;
+
+   /* Aligns TCE entry number to system page boundary */
+   entry &= PAGE_MASK >> IOMMU_PAGE_SHIFT;
+
+   /* Count used 4K pages */
+   while (nbits--)
+   ret += (test_bit(entry++, map) == 0) ? 0 : 1;


Ok, entry is the iova page number.  So presumably it's relative to the
start of dma32_window_start since you're unlikely to have a bitmap that
covers all of memory.  I hadn't realized that previously.


No, it is zero based. The DMA window is a filter but not offset. But you 
are right, the it_map does not cover the whole global table (one per PHB, 
roughly), will fix it, thanks for pointing. On my test system IOMMU group 
is a whole PHB and DMA window always starts from 0 so tests do not show 
everything :)



Doesn't that
mean that it's actually impossible to create an ioctl based interface to
the dma64_window since we're not going to know which window is the
target?  I know you're not planning on one, but it seems limiting.


No ,it is not limiting as iova is zero based. Even if it was, there are 
flags in map/unmap ioctls which we could use, no?



We
at least need some documentation here, but I'm wondering if iova
shouldn't be zero based so we can determine which window it hits.  Also,
now that I look at it, I can't find any range checking on the iova.


True... Have not hit this problem yet :) Good point, will fix, thanks.



--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-11-30 Thread Alex Williamson
On Fri, 2012-11-30 at 17:14 +1100, Alexey Kardashevskiy wrote:
> This patch initializes IOMMU groups based on the IOMMU
> configuration discovered during the PCI scan on POWERNV
> (POWER non virtualized) platform. The IOMMU groups are
> to be used later by VFIO driver (PCI pass through).
> 
> It also implements an API for mapping/unmapping pages for
> guest PCI drivers and providing DMA window properties.
> This API is going to be used later by QEMU-VFIO to handle
> h_put_tce hypercalls from the KVM guest.
> 
> Although this driver has been tested only on the POWERNV
> platform, it should work on any platform which supports
> TCE tables.
> 
> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
> option and configure VFIO as required.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/include/asm/iommu.h |9 ++
>  arch/powerpc/kernel/iommu.c  |  186 
> ++
>  arch/powerpc/platforms/powernv/pci.c |  135 
>  drivers/iommu/Kconfig|8 ++
>  4 files changed, 338 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index cbfe678..5c7087a 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -76,6 +76,9 @@ struct iommu_table {
>   struct iommu_pool large_pool;
>   struct iommu_pool pools[IOMMU_NR_POOLS];
>   unsigned long *it_map;   /* A simple allocation bitmap for now */
> +#ifdef CONFIG_IOMMU_API
> + struct iommu_group *it_group;
> +#endif
>  };
>  
>  struct scatterlist;
> @@ -147,5 +150,11 @@ static inline void iommu_restore(void)
>  }
>  #endif
>  
> +extern long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages);
> +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry,
> + uint64_t tce, enum dma_data_direction direction,
> + unsigned long pages);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_IOMMU_H */
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index ff5a6ce..0646c50 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DBG(...)
>  
> @@ -856,3 +857,188 @@ void iommu_free_coherent(struct iommu_table *tbl, 
> size_t size,
>   free_pages((unsigned long)vaddr, get_order(size));
>   }
>  }
> +
> +#ifdef CONFIG_IOMMU_API
> +/*
> + * SPAPR TCE API
> + */
> +
> +/*
> + * Returns the number of used IOMMU pages (4K) within
> + * the same system page (4K or 64K).
> + * bitmap_weight is not used as it does not support bigendian maps.
> + */
> +static int syspage_weight(unsigned long *map, unsigned long entry)
> +{
> + int ret = 0, nbits = PAGE_SIZE/IOMMU_PAGE_SIZE;
> +
> + /* Aligns TCE entry number to system page boundary */
> + entry &= PAGE_MASK >> IOMMU_PAGE_SHIFT;
> +
> + /* Count used 4K pages */
> + while (nbits--)
> + ret += (test_bit(entry++, map) == 0) ? 0 : 1;

Ok, entry is the iova page number.  So presumably it's relative to the
start of dma32_window_start since you're unlikely to have a bitmap that
covers all of memory.  I hadn't realized that previously.  Doesn't that
mean that it's actually impossible to create an ioctl based interface to
the dma64_window since we're not going to know which window is the
target?  I know you're not planning on one, but it seems limiting.  We
at least need some documentation here, but I'm wondering if iova
shouldn't be zero based so we can determine which window it hits.  Also,
now that I look at it, I can't find any range checking on the iova.
Thanks,

Alex

> +
> + return ret;
> +}
> +
> +static void tce_flush(struct iommu_table *tbl)
> +{
> + /* Flush/invalidate TLB caches if necessary */
> + if (ppc_md.tce_flush)
> + ppc_md.tce_flush(tbl);
> +
> + /* Make sure updates are seen by hardware */
> + mb();
> +}
> +
> +/*
> + * iommu_clear_tces clears tces and returned the number of system pages
> + * which it called put_page() on
> + */
> +static long clear_tces_nolock(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages)
> +{
> + int i, retpages = 0;
> + unsigned long oldtce, oldweight;
> + struct page *page;
> +
> + for (i = 0; i < pages; ++i) {
> + oldtce = ppc_md.tce_get(tbl, entry + i);
> + ppc_md.tce_free(tbl, entry + i, 1);
> +
> + oldweight = syspage_weight(tbl->it_map, entry);
> + __clear_bit(entry, tbl->it_map);
> +
> + if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> + continue;
> +
> + page = pfn_to_page(oldtce >> PAGE_SHIFT);
> +
> + WARN_ON(!page);
> + if (!page)
> + continue;
> +
> + if (oldtce & TCE_PCI_WRITE)
> +   

Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-11-29 Thread Alexey Kardashevskiy

On 29/11/12 15:20, Alex Williamson wrote:


+   /* Put tces to the table */
+   for (i = 0; (i < pages) && !ret; ++i, tce += IOMMU_PAGE_SIZE) {
+   ret = put_tce(tbl, entry + i, tce, direction);
+   /*
+* As IOMMU page size is always 4K, the system page size
+* can be 64K and there is no special tracking for IOMMU pages,
+* we only do rlimit check/update for the very first
+* 4K IOMMUpage within 64K system page.
+*/
+   if (!(tce & ~PAGE_MASK))
+   ++retpages;


Ah, here's the comment I was looking for, though I'm still not sure
about the read/write bits.

Isn't there an exploit here that a user can lock pages beyond their
limits if they just skip mapping the first 4k of each page?  Thanks,



Heh. True. Posted another patch with 4K pages per system page usage tracking.



--
Alexey
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-11-28 Thread Alex Williamson
On Thu, 2012-11-29 at 14:53 +1100, Alexey Kardashevskiy wrote:
> This patch initializes IOMMU groups based on the IOMMU
> configuration discovered during the PCI scan on POWERNV
> (POWER non virtualized) platform. The IOMMU groups are
> to be used later by VFIO driver (PCI pass through).
> 
> It also implements an API for mapping/unmapping pages for
> guest PCI drivers and providing DMA window properties.
> This API is going to be used later by QEMU-VFIO to handle
> h_put_tce hypercalls from the KVM guest.
> 
> Although this driver has been tested only on the POWERNV
> platform, it should work on any platform which supports
> TCE tables.
> 
> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
> option and configure VFIO as required.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/include/asm/iommu.h |9 ++
>  arch/powerpc/kernel/iommu.c  |  159 
> ++
>  arch/powerpc/platforms/powernv/pci.c |  135 +
>  drivers/iommu/Kconfig|8 ++
>  4 files changed, 311 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index cbfe678..5c7087a 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -76,6 +76,9 @@ struct iommu_table {
>   struct iommu_pool large_pool;
>   struct iommu_pool pools[IOMMU_NR_POOLS];
>   unsigned long *it_map;   /* A simple allocation bitmap for now */
> +#ifdef CONFIG_IOMMU_API
> + struct iommu_group *it_group;
> +#endif
>  };
>  
>  struct scatterlist;
> @@ -147,5 +150,11 @@ static inline void iommu_restore(void)
>  }
>  #endif
>  
> +extern long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages);
> +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry,
> + uint64_t tce, enum dma_data_direction direction,
> + unsigned long pages);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_IOMMU_H */
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index ff5a6ce..1225fbb 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DBG(...)
>  
> @@ -856,3 +857,161 @@ void iommu_free_coherent(struct iommu_table *tbl, 
> size_t size,
>   free_pages((unsigned long)vaddr, get_order(size));
>   }
>  }
> +
> +#ifdef CONFIG_IOMMU_API
> +/*
> + * SPAPR TCE API
> + */
> +static void tce_flush(struct iommu_table *tbl)
> +{
> + /* Flush/invalidate TLB caches if necessary */
> + if (ppc_md.tce_flush)
> + ppc_md.tce_flush(tbl);
> +
> + /* Make sure updates are seen by hardware */
> + mb();
> +}
> +
> +/*
> + * iommu_clear_tces clears tces and returned the number of pages
> + * which it called put_page() on.
> + */
> +static long clear_tces_nolock(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages)
> +{
> + int i, retpages = 0;
> + unsigned long oldtce;
> + struct page *page;
> +
> + for (i = 0; i < pages; ++i) {
> + oldtce = ppc_md.tce_get(tbl, entry + i);
> + ppc_md.tce_free(tbl, entry + i, 1);
> +
> + if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> + continue;
> +
> + page = pfn_to_page(oldtce >> PAGE_SHIFT);
> +
> + WARN_ON(!page);
> + if (!page)
> + continue;
> +
> + if (oldtce & TCE_PCI_WRITE)
> + SetPageDirty(page);
> +
> + if (!(oldtce & ~PAGE_MASK))
> + ++retpages;

I'm confused, it looks like you're trying to only increment the counter
for tce pages aligned at the start of a page, but don't we need to mask
out the read/write and valid bits?  Trickiness like this demands a
comment.

> +
> + put_page(page);
> + }
> +
> + return retpages;
> +}
> +
> +/*
> + * iommu_clear_tces clears tces and returned the number of released pages
> + */
> +long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages)
> +{
> + int ret;
> + struct iommu_pool *pool = get_pool(tbl, entry);
> +
> + spin_lock(&(pool->lock));
> + ret = clear_tces_nolock(tbl, entry, pages);
> + tce_flush(tbl);
> + spin_unlock(&(pool->lock));
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_clear_tces);
> +
> +static int put_tce(struct iommu_table *tbl, unsigned long entry,
> + uint64_t tce, enum dma_data_direction direction)
> +{
> + int ret;
> + struct page *page = NULL;
> + unsigned long kva, offset;
> +
> + /* Map new TCE */
> + offset = (tce & IOMMU_PAGE_MASK) - (tce & PAGE_MASK);
> +
> + ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> + direction != DMA_TO_DEVICE, &page);
> +

Re: [PATCH] vfio powerpc: enabled on powernv platform

2012-11-28 Thread Alex Williamson
On Wed, 2012-11-28 at 18:18 +1100, Alexey Kardashevskiy wrote:
> This patch initializes IOMMU groups based on the IOMMU
> configuration discovered during the PCI scan on POWERNV
> (POWER non virtualized) platform. The IOMMU groups are
> to be used later by VFIO driver (PCI pass through).
> 
> It also implements an API for mapping/unmapping pages for
> guest PCI drivers and providing DMA window properties.
> This API is going to be used later by QEMU-VFIO to handle
> h_put_tce hypercalls from the KVM guest.
> 
> Although this driver has been tested only on the POWERNV
> platform, it should work on any platform which supports
> TCE tables.
> 
> To enable VFIO on POWER, enable SPAPR_TCE_IOMMU config
> option and configure VFIO as required.
> 
> Cc: David Gibson 
> Signed-off-by: Alexey Kardashevskiy 
> ---
>  arch/powerpc/include/asm/iommu.h |9 +++
>  arch/powerpc/kernel/iommu.c  |  147 
> ++
>  arch/powerpc/platforms/powernv/pci.c |  135 +++
>  drivers/iommu/Kconfig|8 ++
>  4 files changed, 299 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/iommu.h 
> b/arch/powerpc/include/asm/iommu.h
> index cbfe678..5c7087a 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -76,6 +76,9 @@ struct iommu_table {
>   struct iommu_pool large_pool;
>   struct iommu_pool pools[IOMMU_NR_POOLS];
>   unsigned long *it_map;   /* A simple allocation bitmap for now */
> +#ifdef CONFIG_IOMMU_API
> + struct iommu_group *it_group;
> +#endif
>  };
>  
>  struct scatterlist;
> @@ -147,5 +150,11 @@ static inline void iommu_restore(void)
>  }
>  #endif
>  
> +extern long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages);
> +extern long iommu_put_tces(struct iommu_table *tbl, unsigned long entry,
> + uint64_t tce, enum dma_data_direction direction,
> + unsigned long pages);
> +
>  #endif /* __KERNEL__ */
>  #endif /* _ASM_IOMMU_H */
> diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> index ff5a6ce..1456b6e 100644
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -44,6 +44,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #define DBG(...)
>  
> @@ -856,3 +857,149 @@ void iommu_free_coherent(struct iommu_table *tbl, 
> size_t size,
>   free_pages((unsigned long)vaddr, get_order(size));
>   }
>  }
> +
> +#ifdef CONFIG_IOMMU_API
> +/*
> + * SPAPR TCE API
> + */
> +static void tce_flush(struct iommu_table *tbl)
> +{
> + /* Flush/invalidate TLB caches if necessary */
> + if (ppc_md.tce_flush)
> + ppc_md.tce_flush(tbl);
> +
> + /* Make sure updates are seen by hardware */
> + mb();
> +}
> +
> +/*
> + * iommu_clear_tces clears tces and returned the number of pages
> + * which it called put_page() on.
> + */
> +static long clear_tces_nolock(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages)
> +{
> + int i, pages_put = 0;
> + unsigned long oldtce;
> + struct page *page;
> +
> + for (i = 0; i < pages; ++i) {
> + oldtce = ppc_md.tce_get(tbl, entry + i);
> + ppc_md.tce_free(tbl, entry + i, 1);
> +
> + if (!(oldtce & (TCE_PCI_WRITE | TCE_PCI_READ)))
> + continue;
> +
> + page = pfn_to_page(oldtce >> PAGE_SHIFT);
> +
> + WARN_ON(!page);
> + if (!page)
> + continue;
> +
> + if (oldtce & TCE_PCI_WRITE)
> + SetPageDirty(page);
> +
> + ++pages_put;
> + put_page(page);
> + }
> +
> + return pages_put;
> +}
> +
> +/*
> + * iommu_clear_tces clears tces and returned the number of released pages
> + */
> +long iommu_clear_tces(struct iommu_table *tbl, unsigned long entry,
> + unsigned long pages)
> +{
> + int ret;
> + struct iommu_pool *pool = get_pool(tbl, entry);
> +
> + spin_lock(&(pool->lock));
> + ret = clear_tces_nolock(tbl, entry, pages);
> + tce_flush(tbl);
> + spin_unlock(&(pool->lock));
> +
> + return ret;
> +}
> +EXPORT_SYMBOL_GPL(iommu_clear_tces);
> +
> +static int put_tce(struct iommu_table *tbl, unsigned long entry,
> + uint64_t tce, enum dma_data_direction direction)
> +{
> + int ret;
> + struct page *page = NULL;
> + unsigned long kva, offset;
> +
> + /* Map new TCE */
> + offset = (tce & IOMMU_PAGE_MASK) - (tce & PAGE_MASK);
> +
> + ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> + direction != DMA_TO_DEVICE, &page);
> + if (ret < 1) {
> + printk(KERN_ERR "tce_vfio: get_user_pages_fast failed tce=%llx 
> ioba=%lx ret=%d\n",
> + tce, entry << IOMMU_PAGE_SHIFT, ret);
> + if (!ret)
> + ret = -EFAULT;
> + ret