Re: [PATCH v6 2/2] habanalabs: add support for dma-buf exporter

2021-10-01 Thread Jason Gunthorpe
On Wed, Sep 29, 2021 at 12:17:35AM +0300, Oded Gabbay wrote:
> On Tue, Sep 28, 2021 at 8:36 PM Jason Gunthorpe  wrote:
> >
> > On Sun, Sep 12, 2021 at 07:53:09PM +0300, Oded Gabbay wrote:
> > > From: Tomer Tayar 
> > >
> > > Implement the calls to the dma-buf kernel api to create a dma-buf
> > > object backed by FD.
> > >
> > > We block the option to mmap the DMA-BUF object because we don't support
> > > DIRECT_IO and implicit P2P.
> >
> > This statement doesn't make sense, you can mmap your dmabuf if you
> > like. All dmabuf mmaps are supposed to set the special bit/etc to
> > exclude them from get_user_pages() anyhow - and since this is BAR
> > memory not struct page memory this driver would be doing it anyhow.
> >
> But we block mmap the dmabuf fd from user-space.
> If you try to do it, you will get MAP_FAILED.

You do, I'm saying the above paragraph explaining *why* that was done
is not correct.

> > > We check the p2p distance using pci_p2pdma_distance_many() and refusing
> > > to map dmabuf in case the distance doesn't allow p2p.
> >
> > Does this actually allow the p2p transfer for your intended use cases?
>
> It depends on the system. If we are working bare-metal, then yes, it allows.
> If inside a VM, then no. The virtualized root complex is not
> white-listed and the kernel can't know the distance.
> But I remember you asked me to add this check, in v3 of the review IIRC.
> I don't mind removing this check if you don't object.

Yes, i tis the right code, I was curious how far along things have
gotten

> > Don't write to the kernel log from user space triggered actions
> at all ?

At all.

> It's the first time I hear about this limitation...

Oh? It is a security issue, we don't want to allow userspace to DOS
the kerne logging.

> How do you tell the user it has done something wrong ?

dev_dbg is the usual way and then users doing debugging can opt in to
the logging.


> > Why doesn't this return a sg_table * and an ERR_PTR?
> Basically I modeled this function after amdgpu_vram_mgr_alloc_sgt()
> And in that function they also return int and pass the sg_table as **
> 
> If it's critical I can change.

Please follow the normal kernel style

Jason


Re: [PATCH v6 2/2] habanalabs: add support for dma-buf exporter

2021-10-01 Thread Jason Gunthorpe
On Thu, Sep 30, 2021 at 03:46:35PM +0300, Oded Gabbay wrote:

> After reading the kernel iommu code, I think this is not relevant
> here, and I'll add a comment appropriately but I'll also write it
> here, and please correct me if my understanding is wrong.
> 
> The memory behind this specific dma-buf has *always* resided on the
> device itself, i.e. it lives only in the 'device' domain (after all,
> it maps a PCI bar address which points to the device memory).
> Therefore, it was never in the 'CPU' domain and hence, there is no
> need to perform a sync of the memory to the CPU's cache, as it was
> never inside that cache to begin with.
> 
> This is not the same case as with regular memory which is dma-mapped
> and then copied into the device using a dma engine. In that case,
> the memory started in the 'CPU' domain and moved to the 'device'
> domain. When it is unmapped it will indeed be recycled to be used
> for another purpose and therefore we need to sync the CPU cache.
> 
> Is my understanding correct ?

It makes sense to me

Jason


Re: [PATCH v6 2/2] habanalabs: add support for dma-buf exporter

2021-09-30 Thread Oded Gabbay
On Wed, Sep 29, 2021 at 12:17 AM Oded Gabbay  wrote:
>
> On Tue, Sep 28, 2021 at 8:36 PM Jason Gunthorpe  wrote:
> >
> > On Sun, Sep 12, 2021 at 07:53:09PM +0300, Oded Gabbay wrote:
> > > From: Tomer Tayar 
> > >
> > > Implement the calls to the dma-buf kernel api to create a dma-buf
> > > object backed by FD.
> > >
> > > We block the option to mmap the DMA-BUF object because we don't support
> > > DIRECT_IO and implicit P2P.
> >
> > This statement doesn't make sense, you can mmap your dmabuf if you
> > like. All dmabuf mmaps are supposed to set the special bit/etc to
> > exclude them from get_user_pages() anyhow - and since this is BAR
> > memory not struct page memory this driver would be doing it anyhow.
> >
> But we block mmap the dmabuf fd from user-space.
> If you try to do it, you will get MAP_FAILED.
> That's because we don't supply a function to the mmap callback in dmabuf.
> We did that per Christian's advice. It is in one of the long email
> threads on previous versions of this patch.
>
>
> > > We check the p2p distance using pci_p2pdma_distance_many() and refusing
> > > to map dmabuf in case the distance doesn't allow p2p.
> >
> > Does this actually allow the p2p transfer for your intended use cases?
> >
> It depends on the system. If we are working bare-metal, then yes, it allows.
> If inside a VM, then no. The virtualized root complex is not
> white-listed and the kernel can't know the distance.
> But I remember you asked me to add this check, in v3 of the review IIRC.
> I don't mind removing this check if you don't object.
>
> > > diff --git a/drivers/misc/habanalabs/common/memory.c 
> > > b/drivers/misc/habanalabs/common/memory.c
> > > index 33986933aa9e..8cf5437c0390 100644
> > > +++ b/drivers/misc/habanalabs/common/memory.c
> > > @@ -1,7 +1,7 @@
> > >  // SPDX-License-Identifier: GPL-2.0
> > >
> > >  /*
> > > - * Copyright 2016-2019 HabanaLabs, Ltd.
> > > + * Copyright 2016-2021 HabanaLabs, Ltd.
> > >   * All Rights Reserved.
> > >   */
> > >
> > > @@ -11,11 +11,13 @@
> > >
> > >  #include 
> > >  #include 
> > > +#include 
> > >
> > >  #define HL_MMU_DEBUG 0
> > >
> > >  /* use small pages for supporting non-pow2 (32M/40M/48M) DRAM phys page 
> > > sizes */
> > > -#define DRAM_POOL_PAGE_SIZE SZ_8M
> > > +#define DRAM_POOL_PAGE_SIZE  SZ_8M
> > > +
> >
> > ??
> ok, I 'll remove
> >
> > >  /*
> > >   * The va ranges in context object contain a list with the available 
> > > chunks of
> > > @@ -347,6 +349,13 @@ static int free_device_memory(struct hl_ctx *ctx, 
> > > struct hl_mem_in *args)
> > >   return -EINVAL;
> > >   }
> > >
> > > + if (phys_pg_pack->exporting_cnt) {
> > > + dev_err(hdev->dev,
> > > + "handle %u is exported, cannot free\n", 
> > > handle);
> > > + spin_unlock(&vm->idr_lock);
> >
> > Don't write to the kernel log from user space triggered actions
> at all ?
> It's the first time I hear about this limitation...
> How do you tell the user it has done something wrong ?
> I agree it might be better to rate limit it, but why not give the
> information to the user ?
>
> >
> > > +static int alloc_sgt_from_device_pages(struct hl_device *hdev,
> > > + struct sg_table **sgt, u64 *pages,
> > > + u64 npages, u64 page_size,
> > > + struct device *dev,
> > > + enum dma_data_direction dir)
> >
> > Why doesn't this return a sg_table * and an ERR_PTR?
> Basically I modeled this function after amdgpu_vram_mgr_alloc_sgt()
> And in that function they also return int and pass the sg_table as **
>
> If it's critical I can change.
>
> >
> > > +{
> > > + u64 chunk_size, bar_address, dma_max_seg_size;
> > > + struct asic_fixed_properties *prop;
> > > + int rc, i, j, nents, cur_page;
> > > + struct scatterlist *sg;
> > > +
> > > + prop = &hdev->asic_prop;
> > > +
> > > + dma_max_seg_size = dma_get_max_seg_size(dev);
> >
> > > +
> > > + /* We would like to align the max segment size to PAGE_SIZE, so the
> > > +  * SGL will contain aligned addresses that can be easily mapped to
> > > +  * an MMU
> > > +  */
> > > + dma_max_seg_size = ALIGN_DOWN(dma_max_seg_size, PAGE_SIZE);
> > > + if (dma_max_seg_size < PAGE_SIZE) {
> > > + dev_err_ratelimited(hdev->dev,
> > > + "dma_max_seg_size %llu can't be smaller 
> > > than PAGE_SIZE\n",
> > > + dma_max_seg_size);
> > > + return -EINVAL;
> > > + }
> > > +
> > > + *sgt = kzalloc(sizeof(**sgt), GFP_KERNEL);
> > > + if (!*sgt)
> > > + return -ENOMEM;
> > > +
> > > + /* If the size of each page is larger than the dma max segment size,
> > > +  * then we can't combine pages and the number of entries in the SGL
> > > +  * will just be the
> > >

Re: [PATCH v6 2/2] habanalabs: add support for dma-buf exporter

2021-09-28 Thread Oded Gabbay
On Tue, Sep 28, 2021 at 8:36 PM Jason Gunthorpe  wrote:
>
> On Sun, Sep 12, 2021 at 07:53:09PM +0300, Oded Gabbay wrote:
> > From: Tomer Tayar 
> >
> > Implement the calls to the dma-buf kernel api to create a dma-buf
> > object backed by FD.
> >
> > We block the option to mmap the DMA-BUF object because we don't support
> > DIRECT_IO and implicit P2P.
>
> This statement doesn't make sense, you can mmap your dmabuf if you
> like. All dmabuf mmaps are supposed to set the special bit/etc to
> exclude them from get_user_pages() anyhow - and since this is BAR
> memory not struct page memory this driver would be doing it anyhow.
>
But we block mmap the dmabuf fd from user-space.
If you try to do it, you will get MAP_FAILED.
That's because we don't supply a function to the mmap callback in dmabuf.
We did that per Christian's advice. It is in one of the long email
threads on previous versions of this patch.


> > We check the p2p distance using pci_p2pdma_distance_many() and refusing
> > to map dmabuf in case the distance doesn't allow p2p.
>
> Does this actually allow the p2p transfer for your intended use cases?
>
It depends on the system. If we are working bare-metal, then yes, it allows.
If inside a VM, then no. The virtualized root complex is not
white-listed and the kernel can't know the distance.
But I remember you asked me to add this check, in v3 of the review IIRC.
I don't mind removing this check if you don't object.

> > diff --git a/drivers/misc/habanalabs/common/memory.c 
> > b/drivers/misc/habanalabs/common/memory.c
> > index 33986933aa9e..8cf5437c0390 100644
> > +++ b/drivers/misc/habanalabs/common/memory.c
> > @@ -1,7 +1,7 @@
> >  // SPDX-License-Identifier: GPL-2.0
> >
> >  /*
> > - * Copyright 2016-2019 HabanaLabs, Ltd.
> > + * Copyright 2016-2021 HabanaLabs, Ltd.
> >   * All Rights Reserved.
> >   */
> >
> > @@ -11,11 +11,13 @@
> >
> >  #include 
> >  #include 
> > +#include 
> >
> >  #define HL_MMU_DEBUG 0
> >
> >  /* use small pages for supporting non-pow2 (32M/40M/48M) DRAM phys page 
> > sizes */
> > -#define DRAM_POOL_PAGE_SIZE SZ_8M
> > +#define DRAM_POOL_PAGE_SIZE  SZ_8M
> > +
>
> ??
ok, I 'll remove
>
> >  /*
> >   * The va ranges in context object contain a list with the available 
> > chunks of
> > @@ -347,6 +349,13 @@ static int free_device_memory(struct hl_ctx *ctx, 
> > struct hl_mem_in *args)
> >   return -EINVAL;
> >   }
> >
> > + if (phys_pg_pack->exporting_cnt) {
> > + dev_err(hdev->dev,
> > + "handle %u is exported, cannot free\n", 
> > handle);
> > + spin_unlock(&vm->idr_lock);
>
> Don't write to the kernel log from user space triggered actions
at all ?
It's the first time I hear about this limitation...
How do you tell the user it has done something wrong ?
I agree it might be better to rate limit it, but why not give the
information to the user ?

>
> > +static int alloc_sgt_from_device_pages(struct hl_device *hdev,
> > + struct sg_table **sgt, u64 *pages,
> > + u64 npages, u64 page_size,
> > + struct device *dev,
> > + enum dma_data_direction dir)
>
> Why doesn't this return a sg_table * and an ERR_PTR?
Basically I modeled this function after amdgpu_vram_mgr_alloc_sgt()
And in that function they also return int and pass the sg_table as **

If it's critical I can change.

>
> > +{
> > + u64 chunk_size, bar_address, dma_max_seg_size;
> > + struct asic_fixed_properties *prop;
> > + int rc, i, j, nents, cur_page;
> > + struct scatterlist *sg;
> > +
> > + prop = &hdev->asic_prop;
> > +
> > + dma_max_seg_size = dma_get_max_seg_size(dev);
>
> > +
> > + /* We would like to align the max segment size to PAGE_SIZE, so the
> > +  * SGL will contain aligned addresses that can be easily mapped to
> > +  * an MMU
> > +  */
> > + dma_max_seg_size = ALIGN_DOWN(dma_max_seg_size, PAGE_SIZE);
> > + if (dma_max_seg_size < PAGE_SIZE) {
> > + dev_err_ratelimited(hdev->dev,
> > + "dma_max_seg_size %llu can't be smaller than 
> > PAGE_SIZE\n",
> > + dma_max_seg_size);
> > + return -EINVAL;
> > + }
> > +
> > + *sgt = kzalloc(sizeof(**sgt), GFP_KERNEL);
> > + if (!*sgt)
> > + return -ENOMEM;
> > +
> > + /* If the size of each page is larger than the dma max segment size,
> > +  * then we can't combine pages and the number of entries in the SGL
> > +  * will just be the
> > +  *  * 
> > +  */
> > + if (page_size > dma_max_seg_size)
> > + nents = npages * DIV_ROUND_UP_ULL(page_size, 
> > dma_max_seg_size);
> > + else
> > + /* Get number of non-contiguous chunks */
> > + for (i = 1, nents = 1, chunk_size = page_size ; i < npages 

Re: [PATCH v6 2/2] habanalabs: add support for dma-buf exporter

2021-09-28 Thread Jason Gunthorpe
On Sun, Sep 12, 2021 at 07:53:09PM +0300, Oded Gabbay wrote:
> From: Tomer Tayar 
> 
> Implement the calls to the dma-buf kernel api to create a dma-buf
> object backed by FD.
> 
> We block the option to mmap the DMA-BUF object because we don't support
> DIRECT_IO and implicit P2P. 

This statement doesn't make sense, you can mmap your dmabuf if you
like. All dmabuf mmaps are supposed to set the special bit/etc to
exclude them from get_user_pages() anyhow - and since this is BAR
memory not struct page memory this driver would be doing it anyhow.

> We check the p2p distance using pci_p2pdma_distance_many() and refusing
> to map dmabuf in case the distance doesn't allow p2p.

Does this actually allow the p2p transfer for your intended use cases?

> diff --git a/drivers/misc/habanalabs/common/memory.c 
> b/drivers/misc/habanalabs/common/memory.c
> index 33986933aa9e..8cf5437c0390 100644
> +++ b/drivers/misc/habanalabs/common/memory.c
> @@ -1,7 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0
>  
>  /*
> - * Copyright 2016-2019 HabanaLabs, Ltd.
> + * Copyright 2016-2021 HabanaLabs, Ltd.
>   * All Rights Reserved.
>   */
>  
> @@ -11,11 +11,13 @@
>  
>  #include 
>  #include 
> +#include 
>  
>  #define HL_MMU_DEBUG 0
>  
>  /* use small pages for supporting non-pow2 (32M/40M/48M) DRAM phys page 
> sizes */
> -#define DRAM_POOL_PAGE_SIZE SZ_8M
> +#define DRAM_POOL_PAGE_SIZE  SZ_8M
> +

??

>  /*
>   * The va ranges in context object contain a list with the available chunks 
> of
> @@ -347,6 +349,13 @@ static int free_device_memory(struct hl_ctx *ctx, struct 
> hl_mem_in *args)
>   return -EINVAL;
>   }
>  
> + if (phys_pg_pack->exporting_cnt) {
> + dev_err(hdev->dev,
> + "handle %u is exported, cannot free\n", handle);
> + spin_unlock(&vm->idr_lock);

Don't write to the kernel log from user space triggered actions

> +static int alloc_sgt_from_device_pages(struct hl_device *hdev,
> + struct sg_table **sgt, u64 *pages,
> + u64 npages, u64 page_size,
> + struct device *dev,
> + enum dma_data_direction dir)

Why doesn't this return a sg_table * and an ERR_PTR?

> +{
> + u64 chunk_size, bar_address, dma_max_seg_size;
> + struct asic_fixed_properties *prop;
> + int rc, i, j, nents, cur_page;
> + struct scatterlist *sg;
> +
> + prop = &hdev->asic_prop;
> +
> + dma_max_seg_size = dma_get_max_seg_size(dev);

> +
> + /* We would like to align the max segment size to PAGE_SIZE, so the
> +  * SGL will contain aligned addresses that can be easily mapped to
> +  * an MMU
> +  */
> + dma_max_seg_size = ALIGN_DOWN(dma_max_seg_size, PAGE_SIZE);
> + if (dma_max_seg_size < PAGE_SIZE) {
> + dev_err_ratelimited(hdev->dev,
> + "dma_max_seg_size %llu can't be smaller than 
> PAGE_SIZE\n",
> + dma_max_seg_size);
> + return -EINVAL;
> + }
> +
> + *sgt = kzalloc(sizeof(**sgt), GFP_KERNEL);
> + if (!*sgt)
> + return -ENOMEM;
> +
> + /* If the size of each page is larger than the dma max segment size,
> +  * then we can't combine pages and the number of entries in the SGL
> +  * will just be the
> +  *  * 
> +  */
> + if (page_size > dma_max_seg_size)
> + nents = npages * DIV_ROUND_UP_ULL(page_size, dma_max_seg_size);
> + else
> + /* Get number of non-contiguous chunks */
> + for (i = 1, nents = 1, chunk_size = page_size ; i < npages ; 
> i++) {
> + if (pages[i - 1] + page_size != pages[i] ||
> + chunk_size + page_size > 
> dma_max_seg_size) {
> + nents++;
> + chunk_size = page_size;
> + continue;
> + }
> +
> + chunk_size += page_size;
> + }
> +
> + rc = sg_alloc_table(*sgt, nents, GFP_KERNEL | __GFP_ZERO);
> + if (rc)
> + goto error_free;
> +
> + /* Because we are not going to include a CPU list we want to have some
> +  * chance that other users will detect this by setting the orig_nents
> +  * to 0 and using only nents (length of DMA list) when going over the
> +  * sgl
> +  */
> + (*sgt)->orig_nents = 0;

Maybe do this at the end so you'd have to undo it on the error path?

> + cur_page = 0;
> +
> + if (page_size > dma_max_seg_size) {
> + u64 size_left, cur_device_address = 0;
> +
> + size_left = page_size;
> +
> + /* Need to split each page into the number of chunks of
> +  * dma_max_seg_size
> +  */
> + for_each_sgtable_dma_sg((*sgt), sg, i) {
> + 

[PATCH v6 2/2] habanalabs: add support for dma-buf exporter

2021-09-12 Thread Oded Gabbay
From: Tomer Tayar 

Implement the calls to the dma-buf kernel api to create a dma-buf
object backed by FD.

We block the option to mmap the DMA-BUF object because we don't support
DIRECT_IO and implicit P2P. We only implement support for explicit P2P
through importing the FD of the DMA-BUF.

In the export phase, we provide to the DMA-BUF object an array of pages
that represent the device's memory area. During the map callback,
we convert the array of pages into an SGT. We split/merge the pages
according to the dma max segment size of the importer.

To get the DMA address of the PCI bar, we use the dma_map_resources()
kernel API, because our device memory is not backed by page struct
and this API doesn't need page struct to map the physical address to
a DMA address.

We set the orig_nents member of the SGT to be 0, to indicate to other
drivers that we don't support CPU mappings.

Note that in Habanalabs's ASICs, the device memory is pinned and
immutable. Therefore, there is no need for dynamic mappings and pinning
callbacks.

Also note that in GAUDI we don't have an MMU towards the device memory
and the user works on physical addresses. Therefore, the user doesn't
pass through the kernel driver to allocate memory there. As a result,
only for GAUDI we receive from the user a device memory physical address
(instead of a handle) and a size.

We check the p2p distance using pci_p2pdma_distance_many() and refusing
to map dmabuf in case the distance doesn't allow p2p.

Signed-off-by: Tomer Tayar 
Reviewed-by: Oded Gabbay 
Reviewed-by: Gal Pressman 
Reviewed-by: Greg Kroah-Hartman 
Signed-off-by: Oded Gabbay 
---
 drivers/misc/habanalabs/Kconfig |   1 +
 drivers/misc/habanalabs/common/habanalabs.h |  22 +
 drivers/misc/habanalabs/common/memory.c | 522 +++-
 drivers/misc/habanalabs/gaudi/gaudi.c   |   1 +
 drivers/misc/habanalabs/goya/goya.c |   1 +
 5 files changed, 543 insertions(+), 4 deletions(-)

diff --git a/drivers/misc/habanalabs/Kconfig b/drivers/misc/habanalabs/Kconfig
index 293d79811372..c82d2e7b2035 100644
--- a/drivers/misc/habanalabs/Kconfig
+++ b/drivers/misc/habanalabs/Kconfig
@@ -8,6 +8,7 @@ config HABANA_AI
depends on PCI && HAS_IOMEM
select GENERIC_ALLOCATOR
select HWMON
+   select DMA_SHARED_BUFFER
help
  Enables PCIe card driver for Habana's AI Processors (AIP) that are
  designed to accelerate Deep Learning inference and training workloads.
diff --git a/drivers/misc/habanalabs/common/habanalabs.h 
b/drivers/misc/habanalabs/common/habanalabs.h
index bebebcb163ee..df49376a0880 100644
--- a/drivers/misc/habanalabs/common/habanalabs.h
+++ b/drivers/misc/habanalabs/common/habanalabs.h
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #define HL_NAME"habanalabs"
 
@@ -1352,6 +1353,23 @@ struct hl_cs_counters_atomic {
atomic64_t validation_drop_cnt;
 };
 
+/**
+ * struct hl_dmabuf_wrapper - a dma-buf wrapper object.
+ * @dmabuf: pointer to dma-buf object.
+ * @ctx: pointer to the dma-buf owner's context.
+ * @phys_pg_pack: pointer to physical page pack if the dma-buf was exported for
+ *memory allocation handle.
+ * @device_address: physical address of the device's memory. Relevant only
+ *  if phys_pg_pack is NULL (dma-buf was exported from 
address).
+ *  The total size can be taken from the dmabuf object.
+ */
+struct hl_dmabuf_wrapper {
+   struct dma_buf  *dmabuf;
+   struct hl_ctx   *ctx;
+   struct hl_vm_phys_pg_pack   *phys_pg_pack;
+   uint64_tdevice_address;
+};
+
 /**
  * struct hl_ctx - user/kernel context.
  * @mem_hash: holds mapping from virtual address to virtual memory area
@@ -1662,6 +1680,7 @@ struct hl_vm_hw_block_list_node {
  * @npages: num physical pages in the pack.
  * @total_size: total size of all the pages in this list.
  * @mapping_cnt: number of shared mappings.
+ * @exporting_cnt: number of dma-buf exporting.
  * @asid: the context related to this list.
  * @page_size: size of each page in the pack.
  * @flags: HL_MEM_* flags related to this list.
@@ -1676,6 +1695,7 @@ struct hl_vm_phys_pg_pack {
u64 npages;
u64 total_size;
atomic_tmapping_cnt;
+   u32 exporting_cnt;
u32 asid;
u32 page_size;
u32 flags;
@@ -2396,6 +2416,7 @@ struct multi_cs_data {
  *  the error will be ignored by the driver during
  *  device initialization. Mainly used to debug and
  *  workaround firmware bugs
+ * @dram_pci_bar_start: start bus address of PCIe bar towards DRAM.
  * @last_successful_open_jif: timestamp (jiffies) of the last successful
  *