Re: [PATCH v8 4/5] RDMA/mlx5: Support dma-buf based userspace memory region

2020-11-08 Thread Jason Gunthorpe
On Fri, Nov 06, 2020 at 01:11:38AM +, Xiong, Jianxin wrote:
> > On Thu, Nov 05, 2020 at 02:48:08PM -0800, Jianxin Xiong wrote:
> > > @@ -966,7 +969,10 @@ static struct mlx5_ib_mr *alloc_mr_from_cache(struct 
> > > ib_pd *pd,
> > >   struct mlx5_ib_mr *mr;
> > >   unsigned int page_size;
> > >
> > > - page_size = mlx5_umem_find_best_pgsz(umem, mkc, log_page_size, 0, iova);
> > > + if (umem->is_dmabuf)
> > > + page_size = ib_umem_find_best_pgsz(umem, PAGE_SIZE, iova);
> > 
> > You said the sgl is not set here, why doesn't this crash? It is certainly 
> > wrong to call this function without a SGL.
> 
> The sgl is NULL, and nmap is 0. The 'for_each_sg' loop is just skipped and 
> won't crash.

Just wire this to 4k it is clearer than calling some no-op pgsz


> > > + if (!mr->cache_ent) {
> > > + mlx5_core_destroy_mkey(mr->dev->mdev, >mmkey);
> > > + WARN_ON(mr->descs);
> > > + }
> > > +}
> > 
> > I would expect this to call ib_umem_dmabuf_unmap_pages() ?
> > 
> > Who calls it on the dereg path?
> > 
> > This looks quite strange to me, it calls ib_umem_dmabuf_unmap_pages() only 
> > from the invalidate callback?
> 
> It is also called from ib_umem_dmabuf_release(). 

Hmm, that is no how the other APIs work, the unmap should be paired
with the map in the caller, and the sequence for destroy should be

 invalidate
 unmap
 destroy_mkey
 release_umem

I have another series coming that makes the other three destroy flows
much closer to that ideal.

> > I feel uneasy how this seems to assume everything works sanely, we can have 
> > parallel page faults so pagefault_dmabuf_mr() can be called
> > multiple times after an invalidation, and it doesn't protect itself against 
> > calling ib_umem_dmabuf_map_pages() twice.
> > 
> > Perhaps the umem code should keep track of the current map state and exit 
> > if there is already a sgl. NULL or not NULL sgl would do and
> > seems quite reasonable.
> 
> Ib_umem_dmabuf_map() already checks the sgl and will do nothing if it is 
> already set.

How? What I see in patch 1 is an unconditonal call to
dma_buf_map_attachment() ?

> > > + if (is_dmabuf_mr(mr))
> > > + return pagefault_dmabuf_mr(mr, umem_dmabuf, user_va,
> > > +bcnt, bytes_mapped, flags);
> > 
> > But this doesn't care about user_va or bcnt it just triggers the whole 
> > thing to be remapped, so why calculate it?
> 
> The range check is still needed, in order to catch application
> errors of using incorrect address or count in verbs command. Passing
> the values further in is to allow pagefault_dmabuf_mr to generate
> return value and set bytes_mapped in a way consistent with the page
> fault handler chain.

The HW validates the range. The range check in the ODP case is to
protect against a HW bug that would cause the kernel to
malfunction. For dmabuf you don't need to do it

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [PATCH v8 4/5] RDMA/mlx5: Support dma-buf based userspace memory region

2020-11-06 Thread Xiong, Jianxin
> -Original Message-
> From: Jason Gunthorpe 
> Sent: Friday, November 06, 2020 4:49 AM
> To: Xiong, Jianxin 
> Cc: linux-r...@vger.kernel.org; dri-devel@lists.freedesktop.org; Doug Ledford 
> ; Leon Romanovsky
> ; Sumit Semwal ; Christian Koenig 
> ; Vetter, Daniel
> 
> Subject: Re: [PATCH v8 4/5] RDMA/mlx5: Support dma-buf based userspace memory 
> region
> 
> On Fri, Nov 06, 2020 at 01:11:38AM +, Xiong, Jianxin wrote:
> > > On Thu, Nov 05, 2020 at 02:48:08PM -0800, Jianxin Xiong wrote:
> > > > @@ -966,7 +969,10 @@ static struct mlx5_ib_mr 
> > > > *alloc_mr_from_cache(struct ib_pd *pd,
> > > > struct mlx5_ib_mr *mr;
> > > > unsigned int page_size;
> > > >
> > > > -   page_size = mlx5_umem_find_best_pgsz(umem, mkc, log_page_size, 
> > > > 0, iova);
> > > > +   if (umem->is_dmabuf)
> > > > +   page_size = ib_umem_find_best_pgsz(umem, PAGE_SIZE, 
> > > > iova);
> > >
> > > You said the sgl is not set here, why doesn't this crash? It is certainly 
> > > wrong to call this function without a SGL.
> >
> > The sgl is NULL, and nmap is 0. The 'for_each_sg' loop is just skipped and 
> > won't crash.
> 
> Just wire this to 4k it is clearer than calling some no-op pgsz

Ok

> 
> 
> > > > +   if (!mr->cache_ent) {
> > > > +   mlx5_core_destroy_mkey(mr->dev->mdev, >mmkey);
> > > > +   WARN_ON(mr->descs);
> > > > +   }
> > > > +}
> > >
> > > I would expect this to call ib_umem_dmabuf_unmap_pages() ?
> > >
> > > Who calls it on the dereg path?
> > >
> > > This looks quite strange to me, it calls ib_umem_dmabuf_unmap_pages() 
> > > only from the invalidate callback?
> >
> > It is also called from ib_umem_dmabuf_release().
> 
> Hmm, that is no how the other APIs work, the unmap should be paired with the 
> map in the caller, and the sequence for destroy should be
> 
>  invalidate
>  unmap
>  destroy_mkey
>  release_umem
> 
> I have another series coming that makes the other three destroy flows much 
> closer to that ideal.
> 

Can fix that.

> > > I feel uneasy how this seems to assume everything works sanely, we
> > > can have parallel page faults so pagefault_dmabuf_mr() can be called 
> > > multiple times after an invalidation, and it doesn't protect itself
> against calling ib_umem_dmabuf_map_pages() twice.
> > >
> > > Perhaps the umem code should keep track of the current map state and
> > > exit if there is already a sgl. NULL or not NULL sgl would do and seems 
> > > quite reasonable.
> >
> > Ib_umem_dmabuf_map() already checks the sgl and will do nothing if it is 
> > already set.
> 
> How? What I see in patch 1 is an unconditonal call to
> dma_buf_map_attachment() ?

My bad. I misread the lines. It used to be there (in v3) but somehow got lost. 

> 
> > > > +   if (is_dmabuf_mr(mr))
> > > > +   return pagefault_dmabuf_mr(mr, umem_dmabuf, 
> > > > user_va,
> > > > +  bcnt, bytes_mapped, 
> > > > flags);
> > >
> > > But this doesn't care about user_va or bcnt it just triggers the whole 
> > > thing to be remapped, so why calculate it?
> >
> > The range check is still needed, in order to catch application errors
> > of using incorrect address or count in verbs command. Passing the
> > values further in is to allow pagefault_dmabuf_mr to generate return
> > value and set bytes_mapped in a way consistent with the page fault
> > handler chain.
> 
> The HW validates the range. The range check in the ODP case is to protect 
> against a HW bug that would cause the kernel to malfunction.
> For dmabuf you don't need to do it

Ok.  So the handler can simply return 0 (as the number of pages mapped) and 
leave bytes_mapped untouched?

> 
> Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v8 4/5] RDMA/mlx5: Support dma-buf based userspace memory region

2020-11-06 Thread Jason Gunthorpe
On Thu, Nov 05, 2020 at 02:48:08PM -0800, Jianxin Xiong wrote:
> @@ -966,7 +969,10 @@ static struct mlx5_ib_mr *alloc_mr_from_cache(struct 
> ib_pd *pd,
>   struct mlx5_ib_mr *mr;
>   unsigned int page_size;
>  
> - page_size = mlx5_umem_find_best_pgsz(umem, mkc, log_page_size, 0, iova);
> + if (umem->is_dmabuf)
> + page_size = ib_umem_find_best_pgsz(umem, PAGE_SIZE, iova);

You said the sgl is not set here, why doesn't this crash? It is
certainly wrong to call this function without a SGL.

> +/**
> + * mlx5_ib_fence_dmabuf_mr - Stop all access to the dmabuf MR
> + * @mr: to fence
> + *
> + * On return no parallel threads will be touching this MR and no DMA will be
> + * active.
> + */
> +void mlx5_ib_fence_dmabuf_mr(struct mlx5_ib_mr *mr)
> +{
> + struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(mr->umem);
> +
> + /* Prevent new page faults and prefetch requests from succeeding */
> + xa_erase(>dev->odp_mkeys, mlx5_base_mkey(mr->mmkey.key));
> +
> + /* Wait for all running page-fault handlers to finish. */
> + synchronize_srcu(>dev->odp_srcu);
> +
> + wait_event(mr->q_deferred_work, !atomic_read(>num_deferred_work));
> +
> + dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL);
> + mlx5_mr_cache_invalidate(mr);
> + umem_dmabuf->private = NULL;
> + dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
> +
> + if (!mr->cache_ent) {
> + mlx5_core_destroy_mkey(mr->dev->mdev, >mmkey);
> + WARN_ON(mr->descs);
> + }
> +}

I would expect this to call ib_umem_dmabuf_unmap_pages() ?

Who calls it on the dereg path?

This looks quite strange to me, it calls ib_umem_dmabuf_unmap_pages()
only from the invalidate callback?

I feel uneasy how this seems to assume everything works sanely, we can
have parallel page faults so pagefault_dmabuf_mr() can be called
multiple times after an invalidation, and it doesn't protect itself
against calling ib_umem_dmabuf_map_pages() twice.

Perhaps the umem code should keep track of the current map state and
exit if there is already a sgl. NULL or not NULL sgl would do and
seems quite reasonable.

> @@ -810,22 +871,31 @@ static int pagefault_mr(struct mlx5_ib_mr *mr, u64 
> io_virt, size_t bcnt,
>   u32 *bytes_mapped, u32 flags)
>  {
>   struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
> + struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(mr->umem);
>  
>   lockdep_assert_held(>dev->odp_srcu);
>   if (unlikely(io_virt < mr->mmkey.iova))
>   return -EFAULT;
>  
> - if (!odp->is_implicit_odp) {
> + if (is_dmabuf_mr(mr) || !odp->is_implicit_odp) {
>   u64 user_va;
> + u64 end;
>  
>   if (check_add_overflow(io_virt - mr->mmkey.iova,
> -(u64)odp->umem.address, _va))
> +(u64)mr->umem->address, _va))
>   return -EFAULT;
> - if (unlikely(user_va >= ib_umem_end(odp) ||
> -  ib_umem_end(odp) - user_va < bcnt))
> + if (is_dmabuf_mr(mr))
> + end = mr->umem->address + mr->umem->length;
> + else
> + end = ib_umem_end(odp);
> + if (unlikely(user_va >= end || end - user_va < bcnt))
>   return -EFAULT;
> - return pagefault_real_mr(mr, odp, user_va, bcnt, bytes_mapped,
> -  flags);
> + if (is_dmabuf_mr(mr))
> + return pagefault_dmabuf_mr(mr, umem_dmabuf, user_va,
> +bcnt, bytes_mapped, flags);

But this doesn't care about user_va or bcnt it just triggers the whole
thing to be remapped, so why calculate it?

Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


RE: [PATCH v8 4/5] RDMA/mlx5: Support dma-buf based userspace memory region

2020-11-05 Thread Xiong, Jianxin
> -Original Message-
> From: Jason Gunthorpe 
> Sent: Thursday, November 05, 2020 4:25 PM
> To: Xiong, Jianxin 
> Cc: linux-r...@vger.kernel.org; dri-devel@lists.freedesktop.org; Doug Ledford 
> ; Leon Romanovsky
> ; Sumit Semwal ; Christian Koenig 
> ; Vetter, Daniel
> 
> Subject: Re: [PATCH v8 4/5] RDMA/mlx5: Support dma-buf based userspace memory 
> region
> 
> On Thu, Nov 05, 2020 at 02:48:08PM -0800, Jianxin Xiong wrote:
> > @@ -966,7 +969,10 @@ static struct mlx5_ib_mr *alloc_mr_from_cache(struct 
> > ib_pd *pd,
> > struct mlx5_ib_mr *mr;
> > unsigned int page_size;
> >
> > -   page_size = mlx5_umem_find_best_pgsz(umem, mkc, log_page_size, 0, iova);
> > +   if (umem->is_dmabuf)
> > +   page_size = ib_umem_find_best_pgsz(umem, PAGE_SIZE, iova);
> 
> You said the sgl is not set here, why doesn't this crash? It is certainly 
> wrong to call this function without a SGL.

The sgl is NULL, and nmap is 0. The 'for_each_sg' loop is just skipped and 
won't crash.

> 
> > +/**
> > + * mlx5_ib_fence_dmabuf_mr - Stop all access to the dmabuf MR
> > + * @mr: to fence
> > + *
> > + * On return no parallel threads will be touching this MR and no DMA
> > +will be
> > + * active.
> > + */
> > +void mlx5_ib_fence_dmabuf_mr(struct mlx5_ib_mr *mr) {
> > +   struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(mr->umem);
> > +
> > +   /* Prevent new page faults and prefetch requests from succeeding */
> > +   xa_erase(>dev->odp_mkeys, mlx5_base_mkey(mr->mmkey.key));
> > +
> > +   /* Wait for all running page-fault handlers to finish. */
> > +   synchronize_srcu(>dev->odp_srcu);
> > +
> > +   wait_event(mr->q_deferred_work,
> > +!atomic_read(>num_deferred_work));
> > +
> > +   dma_resv_lock(umem_dmabuf->attach->dmabuf->resv, NULL);
> > +   mlx5_mr_cache_invalidate(mr);
> > +   umem_dmabuf->private = NULL;
> > +   dma_resv_unlock(umem_dmabuf->attach->dmabuf->resv);
> > +
> > +   if (!mr->cache_ent) {
> > +   mlx5_core_destroy_mkey(mr->dev->mdev, >mmkey);
> > +   WARN_ON(mr->descs);
> > +   }
> > +}
> 
> I would expect this to call ib_umem_dmabuf_unmap_pages() ?
> 
> Who calls it on the dereg path?
> 
> This looks quite strange to me, it calls ib_umem_dmabuf_unmap_pages() only 
> from the invalidate callback?
>

It is also called from ib_umem_dmabuf_release(). 
 
> I feel uneasy how this seems to assume everything works sanely, we can have 
> parallel page faults so pagefault_dmabuf_mr() can be called
> multiple times after an invalidation, and it doesn't protect itself against 
> calling ib_umem_dmabuf_map_pages() twice.
> 
> Perhaps the umem code should keep track of the current map state and exit if 
> there is already a sgl. NULL or not NULL sgl would do and
> seems quite reasonable.
> 

Ib_umem_dmabuf_map() already checks the sgl and will do nothing if it is 
already set.

> > @@ -810,22 +871,31 @@ static int pagefault_mr(struct mlx5_ib_mr *mr, u64 
> > io_virt, size_t bcnt,
> > u32 *bytes_mapped, u32 flags)
> >  {
> > struct ib_umem_odp *odp = to_ib_umem_odp(mr->umem);
> > +   struct ib_umem_dmabuf *umem_dmabuf = to_ib_umem_dmabuf(mr->umem);
> >
> > lockdep_assert_held(>dev->odp_srcu);
> > if (unlikely(io_virt < mr->mmkey.iova))
> > return -EFAULT;
> >
> > -   if (!odp->is_implicit_odp) {
> > +   if (is_dmabuf_mr(mr) || !odp->is_implicit_odp) {
> > u64 user_va;
> > +   u64 end;
> >
> > if (check_add_overflow(io_virt - mr->mmkey.iova,
> > -  (u64)odp->umem.address, _va))
> > +  (u64)mr->umem->address, _va))
> > return -EFAULT;
> > -   if (unlikely(user_va >= ib_umem_end(odp) ||
> > -ib_umem_end(odp) - user_va < bcnt))
> > +   if (is_dmabuf_mr(mr))
> > +   end = mr->umem->address + mr->umem->length;
> > +   else
> > +   end = ib_umem_end(odp);
> > +   if (unlikely(user_va >= end || end - user_va < bcnt))
> > return -EFAULT;
> > -   return pagefault_real_mr(mr, odp, user_va, bcnt, bytes_mapped,
> > -flags);
> > +   if (is_dmabuf_mr(mr))
> > +   return pagefault_dmabuf_mr(mr, umem_dmabuf, user_va,
> > +  bcnt, bytes_mapped, flags);
> 
> But this doesn't care about user_va or bcnt it just triggers the whole thing 
> to be remapped, so why calculate it?

The range check is still needed, in order to catch application errors of using 
incorrect address or count in verbs command. Passing the values further in is 
to allow pagefault_dmabuf_mr to
generate return value and set bytes_mapped in a way consistent with the page 
fault handler
chain.
  
> 
> Jason
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel