On Thu, 2010-12-02 at 12:24 -0800, Tom Tucker wrote:
> Personally I think the biggest issue is that I don't think the pfn to dma 
> address mapping logic is portable.

Perhaps. That is why the core Linux VM folks should be
involved.

> On 12/2/10 1:35 PM, Ralph Campbell wrote:
> > I understand the need for something like this patch since
> > the GPU folks would also like to mmap memory (although the
> > memory is marked as vma->vm_flags&  VM_IO).
> >
> > It seems to me that duplicating the page walking code is
> > the wrong approach and exporting a new interface from
> > mm/memory.c is more appropriate.
> >
> Perhaps, but that's kernel proper (not a module) and has it's own issues. 
> For example, it represents an exported kernel interface and therefore a 
> kernel compatability commitment going forward. I suggest that a new kernel 
> interface is a separate effort that this code code utilize going forward.

I agree. That was what I was implying.

> > Also, the quick check to find_vma() is essentially duplicated
> > if get_user_pages() is called
> 
> You need to know the type before you know how to handle it. Unless you 
> want to tear up get_user_pages, i think this non-performance path double 
> lookup is a non issue.

You only need the type if the translation is handled as the
patch proposes. get_user_pages() could handle getting
the physical addresses differently as it checks each vma region.

> > and it doesn't handle the case
> > when the region spans multiple vma regions with different flags.
> 
> Actually, it specifically does not allow that and I'm not sure that is 
> something you would want to support.

True. It doesn't support it because without reference counting,
there would need to be a callback mechanism to let the caller know
if/when the mapping changes or is invalidated.

> > Maybe we can modify get_user_pages to have a new flag which
> > allows VM_PFNMAP segments to be accessed as IB memory regions.
> > The problem is that VM_PFNMAP means there is no corresponding
> > struct page to handle reference counting. What happens if the
> > device that exports the VM_PFNMAP memory is hot removed?
> 
> Bus Address Error.

> > Can the device reference count be incremented to prevent that?
> >
> 
> I don't think that would go in this code, it would go in the driver that 
> gave the user the address in the first place.

Perhaps. Once the IB core has the physical address, the user may
unmmap the range which would notify the exporting driver via unmmap but
IB wouldn't know that had happened without some sort of callback
notification. The IB hardware could happily go on DMA'ing to that
physical address.

> > On Thu, 2010-12-02 at 11:02 -0800, Tom Tucker wrote:
> >> Added support to the ib_umem_get helper function for handling
> >> mmaped memory.
> >>
> >> Signed-off-by: Tom Tucker<t...@ogc.us>
> >> ---
> >>
> >>   drivers/infiniband/core/umem.c |  272 
> >> +++++++++++++++++++++++++++++++++++++---
> >>   1 files changed, 253 insertions(+), 19 deletions(-)
> >>
> >> diff --git a/drivers/infiniband/core/umem.c 
> >> b/drivers/infiniband/core/umem.c
> >> index 415e186..357ca5e 100644
> >> --- a/drivers/infiniband/core/umem.c
> >> +++ b/drivers/infiniband/core/umem.c
> >> @@ -52,30 +52,24 @@ static void __ib_umem_release(struct ib_device *dev, 
> >> struct ib_umem *umem, int d
> >>    int i;
> >>
> >>    list_for_each_entry_safe(chunk, tmp,&umem->chunk_list, list) {
> >> -          ib_dma_unmap_sg(dev, chunk->page_list,
> >> -                          chunk->nents, DMA_BIDIRECTIONAL);
> >> -          for (i = 0; i<  chunk->nents; ++i) {
> >> -                  struct page *page = sg_page(&chunk->page_list[i]);
> >> -
> >> -                  if (umem->writable&&  dirty)
> >> -                          set_page_dirty_lock(page);
> >> -                  put_page(page);
> >> -          }
> >> +          if (umem->type == IB_UMEM_MEM_MAP) {
> >> +                  ib_dma_unmap_sg(dev, chunk->page_list,
> >> +                                  chunk->nents, DMA_BIDIRECTIONAL);
> >> +                  for (i = 0; i<  chunk->nents; ++i) {
> >> +                          struct page *page = 
> >> sg_page(&chunk->page_list[i]);
> >>
> >> +                          if (umem->writable&&  dirty)
> >> +                                  set_page_dirty_lock(page);
> >> +                          put_page(page);
> >> +                  }
> >> +          }
> >>            kfree(chunk);
> >>    }
> >>   }
> >>
> >> -/**
> >> - * ib_umem_get - Pin and DMA map userspace memory.
> >> - * @context: userspace context to pin memory for
> >> - * @addr: userspace virtual address to start at
> >> - * @size: length of region to pin
> >> - * @access: IB_ACCESS_xxx flags for memory being pinned
> >> - * @dmasync: flush in-flight DMA when the memory region is written
> >> - */
> >> -struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long 
> >> addr,
> >> -                      size_t size, int access, int dmasync)
> >> +static struct ib_umem *__umem_get(struct ib_ucontext *context,
> >> +                            unsigned long addr, size_t size,
> >> +                            int access, int dmasync)
> >>   {
> >>    struct ib_umem *umem;
> >>    struct page **page_list;
> >> @@ -100,6 +94,7 @@ struct ib_umem *ib_umem_get(struct ib_ucontext 
> >> *context, unsigned long addr,
> >>    if (!umem)
> >>            return ERR_PTR(-ENOMEM);
> >>
> >> +  umem->type      = IB_UMEM_MEM_MAP;
> >>    umem->context   = context;
> >>    umem->length    = size;
> >>    umem->offset    = addr&  ~PAGE_MASK;
> >> @@ -215,6 +210,245 @@ out:
> >>
> >>    return ret<  0 ? ERR_PTR(ret) : umem;
> >>   }
> >> +
> >> +/*
> >> + * Return the PFN for the specified address in the vma. This only
> >> + * works for a vma that is VM_PFNMAP.
> >> + */
> >> +static unsigned long __follow_io_pfn(struct vm_area_struct *vma,
> >> +                               unsigned long address, int write)
> >> +{
> >> +  pgd_t *pgd;
> >> +  pud_t *pud;
> >> +  pmd_t *pmd;
> >> +  pte_t *ptep, pte;
> >> +  spinlock_t *ptl;
> >> +  unsigned long pfn;
> >> +  struct mm_struct *mm = vma->vm_mm;
> >> +
> >> +  pgd = pgd_offset(mm, address);
> >> +  if (pgd_none(*pgd) || unlikely(pgd_bad(*pgd)))
> >> +          return 0;
> >> +
> >> +  pud = pud_offset(pgd, address);
> >> +  if (pud_none(*pud))
> >> +          return 0;
> >> +  if (unlikely(pud_bad(*pud)))
> >> +          return 0;
> >> +
> >> +  pmd = pmd_offset(pud, address);
> >> +  if (pmd_none(*pmd))
> >> +          return 0;
> >> +  if (unlikely(pmd_bad(*pmd)))
> >> +          return 0;
> >> +
> >> +  ptep = pte_offset_map_lock(mm, pmd, address,&ptl);
> >> +  pte = *ptep;
> >> +  if (!pte_present(pte))
> >> +          goto bad;
> >> +  if (write&&  !pte_write(pte))
> >> +          goto bad;
> >> +
> >> +  pfn = pte_pfn(pte);
> >> +  pte_unmap_unlock(ptep, ptl);
> >> +  return pfn;
> >> + bad:
> >> +  pte_unmap_unlock(ptep, ptl);
> >> +  return 0;
> >> +}
> >> +
> >> +static int __get_io_pfn(struct task_struct *tsk, struct mm_struct *mm,
> >> +                  unsigned long start, int len, int write, int force,
> >> +                  unsigned long *pfn_list, struct vm_area_struct **vmas)
> >> +{
> >> +  unsigned long pfn;
> >> +  int i;
> >> +  if (len<= 0)
> >> +          return 0;
> >> +
> >> +  i = 0;
> >> +  do {
> >> +          struct vm_area_struct *vma;
> >> +
> >> +          vma = find_vma(mm, start);
> >> +          if (!(vma->vm_flags&  VM_PFNMAP))
> >> +                  return -EINVAL;
> >> +
> >> +          if (!(vma->vm_flags&  VM_IO))
> >> +                  return -EFAULT;
> >> +
> >> +          if (is_vm_hugetlb_page(vma))
> >> +                  return -EFAULT;
> >> +
> >> +          do {
> >> +                  cond_resched();
> >> +                  pfn = __follow_io_pfn(vma, start, write);
> >> +                  if (!pfn)
> >> +                          return -EFAULT;
> >> +                  if (pfn_list)
> >> +                          pfn_list[i] = pfn;
> >> +                  if (vmas)
> >> +                          vmas[i] = vma;
> >> +                  i++;
> >> +                  start += PAGE_SIZE;
> >> +                  len--;
> >> +          } while (len&&  start<  vma->vm_end);
> >> +  } while (len);
> >> +  return i;
> >> +}
> >> +
> >> +static struct ib_umem *__iomem_get(struct ib_ucontext *context,
> >> +                             unsigned long addr, size_t size,
> >> +                             int access, int dmasync)
> >> +{
> >> +  struct ib_umem *umem;
> >> +  unsigned long *pfn_list;
> >> +  struct ib_umem_chunk *chunk;
> >> +  unsigned long locked;
> >> +  unsigned long lock_limit;
> >> +  unsigned long cur_base;
> >> +  unsigned long npages;
> >> +  int ret;
> >> +  int off;
> >> +  int i;
> >> +  DEFINE_DMA_ATTRS(attrs);
> >> +
> >> +  if (dmasync)
> >> +          dma_set_attr(DMA_ATTR_WRITE_BARRIER,&attrs);
> >> +
> >> +  if (!can_do_mlock())
> >> +          return ERR_PTR(-EPERM);
> >> +
> >> +  umem = kmalloc(sizeof *umem, GFP_KERNEL);
> >> +  if (!umem)
> >> +          return ERR_PTR(-ENOMEM);
> >> +
> >> +  umem->type      = IB_UMEM_IO_MAP;
> >> +  umem->context   = context;
> >> +  umem->length    = size;
> >> +  umem->offset    = addr&  ~PAGE_MASK;
> >> +  umem->page_size = PAGE_SIZE;
> >> +
> >> +  /*
> >> +   * We ask for writable memory if any access flags other than
> >> +   * "remote read" are set.  "Local write" and "remote write"
> >> +   * obviously require write access.  "Remote atomic" can do
> >> +   * things like fetch and add, which will modify memory, and
> >> +   * "MW bind" can change permissions by binding a window.
> >> +   */
> >> +  umem->writable  = !!(access&  ~IB_ACCESS_REMOTE_READ);
> >> +
> >> +  /* IO memory is not hugetlb memory */
> >> +  umem->hugetlb   = 0;
> >> +
> >> +  INIT_LIST_HEAD(&umem->chunk_list);
> >> +
> >> +  pfn_list = (unsigned long *) __get_free_page(GFP_KERNEL);
> >> +  if (!pfn_list) {
> >> +          kfree(umem);
> >> +          return ERR_PTR(-ENOMEM);
> >> +  }
> >> +
> >> +  npages = PAGE_ALIGN(size + umem->offset)>>  PAGE_SHIFT;
> >> +
> >> +  down_write(&current->mm->mmap_sem);
> >> +
> >> +  locked     = npages + current->mm->locked_vm;
> >> +  lock_limit = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur>>  
> >> PAGE_SHIFT;
> >> +
> >> +  if ((locked>  lock_limit)&&  !capable(CAP_IPC_LOCK)) {
> >> +          ret = -ENOMEM;
> >> +          goto out;
> >> +  }
> >> +
> >> +  cur_base = addr&  PAGE_MASK;
> >> +
> >> +  ret = 0;
> >> +  while (npages) {
> >> +          ret = __get_io_pfn(current, current->mm, cur_base,
> >> +                             min_t(unsigned long, npages,
> >> +                                   PAGE_SIZE / sizeof(unsigned long *)),
> >> +                             umem->writable,
> >> +                             !umem->writable, pfn_list, NULL);
> >> +
> >> +          if (ret<  0)
> >> +                  goto out;
> >> +
> >> +          cur_base += ret * PAGE_SIZE;
> >> +          npages   -= ret;
> >> +
> >> +          off = 0;
> >> +
> >> +          while (ret) {
> >> +                  chunk = kmalloc(sizeof *chunk +
> >> +                                  sizeof(struct scatterlist) *
> >> +                                  min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK),
> >> +                                  GFP_KERNEL);
> >> +                  if (!chunk) {
> >> +                          ret = -ENOMEM;
> >> +                          goto out;
> >> +                  }
> >> +                  chunk->nents = min_t(int, ret, IB_UMEM_MAX_PAGE_CHUNK);
> >> +                  sg_init_table(chunk->page_list, chunk->nents);
> >> +                  /* The pfn_list we built is a set of Page
> >> +                   * Frame Numbers (PFN) whose physical address
> >> +                   * is PFN<<  PAGE_SHIFT. The SG DMA mapping
> >> +                   * services expect page addresses, not PFN,
> >> +                   * therefore, we have to do the dma mapping
> >> +                   * ourselves here. */
> >> +                  for (i = 0; i<  chunk->nents; ++i) {
> >> +                          sg_set_page(&chunk->page_list[i], 0,
> >> +                                      PAGE_SIZE, 0);
> >> +                          chunk->page_list[i].dma_address =
> >> +                                  (pfn_list[i + off]<<  PAGE_SHIFT);
> >> +                          chunk->page_list[i].dma_length = PAGE_SIZE;
> >> +                  }
> >> +                  chunk->nmap = chunk->nents;
> >> +                  ret -= chunk->nents;
> >> +                  off += chunk->nents;
> >> +                  list_add_tail(&chunk->list,&umem->chunk_list);
> >> +          }
> >> +  }
> >> +
> >> +out:
> >> +  if (ret<  0) {
> >> +          __ib_umem_release(context->device, umem, 0);
> >> +          kfree(umem);
> >> +  } else
> >> +          current->mm->locked_vm = locked;
> >> +  up_write(&current->mm->mmap_sem);
> >> +  free_page((unsigned long) pfn_list);
> >> +  return ret<  0 ? ERR_PTR(ret) : umem;
> >> +}
> >> +
> >> +/**
> >> + * ib_umem_get - Pin and DMA map userspace memory.
> >> + * @context: userspace context to pin memory for
> >> + * @addr: userspace virtual address to start at
> >> + * @size: length of region to pin
> >> + * @access: IB_ACCESS_xxx flags for memory being pinned
> >> + * @dmasync: flush in-flight DMA when the memory region is written
> >> + */
> >> +struct ib_umem *ib_umem_get(struct ib_ucontext *context, unsigned long 
> >> addr,
> >> +                      size_t size, int access, int dmasync)
> >> +{
> >> +  /* Do a quick lookup of the containing VMA to determine it's type. */
> >> +  struct vm_area_struct *vma;
> >> +  int do_pfn_map = 0;
> >> +
> >> +  down_write(&current->mm->mmap_sem);
> >> +  vma = find_vma(current->mm, addr&  PAGE_MASK);
> >> +  if (vma)
> >> +          do_pfn_map = vma->vm_flags&  VM_PFNMAP;
> >> +  up_write(&current->mm->mmap_sem);
> >> +  if (!vma)
> >> +          return ERR_PTR(-EINVAL);
> >> +
> >> +  if (do_pfn_map)
> >> +          return __iomem_get(context, addr, size, access, dmasync);
> >> +
> >> +  return __umem_get(context, addr, size, access, dmasync);
> >> +}
> >>   EXPORT_SYMBOL(ib_umem_get);
> >>
> >>   static void ib_umem_account(struct work_struct *work)
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> >> the body of a message to majord...@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to