On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote: > > > On 2017/11/1 20:26, Robin Murphy wrote: > > On 01/11/17 07:46, Wei Hu (Xavier) wrote: > >> > >> On 2017/10/12 20:59, Robin Murphy wrote: > >>> On 12/10/17 13:31, Wei Hu (Xavier) wrote: > >>>> On 2017/10/1 0:10, Leon Romanovsky wrote: > >>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote: > >>>>>> If the IOMMU is enabled, the length of sg obtained from > >>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg > >>>>>> dma address, the IOVA will not be page continuous. and the VA > >>>>>> returned from dma_alloc_coherent is a vmalloc address. However, > >>>>>> the VA obtained by the page_address is a discontinuous VA. Under > >>>>>> these circumstances, the IOVA should be calculated based on the > >>>>>> sg length, and record the VA returned from dma_alloc_coherent > >>>>>> in the struct of hem. > >>>>>> > >>>>>> Signed-off-by: Wei Hu (Xavier) <[email protected]> > >>>>>> Signed-off-by: Shaobo Xu <[email protected]> > >>>>>> Signed-off-by: Lijun Ou <[email protected]> > >>>>>> --- > >>>>> Doug, > >>>>> > >>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in > >>>>> driver code to deal with dma_alloc_coherent is most probably wrong. > >>>>> > >>>>> Thanks > >>>> Hi, Leon & Doug > >>>> We refered the function named __ttm_dma_alloc_page in the kernel > >>>> code as below: > >>>> And there are similar methods in bch_bio_map and mem_to_page > >>>> functions in current 4.14-rcx. > >>>> > >>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool > >>>> *pool) > >>>> { > >>>> struct dma_page *d_page; > >>>> > >>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL); > >>>> if (!d_page) > >>>> return NULL; > >>>> > >>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size, > >>>> &d_page->dma, > >>>> pool->gfp_flags); > >>>> if (d_page->vaddr) { > >>>> if (is_vmalloc_addr(d_page->vaddr)) > >>>> d_page->p = vmalloc_to_page(d_page->vaddr); > >>>> else > >>>> d_page->p = virt_to_page(d_page->vaddr); > >>> There are cases on various architectures where neither of those is > >>> right. Whether those actually intersect with TTM or RDMA use-cases is > >>> another matter, of course. > >>> > >>> What definitely is a problem is if you ever take that page and end up > >>> accessing it through any virtual address other than the one explicitly > >>> returned by dma_alloc_coherent(). That can blow the coherency wide open > >>> and invite data loss, right up to killing the whole system with a > >>> machine check on certain architectures. > >>> > >>> Robin. > >> Hi, Robin > >> Thanks for your comment. > >> > >> We have one problem and the related code as below. > >> 1. call dma_alloc_coherent function serval times to alloc memory. > >> 2. vmap the allocated memory pages. > >> 3. software access memory by using the return virt addr of vmap > >> and hardware using the dma addr of dma_alloc_coherent. > > The simple answer is "don't do that". Seriously. dma_alloc_coherent() > > gives you a CPU virtual address and a DMA address with which to access > > your buffer, and that is the limit of what you may infer about it. You > > have no guarantee that the virtual address is either in the linear map > > or vmalloc, and not some other special place. You have no guarantee that > > the underlying memory even has an associated struct page at all. > > > >> When IOMMU is disabled in ARM64 architecture, we use virt_to_page() > >> before vmap(), it works. And when IOMMU is enabled using > >> virt_to_page() will cause calltrace later, we found the return > >> addr of dma_alloc_coherent is vmalloc addr, so we add the > >> condition judgement statement as below, it works. > >> for (i = 0; i < buf->nbufs; ++i) > >> pages[i] = > >> is_vmalloc_addr(buf->page_list[i].buf) ? > >> vmalloc_to_page(buf->page_list[i].buf) : > >> virt_to_page(buf->page_list[i].buf); > >> Can you give us suggestion? better method? > > Oh my goodness, having now taken a closer look at this driver, I'm lost > > for words in disbelief. To pick just one example: > > > > u32 bits_per_long = BITS_PER_LONG; > > ... > > if (bits_per_long == 64) { > > /* memory mapping nonsense */ > > } > > > > WTF does the size of a long have to do with DMA buffer management!? > > > > Of course I can guess that it might be trying to make some tortuous > > inference about vmalloc space being constrained on 32-bit platforms, but > > still... > > > >> The related code as below: > >> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list), > >> GFP_KERNEL); > >> if (!buf->page_list) > >> return -ENOMEM; > >> > >> for (i = 0; i < buf->nbufs; ++i) { > >> buf->page_list[i].buf = dma_alloc_coherent(dev, > >> page_size, &t, > >> GFP_KERNEL); > >> if (!buf->page_list[i].buf) > >> goto err_free; > >> > >> buf->page_list[i].map = t; > >> memset(buf->page_list[i].buf, 0, page_size); > >> } > >> > >> pages = kmalloc_array(buf->nbufs, sizeof(*pages), > >> GFP_KERNEL); > >> if (!pages) > >> goto err_free; > >> > >> for (i = 0; i < buf->nbufs; ++i) > >> pages[i] = > >> is_vmalloc_addr(buf->page_list[i].buf) ? > >> vmalloc_to_page(buf->page_list[i].buf) : > >> virt_to_page(buf->page_list[i].buf); > >> > >> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP, > >> PAGE_KERNEL); > >> kfree(pages); > >> if (!buf->direct.buf) > >> goto err_free; > > OK, this is complete crap. As above, you cannot assume that a struct > > page even exists; even if it does you cannot assume that using a > > PAGE_KERNEL mapping will not result in mismatched attributes, > > unpredictable behaviour and data loss. Trying to remap coherent DMA > > allocations like this is just egregiously wrong. > > > > What I do like is that you can seemingly fix all this by simply deleting > > hns_roce_buf::direct and all the garbage code related to it, and using > > the page_list entries consistently because the alternate paths involving > > those appear to do the right thing already. > > > > That is, of course, assuming that the buffers involved can be so large > > that it's not practical to just always make a single allocation and > > fragment it into multiple descriptors if the hardware does have some > > maximum length constraint - frankly I'm a little puzzled by the > > PAGE_SIZE * 2 threshold, given that that's not a fixed size. > > > > Robin. > Hiļ¼Robin > > We reconstruct the code as below: > It replaces dma_alloc_coherent with __get_free_pages and > dma_map_single > functions. So, we can vmap serveral ptrs returned by > __get_free_pages, right?
Most probably not, you should get rid of your virt_to_page/vmap calls.
Thanks
>
>
> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
> GFP_KERNEL);
> if (!buf->page_list)
> return -ENOMEM;
>
> for (i = 0; i < buf->nbufs; ++i) {
> ptr = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
> get_order(page_size));
> if (!ptr) {
> dev_err(dev, "Alloc pages error.\n");
> goto err_free;
> }
>
> t = dma_map_single(dev, ptr, page_size,
> DMA_BIDIRECTIONAL);
> if (dma_mapping_error(dev, t)) {
> dev_err(dev, "DMA mapping error.\n");
> free_pages((unsigned long)ptr,
> get_order(page_size));
> goto err_free;
> }
>
> buf->page_list[i].buf = ptr;
> buf->page_list[i].map = t;
> }
>
> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
> GFP_KERNEL);
> if (!pages)
> goto err_free;
>
> for (i = 0; i < buf->nbufs; ++i)
> pages[i] = virt_to_page(buf->page_list[i].buf);
>
> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> PAGE_KERNEL);
> kfree(pages);
> if (!buf->direct.buf)
> goto err_free;
>
>
> Regards
> Wei Hu
> >> Regards
> >> Wei Hu
> >>>> } else {
> >>>> kfree(d_page);
> >>>> d_page = NULL;
> >>>> }
> >>>> return d_page;
> >>>> }
> >>>>
> >>>> Regards
> >>>> Wei Hu
> >>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
> >>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
> >>>>>> +++++++++++++++++++++++++++---
> >>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
> >>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22
> >>>>>> +++++++++++++++-------
> >>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
> >>>>>>
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> index 3e4c525..a69cd4b 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
> >>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
> >>>>>> *hr_dev, u32 size, u32 max_direct,
> >>>>>> goto err_free;
> >>>>>>
> >>>>>> for (i = 0; i < buf->nbufs; ++i)
> >>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
> >>>>>> + pages[i] =
> >>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
> >>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
> >>>>>> + virt_to_page(buf->page_list[i].buf);
> >>>>>>
> >>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
> >>>>>> PAGE_KERNEL);
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> index 8388ae2..4a3d1d4 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
> >>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> gfp_t gfp_mask)
> >>>>>> {
> >>>>>> struct hns_roce_hem_chunk *chunk = NULL;
> >>>>>> + struct hns_roce_vmalloc *vmalloc;
> >>>>>> struct hns_roce_hem *hem;
> >>>>>> struct scatterlist *mem;
> >>>>>> int order;
> >>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
> >>>>>> chunk->npages = 0;
> >>>>>> chunk->nsg = 0;
> >>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
> >>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
> >>>>>> }
> >>>>>>
> >>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> if (!buf)
> >>>>>> goto fail;
> >>>>>>
> >>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
> >>>>>> + if (is_vmalloc_addr(buf)) {
> >>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
> >>>>>> + vmalloc->is_vmalloc_addr = true;
> >>>>>> + vmalloc->vmalloc_addr = buf;
> >>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
> >>>>>> + PAGE_SIZE << order, offset_in_page(buf));
> >>>>>> + } else {
> >>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
> >>>>>> + }
> >>>>>> WARN_ON(mem->offset);
> >>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
> >>>>>>
> >>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
> >>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
> >>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
> >>>>>> hns_roce_hem *hem)
> >>>>>> {
> >>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
> >>>>>> + void *cpu_addr;
> >>>>>> int i;
> >>>>>>
> >>>>>> if (!hem)
> >>>>>> return;
> >>>>>>
> >>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
> >>>>>> - for (i = 0; i < chunk->npages; ++i)
> >>>>>> + for (i = 0; i < chunk->npages; ++i) {
> >>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
> >>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
> >>>>>> + else
> >>>>>> + cpu_addr =
> >>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
> >>>>>> +
> >>>>>> dma_free_coherent(hr_dev->dev,
> >>>>>> chunk->mem[i].length,
> >>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
> >>>>>> + cpu_addr,
> >>>>>> sg_dma_address(&chunk->mem[i]));
> >>>>>> + }
> >>>>>> kfree(chunk);
> >>>>>> }
> >>>>>>
> >>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
> >>>>>> *hr_dev,
> >>>>>>
> >>>>>> if (chunk->mem[i].length > (u32)offset) {
> >>>>>> page = sg_page(&chunk->mem[i]);
> >>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
> >>>>>> + mutex_unlock(&table->mutex);
> >>>>>> + return page ?
> >>>>>> + chunk->vmalloc[i].vmalloc_addr
> >>>>>> + + offset : NULL;
> >>>>>> + }
> >>>>>> goto out;
> >>>>>> }
> >>>>>> offset -= chunk->mem[i].length;
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> index af28bbf..62d712a 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
> >>>>>> @@ -72,11 +72,17 @@ enum {
> >>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
> >>>>>> };
> >>>>>>
> >>>>>> +struct hns_roce_vmalloc {
> >>>>>> + bool is_vmalloc_addr;
> >>>>>> + void *vmalloc_addr;
> >>>>>> +};
> >>>>>> +
> >>>>>> struct hns_roce_hem_chunk {
> >>>>>> struct list_head list;
> >>>>>> int npages;
> >>>>>> int nsg;
> >>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
> >>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
> >>>>>> };
> >>>>>>
> >>>>>> struct hns_roce_hem {
> >>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> index b99d70a..9e19bf1 100644
> >>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
> >>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
> >>>>>> *mb_buf, struct hns_roce_mr *mr,
> >>>>>> {
> >>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
> >>>>>> struct scatterlist *sg;
> >>>>>> + u64 page_addr = 0;
> >>>>>> u64 *pages;
> >>>>>> + int i = 0, j = 0;
> >>>>>> + int len = 0;
> >>>>>> int entry;
> >>>>>> - int i;
> >>>>>>
> >>>>>> mpt_entry = mb_buf;
> >>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
> >>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
> >>>>>> *mb_buf, struct hns_roce_mr *mr,
> >>>>>>
> >>>>>> i = 0;
> >>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
> >>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
> >>>>>> -
> >>>>>> - /* Record the first 2 entry directly to MTPT table */
> >>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
> >>>>>> - break;
> >>>>>> - i++;
> >>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
> >>>>>> + for (j = 0; j < len; ++j) {
> >>>>>> + page_addr = sg_dma_address(sg) +
> >>>>>> + (j << mr->umem->page_shift);
> >>>>>> + pages[i] = page_addr >> 6;
> >>>>>> +
> >>>>>> + /* Record the first 2 entry directly to MTPT table */
> >>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
> >>>>>> + goto found;
> >>>>>> + i++;
> >>>>>> + }
> >>>>>> }
> >>>>>>
> >>>>>> +found:
> >>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
> >>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
> >>>>>> V2_MPT_BYTE_56_PA0_H_S,
> >>>>>> --
> >>>>>> 1.9.1
> >>>>>>
> >>>> _______________________________________________
> >>>> iommu mailing list
> >>>> [email protected]
> >>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
> >>> .
> >>>
> >>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > .
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
signature.asc
Description: PGP signature

