Hi David, > > On 18.11.23 07:32, Vivek Kasireddy wrote: > > For drivers that would like to longterm-pin the pages associated > > with a file, the pin_user_pages_fd() API provides an option to > > not only pin the pages via FOLL_PIN but also to check and migrate > > them if they reside in movable zone or CMA block. This API > > currently works with files that belong to either shmem or hugetlbfs. > > Files belonging to other filesystems are rejected for now. > > > > The pages need to be located first before pinning them via FOLL_PIN. > > If they are found in the page cache, they can be immediately pinned. > > Otherwise, they need to be allocated using the filesystem specific > > APIs and then pinned. > > > > v2: > > - Drop gup_flags and improve comments and commit message (David) > > - Allocate a page if we cannot find in page cache for the hugetlbfs > > case as well (David) > > - Don't unpin pages if there is a migration related failure (David) > > - Drop the unnecessary nr_pages <= 0 check (Jason) > > - Have the caller of the API pass in file * instead of fd (Jason) > > > > v3: (David) > > - Enclose the huge page allocation code with #ifdef > CONFIG_HUGETLB_PAGE > > (Build error reported by kernel test robot <l...@intel.com>) > > - Don't forget memalloc_pin_restore() on non-migration related errors > > - Improve the readability of the cleanup code associated with > > non-migration related errors > > - Augment the comments by describing FOLL_LONGTERM like behavior > > - Include the R-b tag from Jason > > > > v4: > > - Remove the local variable "page" and instead use 3 return statements > > in alloc_file_page() (David) > > - Add the R-b tag from David > > > > Cc: David Hildenbrand <da...@redhat.com> > > Cc: Daniel Vetter <daniel.vet...@ffwll.ch> > > Cc: Mike Kravetz <mike.krav...@oracle.com> > > Cc: Hugh Dickins <hu...@google.com> > > Cc: Peter Xu <pet...@redhat.com> > > Cc: Gerd Hoffmann <kra...@redhat.com> > > Cc: Dongwon Kim <dongwon....@intel.com> > > Cc: Junxiao Chang <junxiao.ch...@intel.com> > > Suggested-by: Jason Gunthorpe <j...@nvidia.com> > > Reviewed-by: Jason Gunthorpe <j...@nvidia.com> (v2) > > Reviewed-by: David Hildenbrand <da...@redhat.com> (v3) > > Signed-off-by: Vivek Kasireddy <vivek.kasire...@intel.com> > > --- > > > [...] > > > > +static struct page *alloc_file_page(struct file *file, pgoff_t idx) > > +{ > > +#ifdef CONFIG_HUGETLB_PAGE > > + struct folio *folio; > > + int err; > > + > > + if (is_file_hugepages(file)) { > > + folio = alloc_hugetlb_folio_nodemask(hstate_file(file), > > + NUMA_NO_NODE, > > + NULL, > > + GFP_USER); > > + if (folio && folio_try_get(folio)) { > > + err = hugetlb_add_to_page_cache(folio, > > + file->f_mapping, > > + idx); > > + if (err) { > > + folio_put(folio); > > + free_huge_folio(folio); > > + return ERR_PTR(err); > > + } > > + return &folio->page; > > While looking at the user of pin_user_pages_fd(), I realized something: > > Assume idx is not aligned to the hugetlb page size. > find_get_page_flags() would always return a tail page in that case, but > you'd be returning the head page here. > > See pagecache_get_page()->folio_file_page(folio, index); Thank you for catching this. Looking at how udambuf uses this API for hugetlb case: hpstate = hstate_file(memfd); mapidx = list[i].offset >> huge_page_shift(hpstate); do { nr_pages = shmem_file(memfd) ? pgcnt : 1; ret = pin_user_pages_fd(memfd, mapidx, nr_pages, ubuf->pages + pgbuf); As the raw file offset is translated into huge page size units, represented by mapidx, I was expecting find_get_page_flags() to return a head page but I did not realize that find_get_page_flags() now returns tail pages given that it had returned head pages in the previous kernel versions I had tested IIRC. As my goal is to only grab the head pages, __filemap_get_folio() seems like the right API to use instead of find_get_page_flags(). With this change, the hugetlb subtest (that I have not tested with kernels >= 6.7) that fails with kernel 6.7 RC1 now seems to work as expected.
> > > + } > > + return ERR_PTR(-ENOMEM); > > + } > > +#endif > > + return shmem_read_mapping_page(file->f_mapping, idx); > > +} > > + > > +/** > > + * pin_user_pages_fd() - pin user pages associated with a file > > + * @file: the file whose pages are to be pinned > > + * @start: starting file offset > > + * @nr_pages: number of pages from start to pin > > + * @pages: array that receives pointers to the pages pinned. > > + * Should be at-least nr_pages long. > > + * > > + * Attempt to pin pages associated with a file that belongs to either > shmem > > + * or hugetlb. The pages are either found in the page cache or allocated if > > + * necessary. Once the pages are located, they are all pinned via FOLL_PIN. > > + * And, these pinned pages need to be released either using > unpin_user_pages() > > + * or unpin_user_page(). > > + * > > + * It must be noted that the pages may be pinned for an indefinite amount > > + * of time. And, in most cases, the duration of time they may stay pinned > > + * would be controlled by the userspace. This behavior is effectively the > > + * same as using FOLL_LONGTERM with other GUP APIs. > > + * > > + * Returns number of pages pinned. This would be equal to the number of > > + * pages requested. If no pages were pinned, it returns -errno. > > + */ > > +long pin_user_pages_fd(struct file *file, pgoff_t start, > > + unsigned long nr_pages, struct page **pages) > > +{ > > + struct page *page; > > + unsigned int flags, i; > > + long ret; > > + > > + if (start < 0) > > + return -EINVAL; > > + > > + if (!file) > > + return -EINVAL; > > + > > + if (!shmem_file(file) && !is_file_hugepages(file)) > > + return -EINVAL; > > + > > + flags = memalloc_pin_save(); > > + do { > > + for (i = 0; i < nr_pages; i++) { > > + /* > > + * In most cases, we should be able to find the page > > + * in the page cache. If we cannot find it, we try to > > + * allocate one and add it to the page cache. > > + */ > > + page = find_get_page_flags(file->f_mapping, > > + start + i, > > + FGP_ACCESSED); > > + if (!page) { > > + page = alloc_file_page(file, start + i); > > + if (IS_ERR(page)) { > > + ret = PTR_ERR(page); > > + goto err; > > While looking at above, I do wonder: what if two parties tried to alloc > the page at the same time? I suspect we'd want to handle -EEXIST a bit > nicer here, right? At-least with the udmabuf use-case, there cannot be multiple entities allocating and adding a page at a given offset in the memfd at the same time. However, I can make the following changes to protect against this potential failure mode: do { for (i = 0; i < nr_pages; i++) { +retry: + page = NULL; /* * In most cases, we should be able to find the page * in the page cache. If we cannot find it, we try to * allocate one and add it to the page cache. */ - page = find_get_page_flags(file->f_mapping, - start + i, - FGP_ACCESSED); + folio = __filemap_get_folio(file->f_mapping, + start + i, + FGP_ACCESSED, 0); + if (!IS_ERR(folio)) + page = &folio->page; + if (!page) { page = alloc_file_page(file, start + i); if (IS_ERR(page)) { ret = PTR_ERR(page); + if (ret == -EEXIST) + goto retry; goto err; } Thanks, Vivek > > > -- > Cheers, > > David / dhildenb