Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
> Right, but then you need to set that in the VMA's, and thus gone is your > nice fast g_u_p() that doesn't touch VMAs :-) Registering memory is a slow path thing in the RDMA world. Speeding it up is nice, so we make userspace do the madvise(VM_DONTCOPY) if it cares but if it doesn't it can leave it out. > > Yes, but unfortunately MPI says apps can allocate memory however they > > damn well please... in any case these issues are all-too-well-known in > > the RDMA world for quite a while. > Yup. What do you think of the idea of pre-COWing pages with an elevated > count at fork time ? Super-duper sucks if the first thing the child does is exec() :) Also if the parent has registered > half the memory in the system then it's instant OOM. So not that useful for the RDMA case :) The one thing that might make sense is to pre-COW any partial pages that the parent has registered -- ie if half a page can be used by the child, at least pre-COW that, but leave all the full pages with VM_DONTCOPY. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
On Wed, 2009-02-04 at 21:10 -0800, Roland Dreier wrote: > > Note that g_u_p() has all sort of shortcommings... we were discussing > > some of that recently due to bugs reported from the field. > > > > The problem mostly is that you cannot guarantee that the physical page > > will remain mapped to that virtual address in the process. For example, > > if your code is part of some library used by an application, and that > > application somewhere does a fork/exec (for example, a system() call to > > run a shell helper), copy-on-write will hit, and you may end up with > > the child process getting the original physical page and the original > > process getting the copy... > > Believe me, I'm well aware of that. We added VM_DONTCOPY to allow apps > to fork() without the child triggering COW, but that means only the > parent process can use the mapped memory (and the app has to worry about > page boundaries etc). Right, but then you need to set that in the VMA's, and thus gone is your nice fast g_u_p() that doesn't touch VMAs :-) > Yes, but unfortunately MPI says apps can allocate memory however they > damn well please... in any case these issues are all-too-well-known in > the RDMA world for quite a while. Yup. What do you think of the idea of pre-COWing pages with an elevated count at fork time ? Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
> Note that g_u_p() has all sort of shortcommings... we were discussing > some of that recently due to bugs reported from the field. > > The problem mostly is that you cannot guarantee that the physical page > will remain mapped to that virtual address in the process. For example, > if your code is part of some library used by an application, and that > application somewhere does a fork/exec (for example, a system() call to > run a shell helper), copy-on-write will hit, and you may end up with > the child process getting the original physical page and the original > process getting the copy... Believe me, I'm well aware of that. We added VM_DONTCOPY to allow apps to fork() without the child triggering COW, but that means only the parent process can use the mapped memory (and the app has to worry about page boundaries etc). > We've been discussing that at KS with various people, Linus says g_u_p() > sucks, don't do that :-) Most of the time, the other approach should be > used, ie, the driver allocates memory, and userspace mmap's it, in which > case you get access to the VMA to set flags such as don't copy on fork. Yes, but unfortunately MPI says apps can allocate memory however they damn well please... in any case these issues are all-too-well-known in the RDMA world for quite a while. Thanks, Roland ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
> get_user_pages() also gives us the vma back, and we can see from > is_vm_hugetlb_page() (-- BTW can I just say that a function > is_xxx_page() that operates on vmas is horribly misnamed --) that these > pages all come from a hugetlb mapping, but figuring out the size of that > mapping is I guess a challenge. Note that g_u_p() has all sort of shortcommings... we were discussing some of that recently due to bugs reported from the field. The problem mostly is that you cannot guarantee that the physical page will remain mapped to that virtual address in the process. For example, if your code is part of some library used by an application, and that application somewhere does a fork/exec (for example, a system() call to run a shell helper), copy-on-write will hit, and you may end up with the child process getting the original physical page and the original process getting the copy... So your HW will still DMA to a valid page (ie, it's count will have been incremented) but it's not going to be the one the application uses any more. There are similar issues that can be cause, afaik, by madvise, etc... We've been discussing that at KS with various people, Linus says g_u_p() sucks, don't do that :-) Most of the time, the other approach should be used, ie, the driver allocates memory, and userspace mmap's it, in which case you get access to the VMA to set flags such as don't copy on fork. An option possibly would be to make fork() pre-COW pages with an elevated count to ensure that at least the original process is the one to keep the original physical page... but that has other potential side effects or performance issues. A can of worms.. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
> You should be able to find the head of a compound page using the > compound_head() inline, so try > PAGE_SIZE << compound_order(compound_head(page)) Thanks! Looks like that should be exactly what we need. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
On Wed, 04 Feb 2009 11:11:22 -0800 Roland Dreier wrote: > > > > huge_page_size(page_hstate(page)) > > > > That would suit. I assume the intention is for that to be usable by > > > driver modules on any architecture? > > > erm, you overestimate the amount of planning and forethought which goes > > into these things ;) > > > The lack of any EXPORT_SYMBOL(size_to_hstate) is a broadish hint. > > Heh. Looking into the implementation, it seems that I could actually do > > PAGE_SIZE << compound_order(page) > > directly (since there's no reason to go from size to hstate and back to > size. I don't know all the details of these VM internals, but that > seems to only work on the first (small) page of a giant page? Which > causes problems for what we're trying to do here... > > To summarize the goal, we are mapping user memory to a device that has > its own page tables, where the device's page tables can also use > multiple page sizes. Using big pages on the device leads to similar > efficiencies as hugetlb pages do on the CPU, and in fact if a user has > used hugetlb pages for the memory they're giving to the device, that's a > very strong hint that the device should use big pages too. > > But one valid situation we have to handle in the driver is if, say, > userspace has a hugetlb mapped at virtual address 0x20, and wants to > map 0x8 bytes at 0x28 to the device. In that case, we're going > to do essentially > > get_user_pages(..., 0x28, 0x8 / PAGE_SIZE, ...) > > and get_user_pages() is going to give us a bunch of normal PAGE_SIZE > pages starting at offset 0x80 within the compound page that makes up > the huge page mapped at 0x20. > > get_user_pages() also gives us the vma back, and we can see from > is_vm_hugetlb_page() (-- BTW can I just say that a function > is_xxx_page() that operates on vmas is horribly misnamed --) that these > pages all come from a hugetlb mapping, but figuring out the size of that > mapping is I guess a challenge. compound_head() will convert any page* inside a hugepage into a pointer to the head page. It should work OK for regular pages as well as CONFIG_HUGETLB=n. So.. PAGE_SIZE << compound_order(compound_head(page)) ? ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
On Wed, Feb 04, 2009 at 11:11:22AM -0800, Roland Dreier wrote: > Heh. Looking into the implementation, it seems that I could actually do > PAGE_SIZE << compound_order(page) > directly (since there's no reason to go from size to hstate and back to > size. I don't know all the details of these VM internals, but that > seems to only work on the first (small) page of a giant page? Which > causes problems for what we're trying to do here... You should be able to find the head of a compound page using the compound_head() inline, so try PAGE_SIZE << compound_order(compound_head(page)) -- wli ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
> > > huge_page_size(page_hstate(page)) > > That would suit. I assume the intention is for that to be usable by > > driver modules on any architecture? > erm, you overestimate the amount of planning and forethought which goes > into these things ;) > The lack of any EXPORT_SYMBOL(size_to_hstate) is a broadish hint. Heh. Looking into the implementation, it seems that I could actually do PAGE_SIZE << compound_order(page) directly (since there's no reason to go from size to hstate and back to size. I don't know all the details of these VM internals, but that seems to only work on the first (small) page of a giant page? Which causes problems for what we're trying to do here... To summarize the goal, we are mapping user memory to a device that has its own page tables, where the device's page tables can also use multiple page sizes. Using big pages on the device leads to similar efficiencies as hugetlb pages do on the CPU, and in fact if a user has used hugetlb pages for the memory they're giving to the device, that's a very strong hint that the device should use big pages too. But one valid situation we have to handle in the driver is if, say, userspace has a hugetlb mapped at virtual address 0x20, and wants to map 0x8 bytes at 0x28 to the device. In that case, we're going to do essentially get_user_pages(..., 0x28, 0x8 / PAGE_SIZE, ...) and get_user_pages() is going to give us a bunch of normal PAGE_SIZE pages starting at offset 0x80 within the compound page that makes up the huge page mapped at 0x20. get_user_pages() also gives us the vma back, and we can see from is_vm_hugetlb_page() (-- BTW can I just say that a function is_xxx_page() that operates on vmas is horribly misnamed --) that these pages all come from a hugetlb mapping, but figuring out the size of that mapping is I guess a challenge. - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
On Wed, 04 Feb 2009 12:50:48 +1100 Benjamin Herrenschmidt >>> Do the generic hugetlbfs code provides such an API ? If not, we may need >>> to add one. On Wednesday 04 February 2009 16:13:29 Andrew Morton wrote: >> I think it's something like >> huge_page_size(page_hstate(page)) On Wed, Feb 04, 2009 at 04:31:38PM +1100, Nick Piggin wrote: > That would work if you have a page, yes. If you want to query which hugepage > sizes are available, then you probably want for_each_hstate() (which is only > within mm/hugetlb.c at the moment, but I have no objections to exporting it > and symbols it requires). Exciting things have happened in mm/hugetlb.c while I was out sick. I've got quite some catching up to do. -- wli ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
On Tue, 03 Feb 2009 22:16:08 -0800 Roland Dreier wrote: > > I think it's something like > > > >huge_page_size(page_hstate(page)) > > That would suit. I assume the intention is for that to be usable by > driver modules on any architecture? > erm, you overestimate the amount of planning and forethought which goes into these things ;) The lack of any EXPORT_SYMBOL(size_to_hstate) is a broadish hint. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
> I think it's something like > > huge_page_size(page_hstate(page)) That would suit. I assume the intention is for that to be usable by driver modules on any architecture? - R. ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
On Wednesday 04 February 2009 16:13:29 Andrew Morton wrote: > On Wed, 04 Feb 2009 12:50:48 +1100 Benjamin Herrenschmidt > > Do the generic hugetlbfs code provides such an API ? If not, we may need > > to add one. > > I think it's something like > > huge_page_size(page_hstate(page)) That would work if you have a page, yes. If you want to query which hugepage sizes are available, then you probably want for_each_hstate() (which is only within mm/hugetlb.c at the moment, but I have no objections to exporting it and symbols it requires). ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
On Wed, 04 Feb 2009 12:50:48 +1100 Benjamin Herrenschmidt wrote: > On Tue, 2009-02-03 at 17:08 -0800, Roland Dreier wrote: > > Forwarding Eli's patch below, since PowerPC guys may have missed it. I > > guess the question for Ben et al is whether there is any issue with > > exporting HPAGE_SHIFT for modules (can be EXPORT_SYMBOL_GPL if you feel > > it's an internal detail). It would probably make sense to roll this > > change into the mlx4 change that Eli alludes to below and merge through > > my tree (with ppc maintainer acks of course), rather than splitting this > > patch out and introducing cross-tree dependencies (and also separating > > the rationale for the change from the change itself). > > > > Thanks, > > Roland > > > > > > Drivers may want to take advantage of the large pages used for memory > > obtained > > from hugetlbfs. One example is mlx4_ib which can use much less MTT entries > > (in > > the order of HPAGE_SIZE / PAGE_SIZE) when registering such memory, thus > > scale > > significantly better when registering larger memory regions. Other drivers > > could also benefit from this. > > Except that we support multiple large page sizes nowadays ... I think > the size can be specified per mountpoint of hugetlbfs no ? Thus things > like mellanox would have to query the page size used for a given > mapping. > > Do the generic hugetlbfs code provides such an API ? If not, we may need > to add one. > I think it's something like huge_page_size(page_hstate(page)) ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
Re: FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
On Tue, 2009-02-03 at 17:08 -0800, Roland Dreier wrote: > Forwarding Eli's patch below, since PowerPC guys may have missed it. I > guess the question for Ben et al is whether there is any issue with > exporting HPAGE_SHIFT for modules (can be EXPORT_SYMBOL_GPL if you feel > it's an internal detail). It would probably make sense to roll this > change into the mlx4 change that Eli alludes to below and merge through > my tree (with ppc maintainer acks of course), rather than splitting this > patch out and introducing cross-tree dependencies (and also separating > the rationale for the change from the change itself). > > Thanks, > Roland > > > Drivers may want to take advantage of the large pages used for memory obtained > from hugetlbfs. One example is mlx4_ib which can use much less MTT entries (in > the order of HPAGE_SIZE / PAGE_SIZE) when registering such memory, thus scale > significantly better when registering larger memory regions. Other drivers > could also benefit from this. Except that we support multiple large page sizes nowadays ... I think the size can be specified per mountpoint of hugetlbfs no ? Thus things like mellanox would have to query the page size used for a given mapping. Do the generic hugetlbfs code provides such an API ? If not, we may need to add one. Cheers, Ben. > Signed-off-by: Eli Cohen > --- > arch/powerpc/mm/hash_utils_64.c |1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c > index 8d5b475..6cff8c7 100644 > --- a/arch/powerpc/mm/hash_utils_64.c > +++ b/arch/powerpc/mm/hash_utils_64.c > @@ -104,6 +104,7 @@ int mmu_highuser_ssize = MMU_SEGSIZE_256M; > u16 mmu_slb_size = 64; > #ifdef CONFIG_HUGETLB_PAGE > unsigned int HPAGE_SHIFT; > +EXPORT_SYMBOL(HPAGE_SHIFT); > #endif > #ifdef CONFIG_PPC_64K_PAGES > int mmu_ci_restrictions; ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev
FW: [PATCH] powerpc/mm: Export HPAGE_SHIFT
Forwarding Eli's patch below, since PowerPC guys may have missed it. I guess the question for Ben et al is whether there is any issue with exporting HPAGE_SHIFT for modules (can be EXPORT_SYMBOL_GPL if you feel it's an internal detail). It would probably make sense to roll this change into the mlx4 change that Eli alludes to below and merge through my tree (with ppc maintainer acks of course), rather than splitting this patch out and introducing cross-tree dependencies (and also separating the rationale for the change from the change itself). Thanks, Roland Drivers may want to take advantage of the large pages used for memory obtained from hugetlbfs. One example is mlx4_ib which can use much less MTT entries (in the order of HPAGE_SIZE / PAGE_SIZE) when registering such memory, thus scale significantly better when registering larger memory regions. Other drivers could also benefit from this. Signed-off-by: Eli Cohen --- arch/powerpc/mm/hash_utils_64.c |1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/arch/powerpc/mm/hash_utils_64.c b/arch/powerpc/mm/hash_utils_64.c index 8d5b475..6cff8c7 100644 --- a/arch/powerpc/mm/hash_utils_64.c +++ b/arch/powerpc/mm/hash_utils_64.c @@ -104,6 +104,7 @@ int mmu_highuser_ssize = MMU_SEGSIZE_256M; u16 mmu_slb_size = 64; #ifdef CONFIG_HUGETLB_PAGE unsigned int HPAGE_SHIFT; +EXPORT_SYMBOL(HPAGE_SHIFT); #endif #ifdef CONFIG_PPC_64K_PAGES int mmu_ci_restrictions; -- 1.6.1 ___ Linuxppc-dev mailing list Linuxppc-dev@ozlabs.org https://ozlabs.org/mailman/listinfo/linuxppc-dev