Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On Wed, Mar 06, 2019 at 06:15:25PM +0530, Aneesh Kumar K.V wrote: > On 3/6/19 5:14 PM, Michal Suchánek wrote: > > On Wed, 06 Mar 2019 14:47:33 +0530 > > "Aneesh Kumar K.V" wrote: > > > > > Dan Williams writes: > > > > > > > On Thu, Feb 28, 2019 at 1:40 AM Oliver wrote: > > > > > > > > > > On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V > > > > > wrote: > > > Also even if the user decided to not use THP, by > > > echo "never" > transparent_hugepage/enabled , we should continue to map > > > dax fault using huge page on platforms that can support huge pages. > > > > Is this a good idea? > > > > This knob is there for a reason. In some situations having huge pages > > can severely impact performance of the system (due to host-guest > > interaction or whatever) and the ability to really turn off all THP > > would be important in those cases, right? > > > > My understanding was that is not true for dax pages? These are not regular > memory that got allocated. They are allocated out of /dev/dax/ or > /dev/pmem*. Do we have a reason not to use hugepages for mapping pages in > that case? Yes. Like when you don't want dax to compete for TLB with mission-critical application (which uses hugetlb for instance). -- Kirill A. Shutemov
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On 3/6/19 5:14 PM, Michal Suchánek wrote: On Wed, 06 Mar 2019 14:47:33 +0530 "Aneesh Kumar K.V" wrote: Dan Williams writes: On Thu, Feb 28, 2019 at 1:40 AM Oliver wrote: On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V wrote: Also even if the user decided to not use THP, by echo "never" > transparent_hugepage/enabled , we should continue to map dax fault using huge page on platforms that can support huge pages. Is this a good idea? This knob is there for a reason. In some situations having huge pages can severely impact performance of the system (due to host-guest interaction or whatever) and the ability to really turn off all THP would be important in those cases, right? My understanding was that is not true for dax pages? These are not regular memory that got allocated. They are allocated out of /dev/dax/ or /dev/pmem*. Do we have a reason not to use hugepages for mapping pages in that case? -aneesh
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On Wed, 06 Mar 2019 14:47:33 +0530 "Aneesh Kumar K.V" wrote: > Dan Williams writes: > > > On Thu, Feb 28, 2019 at 1:40 AM Oliver wrote: > >> > >> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V > >> wrote: > Also even if the user decided to not use THP, by > echo "never" > transparent_hugepage/enabled , we should continue to map > dax fault using huge page on platforms that can support huge pages. Is this a good idea? This knob is there for a reason. In some situations having huge pages can severely impact performance of the system (due to host-guest interaction or whatever) and the ability to really turn off all THP would be important in those cases, right? Thanks Michal
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
Dan Williams writes: > On Thu, Feb 28, 2019 at 1:40 AM Oliver wrote: >> >> On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V >> wrote: >> > >> > Add a flag to indicate the ability to do huge page dax mapping. On >> > architecture >> > like ppc64, the hypervisor can disable huge page support in the guest. In >> > such a case, we should not enable huge page dax mapping. This patch adds >> > a flag which the architecture code will update to indicate huge page >> > dax mapping support. >> >> *groan* >> >> > Architectures mostly do transparent_hugepage_flag = 0; if they can't >> > do hugepages. That also takes care of disabling dax hugepage mapping >> > with this change. >> > >> > Without this patch we get the below error with kvm on ppc64. >> > >> > [ 118.849975] lpar: Failed hash pte insert with error -4 >> > >> > NOTE: The patch also use >> > >> > echo never > /sys/kernel/mm/transparent_hugepage/enabled >> > to disable dax huge page mapping. >> > >> > Signed-off-by: Aneesh Kumar K.V >> > --- >> > TODO: >> > * Add Fixes: tag >> > >> > include/linux/huge_mm.h | 4 +++- >> > mm/huge_memory.c| 4 >> > 2 files changed, 7 insertions(+), 1 deletion(-) >> > >> > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h >> > index 381e872bfde0..01ad5258545e 100644 >> > --- a/include/linux/huge_mm.h >> > +++ b/include/linux/huge_mm.h >> > @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct >> > *vma, unsigned long addr, >> > pud_t *pud, pfn_t pfn, bool write); >> > enum transparent_hugepage_flag { >> > TRANSPARENT_HUGEPAGE_FLAG, >> > + TRANSPARENT_HUGEPAGE_DAX_FLAG, >> > TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, >> > TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, >> > TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, >> > @@ -111,7 +112,8 @@ static inline bool >> > __transparent_hugepage_enabled(struct vm_area_struct *vma) >> > if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG)) >> > return true; >> > >> > - if (vma_is_dax(vma)) >> > + if (vma_is_dax(vma) && >> > + (transparent_hugepage_flags & (1 << >> > TRANSPARENT_HUGEPAGE_DAX_FLAG))) >> > return true; >> >> Forcing PTE sized faults should be fine for fsdax, but it'll break >> devdax. The devdax driver requires the fault size be >= the namespace >> alignment since devdax tries to guarantee hugepage mappings will be >> used and PMD alignment is the default. We can probably have devdax >> fall back to the largest size the hypervisor has made available, but >> it does run contrary to the design. Ah well, I suppose it's better off >> being degraded rather than unusable. > > Given this is an explicit setting I think device-dax should explicitly > fail to enable in the presence of this flag to preserve the > application visible behavior. > > I.e. if device-dax was enabled after this setting was made then I > think future faults should fail as well. Not sure I understood that. Now we are disabling the ability to map pages as huge pages. I am now considering that this should not be user configurable. Ie, this is something that platform can use to avoid dax forcing huge page mapping, but if the architecture can enable huge dax mapping, we should always default to using that. Now w.r.t to failures, can device-dax do an opportunistic huge page usage? I haven't looked at the device-dax details fully yet. Do we make the assumption of the mapping page size as a format w.r.t device-dax? Is that derived from nd_pfn->align value? Here is what I am working on: 1) If the platform doesn't support huge page and if the device superblock indicated that it was created with huge page support, we fail the device init. 2) Now if we are creating a new namespace without huge page support in the platform, then we force the align details to PAGE_SIZE. In such a configuration when handling dax fault even with THP enabled during the build, we should not try to use hugepage. This I think we can achieve by using TRANSPARENT_HUGEPAEG_DAX_FLAG. Also even if the user decided to not use THP, by echo "never" > transparent_hugepage/enabled , we should continue to map dax fault using huge page on platforms that can support huge pages. This still doesn't cover the details of a device-dax created with PAGE_SIZE align later booted with a kernel that can do hugepage dax.How should we handle that? That makes me think, this should be a VMA flag which got derived from device config? May be use VM_HUGEPAGE to indicate if device should use a hugepage mapping or not? -aneesh
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On Thu, Feb 28, 2019 at 1:40 AM Oliver wrote: > > On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V > wrote: > > > > Add a flag to indicate the ability to do huge page dax mapping. On > > architecture > > like ppc64, the hypervisor can disable huge page support in the guest. In > > such a case, we should not enable huge page dax mapping. This patch adds > > a flag which the architecture code will update to indicate huge page > > dax mapping support. > > *groan* > > > Architectures mostly do transparent_hugepage_flag = 0; if they can't > > do hugepages. That also takes care of disabling dax hugepage mapping > > with this change. > > > > Without this patch we get the below error with kvm on ppc64. > > > > [ 118.849975] lpar: Failed hash pte insert with error -4 > > > > NOTE: The patch also use > > > > echo never > /sys/kernel/mm/transparent_hugepage/enabled > > to disable dax huge page mapping. > > > > Signed-off-by: Aneesh Kumar K.V > > --- > > TODO: > > * Add Fixes: tag > > > > include/linux/huge_mm.h | 4 +++- > > mm/huge_memory.c| 4 > > 2 files changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > > index 381e872bfde0..01ad5258545e 100644 > > --- a/include/linux/huge_mm.h > > +++ b/include/linux/huge_mm.h > > @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, > > unsigned long addr, > > pud_t *pud, pfn_t pfn, bool write); > > enum transparent_hugepage_flag { > > TRANSPARENT_HUGEPAGE_FLAG, > > + TRANSPARENT_HUGEPAGE_DAX_FLAG, > > TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, > > TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, > > TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, > > @@ -111,7 +112,8 @@ static inline bool > > __transparent_hugepage_enabled(struct vm_area_struct *vma) > > if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG)) > > return true; > > > > - if (vma_is_dax(vma)) > > + if (vma_is_dax(vma) && > > + (transparent_hugepage_flags & (1 << > > TRANSPARENT_HUGEPAGE_DAX_FLAG))) > > return true; > > Forcing PTE sized faults should be fine for fsdax, but it'll break > devdax. The devdax driver requires the fault size be >= the namespace > alignment since devdax tries to guarantee hugepage mappings will be > used and PMD alignment is the default. We can probably have devdax > fall back to the largest size the hypervisor has made available, but > it does run contrary to the design. Ah well, I suppose it's better off > being degraded rather than unusable. Given this is an explicit setting I think device-dax should explicitly fail to enable in the presence of this flag to preserve the application visible behavior. I.e. if device-dax was enabled after this setting was made then I think future faults should fail as well.
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On 2/28/19 3:10 PM, Oliver wrote: On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V wrote: Add a flag to indicate the ability to do huge page dax mapping. On architecture like ppc64, the hypervisor can disable huge page support in the guest. In such a case, we should not enable huge page dax mapping. This patch adds a flag which the architecture code will update to indicate huge page dax mapping support. *groan* Architectures mostly do transparent_hugepage_flag = 0; if they can't do hugepages. That also takes care of disabling dax hugepage mapping with this change. Without this patch we get the below error with kvm on ppc64. [ 118.849975] lpar: Failed hash pte insert with error -4 NOTE: The patch also use echo never > /sys/kernel/mm/transparent_hugepage/enabled to disable dax huge page mapping. Signed-off-by: Aneesh Kumar K.V --- TODO: * Add Fixes: tag include/linux/huge_mm.h | 4 +++- mm/huge_memory.c| 4 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 381e872bfde0..01ad5258545e 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, unsigned long addr, pud_t *pud, pfn_t pfn, bool write); enum transparent_hugepage_flag { TRANSPARENT_HUGEPAGE_FLAG, + TRANSPARENT_HUGEPAGE_DAX_FLAG, TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma) if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG)) return true; - if (vma_is_dax(vma)) + if (vma_is_dax(vma) && + (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG))) return true; Forcing PTE sized faults should be fine for fsdax, but it'll break devdax. The devdax driver requires the fault size be >= the namespace alignment since devdax tries to guarantee hugepage mappings will be used and PMD alignment is the default. We can probably have devdax fall back to the largest size the hypervisor has made available, but it does run contrary to the design. Ah well, I suppose it's better off being degraded rather than unusable. Will fix that. I will make PFN_DEFAULT_ALIGNMENT arch specific. if (transparent_hugepage_flags & diff --git a/mm/huge_memory.c b/mm/huge_memory.c index faf357eaf0ce..43d742fe0341 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -53,6 +53,7 @@ unsigned long transparent_hugepage_flags __read_mostly = #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE (1< @@ -753,6 +756,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, unsigned long addr, spinlock_t *ptl; ptl = pmd_lock(mm, pmd); + /* should we check for none here again? */ VM_WARN_ON() maybe? If THP is disabled and we're here then something has gone wrong. I was wondering whether we can end up calling insert_pfn_pmd in parallel and hence end up having a pmd entry here already. Usually we check for if (!pmd_none(pmd)) after holding pmd_lock. Was not sure whether there is anything preventing a parallel fault in case of dax. entry = pmd_mkhuge(pfn_t_pmd(pfn, prot)); if (pfn_t_devmap(pfn)) entry = pmd_mkdevmap(entry); -- 2.20.1
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On 2/28/19 3:10 PM, Jan Kara wrote: On Thu 28-02-19 14:05:22, Aneesh Kumar K.V wrote: Add a flag to indicate the ability to do huge page dax mapping. On architecture like ppc64, the hypervisor can disable huge page support in the guest. In such a case, we should not enable huge page dax mapping. This patch adds a flag which the architecture code will update to indicate huge page dax mapping support. Architectures mostly do transparent_hugepage_flag = 0; if they can't do hugepages. That also takes care of disabling dax hugepage mapping with this change. Without this patch we get the below error with kvm on ppc64. [ 118.849975] lpar: Failed hash pte insert with error -4 NOTE: The patch also use echo never > /sys/kernel/mm/transparent_hugepage/enabled to disable dax huge page mapping. Signed-off-by: Aneesh Kumar K.V Added Dan to CC for opinion. I kind of fail to see why you don't use TRANSPARENT_HUGEPAGE_FLAG for this. I know that technically DAX huge pages and normal THPs are different things but so far we've tried to avoid making that distinction visible to userspace. I would also like to use the same flag. Was not sure whether it was ok. In fact that is one of the reason I hooked this to /sys/kernel/mm/transparent_hugepage/enabled. If we are ok with using same flag, we can kill the vma_is_dax() check completely. -aneesh
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On Thu, Feb 28, 2019 at 7:35 PM Aneesh Kumar K.V wrote: > > Add a flag to indicate the ability to do huge page dax mapping. On > architecture > like ppc64, the hypervisor can disable huge page support in the guest. In > such a case, we should not enable huge page dax mapping. This patch adds > a flag which the architecture code will update to indicate huge page > dax mapping support. *groan* > Architectures mostly do transparent_hugepage_flag = 0; if they can't > do hugepages. That also takes care of disabling dax hugepage mapping > with this change. > > Without this patch we get the below error with kvm on ppc64. > > [ 118.849975] lpar: Failed hash pte insert with error -4 > > NOTE: The patch also use > > echo never > /sys/kernel/mm/transparent_hugepage/enabled > to disable dax huge page mapping. > > Signed-off-by: Aneesh Kumar K.V > --- > TODO: > * Add Fixes: tag > > include/linux/huge_mm.h | 4 +++- > mm/huge_memory.c| 4 > 2 files changed, 7 insertions(+), 1 deletion(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 381e872bfde0..01ad5258545e 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, > unsigned long addr, > pud_t *pud, pfn_t pfn, bool write); > enum transparent_hugepage_flag { > TRANSPARENT_HUGEPAGE_FLAG, > + TRANSPARENT_HUGEPAGE_DAX_FLAG, > TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, > TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, > TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, > @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct > vm_area_struct *vma) > if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG)) > return true; > > - if (vma_is_dax(vma)) > + if (vma_is_dax(vma) && > + (transparent_hugepage_flags & (1 << > TRANSPARENT_HUGEPAGE_DAX_FLAG))) > return true; Forcing PTE sized faults should be fine for fsdax, but it'll break devdax. The devdax driver requires the fault size be >= the namespace alignment since devdax tries to guarantee hugepage mappings will be used and PMD alignment is the default. We can probably have devdax fall back to the largest size the hypervisor has made available, but it does run contrary to the design. Ah well, I suppose it's better off being degraded rather than unusable. > if (transparent_hugepage_flags & > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index faf357eaf0ce..43d742fe0341 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -53,6 +53,7 @@ unsigned long transparent_hugepage_flags __read_mostly = > #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE > (1< #endif > + (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG) | > (1< (1< (1< @@ -475,6 +476,8 @@ static int __init setup_transparent_hugepage(char *str) > &transparent_hugepage_flags); > clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, > &transparent_hugepage_flags); > + clear_bit(TRANSPARENT_HUGEPAGE_DAX_FLAG, > + &transparent_hugepage_flags); > ret = 1; > } > out: > @@ -753,6 +756,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, > unsigned long addr, > spinlock_t *ptl; > > ptl = pmd_lock(mm, pmd); > + /* should we check for none here again? */ VM_WARN_ON() maybe? If THP is disabled and we're here then something has gone wrong. > entry = pmd_mkhuge(pfn_t_pmd(pfn, prot)); > if (pfn_t_devmap(pfn)) > entry = pmd_mkdevmap(entry); > -- > 2.20.1 >
Re: [PATCH 2/2] mm/dax: Don't enable huge dax mapping by default
On Thu 28-02-19 14:05:22, Aneesh Kumar K.V wrote: > Add a flag to indicate the ability to do huge page dax mapping. On > architecture > like ppc64, the hypervisor can disable huge page support in the guest. In > such a case, we should not enable huge page dax mapping. This patch adds > a flag which the architecture code will update to indicate huge page > dax mapping support. > > Architectures mostly do transparent_hugepage_flag = 0; if they can't > do hugepages. That also takes care of disabling dax hugepage mapping > with this change. > > Without this patch we get the below error with kvm on ppc64. > > [ 118.849975] lpar: Failed hash pte insert with error -4 > > NOTE: The patch also use > > echo never > /sys/kernel/mm/transparent_hugepage/enabled > to disable dax huge page mapping. > > Signed-off-by: Aneesh Kumar K.V Added Dan to CC for opinion. I kind of fail to see why you don't use TRANSPARENT_HUGEPAGE_FLAG for this. I know that technically DAX huge pages and normal THPs are different things but so far we've tried to avoid making that distinction visible to userspace. Honza > --- > TODO: > * Add Fixes: tag > > include/linux/huge_mm.h | 4 +++- > mm/huge_memory.c| 4 > 2 files changed, 7 insertions(+), 1 deletion(-) > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h > index 381e872bfde0..01ad5258545e 100644 > --- a/include/linux/huge_mm.h > +++ b/include/linux/huge_mm.h > @@ -53,6 +53,7 @@ vm_fault_t vmf_insert_pfn_pud(struct vm_area_struct *vma, > unsigned long addr, > pud_t *pud, pfn_t pfn, bool write); > enum transparent_hugepage_flag { > TRANSPARENT_HUGEPAGE_FLAG, > + TRANSPARENT_HUGEPAGE_DAX_FLAG, > TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, > TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, > TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, > @@ -111,7 +112,8 @@ static inline bool __transparent_hugepage_enabled(struct > vm_area_struct *vma) > if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG)) > return true; > > - if (vma_is_dax(vma)) > + if (vma_is_dax(vma) && > + (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG))) > return true; > > if (transparent_hugepage_flags & > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index faf357eaf0ce..43d742fe0341 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -53,6 +53,7 @@ unsigned long transparent_hugepage_flags __read_mostly = > #ifdef CONFIG_TRANSPARENT_HUGEPAGE_MADVISE > (1< #endif > + (1 << TRANSPARENT_HUGEPAGE_DAX_FLAG) | > (1< (1< (1< @@ -475,6 +476,8 @@ static int __init setup_transparent_hugepage(char *str) > &transparent_hugepage_flags); > clear_bit(TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG, > &transparent_hugepage_flags); > + clear_bit(TRANSPARENT_HUGEPAGE_DAX_FLAG, > + &transparent_hugepage_flags); > ret = 1; > } > out: > @@ -753,6 +756,7 @@ static void insert_pfn_pmd(struct vm_area_struct *vma, > unsigned long addr, > spinlock_t *ptl; > > ptl = pmd_lock(mm, pmd); > + /* should we check for none here again? */ > entry = pmd_mkhuge(pfn_t_pmd(pfn, prot)); > if (pfn_t_devmap(pfn)) > entry = pmd_mkdevmap(entry); > -- > 2.20.1 > -- Jan Kara SUSE Labs, CR