Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry
On Thu, Jun 13, 2024 at 04:43:57PM +, LEROY Christophe wrote: > I can test whatever you want on my 8xx boards. > > I have two types of board: > - One with MPC866 microcontroller and 32Mbytes memory > - One with MPC885 microcontroller and 128Mbytes memory That is great. I will code up some tests, and once you have the pmd_* stuff on 8xx we can give it a shot. Thanks! -- Oscar Salvador SUSE Labs
Re: [PATCH v5 16/18] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
On Mon, Jun 10, 2024 at 07:55:01AM +0200, Christophe Leroy wrote: > On book3s/64, the only user of hugepd is hash in 4k mode. > > All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD. > > Rework hash-4k to use contiguous PMD and PUD instead. > > In that setup there are only two huge page sizes: 16M and 16G. > > 16M sits at PMD level and 16G at PUD level. > > pte_update doesn't know page size, lets use the same trick as > hpte_need_flush() to get page size from segment properties. That's > not the most efficient way but let's do that until callers of > pte_update() provide page size instead of just a huge flag. > > Signed-off-by: Christophe Leroy > --- ... > +static inline unsigned long hash__pte_update(struct mm_struct *mm, > + unsigned long addr, > + pte_t *ptep, unsigned long clr, > + unsigned long set, > + int huge) > +{ > + unsigned long old; > + > + old = hash__pte_update_one(ptep, clr, set); > + > + if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && huge) { > + unsigned int psize = get_slice_psize(mm, addr); > + int nb, i; > + > + if (psize == MMU_PAGE_16M) > + nb = SZ_16M / PMD_SIZE; > + else if (psize == MMU_PAGE_16G) > + nb = SZ_16G / PUD_SIZE; > + else > + nb = 1; Although that might be a safe default, it might carry consequences down the road? It might not, but if we reach that, something went wrong, so I would put a WARN_ON_ONCE at least. > --- a/arch/powerpc/mm/book3s64/hugetlbpage.c > +++ b/arch/powerpc/mm/book3s64/hugetlbpage.c > @@ -53,6 +53,16 @@ int __hash_page_huge(unsigned long ea, unsigned long > access, unsigned long vsid, > /* If PTE permissions don't match, take page fault */ > if (unlikely(!check_pte_access(access, old_pte))) > return 1; > + /* > + * If hash-4k, hugepages use seeral contiguous PxD entries > + * so bail out and let mm make the page young or dirty > + */ > + if (IS_ENABLED(CONFIG_PPC_4K_PAGES)) { > + if (!(old_pte & _PAGE_ACCESSED)) > + return 1; > + if ((access & _PAGE_WRITE) && !(old_pte & _PAGE_DIRTY)) > + return 1; > + } You mentioned that we need to bail out otherwise only the first PxD would be updated. In the comment you say that mm will take care of making the page young or dirty. Does this mean that the PxDs underneath will not have its bits updated? -- Oscar Salvador SUSE Labs
Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry
On Tue, Jun 11, 2024 at 07:00:14PM +, LEROY Christophe wrote: > We have space available in PMD if we need more flags, but in PTE I can't > see anything possible without additional churn that would require > additional instructions in TLB miss handlers, which is what I want to > avoid most. > > Bits mapped to HW PTE: > > #define _PAGE_PRESENT 0x0001 /* V: Page is valid */ > #define _PAGE_NO_CACHE0x0002 /* CI: cache inhibit */ > #define _PAGE_SH 0x0004 /* SH: No ASID (context) compare */ > #define _PAGE_SPS 0x0008 /* SPS: Small Page Size (1 if 16k, 512k or 8M)*/ > #define _PAGE_DIRTY 0x0100 /* C: page changed */ > #define _PAGE_NA 0x0200 /* Supervisor NA, User no access */ > #define _PAGE_RO 0x0600 /* Supervisor RO, User no access */ > > SW bits masked out in TLB miss handler: > > #define _PAGE_GUARDED 0x0010 /* Copied to L1 G entry in DTLB */ > #define _PAGE_ACCESSED0x0020 /* Copied to L1 APG 1 entry in I/DTLB */ > #define _PAGE_EXEC0x0040 /* Copied to PP (bit 21) in ITLB */ > #define _PAGE_SPECIAL 0x0080 /* SW entry */ > #define _PAGE_HUGE0x0800 /* Copied to L1 PS bit 29 */ > > All bits are used. The only thing would could do but that would have a > performance cost is to retrieve _PAGE_SH from the PMD and use that bit > for something else. I guess that this would be the last resort if we run out of options. But at least it is good to know that there is a plan B (or Z if you will :-)) > But I was maybe thinking another way. Lets take the exemple of > pmd_write() helper: > > #define pmd_write(pmd)pte_write(pmd_pte(pmd)) > > At the time being we have > > static inline pte_t pmd_pte(pmd_t pmd) > { > return __pte(pmd_val(pmd)); > } > > But what about something like > > static inline pte_t pmd_pte(pmd_t pmd) > { > return *(pte_t *)pmd_page_vaddr(pmd); > } I think this could work, yes. So, we should define all pmd_*(pmd) operations for 8xx the way they are defined in include/asm/book3s/64/pgtable.h. Other page size would not interfere because they already can perform operations on pte level. Ok, I think we might have a shot here. I would help testing, but I do not have 8xx hardware, and Qemu does not support 8xx emulation, but I think that if we are careful enough, this can work. Actually, as a smoketest would be enough to have a task with a 8MB huge mapped, and then do: static const struct mm_walk_ops test_walk_ops = { .pmd_entry = test_8mbp_hugepage, .pte_entry = test_16k_and_512k_hugepage, .hugetlb_entry = check_hugetlb_entry, .walk_lock = PGWALK_RDLOCK, }; static int test(void) { pr_info("%s: %s [0 - %lx]\n", __func__, current->comm, TASK_SIZE); mmap_read_lock(current->mm); ret = walk_page_range(current->mm, 0, TASK_SIZE, &test_walk_ops, NULL); mmap_read_unlock(current->mm); pr_info("%s: %s ret: %d\n", __func__, current->comm, ret); return 0; } This is an extract of a debugging mechanism I have to check that I am not going off rails when unifying hugetlb and normal walkers. test_8mbp_hugepage() could so some checks with pmd_ operations, print the results, and then compare them with those that check_hugetlb_entry() would give us. If everything is alright, both results should be the same. I can write the tests up, so we run some sort of smoketests. So yes, I do think that this is a good initiative. Thanks a lot Christoph -- Oscar Salvador SUSE Labs
Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry
On Tue, Jun 11, 2024 at 11:20:01AM -0400, Peter Xu wrote: > On Tue, Jun 11, 2024 at 05:08:45PM +0200, Oscar Salvador wrote: > > The problem is that we do not have spare bits for 8xx to mark these ptes > > as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how > > we could remove huge_ptep_get for 8xx. > > Right, I remember I thought about this too when I initially looked at one > previous version of the series, I didn't come up yet with a good solution, > but I guess we probably need to get rid of hugepd first anyway. We may > somehow still need to identify this is a 8M large leaf, and I guess this is > again the only special case where contpte can go over >1 pmds. Yes, we definitely need first to get rid of hugepd, which is a huge step, and one step closer to our goal, but at some point we will have to see what can we do about 8MB cont-ptes for 8xx and how to mark them, so ptep_get can work the same way as e.g: arm64 (ptep_get->contpte_ptep_get). @Christophe: Can you think of a way to flag those ptes? (e.g: a combination of bits etc) > I'll leave this to Christophe, but IIUC thp is only PMD_ORDER sized, so > shouldn't apply to the 8MB pages. That might be it, yes. -- Oscar Salvador SUSE Labs
Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry
On Tue, Jun 11, 2024 at 10:17:30AM -0400, Peter Xu wrote: > Oscar, > > On Tue, Jun 11, 2024 at 11:34:23AM +0200, Oscar Salvador wrote: > > Which means that they would be caught in the following code: > > > > ptl = pmd_huge_lock(pmd, vma); > > if (ptl) { > > - 8MB hugepages will be handled here > > smaps_pmd_entry(pmd, addr, walk); > > spin_unlock(ptl); > > } > > /* pte stuff */ > > ... > > Just one quick comment: I think there's one challenge though as this is > also not a generic "pmd leaf", but a pgtable page underneath. I think it > means smaps_pmd_entry() won't trivially work here, e.g., it will start to > do this: > > if (pmd_present(*pmd)) { > page = vm_normal_page_pmd(vma, addr, *pmd); > > Here vm_normal_page_pmd() will only work if pmd_leaf() satisfies its > definition as: > > * - It should contain a huge PFN, which points to a huge page larger than > * PAGE_SIZE of the platform. The PFN format isn't important here. > > But now it's a pgtable page, containing cont-ptes. Similarly, I think most > pmd_*() helpers will stop working there if we report it as a leaf. Heh, I think I managed to confuse myself. I do not why but I thought that static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, pte_t *ptep) { if (ptep_is_8m_pmdp(mm, addr, ptep)) ptep = pte_offset_kernel((pmd_t *)ptep, 0); return ptep_get(ptep); } would return the address of the pmd for 8MB hugepages, but it will return the address of the first pte? Then yeah, this will not work as I thought. The problem is that we do not have spare bits for 8xx to mark these ptes as cont-ptes or mark them pte as 8MB, so I do not see a clear path on how we could remove huge_ptep_get for 8xx. I am really curious though how we handle that for THP? Or THP on 8xx does not support that size? -- Oscar Salvador SUSE Labs
Re: [PATCH v5 02/18] mm: Define __pte_leaf_size() to also take a PMD entry
On Mon, Jun 10, 2024 at 07:54:47AM +0200, Christophe Leroy wrote: > On powerpc 8xx, when a page is 8M size, the information is in the PMD > entry. So allow architectures to provide __pte_leaf_size() instead of > pte_leaf_size() and provide the PMD entry to that function. > > When __pte_leaf_size() is not defined, define it as a pte_leaf_size() > so that architectures not interested in the PMD arguments are not > impacted. > > Only define a default pte_leaf_size() when __pte_leaf_size() is not > defined to make sure nobody adds new calls to pte_leaf_size() in the > core. Hi Christophe, Now I am going to give you a hard time, so sorry in advance. I should have raised this before, but I was not fully aware of it. There is an ongoing effort of unifying pagewalkers [1], so hugetlb does not have to be special-cased anymore, and the operations we do for THP on page-table basis work for hugetlb as well. The most special bit about this is huge_ptep_get. huge_ptep_get() gets special handled on arm/arm64/riscv and s390. arm64 and riscv is about cont-pmd/pte and propagate the dirty/young bits bits, so that is fine as walkers can already understand that. s390 is a funny one because it converts pud/pmd to pte and viceversa, because hugetlb *works* with ptes, so before returning the pte it has to transfer all bits from PUD/PMD level into a something that PTE level can understand. As you can imagine, this can be gone as we already have all the information in PUD/PMD and that is all pagewalkers need. But we are left with the one you will introduce in patch#8. 8MB pages get mapped as cont-pte, but all the information is stored in the PMD entries (size, dirtiness, present etc). huge_ptep_get() will return the PMD for 8MB, and so all operations hugetlb code performs with what huge_ptep_get returns will be performed on those PMDs. Which brings me to this point: I do not think __pte_leaf_size is needed. AFAICS, it should be enough to define pmd_leaf on 8xx, and return 8MB if it is a 8MB hugepage. #define pmd_leaf pmd_leaf static inline bool pmd_leaf(pmd_t pmd) { return pmd_val(pmd) & _PMD_PAGE_8M); } and then pmd_leaf_size to return _PMD_PAGE_8M. This will help because on the ongoing effort of unifying hugetlb and getting rid of huge_ptep_get() [1], pagewalkers will stumble upon the 8mb-PMD as they do for regular PMDs. Which means that they would be caught in the following code: ptl = pmd_huge_lock(pmd, vma); if (ptl) { - 8MB hugepages will be handled here smaps_pmd_entry(pmd, addr, walk); spin_unlock(ptl); } /* pte stuff */ ... where pmd_huge_lock is: static inline spinlock_t *pmd_huge_lock(pmd_t *pmd, struct vm_area_struct *vma) { spinlock_t *ptl = pmd_lock(vma->vm_mm, pmd); if (pmd_leaf(*pmd) || pmd_devmap(*pmd)) return ptl; spin_unlock(ptl); return NULL; } So, since pmd_leaf() will return true for 8MB hugepages, we are fine, because anyway we want to perform pagetable operations on *that* PMD and not the ptes that are cont-mapped, which is different for e.g: 512K hugepages, where we perform it on pte level. So I would suggest that instead of this patch, we have one implementing pmd_leaf and pmd_leaf_size for 8Mb hugepages on power8xx, as that takes us closer to our goal of unifying hugetlb. [1] https://github.com/leberus/linux/tree/hugetlb-pagewalk-v2 -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 16/16] mm: Remove CONFIG_ARCH_HAS_HUGEPD
On Mon, May 27, 2024 at 03:30:14PM +0200, Christophe Leroy wrote: > powerpc was the only user of CONFIG_ARCH_HAS_HUGEPD and doesn't > use it anymore, so remove all related code. > > Signed-off-by: Christophe Leroy > --- > v4: Rebased on v6.10-rc1 > --- > arch/powerpc/mm/hugetlbpage.c | 1 - > include/linux/hugetlb.h | 6 -- > mm/Kconfig| 10 -- > mm/gup.c | 183 +- > mm/pagewalk.c | 57 +-- > 5 files changed, 9 insertions(+), 248 deletions(-) > > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c > index 76846c6014e4..6b043180220a 100644 > --- a/arch/powerpc/mm/hugetlbpage.c > +++ b/arch/powerpc/mm/hugetlbpage.c > @@ -78,7 +78,6 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct > vm_area_struct *vma, > > return pte_alloc_huge(mm, pmd, addr); > } > -#endif Did not notice this before. This belongs to the previous patch. -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 00/16] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)
On Mon, May 27, 2024 at 03:29:58PM +0200, Christophe Leroy wrote: > This is the continuation of the RFC v1 series "Reimplement huge pages > without hugepd on powerpc 8xx". It now get rid of hugepd completely > after handling also e500 and book3s/64 > > Also see https://github.com/linuxppc/issues/issues/483 > > Unlike most architectures, powerpc 8xx HW requires a two-level > pagetable topology for all page sizes. So a leaf PMD-contig approach > is not feasible as such. > > Possible sizes on 8xx are 4k, 16k, 512k and 8M. > > First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries > must point to a single entry level-2 page table. Until now that was > done using hugepd. This series changes it to use standard page tables > where the entry is replicated 1024 times on each of the two pagetables > refered by the two associated PMD entries for that 8M page. > > For e500 and book3s/64 there are less constraints because it is not > tied to the HW assisted tablewalk like on 8xx, so it is easier to use > leaf PMDs (and PUDs). > > On e500 the supported page sizes are 4M, 16M, 64M, 256M and 1G. All at > PMD level on e500/32 (mpc85xx) and mix of PMD and PUD for e500/64. We > encode page size with 4 available bits in PTE entries. On e300/32 PGD > entries size is increases to 64 bits in order to allow leaf-PMD entries > because PTE are 64 bits on e500. > > On book3s/64 only the hash-4k mode is concerned. It supports 16M pages > as cont-PMD and 16G pages as cont-PUD. In other modes (radix-4k, radix-6k > and hash-64k) the sizes match with PMD and PUD sizes so that's just leaf > entries. The hash processing make things a bit more complex. To ease > things, __hash_page_huge() is modified to bail out when DIRTY or ACCESSED > bits are missing, leaving it to mm core to fix it. Ok, I managed to go through the series and provide some feedback. Sorry you had to bear some dumb questions but I am used to x86 realm where things are farily easier wrt. hugepage sizes. I will over v5 when you send it, but I think this would benefit from another pair of eyes (with more powerpc knowledge than me) having a look. Anyway, I think this is a great step in the right direction, and definitely a big help for the upcoming tasks. I plan to start working on the walk_page API to get rid of hugetlb specific hooks basing it on this patchset. Thanks a lot for this work Christophe -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 15/16] powerpc/mm: Remove hugepd leftovers
On Mon, May 27, 2024 at 03:30:13PM +0200, Christophe Leroy wrote: > All targets have now opted out of CONFIG_ARCH_HAS_HUGEPD so > remove left over code. > > Signed-off-by: Christophe Leroy Acked-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 14/16] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
On Wed, May 29, 2024 at 10:07:55AM +, Christophe Leroy wrote: > We can't but I didn't want to leave nb undefined or with a value that > might lead to writing in the weed. Value 1 seems a safe default. Might be worth to throw a WARN_ON? > >> diff --git a/arch/powerpc/mm/book3s64/hugetlbpage.c > >> b/arch/powerpc/mm/book3s64/hugetlbpage.c > >> index 5a2e512e96db..83c3361b358b 100644 > >> --- a/arch/powerpc/mm/book3s64/hugetlbpage.c > >> +++ b/arch/powerpc/mm/book3s64/hugetlbpage.c > >> @@ -53,6 +53,16 @@ int __hash_page_huge(unsigned long ea, unsigned long > >> access, unsigned long vsid, > >>/* If PTE permissions don't match, take page fault */ > >>if (unlikely(!check_pte_access(access, old_pte))) > >>return 1; > >> + /* > >> + * If hash-4k, hugepages use seeral contiguous PxD entries > > 'several' > >> + * so bail out and let mm make the page young or dirty > >> + */ > >> + if (IS_ENABLED(CONFIG_PPC_4K_PAGES)) { > >> + if (!(old_pte & _PAGE_ACCESSED)) > >> + return 1; > >> + if ((access & _PAGE_WRITE) && !(old_pte & _PAGE_DIRTY)) > >> + return 1; > > > > I have 0 clue about this code. What would happen if we do not bail out? > > > > In that case the pte_xchg() in the while () will only set ACCESS or > DIRTY bit on the first PxD entry, not on all cont-PxD entries. I see, thanks for explaining. -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 16/16] mm: Remove CONFIG_ARCH_HAS_HUGEPD
On Mon, May 27, 2024 at 03:30:14PM +0200, Christophe Leroy wrote: > powerpc was the only user of CONFIG_ARCH_HAS_HUGEPD and doesn't > use it anymore, so remove all related code. > > Signed-off-by: Christophe Leroy Acked-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 12/16] powerpc/e500: Encode hugepage size in PTE bits
On Wed, May 29, 2024 at 10:14:15AM +, Christophe Leroy wrote: > > > Le 29/05/2024 à 12:09, Oscar Salvador a écrit : > > On Wed, May 29, 2024 at 09:49:48AM +, Christophe Leroy wrote: > >> Doesn't really matter if it's PUD or PMD at this point. On a 32 bits > >> kernel it will be all PMD while on a 64 bits kernel it is both PMD and PUD. > >> > >> At the time being (as implemented with hugepd), Linux support 4M, 16M, > >> 64M, 256M and 1G (Shifts 22, 24, 26, 28, 30) > >> > >> The hardware supports the following page sizes, and encodes them on 4 > >> bits allthough it is not directly a shift. Maybe it would be better to > >> use that encoding after all: > > > > I think so. > > > >> > >> 0001 4 Kbytes (Shift 12) > >> 0010 16 Kbytes (Shift 14) > >> 0011 64 Kbytes (Shift 16) > >> 0100 256 Kbytes (Shift 18) > >> 0101 1 Mbyte (Shift 20) > >> 0110 4 Mbytes (Shift 22) > >> 0111 16 Mbytes (Shift 24) > >> 1000 64 Mbytes (Shift 26) > >> 1001 256 Mbytes (Shift 28) > >> 1010 1 Gbyte (e500v2 only) (Shift 30) > >> 1011 4 Gbytes (e500v2 only) (Shift 32) > > > > You say hugehages start at 2MB (shift 21), but you say that the smallest > > hugepage > > Linux support is 4MB (shift 22).? > > > > > > No I say PMD_SIZE is 2MB on e500 with 64 bits PTE and at the time being > Linux powerpc implementation for e500 supports sizes 4M, 16M, 64M, 256M > and 1G. Got it. I got confused. -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 12/16] powerpc/e500: Encode hugepage size in PTE bits
On Wed, May 29, 2024 at 09:49:48AM +, Christophe Leroy wrote: > Doesn't really matter if it's PUD or PMD at this point. On a 32 bits > kernel it will be all PMD while on a 64 bits kernel it is both PMD and PUD. > > At the time being (as implemented with hugepd), Linux support 4M, 16M, > 64M, 256M and 1G (Shifts 22, 24, 26, 28, 30) > > The hardware supports the following page sizes, and encodes them on 4 > bits allthough it is not directly a shift. Maybe it would be better to > use that encoding after all: I think so. > > 0001 4 Kbytes (Shift 12) > 0010 16 Kbytes (Shift 14) > 0011 64 Kbytes (Shift 16) > 0100 256 Kbytes (Shift 18) > 0101 1 Mbyte (Shift 20) > 0110 4 Mbytes (Shift 22) > 0111 16 Mbytes (Shift 24) > 1000 64 Mbytes (Shift 26) > 1001 256 Mbytes (Shift 28) > 1010 1 Gbyte (e500v2 only) (Shift 30) > 1011 4 Gbytes (e500v2 only) (Shift 32) You say hugehages start at 2MB (shift 21), but you say that the smallest hugepage Linux support is 4MB (shift 22).? -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 13/16] powerpc/e500: Use contiguous PMD instead of hugepd
On Wed, May 29, 2024 at 09:58:35AM +, Christophe Leroy wrote: > Yes I now have : > > +#define _PAGE_HSIZE_MSK (_PAGE_U0 | _PAGE_U1 | _PAGE_U2 | _PAGE_U3) > +#define _PAGE_HSIZE_SHIFT 14 > +#define _PAGE_HSIZE_SHIFT_OFFSET 20 > > and have added a helper to avoid doing the calculation at several places: > > +static inline unsigned long pte_huge_size(pte_t pte) > +{ > + pte_basic_t val = pte_val(pte); > + > + return 1UL << (((val & _PAGE_HSIZE_MSK) >> _PAGE_HSIZE_SHIFT) + > _PAGE_HSIZE_SHIFT_OFFSET); > +} Great, this looks much better. > That's what I did before but it didn't work. The problem is that > pte_advance_pfn() takes a long not a long long: > > static inline pte_t pte_advance_pfn(pte_t pte, unsigned long nr) > { > return __pte(pte_val(pte) + (nr << PFN_PTE_SHIFT)); > } > > And when I called it with nr = PMD_SIZE / PAGE_SIZE = 2M / 4k = 512, as > we have PFN_PTE_SHIFT = 24, I got 512 << 24 = 0 Ah, I missed that trickery with the types. Thanks! -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 14/16] powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD
On Mon, May 27, 2024 at 03:30:12PM +0200, Christophe Leroy wrote: > On book3s/64, the only user of hugepd is hash in 4k mode. > > All other setups (hash-64, radix-4, radix-64) use leaf PMD/PUD. > > Rework hash-4k to use contiguous PMD and PUD instead. > > In that setup there are only two huge page sizes: 16M and 16G. > > 16M sits at PMD level and 16G at PUD level. On 4k mode, PMD_SIZE is 2MB and PUD_SIZE is 256MB, right? > +static inline unsigned long hash__pte_update(struct mm_struct *mm, > + unsigned long addr, > + pte_t *ptep, unsigned long clr, > + unsigned long set, > + int huge) > +{ > + unsigned long old; > + > + old = hash__pte_update_one(ptep, clr, set); > + > + if (IS_ENABLED(CONFIG_PPC_4K_PAGES) && huge) { > + unsigned int psize = get_slice_psize(mm, addr); > + int nb, i; > + > + if (psize == MMU_PAGE_16M) > + nb = SZ_16M / PMD_SIZE; > + else if (psize == MMU_PAGE_16G) > + nb = SZ_16G / PUD_SIZE; > + else > + nb = 1; On 4K, hugepages are either 16M or 16G. How can we end up in a situation whwere the is pte is huge, but is is neither MMU_PAGE_16G nor MMU_PAGE_16M? > diff --git a/arch/powerpc/mm/book3s64/hugetlbpage.c > b/arch/powerpc/mm/book3s64/hugetlbpage.c > index 5a2e512e96db..83c3361b358b 100644 > --- a/arch/powerpc/mm/book3s64/hugetlbpage.c > +++ b/arch/powerpc/mm/book3s64/hugetlbpage.c > @@ -53,6 +53,16 @@ int __hash_page_huge(unsigned long ea, unsigned long > access, unsigned long vsid, > /* If PTE permissions don't match, take page fault */ > if (unlikely(!check_pte_access(access, old_pte))) > return 1; > + /* > + * If hash-4k, hugepages use seeral contiguous PxD entries 'several' > + * so bail out and let mm make the page young or dirty > + */ > + if (IS_ENABLED(CONFIG_PPC_4K_PAGES)) { > + if (!(old_pte & _PAGE_ACCESSED)) > + return 1; > + if ((access & _PAGE_WRITE) && !(old_pte & _PAGE_DIRTY)) > + return 1; I have 0 clue about this code. What would happen if we do not bail out? -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 13/16] powerpc/e500: Use contiguous PMD instead of hugepd
On Mon, May 27, 2024 at 03:30:11PM +0200, Christophe Leroy wrote: > e500 supports many page sizes among which the following size are > implemented in the kernel at the time being: 4M, 16M, 64M, 256M, 1G. > > On e500, TLB miss for hugepages is exclusively handled by SW even > on e6500 which has HW assistance for 4k pages, so there are no > constraints like on the 8xx. > > On e500/32, all are at PGD/PMD level and can be handled as > cont-PMD. > > On e500/64, smaller ones are on PMD while bigger ones are on PUD. > Again, they can easily be handled as cont-PMD and cont-PUD instead > of hugepd. > > Signed-off-by: Christophe Leroy ... > diff --git a/arch/powerpc/include/asm/nohash/pgtable.h > b/arch/powerpc/include/asm/nohash/pgtable.h > index 90d6a0943b35..f7421d1a1693 100644 > --- a/arch/powerpc/include/asm/nohash/pgtable.h > +++ b/arch/powerpc/include/asm/nohash/pgtable.h > @@ -52,11 +52,36 @@ static inline pte_basic_t pte_update(struct mm_struct > *mm, unsigned long addr, p > { > pte_basic_t old = pte_val(*p); > pte_basic_t new = (old & ~(pte_basic_t)clr) | set; > + unsigned long sz; > + unsigned long pdsize; > + int i; > > if (new == old) > return old; > > - *p = __pte(new); > +#ifdef CONFIG_PPC_E500 > + if (huge) > + sz = 1UL << (((old & _PAGE_HSIZE_MSK) >> _PAGE_HSIZE_SHIFT) + > 20); > + else I think this will not compile when CONFIG_PPC_85xx && !CONFIG_PTE_64BIT. You have declared _PAGE_HSIZE_MSK and _PAGE_HSIZE_SHIFT in arch/powerpc/include/asm/nohash/hugetlb-e500.h. But hugetlb-e500.h is only included if CONFIG_PPC_85xx && CONFIG_PTE_64BIT (see arch/powerpc/include/asm/nohash/32/pgtable.h). > +#endif > + sz = PAGE_SIZE; > + > + if (!huge || sz < PMD_SIZE) > + pdsize = PAGE_SIZE; > + else if (sz < PUD_SIZE) > + pdsize = PMD_SIZE; > + else if (sz < P4D_SIZE) > + pdsize = PUD_SIZE; > + else if (sz < PGDIR_SIZE) > + pdsize = P4D_SIZE; > + else > + pdsize = PGDIR_SIZE; > + > + for (i = 0; i < sz / pdsize; i++, p++) { > + *p = __pte(new); > + if (new) > + new += (unsigned long long)(pdsize / PAGE_SIZE) << > PTE_RPN_SHIFT; I guess 'new' can be 0 if pte_update() is called on behave of clearing the pte? > +static inline unsigned long pmd_leaf_size(pmd_t pmd) > +{ > + return 1UL << (((pmd_val(pmd) & _PAGE_HSIZE_MSK) >> _PAGE_HSIZE_SHIFT) > + 20); Can we have the '20' somewhere defined with a comment on top explaining what is so it is not a magic number? Otherwise people might come look at this and wonder why 20. > --- a/arch/powerpc/mm/pgtable.c > +++ b/arch/powerpc/mm/pgtable.c > @@ -331,6 +331,37 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long > addr, pte_t *ptep, > __set_huge_pte_at(pmdp, ptep, pte_val(pte)); > } > } > +#elif defined(CONFIG_PPC_E500) > +void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, > + pte_t pte, unsigned long sz) > +{ > + unsigned long pdsize; > + int i; > + > + pte = set_pte_filter(pte, addr); > + > + /* > + * Make sure hardware valid bit is not set. We don't do > + * tlb flush for this update. > + */ > + VM_WARN_ON(pte_hw_valid(*ptep) && !pte_protnone(*ptep)); > + > + if (sz < PMD_SIZE) > + pdsize = PAGE_SIZE; > + else if (sz < PUD_SIZE) > + pdsize = PMD_SIZE; > + else if (sz < P4D_SIZE) > + pdsize = PUD_SIZE; > + else if (sz < PGDIR_SIZE) > + pdsize = P4D_SIZE; > + else > + pdsize = PGDIR_SIZE; > + > + for (i = 0; i < sz / pdsize; i++, ptep++, addr += pdsize) { > + __set_pte_at(mm, addr, ptep, pte, 0); > + pte = __pte(pte_val(pte) + ((unsigned long long)pdsize / > PAGE_SIZE << PFN_PTE_SHIFT)); You can use pte_advance_pfn() here? Just give have nr = (unsigned long long)pdsize / PAGE_SIZE << PFN_PTE_SHIFT) pte_advance_pfn(pte, nr) Which 'sz's can we have here? You mentioned that e500 support: 4M, 16M, 64M, 256M, 1G. which of these ones can be huge? -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 08/16] powerpc/8xx: Rework support for 8M pages using contiguous PTE entries
On Mon, May 27, 2024 at 03:30:06PM +0200, Christophe Leroy wrote: > In order to fit better with standard Linux page tables layout, add > support for 8M pages using contiguous PTE entries in a standard > page table. Page tables will then be populated with 1024 similar > entries and two PMD entries will point to that page table. > > The PMD entries also get a flag to tell it is addressing an 8M page, > this is required for the HW tablewalk assistance. > > Signed-off-by: Christophe Leroy > Reviewed-by: Oscar Salvador > --- ... > +#define __HAVE_ARCH_HUGE_PTEP_GET > +static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, > pte_t *ptep) > +{ > + if (ptep_is_8m_pmdp(mm, addr, ptep)) > + ptep = pte_offset_kernel((pmd_t *)ptep, 0); Yes, you are right that this should have had the addr aligned down. I can speak for others, but for me it is more clear to think of it this way: 1) check if ptep points to the first PMD entry for address 2) if it does, we know that the PMD describes a 8MB hugepage 3) return the PMD That is why I thought that directly calling pmd_page_vaddr() gave a more clear overview. Now, feel free to ignore this if you think this is not clear or adds confusion, I just wanted to give my insight reflecting on what I considered more logical. -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 12/16] powerpc/e500: Encode hugepage size in PTE bits
On Mon, May 27, 2024 at 03:30:10PM +0200, Christophe Leroy wrote: > Use U0-U3 bits to encode hugepage size, more exactly page shift. > > As we start using hugepages at shift 21 (2Mbytes), substract 20 > so that it fits into 4 bits. That may change in the future if > we want to use smaller hugepages. What other shifts we can have here on e500? PUD_SHIFT? Could you please spell them out here? Or even better, > > Signed-off-by: Christophe Leroy > --- > arch/powerpc/include/asm/nohash/hugetlb-e500.h | 6 ++ > arch/powerpc/include/asm/nohash/pte-e500.h | 3 +++ > 2 files changed, 9 insertions(+) > > diff --git a/arch/powerpc/include/asm/nohash/hugetlb-e500.h > b/arch/powerpc/include/asm/nohash/hugetlb-e500.h > index 8f04ad20e040..d8e51a3f8557 100644 > --- a/arch/powerpc/include/asm/nohash/hugetlb-e500.h > +++ b/arch/powerpc/include/asm/nohash/hugetlb-e500.h > @@ -42,4 +42,10 @@ static inline int check_and_get_huge_psize(int shift) > return shift_to_mmu_psize(shift); > } > > +static inline pte_t arch_make_huge_pte(pte_t entry, unsigned int shift, > vm_flags_t flags) > +{ > + return __pte(pte_val(entry) | (_PAGE_U3 * (shift - 20))); > +} > +#define arch_make_huge_pte arch_make_huge_pte > + > #endif /* _ASM_POWERPC_NOHASH_HUGETLB_E500_H */ > diff --git a/arch/powerpc/include/asm/nohash/pte-e500.h > b/arch/powerpc/include/asm/nohash/pte-e500.h > index 975facc7e38e..091e4bff1fba 100644 > --- a/arch/powerpc/include/asm/nohash/pte-e500.h > +++ b/arch/powerpc/include/asm/nohash/pte-e500.h > @@ -46,6 +46,9 @@ > #define _PAGE_NO_CACHE 0x40 /* I: cache inhibit */ > #define _PAGE_WRITETHRU 0x80 /* W: cache write-through */ > +#define _PAGE_HSIZE_MSK (_PAGE_U0 | _PAGE_U1 | _PAGE_U2 | _PAGE_U3) > +#define _PAGE_HSIZE_SHIFT14 Add a comment in above explaining which P*_SHIFT we need cover with these 4bits. -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 03/16] mm: Provide mm_struct and address to huge_ptep_get()
On Mon, May 27, 2024 at 03:30:01PM +0200, Christophe Leroy wrote: > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -547,7 +547,7 @@ static int gup_hugepte(struct vm_area_struct *vma, pte_t > *ptep, unsigned long sz > if (pte_end < end) > end = pte_end; > > - pte = huge_ptep_get(ptep); > + pte = huge_ptep_get(vma->mm, addr, ptep); I looked again and I stumbled upon this. It should have been "vma->vm_mm". -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v4 03/16] mm: Provide mm_struct and address to huge_ptep_get()
On Mon, May 27, 2024 at 03:30:01PM +0200, Christophe Leroy wrote: > On powerpc 8xx huge_ptep_get() will need to know whether the given > ptep is a PTE entry or a PMD entry. This cannot be known with the > PMD entry itself because there is no easy way to know it from the > content of the entry. > > So huge_ptep_get() will need to know either the size of the page > or get the pmd. > > In order to be consistent with huge_ptep_get_and_clear(), give > mm and address to huge_ptep_get(). > > Signed-off-by: Christophe Leroy Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v3 03/16] mm: Provide mm_struct and address to huge_ptep_get()
On Mon, May 27, 2024 at 03:51:41PM +, Christophe Leroy wrote: > We could be is that worth the churn ? Probably not. > With patch 1 there was only one callsite. Yes, you are right here. > Here we have many callsites, and we also have huge_ptep_get_and_clear() > which already takes three arguments. So for me it make more sense to > adapt huge_ptep_get() here. > > Today several of the huge-related functions already have parameters that > are used only by a few architectures and everytime one architecture > needs a new parameter it is added for all of them, and there are > exemples in the past of new functions added to get new parameters for > only a few architectures that ended up with a mess and a need to > re-factor at the end. > > See for instance the story around arch_make_huge_pte() and pte_mkhuge(), > both do the same but arch_make_huge_pte() was added to take additional > parameters by commit d9ed9faac283 ("mm: add new arch_make_huge_pte() > method for tile support") then they were merged by commit 16785bd77431 > ("mm: merge pte_mkhuge() call into arch_make_huge_pte()") > > So I'm open to any suggestion but we need to try not make it a bigger > mess at the end. > > By the way, I think most if not all huge related helpers should all take > the same parameters even if not all of them are used, then it would make > things easier. And maybe the cleanest would be to give the page size to > all those functions instead of having them guess it. > > So let's have your ideas here on the most straight forward way to handle > that. It is probably not worth pursuing this then. As you said, there are many callers and we would have to create some kind of hook for only those interested places, which I guess would end up looking just too ugly in order to save little code in arch code. So please disregard my comment here, and stick with what we have. > By the way, after commit 01d89b93e176 ("mm/gup: fix hugepd handling in > hugetlb rework") we now have the vma in gup_hugepte() so we now pass > vma->vm_mm I did not notice, thanks. -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v3 08/16] powerpc/8xx: Rework support for 8M pages using contiguous PTE entries
On Sun, May 26, 2024 at 11:22:28AM +0200, Christophe Leroy wrote: > In order to fit better with standard Linux page tables layout, add > support for 8M pages using contiguous PTE entries in a standard > page table. Page tables will then be populated with 1024 similar > entries and two PMD entries will point to that page table. > > The PMD entries also get a flag to tell it is addressing an 8M page, > this is required for the HW tablewalk assistance. > > Signed-off-by: Christophe Leroy I did not look close into KSAN bits, and I trust you with the assembly part, but other than that looks good to me, so FWIW: Reviewed-by: Oscar Salvador Just a nit below: > +#define __HAVE_ARCH_HUGE_PTEP_GET > +static inline pte_t huge_ptep_get(struct mm_struct *mm, unsigned long addr, > pte_t *ptep) > +{ > + if (ptep_is_8m_pmdp(mm, addr, ptep)) > + ptep = pte_offset_kernel((pmd_t *)ptep, 0); Would it not be more clear to use pmd_page_vaddr directly there? -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v3 06/16] powerpc/mm: Allow hugepages without hugepd
On Sun, May 26, 2024 at 11:22:26AM +0200, Christophe Leroy wrote: > In preparation of implementing huge pages on powerpc 8xx > without hugepd, enclose hugepd related code inside an > ifdef CONFIG_ARCH_HAS_HUGEPD > > This also allows removing some stubs. > > Signed-off-by: Christophe Leroy Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v3 05/16] powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries
On Sun, May 26, 2024 at 11:22:25AM +0200, Christophe Leroy wrote: > Building on 32 bits with pmd_leaf() not returning always false leads > to the following error: > > CC arch/powerpc/mm/pgtable.o > arch/powerpc/mm/pgtable.c: In function '__find_linux_pte': > arch/powerpc/mm/pgtable.c:506:1: error: function may return address of local > variable [-Werror=return-local-addr] > 506 | } > | ^ > arch/powerpc/mm/pgtable.c:394:15: note: declared here > 394 | pud_t pud, *pudp; > | ^~~ > arch/powerpc/mm/pgtable.c:394:15: note: declared here > > This is due to pmd_offset() being a no-op in that case. > > So rework it for powerpc/32 so that pXd_offset() are used on real > pointers and not on on-stack copies. > > Signed-off-by: Christophe Leroy Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v3 03/16] mm: Provide mm_struct and address to huge_ptep_get()
On Sun, May 26, 2024 at 11:22:23AM +0200, Christophe Leroy wrote: > On powerpc 8xx huge_ptep_get() will need to know whether the given > ptep is a PTE entry or a PMD entry. This cannot be known with the > PMD entry itself because there is no easy way to know it from the > content of the entry. > > So huge_ptep_get() will need to know either the size of the page > or get the pmd. > > In order to be consistent with huge_ptep_get_and_clear(), give > mm and address to huge_ptep_get(). > > Signed-off-by: Christophe Leroy > --- > v2: Add missing changes in arch implementations > v3: Fixed a comment in ARM and missing changes in S390 > --- > arch/arm/include/asm/hugetlb-3level.h | 4 +-- > arch/arm64/include/asm/hugetlb.h | 2 +- > arch/arm64/mm/hugetlbpage.c | 2 +- > arch/riscv/include/asm/hugetlb.h | 2 +- > arch/riscv/mm/hugetlbpage.c | 2 +- > arch/s390/include/asm/hugetlb.h | 4 +-- > arch/s390/mm/hugetlbpage.c| 4 +-- I was wondering whether we could do something similar for what we did in patch#1, so we do not touch architectures code. > diff --git a/mm/gup.c b/mm/gup.c > index 1611e73b1121..86b5105b82a1 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -2812,7 +2812,7 @@ static int gup_hugepte(pte_t *ptep, unsigned long sz, > unsigned long addr, > if (pte_end < end) > end = pte_end; > > - pte = huge_ptep_get(ptep); > + pte = huge_ptep_get(NULL, addr, ptep); I know that after this series all this code is gone, but I was not sure about the behaviour between this patch and the last one. It made me nervous, until I realized that this code is only used on CONFIG_ARCH_HAS_HUGEPD, which should not be the case anymore for 8xx after patch#8, and since 8xx is the only one that will use the mm parameter from huge_ptep_get, we are all good. -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v3 07/16] powerpc/8xx: Fix size given to set_huge_pte_at()
On Sun, May 26, 2024 at 11:22:27AM +0200, Christophe Leroy wrote: > set_huge_pte_at() expects the size of the hugepage as an int, not the > psize which is the index of the page definition in table mmu_psize_defs[] > > Fixes: 935d4f0c6dc8 ("mm: hugetlb: add huge page size param to > set_huge_pte_at()") > Signed-off-by: Christophe Leroy Reviewed-by: Oscar Salvador > --- > arch/powerpc/mm/nohash/8xx.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c > index 43d4842bb1c7..d93433e26ded 100644 > --- a/arch/powerpc/mm/nohash/8xx.c > +++ b/arch/powerpc/mm/nohash/8xx.c > @@ -94,7 +94,8 @@ static int __ref __early_map_kernel_hugepage(unsigned long > va, phys_addr_t pa, > return -EINVAL; > > set_huge_pte_at(&init_mm, va, ptep, > - pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), psize); > + pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), > + 1UL << mmu_psize_to_shift(psize)); > > return 0; > } > -- > 2.44.0 > -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v3 05/16] powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries
On Sun, May 26, 2024 at 11:22:25AM +0200, Christophe Leroy wrote: > Building on 32 bits with pmd_leaf() not returning always false leads > to the following error: > > CC arch/powerpc/mm/pgtable.o > arch/powerpc/mm/pgtable.c: In function '__find_linux_pte': > arch/powerpc/mm/pgtable.c:506:1: error: function may return address of local > variable [-Werror=return-local-addr] > 506 | } > | ^ > arch/powerpc/mm/pgtable.c:394:15: note: declared here > 394 | pud_t pud, *pudp; > | ^~~ > arch/powerpc/mm/pgtable.c:394:15: note: declared here > > This is due to pmd_offset() being a no-op in that case. > > So rework it for powerpc/32 so that pXd_offset() are used on real > pointers and not on on-stack copies. > > Signed-off-by: Christophe Leroy Maybe this could be folded into the patch that makes pmd_leaf() not returning always false, but no strong feelings: Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v3 02/16] mm: Define __pte_leaf_size() to also take a PMD entry
On Sun, May 26, 2024 at 11:22:22AM +0200, Christophe Leroy wrote: > On powerpc 8xx, when a page is 8M size, the information is in the PMD > entry. So allow architectures to provide __pte_leaf_size() instead of > pte_leaf_size() and provide the PMD entry to that function. > > When __pte_leaf_size() is not defined, define it as a pte_leaf_size() > so that architectures not interested in the PMD arguments are not > impacted. > > Only define a default pte_leaf_size() when __pte_leaf_size() is not > defined to make sure nobody adds new calls to pte_leaf_size() in the > core. > > Signed-off-by: Christophe Leroy thanks, this looks much cleaner. Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v3 00/16] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)
On Sun, May 26, 2024 at 11:22:20AM +0200, Christophe Leroy wrote: > This is the continuation of the RFC v1 series "Reimplement huge pages > without hugepd on powerpc 8xx". It now get rid of hugepd completely > after handling also e500 and book3s/64 > > Also see https://github.com/linuxppc/issues/issues/483 > > Unlike most architectures, powerpc 8xx HW requires a two-level > pagetable topology for all page sizes. So a leaf PMD-contig approach > is not feasible as such. > > Possible sizes on 8xx are 4k, 16k, 512k and 8M. > > First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries > must point to a single entry level-2 page table. Until now that was > done using hugepd. This series changes it to use standard page tables > where the entry is replicated 1024 times on each of the two pagetables > refered by the two associated PMD entries for that 8M page. > > For e500 and book3s/64 there are less constraints because it is not > tied to the HW assisted tablewalk like on 8xx, so it is easier to use > leaf PMDs (and PUDs). > > On e500 the supported page sizes are 4M, 16M, 64M, 256M and 1G. All at > PMD level on e500/32 (mpc85xx) and mix of PMD and PUD for e500/64. We > encode page size with 4 available bits in PTE entries. On e300/32 PGD > entries size is increases to 64 bits in order to allow leaf-PMD entries > because PTE are 64 bits on e500. > > On book3s/64 only the hash-4k mode is concerned. It supports 16M pages > as cont-PMD and 16G pages as cont-PUD. In other modes (radix-4k, radix-6k > and hash-64k) the sizes match with PMD and PUD sizes so that's just leaf > entries. The hash processing make things a bit more complex. To ease > things, __hash_page_huge() is modified to bail out when DIRTY or ACCESSED > bits are missing, leaving it to mm core to fix it. > > Global changes in v3: > - Removed patches 1 and 2 > - Squashed patch 11 into patch 5 > - Replaced patches 12 and 13 with a series from Michael > - Reordered patches a bit to have more general patches up front > > For more details on changes, see in each patch. > > Christophe Leroy (15): > mm: Define __pte_leaf_size() to also take a PMD entry > mm: Provide mm_struct and address to huge_ptep_get() > powerpc/mm: Remove _PAGE_PSIZE > powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries > powerpc/mm: Allow hugepages without hugepd > powerpc/8xx: Fix size given to set_huge_pte_at() > powerpc/8xx: Rework support for 8M pages using contiguous PTE entries > powerpc/8xx: Simplify struct mmu_psize_def > powerpc/e500: Remove enc and ind fields from struct mmu_psize_def > powerpc/e500: Switch to 64 bits PGD on 85xx (32 bits) > powerpc/e500: Encode hugepage size in PTE bits > powerpc/e500: Use contiguous PMD instead of hugepd > powerpc/64s: Use contiguous PMD/PUD instead of HUGEPD > powerpc/mm: Remove hugepd leftovers > mm: Remove CONFIG_ARCH_HAS_HUGEPD I glanced over it and it looks much better, not having to fiddle with other arch code and generic declarations is a big plus. I plan to do a proper review tomorrow. Thanks for working on this Christophe! -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 11/20] powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups
On Sat, May 25, 2024 at 06:44:06AM +, Christophe Leroy wrote: > No, all have cont-PMD but only 8xx handles pages greater than PMD_SIZE > as cont-PTE instead of cont-PMD. Yes, sorry, I managed to confuse myself. It is obvious from the code. -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 15/20] powerpc/85xx: Switch to 64 bits PGD
On Fri, May 17, 2024 at 09:00:09PM +0200, Christophe Leroy wrote: > In order to allow leaf PMD entries, switch the PGD to 64 bits entries. > > Signed-off-by: Christophe Leroy I do not quite understand this change. Are not powerE500 and power85xx two different things? You are changing making it 64 for PPC_E500_64bits, but you are updating head_85xx. Are they sharing this code? Also, we would benefit from a slightly bigger changelog, explaining why do we need this change in some more detail. > diff --git a/arch/powerpc/include/asm/pgtable-types.h > b/arch/powerpc/include/asm/pgtable-types.h > index 082c85cc09b1..db965d98e0ae 100644 > --- a/arch/powerpc/include/asm/pgtable-types.h > +++ b/arch/powerpc/include/asm/pgtable-types.h > @@ -49,7 +49,11 @@ static inline unsigned long pud_val(pud_t x) > #endif /* CONFIG_PPC64 */ > > /* PGD level */ > +#if defined(CONFIG_PPC_E500) && defined(CONFIG_PTE_64BIT) > +typedef struct { unsigned long long pgd; } pgd_t; > +#else > typedef struct { unsigned long pgd; } pgd_t; > +#endif > #define __pgd(x) ((pgd_t) { (x) }) > static inline unsigned long pgd_val(pgd_t x) > { > diff --git a/arch/powerpc/kernel/head_85xx.S b/arch/powerpc/kernel/head_85xx.S > index 39724ff5ae1f..a305244afc9f 100644 > --- a/arch/powerpc/kernel/head_85xx.S > +++ b/arch/powerpc/kernel/head_85xx.S > @@ -307,8 +307,9 @@ set_ivor: > #ifdef CONFIG_PTE_64BIT > #ifdef CONFIG_HUGETLB_PAGE > #define FIND_PTE \ > - rlwinm r12, r10, 13, 19, 29; /* Compute pgdir/pmd offset */ \ > - lwzxr11, r12, r11; /* Get pgd/pmd entry */ \ > + rlwinm r12, r10, 14, 18, 28; /* Compute pgdir/pmd offset */ \ > + add r12, r11, r12; You add the offset to pgdir? > + lwz r11, 4(r12);/* Get pgd/pmd entry */ \ What is i offset 4? -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 14/20] powerpc/e500: Remove enc field from struct mmu_psize_def
On Fri, May 17, 2024 at 09:00:08PM +0200, Christophe Leroy wrote: > enc field is hidden behind BOOK3E_PAGESZ_XX macros, and when you look > closer you realise that this field is nothing else than the value of > shift minus ten. > > So remove enc field and calculate tsize from shift field. > > Also remove inc filed which is unused. > > Signed-off-by: Christophe Leroy Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 11/20] powerpc/mm: Complement huge_pte_alloc() for all non HUGEPD setups
On Fri, May 17, 2024 at 09:00:05PM +0200, Christophe Leroy wrote: > huge_pte_alloc() for non-HUGEPD targets is reserved for 8xx at the > moment. In order to convert other targets for non-HUGEPD, complement > huge_pte_alloc() to support any standard cont-PxD setup. > > Signed-off-by: Christophe Leroy > --- > arch/powerpc/mm/hugetlbpage.c | 25 - > 1 file changed, 24 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/hugetlbpage.c b/arch/powerpc/mm/hugetlbpage.c > index 42b12e1ec851..f8aefa1e7363 100644 > --- a/arch/powerpc/mm/hugetlbpage.c > +++ b/arch/powerpc/mm/hugetlbpage.c > @@ -195,11 +195,34 @@ pte_t *huge_pte_alloc(struct mm_struct *mm, struct > vm_area_struct *vma, > pte_t *huge_pte_alloc(struct mm_struct *mm, struct vm_area_struct *vma, > unsigned long addr, unsigned long sz) > { > - pmd_t *pmd = pmd_off(mm, addr); > + pgd_t *pgd; > + p4d_t *p4d; > + pud_t *pud; > + pmd_t *pmd; > + > + addr &= ~(sz - 1); > + pgd = pgd_offset(mm, addr); > + > + p4d = p4d_offset(pgd, addr); > + if (sz >= PGDIR_SIZE) > + return (pte_t *)p4d; > + > + pud = pud_alloc(mm, p4d, addr); > + if (!pud) > + return NULL; > + if (sz >= PUD_SIZE) > + return (pte_t *)pud; > + > + pmd = pmd_alloc(mm, pud, addr); > + if (!pmd) > + return NULL; > > if (sz < PMD_SIZE) > return pte_alloc_huge(mm, pmd, addr, sz); > > + if (!IS_ENABLED(CONFIG_PPC_8xx)) > + return (pte_t *)pmd; So only 8xx has cont-PMD for hugepages? > + > if (sz != SZ_8M) > return NULL; Since this function is the core for allocation huge pages, I think it would benefit from a comment at the top explaining the possible layouts. e.g: Who can have cont-{P4d,PUD,PMD} etc. A brief explanation of the possible scheme for all powerpc platforms. That would help people looking into this in a future. -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 10/20] powerpc/mm: Fix __find_linux_pte() on 32 bits with PMD leaf entries
On Fri, May 17, 2024 at 09:00:04PM +0200, Christophe Leroy wrote: > Building on 32 bits with pmd_leaf() not returning always false leads > to the following error: I am curious though. pmd_leaf is only defined in include/linux/pgtable.h for 32bits, and is hardcoded to false. I do not see where we change it in previous patches, so is this artificial? > > CC arch/powerpc/mm/pgtable.o > arch/powerpc/mm/pgtable.c: In function '__find_linux_pte': > arch/powerpc/mm/pgtable.c:506:1: error: function may return address of local > variable [-Werror=return-local-addr] > 506 | } > | ^ > arch/powerpc/mm/pgtable.c:394:15: note: declared here > 394 | pud_t pud, *pudp; > | ^~~ > arch/powerpc/mm/pgtable.c:394:15: note: declared here > > This is due to pmd_offset() being a no-op in that case. This is because 32bits powerpc include pgtable-nopmd.h? > So rework it for powerpc/32 so that pXd_offset() are used on real > pointers and not on on-stack copies. > > Signed-off-by: Christophe Leroy > --- > arch/powerpc/mm/pgtable.c | 14 -- > 1 file changed, 12 insertions(+), 2 deletions(-) > > diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c > index 59f0d7706d2f..51ee508eeb5b 100644 > --- a/arch/powerpc/mm/pgtable.c > +++ b/arch/powerpc/mm/pgtable.c > @@ -390,8 +390,12 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, > bool *is_thp, unsigned *hpage_shift) > { > pgd_t *pgdp; > - p4d_t p4d, *p4dp; > - pud_t pud, *pudp; > + p4d_t *p4dp; > + pud_t *pudp; > +#ifdef CONFIG_PPC64 > + p4d_t p4d; > + pud_t pud; > +#endif > pmd_t pmd, *pmdp; > pte_t *ret_pte; > hugepd_t *hpdp = NULL; > @@ -412,6 +416,7 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, >*/ > pgdp = pgdir + pgd_index(ea); > p4dp = p4d_offset(pgdp, ea); > +#ifdef CONFIG_PPC64 > p4d = READ_ONCE(*p4dp); > pdshift = P4D_SHIFT; > > @@ -452,6 +457,11 @@ pte_t *__find_linux_pte(pgd_t *pgdir, unsigned long ea, > > pdshift = PMD_SHIFT; > pmdp = pmd_offset(&pud, ea); > +#else > + p4dp = p4d_offset(pgdp, ea); > + pudp = pud_offset(p4dp, ea); > + pmdp = pmd_offset(pudp, ea); I would drop a comment on top explaining that these are no-op for 32bits, otherwise it might not be obvious to people as why this distiction between 64 and 32bits. Other than that looks good to me -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 08/20] powerpc/8xx: Simplify struct mmu_psize_def
On Fri, May 17, 2024 at 09:00:02PM +0200, Christophe Leroy wrote: > On 8xx, only the shift field is used in struct mmu_psize_def > > Remove other fields and related macros. > > Signed-off-by: Christophe Leroy Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 09/20] powerpc/mm: Remove _PAGE_PSIZE
On Fri, May 17, 2024 at 09:00:03PM +0200, Christophe Leroy wrote: > _PAGE_PSIZE macro is never used outside the place it is defined > and is used only on 8xx and e500. > > Remove indirection, remove it and use its content directly. > > Signed-off-by: Christophe Leroy Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 07/20] powerpc/8xx: Rework support for 8M pages using contiguous PTE entries
nohash/8xx.c: err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_8M, new); arch/powerpc/mm/nohash/8xx.c: err = __early_map_kernel_hugepage(v, p, prot, MMU_PAGE_512K, new); arch/powerpc/mm/nohash/8xx.c: __early_map_kernel_hugepage(VIRT_IMMR_BASE, PHYS_IMMR_BASE, PAGE_KERNEL_NCG, MMU_PAGE_512K, true); I think we can drop the 'new' and the block code that tries to handle it? > diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c > index acdf64c9b93e..59f0d7706d2f 100644 > --- a/arch/powerpc/mm/pgtable.c > +++ b/arch/powerpc/mm/pgtable.c > +void set_huge_pte_at(struct mm_struct *mm, unsigned long addr, pte_t *ptep, > + pte_t pte, unsigned long sz) > +{ > + pmd_t *pmdp = pmd_off(mm, addr); > + > + pte = set_pte_filter(pte, addr); > + > + if (sz == SZ_8M) { > + __set_huge_pte_at(pmdp, pte_offset_kernel(pmdp, 0), > pte_val(pte)); > + __set_huge_pte_at(pmdp, pte_offset_kernel(pmdp + 1, 0), > pte_val(pte) + SZ_4M); You also mentioned that this would slightly change after you drop patch#0 and patch#1. The only comment I have right know would be to add a little comment explaining the layout (the replication of 1024 entries), or just something like "see comment from number_of_cells_per_pte". -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 00/20] Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)
On Thu, May 23, 2024 at 03:40:20PM -0400, Peter Xu wrote: > I requested for help on the lsfmm hugetlb unification session, but > unfortunately I don't think there were Power people around.. I'd like to > request help from Power developers again here on the list: it will be very > appreciated if you can help have a look at this series. I am not a powerpc developer but I plan on keep on reviewing this series today and next week. thanks -- Oscar Salvador SUSE Labs
Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors
On Wed, May 22, 2024 at 05:46:09PM -0400, Peter Xu wrote: > > Now, ProcessB still has the page mapped, so upon re-accessing it, > > it will trigger a new MCE event. memory-failure code will see that this > > The question is why accessing that hwpoison entry from ProcB triggers an > MCE. Logically that's a swap entry and it should generate a page fault > rather than MCE. Then in the pgfault hanlder we don't need that encoded > pfn as we have vmf->address. It would be a swap entry if we reach try_to_umap_one() without trouble. Then we have the code that converts it: ... if (PageHWPoison(p)) pteval = swp_entry_to_pte(make_hwpoison_entry(subpage)); set_{huge_}pte_at ... But maybe we could only do that for ProcA, while ProcB failed to do that, which means that for ProcA that is a hwpoisoned-swap-entry, but ProcB still has this page mapped as usual, so if ProcB re-access it, that will not trigger a fault (because the page is still mapped in its pagetables). -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate()
On Mon, May 20, 2024 at 04:24:51PM +, Christophe Leroy wrote: > I had a quick look at that document and it seems to provide a good > summary of MMU features and principles. However there are some > theoritical information which is not fully right in practice. For > instance when they say "Segment attributes. These fields define > attributes common to all pages in this segment.". This is right in > theory if you consider it from Linux page table topology point of view, > hence what they call a segment is a PMD entry for Linux. However, in > practice each page has its own L1 and L2 attributes and there is not > requirement at HW level to have all L1 attributes of all pages of a > segment the same. Thanks for taking the time Christophe, highly appreciated. > rlwimi = Rotate Left Word Immediate then Mask Insert. Here it rotates > r10 by 23 bits to the left (or 9 to the right) then masks with > _PMD_PAGE_512K and inserts it into r11. > > It means _PAGE_HUGE bit is copied into lower bit of PS attribute. > > PS takes the following values: > > PS = 00 ==> Small page (4k or 16k) > PS = 01 ==> 512k page > PS = 10 ==> Undefined > PS = 11 ==> 8M page I see, thanks for the explanation. > That's a RFC, all ideas are welcome, I needed something to replace > hugepd_populate() The only user interested in pmd_populate() having a sz parameter is 8xx because it will toggle _PMD_PAGE_8M in case of a 8MB mapping. Would it be possible for 8xx to encode the 'sz' in the *pmd pointer prior to calling down the chain? (something like as we do for PTR_ERR()). Then pmd_populate_{kernel_}size() from 8xx, would extract it like: unsigned long sz = PTR_SIZE(pmd) Then we would not need all these 'sz' parameters scattered. Can that work? PD: Do you know a way to emulate a 8xx VM? qemu seems to not have support support. Thanks -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 03/20] mm: Provide pmd to pte_leaf_size()
On Fri, May 17, 2024 at 08:59:57PM +0200, Christophe Leroy wrote: > On powerpc 8xx, when a page is 8M size, the information is in the PMD > entry. So provide it to pte_leaf_size(). > > Signed-off-by: Christophe Leroy Overall looks good to me. Would be nicer if we could left the arch code untouched. I wanted to see how this would be if we go down that road and focus only on 8xx at the risk of being more esoteric. pmd_pte_leaf_size() is a name of hell, but could be replaced with __pte_leaf_size for example. Worth it? Maybe not, anyway, just wanted to give it a go: diff --git a/arch/powerpc/include/asm/nohash/32/pte-8xx.h b/arch/powerpc/include/asm/nohash/32/pte-8xx.h index 137dc3c84e45..9e3fe6e1083f 100644 --- a/arch/powerpc/include/asm/nohash/32/pte-8xx.h +++ b/arch/powerpc/include/asm/nohash/32/pte-8xx.h @@ -151,7 +151,7 @@ static inline unsigned long pgd_leaf_size(pgd_t pgd) #define pgd_leaf_size pgd_leaf_size -static inline unsigned long pte_leaf_size(pte_t pte) +static inline unsigned long pmd_pte_leaf_size(pte_t pte) { pte_basic_t val = pte_val(pte); @@ -162,7 +162,7 @@ static inline unsigned long pte_leaf_size(pte_t pte) return SZ_4K; } -#define pte_leaf_size pte_leaf_size +#define pmd_pte_leaf_size pmd_pte_leaf_size /* * On the 8xx, the page tables are a bit special. For 16k pages, we have diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 18019f037bae..2bc2fe3b2b53 100644 --- a/include/linux/pgtable.h +++ b/include/linux/pgtable.h @@ -1891,6 +1891,9 @@ typedef unsigned int pgtbl_mod_mask; #ifndef pte_leaf_size #define pte_leaf_size(x) PAGE_SIZE #endif +#ifndef pmd_pte_leaf_size +#define pmd_pte_leaf_size(x, y) pte_leaf_size(y) +#endif /* * We always define pmd_pfn for all archs as it's used in lots of generic diff --git a/kernel/events/core.c b/kernel/events/core.c index f0128c5ff278..e90a547d2fb2 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -7596,7 +7596,7 @@ static u64 perf_get_pgtable_size(struct mm_struct *mm, unsigned long addr) pte = ptep_get_lockless(ptep); if (pte_present(pte)) - size = pte_leaf_size(pte); + size = pmd_pte_leaf_size(pmd, pte); pte_unmap(ptep); #endif /* CONFIG_HAVE_GUP_FAST */ -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
On Tue, May 21, 2024 at 10:48:21AM +1000, Michael Ellerman wrote: > Yeah I can. Does it actually cause a bug at runtime (I assume so)? No, currently set_huge_pte_at() from 8xx ignores the 'sz' parameter. But it will be used after this series. -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
On Mon, May 20, 2024 at 04:31:39PM +, Christophe Leroy wrote: > Hi Oscar, hi Michael, > > Le 20/05/2024 à 11:14, Oscar Salvador a écrit : > > On Fri, May 17, 2024 at 09:00:00PM +0200, Christophe Leroy wrote: > >> set_huge_pte_at() expects the real page size, not the psize which is > > > > "expects the size of the huge page" sounds bettter? > > Parameter 'pzize' already provides the size of the hugepage, but not in > the way set_huge_pte_at() expects it. > > psize has one of the values defined by MMU_PAGE_XXX macros defined in > arch/powerpc/include/asm/mmu.h while set_huge_pte_at() expects the size > as a value. Yes, psize is an index, which is not a size by itself but used to get mmu_psize_def.shift to see the actual size, I guess. This is why I thought that being explicit about "expects the size of the huge page" was better. But no strong feelings here. -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 06/20] powerpc/8xx: Fix size given to set_huge_pte_at()
On Fri, May 17, 2024 at 09:00:00PM +0200, Christophe Leroy wrote: > set_huge_pte_at() expects the real page size, not the psize which is "expects the size of the huge page" sounds bettter? > the index of the page definition in table mmu_psize_defs[] > > Fixes: 935d4f0c6dc8 ("mm: hugetlb: add huge page size param to > set_huge_pte_at()") > Signed-off-by: Christophe Leroy Reviewed-by: Oscar Salvador AFAICS, this fixup is not related to the series, right? (yes, you will the parameter later) I would have it at the very beginning of the series. > --- > arch/powerpc/mm/nohash/8xx.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/nohash/8xx.c b/arch/powerpc/mm/nohash/8xx.c > index 43d4842bb1c7..d93433e26ded 100644 > --- a/arch/powerpc/mm/nohash/8xx.c > +++ b/arch/powerpc/mm/nohash/8xx.c > @@ -94,7 +94,8 @@ static int __ref __early_map_kernel_hugepage(unsigned long > va, phys_addr_t pa, > return -EINVAL; > > set_huge_pte_at(&init_mm, va, ptep, > - pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), psize); > + pte_mkhuge(pfn_pte(pa >> PAGE_SHIFT, prot)), > + 1UL << mmu_psize_to_shift(psize)); > > return 0; > } > -- > 2.44.0 > -- Oscar Salvador SUSE Labs
Re: [RFC PATCH v2 01/20] mm: Provide pagesize to pmd_populate()
MD_MODIFIED;0;})))?\ > + (__pte_alloc_kernel(pmd, PAGE_SIZE) || > ({*(mask)|=PGTBL_PMD_MODIFIED;0;})))?\ > NULL: pte_offset_kernel(pmd, address)) > > #endif /* _LINUX_PGALLOC_TRACK_H */ > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > index 3c3539c573e7..0f129d5c5aa2 100644 > --- a/mm/userfaultfd.c > +++ b/mm/userfaultfd.c > @@ -764,7 +764,7 @@ static __always_inline ssize_t mfill_atomic(struct > userfaultfd_ctx *ctx, > break; > } > if (unlikely(pmd_none(dst_pmdval)) && > - unlikely(__pte_alloc(dst_mm, dst_pmd))) { > + unlikely(__pte_alloc(dst_mm, dst_pmd, PAGE_SIZE))) { > err = -ENOMEM; > break; > } > @@ -1687,7 +1687,7 @@ ssize_t move_pages(struct userfaultfd_ctx *ctx, > unsigned long dst_start, > err = -ENOENT; > break; > } > - if (unlikely(__pte_alloc(mm, src_pmd))) { > + if (unlikely(__pte_alloc(mm, src_pmd, > PAGE_SIZE))) { > err = -ENOMEM; > break; > } > -- > 2.44.0 > -- Oscar Salvador SUSE Labs
Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx
On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote: > This series reimplements hugepages with hugepd on powerpc 8xx. > > Unlike most architectures, powerpc 8xx HW requires a two-level > pagetable topology for all page sizes. So a leaf PMD-contig approach > is not feasible as such. > > Possible sizes are 4k, 16k, 512k and 8M. > > First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries > must point to a single entry level-2 page table. Until now that was > done using hugepd. This series changes it to use standard page tables > where the entry is replicated 1024 times on each of the two pagetables > refered by the two associated PMD entries for that 8M page. > > At the moment it has to look into each helper to know if the > hugepage ptep is a PTE or a PMD in order to know it is a 8M page or > a lower size. I hope this can me handled by core-mm in the future. > > There are probably several ways to implement stuff, so feedback is > very welcome. Hi Christophe, I have been looking into this because I am interested in the ongoing work of the hugetlb unification, but my knowledge of ppc pagetables tends to zero, So be prepared for some stupid questions. First, let me have a clear picture of the current situation: power8xx has 4KB, 16KB, 512KB, and 8MB page sizes, and operate on a 2Level pagetables. Wiki [1] mentions PGD + PTE, here you seem to be referring them as PMD + PTE though. And we can have 1024 PGDs, each of one covers 4MB, so we can cover a total of of 4GB. Looking at the page table diagram for power8xx, it seems power8xx has also some sort of CONTIG_PTE? (same as arm64 does) So we can have contig_ptes representing bigger page sizes? I also guess that although power8xx supports all these different sizes, only one of them can be active at any time, right? It also seems that this whole hugepd thing is only used when we are using 8MB PAGE_SIZE, right? And when that is active, we do have 2 PGDs(4MB each) pointing to the same 8MB hugepd. E.g: [PGD#0] > || | 8MB hugepd | [PGD#1] > || What you want to do with this work is to get rid of that hugepd abstraction because it is something power8xx/hugetlb specific and cannot be represented with our "normal" page table layout (PGD,P4D,PUD,PMD,PTE). I did not check, but I guess we cannot walk the hugepd thing with a normal page table walker, or can we? (how special is a hugepd? can you describe its internal layout?) So, what you proprose, is something like the following? [PGD#X] ---> [PTE#0] ---> [PTE..#1023] [PGD#Y] ---> [PTE#0] ---> [PTE..#1023] so a 8MB hugepage will be covered by PGD#X and PGD#Y using contiguos PTEs. The diagram at [1] for 8xx 16K seems a bit misleading to me (or maybe it is just me). They say that a Level2 table (aka PTE) covers 4KB chunks regardless of the pagesize, but either I read that wrong or..else. Because on 16K page size, they show that each pte covers for 16KB memory chunk. But that would mean 16KB << 10 = 16M each PGD, which is not really that, so what is the deal there? Or it is just that we always use 4KB PTEs, and use contiguous PTEs for bigger sizes? Now, it seems that power8xx has no-p4d, no-pud and no-pmd, right? Peter mentioned that we should have something like: X X [PGD] - [P4D] - [PUD] - [PMD] - [PTE] where the PMD and PTE would be the ones we use for representing the 2Level ptage table, and PGD,P4D and PUD would just be dummies. But, is not the convention to at least have PGD-PTE always, and have anything in between as optional? E.g: X ? ? ? X [PGD] - [P4D] - [PUD] - [PMD] - [PTE] I mean, are page table walkers ready to deal with non-PGD? I thought they were not. Also, in patch#1, you mentioned: "At the time being, for 512k pages the flag is kept in the PTE and inserted in the PMD entry at TLB miss exception". Can you point out where to look for that in the code? Also, what exactly is the "sz" parameter that gets passed down to pmd_populate_size()? Is the size of the current mapping we are establishing? I see that you only make a distinction when the mapping size is 8MB. So the PMD will have _PMD_PAGE_8MB, so it works that all 1024 PTEs below are contiguous representing a 4MB chunk? I will start looking deeper into this series on Monday, but just wanted to have a better insight of what is going on. PD: I think we could make the changelog of the coverletter a bit fatter and cover some details in there, e.g: layout of page-tables for different page sizes, layout of hugepd, expected layout after the work, etc. I think it would help in reviewing this series. Thanks! [1] https://github.com/linuxppc/wiki/wiki/Huge-pages -- Oscar Salvador SUSE Labs
Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors
On Wed, May 15, 2024 at 12:41:42PM +0200, Borislav Petkov wrote: > On Fri, May 10, 2024 at 11:29:26AM -0700, Axel Rasmussen wrote: > > @@ -3938,7 +3938,7 @@ static vm_fault_t handle_pte_marker(struct vm_fault > > *vmf) > > > > /* Higher priority than uffd-wp when data corrupted */ > > if (marker & PTE_MARKER_POISONED) > > - return VM_FAULT_HWPOISON; > > + return VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_SILENT; > > If you know here that this poisoning should be silent, why do you have > to make it all complicated and propagate it into arch code, waste > a separate VM_FAULT flag just for that instead of simply returning here > a VM_FAULT_COMPLETED or some other innocuous value which would stop > processing the fault? AFAIK, He only wants it to be silent wrt. the arch fault handler not screaming, but he still wants to be able to trigger force_sig_mceerr(). -- Oscar Salvador SUSE Labs
Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors
On Tue, May 14, 2024 at 03:34:24PM -0600, Peter Xu wrote: > The question is whether we can't. > > Now we reserved a swp entry just for hwpoison and it makes sense only > because we cached the poisoned pfn inside. My long standing question is > why do we ever need that pfn after all. If we don't need the pfn, we > simply need a bit in the pgtable entry saying that it's poisoned, if > accessed we should kill the process using sigbus. > > I used to comment on this before, the only path that uses that pfn is > check_hwpoisoned_entry(), which was introduced in: > > commit a3f5d80ea401ac857f2910e28b15f35b2cf902f4 > Author: Naoya Horiguchi > Date: Mon Jun 28 19:43:14 2021 -0700 > > mm,hwpoison: send SIGBUS with error virutal address > > Now an action required MCE in already hwpoisoned address surely sends a > SIGBUS to current process, but the SIGBUS doesn't convey error virtual > address. That's not optimal for hwpoison-aware applications. > > To fix the issue, make memory_failure() call kill_accessing_process(), > that does pagetable walk to find the error virtual address. It could find > multiple virtual addresses for the same error page, and it seems hard to > tell which virtual address is correct one. But that's rare and sending > incorrect virtual address could be better than no address. So let's > report the first found virtual address for now. > > So this time I read more on this and Naoya explained why - it's only used > so far to dump the VA of the poisoned entry. Well, not really dumped, but we just pass the VA down the chain to the signal handler. But on the question whether we need the pfn encoded, I am not sure actually. check_hwpoisoned_entry() checks whether the pfn where the pte sits is the same as the hwpoisoned one, so we know if the process has to be killed. Now, could we get away with using pte_page() instead of pte_pfn() in there, and passing the hwpoisoned page instead ot the pfn? I am trying to think of the implications, then we should not need the encoded pfn? > However what confused me is, if an entry is poisoned already logically we > dump that message in the fault handler not memory_failure(), which is: > > MCE: Killing uffd-unit-tests:650 due to hardware memory corruption fault at > 7f3589d7e000 > > So perhaps we're trying to also dump that when the MCEs (points to the same > pfn) are only generated concurrently? I donno much on hwpoison so I cannot > tell, there's also implication where it's only triggered if > MF_ACTION_REQUIRED. But I think it means hwpoison may work without pfn > encoded, but I don't know the implication to lose that dmesg line. Not necesarily concurrently, but simply if for any reason the page could not have been unmapped. Let us say you have ProcessA and ProcessB mapping the same page. We get an MCE for that page but we could not unmapped it, at least not from all processes (maybe just ProcessA). memory-failure code will mark it as hwpoison, now ProcessA faults it in and gets killed because we see that the page is hwpoisoned in the fault path, so we sill send VM_FAULT_HWPOISON all the way down till you see the: "MCE: Killing uffd-unit-tests:650 due to hardware memory corruption fault at 7f3589d7e000" from e.g: arch/x86/mm/fault.c:do_sigbus() Now, ProcessB still has the page mapped, so upon re-accessing it, it will trigger a new MCE event. memory-failure code will see that this page has already been marked as hwpoisoned, but since we failed to unmap it (otherwise no one should be re-accessing it), it goes: "ok, let's just kill everybody who has this page mapped". > We used to not dump error for swapin error. Note that here what I am > saying is not that Axel is doing things wrong, but it's just that logically > swapin error (as pte marker) can also be with !QUIET, so my final point is > we may want to avoid having the assumption that "pte marker should always > be QUITE", because I want to make it clear that pte marker can used in any > form, so itself shouldn't imply anything.. I think it would make more sense if we have a separate marker for swapin errors? I mean, deep down, they do not mean the same as poison, right? Then you can choose which events get to be silent because you do not care, and which ones need to scream loud. I think swapin errors belong to the latter. At least I a heads-up why a process is getting killed is appreciated, IMHO. -- Oscar Salvador SUSE Labs
Re: [PATCH v2 1/1] arch/fault: don't print logs for pte marker poison errors
On Fri, May 10, 2024 at 03:29:48PM -0400, Peter Xu wrote: > IMHO we shouldn't mention that detail, but only state the effect which is > to not report the event to syslog. > > There's no hard rule that a pte marker can't reflect a real page poison in > the future even MCE. Actually I still remember most places don't care > about the pfn in the hwpoison swap entry so maybe we can even do it? But > that's another story regardless.. But we should not use pte markers for real hwpoisons events (aka MCE), right? I mean, we do have the means to mark a page as hwpoisoned when a real MCE gets triggered, why would we want a pte marker to also reflect that? Or is that something for userfaultd realm? > And also not report swapin error is, IMHO, only because arch errors said > "MCE" in the error logs which may not apply here. Logically speaking > swapin error should also be reported so admin knows better on why a proc is > killed. Now it can still confuse the admin if it really happens, iiuc. I am bit confused by this. It seems we create poisoned pte markers on swap errors (e.g: unuse_pte()), which get passed down the chain with VM_FAULT_HWPOISON, which end up in sigbus (I guess?). This all seems very subtle to me. First of all, why not passing VM_FAULT_SIGBUS if that is what will end up happening? I mean, at the moment that is not possible because we convolute swaping errors and uffd poison in the same type of marker, so we do not have any means to differentiate between the two of them. Would it make sense to create yet another pte marker type to split that up? Because when I look at VM_FAULT_HWPOISON, I get reminded of MCE stuff, and that does not hold here. -- Oscar Salvador SUSE Labs
Re: [PATCH v3] powerpc/memhotplug: Add add_pages override for PPC
On Wed, Jun 29, 2022 at 10:39:25AM +0530, Aneesh Kumar K.V wrote: > With commit ffa0b64e3be5 ("powerpc: Fix virt_addr_valid() for 64-bit Book3E & > 32-bit") > the kernel now validate the addr against high_memory value. This results > in the below BUG_ON with dax pfns. > > [ 635.798741][T26531] kernel BUG at mm/page_alloc.c:5521! > 1:mon> e > cpu 0x1: Vector: 700 (Program Check) at [c7287630] > pc: c055ed48: free_pages.part.0+0x48/0x110 > lr: c053ca70: tlb_finish_mmu+0x80/0xd0 > sp: c72878d0 >msr: 8282b033 > current = 0xcafabe00 > paca= 0xc0037300 irqmask: 0x03 irq_happened: 0x05 > pid = 26531, comm = 50-landscape-sy > kernel BUG at :5521! > Linux version 5.19.0-rc3-14659-g4ec05be7c2e1 (kvaneesh@ltc-boston8) (gcc > (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0, GNU ld (GNU Binutils for Ubuntu) 2.34) > #625 SMP Thu Jun 23 00:35:43 CDT 2022 > 1:mon> t > [link register ] c053ca70 tlb_finish_mmu+0x80/0xd0 > [c72878d0] c053ca54 tlb_finish_mmu+0x64/0xd0 (unreliable) > [c7287900] c0539424 exit_mmap+0xe4/0x2a0 > [c72879e0] c019fc1c mmput+0xcc/0x210 > [c7287a20] c0629230 begin_new_exec+0x5e0/0xf40 > [c7287ae0] c070b3cc load_elf_binary+0x3ac/0x1e00 > [c7287c10] c0627af0 bprm_execve+0x3b0/0xaf0 > [c7287cd0] c0628414 do_execveat_common.isra.0+0x1e4/0x310 > [c7287d80] c062858c sys_execve+0x4c/0x60 > [c7287db0] c002c1b0 system_call_exception+0x160/0x2c0 > [c7287e10] c000c53c system_call_common+0xec/0x250 > > The fix is to make sure we update high_memory on memory hotplug. > This is similar to what x86 does in commit 3072e413e305 ("mm/memory_hotplug: > introduce add_pages") > > Fixes: ffa0b64e3be5 ("powerpc: Fix virt_addr_valid() for 64-bit Book3E & > 32-bit") > Cc: Kefeng Wang > Cc: Christophe Leroy > Signed-off-by: Aneesh Kumar K.V Reviewed-by: Oscar Salvador > --- > Changes from v2: > * drop WARN_ON_ONCE > * check for error from __add_pages > > arch/powerpc/Kconfig | 4 > arch/powerpc/mm/mem.c | 33 - > 2 files changed, 36 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig > index c2ce2e60c8f0..7aa12e88c580 100644 > --- a/arch/powerpc/Kconfig > +++ b/arch/powerpc/Kconfig > @@ -358,6 +358,10 @@ config ARCH_SUSPEND_NONZERO_CPU > def_bool y > depends on PPC_POWERNV || PPC_PSERIES > > +config ARCH_HAS_ADD_PAGES > + def_bool y > + depends on ARCH_ENABLE_MEMORY_HOTPLUG > + > config PPC_DCR_NATIVE > bool > > diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c > index 52b77684acda..a97128a48817 100644 > --- a/arch/powerpc/mm/mem.c > +++ b/arch/powerpc/mm/mem.c > @@ -105,6 +105,37 @@ void __ref arch_remove_linear_mapping(u64 start, u64 > size) > vm_unmap_aliases(); > } > > +/* > + * After memory hotplug the variables max_pfn, max_low_pfn and high_memory > need > + * updating. > + */ > +static void update_end_of_memory_vars(u64 start, u64 size) > +{ > + unsigned long end_pfn = PFN_UP(start + size); > + > + if (end_pfn > max_pfn) { > + max_pfn = end_pfn; > + max_low_pfn = end_pfn; > + high_memory = (void *)__va(max_pfn * PAGE_SIZE - 1) + 1; > + } > +} > + > +int __ref add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages, > + struct mhp_params *params) > +{ > + int ret; > + > + ret = __add_pages(nid, start_pfn, nr_pages, params); > + if (ret) > + return ret; > + > + /* update max_pfn, max_low_pfn and high_memory */ > + update_end_of_memory_vars(start_pfn << PAGE_SHIFT, > + nr_pages << PAGE_SHIFT); > + > + return ret; > +} > + > int __ref arch_add_memory(int nid, u64 start, u64 size, > struct mhp_params *params) > { > @@ -115,7 +146,7 @@ int __ref arch_add_memory(int nid, u64 start, u64 size, > rc = arch_create_linear_mapping(nid, start, size, params); > if (rc) > return rc; > - rc = __add_pages(nid, start_pfn, nr_pages, params); > + rc = add_pages(nid, start_pfn, nr_pages, params); > if (rc) > arch_remove_linear_mapping(start, size); > return rc; > -- > 2.36.1 > > -- Oscar Salvador SUSE Labs
Re: [PATCH] powerpc/numa: Associate numa node to its cpu earlier
On Mon, Apr 11, 2022 at 02:28:08PM +0530, Srikar Dronamraju wrote: > Given that my patch got accepted into powerpc tree > https://git.kernel.org/powerpc/c/e4ff77598a109bd36789ad5e80aba66fc53d0ffb > is now part of Linus tree, this line may need a slight tweak. Right. @Michael: Will you resolve the conflict, or you would rather want me to send v2 with the amendment? -- Oscar Salvador SUSE Labs
Re: [PATCH] powerpc/numa: Handle partially initialized numa nodes
On Sun, Apr 10, 2022 at 09:28:38PM +1000, Michael Ellerman wrote: > Yeah agreed, thanks for getting to the root of the problem. > > Can you resend as a standalone patch. Because you sent it as a reply it > won't be recognised by patchwork[1] which means it risks getting lost. Hi Michael, It's done [1]. thanks! [1] http://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220411074934.4632-1-osalva...@suse.de/ -- Oscar Salvador SUSE Labs
[PATCH] powerpc/numa: Associate numa node to its cpu earlier
powerpc is the only platform that do not rely on cpu_up()->try_online_node() to bring up a numa node, and special cases it, instead, deep in its own machinery: dlpar_online_cpu find_and_online_cpu_nid try_online_node This should not be needed, but the thing is that the try_online_node() from cpu_up() will not apply on the right node, because cpu_to_node() will return the old mapping numa<->cpu that gets set on boot stage for all possible cpus. That can be seen easily if we try to print out the numa node passed to try_online_node() in cpu_up(). The thing is that the numa<->cpu mapping does not get updated till a much later stage in start_secondary: start_secondary: set_numa_node(numa_cpu_lookup_table[cpu]) But we do not really care, as we already now the CPU <-> NUMA associativity back in find_and_online_cpu_nid(), so let us make use of that and set the proper numa<->cpu mapping, so cpu_to_node() in cpu_up() returns the right node and try_online_node() can do its work. Signed-off-by: Oscar Salvador Reviewed-by: Srikar Dronamraju Tested-by: Geetika Moolchandani --- arch/powerpc/include/asm/topology.h | 8 ++- arch/powerpc/mm/numa.c | 31 +++- arch/powerpc/platforms/pseries/hotplug-cpu.c | 2 +- 3 files changed, 11 insertions(+), 30 deletions(-) diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h index 36fcafb1fd6d..6ae1b2dce83e 100644 --- a/arch/powerpc/include/asm/topology.h +++ b/arch/powerpc/include/asm/topology.h @@ -111,14 +111,10 @@ static inline void unmap_cpu_from_node(unsigned long cpu) {} #endif /* CONFIG_NUMA */ #if defined(CONFIG_NUMA) && defined(CONFIG_PPC_SPLPAR) -extern int find_and_online_cpu_nid(int cpu); +extern void find_and_update_cpu_nid(int cpu); extern int cpu_to_coregroup_id(int cpu); #else -static inline int find_and_online_cpu_nid(int cpu) -{ - return 0; -} - +static inline void find_and_update_cpu_nid(int cpu) {} static inline int cpu_to_coregroup_id(int cpu) { #ifdef CONFIG_SMP diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c index b9b7fefbb64b..b5bc8b1a833d 100644 --- a/arch/powerpc/mm/numa.c +++ b/arch/powerpc/mm/numa.c @@ -1423,43 +1423,28 @@ static long vphn_get_associativity(unsigned long cpu, return rc; } -int find_and_online_cpu_nid(int cpu) +void find_and_update_cpu_nid(int cpu) { __be32 associativity[VPHN_ASSOC_BUFSIZE] = {0}; int new_nid; /* Use associativity from first thread for all siblings */ if (vphn_get_associativity(cpu, associativity)) - return cpu_to_node(cpu); + return; + /* Do not have previous associativity, so find it now. */ new_nid = associativity_to_nid(associativity); + if (new_nid < 0 || !node_possible(new_nid)) new_nid = first_online_node; - - if (NODE_DATA(new_nid) == NULL) { -#ifdef CONFIG_MEMORY_HOTPLUG - /* -* Need to ensure that NODE_DATA is initialized for a node from -* available memory (see memblock_alloc_try_nid). If unable to -* init the node, then default to nearest node that has memory -* installed. Skip onlining a node if the subsystems are not -* yet initialized. -*/ - if (!topology_inited || try_online_node(new_nid)) - new_nid = first_online_node; -#else - /* -* Default to using the nearest node that has memory installed. -* Otherwise, it would be necessary to patch the kernel MM code -* to deal with more memoryless-node error conditions. + else + /* Associate node <-> cpu, so cpu_up() calls +* try_online_node() on the right node. */ - new_nid = first_online_node; -#endif - } + set_cpu_numa_node(cpu, new_nid); pr_debug("%s:%d cpu %d nid %d\n", __FUNCTION__, __LINE__, cpu, new_nid); - return new_nid; } int cpu_to_coregroup_id(int cpu) diff --git a/arch/powerpc/platforms/pseries/hotplug-cpu.c b/arch/powerpc/platforms/pseries/hotplug-cpu.c index b81fc846d99c..0f8cd8b06432 100644 --- a/arch/powerpc/platforms/pseries/hotplug-cpu.c +++ b/arch/powerpc/platforms/pseries/hotplug-cpu.c @@ -398,7 +398,7 @@ static int dlpar_online_cpu(struct device_node *dn) if (get_hard_smp_processor_id(cpu) != thread) continue; cpu_maps_update_done(); - find_and_online_cpu_nid(cpu); + find_and_update_cpu_nid(cpu); rc = device_online(get_cpu_device(cpu)); if (rc) { dlpar_offline_cpu(dn); -- 2.16.4
Re: [PATCH] powerpc/numa: Handle partially initialized numa nodes
On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote: > arch/powerpc/mm/numa.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index b9b7fefbb64b..13022d734951 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu) > if (new_nid < 0 || !node_possible(new_nid)) > new_nid = first_online_node; > > - if (NODE_DATA(new_nid) == NULL) { > + if (!node_online(new_nid)) { > #ifdef CONFIG_MEMORY_HOTPLUG > /* >* Need to ensure that NODE_DATA is initialized for a node from Because of this fix, I wanted to check whether we might have any more fallouts due to ("mm: handle uninitialized numa nodes gracefully"), and it made me look closer as to why powerpc is the only platform that special cases try_online_node(), while all others rely on cpu_up()->try_online_node() to do the right thing. So, I had a look. Let us rewind a bit: The commit that sets find_and_online_cpu_nid() in dlpar_online_cpu was e67e02a544e9 ("powerpc/pseries: Fix cpu hotplug crash with memoryless nodes"). In there, it says that we have the following picture: partition_sched_domains arch_update_cpu_topology numa_update_cpu_topology find_and_online_cpu_nid and by the time find_and_online_cpu_nid() gets called to online the node, it might be too late as we might have referenced some NODE_DATA() fields. Note that this happens at a much later stage in cpuhp. Also note that at a much earlier stage, we do already have a try_online_node() in cpu_up(), which should allocate-and-online the node and prevent accessing garbage. But the problem is that, on powerpc, all possible cpus have the same node set at boot stage (see arch/powerpc/mm/numa.c:mem_topology_setup), so cpu_to_node() returns the same thing until it the mapping gets update (which happens in start_secondary()->set_numa_node()), and so, the try_online_node() from cpu_up() does not apply on the right node, because it still holds the not-up-to-date mapping node <-> cpu. (e.g: in my test case, when onlining a CPU belongin to node1, cpu_up()->try_online_node() tries to online node0, or whatever old mapping numa<->cpu is there). So, we have something like: dlpar_online_cpu device_online dev->bus->online cpu_subsys_online cpu_device_up cpu_up try_online_node (old mapping nid <-> cpu ) ... ... cphp_callbacks sched_cpu_activate cpuset_update_active_cpus schedule_work(&cpuset_hotplug_work) cpuset_hotplug_work partition_sched_domains arch_update_cpu_topology numa_update_cpu_topology find_and_online_cpu_nid (online new_nid) So, yeah, the real onlining in numa_update_cpu_topology()->find_and_online_cpu_nid() happens too late, that is why adding find_and_online_cpu_nid() back in dlpar_online_cpu() fixed the issue, but we should not need this special casing at all. We do already know the numa<->cpu associativity in dlpar_online_cpu()->find_and_online_cpu_nid(), so we should just be able to update the numa<->cpu mapping, and let the try_online_node() do the right thing. With this in mind, I came up with the following patch, where I carried out a battery of tests for CPU hotplug stuff and it worked as expected. But I am not familiar with all powerpc pitfalls, e.g: dedicated vs shared cpus etc, so I would like to hear from more familiar people. The patch is: From: Oscar Salvador Date: Wed, 6 Apr 2022 14:39:15 +0200 Subject: [PATCH] powerpc/numa: Associate numa node to its cpu earlier powerpc is the only platform that do not rely on cpu_up()->try_online_node() to bring up a numa node, and special cases it, instead, deep in its own machinery: dlpar_online_cpu find_and_online_cpu_nid try_online_node This should not be needed, but the thing is that the try_online_node() from cpu_up() will not apply on the right node, because cpu_to_node() will return the old mapping numa<->cpu that gets set on boot stage for all possible cpus. That can be seen easily if we try to print out the numa node passed to try_online_node() in cpu_up(). The thing is that the numa<->cpu mapping does not get updated till a much later stage in start_secondary: start_secondary: set_numa_node(numa_cpu_lookup_table[cpu]) But we do not really care, as we already now the CPU <-> NUMA associativity back in find_and_online_cpu_nid(), so let us make use of that and set the proper numa<->cpu mapping, so cpu_to_node() in cpu_up() returns the right node and try_online_node() can do its work. Signed-off-by: Oscar Salvador --- arch/powerpc/include/asm/topology.h | 8 ++- arch/powerpc/mm/numa.c
Re: [PATCH] powerpc/numa: Handle partially initialized numa nodes
On Wed, Mar 30, 2022 at 07:21:23PM +0530, Srikar Dronamraju wrote: > With commit 09f49dca570a ("mm: handle uninitialized numa nodes > gracefully") NODE_DATA for even a memoryless/cpuless node is partially > initialized at boot time. > > Before onlining the node, current Powerpc code checks for NODE_DATA to > be NULL. However since NODE_DATA is partially initialized, this check > will end up always being false. > > This causes hotplugging a CPU to a memoryless/cpuless node to fail. > > Before adding CPUs > $ numactl -H > available: 1 nodes (4) > node 4 cpus: 0 1 2 3 4 5 6 7 > node 4 size: 97372 MB > node 4 free: 95545 MB > node distances: > node 4 > 4: 10 > > $ lparstat > System Configuration > type=Dedicated mode=Capped smt=8 lcpu=1 mem=99709440 kB cpus=0 ent=1.00 > > %user %sys %wait%idlephysc %entc lbusy app vcsw phint > - - --- - - - - - > 2.66 2.67 0.1694.51 0.00 0.00 5.33 0.00 67749 0 > > After hotplugging 32 cores > $ numactl -H > node 4 cpus: 0 1 2 3 4 5 6 7 120 121 122 123 124 125 126 127 128 129 130 > 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 > 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 > 167 168 169 170 171 172 173 174 175 > node 4 size: 97372 MB > node 4 free: 93636 MB > node distances: > node 4 > 4: 10 > > $ lparstat > System Configuration > type=Dedicated mode=Capped smt=8 lcpu=33 mem=99709440 kB cpus=0 ent=33.00 > > %user %sys %wait%idlephysc %entc lbusy app vcsw phint > - - --- - - - - - > 0.04 0.02 0.0099.94 0.00 0.00 0.06 0.00 1128751 3 > > As we can see numactl is listing only 8 cores while lparstat is showing > 33 cores. > > Also dmesg is showing messages like: > [ 2261.318350 ] BUG: arch topology borken > [ 2261.318357 ] the DIE domain not a subset of the NODE domain > > Fixes: 09f49dca570a ("mm: handle uninitialized numa nodes gracefully") > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux...@kvack.org > Cc: Michal Hocko > Cc: Michael Ellerman > Reported-by: Geetika Moolchandani > Signed-off-by: Srikar Dronamraju Acked-by: Oscar Salvador Thanks Srikar! > --- > arch/powerpc/mm/numa.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c > index b9b7fefbb64b..13022d734951 100644 > --- a/arch/powerpc/mm/numa.c > +++ b/arch/powerpc/mm/numa.c > @@ -1436,7 +1436,7 @@ int find_and_online_cpu_nid(int cpu) > if (new_nid < 0 || !node_possible(new_nid)) > new_nid = first_online_node; > > - if (NODE_DATA(new_nid) == NULL) { > + if (!node_online(new_nid)) { > #ifdef CONFIG_MEMORY_HOTPLUG > /* >* Need to ensure that NODE_DATA is initialized for a node from > -- > 2.27.0 > > -- Oscar Salvador SUSE Labs
Re: [PATCH v4 4/7] mm: make alloc_contig_range work at pageblock granularity
On Wed, Jan 19, 2022 at 02:06:20PM -0500, Zi Yan wrote: > From: Zi Yan > > alloc_contig_range() worked at MAX_ORDER-1 granularity to avoid merging > pageblocks with different migratetypes. It might unnecessarily convert > extra pageblocks at the beginning and at the end of the range. Change > alloc_contig_range() to work at pageblock granularity. > > It is done by restoring pageblock types and split >pageblock_order free > pages after isolating at MAX_ORDER-1 granularity and migrating pages > away at pageblock granularity. The reason for this process is that > during isolation, some pages, either free or in-use, might have >pageblock > sizes and isolating part of them can cause free accounting issues. > Restoring the migratetypes of the pageblocks not in the interesting > range later is much easier. Hi Zi Yan, Due to time constraints I only glanced over, so some comments below about stuff that caught my eye: > +static inline void split_free_page_into_pageblocks(struct page *free_page, > + int order, struct zone *zone) > +{ > + unsigned long pfn; > + > + spin_lock(&zone->lock); > + del_page_from_free_list(free_page, zone, order); > + for (pfn = page_to_pfn(free_page); > + pfn < page_to_pfn(free_page) + (1UL << order); It migt make sense to have a end_pfn variable so that does not have to be constantly evaluated. Or maybe the compiler is clever enough to only evualuate it once. > + pfn += pageblock_nr_pages) { > + int mt = get_pfnblock_migratetype(pfn_to_page(pfn), pfn); > + > + __free_one_page(pfn_to_page(pfn), pfn, zone, pageblock_order, > + mt, FPI_NONE); > + } > + spin_unlock(&zone->lock); It is possible that free_page's order is already pageblock_order, so I would add a one-liner upfront to catch that case and return, otherwise we do the delete_from_freelist-and-free_it_again dance. > + /* Save the migratepages of the pageblocks before start and after end */ > + num_pageblock_to_save = (alloc_start - isolate_start) / > pageblock_nr_pages > + + (isolate_end - alloc_end) / > pageblock_nr_pages; > + saved_mt = > + kmalloc_array(num_pageblock_to_save, > + sizeof(unsigned char), GFP_KERNEL); > + if (!saved_mt) > + return -ENOMEM; > + > + num = save_migratetypes(saved_mt, isolate_start, alloc_start); > + > + num = save_migratetypes(&saved_mt[num], alloc_end, isolate_end); I really hope we can put all this magic within start_isolate_page_range, and the counterparts in undo_isolate_page_range. Also, I kinda dislike the &saved_mt thing. I thought about some other approaches but nothing that wasn't too specific for this case, and I guess we want that function to be as generic as possible. > + /* > + * Split free page spanning [alloc_end, isolate_end) and put the > + * pageblocks in the right migratetype list > + */ > + for (outer_end = alloc_end; outer_end < isolate_end;) { > + unsigned long begin_pfn = outer_end; > + > + order = 0; > + while (!PageBuddy(pfn_to_page(outer_end))) { > + if (++order >= MAX_ORDER) { > + outer_end = begin_pfn; > + break; > + } > + outer_end &= ~0UL << order; > + } > + > + if (outer_end != begin_pfn) { > + order = buddy_order(pfn_to_page(outer_end)); > + > + /* > + * split the free page has start page and put the > pageblocks > + * in the right migratetype list > + */ > + VM_BUG_ON(outer_end + (1UL << order) <= begin_pfn); How could this possibily happen? > + { > + struct page *free_page = pfn_to_page(outer_end); > + > + split_free_page_into_pageblocks(free_page, > order, cc.zone); > + } > + outer_end += 1UL << order; > + } else > + outer_end = begin_pfn + 1; > } I think there are cases could optimize for. If the page has already been split in pageblock by the outer_start loop, we could skip this outer_end logic altogether. E.g: An order-10 page is split in two pageblocks. There's nothing else to be done, right? We could skip this. -- Oscar Salvador SUSE Labs
Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages
On Wed, Feb 02, 2022 at 01:25:28PM +0100, David Hildenbrand wrote: > That's the whole idea for being able to allocate parts of an unmovable > pageblock that are movable. > > If the first part is unmovable but the second part is movable, nothing > should stop us from trying to allocate the second part. Yeah, I see, I was a bit slow there, but I see the point now. Thanks David -- Oscar Salvador SUSE Labs
Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages
On Wed, Jan 19, 2022 at 02:06:19PM -0500, Zi Yan wrote: > From: Zi Yan > > Enable set_migratetype_isolate() to check specified sub-range for > unmovable pages during isolation. Page isolation is done > at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all > pages within that granularity are intended to be isolated. For example, > alloc_contig_range(), which uses page isolation, allows ranges without > alignment. This commit makes unmovable page check only look for > interesting pages, so that page isolation can succeed for any > non-overlapping ranges. Another thing that came to my mind. Prior to this patch, has_unmovable_pages() was checking on pageblock granularity, starting at pfn#0 of the pageblock. With this patch, you no longer check on pageblock granularity, which means you might isolate a pageblock, but some pages that sneaked in might actually be unmovable. E.g: Let's say you have a pageblock that spans (pfn#512,pfn#1024), and you pass alloc_contig_range() (pfn#514,pfn#1024). has_unmovable_pages() will start checking the pageblock at pfn#514, and so it will mis pfn#512 and pfn#513. Isn't that a problem, if those pfn turn out to be actually unmovable? -- Oscar Salvador SUSE Labs
Re: [PATCH RFC v1] drivers/base/node: consolidate node device subsystem initialization in node_dev_init()
On Fri, Jan 28, 2022 at 04:15:40PM +0100, David Hildenbrand wrote: > ... and call node_dev_init() after memory_dev_init() from driver_init(), > so before any of the existing arch/subsys calls. All online nodes should > be known at that point. > > This is in line with memory_dev_init(), which initializes the memory > device subsystem and creates all memory block devices. > > Similar to memory_dev_init(), panic() if anything goes wrong, we don't > want to continue with such basic initialization errors. > > The important part is that node_dev_init() gets called after > memory_dev_init() and after cpu_dev_init(), but before any of the > relevant archs call register_cpu() to register the new cpu device under > the node device. The latter should be the case for the current users > of topology_init(). > > Cc: Andrew Morton > Cc: Greg Kroah-Hartman > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Catalin Marinas > Cc: Will Deacon > Cc: Thomas Bogendoerfer > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Paul Walmsley > Cc: Palmer Dabbelt > Cc: Albert Ou > Cc: Heiko Carstens > Cc: Vasily Gorbik > Cc: Yoshinori Sato > Cc: Rich Felker > Cc: "David S. Miller" > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: Borislav Petkov > Cc: Dave Hansen > Cc: "Rafael J. Wysocki" > Cc: x...@kernel.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linux-ker...@vger.kernel.org > Cc: linux-i...@vger.kernel.org > Cc: linux-m...@vger.kernel.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux-ri...@lists.infradead.org > Cc: linux-s...@vger.kernel.org > Cc: linux...@vger.kernel.org > Cc: sparcli...@vger.kernel.org > Cc: linux...@kvack.org > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [PATCH RFC v1] drivers/base/node: consolidate node device subsystem initialization in node_dev_init()
On Mon, Jan 31, 2022 at 08:48:54AM +0100, David Hildenbrand wrote: > Hi Oscar, Hi David :-), > Right, and the idea is that the online state of nodes (+ node/zone > ranges) already has to be known at that point in time, because > otherwise, we'd be in bigger trouble. Yeah, I wanted to check where exactly did we mark the nodes online, and for the few architectures I checked it happens in setup_arch(), which is called very early in start_kernel(), while driver_init() gets called through arch_call_rest_init(), which happens at the end of the function. I am not sure whether we want to remark that somehow in the changelog, so it is crystal clear that by the time the node_dev_init() gets called, we already set the nodes online. Anyway, just saying, but is fine as is. -- Oscar Salvador SUSE Labs
Re: [PATCH RFC v1] drivers/base/node: consolidate node device subsystem initialization in node_dev_init()
On Fri, Jan 28, 2022 at 04:15:40PM +0100, David Hildenbrand wrote: > ... and call node_dev_init() after memory_dev_init() from driver_init(), > so before any of the existing arch/subsys calls. All online nodes should > be known at that point. > > This is in line with memory_dev_init(), which initializes the memory > device subsystem and creates all memory block devices. > > Similar to memory_dev_init(), panic() if anything goes wrong, we don't > want to continue with such basic initialization errors. > > The important part is that node_dev_init() gets called after > memory_dev_init() and after cpu_dev_init(), but before any of the > relevant archs call register_cpu() to register the new cpu device under > the node device. The latter should be the case for the current users > of topology_init(). So, before this change we had something like this: do_basic_setup driver_init memory_dev_init do_init_calls ... topology_init register_nodes/register_one_node And after the patch all happens in driver_init() driver_init memory_dev_init node_dev_init I guess this is fine as we do not have any ordering problems (aka: none of the functions we used to call before expect the nodes not to be there for some weird reason). So, no functional change, right? This certainly looks like an improvment. -- Oscar Salvador SUSE Labs
Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages
On Tue, Jan 25, 2022 at 02:19:46PM +0100, Oscar Salvador wrote: > I know that this has been discussed previously, and the cover-letter already > mentions it, but I think it would be great to have some sort of information > about > the problem in the commit message as well, so people do not have to go and > find > it somewhere else. Sorry, the commit already points it out, but I meant to elaborate some more. -- Oscar Salvador SUSE Labs
Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages
On Mon, Jan 24, 2022 at 12:17:23PM -0500, Zi Yan wrote: > You are right. Sorry for the confusion. I think it should be > “Page isolation is done at least on max(MAX_ORDER_NR_PAEGS, > pageblock_nr_pages) granularity.” > > memory_hotplug uses PAGES_PER_SECTION. It is greater than that. Or just specify that the max(MAX_ORDER_NR_PAGES, pageblock_nr_pages) granurality only comes from alloc_contig_range at the moment. Other callers might want to work in other granularity (e.g: memory-hotplug) although ultimately the range has to be aligned to something. > > True is that start_isolate_page_range() expects the range to be pageblock > > aligned and works in pageblock_nr_pages chunks, but I do not think that is > > what you meant to say here. > > Actually, start_isolate_page_range() should expect max(MAX_ORDER_NR_PAEGS, > pageblock_nr_pages) alignment instead of pageblock alignment. It seems to > be an uncovered bug in the current code, since all callers uses at least > max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) alignment. > > The reason is that if start_isolate_page_range() is only pageblock aligned > and a caller wants to isolate one pageblock from a MAX_ORDER-1 > (2 pageblocks on x84_64 systems) free page, this will lead to MIGRATE_ISOLATE > accounting error. To avoid it, start_isolate_page_range() needs to isolate > the max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) aligned range. So, let me see if I get this straight: You are saying that, currently, alloc_contig_ranges() works on the biggest alignment otherwise we might have this scenario: [ MAX_ORDER-1 ] [pageblock#0][pageblock#1] We only want to isolate pageblock#1, so we pass a pageblock-aligned range to start_isolate_page_range(), but the page belonging to pageblock#1 spans pageblock#0 and pageblock#1 because it is a MAX_ORDER-1 page. So when we call set_migratetype_isolate()->set_pageblock_migratetype(), this will mark either pageblock#0 or pageblock#1 as isolated, but the whole page will be put in the MIGRATE_ISOLATE freelist by move_freepages_block()->move_freepages(). Meaning, we wil effectively have two pageblocks isolated, but only one marked as such? Did I get it right or did I miss something? I know that this has been discussed previously, and the cover-letter already mentions it, but I think it would be great to have some sort of information about the problem in the commit message as well, so people do not have to go and find it somewhere else. -- Oscar Salvador SUSE Labs
Re: [PATCH v4 2/7] mm: page_isolation: move has_unmovable_pages() to mm/page_isolation.c
On 2022-01-19 20:06, Zi Yan wrote: From: Zi Yan has_unmovable_pages() is only used in mm/page_isolation.c. Move it from mm/page_alloc.c and make it static. Signed-off-by: Zi Yan Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE Labs
Re: [PATCH v4 3/7] mm: page_isolation: check specified range for unmovable pages
On 2022-01-19 20:06, Zi Yan wrote: From: Zi Yan Enable set_migratetype_isolate() to check specified sub-range for unmovable pages during isolation. Page isolation is done at max(MAX_ORDER_NR_PAEGS, pageblock_nr_pages) granularity, but not all pages within that granularity are intended to be isolated. For example, alloc_contig_range(), which uses page isolation, allows ranges without alignment. This commit makes unmovable page check only look for interesting pages, so that page isolation can succeed for any non-overlapping ranges. Hi Zi Yan, I had to re-read this several times as I found this a bit misleading. I was mainly confused by the fact that memory_hotplug does isolation on PAGES_PER_SECTION granularity, and reading the above seems to indicate that either do it at MAX_ORDER_NR_PAGES or at pageblock_nr_pages granularity. True is that start_isolate_page_range() expects the range to be pageblock aligned and works in pageblock_nr_pages chunks, but I do not think that is what you meant to say here. Now, to the change itself, below: @@ -47,8 +51,8 @@ static struct page *has_unmovable_pages(struct zone *zone, struct page *page, return page; } - for (; iter < pageblock_nr_pages - offset; iter++) { - page = pfn_to_page(pfn + iter); + for (pfn = first_pfn; pfn < last_pfn; pfn++) { You already did pfn = first_pfn before. /** * start_isolate_page_range() - make page-allocation-type of range of pages to * be MIGRATE_ISOLATE. - * @start_pfn: The lower PFN of the range to be isolated. - * @end_pfn: The upper PFN of the range to be isolated. + * @start_pfn: The lower PFN of the range to be checked for + * possibility of isolation. + * @end_pfn: The upper PFN of the range to be checked for + * possibility of isolation. + * @isolate_start: The lower PFN of the range to be isolated. + * @isolate_end: The upper PFN of the range to be isolated. So, what does "possibility" means here. I think this need to be clarified a bit better. From what you pointed out in the commit message I think what you are doing is: - alloc_contig_range() gets a range to be isolated. - then you pass two ranges to start_isolate_page_range() (start_pfn, end_pfn]: which is the unaligned range you got in alloc_contig_range() (isolate_start, isolate_end]: which got aligned to, let's say, to MAX_ORDER_NR_PAGES Now, most likely, (start_pfn, end_pfn] only covers a sub-range of (isolate_start, isolate_end], and that sub-range is what you really want to isolate (so (start_pfn, end_pfn])? If so, should not the names be reversed? -- Oscar Salvador SUSE Labs
Re: [PATCH v1 6/6] x86: remove memory hotplug support on X86_32
On Wed, Sep 29, 2021 at 04:36:00PM +0200, David Hildenbrand wrote: > CONFIG_MEMORY_HOTPLUG was marked BROKEN over one year and we just > restricted it to 64 bit. Let's remove the unused x86 32bit implementation > and simplify the Kconfig. > > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador > --- > arch/x86/Kconfig | 6 +++--- > arch/x86/mm/init_32.c | 31 --- > 2 files changed, 3 insertions(+), 34 deletions(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index ab83c22d274e..85f4762429f1 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -62,7 +62,7 @@ config X86 > select ARCH_32BIT_OFF_T if X86_32 > select ARCH_CLOCKSOURCE_INIT > select ARCH_ENABLE_HUGEPAGE_MIGRATION if X86_64 && HUGETLB_PAGE && > MIGRATION > - select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM) > + select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 > select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG > select ARCH_ENABLE_SPLIT_PMD_PTLOCK if (PGTABLE_LEVELS > 2) && (X86_64 > || X86_PAE) > select ARCH_ENABLE_THP_MIGRATION if X86_64 && TRANSPARENT_HUGEPAGE > @@ -1615,7 +1615,7 @@ config ARCH_SELECT_MEMORY_MODEL > > config ARCH_MEMORY_PROBE > bool "Enable sysfs memory/probe interface" > - depends on X86_64 && MEMORY_HOTPLUG > + depends on MEMORY_HOTPLUG > help > This option enables a sysfs memory/probe interface for testing. > See Documentation/admin-guide/mm/memory-hotplug.rst for more > information. > @@ -2395,7 +2395,7 @@ endmenu > > config ARCH_HAS_ADD_PAGES > def_bool y > - depends on X86_64 && ARCH_ENABLE_MEMORY_HOTPLUG > + depends on ARCH_ENABLE_MEMORY_HOTPLUG > > config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE > def_bool y > diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c > index bd90b8fe81e4..5cd7ea6d645c 100644 > --- a/arch/x86/mm/init_32.c > +++ b/arch/x86/mm/init_32.c > @@ -779,37 +779,6 @@ void __init mem_init(void) > test_wp_bit(); > } > > -#ifdef CONFIG_MEMORY_HOTPLUG > -int arch_add_memory(int nid, u64 start, u64 size, > - struct mhp_params *params) > -{ > - unsigned long start_pfn = start >> PAGE_SHIFT; > - unsigned long nr_pages = size >> PAGE_SHIFT; > - int ret; > - > - /* > - * The page tables were already mapped at boot so if the caller > - * requests a different mapping type then we must change all the > - * pages with __set_memory_prot(). > - */ > - if (params->pgprot.pgprot != PAGE_KERNEL.pgprot) { > - ret = __set_memory_prot(start, nr_pages, params->pgprot); > - if (ret) > - return ret; > - } > - > - return __add_pages(nid, start_pfn, nr_pages, params); > -} > - > -void arch_remove_memory(u64 start, u64 size, struct vmem_altmap *altmap) > -{ > - unsigned long start_pfn = start >> PAGE_SHIFT; > - unsigned long nr_pages = size >> PAGE_SHIFT; > - > - __remove_pages(start_pfn, nr_pages, altmap); > -} > -#endif > - > int kernel_set_to_readonly __read_mostly; > > static void mark_nxdata_nx(void) > -- > 2.31.1 > > -- Oscar Salvador SUSE Labs
Re: [PATCH v1 5/6] mm/memory_hotplug: remove stale function declarations
On Wed, Sep 29, 2021 at 04:35:59PM +0200, David Hildenbrand wrote: > These functions no longer exist. > > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador > --- > include/linux/memory_hotplug.h | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h > index e5a867c950b2..be48e003a518 100644 > --- a/include/linux/memory_hotplug.h > +++ b/include/linux/memory_hotplug.h > @@ -98,9 +98,6 @@ static inline void zone_seqlock_init(struct zone *zone) > { > seqlock_init(&zone->span_seqlock); > } > -extern int zone_grow_free_lists(struct zone *zone, unsigned long > new_nr_pages); > -extern int zone_grow_waitqueues(struct zone *zone, unsigned long nr_pages); > -extern int add_one_highpage(struct page *page, int pfn, int bad_ppro); > extern void adjust_present_page_count(struct page *page, > struct memory_group *group, > long nr_pages); > -- > 2.31.1 > > -- Oscar Salvador SUSE Labs
Re: [PATCH v1 4/6] mm/memory_hotplug: remove HIGHMEM leftovers
On Wed, Sep 29, 2021 at 04:35:58PM +0200, David Hildenbrand wrote: > We don't support CONFIG_MEMORY_HOTPLUG on 32 bit and consequently not > HIGHMEM. Let's remove any leftover code -- including the unused > "status_change_nid_high" field part of the memory notifier. > > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador > --- > Documentation/core-api/memory-hotplug.rst | 3 -- > .../zh_CN/core-api/memory-hotplug.rst | 4 --- > include/linux/memory.h| 1 - > mm/memory_hotplug.c | 36 ++- > 4 files changed, 2 insertions(+), 42 deletions(-) > > diff --git a/Documentation/core-api/memory-hotplug.rst > b/Documentation/core-api/memory-hotplug.rst > index de7467e48067..682259ee633a 100644 > --- a/Documentation/core-api/memory-hotplug.rst > +++ b/Documentation/core-api/memory-hotplug.rst > @@ -57,7 +57,6 @@ The third argument (arg) passes a pointer of struct > memory_notify:: > unsigned long start_pfn; > unsigned long nr_pages; > int status_change_nid_normal; > - int status_change_nid_high; > int status_change_nid; > } > > @@ -65,8 +64,6 @@ The third argument (arg) passes a pointer of struct > memory_notify:: > - nr_pages is # of pages of online/offline memory. > - status_change_nid_normal is set node id when N_NORMAL_MEMORY of nodemask >is (will be) set/clear, if this is -1, then nodemask status is not changed. > -- status_change_nid_high is set node id when N_HIGH_MEMORY of nodemask > - is (will be) set/clear, if this is -1, then nodemask status is not changed. > - status_change_nid is set node id when N_MEMORY of nodemask is (will be) >set/clear. It means a new(memoryless) node gets new memory by online and a >node loses all memory. If this is -1, then nodemask status is not changed. > diff --git a/Documentation/translations/zh_CN/core-api/memory-hotplug.rst > b/Documentation/translations/zh_CN/core-api/memory-hotplug.rst > index 161f4d2c18cc..9a204eb196f2 100644 > --- a/Documentation/translations/zh_CN/core-api/memory-hotplug.rst > +++ b/Documentation/translations/zh_CN/core-api/memory-hotplug.rst > @@ -63,7 +63,6 @@ memory_notify结构体的指针:: > unsigned long start_pfn; > unsigned long nr_pages; > int status_change_nid_normal; > - int status_change_nid_high; > int status_change_nid; > } > > @@ -74,9 +73,6 @@ memory_notify结构体的指针:: > - status_change_nid_normal是当nodemask的N_NORMAL_MEMORY被设置/清除时设置节 >点id,如果是-1,则nodemask状态不改变。 > > -- status_change_nid_high是当nodemask的N_HIGH_MEMORY被设置/清除时设置的节点 > - id,如果这个值为-1,那么nodemask状态不会改变。 > - > - status_change_nid是当nodemask的N_MEMORY被(将)设置/清除时设置的节点id。这 >意味着一个新的(没上线的)节点通过联机获得新的内存,而一个节点失去了所有的内 >存。如果这个值为-1,那么nodemask的状态就不会改变。 > diff --git a/include/linux/memory.h b/include/linux/memory.h > index dd6e608c3e0b..c46ff374d48d 100644 > --- a/include/linux/memory.h > +++ b/include/linux/memory.h > @@ -96,7 +96,6 @@ struct memory_notify { > unsigned long start_pfn; > unsigned long nr_pages; > int status_change_nid_normal; > - int status_change_nid_high; > int status_change_nid; > }; > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 8d7b2b593a26..95c927c8bfb8 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -21,7 +21,6 @@ > #include > #include > #include > -#include > #include > #include > #include > @@ -585,10 +584,6 @@ void generic_online_page(struct page *page, unsigned int > order) > debug_pagealloc_map_pages(page, 1 << order); > __free_pages_core(page, order); > totalram_pages_add(1UL << order); > -#ifdef CONFIG_HIGHMEM > - if (PageHighMem(page)) > - totalhigh_pages_add(1UL << order); > -#endif > } > EXPORT_SYMBOL_GPL(generic_online_page); > > @@ -625,16 +620,11 @@ static void node_states_check_changes_online(unsigned > long nr_pages, > > arg->status_change_nid = NUMA_NO_NODE; > arg->status_change_nid_normal = NUMA_NO_NODE; > - arg->status_change_nid_high = NUMA_NO_NODE; > > if (!node_state(nid, N_MEMORY)) > arg->status_change_nid = nid; > if (zone_idx(zone) <= ZONE_NORMAL && !node_state(nid, N_NORMAL_MEMORY)) > arg->status_change_nid_normal = nid; > -#ifdef CONFIG_HIGHMEM > - if (zone_idx(zone) <= ZONE_HIGHMEM && !node_state(nid, N_HIGH_MEMORY)) > - arg->status_change_nid_high = nid; > -#endif > } > >
Re: [PATCH v1 3/6] mm/memory_hotplug: restrict CONFIG_MEMORY_HOTPLUG to 64 bit
On Wed, Sep 29, 2021 at 04:35:57PM +0200, David Hildenbrand wrote: > 32 bit support is broken in various ways: for example, we can online > memory that should actually go to ZONE_HIGHMEM to ZONE_MOVABLE or in > some cases even to one of the other kernel zones. > > We marked it BROKEN in commit b59d02ed0869 ("mm/memory_hotplug: disable the > functionality for 32b") almost one year ago. According to that commit > it might be broken at least since 2017. Further, there is hardly a sane use > case nowadays. > > Let's just depend completely on 64bit, dropping the "BROKEN" dependency to > make clear that we are not going to support it again. Next, we'll remove > some HIGHMEM leftovers from memory hotplug code to clean up. > > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador > --- > mm/Kconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index ea8762cd8e1e..88273dd5c6d6 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -125,7 +125,7 @@ config MEMORY_HOTPLUG > select MEMORY_ISOLATION > depends on SPARSEMEM > depends on ARCH_ENABLE_MEMORY_HOTPLUG > - depends on 64BIT || BROKEN > + depends on 64BIT > select NUMA_KEEP_MEMINFO if NUMA > > config MEMORY_HOTPLUG_DEFAULT_ONLINE > -- > 2.31.1 > > -- Oscar Salvador SUSE Labs
Re: [PATCH v1 2/6] mm/memory_hotplug: remove CONFIG_MEMORY_HOTPLUG_SPARSE
On Wed, Sep 29, 2021 at 04:35:56PM +0200, David Hildenbrand wrote: > CONFIG_MEMORY_HOTPLUG depends on CONFIG_SPARSEMEM, so there is no need for > CONFIG_MEMORY_HOTPLUG_SPARSE anymore; adjust all instances to use > CONFIG_MEMORY_HOTPLUG and remove CONFIG_MEMORY_HOTPLUG_SPARSE. > > Signed-off-by: David Hildenbrand Acked-by: Oscar Salvador > --- > arch/powerpc/include/asm/machdep.h| 2 +- > arch/powerpc/kernel/setup_64.c| 2 +- > arch/powerpc/platforms/powernv/setup.c| 4 ++-- > arch/powerpc/platforms/pseries/setup.c| 2 +- > drivers/base/Makefile | 2 +- > drivers/base/node.c | 9 - > drivers/virtio/Kconfig| 2 +- > include/linux/memory.h| 18 +++--- > include/linux/node.h | 4 ++-- > lib/Kconfig.debug | 2 +- > mm/Kconfig| 4 > mm/memory_hotplug.c | 2 -- > tools/testing/selftests/memory-hotplug/config | 1 - > 13 files changed, 21 insertions(+), 33 deletions(-) > > diff --git a/arch/powerpc/include/asm/machdep.h > b/arch/powerpc/include/asm/machdep.h > index 764f2732a821..d8a2ca007082 100644 > --- a/arch/powerpc/include/asm/machdep.h > +++ b/arch/powerpc/include/asm/machdep.h > @@ -32,7 +32,7 @@ struct machdep_calls { > void(*iommu_save)(void); > void(*iommu_restore)(void); > #endif > -#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE > +#ifdef CONFIG_MEMORY_HOTPLUG > unsigned long (*memory_block_size)(void); > #endif > #endif /* CONFIG_PPC64 */ > diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c > index eaa79a0996d1..21f15d82f062 100644 > --- a/arch/powerpc/kernel/setup_64.c > +++ b/arch/powerpc/kernel/setup_64.c > @@ -912,7 +912,7 @@ void __init setup_per_cpu_areas(void) > } > #endif > > -#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE > +#ifdef CONFIG_MEMORY_HOTPLUG > unsigned long memory_block_size_bytes(void) > { > if (ppc_md.memory_block_size) > diff --git a/arch/powerpc/platforms/powernv/setup.c > b/arch/powerpc/platforms/powernv/setup.c > index a8db3f153063..ad56a54ac9c5 100644 > --- a/arch/powerpc/platforms/powernv/setup.c > +++ b/arch/powerpc/platforms/powernv/setup.c > @@ -440,7 +440,7 @@ static void pnv_kexec_cpu_down(int crash_shutdown, int > secondary) > } > #endif /* CONFIG_KEXEC_CORE */ > > -#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE > +#ifdef CONFIG_MEMORY_HOTPLUG > static unsigned long pnv_memory_block_size(void) > { > /* > @@ -553,7 +553,7 @@ define_machine(powernv) { > #ifdef CONFIG_KEXEC_CORE > .kexec_cpu_down = pnv_kexec_cpu_down, > #endif > -#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE > +#ifdef CONFIG_MEMORY_HOTPLUG > .memory_block_size = pnv_memory_block_size, > #endif > }; > diff --git a/arch/powerpc/platforms/pseries/setup.c > b/arch/powerpc/platforms/pseries/setup.c > index f79126f16258..d29f6f1f7f37 100644 > --- a/arch/powerpc/platforms/pseries/setup.c > +++ b/arch/powerpc/platforms/pseries/setup.c > @@ -1089,7 +1089,7 @@ define_machine(pseries) { > .machine_kexec = pSeries_machine_kexec, > .kexec_cpu_down = pseries_kexec_cpu_down, > #endif > -#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE > +#ifdef CONFIG_MEMORY_HOTPLUG > .memory_block_size = pseries_memory_block_size, > #endif > }; > diff --git a/drivers/base/Makefile b/drivers/base/Makefile > index ef8e44a7d288..02f7f1358e86 100644 > --- a/drivers/base/Makefile > +++ b/drivers/base/Makefile > @@ -13,7 +13,7 @@ obj-y += power/ > obj-$(CONFIG_ISA_BUS_API)+= isa.o > obj-y+= firmware_loader/ > obj-$(CONFIG_NUMA) += node.o > -obj-$(CONFIG_MEMORY_HOTPLUG_SPARSE) += memory.o > +obj-$(CONFIG_MEMORY_HOTPLUG) += memory.o > ifeq ($(CONFIG_SYSFS),y) > obj-$(CONFIG_MODULES)+= module.o > endif > diff --git a/drivers/base/node.c b/drivers/base/node.c > index c56d34f8158f..b5a4ba18f9f9 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -629,7 +629,7 @@ static void node_device_release(struct device *dev) > { > struct node *node = to_node(dev); > > -#if defined(CONFIG_MEMORY_HOTPLUG_SPARSE) && defined(CONFIG_HUGETLBFS) > +#if defined(CONFIG_MEMORY_HOTPLUG) && defined(CONFIG_HUGETLBFS) > /* >* We schedule the work only when a memory section is >* onlined/offlined on this node. When we come here, > @@ -782,7 +782,7 @@ int unregister_cpu_under_node(unsigned int
Re: [PATCH v1 1/6] mm/memory_hotplug: remove CONFIG_X86_64_ACPI_NUMA dependency from CONFIG_MEMORY_HOTPLUG
On Wed, Sep 29, 2021 at 04:35:55PM +0200, David Hildenbrand wrote: > SPARSEMEM is the only possible memory model for x86-64, FLATMEM is not > possible: > config ARCH_FLATMEM_ENABLE > def_bool y > depends on X86_32 && !NUMA > > And X86_64_ACPI_NUMA (obviously) only supports x86-64: > config X86_64_ACPI_NUMA > def_bool y > depends on X86_64 && NUMA && ACPI && PCI > > Let's just remove the CONFIG_X86_64_ACPI_NUMA dependency, as it does no > longer make sense. > > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador > --- > mm/Kconfig | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index d16ba9249bc5..b7fb3f0b485e 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -123,7 +123,7 @@ config ARCH_ENABLE_MEMORY_HOTPLUG > config MEMORY_HOTPLUG > bool "Allow for memory hot-add" > select MEMORY_ISOLATION > - depends on SPARSEMEM || X86_64_ACPI_NUMA > + depends on SPARSEMEM > depends on ARCH_ENABLE_MEMORY_HOTPLUG > depends on 64BIT || BROKEN > select NUMA_KEEP_MEMINFO if NUMA > -- > 2.31.1 > > -- Oscar Salvador SUSE Labs
Re: [PATCH V2 4/6] mm: Drop redundant ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION
On Thu, Apr 01, 2021 at 12:14:06PM +0530, Anshuman Khandual wrote: > ARCH_ENABLE_[HUGEPAGE|THP]_MIGRATION configs have duplicate definitions on > platforms that subscribe them. Drop these reduntant definitions and instead > just select them appropriately. > > Cc: Catalin Marinas > Cc: Will Deacon > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: "H. Peter Anvin" > Cc: Andrew Morton > Cc: x...@kernel.org > Cc: linux-arm-ker...@lists.infradead.org > Cc: linuxppc-dev@lists.ozlabs.org > Cc: linux...@kvack.org > Cc: linux-ker...@vger.kernel.org > Acked-by: Catalin Marinas (arm64) > Signed-off-by: Anshuman Khandual Hi Anshuman, X86 needs fixing, see below: > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 503d8b2e8676..10702ef1eb57 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -60,8 +60,10 @@ config X86 > select ACPI_SYSTEM_POWER_STATES_SUPPORT if ACPI > select ARCH_32BIT_OFF_T if X86_32 > select ARCH_CLOCKSOURCE_INIT > + select ARCH_ENABLE_HUGEPAGE_MIGRATION if x86_64 && HUGETLB_PAGE && > MIGRATION > select ARCH_ENABLE_MEMORY_HOTPLUG if X86_64 || (X86_32 && HIGHMEM) > select ARCH_ENABLE_MEMORY_HOTREMOVE if MEMORY_HOTPLUG > + select ARCH_ENABLE_THP_MIGRATION if x86_64 && TRANSPARENT_HUGEPAGE you need s/x86_64/X86_64/, otherwise we are left with no migration :-) -- Oscar Salvador SUSE L3
Re: [PATCH v2 8/8] powernv/memtrace: don't abuse memory hot(un)plug infrastructure for memory allocations
On Wed, Nov 11, 2020 at 03:53:22PM +0100, David Hildenbrand wrote: > Suggested-by: Michal Hocko > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Rashmica Gupta > Cc: Andrew Morton > Cc: Mike Rapoport > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Wei Yang > Signed-off-by: David Hildenbrand LGTM, and much cleaner definetely: Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v2 7/8] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()
On Wed, Nov 11, 2020 at 03:53:21PM +0100, David Hildenbrand wrote: > Let's revert what we did in case seomthing goes wrong and we return an "something" :-) > error - as already done on arm64 and s390x. > > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Rashmica Gupta > Cc: Andrew Morton > Cc: Mike Rapoport > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Wei Yang > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v2 6/8] powerepc/book3s64/hash: drop WARN_ON in hash__remove_section_mapping
On Wed, Nov 11, 2020 at 03:53:20PM +0100, David Hildenbrand wrote: > The single caller (arch_remove_linear_mapping()) prints a proper warning > when this function fails. No need to eventually crash the kernel - let's > drop this WARN_ON. > > Suggested-by: Oscar Salvador > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Rashmica Gupta > Cc: Andrew Morton > Cc: Mike Rapoport > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Wei Yang > Cc: "Aneesh Kumar K.V" > Cc: Nicholas Piggin > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v2 4/8] powerpc/mm: protect linear mapping modifications by a mutex
On Wed, Nov 11, 2020 at 03:53:18PM +0100, David Hildenbrand wrote: > @@ -144,7 +147,9 @@ void __ref arch_remove_linear_mapping(u64 start, u64 size) > start = (unsigned long)__va(start); > flush_dcache_range_chunked(start, start + size, FLUSH_CHUNK_SIZE); > > + mutex_lock(&linear_mapping_mutex); > ret = remove_section_mapping(start, start + size); > + mutex_unlock(&linear_mapping_mutex); > WARN_ON_ONCE(ret); My expertise in this area is low, so bear with me. Why we do not need to protect flush_dcache_range_chunked and vm_unmap_aliases? -- Oscar Salvador SUSE L3
Re: [PATCH v2 3/8] powerpc/mm: factor out creating/removing linear mapping
On Wed, Nov 11, 2020 at 03:53:17PM +0100, David Hildenbrand wrote: > We want to stop abusing memory hotplug infrastructure in memtrace code > to perform allocations and remove the linear mapping. Instead we will use > alloc_contig_pages() and remove the linear mapping manually. > > Let's factor out creating/removing the linear mapping into > arch_create_linear_mapping() / arch_remove_linear_mapping() - so in the > future, we might be able to have whole arch_add_memory() / > arch_remove_memory() be implemented in common code. > > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Rashmica Gupta > Cc: Andrew Morton > Cc: Mike Rapoport > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Wei Yang > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v2 2/8] powernv/memtrace: fix crashing the kernel when enabling concurrently
On Wed, Nov 11, 2020 at 03:53:16PM +0100, David Hildenbrand wrote: > Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory > tracing") > Cc: sta...@vger.kernel.org# v4.14+ > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Rashmica Gupta > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v2 1/8] powernv/memtrace: don't leak kernel memory to user space
On Wed, Nov 11, 2020 at 03:53:15PM +0100, David Hildenbrand wrote: > Reported-by: Michael Ellerman > Fixes: 9d5171a8f248 ("powerpc/powernv: Enable removal of memory for in memory > tracing") > Cc: sta...@vger.kernel.org # v4.14+ > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Rashmica Gupta > Cc: Andrew Morton > Cc: Mike Rapoport > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Wei Yang > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()
On Thu, Oct 29, 2020 at 05:27:17PM +0100, David Hildenbrand wrote: > Let's revert what we did in case seomthing goes wrong and we return an > error. > > Cc: Michael Ellerman > Cc: Benjamin Herrenschmidt > Cc: Paul Mackerras > Cc: Rashmica Gupta > Cc: Andrew Morton > Cc: Mike Rapoport > Cc: Michal Hocko > Cc: Oscar Salvador > Cc: Wei Yang > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador > --- > arch/powerpc/mm/mem.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c > index 685028451dd2..69b3e8072261 100644 > --- a/arch/powerpc/mm/mem.c > +++ b/arch/powerpc/mm/mem.c > @@ -165,7 +165,10 @@ int __ref arch_add_memory(int nid, u64 start, u64 size, > rc = arch_create_linear_mapping(nid, start, size, params); > if (rc) > return rc; > - return __add_pages(nid, start_pfn, nr_pages, params); > + rc = __add_pages(nid, start_pfn, nr_pages, params); > + if (rc) > + arch_remove_linear_mapping(start, size); > + return rc; > } > > void __ref arch_remove_memory(int nid, u64 start, u64 size, > -- > 2.26.2 > -- Oscar Salvador SUSE L3
Re: [PATCH v1 3/4] powerpc/mm: remove linear mapping if __add_pages() fails in arch_add_memory()
On Wed, Nov 04, 2020 at 02:06:51PM +0200, Mike Rapoport wrote: > On Wed, Nov 04, 2020 at 10:50:07AM +0100, osalvador wrote: > > On Thu, Oct 29, 2020 at 05:27:17PM +0100, David Hildenbrand wrote: > > > Let's revert what we did in case seomthing goes wrong and we return an > > > error. > > > > Dumb question, but should not we do this for other arches as well? > > It seems arm64 and s390 already do that. > x86 could have its arch_add_memory() improved though :) Right, I only stared at x86 and see it did not have it. I guess we want to have all arches aligned with this. Thanks -- Oscar Salvador SUSE L3
Re: [PATCH v6 00/10] mm/memory_hotplug: Shrink zones before removing memory
On Tue, Feb 04, 2020 at 09:45:24AM +0100, David Hildenbrand wrote: > I really hope we'll find more reviewers in general - I'm also not happy > if my patches go upstream with little/no review. However, patches > shouldn't be stuck for multiple merge windows in linux-next IMHO > (excluding exceptions of course) - then they should either be sent > upstream (and eventually fixed later) or dropped. First of all sorry for my lack of review, as lately I have been a bit disconnected of the list because lack of time. Lucky my I managed to find some time, so I went through the patches that did lack review (#6-#10). I hope this helps in moving forward the series, although Michal's review would be great as well. -- Oscar Salvador SUSE L3
Re: [PATCH v6 10/10] mm/memory_hotplug: Cleanup __remove_pages()
On Sun, Oct 06, 2019 at 10:56:46AM +0200, David Hildenbrand wrote: > Let's drop the basically unused section stuff and simplify. > > Also, let's use a shorter variant to calculate the number of pages to > the next section boundary. > > Cc: Andrew Morton > Cc: Oscar Salvador > Cc: Michal Hocko > Cc: Pavel Tatashin > Cc: Dan Williams > Cc: Wei Yang > Signed-off-by: David Hildenbrand I have to confess that it took me while to wrap around my head with the new min() change, but looks ok: Reviewed-by: Oscar Salvador > --- > mm/memory_hotplug.c | 17 ++--- > 1 file changed, 6 insertions(+), 11 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 843481bd507d..2275240cfa10 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -490,25 +490,20 @@ static void __remove_section(unsigned long pfn, > unsigned long nr_pages, > void __remove_pages(unsigned long pfn, unsigned long nr_pages, > struct vmem_altmap *altmap) > { > + const unsigned long end_pfn = pfn + nr_pages; > + unsigned long cur_nr_pages; > unsigned long map_offset = 0; > - unsigned long nr, start_sec, end_sec; > > map_offset = vmem_altmap_offset(altmap); > > if (check_pfn_span(pfn, nr_pages, "remove")) > return; > > - start_sec = pfn_to_section_nr(pfn); > - end_sec = pfn_to_section_nr(pfn + nr_pages - 1); > - for (nr = start_sec; nr <= end_sec; nr++) { > - unsigned long pfns; > - > + for (; pfn < end_pfn; pfn += cur_nr_pages) { > cond_resched(); > - pfns = min(nr_pages, PAGES_PER_SECTION > - - (pfn & ~PAGE_SECTION_MASK)); > - __remove_section(pfn, pfns, map_offset, altmap); > - pfn += pfns; > - nr_pages -= pfns; > + /* Select all remaining pages up to the next section boundary */ > + cur_nr_pages = min(end_pfn - pfn, -(pfn | PAGE_SECTION_MASK)); > + __remove_section(pfn, cur_nr_pages, map_offset, altmap); > map_offset = 0; > } > } > -- > 2.21.0 > > -- Oscar Salvador SUSE L3
Re: [PATCH v6 09/10] mm/memory_hotplug: Drop local variables in shrink_zone_span()
On Sun, Oct 06, 2019 at 10:56:45AM +0200, David Hildenbrand wrote: > Get rid of the unnecessary local variables. > > Cc: Andrew Morton > Cc: Oscar Salvador > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Pavel Tatashin > Cc: Dan Williams > Cc: Wei Yang > Signed-off-by: David Hildenbrand > --- > mm/memory_hotplug.c | 15 ++- > 1 file changed, 6 insertions(+), 9 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 8dafa1ba8d9f..843481bd507d 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -374,14 +374,11 @@ static unsigned long find_biggest_section_pfn(int nid, > struct zone *zone, > static void shrink_zone_span(struct zone *zone, unsigned long start_pfn, >unsigned long end_pfn) > { > - unsigned long zone_start_pfn = zone->zone_start_pfn; > - unsigned long z = zone_end_pfn(zone); /* zone_end_pfn namespace clash */ > - unsigned long zone_end_pfn = z; > unsigned long pfn; > int nid = zone_to_nid(zone); We could also remove the nid, right? AFAICS, the nid is only used in find_{smallest/biggest}_section_pfn so we could place there as well. Anyway, nothing to nit-pick about: Reviewed-by: Oscar Salvador > > zone_span_writelock(zone); > - if (zone_start_pfn == start_pfn) { > + if (zone->zone_start_pfn == start_pfn) { > /* >* If the section is smallest section in the zone, it need >* shrink zone->zone_start_pfn and zone->zone_spanned_pages. > @@ -389,25 +386,25 @@ static void shrink_zone_span(struct zone *zone, > unsigned long start_pfn, >* for shrinking zone. >*/ > pfn = find_smallest_section_pfn(nid, zone, end_pfn, > - zone_end_pfn); > + zone_end_pfn(zone)); > if (pfn) { > + zone->spanned_pages = zone_end_pfn(zone) - pfn; > zone->zone_start_pfn = pfn; > - zone->spanned_pages = zone_end_pfn - pfn; > } else { > zone->zone_start_pfn = 0; > zone->spanned_pages = 0; > } > - } else if (zone_end_pfn == end_pfn) { > + } else if (zone_end_pfn(zone) == end_pfn) { > /* >* If the section is biggest section in the zone, it need >* shrink zone->spanned_pages. >* In this case, we find second biggest valid mem_section for >* shrinking zone. >*/ > - pfn = find_biggest_section_pfn(nid, zone, zone_start_pfn, > + pfn = find_biggest_section_pfn(nid, zone, zone->zone_start_pfn, > start_pfn); > if (pfn) > - zone->spanned_pages = pfn - zone_start_pfn + 1; > + zone->spanned_pages = pfn - zone->zone_start_pfn + 1; > else { > zone->zone_start_pfn = 0; > zone->spanned_pages = 0; > -- > 2.21.0 > -- Oscar Salvador SUSE L3
Re: [PATCH v6 08/10] mm/memory_hotplug: Don't check for "all holes" in shrink_zone_span()
On Sun, Oct 06, 2019 at 10:56:44AM +0200, David Hildenbrand wrote: > If we have holes, the holes will automatically get detected and removed > once we remove the next bigger/smaller section. The extra checks can > go. > > Cc: Andrew Morton > Cc: Oscar Salvador > Cc: Michal Hocko > Cc: David Hildenbrand > Cc: Pavel Tatashin > Cc: Dan Williams > Cc: Wei Yang > Signed-off-by: David Hildenbrand Heh, I have been here before. I have to confess that when I wrote my version of this I was not really 100% about removing it, because hotplug was a sort of a "catchall" for all sort of weird and corner-cases configurations, but thinking more about it, I cannot think of any situation that would make this blow up. Reviewed-by: Oscar Salvador > --- > mm/memory_hotplug.c | 34 +++--- > 1 file changed, 7 insertions(+), 27 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index f294918f7211..8dafa1ba8d9f 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -393,6 +393,9 @@ static void shrink_zone_span(struct zone *zone, unsigned > long start_pfn, > if (pfn) { > zone->zone_start_pfn = pfn; > zone->spanned_pages = zone_end_pfn - pfn; > + } else { > + zone->zone_start_pfn = 0; > + zone->spanned_pages = 0; > } > } else if (zone_end_pfn == end_pfn) { > /* > @@ -405,34 +408,11 @@ static void shrink_zone_span(struct zone *zone, > unsigned long start_pfn, > start_pfn); > if (pfn) > zone->spanned_pages = pfn - zone_start_pfn + 1; > + else { > + zone->zone_start_pfn = 0; > + zone->spanned_pages = 0; > + } > } > - > - /* > - * The section is not biggest or smallest mem_section in the zone, it > - * only creates a hole in the zone. So in this case, we need not > - * change the zone. But perhaps, the zone has only hole data. Thus > - * it check the zone has only hole or not. > - */ > - pfn = zone_start_pfn; > - for (; pfn < zone_end_pfn; pfn += PAGES_PER_SUBSECTION) { > - if (unlikely(!pfn_to_online_page(pfn))) > - continue; > - > - if (page_zone(pfn_to_page(pfn)) != zone) > - continue; > - > - /* Skip range to be removed */ > - if (pfn >= start_pfn && pfn < end_pfn) > - continue; > - > - /* If we find valid section, we have nothing to do */ > - zone_span_writeunlock(zone); > - return; > - } > - > - /* The zone has no valid section */ > - zone->zone_start_pfn = 0; > - zone->spanned_pages = 0; > zone_span_writeunlock(zone); > } > > -- > 2.21.0 > -- Oscar Salvador SUSE L3
Re: [PATCH v6 07/10] mm/memory_hotplug: We always have a zone in find_(smallest|biggest)_section_pfn
On Sun, Oct 06, 2019 at 10:56:43AM +0200, David Hildenbrand wrote: > With shrink_pgdat_span() out of the way, we now always have a valid > zone. > > Cc: Andrew Morton > Cc: Oscar Salvador > Cc: David Hildenbrand > Cc: Michal Hocko > Cc: Pavel Tatashin > Cc: Dan Williams > Cc: Wei Yang > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador > --- > mm/memory_hotplug.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index bf5173e7913d..f294918f7211 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -337,7 +337,7 @@ static unsigned long find_smallest_section_pfn(int nid, > struct zone *zone, > if (unlikely(pfn_to_nid(start_pfn) != nid)) > continue; > > - if (zone && zone != page_zone(pfn_to_page(start_pfn))) > + if (zone != page_zone(pfn_to_page(start_pfn))) > continue; > > return start_pfn; > @@ -362,7 +362,7 @@ static unsigned long find_biggest_section_pfn(int nid, > struct zone *zone, > if (unlikely(pfn_to_nid(pfn) != nid)) > continue; > > - if (zone && zone != page_zone(pfn_to_page(pfn))) > + if (zone != page_zone(pfn_to_page(pfn))) > continue; > > return pfn; > -- > 2.21.0 > -- Oscar Salvador SUSE L3
Re: [PATCH v6 06/10] mm/memory_hotplug: Poison memmap in remove_pfn_range_from_zone()
On Sun, Oct 06, 2019 at 10:56:42AM +0200, David Hildenbrand wrote: > Let's poison the pages similar to when adding new memory in > sparse_add_section(). Also call remove_pfn_range_from_zone() from > memunmap_pages(), so we can poison the memmap from there as well. > > While at it, calculate the pfn in memunmap_pages() only once. > > Cc: Andrew Morton > Cc: David Hildenbrand > Cc: Oscar Salvador > Cc: Michal Hocko > Cc: Pavel Tatashin > Cc: Dan Williams > Signed-off-by: David Hildenbrand Looks good to me, it is fine as long as we do not access those pages later on, and if my eyes did not lie to me, we have to proper checks (pfn_to_online_page) in place to avoid that, so: Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v6 05/10] mm/memory_hotplug: Shrink zones when offlining memory
On Sun, Oct 06, 2019 at 10:56:41AM +0200, David Hildenbrand wrote: > Fixes: d0dc12e86b31 ("mm/memory_hotplug: optimize memory hotplug") > Signed-off-by: David Hildenbrand I did not see anything wrong with the taken approach, and makes sense to me. The only thing that puzzles me is we seem to not balance spanned_pages for ZONE_DEVICE anymore. memremap_pages() increments them via move_pfn_range_to_zone, but we skip ZONE_DEVICE in remove_pfn_range_from_zone. That is not really related to this patch, so I might be missing something, but it caught my eye while reviewing this. Anyway, for this one: Reviewed-by: Oscar Salvador off-topic: I __think__ we really need to trim the CC list. > --- > arch/arm64/mm/mmu.c| 4 +--- > arch/ia64/mm/init.c| 4 +--- > arch/powerpc/mm/mem.c | 3 +-- > arch/s390/mm/init.c| 4 +--- > arch/sh/mm/init.c | 4 +--- > arch/x86/mm/init_32.c | 4 +--- > arch/x86/mm/init_64.c | 4 +--- > include/linux/memory_hotplug.h | 7 +-- > mm/memory_hotplug.c| 31 --- > mm/memremap.c | 2 +- > 10 files changed, 29 insertions(+), 38 deletions(-) > > diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c > index 60c929f3683b..d10247fab0fd 100644 > --- a/arch/arm64/mm/mmu.c > +++ b/arch/arm64/mm/mmu.c > @@ -1069,7 +1069,6 @@ void arch_remove_memory(int nid, u64 start, u64 size, > { > unsigned long start_pfn = start >> PAGE_SHIFT; > unsigned long nr_pages = size >> PAGE_SHIFT; > - struct zone *zone; > > /* >* FIXME: Cleanup page tables (also in arch_add_memory() in case > @@ -1078,7 +1077,6 @@ void arch_remove_memory(int nid, u64 start, u64 size, >* unplug. ARCH_ENABLE_MEMORY_HOTREMOVE must not be >* unlocked yet. >*/ > - zone = page_zone(pfn_to_page(start_pfn)); > - __remove_pages(zone, start_pfn, nr_pages, altmap); > + __remove_pages(start_pfn, nr_pages, altmap); > } > #endif > diff --git a/arch/ia64/mm/init.c b/arch/ia64/mm/init.c > index bf9df2625bc8..a6dd80a2c939 100644 > --- a/arch/ia64/mm/init.c > +++ b/arch/ia64/mm/init.c > @@ -689,9 +689,7 @@ void arch_remove_memory(int nid, u64 start, u64 size, > { > unsigned long start_pfn = start >> PAGE_SHIFT; > unsigned long nr_pages = size >> PAGE_SHIFT; > - struct zone *zone; > > - zone = page_zone(pfn_to_page(start_pfn)); > - __remove_pages(zone, start_pfn, nr_pages, altmap); > + __remove_pages(start_pfn, nr_pages, altmap); > } > #endif > diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c > index be941d382c8d..97e5922cb52e 100644 > --- a/arch/powerpc/mm/mem.c > +++ b/arch/powerpc/mm/mem.c > @@ -130,10 +130,9 @@ void __ref arch_remove_memory(int nid, u64 start, u64 > size, > { > unsigned long start_pfn = start >> PAGE_SHIFT; > unsigned long nr_pages = size >> PAGE_SHIFT; > - struct page *page = pfn_to_page(start_pfn) + vmem_altmap_offset(altmap); > int ret; > > - __remove_pages(page_zone(page), start_pfn, nr_pages, altmap); > + __remove_pages(start_pfn, nr_pages, altmap); > > /* Remove htab bolted mappings for this section of memory */ > start = (unsigned long)__va(start); > diff --git a/arch/s390/mm/init.c b/arch/s390/mm/init.c > index a124f19f7b3c..c1d96e588152 100644 > --- a/arch/s390/mm/init.c > +++ b/arch/s390/mm/init.c > @@ -291,10 +291,8 @@ void arch_remove_memory(int nid, u64 start, u64 size, > { > unsigned long start_pfn = start >> PAGE_SHIFT; > unsigned long nr_pages = size >> PAGE_SHIFT; > - struct zone *zone; > > - zone = page_zone(pfn_to_page(start_pfn)); > - __remove_pages(zone, start_pfn, nr_pages, altmap); > + __remove_pages(start_pfn, nr_pages, altmap); > vmem_remove_mapping(start, size); > } > #endif /* CONFIG_MEMORY_HOTPLUG */ > diff --git a/arch/sh/mm/init.c b/arch/sh/mm/init.c > index dfdbaa50946e..d1b1ff2be17a 100644 > --- a/arch/sh/mm/init.c > +++ b/arch/sh/mm/init.c > @@ -434,9 +434,7 @@ void arch_remove_memory(int nid, u64 start, u64 size, > { > unsigned long start_pfn = PFN_DOWN(start); > unsigned long nr_pages = size >> PAGE_SHIFT; > - struct zone *zone; > > - zone = page_zone(pfn_to_page(start_pfn)); > - __remove_pages(zone, start_pfn, nr_pages, altmap); > + __remove_pages(start_pfn, nr_pages, altmap); > } > #endif /* CONFIG_MEMORY_HOTPLUG */ > diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c > index 930edeb41ec3..0a74407ef92e 100644 > --- a/arch/x86/
Re: [PATCH v6 00/10] mm/memory_hotplug: Shrink zones before removing memory
On Mon, Dec 02, 2019 at 10:09:51AM +0100, David Hildenbrand wrote: > @Michal, @Oscar, can some of you at least have a patch #5 now so we can > proceed with that? (the other patches can stay in -next some time longer) Hi, I will be having a look at patch#5 shortly. Thanks for the reminder -- Oscar Salvador SUSE L3
Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail
On Mon, Jul 15, 2019 at 01:10:33PM +0200, David Hildenbrand wrote: > On 01.07.19 12:27, Michal Hocko wrote: > > On Mon 01-07-19 11:36:44, Oscar Salvador wrote: > >> On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote: > >>> Yeah, we do not allow to offline multi zone (node) ranges so the current > >>> code seems to be over engineered. > >>> > >>> Anyway, I am wondering why do we have to strictly check for already > >>> removed nodes links. Is the sysfs code going to complain we we try to > >>> remove again? > >> > >> No, sysfs will silently "fail" if the symlink has already been removed. > >> At least that is what I saw last time I played with it. > >> > >> I guess the question is what if sysfs handling changes in the future > >> and starts dropping warnings when trying to remove a symlink is not there. > >> Maybe that is unlikely to happen? > > > > And maybe we handle it then rather than have a static allocation that > > everybody with hotremove configured has to pay for. > > > > So what's the suggestion? Dropping the nodemask_t completely and calling > sysfs_remove_link() on already potentially removed links? > > Of course, we can also just use mem_blk->nid and rest assured that it > will never be called for memory blocks belonging to multiple nodes. Hi David, While it is easy to construct a scenario where a memblock belongs to multiple nodes, I have to confess that I yet have not seen that in a real-world scenario. Given said that, I think that the less risky way is to just drop the nodemask_t and do not care about calling sysfs_remove_link() for already removed links. As I said, sysfs_remove_link() will silently fail when it fails to find the symlink, so I do not think it is a big deal. -- Oscar Salvador SUSE L3
Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail
On Mon, Jul 01, 2019 at 10:51:44AM +0200, Michal Hocko wrote: > Yeah, we do not allow to offline multi zone (node) ranges so the current > code seems to be over engineered. > > Anyway, I am wondering why do we have to strictly check for already > removed nodes links. Is the sysfs code going to complain we we try to > remove again? No, sysfs will silently "fail" if the symlink has already been removed. At least that is what I saw last time I played with it. I guess the question is what if sysfs handling changes in the future and starts dropping warnings when trying to remove a symlink is not there. Maybe that is unlikely to happen? -- Oscar Salvador SUSE L3
Re: [PATCH v3 02/11] s390x/mm: Fail when an altmap is used for arch_add_memory()
On Mon, May 27, 2019 at 01:11:43PM +0200, David Hildenbrand wrote: > ZONE_DEVICE is not yet supported, fail if an altmap is passed, so we > don't forget arch_add_memory()/arch_remove_memory() when unlocking > support. > > Cc: Martin Schwidefsky > Cc: Heiko Carstens > Cc: Andrew Morton > Cc: Michal Hocko > Cc: Mike Rapoport > Cc: David Hildenbrand > Cc: Vasily Gorbik > Cc: Oscar Salvador > Suggested-by: Dan Williams > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v3 11/11] mm/memory_hotplug: Remove "zone" parameter from sparse_remove_one_section
On Mon, May 27, 2019 at 01:11:52PM +0200, David Hildenbrand wrote: > The parameter is unused, so let's drop it. Memory removal paths should > never care about zones. This is the job of memory offlining and will > require more refactorings. > > Reviewed-by: Dan Williams > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v3 10/11] mm/memory_hotplug: Make unregister_memory_block_under_nodes() never fail
On Mon, May 27, 2019 at 01:11:51PM +0200, David Hildenbrand wrote: > We really don't want anything during memory hotunplug to fail. > We always pass a valid memory block device, that check can go. Avoid > allocating memory and eventually failing. As we are always called under > lock, we can use a static piece of memory. This avoids having to put > the structure onto the stack, having to guess about the stack size > of callers. > > Patch inspired by a patch from Oscar Salvador. > > In the future, there might be no need to iterate over nodes at all. > mem->nid should tell us exactly what to remove. Memory block devices > with mixed nodes (added during boot) should properly fenced off and never > removed. > > Cc: Greg Kroah-Hartman > Cc: "Rafael J. Wysocki" > Cc: Alex Deucher > Cc: "David S. Miller" > Cc: Mark Brown > Cc: Chris Wilson > Cc: David Hildenbrand > Cc: Oscar Salvador > Cc: Andrew Morton > Cc: Jonathan Cameron > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v3 01/11] mm/memory_hotplug: Simplify and fix check_hotplug_memory_range()
On Mon, May 27, 2019 at 01:11:42PM +0200, David Hildenbrand wrote: > By converting start and size to page granularity, we actually ignore > unaligned parts within a page instead of properly bailing out with an > error. > > Cc: Andrew Morton > Cc: Oscar Salvador > Cc: Michal Hocko > Cc: David Hildenbrand > Cc: Pavel Tatashin > Cc: Qian Cai > Cc: Wei Yang > Cc: Arun KS > Cc: Mathieu Malaterre > Reviewed-by: Dan Williams > Reviewed-by: Wei Yang > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v3 2/6] mm/memory_hotplug: make add_memory() take the device_hotplug_lock
On Thu, Sep 27, 2018 at 11:25:50AM +0200, David Hildenbrand wrote: > Reviewed-by: Pavel Tatashin > Reviewed-by: Rafael J. Wysocki > Reviewed-by: Rashmica Gupta > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v3 1/6] mm/memory_hotplug: make remove_memory() take the device_hotplug_lock
On Thu, Sep 27, 2018 at 11:25:49AM +0200, David Hildenbrand wrote: > Reviewed-by: Pavel Tatashin > Reviewed-by: Rafael J. Wysocki > Reviewed-by: Rashmica Gupta > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v3 6/6] memory-hotplug.txt: Add some details about locking internals
On Thu, Sep 27, 2018 at 11:25:54AM +0200, David Hildenbrand wrote: > Cc: Jonathan Corbet > Cc: Michal Hocko > Cc: Andrew Morton > Reviewed-by: Pavel Tatashin > Reviewed-by: Rashmica Gupta > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH v3 3/6] mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock
On Thu, Sep 27, 2018 at 11:25:51AM +0200, David Hildenbrand wrote: > Reviewed-by: Pavel Tatashin > Reviewed-by: Rashmica Gupta > Signed-off-by: David Hildenbrand Reviewed-by: Oscar Salvador -- Oscar Salvador SUSE L3
Re: [PATCH RFCv2 0/6] mm: online/offline_pages called w.o. mem_hotplug_lock
On Tue, Aug 21, 2018 at 12:44:12PM +0200, David Hildenbrand wrote: > This is the same approach as in the first RFC, but this time without > exporting device_hotplug_lock (requested by Greg) and with some more > details and documentation regarding locking. Tested only on x86 so far. Hi David, I would like to review this but I am on vacation, so I will not be able to get to it soon. I plan to do it once I am back. Thanks -- Oscar Salvador SUSE L3