Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Thu, Apr 15, 2021 at 01:13:13AM -0600, Yu Zhao wrote: > Page table scanning doesn't replace the existing rmap walk. It is > complementary and only happens when it is likely that most of the > pages on a system under pressure have been referenced, i.e., out of > *inactive* pages, by definition of the existing implementation. Under > such a condition, scanning *active* pages one by one with the rmap is > likely to cost more than scanning them all at once via page tables. > When we evict *inactive* pages, we still use the rmap and share a > common path with the existing code. > > Page table scanning falls back to the rmap walk if the page tables of > a process are apparently sparse, i.e., rss < size of the page tables. Could you expand a bit more as to how page table scanning and rmap scanning coexist ? Say, there is some memory pressure and you want to identify good candidate pages to recaim. You could scan processes with the page table scanning method, or you could scan the lru list through the rmap method. How do you mix the two - when you use the lru/rmap method, won't you encounter both pages that are mapped in "dense" processes where scanning page tables would have been better, and pages that are mapped in "sparse" processes where you are happy to be using rmap, and even pges that are mapped into both types of processes at once ? Or, can you change the lru/rmap scan so that it will efficiently skip over all dense processes when you use it ? Thanks, -- Michel "walken" Lespinasse
Re: [PATCH v2 00/16] Multigenerational LRU Framework
Yu Zhao writes: > On Wed, Apr 14, 2021 at 9:00 PM Andi Kleen wrote: >> >> > We fall back to the rmap when it's obviously not smart to do so. There >> > is still a lot of room for improvement in this function though, i.e., >> > it should be per VMA and NUMA aware. >> >> Okay so it's more a question to tune the cross over heuristic. That >> sounds much easier than replacing everything. >> >> Of course long term it might be a problem to maintain too many >> different ways to do things, but I suppose short term it's a reasonable >> strategy. > > Hi Rik, Ying, > > Sorry for being persistent. I want to make sure we are on the same page: > > Page table scanning doesn't replace the existing rmap walk. It is > complementary and only happens when it is likely that most of the > pages on a system under pressure have been referenced, i.e., out of > *inactive* pages, by definition of the existing implementation. Under > such a condition, scanning *active* pages one by one with the rmap is > likely to cost more than scanning them all at once via page tables. > When we evict *inactive* pages, we still use the rmap and share a > common path with the existing code. > > Page table scanning falls back to the rmap walk if the page tables of > a process are apparently sparse, i.e., rss < size of the page tables. > > I should have clarified this at the very beginning of the discussion. > But it has become so natural to me and I assumed we'd all see it this > way. > > Your concern regarding the NUMA optimization is still valid, and it's > a high priority. Hi, Yu, In general, I think it's a good idea to combine the page table scanning and rmap scanning in the page reclaiming. For example, if the working-set is transitioned, we can take advantage of the fast page table scanning to identify the new working-set quickly. While we can fallback to the rmap scanning if the page table scanning doesn't help. Best Regards, Huang, Ying
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, Apr 14, 2021 at 9:00 PM Andi Kleen wrote: > > > We fall back to the rmap when it's obviously not smart to do so. There > > is still a lot of room for improvement in this function though, i.e., > > it should be per VMA and NUMA aware. > > Okay so it's more a question to tune the cross over heuristic. That > sounds much easier than replacing everything. > > Of course long term it might be a problem to maintain too many > different ways to do things, but I suppose short term it's a reasonable > strategy. Hi Rik, Ying, Sorry for being persistent. I want to make sure we are on the same page: Page table scanning doesn't replace the existing rmap walk. It is complementary and only happens when it is likely that most of the pages on a system under pressure have been referenced, i.e., out of *inactive* pages, by definition of the existing implementation. Under such a condition, scanning *active* pages one by one with the rmap is likely to cost more than scanning them all at once via page tables. When we evict *inactive* pages, we still use the rmap and share a common path with the existing code. Page table scanning falls back to the rmap walk if the page tables of a process are apparently sparse, i.e., rss < size of the page tables. I should have clarified this at the very beginning of the discussion. But it has become so natural to me and I assumed we'd all see it this way. Your concern regarding the NUMA optimization is still valid, and it's a high priority. Thanks.
Re: [PATCH v2 00/16] Multigenerational LRU Framework
> We fall back to the rmap when it's obviously not smart to do so. There > is still a lot of room for improvement in this function though, i.e., > it should be per VMA and NUMA aware. Okay so it's more a question to tune the cross over heuristic. That sounds much easier than replacing everything. Of course long term it might be a problem to maintain too many different ways to do things, but I suppose short term it's a reasonable strategy. -Andi
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote: > On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner wrote: > > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote: > > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner wrote: > > > > Profiles would be interesting, because it sounds to me like reclaim > > > > *might* be batching page cache removal better (e.g. fewer, larger > > > > batches) and so spending less time contending on the mapping tree > > > > lock... > > > > > > > > IOWs, I suspect this result might actually be a result of less lock > > > > contention due to a change in batch processing characteristics of > > > > the new algorithm rather than it being a "better" algorithm... > > > > > > I appreciate the profile. But there is no batching in > > > __remove_mapping() -- it locks the mapping for each page, and > > > therefore the lock contention penalizes the mainline and this patchset > > > equally. It looks worse on your system because the four kswapd threads > > > from different nodes were working on the same file. > > > > I think you misunderstand exactly what I mean by "batching" here. > > I'm not talking about doing multiple pieces of work under a single > > lock. What I mean is that the overall amount of work done in a > > single reclaim scan (i.e a "reclaim batch") is packaged differently. > > > > We already batch up page reclaim via building a page list and then > > passing it to shrink_page_list() to process the batch of pages in a > > single pass. Each page in this page list batch then calls > > remove_mapping() to pull the page form the LRU, we have a run of > > contention between the foreground read() thread and the background > > kswapd. > > > > If the size or nature of the pages in the batch passed to > > shrink_page_list() changes, then the amount of time a reclaim batch > > is going to put pressure on the mapping tree lock will also change. > > That's the "change in batching behaviour" I'm referring to here. I > > haven't read through the patchset to determine if you change the > > shrink_page_list() algorithm, but it likely changes what is passed > > to be reclaimed and that in turn changes the locking patterns that > > fall out of shrink_page_list... > > Ok, if we are talking about the size of the batch passed to > shrink_page_list(), both the mainline and this patchset cap it at > SWAP_CLUSTER_MAX, which is 32. There are corner cases, but when > running fio/io_uring, it's safe to say both use 32. You're still looking at micro-scale behaviour, not the larger-scale batching effects. Are we passing SWAP_CLUSTER_MAX groups of pages to shrinker_page_list() at a different rate? When I say "batch of work" when talking about the page cache cycling *500 thousand pages a second* through the cache, I'm not talking about batches of 32 pages. I'm talking about the entire batch of work kswapd does in an invocation cycle. Is it scanning 100k pages 10 times a second? or 10k pages a hundred times a second? How long does a batch take to run? how long does is sleep between processing batches? Is there any change in these metrics as a result of the multi-gen LRU patches? Basically, we're looking at how access to the mapping lock is changing the contention profile, and whether that is signficant or not. I suspect it is, because when you have highly contended locks and you do something external that reduces unrelated lock contention, it's because that external thing is taking more time to do and so there's less time to spend hitting locks hard... As such, I don't think this test is a good measure of the multi-gen LRU patches at all - performance is dominated by the severity of lock contention external to the LRU scanning algorithm, and it's hard to infer anything through suck lock contention > I don't want to paste everything here -- they'd clutter. Please see > all the detailed profiles in the attachment. Let me know if their > formats are no to your liking. I still have the raw perf.data. Which makes the discussion thread just about impossible to follow or comment on. Please just post the relevant excerpt of the stack profile that you are commenting on. > > > And I plan to reach out to other communities, e.g., PostgreSQL, to > > > benchmark the patchset. I heard they have been complaining about the > > > buffered io performance under memory pressure. Any other benchmarks > > > you'd suggest? > > > > > > BTW, you might find another surprise in how less frequently slab > > > shrinkers are called under memory pressure, because this patchset is a > > > lot better at finding pages to reclaim and therefore doesn't overkill > > > slabs. > > > > That's actually very likely to be a Bad Thing and cause unexpected > > perofrmance and OOM based regressions. When the machine finally runs > > out of page cache it can easily reclaim, it's going to get stuck > > with long tail latencies reclaiming huge slab caches as they've had > > no substantial ongoing pressure put on them to keep them in bal
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, Apr 14, 2021 at 08:43:36AM -0600, Jens Axboe wrote: > On 4/13/21 5:14 PM, Dave Chinner wrote: > > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: > >> On 4/13/21 1:51 AM, SeongJae Park wrote: > >>> From: SeongJae Park > >>> > >>> Hello, > >>> > >>> > >>> Very interesting work, thank you for sharing this :) > >>> > >>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: > >>> > What's new in v2 > > Special thanks to Jens Axboe for reporting a regression in buffered > I/O and helping test the fix. > >>> > >>> Is the discussion open? If so, could you please give me a link? > >> > >> I wasn't on the initial post (or any of the lists it was posted to), but > >> it's on the google page reclaim list. Not sure if that is public or not. > >> > >> tldr is that I was pretty excited about this work, as buffered IO tends > >> to suck (a lot) for high throughput applications. My test case was > >> pretty simple: > >> > >> Randomly read a fast device, using 4k buffered IO, and watch what > >> happens when the page cache gets filled up. For this particular test, > >> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec > >> with kswapd using a lot of CPU trying to keep up. That's mainline > >> behavior. > > > > I see this exact same behaviour here, too, but I RCA'd it to > > contention between the inode and memory reclaim for the mapping > > structure that indexes the page cache. Basically the mapping tree > > lock is the contention point here - you can either be adding pages > > to the mapping during IO, or memory reclaim can be removing pages > > from the mapping, but we can't do both at once. > > > > So we end up with kswapd spinning on the mapping tree lock like so > > when doing 1.6GB/s in 4kB buffered IO: > > > > - 20.06% 0.00% [kernel] [k] kswapd > > > >▒ > >- 20.06% kswapd > > > >▒ > > - 20.05% balance_pgdat > > > >▒ > > - 20.03% shrink_node > > > >▒ > > - 19.92% shrink_lruvec > > > >▒ > >- 19.91% shrink_inactive_list > > > >▒ > > - 19.22% shrink_page_list > > > >▒ > > - 17.51% __remove_mapping > > > >▒ > > - 14.16% _raw_spin_lock_irqsave > > > >▒ > >- 14.14% do_raw_spin_lock > > > >▒ > > __pv_queued_spin_lock_slowpath > > > >▒ > > - 1.56% __delete_from_page_cache > > > >▒ > > 0.63% xas_store > > > >▒ > > - 0.78% _raw_spin_unlock_irqrestore > > > >▒ > >- 0.69% do_raw_spin_unlock > > > >▒ > > __raw_callee_save___pv_queued_spin_unlock > > > >▒ > > - 0.82% free_unref_page_list > > > >▒ > > - 0.72% free_unref_page_commit > > > >▒ > > 0.57%
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, Apr 14, 2021 at 1:42 PM Rik van Riel wrote: > > On Wed, 2021-04-14 at 13:14 -0600, Yu Zhao wrote: > > On Wed, Apr 14, 2021 at 9:59 AM Rik van Riel > > wrote: > > > On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote: > > > > >2) It will not scan PTE tables under non-leaf PMD entries > > > > > that > > > > > do not > > > > > have the accessed bit set, when > > > > > CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. > > > > > > > > This assumes that workloads have reasonable locality. Could > > > > there > > > > be a worst case where only one or two pages in each PTE are used, > > > > so this PTE skipping trick doesn't work? > > > > > > Databases with large shared memory segments shared between > > > many processes come to mind as a real-world example of a > > > worst case scenario. > > > > Well, I don't think you two are talking about the same thing. Andi > > was > > focusing on sparsity. Your example seems to be about sharing, i.e., > > ihgh mapcount. Of course both can happen at the same time, as I > > tested > > here: > > https://lore.kernel.org/linux-mm/yhful%2fddtiml4...@google.com/#t > > > > I'm skeptical that shared memory used by databases is that sparse, > > i.e., one page per PTE table, because the extremely low locality > > would > > heavily penalize their performance. But my knowledge in databases is > > close to zero. So feel free to enlighten me or just ignore what I > > said. > > A database may have a 200GB shared memory segment, > and a worker task that gets spun up to handle a > query might access only 1MB of memory to answer > that query. > > That memory could be from anywhere inside the > shared memory segment. Maybe some of the accesses > are more dense, and others more sparse, who knows? > > A lot of the locality > will depend on how memory > space inside the shared memory segment is reclaimed > and recycled inside the database. Thanks. Yeah, I guess we'll just need to see more benchmarks from the database realm. Stay tuned :)
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, Apr 14, 2021 at 8:43 AM Jens Axboe wrote: > > On 4/13/21 5:14 PM, Dave Chinner wrote: > > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: > >> On 4/13/21 1:51 AM, SeongJae Park wrote: > >>> From: SeongJae Park > >>> > >>> Hello, > >>> > >>> > >>> Very interesting work, thank you for sharing this :) > >>> > >>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: > >>> > What's new in v2 > > Special thanks to Jens Axboe for reporting a regression in buffered > I/O and helping test the fix. > >>> > >>> Is the discussion open? If so, could you please give me a link? > >> > >> I wasn't on the initial post (or any of the lists it was posted to), but > >> it's on the google page reclaim list. Not sure if that is public or not. > >> > >> tldr is that I was pretty excited about this work, as buffered IO tends > >> to suck (a lot) for high throughput applications. My test case was > >> pretty simple: > >> > >> Randomly read a fast device, using 4k buffered IO, and watch what > >> happens when the page cache gets filled up. For this particular test, > >> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec > >> with kswapd using a lot of CPU trying to keep up. That's mainline > >> behavior. > > > > I see this exact same behaviour here, too, but I RCA'd it to > > contention between the inode and memory reclaim for the mapping > > structure that indexes the page cache. Basically the mapping tree > > lock is the contention point here - you can either be adding pages > > to the mapping during IO, or memory reclaim can be removing pages > > from the mapping, but we can't do both at once. > > > > So we end up with kswapd spinning on the mapping tree lock like so > > when doing 1.6GB/s in 4kB buffered IO: > > > > - 20.06% 0.00% [kernel] [k] kswapd > > > >▒ > >- 20.06% kswapd > > > >▒ > > - 20.05% balance_pgdat > > > >▒ > > - 20.03% shrink_node > > > >▒ > > - 19.92% shrink_lruvec > > > >▒ > >- 19.91% shrink_inactive_list > > > >▒ > > - 19.22% shrink_page_list > > > >▒ > > - 17.51% __remove_mapping > > > >▒ > > - 14.16% _raw_spin_lock_irqsave > > > >▒ > >- 14.14% do_raw_spin_lock > > > >▒ > > __pv_queued_spin_lock_slowpath > > > >▒ > > - 1.56% __delete_from_page_cache > > > >▒ > > 0.63% xas_store > > > >▒ > > - 0.78% _raw_spin_unlock_irqrestore > > > >▒ > >- 0.69% do_raw_spin_unlock > > > >▒ > > __raw_callee_save___pv_queued_spin_unlock > > > >▒ > > - 0.82% free_unref_page_list > > > >▒ > > - 0.72% free_unref_page_commit > > > >▒ > > 0.57% free_pcppa
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, 2021-04-14 at 13:14 -0600, Yu Zhao wrote: > On Wed, Apr 14, 2021 at 9:59 AM Rik van Riel > wrote: > > On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote: > > > >2) It will not scan PTE tables under non-leaf PMD entries > > > > that > > > > do not > > > > have the accessed bit set, when > > > > CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. > > > > > > This assumes that workloads have reasonable locality. Could > > > there > > > be a worst case where only one or two pages in each PTE are used, > > > so this PTE skipping trick doesn't work? > > > > Databases with large shared memory segments shared between > > many processes come to mind as a real-world example of a > > worst case scenario. > > Well, I don't think you two are talking about the same thing. Andi > was > focusing on sparsity. Your example seems to be about sharing, i.e., > ihgh mapcount. Of course both can happen at the same time, as I > tested > here: > https://lore.kernel.org/linux-mm/yhful%2fddtiml4...@google.com/#t > > I'm skeptical that shared memory used by databases is that sparse, > i.e., one page per PTE table, because the extremely low locality > would > heavily penalize their performance. But my knowledge in databases is > close to zero. So feel free to enlighten me or just ignore what I > said. A database may have a 200GB shared memory segment, and a worker task that gets spun up to handle a query might access only 1MB of memory to answer that query. That memory could be from anywhere inside the shared memory segment. Maybe some of the accesses are more dense, and others more sparse, who knows? A lot of the locality will depend on how memory space inside the shared memory segment is reclaimed and recycled inside the database. -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, Apr 14, 2021 at 9:59 AM Rik van Riel wrote: > > On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote: > > >2) It will not scan PTE tables under non-leaf PMD entries that > > > do not > > > have the accessed bit set, when > > > CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. > > > > This assumes that workloads have reasonable locality. Could there > > be a worst case where only one or two pages in each PTE are used, > > so this PTE skipping trick doesn't work? > > Databases with large shared memory segments shared between > many processes come to mind as a real-world example of a > worst case scenario. Well, I don't think you two are talking about the same thing. Andi was focusing on sparsity. Your example seems to be about sharing, i.e., ihgh mapcount. Of course both can happen at the same time, as I tested here: https://lore.kernel.org/linux-mm/yhful%2fddtiml4...@google.com/#t I'm skeptical that shared memory used by databases is that sparse, i.e., one page per PTE table, because the extremely low locality would heavily penalize their performance. But my knowledge in databases is close to zero. So feel free to enlighten me or just ignore what I said.
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, Apr 14, 2021 at 9:51 AM Andi Kleen wrote: > > >2) It will not scan PTE tables under non-leaf PMD entries that do not > > have the accessed bit set, when > > CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. > > This assumes that workloads have reasonable locality. Could there > be a worst case where only one or two pages in each PTE are used, > so this PTE skipping trick doesn't work? Hi Andi, Yes, it does make that assumption. And yes, there could. AFAIK, only x86 supports this. I wrote a crude test to verify this, and it maps exactly one page within each PTE table. And I found page table scanning didn't underperform the rmap: https://lore.kernel.org/linux-mm/yhful%2fddtiml4...@google.com/#t The reason (sorry for repeating this) is page table scanning is conditional: bool should_skip_mm() { ... /* leave the legwork to the rmap if mapped pages are too sparse */ if (RSS < mm_pgtables_bytes(mm) / PAGE_SIZE) return true; } We fall back to the rmap when it's obviously not smart to do so. There is still a lot of room for improvement in this function though, i.e., it should be per VMA and NUMA aware. Note that page table scanning doesn't replace the existing rmap scan. It's complementary, and it happens when there is a good chance that most of the pages on a system under pressure have been referenced. IOW, scanning them one by one with the rmap would cost more than scanning them all at once via page tables. Sounds reasonable? Thanks.
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, Apr 14, 2021 at 7:52 AM Rik van Riel wrote: > > On Wed, 2021-04-14 at 16:27 +0800, Huang, Ying wrote: > > Yu Zhao writes: > > > > > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying > > > wrote: > > > > > > > NUMA Optimization > > > - > > > Support NUMA policies and per-node RSS counters. > > > > > > We only can move forward one step at a time. Fair? > > > > You don't need to implement that now definitely. But we can discuss > > the > > possible solution now. > > That was my intention, too. I want to make sure we don't > end up "painting ourselves into a corner" by moving in some > direction we have no way to get out of. > > The patch set looks promising, but we need some plan to > avoid the worst case behaviors that forced us into rmap > based scanning initially. Hi Rik, By design, we voluntarily fall back to the rmap when page tables of a process are too sparse. At the moment, we have bool should_skip_mm() { ... /* leave the legwork to the rmap if mapped pages are too sparse */ if (RSS < mm_pgtables_bytes(mm) / PAGE_SIZE) return true; } So yes, I agree we have more work to do in this direction, the fallback should be per VMA and NUMA aware. Note that once the fallback happens, it shares the same path with the existing implementation. Probably I should have clarified that this patchset does not replace the rmap with page table scanning. It conditionally uses page table scanning when it thinks most of the pages on a system could have been referenced, i.e., when it thinks walking the rmap would be less efficient, based on generations. It *unconditionally* walks the rmap to scan each of the pages it eventually tries to evict, because scanning page tables for a small batch of pages it wants to evict is too costly. One of the simple ways to look at how the mixture of page table scanning and the rmap works is: 1) it scans page tables (but might fallback to the rmap) to deactivate pages from the active list to the inactive list, when the inactive list becomes empty 2) it walks the rmap (not page table scanning) when it evicts individual pages from the inactive list. Does it make sense? I fully agree "the mixture" is currently statistically decided, and it must be made worst-case scenario proof. > > Note that it's possible that only some processes are bound to some > > NUMA > > nodes, while other processes aren't bound. > > For workloads like PostgresQL or Oracle, it is common > to have maybe 70% of memory in a large shared memory > segment, spread between all the NUMA nodes, and mapped > into hundreds, if not thousands, of processes in the > system. I do plan to reach out to the PostgreSQL community and ask for help to benchmark this patchset. Will keep everybody posted. > Now imagine we have an 8 node system, and memory > pressure in the DMA32 zone of node 0. > > How will the current VM behave? At the moment, we don't plan to make the DMA32 zone reclaim a priority. Rather, I'd suggest 1) stay with the existing implementation 2) boost the watermark for DMA32 > What will the virtual scanning need to do? The high priority items are: To-do List == KVM Optimization Support shadow page table scanning. NUMA Optimization - Support NUMA policies and per-node RSS counters. We are just trying to focus our resources on the trending use cases. Reasonable? > If we can come up with a solution to make virtual > scanning scale for that kind of workload, great. It won't be easy, but IMO nothing worth doing is easy :) > If not ... if it turns out most of the benefits of > the multigeneratinal LRU framework come from sorting > the pages into multiple LRUs, and from being able > to easily reclaim unmapped pages before having to > scan mapped ones, could it be an idea to implement > that first, independently from virtual scanning? This option is on the table considering the possibilities 1) there are unforeseeable problems we couldn't solve 2) sorting pages alone has demonstrated its standalone value I guess 2) alone will help people heavily using page cache. Google isn't one of them though. Personally I'm neutral (at least trying to be), and my goal is to accommodate everybody as best as I can. > I am all for improving > our page reclaim system, I > just want to make sure we don't revisit the old traps > that forced us where we are today :) Yeah, I do see your concerns and we need more data. Any suggestions on benchmarks you'd be interested in? Thanks.
Re: [PATCH v2 00/16] Multigenerational LRU Framework
Hello Yu, On Tue, Apr 13, 2021 at 12:56:17AM -0600, Yu Zhao wrote: > What's new in v2 > > Special thanks to Jens Axboe for reporting a regression in buffered > I/O and helping test the fix. > > This version includes the support of tiers, which represent levels of > usage from file descriptors only. Pages accessed N times via file > descriptors belong to tier order_base_2(N). Each generation contains > at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2 > bits in page->flags. In contrast to moving across generations which > requires the lru lock, moving across tiers only involves an atomic > operation on page->flags and therefore has a negligible cost. A > feedback loop modeled after the well-known PID controller monitors the > refault rates across all tiers and decides when to activate pages from > which tiers, on the reclaim path. Could you elaborate a bit more on the difference between generations and tiers? A refault, a page table reference, or a buffered read through a file descriptor ultimately all boil down to a memory access. The value of having that memory resident and the cost of bringing it in from backing storage should be the same regardless of how it's accessed by userspace; and whether it's an in-memory reference or a non-resident reference should have the same relative impact on the page's age. With that context, I don't understand why file descriptor refs and refaults get such special treatment. Could you shed some light here? > This feedback model has a few advantages over the current feedforward > model: > 1) It has a negligible overhead in the buffered I/O access path >because activations are done in the reclaim path. This is useful if the workload isn't reclaim bound, but it can be hazardous to defer work to reclaim, too. If you go through the git history, there have been several patches to soften access recognition inside reclaim because it can come with large latencies when page reclaim kicks in after a longer period with no memory pressure and doesn't have uptodate reference information - to the point where eating a few extra IOs tend to add less latency to the workload than waiting for reclaim to refresh its aging data. Could you elaborate a bit more on the tradeoff here? > Highlights from the discussions on v1 > = > Thanks to Ying Huang and Dave Hansen for the comments and suggestions > on page table scanning. > > A simple worst-case scenario test did not find page table scanning > underperforms the rmap because of the following optimizations: > 1) It will not scan page tables from processes that have been sleeping >since the last scan. > 2) It will not scan PTE tables under non-leaf PMD entries that do not >have the accessed bit set, when >CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. > 3) It will not zigzag between the PGD table and the same PMD or PTE >table spanning multiple VMAs. In other words, it finishes all the >VMAs with the range of the same PMD or PTE table before it returns >to the PGD table. This optimizes workloads that have large numbers >of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5. > > TLDR > > The current page reclaim is too expensive in terms of CPU usage and > often making poor choices about what to evict. We would like to offer > an alternative framework that is performant, versatile and > straightforward. > > Repo > > git fetch https://linux-mm.googlesource.com/page-reclaim > refs/changes/73/1173/1 > > Gerrit https://linux-mm-review.googlesource.com/c/page-reclaim/+/1173 > > Background > == > DRAM is a major factor in total cost of ownership, and improving > memory overcommit brings a high return on investment. RAM cost on one hand. On the other, paging backends have seen a revolutionary explosion in iop/s capacity from solid state devices and CPUs that allow in-memory compression at scale, so a higher rate of paging (semi-random IO) and thus larger levels of overcommit are possible than ever before. There is a lot of new opportunity here. > Over the past decade of research and experimentation in memory > overcommit, we observed a distinct trend across millions of servers > and clients: the size of page cache has been decreasing because of > the growing popularity of cloud storage. Nowadays anon pages account > for more than 90% of our memory consumption and page cache contains > mostly executable pages. This gives the impression that because the number of setups heavily using the page cache has reduced somewhat, its significance is waning as well. I don't think that's true. I think we'll continue to have mainstream workloads for which the page cache is significant. Yes, the importance of paging anon memory more efficiently (or paging it at all again, for that matter), has increased dramatically. But IMO not because it's more prevalent, but rather because of the increase in paging capacity from the hardware side. It's
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, 2021-04-14 at 08:51 -0700, Andi Kleen wrote: > >2) It will not scan PTE tables under non-leaf PMD entries that > > do not > > have the accessed bit set, when > > CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. > > This assumes that workloads have reasonable locality. Could there > be a worst case where only one or two pages in each PTE are used, > so this PTE skipping trick doesn't work? Databases with large shared memory segments shared between many processes come to mind as a real-world example of a worst case scenario. -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [page-reclaim] Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, Apr 14, 2021 at 6:52 AM Rik van Riel wrote: > > On Wed, 2021-04-14 at 16:27 +0800, Huang, Ying wrote: > > Yu Zhao writes: > > > > > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying > > > wrote: > > > > > > > NUMA Optimization > > > - > > > Support NUMA policies and per-node RSS counters. > > > > > > We only can move forward one step at a time. Fair? > > > > You don't need to implement that now definitely. But we can discuss > > the > > possible solution now. > > That was my intention, too. I want to make sure we don't > end up "painting ourselves into a corner" by moving in some > direction we have no way to get out of. > > The patch set looks promising, but we need some plan to > avoid the worst case behaviors that forced us into rmap > based scanning initially. > > > Note that it's possible that only some processes are bound to some > > NUMA > > nodes, while other processes aren't bound. > > For workloads like PostgresQL or Oracle, it is common > to have maybe 70% of memory in a large shared memory > segment, spread between all the NUMA nodes, and mapped > into hundreds, if not thousands, of processes in the > system. > > Now imagine we have an 8 node system, and memory > pressure in the DMA32 zone of node 0. > > How will the current VM behave? > > Wha > t will the virtual scanning need to do? > > If we can come up with a solution to make virtual > scanning scale for that kind of workload, great. > > If not ... if it turns out most of the benefits of > the multigeneratinal LRU framework come from sorting > the pages into multiple LRUs, and from being able > to easily reclaim unmapped pages before having to > scan mapped ones, could it be an idea to implement > that first, independently from virtual scanning? > > I am all for improving > our page reclaim system, I > just want to make sure we don't revisit the old traps > that forced us where we are today :) > One potential idea is to take the hybrid 'of rmap and virtual scanning' approach. If the number of pages that are targeted to be scanned is below some threshold, do rmap otherwise virtual scanning. I think we can experimentally find good value for that threshold.
Re: [PATCH v2 00/16] Multigenerational LRU Framework
> Now imagine we have an 8 node system, and memory > pressure in the DMA32 zone of node 0. The question is how much do we still care about DMA32. If there are problems they can probably just turn on the IOMMU for these IO mappings. -Andi
Re: [PATCH v2 00/16] Multigenerational LRU Framework
>2) It will not scan PTE tables under non-leaf PMD entries that do not > have the accessed bit set, when > CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. This assumes that workloads have reasonable locality. Could there be a worst case where only one or two pages in each PTE are used, so this PTE skipping trick doesn't work? -Andi
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On 4/13/21 5:14 PM, Dave Chinner wrote: > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: >> On 4/13/21 1:51 AM, SeongJae Park wrote: >>> From: SeongJae Park >>> >>> Hello, >>> >>> >>> Very interesting work, thank you for sharing this :) >>> >>> On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: >>> What's new in v2 Special thanks to Jens Axboe for reporting a regression in buffered I/O and helping test the fix. >>> >>> Is the discussion open? If so, could you please give me a link? >> >> I wasn't on the initial post (or any of the lists it was posted to), but >> it's on the google page reclaim list. Not sure if that is public or not. >> >> tldr is that I was pretty excited about this work, as buffered IO tends >> to suck (a lot) for high throughput applications. My test case was >> pretty simple: >> >> Randomly read a fast device, using 4k buffered IO, and watch what >> happens when the page cache gets filled up. For this particular test, >> we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec >> with kswapd using a lot of CPU trying to keep up. That's mainline >> behavior. > > I see this exact same behaviour here, too, but I RCA'd it to > contention between the inode and memory reclaim for the mapping > structure that indexes the page cache. Basically the mapping tree > lock is the contention point here - you can either be adding pages > to the mapping during IO, or memory reclaim can be removing pages > from the mapping, but we can't do both at once. > > So we end up with kswapd spinning on the mapping tree lock like so > when doing 1.6GB/s in 4kB buffered IO: > > - 20.06% 0.00% [kernel] [k] kswapd > >▒ >- 20.06% kswapd > >▒ > - 20.05% balance_pgdat > >▒ > - 20.03% shrink_node > >▒ > - 19.92% shrink_lruvec > >▒ >- 19.91% shrink_inactive_list > >▒ > - 19.22% shrink_page_list > >▒ > - 17.51% __remove_mapping > >▒ > - 14.16% _raw_spin_lock_irqsave > >▒ >- 14.14% do_raw_spin_lock > >▒ > __pv_queued_spin_lock_slowpath > >▒ > - 1.56% __delete_from_page_cache > >▒ > 0.63% xas_store > >▒ > - 0.78% _raw_spin_unlock_irqrestore > >▒ >- 0.69% do_raw_spin_unlock > >▒ > __raw_callee_save___pv_queued_spin_unlock > >▒ > - 0.82% free_unref_page_list > >▒ > - 0.72% free_unref_page_commit > >▒ > 0.57% free_pcppages_bulk > >▒ > > And these are the processes consuming CPU: > >5171 root 20 0 1442496 5696 1284 R 99.7 0.0 1:07.7
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, 2021-04-14 at 16:27 +0800, Huang, Ying wrote: > Yu Zhao writes: > > > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying > > wrote: > > > > > NUMA Optimization > > - > > Support NUMA policies and per-node RSS counters. > > > > We only can move forward one step at a time. Fair? > > You don't need to implement that now definitely. But we can discuss > the > possible solution now. That was my intention, too. I want to make sure we don't end up "painting ourselves into a corner" by moving in some direction we have no way to get out of. The patch set looks promising, but we need some plan to avoid the worst case behaviors that forced us into rmap based scanning initially. > Note that it's possible that only some processes are bound to some > NUMA > nodes, while other processes aren't bound. For workloads like PostgresQL or Oracle, it is common to have maybe 70% of memory in a large shared memory segment, spread between all the NUMA nodes, and mapped into hundreds, if not thousands, of processes in the system. Now imagine we have an 8 node system, and memory pressure in the DMA32 zone of node 0. How will the current VM behave? Wha t will the virtual scanning need to do? If we can come up with a solution to make virtual scanning scale for that kind of workload, great. If not ... if it turns out most of the benefits of the multigeneratinal LRU framework come from sorting the pages into multiple LRUs, and from being able to easily reclaim unmapped pages before having to scan mapped ones, could it be an idea to implement that first, independently from virtual scanning? I am all for improving our page reclaim system, I just want to make sure we don't revisit the old traps that forced us where we are today :) -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, Apr 14, 2021 at 01:16:52AM -0600, Yu Zhao wrote: > On Tue, Apr 13, 2021 at 10:50 PM Dave Chinner wrote: > > > > On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote: > > > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner wrote: > > > > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: > > > > > On 4/13/21 1:51 AM, SeongJae Park wrote: > > > > > > From: SeongJae Park > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > Very interesting work, thank you for sharing this :) > > > > > > > > > > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao > > > > > > wrote: > > > > > > > > > > > >> What's new in v2 > > > > > >> > > > > > >> Special thanks to Jens Axboe for reporting a regression in buffered > > > > > >> I/O and helping test the fix. > > > > > > > > > > > > Is the discussion open? If so, could you please give me a link? > > > > > > > > > > I wasn't on the initial post (or any of the lists it was posted to), > > > > > but > > > > > it's on the google page reclaim list. Not sure if that is public or > > > > > not. > > > > > > > > > > tldr is that I was pretty excited about this work, as buffered IO > > > > > tends > > > > > to suck (a lot) for high throughput applications. My test case was > > > > > pretty simple: > > > > > > > > > > Randomly read a fast device, using 4k buffered IO, and watch what > > > > > happens when the page cache gets filled up. For this particular test, > > > > > we'll initially be doing 2.1GB/sec of IO, and then drop to > > > > > 1.5-1.6GB/sec > > > > > with kswapd using a lot of CPU trying to keep up. That's mainline > > > > > behavior. > > > > > > > > I see this exact same behaviour here, too, but I RCA'd it to > > > > contention between the inode and memory reclaim for the mapping > > > > structure that indexes the page cache. Basically the mapping tree > > > > lock is the contention point here - you can either be adding pages > > > > to the mapping during IO, or memory reclaim can be removing pages > > > > from the mapping, but we can't do both at once. > > > > > > > > So we end up with kswapd spinning on the mapping tree lock like so > > > > when doing 1.6GB/s in 4kB buffered IO: > > > > > > > > - 20.06% 0.00% [kernel] [k] kswapd > > > > > > > >▒ > > > >- 20.06% kswapd > > > > > > > >▒ > > > > - 20.05% balance_pgdat > > > > > > > >▒ > > > > - 20.03% shrink_node > > > > > > > >▒ > > > > - 19.92% shrink_lruvec > > > > > > > >▒ > > > >- 19.91% shrink_inactive_list > > > > > > > >▒ > > > > - 19.22% shrink_page_list > > > > > > > >▒ > > > > - 17.51% __remove_mapping > > > > > > > >▒ > > > > - 14.16% _raw_spin_lock_irqsave > > > > > > > >▒ > > > >- 14.14% do_raw_spin_lock > > > > > > > >▒ > > > > __pv_queued_spin_lock_slowpath > > > > > > > >▒ > > > > - 1.56% __delete_from_page_cache > > > > > > > >▒ > > > > 0.63% xas_store > > > > > > > >▒ > > > > - 0.78% _raw_spin_unlock_irqrestore > > > > > > > >▒ > > > >- 0.69% do_raw_spin_unlock > > > > > > > >
Re: [PATCH v2 00/16] Multigenerational LRU Framework
Yu Zhao writes: > On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying wrote: >> >> Yu Zhao writes: >> >> > On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel wrote: >> >> >> >> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote: >> >> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: >> >> > >> >> > > The initial posting of this patchset did no better, in fact it did >> >> > > a bit >> >> > > worse. Performance dropped to the same levels and kswapd was using >> >> > > as >> >> > > much CPU as before, but on top of that we also got excessive >> >> > > swapping. >> >> > > Not at a high rate, but 5-10MB/sec continually. >> >> > > >> >> > > I had some back and forths with Yu Zhao and tested a few new >> >> > > revisions, >> >> > > and the current series does much better in this regard. Performance >> >> > > still dips a bit when page cache fills, but not nearly as much, and >> >> > > kswapd is using less CPU than before. >> >> > >> >> > Profiles would be interesting, because it sounds to me like reclaim >> >> > *might* be batching page cache removal better (e.g. fewer, larger >> >> > batches) and so spending less time contending on the mapping tree >> >> > lock... >> >> > >> >> > IOWs, I suspect this result might actually be a result of less lock >> >> > contention due to a change in batch processing characteristics of >> >> > the new algorithm rather than it being a "better" algorithm... >> >> >> >> That seems quite likely to me, given the issues we have >> >> had with virtual scan reclaim algorithms in the past. >> > >> > Hi Rik, >> > >> > Let paste the code so we can move beyond the "batching" hypothesis: >> > >> > static int __remove_mapping(struct address_space *mapping, struct page >> > *page, >> > bool reclaimed, struct mem_cgroup >> > *target_memcg) >> > { >> > unsigned long flags; >> > int refcount; >> > void *shadow = NULL; >> > >> > BUG_ON(!PageLocked(page)); >> > BUG_ON(mapping != page_mapping(page)); >> > >> > xa_lock_irqsave(&mapping->i_pages, flags); >> > >> >> SeongJae, what is this algorithm supposed to do when faced >> >> with situations like this: >> > >> > I'll assume the questions were directed at me, not SeongJae. >> > >> >> 1) Running on a system with 8 NUMA nodes, and >> >> memory >> >>pressure in one of those nodes. >> >> 2) Running PostgresQL or Oracle, with hundreds of >> >>processes mapping the same (very large) shared >> >>memory segment. >> >> >> >> How do you keep your algorithm from falling into the worst >> >> case virtual scanning scenarios that were crippling the >> >> 2.4 kernel 15+ years ago on systems with just a few GB of >> >> memory? >> > >> > There is a fundamental shift: that time we were scanning for cold pages, >> > and nowadays we are scanning for hot pages. >> > >> > I'd be surprised if scanning for cold pages didn't fall apart, because it'd >> > find most of the entries accessed, if they are present at all. >> > >> > Scanning for hot pages, on the other hand, is way better. Let me just >> > reiterate: >> > 1) It will not scan page tables from processes that have been sleeping >> >since the last scan. >> > 2) It will not scan PTE tables under non-leaf PMD entries that do not >> >have the accessed bit set, when >> >CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. >> > 3) It will not zigzag between the PGD table and the same PMD or PTE >> >table spanning multiple VMAs. In other words, it finishes all the >> >VMAs with the range of the same PMD or PTE table before it returns >> >to the PGD table. This optimizes workloads that have large numbers >> >of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5. >> > >> > So the cost is roughly proportional to the number of referenced pages it >> > discovers. If there is no memory pressure, no scanning at all. For a system >> > under heavy memory pressure, most of the pages are referenced (otherwise >> > why would it be under memory pressure?), and if we use the rmap, we need to >> > scan a lot of pages anyway. Why not just scan them all? >> >> This may be not the case. For rmap scanning, it's possible to scan only >> a small portion of memory. But with the page table scanning, you need >> to scan almost all (I understand you have some optimization as above). > > Hi Ying, > > Let's take a step back. > > For the sake of discussion, when does the scanning have to happen? Can > we agree that the simplest answer is when we have evicted all inactive > pages? > > If so, my next question is who's filled in the memory space previously > occupied by those inactive pages? Newly faulted in pages, right? They > have the accessed bit set, and we can't evict them without scanning > them first, would you agree? > > And there are also existing active pages, and they were protected from > eviction. But now we need to deactivate some of them. Do you think > whether they'd have been used or not since the last scan? (Remember > they w
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, Apr 14, 2021 at 12:15 AM Huang, Ying wrote: > > Yu Zhao writes: > > > On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel wrote: > >> > >> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote: > >> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: > >> > > >> > > The initial posting of this patchset did no better, in fact it did > >> > > a bit > >> > > worse. Performance dropped to the same levels and kswapd was using > >> > > as > >> > > much CPU as before, but on top of that we also got excessive > >> > > swapping. > >> > > Not at a high rate, but 5-10MB/sec continually. > >> > > > >> > > I had some back and forths with Yu Zhao and tested a few new > >> > > revisions, > >> > > and the current series does much better in this regard. Performance > >> > > still dips a bit when page cache fills, but not nearly as much, and > >> > > kswapd is using less CPU than before. > >> > > >> > Profiles would be interesting, because it sounds to me like reclaim > >> > *might* be batching page cache removal better (e.g. fewer, larger > >> > batches) and so spending less time contending on the mapping tree > >> > lock... > >> > > >> > IOWs, I suspect this result might actually be a result of less lock > >> > contention due to a change in batch processing characteristics of > >> > the new algorithm rather than it being a "better" algorithm... > >> > >> That seems quite likely to me, given the issues we have > >> had with virtual scan reclaim algorithms in the past. > > > > Hi Rik, > > > > Let paste the code so we can move beyond the "batching" hypothesis: > > > > static int __remove_mapping(struct address_space *mapping, struct page > > *page, > > bool reclaimed, struct mem_cgroup *target_memcg) > > { > > unsigned long flags; > > int refcount; > > void *shadow = NULL; > > > > BUG_ON(!PageLocked(page)); > > BUG_ON(mapping != page_mapping(page)); > > > > xa_lock_irqsave(&mapping->i_pages, flags); > > > >> SeongJae, what is this algorithm supposed to do when faced > >> with situations like this: > > > > I'll assume the questions were directed at me, not SeongJae. > > > >> 1) Running on a system with 8 NUMA nodes, and > >> memory > >>pressure in one of those nodes. > >> 2) Running PostgresQL or Oracle, with hundreds of > >>processes mapping the same (very large) shared > >>memory segment. > >> > >> How do you keep your algorithm from falling into the worst > >> case virtual scanning scenarios that were crippling the > >> 2.4 kernel 15+ years ago on systems with just a few GB of > >> memory? > > > > There is a fundamental shift: that time we were scanning for cold pages, > > and nowadays we are scanning for hot pages. > > > > I'd be surprised if scanning for cold pages didn't fall apart, because it'd > > find most of the entries accessed, if they are present at all. > > > > Scanning for hot pages, on the other hand, is way better. Let me just > > reiterate: > > 1) It will not scan page tables from processes that have been sleeping > >since the last scan. > > 2) It will not scan PTE tables under non-leaf PMD entries that do not > >have the accessed bit set, when > >CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. > > 3) It will not zigzag between the PGD table and the same PMD or PTE > >table spanning multiple VMAs. In other words, it finishes all the > >VMAs with the range of the same PMD or PTE table before it returns > >to the PGD table. This optimizes workloads that have large numbers > >of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5. > > > > So the cost is roughly proportional to the number of referenced pages it > > discovers. If there is no memory pressure, no scanning at all. For a system > > under heavy memory pressure, most of the pages are referenced (otherwise > > why would it be under memory pressure?), and if we use the rmap, we need to > > scan a lot of pages anyway. Why not just scan them all? > > This may be not the case. For rmap scanning, it's possible to scan only > a small portion of memory. But with the page table scanning, you need > to scan almost all (I understand you have some optimization as above). Hi Ying, Let's take a step back. For the sake of discussion, when does the scanning have to happen? Can we agree that the simplest answer is when we have evicted all inactive pages? If so, my next question is who's filled in the memory space previously occupied by those inactive pages? Newly faulted in pages, right? They have the accessed bit set, and we can't evict them without scanning them first, would you agree? And there are also existing active pages, and they were protected from eviction. But now we need to deactivate some of them. Do you think whether they'd have been used or not since the last scan? (Remember they were active.) You mentioned "a small portion" and "almost all". How do you interpret them in terms of these steps? Intuitively, "a small portion" and "a
Re: [PATCH v2 00/16] Multigenerational LRU Framework
Yu Zhao writes: > On Tue, Apr 13, 2021 at 8:30 PM Rik van Riel wrote: >> >> On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote: >> > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: >> > >> > > The initial posting of this patchset did no better, in fact it did >> > > a bit >> > > worse. Performance dropped to the same levels and kswapd was using >> > > as >> > > much CPU as before, but on top of that we also got excessive >> > > swapping. >> > > Not at a high rate, but 5-10MB/sec continually. >> > > >> > > I had some back and forths with Yu Zhao and tested a few new >> > > revisions, >> > > and the current series does much better in this regard. Performance >> > > still dips a bit when page cache fills, but not nearly as much, and >> > > kswapd is using less CPU than before. >> > >> > Profiles would be interesting, because it sounds to me like reclaim >> > *might* be batching page cache removal better (e.g. fewer, larger >> > batches) and so spending less time contending on the mapping tree >> > lock... >> > >> > IOWs, I suspect this result might actually be a result of less lock >> > contention due to a change in batch processing characteristics of >> > the new algorithm rather than it being a "better" algorithm... >> >> That seems quite likely to me, given the issues we have >> had with virtual scan reclaim algorithms in the past. > > Hi Rik, > > Let paste the code so we can move beyond the "batching" hypothesis: > > static int __remove_mapping(struct address_space *mapping, struct page > *page, > bool reclaimed, struct mem_cgroup *target_memcg) > { > unsigned long flags; > int refcount; > void *shadow = NULL; > > BUG_ON(!PageLocked(page)); > BUG_ON(mapping != page_mapping(page)); > > xa_lock_irqsave(&mapping->i_pages, flags); > >> SeongJae, what is this algorithm supposed to do when faced >> with situations like this: > > I'll assume the questions were directed at me, not SeongJae. > >> 1) Running on a system with 8 NUMA nodes, and >> memory >>pressure in one of those nodes. >> 2) Running PostgresQL or Oracle, with hundreds of >>processes mapping the same (very large) shared >>memory segment. >> >> How do you keep your algorithm from falling into the worst >> case virtual scanning scenarios that were crippling the >> 2.4 kernel 15+ years ago on systems with just a few GB of >> memory? > > There is a fundamental shift: that time we were scanning for cold pages, > and nowadays we are scanning for hot pages. > > I'd be surprised if scanning for cold pages didn't fall apart, because it'd > find most of the entries accessed, if they are present at all. > > Scanning for hot pages, on the other hand, is way better. Let me just > reiterate: > 1) It will not scan page tables from processes that have been sleeping >since the last scan. > 2) It will not scan PTE tables under non-leaf PMD entries that do not >have the accessed bit set, when >CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. > 3) It will not zigzag between the PGD table and the same PMD or PTE >table spanning multiple VMAs. In other words, it finishes all the >VMAs with the range of the same PMD or PTE table before it returns >to the PGD table. This optimizes workloads that have large numbers >of tiny VMAs, especially when CONFIG_PGTABLE_LEVELS=5. > > So the cost is roughly proportional to the number of referenced pages it > discovers. If there is no memory pressure, no scanning at all. For a system > under heavy memory pressure, most of the pages are referenced (otherwise > why would it be under memory pressure?), and if we use the rmap, we need to > scan a lot of pages anyway. Why not just scan them all? This may be not the case. For rmap scanning, it's possible to scan only a small portion of memory. But with the page table scanning, you need to scan almost all (I understand you have some optimization as above). As Rik shown in the test case above, there may be memory pressure on only one of 8 NUMA nodes (because of NUMA binding?). Then ramp scanning only needs to scan pages in this node, while the page table scanning may need to scan pages in other nodes too. Best Regards, Huang, Ying > This way you save a > lot because of batching (now it's time to talk about batching). Besides, > page tables have far better memory locality than the rmap. For the shared > memory example you gave, the rmap needs to lock *each* page it scans. How > many 4KB pages does your large file have? I'll leave the math to you. > > Here are some profiles: > > zram with the rmap (mainline) > 31.03% page_vma_mapped_walk > 25.59% lzo1x_1_do_compress >4.63% do_raw_spin_lock >3.89% vma_interval_tree_iter_next >3.33% vma_interval_tree_subtree_search > > zram with page table scanning (this patchset) > 49.36% lzo1x_1_do_compress >4.54% page_vma_mapped_walk >4.45% memset_erms >3.47% walk_pte_range >2.88% zram_bvec_r
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Tue, Apr 13, 2021 at 09:40:12PM -0600, Yu Zhao wrote: > On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner wrote: > > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: > > > On 4/13/21 1:51 AM, SeongJae Park wrote: > > > > From: SeongJae Park > > > > > > > > Hello, > > > > > > > > > > > > Very interesting work, thank you for sharing this :) > > > > > > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: > > > > > > > >> What's new in v2 > > > >> > > > >> Special thanks to Jens Axboe for reporting a regression in buffered > > > >> I/O and helping test the fix. > > > > > > > > Is the discussion open? If so, could you please give me a link? > > > > > > I wasn't on the initial post (or any of the lists it was posted to), but > > > it's on the google page reclaim list. Not sure if that is public or not. > > > > > > tldr is that I was pretty excited about this work, as buffered IO tends > > > to suck (a lot) for high throughput applications. My test case was > > > pretty simple: > > > > > > Randomly read a fast device, using 4k buffered IO, and watch what > > > happens when the page cache gets filled up. For this particular test, > > > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec > > > with kswapd using a lot of CPU trying to keep up. That's mainline > > > behavior. > > > > I see this exact same behaviour here, too, but I RCA'd it to > > contention between the inode and memory reclaim for the mapping > > structure that indexes the page cache. Basically the mapping tree > > lock is the contention point here - you can either be adding pages > > to the mapping during IO, or memory reclaim can be removing pages > > from the mapping, but we can't do both at once. > > > > So we end up with kswapd spinning on the mapping tree lock like so > > when doing 1.6GB/s in 4kB buffered IO: > > > > - 20.06% 0.00% [kernel] [k] kswapd > > > >▒ > >- 20.06% kswapd > > > >▒ > > - 20.05% balance_pgdat > > > >▒ > > - 20.03% shrink_node > > > >▒ > > - 19.92% shrink_lruvec > > > >▒ > >- 19.91% shrink_inactive_list > > > >▒ > > - 19.22% shrink_page_list > > > >▒ > > - 17.51% __remove_mapping > > > >▒ > > - 14.16% _raw_spin_lock_irqsave > > > >▒ > >- 14.14% do_raw_spin_lock > > > >▒ > > __pv_queued_spin_lock_slowpath > > > >▒ > > - 1.56% __delete_from_page_cache > > > >▒ > > 0.63% xas_store > > > >▒ > > - 0.78% _raw_spin_unlock_irqrestore > > > >▒ > >- 0.69% do_raw_spin_unlock > > > >▒ > > __raw_callee_save___pv_queued_spin_unlock > > > >▒ > > - 0.82% free_unref_page_list > > > >▒ > > - 0.72% free_unref_page_commit > > >
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Tue, Apr 13, 2021 at 5:14 PM Dave Chinner wrote: > > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: > > On 4/13/21 1:51 AM, SeongJae Park wrote: > > > From: SeongJae Park > > > > > > Hello, > > > > > > > > > Very interesting work, thank you for sharing this :) > > > > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: > > > > > >> What's new in v2 > > >> > > >> Special thanks to Jens Axboe for reporting a regression in buffered > > >> I/O and helping test the fix. > > > > > > Is the discussion open? If so, could you please give me a link? > > > > I wasn't on the initial post (or any of the lists it was posted to), but > > it's on the google page reclaim list. Not sure if that is public or not. > > > > tldr is that I was pretty excited about this work, as buffered IO tends > > to suck (a lot) for high throughput applications. My test case was > > pretty simple: > > > > Randomly read a fast device, using 4k buffered IO, and watch what > > happens when the page cache gets filled up. For this particular test, > > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec > > with kswapd using a lot of CPU trying to keep up. That's mainline > > behavior. > > I see this exact same behaviour here, too, but I RCA'd it to > contention between the inode and memory reclaim for the mapping > structure that indexes the page cache. Basically the mapping tree > lock is the contention point here - you can either be adding pages > to the mapping during IO, or memory reclaim can be removing pages > from the mapping, but we can't do both at once. > > So we end up with kswapd spinning on the mapping tree lock like so > when doing 1.6GB/s in 4kB buffered IO: > > - 20.06% 0.00% [kernel] [k] kswapd > >▒ >- 20.06% kswapd > >▒ > - 20.05% balance_pgdat > >▒ > - 20.03% shrink_node > >▒ > - 19.92% shrink_lruvec > >▒ >- 19.91% shrink_inactive_list > >▒ > - 19.22% shrink_page_list > >▒ > - 17.51% __remove_mapping > >▒ > - 14.16% _raw_spin_lock_irqsave > >▒ >- 14.14% do_raw_spin_lock > >▒ > __pv_queued_spin_lock_slowpath > >▒ > - 1.56% __delete_from_page_cache > >▒ > 0.63% xas_store > >▒ > - 0.78% _raw_spin_unlock_irqrestore > >▒ >- 0.69% do_raw_spin_unlock > >▒ > __raw_callee_save___pv_queued_spin_unlock > >▒ > - 0.82% free_unref_page_list > >▒ > - 0.72% free_unref_page_commit > >▒ > 0.57% free_pcppages_bulk > >▒ > > And these are the processes consuming CPU: > >5171 root
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Wed, 2021-04-14 at 09:14 +1000, Dave Chinner wrote: > On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: > > > The initial posting of this patchset did no better, in fact it did > > a bit > > worse. Performance dropped to the same levels and kswapd was using > > as > > much CPU as before, but on top of that we also got excessive > > swapping. > > Not at a high rate, but 5-10MB/sec continually. > > > > I had some back and forths with Yu Zhao and tested a few new > > revisions, > > and the current series does much better in this regard. Performance > > still dips a bit when page cache fills, but not nearly as much, and > > kswapd is using less CPU than before. > > Profiles would be interesting, because it sounds to me like reclaim > *might* be batching page cache removal better (e.g. fewer, larger > batches) and so spending less time contending on the mapping tree > lock... > > IOWs, I suspect this result might actually be a result of less lock > contention due to a change in batch processing characteristics of > the new algorithm rather than it being a "better" algorithm... That seems quite likely to me, given the issues we have had with virtual scan reclaim algorithms in the past. SeongJae, what is this algorithm supposed to do when faced with situations like this: 1) Running on a system with 8 NUMA nodes, and memory pressure in one of those nodes. 2) Running PostgresQL or Oracle, with hundreds of processes mapping the same (very large) shared memory segment. How do you keep your algorithm from falling into the worst case virtual scanning scenarios that were crippling the 2.4 kernel 15+ years ago on systems with just a few GB of memory? -- All Rights Reversed. signature.asc Description: This is a digitally signed message part
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On Tue, Apr 13, 2021 at 10:13:24AM -0600, Jens Axboe wrote: > On 4/13/21 1:51 AM, SeongJae Park wrote: > > From: SeongJae Park > > > > Hello, > > > > > > Very interesting work, thank you for sharing this :) > > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: > > > >> What's new in v2 > >> > >> Special thanks to Jens Axboe for reporting a regression in buffered > >> I/O and helping test the fix. > > > > Is the discussion open? If so, could you please give me a link? > > I wasn't on the initial post (or any of the lists it was posted to), but > it's on the google page reclaim list. Not sure if that is public or not. > > tldr is that I was pretty excited about this work, as buffered IO tends > to suck (a lot) for high throughput applications. My test case was > pretty simple: > > Randomly read a fast device, using 4k buffered IO, and watch what > happens when the page cache gets filled up. For this particular test, > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec > with kswapd using a lot of CPU trying to keep up. That's mainline > behavior. I see this exact same behaviour here, too, but I RCA'd it to contention between the inode and memory reclaim for the mapping structure that indexes the page cache. Basically the mapping tree lock is the contention point here - you can either be adding pages to the mapping during IO, or memory reclaim can be removing pages from the mapping, but we can't do both at once. So we end up with kswapd spinning on the mapping tree lock like so when doing 1.6GB/s in 4kB buffered IO: - 20.06% 0.00% [kernel] [k] kswapd ▒ - 20.06% kswapd ▒ - 20.05% balance_pgdat ▒ - 20.03% shrink_node ▒ - 19.92% shrink_lruvec ▒ - 19.91% shrink_inactive_list ▒ - 19.22% shrink_page_list ▒ - 17.51% __remove_mapping ▒ - 14.16% _raw_spin_lock_irqsave ▒ - 14.14% do_raw_spin_lock ▒ __pv_queued_spin_lock_slowpath ▒ - 1.56% __delete_from_page_cache ▒ 0.63% xas_store ▒ - 0.78% _raw_spin_unlock_irqrestore ▒ - 0.69% do_raw_spin_unlock ▒ __raw_callee_save___pv_queued_spin_unlock ▒ - 0.82% free_unref_page_list ▒ - 0.72% free_unref_page_commit ▒ 0.57% free_pcppages_bulk ▒ And these are the processes consuming CPU: 5171 root 20 0 1442496 5696 1284 R 99.7 0.0 1:07.78 fio 1150 root 20 0 0 0 0 S 47.4 0.0 0:22.70 kswapd1 1146 root 20 0 0 0 0 S 44.0 0.0 0:21.85 kswapd0 1152 root 20 0 0 0 0
Re: [PATCH v2 00/16] Multigenerational LRU Framework
From: SeongJae Park On Tue, 13 Apr 2021 10:13:24 -0600 Jens Axboe wrote: > On 4/13/21 1:51 AM, SeongJae Park wrote: > > From: SeongJae Park > > > > Hello, > > > > > > Very interesting work, thank you for sharing this :) > > > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: > > > >> What's new in v2 > >> > >> Special thanks to Jens Axboe for reporting a regression in buffered > >> I/O and helping test the fix. > > > > Is the discussion open? If so, could you please give me a link? > > I wasn't on the initial post (or any of the lists it was posted to), but > it's on the google page reclaim list. Not sure if that is public or not. > > tldr is that I was pretty excited about this work, as buffered IO tends > to suck (a lot) for high throughput applications. My test case was > pretty simple: > > Randomly read a fast device, using 4k buffered IO, and watch what > happens when the page cache gets filled up. For this particular test, > we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec > with kswapd using a lot of CPU trying to keep up. That's mainline > behavior. > > The initial posting of this patchset did no better, in fact it did a bit > worse. Performance dropped to the same levels and kswapd was using as > much CPU as before, but on top of that we also got excessive swapping. > Not at a high rate, but 5-10MB/sec continually. > > I had some back and forths with Yu Zhao and tested a few new revisions, > and the current series does much better in this regard. Performance > still dips a bit when page cache fills, but not nearly as much, and > kswapd is using less CPU than before. > > Hope that helps, Appreciate this kind and detailed explanation, Jens! So, my understanding is that v2 of this patchset improved the performance by using frequency (tier) in addition to recency (generation number) for buffered I/O pages. That makes sense to me. If I'm misunderstanding, please let me know. Thanks, SeongJae Park > -- > Jens Axboe >
Re: [PATCH v2 00/16] Multigenerational LRU Framework
On 4/13/21 1:51 AM, SeongJae Park wrote: > From: SeongJae Park > > Hello, > > > Very interesting work, thank you for sharing this :) > > On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: > >> What's new in v2 >> >> Special thanks to Jens Axboe for reporting a regression in buffered >> I/O and helping test the fix. > > Is the discussion open? If so, could you please give me a link? I wasn't on the initial post (or any of the lists it was posted to), but it's on the google page reclaim list. Not sure if that is public or not. tldr is that I was pretty excited about this work, as buffered IO tends to suck (a lot) for high throughput applications. My test case was pretty simple: Randomly read a fast device, using 4k buffered IO, and watch what happens when the page cache gets filled up. For this particular test, we'll initially be doing 2.1GB/sec of IO, and then drop to 1.5-1.6GB/sec with kswapd using a lot of CPU trying to keep up. That's mainline behavior. The initial posting of this patchset did no better, in fact it did a bit worse. Performance dropped to the same levels and kswapd was using as much CPU as before, but on top of that we also got excessive swapping. Not at a high rate, but 5-10MB/sec continually. I had some back and forths with Yu Zhao and tested a few new revisions, and the current series does much better in this regard. Performance still dips a bit when page cache fills, but not nearly as much, and kswapd is using less CPU than before. Hope that helps, -- Jens Axboe
Re: [PATCH v2 00/16] Multigenerational LRU Framework
From: SeongJae Park Hello, Very interesting work, thank you for sharing this :) On Tue, 13 Apr 2021 00:56:17 -0600 Yu Zhao wrote: > What's new in v2 > > Special thanks to Jens Axboe for reporting a regression in buffered > I/O and helping test the fix. Is the discussion open? If so, could you please give me a link? > > This version includes the support of tiers, which represent levels of > usage from file descriptors only. Pages accessed N times via file > descriptors belong to tier order_base_2(N). Each generation contains > at most MAX_NR_TIERS tiers, and they require additional MAX_NR_TIERS-2 > bits in page->flags. In contrast to moving across generations which > requires the lru lock, moving across tiers only involves an atomic > operation on page->flags and therefore has a negligible cost. A > feedback loop modeled after the well-known PID controller monitors the > refault rates across all tiers and decides when to activate pages from > which tiers, on the reclaim path. > > This feedback model has a few advantages over the current feedforward > model: > 1) It has a negligible overhead in the buffered I/O access path >because activations are done in the reclaim path. > 2) It takes mapped pages into account and avoids overprotecting pages >accessed multiple times via file descriptors. > 3) More tiers offer better protection to pages accessed more than >twice when buffered-I/O-intensive workloads are under memory >pressure. > > The fio/io_uring benchmark shows 14% improvement in IOPS when randomly > accessing Samsung PM981a in the buffered I/O mode. Improvement under memory pressure, right? How much pressure? [...] > > Differential scans via page tables > -- > Each differential scan discovers all pages that have been referenced > since the last scan. Specifically, it walks the mm_struct list > associated with an lruvec to scan page tables of processes that have > been scheduled since the last scan. Does this means it scans only virtual address spaces of processes and therefore pages in the page cache that are not mmap()-ed will not be scanned? > The cost of each differential scan > is roughly proportional to the number of referenced pages it > discovers. Unless address spaces are extremely sparse, page tables > usually have better memory locality than the rmap. The end result is > generally a significant reduction in CPU usage, for workloads using a > large amount of anon memory. When and how frequently it scans? Thanks, SeongJae Park [...]