Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-04-12 Thread Huang, Ying
Yu Zhao  writes:

> On Wed, Mar 24, 2021 at 12:58 AM Huang, Ying  wrote:
>>
>> Yu Zhao  writes:
>>
>> > On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
>> >> Yu Zhao  writes:
>> >>
>> >> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
>> >> >> Yu Zhao  writes:
>> >> >>
>> >> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> >> >> > The scanning overhead is only one of the two major problems of the
>> >> >> > current page reclaim. The other problem is the granularity of the
>> >> >> > active/inactive (sizes). We stopped using them in making job
>> >> >> > scheduling decision a long time ago. I know another large internet
>> >> >> > company adopted a similar approach as ours, and I'm wondering how
>> >> >> > everybody else is coping with the discrepancy from those counters.
>> >> >>
>> >> >> From intuition, the scanning overhead of the full page table scanning
>> >> >> appears higher than that of the rmap scanning for a small portion of
>> >> >> system memory.  But form your words, you think the reality is the
>> >> >> reverse?  If others concern about the overhead too, finally, I think 
>> >> >> you
>> >> >> need to prove the overhead of the page table scanning isn't too higher,
>> >> >> or even lower with more data and theory.
>> >> >
>> >> > There is a misunderstanding here. I never said anything about full
>> >> > page table scanning. And this is not how it's done in this series
>> >> > either. I guess the misunderstanding has something to do with the cold
>> >> > memory tracking you are thinking about?
>> >>
>> >> If my understanding were correct, from the following code path in your
>> >> patch 10/14,
>> >>
>> >> age_active_anon
>> >>   age_lru_gens
>> >> try_walk_mm_list
>> >>   walk_mm_list
>> >> walk_mm
>> >>
>> >> So, in kswapd(), the page tables of many processes may be scanned
>> >> fully.  If the number of processes that are active are high, the
>> >> overhead may be high too.
>> >
>> > That's correct. Just in case we have different definitions of what we
>> > call "full":
>> >
>> >   I understand it as the full range of the address space of a process
>> >   that was loaded by switch_mm() at least once since the last scan.
>> >   This is not the case because we don't scan the full range -- we skip
>> >   holes and VMAs that are unevictable, as well as PTE tables that have
>> >   no accessed entries on x86_64, by should_skip_vma() and
>> >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.
>> >
>> >   If you are referring to the full range of PTE tables that have at
>> >   least one accessed entry, i.e., other 511 are not none  but have not
>> >   been accessed either since the last scan on x86_64, then yes, you
>> >   are right again :) This is the worse case scenario.
>>
>> OK.  So there's no fundamental difference between us on this.
>>
>> >> > This series uses page tables to discover page accesses when a system
>> >> > has run out of inactive pages. Under such a situation, the system is
>> >> > very likely to have a lot of page accesses, and using the rmap is
>> >> > likely to cost a lot more because its poor memory locality compared
>> >> > with page tables.
>> >>
>> >> This is the theory.  Can you verify this with more data?  Including the
>> >> CPU cycles or time spent scanning page tables?
>> >
>> > Yes, I'll be happy to do so as I should, because page table scanning
>> > is counterintuitive. Let me add more theory in case it's still unclear
>> > to others.
>> >
>> > From my understanding, the two fundamental questions we need to
>> > consider in terms of page reclaim are:
>> >
>> >   What are the sizes of hot clusters (spatial locality) should we
>> >   expect under memory pressure?
>> >
>> >   On smaller systems with 4GB memory, our observations are that the
>> >   average size of hot clusters found during each scan is 32KB. On
>> >   larger systems with hundreds of gigabytes of memory, it's well
>> >   above this value -- 512KB or larger. These values vary under
>> >   different workloads and with different memory allocators. Unless
>> >   done deliberately by memory allocators, e.g., Scudo as I've
>> >   mentioned earlier, it's safe to say if a PTE entry has been
>> >   accessed, its neighbors are likely to have been accessed too.
>> >
>> >   What's hot memory footprint (total size of hot clusters) should we
>> >   expect when we have run out of inactive pages?
>> >
>> >   Some numbers first: on large and heavily overcommitted systems, we
>> >   have observed close to 90% during a scan. Those systems have
>> >   millions of pages and using the rmap to find out which pages to
>> >   reclaim will just blow kswapd. On smaller systems with less memory
>> >   pressure (due to their weaker CPUs), this number is more reasonable,
>> >   ~50%. Here is some kswapd profiles from a smaller systems running
>> >   5.11:
>> >
>> >the rmap page table scan
>> >--

Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-04-10 Thread Yu Zhao
On Wed, Mar 24, 2021 at 12:58 AM Huang, Ying  wrote:
>
> Yu Zhao  writes:
>
> > On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
> >> Yu Zhao  writes:
> >>
> >> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
> >> >> Yu Zhao  writes:
> >> >>
> >> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> >> >> > The scanning overhead is only one of the two major problems of the
> >> >> > current page reclaim. The other problem is the granularity of the
> >> >> > active/inactive (sizes). We stopped using them in making job
> >> >> > scheduling decision a long time ago. I know another large internet
> >> >> > company adopted a similar approach as ours, and I'm wondering how
> >> >> > everybody else is coping with the discrepancy from those counters.
> >> >>
> >> >> From intuition, the scanning overhead of the full page table scanning
> >> >> appears higher than that of the rmap scanning for a small portion of
> >> >> system memory.  But form your words, you think the reality is the
> >> >> reverse?  If others concern about the overhead too, finally, I think you
> >> >> need to prove the overhead of the page table scanning isn't too higher,
> >> >> or even lower with more data and theory.
> >> >
> >> > There is a misunderstanding here. I never said anything about full
> >> > page table scanning. And this is not how it's done in this series
> >> > either. I guess the misunderstanding has something to do with the cold
> >> > memory tracking you are thinking about?
> >>
> >> If my understanding were correct, from the following code path in your
> >> patch 10/14,
> >>
> >> age_active_anon
> >>   age_lru_gens
> >> try_walk_mm_list
> >>   walk_mm_list
> >> walk_mm
> >>
> >> So, in kswapd(), the page tables of many processes may be scanned
> >> fully.  If the number of processes that are active are high, the
> >> overhead may be high too.
> >
> > That's correct. Just in case we have different definitions of what we
> > call "full":
> >
> >   I understand it as the full range of the address space of a process
> >   that was loaded by switch_mm() at least once since the last scan.
> >   This is not the case because we don't scan the full range -- we skip
> >   holes and VMAs that are unevictable, as well as PTE tables that have
> >   no accessed entries on x86_64, by should_skip_vma() and
> >   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.
> >
> >   If you are referring to the full range of PTE tables that have at
> >   least one accessed entry, i.e., other 511 are not none  but have not
> >   been accessed either since the last scan on x86_64, then yes, you
> >   are right again :) This is the worse case scenario.
>
> OK.  So there's no fundamental difference between us on this.
>
> >> > This series uses page tables to discover page accesses when a system
> >> > has run out of inactive pages. Under such a situation, the system is
> >> > very likely to have a lot of page accesses, and using the rmap is
> >> > likely to cost a lot more because its poor memory locality compared
> >> > with page tables.
> >>
> >> This is the theory.  Can you verify this with more data?  Including the
> >> CPU cycles or time spent scanning page tables?
> >
> > Yes, I'll be happy to do so as I should, because page table scanning
> > is counterintuitive. Let me add more theory in case it's still unclear
> > to others.
> >
> > From my understanding, the two fundamental questions we need to
> > consider in terms of page reclaim are:
> >
> >   What are the sizes of hot clusters (spatial locality) should we
> >   expect under memory pressure?
> >
> >   On smaller systems with 4GB memory, our observations are that the
> >   average size of hot clusters found during each scan is 32KB. On
> >   larger systems with hundreds of gigabytes of memory, it's well
> >   above this value -- 512KB or larger. These values vary under
> >   different workloads and with different memory allocators. Unless
> >   done deliberately by memory allocators, e.g., Scudo as I've
> >   mentioned earlier, it's safe to say if a PTE entry has been
> >   accessed, its neighbors are likely to have been accessed too.
> >
> >   What's hot memory footprint (total size of hot clusters) should we
> >   expect when we have run out of inactive pages?
> >
> >   Some numbers first: on large and heavily overcommitted systems, we
> >   have observed close to 90% during a scan. Those systems have
> >   millions of pages and using the rmap to find out which pages to
> >   reclaim will just blow kswapd. On smaller systems with less memory
> >   pressure (due to their weaker CPUs), this number is more reasonable,
> >   ~50%. Here is some kswapd profiles from a smaller systems running
> >   5.11:
> >
> >the rmap page table scan
> >-
> >31.03%  page_vma_mapped_walk 49.36%  lzo1x_1_do_compress
> >25.59%  lzo1x_1_do_compress

Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-03-23 Thread Huang, Ying
Yu Zhao  writes:

> On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
>> Yu Zhao  writes:
>> 
>> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
>> >> Yu Zhao  writes:
>> >> 
>> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> >> > The scanning overhead is only one of the two major problems of the
>> >> > current page reclaim. The other problem is the granularity of the
>> >> > active/inactive (sizes). We stopped using them in making job
>> >> > scheduling decision a long time ago. I know another large internet
>> >> > company adopted a similar approach as ours, and I'm wondering how
>> >> > everybody else is coping with the discrepancy from those counters.
>> >> 
>> >> From intuition, the scanning overhead of the full page table scanning
>> >> appears higher than that of the rmap scanning for a small portion of
>> >> system memory.  But form your words, you think the reality is the
>> >> reverse?  If others concern about the overhead too, finally, I think you
>> >> need to prove the overhead of the page table scanning isn't too higher,
>> >> or even lower with more data and theory.
>> >
>> > There is a misunderstanding here. I never said anything about full
>> > page table scanning. And this is not how it's done in this series
>> > either. I guess the misunderstanding has something to do with the cold
>> > memory tracking you are thinking about?
>> 
>> If my understanding were correct, from the following code path in your
>> patch 10/14,
>> 
>> age_active_anon
>>   age_lru_gens
>> try_walk_mm_list
>>   walk_mm_list
>> walk_mm
>> 
>> So, in kswapd(), the page tables of many processes may be scanned
>> fully.  If the number of processes that are active are high, the
>> overhead may be high too.
>
> That's correct. Just in case we have different definitions of what we
> call "full":
>
>   I understand it as the full range of the address space of a process
>   that was loaded by switch_mm() at least once since the last scan.
>   This is not the case because we don't scan the full range -- we skip
>   holes and VMAs that are unevictable, as well as PTE tables that have
>   no accessed entries on x86_64, by should_skip_vma() and
>   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.
>
>   If you are referring to the full range of PTE tables that have at
>   least one accessed entry, i.e., other 511 are not none  but have not
>   been accessed either since the last scan on x86_64, then yes, you
>   are right again :) This is the worse case scenario.

OK.  So there's no fundamental difference between us on this.

>> > This series uses page tables to discover page accesses when a system
>> > has run out of inactive pages. Under such a situation, the system is
>> > very likely to have a lot of page accesses, and using the rmap is
>> > likely to cost a lot more because its poor memory locality compared
>> > with page tables.
>> 
>> This is the theory.  Can you verify this with more data?  Including the
>> CPU cycles or time spent scanning page tables?
>
> Yes, I'll be happy to do so as I should, because page table scanning
> is counterintuitive. Let me add more theory in case it's still unclear
> to others.
>
> From my understanding, the two fundamental questions we need to
> consider in terms of page reclaim are:
>
>   What are the sizes of hot clusters (spatial locality) should we
>   expect under memory pressure?
>
>   On smaller systems with 4GB memory, our observations are that the
>   average size of hot clusters found during each scan is 32KB. On
>   larger systems with hundreds of gigabytes of memory, it's well
>   above this value -- 512KB or larger. These values vary under
>   different workloads and with different memory allocators. Unless
>   done deliberately by memory allocators, e.g., Scudo as I've
>   mentioned earlier, it's safe to say if a PTE entry has been
>   accessed, its neighbors are likely to have been accessed too.
>
>   What's hot memory footprint (total size of hot clusters) should we
>   expect when we have run out of inactive pages?
>
>   Some numbers first: on large and heavily overcommitted systems, we
>   have observed close to 90% during a scan. Those systems have
>   millions of pages and using the rmap to find out which pages to
>   reclaim will just blow kswapd. On smaller systems with less memory
>   pressure (due to their weaker CPUs), this number is more reasonable,
>   ~50%. Here is some kswapd profiles from a smaller systems running
>   5.11:
>
>the rmap page table scan
>-
>31.03%  page_vma_mapped_walk 49.36%  lzo1x_1_do_compress
>25.59%  lzo1x_1_do_compress   4.54%  page_vma_mapped_walk
> 4.63%  do_raw_spin_lock  4.45%  memset_erms
> 3.89%  vma_interval_tree_iter_next   3.47%  walk_pte_range
> 3.33%  vma_interval_tree_subtree_search  2.88%  zram_bvec_rw
>
>   

Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-03-22 Thread Yu Zhao
On Mon, Mar 22, 2021 at 11:13:19AM +0800, Huang, Ying wrote:
> Yu Zhao  writes:
> 
> > On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
> >> Yu Zhao  writes:
> >> 
> >> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> >> > The scanning overhead is only one of the two major problems of the
> >> > current page reclaim. The other problem is the granularity of the
> >> > active/inactive (sizes). We stopped using them in making job
> >> > scheduling decision a long time ago. I know another large internet
> >> > company adopted a similar approach as ours, and I'm wondering how
> >> > everybody else is coping with the discrepancy from those counters.
> >> 
> >> From intuition, the scanning overhead of the full page table scanning
> >> appears higher than that of the rmap scanning for a small portion of
> >> system memory.  But form your words, you think the reality is the
> >> reverse?  If others concern about the overhead too, finally, I think you
> >> need to prove the overhead of the page table scanning isn't too higher,
> >> or even lower with more data and theory.
> >
> > There is a misunderstanding here. I never said anything about full
> > page table scanning. And this is not how it's done in this series
> > either. I guess the misunderstanding has something to do with the cold
> > memory tracking you are thinking about?
> 
> If my understanding were correct, from the following code path in your
> patch 10/14,
> 
> age_active_anon
>   age_lru_gens
> try_walk_mm_list
>   walk_mm_list
> walk_mm
> 
> So, in kswapd(), the page tables of many processes may be scanned
> fully.  If the number of processes that are active are high, the
> overhead may be high too.

That's correct. Just in case we have different definitions of what we
call "full":

  I understand it as the full range of the address space of a process
  that was loaded by switch_mm() at least once since the last scan.
  This is not the case because we don't scan the full range -- we skip
  holes and VMAs that are unevictable, as well as PTE tables that have
  no accessed entries on x86_64, by should_skip_vma() and
  CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG.

  If you are referring to the full range of PTE tables that have at
  least one accessed entry, i.e., other 511 are not none  but have not
  been accessed either since the last scan on x86_64, then yes, you
  are right again :) This is the worse case scenario.
  
> > This series uses page tables to discover page accesses when a system
> > has run out of inactive pages. Under such a situation, the system is
> > very likely to have a lot of page accesses, and using the rmap is
> > likely to cost a lot more because its poor memory locality compared
> > with page tables.
> 
> This is the theory.  Can you verify this with more data?  Including the
> CPU cycles or time spent scanning page tables?

Yes, I'll be happy to do so as I should, because page table scanning
is counterintuitive. Let me add more theory in case it's still unclear
to others.

>From my understanding, the two fundamental questions we need to
consider in terms of page reclaim are:

  What are the sizes of hot clusters (spatial locality) should we
  expect under memory pressure?

  On smaller systems with 4GB memory, our observations are that the
  average size of hot clusters found during each scan is 32KB. On
  larger systems with hundreds of gigabytes of memory, it's well
  above this value -- 512KB or larger. These values vary under
  different workloads and with different memory allocators. Unless
  done deliberately by memory allocators, e.g., Scudo as I've
  mentioned earlier, it's safe to say if a PTE entry has been
  accessed, its neighbors are likely to have been accessed too.

  What's hot memory footprint (total size of hot clusters) should we
  expect when we have run out of inactive pages?

  Some numbers first: on large and heavily overcommitted systems, we
  have observed close to 90% during a scan. Those systems have
  millions of pages and using the rmap to find out which pages to
  reclaim will just blow kswapd. On smaller systems with less memory
  pressure (due to their weaker CPUs), this number is more reasonable,
  ~50%. Here is some kswapd profiles from a smaller systems running
  5.11:

   the rmap page table scan
   -
   31.03%  page_vma_mapped_walk 49.36%  lzo1x_1_do_compress
   25.59%  lzo1x_1_do_compress   4.54%  page_vma_mapped_walk
4.63%  do_raw_spin_lock  4.45%  memset_erms
3.89%  vma_interval_tree_iter_next   3.47%  walk_pte_range
3.33%  vma_interval_tree_subtree_search  2.88%  zram_bvec_rw

  The page table scan is only twice as fast. Only larger systems,
  it's usually more than 4 times, without THP. With THP, both are
  negligible (<1% CPU usage). I can grab profiles from our servers
  too if you are interested

Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-03-21 Thread Huang, Ying
Yu Zhao  writes:

> On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
>> Yu Zhao  writes:
>> 
>> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> > The scanning overhead is only one of the two major problems of the
>> > current page reclaim. The other problem is the granularity of the
>> > active/inactive (sizes). We stopped using them in making job
>> > scheduling decision a long time ago. I know another large internet
>> > company adopted a similar approach as ours, and I'm wondering how
>> > everybody else is coping with the discrepancy from those counters.
>> 
>> From intuition, the scanning overhead of the full page table scanning
>> appears higher than that of the rmap scanning for a small portion of
>> system memory.  But form your words, you think the reality is the
>> reverse?  If others concern about the overhead too, finally, I think you
>> need to prove the overhead of the page table scanning isn't too higher,
>> or even lower with more data and theory.
>
> There is a misunderstanding here. I never said anything about full
> page table scanning. And this is not how it's done in this series
> either. I guess the misunderstanding has something to do with the cold
> memory tracking you are thinking about?

If my understanding were correct, from the following code path in your
patch 10/14,

age_active_anon
  age_lru_gens
try_walk_mm_list
  walk_mm_list
walk_mm

So, in kswapd(), the page tables of many processes may be scanned
fully.  If the number of processes that are active are high, the
overhead may be high too.

> This series uses page tables to discover page accesses when a system
> has run out of inactive pages. Under such a situation, the system is
> very likely to have a lot of page accesses, and using the rmap is
> likely to cost a lot more because its poor memory locality compared
> with page tables.

This is the theory.  Can you verify this with more data?  Including the
CPU cycles or time spent scanning page tables?

> But, page tables can be sparse too, in terms of hot memory tracking.
> Dave has asked me to test the worst case scenario, which I'll do.
> And I'd be happy to share more data. Any specific workload you are
> interested in?

We can start with some simple workloads that are easier to be reasoned.
For example,

1. Run the workload with hot and cold pages, when the free memory
becomes lower than the low watermark, kswapd will be waken up to scan
and reclaim some cold pages.  How long will it take to do that?  It's
expected that almost all pages need to be scanned, so that page table
scanning is expected to have less overhead.  We can measure how well it
is.

2. Run the workload with hot and cold pages, if the whole working-set
cannot fit in DRAM, that is, the cold pages will be reclaimed and
swapped in regularly (for example tens MB/s).  It's expected that less
pages may be scanned with rmap, but the speed of page table scanning is
faster.

3. Run the workload with hot and cold pages, the system is
overcommitted, that is, some cold pages will be placed in swap.  But the
cold pages are cold enough, so there's almost no thrashing.  Then the
hot working-set of the workload changes, that is, some hot pages become
cold, while some cold pages becomes hot, so page reclaiming and swapin
will be triggered.

For each cases, we can use some different parameters.  And we can
measure something like the number of pages scanned, the time taken to
scan them, the number of page reclaimed and swapped in, etc.

Best Regards,
Huang, Ying


Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-03-17 Thread Yu Zhao
On Wed, Mar 17, 2021 at 11:37:38AM +0800, Huang, Ying wrote:
> Yu Zhao  writes:
> 
> > On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> >> Yu Zhao  writes:
> >> 
> >> > On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
> >> >> Rik van Riel  writes:
> >> >> 
> >> >> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
> >> >> >
> >> >> >> +/*
> >> >> >> + * After pages are faulted in, they become the youngest generation.
> >> >> >> They must
> >> >> >> + * go through aging process twice before they can be evicted. After
> >> >> >> first scan,
> >> >> >> + * their accessed bit set during initial faults are cleared and they
> >> >> >> become the
> >> >> >> + * second youngest generation. And second scan makes sure they
> >> >> >> haven't been used
> >> >> >> + * since the first.
> >> >> >> + */
> >> >> >
> >> >> > I have to wonder if the reductions in OOM kills and 
> >> >> > low-memory tab discards is due to this aging policy
> >> >> > change, rather than from the switch to virtual scanning.
> >> >
> >> > There are no policy changes per se. The current page reclaim also
> >> > scans a faulted-in page at least twice before it can reclaim it.
> >> > That said, the new aging yields a better overall result because it
> >> > discovers every page that has been referenced since the last scan,
> >> > in addition to what Ying has mentioned. The current page scan stops
> >> > stops once it finds enough candidates, which may seem more
> >> > efficiently, but actually pays the price for not finding the best.
> >> >
> >> >> If my understanding were correct, the temperature of the processes is
> >> >> considered in addition to that of the individual pages.  That is, the
> >> >> pages of the processes that haven't been scheduled after the previous
> >> >> scanning will not be scanned.  I guess that this helps OOM kills?
> >> >
> >> > Yes, that's correct.
> >> >
> >> >> If so, how about just take advantage of that information for OOM killing
> >> >> and page reclaiming?  For example, if a process hasn't been scheduled
> >> >> for long time, just reclaim its private pages.
> >> >
> >> > This is how it works. Pages that haven't been scanned grow older
> >> > automatically because those that have been scanned will be tagged with
> >> > younger generation numbers. Eviction does bucket sort based on
> >> > generation numbers and attacks the oldest.
> >> 
> >> Sorry, my original words are misleading.  What I wanted to say was that
> >> is it good enough that
> >> 
> >> - Do not change the core algorithm of current page reclaiming.
> >> 
> >> - Add some new logic to reclaim the process private pages regardless of
> >>   the Accessed bits if the processes are not scheduled for some long
> >>   enough time.  This can be done before the normal page reclaiming.
> >
> > This is a good idea, which being used on Android and Chrome OS. We
> > call it per-process reclaim, and I've mentioned here:
> > https://lore.kernel.org/linux-mm/ybkt6175gmmwb...@google.com/
> >   On Android, our most advanced simulation that generates memory
> >   pressure from realistic user behavior shows 18% fewer low-memory
> >   kills, which in turn reduces cold starts by 16%. This is on top of
> >   per-process reclaim, a predecessor of ``MADV_COLD`` and
> >   ``MADV_PAGEOUT``, against background apps.
> 
> Thanks, now I see your improvement compared with the per-process
> reclaim.  How about the per-process reclaim compared with the normal
> page reclaiming for the similar test cases?
> 
> My intention behind this is that your solution includes several
> improvements,
> 
> a) take advantage of scheduler information
> b) more fine-grained active/inactive dividing
> c) page table scanning instead of rmap
> 
> Is it possible to evaluate the benefit of the each step?  And is there
> still some potential to optimize the current LRU based algorithm before
> adopting a totally new algorithm?

Well, there isn't really any new algorithm -- it's still the LRU
(algorithm). But I do see your point. In another survey we posted
here:
https://lore.kernel.org/linux-mm/ybkt6175gmmwb...@google.com/
we stated:
  Why not try to improve the existing code?
  -
  We have tried but concluded the two limiting factors [note]_ in the
  existing code are fundamental, and therefore changes made atop them
  will not result in substantial gains on any of the aspects above.

We learned this the hard way.

> > The patches landed not long a ago :) See mm/madvise.c
> 
> :) I'm too out-dated.
> 
> >> So this is an one small step improvement to the current page reclaiming
> >> algorithm via taking advantage of the scheduler information.  It's
> >> clearly not sophisticated as your new algorithm, for example, the cold
> >> pages in the hot processes will not be reclaimed in this stage.  But it
> >> can reduce the overhead of scanning too.
> >
> > The general problems with the direction of per-process reclaim:
> >   1) we can't find the 

Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-03-16 Thread Huang, Ying
Yu Zhao  writes:

> On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
>> Yu Zhao  writes:
>> 
>> > On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
>> >> Rik van Riel  writes:
>> >> 
>> >> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
>> >> >
>> >> >> +/*
>> >> >> + * After pages are faulted in, they become the youngest generation.
>> >> >> They must
>> >> >> + * go through aging process twice before they can be evicted. After
>> >> >> first scan,
>> >> >> + * their accessed bit set during initial faults are cleared and they
>> >> >> become the
>> >> >> + * second youngest generation. And second scan makes sure they
>> >> >> haven't been used
>> >> >> + * since the first.
>> >> >> + */
>> >> >
>> >> > I have to wonder if the reductions in OOM kills and 
>> >> > low-memory tab discards is due to this aging policy
>> >> > change, rather than from the switch to virtual scanning.
>> >
>> > There are no policy changes per se. The current page reclaim also
>> > scans a faulted-in page at least twice before it can reclaim it.
>> > That said, the new aging yields a better overall result because it
>> > discovers every page that has been referenced since the last scan,
>> > in addition to what Ying has mentioned. The current page scan stops
>> > stops once it finds enough candidates, which may seem more
>> > efficiently, but actually pays the price for not finding the best.
>> >
>> >> If my understanding were correct, the temperature of the processes is
>> >> considered in addition to that of the individual pages.  That is, the
>> >> pages of the processes that haven't been scheduled after the previous
>> >> scanning will not be scanned.  I guess that this helps OOM kills?
>> >
>> > Yes, that's correct.
>> >
>> >> If so, how about just take advantage of that information for OOM killing
>> >> and page reclaiming?  For example, if a process hasn't been scheduled
>> >> for long time, just reclaim its private pages.
>> >
>> > This is how it works. Pages that haven't been scanned grow older
>> > automatically because those that have been scanned will be tagged with
>> > younger generation numbers. Eviction does bucket sort based on
>> > generation numbers and attacks the oldest.
>> 
>> Sorry, my original words are misleading.  What I wanted to say was that
>> is it good enough that
>> 
>> - Do not change the core algorithm of current page reclaiming.
>> 
>> - Add some new logic to reclaim the process private pages regardless of
>>   the Accessed bits if the processes are not scheduled for some long
>>   enough time.  This can be done before the normal page reclaiming.
>
> This is a good idea, which being used on Android and Chrome OS. We
> call it per-process reclaim, and I've mentioned here:
> https://lore.kernel.org/linux-mm/ybkt6175gmmwb...@google.com/
>   On Android, our most advanced simulation that generates memory
>   pressure from realistic user behavior shows 18% fewer low-memory
>   kills, which in turn reduces cold starts by 16%. This is on top of
>   per-process reclaim, a predecessor of ``MADV_COLD`` and
>   ``MADV_PAGEOUT``, against background apps.

Thanks, now I see your improvement compared with the per-process
reclaim.  How about the per-process reclaim compared with the normal
page reclaiming for the similar test cases?

My intention behind this is that your solution includes several
improvements,

a) take advantage of scheduler information
b) more fine-grained active/inactive dividing
c) page table scanning instead of rmap

Is it possible to evaluate the benefit of the each step?  And is there
still some potential to optimize the current LRU based algorithm before
adopting a totally new algorithm?

> The patches landed not long a ago :) See mm/madvise.c

:) I'm too out-dated.

>> So this is an one small step improvement to the current page reclaiming
>> algorithm via taking advantage of the scheduler information.  It's
>> clearly not sophisticated as your new algorithm, for example, the cold
>> pages in the hot processes will not be reclaimed in this stage.  But it
>> can reduce the overhead of scanning too.
>
> The general problems with the direction of per-process reclaim:
>   1) we can't find the coldest pages, as you have mentioned.
>   2) we can't reach file pages accessed via file descriptors only,
>   especially those caching config files that were read only once.

In theory, this is possible, we can build a inode list based on the
accessing time too.  Although this may not be necessary.  We can reclaim
the read-once file cache before the per-process reclaim in theory.

>   3) we can't reclaim lru pages and slab objects proportionally and
>   therefore we leave many stale slab objects behind.
>   4) we have to be proactive, as you suggested (once again, you were
>   right), and this has a serious problem: client's battery life can
>   be affected.

Why can this not be done reactively?  We can start per-process reclaim
under memory pressure.  This has been used in p

Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-03-16 Thread Yu Zhao
On Tue, Mar 16, 2021 at 02:44:31PM +0800, Huang, Ying wrote:
> Yu Zhao  writes:
> 
> > On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
> >> Rik van Riel  writes:
> >> 
> >> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
> >> >
> >> >> +/*
> >> >> + * After pages are faulted in, they become the youngest generation.
> >> >> They must
> >> >> + * go through aging process twice before they can be evicted. After
> >> >> first scan,
> >> >> + * their accessed bit set during initial faults are cleared and they
> >> >> become the
> >> >> + * second youngest generation. And second scan makes sure they
> >> >> haven't been used
> >> >> + * since the first.
> >> >> + */
> >> >
> >> > I have to wonder if the reductions in OOM kills and 
> >> > low-memory tab discards is due to this aging policy
> >> > change, rather than from the switch to virtual scanning.
> >
> > There are no policy changes per se. The current page reclaim also
> > scans a faulted-in page at least twice before it can reclaim it.
> > That said, the new aging yields a better overall result because it
> > discovers every page that has been referenced since the last scan,
> > in addition to what Ying has mentioned. The current page scan stops
> > stops once it finds enough candidates, which may seem more
> > efficiently, but actually pays the price for not finding the best.
> >
> >> If my understanding were correct, the temperature of the processes is
> >> considered in addition to that of the individual pages.  That is, the
> >> pages of the processes that haven't been scheduled after the previous
> >> scanning will not be scanned.  I guess that this helps OOM kills?
> >
> > Yes, that's correct.
> >
> >> If so, how about just take advantage of that information for OOM killing
> >> and page reclaiming?  For example, if a process hasn't been scheduled
> >> for long time, just reclaim its private pages.
> >
> > This is how it works. Pages that haven't been scanned grow older
> > automatically because those that have been scanned will be tagged with
> > younger generation numbers. Eviction does bucket sort based on
> > generation numbers and attacks the oldest.
> 
> Sorry, my original words are misleading.  What I wanted to say was that
> is it good enough that
> 
> - Do not change the core algorithm of current page reclaiming.
> 
> - Add some new logic to reclaim the process private pages regardless of
>   the Accessed bits if the processes are not scheduled for some long
>   enough time.  This can be done before the normal page reclaiming.

This is a good idea, which being used on Android and Chrome OS. We
call it per-process reclaim, and I've mentioned here:
https://lore.kernel.org/linux-mm/ybkt6175gmmwb...@google.com/
  On Android, our most advanced simulation that generates memory
  pressure from realistic user behavior shows 18% fewer low-memory
  kills, which in turn reduces cold starts by 16%. This is on top of
  per-process reclaim, a predecessor of ``MADV_COLD`` and
  ``MADV_PAGEOUT``, against background apps.

The patches landed not long a ago :) See mm/madvise.c

> So this is an one small step improvement to the current page reclaiming
> algorithm via taking advantage of the scheduler information.  It's
> clearly not sophisticated as your new algorithm, for example, the cold
> pages in the hot processes will not be reclaimed in this stage.  But it
> can reduce the overhead of scanning too.

The general problems with the direction of per-process reclaim:
  1) we can't find the coldest pages, as you have mentioned.
  2) we can't reach file pages accessed via file descriptors only,
  especially those caching config files that were read only once.
  3) we can't reclaim lru pages and slab objects proportionally and
  therefore we leave many stale slab objects behind.
  4) we have to be proactive, as you suggested (once again, you were
  right), and this has a serious problem: client's battery life can
  be affected.

The scanning overhead is only one of the two major problems of the
current page reclaim. The other problem is the granularity of the
active/inactive (sizes). We stopped using them in making job
scheduling decision a long time ago. I know another large internet
company adopted a similar approach as ours, and I'm wondering how
everybody else is coping with the discrepancy from those counters.

> All in all, some of your ideas may help the original LRU algorithm too.
> Or some can be experimented without replacing the original algorithm.
> 
> But from another point of view, your solution can be seen as a kind of
> improvement on top of the original LRU algorithm too.  It moves the
> recently accessed pages to kind of multiple active lists based on
> scanning page tables directly (instead of reversely).

We hope this series can be a framework or an infrastructure flexible
enough that people can build their complex use cases upon, e.g.,
proactive reclaim (machine-wide, not per process), cold memory
estimation (for job sched

Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-03-15 Thread Huang, Ying
Yu Zhao  writes:

> On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
>> Rik van Riel  writes:
>> 
>> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
>> >
>> >> +/*
>> >> + * After pages are faulted in, they become the youngest generation.
>> >> They must
>> >> + * go through aging process twice before they can be evicted. After
>> >> first scan,
>> >> + * their accessed bit set during initial faults are cleared and they
>> >> become the
>> >> + * second youngest generation. And second scan makes sure they
>> >> haven't been used
>> >> + * since the first.
>> >> + */
>> >
>> > I have to wonder if the reductions in OOM kills and 
>> > low-memory tab discards is due to this aging policy
>> > change, rather than from the switch to virtual scanning.
>
> There are no policy changes per se. The current page reclaim also
> scans a faulted-in page at least twice before it can reclaim it.
> That said, the new aging yields a better overall result because it
> discovers every page that has been referenced since the last scan,
> in addition to what Ying has mentioned. The current page scan stops
> stops once it finds enough candidates, which may seem more
> efficiently, but actually pays the price for not finding the best.
>
>> If my understanding were correct, the temperature of the processes is
>> considered in addition to that of the individual pages.  That is, the
>> pages of the processes that haven't been scheduled after the previous
>> scanning will not be scanned.  I guess that this helps OOM kills?
>
> Yes, that's correct.
>
>> If so, how about just take advantage of that information for OOM killing
>> and page reclaiming?  For example, if a process hasn't been scheduled
>> for long time, just reclaim its private pages.
>
> This is how it works. Pages that haven't been scanned grow older
> automatically because those that have been scanned will be tagged with
> younger generation numbers. Eviction does bucket sort based on
> generation numbers and attacks the oldest.

Sorry, my original words are misleading.  What I wanted to say was that
is it good enough that

- Do not change the core algorithm of current page reclaiming.

- Add some new logic to reclaim the process private pages regardless of
  the Accessed bits if the processes are not scheduled for some long
  enough time.  This can be done before the normal page reclaiming.

So this is an one small step improvement to the current page reclaiming
algorithm via taking advantage of the scheduler information.  It's
clearly not sophisticated as your new algorithm, for example, the cold
pages in the hot processes will not be reclaimed in this stage.  But it
can reduce the overhead of scanning too.

All in all, some of your ideas may help the original LRU algorithm too.
Or some can be experimented without replacing the original algorithm.

But from another point of view, your solution can be seen as a kind of
improvement on top of the original LRU algorithm too.  It moves the
recently accessed pages to kind of multiple active lists based on
scanning page tables directly (instead of reversely).

Best Regards,
Huang, Ying



Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-03-15 Thread Yu Zhao
On Tue, Mar 16, 2021 at 10:07:36AM +0800, Huang, Ying wrote:
> Rik van Riel  writes:
> 
> > On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
> >
> >> +/*
> >> + * After pages are faulted in, they become the youngest generation.
> >> They must
> >> + * go through aging process twice before they can be evicted. After
> >> first scan,
> >> + * their accessed bit set during initial faults are cleared and they
> >> become the
> >> + * second youngest generation. And second scan makes sure they
> >> haven't been used
> >> + * since the first.
> >> + */
> >
> > I have to wonder if the reductions in OOM kills and 
> > low-memory tab discards is due to this aging policy
> > change, rather than from the switch to virtual scanning.

There are no policy changes per se. The current page reclaim also
scans a faulted-in page at least twice before it can reclaim it.
That said, the new aging yields a better overall result because it
discovers every page that has been referenced since the last scan,
in addition to what Ying has mentioned. The current page scan stops
stops once it finds enough candidates, which may seem more
efficiently, but actually pays the price for not finding the best.

> If my understanding were correct, the temperature of the processes is
> considered in addition to that of the individual pages.  That is, the
> pages of the processes that haven't been scheduled after the previous
> scanning will not be scanned.  I guess that this helps OOM kills?

Yes, that's correct.

> If so, how about just take advantage of that information for OOM killing
> and page reclaiming?  For example, if a process hasn't been scheduled
> for long time, just reclaim its private pages.

This is how it works. Pages that haven't been scanned grow older
automatically because those that have been scanned will be tagged with
younger generation numbers. Eviction does bucket sort based on
generation numbers and attacks the oldest.


Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-03-15 Thread Huang, Ying
Rik van Riel  writes:

> On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:
>
>> +/*
>> + * After pages are faulted in, they become the youngest generation.
>> They must
>> + * go through aging process twice before they can be evicted. After
>> first scan,
>> + * their accessed bit set during initial faults are cleared and they
>> become the
>> + * second youngest generation. And second scan makes sure they
>> haven't been used
>> + * since the first.
>> + */
>
> I have to wonder if the reductions in OOM kills and 
> low-memory tab discards is due to this aging policy
> change, rather than from the switch to virtual scanning.

If my understanding were correct, the temperature of the processes is
considered in addition to that of the individual pages.  That is, the
pages of the processes that haven't been scheduled after the previous
scanning will not be scanned.  I guess that this helps OOM kills?

If so, how about just take advantage of that information for OOM killing
and page reclaiming?  For example, if a process hasn't been scheduled
for long time, just reclaim its private pages.

Best Regards,
Huang, Ying


Re: [PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-03-15 Thread Rik van Riel
On Sat, 2021-03-13 at 00:57 -0700, Yu Zhao wrote:

> +/*
> + * After pages are faulted in, they become the youngest generation.
> They must
> + * go through aging process twice before they can be evicted. After
> first scan,
> + * their accessed bit set during initial faults are cleared and they
> become the
> + * second youngest generation. And second scan makes sure they
> haven't been used
> + * since the first.
> + */

I have to wonder if the reductions in OOM kills and 
low-memory tab discards is due to this aging policy
change, rather than from the switch to virtual scanning.

-- 
All Rights Reversed.


signature.asc
Description: This is a digitally signed message part


[PATCH v1 09/14] mm: multigenerational lru: mm_struct list

2021-03-13 Thread Yu Zhao
Add an infrastructure that maintains either a system-wide mm_struct
list or per-memcg mm_struct lists. Multiple threads can concurrently
work on the same mm_struct list, and each of them will be given a
different mm_struct. Those who finish early can optionally wait on the
rest after the iterator has reached the end of the list.

This infrastructure also tracks whether an mm_struct is being used on
any CPUs or has been used since the last time a worker looked at it.
In other words, workers will not be given an mm_struct that belongs to
a process that has been sleeping.

Signed-off-by: Yu Zhao 
---
 fs/exec.c  |   2 +
 include/linux/memcontrol.h |   4 +
 include/linux/mm_types.h   | 135 +++
 include/linux/mmzone.h |   2 -
 kernel/exit.c  |   1 +
 kernel/fork.c  |  10 ++
 kernel/kthread.c   |   1 +
 kernel/sched/core.c|   2 +
 mm/memcontrol.c|  28 
 mm/vmscan.c| 263 +
 10 files changed, 446 insertions(+), 2 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 18594f11c31f..c691d4d7720c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1008,6 +1008,7 @@ static int exec_mmap(struct mm_struct *mm)
active_mm = tsk->active_mm;
tsk->active_mm = mm;
tsk->mm = mm;
+   lru_gen_add_mm(mm);
/*
 * This prevents preemption while active_mm is being loaded and
 * it and mm are being updated, which could cause problems for
@@ -1018,6 +1019,7 @@ static int exec_mmap(struct mm_struct *mm)
if (!IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
activate_mm(active_mm, mm);
+   lru_gen_switch_mm(active_mm, mm);
if (IS_ENABLED(CONFIG_ARCH_WANT_IRQS_OFF_ACTIVATE_MM))
local_irq_enable();
tsk->mm->vmacache_seqnum = 0;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index f325aeb4b4e8..591557c5b7e2 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -335,6 +335,10 @@ struct mem_cgroup {
struct deferred_split deferred_split_queue;
 #endif
 
+#ifdef CONFIG_LRU_GEN
+   struct lru_gen_mm_list *mm_list;
+#endif
+
struct mem_cgroup_per_node *nodeinfo[0];
/* WARNING: nodeinfo must be the last member here */
 };
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0974ad501a47..b8a038a016f2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -15,6 +15,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 
@@ -382,6 +384,8 @@ struct core_state {
struct completion startup;
 };
 
+#define ANON_AND_FILE 2
+
 struct kioctx_table;
 struct mm_struct {
struct {
@@ -560,6 +564,22 @@ struct mm_struct {
 
 #ifdef CONFIG_IOMMU_SUPPORT
u32 pasid;
+#endif
+#ifdef CONFIG_LRU_GEN
+   struct {
+   /* node of a global or per-memcg mm list */
+   struct list_head list;
+#ifdef CONFIG_MEMCG
+   /* points to memcg of the owner task above */
+   struct mem_cgroup *memcg;
+#endif
+   /* indicates this mm has been used since last walk */
+   nodemask_t nodes[ANON_AND_FILE];
+#ifndef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+   /* number of cpus that are using this mm */
+   atomic_t nr_cpus;
+#endif
+   } lru_gen;
 #endif
} __randomize_layout;
 
@@ -587,6 +607,121 @@ static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
return (struct cpumask *)&mm->cpu_bitmap;
 }
 
+#ifdef CONFIG_LRU_GEN
+
+struct lru_gen_mm_list {
+   /* head of a global or per-memcg mm list */
+   struct list_head head;
+   /* protects the list */
+   spinlock_t lock;
+   struct {
+   /* set to max_seq after each round of walk */
+   unsigned long cur_seq;
+   /* next mm on the list to walk */
+   struct list_head *iter;
+   /* to wait for last worker to finish */
+   struct wait_queue_head wait;
+   /* number of concurrent workers */
+   int nr_workers;
+   } nodes[0];
+};
+
+void lru_gen_init_mm(struct mm_struct *mm);
+void lru_gen_add_mm(struct mm_struct *mm);
+void lru_gen_del_mm(struct mm_struct *mm);
+#ifdef CONFIG_MEMCG
+int lru_gen_alloc_mm_list(struct mem_cgroup *memcg);
+void lru_gen_free_mm_list(struct mem_cgroup *memcg);
+void lru_gen_migrate_mm(struct mm_struct *mm);
+#endif
+
+/*
+ * Track usage so mms that haven't been used since last walk can be skipped.
+ *
+ * This function introduces a theoretical overhead for each mm switch, but it
+ * hasn't been measurable.
+ */
+static inline void lru_gen_switch_mm(struct mm_struct *old, struct mm_struct 
*new)
+{
+   int file;
+
+   /* exclude init_mm, efi_mm, etc. */
+   if