Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Qian Cai
On Wed, 2019-10-23 at 11:03 -0400, Rafael Aquini wrote:
> > > > this still isn't a bulletproof fix.  Maybe just terminate the list
> > > > walk if freecount reaches 1024.  Would anyone really care?
> > > > 
> > > > Sigh.  I wonder if anyone really uses this thing for anything
> > > > important.  Can we just remove it all?
> > > > 
> > > 
> > > Removing it will be a breakage of kernel API.
> > 
> > Who cares about breaking this part of the API that essentially nobody will 
> > use
> > this file?
> > 
> 
> Seems that _some_ are using it, aren't they? Otherwise we wouldn't be
> seeing complaints. Moving it out of /proc to /sys/kernel/debug looks 
> like a reasonable compromise with those that care about the interface.

No, there are some known testing tools will blindly read this file just because
it exists which is not important to keep the file.


Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Rafael Aquini
On Wed, Oct 23, 2019 at 10:48:13AM -0400, Qian Cai wrote:
> On Wed, 2019-10-23 at 10:30 -0400, Waiman Long wrote:
> > On 10/22/19 5:59 PM, Andrew Morton wrote:
> > > On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long  wrote:
> > > 
> > > > The pagetypeinfo_showfree_print() function prints out the number of
> > > > free blocks for each of the page orders and migrate types. The current
> > > > code just iterates the each of the free lists to get counts.  There are
> > > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > > file just because it look too long to iterate all the free lists within
> > > > a zone while holing the zone lock with irq disabled.
> > > > 
> > > > Given the fact that /proc/pagetypeinfo is readable by all, the 
> > > > possiblity
> > > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > > by any user is a security problem that needs to be addressed.
> > > 
> > > Yes.
> > > 
> > > > There is a free_area structure associated with each page order. There
> > > > is also a nr_free count within the free_area for all the different
> > > > migration types combined. Tracking the number of free list entries
> > > > for each migration type will probably add some overhead to the fast
> > > > paths like moving pages from one migration type to another which may
> > > > not be desirable.
> > > > 
> > > > we can actually skip iterating the list of one of the migration types
> > > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > > is usually the largest one on large memory systems, this is the one
> > > > to be skipped. Since the printing order is migration-type => order, we
> > > > will have to store the counts in an internal 2D array before printing
> > > > them out.
> > > > 
> > > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > > zone lock for too long blocking out other zone lock waiters from being
> > > > run. This can be problematic for systems with large amount of memory.
> > > > So a check is added to temporarily release the lock and reschedule if
> > > > more than 64k of list entries have been iterated for each order. With
> > > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > > entries before releasing the lock.
> > > > 
> > > > ...
> > > > 
> > > > --- a/mm/vmstat.c
> > > > +++ b/mm/vmstat.c
> > > > @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct 
> > > > seq_file *m,
> > > > pg_data_t *pgdat, struct zone 
> > > > *zone)
> > > >  {
> > > > int order, mtype;
> > > > +   unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> > > 
> > > 600+ bytes is a bit much.  I guess it's OK in this situation.
> > > 
> > 
> > This function is called by reading /proc/pagetypeinfo. The call stack is
> > rather shallow:
> > 
> > PID: 58188  TASK: 938a4d4f1fa0  CPU: 2   COMMAND: "sosreport"
> >  #0 [9483bf488e48] crash_nmi_callback at b8c551d7
> >  #1 [9483bf488e58] nmi_handle at b931d8cc
> >  #2 [9483bf488eb0] do_nmi at b931dba8
> >  #3 [9483bf488ef0] end_repeat_nmi at b931cd69
> >     [exception RIP: pagetypeinfo_showfree_print+0x73]
> >     RIP: b8db7173  RSP: 938b9fcbfda0  RFLAGS: 0006
> >     RAX: f0c9946d7020  RBX: 96073ffd5528  RCX: 
> >     RDX: 001c7764  RSI: b9676ab1  RDI: 
> >     RBP: 938b9fcbfdd0   R8: 000a   R9: fffe
> >     R10:   R11: 938b9fcbfc36  R12: 942b97758240
> >     R13: b942f730  R14: 96073ffd5000  R15: 96073ffd5180
> >     ORIG_RAX:   CS: 0010  SS: 0018
> > ---  ---
> >  #4 [938b9fcbfda0] pagetypeinfo_showfree_print at b8db7173
> >  #5 [938b9fcbfdd8] walk_zones_in_node at b8db74df
> >  #6 [938b9fcbfe20] pagetypeinfo_show at b8db7a29
> >  #7 [938b9fcbfe48] seq_read at b8e45c3c
> >  #8 [938b9fcbfeb8] proc_reg_read at b8e95070
> >  #9 [938b9fcbfed8] vfs_read at b8e1f2af
> > #10 [938b9fcbff08] sys_read at b8e2017f
> > #11 [938b9fcbff50] system_call_fastpath at b932579b
> > 
> > So we should not be in any risk of overflowing the stack.
> > 
> > > > -   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > > -   seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > > -   pgdat->node_id,
> > > > -   zone->name,
> > > > -   migratetype_names[mtype]);
> > > > -   for (order = 0; order < MAX_ORDER; ++order) {
> > > > +   lockdep_assert_held(>lock);
> > > > +   lockdep_assert_irqs_disabled();
> > > > +
> > > > +   /*
> > > > +* MIGRATE_MOVABLE is usually the largest one in large memory
> > > > +* systems. We skip iterating that 

Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Qian Cai
On Wed, 2019-10-23 at 11:01 -0400, Waiman Long wrote:
> On 10/23/19 10:48 AM, Qian Cai wrote:
> > > > this still isn't a bulletproof fix.  Maybe just terminate the list
> > > > walk if freecount reaches 1024.  Would anyone really care?
> > > > 
> > > > Sigh.  I wonder if anyone really uses this thing for anything
> > > > important.  Can we just remove it all?
> > > > 
> > > 
> > > Removing it will be a breakage of kernel API.
> > 
> > Who cares about breaking this part of the API that essentially nobody will 
> > use
> > this file?
> > 
> 
> There are certainly tools that use /proc/pagetypeinfo and this is how
> the problem is found. I am not against removing it, but we have to be
> careful and deprecate it in way that minimize user impact.

What are those tools?


Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Waiman Long
On 10/23/19 10:48 AM, Qian Cai wrote:
>>> this still isn't a bulletproof fix.  Maybe just terminate the list
>>> walk if freecount reaches 1024.  Would anyone really care?
>>>
>>> Sigh.  I wonder if anyone really uses this thing for anything
>>> important.  Can we just remove it all?
>>>
>> Removing it will be a breakage of kernel API.
> Who cares about breaking this part of the API that essentially nobody will use
> this file?
>
There are certainly tools that use /proc/pagetypeinfo and this is how
the problem is found. I am not against removing it, but we have to be
careful and deprecate it in way that minimize user impact.

Cheers,
Longman



Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Qian Cai
On Wed, 2019-10-23 at 10:30 -0400, Waiman Long wrote:
> On 10/22/19 5:59 PM, Andrew Morton wrote:
> > On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long  wrote:
> > 
> > > The pagetypeinfo_showfree_print() function prints out the number of
> > > free blocks for each of the page orders and migrate types. The current
> > > code just iterates the each of the free lists to get counts.  There are
> > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > file just because it look too long to iterate all the free lists within
> > > a zone while holing the zone lock with irq disabled.
> > > 
> > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > by any user is a security problem that needs to be addressed.
> > 
> > Yes.
> > 
> > > There is a free_area structure associated with each page order. There
> > > is also a nr_free count within the free_area for all the different
> > > migration types combined. Tracking the number of free list entries
> > > for each migration type will probably add some overhead to the fast
> > > paths like moving pages from one migration type to another which may
> > > not be desirable.
> > > 
> > > we can actually skip iterating the list of one of the migration types
> > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > is usually the largest one on large memory systems, this is the one
> > > to be skipped. Since the printing order is migration-type => order, we
> > > will have to store the counts in an internal 2D array before printing
> > > them out.
> > > 
> > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > zone lock for too long blocking out other zone lock waiters from being
> > > run. This can be problematic for systems with large amount of memory.
> > > So a check is added to temporarily release the lock and reschedule if
> > > more than 64k of list entries have been iterated for each order. With
> > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > entries before releasing the lock.
> > > 
> > > ...
> > > 
> > > --- a/mm/vmstat.c
> > > +++ b/mm/vmstat.c
> > > @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct 
> > > seq_file *m,
> > >   pg_data_t *pgdat, struct zone *zone)
> > >  {
> > >   int order, mtype;
> > > + unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> > 
> > 600+ bytes is a bit much.  I guess it's OK in this situation.
> > 
> 
> This function is called by reading /proc/pagetypeinfo. The call stack is
> rather shallow:
> 
> PID: 58188  TASK: 938a4d4f1fa0  CPU: 2   COMMAND: "sosreport"
>  #0 [9483bf488e48] crash_nmi_callback at b8c551d7
>  #1 [9483bf488e58] nmi_handle at b931d8cc
>  #2 [9483bf488eb0] do_nmi at b931dba8
>  #3 [9483bf488ef0] end_repeat_nmi at b931cd69
>     [exception RIP: pagetypeinfo_showfree_print+0x73]
>     RIP: b8db7173  RSP: 938b9fcbfda0  RFLAGS: 0006
>     RAX: f0c9946d7020  RBX: 96073ffd5528  RCX: 
>     RDX: 001c7764  RSI: b9676ab1  RDI: 
>     RBP: 938b9fcbfdd0   R8: 000a   R9: fffe
>     R10:   R11: 938b9fcbfc36  R12: 942b97758240
>     R13: b942f730  R14: 96073ffd5000  R15: 96073ffd5180
>     ORIG_RAX:   CS: 0010  SS: 0018
> ---  ---
>  #4 [938b9fcbfda0] pagetypeinfo_showfree_print at b8db7173
>  #5 [938b9fcbfdd8] walk_zones_in_node at b8db74df
>  #6 [938b9fcbfe20] pagetypeinfo_show at b8db7a29
>  #7 [938b9fcbfe48] seq_read at b8e45c3c
>  #8 [938b9fcbfeb8] proc_reg_read at b8e95070
>  #9 [938b9fcbfed8] vfs_read at b8e1f2af
> #10 [938b9fcbff08] sys_read at b8e2017f
> #11 [938b9fcbff50] system_call_fastpath at b932579b
> 
> So we should not be in any risk of overflowing the stack.
> 
> > > - for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > > - seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > > - pgdat->node_id,
> > > - zone->name,
> > > - migratetype_names[mtype]);
> > > - for (order = 0; order < MAX_ORDER; ++order) {
> > > + lockdep_assert_held(>lock);
> > > + lockdep_assert_irqs_disabled();
> > > +
> > > + /*
> > > +  * MIGRATE_MOVABLE is usually the largest one in large memory
> > > +  * systems. We skip iterating that list. Instead, we compute it by
> > > +  * subtracting the total of the rests from free_area->nr_free.
> > > +  */
> > > + for (order = 0; order < MAX_ORDER; ++order) {
> > > + unsigned long nr_total = 0;
> > > + struct free_area *area = &(zone->free_area[order]);
> > > +
> > > + for (mtype = 0; mtype < MIGRATE_TYPES; 

Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Waiman Long
On 10/22/19 5:59 PM, Andrew Morton wrote:
> On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long  wrote:
>
>> The pagetypeinfo_showfree_print() function prints out the number of
>> free blocks for each of the page orders and migrate types. The current
>> code just iterates the each of the free lists to get counts.  There are
>> bug reports about hard lockup panics when reading the /proc/pagetyeinfo
>> file just because it look too long to iterate all the free lists within
>> a zone while holing the zone lock with irq disabled.
>>
>> Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
>> of crashing a system by the simple act of reading /proc/pagetypeinfo
>> by any user is a security problem that needs to be addressed.
> Yes.
>
>> There is a free_area structure associated with each page order. There
>> is also a nr_free count within the free_area for all the different
>> migration types combined. Tracking the number of free list entries
>> for each migration type will probably add some overhead to the fast
>> paths like moving pages from one migration type to another which may
>> not be desirable.
>>
>> we can actually skip iterating the list of one of the migration types
>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
>> is usually the largest one on large memory systems, this is the one
>> to be skipped. Since the printing order is migration-type => order, we
>> will have to store the counts in an internal 2D array before printing
>> them out.
>>
>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
>> zone lock for too long blocking out other zone lock waiters from being
>> run. This can be problematic for systems with large amount of memory.
>> So a check is added to temporarily release the lock and reschedule if
>> more than 64k of list entries have been iterated for each order. With
>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
>> entries before releasing the lock.
>>
>> ...
>>
>> --- a/mm/vmstat.c
>> +++ b/mm/vmstat.c
>> @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct 
>> seq_file *m,
>>  pg_data_t *pgdat, struct zone *zone)
>>  {
>>  int order, mtype;
>> +unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
> 600+ bytes is a bit much.  I guess it's OK in this situation.
>
This function is called by reading /proc/pagetypeinfo. The call stack is
rather shallow:

PID: 58188  TASK: 938a4d4f1fa0  CPU: 2   COMMAND: "sosreport"
 #0 [9483bf488e48] crash_nmi_callback at b8c551d7
 #1 [9483bf488e58] nmi_handle at b931d8cc
 #2 [9483bf488eb0] do_nmi at b931dba8
 #3 [9483bf488ef0] end_repeat_nmi at b931cd69
    [exception RIP: pagetypeinfo_showfree_print+0x73]
    RIP: b8db7173  RSP: 938b9fcbfda0  RFLAGS: 0006
    RAX: f0c9946d7020  RBX: 96073ffd5528  RCX: 
    RDX: 001c7764  RSI: b9676ab1  RDI: 
    RBP: 938b9fcbfdd0   R8: 000a   R9: fffe
    R10:   R11: 938b9fcbfc36  R12: 942b97758240
    R13: b942f730  R14: 96073ffd5000  R15: 96073ffd5180
    ORIG_RAX:   CS: 0010  SS: 0018
---  ---
 #4 [938b9fcbfda0] pagetypeinfo_showfree_print at b8db7173
 #5 [938b9fcbfdd8] walk_zones_in_node at b8db74df
 #6 [938b9fcbfe20] pagetypeinfo_show at b8db7a29
 #7 [938b9fcbfe48] seq_read at b8e45c3c
 #8 [938b9fcbfeb8] proc_reg_read at b8e95070
 #9 [938b9fcbfed8] vfs_read at b8e1f2af
#10 [938b9fcbff08] sys_read at b8e2017f
#11 [938b9fcbff50] system_call_fastpath at b932579b

So we should not be in any risk of overflowing the stack.

>> -for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>> -seq_printf(m, "Node %4d, zone %8s, type %12s ",
>> -pgdat->node_id,
>> -zone->name,
>> -migratetype_names[mtype]);
>> -for (order = 0; order < MAX_ORDER; ++order) {
>> +lockdep_assert_held(>lock);
>> +lockdep_assert_irqs_disabled();
>> +
>> +/*
>> + * MIGRATE_MOVABLE is usually the largest one in large memory
>> + * systems. We skip iterating that list. Instead, we compute it by
>> + * subtracting the total of the rests from free_area->nr_free.
>> + */
>> +for (order = 0; order < MAX_ORDER; ++order) {
>> +unsigned long nr_total = 0;
>> +struct free_area *area = &(zone->free_area[order]);
>> +
>> +for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>>  unsigned long freecount = 0;
>> -struct free_area *area;
>>  struct list_head *curr;
>>  
>> -area = &(zone->free_area[order]);
>> -
>> +if (mtype == 

Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Qian Cai



> On Oct 23, 2019, at 5:56 AM, Mel Gorman  wrote:
> 
> Again, the cost is when reading a proc file. From what Andrew said,
> the lock is necessary to safely walk the list but if anything. I would
> be ok with limiting the length of the walk but honestly, I would also
> be ok with simply deleting the proc file. The utility for debugging a
> problem with it is limited now (it was more important when fragmentation
> avoidance was first introduced) and there is little an admin can do with
> the information. I can't remember the last time I asked for the contents
> of the file when trying to debug a problem. There is a possibility that
> someone will complain but I'm not aware of any utility that reads the
> information and does something useful with it. In the unlikely event
> something breaks, the file can be re-added with a limited walk.

Agree. It is better to remove this file all together in linux-next first for a 
while to see if anyone complains loudly.

Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Mel Gorman
On Wed, Oct 23, 2019 at 11:04:22AM +0200, Michal Hocko wrote:
> On Wed 23-10-19 09:31:43, Mel Gorman wrote:
> > On Tue, Oct 22, 2019 at 06:57:45PM +0200, Michal Hocko wrote:
> > > [Cc Mel]
> > > 
> > > On Tue 22-10-19 12:21:56, Waiman Long wrote:
> > > > The pagetypeinfo_showfree_print() function prints out the number of
> > > > free blocks for each of the page orders and migrate types. The current
> > > > code just iterates the each of the free lists to get counts.  There are
> > > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > > file just because it look too long to iterate all the free lists within
> > > > a zone while holing the zone lock with irq disabled.
> > > > 
> > > > Given the fact that /proc/pagetypeinfo is readable by all, the 
> > > > possiblity
> > > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > > by any user is a security problem that needs to be addressed.
> > > 
> > > Should we make the file 0400? It is a useful thing when debugging but
> > > not something regular users would really need for life.
> > > 
> > 
> > I think this would be useful in general. The information is not that
> > useful outside of debugging. Even then it's only useful when trying to
> > get a handle on why a path like compaction is taking too long.
> 
> So can we go with this to address the security aspect of this and have
> something trivial to backport.
> 

Yes.

> > > > There is a free_area structure associated with each page order. There
> > > > is also a nr_free count within the free_area for all the different
> > > > migration types combined. Tracking the number of free list entries
> > > > for each migration type will probably add some overhead to the fast
> > > > paths like moving pages from one migration type to another which may
> > > > not be desirable.
> > > 
> > > Have you tried to measure that overhead?
> > >  
> > 
> > I would prefer this option not be taken. It would increase the cost of
> > watermark calculations which is a relatively fast path.
> 
> Is the change for the wmark check going to require more than
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c0b2e0306720..5d95313ba4a5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3448,9 +3448,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int 
> order, unsigned long mark,
>   struct free_area *area = >free_area[o];
>   int mt;
>  
> - if (!area->nr_free)
> - continue;
> -
>   for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
>   if (!free_area_empty(area, mt))
>   return true;
> 
> Is this really going to be visible in practice? Sure we are going to do
> more checks but most orders tend to have at least some memory in a
> reasonably balanced system and we can hardly expect an optimal
> allocation path on those that are not.
>  

You also have to iterate over them all later in the same function.  The the
free counts are per migrate type then they would have to be iterated over
every time.

Similarly, there would be multiple places where all the counters would
have to be iterated -- find_suitable_fallback, show_free_areas,
fast_isolate_freepages, fill_contig_page_info, zone_init_free_lists etc.

It'd be a small cost but given that it's aimed at fixing a problem with
reading pagetypeinfo, is it really worth it? I don't think so.

> > > > we can actually skip iterating the list of one of the migration types
> > > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > > is usually the largest one on large memory systems, this is the one
> > > > to be skipped. Since the printing order is migration-type => order, we
> > > > will have to store the counts in an internal 2D array before printing
> > > > them out.
> > > > 
> > > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > > zone lock for too long blocking out other zone lock waiters from being
> > > > run. This can be problematic for systems with large amount of memory.
> > > > So a check is added to temporarily release the lock and reschedule if
> > > > more than 64k of list entries have been iterated for each order. With
> > > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > > entries before releasing the lock.
> > > 
> > > But you are still iterating through the whole free_list at once so if it
> > > gets really large then this is still possible. I think it would be
> > > preferable to use per migratetype nr_free if it doesn't cause any
> > > regressions.
> > > 
> > 
> > I think it will. The patch as it is contains the overhead within the
> > reader of the pagetypeinfo proc file which is a non-critical path. The
> > page allocator paths on the other hand is very important.
> 
> As pointed out in other email. The problem with this patch is that it
> hasn't really removed the iteration over the whole free_list which is
> 

Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
On Wed 23-10-19 09:31:43, Mel Gorman wrote:
> On Tue, Oct 22, 2019 at 06:57:45PM +0200, Michal Hocko wrote:
> > [Cc Mel]
> > 
> > On Tue 22-10-19 12:21:56, Waiman Long wrote:
> > > The pagetypeinfo_showfree_print() function prints out the number of
> > > free blocks for each of the page orders and migrate types. The current
> > > code just iterates the each of the free lists to get counts.  There are
> > > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > > file just because it look too long to iterate all the free lists within
> > > a zone while holing the zone lock with irq disabled.
> > > 
> > > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > > by any user is a security problem that needs to be addressed.
> > 
> > Should we make the file 0400? It is a useful thing when debugging but
> > not something regular users would really need for life.
> > 
> 
> I think this would be useful in general. The information is not that
> useful outside of debugging. Even then it's only useful when trying to
> get a handle on why a path like compaction is taking too long.

So can we go with this to address the security aspect of this and have
something trivial to backport.

> > > There is a free_area structure associated with each page order. There
> > > is also a nr_free count within the free_area for all the different
> > > migration types combined. Tracking the number of free list entries
> > > for each migration type will probably add some overhead to the fast
> > > paths like moving pages from one migration type to another which may
> > > not be desirable.
> > 
> > Have you tried to measure that overhead?
> >  
> 
> I would prefer this option not be taken. It would increase the cost of
> watermark calculations which is a relatively fast path.

Is the change for the wmark check going to require more than
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c0b2e0306720..5d95313ba4a5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3448,9 +3448,6 @@ bool __zone_watermark_ok(struct zone *z, unsigned int 
order, unsigned long mark,
struct free_area *area = >free_area[o];
int mt;
 
-   if (!area->nr_free)
-   continue;
-
for (mt = 0; mt < MIGRATE_PCPTYPES; mt++) {
if (!free_area_empty(area, mt))
return true;

Is this really going to be visible in practice? Sure we are going to do
more checks but most orders tend to have at least some memory in a
reasonably balanced system and we can hardly expect an optimal
allocation path on those that are not.
 
> > > we can actually skip iterating the list of one of the migration types
> > > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > > is usually the largest one on large memory systems, this is the one
> > > to be skipped. Since the printing order is migration-type => order, we
> > > will have to store the counts in an internal 2D array before printing
> > > them out.
> > > 
> > > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > > zone lock for too long blocking out other zone lock waiters from being
> > > run. This can be problematic for systems with large amount of memory.
> > > So a check is added to temporarily release the lock and reschedule if
> > > more than 64k of list entries have been iterated for each order. With
> > > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > > entries before releasing the lock.
> > 
> > But you are still iterating through the whole free_list at once so if it
> > gets really large then this is still possible. I think it would be
> > preferable to use per migratetype nr_free if it doesn't cause any
> > regressions.
> > 
> 
> I think it will. The patch as it is contains the overhead within the
> reader of the pagetypeinfo proc file which is a non-critical path. The
> page allocator paths on the other hand is very important.

As pointed out in other email. The problem with this patch is that it
hasn't really removed the iteration over the whole free_list which is
the primary problem. So I think that we should either consider this a
non-issue and make it "admin knows this is potentially expensive" or do
something like Andrew was suggesting if we do not want to change the
nr_free accounting.

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6afc892a148a..83c0295ecddc 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1386,8 +1386,16 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
 
area = &(zone->free_area[order]);
 
-   list_for_each(curr, >free_list[mtype])
+   list_for_each(curr, >free_list[mtype]) {
freecount++;
+   if (freecount > BIG_NUMBER) {
+ 

Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Mel Gorman
On Tue, Oct 22, 2019 at 06:57:45PM +0200, Michal Hocko wrote:
> [Cc Mel]
> 
> On Tue 22-10-19 12:21:56, Waiman Long wrote:
> > The pagetypeinfo_showfree_print() function prints out the number of
> > free blocks for each of the page orders and migrate types. The current
> > code just iterates the each of the free lists to get counts.  There are
> > bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> > file just because it look too long to iterate all the free lists within
> > a zone while holing the zone lock with irq disabled.
> > 
> > Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> > of crashing a system by the simple act of reading /proc/pagetypeinfo
> > by any user is a security problem that needs to be addressed.
> 
> Should we make the file 0400? It is a useful thing when debugging but
> not something regular users would really need for life.
> 

I think this would be useful in general. The information is not that
useful outside of debugging. Even then it's only useful when trying to
get a handle on why a path like compaction is taking too long.

> > There is a free_area structure associated with each page order. There
> > is also a nr_free count within the free_area for all the different
> > migration types combined. Tracking the number of free list entries
> > for each migration type will probably add some overhead to the fast
> > paths like moving pages from one migration type to another which may
> > not be desirable.
> 
> Have you tried to measure that overhead?
>  

I would prefer this option not be taken. It would increase the cost of
watermark calculations which is a relatively fast path.

> > we can actually skip iterating the list of one of the migration types
> > and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> > is usually the largest one on large memory systems, this is the one
> > to be skipped. Since the printing order is migration-type => order, we
> > will have to store the counts in an internal 2D array before printing
> > them out.
> > 
> > Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> > zone lock for too long blocking out other zone lock waiters from being
> > run. This can be problematic for systems with large amount of memory.
> > So a check is added to temporarily release the lock and reschedule if
> > more than 64k of list entries have been iterated for each order. With
> > a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> > entries before releasing the lock.
> 
> But you are still iterating through the whole free_list at once so if it
> gets really large then this is still possible. I think it would be
> preferable to use per migratetype nr_free if it doesn't cause any
> regressions.
> 

I think it will. The patch as it is contains the overhead within the
reader of the pagetypeinfo proc file which is a non-critical path. The
page allocator paths on the other hand is very important.

-- 
Mel Gorman
SUSE Labs


Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-23 Thread Michal Hocko
On Tue 22-10-19 14:59:02, Andrew Morton wrote:
> On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long  wrote:
[...]
> > -   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > -   seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > -   pgdat->node_id,
> > -   zone->name,
> > -   migratetype_names[mtype]);
> > -   for (order = 0; order < MAX_ORDER; ++order) {
> > +   lockdep_assert_held(>lock);
> > +   lockdep_assert_irqs_disabled();
> > +
> > +   /*
> > +* MIGRATE_MOVABLE is usually the largest one in large memory
> > +* systems. We skip iterating that list. Instead, we compute it by
> > +* subtracting the total of the rests from free_area->nr_free.
> > +*/
> > +   for (order = 0; order < MAX_ORDER; ++order) {
> > +   unsigned long nr_total = 0;
> > +   struct free_area *area = &(zone->free_area[order]);
> > +
> > +   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > unsigned long freecount = 0;
> > -   struct free_area *area;
> > struct list_head *curr;
> >  
> > -   area = &(zone->free_area[order]);
> > -
> > +   if (mtype == MIGRATE_MOVABLE)
> > +   continue;
> > list_for_each(curr, >free_list[mtype])
> > freecount++;
> > -   seq_printf(m, "%6lu ", freecount);
> > +   nfree[order][mtype] = freecount;
> > +   nr_total += freecount;
> > }
> > +   nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> > +
> > +   /*
> > +* If we have already iterated more than 64k of list
> > +* entries, we might have hold the zone lock for too long.
> > +* Temporarily release the lock and reschedule before
> > +* continuing so that other lock waiters have a chance
> > +* to run.
> > +*/
> > +   if (nr_total > (1 << 16)) {
> > +   spin_unlock_irq(>lock);
> > +   cond_resched();
> > +   spin_lock_irq(>lock);
> > +   }
> > +   }
> > +
> > +   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> > +   seq_printf(m, "Node %4d, zone %8s, type %12s ",
> > +   pgdat->node_id,
> > +   zone->name,
> > +   migratetype_names[mtype]);
> > +   for (order = 0; order < MAX_ORDER; ++order)
> > +   seq_printf(m, "%6lu ", nfree[order][mtype]);
> > seq_putc(m, '\n');
> 
> This is not exactly a thing of beauty :( Presumably there might still
> be situations where the irq-off times remain excessive.

Yes. It is the list_for_each over the free_list that needs the lock and
that is the actual problem here. This can be really large with a _lot_
of memory. And this is why I objected to the patch. Because it doesn't
really address this problem. I would like to hear from Mel and Vlastimil
how would they feel about making free_list fully migrate type aware
(including nr_free).

> Why are we actually holding zone->lock so much?  Can we get away with
> holding it across the list_for_each() loop and nothing else?  If so,
> this still isn't a bulletproof fix.  Maybe just terminate the list
> walk if freecount reaches 1024.  Would anyone really care?
> 
> Sigh.  I wonder if anyone really uses this thing for anything
> important.  Can we just remove it all?

Vlastimil would know much better but I have seen this being used for
fragmentation related debugging. That should imply that 0400 should be
sufficient and a quick and easily backportable fix for the most pressing
immediate problem.
-- 
Michal Hocko
SUSE Labs


Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-22 Thread David Rientjes
On Tue, 22 Oct 2019, Waiman Long wrote:

> >>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> >>> is usually the largest one on large memory systems, this is the one
> >>> to be skipped. Since the printing order is migration-type => order, we
> >>> will have to store the counts in an internal 2D array before printing
> >>> them out.
> >>>
> >>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> >>> zone lock for too long blocking out other zone lock waiters from being
> >>> run. This can be problematic for systems with large amount of memory.
> >>> So a check is added to temporarily release the lock and reschedule if
> >>> more than 64k of list entries have been iterated for each order. With
> >>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> >>> entries before releasing the lock.
> >> But you are still iterating through the whole free_list at once so if it
> >> gets really large then this is still possible. I think it would be
> >> preferable to use per migratetype nr_free if it doesn't cause any
> >> regressions.
> >>
> > Yes, it is still theoretically possible. I will take a further look at
> > having per-migrate type nr_free. BTW, there is one more place where the
> > free lists are being iterated with zone lock held - mark_free_pages().
> 
> Looking deeper into the code, the exact migration type is not stored in
> the page itself. An initial movable page can be stolen to be put into
> another migration type. So in a delete or move from free_area, we don't
> know exactly what migration type the page is coming from. IOW, it is
> hard to get accurate counts of the number of entries in each lists.
> 

I think the suggestion is to maintain a nr_free count of the free_list for 
each order for each migratetype so anytime a page is added or deleted from 
the list, the nr_free is adjusted.  Then the free_area's nr_free becomes 
the sum of its migratetype's nr_free at that order.  That's possible to do 
if you track the migratetype per page, as you said, or like pcp pages 
track it as part of page->index.  It's a trade-off on whether you want to 
impact the performance of maintaining these new nr_frees anytime you 
manipulate the freelists.

I think Vlastimil and I discussed per order per migratetype nr_frees in 
the past and it could be a worthwhile improvement for other reasons, 
specifically it leads to heuristics that can be used to determine how 
fragmentated a certain migratetype is for a zone, i.e. a very quick way to 
determine what ratio of pages over all MIGRATE_UNMOVABLE pageblocks are 
free.

Or maybe there are other reasons why these nr_frees can't be maintained 
anymore?  (I had a patch to do it on 4.3.)

You may also find systems where MIGRATE_MOVABLE is not actually the 
longest free_list compared to other migratetypes on a severely fragmented 
system, so special casing MIGRATE_MOVABLE might not be the best way 
forward.


Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-22 Thread Andrew Morton
On Tue, 22 Oct 2019 12:21:56 -0400 Waiman Long  wrote:

> The pagetypeinfo_showfree_print() function prints out the number of
> free blocks for each of the page orders and migrate types. The current
> code just iterates the each of the free lists to get counts.  There are
> bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> file just because it look too long to iterate all the free lists within
> a zone while holing the zone lock with irq disabled.
> 
> Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> of crashing a system by the simple act of reading /proc/pagetypeinfo
> by any user is a security problem that needs to be addressed.

Yes.

> There is a free_area structure associated with each page order. There
> is also a nr_free count within the free_area for all the different
> migration types combined. Tracking the number of free list entries
> for each migration type will probably add some overhead to the fast
> paths like moving pages from one migration type to another which may
> not be desirable.
> 
> we can actually skip iterating the list of one of the migration types
> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> is usually the largest one on large memory systems, this is the one
> to be skipped. Since the printing order is migration-type => order, we
> will have to store the counts in an internal 2D array before printing
> them out.
> 
> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> zone lock for too long blocking out other zone lock waiters from being
> run. This can be problematic for systems with large amount of memory.
> So a check is added to temporarily release the lock and reschedule if
> more than 64k of list entries have been iterated for each order. With
> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> entries before releasing the lock.
> 
> ...
>
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct 
> seq_file *m,
>   pg_data_t *pgdat, struct zone *zone)
>  {
>   int order, mtype;
> + unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];

600+ bytes is a bit much.  I guess it's OK in this situation.

> - for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> - seq_printf(m, "Node %4d, zone %8s, type %12s ",
> - pgdat->node_id,
> - zone->name,
> - migratetype_names[mtype]);
> - for (order = 0; order < MAX_ORDER; ++order) {
> + lockdep_assert_held(>lock);
> + lockdep_assert_irqs_disabled();
> +
> + /*
> +  * MIGRATE_MOVABLE is usually the largest one in large memory
> +  * systems. We skip iterating that list. Instead, we compute it by
> +  * subtracting the total of the rests from free_area->nr_free.
> +  */
> + for (order = 0; order < MAX_ORDER; ++order) {
> + unsigned long nr_total = 0;
> + struct free_area *area = &(zone->free_area[order]);
> +
> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>   unsigned long freecount = 0;
> - struct free_area *area;
>   struct list_head *curr;
>  
> - area = &(zone->free_area[order]);
> -
> + if (mtype == MIGRATE_MOVABLE)
> + continue;
>   list_for_each(curr, >free_list[mtype])
>   freecount++;
> - seq_printf(m, "%6lu ", freecount);
> + nfree[order][mtype] = freecount;
> + nr_total += freecount;
>   }
> + nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> +
> + /*
> +  * If we have already iterated more than 64k of list
> +  * entries, we might have hold the zone lock for too long.
> +  * Temporarily release the lock and reschedule before
> +  * continuing so that other lock waiters have a chance
> +  * to run.
> +  */
> + if (nr_total > (1 << 16)) {
> + spin_unlock_irq(>lock);
> + cond_resched();
> + spin_lock_irq(>lock);
> + }
> + }
> +
> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> + seq_printf(m, "Node %4d, zone %8s, type %12s ",
> + pgdat->node_id,
> + zone->name,
> + migratetype_names[mtype]);
> + for (order = 0; order < MAX_ORDER; ++order)
> + seq_printf(m, "%6lu ", nfree[order][mtype]);
>   seq_putc(m, '\n');

This is not exactly a thing of beauty :( Presumably there might still
be situations where the 

Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-22 Thread Waiman Long
On 10/22/19 2:00 PM, Waiman Long wrote:
> On 10/22/19 12:57 PM, Michal Hocko wrote:
>
>>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
>>> is usually the largest one on large memory systems, this is the one
>>> to be skipped. Since the printing order is migration-type => order, we
>>> will have to store the counts in an internal 2D array before printing
>>> them out.
>>>
>>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
>>> zone lock for too long blocking out other zone lock waiters from being
>>> run. This can be problematic for systems with large amount of memory.
>>> So a check is added to temporarily release the lock and reschedule if
>>> more than 64k of list entries have been iterated for each order. With
>>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
>>> entries before releasing the lock.
>> But you are still iterating through the whole free_list at once so if it
>> gets really large then this is still possible. I think it would be
>> preferable to use per migratetype nr_free if it doesn't cause any
>> regressions.
>>
> Yes, it is still theoretically possible. I will take a further look at
> having per-migrate type nr_free. BTW, there is one more place where the
> free lists are being iterated with zone lock held - mark_free_pages().

Looking deeper into the code, the exact migration type is not stored in
the page itself. An initial movable page can be stolen to be put into
another migration type. So in a delete or move from free_area, we don't
know exactly what migration type the page is coming from. IOW, it is
hard to get accurate counts of the number of entries in each lists.

I am not saying this is impossible, but doing it may require stealing
some bits from the page structure to store this information which is
probably not worth the benefit we can get from it. So if you have any
good suggestion of how to do it without too much cost, please let me
know about it. Otherwise, I will probably stay with the current patch.

Cheers,
Longman




Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-22 Thread Waiman Long
On 10/22/19 12:57 PM, Michal Hocko wrote:
> [Cc Mel]
>
> On Tue 22-10-19 12:21:56, Waiman Long wrote:
>> The pagetypeinfo_showfree_print() function prints out the number of
>> free blocks for each of the page orders and migrate types. The current
>> code just iterates the each of the free lists to get counts.  There are
>> bug reports about hard lockup panics when reading the /proc/pagetyeinfo
>> file just because it look too long to iterate all the free lists within
>> a zone while holing the zone lock with irq disabled.
>>
>> Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
>> of crashing a system by the simple act of reading /proc/pagetypeinfo
>> by any user is a security problem that needs to be addressed.
> Should we make the file 0400? It is a useful thing when debugging but
> not something regular users would really need for life.
>
I am not against doing that, but it may break existing applications that
somehow need to read pagetypeinfo. That is why I didn't try to advocate
about changing protection.


>> There is a free_area structure associated with each page order. There
>> is also a nr_free count within the free_area for all the different
>> migration types combined. Tracking the number of free list entries
>> for each migration type will probably add some overhead to the fast
>> paths like moving pages from one migration type to another which may
>> not be desirable.
> Have you tried to measure that overhead?

I haven't tried to measure the performance impact yet. I did thought
about tracking nr_free for each of the migration types within a
free_area. That will require auditing the code to make sure that all the
intra-free_area migrations are properly accounted for. I can work on it
if people prefer this alternative.


>  
>> we can actually skip iterating the list of one of the migration types
>> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
>> is usually the largest one on large memory systems, this is the one
>> to be skipped. Since the printing order is migration-type => order, we
>> will have to store the counts in an internal 2D array before printing
>> them out.
>>
>> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
>> zone lock for too long blocking out other zone lock waiters from being
>> run. This can be problematic for systems with large amount of memory.
>> So a check is added to temporarily release the lock and reschedule if
>> more than 64k of list entries have been iterated for each order. With
>> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
>> entries before releasing the lock.
> But you are still iterating through the whole free_list at once so if it
> gets really large then this is still possible. I think it would be
> preferable to use per migratetype nr_free if it doesn't cause any
> regressions.
>
Yes, it is still theoretically possible. I will take a further look at
having per-migrate type nr_free. BTW, there is one more place where the
free lists are being iterated with zone lock held - mark_free_pages().

Cheers,
Longman



Re: [PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-22 Thread Michal Hocko
[Cc Mel]

On Tue 22-10-19 12:21:56, Waiman Long wrote:
> The pagetypeinfo_showfree_print() function prints out the number of
> free blocks for each of the page orders and migrate types. The current
> code just iterates the each of the free lists to get counts.  There are
> bug reports about hard lockup panics when reading the /proc/pagetyeinfo
> file just because it look too long to iterate all the free lists within
> a zone while holing the zone lock with irq disabled.
> 
> Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
> of crashing a system by the simple act of reading /proc/pagetypeinfo
> by any user is a security problem that needs to be addressed.

Should we make the file 0400? It is a useful thing when debugging but
not something regular users would really need for life.

> There is a free_area structure associated with each page order. There
> is also a nr_free count within the free_area for all the different
> migration types combined. Tracking the number of free list entries
> for each migration type will probably add some overhead to the fast
> paths like moving pages from one migration type to another which may
> not be desirable.

Have you tried to measure that overhead?
 
> we can actually skip iterating the list of one of the migration types
> and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
> is usually the largest one on large memory systems, this is the one
> to be skipped. Since the printing order is migration-type => order, we
> will have to store the counts in an internal 2D array before printing
> them out.
> 
> Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
> zone lock for too long blocking out other zone lock waiters from being
> run. This can be problematic for systems with large amount of memory.
> So a check is added to temporarily release the lock and reschedule if
> more than 64k of list entries have been iterated for each order. With
> a MAX_ORDER of 11, the worst case will be iterating about 700k of list
> entries before releasing the lock.

But you are still iterating through the whole free_list at once so if it
gets really large then this is still possible. I think it would be
preferable to use per migratetype nr_free if it doesn't cause any
regressions.

> Signed-off-by: Waiman Long 
> ---
>  mm/vmstat.c | 51 +--
>  1 file changed, 41 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/vmstat.c b/mm/vmstat.c
> index 6afc892a148a..40c9a1494709 100644
> --- a/mm/vmstat.c
> +++ b/mm/vmstat.c
> @@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct 
> seq_file *m,
>   pg_data_t *pgdat, struct zone *zone)
>  {
>   int order, mtype;
> + unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
>  
> - for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
> - seq_printf(m, "Node %4d, zone %8s, type %12s ",
> - pgdat->node_id,
> - zone->name,
> - migratetype_names[mtype]);
> - for (order = 0; order < MAX_ORDER; ++order) {
> + lockdep_assert_held(>lock);
> + lockdep_assert_irqs_disabled();
> +
> + /*
> +  * MIGRATE_MOVABLE is usually the largest one in large memory
> +  * systems. We skip iterating that list. Instead, we compute it by
> +  * subtracting the total of the rests from free_area->nr_free.
> +  */
> + for (order = 0; order < MAX_ORDER; ++order) {
> + unsigned long nr_total = 0;
> + struct free_area *area = &(zone->free_area[order]);
> +
> + for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
>   unsigned long freecount = 0;
> - struct free_area *area;
>   struct list_head *curr;
>  
> - area = &(zone->free_area[order]);
> -
> + if (mtype == MIGRATE_MOVABLE)
> + continue;
>   list_for_each(curr, >free_list[mtype])
>   freecount++;
> - seq_printf(m, "%6lu ", freecount);
> + nfree[order][mtype] = freecount;
> + nr_total += freecount;
>   }
> + nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
> +
> + /*
> +  * If we have already iterated more than 64k of list
> +  * entries, we might have hold the zone lock for too long.
> +  * Temporarily release the lock and reschedule before
> +  * continuing so that other lock waiters have a chance
> +  * to run.
> +  */
> + if (nr_total > (1 << 16)) {
> + spin_unlock_irq(>lock);
> + cond_resched();
> + spin_lock_irq(>lock);
> + }
> + }
> +
> + 

[PATCH] mm/vmstat: Reduce zone lock hold time when reading /proc/pagetypeinfo

2019-10-22 Thread Waiman Long
The pagetypeinfo_showfree_print() function prints out the number of
free blocks for each of the page orders and migrate types. The current
code just iterates the each of the free lists to get counts.  There are
bug reports about hard lockup panics when reading the /proc/pagetyeinfo
file just because it look too long to iterate all the free lists within
a zone while holing the zone lock with irq disabled.

Given the fact that /proc/pagetypeinfo is readable by all, the possiblity
of crashing a system by the simple act of reading /proc/pagetypeinfo
by any user is a security problem that needs to be addressed.

There is a free_area structure associated with each page order. There
is also a nr_free count within the free_area for all the different
migration types combined. Tracking the number of free list entries
for each migration type will probably add some overhead to the fast
paths like moving pages from one migration type to another which may
not be desirable.

we can actually skip iterating the list of one of the migration types
and used nr_free to compute the missing count. Since MIGRATE_MOVABLE
is usually the largest one on large memory systems, this is the one
to be skipped. Since the printing order is migration-type => order, we
will have to store the counts in an internal 2D array before printing
them out.

Even by skipping the MIGRATE_MOVABLE pages, we may still be holding the
zone lock for too long blocking out other zone lock waiters from being
run. This can be problematic for systems with large amount of memory.
So a check is added to temporarily release the lock and reschedule if
more than 64k of list entries have been iterated for each order. With
a MAX_ORDER of 11, the worst case will be iterating about 700k of list
entries before releasing the lock.

Signed-off-by: Waiman Long 
---
 mm/vmstat.c | 51 +--
 1 file changed, 41 insertions(+), 10 deletions(-)

diff --git a/mm/vmstat.c b/mm/vmstat.c
index 6afc892a148a..40c9a1494709 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1373,23 +1373,54 @@ static void pagetypeinfo_showfree_print(struct seq_file 
*m,
pg_data_t *pgdat, struct zone *zone)
 {
int order, mtype;
+   unsigned long nfree[MAX_ORDER][MIGRATE_TYPES];
 
-   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
-   seq_printf(m, "Node %4d, zone %8s, type %12s ",
-   pgdat->node_id,
-   zone->name,
-   migratetype_names[mtype]);
-   for (order = 0; order < MAX_ORDER; ++order) {
+   lockdep_assert_held(>lock);
+   lockdep_assert_irqs_disabled();
+
+   /*
+* MIGRATE_MOVABLE is usually the largest one in large memory
+* systems. We skip iterating that list. Instead, we compute it by
+* subtracting the total of the rests from free_area->nr_free.
+*/
+   for (order = 0; order < MAX_ORDER; ++order) {
+   unsigned long nr_total = 0;
+   struct free_area *area = &(zone->free_area[order]);
+
+   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
unsigned long freecount = 0;
-   struct free_area *area;
struct list_head *curr;
 
-   area = &(zone->free_area[order]);
-
+   if (mtype == MIGRATE_MOVABLE)
+   continue;
list_for_each(curr, >free_list[mtype])
freecount++;
-   seq_printf(m, "%6lu ", freecount);
+   nfree[order][mtype] = freecount;
+   nr_total += freecount;
}
+   nfree[order][MIGRATE_MOVABLE] = area->nr_free - nr_total;
+
+   /*
+* If we have already iterated more than 64k of list
+* entries, we might have hold the zone lock for too long.
+* Temporarily release the lock and reschedule before
+* continuing so that other lock waiters have a chance
+* to run.
+*/
+   if (nr_total > (1 << 16)) {
+   spin_unlock_irq(>lock);
+   cond_resched();
+   spin_lock_irq(>lock);
+   }
+   }
+
+   for (mtype = 0; mtype < MIGRATE_TYPES; mtype++) {
+   seq_printf(m, "Node %4d, zone %8s, type %12s ",
+   pgdat->node_id,
+   zone->name,
+   migratetype_names[mtype]);
+   for (order = 0; order < MAX_ORDER; ++order)
+   seq_printf(m, "%6lu ", nfree[order][mtype]);
seq_putc(m, '\n');
}
 }
-- 
2.18.1