Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-06-02 Thread Michal Hocko
On Wed 01-06-16 23:38:30, Oleg Nesterov wrote:
> On 06/01, Michal Hocko wrote:
> >
> > On Wed 01-06-16 01:56:26, Oleg Nesterov wrote:
> > > On 05/31, Michal Hocko wrote:
> > > >
> > > > On Sun 29-05-16 23:25:40, Oleg Nesterov wrote:
> > > > >
> > > > > This single change in get_scan_count() under for_each_evictable_lru() 
> > > > > loop
> > > > >
> > > > >   -   size = lruvec_lru_size(lruvec, lru);
> > > > >   +   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
> > > > > NR_LRU_BASE + lru);
> > > > >
> > > > > fixes the problem too.
> > > > >
> > > > > Without this change shrink*() continues to scan the LRU_ACTIVE_FILE 
> > > > > list
> > > > > while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) 
> > > > > but
> > > > > we do not even try to scan it, lruvec_lru_size() returns zero.
> > > >
> > > > OK, you seem to be really seeing a different issue than me.
> > >
> > > quite possibly, but
> > >
> > > > My debugging
> > > > patch was showing when nothing was really isolated from the LRU lists
> > > > (both for shrink_{in}active_list.
> > >
> > > in my debugging session too. LRU_ACTIVE_FILE was empty, so there is 
> > > nothing to
> > > isolate even if shrink_active_list() is (wrongly called) with nr_to_scan 
> > > != 0.
> > > LRU_INACTIVE_FILE is not empty but it is not scanned because nr_to_scan 
> > > == 0.
> > >
> > > But I am afraid I misunderstood you, and you meant something else.
> >
> > What I wanted to say is that my debugging hasn't shown a single case
> > when nothing would be isolated. Which seems to be the case for you.
> 
> Ah, got it, thanks. Yes, I see that there is no "nothing scanned" in
> oom-test.qcow_serial.log.gz from 
> http://marc.info/?l=linux-kernel=146417822608902
> you sent. I applied this patch and I do see "nothing scanned".
> 
> But, unlike you, I do not see the messages from free-pages... perhaps you
> have more active tasks. To remind, I tested this with the single user-space
> process, /bin/sh running with pid==1, then I did "while true; do ./oom; done".

Well, I was booting into a standard init which will have a couple of
processes. So yes this would make a slight difference.
 
> So of course I do not know if you see another issue or the same, but now I am
> wondering if the change in get_scan_count() above fixes the problem for you.

I have played with it but the interfering freed pages just ruined the
whole zone_reclaimable expectations.
 
> Probably not, but the fact you do not see "nothing scanned" can't prove this,
> it is possible that shrink_*_list() was not called because vm_stat == 0 but
> zone_reclaimable() sees the per-cpu counter. In this case 0db2cb8da89d can
> make a difference, but see below.
> 
> > > > But I am thinking whether we should simply revert 0db2cb8da89d ("mm,
> > > > vmscan: make zone_reclaimable_pages more precise") in 4.6 stable tree.
> > > > Does that help as well?
> > >
> > > I'll test this tomorrow,
> 
> So it doesn't help.

OK, so we at least know this is not a regression.

> > but even if it helps I am not sure... Yes, this
> > > way zone_reclaimable() and get_scan_count() will see the same numbers, but
> > > how this can help to make zone_reclaimable() == F at the end?
> >
> > It won't in some cases.
> 
> And unless I am notally confused  hit exactly this case.
> 
> > And that has been the case for ages so I do not
> > think we need any steps for the stable.
> 
> OK, agreed.
> 
> > What meant to address is a
> > potential regression caused by 0db2cb8da89d which would make this more
> > likely because of the mismatch
> 
> Again, I can be easily wrong, but I do not see how 0db2cb8da89d could make
> the things worse...
> 
> Unless both get_scan_count() and zone_reclaimable() use "snapshot" variant,
> we can't guarantee zone_reclaimable() becomes false. The fact that they see
> different numbers (after 0db2cb8da89d) doesn't really matter.
> 
> Anyway, this was already fixed, so lets forget it ;)

Yes, especially as this doesn't seem to be a regression.

Thanks for your effort anyway.
-- 
Michal Hocko
SUSE Labs


Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-06-02 Thread Michal Hocko
On Wed 01-06-16 23:38:30, Oleg Nesterov wrote:
> On 06/01, Michal Hocko wrote:
> >
> > On Wed 01-06-16 01:56:26, Oleg Nesterov wrote:
> > > On 05/31, Michal Hocko wrote:
> > > >
> > > > On Sun 29-05-16 23:25:40, Oleg Nesterov wrote:
> > > > >
> > > > > This single change in get_scan_count() under for_each_evictable_lru() 
> > > > > loop
> > > > >
> > > > >   -   size = lruvec_lru_size(lruvec, lru);
> > > > >   +   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
> > > > > NR_LRU_BASE + lru);
> > > > >
> > > > > fixes the problem too.
> > > > >
> > > > > Without this change shrink*() continues to scan the LRU_ACTIVE_FILE 
> > > > > list
> > > > > while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) 
> > > > > but
> > > > > we do not even try to scan it, lruvec_lru_size() returns zero.
> > > >
> > > > OK, you seem to be really seeing a different issue than me.
> > >
> > > quite possibly, but
> > >
> > > > My debugging
> > > > patch was showing when nothing was really isolated from the LRU lists
> > > > (both for shrink_{in}active_list.
> > >
> > > in my debugging session too. LRU_ACTIVE_FILE was empty, so there is 
> > > nothing to
> > > isolate even if shrink_active_list() is (wrongly called) with nr_to_scan 
> > > != 0.
> > > LRU_INACTIVE_FILE is not empty but it is not scanned because nr_to_scan 
> > > == 0.
> > >
> > > But I am afraid I misunderstood you, and you meant something else.
> >
> > What I wanted to say is that my debugging hasn't shown a single case
> > when nothing would be isolated. Which seems to be the case for you.
> 
> Ah, got it, thanks. Yes, I see that there is no "nothing scanned" in
> oom-test.qcow_serial.log.gz from 
> http://marc.info/?l=linux-kernel=146417822608902
> you sent. I applied this patch and I do see "nothing scanned".
> 
> But, unlike you, I do not see the messages from free-pages... perhaps you
> have more active tasks. To remind, I tested this with the single user-space
> process, /bin/sh running with pid==1, then I did "while true; do ./oom; done".

Well, I was booting into a standard init which will have a couple of
processes. So yes this would make a slight difference.
 
> So of course I do not know if you see another issue or the same, but now I am
> wondering if the change in get_scan_count() above fixes the problem for you.

I have played with it but the interfering freed pages just ruined the
whole zone_reclaimable expectations.
 
> Probably not, but the fact you do not see "nothing scanned" can't prove this,
> it is possible that shrink_*_list() was not called because vm_stat == 0 but
> zone_reclaimable() sees the per-cpu counter. In this case 0db2cb8da89d can
> make a difference, but see below.
> 
> > > > But I am thinking whether we should simply revert 0db2cb8da89d ("mm,
> > > > vmscan: make zone_reclaimable_pages more precise") in 4.6 stable tree.
> > > > Does that help as well?
> > >
> > > I'll test this tomorrow,
> 
> So it doesn't help.

OK, so we at least know this is not a regression.

> > but even if it helps I am not sure... Yes, this
> > > way zone_reclaimable() and get_scan_count() will see the same numbers, but
> > > how this can help to make zone_reclaimable() == F at the end?
> >
> > It won't in some cases.
> 
> And unless I am notally confused  hit exactly this case.
> 
> > And that has been the case for ages so I do not
> > think we need any steps for the stable.
> 
> OK, agreed.
> 
> > What meant to address is a
> > potential regression caused by 0db2cb8da89d which would make this more
> > likely because of the mismatch
> 
> Again, I can be easily wrong, but I do not see how 0db2cb8da89d could make
> the things worse...
> 
> Unless both get_scan_count() and zone_reclaimable() use "snapshot" variant,
> we can't guarantee zone_reclaimable() becomes false. The fact that they see
> different numbers (after 0db2cb8da89d) doesn't really matter.
> 
> Anyway, this was already fixed, so lets forget it ;)

Yes, especially as this doesn't seem to be a regression.

Thanks for your effort anyway.
-- 
Michal Hocko
SUSE Labs


Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-06-01 Thread Oleg Nesterov
On 06/01, Michal Hocko wrote:
>
> On Wed 01-06-16 01:56:26, Oleg Nesterov wrote:
> > On 05/31, Michal Hocko wrote:
> > >
> > > On Sun 29-05-16 23:25:40, Oleg Nesterov wrote:
> > > >
> > > > This single change in get_scan_count() under for_each_evictable_lru() 
> > > > loop
> > > >
> > > > -   size = lruvec_lru_size(lruvec, lru);
> > > > +   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
> > > > NR_LRU_BASE + lru);
> > > >
> > > > fixes the problem too.
> > > >
> > > > Without this change shrink*() continues to scan the LRU_ACTIVE_FILE list
> > > > while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) but
> > > > we do not even try to scan it, lruvec_lru_size() returns zero.
> > >
> > > OK, you seem to be really seeing a different issue than me.
> >
> > quite possibly, but
> >
> > > My debugging
> > > patch was showing when nothing was really isolated from the LRU lists
> > > (both for shrink_{in}active_list.
> >
> > in my debugging session too. LRU_ACTIVE_FILE was empty, so there is nothing 
> > to
> > isolate even if shrink_active_list() is (wrongly called) with nr_to_scan != 
> > 0.
> > LRU_INACTIVE_FILE is not empty but it is not scanned because nr_to_scan == 
> > 0.
> >
> > But I am afraid I misunderstood you, and you meant something else.
>
> What I wanted to say is that my debugging hasn't shown a single case
> when nothing would be isolated. Which seems to be the case for you.

Ah, got it, thanks. Yes, I see that there is no "nothing scanned" in
oom-test.qcow_serial.log.gz from 
http://marc.info/?l=linux-kernel=146417822608902
you sent. I applied this patch and I do see "nothing scanned".

But, unlike you, I do not see the messages from free-pages... perhaps you
have more active tasks. To remind, I tested this with the single user-space
process, /bin/sh running with pid==1, then I did "while true; do ./oom; done".

So of course I do not know if you see another issue or the same, but now I am
wondering if the change in get_scan_count() above fixes the problem for you.

Probably not, but the fact you do not see "nothing scanned" can't prove this,
it is possible that shrink_*_list() was not called because vm_stat == 0 but
zone_reclaimable() sees the per-cpu counter. In this case 0db2cb8da89d can
make a difference, but see below.

> > > But I am thinking whether we should simply revert 0db2cb8da89d ("mm,
> > > vmscan: make zone_reclaimable_pages more precise") in 4.6 stable tree.
> > > Does that help as well?
> >
> > I'll test this tomorrow,

So it doesn't help.

> but even if it helps I am not sure... Yes, this
> > way zone_reclaimable() and get_scan_count() will see the same numbers, but
> > how this can help to make zone_reclaimable() == F at the end?
>
> It won't in some cases.

And unless I am notally confused  hit exactly this case.

> And that has been the case for ages so I do not
> think we need any steps for the stable.

OK, agreed.

> What meant to address is a
> potential regression caused by 0db2cb8da89d which would make this more
> likely because of the mismatch

Again, I can be easily wrong, but I do not see how 0db2cb8da89d could make
the things worse...

Unless both get_scan_count() and zone_reclaimable() use "snapshot" variant,
we can't guarantee zone_reclaimable() becomes false. The fact that they see
different numbers (after 0db2cb8da89d) doesn't really matter.

Anyway, this was already fixed, so lets forget it ;)

Oleg.



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-06-01 Thread Oleg Nesterov
On 06/01, Michal Hocko wrote:
>
> On Wed 01-06-16 01:56:26, Oleg Nesterov wrote:
> > On 05/31, Michal Hocko wrote:
> > >
> > > On Sun 29-05-16 23:25:40, Oleg Nesterov wrote:
> > > >
> > > > This single change in get_scan_count() under for_each_evictable_lru() 
> > > > loop
> > > >
> > > > -   size = lruvec_lru_size(lruvec, lru);
> > > > +   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
> > > > NR_LRU_BASE + lru);
> > > >
> > > > fixes the problem too.
> > > >
> > > > Without this change shrink*() continues to scan the LRU_ACTIVE_FILE list
> > > > while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) but
> > > > we do not even try to scan it, lruvec_lru_size() returns zero.
> > >
> > > OK, you seem to be really seeing a different issue than me.
> >
> > quite possibly, but
> >
> > > My debugging
> > > patch was showing when nothing was really isolated from the LRU lists
> > > (both for shrink_{in}active_list.
> >
> > in my debugging session too. LRU_ACTIVE_FILE was empty, so there is nothing 
> > to
> > isolate even if shrink_active_list() is (wrongly called) with nr_to_scan != 
> > 0.
> > LRU_INACTIVE_FILE is not empty but it is not scanned because nr_to_scan == 
> > 0.
> >
> > But I am afraid I misunderstood you, and you meant something else.
>
> What I wanted to say is that my debugging hasn't shown a single case
> when nothing would be isolated. Which seems to be the case for you.

Ah, got it, thanks. Yes, I see that there is no "nothing scanned" in
oom-test.qcow_serial.log.gz from 
http://marc.info/?l=linux-kernel=146417822608902
you sent. I applied this patch and I do see "nothing scanned".

But, unlike you, I do not see the messages from free-pages... perhaps you
have more active tasks. To remind, I tested this with the single user-space
process, /bin/sh running with pid==1, then I did "while true; do ./oom; done".

So of course I do not know if you see another issue or the same, but now I am
wondering if the change in get_scan_count() above fixes the problem for you.

Probably not, but the fact you do not see "nothing scanned" can't prove this,
it is possible that shrink_*_list() was not called because vm_stat == 0 but
zone_reclaimable() sees the per-cpu counter. In this case 0db2cb8da89d can
make a difference, but see below.

> > > But I am thinking whether we should simply revert 0db2cb8da89d ("mm,
> > > vmscan: make zone_reclaimable_pages more precise") in 4.6 stable tree.
> > > Does that help as well?
> >
> > I'll test this tomorrow,

So it doesn't help.

> but even if it helps I am not sure... Yes, this
> > way zone_reclaimable() and get_scan_count() will see the same numbers, but
> > how this can help to make zone_reclaimable() == F at the end?
>
> It won't in some cases.

And unless I am notally confused  hit exactly this case.

> And that has been the case for ages so I do not
> think we need any steps for the stable.

OK, agreed.

> What meant to address is a
> potential regression caused by 0db2cb8da89d which would make this more
> likely because of the mismatch

Again, I can be easily wrong, but I do not see how 0db2cb8da89d could make
the things worse...

Unless both get_scan_count() and zone_reclaimable() use "snapshot" variant,
we can't guarantee zone_reclaimable() becomes false. The fact that they see
different numbers (after 0db2cb8da89d) doesn't really matter.

Anyway, this was already fixed, so lets forget it ;)

Oleg.



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-06-01 Thread Michal Hocko
On Wed 01-06-16 01:56:26, Oleg Nesterov wrote:
> On 05/31, Michal Hocko wrote:
> >
> > On Sun 29-05-16 23:25:40, Oleg Nesterov wrote:
> > >
> > > This single change in get_scan_count() under for_each_evictable_lru() loop
> > >
> > >   -   size = lruvec_lru_size(lruvec, lru);
> > >   +   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
> > > NR_LRU_BASE + lru);
> > >
> > > fixes the problem too.
> > >
> > > Without this change shrink*() continues to scan the LRU_ACTIVE_FILE list
> > > while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) but
> > > we do not even try to scan it, lruvec_lru_size() returns zero.
> >
> > OK, you seem to be really seeing a different issue than me.
> 
> quite possibly, but
> 
> > My debugging
> > patch was showing when nothing was really isolated from the LRU lists
> > (both for shrink_{in}active_list.
> 
> in my debugging session too. LRU_ACTIVE_FILE was empty, so there is nothing to
> isolate even if shrink_active_list() is (wrongly called) with nr_to_scan != 0.
> LRU_INACTIVE_FILE is not empty but it is not scanned because nr_to_scan == 0.
> 
> But I am afraid I misunderstood you, and you meant something else.

What I wanted to say is that my debugging hasn't shown a single case
when nothing would be isolated. Which seems to be the case for you.

[...]

> > But I am thinking whether we should simply revert 0db2cb8da89d ("mm,
> > vmscan: make zone_reclaimable_pages more precise") in 4.6 stable tree.
> > Does that help as well?
> 
> I'll test this tomorrow, but even if it helps I am not sure... Yes, this
> way zone_reclaimable() and get_scan_count() will see the same numbers, but
> how this can help to make zone_reclaimable() == F at the end?

It won't in some cases. And that has been the case for ages so I do not
think we need any steps for the stable. What meant to address is a
potential regression caused by 0db2cb8da89d which would make this more
likely because of the mismatch because the patch really makes much more
sense for the oom detection rework which really wants more precise
numbers. If the revert doesn't help then I would just leave it as it is
and just note that the zone_reclaimable was a bad decision which
fortunatelly didn't blow up that often...

Thanks
-- 
Michal Hocko
SUSE Labs


Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-06-01 Thread Michal Hocko
On Wed 01-06-16 01:56:26, Oleg Nesterov wrote:
> On 05/31, Michal Hocko wrote:
> >
> > On Sun 29-05-16 23:25:40, Oleg Nesterov wrote:
> > >
> > > This single change in get_scan_count() under for_each_evictable_lru() loop
> > >
> > >   -   size = lruvec_lru_size(lruvec, lru);
> > >   +   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
> > > NR_LRU_BASE + lru);
> > >
> > > fixes the problem too.
> > >
> > > Without this change shrink*() continues to scan the LRU_ACTIVE_FILE list
> > > while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) but
> > > we do not even try to scan it, lruvec_lru_size() returns zero.
> >
> > OK, you seem to be really seeing a different issue than me.
> 
> quite possibly, but
> 
> > My debugging
> > patch was showing when nothing was really isolated from the LRU lists
> > (both for shrink_{in}active_list.
> 
> in my debugging session too. LRU_ACTIVE_FILE was empty, so there is nothing to
> isolate even if shrink_active_list() is (wrongly called) with nr_to_scan != 0.
> LRU_INACTIVE_FILE is not empty but it is not scanned because nr_to_scan == 0.
> 
> But I am afraid I misunderstood you, and you meant something else.

What I wanted to say is that my debugging hasn't shown a single case
when nothing would be isolated. Which seems to be the case for you.

[...]

> > But I am thinking whether we should simply revert 0db2cb8da89d ("mm,
> > vmscan: make zone_reclaimable_pages more precise") in 4.6 stable tree.
> > Does that help as well?
> 
> I'll test this tomorrow, but even if it helps I am not sure... Yes, this
> way zone_reclaimable() and get_scan_count() will see the same numbers, but
> how this can help to make zone_reclaimable() == F at the end?

It won't in some cases. And that has been the case for ages so I do not
think we need any steps for the stable. What meant to address is a
potential regression caused by 0db2cb8da89d which would make this more
likely because of the mismatch because the patch really makes much more
sense for the oom detection rework which really wants more precise
numbers. If the revert doesn't help then I would just leave it as it is
and just note that the zone_reclaimable was a bad decision which
fortunatelly didn't blow up that often...

Thanks
-- 
Michal Hocko
SUSE Labs


Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-31 Thread Oleg Nesterov
On 05/31, Michal Hocko wrote:
>
> On Sun 29-05-16 23:25:40, Oleg Nesterov wrote:
> >
> > This single change in get_scan_count() under for_each_evictable_lru() loop
> >
> > -   size = lruvec_lru_size(lruvec, lru);
> > +   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
> > NR_LRU_BASE + lru);
> >
> > fixes the problem too.
> >
> > Without this change shrink*() continues to scan the LRU_ACTIVE_FILE list
> > while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) but
> > we do not even try to scan it, lruvec_lru_size() returns zero.
>
> OK, you seem to be really seeing a different issue than me.

quite possibly, but

> My debugging
> patch was showing when nothing was really isolated from the LRU lists
> (both for shrink_{in}active_list.

in my debugging session too. LRU_ACTIVE_FILE was empty, so there is nothing to
isolate even if shrink_active_list() is (wrongly called) with nr_to_scan != 0.
LRU_INACTIVE_FILE is not empty but it is not scanned because nr_to_scan == 0.

But I am afraid I misunderstood you, and you meant something else.

> > Then later we recheck zone_reclaimable() and it notices the INACTIVE_FILE
> > counter because it uses the _snapshot variant, this leads to livelock.
> >
> > I guess this doesn't really matter, but in my particular case these
> > ACTIVE/INACTIVE counters were screwed by the recent putback_inactive_pages()
> > logic. The pages we "leak" in INACTIVE list were recently moved from ACTIVE
> > to INACTIVE list, and this updated only the per-cpu ->vm_stat_diff[] 
> > counters,
> > so the "non snapshot" lruvec_lru_size() in get_scan_count() sees the "old"
> > numbers.
>
> Hmm. I am not really sure we can use the _snapshot version in lruvec_lru_size.

Yes, yes, I  understand,

> But I am thinking whether we should simply revert 0db2cb8da89d ("mm,
> vmscan: make zone_reclaimable_pages more precise") in 4.6 stable tree.
> Does that help as well?

I'll test this tomorrow, but even if it helps I am not sure... Yes, this
way zone_reclaimable() and get_scan_count() will see the same numbers, but
how this can help to make zone_reclaimable() == F at the end?

Again, suppose that (say) ACTIVE list is empty but zone->vm_stat != 0
because there is something in per-cpu counter (so that _snapshot == 0).
This means that we sill continue to try to scan this list for no reason.

But Michal, let me repeat that I do not understand this code, so I can
be easily wrong.

Oleg.



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-31 Thread Oleg Nesterov
On 05/31, Michal Hocko wrote:
>
> On Sun 29-05-16 23:25:40, Oleg Nesterov wrote:
> >
> > This single change in get_scan_count() under for_each_evictable_lru() loop
> >
> > -   size = lruvec_lru_size(lruvec, lru);
> > +   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
> > NR_LRU_BASE + lru);
> >
> > fixes the problem too.
> >
> > Without this change shrink*() continues to scan the LRU_ACTIVE_FILE list
> > while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) but
> > we do not even try to scan it, lruvec_lru_size() returns zero.
>
> OK, you seem to be really seeing a different issue than me.

quite possibly, but

> My debugging
> patch was showing when nothing was really isolated from the LRU lists
> (both for shrink_{in}active_list.

in my debugging session too. LRU_ACTIVE_FILE was empty, so there is nothing to
isolate even if shrink_active_list() is (wrongly called) with nr_to_scan != 0.
LRU_INACTIVE_FILE is not empty but it is not scanned because nr_to_scan == 0.

But I am afraid I misunderstood you, and you meant something else.

> > Then later we recheck zone_reclaimable() and it notices the INACTIVE_FILE
> > counter because it uses the _snapshot variant, this leads to livelock.
> >
> > I guess this doesn't really matter, but in my particular case these
> > ACTIVE/INACTIVE counters were screwed by the recent putback_inactive_pages()
> > logic. The pages we "leak" in INACTIVE list were recently moved from ACTIVE
> > to INACTIVE list, and this updated only the per-cpu ->vm_stat_diff[] 
> > counters,
> > so the "non snapshot" lruvec_lru_size() in get_scan_count() sees the "old"
> > numbers.
>
> Hmm. I am not really sure we can use the _snapshot version in lruvec_lru_size.

Yes, yes, I  understand,

> But I am thinking whether we should simply revert 0db2cb8da89d ("mm,
> vmscan: make zone_reclaimable_pages more precise") in 4.6 stable tree.
> Does that help as well?

I'll test this tomorrow, but even if it helps I am not sure... Yes, this
way zone_reclaimable() and get_scan_count() will see the same numbers, but
how this can help to make zone_reclaimable() == F at the end?

Again, suppose that (say) ACTIVE list is empty but zone->vm_stat != 0
because there is something in per-cpu counter (so that _snapshot == 0).
This means that we sill continue to try to scan this list for no reason.

But Michal, let me repeat that I do not understand this code, so I can
be easily wrong.

Oleg.



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-31 Thread Michal Hocko
On Sun 29-05-16 23:25:40, Oleg Nesterov wrote:
> sorry for delay,
> 
> On 05/25, Michal Hocko wrote:
[...]
> > This is a bit surprising but my testing shows that the result shouldn't
> > make much difference. I can see some discrepancies between lru_vec size
> > and zone_reclaimable_pages but they are too small to actually matter.
> 
> Yes, the difference is small but it does matter.
> 
> I do not pretend I understand this all, but finally it seems I understand
> whats going on on my system when it hangs. At least, why the change in
> lruvec_lru_size() or calculate_normal_threshold() makes a difference.
> 
> This single change in get_scan_count() under for_each_evictable_lru() loop
> 
>   -   size = lruvec_lru_size(lruvec, lru);
>   +   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
> NR_LRU_BASE + lru);
> 
> fixes the problem too.
> 
> Without this change shrink*() continues to scan the LRU_ACTIVE_FILE list
> while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) but
> we do not even try to scan it, lruvec_lru_size() returns zero.

OK, you seem to be really seeing a different issue than me. My debugging
patch was showing when nothing was really isolated from the LRU lists
(both for shrink_{in}active_list.

> Then later we recheck zone_reclaimable() and it notices the INACTIVE_FILE
> counter because it uses the _snapshot variant, this leads to livelock.
> 
> I guess this doesn't really matter, but in my particular case these
> ACTIVE/INACTIVE counters were screwed by the recent putback_inactive_pages()
> logic. The pages we "leak" in INACTIVE list were recently moved from ACTIVE
> to INACTIVE list, and this updated only the per-cpu ->vm_stat_diff[] counters,
> so the "non snapshot" lruvec_lru_size() in get_scan_count() sees the "old"
> numbers.

Hmm. I am not really sure we can use the _snapshot version in lruvec_lru_size.
It is called also outise of slow paths (like add_to_page_cache_lru).
Maybe a path like below would be acceptable for stable trees.

But I am thinking whether we should simply revert 0db2cb8da89d ("mm,
vmscan: make zone_reclaimable_pages more precise") in 4.6 stable tree.
Does that help as well? The patch is certainly needed for the oom
detection rework but it got merged one release cycle earlier.
--- 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3388ccbab7d6..9f46a29c06b6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -764,7 +764,8 @@ static inline struct zone *lruvec_zone(struct lruvec 
*lruvec)
 #endif
 }
 
-extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);
+extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru,
+   bool precise);
 
 #ifdef CONFIG_HAVE_MEMORY_PRESENT
 void memory_present(int nid, unsigned long start, unsigned long end);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..84420045090b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -213,11 +213,13 @@ bool zone_reclaimable(struct zone *zone)
zone_reclaimable_pages(zone) * 6;
 }
 
-unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
+unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, bool 
precise)
 {
if (!mem_cgroup_disabled())
return mem_cgroup_get_lru_size(lruvec, lru);
 
+   if (precise)
+   return zone_page_state_snapshot(lruvec_zone(lruvec), 
NR_LRU_BASE + lru);
return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
 }
 
@@ -1902,8 +1904,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file)
if (!file && !total_swap_pages)
return false;
 
-   inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
-   active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
+   inactive = lruvec_lru_size(lruvec, file * LRU_FILE, true);
+   active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE, true);
 
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
@@ -2040,7 +2042,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
 * system is under heavy pressure.
 */
if (!inactive_list_is_low(lruvec, true) &&
-   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, true) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2066,10 +2068,10 @@ static void get_scan_count(struct lruvec *lruvec, 
struct mem_cgroup *memcg,
 * anon in [0], file in [1]
 */
 
-   anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON) +
-   lruvec_lru_size(lruvec, LRU_INACTIVE_ANON);
-   file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
-   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
+   anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, true) +
+   lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, 

Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-31 Thread Michal Hocko
On Sun 29-05-16 23:25:40, Oleg Nesterov wrote:
> sorry for delay,
> 
> On 05/25, Michal Hocko wrote:
[...]
> > This is a bit surprising but my testing shows that the result shouldn't
> > make much difference. I can see some discrepancies between lru_vec size
> > and zone_reclaimable_pages but they are too small to actually matter.
> 
> Yes, the difference is small but it does matter.
> 
> I do not pretend I understand this all, but finally it seems I understand
> whats going on on my system when it hangs. At least, why the change in
> lruvec_lru_size() or calculate_normal_threshold() makes a difference.
> 
> This single change in get_scan_count() under for_each_evictable_lru() loop
> 
>   -   size = lruvec_lru_size(lruvec, lru);
>   +   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
> NR_LRU_BASE + lru);
> 
> fixes the problem too.
> 
> Without this change shrink*() continues to scan the LRU_ACTIVE_FILE list
> while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) but
> we do not even try to scan it, lruvec_lru_size() returns zero.

OK, you seem to be really seeing a different issue than me. My debugging
patch was showing when nothing was really isolated from the LRU lists
(both for shrink_{in}active_list.

> Then later we recheck zone_reclaimable() and it notices the INACTIVE_FILE
> counter because it uses the _snapshot variant, this leads to livelock.
> 
> I guess this doesn't really matter, but in my particular case these
> ACTIVE/INACTIVE counters were screwed by the recent putback_inactive_pages()
> logic. The pages we "leak" in INACTIVE list were recently moved from ACTIVE
> to INACTIVE list, and this updated only the per-cpu ->vm_stat_diff[] counters,
> so the "non snapshot" lruvec_lru_size() in get_scan_count() sees the "old"
> numbers.

Hmm. I am not really sure we can use the _snapshot version in lruvec_lru_size.
It is called also outise of slow paths (like add_to_page_cache_lru).
Maybe a path like below would be acceptable for stable trees.

But I am thinking whether we should simply revert 0db2cb8da89d ("mm,
vmscan: make zone_reclaimable_pages more precise") in 4.6 stable tree.
Does that help as well? The patch is certainly needed for the oom
detection rework but it got merged one release cycle earlier.
--- 
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3388ccbab7d6..9f46a29c06b6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -764,7 +764,8 @@ static inline struct zone *lruvec_zone(struct lruvec 
*lruvec)
 #endif
 }
 
-extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru);
+extern unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru,
+   bool precise);
 
 #ifdef CONFIG_HAVE_MEMORY_PRESENT
 void memory_present(int nid, unsigned long start, unsigned long end);
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c4a2f4512fca..84420045090b 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -213,11 +213,13 @@ bool zone_reclaimable(struct zone *zone)
zone_reclaimable_pages(zone) * 6;
 }
 
-unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru)
+unsigned long lruvec_lru_size(struct lruvec *lruvec, enum lru_list lru, bool 
precise)
 {
if (!mem_cgroup_disabled())
return mem_cgroup_get_lru_size(lruvec, lru);
 
+   if (precise)
+   return zone_page_state_snapshot(lruvec_zone(lruvec), 
NR_LRU_BASE + lru);
return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
 }
 
@@ -1902,8 +1904,8 @@ static bool inactive_list_is_low(struct lruvec *lruvec, 
bool file)
if (!file && !total_swap_pages)
return false;
 
-   inactive = lruvec_lru_size(lruvec, file * LRU_FILE);
-   active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE);
+   inactive = lruvec_lru_size(lruvec, file * LRU_FILE, true);
+   active = lruvec_lru_size(lruvec, file * LRU_FILE + LRU_ACTIVE, true);
 
gb = (inactive + active) >> (30 - PAGE_SHIFT);
if (gb)
@@ -2040,7 +2042,7 @@ static void get_scan_count(struct lruvec *lruvec, struct 
mem_cgroup *memcg,
 * system is under heavy pressure.
 */
if (!inactive_list_is_low(lruvec, true) &&
-   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE) >> sc->priority) {
+   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE, true) >> sc->priority) {
scan_balance = SCAN_FILE;
goto out;
}
@@ -2066,10 +2068,10 @@ static void get_scan_count(struct lruvec *lruvec, 
struct mem_cgroup *memcg,
 * anon in [0], file in [1]
 */
 
-   anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON) +
-   lruvec_lru_size(lruvec, LRU_INACTIVE_ANON);
-   file  = lruvec_lru_size(lruvec, LRU_ACTIVE_FILE) +
-   lruvec_lru_size(lruvec, LRU_INACTIVE_FILE);
+   anon  = lruvec_lru_size(lruvec, LRU_ACTIVE_ANON, true) +
+   lruvec_lru_size(lruvec, LRU_INACTIVE_ANON, 

Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-29 Thread Oleg Nesterov
sorry for delay,

On 05/25, Michal Hocko wrote:
>
> On Wed 25-05-16 00:43:41, Oleg Nesterov wrote:
> >
> > But. It _seems to me_ that the kernel "leaks" some pages in 
> > LRU_INACTIVE_FILE
> > list because inactive_file_is_low() returns the wrong value. And do not even
> > ask me why I think so, unlikely I will be able to explain ;) to remind, I 
> > never
> > tried to read vmscan.c before.

No, this is not because of inactive_file_is_low(), but

> >
> > But. if I change lruvec_lru_size()
> >
> > -   return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
> > +   return zone_page_state_snapshot(lruvec_zone(lruvec), 
> > NR_LRU_BASE + lru);
> >
> > the problem goes away too.

Yes,

> This is a bit surprising but my testing shows that the result shouldn't
> make much difference. I can see some discrepancies between lru_vec size
> and zone_reclaimable_pages but they are too small to actually matter.

Yes, the difference is small but it does matter.

I do not pretend I understand this all, but finally it seems I understand
whats going on on my system when it hangs. At least, why the change in
lruvec_lru_size() or calculate_normal_threshold() makes a difference.

This single change in get_scan_count() under for_each_evictable_lru() loop

-   size = lruvec_lru_size(lruvec, lru);
+   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
NR_LRU_BASE + lru);

fixes the problem too.

Without this change shrink*() continues to scan the LRU_ACTIVE_FILE list
while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) but
we do not even try to scan it, lruvec_lru_size() returns zero.

Then later we recheck zone_reclaimable() and it notices the INACTIVE_FILE
counter because it uses the _snapshot variant, this leads to livelock.

I guess this doesn't really matter, but in my particular case these
ACTIVE/INACTIVE counters were screwed by the recent putback_inactive_pages()
logic. The pages we "leak" in INACTIVE list were recently moved from ACTIVE
to INACTIVE list, and this updated only the per-cpu ->vm_stat_diff[] counters,
so the "non snapshot" lruvec_lru_size() in get_scan_count() sees the "old"
numbers.

I even added more printk's, and yes when the system hangs I have something
like, say,

->vm_stat[ACTIVE]= NR;  // small number
->vm_stat_diff[ACTIVE]   = -NR; // so it is actually zero but
// get_scan_count() sees NR

->vm_stat[INACTIVE]  = 0;   // this is what 
get_scan_count() sees
->vm_stat_diff[INACTIVE] = NR;  // and this is what 
zone_reclaimable()

Oleg.



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-29 Thread Oleg Nesterov
sorry for delay,

On 05/25, Michal Hocko wrote:
>
> On Wed 25-05-16 00:43:41, Oleg Nesterov wrote:
> >
> > But. It _seems to me_ that the kernel "leaks" some pages in 
> > LRU_INACTIVE_FILE
> > list because inactive_file_is_low() returns the wrong value. And do not even
> > ask me why I think so, unlikely I will be able to explain ;) to remind, I 
> > never
> > tried to read vmscan.c before.

No, this is not because of inactive_file_is_low(), but

> >
> > But. if I change lruvec_lru_size()
> >
> > -   return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
> > +   return zone_page_state_snapshot(lruvec_zone(lruvec), 
> > NR_LRU_BASE + lru);
> >
> > the problem goes away too.

Yes,

> This is a bit surprising but my testing shows that the result shouldn't
> make much difference. I can see some discrepancies between lru_vec size
> and zone_reclaimable_pages but they are too small to actually matter.

Yes, the difference is small but it does matter.

I do not pretend I understand this all, but finally it seems I understand
whats going on on my system when it hangs. At least, why the change in
lruvec_lru_size() or calculate_normal_threshold() makes a difference.

This single change in get_scan_count() under for_each_evictable_lru() loop

-   size = lruvec_lru_size(lruvec, lru);
+   size = zone_page_state_snapshot(lruvec_zone(lruvec), 
NR_LRU_BASE + lru);

fixes the problem too.

Without this change shrink*() continues to scan the LRU_ACTIVE_FILE list
while it is empty. LRU_INACTIVE_FILE is not empty (just a few pages) but
we do not even try to scan it, lruvec_lru_size() returns zero.

Then later we recheck zone_reclaimable() and it notices the INACTIVE_FILE
counter because it uses the _snapshot variant, this leads to livelock.

I guess this doesn't really matter, but in my particular case these
ACTIVE/INACTIVE counters were screwed by the recent putback_inactive_pages()
logic. The pages we "leak" in INACTIVE list were recently moved from ACTIVE
to INACTIVE list, and this updated only the per-cpu ->vm_stat_diff[] counters,
so the "non snapshot" lruvec_lru_size() in get_scan_count() sees the "old"
numbers.

I even added more printk's, and yes when the system hangs I have something
like, say,

->vm_stat[ACTIVE]= NR;  // small number
->vm_stat_diff[ACTIVE]   = -NR; // so it is actually zero but
// get_scan_count() sees NR

->vm_stat[INACTIVE]  = 0;   // this is what 
get_scan_count() sees
->vm_stat_diff[INACTIVE] = NR;  // and this is what 
zone_reclaimable()

Oleg.



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-24 Thread Oleg Nesterov
On 05/24, Michal Hocko wrote:
>
> On Mon 23-05-16 17:14:19, Oleg Nesterov wrote:
> > On 05/23, Michal Hocko wrote:
> [...]
> > > Could you add some tracing and see what are the numbers
> > > above?
> >
> > with the patch below I can press Ctrl-C when it hangs, this breaks the
> > endless loop and the output looks like
> >
> > vmscan: ZONE=8189f180 0 scanned=0 pages=6
> > vmscan: ZONE=8189eb00 0 scanned=1 pages=0
> > ...
> > vmscan: ZONE=8189eb00 0 scanned=2 pages=1
> > vmscan: ZONE=8189f180 0 scanned=4 pages=6
> > ...
> > vmscan: ZONE=8189f180 0 scanned=4 pages=6
> > vmscan: ZONE=8189f180 0 scanned=4 pages=6
> >
> > the numbers are always small.
>
> Small but scanned is not 0 and constant which means it either gets reset
> repeatedly (something gets freed) or we have stopped scanning. Which
> pattern can you see? I assume that the swap space is full at the time
> (could you add get_nr_swap_pages() to the output).

no, I tested this without SWAP,

> Also zone->name would
> be better than the pointer.

Yes, forgot to mention, this is DMA32. To remind, only 512m of RAM so
this is natural.

> I am trying to reproduce but your test case always hits the oom killer:

Did you try to run it in a loop? Usually it takes a while before the system
hangs.

> Swap:   138236  57740  80496

perhaps this makes a difference? See above, I have no SWAP.


So. I spent almost the whole day trying to understand whats going on, and
of course I failed.

But. It _seems to me_ that the kernel "leaks" some pages in LRU_INACTIVE_FILE
list because inactive_file_is_low() returns the wrong value. And do not even
ask me why I think so, unlikely I will be able to explain ;) to remind, I never
tried to read vmscan.c before.

But. if I change lruvec_lru_size()

-   return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
+   return zone_page_state_snapshot(lruvec_zone(lruvec), 
NR_LRU_BASE + lru);

the problem goes away too.

To remind, it also goes away if I change calculate_normal_threshold() to return
zero, and it was not clear why. Now we can probably conclude that that this is
because the change obviouslt affects lruvec_lru_size().

Oleg.



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-24 Thread Oleg Nesterov
On 05/24, Michal Hocko wrote:
>
> On Mon 23-05-16 17:14:19, Oleg Nesterov wrote:
> > On 05/23, Michal Hocko wrote:
> [...]
> > > Could you add some tracing and see what are the numbers
> > > above?
> >
> > with the patch below I can press Ctrl-C when it hangs, this breaks the
> > endless loop and the output looks like
> >
> > vmscan: ZONE=8189f180 0 scanned=0 pages=6
> > vmscan: ZONE=8189eb00 0 scanned=1 pages=0
> > ...
> > vmscan: ZONE=8189eb00 0 scanned=2 pages=1
> > vmscan: ZONE=8189f180 0 scanned=4 pages=6
> > ...
> > vmscan: ZONE=8189f180 0 scanned=4 pages=6
> > vmscan: ZONE=8189f180 0 scanned=4 pages=6
> >
> > the numbers are always small.
>
> Small but scanned is not 0 and constant which means it either gets reset
> repeatedly (something gets freed) or we have stopped scanning. Which
> pattern can you see? I assume that the swap space is full at the time
> (could you add get_nr_swap_pages() to the output).

no, I tested this without SWAP,

> Also zone->name would
> be better than the pointer.

Yes, forgot to mention, this is DMA32. To remind, only 512m of RAM so
this is natural.

> I am trying to reproduce but your test case always hits the oom killer:

Did you try to run it in a loop? Usually it takes a while before the system
hangs.

> Swap:   138236  57740  80496

perhaps this makes a difference? See above, I have no SWAP.


So. I spent almost the whole day trying to understand whats going on, and
of course I failed.

But. It _seems to me_ that the kernel "leaks" some pages in LRU_INACTIVE_FILE
list because inactive_file_is_low() returns the wrong value. And do not even
ask me why I think so, unlikely I will be able to explain ;) to remind, I never
tried to read vmscan.c before.

But. if I change lruvec_lru_size()

-   return zone_page_state(lruvec_zone(lruvec), NR_LRU_BASE + lru);
+   return zone_page_state_snapshot(lruvec_zone(lruvec), 
NR_LRU_BASE + lru);

the problem goes away too.

To remind, it also goes away if I change calculate_normal_threshold() to return
zero, and it was not clear why. Now we can probably conclude that that this is
because the change obviouslt affects lruvec_lru_size().

Oleg.



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-24 Thread Michal Hocko
On Mon 23-05-16 17:14:19, Oleg Nesterov wrote:
> On 05/23, Michal Hocko wrote:
[...]
> > Could you add some tracing and see what are the numbers
> > above?
> 
> with the patch below I can press Ctrl-C when it hangs, this breaks the
> endless loop and the output looks like
> 
>   vmscan: ZONE=8189f180 0 scanned=0 pages=6
>   vmscan: ZONE=8189eb00 0 scanned=1 pages=0
>   ...
>   vmscan: ZONE=8189eb00 0 scanned=2 pages=1
>   vmscan: ZONE=8189f180 0 scanned=4 pages=6
>   ...
>   vmscan: ZONE=8189f180 0 scanned=4 pages=6
>   vmscan: ZONE=8189f180 0 scanned=4 pages=6
> 
> the numbers are always small.

Small but scanned is not 0 and constant which means it either gets reset
repeatedly (something gets freed) or we have stopped scanning. Which
pattern can you see? I assume that the swap space is full at the time
(could you add get_nr_swap_pages() to the output). Also zone->name would
be better than the pointer.

I am trying to reproduce but your test case always hits the oom killer:

This is in a qemu x86_64 virtual machine:
# free
 total   used   free sharedbuffers cached
Mem:490212  96788 393424  0   3196   9976
-/+ buffers/cache:  83616 406596
Swap:   138236  57740  80496

I have tried with much larger swap space but no change except for the
run time of the test which is expected.

# grep "^processor" /proc/cpuinfo | wc -l
1

[... Skipped several previous attempts ...]
[  695.215235] vmscan: XXX: zone:DMA32 nr_pages_scanned:0 reclaimable:20
[  695.215245] vmscan: XXX: zone:DMA32 nr_pages_scanned:0 reclaimable:20
[  695.215255] vmscan: XXX: zone:DMA32 nr_pages_scanned:0 reclaimable:20
[  695.215282] vmscan: XXX: zone:DMA32 nr_pages_scanned:1 reclaimable:27
[  695.215303] vmscan: XXX: zone:DMA32 nr_pages_scanned:5 reclaimable:27
[  695.215327] vmscan: XXX: zone:DMA32 nr_pages_scanned:18 reclaimable:27
[  695.215351] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215362] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215373] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215382] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215392] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215402] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215412] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215422] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215431] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215442] vmscan: XXX: zone:DMA32 nr_pages_scanned:46 reclaimable:27
[  695.215462] vmscan: XXX: zone:DMA32 nr_pages_scanned:48 reclaimable:27
[  695.215482] vmscan: XXX: zone:DMA32 nr_pages_scanned:53 reclaimable:27
[  695.215504] vmscan: XXX: zone:DMA32 nr_pages_scanned:63 reclaimable:27
[  695.215528] vmscan: XXX: zone:DMA32 nr_pages_scanned:90 reclaimable:27
[...]
[  695.215620] vmscan: XXX: zone:DMA32 nr_pages_scanned:91 reclaimable:27
[  695.215640] vmscan: XXX: zone:DMA32 nr_pages_scanned:94 reclaimable:27
[  695.215659] vmscan: XXX: zone:DMA32 nr_pages_scanned:100 reclaimable:27
[  695.215683] vmscan: XXX: zone:DMA32 nr_pages_scanned:113 reclaimable:27
[...]
[  695.215786] vmscan: XXX: zone:DMA32 nr_pages_scanned:140 reclaimable:27
[  695.215797] vmscan: XXX: zone:DMA32 nr_pages_scanned:141 reclaimable:27
[  695.215816] vmscan: XXX: zone:DMA32 nr_pages_scanned:144 reclaimable:27
[  695.215836] vmscan: XXX: zone:DMA32 nr_pages_scanned:150 reclaimable:27
[  695.215906] test-oleg invoked oom-killer: 
gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
-- 
Michal Hocko
SUSE Labs


Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-24 Thread Michal Hocko
On Mon 23-05-16 17:14:19, Oleg Nesterov wrote:
> On 05/23, Michal Hocko wrote:
[...]
> > Could you add some tracing and see what are the numbers
> > above?
> 
> with the patch below I can press Ctrl-C when it hangs, this breaks the
> endless loop and the output looks like
> 
>   vmscan: ZONE=8189f180 0 scanned=0 pages=6
>   vmscan: ZONE=8189eb00 0 scanned=1 pages=0
>   ...
>   vmscan: ZONE=8189eb00 0 scanned=2 pages=1
>   vmscan: ZONE=8189f180 0 scanned=4 pages=6
>   ...
>   vmscan: ZONE=8189f180 0 scanned=4 pages=6
>   vmscan: ZONE=8189f180 0 scanned=4 pages=6
> 
> the numbers are always small.

Small but scanned is not 0 and constant which means it either gets reset
repeatedly (something gets freed) or we have stopped scanning. Which
pattern can you see? I assume that the swap space is full at the time
(could you add get_nr_swap_pages() to the output). Also zone->name would
be better than the pointer.

I am trying to reproduce but your test case always hits the oom killer:

This is in a qemu x86_64 virtual machine:
# free
 total   used   free sharedbuffers cached
Mem:490212  96788 393424  0   3196   9976
-/+ buffers/cache:  83616 406596
Swap:   138236  57740  80496

I have tried with much larger swap space but no change except for the
run time of the test which is expected.

# grep "^processor" /proc/cpuinfo | wc -l
1

[... Skipped several previous attempts ...]
[  695.215235] vmscan: XXX: zone:DMA32 nr_pages_scanned:0 reclaimable:20
[  695.215245] vmscan: XXX: zone:DMA32 nr_pages_scanned:0 reclaimable:20
[  695.215255] vmscan: XXX: zone:DMA32 nr_pages_scanned:0 reclaimable:20
[  695.215282] vmscan: XXX: zone:DMA32 nr_pages_scanned:1 reclaimable:27
[  695.215303] vmscan: XXX: zone:DMA32 nr_pages_scanned:5 reclaimable:27
[  695.215327] vmscan: XXX: zone:DMA32 nr_pages_scanned:18 reclaimable:27
[  695.215351] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215362] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215373] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215382] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215392] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215402] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215412] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215422] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215431] vmscan: XXX: zone:DMA32 nr_pages_scanned:45 reclaimable:27
[  695.215442] vmscan: XXX: zone:DMA32 nr_pages_scanned:46 reclaimable:27
[  695.215462] vmscan: XXX: zone:DMA32 nr_pages_scanned:48 reclaimable:27
[  695.215482] vmscan: XXX: zone:DMA32 nr_pages_scanned:53 reclaimable:27
[  695.215504] vmscan: XXX: zone:DMA32 nr_pages_scanned:63 reclaimable:27
[  695.215528] vmscan: XXX: zone:DMA32 nr_pages_scanned:90 reclaimable:27
[...]
[  695.215620] vmscan: XXX: zone:DMA32 nr_pages_scanned:91 reclaimable:27
[  695.215640] vmscan: XXX: zone:DMA32 nr_pages_scanned:94 reclaimable:27
[  695.215659] vmscan: XXX: zone:DMA32 nr_pages_scanned:100 reclaimable:27
[  695.215683] vmscan: XXX: zone:DMA32 nr_pages_scanned:113 reclaimable:27
[...]
[  695.215786] vmscan: XXX: zone:DMA32 nr_pages_scanned:140 reclaimable:27
[  695.215797] vmscan: XXX: zone:DMA32 nr_pages_scanned:141 reclaimable:27
[  695.215816] vmscan: XXX: zone:DMA32 nr_pages_scanned:144 reclaimable:27
[  695.215836] vmscan: XXX: zone:DMA32 nr_pages_scanned:150 reclaimable:27
[  695.215906] test-oleg invoked oom-killer: 
gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
-- 
Michal Hocko
SUSE Labs


Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-23 Thread Oleg Nesterov
On 05/23, Michal Hocko wrote:
>
> > nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> > if (nr_scanned)
> > __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> >
> > and this doesn't look exactly right: zone_page_state() ignores the per-cpu
> > ->vm_stat_diff[] counters (and we probably do not want for_each_online_cpu()
> > loop here). And I do not know if this is really bad or not, but note that if
> > I change calculate_normal_threshold() to return 0, the problem goes away 
> > too.
>
> You are absolutely right that this is racy. In the worst case we would
> end up missing nr_cpus*threshold scanned pages which would stay behind.

and the sum of ->vm_diff[] can be negative, so...

> But
>
> bool zone_reclaimable(struct zone *zone)
> {
>   return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
>   zone_reclaimable_pages(zone) * 6;
> }
>
> So the left over shouldn't cause it to return true all the time.

well if NR_PAGES_SCANNED doesn't grow enough it can even stay negative,
but zone_page_state_snapshot() returns zero in this case. In any case
we can underestimate zone_page_state_snapshot(NR_PAGES_SCANNED).

> In
> fact it could prematurely say false, right? (note that _snapshot variant
> considers per-cpu diffs [1]).

exactly because _snapshot() doesn't ignore the per-cpu counters.

> That being said I am not really sure why would the 0 threshold help for
> your test case.

Neither me. Except, of course, threshold==0 means the the code above will
work correctly. But I do not think this was the root of the problem.

> Could you add some tracing and see what are the numbers
> above?

with the patch below I can press Ctrl-C when it hangs, this breaks the
endless loop and the output looks like

vmscan: ZONE=8189f180 0 scanned=0 pages=6
vmscan: ZONE=8189eb00 0 scanned=1 pages=0
...
vmscan: ZONE=8189eb00 0 scanned=2 pages=1
vmscan: ZONE=8189f180 0 scanned=4 pages=6
...
vmscan: ZONE=8189f180 0 scanned=4 pages=6
vmscan: ZONE=8189f180 0 scanned=4 pages=6

the numbers are always small.

> [1] I am not really sure which kernel version have you tested - your
> config says 4.6.0-rc7 but this is true since 0db2cb8da89d ("mm, vmscan:
> make zone_reclaimable_pages more precise") which is 4.6-rc1.

Yes, I am on c5114626f33b62fa7595e57d87f33d9d1f8298a2, it has this change.

Oleg.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 142cb61..6d221f9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2614,6 +2614,12 @@ static bool shrink_zones(struct zonelist *zonelist, 
struct scan_control *sc)
if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
reclaimable = true;
 
+if (fatal_signal_pending(current))
+   pr_crit("ZONE=%p %d scanned=%ld pages=%ld\n",
+   zone, reclaimable,
+   zone_page_state_snapshot(zone, NR_PAGES_SCANNED),
+   zone_reclaimable_pages(zone));
+else
if (global_reclaim(sc) &&
!reclaimable && zone_reclaimable(zone))
reclaimable = true;



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-23 Thread Oleg Nesterov
On 05/23, Michal Hocko wrote:
>
> > nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
> > if (nr_scanned)
> > __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> >
> > and this doesn't look exactly right: zone_page_state() ignores the per-cpu
> > ->vm_stat_diff[] counters (and we probably do not want for_each_online_cpu()
> > loop here). And I do not know if this is really bad or not, but note that if
> > I change calculate_normal_threshold() to return 0, the problem goes away 
> > too.
>
> You are absolutely right that this is racy. In the worst case we would
> end up missing nr_cpus*threshold scanned pages which would stay behind.

and the sum of ->vm_diff[] can be negative, so...

> But
>
> bool zone_reclaimable(struct zone *zone)
> {
>   return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
>   zone_reclaimable_pages(zone) * 6;
> }
>
> So the left over shouldn't cause it to return true all the time.

well if NR_PAGES_SCANNED doesn't grow enough it can even stay negative,
but zone_page_state_snapshot() returns zero in this case. In any case
we can underestimate zone_page_state_snapshot(NR_PAGES_SCANNED).

> In
> fact it could prematurely say false, right? (note that _snapshot variant
> considers per-cpu diffs [1]).

exactly because _snapshot() doesn't ignore the per-cpu counters.

> That being said I am not really sure why would the 0 threshold help for
> your test case.

Neither me. Except, of course, threshold==0 means the the code above will
work correctly. But I do not think this was the root of the problem.

> Could you add some tracing and see what are the numbers
> above?

with the patch below I can press Ctrl-C when it hangs, this breaks the
endless loop and the output looks like

vmscan: ZONE=8189f180 0 scanned=0 pages=6
vmscan: ZONE=8189eb00 0 scanned=1 pages=0
...
vmscan: ZONE=8189eb00 0 scanned=2 pages=1
vmscan: ZONE=8189f180 0 scanned=4 pages=6
...
vmscan: ZONE=8189f180 0 scanned=4 pages=6
vmscan: ZONE=8189f180 0 scanned=4 pages=6

the numbers are always small.

> [1] I am not really sure which kernel version have you tested - your
> config says 4.6.0-rc7 but this is true since 0db2cb8da89d ("mm, vmscan:
> make zone_reclaimable_pages more precise") which is 4.6-rc1.

Yes, I am on c5114626f33b62fa7595e57d87f33d9d1f8298a2, it has this change.

Oleg.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 142cb61..6d221f9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2614,6 +2614,12 @@ static bool shrink_zones(struct zonelist *zonelist, 
struct scan_control *sc)
if (shrink_zone(zone, sc, zone_idx(zone) == classzone_idx))
reclaimable = true;
 
+if (fatal_signal_pending(current))
+   pr_crit("ZONE=%p %d scanned=%ld pages=%ld\n",
+   zone, reclaimable,
+   zone_page_state_snapshot(zone, NR_PAGES_SCANNED),
+   zone_reclaimable_pages(zone));
+else
if (global_reclaim(sc) &&
!reclaimable && zone_reclaimable(zone))
reclaimable = true;



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-23 Thread Michal Hocko
Hi,
Tetsuo has already pointed you at my oom detection rework which removes
the zone_reclaimable ugliness (btw. one of the top reasons to rework
this area) and it is likely to fix your problem. I would still like to
understand what happens with your test case because we might want to
prepare a stable patch for older kernels.

On Fri 20-05-16 22:28:17, Oleg Nesterov wrote:
> I don't understand vmscan.c, and in fact I don't even understand 
> NR_PAGES_SCANNED
[...]
> counter... why it has to be atomic/per-cpu? It is always updated under 
> ->lru_lock
> except free_pcppages_bulk/free_one_page try to reset this counter. But note 
> that
> they both do

It doesn't really have to be atomic/per-cpu because it is really updated
under the lock. It just uses the generic vmstat infrastructure...

>   nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
>   if (nr_scanned)
>   __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> 
> and this doesn't look exactly right: zone_page_state() ignores the per-cpu
> ->vm_stat_diff[] counters (and we probably do not want for_each_online_cpu()
> loop here). And I do not know if this is really bad or not, but note that if
> I change calculate_normal_threshold() to return 0, the problem goes away too.

You are absolutely right that this is racy. In the worst case we would
end up missing nr_cpus*threshold scanned pages which would stay behind.
But

bool zone_reclaimable(struct zone *zone)
{
return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
zone_reclaimable_pages(zone) * 6;
}

So the left over shouldn't cause it to return true all the time. In
fact it could prematurely say false, right? (note that _snapshot variant
considers per-cpu diffs [1]).

That being said I am not really sure why would the 0 threshold help for
your test case. Could you add some tracing and see what are the numbers
above? Is it possible that zone_reclaimable_pages is some small number
which actuall prevents us to scan anything? Aka a bug is get_scan_count
or somewhere else?

[1] I am not really sure which kernel version have you tested - your
config says 4.6.0-rc7 but this is true since 0db2cb8da89d ("mm, vmscan:
make zone_reclaimable_pages more precise") which is 4.6-rc1.
-- 
Michal Hocko
SUSE Labs


Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-23 Thread Michal Hocko
Hi,
Tetsuo has already pointed you at my oom detection rework which removes
the zone_reclaimable ugliness (btw. one of the top reasons to rework
this area) and it is likely to fix your problem. I would still like to
understand what happens with your test case because we might want to
prepare a stable patch for older kernels.

On Fri 20-05-16 22:28:17, Oleg Nesterov wrote:
> I don't understand vmscan.c, and in fact I don't even understand 
> NR_PAGES_SCANNED
[...]
> counter... why it has to be atomic/per-cpu? It is always updated under 
> ->lru_lock
> except free_pcppages_bulk/free_one_page try to reset this counter. But note 
> that
> they both do

It doesn't really have to be atomic/per-cpu because it is really updated
under the lock. It just uses the generic vmstat infrastructure...

>   nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
>   if (nr_scanned)
>   __mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);
> 
> and this doesn't look exactly right: zone_page_state() ignores the per-cpu
> ->vm_stat_diff[] counters (and we probably do not want for_each_online_cpu()
> loop here). And I do not know if this is really bad or not, but note that if
> I change calculate_normal_threshold() to return 0, the problem goes away too.

You are absolutely right that this is racy. In the worst case we would
end up missing nr_cpus*threshold scanned pages which would stay behind.
But

bool zone_reclaimable(struct zone *zone)
{
return zone_page_state_snapshot(zone, NR_PAGES_SCANNED) <
zone_reclaimable_pages(zone) * 6;
}

So the left over shouldn't cause it to return true all the time. In
fact it could prematurely say false, right? (note that _snapshot variant
considers per-cpu diffs [1]).

That being said I am not really sure why would the 0 threshold help for
your test case. Could you add some tracing and see what are the numbers
above? Is it possible that zone_reclaimable_pages is some small number
which actuall prevents us to scan anything? Aka a bug is get_scan_count
or somewhere else?

[1] I am not really sure which kernel version have you tested - your
config says 4.6.0-rc7 but this is true since 0db2cb8da89d ("mm, vmscan:
make zone_reclaimable_pages more precise") which is 4.6-rc1.
-- 
Michal Hocko
SUSE Labs


Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-22 Thread Oleg Nesterov
On 05/21, Tetsuo Handa wrote:
>
> On 2016/05/21 5:28, Oleg Nesterov wrote:
> > It spins in __alloc_pages_slowpath() forever, __alloc_pages_may_oom() is 
> > never
> > called, it doesn't react to SIGKILL, etc.
> >
> > This is because zone_reclaimable() is always true in shrink_zones(), and the
> > problem goes away if I comment out this code
> >
> > if (global_reclaim(sc) &&
> > !reclaimable && zone_reclaimable(zone))
> > reclaimable = true;
> >
> > in shrink_zones() which otherwise returns this "true" every time, and thus
> > __alloc_pages_slowpath() always sees did_some_progress != 0.
> >
>
> Michal Hocko's OOM detection rework patchset that removes that code was sent
> to Linus 4 hours ago. ( 
> https://marc.info/?l=linux-mm-commits=146378862415399 )
> Please wait for a few days and try reproducing using linux.git .

I guess you mean
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/mm/vmscan.c?id=fa8c5f033ebb43f925d68c29d297bafd36af7114
"mm, oom: rework oom detection"...

Yes thanks a lot Tetsuo, it should fix the problem.

Cough I can't resist I hate Michal^W the fact this was already fixed ;) Because
it took me some time to understand whats going on, initially it looked like some
subtle and hard-to-reproduce bug in userfaultfd.

Thanks!

Oleg.



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-22 Thread Oleg Nesterov
On 05/21, Tetsuo Handa wrote:
>
> On 2016/05/21 5:28, Oleg Nesterov wrote:
> > It spins in __alloc_pages_slowpath() forever, __alloc_pages_may_oom() is 
> > never
> > called, it doesn't react to SIGKILL, etc.
> >
> > This is because zone_reclaimable() is always true in shrink_zones(), and the
> > problem goes away if I comment out this code
> >
> > if (global_reclaim(sc) &&
> > !reclaimable && zone_reclaimable(zone))
> > reclaimable = true;
> >
> > in shrink_zones() which otherwise returns this "true" every time, and thus
> > __alloc_pages_slowpath() always sees did_some_progress != 0.
> >
>
> Michal Hocko's OOM detection rework patchset that removes that code was sent
> to Linus 4 hours ago. ( 
> https://marc.info/?l=linux-mm-commits=146378862415399 )
> Please wait for a few days and try reproducing using linux.git .

I guess you mean
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/mm/vmscan.c?id=fa8c5f033ebb43f925d68c29d297bafd36af7114
"mm, oom: rework oom detection"...

Yes thanks a lot Tetsuo, it should fix the problem.

Cough I can't resist I hate Michal^W the fact this was already fixed ;) Because
it took me some time to understand whats going on, initially it looked like some
subtle and hard-to-reproduce bug in userfaultfd.

Thanks!

Oleg.



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-20 Thread Tetsuo Handa
On 2016/05/21 5:28, Oleg Nesterov wrote:
> Hello,
> 
> Recently I hit the problem, _sometimes_ the system just hangs in OOM 
> situation.
> Surprisingly, this time OOM-killer is innocent ;) and finally I can reproduce
> this more-or-less reliably just running
> 
>   #include 
>   #include 
> 
>   int main(void)
>   {
>   for (;;) {
>   void *p = malloc(1024 * 1024);
>   memset(p, 0, 1024 * 1024);
>   }
>   }
> 
> in a loop on the otherwise idle system. 512m RAM, one CPU (but CONFIG_SMP=y),
> no swap, and only one user-space process (apart from test-case above), /bin/sh
> runnning as init with pid==1. I am attaching my .config just in case, but I
> think the problem is not really specific to this configuration.
> 
> 
> It spins in __alloc_pages_slowpath() forever, __alloc_pages_may_oom() is never
> called, it doesn't react to SIGKILL, etc.
> 
> This is because zone_reclaimable() is always true in shrink_zones(), and the
> problem goes away if I comment out this code
> 
>   if (global_reclaim(sc) &&
>   !reclaimable && zone_reclaimable(zone))
>   reclaimable = true;
> 
> in shrink_zones() which otherwise returns this "true" every time, and thus
> __alloc_pages_slowpath() always sees did_some_progress != 0.
> 

Michal Hocko's OOM detection rework patchset that removes that code was sent
to Linus 4 hours ago. ( https://marc.info/?l=linux-mm-commits=146378862415399 
)
Please wait for a few days and try reproducing using linux.git .



Re: zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-20 Thread Tetsuo Handa
On 2016/05/21 5:28, Oleg Nesterov wrote:
> Hello,
> 
> Recently I hit the problem, _sometimes_ the system just hangs in OOM 
> situation.
> Surprisingly, this time OOM-killer is innocent ;) and finally I can reproduce
> this more-or-less reliably just running
> 
>   #include 
>   #include 
> 
>   int main(void)
>   {
>   for (;;) {
>   void *p = malloc(1024 * 1024);
>   memset(p, 0, 1024 * 1024);
>   }
>   }
> 
> in a loop on the otherwise idle system. 512m RAM, one CPU (but CONFIG_SMP=y),
> no swap, and only one user-space process (apart from test-case above), /bin/sh
> runnning as init with pid==1. I am attaching my .config just in case, but I
> think the problem is not really specific to this configuration.
> 
> 
> It spins in __alloc_pages_slowpath() forever, __alloc_pages_may_oom() is never
> called, it doesn't react to SIGKILL, etc.
> 
> This is because zone_reclaimable() is always true in shrink_zones(), and the
> problem goes away if I comment out this code
> 
>   if (global_reclaim(sc) &&
>   !reclaimable && zone_reclaimable(zone))
>   reclaimable = true;
> 
> in shrink_zones() which otherwise returns this "true" every time, and thus
> __alloc_pages_slowpath() always sees did_some_progress != 0.
> 

Michal Hocko's OOM detection rework patchset that removes that code was sent
to Linus 4 hours ago. ( https://marc.info/?l=linux-mm-commits=146378862415399 
)
Please wait for a few days and try reproducing using linux.git .



zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-20 Thread Oleg Nesterov
Hello,

Recently I hit the problem, _sometimes_ the system just hangs in OOM situation.
Surprisingly, this time OOM-killer is innocent ;) and finally I can reproduce
this more-or-less reliably just running

#include 
#include 

int main(void)
{
for (;;) {
void *p = malloc(1024 * 1024);
memset(p, 0, 1024 * 1024);
}
}

in a loop on the otherwise idle system. 512m RAM, one CPU (but CONFIG_SMP=y),
no swap, and only one user-space process (apart from test-case above), /bin/sh
runnning as init with pid==1. I am attaching my .config just in case, but I
think the problem is not really specific to this configuration.


It spins in __alloc_pages_slowpath() forever, __alloc_pages_may_oom() is never
called, it doesn't react to SIGKILL, etc.

This is because zone_reclaimable() is always true in shrink_zones(), and the
problem goes away if I comment out this code

if (global_reclaim(sc) &&
!reclaimable && zone_reclaimable(zone))
reclaimable = true;

in shrink_zones() which otherwise returns this "true" every time, and thus
__alloc_pages_slowpath() always sees did_some_progress != 0.

---
I don't understand vmscan.c, and in fact I don't even understand 
NR_PAGES_SCANNED
counter... why it has to be atomic/per-cpu? It is always updated under 
->lru_lock
except free_pcppages_bulk/free_one_page try to reset this counter. But note that
they both do

nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
if (nr_scanned)
__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);

and this doesn't look exactly right: zone_page_state() ignores the per-cpu
->vm_stat_diff[] counters (and we probably do not want for_each_online_cpu()
loop here). And I do not know if this is really bad or not, but note that if
I change calculate_normal_threshold() to return 0, the problem goes away too.

Oleg.

---
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.6.0-rc7 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEBUG_RODATA=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_CROSS_MEMORY_ATTACH is not set
# CONFIG_FHANDLE is not set
# CONFIG_USELIB is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y

zone_reclaimable() leads to livelock in __alloc_pages_slowpath()

2016-05-20 Thread Oleg Nesterov
Hello,

Recently I hit the problem, _sometimes_ the system just hangs in OOM situation.
Surprisingly, this time OOM-killer is innocent ;) and finally I can reproduce
this more-or-less reliably just running

#include 
#include 

int main(void)
{
for (;;) {
void *p = malloc(1024 * 1024);
memset(p, 0, 1024 * 1024);
}
}

in a loop on the otherwise idle system. 512m RAM, one CPU (but CONFIG_SMP=y),
no swap, and only one user-space process (apart from test-case above), /bin/sh
runnning as init with pid==1. I am attaching my .config just in case, but I
think the problem is not really specific to this configuration.


It spins in __alloc_pages_slowpath() forever, __alloc_pages_may_oom() is never
called, it doesn't react to SIGKILL, etc.

This is because zone_reclaimable() is always true in shrink_zones(), and the
problem goes away if I comment out this code

if (global_reclaim(sc) &&
!reclaimable && zone_reclaimable(zone))
reclaimable = true;

in shrink_zones() which otherwise returns this "true" every time, and thus
__alloc_pages_slowpath() always sees did_some_progress != 0.

---
I don't understand vmscan.c, and in fact I don't even understand 
NR_PAGES_SCANNED
counter... why it has to be atomic/per-cpu? It is always updated under 
->lru_lock
except free_pcppages_bulk/free_one_page try to reset this counter. But note that
they both do

nr_scanned = zone_page_state(zone, NR_PAGES_SCANNED);
if (nr_scanned)
__mod_zone_page_state(zone, NR_PAGES_SCANNED, -nr_scanned);

and this doesn't look exactly right: zone_page_state() ignores the per-cpu
->vm_stat_diff[] counters (and we probably do not want for_each_online_cpu()
loop here). And I do not know if this is really bad or not, but note that if
I change calculate_normal_threshold() to return 0, the problem goes away too.

Oleg.

---
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.6.0-rc7 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEBUG_RODATA=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_CROSS_MEMORY_ATTACH is not set
# CONFIG_FHANDLE is not set
# CONFIG_USELIB is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y