[PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
kswapd is woken to reclaim a node based on a failed allocation request from any eligible zone. Once reclaiming in balance_pgdat(), it will continue reclaiming until there is an eligible zone available for the zone it was woken for. kswapd tracks what zone it was recently woken for in pgdat->kswapd_classzone_idx. If it has not been woken recently, this zone will be 0. However, the decision on whether to sleep is made on kswapd_classzone_idx which is 0 without a recent wakeup request and that classzone does not account for lowmem reserves. This allows kswapd to sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA request even if a stream of allocations cannot use that zone. While kswapd may be woken again shortly in the near future there are two consequences -- the pgdat bits that control congestion are cleared prematurely and direct reclaim is more likely as kswapd slept prematurely. This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid index) when there has been no recent wakeups. If there are no wakeups, it'll decide whether to sleep based on the highest possible zone available (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" decisions during reclaim and when deciding to sleep are the same. If there is a mismatch, kswapd can stay awake continually trying to balance tiny zones. simoop was used to evaluate it again. Two of the preparation patches regressed the workload so they are included as the second set of results. Otherwise this patch looks artifically excellent 4.11.0-rc14.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Ameanp50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%) Ameanp95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%) Ameanp99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%) Ameanp50-Write1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%) Ameanp95-Write 412901.57 ( 0.00%)34874.98 ( 91.55%) 1362.62 ( 99.67%) Ameanp99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%) Ameanp50-Allocation 78714.31 ( 0.00%)84246.26 ( -7.03%) 74729.74 ( 5.06%) Ameanp95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%) Ameanp99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%) With this patch on top, write and allocation latencies are massively improved. The read latencies are slightly impaired but it's worth noting that this is mostly due to the IO scheduler and not directly related to reclaim. The vmstats are a bit of a mix but the relevant ones are as follows; 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25keepawake-v1r25 Swap Ins 0 0 0 Swap Outs0 608 0 Direct pages scanned 6910672 3132699 6357298 Kswapd pages scanned 570369468248866556986286 Kswapd pages reclaimed559934886347432955939113 Direct pages reclaimed 6905990 2964843 6352115 Kswapd efficiency 98% 76% 98% Kswapd velocity 12494.375 17597.507 12488.065 Direct efficiency 99% 94% 99% Direct velocity 1513.835 668.3061393.148 Page writes by reclaim 0.000 4410243.000 0.000 Page writes file 0 4409635 0 Page writes anon 0 608 0 Page reclaim immediate 103679214175203 1042571 4.11.0-rc1 4.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Swap Ins 0 12 0 Swap Outs0 838 0 Direct pages scanned 6579706 3237270 6256811 Kswapd pages scanned 618537027996148654837791 Kswapd pages reclaimed607687646075578853849586 Direct pages reclaimed 6579055 2987453 6256151 Kswapd efficiency 98% 75% 98% Page writes by reclaim 0.000 4389496.000 0.000 Page writes file 0 4388658 0 Page writes anon 0 838 0 Page reclaim immediate 107357314473009 982507 Swap-outs are equivalent to baseline. Direct reclaim is reduced but not eliminated. It's worth noting that there are two periods of direct reclaim for this workload. The first is when it switches from preparing the files for
[PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
kswapd is woken to reclaim a node based on a failed allocation request from any eligible zone. Once reclaiming in balance_pgdat(), it will continue reclaiming until there is an eligible zone available for the zone it was woken for. kswapd tracks what zone it was recently woken for in pgdat->kswapd_classzone_idx. If it has not been woken recently, this zone will be 0. However, the decision on whether to sleep is made on kswapd_classzone_idx which is 0 without a recent wakeup request and that classzone does not account for lowmem reserves. This allows kswapd to sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA request even if a stream of allocations cannot use that zone. While kswapd may be woken again shortly in the near future there are two consequences -- the pgdat bits that control congestion are cleared prematurely and direct reclaim is more likely as kswapd slept prematurely. This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid index) when there has been no recent wakeups. If there are no wakeups, it'll decide whether to sleep based on the highest possible zone available (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" decisions during reclaim and when deciding to sleep are the same. If there is a mismatch, kswapd can stay awake continually trying to balance tiny zones. simoop was used to evaluate it again. Two of the preparation patches regressed the workload so they are included as the second set of results. Otherwise this patch looks artifically excellent 4.11.0-rc14.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Ameanp50-Read 21670074.18 ( 0.00%) 19786774.76 ( 8.69%) 22668332.52 ( -4.61%) Ameanp95-Read 25456267.64 ( 0.00%) 24101956.27 ( 5.32%) 26738688.00 ( -5.04%) Ameanp99-Read 29369064.73 ( 0.00%) 27691872.71 ( 5.71%) 30991404.52 ( -5.52%) Ameanp50-Write1390.30 ( 0.00%) 1011.91 ( 27.22%) 924.91 ( 33.47%) Ameanp95-Write 412901.57 ( 0.00%)34874.98 ( 91.55%) 1362.62 ( 99.67%) Ameanp99-Write 6668722.09 ( 0.00%) 575449.60 ( 91.37%) 16854.04 ( 99.75%) Ameanp50-Allocation 78714.31 ( 0.00%)84246.26 ( -7.03%) 74729.74 ( 5.06%) Ameanp95-Allocation 175533.51 ( 0.00%) 400058.43 (-127.91%) 101609.74 ( 42.11%) Ameanp99-Allocation 247003.02 ( 0.00%) 10905600.00 (-4315.17%) 125765.57 ( 49.08%) With this patch on top, write and allocation latencies are massively improved. The read latencies are slightly impaired but it's worth noting that this is mostly due to the IO scheduler and not directly related to reclaim. The vmstats are a bit of a mix but the relevant ones are as follows; 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25keepawake-v1r25 Swap Ins 0 0 0 Swap Outs0 608 0 Direct pages scanned 6910672 3132699 6357298 Kswapd pages scanned 570369468248866556986286 Kswapd pages reclaimed559934886347432955939113 Direct pages reclaimed 6905990 2964843 6352115 Kswapd efficiency 98% 76% 98% Kswapd velocity 12494.375 17597.507 12488.065 Direct efficiency 99% 94% 99% Direct velocity 1513.835 668.3061393.148 Page writes by reclaim 0.000 4410243.000 0.000 Page writes file 0 4409635 0 Page writes anon 0 608 0 Page reclaim immediate 103679214175203 1042571 4.11.0-rc1 4.11.0-rc1 4.11.0-rc1 vanilla clear-v2 keepawake-v2 Swap Ins 0 12 0 Swap Outs0 838 0 Direct pages scanned 6579706 3237270 6256811 Kswapd pages scanned 618537027996148654837791 Kswapd pages reclaimed607687646075578853849586 Direct pages reclaimed 6579055 2987453 6256151 Kswapd efficiency 98% 75% 98% Page writes by reclaim 0.000 4389496.000 0.000 Page writes file 0 4388658 0 Page writes anon 0 838 0 Page reclaim immediate 107357314473009 982507 Swap-outs are equivalent to baseline. Direct reclaim is reduced but not eliminated. It's worth noting that there are two periods of direct reclaim for this workload. The first is when it switches from preparing the files for
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On 02/23/2017 04:01 PM, Mel Gorman wrote: > On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote: >>> With this patch on top, all the latencies relative to the baseline are >>> improved, particularly write latencies. The read latencies are still high >>> for the number of threads but it's worth noting that this is mostly due >>> to the IO scheduler and not directly related to reclaim. The vmstats are >>> a bit of a mix but the relevant ones are as follows; >>> >>> 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 >>> mmots-20170209 clear-v1r25keepawake-v1r25 >>> Swap Ins 0 0 0 >>> Swap Outs0 608 0 >>> Direct pages scanned 6910672 3132699 6357298 >>> Kswapd pages scanned 570369468248866556986286 >>> Kswapd pages reclaimed559934886347432955939113 >>> Direct pages reclaimed 6905990 2964843 6352115 >> >> These stats are confusing me. The earlier description suggests that this >> patch >> should cause less direct reclaim and more kswapd reclaim, but compared to >> "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? >> (when >> considering direct + kswapd combined) >> > > The description is referring to the impact relative to baseline. It is > true that relative to patch that direct reclaim is higher but there are > a number of anomalies. > > Note that kswapd is scanning very aggressively in "clear-v1" and overall > efficiency is down to 76%. It's also not clear in the stats but in > "clear-v1", pgskip_* is active as the wrong zone is being reclaimed for > due to the patch "mm, vmscan: fix zone balance check in > prepare_kswapd_sleep". It's also doing a lot of writing of file-backed > pages from reclaim context and some swapping due to the aggressiveness > of the scan. > > While direct reclaim activity might be lower, it's due to kswapd scanning > aggressively and trying to reclaim the world which is not the right thing > to do. With the patches applied, there is still direct reclaim but the fast > bulk of them are when the workload changes phase from "creating work files" > to starting multiple threads that allocate a lot of anonymous memory with > a sudden spike in memory pressure that kswapd does not keep ahead of with > multiple allocating threads. Thanks for the explanation. > >>> @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int >>> order, int classzone_idx) >>> return sc.order; >>> } >>> >>> +/* >>> + * pgdat->kswapd_classzone_idx is the highest zone index that a recent >>> + * allocation request woke kswapd for. When kswapd has not woken recently, >>> + * the value is MAX_NR_ZONES which is not a valid index. This compares a >>> + * given classzone and returns it or the highest classzone index kswapd >>> + * was recently woke for. >>> + */ >>> +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat, >>> + enum zone_type classzone_idx) >>> +{ >>> + if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES) >>> + return classzone_idx; >>> + >>> + return max(pgdat->kswapd_classzone_idx, classzone_idx); >> >> A bit paranoid comment: this should probably read >> pgdat->kswapd_classzone_idx to >> a local variable with READ_ONCE(), otherwise something can set it to >> MAX_NR_ZONES between the check and max(), and compiler can decide to reread. >> Probably not an issue with current callers, but I'd rather future-proof it. >> > > I'm a little wary of adding READ_ONCE unless there is a definite > problem. Even if it was an issue, I think it would be better to protect > thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken > if an update is required or a read to fully guarantee the ordering. > > The consequences as they are is that kswapd may miss reclaiming at a > higher order or classzone than it should have although it is very > unlikely and the update and read are made with a workqueue wake and > scheduler wakeup which should be sufficient in terms of barriers. OK then. Acked-by: Vlastimil Babka
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On 02/23/2017 04:01 PM, Mel Gorman wrote: > On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote: >>> With this patch on top, all the latencies relative to the baseline are >>> improved, particularly write latencies. The read latencies are still high >>> for the number of threads but it's worth noting that this is mostly due >>> to the IO scheduler and not directly related to reclaim. The vmstats are >>> a bit of a mix but the relevant ones are as follows; >>> >>> 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 >>> mmots-20170209 clear-v1r25keepawake-v1r25 >>> Swap Ins 0 0 0 >>> Swap Outs0 608 0 >>> Direct pages scanned 6910672 3132699 6357298 >>> Kswapd pages scanned 570369468248866556986286 >>> Kswapd pages reclaimed559934886347432955939113 >>> Direct pages reclaimed 6905990 2964843 6352115 >> >> These stats are confusing me. The earlier description suggests that this >> patch >> should cause less direct reclaim and more kswapd reclaim, but compared to >> "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? >> (when >> considering direct + kswapd combined) >> > > The description is referring to the impact relative to baseline. It is > true that relative to patch that direct reclaim is higher but there are > a number of anomalies. > > Note that kswapd is scanning very aggressively in "clear-v1" and overall > efficiency is down to 76%. It's also not clear in the stats but in > "clear-v1", pgskip_* is active as the wrong zone is being reclaimed for > due to the patch "mm, vmscan: fix zone balance check in > prepare_kswapd_sleep". It's also doing a lot of writing of file-backed > pages from reclaim context and some swapping due to the aggressiveness > of the scan. > > While direct reclaim activity might be lower, it's due to kswapd scanning > aggressively and trying to reclaim the world which is not the right thing > to do. With the patches applied, there is still direct reclaim but the fast > bulk of them are when the workload changes phase from "creating work files" > to starting multiple threads that allocate a lot of anonymous memory with > a sudden spike in memory pressure that kswapd does not keep ahead of with > multiple allocating threads. Thanks for the explanation. > >>> @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int >>> order, int classzone_idx) >>> return sc.order; >>> } >>> >>> +/* >>> + * pgdat->kswapd_classzone_idx is the highest zone index that a recent >>> + * allocation request woke kswapd for. When kswapd has not woken recently, >>> + * the value is MAX_NR_ZONES which is not a valid index. This compares a >>> + * given classzone and returns it or the highest classzone index kswapd >>> + * was recently woke for. >>> + */ >>> +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat, >>> + enum zone_type classzone_idx) >>> +{ >>> + if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES) >>> + return classzone_idx; >>> + >>> + return max(pgdat->kswapd_classzone_idx, classzone_idx); >> >> A bit paranoid comment: this should probably read >> pgdat->kswapd_classzone_idx to >> a local variable with READ_ONCE(), otherwise something can set it to >> MAX_NR_ZONES between the check and max(), and compiler can decide to reread. >> Probably not an issue with current callers, but I'd rather future-proof it. >> > > I'm a little wary of adding READ_ONCE unless there is a definite > problem. Even if it was an issue, I think it would be better to protect > thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken > if an update is required or a read to fully guarantee the ordering. > > The consequences as they are is that kswapd may miss reclaiming at a > higher order or classzone than it should have although it is very > unlikely and the update and read are made with a workqueue wake and > scheduler wakeup which should be sufficient in terms of barriers. OK then. Acked-by: Vlastimil Babka
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote: > > With this patch on top, all the latencies relative to the baseline are > > improved, particularly write latencies. The read latencies are still high > > for the number of threads but it's worth noting that this is mostly due > > to the IO scheduler and not directly related to reclaim. The vmstats are > > a bit of a mix but the relevant ones are as follows; > > > > 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 > > mmots-20170209 clear-v1r25keepawake-v1r25 > > Swap Ins 0 0 0 > > Swap Outs0 608 0 > > Direct pages scanned 6910672 3132699 6357298 > > Kswapd pages scanned 570369468248866556986286 > > Kswapd pages reclaimed559934886347432955939113 > > Direct pages reclaimed 6905990 2964843 6352115 > > These stats are confusing me. The earlier description suggests that this patch > should cause less direct reclaim and more kswapd reclaim, but compared to > "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when > considering direct + kswapd combined) > The description is referring to the impact relative to baseline. It is true that relative to patch that direct reclaim is higher but there are a number of anomalies. Note that kswapd is scanning very aggressively in "clear-v1" and overall efficiency is down to 76%. It's also not clear in the stats but in "clear-v1", pgskip_* is active as the wrong zone is being reclaimed for due to the patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep". It's also doing a lot of writing of file-backed pages from reclaim context and some swapping due to the aggressiveness of the scan. While direct reclaim activity might be lower, it's due to kswapd scanning aggressively and trying to reclaim the world which is not the right thing to do. With the patches applied, there is still direct reclaim but the fast bulk of them are when the workload changes phase from "creating work files" to starting multiple threads that allocate a lot of anonymous memory with a sudden spike in memory pressure that kswapd does not keep ahead of with multiple allocating threads. > > @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int > > order, int classzone_idx) > > return sc.order; > > } > > > > +/* > > + * pgdat->kswapd_classzone_idx is the highest zone index that a recent > > + * allocation request woke kswapd for. When kswapd has not woken recently, > > + * the value is MAX_NR_ZONES which is not a valid index. This compares a > > + * given classzone and returns it or the highest classzone index kswapd > > + * was recently woke for. > > + */ > > +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat, > > + enum zone_type classzone_idx) > > +{ > > + if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES) > > + return classzone_idx; > > + > > + return max(pgdat->kswapd_classzone_idx, classzone_idx); > > A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx > to > a local variable with READ_ONCE(), otherwise something can set it to > MAX_NR_ZONES between the check and max(), and compiler can decide to reread. > Probably not an issue with current callers, but I'd rather future-proof it. > I'm a little wary of adding READ_ONCE unless there is a definite problem. Even if it was an issue, I think it would be better to protect thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken if an update is required or a read to fully guarantee the ordering. The consequences as they are is that kswapd may miss reclaiming at a higher order or classzone than it should have although it is very unlikely and the update and read are made with a workqueue wake and scheduler wakeup which should be sufficient in terms of barriers. -- Mel Gorman SUSE Labs
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On Mon, Feb 20, 2017 at 05:42:49PM +0100, Vlastimil Babka wrote: > > With this patch on top, all the latencies relative to the baseline are > > improved, particularly write latencies. The read latencies are still high > > for the number of threads but it's worth noting that this is mostly due > > to the IO scheduler and not directly related to reclaim. The vmstats are > > a bit of a mix but the relevant ones are as follows; > > > > 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 > > mmots-20170209 clear-v1r25keepawake-v1r25 > > Swap Ins 0 0 0 > > Swap Outs0 608 0 > > Direct pages scanned 6910672 3132699 6357298 > > Kswapd pages scanned 570369468248866556986286 > > Kswapd pages reclaimed559934886347432955939113 > > Direct pages reclaimed 6905990 2964843 6352115 > > These stats are confusing me. The earlier description suggests that this patch > should cause less direct reclaim and more kswapd reclaim, but compared to > "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when > considering direct + kswapd combined) > The description is referring to the impact relative to baseline. It is true that relative to patch that direct reclaim is higher but there are a number of anomalies. Note that kswapd is scanning very aggressively in "clear-v1" and overall efficiency is down to 76%. It's also not clear in the stats but in "clear-v1", pgskip_* is active as the wrong zone is being reclaimed for due to the patch "mm, vmscan: fix zone balance check in prepare_kswapd_sleep". It's also doing a lot of writing of file-backed pages from reclaim context and some swapping due to the aggressiveness of the scan. While direct reclaim activity might be lower, it's due to kswapd scanning aggressively and trying to reclaim the world which is not the right thing to do. With the patches applied, there is still direct reclaim but the fast bulk of them are when the workload changes phase from "creating work files" to starting multiple threads that allocate a lot of anonymous memory with a sudden spike in memory pressure that kswapd does not keep ahead of with multiple allocating threads. > > @@ -3328,6 +3330,22 @@ static int balance_pgdat(pg_data_t *pgdat, int > > order, int classzone_idx) > > return sc.order; > > } > > > > +/* > > + * pgdat->kswapd_classzone_idx is the highest zone index that a recent > > + * allocation request woke kswapd for. When kswapd has not woken recently, > > + * the value is MAX_NR_ZONES which is not a valid index. This compares a > > + * given classzone and returns it or the highest classzone index kswapd > > + * was recently woke for. > > + */ > > +static enum zone_type kswapd_classzone_idx(pg_data_t *pgdat, > > + enum zone_type classzone_idx) > > +{ > > + if (pgdat->kswapd_classzone_idx == MAX_NR_ZONES) > > + return classzone_idx; > > + > > + return max(pgdat->kswapd_classzone_idx, classzone_idx); > > A bit paranoid comment: this should probably read pgdat->kswapd_classzone_idx > to > a local variable with READ_ONCE(), otherwise something can set it to > MAX_NR_ZONES between the check and max(), and compiler can decide to reread. > Probably not an issue with current callers, but I'd rather future-proof it. > I'm a little wary of adding READ_ONCE unless there is a definite problem. Even if it was an issue, I think it would be better to protect thse kswapd_classzone_idx and kswapd_order with a spinlock that is taken if an update is required or a read to fully guarantee the ordering. The consequences as they are is that kswapd may miss reclaiming at a higher order or classzone than it should have although it is very unlikely and the update and read are made with a workqueue wake and scheduler wakeup which should be sufficient in terms of barriers. -- Mel Gorman SUSE Labs
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On February 21, 2017 12:34 AM Vlastimil Babka wrote: > On 02/16/2017 09:21 AM, Hillf Danton wrote: > > Right, but the order-3 request can also come up while kswapd is active and > > gives up order-5. > > "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming > order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way > how > kswapd could help an order-3 allocation at that point - it's up to kcompactd. > cpu0cpu1 give up order-5 fall back to order-0 wake up kswapd for order-3 wake up kswapd for order-5 fall in sleep wake up kswapd for order-3 what order would we try? It is order-5 in the patch. Given the fresh new world without hike ban after napping, one tenth second or 3 minutes, we feel free IMHO to select any order and go another round of reclaiming pages. thanks Hillf
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On February 21, 2017 12:34 AM Vlastimil Babka wrote: > On 02/16/2017 09:21 AM, Hillf Danton wrote: > > Right, but the order-3 request can also come up while kswapd is active and > > gives up order-5. > > "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming > order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way > how > kswapd could help an order-3 allocation at that point - it's up to kcompactd. > cpu0cpu1 give up order-5 fall back to order-0 wake up kswapd for order-3 wake up kswapd for order-5 fall in sleep wake up kswapd for order-3 what order would we try? It is order-5 in the patch. Given the fresh new world without hike ban after napping, one tenth second or 3 minutes, we feel free IMHO to select any order and go another round of reclaiming pages. thanks Hillf
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On 02/15/2017 10:22 AM, Mel Gorman wrote: > kswapd is woken to reclaim a node based on a failed allocation request > from any eligible zone. Once reclaiming in balance_pgdat(), it will > continue reclaiming until there is an eligible zone available for the > zone it was woken for. kswapd tracks what zone it was recently woken for > in pgdat->kswapd_classzone_idx. If it has not been woken recently, this > zone will be 0. > > However, the decision on whether to sleep is made on kswapd_classzone_idx > which is 0 without a recent wakeup request and that classzone does not > account for lowmem reserves. This allows kswapd to sleep when a low > small zone such as ZONE_DMA is balanced for a GFP_DMA request even if > a stream of allocations cannot use that zone. While kswapd may be woken > again shortly in the near future there are two consequences -- the pgdat > bits that control congestion are cleared prematurely and direct reclaim > is more likely as kswapd slept prematurely. > > This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid > index) when there has been no recent wakeups. If there are no wakeups, > it'll decide whether to sleep based on the highest possible zone available > (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" > decisions during reclaim and when deciding to sleep are the same. If there is > a mismatch, kswapd can stay awake continually trying to balance tiny zones. > > simoop was used to evaluate it again. Two of the preparation patches regressed > the workload so they are included as the second set of results. Otherwise > this patch looks artifically excellent > > 4.10.0-rc74.10.0-rc7 >4.10.0-rc7 > mmots-20170209 clear-v1r25 > keepawake-v1r25 > Ameanp50-Read 22325202.49 ( 0.00%) 19491134.58 ( 12.69%) > 22092755.48 ( 1.04%) > Ameanp95-Read 26102988.80 ( 0.00%) 24294195.20 ( 6.93%) > 26101849.04 ( 0.00%) > Ameanp99-Read 30935176.53 ( 0.00%) 30397053.16 ( 1.74%) > 29746220.52 ( 3.84%) > Ameanp50-Write 976.44 ( 0.00%) 1077.22 (-10.32%) > 952.73 ( 2.43%) > Ameanp95-Write 15471.29 ( 0.00%)36419.56 (-135.40%) > 3140.27 ( 79.70%) > Ameanp99-Write 35108.62 ( 0.00%) 102000.36 (-190.53%) > 8843.73 ( 74.81%) > Ameanp50-Allocation 76382.61 ( 0.00%)87485.22 (-14.54%) > 76349.22 ( 0.04%) > Ameanp95-Allocation 12.39 ( 0.00%) 204588.52 (-60.11%) > 108630.26 ( 14.98%) > Ameanp99-Allocation 187937.39 ( 0.00%) 631657.74 (-236.10%) > 139094.26 ( 25.99%) > > With this patch on top, all the latencies relative to the baseline are > improved, particularly write latencies. The read latencies are still high > for the number of threads but it's worth noting that this is mostly due > to the IO scheduler and not directly related to reclaim. The vmstats are > a bit of a mix but the relevant ones are as follows; > > 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 > mmots-20170209 clear-v1r25keepawake-v1r25 > Swap Ins 0 0 0 > Swap Outs0 608 0 > Direct pages scanned 6910672 3132699 6357298 > Kswapd pages scanned 570369468248866556986286 > Kswapd pages reclaimed559934886347432955939113 > Direct pages reclaimed 6905990 2964843 6352115 These stats are confusing me. The earlier description suggests that this patch should cause less direct reclaim and more kswapd reclaim, but compared to "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when considering direct + kswapd combined) > Kswapd efficiency 98% 76% 98% > Kswapd velocity 12494.375 17597.507 12488.065 > Direct efficiency 99% 94% 99% > Direct velocity 1513.835 668.3061393.148 > Page writes by reclaim 0.000 4410243.000 0.000 > Page writes file 0 4409635 0 > Page writes anon 0 608 0 > Page reclaim immediate 103679214175203 1042571 > > Swap-outs are equivalent to baseline > Direct reclaim is reduced but not eliminated. It's worth noting > that there are two periods of direct reclaim for this workload. The > first is when it switches from preparing the files for the actual > test itself. It's a lot of file IO followed by a lot of allocs > that reclaims heavily for a brief window. After that, direct > reclaim is intermittent when the workload spawns a number of > threads periodically to do work. kswapd simply cannot wake and > reclaim
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On 02/15/2017 10:22 AM, Mel Gorman wrote: > kswapd is woken to reclaim a node based on a failed allocation request > from any eligible zone. Once reclaiming in balance_pgdat(), it will > continue reclaiming until there is an eligible zone available for the > zone it was woken for. kswapd tracks what zone it was recently woken for > in pgdat->kswapd_classzone_idx. If it has not been woken recently, this > zone will be 0. > > However, the decision on whether to sleep is made on kswapd_classzone_idx > which is 0 without a recent wakeup request and that classzone does not > account for lowmem reserves. This allows kswapd to sleep when a low > small zone such as ZONE_DMA is balanced for a GFP_DMA request even if > a stream of allocations cannot use that zone. While kswapd may be woken > again shortly in the near future there are two consequences -- the pgdat > bits that control congestion are cleared prematurely and direct reclaim > is more likely as kswapd slept prematurely. > > This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid > index) when there has been no recent wakeups. If there are no wakeups, > it'll decide whether to sleep based on the highest possible zone available > (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" > decisions during reclaim and when deciding to sleep are the same. If there is > a mismatch, kswapd can stay awake continually trying to balance tiny zones. > > simoop was used to evaluate it again. Two of the preparation patches regressed > the workload so they are included as the second set of results. Otherwise > this patch looks artifically excellent > > 4.10.0-rc74.10.0-rc7 >4.10.0-rc7 > mmots-20170209 clear-v1r25 > keepawake-v1r25 > Ameanp50-Read 22325202.49 ( 0.00%) 19491134.58 ( 12.69%) > 22092755.48 ( 1.04%) > Ameanp95-Read 26102988.80 ( 0.00%) 24294195.20 ( 6.93%) > 26101849.04 ( 0.00%) > Ameanp99-Read 30935176.53 ( 0.00%) 30397053.16 ( 1.74%) > 29746220.52 ( 3.84%) > Ameanp50-Write 976.44 ( 0.00%) 1077.22 (-10.32%) > 952.73 ( 2.43%) > Ameanp95-Write 15471.29 ( 0.00%)36419.56 (-135.40%) > 3140.27 ( 79.70%) > Ameanp99-Write 35108.62 ( 0.00%) 102000.36 (-190.53%) > 8843.73 ( 74.81%) > Ameanp50-Allocation 76382.61 ( 0.00%)87485.22 (-14.54%) > 76349.22 ( 0.04%) > Ameanp95-Allocation 12.39 ( 0.00%) 204588.52 (-60.11%) > 108630.26 ( 14.98%) > Ameanp99-Allocation 187937.39 ( 0.00%) 631657.74 (-236.10%) > 139094.26 ( 25.99%) > > With this patch on top, all the latencies relative to the baseline are > improved, particularly write latencies. The read latencies are still high > for the number of threads but it's worth noting that this is mostly due > to the IO scheduler and not directly related to reclaim. The vmstats are > a bit of a mix but the relevant ones are as follows; > > 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 > mmots-20170209 clear-v1r25keepawake-v1r25 > Swap Ins 0 0 0 > Swap Outs0 608 0 > Direct pages scanned 6910672 3132699 6357298 > Kswapd pages scanned 570369468248866556986286 > Kswapd pages reclaimed559934886347432955939113 > Direct pages reclaimed 6905990 2964843 6352115 These stats are confusing me. The earlier description suggests that this patch should cause less direct reclaim and more kswapd reclaim, but compared to "clear-v1r25" it does the opposite? Was clear-v1r25 overreclaiming then? (when considering direct + kswapd combined) > Kswapd efficiency 98% 76% 98% > Kswapd velocity 12494.375 17597.507 12488.065 > Direct efficiency 99% 94% 99% > Direct velocity 1513.835 668.3061393.148 > Page writes by reclaim 0.000 4410243.000 0.000 > Page writes file 0 4409635 0 > Page writes anon 0 608 0 > Page reclaim immediate 103679214175203 1042571 > > Swap-outs are equivalent to baseline > Direct reclaim is reduced but not eliminated. It's worth noting > that there are two periods of direct reclaim for this workload. The > first is when it switches from preparing the files for the actual > test itself. It's a lot of file IO followed by a lot of allocs > that reclaims heavily for a brief window. After that, direct > reclaim is intermittent when the workload spawns a number of > threads periodically to do work. kswapd simply cannot wake and > reclaim
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On 02/16/2017 09:21 AM, Hillf Danton wrote: > > On February 16, 2017 4:11 PM Mel Gorman wrote: >> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: >> > On February 15, 2017 5:23 PM Mel Gorman wrote: >> > > */ >> > > static int kswapd(void *p) >> > > { >> > > -unsigned int alloc_order, reclaim_order, classzone_idx; >> > > +unsigned int alloc_order, reclaim_order; >> > > +unsigned int classzone_idx = MAX_NR_ZONES - 1; >> > > pg_data_t *pgdat = (pg_data_t*)p; >> > > struct task_struct *tsk = current; >> > > >> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) >> > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; >> > > set_freezable(); >> > > >> > > -pgdat->kswapd_order = alloc_order = reclaim_order = 0; >> > > -pgdat->kswapd_classzone_idx = classzone_idx = 0; >> > > +pgdat->kswapd_order = 0; >> > > +pgdat->kswapd_classzone_idx = MAX_NR_ZONES; >> > > for ( ; ; ) { >> > > bool ret; >> > > >> > > +alloc_order = reclaim_order = pgdat->kswapd_order; >> > > +classzone_idx = kswapd_classzone_idx(pgdat, >> > > classzone_idx); >> > > + >> > > kswapd_try_sleep: >> > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, >> > > classzone_idx); >> > > >> > > /* Read the new order and classzone_idx */ >> > > alloc_order = reclaim_order = pgdat->kswapd_order; >> > > -classzone_idx = pgdat->kswapd_classzone_idx; >> > > +classzone_idx = kswapd_classzone_idx(pgdat, 0); >> > > pgdat->kswapd_order = 0; >> > > -pgdat->kswapd_classzone_idx = 0; >> > > +pgdat->kswapd_classzone_idx = MAX_NR_ZONES; >> > > >> > > ret = try_to_freeze(); >> > > if (kthread_should_stop()) >> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) >> > > reclaim_order = balance_pgdat(pgdat, alloc_order, >> > > classzone_idx); >> > > if (reclaim_order < alloc_order) >> > > goto kswapd_try_sleep; >> > >> > If we fail order-5 request, can we then give up order-5, and >> > try order-3 if requested, after napping? >> > >> >> That has no bearing upon this patch. At this point, kswapd has stopped >> reclaiming at the requested order and is preparing to sleep. If there is >> a parallel request for order-3 while it's sleeping, it'll wake and start >> reclaiming at order-3 as requested. >> > Right, but the order-3 request can also come up while kswapd is active and > gives up order-5. "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way how kswapd could help an order-3 allocation at that point - it's up to kcompactd. > thanks > Hillf >
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On 02/16/2017 09:21 AM, Hillf Danton wrote: > > On February 16, 2017 4:11 PM Mel Gorman wrote: >> On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: >> > On February 15, 2017 5:23 PM Mel Gorman wrote: >> > > */ >> > > static int kswapd(void *p) >> > > { >> > > -unsigned int alloc_order, reclaim_order, classzone_idx; >> > > +unsigned int alloc_order, reclaim_order; >> > > +unsigned int classzone_idx = MAX_NR_ZONES - 1; >> > > pg_data_t *pgdat = (pg_data_t*)p; >> > > struct task_struct *tsk = current; >> > > >> > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) >> > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; >> > > set_freezable(); >> > > >> > > -pgdat->kswapd_order = alloc_order = reclaim_order = 0; >> > > -pgdat->kswapd_classzone_idx = classzone_idx = 0; >> > > +pgdat->kswapd_order = 0; >> > > +pgdat->kswapd_classzone_idx = MAX_NR_ZONES; >> > > for ( ; ; ) { >> > > bool ret; >> > > >> > > +alloc_order = reclaim_order = pgdat->kswapd_order; >> > > +classzone_idx = kswapd_classzone_idx(pgdat, >> > > classzone_idx); >> > > + >> > > kswapd_try_sleep: >> > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, >> > > classzone_idx); >> > > >> > > /* Read the new order and classzone_idx */ >> > > alloc_order = reclaim_order = pgdat->kswapd_order; >> > > -classzone_idx = pgdat->kswapd_classzone_idx; >> > > +classzone_idx = kswapd_classzone_idx(pgdat, 0); >> > > pgdat->kswapd_order = 0; >> > > -pgdat->kswapd_classzone_idx = 0; >> > > +pgdat->kswapd_classzone_idx = MAX_NR_ZONES; >> > > >> > > ret = try_to_freeze(); >> > > if (kthread_should_stop()) >> > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) >> > > reclaim_order = balance_pgdat(pgdat, alloc_order, >> > > classzone_idx); >> > > if (reclaim_order < alloc_order) >> > > goto kswapd_try_sleep; >> > >> > If we fail order-5 request, can we then give up order-5, and >> > try order-3 if requested, after napping? >> > >> >> That has no bearing upon this patch. At this point, kswapd has stopped >> reclaiming at the requested order and is preparing to sleep. If there is >> a parallel request for order-3 while it's sleeping, it'll wake and start >> reclaiming at order-3 as requested. >> > Right, but the order-3 request can also come up while kswapd is active and > gives up order-5. "Giving up on order-5" means it will set sc.order to 0, go to sleep (assuming order-0 watermarks are OK) and wakeup kcompactd for order-5. There's no way how kswapd could help an order-3 allocation at that point - it's up to kcompactd. > thanks > Hillf >
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On Thu, Feb 16, 2017 at 04:21:04PM +0800, Hillf Danton wrote: > > On February 16, 2017 4:11 PM Mel Gorman wrote: > > On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: > > > On February 15, 2017 5:23 PM Mel Gorman wrote: > > > > */ > > > > static int kswapd(void *p) > > > > { > > > > - unsigned int alloc_order, reclaim_order, classzone_idx; > > > > + unsigned int alloc_order, reclaim_order; > > > > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > > > > pg_data_t *pgdat = (pg_data_t*)p; > > > > struct task_struct *tsk = current; > > > > > > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > > > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > > > set_freezable(); > > > > > > > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > > > > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > > > > + pgdat->kswapd_order = 0; > > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > for ( ; ; ) { > > > > bool ret; > > > > > > > > + alloc_order = reclaim_order = pgdat->kswapd_order; > > > > + classzone_idx = kswapd_classzone_idx(pgdat, > > > > classzone_idx); > > > > + > > > > kswapd_try_sleep: > > > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > > > > classzone_idx); > > > > > > > > /* Read the new order and classzone_idx */ > > > > alloc_order = reclaim_order = pgdat->kswapd_order; > > > > - classzone_idx = pgdat->kswapd_classzone_idx; > > > > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > > > > pgdat->kswapd_order = 0; > > > > - pgdat->kswapd_classzone_idx = 0; > > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > > > > > ret = try_to_freeze(); > > > > if (kthread_should_stop()) > > > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > > > > reclaim_order = balance_pgdat(pgdat, alloc_order, > > > > classzone_idx); > > > > if (reclaim_order < alloc_order) > > > > goto kswapd_try_sleep; > > > > > > If we fail order-5 request, can we then give up order-5, and > > > try order-3 if requested, after napping? > > > > > > > That has no bearing upon this patch. At this point, kswapd has stopped > > reclaiming at the requested order and is preparing to sleep. If there is > > a parallel request for order-3 while it's sleeping, it'll wake and start > > reclaiming at order-3 as requested. > > > > Right, but the order-3 request can also come up while kswapd is active and > gives up order-5. > And then it'll be in pgdat->kswapd_order and be picked up on the next wakeup. It won't be immediate but it's also unlikely to be worth picking up immediately. The context here is that a high-order reclaim request failed and rather keeping kswapd awake reclaiming the world, go to sleep until another wakeup request comes in. Staying awake continually for high orders caused problems with excessive reclaim in the past. It could be revisited again but it's not related to what this patch is aimed for -- avoiding reclaim going to sleep because ZONE_DMA is balanced for a GFP_DMA request which is nowhere in the request stream. -- Mel Gorman SUSE Labs
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On Thu, Feb 16, 2017 at 04:21:04PM +0800, Hillf Danton wrote: > > On February 16, 2017 4:11 PM Mel Gorman wrote: > > On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: > > > On February 15, 2017 5:23 PM Mel Gorman wrote: > > > > */ > > > > static int kswapd(void *p) > > > > { > > > > - unsigned int alloc_order, reclaim_order, classzone_idx; > > > > + unsigned int alloc_order, reclaim_order; > > > > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > > > > pg_data_t *pgdat = (pg_data_t*)p; > > > > struct task_struct *tsk = current; > > > > > > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > > > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > > > set_freezable(); > > > > > > > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > > > > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > > > > + pgdat->kswapd_order = 0; > > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > for ( ; ; ) { > > > > bool ret; > > > > > > > > + alloc_order = reclaim_order = pgdat->kswapd_order; > > > > + classzone_idx = kswapd_classzone_idx(pgdat, > > > > classzone_idx); > > > > + > > > > kswapd_try_sleep: > > > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > > > > classzone_idx); > > > > > > > > /* Read the new order and classzone_idx */ > > > > alloc_order = reclaim_order = pgdat->kswapd_order; > > > > - classzone_idx = pgdat->kswapd_classzone_idx; > > > > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > > > > pgdat->kswapd_order = 0; > > > > - pgdat->kswapd_classzone_idx = 0; > > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > > > > > ret = try_to_freeze(); > > > > if (kthread_should_stop()) > > > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > > > > reclaim_order = balance_pgdat(pgdat, alloc_order, > > > > classzone_idx); > > > > if (reclaim_order < alloc_order) > > > > goto kswapd_try_sleep; > > > > > > If we fail order-5 request, can we then give up order-5, and > > > try order-3 if requested, after napping? > > > > > > > That has no bearing upon this patch. At this point, kswapd has stopped > > reclaiming at the requested order and is preparing to sleep. If there is > > a parallel request for order-3 while it's sleeping, it'll wake and start > > reclaiming at order-3 as requested. > > > > Right, but the order-3 request can also come up while kswapd is active and > gives up order-5. > And then it'll be in pgdat->kswapd_order and be picked up on the next wakeup. It won't be immediate but it's also unlikely to be worth picking up immediately. The context here is that a high-order reclaim request failed and rather keeping kswapd awake reclaiming the world, go to sleep until another wakeup request comes in. Staying awake continually for high orders caused problems with excessive reclaim in the past. It could be revisited again but it's not related to what this patch is aimed for -- avoiding reclaim going to sleep because ZONE_DMA is balanced for a GFP_DMA request which is nowhere in the request stream. -- Mel Gorman SUSE Labs
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On February 16, 2017 4:11 PM Mel Gorman wrote: > On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: > > On February 15, 2017 5:23 PM Mel Gorman wrote: > > > */ > > > static int kswapd(void *p) > > > { > > > - unsigned int alloc_order, reclaim_order, classzone_idx; > > > + unsigned int alloc_order, reclaim_order; > > > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > > > pg_data_t *pgdat = (pg_data_t*)p; > > > struct task_struct *tsk = current; > > > > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > > set_freezable(); > > > > > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > > > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > > > + pgdat->kswapd_order = 0; > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > for ( ; ; ) { > > > bool ret; > > > > > > + alloc_order = reclaim_order = pgdat->kswapd_order; > > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > > > + > > > kswapd_try_sleep: > > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > > > classzone_idx); > > > > > > /* Read the new order and classzone_idx */ > > > alloc_order = reclaim_order = pgdat->kswapd_order; > > > - classzone_idx = pgdat->kswapd_classzone_idx; > > > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > > > pgdat->kswapd_order = 0; > > > - pgdat->kswapd_classzone_idx = 0; > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > > > ret = try_to_freeze(); > > > if (kthread_should_stop()) > > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > > > reclaim_order = balance_pgdat(pgdat, alloc_order, > > > classzone_idx); > > > if (reclaim_order < alloc_order) > > > goto kswapd_try_sleep; > > > > If we fail order-5 request, can we then give up order-5, and > > try order-3 if requested, after napping? > > > > That has no bearing upon this patch. At this point, kswapd has stopped > reclaiming at the requested order and is preparing to sleep. If there is > a parallel request for order-3 while it's sleeping, it'll wake and start > reclaiming at order-3 as requested. > Right, but the order-3 request can also come up while kswapd is active and gives up order-5. thanks Hillf
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On February 16, 2017 4:11 PM Mel Gorman wrote: > On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: > > On February 15, 2017 5:23 PM Mel Gorman wrote: > > > */ > > > static int kswapd(void *p) > > > { > > > - unsigned int alloc_order, reclaim_order, classzone_idx; > > > + unsigned int alloc_order, reclaim_order; > > > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > > > pg_data_t *pgdat = (pg_data_t*)p; > > > struct task_struct *tsk = current; > > > > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > > set_freezable(); > > > > > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > > > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > > > + pgdat->kswapd_order = 0; > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > for ( ; ; ) { > > > bool ret; > > > > > > + alloc_order = reclaim_order = pgdat->kswapd_order; > > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > > > + > > > kswapd_try_sleep: > > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > > > classzone_idx); > > > > > > /* Read the new order and classzone_idx */ > > > alloc_order = reclaim_order = pgdat->kswapd_order; > > > - classzone_idx = pgdat->kswapd_classzone_idx; > > > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > > > pgdat->kswapd_order = 0; > > > - pgdat->kswapd_classzone_idx = 0; > > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > > > ret = try_to_freeze(); > > > if (kthread_should_stop()) > > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > > > reclaim_order = balance_pgdat(pgdat, alloc_order, > > > classzone_idx); > > > if (reclaim_order < alloc_order) > > > goto kswapd_try_sleep; > > > > If we fail order-5 request, can we then give up order-5, and > > try order-3 if requested, after napping? > > > > That has no bearing upon this patch. At this point, kswapd has stopped > reclaiming at the requested order and is preparing to sleep. If there is > a parallel request for order-3 while it's sleeping, it'll wake and start > reclaiming at order-3 as requested. > Right, but the order-3 request can also come up while kswapd is active and gives up order-5. thanks Hillf
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: > On February 15, 2017 5:23 PM Mel Gorman wrote: > > */ > > static int kswapd(void *p) > > { > > - unsigned int alloc_order, reclaim_order, classzone_idx; > > + unsigned int alloc_order, reclaim_order; > > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > > pg_data_t *pgdat = (pg_data_t*)p; > > struct task_struct *tsk = current; > > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > set_freezable(); > > > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > > + pgdat->kswapd_order = 0; > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > for ( ; ; ) { > > bool ret; > > > > + alloc_order = reclaim_order = pgdat->kswapd_order; > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > > + > > kswapd_try_sleep: > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > > classzone_idx); > > > > /* Read the new order and classzone_idx */ > > alloc_order = reclaim_order = pgdat->kswapd_order; > > - classzone_idx = pgdat->kswapd_classzone_idx; > > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > > pgdat->kswapd_order = 0; > > - pgdat->kswapd_classzone_idx = 0; > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > ret = try_to_freeze(); > > if (kthread_should_stop()) > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > > reclaim_order = balance_pgdat(pgdat, alloc_order, > > classzone_idx); > > if (reclaim_order < alloc_order) > > goto kswapd_try_sleep; > > If we fail order-5 request, can we then give up order-5, and > try order-3 if requested, after napping? > That has no bearing upon this patch. At this point, kswapd has stopped reclaiming at the requested order and is preparing to sleep. If there is a parallel request for order-3 while it's sleeping, it'll wake and start reclaiming at order-3 as requested. -- Mel Gorman SUSE Labs
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On Thu, Feb 16, 2017 at 02:23:08PM +0800, Hillf Danton wrote: > On February 15, 2017 5:23 PM Mel Gorman wrote: > > */ > > static int kswapd(void *p) > > { > > - unsigned int alloc_order, reclaim_order, classzone_idx; > > + unsigned int alloc_order, reclaim_order; > > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > > pg_data_t *pgdat = (pg_data_t*)p; > > struct task_struct *tsk = current; > > > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > > set_freezable(); > > > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > > + pgdat->kswapd_order = 0; > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > for ( ; ; ) { > > bool ret; > > > > + alloc_order = reclaim_order = pgdat->kswapd_order; > > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > > + > > kswapd_try_sleep: > > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > > classzone_idx); > > > > /* Read the new order and classzone_idx */ > > alloc_order = reclaim_order = pgdat->kswapd_order; > > - classzone_idx = pgdat->kswapd_classzone_idx; > > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > > pgdat->kswapd_order = 0; > > - pgdat->kswapd_classzone_idx = 0; > > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > > > ret = try_to_freeze(); > > if (kthread_should_stop()) > > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > > reclaim_order = balance_pgdat(pgdat, alloc_order, > > classzone_idx); > > if (reclaim_order < alloc_order) > > goto kswapd_try_sleep; > > If we fail order-5 request, can we then give up order-5, and > try order-3 if requested, after napping? > That has no bearing upon this patch. At this point, kswapd has stopped reclaiming at the requested order and is preparing to sleep. If there is a parallel request for order-3 while it's sleeping, it'll wake and start reclaiming at order-3 as requested. -- Mel Gorman SUSE Labs
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On February 15, 2017 5:23 PM Mel Gorman wrote: > */ > static int kswapd(void *p) > { > - unsigned int alloc_order, reclaim_order, classzone_idx; > + unsigned int alloc_order, reclaim_order; > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > pg_data_t *pgdat = (pg_data_t*)p; > struct task_struct *tsk = current; > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > set_freezable(); > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > + pgdat->kswapd_order = 0; > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > for ( ; ; ) { > bool ret; > > + alloc_order = reclaim_order = pgdat->kswapd_order; > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > + > kswapd_try_sleep: > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > classzone_idx); > > /* Read the new order and classzone_idx */ > alloc_order = reclaim_order = pgdat->kswapd_order; > - classzone_idx = pgdat->kswapd_classzone_idx; > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > pgdat->kswapd_order = 0; > - pgdat->kswapd_classzone_idx = 0; > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > ret = try_to_freeze(); > if (kthread_should_stop()) > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > reclaim_order = balance_pgdat(pgdat, alloc_order, > classzone_idx); > if (reclaim_order < alloc_order) > goto kswapd_try_sleep; If we fail order-5 request, can we then give up order-5, and try order-3 if requested, after napping? > - > - alloc_order = reclaim_order = pgdat->kswapd_order; > - classzone_idx = pgdat->kswapd_classzone_idx; > } >
Re: [PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
On February 15, 2017 5:23 PM Mel Gorman wrote: > */ > static int kswapd(void *p) > { > - unsigned int alloc_order, reclaim_order, classzone_idx; > + unsigned int alloc_order, reclaim_order; > + unsigned int classzone_idx = MAX_NR_ZONES - 1; > pg_data_t *pgdat = (pg_data_t*)p; > struct task_struct *tsk = current; > > @@ -3447,20 +3466,23 @@ static int kswapd(void *p) > tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD; > set_freezable(); > > - pgdat->kswapd_order = alloc_order = reclaim_order = 0; > - pgdat->kswapd_classzone_idx = classzone_idx = 0; > + pgdat->kswapd_order = 0; > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > for ( ; ; ) { > bool ret; > > + alloc_order = reclaim_order = pgdat->kswapd_order; > + classzone_idx = kswapd_classzone_idx(pgdat, classzone_idx); > + > kswapd_try_sleep: > kswapd_try_to_sleep(pgdat, alloc_order, reclaim_order, > classzone_idx); > > /* Read the new order and classzone_idx */ > alloc_order = reclaim_order = pgdat->kswapd_order; > - classzone_idx = pgdat->kswapd_classzone_idx; > + classzone_idx = kswapd_classzone_idx(pgdat, 0); > pgdat->kswapd_order = 0; > - pgdat->kswapd_classzone_idx = 0; > + pgdat->kswapd_classzone_idx = MAX_NR_ZONES; > > ret = try_to_freeze(); > if (kthread_should_stop()) > @@ -3486,9 +3508,6 @@ static int kswapd(void *p) > reclaim_order = balance_pgdat(pgdat, alloc_order, > classzone_idx); > if (reclaim_order < alloc_order) > goto kswapd_try_sleep; If we fail order-5 request, can we then give up order-5, and try order-3 if requested, after napping? > - > - alloc_order = reclaim_order = pgdat->kswapd_order; > - classzone_idx = pgdat->kswapd_classzone_idx; > } >
[PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
kswapd is woken to reclaim a node based on a failed allocation request from any eligible zone. Once reclaiming in balance_pgdat(), it will continue reclaiming until there is an eligible zone available for the zone it was woken for. kswapd tracks what zone it was recently woken for in pgdat->kswapd_classzone_idx. If it has not been woken recently, this zone will be 0. However, the decision on whether to sleep is made on kswapd_classzone_idx which is 0 without a recent wakeup request and that classzone does not account for lowmem reserves. This allows kswapd to sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA request even if a stream of allocations cannot use that zone. While kswapd may be woken again shortly in the near future there are two consequences -- the pgdat bits that control congestion are cleared prematurely and direct reclaim is more likely as kswapd slept prematurely. This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid index) when there has been no recent wakeups. If there are no wakeups, it'll decide whether to sleep based on the highest possible zone available (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" decisions during reclaim and when deciding to sleep are the same. If there is a mismatch, kswapd can stay awake continually trying to balance tiny zones. simoop was used to evaluate it again. Two of the preparation patches regressed the workload so they are included as the second set of results. Otherwise this patch looks artifically excellent 4.10.0-rc74.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25 keepawake-v1r25 Ameanp50-Read 22325202.49 ( 0.00%) 19491134.58 ( 12.69%) 22092755.48 ( 1.04%) Ameanp95-Read 26102988.80 ( 0.00%) 24294195.20 ( 6.93%) 26101849.04 ( 0.00%) Ameanp99-Read 30935176.53 ( 0.00%) 30397053.16 ( 1.74%) 29746220.52 ( 3.84%) Ameanp50-Write 976.44 ( 0.00%) 1077.22 (-10.32%) 952.73 ( 2.43%) Ameanp95-Write 15471.29 ( 0.00%)36419.56 (-135.40%) 3140.27 ( 79.70%) Ameanp99-Write 35108.62 ( 0.00%) 102000.36 (-190.53%) 8843.73 ( 74.81%) Ameanp50-Allocation 76382.61 ( 0.00%)87485.22 (-14.54%) 76349.22 ( 0.04%) Ameanp95-Allocation 12.39 ( 0.00%) 204588.52 (-60.11%) 108630.26 ( 14.98%) Ameanp99-Allocation 187937.39 ( 0.00%) 631657.74 (-236.10%) 139094.26 ( 25.99%) With this patch on top, all the latencies relative to the baseline are improved, particularly write latencies. The read latencies are still high for the number of threads but it's worth noting that this is mostly due to the IO scheduler and not directly related to reclaim. The vmstats are a bit of a mix but the relevant ones are as follows; 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25keepawake-v1r25 Swap Ins 0 0 0 Swap Outs0 608 0 Direct pages scanned 6910672 3132699 6357298 Kswapd pages scanned 570369468248866556986286 Kswapd pages reclaimed559934886347432955939113 Direct pages reclaimed 6905990 2964843 6352115 Kswapd efficiency 98% 76% 98% Kswapd velocity 12494.375 17597.507 12488.065 Direct efficiency 99% 94% 99% Direct velocity 1513.835 668.3061393.148 Page writes by reclaim 0.000 4410243.000 0.000 Page writes file 0 4409635 0 Page writes anon 0 608 0 Page reclaim immediate 103679214175203 1042571 Swap-outs are equivalent to baseline Direct reclaim is reduced but not eliminated. It's worth noting that there are two periods of direct reclaim for this workload. The first is when it switches from preparing the files for the actual test itself. It's a lot of file IO followed by a lot of allocs that reclaims heavily for a brief window. After that, direct reclaim is intermittent when the workload spawns a number of threads periodically to do work. kswapd simply cannot wake and reclaim fast enough between the low and min watermarks. It could be mitigated using vm.watermark_scale_factor but not through special tricks in kswapd. Page writes from reclaim context are at 0 which is the ideal Pages immediately reclaimed after IO completes is back at the baseline On UMA, there is almost no change so this is not expected to be a universal win. Signed-off-by: Mel Gorman--- mm/memory_hotplug.c | 2
[PATCH 3/3] mm, vmscan: Prevent kswapd sleeping prematurely due to mismatched classzone_idx
kswapd is woken to reclaim a node based on a failed allocation request from any eligible zone. Once reclaiming in balance_pgdat(), it will continue reclaiming until there is an eligible zone available for the zone it was woken for. kswapd tracks what zone it was recently woken for in pgdat->kswapd_classzone_idx. If it has not been woken recently, this zone will be 0. However, the decision on whether to sleep is made on kswapd_classzone_idx which is 0 without a recent wakeup request and that classzone does not account for lowmem reserves. This allows kswapd to sleep when a low small zone such as ZONE_DMA is balanced for a GFP_DMA request even if a stream of allocations cannot use that zone. While kswapd may be woken again shortly in the near future there are two consequences -- the pgdat bits that control congestion are cleared prematurely and direct reclaim is more likely as kswapd slept prematurely. This patch flips kswapd_classzone_idx to default to MAX_NR_ZONES (an invalid index) when there has been no recent wakeups. If there are no wakeups, it'll decide whether to sleep based on the highest possible zone available (MAX_NR_ZONES - 1). It then becomes critical that the "pgdat balanced" decisions during reclaim and when deciding to sleep are the same. If there is a mismatch, kswapd can stay awake continually trying to balance tiny zones. simoop was used to evaluate it again. Two of the preparation patches regressed the workload so they are included as the second set of results. Otherwise this patch looks artifically excellent 4.10.0-rc74.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25 keepawake-v1r25 Ameanp50-Read 22325202.49 ( 0.00%) 19491134.58 ( 12.69%) 22092755.48 ( 1.04%) Ameanp95-Read 26102988.80 ( 0.00%) 24294195.20 ( 6.93%) 26101849.04 ( 0.00%) Ameanp99-Read 30935176.53 ( 0.00%) 30397053.16 ( 1.74%) 29746220.52 ( 3.84%) Ameanp50-Write 976.44 ( 0.00%) 1077.22 (-10.32%) 952.73 ( 2.43%) Ameanp95-Write 15471.29 ( 0.00%)36419.56 (-135.40%) 3140.27 ( 79.70%) Ameanp99-Write 35108.62 ( 0.00%) 102000.36 (-190.53%) 8843.73 ( 74.81%) Ameanp50-Allocation 76382.61 ( 0.00%)87485.22 (-14.54%) 76349.22 ( 0.04%) Ameanp95-Allocation 12.39 ( 0.00%) 204588.52 (-60.11%) 108630.26 ( 14.98%) Ameanp99-Allocation 187937.39 ( 0.00%) 631657.74 (-236.10%) 139094.26 ( 25.99%) With this patch on top, all the latencies relative to the baseline are improved, particularly write latencies. The read latencies are still high for the number of threads but it's worth noting that this is mostly due to the IO scheduler and not directly related to reclaim. The vmstats are a bit of a mix but the relevant ones are as follows; 4.10.0-rc7 4.10.0-rc7 4.10.0-rc7 mmots-20170209 clear-v1r25keepawake-v1r25 Swap Ins 0 0 0 Swap Outs0 608 0 Direct pages scanned 6910672 3132699 6357298 Kswapd pages scanned 570369468248866556986286 Kswapd pages reclaimed559934886347432955939113 Direct pages reclaimed 6905990 2964843 6352115 Kswapd efficiency 98% 76% 98% Kswapd velocity 12494.375 17597.507 12488.065 Direct efficiency 99% 94% 99% Direct velocity 1513.835 668.3061393.148 Page writes by reclaim 0.000 4410243.000 0.000 Page writes file 0 4409635 0 Page writes anon 0 608 0 Page reclaim immediate 103679214175203 1042571 Swap-outs are equivalent to baseline Direct reclaim is reduced but not eliminated. It's worth noting that there are two periods of direct reclaim for this workload. The first is when it switches from preparing the files for the actual test itself. It's a lot of file IO followed by a lot of allocs that reclaims heavily for a brief window. After that, direct reclaim is intermittent when the workload spawns a number of threads periodically to do work. kswapd simply cannot wake and reclaim fast enough between the low and min watermarks. It could be mitigated using vm.watermark_scale_factor but not through special tricks in kswapd. Page writes from reclaim context are at 0 which is the ideal Pages immediately reclaimed after IO completes is back at the baseline On UMA, there is almost no change so this is not expected to be a universal win. Signed-off-by: Mel Gorman --- mm/memory_hotplug.c | 2 +- mm/vmscan.c | 118