[PATCH V2] sched,psi: fix the 'int' underflow for psi
psi_group_cpu->tasks, represented by the unsigned int, stores the number of tasks that could be stalled on a psi resource(io/mem/cpu). Decrementing these counters at zero leads to wrapping which further leads to the psi_group_cpu->state_mask is being set with the respective pressure state. This could result into the unnecessary time sampling for the pressure state thus cause the spurious psi events. This can further lead to wrong actions being taken at the user land based on these psi events. Though psi_bug is set under these conditions but that just for debug purpose. Fix it by decrementing the ->tasks count only when it is non-zero. Signed-off-by: Charan Teja Reddy --- Changes in V1: https://lore.kernel.org/patchwork/patch/1413894/ kernel/sched/psi.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 967732c..651218d 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -711,14 +711,15 @@ static void psi_group_change(struct psi_group *group, int cpu, for (t = 0, m = clear; m; m &= ~(1 << t), t++) { if (!(m & (1 << t))) continue; - if (groupc->tasks[t] == 0 && !psi_bug) { + if (groupc->tasks[t]) { + groupc->tasks[t]--; + } else if (!psi_bug) { printk_deferred(KERN_ERR "psi: task underflow! cpu=%d t=%d tasks=[%u %u %u %u] clear=%x set=%x\n", cpu, t, groupc->tasks[0], groupc->tasks[1], groupc->tasks[2], groupc->tasks[3], clear, set); psi_bug = 1; } - groupc->tasks[t]--; } for (t = 0; set; set &= ~(1 << t), t++) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] sched,psi: fix the 'int' underflow for psi
psi_group_cpu->tasks, represented by the unsigned int, stores the number of tasks that could be stalled on a psi resource(io/mem/cpu). Decrementing these counters at zero leads to wrapping which further leads to the psi_group_cpu->state_mask is being set with the respective pressure state. This could result into the unnecessary time sampling for the pressure state thus cause the spurious psi events. This can further lead to wrong actions being taken at the user land based on these psi events. Though psi_bug is set under these conditions but that just for debug purpose. Fix it by decrementing the ->tasks count only when it is non-zero. Signed-off-by: Charan Teja Reddy --- kernel/sched/psi.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 967732c..f925468 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -718,7 +718,8 @@ static void psi_group_change(struct psi_group *group, int cpu, groupc->tasks[3], clear, set); psi_bug = 1; } - groupc->tasks[t]--; + if (groupc->tasks[t]) + groupc->tasks[t]--; } for (t = 0; set; set &= ~(1 << t), t++) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH RFC 0/1] mm: balancing the node zones occupancy
I would like to start discussion about balancing the occupancy of memory zones in a node in the system whose imabalance may be caused by migration of pages to other zones during hotremove and then hotadding same memory. In this case there is a lot of free memory in newly hotadd memory which can be filled up by the previous migrated pages(as part of offline/hotremove) thus may free up some pressure in other zones of the node. Say that in system has 2 zones(Normal and Movable), where Normal zone is almost filled up by the pages of the movable zone as part of offline operation and then we hot add a memory node as movable zone. At this moment, Movable zone is almost empty and Normal zone is almost filled up(by the migrated pages as part of previous offline/hot-remove operation). At this point of time, allocation requests from Normal zone may have to go through reclaim cycles thus cause some pressure. This problem is quite common in the system where they aggressively offline and online the memory blocks in the system where the former part do the migration of pages to the lower zones and the later part don't reverse them and as a result the offline operation may contribute to the pressure in other zones of the system. To overcome this problem, we can do the reverse of what offline operation did, after onlining the memory block i.e. **try to reverse migrate the pages from other zones which were migrated as part of the offline operation**. This may freeup some pressure built in the other zones because of offline operation. Since we are reverse migrating the pages in the system, we can name this as "reverse migration feature" or since we are actually balancing the occupancy of the zones in the system by migrating the pages we can name it as "balancing the system zones occupancy" or any name... We have the proof-of-concept code tried on the Snapdragon systems with the system configuration, single memory node of just 2 zones, 6GB normal zone and 2GB movable zone. And this Movable zone is such that hot-added once and there after offline/online based on the need. We run the below unit test to evalutate this: 1) Completely fill up the already hot added movable zone with anon pages 2) Offline this hot added movable zone. At this point there is no pressure in the normal zone yet, but this migration of pages can contribute to it in the future. 3) Now fill up the normal zone such that there is 300MB or less left in the system. 4) And now the user onlined the movable zone memory blocks in the system. 5) Run the tests of allocating 512MB of memory from the normal zone and in doing so try allocating the higher order pages first and then gradually fall back to lower orders. I took the help from ion system heap memory allocation which try to allocate the memory in available orders: 9, 4 and 0. 6) Repeat the above steps for 10 iterations and below is the average of the results. We did try to collect the time it takes to complete the tests and the distribution of anon pages(are the ones participated in the tests) in the system node zones. a) With out the reverse migration, it took an average of around 145msec to complete the test. b) With the reverse migration, it took an average of the 120msec to complete the tests. For distribution of the anon pages in the system we did try collect the anon pages left in the individual zone before and after the test: - |--- Base | Reverse Migration --|--- Beforetest Aftertest | Beforetest Aftertest Normal zone(Anon) | Active 49982545756 | 481441 203124 Inactive 46008446687| 5155358602 Free 80350224252| 84877 **440586** Movable zone(Anon)| Active 2224 2626 | 2239 **484622** Inactive88 | 9 7663 --|--- The above table shows that, On base case(left column), there exists a lot of anon pages in the system which can be migrated back to Movable zone(almost totally free), thus may freeup some space in the normal zone. With the reverse migration(Right coloumn), we see that the anon pages are evenly distributed in the system and lot of free memory left in the normal zones caused by the reverse migration. The code shows the PoC by assuming just 2 zones(normal and Movable) of a single node in the system. The number of pages to be reverse migrated is written on the sysfs interface from the userspace by monitoring the memory pressure events in the system. Charan Teja Reddy (1): mm: proof-of-concept for balancing node zones occupancy include/linux/migrate.h | 8 +- include/linux/mm.h | 3 + include/linux/mmzon
[PATCH 1/1] mm: proof-of-concept for balancing node zones occupancy
This is a Proof-of-concept code for balancing the node zones occupancy whose imbalance may be caused by the memory hotplug. Signed-off-by: Charan Teja Reddy --- include/linux/migrate.h | 8 +- include/linux/mm.h | 3 + include/linux/mmzone.h | 2 + kernel/sysctl.c | 11 ++ mm/compaction.c | 4 +- mm/memory_hotplug.c | 265 6 files changed, 290 insertions(+), 3 deletions(-) diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 4594838..b7dc259 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -53,6 +53,8 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page); extern int migrate_page_move_mapping(struct address_space *mapping, struct page *newpage, struct page *page, int extra_count); +extern void split_map_pages(struct list_head *list); +extern unsigned long release_freepages(struct list_head *freelist); #else static inline void putback_movable_pages(struct list_head *l) {} @@ -81,7 +83,11 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping, { return -ENOSYS; } - +static inline void split_map_pages(struct list_head *list) { } +static inline unsigned long release_freepages(struct list_head *freelist) +{ + return 0; +} #endif /* CONFIG_MIGRATION */ #ifdef CONFIG_COMPACTION diff --git a/include/linux/mm.h b/include/linux/mm.h index ecdf8a8..1014139 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2465,6 +2465,9 @@ extern int watermark_boost_factor; extern int watermark_scale_factor; extern bool arch_has_descending_max_zone_pfns(void); +/* memory_hotplug.c */ +extern int balance_node_occupancy_pages; + /* nommu.c */ extern atomic_long_t mmap_pages_allocated; extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t); diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index b593316..ce417c3 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -977,6 +977,8 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int, void *, size_t *, loff_t *); int numa_zonelist_order_handler(struct ctl_table *, int, void *, size_t *, loff_t *); +extern int sysctl_balance_node_occupancy_handler(struct ctl_table *tbl, + int write, void *buf, size_t *len, loff_t *pos); extern int percpu_pagelist_fraction; extern char numa_zonelist_order[]; #define NUMA_ZONELIST_ORDER_LEN16 diff --git a/kernel/sysctl.c b/kernel/sysctl.c index c9fbdd8..4b95a90 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -3140,6 +3140,17 @@ static struct ctl_table vm_table[] = { .extra2 = SYSCTL_ONE, }, #endif +#ifdef CONFIG_MEMORY_HOTPLUG + { + .procname = "balance_node_occupancy_pages", + .data = _node_occupancy_pages, + .maxlen = sizeof(balance_node_occupancy_pages), + .mode = 0200, + .proc_handler = sysctl_balance_node_occupancy_handler, + .extra1 = SYSCTL_ZERO, + }, + +#endif { } }; diff --git a/mm/compaction.c b/mm/compaction.c index 190ccda..da3c015 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -68,7 +68,7 @@ static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 500; #define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT) #endif -static unsigned long release_freepages(struct list_head *freelist) +unsigned long release_freepages(struct list_head *freelist) { struct page *page, *next; unsigned long high_pfn = 0; @@ -84,7 +84,7 @@ static unsigned long release_freepages(struct list_head *freelist) return high_pfn; } -static void split_map_pages(struct list_head *list) +void split_map_pages(struct list_head *list) { unsigned int i, order, nr_pages; struct page *page, *next; diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index f9d57b9..2780c91 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -97,6 +97,271 @@ void mem_hotplug_done(void) u64 max_mem_size = U64_MAX; +int balance_node_occupancy_pages; +static atomic_t target_migrate_pages = ATOMIC_INIT(0); + +struct movable_zone_fill_control { + struct list_head freepages; + unsigned long start_pfn; + unsigned long end_pfn; + unsigned long nr_migrate_pages; + unsigned long nr_free_pages; + unsigned long limit; + int target; + struct zone *zone; +}; + +static void fill_movable_zone_fn(struct work_struct *work); +static DECLARE_WORK(fill_movable_zone_work, fill_movable_zone_fn); +static DEFINE_MUTEX(page_migrate_lock); + +static inline void reset_page_order(struct page *page) +{ + __ClearPageBuddy(page); + set_page_private(page, 0); +} + +static int isolate_free_p
[PATCH V2] mm: compaction: update the COMPACT[STALL|FAIL] events properly
By definition, COMPACT[STALL|FAIL] events needs to be counted when there is 'At least in one zone compaction wasn't deferred or skipped from the direct compaction'. And when compaction is skipped or deferred, COMPACT_SKIPPED will be returned but it will still go and update these compaction events which is wrong in the sense that COMPACT[STALL|FAIL] is counted without even trying the compaction. Correct this by skipping the counting of these events when COMPACT_SKIPPED is returned for compaction. This indirectly also avoid the unnecessary try into the get_page_from_freelist() when compaction is not even tried. There is a corner case where compaction is skipped but still count COMPACTSTALL event, which is that IRQ came and freed the page and the same is captured in capture_control. Signed-off-by: Charan Teja Reddy --- changes in V1: https://lore.kernel.org/patchwork/patch/1373665/ mm/compaction.c | 8 mm/page_alloc.c | 2 ++ 2 files changed, 10 insertions(+) diff --git a/mm/compaction.c b/mm/compaction.c index 190ccda..104ebef 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -2487,6 +2487,14 @@ static enum compact_result compact_zone_order(struct zone *zone, int order, */ WRITE_ONCE(current->capture_control, NULL); *capture = READ_ONCE(capc.page); + /* +* Technically, it is also possible that compaction is skipped but +* the page is still captured out of luck(IRQ came and freed the page). +* Returning COMPACT_SUCCESS in such cases helps in properly accounting +* the COMPACT[STALL|FAIL] when compaction is skipped. +*/ + if (*capture) + ret = COMPACT_SUCCESS; return ret; } diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 519a60d..531f244 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4152,6 +4152,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, memalloc_noreclaim_restore(noreclaim_flag); psi_memstall_leave(); + if (*compact_result == COMPACT_SKIPPED) + return NULL; /* * At least in one zone compaction wasn't deferred or skipped, so let's * count a compaction stall -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] mm: page_alloc: update the COMPACT[STALL|FAIL] events properly
By defination, COMPACT[STALL|FAIL] events needs to be counted when there is 'At least in one zone compaction wasn't deferred or skipped from the direct compaction'. And when compaction is skipped or deferred, COMPACT_SKIPPED will be returned but it will still go and update these compaction events which is wrong in the sense that COMPACT[STALL|FAIL] is counted without even trying the compaction. Correct this by skipping the counting of these events when COMPACT_SKIPPED is returned for compaction. This indirectly also avoid the unnecessary try into the get_page_from_freelist() when compaction is not even tried. Signed-off-by: Charan Teja Reddy --- mm/page_alloc.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 519a60d..531f244 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4152,6 +4152,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order, memalloc_noreclaim_restore(noreclaim_flag); psi_memstall_leave(); + if (*compact_result == COMPACT_SKIPPED) + return NULL; /* * At least in one zone compaction wasn't deferred or skipped, so let's * count a compaction stall -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH V3] mm/compaction: correct deferral logic for proactive compaction
should_proactive_compact_node() returns true when sum of the weighted fragmentation score of all the zones in the node is greater than the wmark_high of compaction, which then triggers the proactive compaction that operates on the individual zones of the node. But proactive compaction runs on the zone only when its weighted fragmentation score is greater than wmark_low(=wmark_high - 10). This means that the sum of the weighted fragmentation scores of all the zones can exceed the wmark_high but individual weighted fragmentation zone scores can still be less than wmark_low which makes the unnecessary trigger of the proactive compaction only to return doing nothing. Issue with the return of proactive compaction with out even trying is its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if the scores across the proactive compaction is same, thinking that compaction didn't make any progress but in reality it didn't even try. With the delay between successive retries for proactive compaction is 500msec, it can result into the deferral for ~30sec with out even trying the proactive compaction. Test scenario is that: compaction_proactiveness=50 thus the wmark_low = 50 and wmark_high = 60. System have 2 zones(Normal and Movable) with sizes 5GB and 6GB respectively. After opening some apps on the android, the weighted fragmentation scores of these zones are 47 and 49 respectively. Since the sum of these fragmentation scores are above the wmark_high which triggers the proactive compaction and there since the individual zones weighted fragmentation scores are below wmark_low, it returns without trying the proactive compaction. As a result the weighted fragmentation scores of the zones are still 47 and 49 which makes the existing logic to defer the compaction thinking that noprogress is made across the compaction. Fix this by checking just zone fragmentation score, not the weighted, in __compact_finished() and use the zones weighted fragmentation score in fragmentation_score_node(). In the test case above, If the weighted average of is above wmark_high, then individual score (not adjusted) of atleast one zone has to be above wmark_high. Thus it avoids the unnecessary trigger and deferrals of the proactive compaction. Fix-suggested-by: Vlastimil Babka Signed-off-by: Charan Teja Reddy --- Changes in V3: Addressed suggestions from Vlastimil Changes in V2: https://lore.kernel.org/patchwork/patch/1366862/ Changes in V1: https://lore.kernel.org/patchwork/patch/1364646/ mm/compaction.c | 20 ++-- 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index e5acb97..ccddb3a 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1925,20 +1925,28 @@ static bool kswapd_is_running(pg_data_t *pgdat) /* * A zone's fragmentation score is the external fragmentation wrt to the - * COMPACTION_HPAGE_ORDER scaled by the zone's size. It returns a value - * in the range [0, 100]. + * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100]. + */ +static unsigned int fragmentation_score_zone(struct zone *zone) +{ + return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); +} + +/* + * A weighted zone's fragmentation score is the external fragmentation + * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It + * returns a value in the range [0, 100]. * * The scaling factor ensures that proactive compaction focuses on larger * zones like ZONE_NORMAL, rather than smaller, specialized zones like * ZONE_DMA32. For smaller zones, the score value remains close to zero, * and thus never exceeds the high threshold for proactive compaction. */ -static unsigned int fragmentation_score_zone(struct zone *zone) +static unsigned int fragmentation_score_zone_weighted(struct zone *zone) { unsigned long score; - score = zone->present_pages * - extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); + score = zone->present_pages * fragmentation_score_zone(zone); return div64_ul(score, zone->zone_pgdat->node_present_pages + 1); } @@ -1958,7 +1966,7 @@ static unsigned int fragmentation_score_node(pg_data_t *pgdat) struct zone *zone; zone = >node_zones[zoneid]; - score += fragmentation_score_zone(zone); + score += fragmentation_score_zone_weighted(zone); } return score; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH V2] mm/compaction: correct deferral logic for proactive compaction
should_proactive_compact_node() returns true when sum of the weighted fragmentation score of all the zones in the node is greater than the wmark_high of compaction, which then triggers the proactive compaction that operates on the individual zones of the node. But proactive compaction runs on the zone only when its weighted fragmentation score is greater than wmark_low(=wmark_high - 10). This means that the sum of the weighted fragmentation scores of all the zones can exceed the wmark_high but individual weighted fragmentation zone scores can still be less than wmark_low which makes the unnecessary trigger of the proactive compaction only to return doing nothing. Issue with the return of proactive compaction with out even trying is its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if the scores across the proactive compaction is same, thinking that compaction didn't make any progress but in reality it didn't even try. With the delay between successive retries for proactive compaction is 500msec, it can result into the deferral for ~30sec with out even trying the proactive compaction. Test scenario is that: compaction_proactiveness=50 thus the wmark_low = 50 and wmark_high = 60. System have 2 zones(Normal and Movable) with sizes 5GB and 6GB respectively. After opening some apps on the android, the weighted fragmentation scores of these zones are 47 and 49 respectively. Since the sum of these fragmentation scores are above the wmark_high which triggers the proactive compaction and there since the individual zones weighted fragmentation scores are below wmark_low, it returns without trying the proactive compaction. As a result the weighted fragmentation scores of the zones are still 47 and 49 which makes the existing logic to defer the compaction thinking that noprogress is made across the compaction. Fix this by checking just zone fragmentation score, not the weighted, in __compact_finished() and use the zones weighted fragmentation score in fragmentation_score_node(). In the test case above, If the weighted average of is above wmark_high, then individual score (not adjusted) of atleast one zone has to be above wmark_high. Thus it avoids the unnecessary trigger and deferrals of the proactive compaction. Fix-suggested-by: Vlastimil Babka Signed-off-by: Charan Teja Reddy --- Changes in V2: Addressed comments from vlastimil Changes in V1: https://lore.kernel.org/patchwork/patch/1364646/ mm/compaction.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index e5acb97..1b98427 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1924,16 +1924,16 @@ static bool kswapd_is_running(pg_data_t *pgdat) } /* - * A zone's fragmentation score is the external fragmentation wrt to the - * COMPACTION_HPAGE_ORDER scaled by the zone's size. It returns a value - * in the range [0, 100]. + * A weighted zone's fragmentation score is the external fragmentation + * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It + * returns a value in the range [0, 100]. * * The scaling factor ensures that proactive compaction focuses on larger * zones like ZONE_NORMAL, rather than smaller, specialized zones like * ZONE_DMA32. For smaller zones, the score value remains close to zero, * and thus never exceeds the high threshold for proactive compaction. */ -static unsigned int fragmentation_score_zone(struct zone *zone) +static unsigned int fragmentation_score_zone_weighted(struct zone *zone) { unsigned long score; @@ -1943,6 +1943,15 @@ static unsigned int fragmentation_score_zone(struct zone *zone) } /* + * A zone's fragmentation score is the external fragmentation wrt to the + * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100]. + */ +static unsigned int fragmentation_score_zone(struct zone *zone) +{ + return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER); +} + +/* * The per-node proactive (background) compaction process is started by its * corresponding kcompactd thread when the node's fragmentation score * exceeds the high threshold. The compaction process remains active till @@ -1958,7 +1967,7 @@ static unsigned int fragmentation_score_node(pg_data_t *pgdat) struct zone *zone; zone = >node_zones[zoneid]; - score += fragmentation_score_zone(zone); + score += fragmentation_score_zone_weighted(zone); } return score; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] mm/compaction: return proper state in should_proactive_compact_node
should_proactive_compact_node() returns true when sum of the fragmentation score of all the zones in the node is greater than the wmark_high of compaction which then triggers the proactive compaction that operates on the individual zones of the node. But proactive compaction runs on the zone only when the fragmentation score of the zone is greater than wmark_low(=wmark_high - 10). This means that the sum of the fragmentation scores of all the zones can exceed the wmark_high but individual zone scores can still be less than the wmark_low which makes the unnecessary trigger of the proactive compaction only to return doing nothing. Another issue with the return of proactive compaction with out even trying is its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if the scores across the proactive compaction is same, thinking that compaction didn't make any progress but in reality it didn't even try. With the delay between successive retries for proactive compaction is 500msec, it can result into the deferral for ~30sec with out even trying the proactive compaction. Test scenario is that: compaction_proactiveness=50 thus the wmark_low = 50 and wmark_high = 60. System have 2 zones(Normal and Movable) with sizes 5GB and 6GB respectively. After opening some apps on the android, the fragmentation scores of these zones are 47 and 49 respectively. Since the sum of these fragmentation scores are above the wmark_high which triggers the proactive compaction and there since the individual zone scores are below wmark_low, it returns without trying the compaction. As a result the fragmentation scores of the zones are still 47 and 49 which makes the existing logic to defer the compaction thinking that noprogress is made across the compaction. So, run the proactive compaction on the node zones only when atleast one of the zones fragmentation score is greater than wmark_low. This avoids the unnecessary deferral and retries of the compaction. Signed-off-by: Charan Teja Reddy --- mm/compaction.c | 27 +-- 1 file changed, 25 insertions(+), 2 deletions(-) diff --git a/mm/compaction.c b/mm/compaction.c index e5acb97..f7a772a 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1964,6 +1964,26 @@ static unsigned int fragmentation_score_node(pg_data_t *pgdat) return score; } +/* + * Returns the maximum of fragmentation scores of zones in a node. This is + * used in taking the decission of whether to trigger the proactive compaction + * on the zones of this node. + */ +static unsigned int fragmentation_score_node_zones_max(pg_data_t *pgdat) +{ + int zoneid; + unsigned int max = 0; + + for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) { + struct zone *zone; + + zone = >node_zones[zoneid]; + max = max_t(unsigned int, fragmentation_score_zone(zone), max); + } + + return max; +} + static unsigned int fragmentation_score_wmark(pg_data_t *pgdat, bool low) { unsigned int wmark_low; @@ -1979,13 +1999,16 @@ static unsigned int fragmentation_score_wmark(pg_data_t *pgdat, bool low) static bool should_proactive_compact_node(pg_data_t *pgdat) { - int wmark_high; + int wmark_low, wmark_high; if (!sysctl_compaction_proactiveness || kswapd_is_running(pgdat)) return false; wmark_high = fragmentation_score_wmark(pgdat, false); - return fragmentation_score_node(pgdat) > wmark_high; + wmark_low = fragmentation_score_wmark(pgdat, true); + + return fragmentation_score_node(pgdat) > wmark_high && + fragmentation_score_node_zones_max(pgdat) > wmark_low; } static enum compact_result __compact_finished(struct compact_control *cc) -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH V2] dmabuf: fix use-after-free of dmabuf's file->f_inode
It is observed 'use-after-free' on the dmabuf's file->f_inode with the race between closing the dmabuf file and reading the dmabuf's debug info. Consider the below scenario where P1 is closing the dma_buf file and P2 is reading the dma_buf's debug info in the system: P1 P2 dma_buf_debug_show() dma_buf_put() __fput() file->f_op->release() dput() dentry_unlink_inode() iput(dentry->d_inode) (where the inode is freed) mutex_lock(_list.lock) read 'dma_buf->file->f_inode' (the same inode is freed by P1) mutex_unlock(_list.lock) dentry->d_op->d_release()--> dma_buf_release() . mutex_lock(_list.lock) removes the dmabuf from the list mutex_unlock(_list.lock) In the above scenario, when dma_buf_put() is called on a dma_buf, it first frees the dma_buf's file->f_inode(=dentry->d_inode) and then removes this dma_buf from the system db_list. In between P2 traversing the db_list tries to access this dma_buf's file->f_inode that was freed by P1 which is a use-after-free case. Since, __fput() calls f_op->release first and then later calls the d_op->d_release, move the dma_buf's db_list removal from d_release() to f_op->release(). This ensures that dma_buf's file->f_inode is not accessed after it is released. Cc: # 5.4+ Fixes: 4ab59c3c638c ("dma-buf: Move dma_buf_release() from fops to dentry_ops") Acked-by: Christian König Signed-off-by: Charan Teja Reddy --- V2: Resending with stable tags and Acks V1: https://lore.kernel.org/patchwork/patch/1360118/ drivers/dma-buf/dma-buf.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 0eb80c1..a14dcbb 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -76,10 +76,6 @@ static void dma_buf_release(struct dentry *dentry) dmabuf->ops->release(dmabuf); - mutex_lock(_list.lock); - list_del(>list_node); - mutex_unlock(_list.lock); - if (dmabuf->resv == (struct dma_resv *)[1]) dma_resv_fini(dmabuf->resv); @@ -88,6 +84,22 @@ static void dma_buf_release(struct dentry *dentry) kfree(dmabuf); } +static int dma_buf_file_release(struct inode *inode, struct file *file) +{ + struct dma_buf *dmabuf; + + if (!is_dma_buf_file(file)) + return -EINVAL; + + dmabuf = file->private_data; + + mutex_lock(_list.lock); + list_del(>list_node); + mutex_unlock(_list.lock); + + return 0; +} + static const struct dentry_operations dma_buf_dentry_ops = { .d_dname = dmabuffs_dname, .d_release = dma_buf_release, @@ -413,6 +425,7 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file) } static const struct file_operations dma_buf_fops = { + .release= dma_buf_file_release, .mmap = dma_buf_mmap_internal, .llseek = dma_buf_llseek, .poll = dma_buf_poll, -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] dmabuf: fix use-after-free of dmabuf's file->f_inode
It is observed 'use-after-free' on the dmabuf's file->f_inode with the race between closing the dmabuf file and reading the dmabuf's debug info. Consider the below scenario where P1 is closing the dma_buf file and P2 is reading the dma_buf's debug info in the system: P1 P2 dma_buf_debug_show() dma_buf_put() __fput() file->f_op->release() dput() dentry_unlink_inode() iput(dentry->d_inode) (where the inode is freed) mutex_lock(_list.lock) read 'dma_buf->file->f_inode' (the same inode is freed by P1) mutex_unlock(_list.lock) dentry->d_op->d_release()--> dma_buf_release() . mutex_lock(_list.lock) removes the dmabuf from the list mutex_unlock(_list.lock) In the above scenario, when dma_buf_put() is called on a dma_buf, it first frees the dma_buf's file->f_inode(=dentry->d_inode) and then removes this dma_buf from the system db_list. In between P2 traversing the db_list tries to access this dma_buf's file->f_inode that was freed by P1 which is a use-after-free case. Since, __fput() calls f_op->release first and then later calls the d_op->d_release, move the dma_buf's db_list removal from d_release() to f_op->release(). This ensures that dma_buf's file->f_inode is not accessed after it is released. Fixes: 4ab59c3c638c ("dma-buf: Move dma_buf_release() from fops to dentry_ops") Signed-off-by: Charan Teja Reddy --- drivers/dma-buf/dma-buf.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 0eb80c1..a14dcbb 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -76,10 +76,6 @@ static void dma_buf_release(struct dentry *dentry) dmabuf->ops->release(dmabuf); - mutex_lock(_list.lock); - list_del(>list_node); - mutex_unlock(_list.lock); - if (dmabuf->resv == (struct dma_resv *)[1]) dma_resv_fini(dmabuf->resv); @@ -88,6 +84,22 @@ static void dma_buf_release(struct dentry *dentry) kfree(dmabuf); } +static int dma_buf_file_release(struct inode *inode, struct file *file) +{ + struct dma_buf *dmabuf; + + if (!is_dma_buf_file(file)) + return -EINVAL; + + dmabuf = file->private_data; + + mutex_lock(_list.lock); + list_del(>list_node); + mutex_unlock(_list.lock); + + return 0; +} + static const struct dentry_operations dma_buf_dentry_ops = { .d_dname = dmabuffs_dname, .d_release = dma_buf_release, @@ -413,6 +425,7 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct file *file) } static const struct file_operations dma_buf_fops = { + .release= dma_buf_file_release, .mmap = dma_buf_mmap_internal, .llseek = dma_buf_llseek, .poll = dma_buf_poll, -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] mm: cma: improve pr_debug log in cma_release()
It is required to print 'count' of pages, along with the pages, passed to cma_release to debug the cases of mismatched count value passed between cma_alloc() and cma_release() from a code path. As an example, consider the below scenario: 1) CMA pool size is 4MB and 2) User doing the erroneous step of allocating 2 pages but freeing 1 page in a loop from this CMA pool. The step 2 causes cma_alloc() to return NULL at one point of time because of -ENOMEM condition. And the current pr_debug logs is not giving the info about these types of allocation patterns because of count value not being printed in cma_release(). We are printing the count value in the trace logs, just extend the same to pr_debug logs too. Signed-off-by: Charan Teja Reddy --- mm/cma.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/cma.c b/mm/cma.c index 7f415d7..07c904b 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -512,7 +512,7 @@ bool cma_release(struct cma *cma, const struct page *pages, unsigned int count) if (!cma || !pages) return false; - pr_debug("%s(page %p)\n", __func__, (void *)pages); + pr_debug("%s(page %p, count %zu)\n", __func__, (void *)pages, count); pfn = page_to_pfn(pages); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] mm: memory_hotplug: put migration failure information under DEBUG_VM
When the pages are failed to get isolate or migrate, the page owner information along with page info is dumped. If there are continuous failures in migration(say page is pinned) or isolation, the log buffer is simply getting flooded with the page owner information. As most of the times page info is sufficient to know the causes for failures of migration or isolation, place the page owner information under DEBUG_VM. Signed-off-by: Charan Teja Reddy --- mm/memory_hotplug.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 63b2e46..f48f30d 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -1326,7 +1326,10 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) } else { pr_warn("failed to isolate pfn %lx\n", pfn); - dump_page(page, "isolation failed"); + __dump_page(page, "isolation failed"); +#if defined(CONFIG_DEBUG_VM) + dump_page_owner(page); +#endif } put_page(page); } @@ -1357,7 +1360,10 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn) list_for_each_entry(page, , lru) { pr_warn("migrating pfn %lx failed ret:%d ", page_to_pfn(page), ret); - dump_page(page, "migration failure"); + __dump_page(page, "migration failure"); +#if defined(CONFIG_DEBUG_VM) + dump_page_owner(page); +#endif } putback_movable_pages(); } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] iommu: of: skip iommu_device_list traversal in of_iommu_xlate()
In of_iommu_xlate(), check if iommu device is enabled before traversing the iommu_device_list through iommu_ops_from_fwnode(). It is of no use in traversing the iommu_device_list only to return NO_IOMMU because of iommu device node is disabled. Signed-off-by: Charan Teja Reddy --- drivers/iommu/of_iommu.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c index e505b91..225598c 100644 --- a/drivers/iommu/of_iommu.c +++ b/drivers/iommu/of_iommu.c @@ -94,9 +94,10 @@ static int of_iommu_xlate(struct device *dev, struct fwnode_handle *fwnode = _spec->np->fwnode; int ret; + if (!of_device_is_available(iommu_spec->np)) + return NO_IOMMU; ops = iommu_ops_from_fwnode(fwnode); - if ((ops && !ops->of_xlate) || - !of_device_is_available(iommu_spec->np)) + if (ops && !ops->of_xlate) return NO_IOMMU; ret = iommu_fwspec_init(dev, _spec->np->fwnode, ops); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] dmabuf: fix NULL pointer dereference in dma_buf_release()
NULL pointer dereference is observed while exporting the dmabuf but failed to allocate the 'struct file' which results into the dropping of the allocated dentry corresponding to this file in the dmabuf fs, which is ending up in dma_buf_release() and accessing the uninitialzed dentry->d_fsdata. Call stack on 5.4 is below: dma_buf_release+0x2c/0x254 drivers/dma-buf/dma-buf.c:88 __dentry_kill+0x294/0x31c fs/dcache.c:584 dentry_kill fs/dcache.c:673 [inline] dput+0x250/0x380 fs/dcache.c:859 path_put+0x24/0x40 fs/namei.c:485 alloc_file_pseudo+0x1a4/0x200 fs/file_table.c:235 dma_buf_getfile drivers/dma-buf/dma-buf.c:473 [inline] dma_buf_export+0x25c/0x3ec drivers/dma-buf/dma-buf.c:585 Fix this by checking for the valid pointer in the dentry->d_fsdata. Fixes: 4ab59c3c638c ("dma-buf: Move dma_buf_release() from fops to dentry_ops") Cc: [5.7+] Signed-off-by: Charan Teja Reddy --- drivers/dma-buf/dma-buf.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 58564d82..844967f 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -59,6 +59,8 @@ static void dma_buf_release(struct dentry *dentry) struct dma_buf *dmabuf; dmabuf = dentry->d_fsdata; + if (unlikely(!dmabuf)) + return; BUG_ON(dmabuf->vmapping_counter); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()
The following race is observed with the repeated online, offline and a delay between two successive online of memory blocks of movable zone. P1 P2 Online the first memory block in the movable zone. The pcp struct values are initialized to default values,i.e., pcp->high = 0 & pcp->batch = 1. Allocate the pages from the movable zone. Try to Online the second memory block in the movable zone thus it entered the online_pages() but yet to call zone_pcp_update(). This process is entered into the exit path thus it tries to release the order-0 pages to pcp lists through free_unref_page_commit(). As pcp->high = 0, pcp->count = 1 proceed to call the function free_pcppages_bulk(). Update the pcp values thus the new pcp values are like, say, pcp->high = 378, pcp->batch = 63. Read the pcp's batch value using READ_ONCE() and pass the same to free_pcppages_bulk(), pcp values passed here are, batch = 63, count = 1. Since num of pages in the pcp lists are less than ->batch, then it will stuck in while(list_empty(list)) loop with interrupts disabled thus a core hung. Avoid this by ensuring free_pcppages_bulk() is called with proper count of pcp list pages. The mentioned race is some what easily reproducible without [1] because pcp's are not updated for the first memory block online and thus there is a enough race window for P2 between alloc+free and pcp struct values update through onlining of second memory block. With [1], the race is still exists but it is very much narrow as we update the pcp struct values for the first memory block online itself. [1]: https://patchwork.kernel.org/patch/11696389/ Signed-off-by: Charan Teja Reddy --- v1: https://patchwork.kernel.org/patch/11707637/ mm/page_alloc.c | 5 + 1 file changed, 5 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e4896e6..839039f 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int count, struct page *page, *tmp; LIST_HEAD(head); + /* +* Ensure proper count is passed which otherwise would stuck in the +* below while (list_empty(list)) loop. +*/ + count = min(pcp->count, count); while (count) { struct list_head *list; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] mm, page_alloc: fix core hung in free_pcppages_bulk()
The following race is observed with the repeated online, offline and a delay between two successive online of memory blocks of movable zone. P1 P2 Online the first memory block in the movable zone. The pcp struct values are initialized to default values,i.e., pcp->high = 0 & pcp->batch = 1. Allocate the pages from the movable zone. Try to Online the second memory block in the movable zone thus it entered the online_pages() but yet to call zone_pcp_update(). This process is entered into the exit path thus it tries to release the order-0 pages to pcp lists through free_unref_page_commit(). As pcp->high = 0, pcp->count = 1 proceed to call the function free_pcppages_bulk(). Update the pcp values thus the new pcp values are like, say, pcp->high = 378, pcp->batch = 63. Read the pcp's batch value using READ_ONCE() and pass the same to free_pcppages_bulk(), pcp values passed here are, batch = 63, count = 1. Since num of pages in the pcp lists are less than ->batch, then it will stuck in while(list_empty(list)) loop with interrupts disabled thus a core hung. Avoid this by ensuring free_pcppages_bulk() called with proper count of pcp list pages. The mentioned race is some what easily reproducible without [1] because pcp's are not updated for the first memory block online and thus there is a enough race window for P2 between alloc+free and pcp struct values update through onlining of second memory block. With [1], the race is still exists but it is very much narrow as we update the pcp struct values for the first memory block online itself. [1]: https://patchwork.kernel.org/patch/11696389/ Signed-off-by: Charan Teja Reddy --- mm/page_alloc.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index e4896e6..25e7e12 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3106,6 +3106,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) struct zone *zone = page_zone(page); struct per_cpu_pages *pcp; int migratetype; + int high; migratetype = get_pcppage_migratetype(page); __count_vm_event(PGFREE); @@ -3128,8 +3129,19 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn) pcp = _cpu_ptr(zone->pageset)->pcp; list_add(>lru, >lists[migratetype]); pcp->count++; - if (pcp->count >= pcp->high) { - unsigned long batch = READ_ONCE(pcp->batch); + high = READ_ONCE(pcp->high); + if (pcp->count >= high) { + int batch; + + batch = READ_ONCE(pcp->batch); + /* +* For non-default pcp struct values, high is always +* greater than the batch. If high < batch then pass +* proper count to free the pcp's list pages. +*/ + if (unlikely(high < batch)) + batch = min(pcp->count, batch); + free_pcppages_bulk(zone, batch, pcp); } } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] mm, memory_hotplug: update pcp lists everytime onlining a memory block
When onlining a first memory block in a zone, pcp lists are not updated thus pcp struct will have the default setting of ->high = 0,->batch = 1. This means till the second memory block in a zone(if it have) is onlined the pcp lists of this zone will not contain any pages because pcp's ->count is always greater than ->high thus free_pcppages_bulk() is called to free batch size(=1) pages every time system wants to add a page to the pcp list through free_unref_page(). To put this in a word, system is not using benefits offered by the pcp lists when there is a single onlineable memory block in a zone. Correct this by always updating the pcp lists when memory block is onlined. Signed-off-by: Charan Teja Reddy --- mm/memory_hotplug.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index dcdf327..7f62d69 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -854,8 +854,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, node_states_set_node(nid, ); if (need_zonelists_rebuild) build_all_zonelists(NULL); - else - zone_pcp_update(zone); + zone_pcp_update(zone); init_per_zone_wmark_min(); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] mm, page_alloc: skip ->waternark_boost for atomic order-0 allocations
When boosting is enabled, it is observed that rate of atomic order-0 allocation failures are high due to the fact that free levels in the system are checked with ->watermark_boost offset. This is not a problem for sleepable allocations but for atomic allocations which looks like regression. This problem is seen frequently on system setup of Android kernel running on Snapdragon hardware with 4GB RAM size. When no extfrag event occurred in the system, ->watermark_boost factor is zero, thus the watermark configurations in the system are: _watermark = ( [WMARK_MIN] = 1272, --> ~5MB [WMARK_LOW] = 9067, --> ~36MB [WMARK_HIGH] = 9385), --> ~38MB watermark_boost = 0 After launching some memory hungry applications in Android which can cause extfrag events in the system to an extent that ->watermark_boost can be set to max i.e. default boost factor makes it to 150% of high watermark. _watermark = ( [WMARK_MIN] = 1272, --> ~5MB [WMARK_LOW] = 9067, --> ~36MB [WMARK_HIGH] = 9385), --> ~38MB watermark_boost = 14077, -->~57MB With default system configuration, for an atomic order-0 allocation to succeed, having free memory of ~2MB will suffice. But boosting makes the min_wmark to ~61MB thus for an atomic order-0 allocation to be successful system should have minimum of ~23MB of free memory(from calculations of zone_watermark_ok(), min = 3/4(min/2)). But failures are observed despite system is having ~20MB of free memory. In the testing, this is reproducible as early as first 300secs since boot and with furtherlowram configurations(<2GB) it is observed as early as first 150secs since boot. These failures can be avoided by excluding the ->watermark_boost in watermark caluculations for atomic order-0 allocations. Signed-off-by: Charan Teja Reddy --- mm/page_alloc.c | 12 1 file changed, 12 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d001d61..5193d7e 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -3709,6 +3709,18 @@ static bool zone_allows_reclaim(struct zone *local_zone, struct zone *zone) } mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK); + /* +* Allow GFP_ATOMIC order-0 allocations to exclude the +* zone->watermark_boost in its watermark calculations. +* We rely on the ALLOC_ flags set for GFP_ATOMIC +* requests in gfp_to_alloc_flags() for this. Reason not to +* use the GFP_ATOMIC directly is that we want to fall back +* to slow path thus wake up kswapd. +*/ + if (unlikely(!order && !(alloc_flags & ALLOC_WMARK_MASK) && +(alloc_flags & (ALLOC_HARDER | ALLOC_HIGH { + mark = zone->_watermark[WMARK_MIN]; + } if (!zone_watermark_fast(zone, order, mark, ac->highest_zoneidx, alloc_flags)) { int ret; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH v2] mm, page_alloc: reset the zone->watermark_boost early
Updating the zone watermarks by any means, like min_free_kbytes, water_mark_scale_factor e.t.c, when ->watermark_boost is set will result into the higher low and high watermarks than the user asked. Below are the steps pursued to reproduce the problem on system setup of Android kernel running on Snapdragon hardware. 1) Default settings of the system are as below: #cat /proc/sys/vm/min_free_kbytes = 5162 #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node Node 0, zone Normal min 797 low 8340 high 8539 2) Monitor the zone->watermark_boost(by adding a debug print in the kernel) and whenever it is greater than zero value, write the same value of min_free_kbytes obtained from step 1. #echo 5162 > /proc/sys/vm/min_free_kbytes 3) Then read the zone watermarks in the system while the ->watermark_boost is zero. This should show the same values of watermarks as step 1 but shown a higher values than asked. #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node Node 0, zone Normal min 797 low 21148 high 21347 These higher values are because of updating the zone watermarks using the macro min_wmark_pages(zone) which also adds the zone->watermark_boost. #define min_wmark_pages(z) (z->_watermark[WMARK_MIN] + z->watermark_boost) So the steps that lead to the issue is like below: 1) On the extfrag event, watermarks are boosted by storing the required value in ->watermark_boost. 2) User tries to update the zone watermarks level in the system through min_free_kbytes or watermark_scale_factor. 3) Later, when kswapd woke up, it resets the zone->watermark_boost to zero. In step 2), we use the min_wmark_pages() macro to store the watermarks in the zone structure thus the values are always offsetted by ->watermark_boost value. This can be avoided by resetting the ->watermark_boost to zero before it is used. Signed-off-by: Charan Teja Reddy --- v2: Improve the commit message v1: (https://patchwork.kernel.org/patch/11540751/) mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index cef05d3..d001d61 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7761,9 +7761,9 @@ static void __setup_per_zone_wmarks(void) mult_frac(zone_managed_pages(zone), watermark_scale_factor, 1)); + zone->watermark_boost = 0; zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; - zone->watermark_boost = 0; spin_unlock_irqrestore(>lock, flags); } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH] mm, page_alloc: reset the zone->watermark_boost early
Updating the zone watermarks by any means, like extra_free_kbytes, min_free_kbytes, water_mark_scale_factor e.t.c, when watermark_boost is set will result into the higher low and high watermarks than the user asks. This can be avoided by resetting the zone->watermark_boost to zero early. Signed-off-by: Charan Teja Reddy --- mm/page_alloc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1b265b09..822e262 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -7746,9 +7746,9 @@ static void __setup_per_zone_wmarks(void) mult_frac(zone_managed_pages(zone), watermark_scale_factor, 1)); + zone->watermark_boost = 0; zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp; zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2; - zone->watermark_boost = 0; spin_unlock_irqrestore(>lock, flags); } -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
[PATCH v2] dma-buf: fix use-after-free in dmabuffs_dname
The following race occurs while accessing the dmabuf object exported as file: P1 P2 dma_buf_release() dmabuffs_dname() [say lsof reading /proc//fd/] read dmabuf stored in dentry->d_fsdata Free the dmabuf object Start accessing the dmabuf structure In the above description, the dmabuf object freed in P1 is being accessed from P2 which is resulting into the use-after-free. Below is the dump stack reported. We are reading the dmabuf object stored in the dentry->d_fsdata but there is no binding between the dentry and the dmabuf which means that the dmabuf can be freed while it is being read from ->d_fsdata and inuse. Reviews on the patch V1 says that protecting the dmabuf inuse with an extra refcount is not a viable solution as the exported dmabuf is already under file's refcount and keeping the multiple refcounts on the same object coordinated is not possible. As we are reading the dmabuf in ->d_fsdata just to get the user passed name, we can directly store the name in d_fsdata thus can avoid the reading of dmabuf altogether. Call Trace: kasan_report+0x12/0x20 __asan_report_load8_noabort+0x14/0x20 dmabuffs_dname+0x4f4/0x560 tomoyo_realpath_from_path+0x165/0x660 tomoyo_get_realpath tomoyo_check_open_permission+0x2a3/0x3e0 tomoyo_file_open tomoyo_file_open+0xa9/0xd0 security_file_open+0x71/0x300 do_dentry_open+0x37a/0x1380 vfs_open+0xa0/0xd0 path_openat+0x12ee/0x3490 do_filp_open+0x192/0x260 do_sys_openat2+0x5eb/0x7e0 do_sys_open+0xf2/0x180 Fixes: bb2bb9030425 ("dma-buf: add DMA_BUF_SET_NAME ioctls") Reported-by: syzbot+3643a18836bce555b...@syzkaller.appspotmail.com Cc: [5.3+] Signed-off-by: Charan Teja Reddy --- Changes in v2: - Pass the user passed name in ->d_fsdata instead of dmabuf - Improve the commit message Changes in v1: (https://patchwork.kernel.org/patch/11514063/) drivers/dma-buf/dma-buf.c | 17 ++--- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c index 01ce125..0071f7d 100644 --- a/drivers/dma-buf/dma-buf.c +++ b/drivers/dma-buf/dma-buf.c @@ -25,6 +25,7 @@ #include #include #include +#include #include #include @@ -40,15 +41,13 @@ struct dma_buf_list { static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen) { - struct dma_buf *dmabuf; char name[DMA_BUF_NAME_LEN]; size_t ret = 0; - dmabuf = dentry->d_fsdata; - dma_resv_lock(dmabuf->resv, NULL); - if (dmabuf->name) - ret = strlcpy(name, dmabuf->name, DMA_BUF_NAME_LEN); - dma_resv_unlock(dmabuf->resv); + spin_lock(>d_lock); + if (dentry->d_fsdata) + ret = strlcpy(name, dentry->d_fsdata, DMA_BUF_NAME_LEN); + spin_unlock(>d_lock); return dynamic_dname(dentry, buffer, buflen, "/%s:%s", dentry->d_name.name, ret > 0 ? name : ""); @@ -80,12 +79,16 @@ static int dma_buf_fs_init_context(struct fs_context *fc) static int dma_buf_release(struct inode *inode, struct file *file) { struct dma_buf *dmabuf; + struct dentry *dentry = file->f_path.dentry; if (!is_dma_buf_file(file)) return -EINVAL; dmabuf = file->private_data; + spin_lock(>d_lock); + dentry->d_fsdata = NULL; + spin_unlock(>d_lock); BUG_ON(dmabuf->vmapping_counter); /* @@ -343,6 +346,7 @@ static long dma_buf_set_name(struct dma_buf *dmabuf, const char __user *buf) } kfree(dmabuf->name); dmabuf->name = name; + dmabuf->file->f_path.dentry->d_fsdata = name; out_unlock: dma_resv_unlock(dmabuf->resv); @@ -446,7 +450,6 @@ static struct file *dma_buf_getfile(struct dma_buf *dmabuf, int flags) goto err_alloc_file; file->f_flags = flags & (O_ACCMODE | O_NONBLOCK); file->private_data = dmabuf; - file->f_path.dentry->d_fsdata = dmabuf; return file; -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation