[PATCH V2] sched,psi: fix the 'int' underflow for psi

2021-04-16 Thread Charan Teja Reddy
psi_group_cpu->tasks, represented by the unsigned int, stores the number
of tasks that could be stalled on a psi resource(io/mem/cpu).
Decrementing these counters at zero leads to wrapping which further
leads to the psi_group_cpu->state_mask is being set with the respective
pressure state. This could result into the unnecessary time sampling for
the pressure state thus cause the spurious psi events. This can further
lead to wrong actions being taken at the user land based on these psi
events.
Though psi_bug is set under these conditions but that just for debug
purpose. Fix it by decrementing the ->tasks count only when it is
non-zero.

Signed-off-by: Charan Teja Reddy 
---
 
Changes in V1: https://lore.kernel.org/patchwork/patch/1413894/

 kernel/sched/psi.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 967732c..651218d 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -711,14 +711,15 @@ static void psi_group_change(struct psi_group *group, int 
cpu,
for (t = 0, m = clear; m; m &= ~(1 << t), t++) {
if (!(m & (1 << t)))
continue;
-   if (groupc->tasks[t] == 0 && !psi_bug) {
+   if (groupc->tasks[t]) {
+   groupc->tasks[t]--;
+   } else if (!psi_bug) {
printk_deferred(KERN_ERR "psi: task underflow! cpu=%d 
t=%d tasks=[%u %u %u %u] clear=%x set=%x\n",
cpu, t, groupc->tasks[0],
groupc->tasks[1], groupc->tasks[2],
groupc->tasks[3], clear, set);
psi_bug = 1;
}
-   groupc->tasks[t]--;
}
 
for (t = 0; set; set &= ~(1 << t), t++)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH] sched,psi: fix the 'int' underflow for psi

2021-04-15 Thread Charan Teja Reddy
psi_group_cpu->tasks, represented by the unsigned int, stores the number
of tasks that could be stalled on a psi resource(io/mem/cpu).
Decrementing these counters at zero leads to wrapping which further
leads to the psi_group_cpu->state_mask is being set with the respective
pressure state. This could result into the unnecessary time sampling for
the pressure state thus cause the spurious psi events. This can further
lead to wrong actions being taken at the user land based on these psi
events.
Though psi_bug is set under these conditions but that just for debug
purpose. Fix it by decrementing the ->tasks count only when it is
non-zero.

Signed-off-by: Charan Teja Reddy 
---
 kernel/sched/psi.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 967732c..f925468 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -718,7 +718,8 @@ static void psi_group_change(struct psi_group *group, int 
cpu,
groupc->tasks[3], clear, set);
psi_bug = 1;
}
-   groupc->tasks[t]--;
+   if (groupc->tasks[t])
+   groupc->tasks[t]--;
}
 
for (t = 0; set; set &= ~(1 << t), t++)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH RFC 0/1] mm: balancing the node zones occupancy

2021-02-18 Thread Charan Teja Reddy
I would like to start discussion about  balancing the occupancy of
memory zones in a node in the system whose imabalance may be caused by
migration of pages to other zones during hotremove and then hotadding
same memory. In this case there is a lot of free memory in newly hotadd
memory which can be filled up by the previous migrated pages(as part of
offline/hotremove) thus may free up some pressure in other zones of the
node.

Say that in system has 2 zones(Normal and Movable), where Normal zone is
almost filled up by the pages of the movable zone as part of offline
operation and then we hot add a memory node as movable zone. At this
moment, Movable zone is almost empty and Normal zone is almost filled
up(by the migrated pages as part of previous offline/hot-remove
operation). At this point of time, allocation requests from Normal zone
may have to go through reclaim cycles thus cause some pressure. This
problem is quite common in the system where they aggressively offline
and online the memory blocks in the system where the former part do the
migration of pages to the lower zones and the later part don't reverse
them and as a result the offline operation may contribute to the
pressure in other zones of the system.

To overcome this problem, we can do the reverse of what offline
operation did, after onlining the memory block i.e. **try to reverse
migrate the pages from other zones which were migrated as part of the
offline operation**. This may freeup some pressure built in the other
zones because of offline operation. Since we are reverse migrating the
pages in the system, we can name this as "reverse migration feature" or
since we are actually balancing the occupancy of the zones in the system
by migrating the pages we can name it as "balancing the system zones
occupancy" or any name...

We have the proof-of-concept code tried on the Snapdragon systems with
the system configuration, single memory node of just 2 zones, 6GB normal
zone and 2GB movable zone. And this Movable zone is such that hot-added
once and there after offline/online based on the need.

We run the below unit test to evalutate this:
1) Completely fill up the already hot added movable zone with anon
pages

2) Offline this hot added movable zone. At this point there is no
pressure in the normal zone yet, but this migration of pages can
contribute to it in the future.

3) Now fill up the normal zone such that there is 300MB or less left in
the system.

4) And now the user onlined the movable zone memory blocks in the
system.

5) Run the tests of allocating 512MB of memory from the normal zone and
in doing so try allocating the higher order pages first and then
gradually fall back to lower orders. I took the help from ion system heap
memory allocation which try to allocate the memory in available orders:
9, 4 and 0.

6) Repeat the above steps for 10 iterations and below is the average of
the results.

We did try to collect the time it takes to complete the tests and the
distribution of anon pages(are the ones participated in the tests) in
the system node zones.
a) With out the reverse migration, it took an average of around 145msec
to complete the test.
b) With the reverse migration, it took an average of the 120msec to
complete the tests.

For distribution of the anon pages in the system we did try collect the
anon pages left in the individual zone before and after the test:
- |---
 Base | Reverse Migration
--|---
   Beforetest Aftertest   | Beforetest   Aftertest
Normal zone(Anon) |
  Active  49982545756 |  481441   203124
  Inactive 46008446687|  5155358602
  Free 80350224252|  84877  **440586**
Movable zone(Anon)|
  Active   2224 2626  |  2239   **484622**
  Inactive88  |  9  7663
--|---

The above table shows that, On base case(left column), there exists a
lot of anon pages in the system which can be migrated back to Movable
zone(almost totally free), thus may freeup some space in the normal
zone. With the reverse migration(Right coloumn), we see that the anon
pages are evenly distributed in the system and lot of free memory
left in the normal zones caused by the reverse migration.

The code shows the PoC by assuming just 2 zones(normal and Movable) of a
single node in the system. The number of pages to be reverse migrated is
written on the sysfs interface from the userspace by monitoring the
memory pressure events in the system.

Charan Teja Reddy (1):
  mm: proof-of-concept for balancing node zones occupancy

 include/linux/migrate.h |   8 +-
 include/linux/mm.h  |   3 +
 include/linux/mmzon

[PATCH 1/1] mm: proof-of-concept for balancing node zones occupancy

2021-02-18 Thread Charan Teja Reddy
This is a Proof-of-concept code for balancing the node zones occupancy
whose imbalance may be caused by the memory hotplug.

Signed-off-by: Charan Teja Reddy 
---
 include/linux/migrate.h |   8 +-
 include/linux/mm.h  |   3 +
 include/linux/mmzone.h  |   2 +
 kernel/sysctl.c |  11 ++
 mm/compaction.c |   4 +-
 mm/memory_hotplug.c | 265 
 6 files changed, 290 insertions(+), 3 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 4594838..b7dc259 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -53,6 +53,8 @@ extern int migrate_huge_page_move_mapping(struct 
address_space *mapping,
  struct page *newpage, struct page *page);
 extern int migrate_page_move_mapping(struct address_space *mapping,
struct page *newpage, struct page *page, int extra_count);
+extern void split_map_pages(struct list_head *list);
+extern unsigned long release_freepages(struct list_head *freelist);
 #else
 
 static inline void putback_movable_pages(struct list_head *l) {}
@@ -81,7 +83,11 @@ static inline int migrate_huge_page_move_mapping(struct 
address_space *mapping,
 {
return -ENOSYS;
 }
-
+static inline void split_map_pages(struct list_head *list) { }
+static inline unsigned long release_freepages(struct list_head *freelist)
+{
+   return 0;
+}
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_COMPACTION
diff --git a/include/linux/mm.h b/include/linux/mm.h
index ecdf8a8..1014139 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2465,6 +2465,9 @@ extern int watermark_boost_factor;
 extern int watermark_scale_factor;
 extern bool arch_has_descending_max_zone_pfns(void);
 
+/* memory_hotplug.c */
+extern int balance_node_occupancy_pages;
+
 /* nommu.c */
 extern atomic_long_t mmap_pages_allocated;
 extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t);
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b593316..ce417c3 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -977,6 +977,8 @@ int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table 
*, int,
void *, size_t *, loff_t *);
 int numa_zonelist_order_handler(struct ctl_table *, int,
void *, size_t *, loff_t *);
+extern int sysctl_balance_node_occupancy_handler(struct ctl_table *tbl,
+   int write, void *buf, size_t *len, loff_t *pos);
 extern int percpu_pagelist_fraction;
 extern char numa_zonelist_order[];
 #define NUMA_ZONELIST_ORDER_LEN16
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c9fbdd8..4b95a90 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -3140,6 +3140,17 @@ static struct ctl_table vm_table[] = {
.extra2 = SYSCTL_ONE,
},
 #endif
+#ifdef CONFIG_MEMORY_HOTPLUG
+   {
+   .procname   = "balance_node_occupancy_pages",
+   .data   = _node_occupancy_pages,
+   .maxlen = sizeof(balance_node_occupancy_pages),
+   .mode   = 0200,
+   .proc_handler   = sysctl_balance_node_occupancy_handler,
+   .extra1 = SYSCTL_ZERO,
+   },
+
+#endif
{ }
 };
 
diff --git a/mm/compaction.c b/mm/compaction.c
index 190ccda..da3c015 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -68,7 +68,7 @@ static const unsigned int HPAGE_FRAG_CHECK_INTERVAL_MSEC = 
500;
 #define COMPACTION_HPAGE_ORDER (PMD_SHIFT - PAGE_SHIFT)
 #endif
 
-static unsigned long release_freepages(struct list_head *freelist)
+unsigned long release_freepages(struct list_head *freelist)
 {
struct page *page, *next;
unsigned long high_pfn = 0;
@@ -84,7 +84,7 @@ static unsigned long release_freepages(struct list_head 
*freelist)
return high_pfn;
 }
 
-static void split_map_pages(struct list_head *list)
+void split_map_pages(struct list_head *list)
 {
unsigned int i, order, nr_pages;
struct page *page, *next;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index f9d57b9..2780c91 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -97,6 +97,271 @@ void mem_hotplug_done(void)
 
 u64 max_mem_size = U64_MAX;
 
+int balance_node_occupancy_pages;
+static atomic_t target_migrate_pages = ATOMIC_INIT(0);
+
+struct movable_zone_fill_control {
+   struct list_head freepages;
+   unsigned long start_pfn;
+   unsigned long end_pfn;
+   unsigned long nr_migrate_pages;
+   unsigned long nr_free_pages;
+   unsigned long limit;
+   int target;
+   struct zone *zone;
+};
+
+static void fill_movable_zone_fn(struct work_struct *work);
+static DECLARE_WORK(fill_movable_zone_work, fill_movable_zone_fn);
+static DEFINE_MUTEX(page_migrate_lock);
+
+static inline void reset_page_order(struct page *page)
+{
+   __ClearPageBuddy(page);
+   set_page_private(page, 0);
+}
+
+static int isolate_free_p

[PATCH V2] mm: compaction: update the COMPACT[STALL|FAIL] events properly

2021-02-12 Thread Charan Teja Reddy
By definition, COMPACT[STALL|FAIL] events needs to be counted when there
is 'At least in one zone compaction wasn't deferred or skipped from the
direct compaction'. And when compaction is skipped or deferred,
COMPACT_SKIPPED will be returned but it will still go and update these
compaction events which is wrong in the sense that COMPACT[STALL|FAIL]
is counted without even trying the compaction.

Correct this by skipping the counting of these events when
COMPACT_SKIPPED is returned for compaction. This indirectly also avoid
the unnecessary try into the get_page_from_freelist() when compaction is
not even tried.

There is a corner case where compaction is skipped but still count
COMPACTSTALL event, which is that IRQ came and freed the page and the
same is captured in capture_control.

Signed-off-by: Charan Teja Reddy 
---

changes in V1: https://lore.kernel.org/patchwork/patch/1373665/

 mm/compaction.c | 8 
 mm/page_alloc.c | 2 ++
 2 files changed, 10 insertions(+)

diff --git a/mm/compaction.c b/mm/compaction.c
index 190ccda..104ebef 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2487,6 +2487,14 @@ static enum compact_result compact_zone_order(struct 
zone *zone, int order,
 */
WRITE_ONCE(current->capture_control, NULL);
*capture = READ_ONCE(capc.page);
+   /*
+* Technically, it is also possible that compaction is skipped but
+* the page is still captured out of luck(IRQ came and freed the page).
+* Returning COMPACT_SUCCESS in such cases helps in properly accounting
+* the COMPACT[STALL|FAIL] when compaction is skipped.
+*/
+   if (*capture)
+   ret = COMPACT_SUCCESS;
 
return ret;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d..531f244 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4152,6 +4152,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int 
order,
memalloc_noreclaim_restore(noreclaim_flag);
psi_memstall_leave();
 
+   if (*compact_result == COMPACT_SKIPPED)
+   return NULL;
/*
 * At least in one zone compaction wasn't deferred or skipped, so let's
 * count a compaction stall
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH] mm: page_alloc: update the COMPACT[STALL|FAIL] events properly

2021-02-01 Thread Charan Teja Reddy
By defination, COMPACT[STALL|FAIL] events needs to be counted when there
is 'At least in one zone compaction wasn't deferred or skipped from the
direct compaction'. And when compaction is skipped or deferred,
COMPACT_SKIPPED will be returned but it will still go and update these
compaction events which is wrong in the sense that COMPACT[STALL|FAIL]
is counted without even trying the compaction.

Correct this by skipping the counting of these events when
COMPACT_SKIPPED is returned for compaction. This indirectly also avoid
the unnecessary try into the get_page_from_freelist() when compaction is
not even tried.

Signed-off-by: Charan Teja Reddy 
---
 mm/page_alloc.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d..531f244 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4152,6 +4152,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int 
order,
memalloc_noreclaim_restore(noreclaim_flag);
psi_memstall_leave();
 
+   if (*compact_result == COMPACT_SKIPPED)
+   return NULL;
/*
 * At least in one zone compaction wasn't deferred or skipped, so let's
 * count a compaction stall
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH V3] mm/compaction: correct deferral logic for proactive compaction

2021-01-18 Thread Charan Teja Reddy
should_proactive_compact_node() returns true when sum of the
weighted fragmentation score of all the zones in the node is greater
than the wmark_high of compaction, which then triggers the proactive
compaction that operates on the individual zones of the node. But
proactive compaction runs on the zone only when its weighted
fragmentation score is greater than wmark_low(=wmark_high - 10).

This means that the sum of the weighted fragmentation scores of all the
zones can exceed the wmark_high but individual weighted fragmentation
zone scores can still be less than wmark_low which makes the unnecessary
trigger of the proactive compaction only to return doing nothing.

Issue with the return of proactive compaction with out even trying is
its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if
the scores across the proactive compaction is same, thinking that
compaction didn't make any progress but in reality it didn't even try.
With the delay between successive retries for proactive compaction is
500msec, it can result into the deferral for ~30sec with out even trying
the proactive compaction.

Test scenario is that: compaction_proactiveness=50 thus the wmark_low =
50 and wmark_high = 60. System have 2 zones(Normal and Movable) with
sizes 5GB and 6GB respectively. After opening some apps on the android,
the weighted fragmentation scores of these zones are 47 and 49
respectively. Since the sum of these fragmentation scores are above the
wmark_high which triggers the proactive compaction and there since the
individual zones weighted fragmentation scores are below wmark_low, it
returns without trying the proactive compaction. As a result the
weighted fragmentation scores of the zones are still 47 and 49 which
makes the existing logic to defer the compaction thinking that
noprogress is made across the compaction.

Fix this by checking just zone fragmentation score, not the weighted, in
__compact_finished() and use the zones weighted fragmentation score in
fragmentation_score_node(). In the test case above, If the weighted
average of is above wmark_high, then individual score (not adjusted) of
atleast one zone has to be above wmark_high. Thus it avoids the
unnecessary trigger and deferrals of the proactive compaction.

Fix-suggested-by: Vlastimil Babka 
Signed-off-by: Charan Teja Reddy 
---

Changes in V3: Addressed suggestions from Vlastimil

Changes in V2: https://lore.kernel.org/patchwork/patch/1366862/

Changes in V1: https://lore.kernel.org/patchwork/patch/1364646/

 mm/compaction.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e5acb97..ccddb3a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1925,20 +1925,28 @@ static bool kswapd_is_running(pg_data_t *pgdat)
 
 /*
  * A zone's fragmentation score is the external fragmentation wrt to the
- * COMPACTION_HPAGE_ORDER scaled by the zone's size. It returns a value
- * in the range [0, 100].
+ * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
+ */
+static unsigned int fragmentation_score_zone(struct zone *zone)
+{
+   return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
+}
+
+/*
+ * A weighted zone's fragmentation score is the external fragmentation
+ * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
+ * returns a value in the range [0, 100].
  *
  * The scaling factor ensures that proactive compaction focuses on larger
  * zones like ZONE_NORMAL, rather than smaller, specialized zones like
  * ZONE_DMA32. For smaller zones, the score value remains close to zero,
  * and thus never exceeds the high threshold for proactive compaction.
  */
-static unsigned int fragmentation_score_zone(struct zone *zone)
+static unsigned int fragmentation_score_zone_weighted(struct zone *zone)
 {
unsigned long score;
 
-   score = zone->present_pages *
-   extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
+   score = zone->present_pages * fragmentation_score_zone(zone);
return div64_ul(score, zone->zone_pgdat->node_present_pages + 1);
 }
 
@@ -1958,7 +1966,7 @@ static unsigned int fragmentation_score_node(pg_data_t 
*pgdat)
struct zone *zone;
 
zone = >node_zones[zoneid];
-   score += fragmentation_score_zone(zone);
+   score += fragmentation_score_zone_weighted(zone);
}
 
return score;
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH V2] mm/compaction: correct deferral logic for proactive compaction

2021-01-18 Thread Charan Teja Reddy
should_proactive_compact_node() returns true when sum of the
weighted fragmentation score of all the zones in the node is greater
than the wmark_high of compaction, which then triggers the proactive
compaction that operates on the individual zones of the node. But
proactive compaction runs on the zone only when its weighted
fragmentation score is greater than wmark_low(=wmark_high - 10).

This means that the sum of the weighted fragmentation scores of all the
zones can exceed the wmark_high but individual weighted fragmentation
zone scores can still be less than wmark_low which makes the unnecessary
trigger of the proactive compaction only to return doing nothing.

Issue with the return of proactive compaction with out even trying is
its deferral. It is simply deferred for 1 << COMPACT_MAX_DEFER_SHIFT if
the scores across the proactive compaction is same, thinking that
compaction didn't make any progress but in reality it didn't even try.
With the delay between successive retries for proactive compaction is
500msec, it can result into the deferral for ~30sec with out even trying
the proactive compaction.

Test scenario is that: compaction_proactiveness=50 thus the wmark_low =
50 and wmark_high = 60. System have 2 zones(Normal and Movable) with
sizes 5GB and 6GB respectively. After opening some apps on the android,
the weighted fragmentation scores of these zones are 47 and 49
respectively. Since the sum of these fragmentation scores are above the
wmark_high which triggers the proactive compaction and there since the
individual zones weighted fragmentation scores are below wmark_low, it
returns without trying the proactive compaction. As a result the
weighted fragmentation scores of the zones are still 47 and 49 which
makes the existing logic to defer the compaction thinking that
noprogress is made across the compaction.

Fix this by checking just zone fragmentation score, not the weighted, in
__compact_finished() and use the zones weighted fragmentation score in
fragmentation_score_node(). In the test case above, If the weighted
average of is above wmark_high, then individual score (not adjusted) of
atleast one zone has to be above wmark_high. Thus it avoids the
unnecessary trigger and deferrals of the proactive compaction.

Fix-suggested-by: Vlastimil Babka 
Signed-off-by: Charan Teja Reddy 
---

Changes in V2: Addressed comments from vlastimil

Changes in V1: https://lore.kernel.org/patchwork/patch/1364646/

 mm/compaction.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e5acb97..1b98427 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1924,16 +1924,16 @@ static bool kswapd_is_running(pg_data_t *pgdat)
 }
 
 /*
- * A zone's fragmentation score is the external fragmentation wrt to the
- * COMPACTION_HPAGE_ORDER scaled by the zone's size. It returns a value
- * in the range [0, 100].
+ * A weighted zone's fragmentation score is the external fragmentation
+ * wrt to the COMPACTION_HPAGE_ORDER scaled by the zone's size. It
+ * returns a value in the range [0, 100].
  *
  * The scaling factor ensures that proactive compaction focuses on larger
  * zones like ZONE_NORMAL, rather than smaller, specialized zones like
  * ZONE_DMA32. For smaller zones, the score value remains close to zero,
  * and thus never exceeds the high threshold for proactive compaction.
  */
-static unsigned int fragmentation_score_zone(struct zone *zone)
+static unsigned int fragmentation_score_zone_weighted(struct zone *zone)
 {
unsigned long score;
 
@@ -1943,6 +1943,15 @@ static unsigned int fragmentation_score_zone(struct zone 
*zone)
 }
 
 /*
+ * A zone's fragmentation score is the external fragmentation wrt to the
+ * COMPACTION_HPAGE_ORDER. It returns a value in the range [0, 100].
+ */
+static unsigned int fragmentation_score_zone(struct zone *zone)
+{
+   return extfrag_for_order(zone, COMPACTION_HPAGE_ORDER);
+}
+
+/*
  * The per-node proactive (background) compaction process is started by its
  * corresponding kcompactd thread when the node's fragmentation score
  * exceeds the high threshold. The compaction process remains active till
@@ -1958,7 +1967,7 @@ static unsigned int fragmentation_score_node(pg_data_t 
*pgdat)
struct zone *zone;
 
zone = >node_zones[zoneid];
-   score += fragmentation_score_zone(zone);
+   score += fragmentation_score_zone_weighted(zone);
}
 
return score;
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH] mm/compaction: return proper state in should_proactive_compact_node

2021-01-13 Thread Charan Teja Reddy
should_proactive_compact_node() returns true when sum of the
fragmentation score of all the zones in the node is greater than the
wmark_high of compaction which then triggers the proactive compaction
that operates on the individual zones of the node. But proactive
compaction runs on the zone only when the fragmentation score of the
zone is greater than wmark_low(=wmark_high - 10).

This means that the sum of the fragmentation scores of all the zones can
exceed the wmark_high but individual zone scores can still be less than
the wmark_low which makes the unnecessary trigger of the proactive
compaction only to return doing nothing.

Another issue with the return of proactive compaction with out even
trying is its deferral. It is simply deferred for 1 <<
COMPACT_MAX_DEFER_SHIFT if the scores across the proactive compaction is
same, thinking that compaction didn't make any progress but in reality
it didn't even try. With the delay between successive retries for
proactive compaction is 500msec, it can result into the deferral for
~30sec with out even trying the proactive compaction.

Test scenario is that: compaction_proactiveness=50 thus the wmark_low =
50 and wmark_high = 60. System have 2 zones(Normal and Movable) with
sizes 5GB and 6GB respectively. After opening some apps on the android,
the fragmentation scores of these zones are 47 and 49 respectively.
Since the sum of these fragmentation scores are above the wmark_high
which triggers the proactive compaction and there since the individual
zone scores are below wmark_low, it returns without trying the
compaction. As a result the fragmentation scores of the zones are still
47 and 49 which makes the existing logic to defer the compaction
thinking that noprogress is made across the compaction.

So, run the proactive compaction on the node zones only when atleast one
of the zones fragmentation score is greater than wmark_low. This avoids
the unnecessary deferral and retries of the compaction.

Signed-off-by: Charan Teja Reddy 
---
 mm/compaction.c | 27 +--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index e5acb97..f7a772a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1964,6 +1964,26 @@ static unsigned int fragmentation_score_node(pg_data_t 
*pgdat)
return score;
 }
 
+/*
+ * Returns the maximum of fragmentation scores of zones in a node. This is
+ * used in taking the decission of whether to trigger the proactive compaction
+ * on the zones of this node.
+ */
+static unsigned int fragmentation_score_node_zones_max(pg_data_t *pgdat)
+{
+   int zoneid;
+   unsigned int max = 0;
+
+   for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
+   struct zone *zone;
+
+   zone = >node_zones[zoneid];
+   max = max_t(unsigned int, fragmentation_score_zone(zone), max);
+   }
+
+   return max;
+}
+
 static unsigned int fragmentation_score_wmark(pg_data_t *pgdat, bool low)
 {
unsigned int wmark_low;
@@ -1979,13 +1999,16 @@ static unsigned int fragmentation_score_wmark(pg_data_t 
*pgdat, bool low)
 
 static bool should_proactive_compact_node(pg_data_t *pgdat)
 {
-   int wmark_high;
+   int wmark_low, wmark_high;
 
if (!sysctl_compaction_proactiveness || kswapd_is_running(pgdat))
return false;
 
wmark_high = fragmentation_score_wmark(pgdat, false);
-   return fragmentation_score_node(pgdat) > wmark_high;
+   wmark_low = fragmentation_score_wmark(pgdat, true);
+
+   return fragmentation_score_node(pgdat) > wmark_high &&
+   fragmentation_score_node_zones_max(pgdat) > wmark_low;
 }
 
 static enum compact_result __compact_finished(struct compact_control *cc)
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH V2] dmabuf: fix use-after-free of dmabuf's file->f_inode

2021-01-05 Thread Charan Teja Reddy
It is observed 'use-after-free' on the dmabuf's file->f_inode with the
race between closing the dmabuf file and reading the dmabuf's debug
info.

Consider the below scenario where P1 is closing the dma_buf file
and P2 is reading the dma_buf's debug info in the system:

P1  P2
dma_buf_debug_show()
dma_buf_put()
  __fput()
file->f_op->release()
dput()

  dentry_unlink_inode()
iput(dentry->d_inode)
(where the inode is freed)
mutex_lock(_list.lock)
read 'dma_buf->file->f_inode'
(the same inode is freed by P1)
mutex_unlock(_list.lock)
  dentry->d_op->d_release()-->
dma_buf_release()
  .
  mutex_lock(_list.lock)
  removes the dmabuf from the list
  mutex_unlock(_list.lock)

In the above scenario, when dma_buf_put() is called on a dma_buf, it
first frees the dma_buf's file->f_inode(=dentry->d_inode) and then
removes this dma_buf from the system db_list. In between P2 traversing
the db_list tries to access this dma_buf's file->f_inode that was freed
by P1 which is a use-after-free case.

Since, __fput() calls f_op->release first and then later calls the
d_op->d_release, move the dma_buf's db_list removal from d_release() to
f_op->release(). This ensures that dma_buf's file->f_inode is not
accessed after it is released.

Cc:  # 5.4+
Fixes: 4ab59c3c638c ("dma-buf: Move dma_buf_release() from fops to dentry_ops")
Acked-by: Christian König 
Signed-off-by: Charan Teja Reddy 
---
 V2: Resending with stable tags and Acks

 V1: https://lore.kernel.org/patchwork/patch/1360118/

 drivers/dma-buf/dma-buf.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0eb80c1..a14dcbb 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -76,10 +76,6 @@ static void dma_buf_release(struct dentry *dentry)
 
dmabuf->ops->release(dmabuf);
 
-   mutex_lock(_list.lock);
-   list_del(>list_node);
-   mutex_unlock(_list.lock);
-
if (dmabuf->resv == (struct dma_resv *)[1])
dma_resv_fini(dmabuf->resv);
 
@@ -88,6 +84,22 @@ static void dma_buf_release(struct dentry *dentry)
kfree(dmabuf);
 }
 
+static int dma_buf_file_release(struct inode *inode, struct file *file)
+{
+   struct dma_buf *dmabuf;
+
+   if (!is_dma_buf_file(file))
+   return -EINVAL;
+
+   dmabuf = file->private_data;
+
+   mutex_lock(_list.lock);
+   list_del(>list_node);
+   mutex_unlock(_list.lock);
+
+   return 0;
+}
+
 static const struct dentry_operations dma_buf_dentry_ops = {
.d_dname = dmabuffs_dname,
.d_release = dma_buf_release,
@@ -413,6 +425,7 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct 
file *file)
 }
 
 static const struct file_operations dma_buf_fops = {
+   .release= dma_buf_file_release,
.mmap   = dma_buf_mmap_internal,
.llseek = dma_buf_llseek,
.poll   = dma_buf_poll,
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH] dmabuf: fix use-after-free of dmabuf's file->f_inode

2021-01-04 Thread Charan Teja Reddy
It is observed 'use-after-free' on the dmabuf's file->f_inode with the
race between closing the dmabuf file and reading the dmabuf's debug
info.

Consider the below scenario where P1 is closing the dma_buf file
and P2 is reading the dma_buf's debug info in the system:

P1  P2
dma_buf_debug_show()
dma_buf_put()
  __fput()
file->f_op->release()
dput()

  dentry_unlink_inode()
iput(dentry->d_inode)
(where the inode is freed)
mutex_lock(_list.lock)
read 'dma_buf->file->f_inode'
(the same inode is freed by P1)
mutex_unlock(_list.lock)
  dentry->d_op->d_release()-->
dma_buf_release()
  .
  mutex_lock(_list.lock)
  removes the dmabuf from the list
  mutex_unlock(_list.lock)

In the above scenario, when dma_buf_put() is called on a dma_buf, it
first frees the dma_buf's file->f_inode(=dentry->d_inode) and then
removes this dma_buf from the system db_list. In between P2 traversing
the db_list tries to access this dma_buf's file->f_inode that was freed
by P1 which is a use-after-free case.

Since, __fput() calls f_op->release first and then later calls the
d_op->d_release, move the dma_buf's db_list removal from d_release() to
f_op->release(). This ensures that dma_buf's file->f_inode is not
accessed after it is released.

Fixes: 4ab59c3c638c ("dma-buf: Move dma_buf_release() from fops to dentry_ops")
Signed-off-by: Charan Teja Reddy 
---
 drivers/dma-buf/dma-buf.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 0eb80c1..a14dcbb 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -76,10 +76,6 @@ static void dma_buf_release(struct dentry *dentry)
 
dmabuf->ops->release(dmabuf);
 
-   mutex_lock(_list.lock);
-   list_del(>list_node);
-   mutex_unlock(_list.lock);
-
if (dmabuf->resv == (struct dma_resv *)[1])
dma_resv_fini(dmabuf->resv);
 
@@ -88,6 +84,22 @@ static void dma_buf_release(struct dentry *dentry)
kfree(dmabuf);
 }
 
+static int dma_buf_file_release(struct inode *inode, struct file *file)
+{
+   struct dma_buf *dmabuf;
+
+   if (!is_dma_buf_file(file))
+   return -EINVAL;
+
+   dmabuf = file->private_data;
+
+   mutex_lock(_list.lock);
+   list_del(>list_node);
+   mutex_unlock(_list.lock);
+
+   return 0;
+}
+
 static const struct dentry_operations dma_buf_dentry_ops = {
.d_dname = dmabuffs_dname,
.d_release = dma_buf_release,
@@ -413,6 +425,7 @@ static void dma_buf_show_fdinfo(struct seq_file *m, struct 
file *file)
 }
 
 static const struct file_operations dma_buf_fops = {
+   .release= dma_buf_file_release,
.mmap   = dma_buf_mmap_internal,
.llseek = dma_buf_llseek,
.poll   = dma_buf_poll,
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH] mm: cma: improve pr_debug log in cma_release()

2020-11-25 Thread Charan Teja Reddy
It is required to print 'count' of pages, along with the pages, passed
to cma_release to debug the cases of mismatched count value passed
between cma_alloc() and cma_release() from a code path.

As an example, consider the below scenario:
1) CMA pool size is 4MB and
2) User doing the erroneous step of allocating 2 pages but freeing 1
page in a loop from this CMA pool.
The step 2 causes cma_alloc() to return NULL at one point of time
because of -ENOMEM condition.

And the current pr_debug logs is not giving the info about these types
of allocation patterns because of count value not being printed in
cma_release().

We are printing the count value in the trace logs, just extend the same
to pr_debug logs too.

Signed-off-by: Charan Teja Reddy 
---
 mm/cma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..07c904b 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -512,7 +512,7 @@ bool cma_release(struct cma *cma, const struct page *pages, 
unsigned int count)
if (!cma || !pages)
return false;
 
-   pr_debug("%s(page %p)\n", __func__, (void *)pages);
+   pr_debug("%s(page %p, count %zu)\n", __func__, (void *)pages, count);
 
pfn = page_to_pfn(pages);
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH] mm: memory_hotplug: put migration failure information under DEBUG_VM

2020-11-23 Thread Charan Teja Reddy
When the pages are failed to get isolate or migrate, the page owner
information along with page info is dumped. If there are continuous
failures in migration(say page is pinned) or isolation, the log buffer
is simply getting flooded with the page owner information. As most of
the times page info is sufficient to know the causes for failures of
migration or isolation, place the page owner information under DEBUG_VM.

Signed-off-by: Charan Teja Reddy 
---
 mm/memory_hotplug.c | 10 --
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 63b2e46..f48f30d 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1326,7 +1326,10 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
end_pfn)
 
} else {
pr_warn("failed to isolate pfn %lx\n", pfn);
-   dump_page(page, "isolation failed");
+   __dump_page(page, "isolation failed");
+#if defined(CONFIG_DEBUG_VM)
+   dump_page_owner(page);
+#endif
}
put_page(page);
}
@@ -1357,7 +1360,10 @@ do_migrate_range(unsigned long start_pfn, unsigned long 
end_pfn)
list_for_each_entry(page, , lru) {
pr_warn("migrating pfn %lx failed ret:%d ",
   page_to_pfn(page), ret);
-   dump_page(page, "migration failure");
+   __dump_page(page, "migration failure");
+#if defined(CONFIG_DEBUG_VM)
+   dump_page_owner(page);
+#endif
}
putback_movable_pages();
}
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH] iommu: of: skip iommu_device_list traversal in of_iommu_xlate()

2020-09-23 Thread Charan Teja Reddy
In of_iommu_xlate(), check if iommu device is enabled before traversing
the iommu_device_list through iommu_ops_from_fwnode(). It is of no use
in traversing the iommu_device_list only to return NO_IOMMU because of
iommu device node is disabled.

Signed-off-by: Charan Teja Reddy 
---
 drivers/iommu/of_iommu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/of_iommu.c b/drivers/iommu/of_iommu.c
index e505b91..225598c 100644
--- a/drivers/iommu/of_iommu.c
+++ b/drivers/iommu/of_iommu.c
@@ -94,9 +94,10 @@ static int of_iommu_xlate(struct device *dev,
struct fwnode_handle *fwnode = _spec->np->fwnode;
int ret;
 
+   if (!of_device_is_available(iommu_spec->np))
+   return NO_IOMMU;
ops = iommu_ops_from_fwnode(fwnode);
-   if ((ops && !ops->of_xlate) ||
-   !of_device_is_available(iommu_spec->np))
+   if (ops && !ops->of_xlate)
return NO_IOMMU;
 
ret = iommu_fwspec_init(dev, _spec->np->fwnode, ops);
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH] dmabuf: fix NULL pointer dereference in dma_buf_release()

2020-09-18 Thread Charan Teja Reddy
NULL pointer dereference is observed while exporting the dmabuf but
failed to allocate the 'struct file' which results into the dropping of
the allocated dentry corresponding to this file in the dmabuf fs, which
is ending up in dma_buf_release() and accessing the uninitialzed
dentry->d_fsdata.

Call stack on 5.4 is below:
 dma_buf_release+0x2c/0x254 drivers/dma-buf/dma-buf.c:88
 __dentry_kill+0x294/0x31c fs/dcache.c:584
 dentry_kill fs/dcache.c:673 [inline]
 dput+0x250/0x380 fs/dcache.c:859
 path_put+0x24/0x40 fs/namei.c:485
 alloc_file_pseudo+0x1a4/0x200 fs/file_table.c:235
 dma_buf_getfile drivers/dma-buf/dma-buf.c:473 [inline]
 dma_buf_export+0x25c/0x3ec drivers/dma-buf/dma-buf.c:585

Fix this by checking for the valid pointer in the dentry->d_fsdata.

Fixes: 4ab59c3c638c ("dma-buf: Move dma_buf_release() from fops to dentry_ops")
Cc:  [5.7+]
Signed-off-by: Charan Teja Reddy 
---
 drivers/dma-buf/dma-buf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 58564d82..844967f 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -59,6 +59,8 @@ static void dma_buf_release(struct dentry *dentry)
struct dma_buf *dmabuf;
 
dmabuf = dentry->d_fsdata;
+   if (unlikely(!dmabuf))
+   return;
 
BUG_ON(dmabuf->vmapping_counter);
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH V2] mm, page_alloc: fix core hung in free_pcppages_bulk()

2020-08-11 Thread Charan Teja Reddy
The following race is observed with the repeated online, offline and a
delay between two successive online of memory blocks of movable zone.

P1  P2

Online the first memory block in
the movable zone. The pcp struct
values are initialized to default
values,i.e., pcp->high = 0 &
pcp->batch = 1.

Allocate the pages from the
movable zone.

Try to Online the second memory
block in the movable zone thus it
entered the online_pages() but yet
to call zone_pcp_update().
This process is entered into
the exit path thus it tries
to release the order-0 pages
to pcp lists through
free_unref_page_commit().
As pcp->high = 0, pcp->count = 1
proceed to call the function
free_pcppages_bulk().
Update the pcp values thus the
new pcp values are like, say,
pcp->high = 378, pcp->batch = 63.
Read the pcp's batch value using
READ_ONCE() and pass the same to
free_pcppages_bulk(), pcp values
passed here are, batch = 63,
count = 1.

Since num of pages in the pcp
lists are less than ->batch,
then it will stuck in
while(list_empty(list)) loop
with interrupts disabled thus
a core hung.

Avoid this by ensuring free_pcppages_bulk() is called with proper count
of pcp list pages.

The mentioned race is some what easily reproducible without [1] because
pcp's are not updated for the first memory block online and thus there
is a enough race window for P2 between alloc+free and pcp struct values
update through onlining of second memory block.

With [1], the race is still exists but it is very much narrow as we
update the pcp struct values for the first memory block online itself.

[1]: https://patchwork.kernel.org/patch/11696389/

Signed-off-by: Charan Teja Reddy 
---

v1: https://patchwork.kernel.org/patch/11707637/

 mm/page_alloc.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4896e6..839039f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1304,6 +1304,11 @@ static void free_pcppages_bulk(struct zone *zone, int 
count,
struct page *page, *tmp;
LIST_HEAD(head);
 
+   /*
+* Ensure proper count is passed which otherwise would stuck in the
+* below while (list_empty(list)) loop.
+*/
+   count = min(pcp->count, count);
while (count) {
struct list_head *list;
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH] mm, page_alloc: fix core hung in free_pcppages_bulk()

2020-08-10 Thread Charan Teja Reddy
The following race is observed with the repeated online, offline and a
delay between two successive online of memory blocks of movable zone.

P1  P2

Online the first memory block in
the movable zone. The pcp struct
values are initialized to default
values,i.e., pcp->high = 0 &
pcp->batch = 1.

Allocate the pages from the
movable zone.

Try to Online the second memory
block in the movable zone thus it
entered the online_pages() but yet
to call zone_pcp_update().
This process is entered into
the exit path thus it tries
to release the order-0 pages
to pcp lists through
free_unref_page_commit().
As pcp->high = 0, pcp->count = 1
proceed to call the function
free_pcppages_bulk().
Update the pcp values thus the
new pcp values are like, say,
pcp->high = 378, pcp->batch = 63.
Read the pcp's batch value using
READ_ONCE() and pass the same to
free_pcppages_bulk(), pcp values
passed here are, batch = 63,
count = 1.

Since num of pages in the pcp
lists are less than ->batch,
then it will stuck in
while(list_empty(list)) loop
with interrupts disabled thus
a core hung.

Avoid this by ensuring free_pcppages_bulk() called with proper count of
pcp list pages.

The mentioned race is some what easily reproducible without [1] because
pcp's are not updated for the first memory block online and thus there
is a enough race window for P2 between alloc+free and pcp struct values
update through onlining of second memory block.

With [1], the race is still exists but it is very much narrow as we
update the pcp struct values for the first memory block online itself.

[1]: https://patchwork.kernel.org/patch/11696389/

Signed-off-by: Charan Teja Reddy 
---
 mm/page_alloc.c | 16 ++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e4896e6..25e7e12 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3106,6 +3106,7 @@ static void free_unref_page_commit(struct page *page, 
unsigned long pfn)
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
int migratetype;
+   int high;
 
migratetype = get_pcppage_migratetype(page);
__count_vm_event(PGFREE);
@@ -3128,8 +3129,19 @@ static void free_unref_page_commit(struct page *page, 
unsigned long pfn)
pcp = _cpu_ptr(zone->pageset)->pcp;
list_add(>lru, >lists[migratetype]);
pcp->count++;
-   if (pcp->count >= pcp->high) {
-   unsigned long batch = READ_ONCE(pcp->batch);
+   high = READ_ONCE(pcp->high);
+   if (pcp->count >= high) {
+   int batch;
+
+   batch = READ_ONCE(pcp->batch);
+   /*
+* For non-default pcp struct values, high is always
+* greater than the batch. If high < batch then pass
+* proper count to free the pcp's list pages.
+*/
+   if (unlikely(high < batch))
+   batch = min(pcp->count, batch);
+
free_pcppages_bulk(zone, batch, pcp);
}
 }
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH] mm, memory_hotplug: update pcp lists everytime onlining a memory block

2020-08-02 Thread Charan Teja Reddy
When onlining a first memory block in a zone, pcp lists are not updated
thus pcp struct will have the default setting of ->high = 0,->batch = 1.
This means till the second memory block in a zone(if it have) is onlined
the pcp lists of this zone will not contain any pages because pcp's
->count is always greater than ->high thus free_pcppages_bulk() is
called to free batch size(=1) pages every time system wants to add a
page to the pcp list through free_unref_page(). To put this in a word,
system is not using benefits offered by the pcp lists when there is a
single onlineable memory block in a zone. Correct this by always
updating the pcp lists when memory block is onlined.

Signed-off-by: Charan Teja Reddy 
---
 mm/memory_hotplug.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index dcdf327..7f62d69 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -854,8 +854,7 @@ int __ref online_pages(unsigned long pfn, unsigned long 
nr_pages,
node_states_set_node(nid, );
if (need_zonelists_rebuild)
build_all_zonelists(NULL);
-   else
-   zone_pcp_update(zone);
+   zone_pcp_update(zone);
 
init_per_zone_wmark_min();
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH] mm, page_alloc: skip ->waternark_boost for atomic order-0 allocations

2020-05-19 Thread Charan Teja Reddy
When boosting is enabled, it is observed that rate of atomic order-0
allocation failures are high due to the fact that free levels in the
system are checked with ->watermark_boost offset. This is not a problem
for sleepable allocations but for atomic allocations which looks like
regression.

This problem is seen frequently on system setup of Android kernel
running on Snapdragon hardware with 4GB RAM size. When no extfrag event
occurred in the system, ->watermark_boost factor is zero, thus the
watermark configurations in the system are:
   _watermark = (
  [WMARK_MIN] = 1272, --> ~5MB
  [WMARK_LOW] = 9067, --> ~36MB
  [WMARK_HIGH] = 9385), --> ~38MB
   watermark_boost = 0

After launching some memory hungry applications in Android which can
cause extfrag events in the system to an extent that ->watermark_boost
can be set to max i.e. default boost factor makes it to 150% of high
watermark.
   _watermark = (
  [WMARK_MIN] = 1272, --> ~5MB
  [WMARK_LOW] = 9067, --> ~36MB
  [WMARK_HIGH] = 9385), --> ~38MB
   watermark_boost = 14077, -->~57MB

With default system configuration, for an atomic order-0 allocation to
succeed, having free memory of ~2MB will suffice. But boosting makes
the min_wmark to ~61MB thus for an atomic order-0 allocation to be
successful system should have minimum of ~23MB of free memory(from
calculations of zone_watermark_ok(), min = 3/4(min/2)). But failures are
observed despite system is having ~20MB of free memory. In the testing,
this is reproducible as early as first 300secs since boot and with
furtherlowram configurations(<2GB) it is observed as early as first
150secs since boot.

These failures can be avoided by excluding the ->watermark_boost in
watermark caluculations for atomic order-0 allocations.

Signed-off-by: Charan Teja Reddy 
---
 mm/page_alloc.c | 12 
 1 file changed, 12 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d001d61..5193d7e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3709,6 +3709,18 @@ static bool zone_allows_reclaim(struct zone *local_zone, 
struct zone *zone)
}
 
mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
+   /*
+* Allow GFP_ATOMIC order-0 allocations to exclude the
+* zone->watermark_boost in its watermark calculations.
+* We rely on the ALLOC_ flags set for GFP_ATOMIC
+* requests in gfp_to_alloc_flags() for this. Reason not to
+* use the GFP_ATOMIC directly is that we want to fall back
+* to slow path thus wake up kswapd.
+*/
+   if (unlikely(!order && !(alloc_flags & ALLOC_WMARK_MASK) &&
+(alloc_flags & (ALLOC_HARDER | ALLOC_HIGH {
+   mark = zone->_watermark[WMARK_MIN];
+   }
if (!zone_watermark_fast(zone, order, mark,
   ac->highest_zoneidx, alloc_flags)) {
int ret;
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation



[PATCH v2] mm, page_alloc: reset the zone->watermark_boost early

2020-05-14 Thread Charan Teja Reddy
Updating the zone watermarks by any means, like min_free_kbytes,
water_mark_scale_factor e.t.c, when ->watermark_boost is set will result
into the higher low and high watermarks than the user asked.

Below are the steps pursued to reproduce the problem on system setup
of Android kernel running on Snapdragon hardware.
1) Default settings of the system are as below:
   #cat /proc/sys/vm/min_free_kbytes = 5162
   #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
Node 0, zone   Normal
min  797
low  8340
high 8539

2) Monitor the zone->watermark_boost(by adding a debug print in
the kernel) and whenever it is greater than zero value, write the
same value of min_free_kbytes obtained from step 1.
   #echo 5162 > /proc/sys/vm/min_free_kbytes

3) Then read the zone watermarks in the system while the
->watermark_boost is zero. This should show the same values of
watermarks as step 1 but shown a higher values than asked.
   #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
Node 0, zone   Normal
min  797
low  21148
high 21347

These higher values are because of updating the zone watermarks using
the macro min_wmark_pages(zone) which also adds the
zone->watermark_boost.
#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] +
z->watermark_boost)

So the steps that lead to the issue is like below:
1) On the extfrag event, watermarks are boosted by storing the required
value in ->watermark_boost.

2) User tries to update the zone watermarks level in the system through
min_free_kbytes or watermark_scale_factor.

3) Later, when kswapd woke up, it resets the zone->watermark_boost to
zero.

In step 2), we use the min_wmark_pages() macro to store the watermarks
in the zone structure thus the values are always offsetted by
->watermark_boost value. This can be avoided by resetting the
->watermark_boost to zero before it is used.

Signed-off-by: Charan Teja Reddy 
---

v2: Improve the commit message

v1: (https://patchwork.kernel.org/patch/11540751/)

 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cef05d3..d001d61 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7761,9 +7761,9 @@ static void __setup_per_zone_wmarks(void)
mult_frac(zone_managed_pages(zone),
  watermark_scale_factor, 1));
 
+   zone->watermark_boost = 0;
zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
-   zone->watermark_boost = 0;
 
spin_unlock_irqrestore(>lock, flags);
}
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


[PATCH] mm, page_alloc: reset the zone->watermark_boost early

2020-05-11 Thread Charan Teja Reddy
Updating the zone watermarks by any means, like extra_free_kbytes,
min_free_kbytes, water_mark_scale_factor e.t.c, when watermark_boost is
set will result into the higher low and high watermarks than the user
asks. This can be avoided by resetting the zone->watermark_boost to zero
early.

Signed-off-by: Charan Teja Reddy 
---
 mm/page_alloc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1b265b09..822e262 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7746,9 +7746,9 @@ static void __setup_per_zone_wmarks(void)
mult_frac(zone_managed_pages(zone),
  watermark_scale_factor, 1));
 
+   zone->watermark_boost = 0;
zone->_watermark[WMARK_LOW]  = min_wmark_pages(zone) + tmp;
zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;
-   zone->watermark_boost = 0;
 
spin_unlock_irqrestore(>lock, flags);
}
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation


[PATCH v2] dma-buf: fix use-after-free in dmabuffs_dname

2020-05-08 Thread Charan Teja Reddy
The following race occurs while accessing the dmabuf object exported as
file:
P1  P2
dma_buf_release()  dmabuffs_dname()
   [say lsof reading /proc//fd/]

   read dmabuf stored in dentry->d_fsdata
Free the dmabuf object
   Start accessing the dmabuf structure

In the above description, the dmabuf object freed in P1 is being
accessed from P2 which is resulting into the use-after-free. Below is
the dump stack reported.

We are reading the dmabuf object stored in the dentry->d_fsdata but
there is no binding between the dentry and the dmabuf which means that
the dmabuf can be freed while it is being read from ->d_fsdata and
inuse. Reviews on the patch V1 says that protecting the dmabuf inuse
with an extra refcount is not a viable solution as the exported dmabuf
is already under file's refcount and keeping the multiple refcounts on
the same object coordinated is not possible.

As we are reading the dmabuf in ->d_fsdata just to get the user passed
name, we can directly store the name in d_fsdata thus can avoid the
reading of dmabuf altogether.

Call Trace:
 kasan_report+0x12/0x20
 __asan_report_load8_noabort+0x14/0x20
 dmabuffs_dname+0x4f4/0x560
 tomoyo_realpath_from_path+0x165/0x660
 tomoyo_get_realpath
 tomoyo_check_open_permission+0x2a3/0x3e0
 tomoyo_file_open
 tomoyo_file_open+0xa9/0xd0
 security_file_open+0x71/0x300
 do_dentry_open+0x37a/0x1380
 vfs_open+0xa0/0xd0
 path_openat+0x12ee/0x3490
 do_filp_open+0x192/0x260
 do_sys_openat2+0x5eb/0x7e0
 do_sys_open+0xf2/0x180

Fixes: bb2bb9030425 ("dma-buf: add DMA_BUF_SET_NAME ioctls")
Reported-by: syzbot+3643a18836bce555b...@syzkaller.appspotmail.com
Cc:  [5.3+]
Signed-off-by: Charan Teja Reddy 
---

Changes in v2: 

- Pass the user passed name in ->d_fsdata instead of dmabuf
- Improve the commit message

Changes in v1: (https://patchwork.kernel.org/patch/11514063/)

 drivers/dma-buf/dma-buf.c | 17 ++---
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/drivers/dma-buf/dma-buf.c b/drivers/dma-buf/dma-buf.c
index 01ce125..0071f7d 100644
--- a/drivers/dma-buf/dma-buf.c
+++ b/drivers/dma-buf/dma-buf.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -40,15 +41,13 @@ struct dma_buf_list {
 
 static char *dmabuffs_dname(struct dentry *dentry, char *buffer, int buflen)
 {
-   struct dma_buf *dmabuf;
char name[DMA_BUF_NAME_LEN];
size_t ret = 0;
 
-   dmabuf = dentry->d_fsdata;
-   dma_resv_lock(dmabuf->resv, NULL);
-   if (dmabuf->name)
-   ret = strlcpy(name, dmabuf->name, DMA_BUF_NAME_LEN);
-   dma_resv_unlock(dmabuf->resv);
+   spin_lock(>d_lock);
+   if (dentry->d_fsdata)
+   ret = strlcpy(name, dentry->d_fsdata, DMA_BUF_NAME_LEN);
+   spin_unlock(>d_lock);
 
return dynamic_dname(dentry, buffer, buflen, "/%s:%s",
 dentry->d_name.name, ret > 0 ? name : "");
@@ -80,12 +79,16 @@ static int dma_buf_fs_init_context(struct fs_context *fc)
 static int dma_buf_release(struct inode *inode, struct file *file)
 {
struct dma_buf *dmabuf;
+   struct dentry *dentry = file->f_path.dentry;
 
if (!is_dma_buf_file(file))
return -EINVAL;
 
dmabuf = file->private_data;
 
+   spin_lock(>d_lock);
+   dentry->d_fsdata = NULL;
+   spin_unlock(>d_lock);
BUG_ON(dmabuf->vmapping_counter);
 
/*
@@ -343,6 +346,7 @@ static long dma_buf_set_name(struct dma_buf *dmabuf, const 
char __user *buf)
}
kfree(dmabuf->name);
dmabuf->name = name;
+   dmabuf->file->f_path.dentry->d_fsdata = name;
 
 out_unlock:
dma_resv_unlock(dmabuf->resv);
@@ -446,7 +450,6 @@ static struct file *dma_buf_getfile(struct dma_buf *dmabuf, 
int flags)
goto err_alloc_file;
file->f_flags = flags & (O_ACCMODE | O_NONBLOCK);
file->private_data = dmabuf;
-   file->f_path.dentry->d_fsdata = dmabuf;
 
return file;
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
member of the Code Aurora Forum, hosted by The Linux Foundation