Re: [PATCH v4 3/3] mm: fs: Invalidate BH LRU during page migration

2021-03-20 Thread Chris Goldsworthy

On 2021-03-20 12:54, Matthew Wilcox wrote:

On Sat, Mar 20, 2021 at 10:20:09AM -0700, Minchan Kim wrote:

> > Tested-by: Oliver Sang 
> > Reported-by: kernel test robot 
> > Signed-off-by: Chris Goldsworthy 
> > Signed-off-by: Minchan Kim 
>
> The signoff chain ordering might mean that Chris was the primary author, but
> there is no From:him.  Please clarify?

He tried first version but was diffrent implementation since I
changed a lot. That's why I added his SoB even though current
implementaion is much different. So, maybe I am primary author?


Hey Minchan, let's have you as the primary author.

Maybe Chris is Reported-by: ?  And don't forget Laura Abbott as 
original
author of the patch Chris submitted.  I think she should be 
Reported-by:

as well, since there is no code from either of them in this version of
the patch.


Yes, let's have a Reported-by: from Laura. We can change my 
Signed-off-by to Reported-by: as well.


--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily

2021-03-13 Thread Chris Goldsworthy

On 2021-03-11 14:41, Chris Goldsworthy wrote:

On 2021-03-10 08:14, Minchan Kim wrote:

LRU pagevec holds refcount of pages until the pagevec are drained.
It could prevent migration since the refcount of the page is greater
than the expection in migration logic. To mitigate the issue,
callers of migrate_pages drains LRU pagevec via migrate_prep or
lru_add_drain_all before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

To close the race, this patch disables lru caches(i.e, pagevec)
during ongoing migration until migrate is done.

Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a
sync migration) with below debug code.

int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..

if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
   printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), 
rc);

   dump_page(page, "fail to migrate");
}

The test was repeating android apps launching with cma allocation
in background every five seconds. Total cma allocation count was
about 500 during the testing. With this patch, the dump_page count
was reduced from 400 to 30.

The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also 
in

line with pcp allocator cache which are disabled for the offlining as
well.

Signed-off-by: Minchan Kim 
---


Hi Minchan,

This all looks good to me - feel free to add a Reviewed-by from me.

Thanks,

Chris.

Should have added:

Reviewed-by: Chris Goldsworthy 
--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily

2021-03-11 Thread Chris Goldsworthy

On 2021-03-10 08:14, Minchan Kim wrote:

LRU pagevec holds refcount of pages until the pagevec are drained.
It could prevent migration since the refcount of the page is greater
than the expection in migration logic. To mitigate the issue,
callers of migrate_pages drains LRU pagevec via migrate_prep or
lru_add_drain_all before migrate_pages call.

However, it's not enough because pages coming into pagevec after the
draining call still could stay at the pagevec so it could keep
preventing page migration. Since some callers of migrate_pages have
retrial logic with LRU draining, the page would migrate at next trail
but it is still fragile in that it doesn't close the fundamental race
between upcoming LRU pages into pagvec and migration so the migration
failure could cause contiguous memory allocation failure in the end.

To close the race, this patch disables lru caches(i.e, pagevec)
during ongoing migration until migrate is done.

Since it's really hard to reproduce, I measured how many times
migrate_pages retried with force mode(it is about a fallback to a
sync migration) with below debug code.

int migrate_pages(struct list_head *from, new_page_t get_new_page,
..
..

if (rc && reason == MR_CONTIG_RANGE && pass > 2) {
   printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), 
rc);

   dump_page(page, "fail to migrate");
}

The test was repeating android apps launching with cma allocation
in background every five seconds. Total cma allocation count was
about 500 during the testing. With this patch, the dump_page count
was reduced from 400 to 30.

The new interface is also useful for memory hotplug which currently
drains lru pcp caches after each migration failure. This is rather
suboptimal as it has to disrupt others running during the operation.
With the new interface the operation happens only once. This is also in
line with pcp allocator cache which are disabled for the offlining as
well.

Signed-off-by: Minchan Kim 
---


Hi Minchan,

This all looks good to me - feel free to add a Reviewed-by from me.

Thanks,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v2] [RFC] mm: fs: Invalidate BH LRU during page migration

2021-02-11 Thread Chris Goldsworthy

On 2021-02-11 06:09, Matthew Wilcox wrote:

On Wed, Feb 10, 2021 at 09:35:40PM -0800, Chris Goldsworthy wrote:
+/* These are used to control the BH LRU invalidation during page 
migration */

+static struct cpumask lru_needs_invalidation;
+static bool bh_lru_disabled = false;


As I asked before, what protects this on an SMP system?



Sorry Matthew, I misconstrued your earlier question in V1, and thought 
you had been referring to compile-time protection (so as to prevent 
build breakages).  It is not protected, so I'll need to change this into 
an atomic counter that is incremented and decremented by bh_lru_enable() 
and bh_lru_disable() respectively (such that if the counter is greater 
than zero, we bail).



@@ -1292,7 +1296,9 @@ static inline void check_irqs_on(void)
 /*
  * Install a buffer_head into this cpu's LRU.  If not already in the 
LRU, it is
  * inserted at the front, and the buffer_head at the back if any is 
evicted.

- * Or, if already in the LRU it is moved to the front.
+ * Or, if already in the LRU it is moved to the front. Note that if 
LRU is
+ * disabled because of an ongoing page migration, we won't insert bh 
into the

+ * LRU.


And also, why do we need to do this?  The page LRU has no equivalent
mechanism to prevent new pages being added to the per-CPU LRU lists.
If a BH has just been used, isn't that a strong hint that this page is
a bad candidate for migration?


I had assumed that up until now, that pages in the page cache aren't an 
issue, such that they're dropped during migration as needed. Looking at 
try_to_free_buffers[1], I don't see any handling for the page cache.  I 
will need to do due diligence and follow up on this.


As for the question on necessity, if there is a case in which preventing 
buffer_heads from being added to the BH LRU ensures that the containing 
page can be migrated, then I would say that the change is justified, 
since adds another scenario in which migration is guaranteed (I will 
follow up on this as well).


Regards,

Chris.

[1] https://elixir.bootlin.com/linux/latest/source/fs/buffer.c#L3225

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


[PATCH v2] [RFC] mm: fs: Invalidate BH LRU during page migration

2021-02-10 Thread Chris Goldsworthy
Pages containing buffer_heads that are in one of the per-CPU
buffer_head LRU caches will be pinned and thus cannot be migrated.
This can prevent CMA allocations from succeeding, which are often used
on platforms with co-processors (such as a DSP) that can only use
physically contiguous memory. It can also prevent memory
hot-unplugging from succeeding, which involves migrating at least
MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1
GiB based on the architecture in use.

Correspondingly, invalidate the BH LRU caches before a migration
starts and stop any buffer_head from being cached in the LRU caches,
until migration has finished.

Signed-off-by: Chris Goldsworthy 
Cc: Minchan Kim 
Cc: Matthew Wilcox 
Cc: Andrew Morton 
---
 fs/buffer.c | 54 +++--
 include/linux/buffer_head.h |  8 +++
 include/linux/migrate.h |  2 ++
 mm/migrate.c| 19 
 mm/page_alloc.c |  3 +++
 mm/swap.c   |  7 +-
 6 files changed, 90 insertions(+), 3 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 96c7604..634e474 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1274,6 +1274,10 @@ struct bh_lru {
 
 static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }};
 
+/* These are used to control the BH LRU invalidation during page migration */
+static struct cpumask lru_needs_invalidation;
+static bool bh_lru_disabled = false;
+
 #ifdef CONFIG_SMP
 #define bh_lru_lock()  local_irq_disable()
 #define bh_lru_unlock()local_irq_enable()
@@ -1292,7 +1296,9 @@ static inline void check_irqs_on(void)
 /*
  * Install a buffer_head into this cpu's LRU.  If not already in the LRU, it is
  * inserted at the front, and the buffer_head at the back if any is evicted.
- * Or, if already in the LRU it is moved to the front.
+ * Or, if already in the LRU it is moved to the front. Note that if LRU is
+ * disabled because of an ongoing page migration, we won't insert bh into the
+ * LRU.
  */
 static void bh_lru_install(struct buffer_head *bh)
 {
@@ -1303,6 +1309,9 @@ static void bh_lru_install(struct buffer_head *bh)
check_irqs_on();
bh_lru_lock();
 
+   if (bh_lru_disabled)
+   goto out;
+
b = this_cpu_ptr(_lrus);
for (i = 0; i < BH_LRU_SIZE; i++) {
swap(evictee, b->bhs[i]);
@@ -1313,6 +1322,7 @@ static void bh_lru_install(struct buffer_head *bh)
}
 
get_bh(bh);
+out:
bh_lru_unlock();
brelse(evictee);
 }
@@ -1328,6 +1338,10 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, 
unsigned size)
 
check_irqs_on();
bh_lru_lock();
+
+   if (bh_lru_disabled)
+   goto out;
+
for (i = 0; i < BH_LRU_SIZE; i++) {
struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]);
 
@@ -1346,6 +1360,7 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, 
unsigned size)
break;
}
}
+out:
bh_lru_unlock();
return ret;
 }
@@ -1446,7 +1461,7 @@ EXPORT_SYMBOL(__bread_gfp);
  * This doesn't race because it runs in each cpu either in irq
  * or with preempt disabled.
  */
-static void invalidate_bh_lru(void *arg)
+void invalidate_bh_lru(void *arg)
 {
struct bh_lru *b = _cpu_var(bh_lrus);
int i;
@@ -1477,6 +1492,41 @@ void invalidate_bh_lrus(void)
 }
 EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
 
+bool need_bh_lru_invalidation(int cpu)
+{
+   return cpumask_test_cpu(cpu, _needs_invalidation);
+}
+
+void bh_lru_disable(void)
+{
+   int cpu;
+
+   bh_lru_disabled = true;
+
+   /*
+* This barrier ensures that invocations of bh_lru_install()
+* after this barrier see that bh_lru_disabled == true (until
+* bh_lru_enable() is eventually called)..
+*/
+   smp_mb();
+
+   /*
+* It's alright if someone comes along and hot-plugs a new CPU,
+* since we have that bh_lru_dsiabled == true.  The hot-remove
+* case is handled in buffer_exit_cpu_dead().
+*/
+   for_each_online_cpu(cpu) {
+   if (has_bh_in_lru(cpu, NULL))
+   cpumask_set_cpu(cpu, _needs_invalidation);
+   }
+}
+
+void bh_lru_enable(void)
+{
+   bh_lru_disabled = false;
+   cpumask_clear(_needs_invalidation);
+}
+
 void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
 {
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 6b47f94..78eb5ee 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -193,7 +193,11 @@ void __breadahead_gfp(struct block_device *, sector_t 
block, unsigned int size,
  gfp_t gfp);
 struct buffer_head *__bread_gfp(struct block_device *,
sector_t block, unsigned size, gfp_t gfp);
+void invalidate_bh_lru(void *arg);
 void invalidate_bh_lrus(voi

[PATCH v2] [RFC] Invalidate BH LRU during page migration

2021-02-10 Thread Chris Goldsworthy
A page containing buffer_heads can be pinned if any of its constituent
buffer_heads belongs to the BH LRU cache [1], which can prevent that
page from being migrated. After going through several iterations of a
patch that attempts to solve this by removing BH entries inside of the
drop_buffers() function, which in the worst-case could be called for
each migrated page, Minchan Kim suggested that we invalidate the
entire BH LRU once, just before we start migrating pages.
Additionally, Matthew Wilcox suggested that we invalidate the BH LRU
inside of lru_add_drain_all(), so as to benefit functions like other
functions that would be impacted by pinned pages [2].

V2: Respond to feedback provided by Andrew, Minchan and Matthew in [3].
As suggested by Minchan, we're now doing the invalidate of the LRUs
in a fashion similar to how the pagevecs are drained in
lru_add_drain_all() 


[1] https://elixir.bootlin.com/linux/latest/source/fs/buffer.c#L1238
[2] 
https://lore.kernel.org/linux-fsdevel/cover.1611642038.git.cgold...@codeaurora.org/
[3] https://lkml.org/lkml/2021/2/2/68

Chris Goldsworthy (1):
  [RFC] mm: fs: Invalidate BH LRU during page migration

 fs/buffer.c | 54 +++--
 include/linux/buffer_head.h |  8 +++
 include/linux/migrate.h |  2 ++
 mm/migrate.c| 19 
 mm/page_alloc.c |  3 +++
 mm/swap.c   |  7 +-
 6 files changed, 90 insertions(+), 3 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH] [RFC] mm: fs: Invalidate BH LRU during page migration

2021-02-01 Thread Chris Goldsworthy
Pages containing buffer_heads that are in the buffer_head LRU cache
will be pinned and thus cannot be migrated.  Correspondingly,
invalidate the BH LRU before a migration starts and stop any
buffer_head from being cached in the LRU, until migration has
finished.

Signed-off-by: Chris Goldsworthy 
Cc: Minchan Kim 
Cc: Matthew Wilcox 
---
 fs/buffer.c |  6 ++
 include/linux/buffer_head.h |  3 +++
 include/linux/migrate.h |  2 ++
 mm/migrate.c| 18 ++
 mm/page_alloc.c |  3 +++
 mm/swap.c   |  3 +++
 6 files changed, 35 insertions(+)

diff --git a/fs/buffer.c b/fs/buffer.c
index 96c7604..39ec4ec 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1289,6 +1289,8 @@ static inline void check_irqs_on(void)
 #endif
 }
 
+bool bh_migration_done = true;
+
 /*
  * Install a buffer_head into this cpu's LRU.  If not already in the LRU, it is
  * inserted at the front, and the buffer_head at the back if any is evicted.
@@ -1303,6 +1305,9 @@ static void bh_lru_install(struct buffer_head *bh)
check_irqs_on();
bh_lru_lock();
 
+   if (!bh_migration_done)
+   goto out;
+
b = this_cpu_ptr(_lrus);
for (i = 0; i < BH_LRU_SIZE; i++) {
swap(evictee, b->bhs[i]);
@@ -1313,6 +1318,7 @@ static void bh_lru_install(struct buffer_head *bh)
}
 
get_bh(bh);
+out:
bh_lru_unlock();
brelse(evictee);
 }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 6b47f94..ae4eb6d 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -193,6 +193,9 @@ void __breadahead_gfp(struct block_device *, sector_t 
block, unsigned int size,
  gfp_t gfp);
 struct buffer_head *__bread_gfp(struct block_device *,
sector_t block, unsigned size, gfp_t gfp);
+
+extern bool bh_migration_done;
+
 void invalidate_bh_lrus(void);
 struct buffer_head *alloc_buffer_head(gfp_t gfp_flags);
 void free_buffer_head(struct buffer_head * bh);
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 3a38963..9e4a2dc 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -46,6 +46,7 @@ extern int isolate_movable_page(struct page *page, 
isolate_mode_t mode);
 extern void putback_movable_page(struct page *page);
 
 extern void migrate_prep(void);
+extern void migrate_finish(void);
 extern void migrate_prep_local(void);
 extern void migrate_page_states(struct page *newpage, struct page *page);
 extern void migrate_page_copy(struct page *newpage, struct page *page);
@@ -67,6 +68,7 @@ static inline int isolate_movable_page(struct page *page, 
isolate_mode_t mode)
{ return -EBUSY; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
+static inline int migrate_finish(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
 
 static inline void migrate_page_states(struct page *newpage, struct page *page)
diff --git a/mm/migrate.c b/mm/migrate.c
index a69da8a..08c981d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -64,6 +64,19 @@
  */
 void migrate_prep(void)
 {
+   bh_migration_done = false;
+
+   /*
+* This barrier ensures that callers of bh_lru_install() between
+* the barrier and the call to invalidate_bh_lrus() read
+*  bh_migration_done() as false.
+*/
+   /*
+* TODO: Remove me? lru_add_drain_all() already has an smp_mb(),
+* but it would be good to ensure that the barrier isn't forgotten.
+*/
+   smp_mb();
+
/*
 * Clear the LRU lists so pages can be isolated.
 * Note that pages may be moved off the LRU after we have
@@ -73,6 +86,11 @@ void migrate_prep(void)
lru_add_drain_all();
 }
 
+void migrate_finish(void)
+{
+   bh_migration_done = true;
+}
+
 /* Do the necessary work of migrate_prep but not if it involves other CPUs */
 void migrate_prep_local(void)
 {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6446778..e4cb959 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8493,6 +8493,9 @@ static int __alloc_contig_migrate_range(struct 
compact_control *cc,
ret = migrate_pages(>migratepages, alloc_migration_target,
NULL, (unsigned long), cc->mode, 
MR_CONTIG_RANGE);
}
+
+   migrate_finish();
+
if (ret < 0) {
putback_movable_pages(>migratepages);
return ret;
diff --git a/mm/swap.c b/mm/swap.c
index 31b844d..97efc49 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -36,6 +36,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -759,6 +760,8 @@ void lru_add_drain_all(void)
if (WARN_ON(!mm_percpu_wq))
return;
 
+   invalidate_bh_lrus();
+
/*
 * Guarantee pagevec counter stores visible by this CPU are visible to
 * other CPUs befo

[RFC] Invalidate BH LRU during page migration

2021-02-01 Thread Chris Goldsworthy
A page containing buffer_heads can be pinned if any of its constituent
buffer_heads belongs to the BH LRU cache [1]. After going through
several iterations of a patch that attempts to solve this by removing
BH entries inside of the drop_buffers() function, which in
the worst-case could be called for each migrated page, Minchan Kim
suggested that we invalidate the entire BH LRU once, just before we
start migrating pages.  Additionally, Matthew Wilcox suggested that
we invalidate the BH LRU inside of lru_add_drain_all(), so as to
benefit functions like other functions that would be impacted by
pinned pages [2].

TODO:
 - It should be possible to remove the initial setting of
   bh_migration_done = false; in migrate_prep by passing this in as a
   parameter to invalidate_bh_lru(), but we'd still need a matching
   bh_migration_done = true; call. 
 - To really benefit other callers of lru_add_drain_all() other than
   __alloc_contig_migrate_range() in the CMA allocaiton path, we'd need
  to add additional calls of bh_migration_done = false;

[1] https://elixir.bootlin.com/linux/latest/source/fs/buffer.c#L1238
[2] 
https://lore.kernel.org/linux-fsdevel/cover.1611642038.git.cgold...@codeaurora.org/


Chris Goldsworthy (1):
  [RFC] mm: fs: Invalidate BH LRU during page migration

 fs/buffer.c |  6 ++
 include/linux/buffer_head.h |  3 +++
 include/linux/migrate.h |  2 ++
 mm/migrate.c| 18 ++
 mm/page_alloc.c |  3 +++
 mm/swap.c   |  3 +++
 6 files changed, 35 insertions(+)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



Re: [PATCH v4] fs/buffer.c: Revoke LRU when trying to drop buffers

2021-01-28 Thread Chris Goldsworthy

On 2021-01-28 09:08, Minchan Kim wrote:

On Thu, Jan 28, 2021 at 12:28:37AM -0800, Chris Goldsworthy wrote:

On 2021-01-26 18:59, Matthew Wilcox wrote:
> On Tue, Jan 26, 2021 at 02:59:17PM -0800, Minchan Kim wrote:
> > The release buffer_head in LRU is great improvement for migration
> > point of view.
> >
> > A question:

Hey guys,

> > Can't we invalidate(e.g., invalidate_bh_lrus) bh_lru in migrate_prep
> > or
> > elsewhere when migration found the failure and is about to retry?
> >
> > Migration has done such a way for other per-cpu stuffs for a long
> > time,
> > which would be more consistent with others and might be faster
> > sometimes
> > with reducing IPI calls for page.
> Should lru_add_drain_all() also handle draining the buffer lru for all
> callers?  A quick survey ...
>
> invalidate_bdev() already calls invalidate_bh_lrus()
> compact_nodes() would probably benefit from the BH LRU being invalidated
> POSIX_FADV_DONTNEED would benefit if the underlying filesystem uses BHs
> check_and_migrate_cma_pages() would benefit
> khugepaged_do_scan() doesn't need it today
> scan_get_next_rmap_item() looks like it only works on anon pages (?) so
>doesn't need it
> mem_cgroup_force_empty() probably needs it
> mem_cgroup_move_charge() ditto
> memfd_wait_for_pins() doesn't need it
> shake_page() might benefit
> offline_pages() would benefit
> alloc_contig_range() would benefit
>
> Seems like most would benefit and a few won't care.  I think I'd lean
> towards having lru_add_drain_all() call invalidate_bh_lrus(), just to
> simplify things.


Doing this sounds like a good idea.  We would still need a call to
invalidate_bh_lrus() inside of drop_buffers() in the event that we 
find
busy buffers, since they can be re-added back into the BH LRU - I 
believe
it isn't until this point that a BH can't be added back into the BH 
LRU,

when we acquire the private_lock for the mapping:

https://elixir.bootlin.com/linux/v5.10.10/source/fs/buffer.c#L3240


I am not sure it's good deal considering IPI overhead per page release
at worst case.

A idea is to disable bh_lrus in migrate_prep and enable it when
migration is done(need to introduce "migrate_done".
It's similar approach with marking pageblock MIGRATE_ISOLATE to
disable pcp during the migration.


I'll try creating that mechanism then for the BH LRU, and will come
back with a patch that does the invalidate in lru_add_drain_all().

Thanks Matthew and Minchan for the feedback!

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v4] fs/buffer.c: Revoke LRU when trying to drop buffers

2021-01-28 Thread Chris Goldsworthy

On 2021-01-26 18:59, Matthew Wilcox wrote:

On Tue, Jan 26, 2021 at 02:59:17PM -0800, Minchan Kim wrote:

The release buffer_head in LRU is great improvement for migration
point of view.

A question:


Hey guys,

Can't we invalidate(e.g., invalidate_bh_lrus) bh_lru in migrate_prep 
or

elsewhere when migration found the failure and is about to retry?

Migration has done such a way for other per-cpu stuffs for a long 
time,
which would be more consistent with others and might be faster 
sometimes

with reducing IPI calls for page.

Should lru_add_drain_all() also handle draining the buffer lru for all
callers?  A quick survey ...

invalidate_bdev() already calls invalidate_bh_lrus()
compact_nodes() would probably benefit from the BH LRU being 
invalidated

POSIX_FADV_DONTNEED would benefit if the underlying filesystem uses BHs
check_and_migrate_cma_pages() would benefit
khugepaged_do_scan() doesn't need it today
scan_get_next_rmap_item() looks like it only works on anon pages (?) so
doesn't need it
mem_cgroup_force_empty() probably needs it
mem_cgroup_move_charge() ditto
memfd_wait_for_pins() doesn't need it
shake_page() might benefit
offline_pages() would benefit
alloc_contig_range() would benefit

Seems like most would benefit and a few won't care.  I think I'd lean
towards having lru_add_drain_all() call invalidate_bh_lrus(), just to
simplify things.



Doing this sounds like a good idea.  We would still need a call to
invalidate_bh_lrus() inside of drop_buffers() in the event that we find
busy buffers, since they can be re-added back into the BH LRU - I 
believe

it isn't until this point that a BH can't be added back into the BH LRU,
when we acquire the private_lock for the mapping:

https://elixir.bootlin.com/linux/v5.10.10/source/fs/buffer.c#L3240

Thanks,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


[PATCH v4] fs/buffer.c: Revoke LRU when trying to drop buffers

2021-01-26 Thread Chris Goldsworthy
From: Laura Abbott 

When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these types of failures make CMA significantly
less reliable, especially under high filesystem usage.

Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.

Note: a bug reported by "kernel test robot" lead to a switch from
using xas_for_each() to xa_for_each().

Signed-off-by: Laura Abbott 
Signed-off-by: Chris Goldsworthy 
Cc: Matthew Wilcox 
Reported-by: kernel test robot 
---
 fs/buffer.c | 79 +
 1 file changed, 74 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 96c7604..27516a0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1471,12 +1472,55 @@ static bool has_bh_in_lru(int cpu, void *dummy)
return false;
 }
 
+static void __evict_bhs_lru(void *arg)
+{
+   struct bh_lru *b = _cpu_var(bh_lrus);
+   struct xarray *busy_bhs = arg;
+   struct buffer_head *bh;
+   unsigned long i, xarray_index;
+
+   xa_for_each(busy_bhs, xarray_index, bh) {
+   for (i = 0; i < BH_LRU_SIZE; i++) {
+   if (b->bhs[i] == bh) {
+   brelse(b->bhs[i]);
+   b->bhs[i] = NULL;
+   break;
+   }
+   }
+   }
+
+   put_cpu_var(bh_lrus);
+}
+
+static bool page_has_bhs_in_lru(int cpu, void *arg)
+{
+   struct bh_lru *b = per_cpu_ptr(_lrus, cpu);
+   struct xarray *busy_bhs = arg;
+   struct buffer_head *bh;
+   unsigned long i, xarray_index;
+
+   xa_for_each(busy_bhs, xarray_index, bh) {
+   for (i = 0; i < BH_LRU_SIZE; i++) {
+   if (b->bhs[i] == bh)
+   return true;
+   }
+   }
+
+   return false;
+
+}
 void invalidate_bh_lrus(void)
 {
on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
 }
 EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
 
+static void evict_bh_lrus(struct xarray *busy_bhs)
+{
+   on_each_cpu_cond(page_has_bhs_in_lru, __evict_bhs_lru,
+busy_bhs, 1);
+}
+
 void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
 {
@@ -3242,14 +3286,38 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
 {
struct buffer_head *head = page_buffers(page);
struct buffer_head *bh;
+   struct xarray busy_bhs;
+   int bh_count = 0;
+   int xa_ret, ret = 0;
+
+   xa_init(_bhs);
 
bh = head;
do {
-   if (buffer_busy(bh))
-   goto failed;
+   if (buffer_busy(bh)) {
+   xa_ret = xa_err(xa_store(_bhs, bh_count++,
+bh, GFP_ATOMIC));
+   if (xa_ret)
+   goto out;
+   }
bh = bh->b_this_page;
} while (bh != head);
 
+   if (bh_count) {
+   /*
+* Check if the busy failure was due to an outstanding
+* LRU reference
+*/
+   evict_bh_lrus(_bhs);
+   do {
+   if (buffer_busy(bh))
+   goto out;
+
+   bh = bh->b_this_page;
+   } while (bh != head);
+   }
+
+   ret = 1;
do {
struct buffer_head *next = bh->b_this_page;
 
@@ -3259,9 +3327,10 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
} while (bh != head);
*buffers_to_free = head;
detach_page_private(page);
-   return 1;
-failed:
-   return 0;
+out:
+   xa_destroy(_bhs);
+
+   return ret;
 }
 
 int try_to_free_buffers(struct page *page)
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v4] Resolve LRU page-pinning issue for file-backed pages

2021-01-26 Thread Chris Goldsworthy
It is possible for file-backed pages to end up in a contiguous memory area
(CMA), such that the relevant page must be migrated using the .migratepage()
callback when its backing physical memory is selected for use in an CMA
allocation (through cma_alloc()).  However, if a set of address space
operations (AOPs) for a file-backed page lacks a migratepage() page call-back,
fallback_migrate_page() will be used instead, which through
try_to_release_page() calls try_to_free_buffers() (which is called directly or
through a try_to_free_buffers() callback.  try_to_free_buffers() in turn calls
drop_buffers()

drop_buffers() itself can fail due to the buffer_head associated with a page
being busy. However, it is possible that the buffer_head is on an LRU list for
a CPU, such that we can try removing the buffer_head from that list, in order
to successfully release the page.  Do this.

v1: 
https://lore.kernel.org/lkml/cover.1606194703.git.cgold...@codeaurora.org/T/#m3a44b5745054206665455625ccaf27379df8a190
Original version of the patch (with updates to make to account for changes in
on_each_cpu_cond()).

v2: https://lore.kernel.org/lkml/cover.1609829465.git.cgold...@codeaurora.org/
Follow Matthew Wilcox's suggestion of reducing the number of calls to
on_each_cpu_cond(), by iterating over a page's busy buffer_heads inside of
on_each_cpu_cond(). To copy from his e-mail, we go from:

for_each_buffer
for_each_cpu
for_each_lru_entry

to:

for_each_cpu
for_each_buffer
for_each_lru_entry

This is done using xarrays, which I found to be the cleanest data structure to
use, though a pre-allocated array of page_size(page) / bh->b_size elements might
be more performant.

v3: Replace xas_for_each() with xa_for_each() to account for proper locking.

v4: Fix an iteration error.

Laura Abbott (1):
  fs/buffer.c: Revoke LRU when trying to drop buffers

 fs/buffer.c | 79 +
 1 file changed, 74 insertions(+), 5 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v3] Resolve LRU page-pinning issue for file-backed pages

2021-01-13 Thread Chris Goldsworthy
It is possible for file-backed pages to end up in a contiguous memory area
(CMA), such that the relevant page must be migrated using the .migratepage()
callback when its backing physical memory is selected for use in an CMA
allocation (through cma_alloc()).  However, if a set of address space
operations (AOPs) for a file-backed page lacks a migratepage() page call-back,
fallback_migrate_page() will be used instead, which through
try_to_release_page() calls try_to_free_buffers() (which is called directly or
through a try_to_free_buffers() callback.  try_to_free_buffers() in turn calls
drop_buffers()

drop_buffers() itself can fail due to the buffer_head associated with a page
being busy. However, it is possible that the buffer_head is on an LRU list for
a CPU, such that we can try removing the buffer_head from that list, in order
to successfully release the page.  Do this.

v1: 
https://lore.kernel.org/lkml/cover.1606194703.git.cgold...@codeaurora.org/T/#m3a44b5745054206665455625ccaf27379df8a190
Original version of the patch (with updates to make to account for changes in
on_each_cpu_cond()).

v2: https://lore.kernel.org/lkml/cover.1609829465.git.cgold...@codeaurora.org/
Follow Matthew Wilcox's suggestion of reducing the number of calls to
on_each_cpu_cond(), by iterating over a page's busy buffer_heads inside of
on_each_cpu_cond(). To copy from his e-mail, we go from:

for_each_buffer
for_each_cpu
for_each_lru_entry

to:

for_each_cpu
for_each_buffer
for_each_lru_entry

This is done using xarrays, which I found to be the cleanest data structure to
use, though a pre-allocated array of page_size(page) / bh->b_size elements might
be more performant.

v3: Replace xas_for_each() with xa_for_each() to account for proper locking.

Laura Abbott (1):
  fs/buffer.c: Revoke LRU when trying to drop buffers

 fs/buffer.c | 81 +
 1 file changed, 76 insertions(+), 5 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v3] fs/buffer.c: Revoke LRU when trying to drop buffers

2021-01-13 Thread Chris Goldsworthy
From: Laura Abbott 

When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these types of failures make CMA significantly
less reliable, especially under high filesystem usage.

Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.

Note: a bug reported by "kernel test robot" lead to a switch from
using xas_for_each() to xa_for_each().

Signed-off-by: Laura Abbott 
Signed-off-by: Chris Goldsworthy 
Cc: Matthew Wilcox 
Reported-by: kernel test robot 
---
 fs/buffer.c | 81 +
 1 file changed, 76 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 96c7604..d2d1237 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1471,12 +1472,59 @@ static bool has_bh_in_lru(int cpu, void *dummy)
return false;
 }
 
+static void __evict_bhs_lru(void *arg)
+{
+   struct bh_lru *b = _cpu_var(bh_lrus);
+   struct xarray *busy_bhs = arg;
+   struct buffer_head *bh;
+   unsigned long i, xarray_index;
+
+   xa_for_each(busy_bhs, xarray_index, bh) {
+   for (i = 0; i < BH_LRU_SIZE; i++) {
+   if (b->bhs[i] == bh) {
+   brelse(b->bhs[i]);
+   b->bhs[i] = NULL;
+   break;
+   }
+   }
+
+   bh = bh->b_this_page;
+   }
+
+   put_cpu_var(bh_lrus);
+}
+
+static bool page_has_bhs_in_lru(int cpu, void *arg)
+{
+   struct bh_lru *b = per_cpu_ptr(_lrus, cpu);
+   struct xarray *busy_bhs = arg;
+   struct buffer_head *bh;
+   unsigned long i, xarray_index;
+
+   xa_for_each(busy_bhs, xarray_index, bh) {
+   for (i = 0; i < BH_LRU_SIZE; i++) {
+   if (b->bhs[i] == bh)
+   return true;
+   }
+
+   bh = bh->b_this_page;
+   }
+
+   return false;
+
+}
 void invalidate_bh_lrus(void)
 {
on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
 }
 EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
 
+static void evict_bh_lrus(struct xarray *busy_bhs)
+{
+   on_each_cpu_cond(page_has_bhs_in_lru, __evict_bhs_lru,
+busy_bhs, 1);
+}
+
 void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
 {
@@ -3242,14 +3290,36 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
 {
struct buffer_head *head = page_buffers(page);
struct buffer_head *bh;
+   struct xarray busy_bhs;
+   int bh_count = 0;
+   int xa_ret, ret = 0;
+
+   xa_init(_bhs);
 
bh = head;
do {
-   if (buffer_busy(bh))
-   goto failed;
+   if (buffer_busy(bh)) {
+   xa_ret = xa_err(xa_store(_bhs, bh_count++,
+bh, GFP_ATOMIC));
+   if (xa_ret)
+   goto out;
+   }
bh = bh->b_this_page;
} while (bh != head);
 
+   if (bh_count) {
+   /*
+* Check if the busy failure was due to an outstanding
+* LRU reference
+*/
+   evict_bh_lrus(_bhs);
+   do {
+   if (buffer_busy(bh))
+   goto out;
+   } while (bh != head);
+   }
+
+   ret = 1;
do {
struct buffer_head *next = bh->b_this_page;
 
@@ -3259,9 +3329,10 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
} while (bh != head);
*buffers_to_free = head;
detach_page_private(page);
-   return 1;
-failed:
-   return 0;
+out:
+   xa_destroy(_bhs);
+
+   return ret;
 }
 
 int try_to_free_buffers(struct page *page)
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



Re: [PATCH] fs/buffer.c: Revoke LRU when trying to drop buffers

2021-01-05 Thread Chris Goldsworthy

On 2020-11-24 07:39, Matthew Wilcox wrote:

On Mon, Nov 23, 2020 at 10:49:38PM -0800, Chris Goldsworthy wrote:

+static void __evict_bh_lru(void *arg)
+{
+   struct bh_lru *b = _cpu_var(bh_lrus);
+   struct buffer_head *bh = arg;
+   int i;
+
+   for (i = 0; i < BH_LRU_SIZE; i++) {
+   if (b->bhs[i] == bh) {
+   brelse(b->bhs[i]);
+   b->bhs[i] = NULL;
+   goto out;


That's an odd way to spell 'break' ...


+   }
+   }
+out:
+   put_cpu_var(bh_lrus);
+}


...

@@ -3245,8 +3281,15 @@ drop_buffers(struct page *page, struct 
buffer_head **buffers_to_free)


bh = head;
do {
-   if (buffer_busy(bh))
-   goto failed;
+   if (buffer_busy(bh)) {
+   /*
+* Check if the busy failure was due to an
+* outstanding LRU reference
+*/
+   evict_bh_lrus(bh);
+   if (buffer_busy(bh))
+   goto failed;


Hi Matthew,

Apologies for the delayed response.


We might be better off just calling invalidate_bh_lrus() -- we'd flush
the entire LRU, but we'd only need to do it once, not once per buffer 
head.


I'm concerned about emptying the cache, such that those who might 
benefit

from it would be left affected.

We could have a more complex 'evict' that iterates each busy buffer on 
a

page so transforming:

for_each_buffer
for_each_cpu
for_each_lru_entry

to:

for_each_cpu
for_each_buffer
for_each_lru_entry

(and i suggest that way because it's more expensive to iterate the 
buffers

than it is to iterate the lru entries)


I've gone ahead and done this in a follow-up patch:

https://lore.kernel.org/lkml/cover.1609829465.git.cgold...@codeaurora.org/

There might be room for improvement in the data structure being used to
track the used entries - using an xarray gives the cleanest code, but
pre-allocating an array to hold up to page_size(page) / bh->b_size 
entres

might be faster, although it would be a bit uglier to do in a way that
doesn't reduce the performance of the case when evict_bh_lru() doesn't
need to be called.

Regards,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


[PATCH v2] fs/buffer.c: Revoke LRU when trying to drop buffers

2021-01-04 Thread Chris Goldsworthy
From: Laura Abbott 

When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these types of failures make CMA significantly
less reliable, especially under high filesystem usage.

Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.

Signed-off-by: Laura Abbott 
Signed-off-by: Chris Goldsworthy 
Cc: Matthew Wilcox 
---
 fs/buffer.c   | 85 +++
 fs/internal.h |  5 
 2 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 96c7604..536fb5b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -48,6 +48,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "internal.h"
 
@@ -1471,12 +1472,63 @@ static bool has_bh_in_lru(int cpu, void *dummy)
return false;
 }
 
+static void __evict_bhs_lru(void *arg)
+{
+   struct bh_lru *b = _cpu_var(bh_lrus);
+   struct busy_bhs_container *busy_bhs = arg;
+   struct buffer_head *bh;
+   int i;
+
+   XA_STATE(xas, _bhs->xarray, 0);
+
+   xas_for_each(, bh, busy_bhs->size) {
+   for (i = 0; i < BH_LRU_SIZE; i++) {
+   if (b->bhs[i] == bh) {
+   brelse(b->bhs[i]);
+   b->bhs[i] = NULL;
+   break;
+   }
+   }
+
+   bh = bh->b_this_page;
+   }
+
+   put_cpu_var(bh_lrus);
+}
+
+static bool page_has_bhs_in_lru(int cpu, void *arg)
+{
+   struct bh_lru *b = per_cpu_ptr(_lrus, cpu);
+   struct busy_bhs_container *busy_bhs = arg;
+   struct buffer_head *bh;
+   int i;
+
+   XA_STATE(xas, _bhs->xarray, 0);
+
+   xas_for_each(, bh, busy_bhs->size) {
+   for (i = 0; i < BH_LRU_SIZE; i++) {
+   if (b->bhs[i] == bh)
+   return true;
+   }
+
+   bh = bh->b_this_page;
+   }
+
+   return false;
+
+}
 void invalidate_bh_lrus(void)
 {
on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
 }
 EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
 
+static void evict_bh_lrus(struct busy_bhs_container *busy_bhs)
+{
+   on_each_cpu_cond(page_has_bhs_in_lru, __evict_bhs_lru,
+busy_bhs, 1);
+}
+
 void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
 {
@@ -3242,14 +3294,36 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
 {
struct buffer_head *head = page_buffers(page);
struct buffer_head *bh;
+   struct busy_bhs_container busy_bhs;
+   int xa_ret, ret = 0;
+
+   xa_init(_bhs.xarray);
+   busy_bhs.size = 0;
 
bh = head;
do {
-   if (buffer_busy(bh))
-   goto failed;
+   if (buffer_busy(bh)) {
+   xa_ret = xa_err(xa_store(_bhs.xarray, 
busy_bhs.size++,
+bh, GFP_ATOMIC));
+   if (xa_ret)
+   goto out;
+   }
bh = bh->b_this_page;
} while (bh != head);
 
+   if (busy_bhs.size) {
+   /*
+* Check if the busy failure was due to an outstanding
+* LRU reference
+*/
+   evict_bh_lrus(_bhs);
+   do {
+   if (buffer_busy(bh))
+   goto out;
+   } while (bh != head);
+   }
+
+   ret = 1;
do {
struct buffer_head *next = bh->b_this_page;
 
@@ -3259,9 +,10 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
} while (bh != head);
*buffers_to_free = head;
detach_page_private(page);
-   return 1;
-failed:
-   return 0;
+out:
+   xa_destroy(_bhs.xarray);
+
+   return ret;
 }
 
 int try_to_free_buffers(struct page *page)
diff --git a/fs/internal.h b/fs/internal.h
index 77c50be..00f17c4 100644
--- a/fs/internal.h
+++ b/fs/internal.h
@@

[PATCH v2] Resolve LRU page-pinning issue for file-backed pages

2021-01-04 Thread Chris Goldsworthy
It is possible for file-backed pages to end up in a contiguous memory area
(CMA), such that the relevant page must be migrated using the .migratepage()
callback when its backing physical memory is selected for use in an CMA
allocation (through cma_alloc()).  However, if a set of address space
operations (AOPs) for a file-backed page lacks a migratepage() page call-back,
fallback_migrate_page() will be used instead, which through
try_to_release_page() calls try_to_free_buffers() (which is called directly or
through a try_to_free_buffers() callback.  try_to_free_buffers() in turn calls
drop_buffers()

drop_buffers() itself can fail due to the buffer_head associated with a page
being busy. However, it is possible that the buffer_head is on an LRU list for
a CPU, such that we can try removing the buffer_head from that list, in order
to successfully release the page.  Do this.

v1: 
https://lore.kernel.org/lkml/cover.1606194703.git.cgold...@codeaurora.org/T/#m3a44b5745054206665455625ccaf27379df8a190
Original version of the patch (with updates to make to account for changes in
on_each_cpu_cond()).

v2: Follow Matthew Wilcox's suggestion of reducing the number of calls to
on_each_cpu_cond(), by iterating over a page's busy buffer_heads inside of
on_each_cpu_cond(). To copy from his e-mail, we go from:

for_each_buffer
for_each_cpu
for_each_lru_entry

to:

for_each_cpu
for_each_buffer
for_each_lru_entry

This is done using xarrays, which I found to be the cleanest data structure to
use, though a pre-allocated array of page_size(page) / bh->b_size elements might
be more performant.

Laura Abbott (1):
  fs/buffer.c: Revoke LRU when trying to drop buffers

 fs/buffer.c   | 85 +++
 fs/internal.h |  5 
 2 files changed, 85 insertions(+), 5 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH] fs/buffer.c: Revoke LRU when trying to drop buffers

2020-11-23 Thread Chris Goldsworthy
From: Laura Abbott 

When a buffer is added to the LRU list, a reference is taken which is
not dropped until the buffer is evicted from the LRU list. This is the
correct behavior, however this LRU reference will prevent the buffer
from being dropped. This means that the buffer can't actually be dropped
until it is selected for eviction. There's no bound on the time spent
on the LRU list, which means that the buffer may be undroppable for
very long periods of time. Given that migration involves dropping
buffers, the associated page is now unmigratible for long periods of
time as well. CMA relies on being able to migrate a specific range
of pages, so these types of failures make CMA significantly
less reliable, especially under high filesystem usage.

Rather than waiting for the LRU algorithm to eventually kick out
the buffer, explicitly remove the buffer from the LRU list when trying
to drop it. There is still the possibility that the buffer
could be added back on the list, but that indicates the buffer is
still in use and would probably have other 'in use' indicates to
prevent dropping.

Signed-off-by: Laura Abbott 
Signed-off-by: Chris Goldsworthy 
---
 fs/buffer.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 64564ac..1751f0b 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1471,12 +1471,48 @@ static bool has_bh_in_lru(int cpu, void *dummy)
return false;
 }
 
+static void __evict_bh_lru(void *arg)
+{
+   struct bh_lru *b = _cpu_var(bh_lrus);
+   struct buffer_head *bh = arg;
+   int i;
+
+   for (i = 0; i < BH_LRU_SIZE; i++) {
+   if (b->bhs[i] == bh) {
+   brelse(b->bhs[i]);
+   b->bhs[i] = NULL;
+   goto out;
+   }
+   }
+out:
+   put_cpu_var(bh_lrus);
+}
+
+static bool bh_exists_in_lru(int cpu, void *arg)
+{
+   struct bh_lru *b = per_cpu_ptr(_lrus, cpu);
+   struct buffer_head *bh = arg;
+   int i;
+
+   for (i = 0; i < BH_LRU_SIZE; i++) {
+   if (b->bhs[i] == bh)
+   return true;
+   }
+
+   return false;
+
+}
 void invalidate_bh_lrus(void)
 {
on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1);
 }
 EXPORT_SYMBOL_GPL(invalidate_bh_lrus);
 
+static void evict_bh_lrus(struct buffer_head *bh)
+{
+   on_each_cpu_cond(bh_exists_in_lru, __evict_bh_lru, bh, 1);
+}
+
 void set_bh_page(struct buffer_head *bh,
struct page *page, unsigned long offset)
 {
@@ -3245,8 +3281,15 @@ drop_buffers(struct page *page, struct buffer_head 
**buffers_to_free)
 
bh = head;
do {
-   if (buffer_busy(bh))
-   goto failed;
+   if (buffer_busy(bh)) {
+   /*
+* Check if the busy failure was due to an
+* outstanding LRU reference
+*/
+   evict_bh_lrus(bh);
+   if (buffer_busy(bh))
+   goto failed;
+   }
bh = bh->b_this_page;
} while (bh != head);
 
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH] Resolve LRU page-pinning issue for file-backed pages

2020-11-23 Thread Chris Goldsworthy
It is possible for file-backed pages to end up in a contiguous memory
area (CMA), such that the relevant page must be migrated using the
.migratepage() callback when its backing physical memory is selected
for use in an CMA allocation (through cma_alloc()).  However, if a set
of address space operations (AOPs) for a file-backed page lacks a
migratepage() page call-back, fallback_migrate_page() will be used
instead, which through try_to_release_page() calls
try_to_free_buffers() (which is called directly or through a
try_to_free_buffers() callback.  try_to_free_buffers() in turn calls
drop_buffers().

drop_buffers() itself can fail due to the buffer_head associated with
a page being busy. However, it is possible that the buffer_head is on
an LRU list for a CPU, such that we can try removing the buffer_head
from that list, in order to successfully release the page.  Do this.
For reference ext4_journalled_aops is a set of AOPs that lacks the
migratepage() callback.

Laura Abbott (1):
  fs/buffer.c: Revoke LRU when trying to drop buffers

 fs/buffer.c | 47 +--
 1 file changed, 45 insertions(+), 2 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH 0/2] Increasing CMA Utilization with a GFP Flag

2020-11-02 Thread Chris Goldsworthy
The current approach to increasing CMA utilization introduced in
commit 16867664936e ("mm,page_alloc,cma: conditionally prefer cma
pageblocks for movable allocations") increases CMA utilization by
redirecting MIGRATE_MOVABLE allocations to a CMA region, when
greater than half of the free pages in a given zone are CMA pages.
The issue in this approach is that allocations with type
MIGRATE_MOVABLE can still succumb to pinning. To get around
this, one approach is to re-direct allocations to the CMA areas, that
are known not to be victims of pinning.

To this end, this series brings in __GFP_CMA, which we mark with
allocations that we know are safe to be redirected to a CMA area.

Heesub Shin (1):
  cma: redirect page allocation to CMA

Vinayak Menon (1):
  zram: allow zram to allocate CMA pages

 drivers/block/zram/zram_drv.c |  5 +--
 include/linux/gfp.h   | 15 
 include/linux/highmem.h   |  4 ++-
 include/linux/mmzone.h|  4 +++
 mm/page_alloc.c   | 83 +++
 mm/zsmalloc.c |  4 +--
 6 files changed, 79 insertions(+), 36 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH 2/2] zram: allow zram to allocate CMA pages

2020-11-02 Thread Chris Goldsworthy
From: Vinayak Menon 

Though zram pages are movable, they aren't allowed to enter
MIGRATE_CMA pageblocks. zram is not seen to pin pages for
long which can cause an issue. Moreover allowing zram to
pick CMA pages can be helpful in cases seen where zram order
0 alloc fails when there are lots of free cma pages, resulting
in kswapd or direct reclaim not making enough progress.

Signed-off-by: Vinayak Menon 
Signed-off-by: Chris Goldsworthy 
---
 drivers/block/zram/zram_drv.c | 5 +++--
 mm/zsmalloc.c | 4 ++--
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c
index 9a1e6ee..4b6b16d 100644
--- a/drivers/block/zram/zram_drv.c
+++ b/drivers/block/zram/zram_drv.c
@@ -1380,13 +1380,14 @@ static int __zram_bvec_write(struct zram *zram, struct 
bio_vec *bvec,
__GFP_KSWAPD_RECLAIM |
__GFP_NOWARN |
__GFP_HIGHMEM |
-   __GFP_MOVABLE);
+   __GFP_MOVABLE |
+   __GFP_CMA);
if (!handle) {
zcomp_stream_put(zram->comp);
atomic64_inc(>stats.writestall);
handle = zs_malloc(zram->mem_pool, comp_len,
GFP_NOIO | __GFP_HIGHMEM |
-   __GFP_MOVABLE);
+   __GFP_MOVABLE | __GFP_CMA);
if (handle)
goto compress_again;
return -ENOMEM;
diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index b03bee2..16ba318 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -351,7 +351,7 @@ static void destroy_cache(struct zs_pool *pool)
 static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp)
 {
return (unsigned long)kmem_cache_alloc(pool->handle_cachep,
-   gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE|__GFP_CMA));
 }
 
 static void cache_free_handle(struct zs_pool *pool, unsigned long handle)
@@ -362,7 +362,7 @@ static void cache_free_handle(struct zs_pool *pool, 
unsigned long handle)
 static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags)
 {
return kmem_cache_alloc(pool->zspage_cachep,
-   flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE));
+   flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE|__GFP_CMA));
 }
 
 static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage)
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH 1/2] cma: redirect page allocation to CMA

2020-11-02 Thread Chris Goldsworthy
From: Heesub Shin 

CMA pages are designed to be used as fallback for movable allocations
and cannot be used for non-movable allocations. If CMA pages are
utilized poorly, non-movable allocations may end up getting starved if
all regular movable pages are allocated and the only pages left are
CMA. Always using CMA pages first creates unacceptable performance
problems. As a midway alternative, use CMA pages for certain
userspace allocations. The userspace pages can be migrated or dropped
quickly which giving decent utilization.

Signed-off-by: Kyungmin Park 
Signed-off-by: Heesub Shin 
Signed-off-by: Vinayak Menon 
[cgold...@codeaurora.org: Place in bugfixes]
Signed-off-by: Chris Goldsworthy 
---
 include/linux/gfp.h | 15 +
 include/linux/highmem.h |  4 ++-
 include/linux/mmzone.h  |  4 +++
 mm/page_alloc.c | 83 +++--
 4 files changed, 74 insertions(+), 32 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c603237..e80b7d2 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -39,11 +39,21 @@ struct vm_area_struct;
 #define ___GFP_HARDWALL0x10u
 #define ___GFP_THISNODE0x20u
 #define ___GFP_ACCOUNT 0x40u
+#ifdef CONFIG_CMA
+#define ___GFP_CMA 0x80u
+#else
+#define ___GFP_CMA 0
+#endif
 #ifdef CONFIG_LOCKDEP
+#ifdef CONFIG_CMA
+#define ___GFP_NOLOCKDEP   0x100u
+#else
 #define ___GFP_NOLOCKDEP   0x80u
+#endif
 #else
 #define ___GFP_NOLOCKDEP   0
 #endif
+
 /* If the above are modified, __GFP_BITS_SHIFT may need updating */
 
 /*
@@ -57,6 +67,7 @@ struct vm_area_struct;
 #define __GFP_HIGHMEM  ((__force gfp_t)___GFP_HIGHMEM)
 #define __GFP_DMA32((__force gfp_t)___GFP_DMA32)
 #define __GFP_MOVABLE  ((__force gfp_t)___GFP_MOVABLE)  /* ZONE_MOVABLE 
allowed */
+#define __GFP_CMA  ((__force gfp_t)___GFP_CMA)
 #define GFP_ZONEMASK   (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE)
 
 /**
@@ -224,7 +235,11 @@ struct vm_area_struct;
 #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP)
 
 /* Room for N __GFP_FOO bits */
+#ifdef CONFIG_CMA
+#define __GFP_BITS_SHIFT (24 + IS_ENABLED(CONFIG_LOCKDEP))
+#else
 #define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP))
+#endif
 #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1))
 
 /**
diff --git a/include/linux/highmem.h b/include/linux/highmem.h
index 14e6202..35f052b 100644
--- a/include/linux/highmem.h
+++ b/include/linux/highmem.h
@@ -274,7 +274,9 @@ static inline struct page *
 alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma,
unsigned long vaddr)
 {
-   return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr);
+   return __alloc_zeroed_user_highpage(
+   __GFP_MOVABLE|__GFP_CMA, vma,
+   vaddr);
 }
 
 static inline void clear_highpage(struct page *page)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index fb3bf69..3f913be 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -471,6 +471,10 @@ struct zone {
struct pglist_data  *zone_pgdat;
struct per_cpu_pageset __percpu *pageset;
 
+#ifdef CONFIG_CMA
+   boolcma_alloc;
+#endif
+
 #ifndef CONFIG_SPARSEMEM
/*
 * Flags for a pageblock_nr_pages block. See pageblock-flags.h.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d772206..f938de7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2860,35 +2860,34 @@ __rmqueue(struct zone *zone, unsigned int order, int 
migratetype,
 {
struct page *page;
 
-#ifdef CONFIG_CMA
-   /*
-* Balance movable allocations between regular and CMA areas by
-* allocating from CMA when over half of the zone's free memory
-* is in the CMA area.
-*/
-   if (alloc_flags & ALLOC_CMA &&
-   zone_page_state(zone, NR_FREE_CMA_PAGES) >
-   zone_page_state(zone, NR_FREE_PAGES) / 2) {
-   page = __rmqueue_cma_fallback(zone, order);
-   if (page)
-   return page;
-   }
-#endif
 retry:
page = __rmqueue_smallest(zone, order, migratetype);
-   if (unlikely(!page)) {
-   if (alloc_flags & ALLOC_CMA)
-   page = __rmqueue_cma_fallback(zone, order);
 
-   if (!page && __rmqueue_fallback(zone, order, migratetype,
-   alloc_flags))
-   goto retry;
-   }
+   if (unlikely(!page) && __rmqueue_fallback(zone, order, migratetype,
+ alloc_flags))
+   goto retry;
 
trace_mm_page_alloc_zone_locked(page, order, migratetype);
return page;
 }
 
+static struct page *__rmqueue_cma(struct zone *zone, unsigned int order,
+   

Re: [PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc

2020-10-09 Thread Chris Goldsworthy

On 2020-09-28 22:59, Christoph Hellwig wrote:

On Mon, Sep 28, 2020 at 01:30:27PM -0700, Chris Goldsworthy wrote:
CMA allocations will fail if 'pinned' pages are in a CMA area, since 
we
cannot migrate pinned pages. The _refcount of a struct page being 
greater
than _mapcount for that page can cause pinning for anonymous pages.  
This
is because try_to_unmap(), which (1) is called in the CMA allocation 
path,

and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.  
This

implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.  
Later
in the CMA allocation path in migrate_page_move_mapping(), we will 
have one

more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.

If a process ends up causing _refcount > _mapcount for a page (by 
either
incrementing _recount or decrementing _mapcount), such that the 
process is
context switched out after modifying one refcount but before modifying 
the

other, the page will be temporarily pinned.

One example of where _refcount can be greater than _mapcount is inside 
of

zap_pte_range(), which is called for all the entries of a PMD when a
process is exiting, to unmap the process's memory.  Inside of
zap_pte_range(), after unammping a page with page_remove_rmap(), we 
have
that _recount > _mapcount.  _refcount can only be decremented after a 
TLB
flush is performed for the page - this doesn't occur until enough 
pages
have been batched together for flushing.  The flush can either occur 
inside
of zap_pte_range() (during the same invocation or a later one), or if 
there
aren't enough pages collected by the time we unmap all of the pages in 
a
process, the flush will occur in tlb_finish_mmu() in exit_mmap().  
After

the flush has occurred, tlb_batch_pages_flush() will decrement the
references on the flushed pages.

Another such example like the above is inside of copy_one_pte(), which 
is

called during a fork. For PTEs for which pte_present(pte) == true,
copy_one_pte() will increment the _refcount field followed by the
_mapcount field of a page.

So, inside of cma_alloc(), add the option of letting users pass in
__GFP_NOFAIL to indicate that we should retry CMA allocations 
indefinitely,
in the event that alloc_contig_range() returns -EBUSY after having 
scanned

a whole CMA-region bitmap.


And who is going to use this?  AS-is this just seems to add code that
isn't actually used and thus actually tested.  (In addition to beeing
a relly bad idea as discussed before)


Hi Christoph,

That had slipped my mind - what we would have submitted would have been 
a modified /drivers/dma-heap/heaps/cma_heap.c, which would have created 
a "linux,cma-nofail" heap, that when allocated from, passes GFP_NOFAIL 
to cma_alloc().  But, since this retry approach (finite and infinite) 
has effectively been nacked, I've gone back to the drawing board to find 
either (1) a lock based approach to solving this (as posed by Andrew 
Morton here: https://lkml.org/lkml/2020/8/21/1490), or (2) using 
preempt_disable() calls.


Thanks,

Chris.


--- a/kernel/dma/contiguous.c
+++ b/kernel/dma/contiguous.c
@@ -196,7 +196,7 @@ struct page *dma_alloc_from_contiguous(struct 
device *dev, size_t count,

if (align > CONFIG_CMA_ALIGNMENT)
align = CONFIG_CMA_ALIGNMENT;

-   return cma_alloc(dev_get_cma_area(dev), count, align, no_warn);
+	return cma_alloc(dev_get_cma_area(dev), count, align, no_warn ? 
__GFP_NOWARN : 0);


Also don't add pointlessly overlong lines.


--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: Question regarding ext4_journalled_aops: lack of migrate_page

2020-10-09 Thread Chris Goldsworthy

tOn 2020-10-08 23:42, Chris Goldsworthy wrote:

Hi there,

ext4_aops and ext4_da_aops both have a migratepage callback, whereas
ext4_journalled_aops lacks such a callback.  Why is this so?  I’m
asking this due to the following: when a page containing EXT4 journal
buffer heads ends up being migrated, fallback_migrate_page() is used,
which eventually calls try_to_free_buffers(), which will call
drop_buffers().  Drop buffers() can fail for a page if that page is on
the LRU list (see
https://elixir.bootlin.com/linux/v5.8.14/source/fs/buffer.c#L3225).
Now, if buffer_migrate_page() was supplied as the migratepage callback
for the journaled aops, this wouldn’t be problem since we ignore the
LRU lists altogether.

Resolving this issue will benefit CMA allocations, which might have to
migrate movable pages that were allocated from a CMA region (the
assumption is that these pages can be migrated once the memory backing
these pages is needed).

Thanks,

Chris.


An alternative solution to to placing in a migratepage callback is to 
actually remove the buffer heads from an LRU list in drop_buffers() - we 
tried upstreaming such a patch some time ago: 
https://lore.kernel.org/patchwork/patch/325406/ .  This would solve the 
problem, and would avoid having to get a migratepage callback for any 
other files systems that have aops without the callback.  This didn't 
get off the ground however, I'm picking up the line of questioning left 
in the previous thread, to determine why EXT4 doesn't have a migratepage 
callback.


Thanks,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Question regarding ext4_journalled_aops: lack of migrate_page

2020-10-09 Thread Chris Goldsworthy

Hi there,

ext4_aops and ext4_da_aops both have a migratepage callback, whereas 
ext4_journalled_aops lacks such a callback.  Why is this so?  I’m asking 
this due to the following: when a page containing EXT4 journal buffer 
heads ends up being migrated, fallback_migrate_page() is used, which 
eventually calls try_to_free_buffers(), which will call drop_buffers().  
Drop buffers() can fail for a page if that page is on the LRU list (see 
https://elixir.bootlin.com/linux/v5.8.14/source/fs/buffer.c#L3225).   
Now, if buffer_migrate_page() was supplied as the migratepage callback 
for the journaled aops, this wouldn’t be problem since we ignore the LRU 
lists altogether.


Resolving this issue will benefit CMA allocations, which might have to 
migrate movable pages that were allocated from a CMA region (the 
assumption is that these pages can be migrated once the memory backing 
these pages is needed).


Thanks,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [RFC][PATCH 5/6] dma-buf: system_heap: Add pagepool support to system heap

2020-10-01 Thread Chris Goldsworthy

On 2020-09-29 21:46, Chris Goldsworthy wrote:

On 2020-09-25 21:24, John Stultz wrote:

Reuse/abuse the pagepool code from the network code to speed
up allocation performance.

This is similar to the ION pagepool usage, but tries to
utilize generic code instead of a custom implementation.

Cc: Sumit Semwal 
Cc: Liam Mark 
Cc: Laura Abbott 
Cc: Brian Starkey 
Cc: Hridya Valsaraju 
Cc: Suren Baghdasaryan 
Cc: Sandeep Patil 
Cc: Ørjan Eide 
Cc: Robin Murphy 
Cc: Ezequiel Garcia 
Cc: Simon Ser 
Cc: James Jones 
Cc: linux-me...@vger.kernel.org
Cc: dri-de...@lists.freedesktop.org
Signed-off-by: John Stultz 
---
 drivers/dma-buf/heaps/Kconfig   |  1 +
 drivers/dma-buf/heaps/system_heap.c | 32 
+

 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/heaps/Kconfig 
b/drivers/dma-buf/heaps/Kconfig

index a5eef06c4226..f13cde4321b1 100644
--- a/drivers/dma-buf/heaps/Kconfig
+++ b/drivers/dma-buf/heaps/Kconfig
@@ -1,6 +1,7 @@
 config DMABUF_HEAPS_SYSTEM
bool "DMA-BUF System Heap"
depends on DMABUF_HEAPS
+   select PAGE_POOL
help
 	  Choose this option to enable the system dmabuf heap. The system 
heap

  is backed by pages from the buddy allocator. If in doubt, say Y.
diff --git a/drivers/dma-buf/heaps/system_heap.c
b/drivers/dma-buf/heaps/system_heap.c
index 882a632e9bb7..9f57b4c8ae69 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 

 struct dma_heap *sys_heap;

@@ -46,6 +47,7 @@ struct dma_heap_attachment {
 static gfp_t order_flags[] = {HIGH_ORDER_GFP, LOW_ORDER_GFP, 
LOW_ORDER_GFP};

 static const unsigned int orders[] = {8, 4, 0};
 #define NUM_ORDERS ARRAY_SIZE(orders)
+struct page_pool *pools[NUM_ORDERS];

 static struct sg_table *dup_sg_table(struct sg_table *table)
 {
@@ -264,13 +266,17 @@ static void system_heap_dma_buf_release(struct
dma_buf *dmabuf)
struct system_heap_buffer *buffer = dmabuf->priv;
struct sg_table *table;
struct scatterlist *sg;
-   int i;
+   int i, j;

table = >sg_table;
for_each_sg(table->sgl, sg, table->nents, i) {
struct page *page = sg_page(sg);

-   __free_pages(page, compound_order(page));
+   for (j = 0; j < NUM_ORDERS; j++) {
+   if (compound_order(page) == orders[j])
+   break;
+   }
+   page_pool_put_full_page(pools[j], page, false);
}
sg_free_table(table);
kfree(buffer);
@@ -300,8 +306,7 @@ static struct page
*alloc_largest_available(unsigned long size,
continue;
if (max_order < orders[i])
continue;
-
-   page = alloc_pages(order_flags[i], orders[i]);
+   page = page_pool_alloc_pages(pools[i], order_flags[i]);
if (!page)
continue;
return page;
@@ -406,6 +411,25 @@ static const struct dma_heap_ops system_heap_ops 
= {

 static int system_heap_create(void)
 {
struct dma_heap_export_info exp_info;
+   int i;
+
+   for (i = 0; i < NUM_ORDERS; i++) {
+   struct page_pool_params pp;
+
+   memset(, 0, sizeof(pp));
+   pp.order = orders[i];
+   pp.dma_dir = DMA_BIDIRECTIONAL;


Hey John,

Correct me if I'm wrong, but I think that in order for pp.dma_dir to be 
used in either page_pool_alloc_pages() or page_pool_put_full_page(), we 
need to at least have PP_FLAG_DMA_MAP set (to have 
page_pool_dma_sync_for_device() called, PP_FLAG_DMA_SYNC_DEV should also 
be set I think).  I think you'd also need to to have pp->dev set.  Are 
we setting dma_dir with the intention of doing the necessary CMOs before 
we start using the page?


Thanks,

Chris.


+   pools[i] = page_pool_create();
+
+   if (IS_ERR(pools[i])) {
+   int j;
+
+   pr_err("%s: page pool creation failed!\n", __func__);
+   for (j = 0; j < i; j++)
+   page_pool_destroy(pools[j]);
+   return PTR_ERR(pools[i]);
+   }
+   }

exp_info.name = "system";
exp_info.ops = _heap_ops;


This is cool, I didn't know about this pooling code under /net/core.
Nice and compact.


--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [RFC][PATCH 5/6] dma-buf: system_heap: Add pagepool support to system heap

2020-09-29 Thread Chris Goldsworthy

On 2020-09-25 21:24, John Stultz wrote:

Reuse/abuse the pagepool code from the network code to speed
up allocation performance.

This is similar to the ION pagepool usage, but tries to
utilize generic code instead of a custom implementation.

Cc: Sumit Semwal 
Cc: Liam Mark 
Cc: Laura Abbott 
Cc: Brian Starkey 
Cc: Hridya Valsaraju 
Cc: Suren Baghdasaryan 
Cc: Sandeep Patil 
Cc: Ørjan Eide 
Cc: Robin Murphy 
Cc: Ezequiel Garcia 
Cc: Simon Ser 
Cc: James Jones 
Cc: linux-me...@vger.kernel.org
Cc: dri-de...@lists.freedesktop.org
Signed-off-by: John Stultz 
---
 drivers/dma-buf/heaps/Kconfig   |  1 +
 drivers/dma-buf/heaps/system_heap.c | 32 +
 2 files changed, 29 insertions(+), 4 deletions(-)

diff --git a/drivers/dma-buf/heaps/Kconfig 
b/drivers/dma-buf/heaps/Kconfig

index a5eef06c4226..f13cde4321b1 100644
--- a/drivers/dma-buf/heaps/Kconfig
+++ b/drivers/dma-buf/heaps/Kconfig
@@ -1,6 +1,7 @@
 config DMABUF_HEAPS_SYSTEM
bool "DMA-BUF System Heap"
depends on DMABUF_HEAPS
+   select PAGE_POOL
help
 	  Choose this option to enable the system dmabuf heap. The system 
heap

  is backed by pages from the buddy allocator. If in doubt, say Y.
diff --git a/drivers/dma-buf/heaps/system_heap.c
b/drivers/dma-buf/heaps/system_heap.c
index 882a632e9bb7..9f57b4c8ae69 100644
--- a/drivers/dma-buf/heaps/system_heap.c
+++ b/drivers/dma-buf/heaps/system_heap.c
@@ -20,6 +20,7 @@
 #include 
 #include 
 #include 
+#include 

 struct dma_heap *sys_heap;

@@ -46,6 +47,7 @@ struct dma_heap_attachment {
 static gfp_t order_flags[] = {HIGH_ORDER_GFP, LOW_ORDER_GFP, 
LOW_ORDER_GFP};

 static const unsigned int orders[] = {8, 4, 0};
 #define NUM_ORDERS ARRAY_SIZE(orders)
+struct page_pool *pools[NUM_ORDERS];

 static struct sg_table *dup_sg_table(struct sg_table *table)
 {
@@ -264,13 +266,17 @@ static void system_heap_dma_buf_release(struct
dma_buf *dmabuf)
struct system_heap_buffer *buffer = dmabuf->priv;
struct sg_table *table;
struct scatterlist *sg;
-   int i;
+   int i, j;

table = >sg_table;
for_each_sg(table->sgl, sg, table->nents, i) {
struct page *page = sg_page(sg);

-   __free_pages(page, compound_order(page));
+   for (j = 0; j < NUM_ORDERS; j++) {
+   if (compound_order(page) == orders[j])
+   break;
+   }
+   page_pool_put_full_page(pools[j], page, false);
}
sg_free_table(table);
kfree(buffer);
@@ -300,8 +306,7 @@ static struct page
*alloc_largest_available(unsigned long size,
continue;
if (max_order < orders[i])
continue;
-
-   page = alloc_pages(order_flags[i], orders[i]);
+   page = page_pool_alloc_pages(pools[i], order_flags[i]);
if (!page)
continue;
return page;
@@ -406,6 +411,25 @@ static const struct dma_heap_ops system_heap_ops = 
{

 static int system_heap_create(void)
 {
struct dma_heap_export_info exp_info;
+   int i;
+
+   for (i = 0; i < NUM_ORDERS; i++) {
+   struct page_pool_params pp;
+
+   memset(, 0, sizeof(pp));
+   pp.order = orders[i];
+   pp.dma_dir = DMA_BIDIRECTIONAL;
+   pools[i] = page_pool_create();
+
+   if (IS_ERR(pools[i])) {
+   int j;
+
+   pr_err("%s: page pool creation failed!\n", __func__);
+   for (j = 0; j < i; j++)
+   page_pool_destroy(pools[j]);
+   return PTR_ERR(pools[i]);
+   }
+   }

exp_info.name = "system";
exp_info.ops = _heap_ops;


This is cool, I didn't know about this pooling code under /net/core.  
Nice and compact.


--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


[PATCH v5] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-28 Thread Chris Goldsworthy
V1: Introduces a retry loop that attempts a CMA allocation a finite
number of times before giving up:
 
https://lkml.org/lkml/2020/8/5/1097
https://lkml.org/lkml/2020/8/11/893

V2: Introduces an indefinite retry for CMA allocations.  David Hildenbrand
raised a page pinning example which precludes doing this infite-retrying
for all CMA users:

https://lkml.org/lkml/2020/9/17/984

V3: Re-introduce a GFP mask argument for cma_alloc(), that can take in
__GFP_NOFAIL as an argument to indicate that a CMA allocation should be
retried indefinitely. This lets callers of cma_alloc() decide if they want
to perform indefinite retires. Also introduces a config option for
controlling the duration of the sleep between retries:

https://www.spinics.net/lists/linux-mm/msg228778.html

V4: In response to comments by Minchan Kim, we make the sleep interruptable
and remove a defconfig option controlling how long the sleep lasts for (it
is now a fixed 100 ms):

https://lkml.org/lkml/2020/9/28/1144

V5: Add missing fatal_signal_pending() check

Chris Goldsworthy (1):
  mm: cma: indefinitely retry allocations in cma_alloc

 arch/powerpc/kvm/book3s_hv_builtin.c   |  2 +-
 drivers/dma-buf/heaps/cma_heap.c   |  2 +-
 drivers/s390/char/vmcp.c   |  2 +-
 drivers/staging/android/ion/ion_cma_heap.c |  2 +-
 include/linux/cma.h|  2 +-
 kernel/dma/contiguous.c|  4 +--
 mm/cma.c   | 39 +-
 mm/cma_debug.c |  2 +-
 mm/hugetlb.c   |  4 +--
 9 files changed, 43 insertions(+), 16 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v5] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-28 Thread Chris Goldsworthy
CMA allocations will fail if 'pinned' pages are in a CMA area, since we
cannot migrate pinned pages. The _refcount of a struct page being greater
than _mapcount for that page can cause pinning for anonymous pages.  This
is because try_to_unmap(), which (1) is called in the CMA allocation path,
and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.  This
implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.  Later
in the CMA allocation path in migrate_page_move_mapping(), we will have one
more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.

If a process ends up causing _refcount > _mapcount for a page (by either
incrementing _recount or decrementing _mapcount), such that the process is
context switched out after modifying one refcount but before modifying the
other, the page will be temporarily pinned.

One example of where _refcount can be greater than _mapcount is inside of
zap_pte_range(), which is called for all the entries of a PMD when a
process is exiting, to unmap the process's memory.  Inside of
zap_pte_range(), after unammping a page with page_remove_rmap(), we have
that _recount > _mapcount.  _refcount can only be decremented after a TLB
flush is performed for the page - this doesn't occur until enough pages
have been batched together for flushing.  The flush can either occur inside
of zap_pte_range() (during the same invocation or a later one), or if there
aren't enough pages collected by the time we unmap all of the pages in a
process, the flush will occur in tlb_finish_mmu() in exit_mmap().  After
the flush has occurred, tlb_batch_pages_flush() will decrement the
references on the flushed pages.

Another such example like the above is inside of copy_one_pte(), which is
called during a fork. For PTEs for which pte_present(pte) == true,
copy_one_pte() will increment the _refcount field followed by the
_mapcount field of a page.

So, inside of cma_alloc(), add the option of letting users pass in
__GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely,
in the event that alloc_contig_range() returns -EBUSY after having scanned
a whole CMA-region bitmap.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 arch/powerpc/kvm/book3s_hv_builtin.c   |  2 +-
 drivers/dma-buf/heaps/cma_heap.c   |  2 +-
 drivers/s390/char/vmcp.c   |  2 +-
 drivers/staging/android/ion/ion_cma_heap.c |  2 +-
 include/linux/cma.h|  2 +-
 kernel/dma/contiguous.c|  4 +--
 mm/cma.c   | 39 +-
 mm/cma_debug.c |  2 +-
 mm/hugetlb.c   |  4 +--
 9 files changed, 43 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 073617c..21c3f6a 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -74,7 +74,7 @@ struct page *kvm_alloc_hpt_cma(unsigned long nr_pages)
VM_BUG_ON(order_base_2(nr_pages) < KVM_CMA_CHUNK_ORDER - PAGE_SHIFT);
 
return cma_alloc(kvm_cma, nr_pages, order_base_2(HPT_ALIGN_PAGES),
-false);
+0);
 }
 EXPORT_SYMBOL_GPL(kvm_alloc_hpt_cma);
 
diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c
index 626cf7f..7657359 100644
--- a/drivers/dma-buf/heaps/cma_heap.c
+++ b/drivers/dma-buf/heaps/cma_heap.c
@@ -66,7 +66,7 @@ static int cma_heap_allocate(struct dma_heap *heap,
helper_buffer->heap = heap;
helper_buffer->size = len;
 
-   cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, false);
+   cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, 0);
if (!cma_pages)
goto free_buf;
 
diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c
index 9e06628..11c4e3b 100644
--- a/drivers/s390/char/vmcp.c
+++ b/drivers/s390/char/vmcp.c
@@ -70,7 +70,7 @@ static void vmcp_response_alloc(struct vmcp_session *session)
 * anymore the system won't work anyway.
 */
if (order > 2)
-   page = cma_alloc(vmcp_cma, nr_pages, 0, false);
+   page = cma_alloc(vmcp_cma, nr_pages, 0, 0);
if (page) {
session->response = (char *)page_to_phys(page);
session->cma_alloc = 1;
diff --git a/drivers/staging/android/ion/ion_cma_heap.c 
b/drivers/staging/android/ion/ion_cma_heap.c
index bf65e67..128d3a5 100644
--- a/drivers/staging/android/ion/ion_cma_heap.c
+++ b/drivers/staging/android/ion/ion_cma_heap.c
@@ -39,7 +39,7 @@ static int ion_cma_allocate(struct ion_heap *heap, struct 
ion_buffer *buffer,
   

[PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-28 Thread Chris Goldsworthy
V1: Introduces a retry loop that attempts a CMA allocation a finite
number of times before giving up:
 
https://lkml.org/lkml/2020/8/5/1097
https://lkml.org/lkml/2020/8/11/893

V2: Introduces an indefinite retry for CMA allocations.  David Hildenbrand
raised a page pinning example which precludes doing this infite-retrying
for all CMA users:

https://lkml.org/lkml/2020/9/17/984

V3: Re-introduce a GFP mask argument for cma_alloc(), that can take in
__GFP_NOFAIL as an argument to indicate that a CMA allocation should be
retried indefinitely. This lets callers of cma_alloc() decide if they want
to perform indefinite retires. Also introduces a config option for
controlling the duration of the sleep between retries:

https://www.spinics.net/lists/linux-mm/msg228778.html

V4: In response to comments by Minchan Kim, we make the sleep interruptable
and remove a defconfig option controlling how long the sleep lasts for (it
is now a fixed 100 ms).

Chris Goldsworthy (1):
  mm: cma: indefinitely retry allocations in cma_alloc

 arch/powerpc/kvm/book3s_hv_builtin.c   |  2 +-
 drivers/dma-buf/heaps/cma_heap.c   |  2 +-
 drivers/s390/char/vmcp.c   |  2 +-
 drivers/staging/android/ion/ion_cma_heap.c |  2 +-
 include/linux/cma.h|  2 +-
 kernel/dma/contiguous.c|  4 ++--
 mm/cma.c   | 36 +-
 mm/cma_debug.c |  2 +-
 mm/hugetlb.c   |  4 ++--
 9 files changed, 40 insertions(+), 16 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-28 Thread Chris Goldsworthy
CMA allocations will fail if 'pinned' pages are in a CMA area, since we
cannot migrate pinned pages. The _refcount of a struct page being greater
than _mapcount for that page can cause pinning for anonymous pages.  This
is because try_to_unmap(), which (1) is called in the CMA allocation path,
and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.  This
implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.  Later
in the CMA allocation path in migrate_page_move_mapping(), we will have one
more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.

If a process ends up causing _refcount > _mapcount for a page (by either
incrementing _recount or decrementing _mapcount), such that the process is
context switched out after modifying one refcount but before modifying the
other, the page will be temporarily pinned.

One example of where _refcount can be greater than _mapcount is inside of
zap_pte_range(), which is called for all the entries of a PMD when a
process is exiting, to unmap the process's memory.  Inside of
zap_pte_range(), after unammping a page with page_remove_rmap(), we have
that _recount > _mapcount.  _refcount can only be decremented after a TLB
flush is performed for the page - this doesn't occur until enough pages
have been batched together for flushing.  The flush can either occur inside
of zap_pte_range() (during the same invocation or a later one), or if there
aren't enough pages collected by the time we unmap all of the pages in a
process, the flush will occur in tlb_finish_mmu() in exit_mmap().  After
the flush has occurred, tlb_batch_pages_flush() will decrement the
references on the flushed pages.

Another such example like the above is inside of copy_one_pte(), which is
called during a fork. For PTEs for which pte_present(pte) == true,
copy_one_pte() will increment the _refcount field followed by the
_mapcount field of a page.

So, inside of cma_alloc(), add the option of letting users pass in
__GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely,
in the event that alloc_contig_range() returns -EBUSY after having scanned
a whole CMA-region bitmap.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 arch/powerpc/kvm/book3s_hv_builtin.c   |  2 +-
 drivers/dma-buf/heaps/cma_heap.c   |  2 +-
 drivers/s390/char/vmcp.c   |  2 +-
 drivers/staging/android/ion/ion_cma_heap.c |  2 +-
 include/linux/cma.h|  2 +-
 kernel/dma/contiguous.c|  4 ++--
 mm/cma.c   | 36 +-
 mm/cma_debug.c |  2 +-
 mm/hugetlb.c   |  4 ++--
 9 files changed, 40 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 073617c..21c3f6a 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -74,7 +74,7 @@ struct page *kvm_alloc_hpt_cma(unsigned long nr_pages)
VM_BUG_ON(order_base_2(nr_pages) < KVM_CMA_CHUNK_ORDER - PAGE_SHIFT);
 
return cma_alloc(kvm_cma, nr_pages, order_base_2(HPT_ALIGN_PAGES),
-false);
+0);
 }
 EXPORT_SYMBOL_GPL(kvm_alloc_hpt_cma);
 
diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c
index 626cf7f..7657359 100644
--- a/drivers/dma-buf/heaps/cma_heap.c
+++ b/drivers/dma-buf/heaps/cma_heap.c
@@ -66,7 +66,7 @@ static int cma_heap_allocate(struct dma_heap *heap,
helper_buffer->heap = heap;
helper_buffer->size = len;
 
-   cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, false);
+   cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, 0);
if (!cma_pages)
goto free_buf;
 
diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c
index 9e06628..11c4e3b 100644
--- a/drivers/s390/char/vmcp.c
+++ b/drivers/s390/char/vmcp.c
@@ -70,7 +70,7 @@ static void vmcp_response_alloc(struct vmcp_session *session)
 * anymore the system won't work anyway.
 */
if (order > 2)
-   page = cma_alloc(vmcp_cma, nr_pages, 0, false);
+   page = cma_alloc(vmcp_cma, nr_pages, 0, 0);
if (page) {
session->response = (char *)page_to_phys(page);
session->cma_alloc = 1;
diff --git a/drivers/staging/android/ion/ion_cma_heap.c 
b/drivers/staging/android/ion/ion_cma_heap.c
index bf65e67..128d3a5 100644
--- a/drivers/staging/android/ion/ion_cma_heap.c
+++ b/drivers/staging/android/ion/ion_cma_heap.c
@@ -39,7 +39,7 @@ static int ion_cma_allocate(struct ion_heap *heap, struct 
ion_buffer *buffer,
   

Re: [PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-28 Thread Chris Goldsworthy

On 2020-09-27 12:23, Minchan Kim wrote:

On Wed, Sep 23, 2020 at 10:16:25PM -0700, Chris Goldsworthy wrote:
CMA allocations will fail if 'pinned' pages are in a CMA area, since 
we





+config CMA_RETRY_SLEEP_DURATION
+   int "Sleep duration between retries"
+   depends on CMA
+   default 100
+   help
+ Time to sleep for in milliseconds between the indefinite
+ retries of a CMA allocation that are induced by passing
+ __GFP_NOFAIL to cma_alloc().
+
+ If unsure, leave the value as "100".


What's the point of this new config? If we need it, How could admin
set their value?
How does it make bad if we just use simple default vaule?
IOW, I'd like to avoid introducing new config if we don't see good
justifcation.


I thought that it would be useful for developers, but I guess it would 
be much better for this to be runtime configurable.  But, I don't think 
there's a strong reason to be able to configure the value - 100 ms has 
worked for us.  I'll update scrap this option in a follow-up patch, and 
will use 100 ms as the sleeping time.



+
 config MEM_SOFT_DIRTY
bool "Track memory changes"
depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS
diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..4fbad2b 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 

 #include "cma.h"
@@ -403,13 +404,15 @@ static inline void cma_debug_show_areas(struct 
cma *cma) { }
  * @cma:   Contiguous memory region for which the allocation is 
performed.

  * @count: Requested number of pages.
  * @align: Requested alignment of pages (in PAGE_SIZE order).
- * @no_warn: Avoid printing message about failed allocation
+ * @gfp_mask: If __GFP_NOWARN is passed, suppress messages about 
failed

+ *   allocations. If __GFP_NOFAIL is passed, try doing the CMA
+ *   allocation indefinitely until the allocation succeeds.
  *
  * This function allocates part of contiguous memory on specific
  * contiguous memory area.
  */
 struct page *cma_alloc(struct cma *cma, size_t count, unsigned int 
align,

-  bool no_warn)
+  gfp_t gfp_mask)
 {
unsigned long mask, offset;
unsigned long pfn = -1;
@@ -442,8 +445,28 @@ struct page *cma_alloc(struct cma *cma, size_t 
count, unsigned int align,

bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
-   break;
+   if (ret == -EBUSY && gfp_mask & __GFP_NOFAIL) {
+   mutex_unlock(>lock);
+
+   /*
+* Page may be momentarily pinned by some other
+* process which has been scheduled out, e.g.
+* in exit path, during unmap call, or process
+* fork and so cannot be freed there. Sleep
+* for 100ms and retry the allocation.
+*/
+   start = 0;
+   ret = -ENOMEM;
+   msleep(CONFIG_CMA_RETRY_SLEEP_DURATION);


Should it be uninterruptible, really?


Good point - I'll replace the msleep() this with 
schedule_timeout_killable() in the follow-up patch.


--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-25 Thread Chris Goldsworthy

On 2020-09-25 05:18, David Hildenbrand wrote:

On 24.09.20 07:16, Chris Goldsworthy wrote:

-GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0));
+GFP_KERNEL | (gfp_mask & __GFP_NOWARN));


Right, we definetly don't want to pass the flag further down.

Alternative would be cma_alloc_nofail(). That helps avoid people 
passing

 stuff like GFP_USER and wondering why it doesn't have an effect.


But since we're doing a logical AND with __GFP_NOWARN, we're not passing 
any other values down - this makes it equivalent to the previous 
version, in that only __GFP_NOWARN can be passed to 
alloc_contig_range().


--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


[PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-23 Thread Chris Goldsworthy
V1: Introduces a retry loop that attempts a CMA allocation a finite
number of times before giving up:
 
https://lkml.org/lkml/2020/8/5/1097
https://lkml.org/lkml/2020/8/11/893

V2: Introduces an indefinite retry for CMA allocations.  David Hildenbrand
raised a page pinning example which precludes doing this infite-retrying
for all CMA users:

https://lkml.org/lkml/2020/9/17/984

V3: Re-introduce a GFP mask argument for cma_alloc(), that can take in
__GFP_NOFAIL as an argument to indicate that a CMA allocation should be
retried indefinitely. This lets callers of cma_alloc() decide if they want
to perform indefinite retires. Also introduces a config option for
controlling the duration of the sleep between retries.

Chris Goldsworthy (1):
  mm: cma: indefinitely retry allocations in cma_alloc

 arch/powerpc/kvm/book3s_hv_builtin.c   |  2 +-
 drivers/dma-buf/heaps/cma_heap.c   |  2 +-
 drivers/s390/char/vmcp.c   |  2 +-
 drivers/staging/android/ion/ion_cma_heap.c |  2 +-
 include/linux/cma.h|  2 +-
 kernel/dma/contiguous.c|  4 ++--
 mm/Kconfig | 11 ++
 mm/cma.c   | 35 +-
 mm/cma_debug.c |  2 +-
 mm/hugetlb.c   |  4 ++--
 10 files changed, 50 insertions(+), 16 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-23 Thread Chris Goldsworthy
CMA allocations will fail if 'pinned' pages are in a CMA area, since we
cannot migrate pinned pages. The _refcount of a struct page being greater
than _mapcount for that page can cause pinning for anonymous pages.  This
is because try_to_unmap(), which (1) is called in the CMA allocation path,
and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.  This
implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.  Later
in the CMA allocation path in migrate_page_move_mapping(), we will have one
more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.

If a process ends up causing _refcount > _mapcount for a page (by either
incrementing _recount or decrementing _mapcount), such that the process is
context switched out after modifying one refcount but before modifying the
other, the page will be temporarily pinned.

One example of where _refcount can be greater than _mapcount is inside of
zap_pte_range(), which is called for all the entries of a PMD when a
process is exiting, to unmap the process's memory.  Inside of
zap_pte_range(), after unammping a page with page_remove_rmap(), we have
that _recount > _mapcount.  _refcount can only be decremented after a TLB
flush is performed for the page - this doesn't occur until enough pages
have been batched together for flushing.  The flush can either occur inside
of zap_pte_range() (during the same invocation or a later one), or if there
aren't enough pages collected by the time we unmap all of the pages in a
process, the flush will occur in tlb_finish_mmu() in exit_mmap().  After
the flush has occurred, tlb_batch_pages_flush() will decrement the
references on the flushed pages.

Another such example like the above is inside of copy_one_pte(), which is
called during a fork. For PTEs for which pte_present(pte) == true,
copy_one_pte() will increment the _refcount field followed by the
_mapcount field of a page.

So, inside of cma_alloc(), add the option of letting users pass in
__GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely,
in the event that alloc_contig_range() returns -EBUSY after having scanned
a whole CMA-region bitmap.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 arch/powerpc/kvm/book3s_hv_builtin.c   |  2 +-
 drivers/dma-buf/heaps/cma_heap.c   |  2 +-
 drivers/s390/char/vmcp.c   |  2 +-
 drivers/staging/android/ion/ion_cma_heap.c |  2 +-
 include/linux/cma.h|  2 +-
 kernel/dma/contiguous.c|  4 ++--
 mm/Kconfig | 11 ++
 mm/cma.c   | 35 +-
 mm/cma_debug.c |  2 +-
 mm/hugetlb.c   |  4 ++--
 10 files changed, 50 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index 073617c..21c3f6a 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -74,7 +74,7 @@ struct page *kvm_alloc_hpt_cma(unsigned long nr_pages)
VM_BUG_ON(order_base_2(nr_pages) < KVM_CMA_CHUNK_ORDER - PAGE_SHIFT);
 
return cma_alloc(kvm_cma, nr_pages, order_base_2(HPT_ALIGN_PAGES),
-false);
+0);
 }
 EXPORT_SYMBOL_GPL(kvm_alloc_hpt_cma);
 
diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c
index 626cf7f..7657359 100644
--- a/drivers/dma-buf/heaps/cma_heap.c
+++ b/drivers/dma-buf/heaps/cma_heap.c
@@ -66,7 +66,7 @@ static int cma_heap_allocate(struct dma_heap *heap,
helper_buffer->heap = heap;
helper_buffer->size = len;
 
-   cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, false);
+   cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, 0);
if (!cma_pages)
goto free_buf;
 
diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c
index 9e06628..11c4e3b 100644
--- a/drivers/s390/char/vmcp.c
+++ b/drivers/s390/char/vmcp.c
@@ -70,7 +70,7 @@ static void vmcp_response_alloc(struct vmcp_session *session)
 * anymore the system won't work anyway.
 */
if (order > 2)
-   page = cma_alloc(vmcp_cma, nr_pages, 0, false);
+   page = cma_alloc(vmcp_cma, nr_pages, 0, 0);
if (page) {
session->response = (char *)page_to_phys(page);
session->cma_alloc = 1;
diff --git a/drivers/staging/android/ion/ion_cma_heap.c 
b/drivers/staging/android/ion/ion_cma_heap.c
index bf65e67..128d3a5 100644
--- a/drivers/staging/android/ion/ion_cma_heap.c
+++ b/drivers/staging/android/ion/ion_cma_heap.c
@@ -39,7 +39,7 @@ static int ion_cma_alloc

Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-23 Thread Chris Goldsworthy

On 2020-09-17 10:54, Chris Goldsworthy wrote:

On 2020-09-15 00:53, David Hildenbrand wrote:

On 14.09.20 20:33, Chris Goldsworthy wrote:

On 2020-09-14 02:31, David Hildenbrand wrote:

On 11.09.20 21:17, Chris Goldsworthy wrote:


So, inside of cma_alloc(), instead of giving up when
alloc_contig_range()
returns -EBUSY after having scanned a whole CMA-region bitmap,
perform
retries indefinitely, with sleeps, to give the system an 
opportunity

to
unpin any pinned pages.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..90bb505 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t
count, unsigned int align,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
-   break;
+   if (ret == -EBUSY) {
+   mutex_unlock(>lock);
+
+   /*
+* Page may be momentarily pinned by some other
+* process which has been scheduled out, e.g.
+* in exit path, during unmap call, or process
+* fork and so cannot be freed there. Sleep
+* for 100ms and retry the allocation.
+*/
+   start = 0;
+   ret = -ENOMEM;
+   msleep(100);
+   continue;
+   } else {
+   /*
+* ret == -ENOMEM - all bits in cma->bitmap are
+* set, so we break accordingly.
+*/
+   mutex_unlock(>lock);
+   break;
+   }
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*



What about long-term pinnings? IIRC, that can happen easily e.g.,
with
vfio (and I remember there is a way via vmsplice).

Not convinced trying forever is a sane approach in the general case
...


V1:
[1] https://lkml.org/lkml/2020/8/5/1097
[2] https://lkml.org/lkml/2020/8/6/1040
[3] https://lkml.org/lkml/2020/8/11/893
[4] https://lkml.org/lkml/2020/8/21/1490
[5] https://lkml.org/lkml/2020/9/11/1072

We're fine with doing indefinite retries, on the grounds that if 
there

is some long-term pinning that occurs when alloc_contig_range returns
-EBUSY, that it should be debugged and fixed.  Would it be possible 
to

make this infinite-retrying something that could be enabled or
disabled
by a defconfig option?


Two thoughts:

This means I strongly prefer something like [3] if feasible.


_Resending so that this ends up on LKML_

I can give [3] some further thought then.  Also, I realized [3] will 
not

completely solve the problem, it just reduces the window in which
_refcount > _mapcount (as mentioned in earlier threads, we encountered
the pinning when a task in copy_one_pte() or in the exit_mmap() path
gets context switched out).  If we were to try a sleeping-lock based
solution, do you think it would be permissible to add another lock to
struct page?


I have not been able to think of a clean way of introducing calls to 
preempt_disable() in exit_mmap(), which is the more problematic case.  
We would need to track state across multiple invocations of 
zap_pte_range() (which is called for each entry in a PMD when a 
process's memory is being unmapped), and would also need to extend this 
to tlb_finish_mmu(), which is called after all the process's memory has 
been unmapped: 
https://elixir.bootlin.com/linux/v5.8.10/source/mm/mmap.c#L3164.  As a 
follow-up to this patch, I'm submitting a patch that re-introduces the 
GFP mask for cma_alloc, that will perform indefinite retires if 
__GFP_NOFAIL is passed to the function.


--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-17 Thread Chris Goldsworthy

On 2020-09-15 00:53, David Hildenbrand wrote:

On 14.09.20 20:33, Chris Goldsworthy wrote:

On 2020-09-14 02:31, David Hildenbrand wrote:

On 11.09.20 21:17, Chris Goldsworthy wrote:


So, inside of cma_alloc(), instead of giving up when
alloc_contig_range()
returns -EBUSY after having scanned a whole CMA-region bitmap, 
perform

retries indefinitely, with sleeps, to give the system an opportunity
to
unpin any pinned pages.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..90bb505 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t
count, unsigned int align,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
-   break;
+   if (ret == -EBUSY) {
+   mutex_unlock(>lock);
+
+   /*
+* Page may be momentarily pinned by some other
+* process which has been scheduled out, e.g.
+* in exit path, during unmap call, or process
+* fork and so cannot be freed there. Sleep
+* for 100ms and retry the allocation.
+*/
+   start = 0;
+   ret = -ENOMEM;
+   msleep(100);
+   continue;
+   } else {
+   /*
+* ret == -ENOMEM - all bits in cma->bitmap are
+* set, so we break accordingly.
+*/
+   mutex_unlock(>lock);
+   break;
+   }
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*



What about long-term pinnings? IIRC, that can happen easily e.g., 
with

vfio (and I remember there is a way via vmsplice).

Not convinced trying forever is a sane approach in the general case 
...


V1:
[1] https://lkml.org/lkml/2020/8/5/1097
[2] https://lkml.org/lkml/2020/8/6/1040
[3] https://lkml.org/lkml/2020/8/11/893
[4] https://lkml.org/lkml/2020/8/21/1490
[5] https://lkml.org/lkml/2020/9/11/1072

We're fine with doing indefinite retries, on the grounds that if there
is some long-term pinning that occurs when alloc_contig_range returns
-EBUSY, that it should be debugged and fixed.  Would it be possible to
make this infinite-retrying something that could be enabled or 
disabled

by a defconfig option?


Two thoughts:

This means I strongly prefer something like [3] if feasible.


I can give [3] some further thought then.  Also, I realized [3] will not 
completely solve the problem, it just reduces the window in which 
_refcount > _mapcount (as mentioned in earlier threads, we encountered 
the pinning when a task in copy_one_pte() or in the exit_mmap() path 
gets context switched out).  If we were to try a sleeping-lock based 
solution, do you think it would be permissible to add another lock to 
struct page?



2. The issue that I am having is that long-term pinnings are
(unfortunately) a real thing. It's not something to debug and fix as 
you
suggest. Like, run a VM with VFIO (e.g., PCI passthrough). While that 
VM

is running, all VM memory will be pinned. If memory falls onto a CMA
region your cma_alloc() will be stuck in an (endless, meaning until the
VM ended) loop. I am not sure if all cma users are fine with that -
especially, think about CMA being used for gigantic pages now.

Assume you want to start a new VM while the other one is running and 
use

some (new) gigantic pages for it. Suddenly you're trapped in an endless
loop in the kernel. That's nasty.



Thanks for providing this example.



If we want to stick to retrying forever, can't we use flags like
__GFP_NOFAIL to explicitly enable this new behavior for selected
cma_alloc() users that really can't fail/retry manually again?


This would work, we would just have to undo the work done by this patch 
/ re-introduce the GFP parameter for cma_alloc(): 
http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqtpjkij2939229392eucas1...@eucas1p2.samsung.com 
, and add the support __GFP_NOFAIL (and ignore any flag that is not one 
of __GFP_NOFAIL or __GFP_NOWARN).


--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-17 Thread Chris Goldsworthy

On 2020-09-15 00:53, David Hildenbrand wrote:

On 14.09.20 20:33, Chris Goldsworthy wrote:

On 2020-09-14 02:31, David Hildenbrand wrote:

On 11.09.20 21:17, Chris Goldsworthy wrote:


So, inside of cma_alloc(), instead of giving up when
alloc_contig_range()
returns -EBUSY after having scanned a whole CMA-region bitmap,
perform
retries indefinitely, with sleeps, to give the system an opportunity
to
unpin any pinned pages.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..90bb505 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t
count, unsigned int align,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
-   break;
+   if (ret == -EBUSY) {
+   mutex_unlock(>lock);
+
+   /*
+* Page may be momentarily pinned by some other
+* process which has been scheduled out, e.g.
+* in exit path, during unmap call, or process
+* fork and so cannot be freed there. Sleep
+* for 100ms and retry the allocation.
+*/
+   start = 0;
+   ret = -ENOMEM;
+   msleep(100);
+   continue;
+   } else {
+   /*
+* ret == -ENOMEM - all bits in cma->bitmap are
+* set, so we break accordingly.
+*/
+   mutex_unlock(>lock);
+   break;
+   }
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*



What about long-term pinnings? IIRC, that can happen easily e.g.,
with
vfio (and I remember there is a way via vmsplice).

Not convinced trying forever is a sane approach in the general case
...


V1:
[1] https://lkml.org/lkml/2020/8/5/1097
[2] https://lkml.org/lkml/2020/8/6/1040
[3] https://lkml.org/lkml/2020/8/11/893
[4] https://lkml.org/lkml/2020/8/21/1490
[5] https://lkml.org/lkml/2020/9/11/1072

We're fine with doing indefinite retries, on the grounds that if there
is some long-term pinning that occurs when alloc_contig_range returns
-EBUSY, that it should be debugged and fixed.  Would it be possible to
make this infinite-retrying something that could be enabled or
disabled
by a defconfig option?


Two thoughts:

This means I strongly prefer something like [3] if feasible.


_Resending so that this ends up on LKML_

I can give [3] some further thought then.  Also, I realized [3] will not
completely solve the problem, it just reduces the window in which
_refcount > _mapcount (as mentioned in earlier threads, we encountered
the pinning when a task in copy_one_pte() or in the exit_mmap() path
gets context switched out).  If we were to try a sleeping-lock based
solution, do you think it would be permissible to add another lock to
struct page?


2. The issue that I am having is that long-term pinnings are
(unfortunately) a real thing. It's not something to debug and fix as
you
suggest. Like, run a VM with VFIO (e.g., PCI passthrough). While that
VM
is running, all VM memory will be pinned. If memory falls onto a CMA
region your cma_alloc() will be stuck in an (endless, meaning until the
VM ended) loop. I am not sure if all cma users are fine with that -
especially, think about CMA being used for gigantic pages now.

Assume you want to start a new VM while the other one is running and
use
some (new) gigantic pages for it. Suddenly you're trapped in an endless
loop in the kernel. That's nasty.



Thanks for providing this example.



If we want to stick to retrying forever, can't we use flags like
__GFP_NOFAIL to explicitly enable this new behavior for selected
cma_alloc() users that really can't fail/retry manually again?


This would work, we would just have to undo the work done by this patch
/ re-introduce the GFP parameter for cma_alloc():
http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqtpjkij2939229392eucas1...@eucas1p2.samsung.com
, and add the support __GFP_NOFAIL (and ignore any flag that is not one
of __GFP_NOFAIL or __GFP_NOWARN).

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-14 Thread Chris Goldsworthy

On 2020-09-14 11:33, Chris Goldsworthy wrote:

On 2020-09-14 02:31, David Hildenbrand wrote:

What about long-term pinnings? IIRC, that can happen easily e.g., with
vfio (and I remember there is a way via vmsplice).

Not convinced trying forever is a sane approach in the general case 
...


Hi David,

I've botched the threading, so there are discussions with respect to
the previous patch-set that is missing on this thread, which I will
summarize below:

V1:
[1] https://lkml.org/lkml/2020/8/5/1097
[2] https://lkml.org/lkml/2020/8/6/1040
[3] https://lkml.org/lkml/2020/8/11/893
[4] https://lkml.org/lkml/2020/8/21/1490
[5] https://lkml.org/lkml/2020/9/11/1072

[1] features version of the patch featured a finite number of retries,
which has been stable for our kernels. In [2], Andrew questioned
whether we could actually find a way of solving the problem on the
grounds that doing a finite number of retries doesn't actually fix the
problem (more importantly, in [4] Andrew indicated that he would
prefer not to merge the patch as it doesn't solve the issue).  In [3],
I suggest one actual fix for this, which is to use
preempt_disable/enable() to prevent context switches from occurring
during the periods in copy_one_pte() and exit_mmap() (I forgot to
mention this case in the commit text) in which _refcount > _mapcount
for a page - you would also need to prevent interrupts from occurring
to if we were to fully prevent the issue from occurring.  I think this
would be acceptable for the copy_one_pte() case, since there _refcount
> _mapcount for little time.  For the exit_mmap() case, however, _refcount is 
greater than _mapcount whilst the page-tables are being torn down for a process - 
that could be too long for disabling preemption / interrupts.

So, in [4], Andrew asks about two alternatives to see if they're
viable: (1) acquiring locks on the exit_mmap path and migration paths,
(2) retrying indefinitely.  In [5], I discuss how using locks could
increase the time it takes to perform a CMA allocation, such that a
retry approach would avoid increased CMA allocation times. I'm also
uncertain about how the locking scheme could be implemented
effectively without introducing a new per-page lock that will be used
specifically to solve this issue, and I'm not sure this would be
accepted.

We're fine with doing indefinite retries, on the grounds that if there
is some long-term pinning that occurs when alloc_contig_range returns
-EBUSY, that it should be debugged and fixed.  Would it be possible to
make this infinite-retrying something that could be enabled or
disabled by a defconfig option?

Thanks,

Chris.


Actually, if we were willing to have a defconfig option for enabling / 
disabling indefinite retries on the return of -EBUSY, would it be 
possibly to re-structure the patch to allow either (1) indefinite 
retrying, or (2) doing a fixed number of retires (as some people might 
want to tolerate CMA allocation failures in favor of making progress)?


--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-14 Thread Chris Goldsworthy

On 2020-09-11 14:42, Randy Dunlap wrote:

On 9/11/20 2:37 PM, Florian Fainelli wrote:


I am by no means an authoritative CMA person but this behavior does 
not seem acceptable, there is no doubt the existing one is sub-optimal 
under specific circumstances, but an indefinite retry, as well as a 
100ms sleep appear to be arbitrary at best. How about you introduce a 
parameter that allows the tuning of the number of retries and/or delay 
between retries?


Also:

You should send your patch to linux...@kvack.org -- that's where
most memory management type patches are reviewed.

You should also send your patch to someone who could actually merge
it into the kernel source tree -- assuming that's what you want to 
happen.


Try scripts/get_mainttainer.pl on your patch to see what it says.

And if you are going to use a "cover letter" or "introductory email" 
before

the actual patch, the second (patch(es)) should be sent chained to the
first email.  git send-email should do this for you.


Hi Randy,

git send-email was not using the correct ID to generate a response to 
the cover letter, and I'm not able to fathom why.  This e-mail was 
actually just sent out as a test to LKML as a test so I could figure out 
how to resolve the issue, I wasn't actually expecting anyone to read 
this.  The actual e-mail, with the correct maintainer (Andrew Morton) 
and mailing lists, as well as the summary of the discussion on the 
patches so far, may be found here: https://lkml.org/lkml/2020/8/11/893


Thanks,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-14 Thread Chris Goldsworthy

On 2020-09-11 14:37, Florian Fainelli wrote:

On 9/11/2020 1:54 PM, Chris Goldsworthy wrote:
CMA allocations will fail if 'pinned' pages are in a CMA area, since 
we
cannot migrate pinned pages. The _refcount of a struct page being 
greater
than _mapcount for that page can cause pinning for anonymous pages.  
This
is because try_to_unmap(), which (1) is called in the CMA allocation 
path,

and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.  
This

implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.  
Later
in the CMA allocation path in migrate_page_move_mapping(), we will 
have one

more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.

One example of where _refcount can be greater than _mapcount for a 
page we
would not expect to be pinned is inside of copy_one_pte(), which is 
called
during a fork. For ptes for which pte_present(pte) == true, 
copy_one_pte()
will increment the _refcount field followed by the  _mapcount field of 
a
page. If the process doing copy_one_pte() is context switched out 
after
incrementing _refcount but before incrementing _mapcount, then the 
page

will be temporarily pinned.

So, inside of cma_alloc(), instead of giving up when 
alloc_contig_range()

returns -EBUSY after having scanned a whole CMA-region bitmap, perform
retries indefinitely, with sleeps, to give the system an opportunity 
to

unpin any pinned pages.


I am by no means an authoritative CMA person but this behavior does
not seem acceptable, there is no doubt the existing one is sub-optimal
under specific circumstances, but an indefinite retry, as well as a
100ms sleep appear to be arbitrary at best. How about you introduce a
parameter that allows the tuning of the number of retries and/or delay
between retries?



Apologies Florian, I messed up on the threading and there are 
discussions that aren't reference here.  The original version of this 
patch was doing a finite number of retires.  Also, this e-mail was just 
sent out to LKML so I could debug some issues I was facing with git 
send-email.  The actual thread is now here, which summarizes the 
discussions w.r.t. this patch so far:  
https://lkml.org/lkml/2020/9/14/1097


Thanks,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-14 Thread Chris Goldsworthy

On 2020-09-14 02:31, David Hildenbrand wrote:

On 11.09.20 21:17, Chris Goldsworthy wrote:


So, inside of cma_alloc(), instead of giving up when 
alloc_contig_range()

returns -EBUSY after having scanned a whole CMA-region bitmap, perform
retries indefinitely, with sleeps, to give the system an opportunity 
to

unpin any pinned pages.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..90bb505 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t 
count, unsigned int align,

bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
-   break;
+   if (ret == -EBUSY) {
+   mutex_unlock(>lock);
+
+   /*
+* Page may be momentarily pinned by some other
+* process which has been scheduled out, e.g.
+* in exit path, during unmap call, or process
+* fork and so cannot be freed there. Sleep
+* for 100ms and retry the allocation.
+*/
+   start = 0;
+   ret = -ENOMEM;
+   msleep(100);
+   continue;
+   } else {
+   /*
+* ret == -ENOMEM - all bits in cma->bitmap are
+* set, so we break accordingly.
+*/
+   mutex_unlock(>lock);
+   break;
+   }
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*



What about long-term pinnings? IIRC, that can happen easily e.g., with
vfio (and I remember there is a way via vmsplice).

Not convinced trying forever is a sane approach in the general case ...


Hi David,

I've botched the threading, so there are discussions with respect to the 
previous patch-set that is missing on this thread, which I will 
summarize below:


V1:
[1] https://lkml.org/lkml/2020/8/5/1097
[2] https://lkml.org/lkml/2020/8/6/1040
[3] https://lkml.org/lkml/2020/8/11/893
[4] https://lkml.org/lkml/2020/8/21/1490
[5] https://lkml.org/lkml/2020/9/11/1072

[1] features version of the patch featured a finite number of retries, 
which has been stable for our kernels. In [2], Andrew questioned whether 
we could actually find a way of solving the problem on the grounds that 
doing a finite number of retries doesn't actually fix the problem (more 
importantly, in [4] Andrew indicated that he would prefer not to merge 
the patch as it doesn't solve the issue).  In [3], I suggest one actual 
fix for this, which is to use preempt_disable/enable() to prevent 
context switches from occurring during the periods in copy_one_pte() and 
exit_mmap() (I forgot to mention this case in the commit text) in which 
_refcount > _mapcount for a page - you would also need to prevent 
interrupts from occurring to if we were to fully prevent the issue from 
occurring.  I think this would be acceptable for the copy_one_pte() 
case, since there _refcount > _mapcount for little time.  For the 
exit_mmap() case, however, _refcount is greater than _mapcount whilst 
the page-tables are being torn down for a process - that could be too 
long for disabling preemption / interrupts.


So, in [4], Andrew asks about two alternatives to see if they're viable: 
(1) acquiring locks on the exit_mmap path and migration paths, (2) 
retrying indefinitely.  In [5], I discuss how using locks could increase 
the time it takes to perform a CMA allocation, such that a retry 
approach would avoid increased CMA allocation times. I'm also uncertain 
about how the locking scheme could be implemented effectively without 
introducing a new per-page lock that will be used specifically to solve 
this issue, and I'm not sure this would be accepted.


We're fine with doing indefinite retries, on the grounds that if there 
is some long-term pinning that occurs when alloc_contig_range returns 
-EBUSY, that it should be debugged and fixed.  Would it be possible to 
make this infinite-retrying something that could be enabled or disabled 
by a defconfig option?


Thanks,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


Re: [PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures

2020-09-11 Thread Chris Goldsworthy

On 2020-09-11 13:54, Chris Goldsworthy wrote:
[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY 
failures


On mobile devices, failure to allocate from a CMA area constitutes a
functional failure.  Sometimes during CMA allocations, we have observed
that pages in a CMA area allocated through alloc_pages(), that we're 
trying
to migrate away to make room for a CMA allocation, are temporarily 
pinned.
This temporary pinning can occur when a process that owns the pinned 
page
is being forked (the example is explained further in the commit text), 
or it is
exiting.  This patch addresses this issue by indefinitely retrying 
allocations

that fail due to a return of -EBUSY.

** This change log was re-sent due to threading issues **

Change log:

v1: We were performing retries of the allocation a fixed number of 
times.
Andrew Morton disliked this, as it didn't guarantee that the allocation 
would

succeed.

https://lkml.org/lkml/2020/8/5/1096
https://lkml.org/lkml/2020/8/21/1490

v2: To address this concern, we switched to retrying indefinitely, as 
opposed to

doing to retrying  the allocation a limited number of times.

Chris Goldsworthy (1):
  mm: cma: indefinitely retry allocations in cma_alloc

 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

test
--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora 
Forum,

a Linux Foundation Collaborative Project


[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures

2020-09-11 Thread Chris Goldsworthy
[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures

On mobile devices, failure to allocate from a CMA area constitutes a
functional failure.  Sometimes during CMA allocations, we have observed
that pages in a CMA area allocated through alloc_pages(), that we're trying
to migrate away to make room for a CMA allocation, are temporarily pinned.
This temporary pinning can occur when a process that owns the pinned page
is being forked (the example is explained further in the commit text), or it is
exiting.  This patch addresses this issue by indefinitely retrying allocations
that fail due to a return of -EBUSY.

** This change log was re-sent due to threading issues **

Change log: 

v1: We were performing retries of the allocation a fixed number of times.
Andrew Morton disliked this, as it didn't guarantee that the allocation would
succeed.

https://lkml.org/lkml/2020/8/5/1096
https://lkml.org/lkml/2020/8/21/1490

v2: To address this concern, we switched to retrying indefinitely, as opposed to
doing to retrying  the allocation a limited number of times.

Chris Goldsworthy (1):
  mm: cma: indefinitely retry allocations in cma_alloc

 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-11 Thread Chris Goldsworthy
CMA allocations will fail if 'pinned' pages are in a CMA area, since we
cannot migrate pinned pages. The _refcount of a struct page being greater
than _mapcount for that page can cause pinning for anonymous pages.  This
is because try_to_unmap(), which (1) is called in the CMA allocation path,
and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.  This
implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.  Later
in the CMA allocation path in migrate_page_move_mapping(), we will have one
more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.

One example of where _refcount can be greater than _mapcount for a page we
would not expect to be pinned is inside of copy_one_pte(), which is called
during a fork. For ptes for which pte_present(pte) == true, copy_one_pte()
will increment the _refcount field followed by the  _mapcount field of a
page. If the process doing copy_one_pte() is context switched out after
incrementing _refcount but before incrementing _mapcount, then the page
will be temporarily pinned.

So, inside of cma_alloc(), instead of giving up when alloc_contig_range()
returns -EBUSY after having scanned a whole CMA-region bitmap, perform
retries indefinitely, with sleeps, to give the system an opportunity to
unpin any pinned pages.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..90bb505 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cma.h"
@@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
-   break;
+   if (ret == -EBUSY) {
+   mutex_unlock(>lock);
+
+   /*
+* Page may be momentarily pinned by some other
+* process which has been scheduled out, e.g.
+* in exit path, during unmap call, or process
+* fork and so cannot be freed there. Sleep
+* for 100ms and retry the allocation.
+*/
+   start = 0;
+   ret = -ENOMEM;
+   msleep(100);
+   continue;
+   } else {
+   /*
+* ret == -ENOMEM - all bits in cma->bitmap are
+* set, so we break accordingly.
+*/
+   mutex_unlock(>lock);
+   break;
+   }
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures

2020-09-11 Thread Chris Goldsworthy
[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures

On mobile devices, failure to allocate from a CMA area constitutes a
functional failure.  Sometimes during CMA allocations, we have observed
that pages in a CMA area allocated through alloc_pages(), that we're trying
to migrate away to make room for a CMA allocation, are temporarily pinned.
This temporary pinning can occur when a process that owns the pinned page
is being forked (the example is explained further in the commit text), or it is
exiting.  This patch addresses this issue by indefinitely retrying allocations
that fail due to a return of -EBUSY.

** This change log was re-sent due to threading issues **

Change log: 

v1: We were performing retries of the allocation a fixed number of times.
Andrew Morton disliked this, as it didn't guarantee that the allocation would
succeed.

https://lkml.org/lkml/2020/8/5/1096
https://lkml.org/lkml/2020/8/21/1490

v2: To address this concern, we switched to retrying indefinitely, as opposed to
doing to retrying  the allocation a limited number of times.

Chris Goldsworthy (1):
  mm: cma: indefinitely retry allocations in cma_alloc

 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures

2020-09-11 Thread Chris Goldsworthy
On mobile devices, failure to allocate from a CMA area constitutes a
functional failure.  Sometimes during CMA allocations, we have observed
that pages in a CMA area allocated through alloc_pages(), that we're trying
to migrate away to make room for a CMA allocation, are temporarily pinned.
This temporary pinning can occur when a process that owns the pinned page
is being forked (the example is explained further in the commit text), or it is
exiting.  This patch addresses this issue by indefinitely retrying allocations
that fail due to a return of -EBUSY.

** This change log was re-sent due to threading issues **

Change log: 

v1: We were performing retries of the allocation a fixed number of times.
Andrew Morton disliked this, as it didn't guarantee that the allocation would
succeed.

https://lkml.org/lkml/2020/8/5/1096
https://lkml.org/lkml/2020/8/21/1490

v2: To address this concern, we switched to retrying indefinitely, as opposed to
doing to retrying  the allocation a limited number of times.

Chris Goldsworthy (1):
  mm: cma: indefinitely retry allocations in cma_alloc

 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-11 Thread Chris Goldsworthy
CMA allocations will fail if 'pinned' pages are in a CMA area, since we
cannot migrate pinned pages. The _refcount of a struct page being greater
than _mapcount for that page can cause pinning for anonymous pages.  This
is because try_to_unmap(), which (1) is called in the CMA allocation path,
and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.  This
implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.  Later
in the CMA allocation path in migrate_page_move_mapping(), we will have one
more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.

One example of where _refcount can be greater than _mapcount for a page we
would not expect to be pinned is inside of copy_one_pte(), which is called
during a fork. For ptes for which pte_present(pte) == true, copy_one_pte()
will increment the _refcount field followed by the  _mapcount field of a
page. If the process doing copy_one_pte() is context switched out after
incrementing _refcount but before incrementing _mapcount, then the page
will be temporarily pinned.

So, inside of cma_alloc(), instead of giving up when alloc_contig_range()
returns -EBUSY after having scanned a whole CMA-region bitmap, perform
retries indefinitely, with sleeps, to give the system an opportunity to
unpin any pinned pages.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..90bb505 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cma.h"
@@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
-   break;
+   if (ret == -EBUSY) {
+   mutex_unlock(>lock);
+
+   /*
+* Page may be momentarily pinned by some other
+* process which has been scheduled out, e.g.
+* in exit path, during unmap call, or process
+* fork and so cannot be freed there. Sleep
+* for 100ms and retry the allocation.
+*/
+   start = 0;
+   ret = -ENOMEM;
+   msleep(100);
+   continue;
+   } else {
+   /*
+* ret == -ENOMEM - all bits in cma->bitmap are
+* set, so we break accordingly.
+*/
+   mutex_unlock(>lock);
+   break;
+   }
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures

2020-09-11 Thread Chris Goldsworthy
On mobile devices, failure to allocate from a CMA area constitutes a
functional failure.  Sometimes during CMA allocations, we have observed
that pages in a CMA area allocated through alloc_pages(), that we're trying
to migrate away to make room for a CMA allocation, are temporarily pinned.
This temporary pinning can occur when a process that owns the pinned page
is being forked (the example is explained further in the commit text), or it is
exiting.  This patch addresses this issue by indefinitely retrying allocations
that fail due to a return of -EBUSY.

Change log: 

v1: We were performing retries of the allocation a fixed number of times.
Andrew Morton disliked this, as it didn't guarantee that the allocation would
succeed.

v2: To address this concern, we switched to retrying indefinitely, as opposed to
doing to retrying  the allocation a limited number of times.

Chris Goldsworthy (1):
  mm: cma: indefinitely retry allocations in cma_alloc

 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc

2020-09-11 Thread Chris Goldsworthy
CMA allocations will fail if 'pinned' pages are in a CMA area, since we
cannot migrate pinned pages. The _refcount of a struct page being greater
than _mapcount for that page can cause pinning for anonymous pages.  This
is because try_to_unmap(), which (1) is called in the CMA allocation path,
and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.  This
implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.  Later
in the CMA allocation path in migrate_page_move_mapping(), we will have one
more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.

One example of where _refcount can be greater than _mapcount for a page we
would not expect to be pinned is inside of copy_one_pte(), which is called
during a fork. For ptes for which pte_present(pte) == true, copy_one_pte()
will increment the _refcount field followed by the  _mapcount field of a
page. If the process doing copy_one_pte() is context switched out after
incrementing _refcount but before incrementing _mapcount, then the page
will be temporarily pinned.

So, inside of cma_alloc(), instead of giving up when alloc_contig_range()
returns -EBUSY after having scanned a whole CMA-region bitmap, perform
retries indefinitely, with sleeps, to give the system an opportunity to
unpin any pinned pages.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 mm/cma.c | 25 +++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..90bb505 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cma.h"
@@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
-   break;
+   if (ret == -EBUSY) {
+   mutex_unlock(>lock);
+
+   /*
+* Page may be momentarily pinned by some other
+* process which has been scheduled out, e.g.
+* in exit path, during unmap call, or process
+* fork and so cannot be freed there. Sleep
+* for 100ms and retry the allocation.
+*/
+   start = 0;
+   ret = -ENOMEM;
+   msleep(100);
+   continue;
+   } else {
+   /*
+* ret == -ENOMEM - all bits in cma->bitmap are
+* set, so we break accordingly.
+*/
+   mutex_unlock(>lock);
+   break;
+   }
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



Re: cma_alloc(), add sleep-and-retry for temporary page pinning

2020-09-11 Thread Chris Goldsworthy

On 2020-08-21 15:01, Andrew Morton wrote:

On Tue, 11 Aug 2020 15:20:47 -0700 cgold...@codeaurora.org wrote:


One
thing to stress is that there are other instances of CMA page pinning,
that
this patch isn't attempting to address.


Oh.  How severe are these?


Hey Andrew,

  - get_user_pages() will pin pages (I think that's a 100% guarantee but 
I'd need to double check that). There isn't a workaround for this as far 
as I know.
  - One issue we've encountered is when the pages for buffer heads are 
stuck on an LRU list (see the call to buffer_busy() in drop_buffers() 
https://elixir.bootlin.com/linux/v5.8.8/source/fs/buffer.c#L3225). We 
deal with this by allowing the buffers to be dropped, if the reason 
buffer_busy() returns true is because the page is on an LRU list.



Well.  Why not wait infinitely?  Because there are other sources of CMA
page pinning, I guess.

Could we take a sleeping lock on the exit_mmap() path and on the
migration path?  So that the migration path automatically waits for
the exact amount of time?


Retrying indefinitely whilst alloc_contig_range() returns -EBUSY is 
actually a viable approach.  From the perspective of how long it takes 
to perform a CMA allocation, it is preferable compared to the lock-based 
method, in terms of how long it would take to do a CMA allocation.  With 
our current method, we attempt allocations across an entire CMA region 
before sleeping and retrying. With the lock-based method, we'd be 
sleeping on a per-page basis - this could lead to a situation where we 
spend a great deal of time waiting for a particular page A to be freed, 
that lies in the subset of the CMA region we're allocating form.  We 
could have instead given up on allocating that subset of the CMA region 
(because page A is pinned), and have moved on to a different subset of 
the CMA region, and have successfully allocated that subset, whilst page 
A is still pinned.


I have a patch ready that does this indefinite-retrying, that will be 
sent in reply to this e-mail.


Regards,

Chris.

--
The Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum,
a Linux Foundation Collaborative Project


cma_alloc(), add sleep-and-retry for temporary page pinning

2020-08-05 Thread Chris Goldsworthy
On mobile devices, failure to allocate from a CMA area constitutes a
functional failure.  Sometimes during CMA allocations, we have observed
that pages in a CMA area allocated through alloc_pages(), that we're trying
to migrate away to make room for a CMA allocation, are temporarily pinned.
This temporary pinning can occur when a process that owns the pinned page
is being forked (the example is explained further in the commit text).
This patch addresses this issue by adding a sleep-and-retry loop in
cma_alloc() . There's another example we know of similar to the above that
occurs during exit_mmap() (in zap_pte_range() specifically), but I need to
determine if this is still relevant today.



[PATCH] mm: cma: retry allocations in cma_alloc

2020-08-05 Thread Chris Goldsworthy
CMA allocations will fail if 'pinned' pages are in a CMA area, since we
cannot migrate pinned pages. The _refcount of a struct page being greater
than _mapcount for that page can cause pinning for anonymous pages.  This
is because try_to_unmap(), which (1) is called in the CMA allocation path,
and (2) decrements both _refcount and _mapcount for a page, will stop
unmapping a page from VMAs once the _mapcount for a page reaches 0.  This
implies that after try_to_unmap() has finished successfully for a page
where _recount > _mapcount, that _refcount will be greater than 0.  Later
in the CMA allocation path in migrate_page_move_mapping(), we will have one
more reference count than intended for anonymous pages, meaning the
allocation will fail for that page.

One example of where _refcount can be greater than _mapcount for a page we
would not expect to be pinned is inside of copy_one_pte(), which is called
during a fork. For ptes for which pte_present(pte) == true, copy_one_pte()
will increment the _refcount field followed by the  _mapcount field of a
page. If the process doing copy_one_pte() is context switched out after
incrementing _refcount but before incrementing _mapcount, then the page
will be temporarily pinned.

So, inside of cma_alloc(), instead of giving up when alloc_contig_range()
returns -EBUSY after having scanned a whole CMA-region bitmap, perform
retries with sleeps to give the system an opportunity to unpin any pinned
pages.

Signed-off-by: Chris Goldsworthy 
Co-developed-by: Susheel Khiani 
Signed-off-by: Susheel Khiani 
Co-developed-by: Vinayak Menon 
Signed-off-by: Vinayak Menon 
---
 mm/cma.c | 24 ++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/cma.c b/mm/cma.c
index 7f415d7..7b85fe6 100644
--- a/mm/cma.c
+++ b/mm/cma.c
@@ -32,6 +32,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "cma.h"
@@ -418,6 +419,8 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
size_t i;
struct page *page = NULL;
int ret = -ENOMEM;
+   int num_attempts = 0;
+   int max_retries = 5;
 
if (!cma || !cma->count || !cma->bitmap)
return NULL;
@@ -442,8 +445,25 @@ struct page *cma_alloc(struct cma *cma, size_t count, 
unsigned int align,
bitmap_maxno, start, bitmap_count, mask,
offset);
if (bitmap_no >= bitmap_maxno) {
-   mutex_unlock(>lock);
-   break;
+   if ((num_attempts < max_retries) && (ret == -EBUSY)) {
+   mutex_unlock(>lock);
+
+   /*
+* Page may be momentarily pinned by some other
+* process which has been scheduled out, e.g.
+* in exit path, during unmap call, or process
+* fork and so cannot be freed there. Sleep
+* for 100ms and retry the allocation.
+*/
+   start = 0;
+   ret = -ENOMEM;
+   msleep(100);
+   num_attempts++;
+   continue;
+   } else {
+   mutex_unlock(>lock);
+   break;
+   }
}
bitmap_set(cma->bitmap, bitmap_no, bitmap_count);
/*
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH] of: reserved_mem: add missing of_node_put() for proper ref-counting

2019-10-19 Thread Chris Goldsworthy
Commit d698a388146c ("of: reserved-memory: ignore disabled memory-region
nodes") added an early return in of_reserved_mem_device_init_by_idx(), but
didn't call of_node_put() on a device_node whose ref-count was incremented
in the call to of_parse_phandle() preceding the early exit.

Fixes: d698a388146c ("of: reserved-memory: ignore disabled memory-region nodes")
Signed-off-by: Chris Goldsworthy 
To: Rob Herring 
Cc: devicet...@vger.kernel.org
Cc: sta...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-...@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
---
 drivers/of/of_reserved_mem.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index 7989703..6bd610e 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -324,8 +324,10 @@ int of_reserved_mem_device_init_by_idx(struct device *dev,
if (!target)
return -ENODEV;
 
-   if (!of_device_is_available(target))
+   if (!of_device_is_available(target)) {
+   of_node_put(target);
return 0;
+   }
 
rmem = __find_rmem(target);
of_node_put(target);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project



[PATCH] of: reserved_mem: add missing of_node_put() for proper ref-counting

2019-09-26 Thread Chris Goldsworthy
Commit d698a388146c ("of: reserved-memory: ignore disabled memory-region
nodes") added an early return in of_reserved_mem_device_init_by_idx(), but
didn't call of_node_put() on a device_node whose ref-count was incremented
in the call to of_parse_phandle() preceding the early exit.

Fixes: d698a388146c ("of: reserved-memory: ignore disabled memory-region nodes")
Signed-off-by: Chris Goldsworthy 
To: Rob Herring 
Cc: devicet...@vger.kernel.org
Cc: sta...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-ker...@lists.infradead.org
---
 drivers/of/of_reserved_mem.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index 7989703..6bd610e 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -324,8 +324,10 @@ int of_reserved_mem_device_init_by_idx(struct device *dev,
if (!target)
return -ENODEV;
 
-   if (!of_device_is_available(target))
+   if (!of_device_is_available(target)) {
+   of_node_put(target);
return 0;
+   }
 
rmem = __find_rmem(target);
of_node_put(target);
-- 
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum,
a Linux Foundation Collaborative Project