Re: [PATCH v4 3/3] mm: fs: Invalidate BH LRU during page migration
On 2021-03-20 12:54, Matthew Wilcox wrote: On Sat, Mar 20, 2021 at 10:20:09AM -0700, Minchan Kim wrote: > > Tested-by: Oliver Sang > > Reported-by: kernel test robot > > Signed-off-by: Chris Goldsworthy > > Signed-off-by: Minchan Kim > > The signoff chain ordering might mean that Chris was the primary author, but > there is no From:him. Please clarify? He tried first version but was diffrent implementation since I changed a lot. That's why I added his SoB even though current implementaion is much different. So, maybe I am primary author? Hey Minchan, let's have you as the primary author. Maybe Chris is Reported-by: ? And don't forget Laura Abbott as original author of the patch Chris submitted. I think she should be Reported-by: as well, since there is no code from either of them in this version of the patch. Yes, let's have a Reported-by: from Laura. We can change my Signed-off-by to Reported-by: as well. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily
On 2021-03-11 14:41, Chris Goldsworthy wrote: On 2021-03-10 08:14, Minchan Kim wrote: LRU pagevec holds refcount of pages until the pagevec are drained. It could prevent migration since the refcount of the page is greater than the expection in migration logic. To mitigate the issue, callers of migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all before migrate_pages call. However, it's not enough because pages coming into pagevec after the draining call still could stay at the pagevec so it could keep preventing page migration. Since some callers of migrate_pages have retrial logic with LRU draining, the page would migrate at next trail but it is still fragile in that it doesn't close the fundamental race between upcoming LRU pages into pagvec and migration so the migration failure could cause contiguous memory allocation failure in the end. To close the race, this patch disables lru caches(i.e, pagevec) during ongoing migration until migrate is done. Since it's really hard to reproduce, I measured how many times migrate_pages retried with force mode(it is about a fallback to a sync migration) with below debug code. int migrate_pages(struct list_head *from, new_page_t get_new_page, .. .. if (rc && reason == MR_CONTIG_RANGE && pass > 2) { printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc); dump_page(page, "fail to migrate"); } The test was repeating android apps launching with cma allocation in background every five seconds. Total cma allocation count was about 500 during the testing. With this patch, the dump_page count was reduced from 400 to 30. The new interface is also useful for memory hotplug which currently drains lru pcp caches after each migration failure. This is rather suboptimal as it has to disrupt others running during the operation. With the new interface the operation happens only once. This is also in line with pcp allocator cache which are disabled for the offlining as well. Signed-off-by: Minchan Kim --- Hi Minchan, This all looks good to me - feel free to add a Reviewed-by from me. Thanks, Chris. Should have added: Reviewed-by: Chris Goldsworthy -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v3 2/3] mm: disable LRU pagevec during the migration temporarily
On 2021-03-10 08:14, Minchan Kim wrote: LRU pagevec holds refcount of pages until the pagevec are drained. It could prevent migration since the refcount of the page is greater than the expection in migration logic. To mitigate the issue, callers of migrate_pages drains LRU pagevec via migrate_prep or lru_add_drain_all before migrate_pages call. However, it's not enough because pages coming into pagevec after the draining call still could stay at the pagevec so it could keep preventing page migration. Since some callers of migrate_pages have retrial logic with LRU draining, the page would migrate at next trail but it is still fragile in that it doesn't close the fundamental race between upcoming LRU pages into pagvec and migration so the migration failure could cause contiguous memory allocation failure in the end. To close the race, this patch disables lru caches(i.e, pagevec) during ongoing migration until migrate is done. Since it's really hard to reproduce, I measured how many times migrate_pages retried with force mode(it is about a fallback to a sync migration) with below debug code. int migrate_pages(struct list_head *from, new_page_t get_new_page, .. .. if (rc && reason == MR_CONTIG_RANGE && pass > 2) { printk(KERN_ERR, "pfn 0x%lx reason %d\n", page_to_pfn(page), rc); dump_page(page, "fail to migrate"); } The test was repeating android apps launching with cma allocation in background every five seconds. Total cma allocation count was about 500 during the testing. With this patch, the dump_page count was reduced from 400 to 30. The new interface is also useful for memory hotplug which currently drains lru pcp caches after each migration failure. This is rather suboptimal as it has to disrupt others running during the operation. With the new interface the operation happens only once. This is also in line with pcp allocator cache which are disabled for the offlining as well. Signed-off-by: Minchan Kim --- Hi Minchan, This all looks good to me - feel free to add a Reviewed-by from me. Thanks, Chris. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] [RFC] mm: fs: Invalidate BH LRU during page migration
On 2021-02-11 06:09, Matthew Wilcox wrote: On Wed, Feb 10, 2021 at 09:35:40PM -0800, Chris Goldsworthy wrote: +/* These are used to control the BH LRU invalidation during page migration */ +static struct cpumask lru_needs_invalidation; +static bool bh_lru_disabled = false; As I asked before, what protects this on an SMP system? Sorry Matthew, I misconstrued your earlier question in V1, and thought you had been referring to compile-time protection (so as to prevent build breakages). It is not protected, so I'll need to change this into an atomic counter that is incremented and decremented by bh_lru_enable() and bh_lru_disable() respectively (such that if the counter is greater than zero, we bail). @@ -1292,7 +1296,9 @@ static inline void check_irqs_on(void) /* * Install a buffer_head into this cpu's LRU. If not already in the LRU, it is * inserted at the front, and the buffer_head at the back if any is evicted. - * Or, if already in the LRU it is moved to the front. + * Or, if already in the LRU it is moved to the front. Note that if LRU is + * disabled because of an ongoing page migration, we won't insert bh into the + * LRU. And also, why do we need to do this? The page LRU has no equivalent mechanism to prevent new pages being added to the per-CPU LRU lists. If a BH has just been used, isn't that a strong hint that this page is a bad candidate for migration? I had assumed that up until now, that pages in the page cache aren't an issue, such that they're dropped during migration as needed. Looking at try_to_free_buffers[1], I don't see any handling for the page cache. I will need to do due diligence and follow up on this. As for the question on necessity, if there is a case in which preventing buffer_heads from being added to the BH LRU ensures that the containing page can be migrated, then I would say that the change is justified, since adds another scenario in which migration is guaranteed (I will follow up on this as well). Regards, Chris. [1] https://elixir.bootlin.com/linux/latest/source/fs/buffer.c#L3225 -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2] [RFC] mm: fs: Invalidate BH LRU during page migration
Pages containing buffer_heads that are in one of the per-CPU buffer_head LRU caches will be pinned and thus cannot be migrated. This can prevent CMA allocations from succeeding, which are often used on platforms with co-processors (such as a DSP) that can only use physically contiguous memory. It can also prevent memory hot-unplugging from succeeding, which involves migrating at least MIN_MEMORY_BLOCK_SIZE bytes of memory, which ranges from 8 MiB to 1 GiB based on the architecture in use. Correspondingly, invalidate the BH LRU caches before a migration starts and stop any buffer_head from being cached in the LRU caches, until migration has finished. Signed-off-by: Chris Goldsworthy Cc: Minchan Kim Cc: Matthew Wilcox Cc: Andrew Morton --- fs/buffer.c | 54 +++-- include/linux/buffer_head.h | 8 +++ include/linux/migrate.h | 2 ++ mm/migrate.c| 19 mm/page_alloc.c | 3 +++ mm/swap.c | 7 +- 6 files changed, 90 insertions(+), 3 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 96c7604..634e474 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1274,6 +1274,10 @@ struct bh_lru { static DEFINE_PER_CPU(struct bh_lru, bh_lrus) = {{ NULL }}; +/* These are used to control the BH LRU invalidation during page migration */ +static struct cpumask lru_needs_invalidation; +static bool bh_lru_disabled = false; + #ifdef CONFIG_SMP #define bh_lru_lock() local_irq_disable() #define bh_lru_unlock()local_irq_enable() @@ -1292,7 +1296,9 @@ static inline void check_irqs_on(void) /* * Install a buffer_head into this cpu's LRU. If not already in the LRU, it is * inserted at the front, and the buffer_head at the back if any is evicted. - * Or, if already in the LRU it is moved to the front. + * Or, if already in the LRU it is moved to the front. Note that if LRU is + * disabled because of an ongoing page migration, we won't insert bh into the + * LRU. */ static void bh_lru_install(struct buffer_head *bh) { @@ -1303,6 +1309,9 @@ static void bh_lru_install(struct buffer_head *bh) check_irqs_on(); bh_lru_lock(); + if (bh_lru_disabled) + goto out; + b = this_cpu_ptr(_lrus); for (i = 0; i < BH_LRU_SIZE; i++) { swap(evictee, b->bhs[i]); @@ -1313,6 +1322,7 @@ static void bh_lru_install(struct buffer_head *bh) } get_bh(bh); +out: bh_lru_unlock(); brelse(evictee); } @@ -1328,6 +1338,10 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size) check_irqs_on(); bh_lru_lock(); + + if (bh_lru_disabled) + goto out; + for (i = 0; i < BH_LRU_SIZE; i++) { struct buffer_head *bh = __this_cpu_read(bh_lrus.bhs[i]); @@ -1346,6 +1360,7 @@ lookup_bh_lru(struct block_device *bdev, sector_t block, unsigned size) break; } } +out: bh_lru_unlock(); return ret; } @@ -1446,7 +1461,7 @@ EXPORT_SYMBOL(__bread_gfp); * This doesn't race because it runs in each cpu either in irq * or with preempt disabled. */ -static void invalidate_bh_lru(void *arg) +void invalidate_bh_lru(void *arg) { struct bh_lru *b = _cpu_var(bh_lrus); int i; @@ -1477,6 +1492,41 @@ void invalidate_bh_lrus(void) } EXPORT_SYMBOL_GPL(invalidate_bh_lrus); +bool need_bh_lru_invalidation(int cpu) +{ + return cpumask_test_cpu(cpu, _needs_invalidation); +} + +void bh_lru_disable(void) +{ + int cpu; + + bh_lru_disabled = true; + + /* +* This barrier ensures that invocations of bh_lru_install() +* after this barrier see that bh_lru_disabled == true (until +* bh_lru_enable() is eventually called).. +*/ + smp_mb(); + + /* +* It's alright if someone comes along and hot-plugs a new CPU, +* since we have that bh_lru_dsiabled == true. The hot-remove +* case is handled in buffer_exit_cpu_dead(). +*/ + for_each_online_cpu(cpu) { + if (has_bh_in_lru(cpu, NULL)) + cpumask_set_cpu(cpu, _needs_invalidation); + } +} + +void bh_lru_enable(void) +{ + bh_lru_disabled = false; + cpumask_clear(_needs_invalidation); +} + void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset) { diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 6b47f94..78eb5ee 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -193,7 +193,11 @@ void __breadahead_gfp(struct block_device *, sector_t block, unsigned int size, gfp_t gfp); struct buffer_head *__bread_gfp(struct block_device *, sector_t block, unsigned size, gfp_t gfp); +void invalidate_bh_lru(void *arg); void invalidate_bh_lrus(voi
[PATCH v2] [RFC] Invalidate BH LRU during page migration
A page containing buffer_heads can be pinned if any of its constituent buffer_heads belongs to the BH LRU cache [1], which can prevent that page from being migrated. After going through several iterations of a patch that attempts to solve this by removing BH entries inside of the drop_buffers() function, which in the worst-case could be called for each migrated page, Minchan Kim suggested that we invalidate the entire BH LRU once, just before we start migrating pages. Additionally, Matthew Wilcox suggested that we invalidate the BH LRU inside of lru_add_drain_all(), so as to benefit functions like other functions that would be impacted by pinned pages [2]. V2: Respond to feedback provided by Andrew, Minchan and Matthew in [3]. As suggested by Minchan, we're now doing the invalidate of the LRUs in a fashion similar to how the pagevecs are drained in lru_add_drain_all() [1] https://elixir.bootlin.com/linux/latest/source/fs/buffer.c#L1238 [2] https://lore.kernel.org/linux-fsdevel/cover.1611642038.git.cgold...@codeaurora.org/ [3] https://lkml.org/lkml/2021/2/2/68 Chris Goldsworthy (1): [RFC] mm: fs: Invalidate BH LRU during page migration fs/buffer.c | 54 +++-- include/linux/buffer_head.h | 8 +++ include/linux/migrate.h | 2 ++ mm/migrate.c| 19 mm/page_alloc.c | 3 +++ mm/swap.c | 7 +- 6 files changed, 90 insertions(+), 3 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH] [RFC] mm: fs: Invalidate BH LRU during page migration
Pages containing buffer_heads that are in the buffer_head LRU cache will be pinned and thus cannot be migrated. Correspondingly, invalidate the BH LRU before a migration starts and stop any buffer_head from being cached in the LRU, until migration has finished. Signed-off-by: Chris Goldsworthy Cc: Minchan Kim Cc: Matthew Wilcox --- fs/buffer.c | 6 ++ include/linux/buffer_head.h | 3 +++ include/linux/migrate.h | 2 ++ mm/migrate.c| 18 ++ mm/page_alloc.c | 3 +++ mm/swap.c | 3 +++ 6 files changed, 35 insertions(+) diff --git a/fs/buffer.c b/fs/buffer.c index 96c7604..39ec4ec 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1289,6 +1289,8 @@ static inline void check_irqs_on(void) #endif } +bool bh_migration_done = true; + /* * Install a buffer_head into this cpu's LRU. If not already in the LRU, it is * inserted at the front, and the buffer_head at the back if any is evicted. @@ -1303,6 +1305,9 @@ static void bh_lru_install(struct buffer_head *bh) check_irqs_on(); bh_lru_lock(); + if (!bh_migration_done) + goto out; + b = this_cpu_ptr(_lrus); for (i = 0; i < BH_LRU_SIZE; i++) { swap(evictee, b->bhs[i]); @@ -1313,6 +1318,7 @@ static void bh_lru_install(struct buffer_head *bh) } get_bh(bh); +out: bh_lru_unlock(); brelse(evictee); } diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 6b47f94..ae4eb6d 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -193,6 +193,9 @@ void __breadahead_gfp(struct block_device *, sector_t block, unsigned int size, gfp_t gfp); struct buffer_head *__bread_gfp(struct block_device *, sector_t block, unsigned size, gfp_t gfp); + +extern bool bh_migration_done; + void invalidate_bh_lrus(void); struct buffer_head *alloc_buffer_head(gfp_t gfp_flags); void free_buffer_head(struct buffer_head * bh); diff --git a/include/linux/migrate.h b/include/linux/migrate.h index 3a38963..9e4a2dc 100644 --- a/include/linux/migrate.h +++ b/include/linux/migrate.h @@ -46,6 +46,7 @@ extern int isolate_movable_page(struct page *page, isolate_mode_t mode); extern void putback_movable_page(struct page *page); extern void migrate_prep(void); +extern void migrate_finish(void); extern void migrate_prep_local(void); extern void migrate_page_states(struct page *newpage, struct page *page); extern void migrate_page_copy(struct page *newpage, struct page *page); @@ -67,6 +68,7 @@ static inline int isolate_movable_page(struct page *page, isolate_mode_t mode) { return -EBUSY; } static inline int migrate_prep(void) { return -ENOSYS; } +static inline int migrate_finish(void) { return -ENOSYS; } static inline int migrate_prep_local(void) { return -ENOSYS; } static inline void migrate_page_states(struct page *newpage, struct page *page) diff --git a/mm/migrate.c b/mm/migrate.c index a69da8a..08c981d 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -64,6 +64,19 @@ */ void migrate_prep(void) { + bh_migration_done = false; + + /* +* This barrier ensures that callers of bh_lru_install() between +* the barrier and the call to invalidate_bh_lrus() read +* bh_migration_done() as false. +*/ + /* +* TODO: Remove me? lru_add_drain_all() already has an smp_mb(), +* but it would be good to ensure that the barrier isn't forgotten. +*/ + smp_mb(); + /* * Clear the LRU lists so pages can be isolated. * Note that pages may be moved off the LRU after we have @@ -73,6 +86,11 @@ void migrate_prep(void) lru_add_drain_all(); } +void migrate_finish(void) +{ + bh_migration_done = true; +} + /* Do the necessary work of migrate_prep but not if it involves other CPUs */ void migrate_prep_local(void) { diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6446778..e4cb959 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -8493,6 +8493,9 @@ static int __alloc_contig_migrate_range(struct compact_control *cc, ret = migrate_pages(>migratepages, alloc_migration_target, NULL, (unsigned long), cc->mode, MR_CONTIG_RANGE); } + + migrate_finish(); + if (ret < 0) { putback_movable_pages(>migratepages); return ret; diff --git a/mm/swap.c b/mm/swap.c index 31b844d..97efc49 100644 --- a/mm/swap.c +++ b/mm/swap.c @@ -36,6 +36,7 @@ #include #include #include +#include #include "internal.h" @@ -759,6 +760,8 @@ void lru_add_drain_all(void) if (WARN_ON(!mm_percpu_wq)) return; + invalidate_bh_lrus(); + /* * Guarantee pagevec counter stores visible by this CPU are visible to * other CPUs befo
[RFC] Invalidate BH LRU during page migration
A page containing buffer_heads can be pinned if any of its constituent buffer_heads belongs to the BH LRU cache [1]. After going through several iterations of a patch that attempts to solve this by removing BH entries inside of the drop_buffers() function, which in the worst-case could be called for each migrated page, Minchan Kim suggested that we invalidate the entire BH LRU once, just before we start migrating pages. Additionally, Matthew Wilcox suggested that we invalidate the BH LRU inside of lru_add_drain_all(), so as to benefit functions like other functions that would be impacted by pinned pages [2]. TODO: - It should be possible to remove the initial setting of bh_migration_done = false; in migrate_prep by passing this in as a parameter to invalidate_bh_lru(), but we'd still need a matching bh_migration_done = true; call. - To really benefit other callers of lru_add_drain_all() other than __alloc_contig_migrate_range() in the CMA allocaiton path, we'd need to add additional calls of bh_migration_done = false; [1] https://elixir.bootlin.com/linux/latest/source/fs/buffer.c#L1238 [2] https://lore.kernel.org/linux-fsdevel/cover.1611642038.git.cgold...@codeaurora.org/ Chris Goldsworthy (1): [RFC] mm: fs: Invalidate BH LRU during page migration fs/buffer.c | 6 ++ include/linux/buffer_head.h | 3 +++ include/linux/migrate.h | 2 ++ mm/migrate.c| 18 ++ mm/page_alloc.c | 3 +++ mm/swap.c | 3 +++ 6 files changed, 35 insertions(+) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v4] fs/buffer.c: Revoke LRU when trying to drop buffers
On 2021-01-28 09:08, Minchan Kim wrote: On Thu, Jan 28, 2021 at 12:28:37AM -0800, Chris Goldsworthy wrote: On 2021-01-26 18:59, Matthew Wilcox wrote: > On Tue, Jan 26, 2021 at 02:59:17PM -0800, Minchan Kim wrote: > > The release buffer_head in LRU is great improvement for migration > > point of view. > > > > A question: Hey guys, > > Can't we invalidate(e.g., invalidate_bh_lrus) bh_lru in migrate_prep > > or > > elsewhere when migration found the failure and is about to retry? > > > > Migration has done such a way for other per-cpu stuffs for a long > > time, > > which would be more consistent with others and might be faster > > sometimes > > with reducing IPI calls for page. > Should lru_add_drain_all() also handle draining the buffer lru for all > callers? A quick survey ... > > invalidate_bdev() already calls invalidate_bh_lrus() > compact_nodes() would probably benefit from the BH LRU being invalidated > POSIX_FADV_DONTNEED would benefit if the underlying filesystem uses BHs > check_and_migrate_cma_pages() would benefit > khugepaged_do_scan() doesn't need it today > scan_get_next_rmap_item() looks like it only works on anon pages (?) so >doesn't need it > mem_cgroup_force_empty() probably needs it > mem_cgroup_move_charge() ditto > memfd_wait_for_pins() doesn't need it > shake_page() might benefit > offline_pages() would benefit > alloc_contig_range() would benefit > > Seems like most would benefit and a few won't care. I think I'd lean > towards having lru_add_drain_all() call invalidate_bh_lrus(), just to > simplify things. Doing this sounds like a good idea. We would still need a call to invalidate_bh_lrus() inside of drop_buffers() in the event that we find busy buffers, since they can be re-added back into the BH LRU - I believe it isn't until this point that a BH can't be added back into the BH LRU, when we acquire the private_lock for the mapping: https://elixir.bootlin.com/linux/v5.10.10/source/fs/buffer.c#L3240 I am not sure it's good deal considering IPI overhead per page release at worst case. A idea is to disable bh_lrus in migrate_prep and enable it when migration is done(need to introduce "migrate_done". It's similar approach with marking pageblock MIGRATE_ISOLATE to disable pcp during the migration. I'll try creating that mechanism then for the BH LRU, and will come back with a patch that does the invalidate in lru_add_drain_all(). Thanks Matthew and Minchan for the feedback! -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v4] fs/buffer.c: Revoke LRU when trying to drop buffers
On 2021-01-26 18:59, Matthew Wilcox wrote: On Tue, Jan 26, 2021 at 02:59:17PM -0800, Minchan Kim wrote: The release buffer_head in LRU is great improvement for migration point of view. A question: Hey guys, Can't we invalidate(e.g., invalidate_bh_lrus) bh_lru in migrate_prep or elsewhere when migration found the failure and is about to retry? Migration has done such a way for other per-cpu stuffs for a long time, which would be more consistent with others and might be faster sometimes with reducing IPI calls for page. Should lru_add_drain_all() also handle draining the buffer lru for all callers? A quick survey ... invalidate_bdev() already calls invalidate_bh_lrus() compact_nodes() would probably benefit from the BH LRU being invalidated POSIX_FADV_DONTNEED would benefit if the underlying filesystem uses BHs check_and_migrate_cma_pages() would benefit khugepaged_do_scan() doesn't need it today scan_get_next_rmap_item() looks like it only works on anon pages (?) so doesn't need it mem_cgroup_force_empty() probably needs it mem_cgroup_move_charge() ditto memfd_wait_for_pins() doesn't need it shake_page() might benefit offline_pages() would benefit alloc_contig_range() would benefit Seems like most would benefit and a few won't care. I think I'd lean towards having lru_add_drain_all() call invalidate_bh_lrus(), just to simplify things. Doing this sounds like a good idea. We would still need a call to invalidate_bh_lrus() inside of drop_buffers() in the event that we find busy buffers, since they can be re-added back into the BH LRU - I believe it isn't until this point that a BH can't be added back into the BH LRU, when we acquire the private_lock for the mapping: https://elixir.bootlin.com/linux/v5.10.10/source/fs/buffer.c#L3240 Thanks, Chris. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v4] fs/buffer.c: Revoke LRU when trying to drop buffers
From: Laura Abbott When a buffer is added to the LRU list, a reference is taken which is not dropped until the buffer is evicted from the LRU list. This is the correct behavior, however this LRU reference will prevent the buffer from being dropped. This means that the buffer can't actually be dropped until it is selected for eviction. There's no bound on the time spent on the LRU list, which means that the buffer may be undroppable for very long periods of time. Given that migration involves dropping buffers, the associated page is now unmigratible for long periods of time as well. CMA relies on being able to migrate a specific range of pages, so these types of failures make CMA significantly less reliable, especially under high filesystem usage. Rather than waiting for the LRU algorithm to eventually kick out the buffer, explicitly remove the buffer from the LRU list when trying to drop it. There is still the possibility that the buffer could be added back on the list, but that indicates the buffer is still in use and would probably have other 'in use' indicates to prevent dropping. Note: a bug reported by "kernel test robot" lead to a switch from using xas_for_each() to xa_for_each(). Signed-off-by: Laura Abbott Signed-off-by: Chris Goldsworthy Cc: Matthew Wilcox Reported-by: kernel test robot --- fs/buffer.c | 79 + 1 file changed, 74 insertions(+), 5 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 96c7604..27516a0 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -48,6 +48,7 @@ #include #include #include +#include #include "internal.h" @@ -1471,12 +1472,55 @@ static bool has_bh_in_lru(int cpu, void *dummy) return false; } +static void __evict_bhs_lru(void *arg) +{ + struct bh_lru *b = _cpu_var(bh_lrus); + struct xarray *busy_bhs = arg; + struct buffer_head *bh; + unsigned long i, xarray_index; + + xa_for_each(busy_bhs, xarray_index, bh) { + for (i = 0; i < BH_LRU_SIZE; i++) { + if (b->bhs[i] == bh) { + brelse(b->bhs[i]); + b->bhs[i] = NULL; + break; + } + } + } + + put_cpu_var(bh_lrus); +} + +static bool page_has_bhs_in_lru(int cpu, void *arg) +{ + struct bh_lru *b = per_cpu_ptr(_lrus, cpu); + struct xarray *busy_bhs = arg; + struct buffer_head *bh; + unsigned long i, xarray_index; + + xa_for_each(busy_bhs, xarray_index, bh) { + for (i = 0; i < BH_LRU_SIZE; i++) { + if (b->bhs[i] == bh) + return true; + } + } + + return false; + +} void invalidate_bh_lrus(void) { on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1); } EXPORT_SYMBOL_GPL(invalidate_bh_lrus); +static void evict_bh_lrus(struct xarray *busy_bhs) +{ + on_each_cpu_cond(page_has_bhs_in_lru, __evict_bhs_lru, +busy_bhs, 1); +} + void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset) { @@ -3242,14 +3286,38 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) { struct buffer_head *head = page_buffers(page); struct buffer_head *bh; + struct xarray busy_bhs; + int bh_count = 0; + int xa_ret, ret = 0; + + xa_init(_bhs); bh = head; do { - if (buffer_busy(bh)) - goto failed; + if (buffer_busy(bh)) { + xa_ret = xa_err(xa_store(_bhs, bh_count++, +bh, GFP_ATOMIC)); + if (xa_ret) + goto out; + } bh = bh->b_this_page; } while (bh != head); + if (bh_count) { + /* +* Check if the busy failure was due to an outstanding +* LRU reference +*/ + evict_bh_lrus(_bhs); + do { + if (buffer_busy(bh)) + goto out; + + bh = bh->b_this_page; + } while (bh != head); + } + + ret = 1; do { struct buffer_head *next = bh->b_this_page; @@ -3259,9 +3327,10 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) } while (bh != head); *buffers_to_free = head; detach_page_private(page); - return 1; -failed: - return 0; +out: + xa_destroy(_bhs); + + return ret; } int try_to_free_buffers(struct page *page) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v4] Resolve LRU page-pinning issue for file-backed pages
It is possible for file-backed pages to end up in a contiguous memory area (CMA), such that the relevant page must be migrated using the .migratepage() callback when its backing physical memory is selected for use in an CMA allocation (through cma_alloc()). However, if a set of address space operations (AOPs) for a file-backed page lacks a migratepage() page call-back, fallback_migrate_page() will be used instead, which through try_to_release_page() calls try_to_free_buffers() (which is called directly or through a try_to_free_buffers() callback. try_to_free_buffers() in turn calls drop_buffers() drop_buffers() itself can fail due to the buffer_head associated with a page being busy. However, it is possible that the buffer_head is on an LRU list for a CPU, such that we can try removing the buffer_head from that list, in order to successfully release the page. Do this. v1: https://lore.kernel.org/lkml/cover.1606194703.git.cgold...@codeaurora.org/T/#m3a44b5745054206665455625ccaf27379df8a190 Original version of the patch (with updates to make to account for changes in on_each_cpu_cond()). v2: https://lore.kernel.org/lkml/cover.1609829465.git.cgold...@codeaurora.org/ Follow Matthew Wilcox's suggestion of reducing the number of calls to on_each_cpu_cond(), by iterating over a page's busy buffer_heads inside of on_each_cpu_cond(). To copy from his e-mail, we go from: for_each_buffer for_each_cpu for_each_lru_entry to: for_each_cpu for_each_buffer for_each_lru_entry This is done using xarrays, which I found to be the cleanest data structure to use, though a pre-allocated array of page_size(page) / bh->b_size elements might be more performant. v3: Replace xas_for_each() with xa_for_each() to account for proper locking. v4: Fix an iteration error. Laura Abbott (1): fs/buffer.c: Revoke LRU when trying to drop buffers fs/buffer.c | 79 + 1 file changed, 74 insertions(+), 5 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v3] Resolve LRU page-pinning issue for file-backed pages
It is possible for file-backed pages to end up in a contiguous memory area (CMA), such that the relevant page must be migrated using the .migratepage() callback when its backing physical memory is selected for use in an CMA allocation (through cma_alloc()). However, if a set of address space operations (AOPs) for a file-backed page lacks a migratepage() page call-back, fallback_migrate_page() will be used instead, which through try_to_release_page() calls try_to_free_buffers() (which is called directly or through a try_to_free_buffers() callback. try_to_free_buffers() in turn calls drop_buffers() drop_buffers() itself can fail due to the buffer_head associated with a page being busy. However, it is possible that the buffer_head is on an LRU list for a CPU, such that we can try removing the buffer_head from that list, in order to successfully release the page. Do this. v1: https://lore.kernel.org/lkml/cover.1606194703.git.cgold...@codeaurora.org/T/#m3a44b5745054206665455625ccaf27379df8a190 Original version of the patch (with updates to make to account for changes in on_each_cpu_cond()). v2: https://lore.kernel.org/lkml/cover.1609829465.git.cgold...@codeaurora.org/ Follow Matthew Wilcox's suggestion of reducing the number of calls to on_each_cpu_cond(), by iterating over a page's busy buffer_heads inside of on_each_cpu_cond(). To copy from his e-mail, we go from: for_each_buffer for_each_cpu for_each_lru_entry to: for_each_cpu for_each_buffer for_each_lru_entry This is done using xarrays, which I found to be the cleanest data structure to use, though a pre-allocated array of page_size(page) / bh->b_size elements might be more performant. v3: Replace xas_for_each() with xa_for_each() to account for proper locking. Laura Abbott (1): fs/buffer.c: Revoke LRU when trying to drop buffers fs/buffer.c | 81 + 1 file changed, 76 insertions(+), 5 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v3] fs/buffer.c: Revoke LRU when trying to drop buffers
From: Laura Abbott When a buffer is added to the LRU list, a reference is taken which is not dropped until the buffer is evicted from the LRU list. This is the correct behavior, however this LRU reference will prevent the buffer from being dropped. This means that the buffer can't actually be dropped until it is selected for eviction. There's no bound on the time spent on the LRU list, which means that the buffer may be undroppable for very long periods of time. Given that migration involves dropping buffers, the associated page is now unmigratible for long periods of time as well. CMA relies on being able to migrate a specific range of pages, so these types of failures make CMA significantly less reliable, especially under high filesystem usage. Rather than waiting for the LRU algorithm to eventually kick out the buffer, explicitly remove the buffer from the LRU list when trying to drop it. There is still the possibility that the buffer could be added back on the list, but that indicates the buffer is still in use and would probably have other 'in use' indicates to prevent dropping. Note: a bug reported by "kernel test robot" lead to a switch from using xas_for_each() to xa_for_each(). Signed-off-by: Laura Abbott Signed-off-by: Chris Goldsworthy Cc: Matthew Wilcox Reported-by: kernel test robot --- fs/buffer.c | 81 + 1 file changed, 76 insertions(+), 5 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 96c7604..d2d1237 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -48,6 +48,7 @@ #include #include #include +#include #include "internal.h" @@ -1471,12 +1472,59 @@ static bool has_bh_in_lru(int cpu, void *dummy) return false; } +static void __evict_bhs_lru(void *arg) +{ + struct bh_lru *b = _cpu_var(bh_lrus); + struct xarray *busy_bhs = arg; + struct buffer_head *bh; + unsigned long i, xarray_index; + + xa_for_each(busy_bhs, xarray_index, bh) { + for (i = 0; i < BH_LRU_SIZE; i++) { + if (b->bhs[i] == bh) { + brelse(b->bhs[i]); + b->bhs[i] = NULL; + break; + } + } + + bh = bh->b_this_page; + } + + put_cpu_var(bh_lrus); +} + +static bool page_has_bhs_in_lru(int cpu, void *arg) +{ + struct bh_lru *b = per_cpu_ptr(_lrus, cpu); + struct xarray *busy_bhs = arg; + struct buffer_head *bh; + unsigned long i, xarray_index; + + xa_for_each(busy_bhs, xarray_index, bh) { + for (i = 0; i < BH_LRU_SIZE; i++) { + if (b->bhs[i] == bh) + return true; + } + + bh = bh->b_this_page; + } + + return false; + +} void invalidate_bh_lrus(void) { on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1); } EXPORT_SYMBOL_GPL(invalidate_bh_lrus); +static void evict_bh_lrus(struct xarray *busy_bhs) +{ + on_each_cpu_cond(page_has_bhs_in_lru, __evict_bhs_lru, +busy_bhs, 1); +} + void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset) { @@ -3242,14 +3290,36 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) { struct buffer_head *head = page_buffers(page); struct buffer_head *bh; + struct xarray busy_bhs; + int bh_count = 0; + int xa_ret, ret = 0; + + xa_init(_bhs); bh = head; do { - if (buffer_busy(bh)) - goto failed; + if (buffer_busy(bh)) { + xa_ret = xa_err(xa_store(_bhs, bh_count++, +bh, GFP_ATOMIC)); + if (xa_ret) + goto out; + } bh = bh->b_this_page; } while (bh != head); + if (bh_count) { + /* +* Check if the busy failure was due to an outstanding +* LRU reference +*/ + evict_bh_lrus(_bhs); + do { + if (buffer_busy(bh)) + goto out; + } while (bh != head); + } + + ret = 1; do { struct buffer_head *next = bh->b_this_page; @@ -3259,9 +3329,10 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) } while (bh != head); *buffers_to_free = head; detach_page_private(page); - return 1; -failed: - return 0; +out: + xa_destroy(_bhs); + + return ret; } int try_to_free_buffers(struct page *page) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH] fs/buffer.c: Revoke LRU when trying to drop buffers
On 2020-11-24 07:39, Matthew Wilcox wrote: On Mon, Nov 23, 2020 at 10:49:38PM -0800, Chris Goldsworthy wrote: +static void __evict_bh_lru(void *arg) +{ + struct bh_lru *b = _cpu_var(bh_lrus); + struct buffer_head *bh = arg; + int i; + + for (i = 0; i < BH_LRU_SIZE; i++) { + if (b->bhs[i] == bh) { + brelse(b->bhs[i]); + b->bhs[i] = NULL; + goto out; That's an odd way to spell 'break' ... + } + } +out: + put_cpu_var(bh_lrus); +} ... @@ -3245,8 +3281,15 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) bh = head; do { - if (buffer_busy(bh)) - goto failed; + if (buffer_busy(bh)) { + /* +* Check if the busy failure was due to an +* outstanding LRU reference +*/ + evict_bh_lrus(bh); + if (buffer_busy(bh)) + goto failed; Hi Matthew, Apologies for the delayed response. We might be better off just calling invalidate_bh_lrus() -- we'd flush the entire LRU, but we'd only need to do it once, not once per buffer head. I'm concerned about emptying the cache, such that those who might benefit from it would be left affected. We could have a more complex 'evict' that iterates each busy buffer on a page so transforming: for_each_buffer for_each_cpu for_each_lru_entry to: for_each_cpu for_each_buffer for_each_lru_entry (and i suggest that way because it's more expensive to iterate the buffers than it is to iterate the lru entries) I've gone ahead and done this in a follow-up patch: https://lore.kernel.org/lkml/cover.1609829465.git.cgold...@codeaurora.org/ There might be room for improvement in the data structure being used to track the used entries - using an xarray gives the cleanest code, but pre-allocating an array to hold up to page_size(page) / bh->b_size entres might be faster, although it would be a bit uglier to do in a way that doesn't reduce the performance of the case when evict_bh_lru() doesn't need to be called. Regards, Chris. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2] fs/buffer.c: Revoke LRU when trying to drop buffers
From: Laura Abbott When a buffer is added to the LRU list, a reference is taken which is not dropped until the buffer is evicted from the LRU list. This is the correct behavior, however this LRU reference will prevent the buffer from being dropped. This means that the buffer can't actually be dropped until it is selected for eviction. There's no bound on the time spent on the LRU list, which means that the buffer may be undroppable for very long periods of time. Given that migration involves dropping buffers, the associated page is now unmigratible for long periods of time as well. CMA relies on being able to migrate a specific range of pages, so these types of failures make CMA significantly less reliable, especially under high filesystem usage. Rather than waiting for the LRU algorithm to eventually kick out the buffer, explicitly remove the buffer from the LRU list when trying to drop it. There is still the possibility that the buffer could be added back on the list, but that indicates the buffer is still in use and would probably have other 'in use' indicates to prevent dropping. Signed-off-by: Laura Abbott Signed-off-by: Chris Goldsworthy Cc: Matthew Wilcox --- fs/buffer.c | 85 +++ fs/internal.h | 5 2 files changed, 85 insertions(+), 5 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 96c7604..536fb5b 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -48,6 +48,7 @@ #include #include #include +#include #include "internal.h" @@ -1471,12 +1472,63 @@ static bool has_bh_in_lru(int cpu, void *dummy) return false; } +static void __evict_bhs_lru(void *arg) +{ + struct bh_lru *b = _cpu_var(bh_lrus); + struct busy_bhs_container *busy_bhs = arg; + struct buffer_head *bh; + int i; + + XA_STATE(xas, _bhs->xarray, 0); + + xas_for_each(, bh, busy_bhs->size) { + for (i = 0; i < BH_LRU_SIZE; i++) { + if (b->bhs[i] == bh) { + brelse(b->bhs[i]); + b->bhs[i] = NULL; + break; + } + } + + bh = bh->b_this_page; + } + + put_cpu_var(bh_lrus); +} + +static bool page_has_bhs_in_lru(int cpu, void *arg) +{ + struct bh_lru *b = per_cpu_ptr(_lrus, cpu); + struct busy_bhs_container *busy_bhs = arg; + struct buffer_head *bh; + int i; + + XA_STATE(xas, _bhs->xarray, 0); + + xas_for_each(, bh, busy_bhs->size) { + for (i = 0; i < BH_LRU_SIZE; i++) { + if (b->bhs[i] == bh) + return true; + } + + bh = bh->b_this_page; + } + + return false; + +} void invalidate_bh_lrus(void) { on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1); } EXPORT_SYMBOL_GPL(invalidate_bh_lrus); +static void evict_bh_lrus(struct busy_bhs_container *busy_bhs) +{ + on_each_cpu_cond(page_has_bhs_in_lru, __evict_bhs_lru, +busy_bhs, 1); +} + void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset) { @@ -3242,14 +3294,36 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) { struct buffer_head *head = page_buffers(page); struct buffer_head *bh; + struct busy_bhs_container busy_bhs; + int xa_ret, ret = 0; + + xa_init(_bhs.xarray); + busy_bhs.size = 0; bh = head; do { - if (buffer_busy(bh)) - goto failed; + if (buffer_busy(bh)) { + xa_ret = xa_err(xa_store(_bhs.xarray, busy_bhs.size++, +bh, GFP_ATOMIC)); + if (xa_ret) + goto out; + } bh = bh->b_this_page; } while (bh != head); + if (busy_bhs.size) { + /* +* Check if the busy failure was due to an outstanding +* LRU reference +*/ + evict_bh_lrus(_bhs); + do { + if (buffer_busy(bh)) + goto out; + } while (bh != head); + } + + ret = 1; do { struct buffer_head *next = bh->b_this_page; @@ -3259,9 +,10 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) } while (bh != head); *buffers_to_free = head; detach_page_private(page); - return 1; -failed: - return 0; +out: + xa_destroy(_bhs.xarray); + + return ret; } int try_to_free_buffers(struct page *page) diff --git a/fs/internal.h b/fs/internal.h index 77c50be..00f17c4 100644 --- a/fs/internal.h +++ b/fs/internal.h @@
[PATCH v2] Resolve LRU page-pinning issue for file-backed pages
It is possible for file-backed pages to end up in a contiguous memory area (CMA), such that the relevant page must be migrated using the .migratepage() callback when its backing physical memory is selected for use in an CMA allocation (through cma_alloc()). However, if a set of address space operations (AOPs) for a file-backed page lacks a migratepage() page call-back, fallback_migrate_page() will be used instead, which through try_to_release_page() calls try_to_free_buffers() (which is called directly or through a try_to_free_buffers() callback. try_to_free_buffers() in turn calls drop_buffers() drop_buffers() itself can fail due to the buffer_head associated with a page being busy. However, it is possible that the buffer_head is on an LRU list for a CPU, such that we can try removing the buffer_head from that list, in order to successfully release the page. Do this. v1: https://lore.kernel.org/lkml/cover.1606194703.git.cgold...@codeaurora.org/T/#m3a44b5745054206665455625ccaf27379df8a190 Original version of the patch (with updates to make to account for changes in on_each_cpu_cond()). v2: Follow Matthew Wilcox's suggestion of reducing the number of calls to on_each_cpu_cond(), by iterating over a page's busy buffer_heads inside of on_each_cpu_cond(). To copy from his e-mail, we go from: for_each_buffer for_each_cpu for_each_lru_entry to: for_each_cpu for_each_buffer for_each_lru_entry This is done using xarrays, which I found to be the cleanest data structure to use, though a pre-allocated array of page_size(page) / bh->b_size elements might be more performant. Laura Abbott (1): fs/buffer.c: Revoke LRU when trying to drop buffers fs/buffer.c | 85 +++ fs/internal.h | 5 2 files changed, 85 insertions(+), 5 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH] fs/buffer.c: Revoke LRU when trying to drop buffers
From: Laura Abbott When a buffer is added to the LRU list, a reference is taken which is not dropped until the buffer is evicted from the LRU list. This is the correct behavior, however this LRU reference will prevent the buffer from being dropped. This means that the buffer can't actually be dropped until it is selected for eviction. There's no bound on the time spent on the LRU list, which means that the buffer may be undroppable for very long periods of time. Given that migration involves dropping buffers, the associated page is now unmigratible for long periods of time as well. CMA relies on being able to migrate a specific range of pages, so these types of failures make CMA significantly less reliable, especially under high filesystem usage. Rather than waiting for the LRU algorithm to eventually kick out the buffer, explicitly remove the buffer from the LRU list when trying to drop it. There is still the possibility that the buffer could be added back on the list, but that indicates the buffer is still in use and would probably have other 'in use' indicates to prevent dropping. Signed-off-by: Laura Abbott Signed-off-by: Chris Goldsworthy --- fs/buffer.c | 47 +-- 1 file changed, 45 insertions(+), 2 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 64564ac..1751f0b 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -1471,12 +1471,48 @@ static bool has_bh_in_lru(int cpu, void *dummy) return false; } +static void __evict_bh_lru(void *arg) +{ + struct bh_lru *b = _cpu_var(bh_lrus); + struct buffer_head *bh = arg; + int i; + + for (i = 0; i < BH_LRU_SIZE; i++) { + if (b->bhs[i] == bh) { + brelse(b->bhs[i]); + b->bhs[i] = NULL; + goto out; + } + } +out: + put_cpu_var(bh_lrus); +} + +static bool bh_exists_in_lru(int cpu, void *arg) +{ + struct bh_lru *b = per_cpu_ptr(_lrus, cpu); + struct buffer_head *bh = arg; + int i; + + for (i = 0; i < BH_LRU_SIZE; i++) { + if (b->bhs[i] == bh) + return true; + } + + return false; + +} void invalidate_bh_lrus(void) { on_each_cpu_cond(has_bh_in_lru, invalidate_bh_lru, NULL, 1); } EXPORT_SYMBOL_GPL(invalidate_bh_lrus); +static void evict_bh_lrus(struct buffer_head *bh) +{ + on_each_cpu_cond(bh_exists_in_lru, __evict_bh_lru, bh, 1); +} + void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset) { @@ -3245,8 +3281,15 @@ drop_buffers(struct page *page, struct buffer_head **buffers_to_free) bh = head; do { - if (buffer_busy(bh)) - goto failed; + if (buffer_busy(bh)) { + /* +* Check if the busy failure was due to an +* outstanding LRU reference +*/ + evict_bh_lrus(bh); + if (buffer_busy(bh)) + goto failed; + } bh = bh->b_this_page; } while (bh != head); -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH] Resolve LRU page-pinning issue for file-backed pages
It is possible for file-backed pages to end up in a contiguous memory area (CMA), such that the relevant page must be migrated using the .migratepage() callback when its backing physical memory is selected for use in an CMA allocation (through cma_alloc()). However, if a set of address space operations (AOPs) for a file-backed page lacks a migratepage() page call-back, fallback_migrate_page() will be used instead, which through try_to_release_page() calls try_to_free_buffers() (which is called directly or through a try_to_free_buffers() callback. try_to_free_buffers() in turn calls drop_buffers(). drop_buffers() itself can fail due to the buffer_head associated with a page being busy. However, it is possible that the buffer_head is on an LRU list for a CPU, such that we can try removing the buffer_head from that list, in order to successfully release the page. Do this. For reference ext4_journalled_aops is a set of AOPs that lacks the migratepage() callback. Laura Abbott (1): fs/buffer.c: Revoke LRU when trying to drop buffers fs/buffer.c | 47 +-- 1 file changed, 45 insertions(+), 2 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH 0/2] Increasing CMA Utilization with a GFP Flag
The current approach to increasing CMA utilization introduced in commit 16867664936e ("mm,page_alloc,cma: conditionally prefer cma pageblocks for movable allocations") increases CMA utilization by redirecting MIGRATE_MOVABLE allocations to a CMA region, when greater than half of the free pages in a given zone are CMA pages. The issue in this approach is that allocations with type MIGRATE_MOVABLE can still succumb to pinning. To get around this, one approach is to re-direct allocations to the CMA areas, that are known not to be victims of pinning. To this end, this series brings in __GFP_CMA, which we mark with allocations that we know are safe to be redirected to a CMA area. Heesub Shin (1): cma: redirect page allocation to CMA Vinayak Menon (1): zram: allow zram to allocate CMA pages drivers/block/zram/zram_drv.c | 5 +-- include/linux/gfp.h | 15 include/linux/highmem.h | 4 ++- include/linux/mmzone.h| 4 +++ mm/page_alloc.c | 83 +++ mm/zsmalloc.c | 4 +-- 6 files changed, 79 insertions(+), 36 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH 2/2] zram: allow zram to allocate CMA pages
From: Vinayak Menon Though zram pages are movable, they aren't allowed to enter MIGRATE_CMA pageblocks. zram is not seen to pin pages for long which can cause an issue. Moreover allowing zram to pick CMA pages can be helpful in cases seen where zram order 0 alloc fails when there are lots of free cma pages, resulting in kswapd or direct reclaim not making enough progress. Signed-off-by: Vinayak Menon Signed-off-by: Chris Goldsworthy --- drivers/block/zram/zram_drv.c | 5 +++-- mm/zsmalloc.c | 4 ++-- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/drivers/block/zram/zram_drv.c b/drivers/block/zram/zram_drv.c index 9a1e6ee..4b6b16d 100644 --- a/drivers/block/zram/zram_drv.c +++ b/drivers/block/zram/zram_drv.c @@ -1380,13 +1380,14 @@ static int __zram_bvec_write(struct zram *zram, struct bio_vec *bvec, __GFP_KSWAPD_RECLAIM | __GFP_NOWARN | __GFP_HIGHMEM | - __GFP_MOVABLE); + __GFP_MOVABLE | + __GFP_CMA); if (!handle) { zcomp_stream_put(zram->comp); atomic64_inc(>stats.writestall); handle = zs_malloc(zram->mem_pool, comp_len, GFP_NOIO | __GFP_HIGHMEM | - __GFP_MOVABLE); + __GFP_MOVABLE | __GFP_CMA); if (handle) goto compress_again; return -ENOMEM; diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c index b03bee2..16ba318 100644 --- a/mm/zsmalloc.c +++ b/mm/zsmalloc.c @@ -351,7 +351,7 @@ static void destroy_cache(struct zs_pool *pool) static unsigned long cache_alloc_handle(struct zs_pool *pool, gfp_t gfp) { return (unsigned long)kmem_cache_alloc(pool->handle_cachep, - gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE)); + gfp & ~(__GFP_HIGHMEM|__GFP_MOVABLE|__GFP_CMA)); } static void cache_free_handle(struct zs_pool *pool, unsigned long handle) @@ -362,7 +362,7 @@ static void cache_free_handle(struct zs_pool *pool, unsigned long handle) static struct zspage *cache_alloc_zspage(struct zs_pool *pool, gfp_t flags) { return kmem_cache_alloc(pool->zspage_cachep, - flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE)); + flags & ~(__GFP_HIGHMEM|__GFP_MOVABLE|__GFP_CMA)); } static void cache_free_zspage(struct zs_pool *pool, struct zspage *zspage) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH 1/2] cma: redirect page allocation to CMA
From: Heesub Shin CMA pages are designed to be used as fallback for movable allocations and cannot be used for non-movable allocations. If CMA pages are utilized poorly, non-movable allocations may end up getting starved if all regular movable pages are allocated and the only pages left are CMA. Always using CMA pages first creates unacceptable performance problems. As a midway alternative, use CMA pages for certain userspace allocations. The userspace pages can be migrated or dropped quickly which giving decent utilization. Signed-off-by: Kyungmin Park Signed-off-by: Heesub Shin Signed-off-by: Vinayak Menon [cgold...@codeaurora.org: Place in bugfixes] Signed-off-by: Chris Goldsworthy --- include/linux/gfp.h | 15 + include/linux/highmem.h | 4 ++- include/linux/mmzone.h | 4 +++ mm/page_alloc.c | 83 +++-- 4 files changed, 74 insertions(+), 32 deletions(-) diff --git a/include/linux/gfp.h b/include/linux/gfp.h index c603237..e80b7d2 100644 --- a/include/linux/gfp.h +++ b/include/linux/gfp.h @@ -39,11 +39,21 @@ struct vm_area_struct; #define ___GFP_HARDWALL0x10u #define ___GFP_THISNODE0x20u #define ___GFP_ACCOUNT 0x40u +#ifdef CONFIG_CMA +#define ___GFP_CMA 0x80u +#else +#define ___GFP_CMA 0 +#endif #ifdef CONFIG_LOCKDEP +#ifdef CONFIG_CMA +#define ___GFP_NOLOCKDEP 0x100u +#else #define ___GFP_NOLOCKDEP 0x80u +#endif #else #define ___GFP_NOLOCKDEP 0 #endif + /* If the above are modified, __GFP_BITS_SHIFT may need updating */ /* @@ -57,6 +67,7 @@ struct vm_area_struct; #define __GFP_HIGHMEM ((__force gfp_t)___GFP_HIGHMEM) #define __GFP_DMA32((__force gfp_t)___GFP_DMA32) #define __GFP_MOVABLE ((__force gfp_t)___GFP_MOVABLE) /* ZONE_MOVABLE allowed */ +#define __GFP_CMA ((__force gfp_t)___GFP_CMA) #define GFP_ZONEMASK (__GFP_DMA|__GFP_HIGHMEM|__GFP_DMA32|__GFP_MOVABLE) /** @@ -224,7 +235,11 @@ struct vm_area_struct; #define __GFP_NOLOCKDEP ((__force gfp_t)___GFP_NOLOCKDEP) /* Room for N __GFP_FOO bits */ +#ifdef CONFIG_CMA +#define __GFP_BITS_SHIFT (24 + IS_ENABLED(CONFIG_LOCKDEP)) +#else #define __GFP_BITS_SHIFT (23 + IS_ENABLED(CONFIG_LOCKDEP)) +#endif #define __GFP_BITS_MASK ((__force gfp_t)((1 << __GFP_BITS_SHIFT) - 1)) /** diff --git a/include/linux/highmem.h b/include/linux/highmem.h index 14e6202..35f052b 100644 --- a/include/linux/highmem.h +++ b/include/linux/highmem.h @@ -274,7 +274,9 @@ static inline struct page * alloc_zeroed_user_highpage_movable(struct vm_area_struct *vma, unsigned long vaddr) { - return __alloc_zeroed_user_highpage(__GFP_MOVABLE, vma, vaddr); + return __alloc_zeroed_user_highpage( + __GFP_MOVABLE|__GFP_CMA, vma, + vaddr); } static inline void clear_highpage(struct page *page) diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index fb3bf69..3f913be 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -471,6 +471,10 @@ struct zone { struct pglist_data *zone_pgdat; struct per_cpu_pageset __percpu *pageset; +#ifdef CONFIG_CMA + boolcma_alloc; +#endif + #ifndef CONFIG_SPARSEMEM /* * Flags for a pageblock_nr_pages block. See pageblock-flags.h. diff --git a/mm/page_alloc.c b/mm/page_alloc.c index d772206..f938de7 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2860,35 +2860,34 @@ __rmqueue(struct zone *zone, unsigned int order, int migratetype, { struct page *page; -#ifdef CONFIG_CMA - /* -* Balance movable allocations between regular and CMA areas by -* allocating from CMA when over half of the zone's free memory -* is in the CMA area. -*/ - if (alloc_flags & ALLOC_CMA && - zone_page_state(zone, NR_FREE_CMA_PAGES) > - zone_page_state(zone, NR_FREE_PAGES) / 2) { - page = __rmqueue_cma_fallback(zone, order); - if (page) - return page; - } -#endif retry: page = __rmqueue_smallest(zone, order, migratetype); - if (unlikely(!page)) { - if (alloc_flags & ALLOC_CMA) - page = __rmqueue_cma_fallback(zone, order); - if (!page && __rmqueue_fallback(zone, order, migratetype, - alloc_flags)) - goto retry; - } + if (unlikely(!page) && __rmqueue_fallback(zone, order, migratetype, + alloc_flags)) + goto retry; trace_mm_page_alloc_zone_locked(page, order, migratetype); return page; } +static struct page *__rmqueue_cma(struct zone *zone, unsigned int order, +
Re: [PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc
On 2020-09-28 22:59, Christoph Hellwig wrote: On Mon, Sep 28, 2020 at 01:30:27PM -0700, Chris Goldsworthy wrote: CMA allocations will fail if 'pinned' pages are in a CMA area, since we cannot migrate pinned pages. The _refcount of a struct page being greater than _mapcount for that page can cause pinning for anonymous pages. This is because try_to_unmap(), which (1) is called in the CMA allocation path, and (2) decrements both _refcount and _mapcount for a page, will stop unmapping a page from VMAs once the _mapcount for a page reaches 0. This implies that after try_to_unmap() has finished successfully for a page where _recount > _mapcount, that _refcount will be greater than 0. Later in the CMA allocation path in migrate_page_move_mapping(), we will have one more reference count than intended for anonymous pages, meaning the allocation will fail for that page. If a process ends up causing _refcount > _mapcount for a page (by either incrementing _recount or decrementing _mapcount), such that the process is context switched out after modifying one refcount but before modifying the other, the page will be temporarily pinned. One example of where _refcount can be greater than _mapcount is inside of zap_pte_range(), which is called for all the entries of a PMD when a process is exiting, to unmap the process's memory. Inside of zap_pte_range(), after unammping a page with page_remove_rmap(), we have that _recount > _mapcount. _refcount can only be decremented after a TLB flush is performed for the page - this doesn't occur until enough pages have been batched together for flushing. The flush can either occur inside of zap_pte_range() (during the same invocation or a later one), or if there aren't enough pages collected by the time we unmap all of the pages in a process, the flush will occur in tlb_finish_mmu() in exit_mmap(). After the flush has occurred, tlb_batch_pages_flush() will decrement the references on the flushed pages. Another such example like the above is inside of copy_one_pte(), which is called during a fork. For PTEs for which pte_present(pte) == true, copy_one_pte() will increment the _refcount field followed by the _mapcount field of a page. So, inside of cma_alloc(), add the option of letting users pass in __GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely, in the event that alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap. And who is going to use this? AS-is this just seems to add code that isn't actually used and thus actually tested. (In addition to beeing a relly bad idea as discussed before) Hi Christoph, That had slipped my mind - what we would have submitted would have been a modified /drivers/dma-heap/heaps/cma_heap.c, which would have created a "linux,cma-nofail" heap, that when allocated from, passes GFP_NOFAIL to cma_alloc(). But, since this retry approach (finite and infinite) has effectively been nacked, I've gone back to the drawing board to find either (1) a lock based approach to solving this (as posed by Andrew Morton here: https://lkml.org/lkml/2020/8/21/1490), or (2) using preempt_disable() calls. Thanks, Chris. --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -196,7 +196,7 @@ struct page *dma_alloc_from_contiguous(struct device *dev, size_t count, if (align > CONFIG_CMA_ALIGNMENT) align = CONFIG_CMA_ALIGNMENT; - return cma_alloc(dev_get_cma_area(dev), count, align, no_warn); + return cma_alloc(dev_get_cma_area(dev), count, align, no_warn ? __GFP_NOWARN : 0); Also don't add pointlessly overlong lines. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: Question regarding ext4_journalled_aops: lack of migrate_page
tOn 2020-10-08 23:42, Chris Goldsworthy wrote: Hi there, ext4_aops and ext4_da_aops both have a migratepage callback, whereas ext4_journalled_aops lacks such a callback. Why is this so? I’m asking this due to the following: when a page containing EXT4 journal buffer heads ends up being migrated, fallback_migrate_page() is used, which eventually calls try_to_free_buffers(), which will call drop_buffers(). Drop buffers() can fail for a page if that page is on the LRU list (see https://elixir.bootlin.com/linux/v5.8.14/source/fs/buffer.c#L3225). Now, if buffer_migrate_page() was supplied as the migratepage callback for the journaled aops, this wouldn’t be problem since we ignore the LRU lists altogether. Resolving this issue will benefit CMA allocations, which might have to migrate movable pages that were allocated from a CMA region (the assumption is that these pages can be migrated once the memory backing these pages is needed). Thanks, Chris. An alternative solution to to placing in a migratepage callback is to actually remove the buffer heads from an LRU list in drop_buffers() - we tried upstreaming such a patch some time ago: https://lore.kernel.org/patchwork/patch/325406/ . This would solve the problem, and would avoid having to get a migratepage callback for any other files systems that have aops without the callback. This didn't get off the ground however, I'm picking up the line of questioning left in the previous thread, to determine why EXT4 doesn't have a migratepage callback. Thanks, Chris. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Question regarding ext4_journalled_aops: lack of migrate_page
Hi there, ext4_aops and ext4_da_aops both have a migratepage callback, whereas ext4_journalled_aops lacks such a callback. Why is this so? I’m asking this due to the following: when a page containing EXT4 journal buffer heads ends up being migrated, fallback_migrate_page() is used, which eventually calls try_to_free_buffers(), which will call drop_buffers(). Drop buffers() can fail for a page if that page is on the LRU list (see https://elixir.bootlin.com/linux/v5.8.14/source/fs/buffer.c#L3225). Now, if buffer_migrate_page() was supplied as the migratepage callback for the journaled aops, this wouldn’t be problem since we ignore the LRU lists altogether. Resolving this issue will benefit CMA allocations, which might have to migrate movable pages that were allocated from a CMA region (the assumption is that these pages can be migrated once the memory backing these pages is needed). Thanks, Chris. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [RFC][PATCH 5/6] dma-buf: system_heap: Add pagepool support to system heap
On 2020-09-29 21:46, Chris Goldsworthy wrote: On 2020-09-25 21:24, John Stultz wrote: Reuse/abuse the pagepool code from the network code to speed up allocation performance. This is similar to the ION pagepool usage, but tries to utilize generic code instead of a custom implementation. Cc: Sumit Semwal Cc: Liam Mark Cc: Laura Abbott Cc: Brian Starkey Cc: Hridya Valsaraju Cc: Suren Baghdasaryan Cc: Sandeep Patil Cc: Ørjan Eide Cc: Robin Murphy Cc: Ezequiel Garcia Cc: Simon Ser Cc: James Jones Cc: linux-me...@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Signed-off-by: John Stultz --- drivers/dma-buf/heaps/Kconfig | 1 + drivers/dma-buf/heaps/system_heap.c | 32 + 2 files changed, 29 insertions(+), 4 deletions(-) diff --git a/drivers/dma-buf/heaps/Kconfig b/drivers/dma-buf/heaps/Kconfig index a5eef06c4226..f13cde4321b1 100644 --- a/drivers/dma-buf/heaps/Kconfig +++ b/drivers/dma-buf/heaps/Kconfig @@ -1,6 +1,7 @@ config DMABUF_HEAPS_SYSTEM bool "DMA-BUF System Heap" depends on DMABUF_HEAPS + select PAGE_POOL help Choose this option to enable the system dmabuf heap. The system heap is backed by pages from the buddy allocator. If in doubt, say Y. diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 882a632e9bb7..9f57b4c8ae69 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -20,6 +20,7 @@ #include #include #include +#include struct dma_heap *sys_heap; @@ -46,6 +47,7 @@ struct dma_heap_attachment { static gfp_t order_flags[] = {HIGH_ORDER_GFP, LOW_ORDER_GFP, LOW_ORDER_GFP}; static const unsigned int orders[] = {8, 4, 0}; #define NUM_ORDERS ARRAY_SIZE(orders) +struct page_pool *pools[NUM_ORDERS]; static struct sg_table *dup_sg_table(struct sg_table *table) { @@ -264,13 +266,17 @@ static void system_heap_dma_buf_release(struct dma_buf *dmabuf) struct system_heap_buffer *buffer = dmabuf->priv; struct sg_table *table; struct scatterlist *sg; - int i; + int i, j; table = >sg_table; for_each_sg(table->sgl, sg, table->nents, i) { struct page *page = sg_page(sg); - __free_pages(page, compound_order(page)); + for (j = 0; j < NUM_ORDERS; j++) { + if (compound_order(page) == orders[j]) + break; + } + page_pool_put_full_page(pools[j], page, false); } sg_free_table(table); kfree(buffer); @@ -300,8 +306,7 @@ static struct page *alloc_largest_available(unsigned long size, continue; if (max_order < orders[i]) continue; - - page = alloc_pages(order_flags[i], orders[i]); + page = page_pool_alloc_pages(pools[i], order_flags[i]); if (!page) continue; return page; @@ -406,6 +411,25 @@ static const struct dma_heap_ops system_heap_ops = { static int system_heap_create(void) { struct dma_heap_export_info exp_info; + int i; + + for (i = 0; i < NUM_ORDERS; i++) { + struct page_pool_params pp; + + memset(, 0, sizeof(pp)); + pp.order = orders[i]; + pp.dma_dir = DMA_BIDIRECTIONAL; Hey John, Correct me if I'm wrong, but I think that in order for pp.dma_dir to be used in either page_pool_alloc_pages() or page_pool_put_full_page(), we need to at least have PP_FLAG_DMA_MAP set (to have page_pool_dma_sync_for_device() called, PP_FLAG_DMA_SYNC_DEV should also be set I think). I think you'd also need to to have pp->dev set. Are we setting dma_dir with the intention of doing the necessary CMOs before we start using the page? Thanks, Chris. + pools[i] = page_pool_create(); + + if (IS_ERR(pools[i])) { + int j; + + pr_err("%s: page pool creation failed!\n", __func__); + for (j = 0; j < i; j++) + page_pool_destroy(pools[j]); + return PTR_ERR(pools[i]); + } + } exp_info.name = "system"; exp_info.ops = _heap_ops; This is cool, I didn't know about this pooling code under /net/core. Nice and compact. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [RFC][PATCH 5/6] dma-buf: system_heap: Add pagepool support to system heap
On 2020-09-25 21:24, John Stultz wrote: Reuse/abuse the pagepool code from the network code to speed up allocation performance. This is similar to the ION pagepool usage, but tries to utilize generic code instead of a custom implementation. Cc: Sumit Semwal Cc: Liam Mark Cc: Laura Abbott Cc: Brian Starkey Cc: Hridya Valsaraju Cc: Suren Baghdasaryan Cc: Sandeep Patil Cc: Ørjan Eide Cc: Robin Murphy Cc: Ezequiel Garcia Cc: Simon Ser Cc: James Jones Cc: linux-me...@vger.kernel.org Cc: dri-de...@lists.freedesktop.org Signed-off-by: John Stultz --- drivers/dma-buf/heaps/Kconfig | 1 + drivers/dma-buf/heaps/system_heap.c | 32 + 2 files changed, 29 insertions(+), 4 deletions(-) diff --git a/drivers/dma-buf/heaps/Kconfig b/drivers/dma-buf/heaps/Kconfig index a5eef06c4226..f13cde4321b1 100644 --- a/drivers/dma-buf/heaps/Kconfig +++ b/drivers/dma-buf/heaps/Kconfig @@ -1,6 +1,7 @@ config DMABUF_HEAPS_SYSTEM bool "DMA-BUF System Heap" depends on DMABUF_HEAPS + select PAGE_POOL help Choose this option to enable the system dmabuf heap. The system heap is backed by pages from the buddy allocator. If in doubt, say Y. diff --git a/drivers/dma-buf/heaps/system_heap.c b/drivers/dma-buf/heaps/system_heap.c index 882a632e9bb7..9f57b4c8ae69 100644 --- a/drivers/dma-buf/heaps/system_heap.c +++ b/drivers/dma-buf/heaps/system_heap.c @@ -20,6 +20,7 @@ #include #include #include +#include struct dma_heap *sys_heap; @@ -46,6 +47,7 @@ struct dma_heap_attachment { static gfp_t order_flags[] = {HIGH_ORDER_GFP, LOW_ORDER_GFP, LOW_ORDER_GFP}; static const unsigned int orders[] = {8, 4, 0}; #define NUM_ORDERS ARRAY_SIZE(orders) +struct page_pool *pools[NUM_ORDERS]; static struct sg_table *dup_sg_table(struct sg_table *table) { @@ -264,13 +266,17 @@ static void system_heap_dma_buf_release(struct dma_buf *dmabuf) struct system_heap_buffer *buffer = dmabuf->priv; struct sg_table *table; struct scatterlist *sg; - int i; + int i, j; table = >sg_table; for_each_sg(table->sgl, sg, table->nents, i) { struct page *page = sg_page(sg); - __free_pages(page, compound_order(page)); + for (j = 0; j < NUM_ORDERS; j++) { + if (compound_order(page) == orders[j]) + break; + } + page_pool_put_full_page(pools[j], page, false); } sg_free_table(table); kfree(buffer); @@ -300,8 +306,7 @@ static struct page *alloc_largest_available(unsigned long size, continue; if (max_order < orders[i]) continue; - - page = alloc_pages(order_flags[i], orders[i]); + page = page_pool_alloc_pages(pools[i], order_flags[i]); if (!page) continue; return page; @@ -406,6 +411,25 @@ static const struct dma_heap_ops system_heap_ops = { static int system_heap_create(void) { struct dma_heap_export_info exp_info; + int i; + + for (i = 0; i < NUM_ORDERS; i++) { + struct page_pool_params pp; + + memset(, 0, sizeof(pp)); + pp.order = orders[i]; + pp.dma_dir = DMA_BIDIRECTIONAL; + pools[i] = page_pool_create(); + + if (IS_ERR(pools[i])) { + int j; + + pr_err("%s: page pool creation failed!\n", __func__); + for (j = 0; j < i; j++) + page_pool_destroy(pools[j]); + return PTR_ERR(pools[i]); + } + } exp_info.name = "system"; exp_info.ops = _heap_ops; This is cool, I didn't know about this pooling code under /net/core. Nice and compact. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v5] mm: cma: indefinitely retry allocations in cma_alloc
V1: Introduces a retry loop that attempts a CMA allocation a finite number of times before giving up: https://lkml.org/lkml/2020/8/5/1097 https://lkml.org/lkml/2020/8/11/893 V2: Introduces an indefinite retry for CMA allocations. David Hildenbrand raised a page pinning example which precludes doing this infite-retrying for all CMA users: https://lkml.org/lkml/2020/9/17/984 V3: Re-introduce a GFP mask argument for cma_alloc(), that can take in __GFP_NOFAIL as an argument to indicate that a CMA allocation should be retried indefinitely. This lets callers of cma_alloc() decide if they want to perform indefinite retires. Also introduces a config option for controlling the duration of the sleep between retries: https://www.spinics.net/lists/linux-mm/msg228778.html V4: In response to comments by Minchan Kim, we make the sleep interruptable and remove a defconfig option controlling how long the sleep lasts for (it is now a fixed 100 ms): https://lkml.org/lkml/2020/9/28/1144 V5: Add missing fatal_signal_pending() check Chris Goldsworthy (1): mm: cma: indefinitely retry allocations in cma_alloc arch/powerpc/kvm/book3s_hv_builtin.c | 2 +- drivers/dma-buf/heaps/cma_heap.c | 2 +- drivers/s390/char/vmcp.c | 2 +- drivers/staging/android/ion/ion_cma_heap.c | 2 +- include/linux/cma.h| 2 +- kernel/dma/contiguous.c| 4 +-- mm/cma.c | 39 +- mm/cma_debug.c | 2 +- mm/hugetlb.c | 4 +-- 9 files changed, 43 insertions(+), 16 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v5] mm: cma: indefinitely retry allocations in cma_alloc
CMA allocations will fail if 'pinned' pages are in a CMA area, since we cannot migrate pinned pages. The _refcount of a struct page being greater than _mapcount for that page can cause pinning for anonymous pages. This is because try_to_unmap(), which (1) is called in the CMA allocation path, and (2) decrements both _refcount and _mapcount for a page, will stop unmapping a page from VMAs once the _mapcount for a page reaches 0. This implies that after try_to_unmap() has finished successfully for a page where _recount > _mapcount, that _refcount will be greater than 0. Later in the CMA allocation path in migrate_page_move_mapping(), we will have one more reference count than intended for anonymous pages, meaning the allocation will fail for that page. If a process ends up causing _refcount > _mapcount for a page (by either incrementing _recount or decrementing _mapcount), such that the process is context switched out after modifying one refcount but before modifying the other, the page will be temporarily pinned. One example of where _refcount can be greater than _mapcount is inside of zap_pte_range(), which is called for all the entries of a PMD when a process is exiting, to unmap the process's memory. Inside of zap_pte_range(), after unammping a page with page_remove_rmap(), we have that _recount > _mapcount. _refcount can only be decremented after a TLB flush is performed for the page - this doesn't occur until enough pages have been batched together for flushing. The flush can either occur inside of zap_pte_range() (during the same invocation or a later one), or if there aren't enough pages collected by the time we unmap all of the pages in a process, the flush will occur in tlb_finish_mmu() in exit_mmap(). After the flush has occurred, tlb_batch_pages_flush() will decrement the references on the flushed pages. Another such example like the above is inside of copy_one_pte(), which is called during a fork. For PTEs for which pte_present(pte) == true, copy_one_pte() will increment the _refcount field followed by the _mapcount field of a page. So, inside of cma_alloc(), add the option of letting users pass in __GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely, in the event that alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- arch/powerpc/kvm/book3s_hv_builtin.c | 2 +- drivers/dma-buf/heaps/cma_heap.c | 2 +- drivers/s390/char/vmcp.c | 2 +- drivers/staging/android/ion/ion_cma_heap.c | 2 +- include/linux/cma.h| 2 +- kernel/dma/contiguous.c| 4 +-- mm/cma.c | 39 +- mm/cma_debug.c | 2 +- mm/hugetlb.c | 4 +-- 9 files changed, 43 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c index 073617c..21c3f6a 100644 --- a/arch/powerpc/kvm/book3s_hv_builtin.c +++ b/arch/powerpc/kvm/book3s_hv_builtin.c @@ -74,7 +74,7 @@ struct page *kvm_alloc_hpt_cma(unsigned long nr_pages) VM_BUG_ON(order_base_2(nr_pages) < KVM_CMA_CHUNK_ORDER - PAGE_SHIFT); return cma_alloc(kvm_cma, nr_pages, order_base_2(HPT_ALIGN_PAGES), -false); +0); } EXPORT_SYMBOL_GPL(kvm_alloc_hpt_cma); diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c index 626cf7f..7657359 100644 --- a/drivers/dma-buf/heaps/cma_heap.c +++ b/drivers/dma-buf/heaps/cma_heap.c @@ -66,7 +66,7 @@ static int cma_heap_allocate(struct dma_heap *heap, helper_buffer->heap = heap; helper_buffer->size = len; - cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, false); + cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, 0); if (!cma_pages) goto free_buf; diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c index 9e06628..11c4e3b 100644 --- a/drivers/s390/char/vmcp.c +++ b/drivers/s390/char/vmcp.c @@ -70,7 +70,7 @@ static void vmcp_response_alloc(struct vmcp_session *session) * anymore the system won't work anyway. */ if (order > 2) - page = cma_alloc(vmcp_cma, nr_pages, 0, false); + page = cma_alloc(vmcp_cma, nr_pages, 0, 0); if (page) { session->response = (char *)page_to_phys(page); session->cma_alloc = 1; diff --git a/drivers/staging/android/ion/ion_cma_heap.c b/drivers/staging/android/ion/ion_cma_heap.c index bf65e67..128d3a5 100644 --- a/drivers/staging/android/ion/ion_cma_heap.c +++ b/drivers/staging/android/ion/ion_cma_heap.c @@ -39,7 +39,7 @@ static int ion_cma_allocate(struct ion_heap *heap, struct ion_buffer *buffer,
[PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc
V1: Introduces a retry loop that attempts a CMA allocation a finite number of times before giving up: https://lkml.org/lkml/2020/8/5/1097 https://lkml.org/lkml/2020/8/11/893 V2: Introduces an indefinite retry for CMA allocations. David Hildenbrand raised a page pinning example which precludes doing this infite-retrying for all CMA users: https://lkml.org/lkml/2020/9/17/984 V3: Re-introduce a GFP mask argument for cma_alloc(), that can take in __GFP_NOFAIL as an argument to indicate that a CMA allocation should be retried indefinitely. This lets callers of cma_alloc() decide if they want to perform indefinite retires. Also introduces a config option for controlling the duration of the sleep between retries: https://www.spinics.net/lists/linux-mm/msg228778.html V4: In response to comments by Minchan Kim, we make the sleep interruptable and remove a defconfig option controlling how long the sleep lasts for (it is now a fixed 100 ms). Chris Goldsworthy (1): mm: cma: indefinitely retry allocations in cma_alloc arch/powerpc/kvm/book3s_hv_builtin.c | 2 +- drivers/dma-buf/heaps/cma_heap.c | 2 +- drivers/s390/char/vmcp.c | 2 +- drivers/staging/android/ion/ion_cma_heap.c | 2 +- include/linux/cma.h| 2 +- kernel/dma/contiguous.c| 4 ++-- mm/cma.c | 36 +- mm/cma_debug.c | 2 +- mm/hugetlb.c | 4 ++-- 9 files changed, 40 insertions(+), 16 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v4] mm: cma: indefinitely retry allocations in cma_alloc
CMA allocations will fail if 'pinned' pages are in a CMA area, since we cannot migrate pinned pages. The _refcount of a struct page being greater than _mapcount for that page can cause pinning for anonymous pages. This is because try_to_unmap(), which (1) is called in the CMA allocation path, and (2) decrements both _refcount and _mapcount for a page, will stop unmapping a page from VMAs once the _mapcount for a page reaches 0. This implies that after try_to_unmap() has finished successfully for a page where _recount > _mapcount, that _refcount will be greater than 0. Later in the CMA allocation path in migrate_page_move_mapping(), we will have one more reference count than intended for anonymous pages, meaning the allocation will fail for that page. If a process ends up causing _refcount > _mapcount for a page (by either incrementing _recount or decrementing _mapcount), such that the process is context switched out after modifying one refcount but before modifying the other, the page will be temporarily pinned. One example of where _refcount can be greater than _mapcount is inside of zap_pte_range(), which is called for all the entries of a PMD when a process is exiting, to unmap the process's memory. Inside of zap_pte_range(), after unammping a page with page_remove_rmap(), we have that _recount > _mapcount. _refcount can only be decremented after a TLB flush is performed for the page - this doesn't occur until enough pages have been batched together for flushing. The flush can either occur inside of zap_pte_range() (during the same invocation or a later one), or if there aren't enough pages collected by the time we unmap all of the pages in a process, the flush will occur in tlb_finish_mmu() in exit_mmap(). After the flush has occurred, tlb_batch_pages_flush() will decrement the references on the flushed pages. Another such example like the above is inside of copy_one_pte(), which is called during a fork. For PTEs for which pte_present(pte) == true, copy_one_pte() will increment the _refcount field followed by the _mapcount field of a page. So, inside of cma_alloc(), add the option of letting users pass in __GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely, in the event that alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- arch/powerpc/kvm/book3s_hv_builtin.c | 2 +- drivers/dma-buf/heaps/cma_heap.c | 2 +- drivers/s390/char/vmcp.c | 2 +- drivers/staging/android/ion/ion_cma_heap.c | 2 +- include/linux/cma.h| 2 +- kernel/dma/contiguous.c| 4 ++-- mm/cma.c | 36 +- mm/cma_debug.c | 2 +- mm/hugetlb.c | 4 ++-- 9 files changed, 40 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c index 073617c..21c3f6a 100644 --- a/arch/powerpc/kvm/book3s_hv_builtin.c +++ b/arch/powerpc/kvm/book3s_hv_builtin.c @@ -74,7 +74,7 @@ struct page *kvm_alloc_hpt_cma(unsigned long nr_pages) VM_BUG_ON(order_base_2(nr_pages) < KVM_CMA_CHUNK_ORDER - PAGE_SHIFT); return cma_alloc(kvm_cma, nr_pages, order_base_2(HPT_ALIGN_PAGES), -false); +0); } EXPORT_SYMBOL_GPL(kvm_alloc_hpt_cma); diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c index 626cf7f..7657359 100644 --- a/drivers/dma-buf/heaps/cma_heap.c +++ b/drivers/dma-buf/heaps/cma_heap.c @@ -66,7 +66,7 @@ static int cma_heap_allocate(struct dma_heap *heap, helper_buffer->heap = heap; helper_buffer->size = len; - cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, false); + cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, 0); if (!cma_pages) goto free_buf; diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c index 9e06628..11c4e3b 100644 --- a/drivers/s390/char/vmcp.c +++ b/drivers/s390/char/vmcp.c @@ -70,7 +70,7 @@ static void vmcp_response_alloc(struct vmcp_session *session) * anymore the system won't work anyway. */ if (order > 2) - page = cma_alloc(vmcp_cma, nr_pages, 0, false); + page = cma_alloc(vmcp_cma, nr_pages, 0, 0); if (page) { session->response = (char *)page_to_phys(page); session->cma_alloc = 1; diff --git a/drivers/staging/android/ion/ion_cma_heap.c b/drivers/staging/android/ion/ion_cma_heap.c index bf65e67..128d3a5 100644 --- a/drivers/staging/android/ion/ion_cma_heap.c +++ b/drivers/staging/android/ion/ion_cma_heap.c @@ -39,7 +39,7 @@ static int ion_cma_allocate(struct ion_heap *heap, struct ion_buffer *buffer,
Re: [PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc
On 2020-09-27 12:23, Minchan Kim wrote: On Wed, Sep 23, 2020 at 10:16:25PM -0700, Chris Goldsworthy wrote: CMA allocations will fail if 'pinned' pages are in a CMA area, since we +config CMA_RETRY_SLEEP_DURATION + int "Sleep duration between retries" + depends on CMA + default 100 + help + Time to sleep for in milliseconds between the indefinite + retries of a CMA allocation that are induced by passing + __GFP_NOFAIL to cma_alloc(). + + If unsure, leave the value as "100". What's the point of this new config? If we need it, How could admin set their value? How does it make bad if we just use simple default vaule? IOW, I'd like to avoid introducing new config if we don't see good justifcation. I thought that it would be useful for developers, but I guess it would be much better for this to be runtime configurable. But, I don't think there's a strong reason to be able to configure the value - 100 ms has worked for us. I'll update scrap this option in a follow-up patch, and will use 100 ms as the sleeping time. + config MEM_SOFT_DIRTY bool "Track memory changes" depends on CHECKPOINT_RESTORE && HAVE_ARCH_SOFT_DIRTY && PROC_FS diff --git a/mm/cma.c b/mm/cma.c index 7f415d7..4fbad2b 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -32,6 +32,7 @@ #include #include #include +#include #include #include "cma.h" @@ -403,13 +404,15 @@ static inline void cma_debug_show_areas(struct cma *cma) { } * @cma: Contiguous memory region for which the allocation is performed. * @count: Requested number of pages. * @align: Requested alignment of pages (in PAGE_SIZE order). - * @no_warn: Avoid printing message about failed allocation + * @gfp_mask: If __GFP_NOWARN is passed, suppress messages about failed + * allocations. If __GFP_NOFAIL is passed, try doing the CMA + * allocation indefinitely until the allocation succeeds. * * This function allocates part of contiguous memory on specific * contiguous memory area. */ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, - bool no_warn) + gfp_t gfp_mask) { unsigned long mask, offset; unsigned long pfn = -1; @@ -442,8 +445,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_maxno, start, bitmap_count, mask, offset); if (bitmap_no >= bitmap_maxno) { - mutex_unlock(>lock); - break; + if (ret == -EBUSY && gfp_mask & __GFP_NOFAIL) { + mutex_unlock(>lock); + + /* +* Page may be momentarily pinned by some other +* process which has been scheduled out, e.g. +* in exit path, during unmap call, or process +* fork and so cannot be freed there. Sleep +* for 100ms and retry the allocation. +*/ + start = 0; + ret = -ENOMEM; + msleep(CONFIG_CMA_RETRY_SLEEP_DURATION); Should it be uninterruptible, really? Good point - I'll replace the msleep() this with schedule_timeout_killable() in the follow-up patch. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc
On 2020-09-25 05:18, David Hildenbrand wrote: On 24.09.20 07:16, Chris Goldsworthy wrote: -GFP_KERNEL | (no_warn ? __GFP_NOWARN : 0)); +GFP_KERNEL | (gfp_mask & __GFP_NOWARN)); Right, we definetly don't want to pass the flag further down. Alternative would be cma_alloc_nofail(). That helps avoid people passing stuff like GFP_USER and wondering why it doesn't have an effect. But since we're doing a logical AND with __GFP_NOWARN, we're not passing any other values down - this makes it equivalent to the previous version, in that only __GFP_NOWARN can be passed to alloc_contig_range(). -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc
V1: Introduces a retry loop that attempts a CMA allocation a finite number of times before giving up: https://lkml.org/lkml/2020/8/5/1097 https://lkml.org/lkml/2020/8/11/893 V2: Introduces an indefinite retry for CMA allocations. David Hildenbrand raised a page pinning example which precludes doing this infite-retrying for all CMA users: https://lkml.org/lkml/2020/9/17/984 V3: Re-introduce a GFP mask argument for cma_alloc(), that can take in __GFP_NOFAIL as an argument to indicate that a CMA allocation should be retried indefinitely. This lets callers of cma_alloc() decide if they want to perform indefinite retires. Also introduces a config option for controlling the duration of the sleep between retries. Chris Goldsworthy (1): mm: cma: indefinitely retry allocations in cma_alloc arch/powerpc/kvm/book3s_hv_builtin.c | 2 +- drivers/dma-buf/heaps/cma_heap.c | 2 +- drivers/s390/char/vmcp.c | 2 +- drivers/staging/android/ion/ion_cma_heap.c | 2 +- include/linux/cma.h| 2 +- kernel/dma/contiguous.c| 4 ++-- mm/Kconfig | 11 ++ mm/cma.c | 35 +- mm/cma_debug.c | 2 +- mm/hugetlb.c | 4 ++-- 10 files changed, 50 insertions(+), 16 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v3] mm: cma: indefinitely retry allocations in cma_alloc
CMA allocations will fail if 'pinned' pages are in a CMA area, since we cannot migrate pinned pages. The _refcount of a struct page being greater than _mapcount for that page can cause pinning for anonymous pages. This is because try_to_unmap(), which (1) is called in the CMA allocation path, and (2) decrements both _refcount and _mapcount for a page, will stop unmapping a page from VMAs once the _mapcount for a page reaches 0. This implies that after try_to_unmap() has finished successfully for a page where _recount > _mapcount, that _refcount will be greater than 0. Later in the CMA allocation path in migrate_page_move_mapping(), we will have one more reference count than intended for anonymous pages, meaning the allocation will fail for that page. If a process ends up causing _refcount > _mapcount for a page (by either incrementing _recount or decrementing _mapcount), such that the process is context switched out after modifying one refcount but before modifying the other, the page will be temporarily pinned. One example of where _refcount can be greater than _mapcount is inside of zap_pte_range(), which is called for all the entries of a PMD when a process is exiting, to unmap the process's memory. Inside of zap_pte_range(), after unammping a page with page_remove_rmap(), we have that _recount > _mapcount. _refcount can only be decremented after a TLB flush is performed for the page - this doesn't occur until enough pages have been batched together for flushing. The flush can either occur inside of zap_pte_range() (during the same invocation or a later one), or if there aren't enough pages collected by the time we unmap all of the pages in a process, the flush will occur in tlb_finish_mmu() in exit_mmap(). After the flush has occurred, tlb_batch_pages_flush() will decrement the references on the flushed pages. Another such example like the above is inside of copy_one_pte(), which is called during a fork. For PTEs for which pte_present(pte) == true, copy_one_pte() will increment the _refcount field followed by the _mapcount field of a page. So, inside of cma_alloc(), add the option of letting users pass in __GFP_NOFAIL to indicate that we should retry CMA allocations indefinitely, in the event that alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- arch/powerpc/kvm/book3s_hv_builtin.c | 2 +- drivers/dma-buf/heaps/cma_heap.c | 2 +- drivers/s390/char/vmcp.c | 2 +- drivers/staging/android/ion/ion_cma_heap.c | 2 +- include/linux/cma.h| 2 +- kernel/dma/contiguous.c| 4 ++-- mm/Kconfig | 11 ++ mm/cma.c | 35 +- mm/cma_debug.c | 2 +- mm/hugetlb.c | 4 ++-- 10 files changed, 50 insertions(+), 16 deletions(-) diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c index 073617c..21c3f6a 100644 --- a/arch/powerpc/kvm/book3s_hv_builtin.c +++ b/arch/powerpc/kvm/book3s_hv_builtin.c @@ -74,7 +74,7 @@ struct page *kvm_alloc_hpt_cma(unsigned long nr_pages) VM_BUG_ON(order_base_2(nr_pages) < KVM_CMA_CHUNK_ORDER - PAGE_SHIFT); return cma_alloc(kvm_cma, nr_pages, order_base_2(HPT_ALIGN_PAGES), -false); +0); } EXPORT_SYMBOL_GPL(kvm_alloc_hpt_cma); diff --git a/drivers/dma-buf/heaps/cma_heap.c b/drivers/dma-buf/heaps/cma_heap.c index 626cf7f..7657359 100644 --- a/drivers/dma-buf/heaps/cma_heap.c +++ b/drivers/dma-buf/heaps/cma_heap.c @@ -66,7 +66,7 @@ static int cma_heap_allocate(struct dma_heap *heap, helper_buffer->heap = heap; helper_buffer->size = len; - cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, false); + cma_pages = cma_alloc(cma_heap->cma, nr_pages, align, 0); if (!cma_pages) goto free_buf; diff --git a/drivers/s390/char/vmcp.c b/drivers/s390/char/vmcp.c index 9e06628..11c4e3b 100644 --- a/drivers/s390/char/vmcp.c +++ b/drivers/s390/char/vmcp.c @@ -70,7 +70,7 @@ static void vmcp_response_alloc(struct vmcp_session *session) * anymore the system won't work anyway. */ if (order > 2) - page = cma_alloc(vmcp_cma, nr_pages, 0, false); + page = cma_alloc(vmcp_cma, nr_pages, 0, 0); if (page) { session->response = (char *)page_to_phys(page); session->cma_alloc = 1; diff --git a/drivers/staging/android/ion/ion_cma_heap.c b/drivers/staging/android/ion/ion_cma_heap.c index bf65e67..128d3a5 100644 --- a/drivers/staging/android/ion/ion_cma_heap.c +++ b/drivers/staging/android/ion/ion_cma_heap.c @@ -39,7 +39,7 @@ static int ion_cma_alloc
Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
On 2020-09-17 10:54, Chris Goldsworthy wrote: On 2020-09-15 00:53, David Hildenbrand wrote: On 14.09.20 20:33, Chris Goldsworthy wrote: On 2020-09-14 02:31, David Hildenbrand wrote: On 11.09.20 21:17, Chris Goldsworthy wrote: So, inside of cma_alloc(), instead of giving up when alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap, perform retries indefinitely, with sleeps, to give the system an opportunity to unpin any pinned pages. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 7f415d7..90bb505 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_maxno, start, bitmap_count, mask, offset); if (bitmap_no >= bitmap_maxno) { - mutex_unlock(>lock); - break; + if (ret == -EBUSY) { + mutex_unlock(>lock); + + /* +* Page may be momentarily pinned by some other +* process which has been scheduled out, e.g. +* in exit path, during unmap call, or process +* fork and so cannot be freed there. Sleep +* for 100ms and retry the allocation. +*/ + start = 0; + ret = -ENOMEM; + msleep(100); + continue; + } else { + /* +* ret == -ENOMEM - all bits in cma->bitmap are +* set, so we break accordingly. +*/ + mutex_unlock(>lock); + break; + } } bitmap_set(cma->bitmap, bitmap_no, bitmap_count); /* What about long-term pinnings? IIRC, that can happen easily e.g., with vfio (and I remember there is a way via vmsplice). Not convinced trying forever is a sane approach in the general case ... V1: [1] https://lkml.org/lkml/2020/8/5/1097 [2] https://lkml.org/lkml/2020/8/6/1040 [3] https://lkml.org/lkml/2020/8/11/893 [4] https://lkml.org/lkml/2020/8/21/1490 [5] https://lkml.org/lkml/2020/9/11/1072 We're fine with doing indefinite retries, on the grounds that if there is some long-term pinning that occurs when alloc_contig_range returns -EBUSY, that it should be debugged and fixed. Would it be possible to make this infinite-retrying something that could be enabled or disabled by a defconfig option? Two thoughts: This means I strongly prefer something like [3] if feasible. _Resending so that this ends up on LKML_ I can give [3] some further thought then. Also, I realized [3] will not completely solve the problem, it just reduces the window in which _refcount > _mapcount (as mentioned in earlier threads, we encountered the pinning when a task in copy_one_pte() or in the exit_mmap() path gets context switched out). If we were to try a sleeping-lock based solution, do you think it would be permissible to add another lock to struct page? I have not been able to think of a clean way of introducing calls to preempt_disable() in exit_mmap(), which is the more problematic case. We would need to track state across multiple invocations of zap_pte_range() (which is called for each entry in a PMD when a process's memory is being unmapped), and would also need to extend this to tlb_finish_mmu(), which is called after all the process's memory has been unmapped: https://elixir.bootlin.com/linux/v5.8.10/source/mm/mmap.c#L3164. As a follow-up to this patch, I'm submitting a patch that re-introduces the GFP mask for cma_alloc, that will perform indefinite retires if __GFP_NOFAIL is passed to the function. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
On 2020-09-15 00:53, David Hildenbrand wrote: On 14.09.20 20:33, Chris Goldsworthy wrote: On 2020-09-14 02:31, David Hildenbrand wrote: On 11.09.20 21:17, Chris Goldsworthy wrote: So, inside of cma_alloc(), instead of giving up when alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap, perform retries indefinitely, with sleeps, to give the system an opportunity to unpin any pinned pages. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 7f415d7..90bb505 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_maxno, start, bitmap_count, mask, offset); if (bitmap_no >= bitmap_maxno) { - mutex_unlock(>lock); - break; + if (ret == -EBUSY) { + mutex_unlock(>lock); + + /* +* Page may be momentarily pinned by some other +* process which has been scheduled out, e.g. +* in exit path, during unmap call, or process +* fork and so cannot be freed there. Sleep +* for 100ms and retry the allocation. +*/ + start = 0; + ret = -ENOMEM; + msleep(100); + continue; + } else { + /* +* ret == -ENOMEM - all bits in cma->bitmap are +* set, so we break accordingly. +*/ + mutex_unlock(>lock); + break; + } } bitmap_set(cma->bitmap, bitmap_no, bitmap_count); /* What about long-term pinnings? IIRC, that can happen easily e.g., with vfio (and I remember there is a way via vmsplice). Not convinced trying forever is a sane approach in the general case ... V1: [1] https://lkml.org/lkml/2020/8/5/1097 [2] https://lkml.org/lkml/2020/8/6/1040 [3] https://lkml.org/lkml/2020/8/11/893 [4] https://lkml.org/lkml/2020/8/21/1490 [5] https://lkml.org/lkml/2020/9/11/1072 We're fine with doing indefinite retries, on the grounds that if there is some long-term pinning that occurs when alloc_contig_range returns -EBUSY, that it should be debugged and fixed. Would it be possible to make this infinite-retrying something that could be enabled or disabled by a defconfig option? Two thoughts: This means I strongly prefer something like [3] if feasible. I can give [3] some further thought then. Also, I realized [3] will not completely solve the problem, it just reduces the window in which _refcount > _mapcount (as mentioned in earlier threads, we encountered the pinning when a task in copy_one_pte() or in the exit_mmap() path gets context switched out). If we were to try a sleeping-lock based solution, do you think it would be permissible to add another lock to struct page? 2. The issue that I am having is that long-term pinnings are (unfortunately) a real thing. It's not something to debug and fix as you suggest. Like, run a VM with VFIO (e.g., PCI passthrough). While that VM is running, all VM memory will be pinned. If memory falls onto a CMA region your cma_alloc() will be stuck in an (endless, meaning until the VM ended) loop. I am not sure if all cma users are fine with that - especially, think about CMA being used for gigantic pages now. Assume you want to start a new VM while the other one is running and use some (new) gigantic pages for it. Suddenly you're trapped in an endless loop in the kernel. That's nasty. Thanks for providing this example. If we want to stick to retrying forever, can't we use flags like __GFP_NOFAIL to explicitly enable this new behavior for selected cma_alloc() users that really can't fail/retry manually again? This would work, we would just have to undo the work done by this patch / re-introduce the GFP parameter for cma_alloc(): http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqtpjkij2939229392eucas1...@eucas1p2.samsung.com , and add the support __GFP_NOFAIL (and ignore any flag that is not one of __GFP_NOFAIL or __GFP_NOWARN). -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
On 2020-09-15 00:53, David Hildenbrand wrote: On 14.09.20 20:33, Chris Goldsworthy wrote: On 2020-09-14 02:31, David Hildenbrand wrote: On 11.09.20 21:17, Chris Goldsworthy wrote: So, inside of cma_alloc(), instead of giving up when alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap, perform retries indefinitely, with sleeps, to give the system an opportunity to unpin any pinned pages. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 7f415d7..90bb505 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_maxno, start, bitmap_count, mask, offset); if (bitmap_no >= bitmap_maxno) { - mutex_unlock(>lock); - break; + if (ret == -EBUSY) { + mutex_unlock(>lock); + + /* +* Page may be momentarily pinned by some other +* process which has been scheduled out, e.g. +* in exit path, during unmap call, or process +* fork and so cannot be freed there. Sleep +* for 100ms and retry the allocation. +*/ + start = 0; + ret = -ENOMEM; + msleep(100); + continue; + } else { + /* +* ret == -ENOMEM - all bits in cma->bitmap are +* set, so we break accordingly. +*/ + mutex_unlock(>lock); + break; + } } bitmap_set(cma->bitmap, bitmap_no, bitmap_count); /* What about long-term pinnings? IIRC, that can happen easily e.g., with vfio (and I remember there is a way via vmsplice). Not convinced trying forever is a sane approach in the general case ... V1: [1] https://lkml.org/lkml/2020/8/5/1097 [2] https://lkml.org/lkml/2020/8/6/1040 [3] https://lkml.org/lkml/2020/8/11/893 [4] https://lkml.org/lkml/2020/8/21/1490 [5] https://lkml.org/lkml/2020/9/11/1072 We're fine with doing indefinite retries, on the grounds that if there is some long-term pinning that occurs when alloc_contig_range returns -EBUSY, that it should be debugged and fixed. Would it be possible to make this infinite-retrying something that could be enabled or disabled by a defconfig option? Two thoughts: This means I strongly prefer something like [3] if feasible. _Resending so that this ends up on LKML_ I can give [3] some further thought then. Also, I realized [3] will not completely solve the problem, it just reduces the window in which _refcount > _mapcount (as mentioned in earlier threads, we encountered the pinning when a task in copy_one_pte() or in the exit_mmap() path gets context switched out). If we were to try a sleeping-lock based solution, do you think it would be permissible to add another lock to struct page? 2. The issue that I am having is that long-term pinnings are (unfortunately) a real thing. It's not something to debug and fix as you suggest. Like, run a VM with VFIO (e.g., PCI passthrough). While that VM is running, all VM memory will be pinned. If memory falls onto a CMA region your cma_alloc() will be stuck in an (endless, meaning until the VM ended) loop. I am not sure if all cma users are fine with that - especially, think about CMA being used for gigantic pages now. Assume you want to start a new VM while the other one is running and use some (new) gigantic pages for it. Suddenly you're trapped in an endless loop in the kernel. That's nasty. Thanks for providing this example. If we want to stick to retrying forever, can't we use flags like __GFP_NOFAIL to explicitly enable this new behavior for selected cma_alloc() users that really can't fail/retry manually again? This would work, we would just have to undo the work done by this patch / re-introduce the GFP parameter for cma_alloc(): http://lkml.kernel.org/r/20180709122019eucas1p2340da484acfcc932537e6014f4fd2c29~-sqtpjkij2939229392eucas1...@eucas1p2.samsung.com , and add the support __GFP_NOFAIL (and ignore any flag that is not one of __GFP_NOFAIL or __GFP_NOWARN). -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
On 2020-09-14 11:33, Chris Goldsworthy wrote: On 2020-09-14 02:31, David Hildenbrand wrote: What about long-term pinnings? IIRC, that can happen easily e.g., with vfio (and I remember there is a way via vmsplice). Not convinced trying forever is a sane approach in the general case ... Hi David, I've botched the threading, so there are discussions with respect to the previous patch-set that is missing on this thread, which I will summarize below: V1: [1] https://lkml.org/lkml/2020/8/5/1097 [2] https://lkml.org/lkml/2020/8/6/1040 [3] https://lkml.org/lkml/2020/8/11/893 [4] https://lkml.org/lkml/2020/8/21/1490 [5] https://lkml.org/lkml/2020/9/11/1072 [1] features version of the patch featured a finite number of retries, which has been stable for our kernels. In [2], Andrew questioned whether we could actually find a way of solving the problem on the grounds that doing a finite number of retries doesn't actually fix the problem (more importantly, in [4] Andrew indicated that he would prefer not to merge the patch as it doesn't solve the issue). In [3], I suggest one actual fix for this, which is to use preempt_disable/enable() to prevent context switches from occurring during the periods in copy_one_pte() and exit_mmap() (I forgot to mention this case in the commit text) in which _refcount > _mapcount for a page - you would also need to prevent interrupts from occurring to if we were to fully prevent the issue from occurring. I think this would be acceptable for the copy_one_pte() case, since there _refcount > _mapcount for little time. For the exit_mmap() case, however, _refcount is greater than _mapcount whilst the page-tables are being torn down for a process - that could be too long for disabling preemption / interrupts. So, in [4], Andrew asks about two alternatives to see if they're viable: (1) acquiring locks on the exit_mmap path and migration paths, (2) retrying indefinitely. In [5], I discuss how using locks could increase the time it takes to perform a CMA allocation, such that a retry approach would avoid increased CMA allocation times. I'm also uncertain about how the locking scheme could be implemented effectively without introducing a new per-page lock that will be used specifically to solve this issue, and I'm not sure this would be accepted. We're fine with doing indefinite retries, on the grounds that if there is some long-term pinning that occurs when alloc_contig_range returns -EBUSY, that it should be debugged and fixed. Would it be possible to make this infinite-retrying something that could be enabled or disabled by a defconfig option? Thanks, Chris. Actually, if we were willing to have a defconfig option for enabling / disabling indefinite retries on the return of -EBUSY, would it be possibly to re-structure the patch to allow either (1) indefinite retrying, or (2) doing a fixed number of retires (as some people might want to tolerate CMA allocation failures in favor of making progress)? -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
On 2020-09-11 14:42, Randy Dunlap wrote: On 9/11/20 2:37 PM, Florian Fainelli wrote: I am by no means an authoritative CMA person but this behavior does not seem acceptable, there is no doubt the existing one is sub-optimal under specific circumstances, but an indefinite retry, as well as a 100ms sleep appear to be arbitrary at best. How about you introduce a parameter that allows the tuning of the number of retries and/or delay between retries? Also: You should send your patch to linux...@kvack.org -- that's where most memory management type patches are reviewed. You should also send your patch to someone who could actually merge it into the kernel source tree -- assuming that's what you want to happen. Try scripts/get_mainttainer.pl on your patch to see what it says. And if you are going to use a "cover letter" or "introductory email" before the actual patch, the second (patch(es)) should be sent chained to the first email. git send-email should do this for you. Hi Randy, git send-email was not using the correct ID to generate a response to the cover letter, and I'm not able to fathom why. This e-mail was actually just sent out as a test to LKML as a test so I could figure out how to resolve the issue, I wasn't actually expecting anyone to read this. The actual e-mail, with the correct maintainer (Andrew Morton) and mailing lists, as well as the summary of the discussion on the patches so far, may be found here: https://lkml.org/lkml/2020/8/11/893 Thanks, Chris. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
On 2020-09-11 14:37, Florian Fainelli wrote: On 9/11/2020 1:54 PM, Chris Goldsworthy wrote: CMA allocations will fail if 'pinned' pages are in a CMA area, since we cannot migrate pinned pages. The _refcount of a struct page being greater than _mapcount for that page can cause pinning for anonymous pages. This is because try_to_unmap(), which (1) is called in the CMA allocation path, and (2) decrements both _refcount and _mapcount for a page, will stop unmapping a page from VMAs once the _mapcount for a page reaches 0. This implies that after try_to_unmap() has finished successfully for a page where _recount > _mapcount, that _refcount will be greater than 0. Later in the CMA allocation path in migrate_page_move_mapping(), we will have one more reference count than intended for anonymous pages, meaning the allocation will fail for that page. One example of where _refcount can be greater than _mapcount for a page we would not expect to be pinned is inside of copy_one_pte(), which is called during a fork. For ptes for which pte_present(pte) == true, copy_one_pte() will increment the _refcount field followed by the _mapcount field of a page. If the process doing copy_one_pte() is context switched out after incrementing _refcount but before incrementing _mapcount, then the page will be temporarily pinned. So, inside of cma_alloc(), instead of giving up when alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap, perform retries indefinitely, with sleeps, to give the system an opportunity to unpin any pinned pages. I am by no means an authoritative CMA person but this behavior does not seem acceptable, there is no doubt the existing one is sub-optimal under specific circumstances, but an indefinite retry, as well as a 100ms sleep appear to be arbitrary at best. How about you introduce a parameter that allows the tuning of the number of retries and/or delay between retries? Apologies Florian, I messed up on the threading and there are discussions that aren't reference here. The original version of this patch was doing a finite number of retires. Also, this e-mail was just sent out to LKML so I could debug some issues I was facing with git send-email. The actual thread is now here, which summarizes the discussions w.r.t. this patch so far: https://lkml.org/lkml/2020/9/14/1097 Thanks, Chris. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
On 2020-09-14 02:31, David Hildenbrand wrote: On 11.09.20 21:17, Chris Goldsworthy wrote: So, inside of cma_alloc(), instead of giving up when alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap, perform retries indefinitely, with sleeps, to give the system an opportunity to unpin any pinned pages. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 7f415d7..90bb505 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_maxno, start, bitmap_count, mask, offset); if (bitmap_no >= bitmap_maxno) { - mutex_unlock(>lock); - break; + if (ret == -EBUSY) { + mutex_unlock(>lock); + + /* +* Page may be momentarily pinned by some other +* process which has been scheduled out, e.g. +* in exit path, during unmap call, or process +* fork and so cannot be freed there. Sleep +* for 100ms and retry the allocation. +*/ + start = 0; + ret = -ENOMEM; + msleep(100); + continue; + } else { + /* +* ret == -ENOMEM - all bits in cma->bitmap are +* set, so we break accordingly. +*/ + mutex_unlock(>lock); + break; + } } bitmap_set(cma->bitmap, bitmap_no, bitmap_count); /* What about long-term pinnings? IIRC, that can happen easily e.g., with vfio (and I remember there is a way via vmsplice). Not convinced trying forever is a sane approach in the general case ... Hi David, I've botched the threading, so there are discussions with respect to the previous patch-set that is missing on this thread, which I will summarize below: V1: [1] https://lkml.org/lkml/2020/8/5/1097 [2] https://lkml.org/lkml/2020/8/6/1040 [3] https://lkml.org/lkml/2020/8/11/893 [4] https://lkml.org/lkml/2020/8/21/1490 [5] https://lkml.org/lkml/2020/9/11/1072 [1] features version of the patch featured a finite number of retries, which has been stable for our kernels. In [2], Andrew questioned whether we could actually find a way of solving the problem on the grounds that doing a finite number of retries doesn't actually fix the problem (more importantly, in [4] Andrew indicated that he would prefer not to merge the patch as it doesn't solve the issue). In [3], I suggest one actual fix for this, which is to use preempt_disable/enable() to prevent context switches from occurring during the periods in copy_one_pte() and exit_mmap() (I forgot to mention this case in the commit text) in which _refcount > _mapcount for a page - you would also need to prevent interrupts from occurring to if we were to fully prevent the issue from occurring. I think this would be acceptable for the copy_one_pte() case, since there _refcount > _mapcount for little time. For the exit_mmap() case, however, _refcount is greater than _mapcount whilst the page-tables are being torn down for a process - that could be too long for disabling preemption / interrupts. So, in [4], Andrew asks about two alternatives to see if they're viable: (1) acquiring locks on the exit_mmap path and migration paths, (2) retrying indefinitely. In [5], I discuss how using locks could increase the time it takes to perform a CMA allocation, such that a retry approach would avoid increased CMA allocation times. I'm also uncertain about how the locking scheme could be implemented effectively without introducing a new per-page lock that will be used specifically to solve this issue, and I'm not sure this would be accepted. We're fine with doing indefinite retries, on the grounds that if there is some long-term pinning that occurs when alloc_contig_range returns -EBUSY, that it should be debugged and fixed. Would it be possible to make this infinite-retrying something that could be enabled or disabled by a defconfig option? Thanks, Chris. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: [PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures
On 2020-09-11 13:54, Chris Goldsworthy wrote: [PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures On mobile devices, failure to allocate from a CMA area constitutes a functional failure. Sometimes during CMA allocations, we have observed that pages in a CMA area allocated through alloc_pages(), that we're trying to migrate away to make room for a CMA allocation, are temporarily pinned. This temporary pinning can occur when a process that owns the pinned page is being forked (the example is explained further in the commit text), or it is exiting. This patch addresses this issue by indefinitely retrying allocations that fail due to a return of -EBUSY. ** This change log was re-sent due to threading issues ** Change log: v1: We were performing retries of the allocation a fixed number of times. Andrew Morton disliked this, as it didn't guarantee that the allocation would succeed. https://lkml.org/lkml/2020/8/5/1096 https://lkml.org/lkml/2020/8/21/1490 v2: To address this concern, we switched to retrying indefinitely, as opposed to doing to retrying the allocation a limited number of times. Chris Goldsworthy (1): mm: cma: indefinitely retry allocations in cma_alloc mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) test -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures
[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures On mobile devices, failure to allocate from a CMA area constitutes a functional failure. Sometimes during CMA allocations, we have observed that pages in a CMA area allocated through alloc_pages(), that we're trying to migrate away to make room for a CMA allocation, are temporarily pinned. This temporary pinning can occur when a process that owns the pinned page is being forked (the example is explained further in the commit text), or it is exiting. This patch addresses this issue by indefinitely retrying allocations that fail due to a return of -EBUSY. ** This change log was re-sent due to threading issues ** Change log: v1: We were performing retries of the allocation a fixed number of times. Andrew Morton disliked this, as it didn't guarantee that the allocation would succeed. https://lkml.org/lkml/2020/8/5/1096 https://lkml.org/lkml/2020/8/21/1490 v2: To address this concern, we switched to retrying indefinitely, as opposed to doing to retrying the allocation a limited number of times. Chris Goldsworthy (1): mm: cma: indefinitely retry allocations in cma_alloc mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
CMA allocations will fail if 'pinned' pages are in a CMA area, since we cannot migrate pinned pages. The _refcount of a struct page being greater than _mapcount for that page can cause pinning for anonymous pages. This is because try_to_unmap(), which (1) is called in the CMA allocation path, and (2) decrements both _refcount and _mapcount for a page, will stop unmapping a page from VMAs once the _mapcount for a page reaches 0. This implies that after try_to_unmap() has finished successfully for a page where _recount > _mapcount, that _refcount will be greater than 0. Later in the CMA allocation path in migrate_page_move_mapping(), we will have one more reference count than intended for anonymous pages, meaning the allocation will fail for that page. One example of where _refcount can be greater than _mapcount for a page we would not expect to be pinned is inside of copy_one_pte(), which is called during a fork. For ptes for which pte_present(pte) == true, copy_one_pte() will increment the _refcount field followed by the _mapcount field of a page. If the process doing copy_one_pte() is context switched out after incrementing _refcount but before incrementing _mapcount, then the page will be temporarily pinned. So, inside of cma_alloc(), instead of giving up when alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap, perform retries indefinitely, with sleeps, to give the system an opportunity to unpin any pinned pages. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 7f415d7..90bb505 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -32,6 +32,7 @@ #include #include #include +#include #include #include "cma.h" @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_maxno, start, bitmap_count, mask, offset); if (bitmap_no >= bitmap_maxno) { - mutex_unlock(>lock); - break; + if (ret == -EBUSY) { + mutex_unlock(>lock); + + /* +* Page may be momentarily pinned by some other +* process which has been scheduled out, e.g. +* in exit path, during unmap call, or process +* fork and so cannot be freed there. Sleep +* for 100ms and retry the allocation. +*/ + start = 0; + ret = -ENOMEM; + msleep(100); + continue; + } else { + /* +* ret == -ENOMEM - all bits in cma->bitmap are +* set, so we break accordingly. +*/ + mutex_unlock(>lock); + break; + } } bitmap_set(cma->bitmap, bitmap_no, bitmap_count); /* -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures
[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures On mobile devices, failure to allocate from a CMA area constitutes a functional failure. Sometimes during CMA allocations, we have observed that pages in a CMA area allocated through alloc_pages(), that we're trying to migrate away to make room for a CMA allocation, are temporarily pinned. This temporary pinning can occur when a process that owns the pinned page is being forked (the example is explained further in the commit text), or it is exiting. This patch addresses this issue by indefinitely retrying allocations that fail due to a return of -EBUSY. ** This change log was re-sent due to threading issues ** Change log: v1: We were performing retries of the allocation a fixed number of times. Andrew Morton disliked this, as it didn't guarantee that the allocation would succeed. https://lkml.org/lkml/2020/8/5/1096 https://lkml.org/lkml/2020/8/21/1490 v2: To address this concern, we switched to retrying indefinitely, as opposed to doing to retrying the allocation a limited number of times. Chris Goldsworthy (1): mm: cma: indefinitely retry allocations in cma_alloc mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures
On mobile devices, failure to allocate from a CMA area constitutes a functional failure. Sometimes during CMA allocations, we have observed that pages in a CMA area allocated through alloc_pages(), that we're trying to migrate away to make room for a CMA allocation, are temporarily pinned. This temporary pinning can occur when a process that owns the pinned page is being forked (the example is explained further in the commit text), or it is exiting. This patch addresses this issue by indefinitely retrying allocations that fail due to a return of -EBUSY. ** This change log was re-sent due to threading issues ** Change log: v1: We were performing retries of the allocation a fixed number of times. Andrew Morton disliked this, as it didn't guarantee that the allocation would succeed. https://lkml.org/lkml/2020/8/5/1096 https://lkml.org/lkml/2020/8/21/1490 v2: To address this concern, we switched to retrying indefinitely, as opposed to doing to retrying the allocation a limited number of times. Chris Goldsworthy (1): mm: cma: indefinitely retry allocations in cma_alloc mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
CMA allocations will fail if 'pinned' pages are in a CMA area, since we cannot migrate pinned pages. The _refcount of a struct page being greater than _mapcount for that page can cause pinning for anonymous pages. This is because try_to_unmap(), which (1) is called in the CMA allocation path, and (2) decrements both _refcount and _mapcount for a page, will stop unmapping a page from VMAs once the _mapcount for a page reaches 0. This implies that after try_to_unmap() has finished successfully for a page where _recount > _mapcount, that _refcount will be greater than 0. Later in the CMA allocation path in migrate_page_move_mapping(), we will have one more reference count than intended for anonymous pages, meaning the allocation will fail for that page. One example of where _refcount can be greater than _mapcount for a page we would not expect to be pinned is inside of copy_one_pte(), which is called during a fork. For ptes for which pte_present(pte) == true, copy_one_pte() will increment the _refcount field followed by the _mapcount field of a page. If the process doing copy_one_pte() is context switched out after incrementing _refcount but before incrementing _mapcount, then the page will be temporarily pinned. So, inside of cma_alloc(), instead of giving up when alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap, perform retries indefinitely, with sleeps, to give the system an opportunity to unpin any pinned pages. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 7f415d7..90bb505 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -32,6 +32,7 @@ #include #include #include +#include #include #include "cma.h" @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_maxno, start, bitmap_count, mask, offset); if (bitmap_no >= bitmap_maxno) { - mutex_unlock(>lock); - break; + if (ret == -EBUSY) { + mutex_unlock(>lock); + + /* +* Page may be momentarily pinned by some other +* process which has been scheduled out, e.g. +* in exit path, during unmap call, or process +* fork and so cannot be freed there. Sleep +* for 100ms and retry the allocation. +*/ + start = 0; + ret = -ENOMEM; + msleep(100); + continue; + } else { + /* +* ret == -ENOMEM - all bits in cma->bitmap are +* set, so we break accordingly. +*/ + mutex_unlock(>lock); + break; + } } bitmap_set(cma->bitmap, bitmap_no, bitmap_count); /* -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2] cma_alloc(), indefinitely retry allocations for -EBUSY failures
On mobile devices, failure to allocate from a CMA area constitutes a functional failure. Sometimes during CMA allocations, we have observed that pages in a CMA area allocated through alloc_pages(), that we're trying to migrate away to make room for a CMA allocation, are temporarily pinned. This temporary pinning can occur when a process that owns the pinned page is being forked (the example is explained further in the commit text), or it is exiting. This patch addresses this issue by indefinitely retrying allocations that fail due to a return of -EBUSY. Change log: v1: We were performing retries of the allocation a fixed number of times. Andrew Morton disliked this, as it didn't guarantee that the allocation would succeed. v2: To address this concern, we switched to retrying indefinitely, as opposed to doing to retrying the allocation a limited number of times. Chris Goldsworthy (1): mm: cma: indefinitely retry allocations in cma_alloc mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH v2] mm: cma: indefinitely retry allocations in cma_alloc
CMA allocations will fail if 'pinned' pages are in a CMA area, since we cannot migrate pinned pages. The _refcount of a struct page being greater than _mapcount for that page can cause pinning for anonymous pages. This is because try_to_unmap(), which (1) is called in the CMA allocation path, and (2) decrements both _refcount and _mapcount for a page, will stop unmapping a page from VMAs once the _mapcount for a page reaches 0. This implies that after try_to_unmap() has finished successfully for a page where _recount > _mapcount, that _refcount will be greater than 0. Later in the CMA allocation path in migrate_page_move_mapping(), we will have one more reference count than intended for anonymous pages, meaning the allocation will fail for that page. One example of where _refcount can be greater than _mapcount for a page we would not expect to be pinned is inside of copy_one_pte(), which is called during a fork. For ptes for which pte_present(pte) == true, copy_one_pte() will increment the _refcount field followed by the _mapcount field of a page. If the process doing copy_one_pte() is context switched out after incrementing _refcount but before incrementing _mapcount, then the page will be temporarily pinned. So, inside of cma_alloc(), instead of giving up when alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap, perform retries indefinitely, with sleeps, to give the system an opportunity to unpin any pinned pages. Signed-off-by: Chris Goldsworthy Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- mm/cma.c | 25 +++-- 1 file changed, 23 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 7f415d7..90bb505 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -32,6 +32,7 @@ #include #include #include +#include #include #include "cma.h" @@ -442,8 +443,28 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_maxno, start, bitmap_count, mask, offset); if (bitmap_no >= bitmap_maxno) { - mutex_unlock(>lock); - break; + if (ret == -EBUSY) { + mutex_unlock(>lock); + + /* +* Page may be momentarily pinned by some other +* process which has been scheduled out, e.g. +* in exit path, during unmap call, or process +* fork and so cannot be freed there. Sleep +* for 100ms and retry the allocation. +*/ + start = 0; + ret = -ENOMEM; + msleep(100); + continue; + } else { + /* +* ret == -ENOMEM - all bits in cma->bitmap are +* set, so we break accordingly. +*/ + mutex_unlock(>lock); + break; + } } bitmap_set(cma->bitmap, bitmap_no, bitmap_count); /* -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
Re: cma_alloc(), add sleep-and-retry for temporary page pinning
On 2020-08-21 15:01, Andrew Morton wrote: On Tue, 11 Aug 2020 15:20:47 -0700 cgold...@codeaurora.org wrote: One thing to stress is that there are other instances of CMA page pinning, that this patch isn't attempting to address. Oh. How severe are these? Hey Andrew, - get_user_pages() will pin pages (I think that's a 100% guarantee but I'd need to double check that). There isn't a workaround for this as far as I know. - One issue we've encountered is when the pages for buffer heads are stuck on an LRU list (see the call to buffer_busy() in drop_buffers() https://elixir.bootlin.com/linux/v5.8.8/source/fs/buffer.c#L3225). We deal with this by allowing the buffers to be dropped, if the reason buffer_busy() returns true is because the page is on an LRU list. Well. Why not wait infinitely? Because there are other sources of CMA page pinning, I guess. Could we take a sleeping lock on the exit_mmap() path and on the migration path? So that the migration path automatically waits for the exact amount of time? Retrying indefinitely whilst alloc_contig_range() returns -EBUSY is actually a viable approach. From the perspective of how long it takes to perform a CMA allocation, it is preferable compared to the lock-based method, in terms of how long it would take to do a CMA allocation. With our current method, we attempt allocations across an entire CMA region before sleeping and retrying. With the lock-based method, we'd be sleeping on a per-page basis - this could lead to a situation where we spend a great deal of time waiting for a particular page A to be freed, that lies in the subset of the CMA region we're allocating form. We could have instead given up on allocating that subset of the CMA region (because page A is pinned), and have moved on to a different subset of the CMA region, and have successfully allocated that subset, whilst page A is still pinned. I have a patch ready that does this indefinite-retrying, that will be sent in reply to this e-mail. Regards, Chris. -- The Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
cma_alloc(), add sleep-and-retry for temporary page pinning
On mobile devices, failure to allocate from a CMA area constitutes a functional failure. Sometimes during CMA allocations, we have observed that pages in a CMA area allocated through alloc_pages(), that we're trying to migrate away to make room for a CMA allocation, are temporarily pinned. This temporary pinning can occur when a process that owns the pinned page is being forked (the example is explained further in the commit text). This patch addresses this issue by adding a sleep-and-retry loop in cma_alloc() . There's another example we know of similar to the above that occurs during exit_mmap() (in zap_pte_range() specifically), but I need to determine if this is still relevant today.
[PATCH] mm: cma: retry allocations in cma_alloc
CMA allocations will fail if 'pinned' pages are in a CMA area, since we cannot migrate pinned pages. The _refcount of a struct page being greater than _mapcount for that page can cause pinning for anonymous pages. This is because try_to_unmap(), which (1) is called in the CMA allocation path, and (2) decrements both _refcount and _mapcount for a page, will stop unmapping a page from VMAs once the _mapcount for a page reaches 0. This implies that after try_to_unmap() has finished successfully for a page where _recount > _mapcount, that _refcount will be greater than 0. Later in the CMA allocation path in migrate_page_move_mapping(), we will have one more reference count than intended for anonymous pages, meaning the allocation will fail for that page. One example of where _refcount can be greater than _mapcount for a page we would not expect to be pinned is inside of copy_one_pte(), which is called during a fork. For ptes for which pte_present(pte) == true, copy_one_pte() will increment the _refcount field followed by the _mapcount field of a page. If the process doing copy_one_pte() is context switched out after incrementing _refcount but before incrementing _mapcount, then the page will be temporarily pinned. So, inside of cma_alloc(), instead of giving up when alloc_contig_range() returns -EBUSY after having scanned a whole CMA-region bitmap, perform retries with sleeps to give the system an opportunity to unpin any pinned pages. Signed-off-by: Chris Goldsworthy Co-developed-by: Susheel Khiani Signed-off-by: Susheel Khiani Co-developed-by: Vinayak Menon Signed-off-by: Vinayak Menon --- mm/cma.c | 24 ++-- 1 file changed, 22 insertions(+), 2 deletions(-) diff --git a/mm/cma.c b/mm/cma.c index 7f415d7..7b85fe6 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -32,6 +32,7 @@ #include #include #include +#include #include #include "cma.h" @@ -418,6 +419,8 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, size_t i; struct page *page = NULL; int ret = -ENOMEM; + int num_attempts = 0; + int max_retries = 5; if (!cma || !cma->count || !cma->bitmap) return NULL; @@ -442,8 +445,25 @@ struct page *cma_alloc(struct cma *cma, size_t count, unsigned int align, bitmap_maxno, start, bitmap_count, mask, offset); if (bitmap_no >= bitmap_maxno) { - mutex_unlock(>lock); - break; + if ((num_attempts < max_retries) && (ret == -EBUSY)) { + mutex_unlock(>lock); + + /* +* Page may be momentarily pinned by some other +* process which has been scheduled out, e.g. +* in exit path, during unmap call, or process +* fork and so cannot be freed there. Sleep +* for 100ms and retry the allocation. +*/ + start = 0; + ret = -ENOMEM; + msleep(100); + num_attempts++; + continue; + } else { + mutex_unlock(>lock); + break; + } } bitmap_set(cma->bitmap, bitmap_no, bitmap_count); /* -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH] of: reserved_mem: add missing of_node_put() for proper ref-counting
Commit d698a388146c ("of: reserved-memory: ignore disabled memory-region nodes") added an early return in of_reserved_mem_device_init_by_idx(), but didn't call of_node_put() on a device_node whose ref-count was incremented in the call to of_parse_phandle() preceding the early exit. Fixes: d698a388146c ("of: reserved-memory: ignore disabled memory-region nodes") Signed-off-by: Chris Goldsworthy To: Rob Herring Cc: devicet...@vger.kernel.org Cc: sta...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-...@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org --- drivers/of/of_reserved_mem.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c index 7989703..6bd610e 100644 --- a/drivers/of/of_reserved_mem.c +++ b/drivers/of/of_reserved_mem.c @@ -324,8 +324,10 @@ int of_reserved_mem_device_init_by_idx(struct device *dev, if (!target) return -ENODEV; - if (!of_device_is_available(target)) + if (!of_device_is_available(target)) { + of_node_put(target); return 0; + } rmem = __find_rmem(target); of_node_put(target); -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project
[PATCH] of: reserved_mem: add missing of_node_put() for proper ref-counting
Commit d698a388146c ("of: reserved-memory: ignore disabled memory-region nodes") added an early return in of_reserved_mem_device_init_by_idx(), but didn't call of_node_put() on a device_node whose ref-count was incremented in the call to of_parse_phandle() preceding the early exit. Fixes: d698a388146c ("of: reserved-memory: ignore disabled memory-region nodes") Signed-off-by: Chris Goldsworthy To: Rob Herring Cc: devicet...@vger.kernel.org Cc: sta...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux-arm-ker...@lists.infradead.org --- drivers/of/of_reserved_mem.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c index 7989703..6bd610e 100644 --- a/drivers/of/of_reserved_mem.c +++ b/drivers/of/of_reserved_mem.c @@ -324,8 +324,10 @@ int of_reserved_mem_device_init_by_idx(struct device *dev, if (!target) return -ENODEV; - if (!of_device_is_available(target)) + if (!of_device_is_available(target)) { + of_node_put(target); return 0; + } rmem = __find_rmem(target); of_node_put(target); -- The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project