[PATCH 0/2] btrfs: fortification for GFP_NOFS allocations
Hi, these two patches were sent as a part of a larger RFC which aims at allowing GFP_NOFS allocations to fail to help sort out memory reclaim issues bound to the current behavior (http://marc.info/?l=linux-mm&m=143876830616538&w=2). It is clear that move to the GFP_NOFS behavior change is a long term plan but these patches should be good enough even with that change in place. It also seems that Chris wasn't opposed and would be willing to take them http://marc.info/?l=linux-mm&m=143991792427165&w=2 so here we come. I have rephrased the changeslogs to not refer to the patch which changes the NOFS behavior. Just to clarify. These two patches allowed my particular testcase (mentioned in the cover referenced above) to survive it doesn't mean that the failing GFP_NOFS are OK now. I have seen some other places where GFP_NOFS allocation is followed by BUG_ON(ALLOC_FAILED). I have not encountered them though. Let me know if you would prefer other changes. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/2] btrfs: Prevent from early transaction abort
From: Michal Hocko Btrfs relies on GFP_NOFS allocation when committing the transaction but this allocation context is rather weak wrt. reclaim capabilities. The page allocator currently tries hard to not fail these allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) so this is not a problem currently but there is an attempt to move away from the default no-fail behavior and allow these allocation to fail more eagerly. And this would lead to a pre-mature transaction abort as follows: [ 55.328093] Call Trace: [ 55.328890] [] dump_stack+0x4f/0x7b [ 55.330518] [] ? console_unlock+0x334/0x363 [ 55.332738] [] __alloc_pages_nodemask+0x81d/0x8d4 [ 55.334910] [] pagecache_get_page+0x10e/0x20c [ 55.336844] [] alloc_extent_buffer+0xd0/0x350 [btrfs] [ 55.338973] [] btrfs_find_create_tree_block+0x15/0x17 [btrfs] [ 55.341329] [] btrfs_alloc_tree_block+0x18c/0x405 [btrfs] [ 55.343566] [] split_leaf+0x1e4/0x6a6 [btrfs] [ 55.345577] [] btrfs_search_slot+0x671/0x831 [btrfs] [ 55.347679] [] ? get_parent_ip+0xe/0x3e [ 55.349434] [] btrfs_insert_empty_items+0x5d/0xa8 [btrfs] [ 55.351681] [] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs] [ 55.353979] [] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.356212] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.358378] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.360626] [] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.362894] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.365221] [] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.367273] [] vfs_fsync_range+0x8f/0x9e [ 55.369047] [] vfs_fsync+0x1c/0x1e [ 55.370654] [] do_fsync+0x34/0x4e [ 55.372246] [] SyS_fsync+0x10/0x14 [ 55.373851] [] system_call_fastpath+0x12/0x6f [ 55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory [ 55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction. [ 55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure). [ 55.384280] [ cut here ] [ 55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]() [...] [ 55.384337] Call Trace: [ 55.384353] [] dump_stack+0x4f/0x7b [ 55.384357] [] ? down_trylock+0x2d/0x37 [ 55.384359] [] warn_slowpath_common+0xa1/0xbb [ 55.384398] [] ? btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384400] [] warn_slowpath_null+0x1a/0x1c [ 55.384423] [] btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384446] [] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs] [ 55.384455] [] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs] [ 55.384476] [] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.384499] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.384521] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.384543] [] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.384565] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.384588] [] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.384591] [] vfs_fsync_range+0x8f/0x9e [ 55.384592] [] vfs_fsync+0x1c/0x1e [ 55.384593] [] do_fsync+0x34/0x4e [ 55.384594] [] SyS_fsync+0x10/0x14 [ 55.384595] [] system_call_fastpath+0x12/0x6f [...] [ 55.384608] ---[ end trace c29799da1d4dd621 ]--- [ 55.437323] BTRFS info (device hdb1): forced readonly [ 55.438815] BTRFS info (device hdb1): delayed_refs has NO entry Fix this by being explicit about the no-fail behavior of this allocation path and use __GFP_NOFAIL. Signed-off-by: Michal Hocko --- fs/btrfs/extent_io.c | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c374e1e71e5f..f4d6eea975d7 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4607,9 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start, { struct extent_buffer *eb = NULL; - eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS); - if (eb == NULL) - return NULL; + eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL); eb->start = start; eb->len = len; eb->fs_info = fs_info; @@ -4867,7 +4865,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, return NULL; for (i = 0; i < num_pages; i++, index++) { - p = find_or_create_page(mapping, index, GFP_NOFS); + p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL); if (!p) goto free_eb; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
From: Michal Hocko alloc_btrfs_bio relies on GFP_NOFS allocation when committing the transaction but this allocation context is rather weak wrt. reclaim capabilities. The page allocator currently tries hard to not fail these allocations if they are small (<=PAGE_ALLOC_COSTLY_ORDER) but it can still fail if the _current_ process is the OOM killer victim. Moreover there is an attempt to move away from the default no-fail behavior and allow these allocation to fail more eagerly. This would lead to: [ 37.928625] kernel BUG at fs/btrfs/extent_io.c:4045 which is clearly undesirable and the nofail behavior should be explicit if the allocation failure cannot be tolerated. Signed-off-by: Michal Hocko --- fs/btrfs/volumes.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 53af23f2c087..42b9949dd71d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4914,9 +4914,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes) * and the stripes */ sizeof(u64) * (total_stripes), - GFP_NOFS); - if (!bbio) - return NULL; + GFP_NOFS|__GFP_NOFAIL); atomic_set(&bbio->error, 0); atomic_set(&bbio->refs, 1); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 0/8] Allow GFP_NOFS allocation to fail
Hi, small GFP_NOFS, like GFP_KERNEL, allocations have not been not failing traditionally even though their reclaim capabilities are restricted because the VM code cannot recurse into filesystems to clean dirty pages. At the same time these allocation requests do not allow to trigger the OOM killer because that would lead to pre-mature OOM killing during heavy fs metadata workloads. This leaves the VM code in an unfortunate situation where GFP_NOFS requests is looping inside the allocator relying on somebody else to make a progress on its behalf. This is prone to deadlocks when the request is holding resources which are necessary for other task to make a progress and release memory (e.g. OOM victim is blocked on the lock held by the NONFS request). Another drawback is that the caller of the allocator cannot define any fallback strategy because the request doesn't fail. As the VM cannot do much about these requests we should face the reality and allow those allocations to fail. Johannes has already posted the patch which does that (http://marc.info/?l=linux-mm&m=142726428514236&w=2) but the discussion died pretty quickly. I was playing with this patch and xfs, ext[34] and btrfs for a while to see what is the effect under heavy memory pressure. As expected this led to some fallouts. My test consisted of a simple memory hog which allocates a lot of anonymous memory and writes to a fs mainly to trigger a fs activity on exit. In parallel there is a parallel fs metadata load (multiple tasks creating thousands of empty files and directories). All is running in a VM with small amount of memory to emulate an under provisioned system. The metadata load is triggering a sufficient load to invoke the direct reclaim even without the memory hog. The memory hog forks several tasks sharing the VM and OOM killer manages to kill it without locking up the system (this was based on the test case from Tetsuo Handa - http://www.spinics.net/lists/linux-fsdevel/msg82958.html - I just didn't want to kill my machine ;)). With all the patches applied none of the 4 filesystems gets aborted transactions and RO remount (well xfs didn't need any special treatment). This is obviously not sufficient to claim that failing GFP_NOFS is OK now but I think it is a good start for the further discussion. I would be grateful if FS people could have a look at those patches. I have simply used __GFP_NOFAIL in the critical paths. This might be not the best strategy but it sounds like a good first step. The first patch in the series also allows __GFP_NOFAIL allocations to access memory reserves when the system is OOM which should help those requests to make a forward progress - especially in combination with GFP_NOFS. The second patch tries to address a potential pre-mature OOM killer from the page fault path. I have posted it separately but it didn't get much traction. The third patch allows GFP_NOFS to fail and I believe it should see much more testing coverage. It would be really great if it could sit in the mmotm tree for few release cycles so that we can catch more fallouts. The rest are the FS specific patches to fortify allocations requests which are really needed to finish transactions without RO remounts. There might be more needed but my test case survives with these in place. They would obviously need some rewording if they are going to be applied even without Patch3 and I will do that if respective maintainers will take them. Ext3 and JBD are going away soon so they might be dropped but they have been in the tree while I was testing so I've kept them. Thoughts? Opinions? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 2/8] mm: Allow GFP_IOFS for page_cache_read page cache allocation
From: Michal Hocko page_cache_read has been historically using page_cache_alloc_cold to allocate a new page. This means that mapping_gfp_mask is used as the base for the gfp_mask. Many filesystems are setting this mask to GFP_NOFS to prevent from fs recursion issues. page_cache_read is, however, not called from the fs layera directly so it doesn't need this protection normally. ceph and ocfs2 which call filemap_fault from their fault handlers seem to be OK because they are not taking any fs lock before invoking generic implementation. xfs which takes XFS_MMAPLOCK_SHARED is safe from the reclaim recursion POV because this lock serializes truncate and punch hole with the page faults and it doesn't get involved in the reclaim. The GFP_NOFS protection might be even harmful. There is a push to fail GFP_NOFS allocations rather than loop within allocator indefinitely with a very limited reclaim ability. Once we start failing those requests the OOM killer might be triggered prematurely because the page cache allocation failure is propagated up the page fault path and end up in pagefault_out_of_memory. We cannot play with mapping_gfp_mask directly because that would be racy wrt. parallel page faults and it might interfere with other users who really rely on NOFS semantic from the stored gfp_mask. The mask is also inode proper so it would even be a layering violation. What we can do instead is to push the gfp_mask into struct vm_fault and allow fs layer to overwrite it should the callback need to be called with a different allocation context. Initialize the default to (mapping_gfp_mask | GFP_IOFS) because this should be safe from the page fault path normally. Why do we care about mapping_gfp_mask at all then? Because this doesn't hold only reclaim protection flags but it also might contain zone and movability restrictions (GFP_DMA32, __GFP_MOVABLE and others) so we have to respect those. Reported-by: Tetsuo Handa Signed-off-by: Michal Hocko --- include/linux/mm.h | 4 mm/filemap.c | 9 - mm/memory.c| 17 + 3 files changed, 25 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 7f471789781a..962e37c7cd6a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -220,10 +220,14 @@ extern pgprot_t protection_map[16]; * ->fault function. The vma's ->fault is responsible for returning a bitmask * of VM_FAULT_xxx flags that give details about how the fault was handled. * + * MM layer fills up gfp_mask for page allocations but fault handler might + * alter it if its implementation requires a different allocation context. + * * pgoff should be used in favour of virtual_address, if possible. */ struct vm_fault { unsigned int flags; /* FAULT_FLAG_xxx flags */ + gfp_t gfp_mask; /* gfp mask to be used for allocations */ pgoff_t pgoff; /* Logical page offset based on vma */ void __user *virtual_address; /* Faulting virtual address */ diff --git a/mm/filemap.c b/mm/filemap.c index b63fb81df336..8a16a07bbe02 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -1774,19 +1774,18 @@ EXPORT_SYMBOL(generic_file_read_iter); * This adds the requested page to the page cache if it isn't already there, * and schedules an I/O to read in its contents from disk. */ -static int page_cache_read(struct file *file, pgoff_t offset) +static int page_cache_read(struct file *file, pgoff_t offset, gfp_t gfp_mask) { struct address_space *mapping = file->f_mapping; struct page *page; int ret; do { - page = page_cache_alloc_cold(mapping); + page = __page_cache_alloc(gfp_mask|__GFP_COLD); if (!page) return -ENOMEM; - ret = add_to_page_cache_lru(page, mapping, offset, - GFP_KERNEL & mapping_gfp_mask(mapping)); + ret = add_to_page_cache_lru(page, mapping, offset, GFP_KERNEL & gfp_mask); if (ret == 0) ret = mapping->a_ops->readpage(file, page); else if (ret == -EEXIST) @@ -1969,7 +1968,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) * We're only likely to ever get here if MADV_RANDOM is in * effect. */ - error = page_cache_read(file, offset); + error = page_cache_read(file, offset, vmf->gfp_mask); /* * The page we want has now been added to the page cache. diff --git a/mm/memory.c b/mm/memory.c index 8a2fc9945b46..25ab29560dca 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1949,6 +1949,20 @@ static inline void cow_user_page(struct page *dst, struct page *src, unsigned lo copy_user_highpage(dst, src, va, vma); } +static gfp_t __get_fault_gfp_mask(struct vm_area_struct *vma) +{ + struct file *vm_file = vma->vm_file; + + if (vm_file) +
[RFC 3/8] mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM
From: Johannes Weiner GFP_NOFS allocations are not allowed to invoke the OOM killer since their reclaim abilities are severely diminished. However, without the OOM killer available there is no hope of progress once the reclaimable pages have been exhausted. Don't risk hanging these allocations. Leave it to the allocation site to implement the fallback policy for failing allocations. Signed-off-by: Johannes Weiner Signed-off-by: Michal Hocko --- mm/page_alloc.c | 9 + 1 file changed, 1 insertion(+), 8 deletions(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ee69c338ca2a..024d45d51700 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2715,15 +2715,8 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, if (ac->high_zoneidx < ZONE_NORMAL) goto out; /* The OOM killer does not compensate for IO-less reclaim */ - if (!(gfp_mask & __GFP_FS)) { - /* -* XXX: Page reclaim didn't yield anything, -* and the OOM killer can't be invoked, but -* keep looping as per tradition. -*/ - *did_some_progress = 1; + if (!(gfp_mask & __GFP_FS)) goto out; - } if (pm_suspended_storage()) goto out; /* The OOM killer may not free memory on a specific node */ -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 5/8] ext4: Do not fail journal due to block allocator
From: Michal Hocko Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM" memory allocator doesn't endlessly loop to satisfy low-order allocations and instead fails them to allow callers to handle them gracefully. Some of the callers are not yet prepared for this behavior though. ext4 block allocator relies solely on GFP_NOFS allocation requests and allocation failures lead to aborting yournal too easily: [ 345.028333] oom-trash: page allocation failure: order:0, mode:0x50 [ 345.028336] CPU: 1 PID: 8334 Comm: oom-trash Tainted: GW 4.0.0-nofs3-6-gdfe9931f5f68 #588 [ 345.028337] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014 [ 345.028339] 880005a17708 81538a54 8107a40f [ 345.028341] 0050 880005a17798 810fe854 00018000 [ 345.028342] 0046 81a52100 0246 [ 345.028343] Call Trace: [ 345.028348] [] dump_stack+0x4f/0x7b [ 345.028370] [] warn_alloc_failed+0x12a/0x13f [ 345.028373] [] __alloc_pages_nodemask+0x7f3/0x8aa [ 345.028375] [] pagecache_get_page+0x12a/0x1c9 [ 345.028390] [] ext4_mb_load_buddy+0x220/0x367 [ext4] [ 345.028414] [] ext4_free_blocks+0x522/0xa4c [ext4] [ 345.028425] [] ext4_ext_remove_space+0x833/0xf22 [ext4] [ 345.028434] [] ext4_ext_truncate+0x8c/0xb0 [ext4] [ 345.028441] [] ext4_truncate+0x20b/0x38d [ext4] [ 345.028462] [] ext4_evict_inode+0x32b/0x4c1 [ext4] [ 345.028464] [] evict+0xa0/0x148 [ 345.028466] [] iput+0x1a1/0x1f0 [ 345.028468] [] __dentry_kill+0x136/0x1a6 [ 345.028470] [] dput+0x21a/0x243 [ 345.028472] [] __fput+0x184/0x19b [ 345.028473] [] fput+0xe/0x10 [ 345.028475] [] task_work_run+0x8a/0xa1 [ 345.028477] [] do_exit+0x3c6/0x8dc [ 345.028482] [] do_group_exit+0x4d/0xb2 [ 345.028483] [] get_signal+0x5b1/0x5f5 [ 345.028488] [] do_signal+0x28/0x5d0 [...] [ 345.028624] EXT4-fs error (device hdb1) in ext4_free_blocks:4879: Out of memory [ 345.033097] Aborting journal on device hdb1-8. [ 345.036339] EXT4-fs (hdb1): Remounting filesystem read-only [ 345.036344] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.036766] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.038583] EXT4-fs error (device hdb1) in ext4_ext_remove_space:3048: Journal has aborted [ 345.049115] EXT4-fs error (device hdb1) in ext4_ext_truncate:4669: Journal has aborted [ 345.050434] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.053064] EXT4-fs error (device hdb1) in ext4_truncate:3668: Journal has aborted [ 345.053582] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted [ 345.053946] EXT4-fs error (device hdb1) in ext4_orphan_del:2686: Journal has aborted [ 345.055367] EXT4-fs error (device hdb1) in ext4_reserve_inode_write:4834: Journal has aborted The failure is really premature because GFP_NOFS allocation context is very restricted - especially in the fs metadata heavy loads. Before we go with a more sofisticated solution, let's simply imitate the previous behavior of non-failing NOFS allocation and use __GFP_NOFAIL for the buddy block allocator. I wasn't able to trigger the issue with this patch anymore. Signed-off-by: Michal Hocko --- fs/ext4/mballoc.c | 12 1 file changed, 8 insertions(+), 4 deletions(-) diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c index 5b1613a54307..e6361622bfd5 100644 --- a/fs/ext4/mballoc.c +++ b/fs/ext4/mballoc.c @@ -992,7 +992,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb, block = group * 2; pnum = block / blocks_per_page; poff = block % blocks_per_page; - page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); + page = find_or_create_page(inode->i_mapping, pnum, + GFP_NOFS|__GFP_NOFAIL); if (!page) return -ENOMEM; BUG_ON(page->mapping != inode->i_mapping); @@ -1006,7 +1007,8 @@ static int ext4_mb_get_buddy_page_lock(struct super_block *sb, block++; pnum = block / blocks_per_page; - page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); + page = find_or_create_page(inode->i_mapping, pnum, + GFP_NOFS|__GFP_NOFAIL); if (!page) return -ENOMEM; BUG_ON(page->mapping != inode->i_mapping); @@ -1158,7 +1160,8 @@ ext4_mb_load_buddy(struct super_block *sb, ext4_group_t group, * wait for it to initialize. */ page_cache_release(page); - page = find_or_create_page(inode->i_mapping, pnum, GFP_NOFS); + page = find_or_create_page(inode->i_mapping, pnum, + GFP_NOFS|__GFP_NOFAIL);
[RFC 4/8] jbd, jbd2: Do not fail journal because of frozen_buffer allocation failure
From: Michal Hocko Journal transaction might fail prematurely because the frozen_buffer is allocated by GFP_NOFS request: [ 72.440013] do_get_write_access: OOM for frozen_buffer [ 72.440014] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.440015] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4735: Out of memory (...snipped) [ 72.495559] do_get_write_access: OOM for frozen_buffer [ 72.495560] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.496839] do_get_write_access: OOM for frozen_buffer [ 72.496841] EXT4-fs: ext4_reserve_inode_write:4729: aborting transaction: Out of memory in __ext4_journal_get_write_access [ 72.505766] Aborting journal on device sda1-8. [ 72.505851] EXT4-fs (sda1): Remounting filesystem read-only This wasn't a problem until "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM" because small GPF_NOFS allocations never failed. This allocation seems essential for the journal and GFP_NOFS is too restrictive to the memory allocator so let's use __GFP_NOFAIL here to emulate the previous behavior. jbd code has the very same issue so let's do the same there as well. Signed-off-by: Michal Hocko --- fs/jbd/transaction.c | 11 +-- fs/jbd2/transaction.c | 14 +++--- 2 files changed, 4 insertions(+), 21 deletions(-) diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index 1695ba8334a2..bf7474deda2f 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -673,16 +673,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, jbd_unlock_bh_state(bh); frozen_buffer = jbd_alloc(jh2bh(jh)->b_size, -GFP_NOFS); - if (!frozen_buffer) { - printk(KERN_ERR - "%s: OOM for frozen_buffer\n", - __func__); - JBUFFER_TRACE(jh, "oom!"); - error = -ENOMEM; - jbd_lock_bh_state(bh); - goto done; - } +GFP_NOFS|__GFP_NOFAIL); goto repeat; } jh->b_frozen_data = frozen_buffer; diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c index ff2f2e6ad311..bff071e21553 100644 --- a/fs/jbd2/transaction.c +++ b/fs/jbd2/transaction.c @@ -923,16 +923,7 @@ do_get_write_access(handle_t *handle, struct journal_head *jh, jbd_unlock_bh_state(bh); frozen_buffer = jbd2_alloc(jh2bh(jh)->b_size, -GFP_NOFS); - if (!frozen_buffer) { - printk(KERN_ERR - "%s: OOM for frozen_buffer\n", - __func__); - JBUFFER_TRACE(jh, "oom!"); - error = -ENOMEM; - jbd_lock_bh_state(bh); - goto done; - } +GFP_NOFS|__GFP_NOFAIL); goto repeat; } jh->b_frozen_data = frozen_buffer; @@ -1157,7 +1148,8 @@ int jbd2_journal_get_undo_access(handle_t *handle, struct buffer_head *bh) repeat: if (!jh->b_committed_data) { - committed_data = jbd2_alloc(jh2bh(jh)->b_size, GFP_NOFS); + committed_data = jbd2_alloc(jh2bh(jh)->b_size, + GFP_NOFS|__GFP_NOFAIL); if (!committed_data) { printk(KERN_ERR "%s: No memory for committed data\n", __func__); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 6/8] ext3: Do not abort journal prematurely
From: Michal Hocko journal_get_undo_access is relying on GFP_NOFS allocation yet it is essential for the journal transaction: [ 83.256914] journal_get_undo_access: No memory for committed data [ 83.258022] EXT3-fs: ext3_free_blocks_sb: aborting transaction: Out of memory in __ext3_journal_get_undo_access [ 83.259785] EXT3-fs (hdb1): error in ext3_free_blocks_sb: Out of memory [ 83.267130] Aborting journal on device hdb1. [ 83.292308] EXT3-fs (hdb1): error: ext3_journal_start_sb: Detected aborted journal [ 83.293630] EXT3-fs (hdb1): error: remounting filesystem read-only Since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM" these allocation requests are allowed to fail so we need to use __GFP_NOFAIL to imitate the previous behavior. Signed-off-by: Michal Hocko --- fs/jbd/transaction.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/jbd/transaction.c b/fs/jbd/transaction.c index bf7474deda2f..6c60376a29bc 100644 --- a/fs/jbd/transaction.c +++ b/fs/jbd/transaction.c @@ -887,7 +887,7 @@ int journal_get_undo_access(handle_t *handle, struct buffer_head *bh) repeat: if (!jh->b_committed_data) { - committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS); + committed_data = jbd_alloc(jh2bh(jh)->b_size, GFP_NOFS | __GFP_NOFAIL); if (!committed_data) { printk(KERN_ERR "%s: No memory for committed data\n", __func__); -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 8/8] btrfs: use __GFP_NOFAIL in alloc_btrfs_bio
From: Michal Hocko alloc_btrfs_bio is relying on GFP_NOFS to allocate a bio but since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM" this is allowed to fail which can lead to [ 37.928625] kernel BUG at fs/btrfs/extent_io.c:4045 This is clearly undesirable and the nofail behavior should be explicit if the allocation failure cannot be tolerated. Signed-off-by: Michal Hocko --- fs/btrfs/volumes.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 53af23f2c087..57a99d19533d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4914,7 +4914,7 @@ static struct btrfs_bio *alloc_btrfs_bio(int total_stripes, int real_stripes) * and the stripes */ sizeof(u64) * (total_stripes), - GFP_NOFS); + GFP_NOFS|__GFP_NOFAIL); if (!bbio) return NULL; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 7/8] btrfs: Prevent from early transaction abort
From: Michal Hocko Btrfs relies on GFP_NOFS allocation when commiting the transaction but since "mm: page_alloc: do not lock up GFP_NOFS allocations upon OOM" those allocations are allowed to fail which can lead to a pre-mature transaction abort: [ 55.328093] Call Trace: [ 55.328890] [] dump_stack+0x4f/0x7b [ 55.330518] [] ? console_unlock+0x334/0x363 [ 55.332738] [] __alloc_pages_nodemask+0x81d/0x8d4 [ 55.334910] [] pagecache_get_page+0x10e/0x20c [ 55.336844] [] alloc_extent_buffer+0xd0/0x350 [btrfs] [ 55.338973] [] btrfs_find_create_tree_block+0x15/0x17 [btrfs] [ 55.341329] [] btrfs_alloc_tree_block+0x18c/0x405 [btrfs] [ 55.343566] [] split_leaf+0x1e4/0x6a6 [btrfs] [ 55.345577] [] btrfs_search_slot+0x671/0x831 [btrfs] [ 55.347679] [] ? get_parent_ip+0xe/0x3e [ 55.349434] [] btrfs_insert_empty_items+0x5d/0xa8 [btrfs] [ 55.351681] [] __btrfs_run_delayed_refs+0x7a6/0xf35 [btrfs] [ 55.353979] [] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.356212] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.358378] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.360626] [] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.362894] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.365221] [] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.367273] [] vfs_fsync_range+0x8f/0x9e [ 55.369047] [] vfs_fsync+0x1c/0x1e [ 55.370654] [] do_fsync+0x34/0x4e [ 55.372246] [] SyS_fsync+0x10/0x14 [ 55.373851] [] system_call_fastpath+0x12/0x6f [ 55.381070] BTRFS: error (device hdb1) in btrfs_run_delayed_refs:2821: errno=-12 Out of memory [ 55.382431] BTRFS warning (device hdb1): Skipping commit of aborted transaction. [ 55.382433] BTRFS warning (device hdb1): cleanup_transaction:1692: Aborting unused transaction(IO failure). [ 55.384280] [ cut here ] [ 55.384312] WARNING: CPU: 0 PID: 3010 at fs/btrfs/delayed-ref.c:438 btrfs_select_ref_head+0xd9/0xfe [btrfs]() [...] [ 55.384337] Call Trace: [ 55.384353] [] dump_stack+0x4f/0x7b [ 55.384357] [] ? down_trylock+0x2d/0x37 [ 55.384359] [] warn_slowpath_common+0xa1/0xbb [ 55.384398] [] ? btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384400] [] warn_slowpath_null+0x1a/0x1c [ 55.384423] [] btrfs_select_ref_head+0xd9/0xfe [btrfs] [ 55.384446] [] ? __btrfs_run_delayed_refs+0xa2/0xf35 [btrfs] [ 55.384455] [] __btrfs_run_delayed_refs+0xab/0xf35 [btrfs] [ 55.384476] [] btrfs_run_delayed_refs+0x6e/0x226 [btrfs] [ 55.384499] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.384521] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.384543] [] btrfs_commit_transaction+0x4c/0xaba [btrfs] [ 55.384565] [] ? start_transaction+0x192/0x534 [btrfs] [ 55.384588] [] btrfs_sync_file+0x29c/0x310 [btrfs] [ 55.384591] [] vfs_fsync_range+0x8f/0x9e [ 55.384592] [] vfs_fsync+0x1c/0x1e [ 55.384593] [] do_fsync+0x34/0x4e [ 55.384594] [] SyS_fsync+0x10/0x14 [ 55.384595] [] system_call_fastpath+0x12/0x6f [...] [ 55.384608] ---[ end trace c29799da1d4dd621 ]--- [ 55.437323] BTRFS info (device hdb1): forced readonly [ 55.438815] BTRFS info (device hdb1): delayed_refs has NO entry Fix this by reintroducing the no-fail behavior of this allocation path with the explicit __GFP_NOFAIL. Signed-off-by: Michal Hocko --- fs/btrfs/extent_io.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c374e1e71e5f..88fad7051e38 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4607,7 +4607,7 @@ __alloc_extent_buffer(struct btrfs_fs_info *fs_info, u64 start, { struct extent_buffer *eb = NULL; - eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS); + eb = kmem_cache_zalloc(extent_buffer_cache, GFP_NOFS|__GFP_NOFAIL); if (eb == NULL) return NULL; eb->start = start; @@ -4867,7 +4867,7 @@ struct extent_buffer *alloc_extent_buffer(struct btrfs_fs_info *fs_info, return NULL; for (i = 0; i < num_pages; i++, index++) { - p = find_or_create_page(mapping, index, GFP_NOFS); + p = find_or_create_page(mapping, index, GFP_NOFS|__GFP_NOFAIL); if (!p) goto free_eb; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC 1/8] mm, oom: Give __GFP_NOFAIL allocations access to memory reserves
From: Michal Hocko __GFP_NOFAIL is a big hammer used to ensure that the allocation request can never fail. This is a strong requirement and as such it also deserves a special treatment when the system is OOM. The primary problem here is that the allocation request might have come with some locks held and the oom victim might be blocked on the same locks. This is basically an OOM deadlock situation. This patch tries to reduce the risk of such a deadlocks by giving __GFP_NOFAIL allocations a special treatment and let them dive into memory reserves after oom killer invocation. This should help them to make a progress and release resources they are holding. The OOM victim should compensate for the reserves consumption. Signed-off-by: Michal Hocko --- mm/page_alloc.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 1f9ffbb087cb..ee69c338ca2a 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -2732,8 +2732,16 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, } /* Exhausted what can be done so it's blamo time */ if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false) - || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) + || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL)) { *did_some_progress = 1; + + if (gfp_mask & __GFP_NOFAIL) { + page = get_page_from_freelist(gfp_mask, order, + ALLOC_NO_WATERMARKS|ALLOC_CPUSET, ac); + WARN_ONCE(!page, "Unable to fullfil gfp_nofail allocation." + " Consider increasing min_free_kbytes.\n"); + } + } out: mutex_unlock(&oom_lock); return page; -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html