[PATCH] btrfs/150: add _scratch_dev_pool_get/put to run the test as expected

2018-02-19 Thread Misono, Tomohiro
btrfs/150 uses RAID1 profile and make SCRATCH_DEV fail for test.
However, if SCRATCH_DEV_POOL consists more than two devices, SCRATCH_DEV
may not be used for RAID1 pair and the tests may not run as expected.

Fix this by add _scratch_dev_pool_get/put like other tests (141, 143
etc.) do.

Signed-off-by: Tomohiro Misono 
---
 tests/btrfs/150 | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tests/btrfs/150 b/tests/btrfs/150
index 97041b6c..1e4586be 100755
--- a/tests/btrfs/150
+++ b/tests/btrfs/150
@@ -55,6 +55,7 @@ _supported_os Linux
 _require_scratch
 _require_fail_make_request
 _require_scratch_dev_pool 2
+_scratch_dev_pool_get 2
 
 SYSFS_BDEV=`_sysfs_dev $SCRATCH_DEV`
 enable_io_failure()
@@ -100,6 +101,7 @@ while [[ -z $result ]]; do
disable_io_failure
 done
 
+_scratch_dev_pool_put
 # success, all done
 status=0
 exit
-- 
2.14.3

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RESEND] btrfs: delete function btrfs_close_extra_devices()

2018-02-19 Thread David Sterba
On Thu, Feb 15, 2018 at 01:02:24PM +0800, Anand Jain wrote:
> btrfs_close_extra_devices() is not exactly about just closing the opened
> devices, but its also about free-ing the stale devices which may have
> scanned into the btrfs_fs_devices::dev_list. The way it picks devices to
> be freed is by going through the btrfs_fs_devices::dev_list and its seed
> devices, and finding for devices which do not have the flag
> BTRFS_DEV_STATE_IN_FS_METADATA nor if it is part of the replace target.
> 
> However, in the first place the way devices are scanned and added to the
> btrfs_fs_devices::dev_list have changed for a long time now. During scan
> when it finds matching fsid+uuid+devid it would add the device to
> btrfs_fs_devices::dev_list. A matched device with higher generation number
> overwrites the device with lower generation number during.
> 
> Further, the stale devices containing the stale fsid are removed at the
> time of the scan itself.
> 
> So there isn't any opportunity that btrfs_close_extra_devices() can free
> the stale device within the fsid which is being mounted.
> 
> Further about the btrfs_fs_devices::latest_bdev that
> the btrfs_close_extra_devices() function assigns, is already assigned by
> the function __btrfs_open_devices().
> 
> So as this function has no effect, delete it.

I think this is correct. Freeing stale devices as a side effect of mount
does not seem right anyway. I'm still not able to convince myself that
there's not an unexpected interaction of dev scan and dev replace, as
it relies on the state bits and other locks. If you have ideas were to
put some asserts or extra checks, please suggest.

I found only one place where an assert could go:

> - /*
> -  * keep the device that is marked to be the target device for the
> -  * dev_replace procedure
> -  */
> - btrfs_close_extra_devices(fs_devices, 0);
> -
> - if (!fs_devices->latest_bdev) {
> - btrfs_err(fs_info, "failed to read devices");
> - goto fail_tree_roots;
> - }

ASSERT(fs_devices->latest_bdev);

as latest_bdev could have been touched even by the no-op
btrfs_close_extra_devices. I'll add that.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/2] btrfs: add (the only possible) __exit annotation

2018-02-19 Thread David Sterba
Recently, the __init annotations have been added. There's unfortunatelly
only one case where we can add __exit, because most of the cleanup
helpers are also called from the __init phase.

As the __exit annotated functions get discarded completely for a
built-in code, we'd miss them from the init phase.

Signed-off-by: David Sterba 
---
 fs/btrfs/volumes.c | 2 +-
 fs/btrfs/volumes.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 2c0c08ec987a..797e7706e67b 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -279,7 +279,7 @@ static void btrfs_kobject_uevent(struct block_device *bdev,
_to_dev(bdev->bd_disk)->kobj);
 }
 
-void btrfs_cleanup_fs_uuids(void)
+void __exit btrfs_cleanup_fs_uuids(void)
 {
struct btrfs_fs_devices *fs_devices;
 
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index d110fb03ec0d..d28f5745fee2 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -436,7 +436,7 @@ struct btrfs_device *btrfs_alloc_device(struct 
btrfs_fs_info *fs_info,
const u8 *uuid);
 int btrfs_rm_device(struct btrfs_fs_info *fs_info,
const char *device_path, u64 devid);
-void btrfs_cleanup_fs_uuids(void);
+void __exit btrfs_cleanup_fs_uuids(void);
 int btrfs_num_copies(struct btrfs_fs_info *fs_info, u64 logical, u64 len);
 int btrfs_grow_device(struct btrfs_trans_handle *trans,
  struct btrfs_device *device, u64 new_size);
-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] btrfs: add more __cold annotations

2018-02-19 Thread David Sterba
The __cold functions are placed to a special section, as they're
expected to be called rarely. This could help i-cache prefetches or help
compiler to decide which branches are more/less likely to be taken
without any other annotations needed.

Though we can't add more __exit annotations, it's still possible to add
__cold (that's also added with __exit). That way the following function
categories are tagged:

- printf wrappers, error messages
- exit helpers

Signed-off-by: David Sterba 
---
 fs/btrfs/backref.c   | 2 +-
 fs/btrfs/backref.h   | 2 +-
 fs/btrfs/compression.c   | 2 +-
 fs/btrfs/compression.h   | 2 +-
 fs/btrfs/ctree.h | 9 +
 fs/btrfs/delayed-inode.c | 2 +-
 fs/btrfs/delayed-inode.h | 2 +-
 fs/btrfs/delayed-ref.c   | 2 +-
 fs/btrfs/delayed-ref.h   | 2 +-
 fs/btrfs/disk-io.c   | 2 +-
 fs/btrfs/disk-io.h   | 2 +-
 fs/btrfs/extent_io.c | 2 +-
 fs/btrfs/extent_io.h | 2 +-
 fs/btrfs/extent_map.c| 2 +-
 fs/btrfs/extent_map.h| 2 +-
 fs/btrfs/file.c  | 2 +-
 fs/btrfs/inode.c | 2 +-
 fs/btrfs/ordered-data.c  | 2 +-
 fs/btrfs/ordered-data.h  | 2 +-
 fs/btrfs/super.c | 2 +-
 fs/btrfs/sysfs.c | 2 +-
 fs/btrfs/tree-checker.c  | 3 +++
 22 files changed, 28 insertions(+), 24 deletions(-)

diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c
index f94b2d8c744a..4e89598ca878 100644
--- a/fs/btrfs/backref.c
+++ b/fs/btrfs/backref.c
@@ -170,7 +170,7 @@ int __init btrfs_prelim_ref_init(void)
return 0;
 }
 
-void btrfs_prelim_ref_exit(void)
+void __cold btrfs_prelim_ref_exit(void)
 {
kmem_cache_destroy(btrfs_prelim_ref_cache);
 }
diff --git a/fs/btrfs/backref.h b/fs/btrfs/backref.h
index 0c2fab8514ff..0a30028d5196 100644
--- a/fs/btrfs/backref.h
+++ b/fs/btrfs/backref.h
@@ -73,7 +73,7 @@ int btrfs_find_one_extref(struct btrfs_root *root, u64 
inode_objectid,
 int btrfs_check_shared(struct btrfs_root *root, u64 inum, u64 bytenr);
 
 int __init btrfs_prelim_ref_init(void);
-void btrfs_prelim_ref_exit(void);
+void __cold btrfs_prelim_ref_exit(void);
 
 struct prelim_ref {
struct rb_node rbnode;
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 07d049c0c20f..562c3e633403 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -1133,7 +1133,7 @@ int btrfs_decompress(int type, unsigned char *data_in, 
struct page *dest_page,
return ret;
 }
 
-void btrfs_exit_compress(void)
+void __cold btrfs_exit_compress(void)
 {
free_workspaces();
 }
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 677fa4aa0bd7..ce796557a918 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -76,7 +76,7 @@ struct compressed_bio {
 };
 
 void __init btrfs_init_compress(void);
-void btrfs_exit_compress(void);
+void __cold btrfs_exit_compress(void);
 
 int btrfs_compress_pages(unsigned int type_level, struct address_space 
*mapping,
 u64 start, struct page **pages,
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 773c0cecc608..0b67ee4dcb6e 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3202,7 +3202,7 @@ struct inode *btrfs_alloc_inode(struct super_block *sb);
 void btrfs_destroy_inode(struct inode *inode);
 int btrfs_drop_inode(struct inode *inode);
 int __init btrfs_init_cachep(void);
-void btrfs_destroy_cachep(void);
+void __cold btrfs_destroy_cachep(void);
 long btrfs_ioctl_trans_end(struct file *file);
 struct inode *btrfs_iget(struct super_block *s, struct btrfs_key *location,
 struct btrfs_root *root, int *was_new);
@@ -3253,7 +3253,7 @@ ssize_t btrfs_dedupe_file_range(struct file *src_file, 
u64 loff, u64 olen,
 
 /* file.c */
 int __init btrfs_auto_defrag_init(void);
-void btrfs_auto_defrag_exit(void);
+void __cold btrfs_auto_defrag_exit(void);
 int btrfs_add_inode_defrag(struct btrfs_trans_handle *trans,
   struct btrfs_inode *inode);
 int btrfs_run_defrag_inodes(struct btrfs_fs_info *fs_info);
@@ -3288,7 +3288,7 @@ int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
 
 /* sysfs.c */
 int __init btrfs_init_sysfs(void);
-void btrfs_exit_sysfs(void);
+void __cold btrfs_exit_sysfs(void);
 int btrfs_sysfs_add_mounted(struct btrfs_fs_info *fs_info);
 void btrfs_sysfs_remove_mounted(struct btrfs_fs_info *fs_info);
 
@@ -3300,13 +3300,14 @@ int btrfs_parse_options(struct btrfs_fs_info *info, 
char *options,
unsigned long new_flags);
 int btrfs_sync_fs(struct super_block *sb, int wait);
 
-static inline __printf(2, 3)
+static inline __printf(2, 3) __cold
 void btrfs_no_printk(const struct btrfs_fs_info *fs_info, const char *fmt, ...)
 {
 }
 
 #ifdef CONFIG_PRINTK
 __printf(2, 3)
+__cold
 void btrfs_printk(const struct btrfs_fs_info *fs_info, const char *fmt, ...);
 #else
 #define btrfs_printk(fs_info, fmt, args...) \
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 09939fc37f2a..d06bef16ebd5 100644
--- 

[PATCH 0/2] Function annotations

2018-02-19 Thread David Sterba
Add some more annotations, inspired by the __init.

The 2nd patches gives roughly -800 delta in object size, but nothing too
significant.

   textdata bss dec hex filename
1238987   75977   18520 1333484  1458ec pre/btrfs.ko
1238199   75977   18520 1332696  1455d8 post/btrfs.ko

David Sterba (2):
  btrfs: add (the only possible) __exit annotation
  btrfs: add more __cold annotations

 fs/btrfs/backref.c   | 2 +-
 fs/btrfs/backref.h   | 2 +-
 fs/btrfs/compression.c   | 2 +-
 fs/btrfs/compression.h   | 2 +-
 fs/btrfs/ctree.h | 9 +
 fs/btrfs/delayed-inode.c | 2 +-
 fs/btrfs/delayed-inode.h | 2 +-
 fs/btrfs/delayed-ref.c   | 2 +-
 fs/btrfs/delayed-ref.h   | 2 +-
 fs/btrfs/disk-io.c   | 2 +-
 fs/btrfs/disk-io.h   | 2 +-
 fs/btrfs/extent_io.c | 2 +-
 fs/btrfs/extent_io.h | 2 +-
 fs/btrfs/extent_map.c| 2 +-
 fs/btrfs/extent_map.h| 2 +-
 fs/btrfs/file.c  | 2 +-
 fs/btrfs/inode.c | 2 +-
 fs/btrfs/ordered-data.c  | 2 +-
 fs/btrfs/ordered-data.h  | 2 +-
 fs/btrfs/super.c | 2 +-
 fs/btrfs/sysfs.c | 2 +-
 fs/btrfs/tree-checker.c  | 3 +++
 fs/btrfs/volumes.c   | 2 +-
 fs/btrfs/volumes.h   | 2 +-
 24 files changed, 30 insertions(+), 26 deletions(-)

-- 
2.16.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use kvzalloc to allocate btrfs_fs_info

2018-02-19 Thread Nikolay Borisov


On 19.02.2018 17:33, David Sterba wrote:
> On Thu, Feb 15, 2018 at 10:59:47PM -0500, je...@suse.com wrote:
>> From: Jeff Mahoney 
>>
>> The srcu_struct in btrfs_fs_infoa scales in size with NR_CPUS.  On
>> kernels built with NR_CPUS=8192, this can result in kmalloc failures
>> that prevent mounting.
>>
>> There is work in progress to try to resolve this for every user of
>> srcu_struct but using kvzalloc will work around the failures until
>> that is complete.
> 
> Interesting, the subvol_srcu is the worst contirbutor of the fs_info
> size, on a config with NR_CPUS=512:
> 
> struct srcu_struct subvol_srcu;  /*  1064  3480 */
>   ...
>   /* size: 6496, cachelines: 102, members: 181 */
> 
> Using kvzalloc makes sense and is a minimal fix. In the longterm I'd
> rather allocate subvol_rcu dynamically so the whole fs_info is not
> vmalloced (in the rare case when kmalloc would not work). As this is
> unpredictable and almost invisible, I'm worried about some random
> effects (performance, virtual mappings), so it would be better to avoid
> them if possible.

The interesting bit is the "there is WIP trying to address this". Are
there any patches that have been sent to lkml?


> 
> Reviewed-by: David Sterba 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: use kvzalloc to allocate btrfs_fs_info

2018-02-19 Thread David Sterba
On Thu, Feb 15, 2018 at 10:59:47PM -0500, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> The srcu_struct in btrfs_fs_infoa scales in size with NR_CPUS.  On
> kernels built with NR_CPUS=8192, this can result in kmalloc failures
> that prevent mounting.
> 
> There is work in progress to try to resolve this for every user of
> srcu_struct but using kvzalloc will work around the failures until
> that is complete.

Interesting, the subvol_srcu is the worst contirbutor of the fs_info
size, on a config with NR_CPUS=512:

struct srcu_struct subvol_srcu;  /*  1064  3480 */
...
/* size: 6496, cachelines: 102, members: 181 */

Using kvzalloc makes sense and is a minimal fix. In the longterm I'd
rather allocate subvol_rcu dynamically so the whole fs_info is not
vmalloced (in the rare case when kmalloc would not work). As this is
unpredictable and almost invisible, I'm worried about some random
effects (performance, virtual mappings), so it would be better to avoid
them if possible.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: use ASSERT to report logical error in cow_file_range()

2018-02-19 Thread David Sterba
On Thu, Feb 15, 2018 at 06:07:59PM +0800, Anand Jain wrote:
> Use ASSERT to report logical error in cow_file_range(), also move
> it a bit closer to when the num_bytes is derived.
> 
> Signed-off-by: Anand Jain 
> ---
> v1->v2:
>ASSERT logic changed. Thanks Nikolay.
> 
>  fs/btrfs/inode.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 4b156e191592..260fd8139951 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -976,6 +976,7 @@ static noinline int cow_file_range(struct inode *inode,
>  
>   num_bytes = ALIGN(end - start + 1, blocksize);
>   num_bytes = max(blocksize,  num_bytes);
> + ASSERT(num_bytes <= btrfs_super_total_bytes(fs_info->super_copy));

I was looking how if this assert is valid and theoretically possible.
Yes it seems so, extent start of range could be (u64)-1 in some cases
and this must not enter cow_file_range. So the assert is the right way
to check the parameter mismatch.

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] btrfs: cow_file_range() num_bytes and disk_num_bytes are same

2018-02-19 Thread David Sterba
On Thu, Feb 15, 2018 at 10:38:00AM +0200, Nikolay Borisov wrote:
> 
> 
> On 15.02.2018 06:29, Anand Jain wrote:
> > This patch deletes local variable disk_num_bytes as its value
> > is same as num_bytes in the function cow_file_range().
> > 
> > Signed-off-by: Anand Jain 
> 
> Reviewed-by: Nikolay Borisov 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: remove unused function btrfs_async_submit_limit()

2018-02-19 Thread David Sterba
On Thu, Feb 15, 2018 at 10:29:10AM +0200, Nikolay Borisov wrote:
> On 15.02.2018 05:36, Anand Jain wrote:
> > Commit [1] removed the need to use btrfs_async_submit_limit(), so
> > delete it.
> > 
> > [1]
> >  commit 736cd52e0c720103f52ab9da47b6cc3af6b083f6
> >   Btrfs: remove nr_async_submits and async_submit_draining
> > 
> > Signed-off-by: Anand Jain 
> 
> Reviewed-by: Nikolay Borisov 

Added to next, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] lockdep: Fix fs_reclaim warning.

2018-02-19 Thread Tetsuo Handa
Peter, are you OK with this patch?

Tetsuo Handa wrote:
> From 361d37a7d36978020dfb4c11ec1f4800937ccb68 Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa 
> Date: Thu, 8 Feb 2018 10:35:35 +0900
> Subject: [PATCH v2] lockdep: Fix fs_reclaim warning.
> 
> Dave Jones reported fs_reclaim lockdep warnings.
> 
>   
>   WARNING: possible recursive locking detected
>   4.15.0-rc9-backup-debug+ #1 Not tainted
>   
>   sshd/24800 is trying to acquire lock:
>(fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
>   but task is already holding lock:
>(fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
>   other info that might help us debug this:
>Possible unsafe locking scenario:
> 
>  CPU0
>  
> lock(fs_reclaim);
> lock(fs_reclaim);
> 
>*** DEADLOCK ***
> 
>May be due to missing lock nesting notation
> 
>   2 locks held by sshd/24800:
>#0:  (sk_lock-AF_INET6){+.+.}, at: [<1a069652>] 
> tcp_sendmsg+0x19/0x40
>#1:  (fs_reclaim){+.+.}, at: [<84f438c2>] 
> fs_reclaim_acquire.part.102+0x5/0x30
> 
>   stack backtrace:
>   CPU: 3 PID: 24800 Comm: sshd Not tainted 4.15.0-rc9-backup-debug+ #1
>   Call Trace:
>dump_stack+0xbc/0x13f
>__lock_acquire+0xa09/0x2040
>lock_acquire+0x12e/0x350
>fs_reclaim_acquire.part.102+0x29/0x30
>kmem_cache_alloc+0x3d/0x2c0
>alloc_extent_state+0xa7/0x410
>__clear_extent_bit+0x3ea/0x570
>try_release_extent_mapping+0x21a/0x260
>__btrfs_releasepage+0xb0/0x1c0
>btrfs_releasepage+0x161/0x170
>try_to_release_page+0x162/0x1c0
>shrink_page_list+0x1d5a/0x2fb0
>shrink_inactive_list+0x451/0x940
>shrink_node_memcg.constprop.88+0x4c9/0x5e0
>shrink_node+0x12d/0x260
>try_to_free_pages+0x418/0xaf0
>__alloc_pages_slowpath+0x976/0x1790
>__alloc_pages_nodemask+0x52c/0x5c0
>new_slab+0x374/0x3f0
>___slab_alloc.constprop.81+0x47e/0x5a0
>__slab_alloc.constprop.80+0x32/0x60
>__kmalloc_track_caller+0x267/0x310
>__kmalloc_reserve.isra.40+0x29/0x80
>__alloc_skb+0xee/0x390
>sk_stream_alloc_skb+0xb8/0x340
>tcp_sendmsg_locked+0x8e6/0x1d30
>tcp_sendmsg+0x27/0x40
>inet_sendmsg+0xd0/0x310
>sock_write_iter+0x17a/0x240
>__vfs_write+0x2ab/0x380
>vfs_write+0xfb/0x260
>SyS_write+0xb6/0x140
>do_syscall_64+0x1e5/0xc05
>entry_SYSCALL64_slow_path+0x25/0x25
> 
> This warning is caused by commit d92a8cfcb37ecd13 ("locking/lockdep: Rework
> FS_RECLAIM annotation") which replaced lockdep_set_current_reclaim_state()/
> lockdep_clear_current_reclaim_state() in __perform_reclaim() and
> lockdep_trace_alloc() in slab_pre_alloc_hook() with fs_reclaim_acquire()/
> fs_reclaim_release(). Since __kmalloc_reserve() from __alloc_skb() adds
> __GFP_NOMEMALLOC | __GFP_NOWARN to gfp_mask, and all reclaim path simply
> propagates __GFP_NOMEMALLOC, fs_reclaim_acquire() in slab_pre_alloc_hook()
> is trying to grab the 'fake' lock again when __perform_reclaim() already
> grabbed the 'fake' lock.
> 
> The
> 
>   /* this guy won't enter reclaim */
>   if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>   return false;
> 
> test which causes slab_pre_alloc_hook() to try to grab the 'fake' lock
> was added by commit cf40bd16fdad42c0 ("lockdep: annotate reclaim context
> (__GFP_NOFS)"). But that test is outdated because PF_MEMALLOC thread won't
> enter reclaim regardless of __GFP_NOMEMALLOC after commit 341ce06f69abfafa
> ("page allocator: calculate the alloc_flags for allocation only once")
> added the PF_MEMALLOC safeguard (
> 
>   /* Avoid recursion of direct reclaim */
>   if (p->flags & PF_MEMALLOC)
>   goto nopage;
> 
> in __alloc_pages_slowpath()).
> 
> Thus, let's fix outdated test by removing __GFP_NOMEMALLOC test and allow
> __need_fs_reclaim() to return false.
> 
> Reported-and-tested-by: Dave Jones 
> Signed-off-by: Tetsuo Handa 
> Cc: Peter Zijlstra 
> Cc: Nick Piggin 
> ---
>  mm/page_alloc.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 81e18ce..19fb76b 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3590,7 +3590,7 @@ static bool __need_fs_reclaim(gfp_t gfp_mask)
>   return false;
>  
>   /* this guy won't enter reclaim */
> - if ((current->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
> + if (current->flags & PF_MEMALLOC)
>   return false;
>  
>   /* We're only interested __GFP_FS allocations for now */
> -- 
> 1.8.3.1
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  

Re: fatal database corruption with btrfs "out of space" with ~50 GB left

2018-02-19 Thread Tomasz Chmielewski

On 2018-02-19 13:29, Anand Jain wrote:

On 02/14/2018 10:19 PM, Tomasz Chmielewski wrote:
Just FYI, how dangerous running btrfs can be - we had a fatal, 
unrecoverable MySQL corruption when btrfs decided to do one of these 
"I have ~50 GB left, so let's do out of space (and corrupt some files 
at the same time, ha ha!)".


 Thanks for reporting.


Running btrfs RAID-1 with kernel 4.14.


 Can you pls let us know..
 1. What tool cli/reported/identified that data is corrupted?


mysqld log - mysqld would refuse to start because of database 
corruption.


And, the database wouldn't start even when "innodb_force_recovery = " 
was set to a high/max value.



In the past, with lower kernel versions, we had a similar issue with 
mongod - it wouldn't start anymore due to some corruption which happened 
when we hit "out of space" (again, with dozens of GBs free space).




 2. Disk error stat using.. btrfs dev stat 
(dev stat is stored on disk)


# btrfs dev stat /var/lib/lxd
[/dev/sda3].write_io_errs0
[/dev/sda3].read_io_errs 0
[/dev/sda3].flush_io_errs0
[/dev/sda3].corruption_errs  0
[/dev/sda3].generation_errs  0
[/dev/sdb3].write_io_errs0
[/dev/sdb3].read_io_errs 0
[/dev/sdb3].flush_io_errs0
[/dev/sdb3].corruption_errs  0
[/dev/sdb3].generation_errs  0



 3. Wheather the disk was mounted as degraded any time before?


No. Everything healthy with the disks.


Tomasz Chmielewski
https://lxadm.com
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html