Re: [PATCH v2] btrfs: fix a potential hole-punching failure

2021-03-25 Thread Filipe Manana
On Thu, Mar 25, 2021 at 3:42 AM bingjingc  wrote:
>
> From: BingJing Chang 
>
> In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
> in a already existed hole."), existed holes can be skipped by calling
> find_first_non_hole() to adjust *start and *len. However, if the given
> len is invalid and large, when an EXTENT_MAP_HOLE extent is found, the
> *len will not be set to zero because (em->start + em->len) is less than
> (*start + *len). Then the ret will be 1 but the *len will not be set to
> 0. The propagated non-zero ret will result in fallocate failure.
>
> In the while-loop of btrfs_replace_file_extents(), len is not updated
> every time before it calls find_first_non_hole(). That is, after
> btrfs_drop_extents() successfully drops the last non-hole file extent,
> it may fail with -ENOSPC when attempting to drop a file extent item
> representing a hole. The problem can happen. After it calls
> find_first_non_hole(), the cur_offset will be adjusted to be larger
> than or equal to end. However, since the len is not set to zero. The
> break-loop condition (ret && !len) will not meet. After it leaves the
> while-loop, fallocate will return 1, which is an unexpected return
> value.
>
> We're not able to construct a reproducible way to let
> btrfs_drop_extents() fail with -ENOSPC after it drops the last non-hole
> file extent but with remaining holes left. However, it's quite easy to
> fix. We just need to update and check the len every time before we call
> find_first_non_hole(). To make the while loop more readable, we also
> pull the variable updates to the bottom of loop like this:
> while (cur_offset < end) {
> ...
> // update cur_offset & len
> // advance cur_offset & len in hole-punching case if needed
> }
>
> Reported-by: Robbie Ko 
> Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a
> already existed hole.")
> Reviewed-by: Robbie Ko 
> Reviewed-by: Chung-Chiang Cheng 
> Signed-off-by: BingJing Chang 

Reviewed-by: Filipe Manana 

Looks good, thanks.

> ---
>  fs/btrfs/file.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0e155f0..dccb017 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2735,8 +2735,6 @@ int btrfs_replace_file_extents(struct inode *inode, 
> struct btrfs_path *path,
> extent_info->file_offset += replace_len;
> }
>
> -   cur_offset = drop_args.drop_end;
> -
> ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
> if (ret)
> break;
> @@ -2756,7 +2754,9 @@ int btrfs_replace_file_extents(struct inode *inode, 
> struct btrfs_path *path,
> BUG_ON(ret);/* shouldn't happen */
> trans->block_rsv = rsv;
>
> -   if (!extent_info) {
> +   cur_offset = drop_args.drop_end;
> +   len = end - cur_offset;
> +   if (!extent_info && len) {
> ret = find_first_non_hole(BTRFS_I(inode), &cur_offset,
>   &len);
> if (unlikely(ret < 0))
> --
> 2.7.4
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] btrfs: fix a potential hole-punching failure

2021-03-24 Thread Filipe Manana
On Wed, Mar 24, 2021 at 11:15 AM bingjingc  wrote:
>
> From: BingJing Chang 
>
> In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole in
> a already existed hole."), existed holes can be skipped by calling
> find_first_non_hole() to adjust *start and *len. However, if the given
> len is invalid and large, when an EXTENT_MAP_HOLE extent is found, the
> *len will not be set to zero because (em->start + em->len) is less than
> (*start + *len). Then the ret will be 1 but the *len will not be set to
> 0. The propagated non-zero ret will result in fallocate failure.
>
> In the while-loop of btrfs_replace_file_extents(), len is not updated
> every time before it calls find_first_non_hole(). That is, if the last
> file extent in the given hole-punching range has been dropped but
> btrfs_drop_extents() fails with -ENOSPC (btrfs_drop_extents() runs out
> of reserved space of the given transaction), the problem can happen.

This is not entirely clear. Dropping the last extent and still
returning ENOSPC is confusing.
I think you mean that it drops the last file extent item that does not
represent hole (disk_bytenr > 0), and after it there's only one file
extent item representing a hole (disk_bytenr == 0).
It fails with -ENOSPC when attempting to drop the file extent item
representing the hole, after successfully dropping the non-hole file
extent item.
Is that it?

> After it calls find_first_non_hole(), the cur_offset will be adjusted
> to be larger than or equal to end. However, since the len is not set to
> zero. The break-loop condition (ret && !len) will not meet. After it
> leaves the while-loop, uncleared ret will result in fallocate failure.

Ok, fallocate will return 1, an unexpected return value.

>
> We're not able to construct a reproducible way to let
> btrfs_drop_extents() fails with -ENOSPC after it drops the last file
> extent but with remaining holes. However, it's quite easy to fix. We
> just need to update and check the len every time before we call
> find_first_non_hole(). To make the while loop more readable, we also
> pull the variable updates to the bottom of loop like this:
> while (cur_offset < end) {
> ...
> // update cur_offset & len
> // advance cur_offset & len in hole-punching case if needed
> }
>
> Reported-by: Robbie Ko 
> Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a
> already existed hole.")
> Reviewed-by: Robbie Ko 
> Reviewed-by: Chung-Chiang Cheng 
> Signed-off-by: BingJing Chang 

Looks good.
Please just update that paragraph to be more clear about what is going on.

Thanks.

> ---
>  fs/btrfs/file.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> index 0e155f0..dccb017 100644
> --- a/fs/btrfs/file.c
> +++ b/fs/btrfs/file.c
> @@ -2735,8 +2735,6 @@ int btrfs_replace_file_extents(struct inode *inode, 
> struct btrfs_path *path,
> extent_info->file_offset += replace_len;
> }
>
> -   cur_offset = drop_args.drop_end;
> -
> ret = btrfs_update_inode(trans, root, BTRFS_I(inode));
> if (ret)
> break;
> @@ -2756,7 +2754,9 @@ int btrfs_replace_file_extents(struct inode *inode, 
> struct btrfs_path *path,
> BUG_ON(ret);/* shouldn't happen */
> trans->block_rsv = rsv;
>
> -   if (!extent_info) {
> +   cur_offset = drop_args.drop_end;
> +   len = end - cur_offset;
> +   if (!extent_info && len) {
> ret = find_first_non_hole(BTRFS_I(inode), &cur_offset,
>   &len);
> if (unlikely(ret < 0))
> --
> 2.7.4
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] fs: btrfs: fix error return code of btrfs_recover_relocation()

2021-03-05 Thread Filipe Manana
On Fri, Mar 5, 2021 at 9:46 AM Jia-Ju Bai  wrote:
>
> When the list of reloc_roots is empty, no error return code of
> btrfs_recover_relocation() is assigned.
> To fix this bug, err is assigned with -ENOENT as error return code.

No, there isn't any such bug.

If there are no reloc roots, it means there's no relocation to resume,
in which case err is already 0 and we therefore return 0.
By setting err to -ENOENT, that will cause a mount failure on any fs
that does not have relocation to resume.

You could have tested this simply by doing:

$ mkfs.btrfs -f /dev/sdc
$ mount /dev/sdc /mnt/sdc
mount: /mnt/sdc: mount(2) system call failed: No such file or directory.

It's always a good idea to test patches, even if we are very
comfortable with the code they are touching...

Thanks.


>
> Reported-by: TOTE Robot 
> Signed-off-by: Jia-Ju Bai 
> ---
>  fs/btrfs/relocation.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 232d5da7b7be..631b672a852f 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -3817,8 +3817,10 @@ int btrfs_recover_relocation(struct btrfs_root *root)
> }
> btrfs_release_path(path);
>
> -   if (list_empty(&reloc_roots))
> +   if (list_empty(&reloc_roots)) {
> +   err = -ENOENT;
> goto out;
> +   }
>
> rc = alloc_reloc_control(fs_info);
> if (!rc) {
> --
> 2.17.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH RFC 4/6] btrfs: Check if the filesystem is has mixed type of devices

2021-02-10 Thread Filipe Manana
On Tue, Feb 9, 2021 at 9:32 PM Michal Rostecki  wrote:
>
> From: Michal Rostecki 
>
> Add the btrfs_check_mixed() function which checks if the filesystem has
> the mixed type of devices (non-rotational and rotational). This
> information is going to be used in roundrobin raid1 read policy.
>
> Signed-off-by: Michal Rostecki 
> ---
>  fs/btrfs/volumes.c | 44 ++--
>  fs/btrfs/volumes.h |  7 +++
>  2 files changed, 49 insertions(+), 2 deletions(-)
>
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index 1ac364a2f105..1ad30a595722 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -617,6 +617,35 @@ static int btrfs_free_stale_devices(const char *path,
> return ret;
>  }
>
> +/*
> + * Checks if after adding the new device the filesystem is going to have 
> mixed
> + * types of devices (non-rotational and rotational).
> + *
> + * @fs_devices:  list of devices
> + * @new_device_rotating: if the new device is rotational
> + *
> + * Returns true if there are mixed types of devices, otherwise returns false.
> + */
> +static bool btrfs_check_mixed(struct btrfs_fs_devices *fs_devices,
> + bool new_device_rotating)
> +{
> +   struct btrfs_device *device, *prev_device;
> +
> +   list_for_each_entry(device, &fs_devices->devices, dev_list) {
> +   if (prev_device == NULL &&

Hum, prev_device is not initialized when we enter the first iteration
of the loop.

> +   device->rotating != new_device_rotating)
> +   return true;
> +   if (prev_device != NULL &&
> +   (device->rotating != prev_device->rotating ||

Here it's more dangerous, dereferencing an uninitialized pointer can
result in a crash.

With this fixed, it would be better to redo the benchmarks when using
mixed device types.

Thanks.

> +device->rotating != new_device_rotating))
> +   return true;
> +
> +   prev_device = device;
> +   }
> +
> +   return false;
> +}
> +
>  /*
>   * This is only used on mount, and we are protected from competing things
>   * messing with our fs_devices by the uuid_mutex, thus we do not need the
> @@ -629,6 +658,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
> *fs_devices,
> struct request_queue *q;
> struct block_device *bdev;
> struct btrfs_super_block *disk_super;
> +   bool rotating;
> u64 devid;
> int ret;
>
> @@ -669,8 +699,12 @@ static int btrfs_open_one_device(struct btrfs_fs_devices 
> *fs_devices,
> }
>
> q = bdev_get_queue(bdev);
> -   if (!blk_queue_nonrot(q))
> +   rotating = !blk_queue_nonrot(q);
> +   device->rotating = rotating;
> +   if (rotating)
> fs_devices->rotating = true;
> +   if (!fs_devices->mixed)
> +   fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
>
> device->bdev = bdev;
> clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, &device->dev_state);
> @@ -2418,6 +2452,7 @@ static int btrfs_prepare_sprout(struct btrfs_fs_info 
> *fs_info)
> fs_devices->open_devices = 0;
> fs_devices->missing_devices = 0;
> fs_devices->rotating = false;
> +   fs_devices->mixed = false;
> list_add(&seed_devices->seed_list, &fs_devices->seed_list);
>
> generate_random_uuid(fs_devices->fsid);
> @@ -2522,6 +2557,7 @@ int btrfs_init_new_device(struct btrfs_fs_info 
> *fs_info, const char *device_path
> int seeding_dev = 0;
> int ret = 0;
> bool locked = false;
> +   bool rotating;
>
> if (sb_rdonly(sb) && !fs_devices->seeding)
> return -EROFS;
> @@ -2621,8 +2657,12 @@ int btrfs_init_new_device(struct btrfs_fs_info 
> *fs_info, const char *device_path
>
> atomic64_add(device->total_bytes, &fs_info->free_chunk_space);
>
> -   if (!blk_queue_nonrot(q))
> +   rotating = !blk_queue_nonrot(q);
> +   device->rotating = rotating;
> +   if (rotating)
> fs_devices->rotating = true;
> +   if (!fs_devices->mixed)
> +   fs_devices->mixed = btrfs_check_mixed(fs_devices, rotating);
>
> orig_super_total_bytes = btrfs_super_total_bytes(fs_info->super_copy);
> btrfs_set_super_total_bytes(fs_info->super_copy,
> diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
> index 6e544317a377..594f1207281c 100644
> --- a/fs/btrfs/volumes.h
> +++ b/fs/btrfs/volumes.h
> @@ -147,6 +147,9 @@ struct btrfs_device {
> /* I/O stats for raid1 mirror selection */
> struct percpu_counter inflight;
> atomic_t last_offset;
> +
> +   /* If the device is rotational */
> +   bool rotating;
>  };
>
>  /*
> @@ -274,6 +277,10 @@ struct btrfs_fs_devices {
>  * nonrot flag set
>  */
> bool rotating;
> +   /* Set when we find or add both nonrot and rot disks in the
> 

Re: [PATCH v2] btrfs: Avoid calling btrfs_get_chunk_map() twice

2021-01-29 Thread Filipe Manana
On Fri, Jan 29, 2021 at 5:54 PM David Sterba  wrote:
>
> On Fri, Jan 29, 2021 at 05:15:21PM +, Michal Rostecki wrote:
> > On Fri, Jan 29, 2021 at 11:22:48AM -0500, Josef Bacik wrote:
> > > On 1/27/21 8:57 AM, Michal Rostecki wrote:
> > > > From: Michal Rostecki 
> > > >
> > > > Before this change, the btrfs_get_io_geometry() function was calling
> > > > btrfs_get_chunk_map() to get the extent mapping, necessary for
> > > > calculating the I/O geometry. It was using that extent mapping only
> > > > internally and freeing the pointer after its execution.
> > > >
> > > > That resulted in calling btrfs_get_chunk_map() de facto twice by the
> > > > __btrfs_map_block() function. It was calling btrfs_get_io_geometry()
> > > > first and then calling btrfs_get_chunk_map() directly to get the extent
> > > > mapping, used by the rest of the function.
> > > >
> > > > This change fixes that by passing the extent mapping to the
> > > > btrfs_get_io_geometry() function as an argument.
> > > >
> > > > v2:
> > > > When btrfs_get_chunk_map() returns an error in btrfs_submit_direct():
> > > > - Use errno_to_blk_status(PTR_ERR(em)) as the status
> > > > - Set em to NULL
> > > >
> > > > Signed-off-by: Michal Rostecki 
> > >
> > > This panic'ed all of my test vms in their overnight xfstests runs, the 
> > > panic is this
> > >
> > > [ 2449.936502] BTRFS critical (device dm-7): mapping failed logical
> > > 1113825280 bio len 40960 len 24576
> > > [ 2449.937073] [ cut here ]
> > > [ 2449.937329] kernel BUG at fs/btrfs/volumes.c:6450!
> > > [ 2449.937604] invalid opcode:  [#1] SMP NOPTI
> > > [ 2449.937855] CPU: 0 PID: 259045 Comm: kworker/u5:0 Not tainted 
> > > 5.11.0-rc5+ #122
> > > [ 2449.938252] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> > > 1.13.0-2.fc32 04/01/2014
> > > [ 2449.938713] Workqueue: btrfs-worker-high btrfs_work_helper
> > > [ 2449.939016] RIP: 0010:btrfs_map_bio.cold+0x5a/0x5c
> > > [ 2449.939392] Code: 37 87 ff ff e8 ed d4 8a ff 48 83 c4 18 e9 b5 52 8b ff
> > > 49 89 c8 4c 89 fa 4c 89 f1 48 c7 c6 b0 c0 61 8b 48 89 ef e8 11 87 ff ff 
> > > <0f>
> > > 0b 4c 89 e7 e8 42 09 86 ff e9 fd 59 8b ff 49 8b 7a 50 44 89 f2
> > > [ 2449.940402] RSP: :9f24c1637d90 EFLAGS: 00010282
> > > [ 2449.940689] RAX: 0057 RBX: 90c78ff716b8 RCX: 
> > > 
> > > [ 2449.941080] RDX: 90c7fbc27ae0 RSI: 90c7fbc19110 RDI: 
> > > 90c7fbc19110
> > > [ 2449.941467] RBP: 90c7911d4000 R08:  R09: 
> > > 
> > > [ 2449.941853] R10: 9f24c1637b48 R11: 8b9723e8 R12: 
> > > 
> > > [ 2449.942243] R13:  R14: a000 R15: 
> > > 4263a000
> > > [ 2449.942632] FS:  () GS:90c7fbc0()
> > > knlGS:
> > > [ 2449.943072] CS:  0010 DS:  ES:  CR0: 80050033
> > > [ 2449.943386] CR2: 5575163c3080 CR3: 00010ad6c004 CR4: 
> > > 00370ef0
> > > [ 2449.943772] Call Trace:
> > > [ 2449.943915]  ? lock_release+0x1c3/0x290
> > > [ 2449.944135]  run_one_async_done+0x3a/0x60
> > > [ 2449.944360]  btrfs_work_helper+0x136/0x520
> > > [ 2449.944588]  process_one_work+0x26e/0x570
> > > [ 2449.944812]  worker_thread+0x55/0x3c0
> > > [ 2449.945016]  ? process_one_work+0x570/0x570
> > > [ 2449.945250]  kthread+0x137/0x150
> > > [ 2449.945430]  ? __kthread_bind_mask+0x60/0x60
> > > [ 2449.945666]  ret_from_fork+0x1f/0x30
> > >
> > > it happens when you run btrfs/060.  Please make sure to run xfstests 
> > > against
> > > patches before you submit them upstream.  Thanks,
> > >
> > > Josef
> >
> > Umm... I ran the xftests against v1 patch and didn't get that panic.
>
> It could have been caused by my fixups to v2 and I can reproduce the
> crash now too. I can't see any difference besides the u64/int switch and
> the 'goto out' removal in btrfs_bio_fits_in_stripe.

The problem is caused by what I mentioned earlier today:

The value of "logical" must be set at the start of each iteration of
the loop at btrfs_submit_direct(),
and not before starting the loop, since the start sector is
incremented at the end of each iteration of the loop.




-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v2] btrfs: Avoid calling btrfs_get_chunk_map() twice

2021-01-28 Thread Filipe Manana
On Thu, Jan 28, 2021 at 12:01 AM Michal Rostecki  wrote:
>
> From: Michal Rostecki 
>
> Before this change, the btrfs_get_io_geometry() function was calling
> btrfs_get_chunk_map() to get the extent mapping, necessary for
> calculating the I/O geometry. It was using that extent mapping only
> internally and freeing the pointer after its execution.
>
> That resulted in calling btrfs_get_chunk_map() de facto twice by the
> __btrfs_map_block() function. It was calling btrfs_get_io_geometry()
> first and then calling btrfs_get_chunk_map() directly to get the extent
> mapping, used by the rest of the function.
>
> This change fixes that by passing the extent mapping to the
> btrfs_get_io_geometry() function as an argument.
>
> v2:
> When btrfs_get_chunk_map() returns an error in btrfs_submit_direct():
> - Use errno_to_blk_status(PTR_ERR(em)) as the status
> - Set em to NULL
>
> Signed-off-by: Michal Rostecki 

Reviewed-by: Filipe Manana 

I think this is a fine optimization.
For very large filesystems, i.e. several thousands of allocated
chunks, not only this avoids searching two times the rbtree,
saving time, it may also help reducing contention on the lock that
protects the tree - thinking of writeback starting for multiple
inodes,
other tasks allocating or removing chunks, and anything else that
requires access to the rbtree.

Thanks, it looks good to me now.

> ---
>  fs/btrfs/inode.c   | 38 +-
>  fs/btrfs/volumes.c | 39 ---
>  fs/btrfs/volumes.h |  5 +++--
>  3 files changed, 48 insertions(+), 34 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0dbe1aaa0b71..e2ee3a9c1140 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
> size, struct bio *bio,
> struct inode *inode = page->mapping->host;
> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> u64 logical = bio->bi_iter.bi_sector << 9;
> +   struct extent_map *em;
> u64 length = 0;
> u64 map_length;
> -   int ret;
> +   int ret = 0;
> struct btrfs_io_geometry geom;
>
> if (bio_flags & EXTENT_BIO_COMPRESSED)
> @@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, 
> size_t size, struct bio *bio,
>
> length = bio->bi_iter.bi_size;
> map_length = length;
> -   ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, 
> map_length,
> -   &geom);
> +   em = btrfs_get_chunk_map(fs_info, logical, map_length);
> +   if (IS_ERR(em))
> +   return PTR_ERR(em);
> +   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical,
> +   map_length, &geom);
> if (ret < 0)
> -   return ret;
> +   goto out;
>
> -   if (geom.len < length + size)
> -   return 1;
> -   return 0;
> +   if (geom.len < length + size) {
> +   ret = 1;
> +   goto out;
> +   }
> +out:
> +   free_extent_map(em);
> +   return ret;
>  }
>
>  /*
> @@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
> u64 submit_len;
> int clone_offset = 0;
> int clone_len;
> +   int logical;
> int ret;
> blk_status_t status;
> struct btrfs_io_geometry geom;
> struct btrfs_dio_data *dio_data = iomap->private;
> +   struct extent_map *em;
>
> dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
> if (!dip) {
> @@ -7970,11 +7980,18 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
> }
>
> start_sector = dio_bio->bi_iter.bi_sector;
> +   logical = start_sector << 9;
> submit_len = dio_bio->bi_iter.bi_size;
>
> do {
> -   ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio),
> -   start_sector << 9, submit_len,
> +   em = btrfs_get_chunk_map(fs_info, logical, submit_len);
> +   if (IS_ERR(em)) {
> +   status = errno_to_blk_status(PTR_ERR(em));
> +   em = NULL;
> +   goto out_err;
> +   }
> +   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
> +   logical, submit_len,
> &geom);
>  

Re: [PATCH] btrfs: Avoid calling btrfs_get_chunk_map() twice

2021-01-27 Thread Filipe Manana
On Wed, Jan 27, 2021 at 9:59 AM Michal Rostecki  wrote:
>
> From: Michal Rostecki 
>
> Before this change, the btrfs_get_io_geometry() function was calling
> btrfs_get_chunk_map() to get the extent mapping, necessary for
> calculating the I/O geometry. It was using that extent mapping only
> internally and freeing the pointer after its execution.
>
> That resulted in calling btrfs_get_chunk_map() de facto twice by the
> __btrfs_map_block() function. It was calling btrfs_get_io_geometry()
> first and then calling btrfs_get_chunk_map() directly to get the extent
> mapping, used by the rest of the function.
>
> This change fixes that by passing the extent mapping to the
> btrfs_get_io_geometry() function as an argument.
>
> Fixes: 89b798ad1b42 ("btrfs: Use btrfs_get_io_geometry appropriately")

Generally we only use the Fixes tag for bug fixes or serious
performance regressions.
Have you seen here a serious performance regression?

> Signed-off-by: Michal Rostecki 
> ---
>  fs/btrfs/inode.c   | 37 -
>  fs/btrfs/volumes.c | 39 ---
>  fs/btrfs/volumes.h |  5 +++--
>  3 files changed, 47 insertions(+), 34 deletions(-)
>
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 0dbe1aaa0b71..a4ce8501ed4d 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -2183,9 +2183,10 @@ int btrfs_bio_fits_in_stripe(struct page *page, size_t 
> size, struct bio *bio,
> struct inode *inode = page->mapping->host;
> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> u64 logical = bio->bi_iter.bi_sector << 9;
> +   struct extent_map *em;
> u64 length = 0;
> u64 map_length;
> -   int ret;
> +   int ret = 0;
> struct btrfs_io_geometry geom;
>
> if (bio_flags & EXTENT_BIO_COMPRESSED)
> @@ -2193,14 +2194,21 @@ int btrfs_bio_fits_in_stripe(struct page *page, 
> size_t size, struct bio *bio,
>
> length = bio->bi_iter.bi_size;
> map_length = length;
> -   ret = btrfs_get_io_geometry(fs_info, btrfs_op(bio), logical, 
> map_length,
> -   &geom);
> +   em = btrfs_get_chunk_map(fs_info, logical, map_length);
> +   if (IS_ERR(em))
> +   return PTR_ERR(em);
> +   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(bio), logical,
> +   map_length, &geom);
> if (ret < 0)
> -   return ret;
> +   goto out;
>
> -   if (geom.len < length + size)
> -   return 1;
> -   return 0;
> +   if (geom.len < length + size) {
> +   ret = 1;
> +   goto out;
> +   }
> +out:
> +   free_extent_map(em);
> +   return ret;
>  }
>
>  /*
> @@ -7941,10 +7949,12 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
> u64 submit_len;
> int clone_offset = 0;
> int clone_len;
> +   int logical;
> int ret;
> blk_status_t status;
> struct btrfs_io_geometry geom;
> struct btrfs_dio_data *dio_data = iomap->private;
> +   struct extent_map *em;
>
> dip = btrfs_create_dio_private(dio_bio, inode, file_offset);
> if (!dip) {
> @@ -7970,11 +7980,17 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
> }
>
> start_sector = dio_bio->bi_iter.bi_sector;
> +   logical = start_sector << 9;
> submit_len = dio_bio->bi_iter.bi_size;
>
> do {
> -   ret = btrfs_get_io_geometry(fs_info, btrfs_op(dio_bio),
> -   start_sector << 9, submit_len,
> +   em = btrfs_get_chunk_map(fs_info, logical, submit_len);
> +   if (IS_ERR(em)) {
> +   status = errno_to_blk_status(ret);
> +   goto out_err;

em must be set to NULL before going to "out_err", otherwise we get a
crash due to an invalid memory access.

Also, status should be set to "errno_to_blk_status(PTR_ERR(em))". The
value of ret at this point is undefined.

Other than that, it looks good.

Thanks.

> +   }
> +   ret = btrfs_get_io_geometry(fs_info, em, btrfs_op(dio_bio),
> +   logical, submit_len,
> &geom);
> if (ret) {
> status = errno_to_blk_status(ret);
> @@ -8030,12 +8046,15 @@ static blk_qc_t btrfs_submit_direct(struct inode 
> *inode, struct iomap *iomap,
> clone_offset += clone_len;
> start_sector += clone_len >> 9;
> file_offset += clone_len;
> +
> +   free_extent_map(em);
> } while (submit_len > 0);
> return BLK_QC_T_NONE;
>
>  out_err:
> dip->dio_bio->bi_status = status;
> btrfs_dio_private_put(dip);
> +   free_extent_map(em);
> return BLK_QC_T_NONE;
>  }

Re: KASAN: null-ptr-deref Write in start_transaction

2021-01-08 Thread Filipe Manana
On Thu, Jan 7, 2021 at 1:13 PM syzbot
 wrote:
>
> syzbot suspects this issue was fixed by commit:
>
> commit f30bed83426c5cb9fce6cabb3f7cc5a9d5428fcc
> Author: Filipe Manana 
> Date:   Fri Nov 13 11:24:17 2020 +
>
> btrfs: remove unnecessary attempt to drop extent maps after adding inline 
> extent
>
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13ddc30b50
> start commit:   521b619a Merge tag 'linux-kselftest-kunit-fixes-5.10-rc3' ..
> git tree:   upstream
> kernel config:  https://syzkaller.appspot.com/x/.config?x=61033507391c77ff
> dashboard link: https://syzkaller.appspot.com/bug?extid=6700bca07dff187809c4
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=14a07ab250
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10fe69c650
>
> If the result looks correct, please mark the issue as fixed by replying with:
>
> #syz fix: btrfs: remove unnecessary attempt to drop extent maps after adding 
> inline extent

Nop, it can't be this change.

What should fix it should be the following commit:

commit ecfdc08b8cc65d737eebc26a1ee1875a097fd6a0
Author: Goldwyn Rodrigues 
Date:   Thu Sep 24 11:39:21 2020 -0500

btrfs: remove dio iomap DSYNC workaround

Thanks.


>
> For information about bisection process see: https://goo.gl/tpsmEJ#bisection



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-09 Thread Filipe Manana



On 09/11/20 08:44, Boqun Feng wrote:
> Hi Filipe,
> 
> On Thu, Nov 05, 2020 at 09:10:12AM +0800, Boqun Feng wrote:
>> On Wed, Nov 04, 2020 at 07:54:40PM +, Filipe Manana wrote:
>> [...]
>>>
>>> Ok, so I ran 5.10-rc2 plus your two patches (the fix and the debug one):
>>>
>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>> index b71ad8d9f1c9..b31d4ad482c7 100644
>>> --- a/kernel/locking/lockdep.c
>>> +++ b/kernel/locking/lockdep.c
>>> @@ -539,8 +539,10 @@ static struct lock_trace *save_trace(void)
>>> LOCK_TRACE_SIZE_IN_LONGS;
>>>
>>> if (max_entries <= 0) {
>>> -   if (!debug_locks_off_graph_unlock())
>>> +   if (!debug_locks_off_graph_unlock()) {
>>> +   WARN_ON_ONCE(1);
>>> return NULL;
>>> +   }
>>>
>>> print_lockdep_off("BUG: MAX_STACK_TRACE_ENTRIES too low!");
>>> dump_stack();
>>> @@ -5465,7 +5467,7 @@ noinstr int lock_is_held_type(const struct
>>> lockdep_map *lock, int read)
>>> unsigned long flags;
>>> int ret = 0;
>>>
>>> -   if (unlikely(!lockdep_enabled()))
>>> +   if (unlikely(debug_locks && !lockdep_enabled()))
>>> return 1; /* avoid false negative lockdep_assert_held() */
>>>
>>> raw_local_irq_save(flags);
>>>
>>> With 3 runs of all fstests, the WARN_ON_ONCE(1) wasn't triggered.
>>> Unexpected, right?
>>>
>>
>> Kinda, that means we still don't know why lockdep was turned off.
>>
>>> Should I try something else?
>>>
>>
>> Thanks for trying this. Let me set up the reproducer on my side, and see
>> if I could get more information.
>>
> 
> I could hit this with btrfs/187, and when we hit it, lockdep will report
> the deadlock and turn off, and I think this is the root cause for your
> hitting the original problem, I will add some analysis after the lockdep
> splat.
> 
> [12295.973309] 
> [12295.974770] WARNING: possible recursive locking detected
> [12295.974770] 5.10.0-rc2-btrfs-next-71 #20 Not tainted
> [12295.974770] 
> [12295.974770] zsh/701247 is trying to acquire lock:
> [12295.974770] 92cef43480b8 (&eb->lock){}-{2:2}, at: 
> btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770] 
>but task is already holding lock:
> [12295.974770] 92cef434a038 (&eb->lock){}-{2:2}, at: 
> btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770] 
>other info that might help us debug this:
> [12295.974770]  Possible unsafe locking scenario:
> 
> [12295.974770]CPU0
> [12295.974770]
> [12295.974770]   lock(&eb->lock);
> [12295.974770]   lock(&eb->lock);
> [12295.974770] 
> *** DEADLOCK ***
> 
> [12295.974770]  May be due to missing lock nesting notation
> 
> [12295.974770] 2 locks held by zsh/701247:
> [12295.974770]  #0: 92cef3d315b0 (&sig->cred_guard_mutex){+.+.}-{3:3}, 
> at: bprm_execve+0xaa/0x920
> [12295.974770]  #1: 92cef434a038 (&eb->lock){}-{2:2}, at: 
> btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770] 
>stack backtrace:
> [12295.974770] CPU: 6 PID: 701247 Comm: zsh Not tainted 
> 5.10.0-rc2-btrfs-next-71 #20
> [12295.974770] Hardware name: Microsoft Corporation Virtual Machine/Virtual 
> Machine, BIOS Hyper-V UEFI Release v4.0 12/17/2019
> [12295.974770] Call Trace:
> [12295.974770]  dump_stack+0x8b/0xb0
> [12295.974770]  __lock_acquire.cold+0x175/0x2e9
> [12295.974770]  lock_acquire+0x15b/0x490
> [12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770]  ? read_block_for_search+0xf4/0x350 [btrfs]
> [12295.974770]  _raw_read_lock+0x40/0xa0
> [12295.974770]  ? btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770]  btrfs_tree_read_lock_atomic+0x34/0x140 [btrfs]
> [12295.974770]  btrfs_search_slot+0x6ac/0xca0 [btrfs]
> [12295.974770]  btrfs_lookup_xattr+0x7d/0xd0 [btrfs]
> [12295.974770]  btrfs_getxattr+0x67/0x130 [btrfs]
> [12295.974770]  __vfs_getxattr+0x53/0x70
> [12295.974770]  get_vfs_caps_from_disk+0x68/0x1a0
> [12295.974770]  ? sched_clock_cpu+0x114/0x180
> [12295.974770]  cap_bprm_creds_from_file+0x181/0x6c0
> [12295.974770]  security_bprm_creds_from_file+0x2a/0x40
> [12295.974770]  begin_new_exec+0xf4/0xc40
>

Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-04 Thread Filipe Manana



On 04/11/20 09:49, Filipe Manana wrote:
> 
> 
> On 04/11/20 03:44, Boqun Feng wrote:
>> On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote:
>>> On Tue, Nov 03, 2020 at 07:44:29PM +, Filipe Manana wrote:
>>>>
>>>>
>>>> On 03/11/20 14:08, Boqun Feng wrote:
>>>>> Hi Filipe,
>>>>>
>>>>> On Mon, Oct 26, 2020 at 11:26:49AM +, Filipe Manana wrote:
>>>>>> Hello,
>>>>>>
>>>>>> I've recently started to hit a warning followed by tasks hanging after
>>>>>> attempts to freeze a filesystem. A git bisection pointed to the
>>>>>> following commit:
>>>>>>
>>>>>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>>>>>> Author: Peter Zijlstra 
>>>>>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>>>>>
>>>>>> lockdep: Fix lockdep recursion
>>>>>>
>>>>>> This happens very reliably when running all xfstests with lockdep
>>>>>> enabled, and the tested filesystem is btrfs (haven't tried other
>>>>>> filesystems, but it shouldn't matter). The warning and task hangs always
>>>>>> happen at either test generic/068 or test generic/390, and (oddly)
>>>>>> always have to run all tests for it to trigger, running those tests
>>>>>> individually on an infinite loop doesn't seem to trigger it (at least
>>>>>> for a couple hours).
>>>>>>
>>>>>> The warning triggered is at fs/super.c:__sb_start_write() which always
>>>>>> results later in several tasks hanging on a percpu rw_sem:
>>>>>>
>>>>>> https://pastebin.com/qnLvf94E
>>>>>>
>>>>>
>>>>> In your dmesg, I see line:
>>>>>
>>>>>   [ 9304.920151] INFO: lockdep is turned off.
>>>>>
>>>>> , that means debug_locks is 0, that usually happens when lockdep find a
>>>>> problem (i.e. a deadlock) and it turns itself off, because a problem is
>>>>> found and it's pointless for lockdep to continue to run.
>>>>>
>>>>> And I haven't found a lockdep splat in your dmesg, do you have a full
>>>>> dmesg so that I can have a look?
>>>>>
>>>>> This may be relevant because in commit 4d004099a66, we have
>>>>>
>>>>>   @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct 
>>>>> lockdep_map *lock, int read)
>>>>>   unsigned long flags;
>>>>>   int ret = 0;
>>>>>
>>>>>   -   if (unlikely(current->lockdep_recursion))
>>>>>   +   if (unlikely(!lockdep_enabled()))
>>>>>   return 1; /* avoid false negative lockdep_assert_held() 
>>>>> */
>>>>>
>>>>> before this commit lock_is_held_type() and its friends may return false
>>>>> if debug_locks==0, after this commit lock_is_held_type() and its friends
>>>>> will always return true if debug_locks == 0. That could cause the
>>>>> behavior here.
>>>>>
>>>>> In case I'm correct, the following "fix" may be helpful. 
>>>>>
>>>>> Regards,
>>>>> Boqun
>>>>>
>>>>> --8
>>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>>>> index 3e99dfef8408..c0e27fb949ff 100644
>>>>> --- a/kernel/locking/lockdep.c
>>>>> +++ b/kernel/locking/lockdep.c
>>>>> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct 
>>>>> lockdep_map *lock, int read)
>>>>>   unsigned long flags;
>>>>>   int ret = 0;
>>>>>  
>>>>> - if (unlikely(!lockdep_enabled()))
>>>>> + if (unlikely(debug_locks && !lockdep_enabled()))
>>>>>   return 1; /* avoid false negative lockdep_assert_held() */
>>>>>  
>>>>>   raw_local_irq_save(flags);
>>>>
>>>> Boqun, the patch fixes the problem for me!
>>>> You can have Tested-by: Filipe Manana 
>>>>
>>>
>>> Thanks. Although I think it still means that we have a lock issue when
>>> running xfstests (because we don't know why debug_locks gets cleared),
>>
>> I migh

Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-04 Thread Filipe Manana



On 04/11/20 03:44, Boqun Feng wrote:
> On Wed, Nov 04, 2020 at 10:22:36AM +0800, Boqun Feng wrote:
>> On Tue, Nov 03, 2020 at 07:44:29PM +0000, Filipe Manana wrote:
>>>
>>>
>>> On 03/11/20 14:08, Boqun Feng wrote:
>>>> Hi Filipe,
>>>>
>>>> On Mon, Oct 26, 2020 at 11:26:49AM +, Filipe Manana wrote:
>>>>> Hello,
>>>>>
>>>>> I've recently started to hit a warning followed by tasks hanging after
>>>>> attempts to freeze a filesystem. A git bisection pointed to the
>>>>> following commit:
>>>>>
>>>>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>>>>> Author: Peter Zijlstra 
>>>>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>>>>
>>>>> lockdep: Fix lockdep recursion
>>>>>
>>>>> This happens very reliably when running all xfstests with lockdep
>>>>> enabled, and the tested filesystem is btrfs (haven't tried other
>>>>> filesystems, but it shouldn't matter). The warning and task hangs always
>>>>> happen at either test generic/068 or test generic/390, and (oddly)
>>>>> always have to run all tests for it to trigger, running those tests
>>>>> individually on an infinite loop doesn't seem to trigger it (at least
>>>>> for a couple hours).
>>>>>
>>>>> The warning triggered is at fs/super.c:__sb_start_write() which always
>>>>> results later in several tasks hanging on a percpu rw_sem:
>>>>>
>>>>> https://pastebin.com/qnLvf94E
>>>>>
>>>>
>>>> In your dmesg, I see line:
>>>>
>>>>[ 9304.920151] INFO: lockdep is turned off.
>>>>
>>>> , that means debug_locks is 0, that usually happens when lockdep find a
>>>> problem (i.e. a deadlock) and it turns itself off, because a problem is
>>>> found and it's pointless for lockdep to continue to run.
>>>>
>>>> And I haven't found a lockdep splat in your dmesg, do you have a full
>>>> dmesg so that I can have a look?
>>>>
>>>> This may be relevant because in commit 4d004099a66, we have
>>>>
>>>>@@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct 
>>>> lockdep_map *lock, int read)
>>>>unsigned long flags;
>>>>int ret = 0;
>>>>
>>>>-   if (unlikely(current->lockdep_recursion))
>>>>+   if (unlikely(!lockdep_enabled()))
>>>>return 1; /* avoid false negative lockdep_assert_held() 
>>>> */
>>>>
>>>> before this commit lock_is_held_type() and its friends may return false
>>>> if debug_locks==0, after this commit lock_is_held_type() and its friends
>>>> will always return true if debug_locks == 0. That could cause the
>>>> behavior here.
>>>>
>>>> In case I'm correct, the following "fix" may be helpful. 
>>>>
>>>> Regards,
>>>> Boqun
>>>>
>>>> --8
>>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>>> index 3e99dfef8408..c0e27fb949ff 100644
>>>> --- a/kernel/locking/lockdep.c
>>>> +++ b/kernel/locking/lockdep.c
>>>> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct 
>>>> lockdep_map *lock, int read)
>>>>unsigned long flags;
>>>>int ret = 0;
>>>>  
>>>> -  if (unlikely(!lockdep_enabled()))
>>>> +  if (unlikely(debug_locks && !lockdep_enabled()))
>>>>return 1; /* avoid false negative lockdep_assert_held() */
>>>>  
>>>>raw_local_irq_save(flags);
>>>
>>> Boqun, the patch fixes the problem for me!
>>> You can have Tested-by: Filipe Manana 
>>>
>>
>> Thanks. Although I think it still means that we have a lock issue when
>> running xfstests (because we don't know why debug_locks gets cleared),
> 
> I might find a place where we could turn lockdep off silently:
> 
> in print_circular_bug(), we turn off lockdep via
> debug_locks_off_graph_unlock(), and then we try to save the trace for
> lockdep splat, however, if we use up the stack_trace buffer (i.e.
> nr_stack_trace_entries), save_trace() will return NULL and we return
> silently.
> 
> Filipe, in order to check whethter that happen

Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-03 Thread Filipe Manana



On 03/11/20 14:08, Boqun Feng wrote:
> Hi Filipe,
> 
> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>> Hello,
>>
>> I've recently started to hit a warning followed by tasks hanging after
>> attempts to freeze a filesystem. A git bisection pointed to the
>> following commit:
>>
>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>> Author: Peter Zijlstra 
>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>
>> lockdep: Fix lockdep recursion
>>
>> This happens very reliably when running all xfstests with lockdep
>> enabled, and the tested filesystem is btrfs (haven't tried other
>> filesystems, but it shouldn't matter). The warning and task hangs always
>> happen at either test generic/068 or test generic/390, and (oddly)
>> always have to run all tests for it to trigger, running those tests
>> individually on an infinite loop doesn't seem to trigger it (at least
>> for a couple hours).
>>
>> The warning triggered is at fs/super.c:__sb_start_write() which always
>> results later in several tasks hanging on a percpu rw_sem:
>>
>> https://pastebin.com/qnLvf94E
>>
> 
> In your dmesg, I see line:
> 
>   [ 9304.920151] INFO: lockdep is turned off.
> 
> , that means debug_locks is 0, that usually happens when lockdep find a
> problem (i.e. a deadlock) and it turns itself off, because a problem is
> found and it's pointless for lockdep to continue to run.
> 
> And I haven't found a lockdep splat in your dmesg, do you have a full
> dmesg so that I can have a look?
> 
> This may be relevant because in commit 4d004099a66, we have
> 
>   @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct 
> lockdep_map *lock, int read)
>   unsigned long flags;
>   int ret = 0;
> 
>   -   if (unlikely(current->lockdep_recursion))
>   +   if (unlikely(!lockdep_enabled()))
>   return 1; /* avoid false negative lockdep_assert_held() 
> */
> 
> before this commit lock_is_held_type() and its friends may return false
> if debug_locks==0, after this commit lock_is_held_type() and its friends
> will always return true if debug_locks == 0. That could cause the
> behavior here.
> 
> In case I'm correct, the following "fix" may be helpful. 
> 
> Regards,
> Boqun
> 
> --8
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 3e99dfef8408..c0e27fb949ff 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map 
> *lock, int read)
>   unsigned long flags;
>   int ret = 0;
>  
> - if (unlikely(!lockdep_enabled()))
> + if (unlikely(debug_locks && !lockdep_enabled()))
>   return 1; /* avoid false negative lockdep_assert_held() */
>  
>   raw_local_irq_save(flags);

Boqun, the patch fixes the problem for me!
You can have Tested-by: Filipe Manana 

Thanks!

> 
> 
> 
>> What happens is percpu_rwsem_is_held() is apparently returning a false
>> positive, so this makes __sb_start_write() do a
>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
>> is expected to always succeed, because if the calling task is holding a
>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
>> the semaphore at level 2, since the freeze semaphores are always
>> acquired by increasing level order.
>>
>> But the try_lock fails, it triggers the warning at __sb_start_write(),
>> then its caller sb_start_pagefault() ignores the return value and
>> callers such as btrfs_page_mkwrite() make the assumption the freeze
>> semaphore was taken, proceed to do their stuff, and later call
>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
>> at level 2 despite not having not been able to down_read() the
>> semaphore. This obviously corrupts the semaphore's read_count state, and
>> later causes any task trying to down_write() it to hang forever.
>>
>> After such a hang I ran a drgn script to confirm it:
>>
>> $ cat dump_freeze_sems.py
>> import sys
>> import drgn
>> from drgn import NULL, Object, cast, container_of, execscript, \
>> reinterpret, sizeof
>> from drgn.helpers.linux import *
>>
>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
>>
>> mnt = None
>> for mnt in for_each_mount(prog, dst = mnt_path):
>> pass
>>
>> if mnt is None:
>> sys.stderr.write(f'Error:

Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-03 Thread Filipe Manana



On 03/11/20 14:08, Boqun Feng wrote:
> Hi Filipe,
> 
> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>> Hello,
>>
>> I've recently started to hit a warning followed by tasks hanging after
>> attempts to freeze a filesystem. A git bisection pointed to the
>> following commit:
>>
>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>> Author: Peter Zijlstra 
>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>
>> lockdep: Fix lockdep recursion
>>
>> This happens very reliably when running all xfstests with lockdep
>> enabled, and the tested filesystem is btrfs (haven't tried other
>> filesystems, but it shouldn't matter). The warning and task hangs always
>> happen at either test generic/068 or test generic/390, and (oddly)
>> always have to run all tests for it to trigger, running those tests
>> individually on an infinite loop doesn't seem to trigger it (at least
>> for a couple hours).
>>
>> The warning triggered is at fs/super.c:__sb_start_write() which always
>> results later in several tasks hanging on a percpu rw_sem:
>>
>> https://pastebin.com/qnLvf94E
>>
> 
> In your dmesg, I see line:
> 
>   [ 9304.920151] INFO: lockdep is turned off.
> 
> , that means debug_locks is 0, that usually happens when lockdep find a
> problem (i.e. a deadlock) and it turns itself off, because a problem is
> found and it's pointless for lockdep to continue to run.
> 
> And I haven't found a lockdep splat in your dmesg, do you have a full
> dmesg so that I can have a look?

Everytime I run into the issue it produces similar results. It always
starts with the WARN_ON() at __sb_start_write(), followed by several
"hung task" traces, and then some time after lockdep is turned off.

> 
> This may be relevant because in commit 4d004099a66, we have
> 
>   @@ -5056,13 +5081,13 @@ noinstr int lock_is_held_type(const struct 
> lockdep_map *lock, int read)
>   unsigned long flags;
>   int ret = 0;
> 
>   -   if (unlikely(current->lockdep_recursion))
>   +   if (unlikely(!lockdep_enabled()))
>   return 1; /* avoid false negative lockdep_assert_held() 
> */
> 
> before this commit lock_is_held_type() and its friends may return false
> if debug_locks==0, after this commit lock_is_held_type() and its friends
> will always return true if debug_locks == 0. That could cause the
> behavior here.
> 
> In case I'm correct, the following "fix" may be helpful. 

I'll try it and let you now.
Thanks!

> 
> Regards,
> Boqun
> 
> --8
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 3e99dfef8408..c0e27fb949ff 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -5471,7 +5464,7 @@ noinstr int lock_is_held_type(const struct lockdep_map 
> *lock, int read)
>   unsigned long flags;
>   int ret = 0;
>  
> - if (unlikely(!lockdep_enabled()))
> + if (unlikely(debug_locks && !lockdep_enabled()))
>   return 1; /* avoid false negative lockdep_assert_held() */
>  
>   raw_local_irq_save(flags);
> 
> 
> 
>> What happens is percpu_rwsem_is_held() is apparently returning a false
>> positive, so this makes __sb_start_write() do a
>> percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
>> is expected to always succeed, because if the calling task is holding a
>> freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
>> the semaphore at level 2, since the freeze semaphores are always
>> acquired by increasing level order.
>>
>> But the try_lock fails, it triggers the warning at __sb_start_write(),
>> then its caller sb_start_pagefault() ignores the return value and
>> callers such as btrfs_page_mkwrite() make the assumption the freeze
>> semaphore was taken, proceed to do their stuff, and later call
>> sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
>> at level 2 despite not having not been able to down_read() the
>> semaphore. This obviously corrupts the semaphore's read_count state, and
>> later causes any task trying to down_write() it to hang forever.
>>
>> After such a hang I ran a drgn script to confirm it:
>>
>> $ cat dump_freeze_sems.py
>> import sys
>> import drgn
>> from drgn import NULL, Object, cast, container_of, execscript, \
>> reinterpret, sizeof
>> from drgn.helpers.linux import *
>>
>> mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'
>>
>>

Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-03 Thread Filipe Manana



On 03/11/20 10:15, Jan Kara wrote:
> On Mon 02-11-20 17:58:54, Filipe Manana wrote:
>>
>>
>> On 26/10/20 15:22, Peter Zijlstra wrote:
>>> On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote:
>>>> On Mon, Oct 26, 2020 at 11:56:03AM +, Filipe Manana wrote:
>>>>>> That smells like the same issue reported here:
>>>>>>
>>>>>>   
>>>>>> https://lkml.kernel.org/r/20201022111700.gz2...@hirez.programming.kicks-ass.net
>>>>>>
>>>>>> Make sure you have commit:
>>>>>>
>>>>>>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
>>>>>>
>>>>>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
>>>>>
>>>>> Yes, CONFIG_DEBUG_PREEMPT is enabled.
>>>>
>>>> Bummer :/
>>>>
>>>>> I'll try with that commit and let you know, however it's gonna take a
>>>>> few hours to build a kernel and run all fstests (on that test box it
>>>>> takes over 3 hours) to confirm that fixes the issue.
>>>>
>>>> *ouch*, 3 hours is painful. How long to make it sick with the current
>>>> kernel? quicker I would hope?
>>>>
>>>>> Thanks for the quick reply!
>>>>
>>>> Anyway, I don't think that commit can actually explain the issue :/
>>>>
>>>> The false positive on lockdep_assert_held() happens when the recursion
>>>> count is !0, however we _should_ be having IRQs disabled when
>>>> lockdep_recursion > 0, so that should never be observable.
>>>>
>>>> My hope was that DEBUG_PREEMPT would trigger on one of the
>>>> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
>>>> then be a clear violation.
>>>>
>>>> And you're seeing this on x86, right?
>>>>
>>>> Let me puzzle moar..
>>>
>>> So I might have an explanation for the Sparc64 fail, but that can't
>>> explain x86 :/
>>>
>>> I initially thought raw_cpu_read() was OK, since if it is !0 we have
>>> IRQs disabled and can't get migrated, so if we get migrated both CPUs
>>> must have 0 and it doesn't matter which 0 we read.
>>>
>>> And while that is true; it isn't the whole store, on pretty much all
>>> architectures (except x86) this can result in computing the address for
>>> one CPU, getting migrated, the old CPU continuing execution with another
>>> task (possibly setting recursion) and then the new CPU reading the value
>>> of the old CPU, which is no longer 0.
>>>
>>> I already fixed a bunch of that in:
>>>
>>>   baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu 
>>> variables"")
>>>
>>> but clearly this one got crossed.
>>>
>>> Still, that leaves me puzzled over you seeing this on x86 :/
>>
>> Hi Peter,
>>
>> I still get the same issue with 5.10-rc2.
>> Is there any non-merged patch I should try, or anything I can help with?
> 
> BTW, I've just hit the same deadlock issue with ext4 on generic/390 so I
> confirm this isn't btrfs specific issue (as we already knew from the
> analysis but still it's good to have that confirmed).

Indeed, yesterday Darrick was mentioning on IRC that he has run into it
too with fstests on XFS (5.10-rc).

> 
>   Honza
> 


Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-11-02 Thread Filipe Manana



On 26/10/20 15:22, Peter Zijlstra wrote:
> On Mon, Oct 26, 2020 at 01:55:24PM +0100, Peter Zijlstra wrote:
>> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote:
>>>> That smells like the same issue reported here:
>>>>
>>>>   
>>>> https://lkml.kernel.org/r/20201022111700.gz2...@hirez.programming.kicks-ass.net
>>>>
>>>> Make sure you have commit:
>>>>
>>>>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
>>>>
>>>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
>>>
>>> Yes, CONFIG_DEBUG_PREEMPT is enabled.
>>
>> Bummer :/
>>
>>> I'll try with that commit and let you know, however it's gonna take a
>>> few hours to build a kernel and run all fstests (on that test box it
>>> takes over 3 hours) to confirm that fixes the issue.
>>
>> *ouch*, 3 hours is painful. How long to make it sick with the current
>> kernel? quicker I would hope?
>>
>>> Thanks for the quick reply!
>>
>> Anyway, I don't think that commit can actually explain the issue :/
>>
>> The false positive on lockdep_assert_held() happens when the recursion
>> count is !0, however we _should_ be having IRQs disabled when
>> lockdep_recursion > 0, so that should never be observable.
>>
>> My hope was that DEBUG_PREEMPT would trigger on one of the
>> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
>> then be a clear violation.
>>
>> And you're seeing this on x86, right?
>>
>> Let me puzzle moar..
> 
> So I might have an explanation for the Sparc64 fail, but that can't
> explain x86 :/
> 
> I initially thought raw_cpu_read() was OK, since if it is !0 we have
> IRQs disabled and can't get migrated, so if we get migrated both CPUs
> must have 0 and it doesn't matter which 0 we read.
> 
> And while that is true; it isn't the whole store, on pretty much all
> architectures (except x86) this can result in computing the address for
> one CPU, getting migrated, the old CPU continuing execution with another
> task (possibly setting recursion) and then the new CPU reading the value
> of the old CPU, which is no longer 0.
> 
> I already fixed a bunch of that in:
> 
>   baffd723e44d ("lockdep: Revert "lockdep: Use raw_cpu_*() for per-cpu 
> variables"")
> 
> but clearly this one got crossed.
> 
> Still, that leaves me puzzled over you seeing this on x86 :/

Hi Peter,

I still get the same issue with 5.10-rc2.
Is there any non-merged patch I should try, or anything I can help with?

Thanks.

> 
> Anatoly, could you try linus+tip/locking/urgent and the below on your
> Sparc, please?
> 
> ---
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 3e99dfef8408..a3041463e42d 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -84,7 +84,7 @@ static inline bool lockdep_enabled(void)
>   if (!debug_locks)
>   return false;
>  
> - if (raw_cpu_read(lockdep_recursion))
> + if (this_cpu_read(lockdep_recursion))
>   return false;
>  
>   if (current->lockdep_recursion)
> 


Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-10-26 Thread Filipe Manana



On 26/10/20 12:55, Peter Zijlstra wrote:
> On Mon, Oct 26, 2020 at 11:56:03AM +0000, Filipe Manana wrote:
>>> That smells like the same issue reported here:
>>>
>>>   
>>> https://lkml.kernel.org/r/20201022111700.gz2...@hirez.programming.kicks-ass.net
>>>
>>> Make sure you have commit:
>>>
>>>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
>>>
>>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
>>
>> Yes, CONFIG_DEBUG_PREEMPT is enabled.
> 
> Bummer :/
> 
>> I'll try with that commit and let you know, however it's gonna take a
>> few hours to build a kernel and run all fstests (on that test box it
>> takes over 3 hours) to confirm that fixes the issue.
> 
> *ouch*, 3 hours is painful. How long to make it sick with the current
> kernel? quicker I would hope?

If generic/068 triggers the bug, than it's about 1 hour. If that passes,
which rarely happens, then have to wait to get into generic/390, which
is over 2 hours.

It sucks that running those tests alone never trigger the issue, but
running all fstests (first btrfs specific ones, followed by the generic
ones) reliably triggers the bug, almost always at generic/068, when that
passes, it's triggered by generic/390. To confirm everything is ok, I
let all tests run (last generic is 612).


> 
>> Thanks for the quick reply!
> 
> Anyway, I don't think that commit can actually explain the issue :/
> 
> The false positive on lockdep_assert_held() happens when the recursion
> count is !0, however we _should_ be having IRQs disabled when
> lockdep_recursion > 0, so that should never be observable.
> 
> My hope was that DEBUG_PREEMPT would trigger on one of the
> __this_cpu_{inc,dec}(lockdep_recursion) instance, because that would
> then be a clear violation.
> 
> And you're seeing this on x86, right?

Right.
It's in a qemu vm on x86, with '-cpu host' passed to qemu and kvm enabled.

Thanks.


> 
> Let me puzzle moar..
> 


Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-10-26 Thread Filipe Manana



On 26/10/20 11:55, Jan Kara wrote:
> On Mon 26-10-20 12:40:09, Peter Zijlstra wrote:
>> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>>> Hello,
>>>
>>> I've recently started to hit a warning followed by tasks hanging after
>>> attempts to freeze a filesystem. A git bisection pointed to the
>>> following commit:
>>>
>>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>>> Author: Peter Zijlstra 
>>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>>
>>> lockdep: Fix lockdep recursion
>>>
>>> This happens very reliably when running all xfstests with lockdep
>>> enabled, and the tested filesystem is btrfs (haven't tried other
>>> filesystems, but it shouldn't matter). The warning and task hangs always
>>> happen at either test generic/068 or test generic/390, and (oddly)
>>> always have to run all tests for it to trigger, running those tests
>>> individually on an infinite loop doesn't seem to trigger it (at least
>>> for a couple hours).
>>>
>>> The warning triggered is at fs/super.c:__sb_start_write() which always
>>> results later in several tasks hanging on a percpu rw_sem:
>>>
>>> https://pastebin.com/qnLvf94E
>>>
>>> What happens is percpu_rwsem_is_held() is apparently returning a false
>>> positive,
>>
>> That smells like the same issue reported here:
>>
>>   
>> https://lkml.kernel.org/r/20201022111700.gz2...@hirez.programming.kicks-ass.net
>>
>> Make sure you have commit:
>>
>>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
>>
>> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?
> 
> Hum, I am at 5.10-rc1 now and above mentioned commit doesn't appear to be
> there? Also googling for the title doesn't help...

Same here, can't find in Linus' tree neither through google.

> 
>   Honza
> 


Re: possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-10-26 Thread Filipe Manana



On 26/10/20 11:40, Peter Zijlstra wrote:
> On Mon, Oct 26, 2020 at 11:26:49AM +0000, Filipe Manana wrote:
>> Hello,
>>
>> I've recently started to hit a warning followed by tasks hanging after
>> attempts to freeze a filesystem. A git bisection pointed to the
>> following commit:
>>
>> commit 4d004099a668c41522242aa146a38cc4eb59cb1e
>> Author: Peter Zijlstra 
>> Date:   Fri Oct 2 11:04:21 2020 +0200
>>
>> lockdep: Fix lockdep recursion
>>
>> This happens very reliably when running all xfstests with lockdep
>> enabled, and the tested filesystem is btrfs (haven't tried other
>> filesystems, but it shouldn't matter). The warning and task hangs always
>> happen at either test generic/068 or test generic/390, and (oddly)
>> always have to run all tests for it to trigger, running those tests
>> individually on an infinite loop doesn't seem to trigger it (at least
>> for a couple hours).
>>
>> The warning triggered is at fs/super.c:__sb_start_write() which always
>> results later in several tasks hanging on a percpu rw_sem:
>>
>> https://pastebin.com/qnLvf94E
>>
>> What happens is percpu_rwsem_is_held() is apparently returning a false
>> positive,
> 
> That smells like the same issue reported here:
> 
>   
> https://lkml.kernel.org/r/20201022111700.gz2...@hirez.programming.kicks-ass.net
> 
> Make sure you have commit:
> 
>   f8e48a3dca06 ("lockdep: Fix preemption WARN for spurious IRQ-enable")
> 
> (in Linus' tree by now) and do you have CONFIG_DEBUG_PREEMPT enabled?

Yes, CONFIG_DEBUG_PREEMPT is enabled.
I'll try with that commit and let you know, however it's gonna take a
few hours to build a kernel and run all fstests (on that test box it
takes over 3 hours) to confirm that fixes the issue.

Thanks for the quick reply!

> 
> 
> 
> 


possible lockdep regression introduced by 4d004099a668 ("lockdep: Fix lockdep recursion")

2020-10-26 Thread Filipe Manana
Hello,

I've recently started to hit a warning followed by tasks hanging after
attempts to freeze a filesystem. A git bisection pointed to the
following commit:

commit 4d004099a668c41522242aa146a38cc4eb59cb1e
Author: Peter Zijlstra 
Date:   Fri Oct 2 11:04:21 2020 +0200

lockdep: Fix lockdep recursion

This happens very reliably when running all xfstests with lockdep
enabled, and the tested filesystem is btrfs (haven't tried other
filesystems, but it shouldn't matter). The warning and task hangs always
happen at either test generic/068 or test generic/390, and (oddly)
always have to run all tests for it to trigger, running those tests
individually on an infinite loop doesn't seem to trigger it (at least
for a couple hours).

The warning triggered is at fs/super.c:__sb_start_write() which always
results later in several tasks hanging on a percpu rw_sem:

https://pastebin.com/qnLvf94E

What happens is percpu_rwsem_is_held() is apparently returning a false
positive, so this makes __sb_start_write() do a
percpu_down_read_trylock() on a percpu_rw_sem at a higher level, which
is expected to always succeed, because if the calling task is holding a
freeze percpu_rw_sem at level 1, it's supposed to be able to try_lock
the semaphore at level 2, since the freeze semaphores are always
acquired by increasing level order.

But the try_lock fails, it triggers the warning at __sb_start_write(),
then its caller sb_start_pagefault() ignores the return value and
callers such as btrfs_page_mkwrite() make the assumption the freeze
semaphore was taken, proceed to do their stuff, and later call
sb_end_pagefault(), which which will do an up_read() on the percpu_rwsem
at level 2 despite not having not been able to down_read() the
semaphore. This obviously corrupts the semaphore's read_count state, and
later causes any task trying to down_write() it to hang forever.

After such a hang I ran a drgn script to confirm it:

$ cat dump_freeze_sems.py
import sys
import drgn
from drgn import NULL, Object, cast, container_of, execscript, \
reinterpret, sizeof
from drgn.helpers.linux import *

mnt_path = b'/home/fdmanana/btrfs-tests/scratch_1'

mnt = None
for mnt in for_each_mount(prog, dst = mnt_path):
pass

if mnt is None:
sys.stderr.write(f'Error: mount point {mnt_path} not found\n')
sys.exit(1)

def dump_sem(level_enum):
level = level_enum.value_()
sem = mnt.mnt.mnt_sb.s_writers.rw_sem[level - 1]
print(f'freeze semaphore at level {level}, {str(level_enum)}')
print(f'block {sem.block.counter.value_()}')
for i in for_each_possible_cpu(prog):
read_count = per_cpu_ptr(sem.read_count, i)
print(f'read_count at cpu {i} = {read_count}')
print()

# dump semaphore read counts for all freeze levels (fs.h)
dump_sem(prog['SB_FREEZE_WRITE'])
dump_sem(prog['SB_FREEZE_PAGEFAULT'])
dump_sem(prog['SB_FREEZE_FS'])


$ drgn dump_freeze_sems.py
freeze semaphore at level 1, (enum )SB_FREEZE_WRITE
block 1
read_count at cpu 0 = *(unsigned int *)0xc2ec3ee00c74 = 3
read_count at cpu 1 = *(unsigned int *)0xc2ec3f200c74 = 4294967293
read_count at cpu 2 = *(unsigned int *)0xc2ec3f600c74 = 3
read_count at cpu 3 = *(unsigned int *)0xc2ec3fa00c74 = 4294967293

freeze semaphore at level 2, (enum )SB_FREEZE_PAGEFAULT
block 1
read_count at cpu 0 = *(unsigned int *)0xc2ec3ee00c78 = 0
read_count at cpu 1 = *(unsigned int *)0xc2ec3f200c78 = 4294967295
read_count at cpu 2 = *(unsigned int *)0xc2ec3f600c78 = 0
read_count at cpu 3 = *(unsigned int *)0xc2ec3fa00c78 = 0

freeze semaphore at level 3, (enum )SB_FREEZE_FS
block 0
read_count at cpu 0 = *(unsigned int *)0xc2ec3ee00c7c = 0
read_count at cpu 1 = *(unsigned int *)0xc2ec3f200c7c = 0
read_count at cpu 2 = *(unsigned int *)0xc2ec3f600c7c = 0
read_count at cpu 3 = *(unsigned int *)0xc2ec3fa00c7c = 0

At levels 1 and 3, read_count sums to 0, so it's fine, but at level 2 it
sums to -1. The system remains like that for hours at least, with no
progress at all.

Is there a known regression with that lockdep commit?
Anything I can do to help debug it in case it's not obvious?

Thanks.


Re: [PATCH] fs: btrfs: Fix incorrect printf qualifier

2020-10-17 Thread Filipe Manana
On Wed, Oct 14, 2020 at 10:24 AM Pujin Shi  wrote:
>
> This patch addresses a compile warning:
> fs/btrfs/extent-tree.c: In function '__btrfs_free_extent':
> fs/btrfs/extent-tree.c:3187:4: warning: format '%lu' expects argument of type 
> 'long unsigned int', but argument 8 has type 'unsigned int' [-Wformat=]
>
> Fixes: 3b7b6ffa4f8f ("btrfs: extent-tree: kill BUG_ON() in 
> __btrfs_free_extent()")

Btw, that commit id does not exist in Linus' tree, should be 1c2a07f598d5 [1].

> Signed-off-by: Pujin Shi 

Other than that it looks good,

Reviewed-by: Filipe Manana 

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=1c2a07f598d526e39acbf1837b8d521fc0dab9c5

> ---
>  fs/btrfs/extent-tree.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index 3b21fee13e77..5fd60b13f4f8 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -3185,7 +3185,7 @@ static int __btrfs_free_extent(struct 
> btrfs_trans_handle *trans,
> struct btrfs_tree_block_info *bi;
> if (item_size < sizeof(*ei) + sizeof(*bi)) {
> btrfs_crit(info,
> -"invalid extent item size for key (%llu, %u, %llu) owner %llu, has %u expect 
> >= %lu",
> +"invalid extent item size for key (%llu, %u, %llu) owner %llu, has %u expect 
> >= %zu",
>key.objectid, key.type, key.offset,
>owner_objectid, item_size,
>sizeof(*ei) + sizeof(*bi));
> --
> 2.18.1
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH] btrfs: fix memdup.cocci warnings

2020-09-22 Thread Filipe Manana
On Tue, Sep 22, 2020 at 11:29 AM Julia Lawall  wrote:
>
> From: kernel test robot 
>
> fs/btrfs/send.c:3854:8-15: WARNING opportunity for kmemdup
>
>  Use kmemdup rather than duplicating its implementation
>
> Generated by: scripts/coccinelle/api/memdup.cocci
>
> Fixes: 28314eb24e6c ("btrfs: send, recompute reference path after 
> orphanization of a directory")
> Signed-off-by: kernel test robot 
> Signed-off-by: Julia Lawall 

Since this is not in Linus' tree yet, it can be folded in the original patch.
David, can you do that when you pick it?

Btw, isn't the Fixes tag meant only for bug fixes? This is a pure
cleanup afaics.

Thanks.

> ---
>
> tree:   https://git.kernel.org/pub/scm/linux/kernel/git/fdmanana/linux.git 
> misc-next
> head:   28314eb24e6cb8124d1e5da2ef2ccb90ec44cc06
> commit: 28314eb24e6cb8124d1e5da2ef2ccb90ec44cc06 [2/2] btrfs: send, recompute 
> reference path after orphanization of a directory
> :: branch date: 17 hours ago
> :: commit date: 17 hours ago
>
>
>  send.c |3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> --- a/fs/btrfs/send.c
> +++ b/fs/btrfs/send.c
> @@ -3851,10 +3851,9 @@ static int refresh_ref_path(struct send_
> char *name;
> int ret;
>
> -   name = kmalloc(ref->name_len, GFP_KERNEL);
> +   name = kmemdup(ref->name, ref->name_len, GFP_KERNEL);
> if (!name)
> return -ENOMEM;
> -   memcpy(name, ref->name, ref->name_len);
>
> fs_path_reset(ref->full_path);
> ret = get_cur_path(sctx, ref->dir, ref->dir_gen, ref->full_path);



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v2] btrfs: block-group: Fix free-space bitmap threshould

2020-08-21 Thread Filipe Manana
On Fri, Aug 21, 2020 at 2:43 PM Marcos Paulo de Souza
 wrote:
>
> From: Marcos Paulo de Souza 
>
> [BUG]
> After commit 9afc66498a0b ("btrfs: block-group: refactor how we read one
> block group item"), cache->length is being assigned after calling
> btrfs_create_block_group_cache. This causes a problem since
> set_free_space_tree_thresholds is calculate the free-space threshould to
> decide is the free-space tree should convert from extents to bitmaps.
>
> The current code calls set_free_space_tree_thresholds with cache->length
> being 0, which then makes cache->bitmap_high_thresh being zero. This
> implies the system will always use bitmap instead of extents, which is
> not desired if the block group is not fragmented.
>
> This behavior can be seen by a test that expects to repair systems
> with FREE_SPACE_EXTENT and FREE_SPACE_BITMAP, but the current code only
> created FREE_SPACE_BITMAP.
>
> [FIX]
> Call set_free_space_tree_thresholds after setting cache->length.
>
> Link: https://github.com/kdave/btrfs-progs/issues/251
> Fixes: 9afc66498a0b ("btrfs: block-group: refactor how we read one block 
> group item")
> CC: sta...@vger.kernel.org # 5.8+
> Reviewed-by: Qu Wenruo 
> Signed-off-by: Marcos Paulo de Souza 
> ---
>
>  Changes from v1:
>  * Add a warning in set_free_space_tree_thresholds when bg->length is zero 
> (Qu)
>
>  fs/btrfs/block-group.c | 4 +++-
>  fs/btrfs/free-space-tree.c | 3 +++
>  2 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
> index 44fdfa2eeb2e..01e8ba1da1d3 100644
> --- a/fs/btrfs/block-group.c
> +++ b/fs/btrfs/block-group.c
> @@ -1798,7 +1798,6 @@ static struct btrfs_block_group 
> *btrfs_create_block_group_cache(
>
> cache->fs_info = fs_info;
> cache->full_stripe_len = btrfs_full_stripe_len(fs_info, start);
> -   set_free_space_tree_thresholds(cache);
>
> cache->discard_index = BTRFS_DISCARD_INDEX_UNUSED;
>
> @@ -1908,6 +1907,8 @@ static int read_one_block_group(struct btrfs_fs_info 
> *info,
>
> read_block_group_item(cache, path, key);
>
> +   set_free_space_tree_thresholds(cache);
> +
> if (need_clear) {
> /*
>  * When we mount with old space cache, we need to
> @@ -2128,6 +2129,7 @@ int btrfs_make_block_group(struct btrfs_trans_handle 
> *trans, u64 bytes_used,
> return -ENOMEM;
>
> cache->length = size;
> +   set_free_space_tree_thresholds(cache);
> cache->used = bytes_used;
> cache->flags = type;
> cache->last_byte_to_unpin = (u64)-1;
> diff --git a/fs/btrfs/free-space-tree.c b/fs/btrfs/free-space-tree.c
> index 8b1f5c8897b7..1d191fbc754b 100644
> --- a/fs/btrfs/free-space-tree.c
> +++ b/fs/btrfs/free-space-tree.c
> @@ -22,6 +22,9 @@ void set_free_space_tree_thresholds(struct 
> btrfs_block_group *cache)
> size_t bitmap_size;
> u64 num_bitmaps, total_bitmap_size;
>
> +   if (cache->length == 0)
> +   btrfs_warn(cache->fs_info, "block group length is zero");

This alone is not very useful.
With something like:

if (WARN_ON(cache->length) == 0)
   (and the message including the block group's logical address
too, the ->start field)

Such a bug is much easier to spot. If a test case from fstests
triggers it, it will be reported as a test failure.

Why not an ASSERT() instead? Though I don't have a strong preference
between the two for this case.
Either option will make it easy to spot with fstests.

As for the rest, the fix itself looks good to me.
You can later add,

Reviewed-by: Filipe Manana 

Thanks.

> +
> /*
>  * We convert to bitmaps when the disk space required for using 
> extents
>  * exceeds that required for using bitmaps.
> --
> 2.28.0
>


-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH v2 btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

2020-06-17 Thread Filipe Manana
  btrfs_release_extent_buffer_pages
>   BUG_ON(io_pages > 0)!!!
>
> We observe this at a very low rate in production and were also able to
> reproduce it in a test environment by introducing some spurious delays
> and by introducing probabilistic trylock_page failures.
>
> To fix it, we apply check_tree_ref at a point where it could not
> possibly be unset by a competing task: after io_pages has been
> incremented. All the codepaths that clear TREE_REF check for io, so they
> would not be able to clear it after this point until the io is done.
>
> stack trace, for reference:
> [1417839.424739] [ cut here ]
> [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
> [1417839.447024] invalid opcode:  [#1] SMP
> [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
> [1417839.517008] Code: ed e9 ...
> [1417839.558895] RSP: 0018:c90020bcf798 EFLAGS: 00010202
> [1417839.570816] RAX: 0002 RBX: 888102d6def0 RCX:
> 0028
> [1417839.586962] RDX: 0002 RSI: 8887f0296482 RDI:
> 888102d6def0
> [1417839.603108] RBP: 5664a000 R08: 0046 R09:
> 0238
> [1417839.619255] R10: 0028 R11: 5664af68 R12:
> 
> [1417839.635402] R13:  R14: 88875f573ad0 R15:
> 888797aafd90
> [1417839.651549] FS:  7f5a844fa700() GS:5f68()
> knlGS:
> [1417839.669810] CS:  0010 DS:  ES:  CR0: 80050033
> [1417839.682887] CR2: 7f7884541fe0 CR3: 00049f609002 CR4:
> 003606e0
> [1417839.699037] DR0:  DR1:  DR2:
> 
> [1417839.715187] DR3:  DR6: fffe0ff0 DR7:
> 0400
> [1417839.731320] Call Trace:
> [1417839.737103]  release_extent_buffer+0x39/0x90
> [1417839.746913]  read_block_for_search.isra.38+0x2a3/0x370
> [1417839.758645]  btrfs_search_slot+0x260/0x9b0
> [1417839.768054]  btrfs_lookup_file_extent+0x4a/0x70
> [1417839.778427]  btrfs_get_extent+0x15f/0x830
> [1417839.787665]  ? submit_extent_page+0xc4/0x1c0
> [1417839.797474]  ? __do_readpage+0x299/0x7a0
> [1417839.806515]  __do_readpage+0x33b/0x7a0
> [1417839.815171]  ? btrfs_releasepage+0x70/0x70
> [1417839.824597]  extent_readpages+0x28f/0x400
> [1417839.833836]  read_pages+0x6a/0x1c0
> [1417839.841729]  ? startup_64+0x2/0x30
> [1417839.849624]  __do_page_cache_readahead+0x13c/0x1a0
> [1417839.860590]  filemap_fault+0x6c7/0x990
> [1417839.869252]  ? xas_load+0x8/0x80
> [1417839.876756]  ? xas_find+0x150/0x190
> [1417839.884839]  ? filemap_map_pages+0x295/0x3b0
> [1417839.894652]  __do_fault+0x32/0x110
> [1417839.902540]  __handle_mm_fault+0xacd/0x1000
> [1417839.912156]  handle_mm_fault+0xaa/0x1c0
> [1417839.921004]  __do_page_fault+0x242/0x4b0
> [1417839.930044]  ? page_fault+0x8/0x30
> [1417839.937933]  page_fault+0x1e/0x30
> [1417839.945631] RIP: 0033:0x33c4bae
> [1417839.952927] Code: Bad RIP value.
> [1417839.960411] RSP: 002b:7f5a844f7350 EFLAGS: 00010206
> [1417839.972331] RAX: 006e RBX: 1614b3ff6a50398a RCX:
> 
> [1417839.988477] RDX:  RSI:  RDI:
> 0002
> [1417840.004626] RBP: 7f5a844f7420 R08: 006e R09:
> 7f5a94aeccb8
> [1417840.020784] R10: 7f5a844f7350 R11:  R12:
> 7f5a94aecc79
> [1417840.036932] R13: 7f5a94aecc78 R14: 7f5a94aecc90 R15:
> 7f5a94aecc40
>
> Signed-off-by: Boris Burkov 

Reviewed-by: Filipe Manana 

Looks good, thanks!

> ---
>  fs/btrfs/extent_io.c | 40 
>  1 file changed, 24 insertions(+), 16 deletions(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c59e07360083..95313bb7fe40 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -5086,25 +5086,28 @@ struct extent_buffer 
> *alloc_dummy_extent_buffer(struct btrfs_fs_info *fs_info,
>  static void check_buffer_tree_ref(struct extent_buffer *eb)
>  {
> int refs;
> -   /* the ref bit is tricky.  We have to make sure it is set
> -* if we have the buffer dirty.   Otherwise the
> -* code to free a buffer can end up dropping a dirty
> -* page
> +   /*
> +* The TREE_REF bit is first set when the extent_buffer is added
> +* to the radix tree. It is also reset, if unset, when a new reference
> +* is created by find_extent_buffer.
>  *
> -* Once the ref bit is set, it won't go away while the
> -* buffer is dirty or in writeback, and it also won't
> -* go away while we have the re

Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

2020-06-17 Thread Filipe Manana
On Wed, Jun 17, 2020 at 7:19 PM Josef Bacik  wrote:
>
> On 6/17/20 2:11 PM, Filipe Manana wrote:
> > On Wed, Jun 17, 2020 at 6:43 PM Chris Mason  wrote:
> >>
> >> On 17 Jun 2020, at 13:20, Filipe Manana wrote:
> >>
> >>> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov  wrote:
> >>>
> >>>> ---
> >>>>   fs/btrfs/extent_io.c | 45
> >>>> 
> >>>>   1 file changed, 29 insertions(+), 16 deletions(-)
> >>>>
> >>>> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> >>>> index c59e07360083..f6758ebbb6a2 100644
> >>>> --- a/fs/btrfs/extent_io.c
> >>>> +++ b/fs/btrfs/extent_io.c
> >>>> @@ -3927,6 +3927,11 @@ static noinline_for_stack int
> >>>> write_one_eb(struct extent_buffer *eb,
> >>>>  clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> >>>>  num_pages = num_extent_pages(eb);
> >>>>  atomic_set(&eb->io_pages, num_pages);
> >>>> +   /*
> >>>> +* It is possible for releasepage to clear the TREE_REF bit
> >>>> before we
> >>>> +* set io_pages. See check_buffer_tree_ref for a more
> >>>> detailed comment.
> >>>> +*/
> >>>> +   check_buffer_tree_ref(eb);
> >>>
> >>> This is a whole different case from the one described in the
> >>> changelog, as this is in the write path.
> >>> Why do we need this one?
> >>
> >> This was Josef’s idea, but I really like the symmetry.  You set
> >> io_pages, you do the tree_ref dance.  Everyone fiddling with the write
> >> back bit right now correctly clears writeback after doing the atomic_dec
> >> on io_pages, but the race is tiny and prone to getting exposed again by
> >> shifting code around.  Tree ref checks around io_pages are the most
> >> reliable way to prevent this bug from coming back again later.
> >
> > Ok, but that still doesn't answer my question.
> > Is there an actual race/problem this hunk solves?
> >
> > Before calling write_one_eb() we increment the ref on the eb and we
> > also call lock_extent_buffer_for_io(),
> > which clears the dirty bit and sets the writeback bit on the eb while
> > holding its ref_locks spin_lock.
> >
> > Even if we get to try_release_extent_buffer, it calls
> > extent_buffer_under_io(eb) while holding the ref_locks spin_lock,
> > so at any time it should return true, as either the dirty or the
> > writeback bit is set.
> >
> > Is this purely a safety guard that is being introduced?
> >
> > Can we at least describe in the changelog why we are adding this hunk
> > in the write path?
> > All it mentions is a race between reading and releasing pages, there's
> > nothing mentioned about races with writeback.
> >
>
> I think maybe we make that bit a separate patch, and leave the panic fix on 
> it's
> own.  I suggested this because I have lofty ideas of killing the refs_lock, 
> and
> this would at least keep us consistent in our usage TREE_REF to save us from
> freeing stuff that may be currently under IO.
>
> I'm super not happy with our reference handling coupled with releasepage, but
> these are the kind of hoops we're going to have to jump through until we have
> some better infrastructure in place to handle metadata.  Thanks,

Ok, so it's just a safety guard then.
Yes, either a separate patch or at the very least mention why we are
adding that in the change log.

Thanks.

>
> Josef



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

2020-06-17 Thread Filipe Manana
On Wed, Jun 17, 2020 at 6:43 PM Chris Mason  wrote:
>
> On 17 Jun 2020, at 13:20, Filipe Manana wrote:
>
> > On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov  wrote:
> >
> >> ---
> >>  fs/btrfs/extent_io.c | 45
> >> 
> >>  1 file changed, 29 insertions(+), 16 deletions(-)
> >>
> >> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> >> index c59e07360083..f6758ebbb6a2 100644
> >> --- a/fs/btrfs/extent_io.c
> >> +++ b/fs/btrfs/extent_io.c
> >> @@ -3927,6 +3927,11 @@ static noinline_for_stack int
> >> write_one_eb(struct extent_buffer *eb,
> >> clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
> >> num_pages = num_extent_pages(eb);
> >> atomic_set(&eb->io_pages, num_pages);
> >> +   /*
> >> +* It is possible for releasepage to clear the TREE_REF bit
> >> before we
> >> +* set io_pages. See check_buffer_tree_ref for a more
> >> detailed comment.
> >> +*/
> >> +   check_buffer_tree_ref(eb);
> >
> > This is a whole different case from the one described in the
> > changelog, as this is in the write path.
> > Why do we need this one?
>
> This was Josef’s idea, but I really like the symmetry.  You set
> io_pages, you do the tree_ref dance.  Everyone fiddling with the write
> back bit right now correctly clears writeback after doing the atomic_dec
> on io_pages, but the race is tiny and prone to getting exposed again by
> shifting code around.  Tree ref checks around io_pages are the most
> reliable way to prevent this bug from coming back again later.

Ok, but that still doesn't answer my question.
Is there an actual race/problem this hunk solves?

Before calling write_one_eb() we increment the ref on the eb and we
also call lock_extent_buffer_for_io(),
which clears the dirty bit and sets the writeback bit on the eb while
holding its ref_locks spin_lock.

Even if we get to try_release_extent_buffer, it calls
extent_buffer_under_io(eb) while holding the ref_locks spin_lock,
so at any time it should return true, as either the dirty or the
writeback bit is set.

Is this purely a safety guard that is being introduced?

Can we at least describe in the changelog why we are adding this hunk
in the write path?
All it mentions is a race between reading and releasing pages, there's
nothing mentioned about races with writeback.

Thanks.

>
> -chris



-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

2020-06-17 Thread Filipe Manana
On Wed, Jun 17, 2020 at 6:20 PM Filipe Manana  wrote:
>
> On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov  wrote:
> >
> > Under somewhat convoluted conditions, it is possible to attempt to
> > release an extent_buffer that is under io, which triggers a BUG_ON in
> > btrfs_release_extent_buffer_pages.
> >
> > This relies on a few different factors. First, extent_buffer reads done
> > as readahead for searching use WAIT_NONE, so they free the local extent
> > buffer reference while the io is outstanding. However, they should still
> > be protected by TREE_REF. However, if the system is doing signficant
> > reclaim, and simultaneously heavily accessing the extent_buffers, it is
> > possible for releasepage to race with two concurrent readahead attempts
> > in a way that leaves TREE_REF unset when the readahead extent buffer is
> > released.
> >
> > Essentially, if two tasks race to allocate a new extent_buffer, but the
> > winner who attempts the first io is rebuffed by a page being locked
> > (likely by the reclaim itself) then the loser will still go ahead with
> > issuing the readahead. The loser's call to find_extent_buffer must also
> > race with the reclaim task reading the extent_buffer's refcount as 1 in
> > a way that allows the reclaim to re-clear the TREE_REF checked by
> > find_extent_buffer.
> >
> > The following represents an example execution demonstrating the race:
> >
> > CPU0 
> > CPU1   CPU2
> > reada_for_searchreada_for_search
> >   readahead_tree_block
> > readahead_tree_block
> > find_create_tree_block  
> > find_create_tree_block
> >   alloc_extent_buffer 
> > alloc_extent_buffer
> >   
> > find_extent_buffer // not found
> >   allocates 
> > eb
> >   lock pages
> >   associate 
> > pages to eb
> >   insert eb 
> > into radix tree
> >   set 
> > TREE_REF, refs == 2
> >   unlock 
> > pages
> >   
> > read_extent_buffer_pages // WAIT_NONE
> > not 
> > uptodate (brand new eb)
> > 
> > lock_page
> > if 
> > !trylock_page
> >   goto 
> > unlock_exit // not an error
> >   
> > free_extent_buffer
> > 
> > release_extent_buffer
> >   
> > atomic_dec_and_test refs to 1
> > find_extent_buffer // found
> > 
> > try_release_extent_buffer
> > 
> >   take refs_lock
> > 
> >   reads refs == 1; no io
> >   atomic_inc_not_zero refs to 2
> >   mark_buffer_accessed
> > check_buffer_tree_ref
> >   // not STALE, won't take refs_lock
> >   refs == 2; TREE_REF set // no action
> > read_extent_buffer_pages // WAIT_NONE
> > 
> >   clear TREE_REF
> > 
> >   release_extent_buffer
> > 
> > 

Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

2020-06-17 Thread Filipe Manana
On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov  wrote:
>
> Under somewhat convoluted conditions, it is possible to attempt to
> release an extent_buffer that is under io, which triggers a BUG_ON in
> btrfs_release_extent_buffer_pages.
>
> This relies on a few different factors. First, extent_buffer reads done
> as readahead for searching use WAIT_NONE, so they free the local extent
> buffer reference while the io is outstanding. However, they should still
> be protected by TREE_REF. However, if the system is doing signficant
> reclaim, and simultaneously heavily accessing the extent_buffers, it is
> possible for releasepage to race with two concurrent readahead attempts
> in a way that leaves TREE_REF unset when the readahead extent buffer is
> released.
>
> Essentially, if two tasks race to allocate a new extent_buffer, but the
> winner who attempts the first io is rebuffed by a page being locked
> (likely by the reclaim itself) then the loser will still go ahead with
> issuing the readahead. The loser's call to find_extent_buffer must also
> race with the reclaim task reading the extent_buffer's refcount as 1 in
> a way that allows the reclaim to re-clear the TREE_REF checked by
> find_extent_buffer.
>
> The following represents an example execution demonstrating the race:
>
> CPU0 CPU1 
>   CPU2
> reada_for_searchreada_for_search
>   readahead_tree_block
> readahead_tree_block
> find_create_tree_block  
> find_create_tree_block
>   alloc_extent_buffer 
> alloc_extent_buffer
>   
> find_extent_buffer // not found
>   allocates eb
>   lock pages
>   associate 
> pages to eb
>   insert eb 
> into radix tree
>   set 
> TREE_REF, refs == 2
>   unlock pages
>   
> read_extent_buffer_pages // WAIT_NONE
> not uptodate 
> (brand new eb)
>   
>   lock_page
> if 
> !trylock_page
>   goto 
> unlock_exit // not an error
>   
> free_extent_buffer
> 
> release_extent_buffer
>   
> atomic_dec_and_test refs to 1
> find_extent_buffer // found
>   
>   try_release_extent_buffer
>   
> take refs_lock
>   
> reads refs == 1; no io
>   atomic_inc_not_zero refs to 2
>   mark_buffer_accessed
> check_buffer_tree_ref
>   // not STALE, won't take refs_lock
>   refs == 2; TREE_REF set // no action
> read_extent_buffer_pages // WAIT_NONE
>   
> clear TREE_REF
>   
> release_extent_buffer
>   
>   atomic_dec_and_test refs to 1
>   
>   unlock_page
>   still not uptodate (CPU1 read failed on trylock_page)
>   locks pages
>   set io_pages > 0
>   submit io
>   return
> release_extent_buffer

Small detail, missing the call to free_extent_buffer() from
readahead_tree_block(), which is what ends calling
release_extent_buffer().

>   dec refs to 0
>   delete from radix tree
>   btrfs_release_extent_buffer_pages
> BUG_ON(io_pages > 0)!!!
>
> We observe this at a very low rate in production and were also able to
> reproduce it in a test environment by intr

Re: [GIT PULL] Btrfs updates for 5.8, part 2

2020-06-15 Thread Filipe Manana
On Mon, Jun 15, 2020 at 2:29 PM Christoph Hellwig  wrote:
>
> On Mon, Jun 15, 2020 at 02:57:01PM +0200, David Sterba wrote:
> > On Sun, Jun 14, 2020 at 09:50:17AM -0700, Linus Torvalds wrote:
> > > On Sun, Jun 14, 2020 at 4:56 AM David Sterba  wrote:
> > > >
> > > > Reverts are not great, but under current circumstances I don't see
> > > > better options.
> > >
> > > Pulled. Are people discussing how to make iomap work for everybody?
> > > It's a bit sad if we can't have the major filesystems move away from
> > > the old buffer head interfaces to a common more modern one..
> >
> > Yes, it's fixable and we definitely want to move to iomap. The direct to
> > buffered fallback would fix one of the problems, but this would also
> > mean that xfs would start doing that. Such change should be treated more
> > like a feature development than a bugfix, imposed by another filesystem,
> > and xfs people rightfully complained.
>
> We can trivially key that off a flag at least for 5.8.  I suspect the
> fallback actually is the right thing for XFS in the long run for that
> particular case.

We also have another regression (a deadlock) [1] introduced by the patchset.
I haven't looked into detail to figure out if it can be completely
solved in btrfs or if it would need a change on iomap.
Goldwyn was looking into it, but I don't know if he made any progress.

[1] 
https://lore.kernel.org/linux-btrfs/cal3q7h4f9iqjy3tgwzrwokwenannn7osthqzumej_vwx3we...@mail.gmail.com/#t

-- 
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”


Re: [btrfs] e678934cbe: reaim.jobs_per_min -30.7% regression

2020-06-12 Thread Filipe Manana



On 11/06/20 10:02, kernel test robot wrote:
> Greeting,
> 
> FYI, we noticed a -30.7% regression of reaim.jobs_per_min due to commit:


Hello,

In the future, can you please always CC linux-bt...@vger.kernel.org for
btrfs related reports?

Thanks.

> 
> 
> commit: e678934cbe5f026c2765a1da651e61daa5724fb3 ("btrfs: Remove unnecessary 
> check from join_running_log_trans")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> 
> in testcase: reaim
> on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 
> 256G memory
> with following parameters:
> 
>   runtime: 300s
>   nr_task: 100
>   disk: 1HDD
>   fs: btrfs
>   test: disk
>   cpufreq_governor: performance
>   ucode: 0x52c
> 
> test-description: REAIM is an updated and improved version of AIM 7 benchmark.
> test-url: https://sourceforge.net/projects/re-aim-7/
> 
> 
> 
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot 
> 
> 
> Details are as below:
> -->
> 
> 
> To reproduce:
> 
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml  # job file is attached in this email
> bin/lkp run job.yaml
> 
> =
> compiler/cpufreq_governor/disk/fs/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase/ucode:
>   
> gcc-9/performance/1HDD/btrfs/x86_64-rhel-7.6/100/debian-x86_64-20191114.cgz/300s/lkp-csl-2sp6/disk/reaim/0x52c
> 
> commit: 
>   32e534402a ("Btrfs: wake up inode cache waiters sooner to reduce waiting 
> time")
>   e678934cbe ("btrfs: Remove unnecessary check from join_running_log_trans")
> 
> 32e534402ad52e9f e678934cbe5f026c2765a1da651 
>  --- 
>fail:runs  %reproductionfail:runs
>| | |
>:4   25%   1:4 
> dmesg.WARNING:at#for_ip_swapgs_restore_regs_and_return_to_usermode/0x
>  %stddev %change %stddev
>  \  |\  
>   5.39+6.4%   5.73 ±  6%  reaim.child_utime
>   6882 ±  2% -30.7%   4771 ±  2%  reaim.jobs_per_min
>  68.82 ±  2% -30.7%  47.72 ±  2%  reaim.jobs_per_min_child
>  98.56-3.4%  95.25reaim.jti
>   6978 ±  2% -29.5%   4917 ±  2%  reaim.max_jobs_per_min
>  87.24 ±  2% +44.3% 125.92 ±  2%  reaim.parent_time
>   0.97 ±  5%+337.7%   4.26 ±  9%  reaim.std_dev_percent
>   0.84 ±  5%+504.3%   5.05 ±  7%  reaim.std_dev_time
> 357.77 ±  2%  +7.4% 384.27 ±  2%  reaim.time.elapsed_time
> 357.77 ±  2%  +7.4% 384.27 ±  2%  reaim.time.elapsed_time.max
> 276434   -25.6% 205672reaim.time.file_system_inputs
>   10740316   -20.1%8583748 ±  2%  reaim.time.file_system_outputs
>  63443   -13.8%  54688
> reaim.time.involuntary_context_switches
>6205791   -24.9%4661554reaim.time.minor_page_faults
>  84.25 ±  2% -34.4%  55.25 ± 11%  
> reaim.time.percent_of_cpu_this_job_got
> 281.73   -30.3% 196.49 ± 11%  reaim.time.system_time
>  21.59   -20.2%  17.23 ±  6%  reaim.time.user_time
>2253123   -10.4%2017718
> reaim.time.voluntary_context_switches
>  4   -25.0%  3reaim.workload
>  92.59+2.6%  95.01iostat.cpu.idle
>   6.37   -32.9%   4.27iostat.cpu.iowait
>   0.97 ±  4% -32.4%   0.65 ± 13%  iostat.cpu.system
>   6.41-2.14.29mpstat.cpu.all.iowait%
>   0.02 ±  6%  -0.00.01 ±  4%  mpstat.cpu.all.soft%
>   0.95 ±  4%  -0.30.64 ± 13%  mpstat.cpu.all.sys%
>   0.07 ±  2%  -0.00.06 ±  3%  mpstat.cpu.all.usr%
>3177396 ±  4% -17.2%2630192 ±  2%  numa-numastat.node0.local_node
>3208408 ±  3% -17.3%2653430 ±  3%  numa-numastat.node0.numa_hit
>3183435 ±  4% -21.0%2515397 ±  2%  numa-numastat.node1.local_node
>3183678 ±  4% -20.7%2523398 ±  2%  numa-numastat.node1.numa_hit
>  92.00+2.4%  94.25vmstat.cpu.id
> 382.00   -30.6% 265.25vmstat.io.bi
>  26136   -29.9%  18312vmstat.io.bo
>   6.00   -37.5%   3.75 ± 11%  vmstat.procs.b
>  20123 ±  2% -19.6%  16189vmstat.system.cs
>1044628   -21.1% 824233 ±  4%  meminfo.Active
> 758190   -28.5% 542058 ±  7%  meminfo.Active(file)
> 163791   +76.7% 289387 ±  6%  meminfo.Inactive
> 145927   +86.1% 271504 ±  6%  meminfo.Inactive(file)
> 

Re: [LKP] Re: 28307d938f ("percpu: make pcpu_alloc() aware of current gfp .."): BUG: kernel reboot-without-warning in boot stage

2020-06-02 Thread Filipe Manana



On 02/06/20 05:37, Li Zhijian wrote:
> Hi Filipe,
> 
> LKP checked blow dmesg as the indicator in this problem
> 
>>> [    0.144174] RAMDISK: [mem 0x7fa2e000-0x7fff]
>>> [    0.144559] ACPI: Early table checksum verification disabled
>>> [    0.144985] ACPI: RSDP 0x000F5850 14 (v00 BOCHS )
>>> [    0.145424] ACPI: RSDT 0xBFFE15C9 30 (v01 BOCHS 
>>> BXPCRSDT 0001 BXPC 0001)
>>> [    0.146051] ACPI: FACP 0xBFFE149D 74 (v01 BOCHS 
>>> BXPCFACP 0001 BXPC 0001)
>>> BUG: kernel reboot-without-warning in boot stage
>>>
> 
> And i have reproduced it with script in attachment. this issue is gone
> when i reverted this commit 28307d938f
> 
> Please note that
> - i reproduced it with kernel compiled by gcc-5
> - i failed to reproduce it in kernel compiled by gcc-7 so far

Odd.
Here I tested with:

$ gcc --version
gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
Copyright (C) 2016 Free Software Foundation, Inc.

> 
> :~/1787$ ./reproduce.sh obj/arch/x86/boot/bzImage
> qemu-system-x86_64 -enable-kvm -cpu Haswell,+smep,+smap -kernel
> obj/arch/x86/boot/bzImage -m 8192 -smp 2 -device e1000,netdev=net0
> -netdev user,id=net0,hostfwd=tcp::32032-:22 -boot order=nc -no-reboot
> -watchdog i6300esb -watchdog-action debug -rtc base=localtime -serial
> stdio -display none -monitor null -append root=/dev/ram0
> hung_task_panic=1 debug apic=debug sysrq_always_enabled
> rcupdate.rcu_cpu_stall_timeout=100 net.ifnames=0 printk.devkmsg=on
> panic=-1 softlockup_panic=1 nmi_watchdog=panic oops=panic load_ramdisk=2
> prompt_ramdisk=0 drbd.minor_count=8 systemd.log_level=err
> ignore_loglevel console=tty0 earlyprintk=ttyS0,115200
> console=ttyS0,115200 vga=normal rw rcuperf.shutdown=0 watchdog_thresh=60
> early console in setup code
> Wrong EFI loader signature.
> early console in extract_kernel
> input_data: 0x06f752d8
> input_len: 0x0130dd3c
> output: 0x0100
> output_len: 0x07200a48
> kernel_total_size: 0x06826000
> needed_size: 0x0740
> trampoline_32bit: 0x0009d000
> 
> Decompressing Linux... Parsing ELF... done.
> Booting the kernel.
> [    0.00] Linux version 5.7.0-rc4-00168-g28307d938fb2
> (lizhijian@shao2-debian) (gcc version 5.5.0 20171010 (Debian 5.5.0-12),
> GNU ld (GNU Binutils for Debian) 2.34) #2 SMP PREEMPT Tue Jun 2 11:23:59
> CST 2020
> [    0.00] Command line: root=/dev/ram0 hung_task_panic=1 debug
> apic=debug sysrq_always_enabled rcupdate.rcu_cpu_stall_timeout=100
> net.ifnames=0 printk.devkmsg=on panic=-1 softlockup_panic=1
> nmi_watchdog=panic oops=panic load_ramdisk=2 prompt_ramdisk=0
> drbd.minor_count=8 systemd.log_level=err ignore_loglevel console=tty0
> earlyprintk=ttyS0,115200 console=ttyS0,115200 vga=normal rw
> rcuperf.shutdown=0 watchdog_thresh=60
> [    0.00] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating
> point registers'
> [    0.00] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
> [    0.00] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
> [    0.00] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]: 256
> [    0.00] x86/fpu: Enabled xstate features 0x7, context size is 832
> bytes, using 'standard' format.
> [    0.00] BIOS-provided physical RAM map:
> [    0.00] BIOS-e820: [mem 0x-0x0009fbff]
> usable
> [    0.00] BIOS-e820: [mem 0x0009fc00-0x0009]
> reserved
> [    0.00] BIOS-e820: [mem 0x000f-0x000f]
> reserved
> [    0.00] BIOS-e820: [mem 0x0010-0xbffd]
> usable
> [    0.00] BIOS-e820: [mem 0xbffe-0xbfff]
> reserved
> [    0.00] BIOS-e820: [mem 0xfeffc000-0xfeff]
> reserved
> [    0.00] BIOS-e820: [mem 0xfffc-0x]
> reserved
> [    0.00] BIOS-e820: [mem 0x0001-0x00023fff]
> usable
> [    0.00] printk: debug: ignoring loglevel setting.
> [    0.00] printk: bootconsole [earlyser0] enabled
> [    0.00] NX (Execute Disable) protection: active
> [    0.00] SMBIOS 2.8 present.
> [    0.00] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.12.0-1 04/01/2014
> [    0.00] Hypervisor detected: KVM
> [    0.00] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [    0.02] kvm-clock: cpu 0, msr 7601001, primary cpu clock
> [    0.02] kvm-clock: using sched offset of 2661499940 cycles
> [    0.000603] clocksource: kvm-clock: mask: 0x
> max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
> [    0.002261] tsc: Detected 3407.998 MHz processor
> [    0.005351] e820: update [mem 0x-0x0fff] usable ==> reserved
> [    0.005986] e820: remove [mem 0x000a-0x000f] usable
> [    0.006535] last_pfn = 0x24 max_arch_pfn = 0x4
> [    0.007091] MTRR default type: write-back
> [    0.007477] MTRR fixed ranges enabled:
> [    0.007845]   0-9 write-back

Re: 28307d938f ("percpu: make pcpu_alloc() aware of current gfp .."): BUG: kernel reboot-without-warning in boot stage

2020-05-29 Thread Filipe Manana



On 29/05/20 08:16, kernel test robot wrote:
> Greetings,
> 
> 0day kernel testing robot got the below dmesg and the first bad commit is
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> 
> commit 28307d938fb2e4056ed4c982c06d1503d7719813
> Author: Filipe Manana 
> AuthorDate: Thu May 7 18:36:10 2020 -0700
> Commit: Linus Torvalds 
> CommitDate: Thu May 7 19:27:21 2020 -0700
> 
> percpu: make pcpu_alloc() aware of current gfp context
> 
> Since 5.7-rc1, on btrfs we have a percpu counter initialization for
> which we always pass a GFP_KERNEL gfp_t argument (this happens since
> commit 2992df73268f78 ("btrfs: Implement DREW lock")).
> 
> That is safe in some contextes but not on others where allowing fs
> reclaim could lead to a deadlock because we are either holding some
> btrfs lock needed for a transaction commit or holding a btrfs
> transaction handle open.  Because of that we surround the call to the
> function that initializes the percpu counter with a NOFS context using
> memalloc_nofs_save() (this is done at btrfs_init_fs_root()).
> 
> However it turns out that this is not enough to prevent a possible
> deadlock because percpu_alloc() determines if it is in an atomic context
> by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
> case) and it is not aware that a NOFS context is set.
> 
> Because percpu_alloc() thinks it is in a non atomic context it locks the
> pcpu_alloc_mutex.  This can result in a btrfs deadlock when
> pcpu_balance_workfn() is running, has acquired that mutex and is waiting
> for reclaim, while the btrfs task that called percpu_counter_init() (and
> therefore percpu_alloc()) is holding either the btrfs commit_root
> semaphore or a transaction handle (done fs/btrfs/backref.c:
> iterate_extent_inodes()), which prevents reclaim from finishing as an
> attempt to commit the current btrfs transaction will deadlock.
> 
> Lockdep reports this issue with the following trace:
> 
>   ==
>   WARNING: possible circular locking dependency detected
>   5.6.0-rc7-btrfs-next-77 #1 Not tainted
>   --
>   kswapd0/91 is trying to acquire lock:
>   8938a3b3fdc8 (&delayed_node->mutex){+.+.}, at: 
> __btrfs_release_delayed_node.part.0+0x3f/0x320 [btrfs]
> 
>   but task is already holding lock:
>   b4f0dbc0 (fs_reclaim){+.+.}, at: __fs_reclaim_acquire+0x5/0x30
> 
>   which lock already depends on the new lock.
> 
>   the existing dependency chain (in reverse order) is:
> 
>   -> #4 (fs_reclaim){+.+.}:
>  fs_reclaim_acquire.part.0+0x25/0x30
>  __kmalloc+0x5f/0x3a0
>  pcpu_create_chunk+0x19/0x230
>  pcpu_balance_workfn+0x56a/0x680
>  process_one_work+0x235/0x5f0
>  worker_thread+0x50/0x3b0
>  kthread+0x120/0x140
>  ret_from_fork+0x3a/0x50
> 
>   -> #3 (pcpu_alloc_mutex){+.+.}:
>  __mutex_lock+0xa9/0xaf0
>  pcpu_alloc+0x480/0x7c0
>  __percpu_counter_init+0x50/0xd0
>  btrfs_drew_lock_init+0x22/0x70 [btrfs]
>  btrfs_get_fs_root+0x29c/0x5c0 [btrfs]
>  resolve_indirect_refs+0x120/0xa30 [btrfs]
>  find_parent_nodes+0x50b/0xf30 [btrfs]
>  btrfs_find_all_leafs+0x60/0xb0 [btrfs]
>  iterate_extent_inodes+0x139/0x2f0 [btrfs]
>  iterate_inodes_from_logical+0xa1/0xe0 [btrfs]
>  btrfs_ioctl_logical_to_ino+0xb4/0x190 [btrfs]
>  btrfs_ioctl+0x165a/0x3130 [btrfs]
>  ksys_ioctl+0x87/0xc0
>  __x64_sys_ioctl+0x16/0x20
>  do_syscall_64+0x5c/0x260
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>   -> #2 (&fs_info->commit_root_sem){}:
>  down_write+0x38/0x70
>  btrfs_cache_block_group+0x2ec/0x500 [btrfs]
>  find_free_extent+0xc6a/0x1600 [btrfs]
>  btrfs_reserve_extent+0x9b/0x180 [btrfs]
>  btrfs_alloc_tree_block+0xc1/0x350 [btrfs]
>  alloc_tree_block_no_bg_flush+0x4a/0x60 [btrfs]
>  __btrfs_cow_block+0x122/0x5a0 [btrfs]
>  btrfs_cow_block+0x106/0x240 [btrfs]
>  commit_cowonly_roots+0x55/0x310 [btrfs]
>  btrfs_commit_transaction+0x509/0xb20 [btrfs]
>  sync_filesystem+0x74/0x90
>  generic_shutdown

Re: [PATCH] percpu: make pcpu_alloc() aware of current gfp context

2020-04-30 Thread Filipe Manana
On Thu, Apr 30, 2020 at 11:23 PM Dennis Zhou  wrote:
>
> On Thu, Apr 30, 2020 at 02:40:18PM -0700, Andrew Morton wrote:
> > On Thu, 30 Apr 2020 17:43:56 +0100 fdman...@kernel.org wrote:
> >
> > > From: Filipe Manana 
> > >
> > > Since 5.7-rc1, on btrfs we have a percpu counter initialization for which
> > > we always pass a GFP_KERNEL gfp_t argument (this happens since commit
> > > 2992df73268f78 ("btrfs: Implement DREW lock")).  That is safe in some
> > > contextes but not on others where allowing fs reclaim could lead to a
> > > deadlock because we are either holding some btrfs lock needed for a
> > > transaction commit or holding a btrfs transaction handle open.  Because
> > > of that we surround the call to the function that initializes the percpu
> > > counter with a NOFS context using memalloc_nofs_save() (this is done at
> > > btrfs_init_fs_root()).
> > >
> > > However it turns out that this is not enough to prevent a possible
> > > deadlock because percpu_alloc() determines if it is in an atomic context
> > > by looking exclusively at the gfp flags passed to it (GFP_KERNEL in this
> > > case) and it is not aware that a NOFS context is set.  Because it thinks
> > > it is in a non atomic context it locks the pcpu_alloc_mutex, which can
> > > result in a btrfs deadlock when pcpu_balance_workfn() is running, has
> > > acquired that mutex and is waiting for reclaim, while the btrfs task that
> > > called percpu_counter_init() (and therefore percpu_alloc()) is holding
> > > either the btrfs commit_root semaphore or a transaction handle (done at
> > > fs/btrfs/backref.c:iterate_extent_inodes()), which prevents reclaim from
> > > finishing as an attempt to commit the current btrfs transaction will
> > > deadlock.
> > >
> >
> > Patch looks good and seems sensible, thanks.
> >
>
> Acked-by: Dennis Zhou 
>
> > But why did btrfs use memalloc_nofs_save()/restore() rather than
> > s/GFP_KERNEL/GFP_NOFS/?
>
> I would also like to know.

For 2 reasons:

1) It's the preferred way to do it since
memalloc_nofs_save()/restore() was added (according to
Documentation/core-api/gfp_mask-from-fs-io.rst);

2) According to Documentation/core-api/gfp_mask-from-fs-io.rst,
passing GFP_NOFS to __vmalloc() doesn't work, so one has to use the
memalloc_nofs_save()/restore() API for that. And pcpu_alloc() calls
helpers that end up calling __vmalloc() (through pcpu_mem_zalloc()).

And that's it.

Thanks.

>
> Thanks,
> Dennis


Re: [LKP] [Btrfs] 7cc93f268e: WARNING:at_fs/btrfs/transaction.c:#cleanup_transaction[btrfs]

2018-11-05 Thread Filipe Manana



On 05/11/2018 05:50, kernel test robot wrote:
> FYI, we noticed the following commit (built with gcc-7):
> 
> commit: 7cc93f268e0ea46570963aa6e09c44abe3732efe ("Btrfs: fix deadlock on 
> tree root leaf when finding free extent")
> https://github.com/0day-ci/linux 
> UPDATE-20181023-032539/fdmanana-kernel-org/Btrfs-fix-deadlock-on-tree-root-leaf-when-finding-free-extent/20181022-173541

Hi,

You are testing an earlier version of the patch that introduced -ENOSPC
errors.

Commit 7cc93f268e0ea46570963aa6e09c44abe3732efe doesn't exist anymore
and got replaced by commit 931e9658e7b52b9e8aacf30e6d7dbfc57b7c67be (the
latest version of the patch without the -ENOSPC regression).

thanks

> 
> in testcase: xfstests
> with following parameters:
> 
>   disk: 6HDD
>   fs: btrfs
>   test: btrfs-group1
> 
> test-description: xfstests is a regression test suite for xfs and other files 
> ystems.
> test-url: git://git.kernel.org/pub/scm/fs/xfs/xfstests-dev.git
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu qemu64,+ssse3 -smp 2 -m 
> 4G
> 
> caused below changes (please refer to attached dmesg/kmsg for entire 
> log/backtrace):
> 
> 
> +--+---++
> | 
>  | v4.19 | 7cc93f268e |
> +--+---++
> | boot_successes  
>  | 5027  | 6  |
> | boot_failures   
>  | 1199  | 29 |
> | WARNING:stack_recursion 
>  | 772   ||
> | WARNING:at#for_ip_swapgs_restore_regs_and_return_to_usermode/0x 
>  | 687   ||
> | cpu_clock_throttled 
>  | 83||
> | WARNING:at#for_ip_interrupt_entry/0x
>  | 96||
> | WARNING:at_ip_xfs_inode_item_format/0x  
>  | 2 ||
> | WARNING:at_ip_fsnotify/0x   
>  | 82||
> | WARNING:at_ip__slab_free/0x 
>  | 17||
> | WARNING:at_ip__mutex_lock/0x
>  | 3 ||
> | WARNING:at_ip_perf_event_mmap_output/0x 
>  | 2 ||
> | WARNING:at_ip_native_sched_clock/0x 
>  | 5 ||
> | WARNING:at_ip_ip_finish_output2/0x  
>  | 3 ||
> | WARNING:at_drivers/gpu/drm/drm_vblank.c:#drm_wait_one_vblank[drm]   
>  | 20||
> | RIP:drm_wait_one_vblank[drm]
>  | 20||
> | WARNING:at_ip___perf_sw_event/0x
>  | 4 ||
> | PANIC:double_fault  
>  | 1 ||
> | WARNING:stack_going_in_the_wrong_direction?ip=double_fault/0x   
>  | 1 ||
> | RIP:error_entry 
>  | 1 ||
> | stack_segment:#[##] 
>  | 2 ||
> | RIP:kmem_cache_alloc
>  | 2 ||
> | Kernel_panic-not_syncing:Fatal_exception
>  | 4 | 3  |
> | general_protection_fault:#[##]  
>  | 1 ||
> | RIP:__rb_erase_color
>  | 1 ||
> | WARNING:at_ip_do_sys_poll/0x
>  | 3 ||
> | WARNING:at_ip__netif_receive_skb_core/0x
>  | 1 ||
> | WARNING:at_ip__x64_sys_io_submit/0x 
>  | 2 ||
> | WARNING:at_ip_generic_make_request/0x   
>  | 1 ||
> | WARNING:at_ip_smp_call_function_single/0x   
>  | 5 ||
> | WARNING:at_ip_do_filp_open/0x   
>  | 1 ||
> | WARNING:DIRECTORY_MODE_IS_ENABLED.SOME_ERRO

Re: [btrfs] WARNING: CPU: 0 PID: 6379 at fs/direct-io.c:293 dio_complete+0x1d4/0x220

2017-10-31 Thread Filipe Manana
On Mon, Oct 30, 2017 at 7:44 AM, Eryu Guan  wrote:
> Hi Fengguang,
>
> On Mon, Oct 30, 2017 at 08:20:21AM +0100, Fengguang Wu wrote:
>> CC fsdevel.
>>
>> On Sun, Oct 29, 2017 at 11:51:55PM +0100, Fengguang Wu wrote:
>> > Hi Linus,
>> >
>> > Up to now we see the below boot error/warnings when testing v4.14-rc6.
>> >
>> > They hit the RC release mainly due to various imperfections in 0day's
>> > auto bisection. So I manually list them here and CC the likely easy to
>> > debug ones to the corresponding maintainers in the followup emails.
>> >
>> > boot_successes: 4700
>> > boot_failures: 247
>>
>> [...]
>>
>> > WARNING:at_fs/direct-io.c:#dio_complete: 7
>> > WARNING:at_fs/iomap.c:#iomap_dio_complete: 3
>> > WARNING:at_fs/iomap.c:#iomap_dio_rw: 1
>>
>> The first warning happens on btrfs and is bisected to this commit.
>> The other 2 warnings happen on xfs.
>
> I noticed that the warnings are triggered by generic/095 and
> generic/208, they're known to generate such warnings and I think these
> warnings are kind of 'known issue', please see comments above
> _filter_aiodio_dmesg() in fstests/common/filter.

I've hit that warning (fs/direct-io.c:293 dio_complete+0xee/0x1ad)
once with test btrfs/070 (on a 4.14-rc6 kernel).
Only happened once and seems very hard to hit it here. Hadn't spent
time yet figuring out why it happens.

>
> Please make sure your local fstests contains the following 3 commits:
>
> ca93e26865ab common: move _filter_xfs_dmesg() to common/filter
> 5aa662733ab0 common: turn _filter_xfs_dmesg() into _filter_aiodio_dmesg()
> 228aee780f13 generic/036,208: whitelist [iomap_]dio_complete() WARNs
>
> we filtered out such warnings in fstests on purpose so the affected
> tests won't fail because of such dmesg warnings.
>
> Thanks,
> Eryu
>
>>
>> commit 332391a9935da939319e473b4680e173df75afcf
>> Author: Lukas Czerner 
>> AuthorDate: Thu Sep 21 08:16:29 2017 -0600
>> Commit: Jens Axboe 
>> CommitDate: Mon Sep 25 08:56:05 2017 -0600
>>
>>fs: Fix page cache inconsistency when mixing buffered and AIO DIO
>>
>>Currently when mixing buffered reads and asynchronous direct writes it
>>is possible to end up with the situation where we have stale data in the
>>page cache while the new data is already written to disk. This is
>>permanent until the affected pages are flushed away. Despite the fact
>>that mixing buffered and direct IO is ill-advised it does pose a thread
>>for a data integrity, is unexpected and should be fixed.
>>
>>Fix this by deferring completion of asynchronous direct writes to a
>>process context in the case that there are mapped pages to be found in
>>the inode. Later before the completion in dio_complete() invalidate
>>the pages in question. This ensures that after the completion the pages
>>in the written area are either unmapped, or populated with up-to-date
>>data. Also do the same for the iomap case which uses
>>iomap_dio_complete() instead.
>>
>>This has a side effect of deferring the completion to a process context
>>for every AIO DIO that happens on inode that has pages mapped. However
>>since the consensus is that this is ill-advised practice the performance
>>implication should not be a problem.
>>
>>This was based on proposal from Jeff Moyer, thanks!
>>
>>Reviewed-by: Jan Kara 
>>Reviewed-by: Darrick J. Wong 
>>Reviewed-by: Jeff Moyer 
>>Signed-off-by: Lukas Czerner 
>>Signed-off-by: Jens Axboe 
>> ---
>> fs/direct-io.c | 47 ++-
>> fs/iomap.c | 29 -
>> mm/filemap.c   |  6 ++
>> 3 files changed, 64 insertions(+), 18 deletions(-)
>>
>> The call traces are:
>>
>> [  334.461955] BTRFS info (device vdb): has skinny extents
>> [  334.463231] BTRFS info (device vdb): flagging fs with big metadata feature
>> [  334.469746] BTRFS info (device vdb): creating UUID tree
>> [  336.190840] [ cut here ]
>> [  336.193338] WARNING: CPU: 0 PID: 6379 at fs/direct-io.c:293 
>> dio_complete+0x1d4/0x220
>> [  336.196925] Modules linked in: btrfs xor zstd_decompress zstd_compress 
>> xxhash raid6_pq dm_mod rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver sr_mod 
>> cdrom sg ata_generic pata_acpi ppdev snd_pcm snd_timer snd soundcore 
>> serio_raw parport_pc pcspkr parport floppy ata_piix libata i2c_piix4 
>> ip_tables
>> [  336.203746] CPU: 0 PID: 6379 Comm: kworker/0:0 Not tainted 4.14.0-rc6 #1
>> [  336.204799] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> 1.10.2-1 04/01/2014
>> [  336.207480] Workqueue: dio/vdb dio_aio_complete_work
>> [  336.208373] task: 880079094a80 task.stack: c995
>> [  336.210495] RIP: 0010:dio_complete+0x1d4/0x220
>> [  336.211288] RSP: 0018:c9953e20 EFLAGS: 00010286
>> [  336.213145] RAX: fff0 RBX: 8800aae8c780 RCX: 
>> c9953c70
>> [  336.214232] RDX: 8000 RSI: 02b4 RDI: 
>> 81c

Re: [PATCH v3 46/49] fs/btrfs: convert to bio_for_each_segment_all_sp()

2017-08-08 Thread Filipe Manana
On Tue, Aug 8, 2017 at 9:45 AM, Ming Lei  wrote:
> Cc: Chris Mason 
> Cc: Josef Bacik 
> Cc: David Sterba 
> Cc: linux-bt...@vger.kernel.org
> Signed-off-by: Ming Lei 

Can you please add some meaningful changelog? E.g., why is this
conversion needed.

> ---
>  fs/btrfs/compression.c |  3 ++-
>  fs/btrfs/disk-io.c |  3 ++-
>  fs/btrfs/extent_io.c   | 12 
>  fs/btrfs/inode.c   |  6 --
>  fs/btrfs/raid56.c  |  1 +
>  5 files changed, 17 insertions(+), 8 deletions(-)
>
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 28746588f228..55f251a83d0b 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -147,13 +147,14 @@ static void end_compressed_bio_read(struct bio *bio)
> } else {
> int i;
> struct bio_vec *bvec;
> +   struct bvec_iter_all bia;
>
> /*
>  * we have verified the checksum already, set page
>  * checked so the end_io handlers know about it
>  */
> ASSERT(!bio_flagged(bio, BIO_CLONED));
> -   bio_for_each_segment_all(bvec, cb->orig_bio, i)
> +   bio_for_each_segment_all_sp(bvec, cb->orig_bio, i, bia)
> SetPageChecked(bvec->bv_page);
>
> bio_endio(cb->orig_bio);
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 080e2ebb8aa0..a9cd75e6383d 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -963,9 +963,10 @@ static blk_status_t btree_csum_one_bio(struct bio *bio)
> struct bio_vec *bvec;
> struct btrfs_root *root;
> int i, ret = 0;
> +   struct bvec_iter_all bia;
>
> ASSERT(!bio_flagged(bio, BIO_CLONED));
> -   bio_for_each_segment_all(bvec, bio, i) {
> +   bio_for_each_segment_all_sp(bvec, bio, i, bia) {
> root = BTRFS_I(bvec->bv_page->mapping->host)->root;
> ret = csum_dirty_buffer(root->fs_info, bvec->bv_page);
> if (ret)
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index c8f6a8657bf2..4de9cfd1c385 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -2359,8 +2359,9 @@ static unsigned int get_bio_pages(struct bio *bio)
>  {
> unsigned i;
> struct bio_vec *bv;
> +   struct bvec_iter_all bia;
>
> -   bio_for_each_segment_all(bv, bio, i)
> +   bio_for_each_segment_all_sp(bv, bio, i, bia)
> ;
>
> return i;
> @@ -2463,9 +2464,10 @@ static void end_bio_extent_writepage(struct bio *bio)
> u64 start;
> u64 end;
> int i;
> +   struct bvec_iter_all bia;
>
> ASSERT(!bio_flagged(bio, BIO_CLONED));
> -   bio_for_each_segment_all(bvec, bio, i) {
> +   bio_for_each_segment_all_sp(bvec, bio, i, bia) {
> struct page *page = bvec->bv_page;
> struct inode *inode = page->mapping->host;
> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> @@ -2534,9 +2536,10 @@ static void end_bio_extent_readpage(struct bio *bio)
> int mirror;
> int ret;
> int i;
> +   struct bvec_iter_all bia;
>
> ASSERT(!bio_flagged(bio, BIO_CLONED));
> -   bio_for_each_segment_all(bvec, bio, i) {
> +   bio_for_each_segment_all_sp(bvec, bio, i, bia) {
> struct page *page = bvec->bv_page;
> struct inode *inode = page->mapping->host;
> struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
> @@ -3693,9 +3696,10 @@ static void end_bio_extent_buffer_writepage(struct bio 
> *bio)
> struct bio_vec *bvec;
> struct extent_buffer *eb;
> int i, done;
> +   struct bvec_iter_all bia;
>
> ASSERT(!bio_flagged(bio, BIO_CLONED));
> -   bio_for_each_segment_all(bvec, bio, i) {
> +   bio_for_each_segment_all_sp(bvec, bio, i, bia) {
> struct page *page = bvec->bv_page;
>
> eb = (struct extent_buffer *)page->private;
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 084ed99dd308..eeb2ff662ec4 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -8047,6 +8047,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
> struct bio_vec *bvec;
> struct extent_io_tree *io_tree, *failure_tree;
> int i;
> +   struct bvec_iter_all bia;
>
> if (bio->bi_status)
> goto end;
> @@ -8064,7 +8065,7 @@ static void btrfs_retry_endio_nocsum(struct bio *bio)
>
> done->uptodate = 1;
> ASSERT(!bio_flagged(bio, BIO_CLONED));
> -   bio_for_each_segment_all(bvec, bio, i)
> +   bio_for_each_segment_all_sp(bvec, bio, i, bia)
> clean_io_failure(BTRFS_I(inode)->root->fs_info, failure_tree,
>  io_tree, done->start, bvec->bv_page,
>  btrfs_ino(BTRFS_I(inode)), 0);
> @@ -8143,6 +8144,7 @@ static void btrfs_retr

Re: S2ram + btrfs with recent kernel => soft lockup warning and hard hang

2017-02-13 Thread Filipe Manana


On 02/12/2017 03:45 PM, Bastien ROUCARIES wrote:
> Hi,
> 
> Since 3.8 s2ram on btrfs system lead to hang.
> 
> Could not save a trace will try to get a few picture but trace are similar to 
> :
> https://github.com/docker/docker/issues/15654
> and
> https://patchwork.kernel.org/patch/8352811/
> 
> It is 100% reproducible.

3.8 is very old, and so is 3.19 (which is what you're running as shown
in the github issue). Many fixes landed after 3.19. Just try a recent
kernel.

> 
> mount table
> /dev/mapper/portable2015--bastien--vg-HOME on /var/cache/pbuilder type
> btrfs 
> (rw,relatime,lazytime,compress=zlib,space_cache,subvolid=934,subvol=/pbuilder)
> /dev/mapper/portable2015--bastien--vg-HOME on /home type btrfs
> (rw,nosuid,nodev,relatime,lazytime,compress=zlib,space_cache,subvolid=933,subvol=/home)
> 
> btrfs is over dm-crypt
> 
> Bastien
> 


Re: [PATCH 0/1] btrfs lockdep annotation

2017-01-24 Thread Filipe Manana
On Tue, Jan 24, 2017 at 9:01 AM, Christian Borntraeger
 wrote:
> Chris,
>
> since my bug report about this did not result in any fix and since

It was fixed and the fix landed in 4.10-rc4:

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=781feef7e6befafd4d9787d1f7ada1f9ccd504e4

> this disables lockdep before the the code that I want to debug runs
> here is my attempt to fix it.
> Please double check if the subclass looks right. It seems to work
> for me but I do not know enough about btrfs to decide if this is
> right or not.
>
> Christian Borntraeger ():
>   btrfs: add lockdep annotation for btrfs_log_inode
>
>  fs/btrfs/tree-log.c| 2 +-
>
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"People will forget what you said,
 people will forget what you did,
 but people will never forget how you made them feel."


Re: [PATCH] btrfs:Change BUG_ON to new error path in __clear_extent_bit

2016-04-06 Thread Filipe Manana
On Wed, Apr 6, 2016 at 4:56 PM, Bastien Philbert
 wrote:
> This remove the unnessary BUG_ON if the allocation with
> alloc_extent_state_atomic fails due to this function
> failure not being unrecoverable. Instead we now change
> this BUG_ON into a new error path that jumps to the goto
> label, out from freeing previously allocated resources
> before returning the error code -ENOMEM to signal callers
> that the call to __clear_extent_bit failed due to a memory
> allocation failure.
>
> Signed-off-by: Bastien Philbert 
> ---
>  fs/btrfs/extent_io.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index d247fc0..4c87b77 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -682,7 +682,10 @@ hit_next:
>
> if (state->start < start) {
> prealloc = alloc_extent_state_atomic(prealloc);
> -   BUG_ON(!prealloc);
> +   if (!prealloc) {
> +   err = -ENOMEM;
> +   goto out;
> +   }

Why only in this particular place? We do this in other places in this function.

Second, setting err to -ENOMEM is not enough. Under the out label we
always return 0, so you're not propagating the error to callers.

Now, most importantly, you didn't check if callers handle errors from
this function (__clear_extent_bit()) at all. A failure in this
function is critical.
For example, it can cause a range in an inode's io tree to become
locked forever, blocking any other tasks that want to operate on the
range, and we won't ever know what happened.
So it's far from trivial to handle errors from this function and
that's why the BUG_ON is there.

If you really want to get rid of the BUG_ON() calls you need to make
sure all callers don't ignore the errors and that they deal with them
properly.


> err = split_state(tree, state, prealloc, start);
> if (err)
> extent_io_tree_panic(tree, err);
> --
> 2.5.0
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."


Re: [PATCH 2/2] btrfs: avoid overflowing f_bfree

2016-03-30 Thread Filipe Manana
On Wed, Mar 30, 2016 at 9:53 PM, Luis de Bethencourt
 wrote:
> Since mixed block groups accounting isn't byte-accurate and f_bree is an
> unsigned integer, it could overflow. Avoid this.
>
> Signed-off-by: Luis de Bethencourt 
> Suggested-by: David Sterba 
> ---
>  fs/btrfs/super.c | 5 +
>  1 file changed, 5 insertions(+)
>
> diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
> index bdca79c..93376d0 100644
> --- a/fs/btrfs/super.c
> +++ b/fs/btrfs/super.c
> @@ -2101,6 +2101,11 @@ static int btrfs_statfs(struct dentry *dentry, struct 
> kstatfs *buf)
> /* Account global block reserve as used, it's in logical size already 
> */
> spin_lock(&block_rsv->lock);
> buf->f_bfree -= block_rsv->size >> bits;

You forgot to remove the line above, didn't you?

> +   /* Mixed block groups accounting is not byte-accurate, avoid overflow 
> */
> +   if (buf->f_bfree >= block_rsv->size >> bits)
> +   buf->f_bfree -= block_rsv->size >> bits;
> +   else
> +   buf->f_bfree = 0;
> spin_unlock(&block_rsv->lock);
>
> buf->f_bavail = div_u64(total_free_data, factor);
> --
> 2.5.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."


Re: PAX: size overflow detected in function try_merge_map fs/btrfs/extent_map.c:238

2015-11-27 Thread Filipe Manana
On Fri, Nov 27, 2015 at 11:51 AM, Holger Hoffstätte
 wrote:
> On 11/27/15 12:20, Toralf Förster wrote:
>> On 11/27/2015 12:07 PM, Filipe Manana wrote:
>>> Try the following (also pasted at
>>> https://friendpaste.com/5O6o1cqWqJZDIKrH1YqG7y):
>>
>> Doesn't apply neither against the used 4.2.6 kernel nor aginst current git 
>> HEAD :
>>
>> t44 linux # patch -p1 --dry-run < 
>> /home/tfoerste/Downloads/5O6o1cqWqJZDIKrH1YqG7y.diff.patch
>> checking file fs/btrfs/extent_map.c
>> Hunk #1 FAILED at 235.
>> Hunk #2 FAILED at 252.
>> 2 out of 2 hunks FAILED
>>
>>
>> tfoerste@t44 ~/devel/linux $ patch -p1 --dry-run < 
>> ~/Downloads/5O6o1cqWqJZDIKrH1YqG7y.diff.patch
>> checking file fs/btrfs/extent_map.c
>> Hunk #1 FAILED at 235.
>> Hunk #2 FAILED at 252.
>> 2 out of 2 hunks FAILED
>>
>
> Toralf,
>
> try with --ignore-whitespace, that works for me. Seems the pastebin ate
> some formatting.

Indeed.
Try the following instead:  http://paste.opensuse.org/view/raw/58412382

thanks

>
> -h
>



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PAX: size overflow detected in function try_merge_map fs/btrfs/extent_map.c:238

2015-11-27 Thread Filipe Manana
On Fri, Nov 27, 2015 at 11:20 AM, Toralf Förster  wrote:
> On 11/27/2015 12:07 PM, Filipe Manana wrote:
>> Try the following (also pasted at
>> https://friendpaste.com/5O6o1cqWqJZDIKrH1YqG7y):
>
> Doesn't apply neither against the used 4.2.6 kernel nor aginst current git 
> HEAD :

Quite probable, this was against the integration branch for btrfs.
You should be able to apply it manually, it's a trivial change and
extent_map.c did not change in any significant way.

>
> t44 linux # patch -p1 --dry-run < 
> /home/tfoerste/Downloads/5O6o1cqWqJZDIKrH1YqG7y.diff.patch
> checking file fs/btrfs/extent_map.c
> Hunk #1 FAILED at 235.
> Hunk #2 FAILED at 252.
> 2 out of 2 hunks FAILED
>
>
> tfoerste@t44 ~/devel/linux $ patch -p1 --dry-run < 
> ~/Downloads/5O6o1cqWqJZDIKrH1YqG7y.diff.patch
> checking file fs/btrfs/extent_map.c
> Hunk #1 FAILED at 235.
> Hunk #2 FAILED at 252.
> 2 out of 2 hunks FAILED
>
> --
> Toralf, pgp: C4EACDDE 0076E94E



-- 
Filipe David Manana,

"Reasonable men adapt themselves to the world.
 Unreasonable men adapt the world to themselves.
 That's why all progress depends on unreasonable men."
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: PAX: size overflow detected in function try_merge_map fs/btrfs/extent_map.c:238

2015-11-27 Thread Filipe Manana
On Fri, Nov 27, 2015 at 10:47 AM, Toralf Förster  wrote:
> Happened today few times in a row at a stable 64 bit Gentoo hardened system:
>
>
>
> Nov 27 10:23:09 t44 kernel: [41619.519921] PAX: size overflow detected in 
> function try_merge_map fs/btrfs/extent_map.c:238 cicus.107_102 max, count: 
> 13, decl: block_len; num: 0; context: extent_map;
> Nov 27 10:23:09 t44 kernel: [41619.519929] CPU: 2 PID: 3361 Comm: 
> host_jskwgen Tainted: GW   4.2.6-hardened-r6 #3
> Nov 27 10:23:09 t44 kernel: [41619.519932] Hardware name: LENOVO 
> 20AQCTO1WW/20AQCTO1WW, BIOS GJET83WW (2.33 ) 03/09/2015
> Nov 27 10:23:09 t44 kernel: [41619.519934]  81831343  
> 8183132d c9000298b6e8
> Nov 27 10:23:09 t44 kernel: [41619.519939]  815ee0ea 88033e30eec8 
> 81831343 c9000298b718
> Nov 27 10:23:09 t44 kernel: [41619.519943]  811ade1b 8802fb611480 
> 88032b717510 88032b74fae0
> Nov 27 10:23:09 t44 kernel: [41619.519946] Call Trace:
> Nov 27 10:23:09 t44 kernel: [41619.519955]  [] 
> dump_stack+0x45/0x5d
> Nov 27 10:23:09 t44 kernel: [41619.519959]  [] 
> report_size_overflow+0x3b/0x50
> Nov 27 10:23:09 t44 kernel: [41619.519963]  [] 
> try_merge_map+0x1f1/0x310
> Nov 27 10:23:09 t44 kernel: [41619.519966]  [] 
> add_extent_mapping+0x132/0x1c0
> Nov 27 10:23:09 t44 kernel: [41619.519968]  [] 
> btrfs_get_extent+0x659/0xdd0
> Nov 27 10:23:09 t44 kernel: [41619.519972]  [] ? 
> kmem_cache_alloc+0x32/0x140
> Nov 27 10:23:09 t44 kernel: [41619.519975]  [] 
> __do_readpage+0x6f2/0xc30
> Nov 27 10:23:09 t44 kernel: [41619.519977]  [] ? 
> __set_extent_bit+0x14e/0x580
> Nov 27 10:23:09 t44 kernel: [41619.519979]  [] ? 
> btrfs_real_readdir+0x6f0/0x6f0
> Nov 27 10:23:09 t44 kernel: [41619.519983]  [] ? 
> _raw_spin_unlock_irq+0x19/0x30
> Nov 27 10:23:09 t44 kernel: [41619.519985]  [] ? 
> btrfs_lookup_ordered_extent+0xa2/0xe0
> Nov 27 10:23:09 t44 kernel: [41619.519987]  [] 
> __extent_read_full_page+0x1d6/0x210
> Nov 27 10:23:09 t44 kernel: [41619.519989]  [] ? 
> btrfs_real_readdir+0x6f0/0x6f0
> Nov 27 10:23:09 t44 kernel: [41619.519991]  [] ? 
> btrfs_real_readdir+0x6f0/0x6f0
> Nov 27 10:23:09 t44 kernel: [41619.519993]  [] 
> extent_read_full_page+0x4f/0x80
> Nov 27 10:23:09 t44 kernel: [41619.519997]  [] ? 
> lru_cache_add+0x19/0x30
> Nov 27 10:23:09 t44 kernel: [41619.51]  [] ? 
> inode_tree_add+0x150/0x150
> Nov 27 10:23:09 t44 kernel: [41619.52]  [] 
> btrfs_readpage+0x34/0x50
> Nov 27 10:23:09 t44 kernel: [41619.520002]  [] ? 
> inode_tree_add+0x150/0x150
> Nov 27 10:23:09 t44 kernel: [41619.520004]  [] 
> do_read_cache_page+0x99/0x1b0
> Nov 27 10:23:09 t44 kernel: [41619.520006]  [] ? 
> inode_tree_add+0x150/0x150
> Nov 27 10:23:09 t44 kernel: [41619.520008]  [] ? 
> inode_tree_add+0x150/0x150
> Nov 27 10:23:09 t44 kernel: [41619.520009]  [] 
> read_cache_page+0x38/0x50
> Nov 27 10:23:09 t44 kernel: [41619.520012]  [] 
> page_getlink.isra.48.constprop.51+0x3a/0xa0
> Nov 27 10:23:09 t44 kernel: [41619.520014]  [] 
> page_follow_link_light+0x2b/0x50
> Nov 27 10:23:09 t44 kernel: [41619.520016]  [] 
> trailing_symlink+0x27f/0x2b0
> Nov 27 10:23:09 t44 kernel: [41619.520019]  [] 
> path_openat+0x16b/0x1700
> Nov 27 10:23:09 t44 kernel: [41619.520021]  [] 
> do_filp_open+0x81/0xf0
> Nov 27 10:23:09 t44 kernel: [41619.520024]  [] 
> do_sys_open+0x133/0x280
> Nov 27 10:23:09 t44 kernel: [41619.520026]  [] 
> SyS_open+0x31/0x50
> Nov 27 10:23:09 t44 kernel: [41619.520028]  [] 
> entry_SYSCALL_64_fastpath+0x12/0x83

Try the following (also pasted at
https://friendpaste.com/5O6o1cqWqJZDIKrH1YqG7y):

diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c
index 6a98bdd..26b4c13 100644
--- a/fs/btrfs/extent_map.c
+++ b/fs/btrfs/extent_map.c
@@ -235,7 +235,8 @@ static void try_merge_map(struct extent_map_tree
*tree, struct extent_map *em)
em->start = merge->start;
em->orig_start = merge->orig_start;
em->len += merge->len;
-   em->block_len += merge->block_len;
+   if (em->block_start != EXTENT_MAP_HOLE)
+   em->block_len += merge->block_len;
em->block_start = merge->block_start;
em->mod_len = (em->mod_len + em->mod_start) -
merge->mod_start;
em->mod_start = merge->mod_start;
@@ -252,7 +253,8 @@ static void try_merge_map(struct extent_map_tree
*tree, struct extent_map *em)
merge = rb_entry(rb, struct extent_map, rb_node);
if (rb && mergable_maps(em, merge)) {
em->len += merge->len;
-   em->block_len += merge->block_len;
+   if (em->block_start != EXTENT_MAP_HOLE)
+   em->block_len += merge->block_len;
rb_erase(&merge->rb_node, &tree->map);
RB_CLEAR_NODE(&merge->rb_node);
em->mod_len = (merge->mod_start +

Re: [PATCH 3.2 38/52] Btrfs: fix race when listing an inode's xattrs

2015-11-26 Thread Filipe Manana


On 11/26/2015 12:39 AM, Ben Hutchings wrote:
> On Wed, 2015-11-25 at 23:11 +, Luis Henriques wrote:
>> On Tue, Nov 24, 2015 at 10:33:59PM +, Ben Hutchings wrote:
>>> 3.2.74-rc1 review patch.  If anyone has any objections, please let me know.
>>>
>>> ------
>>>
>>> From: Filipe Manana 
>>>
>>> commit f1cd1f0b7d1b5d4aaa5711e8f4e4898b0045cb6d upstream.
>>>
>>> When listing a inode's xattrs we have a time window where we race against
>>> a concurrent operation for adding a new hard link for our inode that makes
>>> us not return any xattr to user space. In order for this to happen, the
>>> first xattr of our inode needs to be at slot 0 of a leaf and the previous
>>> leaf must still have room for an inode ref (or extref) item, and this can
>>> happen because an inode's listxattrs callback does not lock the inode's
>>> i_mutex (nor does the VFS does it for us), but adding a hard link to an
>>> inode makes the VFS lock the inode's i_mutex before calling the inode's
>>> link callback.
>>>
>>> If we have the following leafs:
>>>
>>>Leaf X (has N items)Leaf Y
>>>
>>>  [ ... (257 INODE_ITEM 0) (257 INODE_REF 256) ]  [ (257 XATTR_ITEM 12345), 
>>> ... ]
>>>slot N - 2 slot N - 1  slot 0
>>>
>>> The race illustrated by the following sequence diagram is possible:
>>>
>>>CPU 1   CPU 2
>>>
>>>   btrfs_listxattr()
>>>
>>> searches for key (257 XATTR_ITEM 0)
>>>
>>> gets path with path->nodes[0] == leaf X
>>> and path->slots[0] == N
>>>
>>> because path->slots[0] is >=
>>> btrfs_header_nritems(leaf X), it calls
>>> btrfs_next_leaf()
>>>
>>> btrfs_next_leaf()
>>>   releases the path
>>>
>>>adds key (257 INODE_REF 
>>> 666)
>>>to the end of leaf X 
>>> (slot N),
>>>and leaf X now has N + 1 
>>> items
>>>
>>>   searches for the key (257 INODE_REF 256),
>>>   with path->keep_locks == 1, because that
>>>   is the last key it saw in leaf X before
>>>   releasing the path
>>>
>>>   ends up at leaf X again and it verifies
>>>   that the key (257 INODE_REF 256) is no
>>>   longer the last key in leaf X, so it
>>>   returns with path->nodes[0] == leaf X
>>>   and path->slots[0] == N, pointing to
>>>   the new item with key (257 INODE_REF 666)
>>>
>>> btrfs_listxattr's loop iteration sees that
>>> the type of the key pointed by the path is
>>> different from the type BTRFS_XATTR_ITEM_KEY
>>> and so it breaks the loop and stops looking
>>> for more xattr items
>>>   --> the application doesn't get any xattr
>>>   listed for our inode
>>>
>>> So fix this by breaking the loop only if the key's type is greater than
>>> BTRFS_XATTR_ITEM_KEY and skip the current key if its type is smaller.
>>>
>>> Signed-off-by: Filipe Manana 
>>> [bwh: Backported to 3.2: s/found_key\.type/btrfs_key_type(\&found_key)/]
>>
>> Actually, in my backport to 3.16 I decided to keep the usage of
>> 'found_key.type' instead, as the usage of btrfs_key_type() has been
>> dropped with commit 962a298f3511 ("btrfs: kill the key type accessor
>> helpers").
> [...]
> 
> OK, that makes sense.  btrfs in 3.2 is pretty inconsistent about using
> btrfs_key_type() anyway.

Using the type field directly, instead of the accessor, is perfectly
safe (the field is an u8 so no worries about endianness conversions
unlike, other field of struct btrfs_key which are u64s).

> 
> Ben.
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [LKP] [Btrfs] 3a8b36f3780: -62.6% fileio.requests_per_sec

2015-04-01 Thread Filipe Manana


On 04/01/2015 05:59 AM, Huang Ying wrote:
> On Tue, 2015-03-31 at 15:59 +0100, Filipe Manana wrote:
> 
> [snip]
> 
>> From: Filipe Manana 
>> Date: Tue, 31 Mar 2015 14:16:52 +0100
>> Subject: [PATCH] Btrfs: avoid syncing log in the fast fsync path when not
>>  necessary
>>
>> Commit 3a8b36f37806 ("Btrfs: fix data loss in the fast fsync path") added
>> a performance regression for that causes an unnecessary sync of the log
>> trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done
>> against a file, without no writes or any metadata updates to the inode in
>> between them and if a transaction is committed before the second fsync is
>> called.
>>
>> Huang Ying reported this to lkml after a test sysbench test that measured
>> a -62% decrease of file io requests for that tests' workload.
>>
>> The test is:
>>
>>   echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
>>   echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
>>   echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
>>   echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
>>   mkfs -t btrfs /dev/sda2
>>   mount -t btrfs /dev/sda2 /fs/sda2
>>   cd /fs/sda2
>>   for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done
>>   sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \
>> --file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync 
>> \
>> --file-num=1024 run
>>
>> A test on kvm guest, running a debug kernel gave me the following results:
>>
>> Without 3a8b36f378060d: 16.01 reqs/sec
>> With 3a8b36f378060d: 3.39 reqs/sec
>> With 3a8b36f378060d and this patch: 16.04 reqs/sec
>>
>> Reported-by: Huang Ying 
> 
> I have tested your patch, the regression restored in our test.  Thanks!
> 
> Tested-by: Huang, Ying 

Thank you very much for testing it and the report.
I'll now send the exact same patch, with your Tested-by tag, to the
btrfs mailing list, so that Chris can pick it from patchwork and can
more easily be noticed by the btrfs community (for review, test, etc).

regards,
Filipe Manana

> 
> Best Regards,
> Huang, Ying
> 
> [snip]
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [LKP] [Btrfs] 3a8b36f3780: -62.6% fileio.requests_per_sec

2015-03-31 Thread Filipe Manana


On 03/31/2015 09:32 AM, Huang Ying wrote:
> Hi, Filipe,
> 
> On Wed, 2015-03-18 at 10:05 +0000, Filipe Manana wrote:
> [snip]
> 
>> Hi, thanks for this.
>>
>> However this doesn't make sense to me.
>> This commit only touches btrfs' fsync handler and the test uses sysbench
>> without passing --file-fsync-freq to it, which means sysbench will never
>> do fsyncs according to its man page (default for fsync frequency is 0).
>>
>> Or maybe I missed something?
> 
> Sorry for late.
> 
> I checked source code of sysbench and found that the actual default
> value of --file-fsync-freq is 100 instead of 0 in man page, as in the
> following lines.
> 
>   {"file-fsync-freq", "do fsync() after this number of requests (0 - don't 
> use fsync())",
>SB_ARG_TYPE_INT, "100"},
> 
> I double checked that via a debug patch to sysbench too.

Ok, thanks for checking that.
What the 100 means is that an fsync is done after every 100 requests
(both writes and reads I assume).

The patch removed an optimization where we would not do any IO if no new
data was written to the file between 2 consecutive fsync requests and if
a btrfs transaction was committed between the 2 fsync requests as well
(by default it happens about every 30 seconds, changeable with -o
commit=xx). Which I think it's a rare/uncommon scenario.

With that optimization removed, the inode's metadata data is always
synced to disk

I've just tested on kvm guest with a debug kernel and got similar
decrease of file io requests as you reported.

The following brought back the performance for me (without reverting the
data loss fix from 3a8b36f37806 of course). Can you give it a try? 
Thanks.


From: Filipe Manana 
Date: Tue, 31 Mar 2015 14:16:52 +0100
Subject: [PATCH] Btrfs: avoid syncing log in the fast fsync path when not
 necessary

Commit 3a8b36f37806 ("Btrfs: fix data loss in the fast fsync path") added
a performance regression for that causes an unnecessary sync of the log
trees (fs/subvol and root log trees) when 2 consecutive fsyncs are done
against a file, without no writes or any metadata updates to the inode in
between them and if a transaction is committed before the second fsync is
called.

Huang Ying reported this to lkml after a test sysbench test that measured
a -62% decrease of file io requests for that tests' workload.

The test is:

  echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
  echo performance > /sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
  mkfs -t btrfs /dev/sda2
  mount -t btrfs /dev/sda2 /fs/sda2
  cd /fs/sda2
  for ((i = 0; i < 1024; i++)); do fallocate -l 67108864 testfile.$i; done
  sysbench --test=fileio --max-requests=0 --num-threads=4 --max-time=600 \
--file-test-mode=rndwr --file-total-size=68719476736 --file-io-mode=sync \
--file-num=1024 run

A test on kvm guest, running a debug kernel gave me the following results:

Without 3a8b36f378060d: 16.01 reqs/sec
With 3a8b36f378060d: 3.39 reqs/sec
With 3a8b36f378060d and this patch: 16.04 reqs/sec

Reported-by: Huang Ying 
Signed-off-by: Filipe Manana 
---
 fs/btrfs/file.c |  9 ++---
 fs/btrfs/ordered-data.c | 14 ++
 fs/btrfs/ordered-data.h |  3 +++
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 309dd57..379275c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -1878,6 +1878,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
struct btrfs_log_ctx ctx;
int ret = 0;
bool full_sync = 0;
+   const u64 len = end - start + 1;
 
trace_btrfs_sync_file(file, datasync);
 
@@ -1906,7 +1907,7 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
 * all extents are persisted and the respective file extent
 * items are in the fs/subvol btree.
 */
-   ret = btrfs_wait_ordered_range(inode, start, end - start + 1);
+   ret = btrfs_wait_ordered_range(inode, start, len);
} else {
/*
 * Start any new ordered operations before starting to log the
@@ -1978,8 +1979,10 @@ int btrfs_sync_file(struct file *file, loff_t start, 
loff_t end, int datasync)
 */
smp_mb();
if (btrfs_inode_in_log(inode, root->fs_info->generation) ||
-   (full_sync && BTRFS_I(inode)->last_trans <=
-root->fs_info->last_trans_committed)) {
+   (BTRFS_I(inode)->last_trans <=
+root->fs_info->last_trans_committed &&
+(full_sync 

Re: [LKP] [Btrfs] 3a8b36f3780: -62.6% fileio.requests_per_sec

2015-03-18 Thread Filipe Manana


On 03/18/2015 08:20 AM, Huang Ying wrote:
> FYI, we noticed the below changes on
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
> commit 3a8b36f378060d20062a0918e99fae39ff077bf0 ("Btrfs: fix data loss in the 
> fast fsync path")
> 
> 
> testbox/testcase/testparams: 
> lkp-sb02/fileio/performance-600s-100%-1HDD-btrfs-64G-1024f-rndwr-sync
> 
> f5c0a122800c301e  3a8b36f378060d20062a0918e9  
>   --  
>  %stddev %change %stddev
>  \  |\  
>  45.33 ±  0% -62.6%  16.94 ±  0%  fileio.requests_per_sec
> 138983 ±  0% +15.1% 16 ±  0%  
> fileio.time.voluntary_context_switches
>  16035 ±  0% +13.0%  18124 ±  0%  
> fileio.time.involuntary_context_switches
>2504328 ±  0%  -7.2%2324488 ±  0%  fileio.time.file_system_outputs
>   1.35 ±  1%  +2.8%   1.38 ±  0%  turbostat.CorWatt
>   0.77 ±  6% +34.6%   1.03 ±  3%  turbostat.Pkg%pc3
>7224199 ± 22% -26.7%5298697 ± 12%  cpuidle.C1-SNB.time
>8377756 ±  1% +15.7%9690687 ±  4%  cpuidle.C3-SNB.time
>  16035 ±  0% +13.0%  18124 ±  0%  
> time.involuntary_context_switches
> 138983 ±  0% +15.1% 16 ±  0%  time.voluntary_context_switches
>  45941 ±  0% +11.0%  50983 ±  0%  softirqs.BLOCK
>  35635 ±  2% +13.7%  40524 ±  2%  softirqs.RCU
>  26255 ±  1% +10.5%  29017 ±  0%  softirqs.SCHED
>  50650 ±  2% +11.3%  56371 ±  0%  softirqs.TIMER
>   3448 ±  0%  +1.6%   3503 ±  0%  vmstat.io.bo
>   4010 ±  0%  +2.7%   4119 ±  0%  vmstat.system.cs
> 294711 ±  1% -17.1% 244365 ±  0%  meminfo.Active
> 275793 ±  2% -18.1% 225971 ±  0%  meminfo.Active(file)
>  53614 ±  6% +27.6%  68412 ± 15%  meminfo.DirectMap4k
>   3781 ±  0% -46.9%   2006 ±  0%  meminfo.Dirty
>  47786 ±  0% -14.7%  40780 ±  0%  meminfo.SReclaimable
>  66047 ±  0% -10.7%  58973 ±  0%  meminfo.Slab
>  68947 ±  2% -18.1%  56492 ±  0%  proc-vmstat.nr_active_file
> 337110 ±  0% -10.0% 303330 ±  0%  proc-vmstat.nr_dirtied
>944 ±  0% -46.9%501 ±  0%  proc-vmstat.nr_dirty
>  11946 ±  0% -14.7%  10195 ±  0%  proc-vmstat.nr_slab_reclaimable
> 335424 ±  0%  -9.7% 302754 ±  0%  proc-vmstat.nr_written
>  55839 ±  3% -15.0%  47438 ±  0%  proc-vmstat.pgactivate
>   1142 ±  5% -16.2%957 ± 17%  
> slabinfo.btrfs_delayed_ref_head.active_objs
>   1146 ±  5% -16.0%962 ± 17%  
> slabinfo.btrfs_delayed_ref_head.num_objs
>   1246 ±  6% -29.4%880 ± 15%  
> slabinfo.btrfs_delayed_tree_ref.active_objs
>   1246 ±  6% -29.4%880 ± 15%  
> slabinfo.btrfs_delayed_tree_ref.num_objs
>   2037 ±  2% +60.0%   3260 ±  1%  
> slabinfo.btrfs_extent_buffer.num_objs
>   2023 ±  2% +60.7%   3250 ±  1%  
> slabinfo.btrfs_extent_buffer.active_objs
>  13307 ±  0% -57.7%   5634 ±  0%  
> slabinfo.btrfs_extent_state.num_objs
>260 ±  0% -57.8%110 ±  0%  
> slabinfo.btrfs_extent_state.num_slabs
>  13292 ±  0% -57.6%   5634 ±  0%  
> slabinfo.btrfs_extent_state.active_objs
>260 ±  0% -57.8%110 ±  0%  
> slabinfo.btrfs_extent_state.active_slabs
>713 ±  1% -51.2%348 ±  1%  
> slabinfo.btrfs_ordered_extent.active_objs
>718 ±  1% -48.1%373 ±  1%  
> slabinfo.btrfs_ordered_extent.num_objs
>  26930 ±  0% -57.1%  11557 ±  0%  slabinfo.btrfs_path.num_objs
>961 ±  0% -57.1%412 ±  0%  slabinfo.btrfs_path.active_slabs
>961 ±  0% -57.1%412 ±  0%  slabinfo.btrfs_path.num_slabs
>  26930 ±  0% -57.1%  11557 ±  0%  slabinfo.btrfs_path.active_objs
>789 ±  4% -48.5%406 ±  0%  
> slabinfo.ext4_extent_status.num_objs
>789 ±  4% -48.5%406 ±  0%  
> slabinfo.ext4_extent_status.active_objs
>  26083 ±  0% -28.3%  18697 ±  0%  
> slabinfo.radix_tree_node.num_objs
>  26083 ±  0% -28.3%  18697 ±  0%  
> slabinfo.radix_tree_node.active_objs
>931 ±  0% -28.3%667 ±  0%  
> slabinfo.radix_tree_node.active_slabs
>931 ±  0% -28.3%667 ±  0%  
> slabinfo.radix_tree_node.num_slabs
>  4 ± 38%+129.4%  9 ± 31%  
> sched_debug.cfs_rq[0]:/.runnable_load_avg
> 17 ± 32% -54.9%  8 ± 45%  
> sched_debug.cfs_rq[3]:/.runnable_load_avg
>385 ± 14% -25.3%287 ± 17%  sched_debug.cfs_rq[3]:/.load
>  51947 ±  3% +15.4%  59938 ±  3%  
> sched_debug.cpu#0.nr_load_updates
> 200860 ±  5% +11.6% 224079 ±  4%  sched_debug.cpu#1.ttwu_local
>  47218 ±  2%  +7.4%  50701 ±  2%  
> sched_debug.cpu#1.nr_load_updates
>