from:"David Sterba"

Re: [PATCH v4 04/28] range: Add range_overlaps()

2024-10-09 Thread David Sterba

On Wed, Oct 09, 2024 at 05:45:10PM +0300, Andy Shevchenko wrote:
> On Tue, Oct 08, 2024 at 06:10:32PM +0200, David Sterba wrote:
> > On Mon, Oct 07, 2024 at 06:16:10PM -0500, Ira Weiny wrote:
> 
> ...
> 
> > > +static inline bool range_overlaps(struct range *r1, struct range *r2)
> > 
> > I've noticed only now, you can constify the arguments, but this applise
> > to other range_* functions so that can be done later in one go.
> 
> Frankly you may add the same to each new API being added to the file and
> the "one go" will never happen.

Yeah, but it's a minor issue for a 28 patchset, I don't know if there
are some other major things still to do so that a v5 is expected.

If anybody is interested, reviewing APIs and interfaces with focus on
some data structure and const is relatively easy, compile test is
typically enough. The hard part is to find the missing ones. There's no
compiler aid thad I'd know of (-Wsuggest-attribute=const is not for
parameters), so it's been reading a file top-down for me.

> So, I support your first part with
> constifying, but I think it would be rather done now to start that "one
> go" to happen.

Agreed, one patch on top is probably the least intrusive way.

Re: [PATCH v4 04/28] range: Add range_overlaps()

2024-10-08 Thread David Sterba

On Mon, Oct 07, 2024 at 06:16:10PM -0500, Ira Weiny wrote:
> --- a/include/linux/range.h
> +++ b/include/linux/range.h
> +/* True if any part of r1 overlaps r2 */
> +static inline bool range_overlaps(struct range *r1, struct range *r2)

I've noticed only now, you can constify the arguments, but this applise
to other range_* functions so that can be done later in one go.

> +{
> + return r1->start <= r2->end && r1->end >= r2->start;
> +}

Re: [PATCH] btrfs: Annotate structs with __counted_by()

2024-08-14 Thread David Sterba

On Wed, Aug 14, 2024 at 09:01:42PM +0200, Thorsten Blum wrote:
> On 14. Aug 2024, at 20:02, Nathan Chancellor  wrote:
> > On Mon, Aug 12, 2024 at 02:03:44PM +0200, Thorsten Blum wrote:
> >> On 12. Aug 2024, at 12:54, Johannes Thumshirn  
> >> wrote:
> >>> On 12.08.24 12:37, Thorsten Blum wrote:
>  Add the __counted_by compiler attribute to the flexible array member
>  stripes to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
>  CONFIG_FORTIFY_SOURCE.
>  
>  Signed-off-by: Thorsten Blum 
>  ---
>  fs/btrfs/volumes.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>  
>  diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
>  index 37a09ebb34dd..f28fa318036b 100644
>  --- a/fs/btrfs/volumes.h
>  +++ b/fs/btrfs/volumes.h
>  @@ -551,7 +551,7 @@ struct btrfs_io_context {
>   * stripes[data_stripes + 1]: The Q stripe (only for RAID6).
>   */
>   u64 full_stripe_logical;
>  - struct btrfs_io_stripe stripes[];
>  + struct btrfs_io_stripe stripes[] __counted_by(num_stripes);
>  };
>  
>  struct btrfs_device_info {
>  @@ -591,7 +591,7 @@ struct btrfs_chunk_map {
>   int io_width;
>   int num_stripes;
>   int sub_stripes;
>  - struct btrfs_io_stripe stripes[];
>  + struct btrfs_io_stripe stripes[] __counted_by(num_stripes);
>  };
>  
>  #define btrfs_chunk_map_size(n) (sizeof(struct btrfs_chunk_map) + \
> >>> 
> >>> Looks good to me,
> >>> Reviewed-by: Johannes Thumshirn 
> >>> 
> >>> Out of curiosity, have you encountered any issues with this patch applied?
> >> 
> >> I only compile-tested it.
> > 
> > This change is now in next-20240814 and I see a UBSAN warning at runtime
> > as a result because the assignment of ->num_stripes happens after
> > accessing ->stripes[] (which breaks one of the requirements for using
> > __counted_by [1]), meaning that UBSAN thinks this is a zero sized array
> > due to bioc being allocated with kzalloc().
> > 
> >  [   24.992264] [ cut here ]
> >  [   25.009196] UBSAN: array-index-out-of-bounds in 
> > fs/btrfs/volumes.c:6602:11
> >  [   25.021963] index 1 is out of range for type 'struct btrfs_io_stripe[] 
> > __counted_by(num_stripes)' (aka 'struct btrfs_io_stripe[]')
> >  [   25.036463] CPU: 28 UID: 0 PID: 1171 Comm: systemd-random- Not tainted 
> > 6.11.0-rc3-next-20240814 #1
> >  [   25.048172] Hardware name: ADLINK Ampere Altra Developer 
> > Platform/Ampere Altra Developer Platform, BIOS TianoCore 2.04.100.11 (SYS: 
> > 2.06.20220308) 11/06/2
> >  [   25.064754] Call trace:
> >  [   25.069965]  dump_backtrace+0x114/0x19c
> >  [   25.076564]  show_stack+0x28/0x3c
> >  [   25.082642]  dump_stack_lvl+0x48/0x94
> >  [   25.089068]  __ubsan_handle_out_of_bounds+0x10c/0x184
> >  [   25.096883]  btrfs_map_block+0x540/0xb3c
> >  [   25.103570]  btrfs_submit_bio+0xf8/0x654
> >  [   25.110256]  write_one_eb+0x290/0x444
> >  [   25.116682]  btree_write_cache_pages+0x44c/0x5a8
> >  [   25.124063]  btree_writepages+0x2c/0x8c
> >  [   25.130662]  do_writepages+0x10c/0x34c
> >  [   25.137175]  filemap_fdatawrite_wbc+0x84/0xb0
> >  [   25.144295]  filemap_fdatawrite_range+0x74/0xac
> >  [   25.151589]  btrfs_write_marked_extents+0xa0/0x140
> >  [   25.159143]  btrfs_sync_log+0x298/0xa98
> >  [   25.165743]  btrfs_sync_file+0x440/0x608
> >  [   25.172429]  __arm64_sys_fsync+0x90/0xd4
> >  [   25.179115]  invoke_syscall+0x8c/0x11c
> >  [   25.185628]  el0_svc_common
> >  [   25.191185]  do_el0_svc+0x2c/0x3c
> >  [   25.197264]  el0_svc+0x48/0xf0
> >  [   25.203083]  el0t_64_sync_handler+0x98/0x108
> >  [   25.210118]  el0t_64_sync+0x19c/0x1a0
> >  [   25.216552] ---[ end trace ]---
> > 
> > The fix might be as simple as something like
> > 
> > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> > index 4a259bdaa21c..0cabc2ebde71 100644
> > --- a/fs/btrfs/volumes.c
> > +++ b/fs/btrfs/volumes.c
> > @@ -6561,6 +6561,7 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, 
> > enum btrfs_map_op op,
> > }
> > bioc->map_type = map->type;
> > 
> > + bioc->num_stripes = io_geom.num_stripes;
> > /*
> > * For RAID56 full map, we need to make sure the stripes[] follows the
> > * rule that data stripes are all ordered, then followed with P and Q
> > @@ -6621,7 +6622,6 @@ int btrfs_map_block(struct btrfs_fs_info *fs_info, 
> > enum btrfs_map_op op,
> > }
> > 
> > *bioc_ret = bioc;
> > - bioc->num_stripes = io_geom.num_stripes;
> > bioc->max_errors = io_geom.max_errors;
> > bioc->mirror_num = io_geom.mirror_num;
> > 
> > 
> > but I am not sure of the implications of this change on quick glance
> > with regards to error handling and such.
> > 
> > [1]: 
> > https://people.kernel.org/gustavoars/how-to-use-the-new-counted_by-attribute-in-c-and-linux
> > 
> > Cheers,
> > Nathan
> 
> My patch should probably be reverted as I somehow missed quite a few
> things and need more time to rework this properly.
> 
> Sorry about that and thanks for re

Re: [PATCH] btrfs: Annotate struct name_cache_entry with __counted_by()

2024-08-13 Thread David Sterba

On Tue, Aug 13, 2024 at 12:53:15PM +0200, Thorsten Blum wrote:
> Add the __counted_by compiler attribute to the flexible array member
> name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
> CONFIG_FORTIFY_SOURCE.
> 
> Signed-off-by: Thorsten Blum 

Thanks,

Reviewed-by: David Sterba

Re: [PATCH] btrfs: Annotate structs with __counted_by()

2024-08-12 Thread David Sterba

On Mon, Aug 12, 2024 at 12:36:20PM +0200, Thorsten Blum wrote:
> Add the __counted_by compiler attribute to the flexible array member
> stripes to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and
> CONFIG_FORTIFY_SOURCE.
> 
> Signed-off-by: Thorsten Blum 

Reviewed-by: David Sterba

Re: [PATCH 1/4] module: Add module_subinit{_noexit} and module_subeixt helper macros

2024-07-26 Thread David Sterba

On Fri, Jul 26, 2024 at 11:09:02AM -0700, Christoph Hellwig wrote:
> On Fri, Jul 26, 2024 at 01:58:00PM -0400, Theodore Ts'o wrote:
> > Yeah, that's my reaction as well.  This only saves 50 lines of code in
> > ext4, and that includes unrelated changes such as getting rid of "int
> > i" and putting the declaration into the for loop --- "for (int i =
> > ...").  Sure, that saves two lines of code, but yay?
> > 
> > If the ordering how the functions gets called is based on the magic
> > ordering in the Makefile, I'm not sure this actually makes the code
> > clearer, more robust, and easier to maintain for the long term.
> 
> So you two object to kernel initcalls for the same reason and would
> rather go back to calling everything explicitly?

No and not my call to do it for the kernel. Somebody probably had a
reason use the initcalls, there are probably practical reasons for that.
Quick grep shows there are thousands of initcalls scattered over the
whole code base, that does ask for some tricks because updating a single
file with explicit calls would be a nightmare. Unlike for a subsystem
inside one directory, like a filesystem.

Re: [PATCH 1/4] module: Add module_subinit{_noexit} and module_subeixt helper macros

2024-07-26 Thread David Sterba

On Fri, Jul 26, 2024 at 07:04:35AM -0700, Christoph Hellwig wrote:
> On Fri, Jul 26, 2024 at 04:54:59PM +0800, Youling Tang wrote:
> > Based on this patch, we may need to do these things with this
> >
> >
> > 1. Change the order of *.o in the Makefile (the same order as before the
> > change)
> 
> While we'll need to be careful, we don't need to match the exact
> order.  Most of the calls simply create slab caches / mempools and
> similar things and the order for those does not matter at all.
> 
> Of course the register_filesytem calls need to be last, and sysfs
> registration probably should be second to last, but for the vast
> amount of calls the order does not matter as long as it is unwound
> in reverse order.
> 
> > 2. We need to define module_subinit through the ifdef MODULE
> > distinction,
> 
> Yes.
> 
> > When one of the subinit runs in a module fails, it is difficult
> > to rollback execution of subexit.
> 
> By having both section in the same order, you an just walk the
> exit section backwards from the offset that failed.  Of course that
> only matters for the modular case as normal initcalls don't get
> unwound when built-in either.
> 
> > 4. The order in which subinit is called is not intuitively known
> > (although it can be found in the Makefile).
> 
> Link order through make file is already a well known concept due to
> it mattering for built-in code.

All of this sounds overengineered for something that is a simple array
and two helpers. The code is not finalized so I'll wait for the next
version but specific file order in makefile and linker tricks seems
fragile and I'm not sure I want this for btrfs.

Re: [PATCH v2 09/11] btrfs: convert to multigrain timestamps

2024-07-01 Thread David Sterba

On Mon, Jul 01, 2024 at 09:57:43AM -0400, Jeff Layton wrote:
> On Mon, 2024-07-01 at 09:49 -0400, Josef Bacik wrote:
> > On Mon, Jul 01, 2024 at 06:26:45AM -0400, Jeff Layton wrote:
> > > Enable multigrain timestamps, which should ensure that there is an
> > > apparent change to the timestamp whenever it has been written after
> > > being actively observed via getattr.
> > > 
> > > Beyond enabling the FS_MGTIME flag, this patch eliminates
> > > update_time_for_write, which goes to great pains to avoid in-memory
> > > stores. Just have it overwrite the timestamps unconditionally.
> > > 
> > > Signed-off-by: Jeff Layton 
> > > ---
> > >  fs/btrfs/file.c  | 25 -
> > >  fs/btrfs/super.c |  3 ++-
> > >  2 files changed, 6 insertions(+), 22 deletions(-)
> > > 
> > > diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
> > > index d90138683a0a..409628c0c3cc 100644
> > > --- a/fs/btrfs/file.c
> > > +++ b/fs/btrfs/file.c
> > > @@ -1120,26 +1120,6 @@ void btrfs_check_nocow_unlock(struct
> > > btrfs_inode *inode)
> > >   btrfs_drew_write_unlock(&inode->root->snapshot_lock);
> > >  }
> > >  
> > > -static void update_time_for_write(struct inode *inode)
> > > -{
> > > - struct timespec64 now, ts;
> > > -
> > > - if (IS_NOCMTIME(inode))
> > > - return;
> > > -
> > > - now = current_time(inode);
> > > - ts = inode_get_mtime(inode);
> > > - if (!timespec64_equal(&ts, &now))
> > > - inode_set_mtime_to_ts(inode, now);
> > > -
> > > - ts = inode_get_ctime(inode);
> > > - if (!timespec64_equal(&ts, &now))
> > > - inode_set_ctime_to_ts(inode, now);
> > > -
> > > - if (IS_I_VERSION(inode))
> > > - inode_inc_iversion(inode);
> > > -}
> > > -
> > >  static int btrfs_write_check(struct kiocb *iocb, struct iov_iter
> > > *from,
> > >    size_t count)
> > >  {
> > > @@ -1171,7 +1151,10 @@ static int btrfs_write_check(struct kiocb
> > > *iocb, struct iov_iter *from,
> > >    * need to start yet another transaction to update the
> > > inode as we will
> > >    * update the inode when we finish writing whatever data
> > > we write.
> > >    */
> > > - update_time_for_write(inode);
> > > + if (!IS_NOCMTIME(inode)) {
> > > + inode_set_mtime_to_ts(inode,
> > > inode_set_ctime_current(inode));
> > > + inode_inc_iversion(inode);
> > 
> > You've dropped the
> > 
> > if (IS_I_VERSION(inode))
> > 
> > check here, and it doesn't appear to be in inode_inc_iversion.  Is
> > there a
> > reason for this?  Thanks,
> > 
> 
> AFAICT, btrfs always sets SB_I_VERSION. Are there any cases where it
> isn't? If so, then I can put this check back. I'll make a note about it
> in the changelog if not.

Yes it's always set and I don't see anything in the generic code that
would unset it so it's safe to drop the IS_I_VERSION check.

The check was originally added in November 2012 by 6c760c072403f4
("Btrfs: do not call file_update_time in aio_write") and then moved a
few times. Enabling the super block flags was added in May 2012 by
0c4d2d95d06e92 ("Btrfs: use i_version instead of our own sequence") so
the check was not necessary from the beginning.

Re: [PATCH 44/82] btrfs: Refactor intentional wrap-around test

2024-01-23 Thread David Sterba

On Mon, Jan 22, 2024 at 04:27:19PM -0800, Kees Cook wrote:
> In an effort to separate intentional arithmetic wrap-around from
> unexpected wrap-around, we need to refactor places that depend on this
> kind of math. One of the most common code patterns of this is:
> 
>   VAR + value < VAR
> 
> Notably, this is considered "undefined behavior" for signed and pointer
> types, which the kernel works around by using the -fno-strict-overflow
> option in the build[1] (which used to just be -fwrapv). Regardless, we
> want to get the kernel source to the position where we can meaningfully
> instrument arithmetic wrap-around conditions and catch them when they
> are unexpected, regardless of whether they are signed[2], unsigned[3],
> or pointer[4] types.
> 
> Refactor open-coded wrap-around addition test to use add_would_overflow().
> This paves the way to enabling the wrap-around sanitizers in the future.
> 
> Link: https://git.kernel.org/linus/68df3755e383e6fecf2354a67b08f92f18536594 
> [1]
> Link: https://github.com/KSPP/linux/issues/26 [2]
> Link: https://github.com/KSPP/linux/issues/27 [3]
> Link: https://github.com/KSPP/linux/issues/344 [4]
> Cc: Chris Mason 
> Cc: Josef Bacik 
> Cc: David Sterba 
> Cc: linux-btrfs@vger.kernel.org
> Signed-off-by: Kees Cook 

Acked-by: David Sterba

Re: [PATCH 13/82] btrfs: Refactor intentional wrap-around calculation

2024-01-22 Thread David Sterba

On Mon, Jan 22, 2024 at 04:26:48PM -0800, Kees Cook wrote:
> In an effort to separate intentional arithmetic wrap-around from
> unexpected wrap-around, we need to refactor places that depend on this
> kind of math. One of the most common code patterns of this is:
> 
>   VAR + value < VAR
> 
> Notably, this is considered "undefined behavior" for signed and pointer
> types, which the kernel works around by using the -fno-strict-overflow
> option in the build[1] (which used to just be -fwrapv). Regardless, we
> want to get the kernel source to the position where we can meaningfully
> instrument arithmetic wrap-around conditions and catch them when they
> are unexpected, regardless of whether they are signed[2], unsigned[3],
> or pointer[4] types.
> 
> Refactor open-coded unsigned wrap-around addition test to use
> check_add_overflow(), retaining the result for later usage (which removes
> the redundant open-coded addition). This paves the way to enabling the
> wrap-around sanitizer in the future.
> 
> Link: https://git.kernel.org/linus/68df3755e383e6fecf2354a67b08f92f18536594 
> [1]
> Link: https://github.com/KSPP/linux/issues/26 [2]
> Link: https://github.com/KSPP/linux/issues/27 [3]
> Link: https://github.com/KSPP/linux/issues/344 [4]
> Cc: Chris Mason 
> Cc: Josef Bacik 
> Cc: David Sterba 
> Cc: linux-btrfs@vger.kernel.org
> Signed-off-by: Kees Cook 

Acked-by: David Sterba

Re: [PATCH][next] btrfs: Add __counted_by for struct btrfs_delayed_item and use struct_size()

2023-10-09 Thread David Sterba

On Mon, Oct 09, 2023 at 02:44:54PM -0600, Gustavo A. R. Silva wrote:
> Prepare for the coming implementation by GCC and Clang of the __counted_by
> attribute. Flexible array members annotated with __counted_by can have
> their accesses bounds-checked at run-time via CONFIG_UBSAN_BOUNDS (for
> array indexing) and CONFIG_FORTIFY_SOURCE (for strcpy/memcpy-family
> functions).
> 
> While there, use struct_size() helper, instead of the open-coded
> version, to calculate the size for the allocation of the whole
> flexible structure, including of course, the flexible-array member.
> 
> This code was found with the help of Coccinelle, and audited and
> fixed manually.
> 
> Signed-off-by: Gustavo A. R. Silva 

Added to misc-next, thanks.

Re: [PATCH v5 1/3] btrfs: zoned: reset zones of relocated block groups

2021-04-20 Thread David Sterba

On Tue, Apr 20, 2021 at 07:40:37AM +, Johannes Thumshirn wrote:
> On 19/04/2021 19:13, David Sterba wrote:
> > On Mon, Apr 19, 2021 at 04:41:00PM +0900, Johannes Thumshirn wrote:
> >> When relocating a block group the freed up space is not discarded in one
> >> big block, but each extent is discarded on it's own with -odisard=sync.
> >>
> >> For a zoned filesystem we need to discard the whole block group at once,
> >> so btrfs_discard_extent() will translate the discard into a
> >> REQ_OP_ZONE_RESET operation, which then resets the device's zone.
> >>
> >> Link: 
> >> https://lore.kernel.org/linux-btrfs/459e2932c48e12e883dcfd3dda828d9da251d5b5.1617962110.git.johannes.thumsh...@wdc.com
> > 
> > The link points to the same patch in v3, is there something missing in
> > this patch that should be added to the changelog?
> > 
> 
> I included this link so one can look up the discussion between Filipe and
> me that lead to this decision. But I guess this is not really relevant. 

So what'd recommend and expect is to put a summary of the discussion to
the changelog, because the discussions can take many mails and the
contents is scattered, and link the first relevant mail in case the full
story could be interesting. Eventually the link could be in text with
some hint what to look for in the mail.

Re: [PATCH v2] btrfs: do more graceful error/warning for 32bit kernel

2021-04-20 Thread David Sterba

On Fri, Apr 09, 2021 at 01:24:20PM +0800, Qu Wenruo wrote:
> Due to the pagecache limit of 32bit systems, btrfs can't access metadata
> at or beyond (ULONG_MAX + 1) << PAGE_SHIFT.
> This is 16T for 4K page size while 256T for 64K page size.
> 
> And unlike other fses, btrfs uses internally mapped u64 address space for
> all of its metadata, this is more tricky than other fses.
> 
> Users can have a fs which doesn't have metadata beyond the boundary at
> mount time, but later balance can cause btrfs to create metadata beyond
> the boundary.
> 
> And modification to MM layer is unrealistic just for such minor use
> case.
> 
> To address such problem, this patch will introduce the following checks,
> much like how XFS handles such problem:
> 
> - Mount time rejection
>   This will reject any fs which has metadata chunk at or beyond the
>   boundary.
> 
> - Mount time early warning
>   If there is any metadata chunk beyond 5/8 of the boundary, we do an
>   early warning and hope the end user will see it.
> 
> - Runtime extent buffer rejection
>   If we're going to allocate an extent buffer at or beyond the boundary,
>   reject such request with -EOVERFLOW.
>   This is definitely going to cause problems like transaction abort, but
>   we have no better ways.
> 
> - Runtime extent buffer early warning
>   If an extent buffer beyond 5/8 of the max file size is allocated, do
>   an early warning.
> 
> Above error/warning message will only be outputted once for each fs to
> reduce dmesg flood.
> 
> Reported-by: Erik Jensen 
> Signed-off-by: Qu Wenruo 
> Reviewed-by: Josef Bacik 

I've added some references and updated wording what's the problem. Patch
moved from topic branch to misc-next, thanks.

Re: [PATCH v2] btrfs: zoned: fix unpaired block group unfreeze during device replace

2021-04-20 Thread David Sterba

On Wed, Apr 14, 2021 at 02:05:26PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When doing a device replace on a zoned filesystem, if we find a block
> group with ->to_copy == 0, we jump to the label 'done', which will result
> in later calling btrfs_unfreeze_block_group(), even though at this point
> we never called btrfs_freeze_block_group().
> 
> Since at this point we have neither turned the block group to RO mode nor
> made any progress, we don't need to jump to the label 'done'. So fix this
> by jumping instead to the label 'skip' and dropping our reference on the
> block group before the jump.
> 
> Fixes: 78ce9fc269af6e ("btrfs: zoned: mark block groups to copy for 
> device-replace")
> Signed-off-by: Filipe Manana 

Thanks, added to misc-next. It's a regression fix for 5.12 but IMHO not
critical enough for a pull request. 5.12 is about to be released in a
week, with CC: stable the will be published soon after.

Re: [PATCH] btrfs: fix race when picking most recent mod log operation for an old root

2021-04-20 Thread David Sterba

On Tue, Apr 20, 2021 at 10:55:44AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Commit dbcc7d57bffc0c ("btrfs: fix race when cloning extent buffer during
> rewind of an old root"), fixed a race when we need to rewind the extent
> buffer of an old root. It was caused by picking a new mod log operation
> for the extent buffer while getting a cloned extent buffer with an outdated
> number of items (off by -1), because we cloned the extent buffer without
> locking it first.
> 
> However there is still another similar race, but in the opposite direction.
> The cloned extent buffer has a number of items that does not match the
> number of tree mod log operations that are going to be replayed. This is
> because right after we got the last (most recent) tree mod log operation to
> replay and before locking and cloning the extent buffer, another task adds
> a new pointer to the extent buffer, which results in adding a new tree mod
> log operation and incrementing the number of items in the extent buffer.
> So after cloning we have mismatch between the number of items in the extent
> buffer and the number of mod log operations we are going to apply to it.
> This results in hitting a BUG_ON() that produces the following stack trace:
> 
>   [145427.689964][ T4811] [ cut here ]
>   [145427.692498][ T4811] kernel BUG at fs/btrfs/tree-mod-log.c:675!
>   [145427.694668][ T4811] invalid opcode:  [#1] SMP KASAN PTI
>   [145427.696379][ T4811] CPU: 3 PID: 4811 Comm: crawl_1215 Tainted: G
> W 5.12.0-7d1efdf501f8-misc-next+ #99
>   [145427.700221][ T4811] Hardware name: QEMU Standard PC (i440FX + PIIX, 
> 1996), BIOS 1.12.0-1 04/01/2014
>   [145427.703623][ T4811] RIP: 0010:tree_mod_log_rewind+0x3b1/0x3c0
>   [145427.706135][ T4811] Code: 05 48 8d 74 10 (...)
>   [145427.713034][ T4811] RSP: 0018:c90001027090 EFLAGS: 00010293
>   [145427.714996][ T4811] RAX:  RBX: 8880a8514600 RCX: 
> aa9e59b6
>   [145427.717158][ T4811] RDX: 0007 RSI: dc00 RDI: 
> 8880a851462c
>   [145427.720422][ T4811] RBP: c900010270e0 R08: 00c0 R09: 
> ed1004333417
>   [145427.723835][ T4811] R10: 88802199a0b7 R11: ed1004333416 R12: 
> 000e
>   [145427.727695][ T4811] R13: 888135af8748 R14: 88818766ff00 R15: 
> 8880a851462c
>   [145427.731636][ T4811] FS:  7f29acf62700() 
> GS:8881f220() knlGS:
>   [145427.736305][ T4811] CS:  0010 DS:  ES:  CR0: 80050033
>   [145427.739587][ T4811] CR2: 7f0e6013f718 CR3: 00010d42e003 CR4: 
> 00170ee0
>   [145427.743573][ T4811] Call Trace:
>   [145427.745117][ T4811]  btrfs_get_old_root+0x16a/0x5c0
>   [145427.747686][ T4811]  ? lock_downgrade+0x400/0x400
>   [145427.754189][ T4811]  btrfs_search_old_slot+0x192/0x520
>   [145427.758023][ T4811]  ? btrfs_search_slot+0x1090/0x1090
>   [145427.761014][ T4811]  ? free_extent_buffer.part.61+0xd7/0x140
>   [145427.765208][ T4811]  ? free_extent_buffer+0x13/0x20
>   [145427.770042][ T4811]  resolve_indirect_refs+0x3e9/0xfc0
>   [145427.773633][ T4811]  ? lock_downgrade+0x400/0x400
>   [145427.777323][ T4811]  ? __kasan_check_read+0x11/0x20
>   [145427.780539][ T4811]  ? add_prelim_ref.part.11+0x150/0x150
>   [145427.785722][ T4811]  ? lock_downgrade+0x400/0x400
>   [145427.791086][ T4811]  ? __kasan_check_read+0x11/0x20
>   [145427.796266][ T4811]  ? lock_acquired+0xbb/0x620
>   [145427.798764][ T4811]  ? __kasan_check_write+0x14/0x20
>   [145427.801118][ T4811]  ? do_raw_spin_unlock+0xa8/0x140
>   [145427.804491][ T4811]  ? rb_insert_color+0x340/0x360
>   [145427.808066][ T4811]  ? prelim_ref_insert+0x12d/0x430
>   [145427.811889][ T4811]  find_parent_nodes+0x5c3/0x1830
>   [145427.815498][ T4811]  ? stack_trace_save+0x87/0xb0
>   [145427.819210][ T4811]  ? resolve_indirect_refs+0xfc0/0xfc0
>   [145427.823254][ T4811]  ? fs_reclaim_acquire+0x67/0xf0
>   [145427.827220][ T4811]  ? __kasan_check_read+0x11/0x20
>   [145427.829080][ T4811]  ? lockdep_hardirqs_on_prepare+0x210/0x210
>   [145427.831237][ T4811]  ? fs_reclaim_acquire+0x67/0xf0
>   [145427.835061][ T4811]  ? __kasan_check_read+0x11/0x20
>   [145427.836508][ T4811]  ? ___might_sleep+0x10f/0x1e0
>   [145427.841389][ T4811]  ? __kasan_kmalloc+0x9d/0xd0
>   [145427.843054][ T4811]  ? trace_hardirqs_on+0x55/0x120
>   [145427.845533][ T4811]  btrfs_find_all_roots_safe+0x142/0x1e0
>   [145427.847325][ T4811]  ? find_parent_nodes+0x1830/0x1830
>   [145427.849318][ T4811]  ? trace_hardirqs_on+0x55/0x120
>   [145427.851210][ T4811]  ? ulist_free+0x1f/0x30
>   [145427.852809][ T4811]  ? btrfs_inode_flags_to_xflags+0x50/0x50
>   [145427.854654][ T4811]  iterate_extent_inodes+0x20e/0x580
>   [145427.856429][ T4811]  ? tree_backref_for_extent+0x230/0x230
>   [145427.858552][ T4811]  ? release_extent_buffer+0x225/0x280
>   [145427.862789][ T4811]  ? read_extent_buffer+0xdd/0x110
>   [145427.86509

Re: [PATCH] btrfs: fix metadata extent leak after failure to create subvolume

2021-04-20 Thread David Sterba

On Tue, Apr 20, 2021 at 10:55:12AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When creating a subvolume we allocate an extent buffer for its root node
> after starting a transaction. We setup a root item for the subvolume that
> points to that extent buffer and then attempt to insert the root item into
> the root tree - however if that fails, due to -ENOMEM for example, we do
> not free the extent buffer previously allocated and we do not abort the
> transaction (as at that point we did nothing that can not be undone).
> 
> This means that we effectively do not return the metadata extent back to
> the free space cache/tree and we leave a delayed reference for it which
> causes a metadata extent item to be added to the extent tree, in the next
> transaction commit, without having backreferences. When this happens
> 'btrfs check' reports the following:
> 
>   $ btrfs check /dev/sdi
>   Opening filesystem to check...
>   Checking filesystem on /dev/sdi
>   UUID: dce2cb9d-025f-4b05-a4bf-cee0ad3785eb
>   [1/7] checking root items
>   [2/7] checking extents
>   ref mismatch on [30425088 16384] extent item 1, found 0
>   backref 30425088 root 256 not referenced back 0x564a91c23d70
>   incorrect global backref count on 30425088 found 1 wanted 0
>   backpointer mismatch on [30425088 16384]
>   owner ref check failed [30425088 16384]
>   ERROR: errors found in extent allocation tree or chunk allocation
>   [3/7] checking free space cache
>   [4/7] checking fs roots
>   [5/7] checking only csums items (without verifying data)
>   [6/7] checking root refs
>   [7/7] checking quota groups skipped (not enabled on this FS)
>   found 212992 bytes used, error(s) found
>   total csum bytes: 0
>   total tree bytes: 131072
>   total fs tree bytes: 32768
>   total extent tree bytes: 16384
>   btree space waste bytes: 124669
>   file data blocks allocated: 65536
>referenced 65536
> 
> So fix this by freeing the metadata extent if btrfs_insert_root() returns
> an error.
> 
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.

Re: [PATCH v2 0/2] btrfs: remove the inode_need_compress() call in

2021-04-19 Thread David Sterba

On Tue, Aug 04, 2020 at 03:25:46PM +0800, Qu Wenruo wrote:
> This is an attempt to remove the inode_need_compress() call in
> compress_file_extent().
> 
> As that compress_file_extent() can race with inode ioctl or bad
> compression ratio, to cause NULL pointer dereferecen for @pages, it's
> nature to try to remove that inode_need_compress() to remove the race
> completely.
> 
> However that's not that easy, we have the following problems:
> 
> - We still need to check @pages anyway
>   That @pages check is for kcalloc() failure, so what we really get is
>   just removing one indent from the if (inode_need_compress()).
>   Everything else is still the same (in fact, even worse, see below
>   problems)
> 
> - Behavior change
>   Before that change, every async_chunk does their check on
>   INODE_NO_COMPRESS flags.
>   If we hit any bad compression ratio, all incoming async_chunk will
>   fall back to plain text write.
> 
>   But if we remove that inode_need_compress() check, then we still try
>   to compress, and lead to potentially wasted CPU times.
> 
> - Still race between compression disable and NULL pointer dereferecen
>   There is a hidden race, mostly exposed by btrfs/071 test case, that we
>   have "compress_type = fs_info->compress_type", so we can still hit case
>   where that compress_type is NONE (caused by remount -o nocompress), and
>   then btrfs_compress_pages() will return -E2BIG, without modifying
>   @nr_pages
> 
>   Then later when we cleanup @pages, we try to access pages[i]->mapping,
>   triggering NULL pointer dereference.
> 
>   This will be address in the first patch though.
> 
> Changelog:
> v2:
> - Fix a bad commit merge
>   Which merges the commit message for the first patch, though the content is
>   still correct.
> 
> Qu Wenruo (2):
>   btrfs: prevent NULL pointer dereference in compress_file_range() when
> btrfs_compress_pages() hits default case

Patch 1 added to misc-next, due to this bug report
https://bugzilla.kernel.org/show_bug.cgi?id=212331

Re: [PATCH v2] fs/btrfs: Fix uninitialized variable

2021-04-19 Thread David Sterba

On Sat, Apr 17, 2021 at 04:36:16PM +0100, Khaled ROMDHANI wrote:
> As reported by the Coverity static analysis.
> The variable zone is not initialized which
> may causes a failed assertion.
> 
> Addresses-Coverity: ("Uninitialized variables")
> Signed-off-by: Khaled ROMDHANI 
> ---
> v2: add a default case as proposed by David Sterba
> ---
>  fs/btrfs/zoned.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index eeb3ebe11d7a..82527308d165 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -143,6 +143,9 @@ static inline u32 sb_zone_number(int shift, int mirror)
>   case 0: zone = 0; break;
>   case 1: zone = 1ULL << (BTRFS_SB_LOG_FIRST_SHIFT - shift); break;
>   case 2: zone = 1ULL << (BTRFS_SB_LOG_SECOND_SHIFT - shift); break;
> + default:
> + zone = 0;

Well yeah but this is not a valid case at all, we'd rather catch that as
an assertion failure than letting is silently continue.

Re: [PATCH v2] btrfs-progs: fi resize: fix false 0.00B sized output

2021-04-19 Thread David Sterba

On Mon, Apr 19, 2021 at 10:08:50AM -0700, Boris Burkov wrote:
> On Mon, Apr 19, 2021 at 09:05:49PM +0800, Su Yue wrote:
> > @@ -1158,6 +1158,16 @@ static int check_resize_args(const char *amount, 
> > const char *path) {
> > }
> > old_size = di_args[dev_idx].total_bytes;
> >  
> > +   /* For target sizes without '+'/'-' sign prefix(e.g. 1:150g) */
> > +   if (mod == 0) {
> > +   new_size = diff;
> > +   diff = max(old_size, new_size) - min(old_size, 
> > new_size);
> > +   if (new_size > old_size)
> > +   mod = 1;
> > +   else if (new_size < old_size)
> > +   mod = -1;
> > +   }
> > +
> > if (mod < 0) {
> > if (diff > old_size) {
> > error("current size is %s which is smaller than 
> > %s",
> 
> This fix seems correct to me, but it feels a tiny bit over-complicated.
> Personally, I think it would be cleaner to do something like:
> 
> if (mod == 0) {
>   new_size = diff;
> } else if (mod < 0) {
>   // >0 check
>   new_size = old_size - diff
> } else {
>   // overflow check
>   new_size = old_size + diff
> }

Right, this looks much better and shares a lot of with the code that
follows the original fix.

Re: [PATCH v5 3/3] btrfs: zoned: automatically reclaim zones

2021-04-19 Thread David Sterba

On Mon, Apr 19, 2021 at 04:41:02PM +0900, Johannes Thumshirn wrote:
> --- a/fs/btrfs/zoned.h
> +++ b/fs/btrfs/zoned.h
> @@ -9,6 +9,8 @@
>  #include "disk-io.h"
>  #include "block-group.h"
>  
> +#define DEFAULT_RECLAIM_THRESH 75

This is not a .c local define so it needs a prefix and a comment what
the value means so something like

/*
 * Block groups filled more than this value (percents) will be scheduled
 * for background reclaim.
 */
#define BTRFS_DEFAULT_RECLAIM_THRESH75

Re: [PATCH v5 3/3] btrfs: zoned: automatically reclaim zones

2021-04-19 Thread David Sterba

On Mon, Apr 19, 2021 at 04:41:02PM +0900, Johannes Thumshirn wrote:
> +void btrfs_reclaim_bgs_work(struct work_struct *work)
> +{
> + struct btrfs_fs_info *fs_info =
> + container_of(work, struct btrfs_fs_info, reclaim_bgs_work);
> + struct btrfs_block_group *bg;
> + struct btrfs_space_info *space_info;
> + int ret;
> +
> + if (!test_bit(BTRFS_FS_OPEN, &fs_info->flags))
> + return;
> +
> + if (!btrfs_exclop_start(fs_info, BTRFS_EXCLOP_BALANCE))
> + return;
> +
> + mutex_lock(&fs_info->reclaim_bgs_lock);
> + spin_lock(&fs_info->unused_bgs_lock);
> + while (!list_empty(&fs_info->reclaim_bgs)) {
> + bg = list_first_entry(&fs_info->reclaim_bgs,
> +   struct btrfs_block_group,
> +   bg_list);
> + list_del_init(&bg->bg_list);
> +
> + space_info = bg->space_info;
> + spin_unlock(&fs_info->unused_bgs_lock);
> +
> + /* Don't want to race with allocators so take the groups_sem */
> + down_write(&space_info->groups_sem);
> +
> + spin_lock(&bg->lock);
> + if (bg->reserved || bg->pinned || bg->ro) {
> + /*
> +  * We want to bail if we made new allocations or have
> +  * outstanding allocations in this block group.  We do
> +  * the ro check in case balance is currently acting on
> +  * this block group.
> +  */
> + spin_unlock(&bg->lock);
> + up_write(&space_info->groups_sem);
> + goto next;
> + }
> + spin_unlock(&bg->lock);
> +
> + /* Get out fast, in case we're unmounting the FS. */
> + if (btrfs_fs_closing(fs_info)) {
> + up_write(&space_info->groups_sem);
> + goto next;
> + }
> +
> + ret = inc_block_group_ro(bg, 0);
> + up_write(&space_info->groups_sem);
> + if (ret < 0)
> + goto next;
> +
> + btrfs_info(fs_info, "reclaiming chunk %llu", bg->start);

This could state the bg usage ratio, bg->used / bg->length .

> + trace_btrfs_reclaim_block_group(bg);
> + ret = btrfs_relocate_chunk(fs_info, bg->start);
> + if (ret)
> + btrfs_err(fs_info, "error relocating chunk %llu",
> +   bg->start);
> +
> +next:
> + btrfs_put_block_group(bg);
> + spin_lock(&fs_info->unused_bgs_lock);
> + }
> + spin_unlock(&fs_info->unused_bgs_lock);
> + mutex_unlock(&fs_info->reclaim_bgs_lock);
> + btrfs_exclop_finish(fs_info);
> +}

Re: [PATCH v5 1/3] btrfs: zoned: reset zones of relocated block groups

2021-04-19 Thread David Sterba

On Mon, Apr 19, 2021 at 04:41:00PM +0900, Johannes Thumshirn wrote:
> When relocating a block group the freed up space is not discarded in one
> big block, but each extent is discarded on it's own with -odisard=sync.
> 
> For a zoned filesystem we need to discard the whole block group at once,
> so btrfs_discard_extent() will translate the discard into a
> REQ_OP_ZONE_RESET operation, which then resets the device's zone.
> 
> Link: 
> https://lore.kernel.org/linux-btrfs/459e2932c48e12e883dcfd3dda828d9da251d5b5.1617962110.git.johannes.thumsh...@wdc.com

The link points to the same patch in v3, is there something missing in
this patch that should be added to the changelog?

Re: [PATCH v2] btrfs-progs: fi resize: fix false 0.00B sized output

2021-04-19 Thread David Sterba

On Mon, Apr 19, 2021 at 09:05:49PM +0800, Su Yue wrote:
> Resize to nums without sign prefix makes false output:
>  btrfs fi resize 1:150g /srv/extra
> Resize device id 1 (/dev/sdb1) from 298.09GiB to 0.00B
> 
> The resize operation would take effect though.
> 
> Fix it by handling the case if mod is 0 in check_resize_args().
> 
> Issue: #307
> Reported-by: Chris Murphy 
> Signed-off-by: Su Yue 

Thanks, added to devel.

Re: [PATCH 3/4] btrfs: zoned: fail mount if the device does not support zone append

2021-04-19 Thread David Sterba

On Mon, Apr 19, 2021 at 09:46:36AM +, Damien Le Moal wrote:
> On 2021/04/19 18:41, h...@infradead.org wrote:
> > On Mon, Apr 19, 2021 at 09:35:37AM +, Damien Le Moal wrote:
> >> This is only to avoid someone from running zoned-btrfs on top of dm-crypt.
> >> Without this patch, mount will be OK and file data writes will also 
> >> actually be
> >> OK. But all reads will miserably fail... I would rather have this patch in 
> >> than
> >> deal with the "bug reports" about btrfs failing to read files. No ?
> >>
> >> Note that like you, I dislike having to add such code. But it was my 
> >> oversight
> >> when I worked on getting dm-crypt to work on zoned drives. Zone append was
> >> overlooked at that time... My bad, really.
> > 
> > dm-crypt needs to stop pretending it supports zoned devices if it
> > doesn't.  Note that dm-crypt could fairly trivially support zone append
> > by doing the same kind of emulation that the sd driver does.
> 
> I am not so sure about the "trivial" but yes, it is feasible. Let me think 
> about
> something then. Whatever we do, performance with ZNS will no be great, for
> sure... But for SMR HDDs, we likely will not notice any difference in 
> performance.

So this needs to be fixed outside of btrfs. The fix in btrfs would make
sense in case we can't sync the dm-crypt and btrfs in a released kernel.
Having a mount check sounds like a better option to me than to fail
reads, we can revert it in a release once everything woks as expected.

Re: [PATCH] btrfs-progs: mkfs: only output the warning if the sectorsize is not supported

2021-04-16 Thread David Sterba

On Fri, Apr 16, 2021 at 11:14:08AM -0700, Boris Burkov wrote:
> On Thu, Apr 15, 2021 at 01:30:11PM +0800, Qu Wenruo wrote:
> > +/*
> > + * The buffer size if strlen("4096 8192 16384 32768 65536"),
> > + * which is 28, then round up to 32.
> 
> I think there is a typo in this comment, because it doesn't quite parse.

Typo fixed.

> > + */
> > +#define SUPPORTED_SECTORSIZE_BUF_SIZE  32
> >  int btrfs_check_sectorsize(u32 sectorsize)
> >  {
> > +   bool sectorsize_checked = false;
> > u32 page_size = (u32)sysconf(_SC_PAGESIZE);
> >  
> > if (!is_power_of_2(sectorsize)) {
> > @@ -340,7 +349,32 @@ int btrfs_check_sectorsize(u32 sectorsize)
> >   sectorsize);
> > return -EINVAL;
> > }
> > -   if (page_size != sectorsize)
> > +   if (page_size == sectorsize) {
> > +   sectorsize_checked = true;
> > +   } else {
> > +   /*
> > +* Check if the sector size is supported
> > +*/
> > +   char supported_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
> > +   char sectorsize_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
> > +   int fd;
> > +   int ret;
> > +
> > +   fd = open("/sys/fs/btrfs/features/supported_sectorsizes",
> > + O_RDONLY);
> > +   if (fd < 0)
> > +   goto out;
> > +   ret = read(fd, supported_buf, sizeof(supported_buf));
> > +   close(fd);
> > +   if (ret < 0)
> > +   goto out;
> > +   snprintf(sectorsize_buf, SUPPORTED_SECTORSIZE_BUF_SIZE,
> > +"%u", page_size);
> > +   if (strstr(supported_buf, sectorsize_buf))
> > +   sectorsize_checked = true;
> 
> Two comments here.
> 1: I think we should be checking sectorsize against the file rather than
> page_size.

What do you mean by 'against the file'?

I read it as comparing what system reports as page size, converted to
string and looked up in the sysfs file.

> 2: strstr seems too permissive, since it doesn't have a notion of
> tokens. If not for the power_of_2 check above, we would admit all kinds
> of silly things like 409. But even with it, we would permit "4" now and
> with your example from the comment, "8", "16", and "32".

In general you're right. Practically speaking, this will work. We know
what the kernel module puts to that file and if getconf returns some
absurd number for PAGE_SIZE other things will break. The code assumes
perfect match, in any other case it prints the warning. So even if
there are some funny values either in getconf or the sysfs file, it
leads to something noticealbe to the user. A silent error would be worse
and worth fixing, but that way around it works.

Re: [PATCH 0/1] btrfs-progs: libbtrfsutil: Relicense to LGPLv2.1+

2021-04-16 Thread David Sterba

On Tue, Mar 30, 2021 at 09:21:40AM -0400, Neal Gompa wrote:
> On Tue, Mar 23, 2021 at 11:53 AM David Sterba  wrote:
> >
> > On Wed, Mar 17, 2021 at 04:01:43PM -0400, Neal Gompa wrote:
> > > This is a patch requesting all substantial copyright owners to sign off
> > > on changing the license of the libbtrfsutil code to LGPLv2.1+ in order
> > > to resolve various concerns around the mixture of code in btrfs-progs
> > > with libbtrfsutil.
> > >
> > > Each significant (i.e. non-trivial) commit author has been CC'd to
> > > request their sign-off on this. Please reply to this to acknowledge
> > > whether or not this is acceptable for your code.
> >
> > Thanks! I think we have acks for all non-trivial contirbutions. For the
> > record, the trivial one are:
> >
> > * dbf60b488e3b ("libbtrfsutil: update btrfs_util_delete_subvolume docs")
> >   a comment update, clarifying usage
> >
> > * 2e67bf0ed69d ("btrfs-progs: docs: fix typos in READMEs, INSTALL and * 
> > CHANGES")
> > * b1d39a42a4ef ("btrfs-progs: fix typos in comments")
> >   treewide comment typo fixes
> >
> > * 01e35d9f53eb ("btrfs-progs: treewide: Fix missing declarations")
> >   code changes, but adding static or missing includes
> >
> > * 9fd37f951be6 ("btrfs-progs: complete the implementation RAID1C34 
> > definitions")
> >   copied definitions from kernel code
> >
> > I'm not sure about the commit adding pkg-config spec file, it's not code
> > but it's beyond what I'd consider trivial. I've added Sheng Mao to CC,
> > please acknowledge the change.
> >
> > * 4498fe1a2a7c ("libbtrfsutil: add pkg-config spec file")
> >
> > I'll update the changelog with all the reasons for relicensing that have
> > been brought up.
> 
> I consider that trivial because the only actual part in libbtrfsutil
> is the pc file, which is a trivial templated data file (I had one that
> I was preparing to send that looked exactly like it, but Sheng Mao
> beat me to the punch). The autofoo is part of the parent btrfs-progs
> project and is already correctly licensed anyway.

Fair enough. Then there's nothing left to addresss, thanks.

Re: [PATCH] btrfs-progs: Fix null pointer deref in balance_level

2021-04-16 Thread David Sterba

On Tue, Apr 06, 2021 at 04:55:03PM +0300, Nikolay Borisov wrote:
> In case the right buffer is emptied it's first set to null and
> subsequently it's dereferenced to get its size to pass to root_sub_used.
> This naturally leads to a null pointer dereference. The correct thing
> to do is to pass the stashed right->len in "blocksize".
> 
> Fixes #296

I'm using the "Issue: #123" format for that.

> Signed-off-by: Nikolay Borisov 

Added to devel, thanks.

Re: [PATCH] btrfs-progs: mkfs: only output the warning if the sectorsize is not supported

2021-04-16 Thread David Sterba

On Thu, Apr 15, 2021 at 01:30:11PM +0800, Qu Wenruo wrote:
> Currently mkfs.btrfs will output a warning message if the sectorsize is
> not the same as page size:
>   WARNING: the filesystem may not be mountable, sectorsize 4096 doesn't match 
> page size 65536
> 
> But since btrfs subpage support for 64K page size is comming, this
> output is populating the golden output of fstests, causing tons of false
> alerts.
> 
> This patch will make teach mkfs.btrfs to check
> /sys/fs/btrfs/features/supported_sectorsizes, and compare if the sector
> size is supported.
> 
> Then only output above warning message if the sector size is not
> supported.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  common/fsfeatures.c | 36 +++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/common/fsfeatures.c b/common/fsfeatures.c
> index 569208a9e5b1..13b775da9c72 100644
> --- a/common/fsfeatures.c
> +++ b/common/fsfeatures.c
> @@ -16,6 +16,8 @@
>  
>  #include "kerncompat.h"
>  #include 
> +#include 
> +#include 
>  #include 
>  #include 
>  #include "common/fsfeatures.h"
> @@ -327,8 +329,15 @@ u32 get_running_kernel_version(void)
>  
>   return version;
>  }
> +
> +/*
> + * The buffer size if strlen("4096 8192 16384 32768 65536"),
> + * which is 28, then round up to 32.
> + */
> +#define SUPPORTED_SECTORSIZE_BUF_SIZE32
>  int btrfs_check_sectorsize(u32 sectorsize)
>  {
> + bool sectorsize_checked = false;
>   u32 page_size = (u32)sysconf(_SC_PAGESIZE);
>  
>   if (!is_power_of_2(sectorsize)) {
> @@ -340,7 +349,32 @@ int btrfs_check_sectorsize(u32 sectorsize)
> sectorsize);
>   return -EINVAL;
>   }
> - if (page_size != sectorsize)
> + if (page_size == sectorsize) {
> + sectorsize_checked = true;
> + } else {
> + /*
> +  * Check if the sector size is supported
> +  */
> + char supported_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
> + char sectorsize_buf[SUPPORTED_SECTORSIZE_BUF_SIZE] = { 0 };
> + int fd;
> + int ret;
> +
> + fd = open("/sys/fs/btrfs/features/supported_sectorsizes",
> +   O_RDONLY);
> + if (fd < 0)
> + goto out;
> + ret = read(fd, supported_buf, sizeof(supported_buf));
> + close(fd);

There are some sysfs helpers, like sysfs_read_file and
sysfs_open_fsid_file that would avoid the boilerplate code. We don't
have a helper for the toplevel sysfs directory so that would need to be
added first.

Now that I look at it, the sysfs_read_file could actually do that in one
go, including open and close. I'll probably do that as a cleanup later
and apply your patch as it's a fix. Thanks.

Re: [PATCH 3/3] btrfs-progs: misc-tests: add test to ensure the restored image can be mounted

2021-04-16 Thread David Sterba

On Fri, Mar 26, 2021 at 08:50:47PM +0800, Qu Wenruo wrote:
> This new test case is to make sure the restored image file has been
> properly enlarged so that newer kernel won't complain.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  .../047-image-restore-mount/test.sh   | 19 +++
>  1 file changed, 19 insertions(+)
>  create mode 100755 tests/misc-tests/047-image-restore-mount/test.sh
> 
> diff --git a/tests/misc-tests/047-image-restore-mount/test.sh 
> b/tests/misc-tests/047-image-restore-mount/test.sh
> new file mode 100755
> index ..7f12afa2bab6
> --- /dev/null
> +++ b/tests/misc-tests/047-image-restore-mount/test.sh
> @@ -0,0 +1,19 @@
> +#!/bin/bash
> +# Verify that the restored image of an empty btrfs can still be mounted
^

I've seen that in patches and comments, the use of word 'btrfs' instead
of 'filesystem' sounds a bit inappropriate to me, so I change it
whenever I see it. It's perhaps matter of taste and style, one can write
it also as 'btrfs filesystem' but that may belong to some more polished
documentation, so you can go with just 'filesystem'.

Re: [PATCH 2/3] btrfs-progs: image: enlarge the output file if no tree modification is needed for restore

2021-04-16 Thread David Sterba

On Fri, Mar 26, 2021 at 08:50:46PM +0800, Qu Wenruo wrote:
> [BUG]
> If restoring dumpped image into a new file, under most cases kernel will
> reject it:
> 
>  # mkfs.btrfs -f /dev/test/test
>  # btrfs-image /dev/test/test /tmp/dump
>  # btrfs-image -r /tmp/dump ~/test.img
>  # mount ~/test.img /mnt/btrfs
>  mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, 
> missing codepage or helper program, or other error.
>  # dmesg -t | tail -n 7
>  loop0: detected capacity change from 10592 to 0
>  BTRFS info (device loop0): disk space caching is enabled
>  BTRFS info (device loop0): has skinny extents
>  BTRFS info (device loop0): flagging fs with big metadata feature
>  BTRFS error (device loop0): device total_bytes should be at most 5423104 but 
> found 10737418240
>  BTRFS error (device loop0): failed to read chunk tree: -22
>  BTRFS error (device loop0): open_ctree failed
> 
> [CAUSE]
> When btrfs-image restores an image into a file, and the source image
> contains only single device, then we don't need to modify the
> chunk/device tree, as we can reuse the existing chunk/dev tree without
> any problem.
> 
> This also means, for such restore, we also won't do any target file
> enlarge. This behavior itself is fine, as at that time, kernel won't
> check if the device is smaller than the device size recorded in device
> tree.
> 
> But later kernel commit 3a160a933111 ("btrfs: drop never met disk total
> bytes check in verify_one_dev_extent") introduces new check on device
> size at mount time, rejecting any loop file which is smaller than the
> original device size.
> 
> [FIX]
> Do extra file enlarge for single device restore.
> 
> Reported-by: Nikolay Borisov 
> Signed-off-by: Qu Wenruo 
> ---
>  image/main.c | 43 +++
>  1 file changed, 43 insertions(+)
> 
> diff --git a/image/main.c b/image/main.c
> index 24393188e5e3..9933f69d0fdb 100644
> --- a/image/main.c
> +++ b/image/main.c
> @@ -2706,6 +2706,49 @@ static int restore_metadump(const char *input, FILE 
> *out, int old_restore,
>   close_ctree(info->chunk_root);
>   if (ret)
>   goto out;
> + } else {
> + struct btrfs_root *root;
> + struct stat st;
> + u64 dev_size;
> +
> + if (!info) {
> + root = open_ctree_fd(fileno(out), target, 0, 0);
> + if (!root) {
> + error("open ctree failed in %s", target);
> + ret = -EIO;
> + goto out;
> + }
> +
> + info = root->fs_info;
> +
> + dev_size = btrfs_stack_device_total_bytes(
> + &info->super_copy->dev_item);
> + close_ctree(root);
> + info = NULL;
> + } else {
> + dev_size = btrfs_stack_device_total_bytes(
> + &info->super_copy->dev_item);
> + }
> +
> + /*
> +  * We don't need extra tree modification, but if the output is
> +  * a file, we need to enlarge the output file so that
> +  * newer kernel won't report error.
> +  */
> + ret = fstat(fileno(out), &st);
> + if (ret < 0) {
> + error("failed to stat result image: %m");
> + ret = -errno;
> + goto out;
> + }
> + if (S_ISREG(st.st_mode)) {
> + ret = ftruncate64(fileno(out), dev_size);

This truncates the file unconditionally, so if the file is larger than
required, I don't think it's necessary to do it.

Re: [PATCH] btrfs-progs: mark BUG() as unreachable

2021-04-16 Thread David Sterba

On Thu, Apr 01, 2021 at 05:41:00PM +0900, Naohiro Aota wrote:
> Marking BUG() unreachable helps us silence unnecessary warnings e.g.
> "warning: control reaches end of non-void function [-Wreturn-type]" like
> the code below.
> 
>int foo()
>{
>...
>   if (XXX)
>   return 0;
>   else if (YYY)
>   return 1;
>   else
>   BUG();
>}
> 
> Signed-off-by: Naohiro Aota 

Added to devel, thanks.

Re: [PATCH-next] fs/btrfs: Fix uninitialized variable

2021-04-16 Thread David Sterba

On Tue, Apr 13, 2021 at 02:06:04PM +0100, Khaled ROMDHANI wrote:
> The variable zone is not initialized. It
> may causes a failed assertion.

Failed assertion means the 2nd one checking that the result still fits
to 32bit type. That would mean that none of the cases were hit, but all
callers pass valid values.

It would be better to add a default: case to catch that explicitly,
though hitting that is considered 'will not happen'.

Re: [PATCH v4 3/3] btrfs: zoned: automatically reclaim zones

2021-04-16 Thread David Sterba

On Thu, Apr 15, 2021 at 10:58:35PM +0900, Johannes Thumshirn wrote:
> @@ -2974,6 +2982,9 @@ void btrfs_init_fs_info(struct btrfs_fs_info *fs_info)
>   fs_info->swapfile_pins = RB_ROOT;
>  
>   fs_info->send_in_progress = 0;
> +
> + fs_info->bg_reclaim_threshold = 75;

The value should be a named constant visible defined in zoned.h with a
comment what it means.

Re: [PATCH v4 3/3] btrfs: zoned: automatically reclaim zones

2021-04-16 Thread David Sterba

On Thu, Apr 15, 2021 at 10:58:35PM +0900, Johannes Thumshirn wrote:
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -954,10 +954,14 @@ struct btrfs_fs_info {
>   struct work_struct async_data_reclaim_work;
>   struct work_struct preempt_reclaim_work;
>  
> + /* Used to reclaim data space in the background */
> + struct work_struct reclaim_bgs_work;
> +
>   spinlock_t unused_bgs_lock;
>   struct list_head unused_bgs;
>   struct mutex unused_bg_unpin_mutex;
>   struct mutex reclaim_bgs_lock;
> + struct list_head reclaim_bgs;
>  
>   /* Cached block sizes */
>   u32 nodesize;
> @@ -998,6 +1002,8 @@ struct btrfs_fs_info {
>   spinlock_t treelog_bg_lock;
>   u64 treelog_bg;
>  
> + int bg_reclaim_threshold;

That's spreading related members over too many lines, please put them
together with comments what they mean if that's not obvious (like for
the work struct).

Re: [PATCH v4 2/3] btrfs: rename delete_unused_bgs_mutex

2021-04-16 Thread David Sterba

On Thu, Apr 15, 2021 at 10:58:34PM +0900, Johannes Thumshirn wrote:
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -957,7 +957,7 @@ struct btrfs_fs_info {
>   spinlock_t unused_bgs_lock;
>   struct list_head unused_bgs;
>   struct mutex unused_bg_unpin_mutex;
> - struct mutex delete_unused_bgs_mutex;
> + struct mutex reclaim_bgs_lock;

Please write a comment what the mutex protects.

Re: [PATCH 3/4] btrfs: zoned: fail mount if the device does not support zone append

2021-04-16 Thread David Sterba

On Fri, Apr 16, 2021 at 12:05:27PM +0900, Damien Le Moal wrote:
> From: Johannes Thumshirn 
> 
> For zoned btrfs, zone append is mandatory to write to a sequential write
> only zone, otherwise parallel writes to the same zone could result in
> unaligned write errors.
> 
> If a zoned block device does not support zone append (e.g. a dm-crypt
> zoned device using a non-NULL IV cypher), fail to mount.
> 
> Signed-off-by: Johannes Thumshirn 
> Signed-off-by: Damien Le Moal 

Added to misc-next, thanks. I'll queue it for 5.13, it's not an urgent
fix for 5.12 release but i'll tag it as stable so it'll apear in 5.12.x
later.

Re: [PATCH v2 3/3] blkid: support zone reset for wipefs

2021-04-16 Thread David Sterba

On Wed, Apr 14, 2021 at 03:57:42PM +0200, Karel Zak wrote:
> On Wed, Apr 14, 2021 at 10:33:39AM +0900, Naohiro Aota wrote:
> >  /**
> >   * blkid_do_wipe:
> >   * @pr: prober
> > @@ -1267,6 +1310,7 @@ int blkid_do_wipe(blkid_probe pr, int dryrun)
> > const char *off = NULL;
> > size_t len = 0;
> > uint64_t offset, magoff;
> > +   bool conventional;
> 
> BTW, nowhere in libblkid we use "bool". It would be probably better to include
>  to blkidP.h.

Pulling a whole new header just for one local variable that can be int
seems too much.

Re: [PATCH v2 2/3] blkid: add magic and probing for zoned btrfs

2021-04-16 Thread David Sterba

On Wed, Apr 14, 2021 at 10:33:38AM +0900, Naohiro Aota wrote:
> It also supports temporary magic ("!BHRfS_M") in zone #0. The mkfs.btrfs
> first writes the temporary superblock to the zone during the mkfs process.
> It will survive there until the zones are filled up and reset. So, we also
> need to detect this temporary magic.

> +   /*
> +* For zoned btrfs, we also need to detect a temporary superblock
> +* at zone #0. Mkfs.btrfs creates it in the initialize process.
> +* It persits until both zones are filled up then reset.
> +*/
> +   { .magic = "!BHRfS_M", .len = 8, .sboff = 0x40,
> + .is_zoned = 1, .zonenum = 0, .kboff_inzone = 0 },

Should we rather reset the zone twice so the initial superblock does not
have the temporary signature?

For the primary superblock at offset 64K and as a fallback the signature
should be valid for detection purposes (ie. not necessarily to get the
latest superblock).

Re: Dead fs on 2 Fedora systems: block=57084067840 write time tree block corruption detected

2021-04-16 Thread David Sterba

On Thu, Apr 15, 2021 at 11:06:32AM -0600, Chris Murphy wrote:
> First computer/file system:
> 
> (from the photo):
> 
> [   136.259984] BTRFS critical (device nvme0n1p8): corrupt leaf: root=257
> block=31259951104 slot=9 ino=3244515, name hash mismatch with key, have
> 0xF22F547D expect 0x92294C62
> 
> This is not obviously a bit flip. I'm not sure what's going on here.

Not a bitflip in the hash itself, but it's produced by hashing a file
name, and if that had a bitflip then the hashes would differ. We've seen
a that already, there could be traces of the bogus filename in logs.

> Second computer/file system:
> 
> [30177.298027] BTRFS critical (device nvme0n1p8): corrupt leaf: root=791
> block=57084067840 slot=64 ino=1537855, name hash mismatch with key, have
> 0xa461adfd expect 0xa461adf5

Yes that's a bitflip in the hash,

bin(0xa461adfd^0xa461adf5) = 0b1000

Re: [PATCH] btrfs: fix race between transaction aborts and fsyncs leading to use-after-free

2021-04-14 Thread David Sterba

On Mon, Apr 05, 2021 at 12:32:16PM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> There is a race between a task aborting a transaction during a commit,
> a task doing an fsync and the transaction kthread, which leads to an
> use-after-free of the log root tree. When this happens, it results in a
> stack trace like the following:
> 
> [99678.547335] BTRFS info (device dm-0): forced readonly
> [99678.547340] BTRFS warning (device dm-0): Skipping commit of aborted 
> transaction.
> [99678.547341] BTRFS: error (device dm-0) in cleanup_transaction:1958: 
> errno=-5 IO failure
> [99678.547373] BTRFS warning (device dm-0): lost page write due to IO error 
> on /dev/mapper/error-test (-5)
> [99678.547533] BTRFS warning (device dm-0): Skipping commit of aborted 
> transaction.
> [99678.548743] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 
> sector 0xa4e8 len 4096 err no 10
> [99678.549188] BTRFS error (device dm-0): error writing primary super block 
> to device 1
> [99678.551100] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 
> sector 0x12e000 len 4096 err no 10
> [99678.551149] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 
> sector 0x12e008 len 4096 err no 10
> [99678.551205] BTRFS warning (device dm-0): direct IO failed ino 261 rw 0,0 
> sector 0x12e010 len 4096 err no 10
> [99678.551401] BTRFS: error (device dm-0) in write_all_supers:4110: errno=-5 
> IO failure (1 errors while writing supers)
> [99678.565169] BTRFS: error (device dm-0) in btrfs_sync_log:3308: errno=-5 IO 
> failure
> [99678.566132] general protection fault, probably for non-canonical address 
> 0x6b6b6b6b6b6b6b68:  [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
> [99678.567526] CPU: 2 PID: 2458471 Comm: fsstress Not tainted 
> 5.12.0-rc5-btrfs-next-84 #1
> [99678.568531] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> [99678.569980] RIP: 0010:__mutex_lock+0x139/0xa40
> [99678.570556] Code: c0 74 19 (...)
> [99678.573752] RSP: 0018:9f18830d7b00 EFLAGS: 00010202
> [99678.574723] RAX: 6b6b6b6b6b6b6b68 RBX: 0001 RCX: 
> 0002
> [99678.576027] RDX: b9c54d13 RSI:  RDI: 
> 
> [99678.577314] RBP: 9f18830d7bc0 R08:  R09: 
> 
> [99678.578601] R10: 9f18830d7be0 R11: 0001 R12: 
> 8c6cd199c040
> [99678.579890] R13: 8c6c95821358 R14: fffb R15: 
> 8c6cbcf01358
> [99678.581282] FS:  7fa9140c2b80() GS:8c6fac60() 
> knlGS:
> [99678.582818] CS:  0010 DS:  ES:  CR0: 80050033
> [99678.583771] CR2: 7fa913d52000 CR3: 00013d2b4003 CR4: 
> 00370ee0
> [99678.584600] DR0:  DR1:  DR2: 
> 
> [99678.585425] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [99678.586247] Call Trace:
> [99678.586542]  ? __btrfs_handle_fs_error+0xde/0x146 [btrfs]
> [99678.587260]  ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
> [99678.587930]  ? btrfs_sync_log+0x7c1/0xf20 [btrfs]
> [99678.588573]  btrfs_sync_log+0x7c1/0xf20 [btrfs]
> [99678.589222]  btrfs_sync_file+0x40c/0x580 [btrfs]
> [99678.589947]  do_fsync+0x38/0x70
> [99678.590514]  __x64_sys_fsync+0x10/0x20
> [99678.591196]  do_syscall_64+0x33/0x80
> [99678.591829]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [99678.592744] RIP: 0033:0x7fa9142a55c3
> [99678.593403] Code: 8b 15 09 (...)
> [99678.596777] RSP: 002b:7fff26278d48 EFLAGS: 0246 ORIG_RAX: 
> 004a
> [99678.598143] RAX: ffda RBX: 563c83cb4560 RCX: 
> 7fa9142a55c3
> [99678.599450] RDX: 7fff26278cb0 RSI: 7fff26278cb0 RDI: 
> 0005
> [99678.600770] RBP: 0005 R08: 0001 R09: 
> 7fff26278d5c
> [99678.602067] R10:  R11: 0246 R12: 
> 0340
> [99678.603380] R13: 7fff26278de0 R14: 7fff26278d96 R15: 
> 563c83ca57c0
> [99678.604714] Modules linked in: btrfs dm_zero dm_snapshot dm_thin_pool (...)
> [99678.616646] ---[ end trace ee2f1b19327d791d ]---
> 
> The steps that lead to this crash are the following:
> 
> 1) We are at transaction N;
> 
> 2) We have two tasks with a transaction handle attached to transaction N.
>Task A and Task B. Task B is doing an fsync;
> 
> 3) Task B is at btrfs_sync_log(), and has saved fs_info->log_root_tree
>into a local variable named 'log_root_tree' at the top of
>btrfs_sync_log(). Task B is about to call write_all_supers(), but
>before that...
> 
> 4) Task A calls btrfs_commit_transaction(), and after it sets the
>transaction state to TRANS_STATE_COMMIT_START, an error happens before
>it waits for the transaction's 'num_writers' counter to reach a value
>of 1 (no one else attached to the transaction), so it jumps to the
>label "cleanup_transaction";
> 
> 5) Task A then calls cleanup_transaction(), wh

[GIT PULL] Btrfs fix for 5.12-rc7

2021-04-11 Thread David Sterba

From: David Sterba 

Hi,

here's one more patch that we'd like to get to 5.12 before release, it's
changing where and how the superblock is stored in the zoned mode. It is
an on-disk format change but so far there are no implications for users
as the proper mkfs support hasn't been merged and is waiting for the
kernel side to settle.

Until now, the superblocks were derived from the zone index, but zone
size can differ per device. This is changed to be based on fixed offset
values, to make it independent of the device zone size.

The work on that got a bit delayed, we discussed the exact locations to
support potential device sizes and usecases. (Partially delayed also due
to my vacation.) Having that in the same release where the zoned mode is
declared usable is highly desired, there are userspace projects that
need to be updated to recognize the feature.  Pushing that to the next
release would make things harder to test.

Please pull, thanks.



The following changes since commit c1d6abdac46ca8127274bea195d804e3f2cec7ee:

  btrfs: fix check_data_csum() error message for direct I/O (2021-03-18 
21:25:11 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-5.12-rc6-tag

for you to fetch changes up to 53b74fa990bf76f290aa5930abfcf37424a1a865:

  btrfs: zoned: move superblock logging zone location (2021-04-10 12:13:16 
+0200)


Naohiro Aota (1):
  btrfs: zoned: move superblock logging zone location

 fs/btrfs/zoned.c | 53 ++---
 1 file changed, 42 insertions(+), 11 deletions(-)

[PATCH v3] btrfs: zoned: move superblock logging zone location

2021-04-10 Thread David Sterba

From: Naohiro Aota 

Moves the location of the superblock logging zones. The new locations of
the logging zones are now determined based on fixed block addresses
instead of on fixed zone numbers.

The old placement method based on fixed zone numbers causes problems when
one needs to inspect a file system image without access to the drive zone
information. In such case, the super block locations cannot be reliably
determined as the zone size is unknown. By locating the superblock logging
zones using fixed addresses, we can scan a dumped file system image without
the zone information since a super block copy will always be present at or
after the fixed known locations.

Introduce the following three pairs of zones containing fixed offset
locations, regardless of the device zone size.

  - primary superblock: offset   0B (and the following zone)
  - first copy: offset 512G (and the following zone)
  - Second copy:offset   4T (4096G, and the following zone)

If a logging zone is outside of the disk capacity, we do not record the
superblock copy.

The first copy position is much larger than for a non-zoned filesystem,
which is at 64M.  This is to avoid overlapping with the log zones for
the primary superblock. This higher location is arbitrary but allows
supporting devices with very large zone sizes, plus some space around in
between.

Such large zone size is unrealistic and very unlikely to ever be seen in
real devices. Currently, SMR disks have a zone size of 256MB, and we are
expecting ZNS drives to be in the 1-4GB range, so this limit gives us
room to breathe. For now, we only allow zone sizes up to 8GB. The
maximum zone size that would still fit in the space is 256G.

The fixed location addresses are somewhat arbitrary, with the intent of
maintaining superblock reliability for smaller and larger devices, with
the preference for the latter. For this reason, there are two superblocks
under the first 1T. This should cover use cases for physical devices and
for emulated/device-mapper devices.

The superblock logging zones are reserved for superblock logging and
never used for data or metadata blocks. Note that we only reserve the
two zones per primary/copy actually used for superblock logging. We do
not reserve the ranges of zones possibly containing superblocks with the
largest supported zone size (0-16GB, 512G-528GB, 4096G-4112G).

The zones containing the fixed location offsets used to store
superblocks on a non-zoned volume are also reserved to avoid confusion.

Signed-off-by: Naohiro Aota 
Signed-off-by: David Sterba 
---

For context see replies under
https://lore.kernel.org/linux-btrfs/2f58edb74695825632c77349b000d31f16cb3226.1617870145.git.naohiro.a...@wdc.com/

 fs/btrfs/zoned.c | 53 ++--
 1 file changed, 42 insertions(+), 11 deletions(-)

diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
index 1f972b75a9ab..eeb3ebe11d7a 100644
--- a/fs/btrfs/zoned.c
+++ b/fs/btrfs/zoned.c
@@ -21,9 +21,30 @@
 /* Pseudo write pointer value for conventional zone */
 #define WP_CONVENTIONAL ((u64)-2)
 
+/*
+ * Location of the first zone of superblock logging zone pairs.
+ *
+ * - primary superblock:0B (zone 0)
+ * - first copy:  512G (zone starting at that offset)
+ * - second copy:   4T (zone starting at that offset)
+ */
+#define BTRFS_SB_LOG_PRIMARY_OFFSET(0ULL)
+#define BTRFS_SB_LOG_FIRST_OFFSET  (512ULL * SZ_1G)
+#define BTRFS_SB_LOG_SECOND_OFFSET (4096ULL * SZ_1G)
+
+#define BTRFS_SB_LOG_FIRST_SHIFT   const_ilog2(BTRFS_SB_LOG_FIRST_OFFSET)
+#define BTRFS_SB_LOG_SECOND_SHIFT  const_ilog2(BTRFS_SB_LOG_SECOND_OFFSET)
+
 /* Number of superblock log zones */
 #define BTRFS_NR_SB_LOG_ZONES 2
 
+/*
+ * Maximum supported zone size. Currently, SMR disks have a zone size of
+ * 256MiB, and we are expecting ZNS drives to be in the 1-4GiB range. We do not
+ * expect the zone size to become larger than 8GiB in the near future.
+ */
+#define BTRFS_MAX_ZONE_SIZESZ_8G
+
 static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, void 
*data)
 {
struct blk_zone *zones = data;
@@ -111,23 +132,22 @@ static int sb_write_pointer(struct block_device *bdev, 
struct blk_zone *zones,
 }
 
 /*
- * The following zones are reserved as the circular buffer on ZONED btrfs.
- *  - The primary superblock: zones 0 and 1
- *  - The first copy: zones 16 and 17
- *  - The second copy: zones 1024 or zone at 256GB which is minimum, and
- * the following one
+ * Get the first zone number of the superblock mirror
  */
 static inline u32 sb_zone_number(int shift, int mirror)
 {
-   ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX);
+   u64 zone;
 
+   ASSERT(mirror < BTRFS_SUPER_MIRROR_MAX);
switch (mirror) {
-   case 0: return 0;
-   case 1: return 16;
-   case 2: return min_t(u64, btrfs_sb_offset(mirror) >> shift, 1024);
+   case 0: zone = 0; break;
+   case 1:

Re: [PATCH v2] btrfs: zoned: move superblock logging zone location

2021-04-10 Thread David Sterba

On Fri, Apr 09, 2021 at 01:07:19PM +0200, David Sterba wrote:
> On Thu, Apr 08, 2021 at 05:25:28PM +0900, Naohiro Aota wrote:
> > This commit moves the location of the superblock logging zones. The new
> > locations of the logging zones are now determined based on fixed block
> > addresses instead of on fixed zone numbers.
> > 
> > The old placement method based on fixed zone numbers causes problems when
> > one needs to inspect a file system image without access to the drive zone
> > information. In such case, the super block locations cannot be reliably
> > determined as the zone size is unknown. By locating the superblock logging
> > zones using fixed addresses, we can scan a dumped file system image without
> > the zone information since a super block copy will always be present at or
> > after the fixed location.
> > 
> > This commit introduces the following three pairs of zones containing fixed
> > offset locations, regardless of the device zone size.
> > 
> >   - Primary superblock: zone starting at offset 0 and the following zone
> >   - First copy: zone containing offset 64GB and the following zone
> >   - Second copy: zone containing offset 256GB and the following zone
> > 
> > If a logging zone is outside of the disk capacity, we do not record the
> > superblock copy.
> > 
> > The first copy position is much larger than for a regular btrfs volume
> > (64M).  This increase is to avoid overlapping with the log zones for the
> > primary superblock. This higher location is arbitrary but allows supporting
> > devices with very large zone sizes, up to 32GB. Such large zone size is
> > unrealistic and very unlikely to ever be seen in real devices. Currently,
> > SMR disks have a zone size of 256MB, and we are expecting ZNS drives to be
> > in the 1-4GB range, so this 32GB limit gives us room to breathe. For now,
> > we only allow zone sizes up to 8GB, below this hard limit of 32GB.
> > 
> > The fixed location addresses are somewhat arbitrary, but with the intent of
> > maintaining superblock reliability even for smaller devices. For this
> > reason, the superblock fixed locations do not exceed 1TB.
> 
> Yeah, so how much should we adjust the offsets in favor of smaller
> devices instead of larger ones? I think this is going the wrong
> direction, the capacity will grow, the zones are supposed to be larger.

For the record, we had a group discussion about that and the consensus
is to do sb offsets at 0, 512G and 4T.

The rationale is to prefer large devices slightly more than smaller, but
are 2 superblocks available under 1T to cover small devices. Physical
devices' capacity grows and the 4T copy should be available on all of
the HDD type, while flash storage growth is a bit slower but still
projected to be up to 10T within the time we care about now.

The small devices are for usecases with not necessarily a physical
device, eg. emulated or on top of device mapper targets.

The superblock copy itself is not a frequently utilized feature but we
still want to keep it, for parity with non-zoned mode and "just in
case".

I'll update the patch changelog and code to reflect the changes and
resend to all people involved. The pull request is scheduled for
tomorrow.

Re: [PATCH v7 03/38] btrfs: handle errors from select_reloc_root()

2021-04-09 Thread David Sterba

On Thu, Mar 11, 2021 at 01:10:03PM -0500, Josef Bacik wrote:
> On 2/26/21 1:30 PM, David Sterba wrote:
> > On Wed, Dec 16, 2020 at 11:26:19AM -0500, Josef Bacik wrote:
> >> Currently select_reloc_root() doesn't return an error, but followup
> >> patches will make it possible for it to return an error.  We do have
> >> proper error recovery in do_relocation however, so handle the
> >> possibility of select_reloc_root() having an error properly instead of
> >> BUG_ON(!root).  I've also adjusted select_reloc_root() to return
> >> ERR_PTR(-ENOENT) if we don't find a root, instead of NULL, to make the
> >> error case easier to deal with.  I've replaced the BUG_ON(!root) with an
> >> ASSERT(0) for this case as it indicates we messed up the backref walking
> >> code, but it could also indicate corruption.
> >>
> >> Signed-off-by: Josef Bacik 
> >> ---
> >>   fs/btrfs/relocation.c | 15 ---
> >>   1 file changed, 12 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> >> index 08ffaa93b78f..741068580455 100644
> >> --- a/fs/btrfs/relocation.c
> >> +++ b/fs/btrfs/relocation.c
> >> @@ -2024,8 +2024,14 @@ struct btrfs_root *select_reloc_root(struct 
> >> btrfs_trans_handle *trans,
> >>if (!next || next->level <= node->level)
> >>break;
> >>}
> >> -  if (!root)
> >> -  return NULL;
> >> +  if (!root) {
> >> +  /*
> >> +   * This can happen if there's fs corruption or if there's a bug
> >> +   * in the backref lookup code.
> >> +   */
> >> +  ASSERT(0);
> > 
> > You've added these assert(0) to several patches and I think it's wrong.
> > The asserts are supposed to verify invariants, you can hardly say that
> > we're expecting 0 to be true, so the construct serves as "please crash
> > here", which is no better than BUG().  It's been spreading, there are
> > like 25 now.
> 
> They are much better than a BUG_ON(), as they won't trip in production, 
> they'll 
> only trip for developers.

I'm not suggesting to use BUG_ON, only in rare cases (ideally). For
developers what should blow up are conditions that validate the
invariants, either in advance or after some action.

> And we need them in many of these cases, and this 
> example you've called out is a clear example of where we absolutely want to 
> differentiate, because we have 3 different failure modes that will return 
> -EUCLEAN.  If I add a ASSERT(ret != -EUCLEAN) to all of the callers then when 
> the developer hits a logic bug they'll have to go and manually add in their 
> own 
> assert to figure out which error condition failed.

The idea is to add more conditions to differentiate if it's corrupted fs
or if it's a development bug. But for testing I'd rather see a way we
can exercise the corruption path, eg. fuzzing or crafted images, up to
the point where EUCLEAN is encountered by some transaction abort or
normal error.

> Instead I've added 
> explanations in the comments for each assert and then have an assert for 
> every 
> error condition so that it's clear where things went wrong.

There can be exceptional cases where distinguishing can't be done easily
or at all so the comments are fine but I'd rather not encourage the
ASSERT(0) pattern instead of thinking hard if it's really the right way
to handle the errors. It too much resembles the BUG_ON() anti-pattern.

Re: [PATCH v8 00/39] Cleanup error handling in relocation

2021-04-09 Thread David Sterba

On Fri, Mar 12, 2021 at 03:24:55PM -0500, Josef Bacik wrote:
> v7->v8:
> - Reordered some of the patches, so that the callers of functions that added 
> new
>   error cases were fixed first, and then added the new error handler.
> - Moved a few of the ASSERT(0) to where they made sense.  Left the others that
>   are in functions that can have multiple cases of the same error that 
> indicates
>   a logic error.

With some updates in changelogs and comments moved from topic branch to
misc-next. I'm still not happy about the ASSERT(0), it should be a real
error or a real assert. We can't observe the same behaviour with asserts
enabled and disabled.

Re: [PATCH v2] btrfs: zoned: move superblock logging zone location

2021-04-09 Thread David Sterba

On Thu, Apr 08, 2021 at 05:25:28PM +0900, Naohiro Aota wrote:
> This commit moves the location of the superblock logging zones. The new
> locations of the logging zones are now determined based on fixed block
> addresses instead of on fixed zone numbers.
> 
> The old placement method based on fixed zone numbers causes problems when
> one needs to inspect a file system image without access to the drive zone
> information. In such case, the super block locations cannot be reliably
> determined as the zone size is unknown. By locating the superblock logging
> zones using fixed addresses, we can scan a dumped file system image without
> the zone information since a super block copy will always be present at or
> after the fixed location.
> 
> This commit introduces the following three pairs of zones containing fixed
> offset locations, regardless of the device zone size.
> 
>   - Primary superblock: zone starting at offset 0 and the following zone
>   - First copy: zone containing offset 64GB and the following zone
>   - Second copy: zone containing offset 256GB and the following zone
> 
> If a logging zone is outside of the disk capacity, we do not record the
> superblock copy.
> 
> The first copy position is much larger than for a regular btrfs volume
> (64M).  This increase is to avoid overlapping with the log zones for the
> primary superblock. This higher location is arbitrary but allows supporting
> devices with very large zone sizes, up to 32GB. Such large zone size is
> unrealistic and very unlikely to ever be seen in real devices. Currently,
> SMR disks have a zone size of 256MB, and we are expecting ZNS drives to be
> in the 1-4GB range, so this 32GB limit gives us room to breathe. For now,
> we only allow zone sizes up to 8GB, below this hard limit of 32GB.
> 
> The fixed location addresses are somewhat arbitrary, but with the intent of
> maintaining superblock reliability even for smaller devices. For this
> reason, the superblock fixed locations do not exceed 1TB.

Yeah, so how much should we adjust the offsets in favor of smaller
devices instead of larger ones? I think this is going the wrong
direction, the capacity will grow, the zones are supposed to be larger.

> The superblock logging zones are reserved for superblock logging and never
> used for data or metadata blocks. Note that we only reserve the two zones
> per primary/copy actually used for superblock logging. We do not reserve
> the ranges of zones possibly containing superblocks with the largest
> supported zone size (0-16GB, 64G-80GB, 256G-272GB).
> 
> The zones containing the fixed location offsets used to store superblocks
> in a regular btrfs volume (no zoned case) are also reserved to avoid
> confusion.
> 
> Signed-off-by: Naohiro Aota 
> ---
>  fs/btrfs/zoned.c | 43 +++
>  1 file changed, 35 insertions(+), 8 deletions(-)
> 
> diff --git a/fs/btrfs/zoned.c b/fs/btrfs/zoned.c
> index 1f972b75a9ab..a4b195fe08a0 100644
> --- a/fs/btrfs/zoned.c
> +++ b/fs/btrfs/zoned.c
> @@ -21,9 +21,28 @@
>  /* Pseudo write pointer value for conventional zone */
>  #define WP_CONVENTIONAL ((u64)-2)
>  
> +/*
> + * Location of the first zone of superblock logging zone pairs.
> + * - Primary superblock: the zone containing offset 0 (zone 0)
> + * - First superblock copy: the zone containing offset 64G
> + * - Second superblock copy: the zone containing offset 256G
> + */
> +#define BTRFS_PRIMARY_SB_LOG_ZONE 0ULL
> +#define BTRFS_FIRST_SB_LOG_ZONE (64ULL * SZ_1G)
> +#define BTRFS_SECOND_SB_LOG_ZONE (256ULL * SZ_1G)
> +#define BTRFS_FIRST_SB_LOG_ZONE_SHIFT const_ilog2(BTRFS_FIRST_SB_LOG_ZONE)
> +#define BTRFS_SECOND_SB_LOG_ZONE_SHIFT const_ilog2(BTRFS_SECOND_SB_LOG_ZONE)

This should be named like BTRFS_SB_LOG_* ie. "namespaces" or common
prefixes for the same class of valuesd.

> +
>  /* Number of superblock log zones */
>  #define BTRFS_NR_SB_LOG_ZONES 2
>  
> +/*
> + * Maximum size of zones. Currently, SMR disks have a zone size of 256MB,
> + * and we are expecting ZNS drives to be in the 1-4GB range. We do not
> + * expect the zone size to become larger than 8GB in the near future.
> + */
> +#define BTRFS_MAX_ZONE_SIZE SZ_8G
> +
>  static int copy_zone_info_cb(struct blk_zone *zone, unsigned int idx, void 
> *data)
>  {
>   struct blk_zone *zones = data;
> @@ -111,11 +130,8 @@ static int sb_write_pointer(struct block_device *bdev, 
> struct blk_zone *zones,
>  }
>  
>  /*
> - * The following zones are reserved as the circular buffer on ZONED btrfs.
> - *  - The primary superblock: zones 0 and 1
> - *  - The first copy: zones 16 and 17
> - *  - The second copy: zones 1024 or zone at 256GB which is minimum, and
> - * the following one
> + * Get the zone number of the first zone of a pair of contiguous zones used
> + * for superblock logging.
>   */
>  static inline u32 sb_zone_number(int shift, int mirror)
>  {
> @@ -123,8 +139,8 @@ static inline u32 sb_zone_number(int shift, int mirror)

Re: [PATCH v2] btrfs: zoned: move superblock logging zone location

2021-04-09 Thread David Sterba

On Thu, Apr 08, 2021 at 10:57:32AM -0400, Josef Bacik wrote:
> On 4/8/21 4:25 AM, Naohiro Aota wrote:
> > confusion.
> > 
> > Signed-off-by: Naohiro Aota 
> 
> Thanks Naohiro, this makes it much easier to understand,
> 
> Reviewed-by: Josef Bacik 
> 
> Dave, I know you're on vacation, this needs to go to Linus for this cycle so 
> we 
> don't have two different SB slots for two different kernels.  I don't want to 
> disturb your vacation, so I'll put this in a pull request tomorrow to Linus.  
> If 
> you already had plans to do this let me know and I'll hold off.  Thanks,

We don't even have mkfs support for zoned filesystem merged so that the
format is not final was not an issue and that's why I did not hurry
pushing it in that cycle and I still rather not do that.

The design-level problem is the backup copy offset, that's been
discussed in some previous iteration - whether it's right to place all
copies under the first 1T or if we should spread them. For me this is
still not settled, I haven't read through all the replies and given it
enough thought.

That could lead to another change of the offsets that I'm sure we want
to avoid.

Re: [PATCH] btrfs: Use readahead_batch_length

2021-04-06 Thread David Sterba

On Sun, Mar 21, 2021 at 09:03:11PM +, Matthew Wilcox (Oracle) wrote:
> Implement readahead_batch_length() to determine the number of bytes in
> the current batch of readahead pages and use it in btrfs.
> 
> Signed-off-by: Matthew Wilcox (Oracle) 

Thanks, I'll take it through my tree as btrfs is probably the only user
of the new helper. The MM list hasn't been CCed, I've added it now but I
think the patch is trivial enough and does not need another ack, so it's
just for the record.

> ---
>  fs/btrfs/extent_io.c| 6 ++
>  include/linux/pagemap.h | 9 +
>  2 files changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> index e9837562f7d6..97ac4ddb2857 100644
> --- a/fs/btrfs/extent_io.c
> +++ b/fs/btrfs/extent_io.c
> @@ -4875,10 +4875,8 @@ void extent_readahead(struct readahead_control *rac)
>   int nr;
>  
>   while ((nr = readahead_page_batch(rac, pagepool))) {
> - u64 contig_start = page_offset(pagepool[0]);
> - u64 contig_end = page_offset(pagepool[nr - 1]) + PAGE_SIZE - 1;
> -
> - ASSERT(contig_start + nr * PAGE_SIZE - 1 == contig_end);
> + u64 contig_start = readahead_pos(rac);
> + u64 contig_end = contig_start + readahead_batch_length(rac) - 1;
>  
>   contiguous_readpages(pagepool, nr, contig_start, contig_end,
>   &em_cached, &bio, &bio_flags, &prev_em_start);
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 2cbfd4c36026..92939afd4944 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -1174,6 +1174,15 @@ static inline unsigned int readahead_count(struct 
> readahead_control *rac)
>   return rac->_nr_pages;
>  }
>  
> +/**
> + * readahead_batch_length - The number of bytes in the current batch.
> + * @rac: The readahead request.
> + */
> +static inline loff_t readahead_batch_length(struct readahead_control *rac)
> +{
> + return rac->_batch_count * PAGE_SIZE;
> +}
> +
>  static inline unsigned long dir_pages(struct inode *inode)
>  {
>   return (unsigned long)(inode->i_size + PAGE_SIZE - 1) >>

--

Re: [PATCH] fs: btrfs: Remove repeated struct declaration

2021-04-06 Thread David Sterba

On Thu, Apr 01, 2021 at 04:03:39PM +0800, Wan Jiabing wrote:
> struct btrfs_inode is declared twice. One is declared at 67th line.
> The blew declaration is not needed. Remove the duplicate.
> struct btrfs_fs_info should be declared in the forward declarations.
> Move it to the forward declarations.

I've reworded the changelog a bit, patch added to misc-next, thanks.

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-06 Thread David Sterba

On Tue, Apr 06, 2021 at 10:31:58AM +0800, Anand Jain wrote:
> On 05/04/2021 14:14, Qu Wenruo wrote:
> > 
> > 
> > On 2021/4/3 下午7:08, David Sterba wrote:
> >> On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> >>> This patchset can be fetched from the following github repo, along with
> >>> the full subpage RW support:
> >>> https://github.com/adam900710/linux/tree/subpage
> >>>
> >>> This patchset is for metadata read write support.
> >>
> >>> Qu Wenruo (13):
> >>>    btrfs: add sysfs interface for supported sectorsize
> >>>    btrfs: use min() to replace open-code in btrfs_invalidatepage()
> >>>    btrfs: remove unnecessary variable shadowing in 
> >>> btrfs_invalidatepage()
> >>>    btrfs: refactor how we iterate ordered extent in
> >>>  btrfs_invalidatepage()
> >>>    btrfs: introduce helpers for subpage dirty status
> >>>    btrfs: introduce helpers for subpage writeback status
> >>>    btrfs: allow btree_set_page_dirty() to do more sanity check on 
> >>> subpage
> >>>  metadata
> >>>    btrfs: support subpage metadata csum calculation at write time
> >>>    btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> >>>    btrfs: make the page uptodate assert to be subpage compatible
> >>>    btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> >>>    btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> >>>  compatible
> >>>    btrfs: add subpage overview comments
> >>
> >> Moved from topic branch to misc-next.
> >>
> > 
> > Note sure if it's too late, but I inserted the last comment patch into
> > the wrong location.
> > 
> > In fact, there are 4 more patches to make
> 
> 
> > subpage metadata RW really work:
> 
>   I took some time to go through these patches, which are lined up for
>   integration.
> 
>   With this set of patches that are being integrated, we don't yet
>   support RW mount of filesystem if PAGESIZE > sectorsize as a whole.
>   Subpage metadata RW support, how is it to be used in the production?
>   OR How is this supposed to be tested?

What gets merged to misc-next is incrementally adding support for the
whole subpage feature. This would quite hard to get in in one go so it's
been split to patchsets with known limitations. Qu lists what works and
what does not in the cover letter.

With known missing functionality it obviously can't be used for
production, just for testing. There are likely patches in Qu's
development branches and more patches still to be written, so even the
testing is partial with known failures or bugs.

>   OR should you just cleanup the title as preparatory patches to support
>   subpage RW? It is confusing.

I think the title says what the patchset does, adding rw support for
metadata, in a sense it's still preparatory, yes.

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-06 Thread David Sterba

On Mon, Apr 05, 2021 at 02:14:34PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/4/3 下午7:08, David Sterba wrote:
> > On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> >> This patchset can be fetched from the following github repo, along with
> >> the full subpage RW support:
> >> https://github.com/adam900710/linux/tree/subpage
> >>
> >> This patchset is for metadata read write support.
> >
> >> Qu Wenruo (13):
> >>btrfs: add sysfs interface for supported sectorsize
> >>btrfs: use min() to replace open-code in btrfs_invalidatepage()
> >>btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
> >>btrfs: refactor how we iterate ordered extent in
> >>  btrfs_invalidatepage()
> >>btrfs: introduce helpers for subpage dirty status
> >>btrfs: introduce helpers for subpage writeback status
> >>btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
> >>  metadata
> >>btrfs: support subpage metadata csum calculation at write time
> >>btrfs: make alloc_extent_buffer() check subpage dirty bitmap
> >>btrfs: make the page uptodate assert to be subpage compatible
> >>btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
> >>btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> >>  compatible
> >>btrfs: add subpage overview comments
> >
> > Moved from topic branch to misc-next.
> >
> 
> Note sure if it's too late, but I inserted the last comment patch into
> the wrong location.

Not late yet but getting very close to the pre-merge window code freeze.

> In fact, there are 4 more patches to make subpage metadata RW really work:
>   btrfs: make lock_extent_buffer_for_io() to be subpage compatible
>   btrfs: introduce submit_eb_subpage() to submit a subpage metadata page
>   btrfs: introduce end_bio_subpage_eb_writepage() function
>   btrfs: introduce write_one_subpage_eb() function
> 
> Those 4 patches should be before the final comment patch.
> 
> Should I just send the 4 patches in a separate series?

As they've been posted now, I'll add them to for-next and reorder before
the last patch with comment, after some testing.

> Sorry for the bad split, it looks like multi-series patches indeed has
> such problem...

Yeah, but so far it's been all fixable given the scope of the whole
subpage support.

Re: [PATCH v8 28/39] btrfs: handle btrfs_search_slot failure in replace_path

2021-04-06 Thread David Sterba

On Fri, Mar 12, 2021 at 03:25:23PM -0500, Josef Bacik wrote:
> This can fail for any number of reasons, why bring the whole box down
> with it?

I've written something more relevant for the code change.

> Reviewed-by: Qu Wenruo 
> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/relocation.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index 592b2d156626..6e8d89e4733a 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -1322,7 +1322,8 @@ int replace_path(struct btrfs_trans_handle *trans, 
> struct reloc_control *rc,
>   path->lowest_level = level;
>   ret = btrfs_search_slot(trans, src, &key, path, 0, 1);
>   path->lowest_level = 0;
> - BUG_ON(ret);
> + if (ret)
> + break;

As replace_path returns positive values, ie. the level, search failing
to find the key could return 1 which would be wrongly interpreted if
returned as-is. The usual pattern is to switch that to -ENOENT, like

--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -1323,8 +1323,11 @@ int replace_path(struct btrfs_trans_handle *trans, 
struct reloc_control *rc,
path->lowest_level = level;
ret = btrfs_search_slot(trans, src, &key, path, 0, 1);
path->lowest_level = 0;
-   if (ret < 0)
+   if (ret) {
+   if (ret > 0)
+   ret = -ENOENT;
break;
+   }
 
/*
 * Info qgroup to trace both subtrees.
---

Re: [PATCH v2] btrfs: fix lockdep warning while mounting sprout fs

2021-04-06 Thread David Sterba

On Mon, Apr 05, 2021 at 05:18:32PM +0800, Su Yue wrote:
> 
> On Mon 05 Apr 2021 at 16:38, Anand Jain  
> wrote:
> 
> > Ping again.
> >
> It's already queued in misc-next.

> commit 441737bb30f83914bb8517f52088c0130138d74b (misc-next)
> Author: Anand Jain 
> Date:   Fri Jul 17 18:05:25 2020 +0800

No it's not, you must have checked some very old snapshot of misc-next,
I don't even have 441737bb30f83914bb8517f52088c0130138d74b in my stale
commit objects so it's been 'git gc'ed already.

Re: [PATCH -next] btrfs: integrity-checker: use DEFINE_MUTEX() for mutex lock

2021-04-06 Thread David Sterba

On Mon, Apr 05, 2021 at 06:14:46PM +0800, Zheng Yongjun wrote:
> mutex lock can be initialized automatically with DEFINE_MUTEX()
> rather than explicitly calling mutex_init().
> 
> Reported-by: Hulk Robot 
> Signed-off-by: Zheng Yongjun 

You've sent this patch already and haven't provided the answer why the
changes should be done.
https://lore.kernel.org/linux-btrfs/20210104144559.gf6...@twin.jikos.cz/

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-03 Thread David Sterba

On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> This patchset can be fetched from the following github repo, along with
> the full subpage RW support:
> https://github.com/adam900710/linux/tree/subpage
> 
> This patchset is for metadata read write support.

> Qu Wenruo (13):
>   btrfs: add sysfs interface for supported sectorsize
>   btrfs: use min() to replace open-code in btrfs_invalidatepage()
>   btrfs: remove unnecessary variable shadowing in btrfs_invalidatepage()
>   btrfs: refactor how we iterate ordered extent in
> btrfs_invalidatepage()
>   btrfs: introduce helpers for subpage dirty status
>   btrfs: introduce helpers for subpage writeback status
>   btrfs: allow btree_set_page_dirty() to do more sanity check on subpage
> metadata
>   btrfs: support subpage metadata csum calculation at write time
>   btrfs: make alloc_extent_buffer() check subpage dirty bitmap
>   btrfs: make the page uptodate assert to be subpage compatible
>   btrfs: make set/clear_extent_buffer_dirty() to be subpage compatible
>   btrfs: make set_btree_ioerr() accept extent buffer and to be subpage
> compatible
>   btrfs: add subpage overview comments

Moved from topic branch to misc-next.

Re: [PATCH v3 05/13] btrfs: introduce helpers for subpage dirty status

2021-04-01 Thread David Sterba

On Thu, Mar 25, 2021 at 03:14:37PM +0800, Qu Wenruo wrote:
> --- a/fs/btrfs/subpage.h
> +++ b/fs/btrfs/subpage.h
> @@ -20,6 +20,7 @@ struct btrfs_subpage {
>   spinlock_t lock;
>   u16 uptodate_bitmap;
>   u16 error_bitmap;
> + u16 dirty_bitmap;
>   union {
>   /*
>* Structures only used by metadata
> @@ -87,5 +88,19 @@ bool btrfs_page_test_##name(const struct btrfs_fs_info 
> *fs_info,   \
>  
>  DECLARE_BTRFS_SUBPAGE_OPS(uptodate);
>  DECLARE_BTRFS_SUBPAGE_OPS(error);
> +DECLARE_BTRFS_SUBPAGE_OPS(dirty);
> +
> +/*
> + * Extra clear_and_test function for subpage dirty bitmap.
> + *
> + * Return true if we're the last bits in the dirty_bitmap and clear the
> + * dirty_bitmap.
> + * Return false otherwise.
> + *
> + * NOTE: Callers should manually clear page dirty for true case, as we have
> + * extra handling for tree blocks.
> + */

I've moved the function comment to subpage.c

> +bool btrfs_subpage_clear_and_test_dirty(const struct btrfs_fs_info *fs_info,
> + struct page *page, u64 start, u32 len);
>  
>  #endif
> -- 
> 2.30.1

Re: [PATCH] btrfs: make reflinks respect O_SYNC O_DSYNC and S_SYNC flags

2021-04-01 Thread David Sterba

On Wed, Mar 31, 2021 at 11:07:26AM +, Filipe Manana wrote:
> On Mon, Mar 29, 2021 at 7:49 PM David Sterba  wrote:
> >
> > On Tue, Mar 23, 2021 at 06:39:49PM +, fdman...@kernel.org wrote:
> > > From: Filipe Manana 
> > >
> > > If we reflink to or from a file opened with O_SYNC/O_DSYNC or to/from a
> > > file that has the S_SYNC attribute set, we totally ignore that and do not
> > > durably persist the reflink changes. Since a reflink can change the data
> > > readable from a file (and mtime/ctime, or a file size), it makes sense to
> > > durably persist (fsync) the source and destination files/ranges.
> > >
> > > This was previously discussed at:
> > >
> > > https://lore.kernel.org/linux-btrfs/20200903035225.GJ6090@magnolia/
> > >
> > > The recently introduced test case generic/628, from fstests, exercises
> > > these scenarios and currently fails without this change.
> > >
> > > So make sure we fsync the source and destination files/ranges when either
> > > of them was opened with O_SYNC/O_DSYNC or has the S_SYNC attribute set,
> > > just like XFS already does.
> > >
> > > Signed-off-by: Filipe Manana 
> >
> > Added to misc-next, thanks.
> 
> Can you squash the following diff into it?
> 
> https://pastebin.com/raw/ARSSDDxd

Squashed in, thanks.

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-04-01 Thread David Sterba

On Thu, Apr 01, 2021 at 01:36:56PM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/3/30 上午2:53, David Sterba wrote:
> > On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> >> v3:
> >> - Rename the sysfs to supported_sectorsizes
> >>
> >> - Rebased to latest misc-next branch
> >>This removes 2 cleanup patches.
> >>
> >> - Add new overview comment for subpage metadata
> >
> > V3 is now in for-next, targeting merge for 5.13. Please post any fixups
> > as replies to the individual patches, I'll fold them in, rather a full
> > series resend. Thanks.
> >
> Is it possible to drop patch "[PATCH v3 04/13] btrfs: refactor how we
> iterate ordered extent in btrfs_invalidatepage()"?

Dropped, there were no conflicts in the followup patches.

> Since in the series, there are no other patches touching it, dropping it
> should not involve too much hassle.
> 
> The problem here is, how we handle ordered extent really belongs to the
> data write path.
> 
> Furthermore, after all the data RW related testing, it turns out that
> the ordered extent code has several problems:
> 
> - Separate indicators for ordered extent
>We use PagePriavte2 to indicate whether we have pending ordered extent
>io.
>But it is not properly integrated into ordered extent code, nor really
>properly documented.
> 
> - Complex call sites requirement
>For endio we don't care whether we finished the ordered extent, while
>for invalidatepage, we don't really need to bother if we finished all
>the ordered extents in the range.
> 
>Thus we really don't need to bother who finished the ordered extents,
>but just want to mark the io finished for the range.
> 
> - Lack subpage compatibility
>That's why I'm here complaining, especially due to the PagePrivate2
>usage.
>It needs to be converted to a new bitmap.
> 
> There will be a refactor on the btrfs_dec_test_*_ordered_pending()
> functions soon, and obvious the existing call sites will all be gone.
> 
> Thus that fourth patch makes no sense.

Ok, thanks for the explanation.

Re: [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize

2021-04-01 Thread David Sterba

On Thu, Mar 25, 2021 at 03:14:33PM +0800, Qu Wenruo wrote:
> Add extra sysfs interface features/supported_ro_sectorsize and
> features/supported_rw_sectorsize to indicate subpage support.
> 
> Currently for supported_rw_sectorsize all architectures only have their
> PAGE_SIZE listed.
> 
> While for supported_ro_sectorsize, for systems with 64K page size, 4K
> sectorsize is also supported.
> 
> This new sysfs interface would help mkfs.btrfs to do more accurate
> warning.

I've reworded the changelog to reflect the code status, ie. just one
file and that the read-only support could do more than it's advertised
in the sysfs file and that this will change.

Re: [PATCH] btrfs: Remove useless call "zlib_inflateEnd"

2021-04-01 Thread David Sterba

On Tue, Mar 30, 2021 at 06:17:01PM +0800, Jiapeng Chong wrote:
> Fix the following whitescan warning:
> 
> Calling "zlib_inflateEnd(&workspace->strm)" is only useful for its
> return value, which is ignored.

According to the zlib API documentation in include/linux/zlib.h

301 extern int zlib_deflateEnd (z_streamp strm);
302 /*
303  All dynamically allocated data structures for this stream are freed.
304This function discards any unprocessed input and does not flush any
305pending output.
306
307  deflateEnd returns Z_OK if success, Z_STREAM_ERROR if the
308stream state was inconsistent, Z_DATA_ERROR if the stream was freed
309prematurely (some input or output was discarded). In the error case,
310msg may be set but then points to a static string (which must not be
311deallocated).
312 */

The first paragraph says it could free data, so the call needs to be
there. The return value could have some meaning as it could point out to
some inconsistency in zlib internal state but just deleting is IMO
wrong.

Re: [PATCH] btrfs: fix exhaustion of the system chunk array due to concurrent allocations

2021-04-01 Thread David Sterba

On Wed, Mar 31, 2021 at 11:55:50AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> When we are running out of space for updating the chunk tree, that is,
> when we are low on available space in the system space info, if we have
> many task concurrently allocating block groups, via fallocate for example,
> many of them can end up all allocating new system chunks when only one is
> needed. In extreme cases this can lead to exhaustion of the system chunk
> array, which has a size limit of 2048 bytes, and results in a transaction
> abort with errno -EFBIG, producing a trace in dmesg like the following,
> which was triggered on a PowerPC machine with a node/leaf size of 64K:
> 
> [ 1359.518899] [ cut here ]
> [ 1359.518980] BTRFS: Transaction aborted (error -27)
> [ 1359.519135] WARNING: CPU: 3 PID: 16463 at ../fs/btrfs/block-group.c:1968 
> btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
> [ 1359.519152] Modules linked in: (...)
> [ 1359.519239] Supported: Yes, External
> [ 1359.519252] CPU: 3 PID: 16463 Comm: stress-ng Tainted: G   X   
>  5.3.18-47-default #1 SLE15-SP3
> [ 1359.519274] NIP:  c00800e36fe8 LR: c00800e36fe4 CTR: 
> 006de8e8
> [ 1359.519293] REGS: c0056890b700 TRAP: 0700   Tainted: G   X 
> (5.3.18-47-default)
> [ 1359.519317] MSR:  8282b033   CR: 
> 48008222  XER: 0007
> [ 1359.519356] CFAR: c013e170 IRQMASK: 0
> [ 1359.519356] GPR00: c00800e36fe4 c0056890b990 c00800e83200 
> 0026
> [ 1359.519356] GPR04:   d52a3b027651 
> 0007
> [ 1359.519356] GPR08: 0003 0001 0007 
> 
> [ 1359.519356] GPR12: 8000 c0063fe44600 1015e028 
> 1015dfd0
> [ 1359.519356] GPR16: 404f 0001 0001 
> dd1e287a
> [ 1359.519356] GPR20: 0001 c00637c9a000 ffe5 
> 
> [ 1359.519356] GPR24: 0004  0100 
> ffc0
> [ 1359.519356] GPR28: c00637c9a000 c00630e09230 c00630e091d8 
> c00562188b08
> [ 1359.519561] NIP [c00800e36fe8] 
> btrfs_create_pending_block_groups+0x340/0x3c0 [btrfs]
> [ 1359.519613] LR [c00800e36fe4] 
> btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs]
> [ 1359.519626] Call Trace:
> [ 1359.519671] [c0056890b990] [c00800e36fe4] 
> btrfs_create_pending_block_groups+0x33c/0x3c0 [btrfs] (unreliable)
> [ 1359.519729] [c0056890ba90] [c00800d68d44] 
> __btrfs_end_transaction+0xbc/0x2f0 [btrfs]
> [ 1359.519782] [c0056890bae0] [c00800e309ac] 
> btrfs_alloc_data_chunk_ondemand+0x154/0x610 [btrfs]
> [ 1359.519844] [c0056890bba0] [c00800d8a0fc] 
> btrfs_fallocate+0xe4/0x10e0 [btrfs]
> [ 1359.519891] [c0056890bd00] [c04a23b4] vfs_fallocate+0x174/0x350
> [ 1359.519929] [c0056890bd50] [c04a3cf8] ksys_fallocate+0x68/0xf0
> [ 1359.519957] [c0056890bda0] [c04a3da8] sys_fallocate+0x28/0x40
> [ 1359.519988] [c0056890bdc0] [c0038968] 
> system_call_exception+0xe8/0x170
> [ 1359.520021] [c0056890be20] [c000cb70] 
> system_call_common+0xf0/0x278
> [ 1359.520037] Instruction dump:
> [ 1359.520049] 7d0049ad 40c2fff4 7c0004ac 71490004 40820024 2f83fffb 419e0048 
> 3c62
> [ 1359.520082] e863bcb8 7ec4b378 48010d91 e8410018 <0fe0> 3c82 
> e884bcc8 7ec6b378
> [ 1359.520122] ---[ end trace d6c186e151022e20 ]---
> 
> The following steps explain how we can end up in this situation:
> 
> 1) Task A is at check_system_chunk(), either because it is allocating a
>new data or metadata block group, at btrfs_chunk_alloc(), or because
>it is removing a block group or turning a block group RO. It does not
>matter why;
> 
> 2) Task A sees that there is not enough free space in the system
>space_info object, that is 'left' is < 'thresh'. And at this point
>the system space_info has a value of 0 for its 'bytes_may_use'
>counter;
> 
> 3) As a consequence task A calls btrfs_alloc_chunk() in order to allocate
>a new system block group (chunk) and then reserves 'thresh' bytes in
>the chunk block reserve with the call to btrfs_block_rsv_add(). This
>changes the chunk block reserve's 'reserved' and 'size' counters by an
>amount of 'thresh', and changes the 'bytes_may_use' counter of the
>system space_info object from 0 to 'thresh'.
> 
>Also during its call to btrfs_alloc_chunk(), we end up increasing the
>value of the 'total_bytes' counter of the system space_info object by
>8MiB (the size of a system chunk stripe). This happens through the
>call chain:
> 
>btrfs_alloc_chunk()
>create_chunk()
>btrfs_make_block_group()
>btrfs_update_space_info()
> 
> 4) After it finishes the first phase of the block group allocation, at
>btrfs_chunk_alloc(

Re: [PATCH] btrfs: improve btree readahead for full send operations

2021-04-01 Thread David Sterba

On Wed, Mar 31, 2021 at 11:56:21AM +0100, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> Currently a full send operation uses the standard btree readahead when
> iterating over the subvolume/snapshot btree, which despite bringing good
> performance benefits, it could be improved in a few aspects for use cases
> such as full send operations, which are guaranteed to visit every node
> and leaf of a btree, in ascending and sequential order. The limitations
> of that standard btree readahead implementation are the following:
> 
> 1) It only triggers readahead for for leaves that are physically close
>to the leaf being read, within a 64K range;
> 
> 2) It only triggers readahead for the next or previous leaves if the
>leaf being read is not currently in memory;
> 
> 3) It never triggers readahead for nodes.
> 
> So add a new readahead mode that addresses all these points and use it
> for full send operations.
> 
> The following test script was used to measure the improvement on a box
> using an average, consumer grade, spinning disk and with 16Gb of ram:
> 
>   $ cat test.sh
>   #!/bin/bash
> 
>   DEV=/dev/sdj
>   MNT=/mnt/sdj
>   MKFS_OPTIONS="--nodesize 16384" # default, just to be explicit
>   MOUNT_OPTIONS="-o max_inline=2048"  # default, just to be explicit
> 
>   mkfs.btrfs -f $MKFS_OPTIONS $DEV > /dev/null
>   mount $MOUNT_OPTIONS $DEV $MNT
> 
>   # Create files with inline data to make it easier and faster to create
>   # large btrees.
>   add_files()
>   {
>   local total=$1
>   local start_offset=$2
>   local number_jobs=$3
>   local total_per_job=$(($total / $number_jobs))
> 
>   echo "Creating $total new files using $number_jobs jobs"
>   for ((n = 0; n < $number_jobs; n++)); do
>   (
>   local start_num=$(($start_offset + $n * $total_per_job))
>   for ((i = 1; i <= $total_per_job; i++)); do
>   local file_num=$((start_num + $i))
>   local file_path="$MNT/file_${file_num}"
>   xfs_io -f -c "pwrite -S 0xab 0 2000" $file_path > /dev/null
>   if [ $? -ne 0 ]; then
>   echo "Failed creating file $file_path"
>   break
>   fi
>   done
>   ) &
>   worker_pids[$n]=$!
>   done
> 
>   wait ${worker_pids[@]}
> 
>   sync
>   echo
>   echo "btree node/leaf count: $(btrfs inspect-internal dump-tree -t 5 
> $DEV | egrep '^(node|leaf) ' | wc -l)"
>   }
> 
>   initial_file_count=50
>   add_files $initial_file_count 0 4
> 
>   echo
>   echo "Creating first snapshot..."
>   btrfs subvolume snapshot -r $MNT $MNT/snap1
> 
>   echo
>   echo "Adding more files..."
>   add_files $((initial_file_count / 4)) $initial_file_count 4
> 
>   echo
>   echo "Updating 1/50th of the initial files..."
>   for ((i = 1; i < $initial_file_count; i += 50)); do
>   xfs_io -c "pwrite -S 0xcd 0 20" $MNT/file_$i > /dev/null
>   done
> 
>   echo
>   echo "Creating second snapshot..."
>   btrfs subvolume snapshot -r $MNT $MNT/snap2
> 
>   umount $MNT
> 
>   echo 3 > /proc/sys/vm/drop_caches
>   blockdev --flushbufs $DEV &> /dev/null
>   hdparm -F $DEV &> /dev/null
> 
>   mount $MOUNT_OPTIONS $DEV $MNT
> 
>   echo
>   echo "Testing full send..."
>   start=$(date +%s)
>   btrfs send $MNT/snap1 > /dev/null
>   end=$(date +%s)
>   echo
>   echo "Full send took $((end - start)) seconds"
> 
>   umount $MNT
> 
> The durations of the full send operation in seconds were the following:
> 
> Before this change:  217 seconds
> After this change:   205 seconds (-5.7%)
> 
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.

Re: [PATCH v3 00/13] btrfs: support read-write for subpage metadata

2021-03-29 Thread David Sterba

On Thu, Mar 25, 2021 at 03:14:32PM +0800, Qu Wenruo wrote:
> v3:
> - Rename the sysfs to supported_sectorsizes
> 
> - Rebased to latest misc-next branch
>   This removes 2 cleanup patches.
> 
> - Add new overview comment for subpage metadata

V3 is now in for-next, targeting merge for 5.13. Please post any fixups
as replies to the individual patches, I'll fold them in, rather a full
series resend. Thanks.

Re: [PATCH] btrfs: make reflinks respect O_SYNC O_DSYNC and S_SYNC flags

2021-03-29 Thread David Sterba

On Tue, Mar 23, 2021 at 06:39:49PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> If we reflink to or from a file opened with O_SYNC/O_DSYNC or to/from a
> file that has the S_SYNC attribute set, we totally ignore that and do not
> durably persist the reflink changes. Since a reflink can change the data
> readable from a file (and mtime/ctime, or a file size), it makes sense to
> durably persist (fsync) the source and destination files/ranges.
> 
> This was previously discussed at:
> 
> https://lore.kernel.org/linux-btrfs/20200903035225.GJ6090@magnolia/
> 
> The recently introduced test case generic/628, from fstests, exercises
> these scenarios and currently fails without this change.
> 
> So make sure we fsync the source and destination files/ranges when either
> of them was opened with O_SYNC/O_DSYNC or has the S_SYNC attribute set,
> just like XFS already does.
> 
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.

Re: [PATCH] [v2] btrfs: zoned: bail out in btrfs_alloc_chunk for bad input

2021-03-29 Thread David Sterba

On Tue, Mar 23, 2021 at 03:31:19PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> gcc complains that the ctl->max_chunk_size member might be used
> uninitialized when none of the three conditions for initializing it in
> init_alloc_chunk_ctl_policy_zoned() are true:
> 
> In function ‘init_alloc_chunk_ctl_policy_zoned’,
> inlined from ‘init_alloc_chunk_ctl’ at fs/btrfs/volumes.c:5023:3,
> inlined from ‘btrfs_alloc_chunk’ at fs/btrfs/volumes.c:5340:2:
> include/linux/compiler-gcc.h:48:45: error: ‘ctl.max_chunk_size’ may be used 
> uninitialized [-Werror=maybe-uninitialized]
>  4998 | ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
>   |   ^~~
> fs/btrfs/volumes.c: In function ‘btrfs_alloc_chunk’:
> fs/btrfs/volumes.c:5316:32: note: ‘ctl’ declared here
>  5316 | struct alloc_chunk_ctl ctl;
>   |^~~
> 
> If we ever get into this condition, something is seriously
> wrong, so the same logic as in init_alloc_chunk_ctl_policy_regular()
> and a few other places should be applied. This avoids both further
> data corruption, and the compile-time warning.
> 
> Fixes: 1cd6121f2a38 ("btrfs: zoned: implement zoned chunk allocator")
> Link: https://lore.kernel.org/lkml/20210323132343.gf7...@twin.jikos.cz/
> Suggested-by: David Sterba 
> Signed-off-by: Arnd Bergmann 

Added to misc-next, thanks.

Re: [PATCH v2] btrfs: fix a potential hole-punching failure

2021-03-29 Thread David Sterba

On Thu, Mar 25, 2021 at 09:56:22AM +0800, bingjingc wrote:
> From: BingJing Chang 
> 
> In commit d77815461f04 ("btrfs: Avoid trucating page or punching hole
> in a already existed hole."), existed holes can be skipped by calling
> find_first_non_hole() to adjust *start and *len. However, if the given
> len is invalid and large, when an EXTENT_MAP_HOLE extent is found, the
> *len will not be set to zero because (em->start + em->len) is less than
> (*start + *len). Then the ret will be 1 but the *len will not be set to
> 0. The propagated non-zero ret will result in fallocate failure.
> 
> In the while-loop of btrfs_replace_file_extents(), len is not updated
> every time before it calls find_first_non_hole(). That is, after
> btrfs_drop_extents() successfully drops the last non-hole file extent,
> it may fail with -ENOSPC when attempting to drop a file extent item
> representing a hole. The problem can happen. After it calls
> find_first_non_hole(), the cur_offset will be adjusted to be larger
> than or equal to end. However, since the len is not set to zero. The
> break-loop condition (ret && !len) will not meet. After it leaves the
> while-loop, fallocate will return 1, which is an unexpected return
> value.
> 
> We're not able to construct a reproducible way to let
> btrfs_drop_extents() fail with -ENOSPC after it drops the last non-hole
> file extent but with remaining holes left. However, it's quite easy to
> fix. We just need to update and check the len every time before we call
> find_first_non_hole(). To make the while loop more readable, we also
> pull the variable updates to the bottom of loop like this:
> while (cur_offset < end) {
> ...
> // update cur_offset & len
> // advance cur_offset & len in hole-punching case if needed
> }
> 
> Reported-by: Robbie Ko 
> Fixes: d77815461f04 ("btrfs: Avoid trucating page or punching hole in a
> already existed hole.")
> Reviewed-by: Robbie Ko 
> Reviewed-by: Chung-Chiang Cheng 
> Signed-off-by: BingJing Chang 

Thanks, added to misc-next.

Re: [PATCH] btrfs: zoned: move log tree node allocation out of log_root_tree->log_mutex

2021-03-29 Thread David Sterba

On Wed, Mar 24, 2021 at 11:23:11PM +0900, Naohiro Aota wrote:
> Commit 6e37d2459941 ("btrfs: zoned: fix deadlock on log sync") pointed out
> a deadlock warning and removed
> mutex_{lock,unlock}(&fs_info->tree_root->log_mutex). While it looks like it
> always cause a deadlock, we didn't see actual deadlock in fstests runs. The
> reason is log_root_tree->log_mutex != fs_info->tree_root->log_mutex, not
> taking the same lock. So, the warning was actually a false-positive.
> 
> Since btrfs_alloc_log_tree_node() is protected only by
> fs_info->tree_root->log_mutex, we can (and should) move the code out of the
> lock scope of log_root_tree->log_mutex and silence the warning.
> 
> Cc: Filipe Manana 
> Cc: Johannes Thumshirn 
> Signed-off-by: Naohiro Aota 

Added to misc-next, thanks.

Re: [PATCH] btrfs: use percpu_read_positive instead of sum_positive for need_preempt

2021-03-29 Thread David Sterba

On Wed, Mar 24, 2021 at 09:44:21AM -0400, Josef Bacik wrote:
> Looking at perf data for a fio workload I noticed that we were spending
> a pretty large chunk of time (around 5%) doing percpu_counter_sum() in
> need_preemptive_reclaim.  This is silly, as we only want to know if we
> have more ordered than delalloc to see if we should be counting the
> delayed items in our threshold calculation.  Change this to
> percpu_read_positive() to avoid the overhead.
> 
> I ran this through fsperf to validate the changes, obviously the latency
> numbers in dbench and fio are quite jittery, so take them as you wish,
> but overall the improvements on throughput, iops, and bw are all
> positive.  Each test was run two times, the given value is the average
> of both runs for their respective column.
> 
> btrfs ssd normal test results
> 
> bufferedrandwrite16g results
>  metric baseline   current  diff
> ==
> write_io_kbytes 16777216   16777216 0.00%
> read_clat_ns_p99   0  0 0.00%
> write_bw_bytes  1.04e+08   1.05e+08 1.12%
> read_iops  0  0 0.00%
> write_clat_ns_p50  13888  11840   -14.75%
> read_io_kbytes 0  0 0.00%
> read_io_bytes  0  0 0.00%
> write_clat_ns_p99  35008  29312   -16.27%
> read_bw_bytes  0  0 0.00%
> elapsed  170167-1.76%
> write_lat_ns_min 4221.503762.50   -10.87%
> sys_cpu39.65  35.37   -10.79%
> write_lat_ns_max2.67e+10   2.50e+10-6.63%
> read_lat_ns_min0  0 0.00%
> write_iops  25270.10   25553.43 1.12%
> read_lat_ns_max0  0 0.00%
> read_clat_ns_p50   0  0 0.00%
> 
> dbench60 results
>   metric baseline   current diff
> ==
> qpathinfo   11.12 12.7314.52%
> throughput 416.09445.66 7.11%
> flush 3485.63   1887.55   -45.85%
> qfileinfo0.70  1.92   173.86%
> ntcreatex  992.60695.76   -29.91%
> qfsinfo  2.43  3.7152.48%
> close1.67  3.1488.09%
> sfileinfo   66.54105.2058.10%
> rename 809.23619.59   -23.43%
> find16.88 15.46-8.41%
> unlink 820.54670.86   -18.24%
> writex3375.20   2637.91   -21.84%
> deltree386.33449.9816.48%
> readx3.43  3.41-0.60%
> mkdir0.05  0.03   -38.46%
> lockx0.26  0.26-0.76%
> unlockx  0.81  0.32   -60.33%
> 
> dio4kbs16threads results
>  metric  baseline   current   diff
> 
> write_io_kbytes 5249676   3357150   -36.05%
> read_clat_ns_p99  0 0 0.00%
> write_bw_bytes  89583501.50   57291192.50   -36.05%
> read_iops 0 0 0.00%
> write_clat_ns_p50242688263680 8.65%
> read_io_kbytes0 0 0.00%
> read_io_bytes 0 0 0.00%
> write_clat_ns_p99  15826944  36732928   132.09%
> read_bw_bytes 0 0 0.00%
> elapsed  6161 0.00%
> write_lat_ns_min  42704 42095-1.43%
> sys_cpu5.27  3.45   -34.52%
> write_lat_ns_max   7.43e+08  9.27e+0824.71%
> read_lat_ns_min   0 0 0.00%
> write_iops 21870.97  13987.11   -36.05%
> read_lat_ns_max   0 0 0.00%
> read_clat_ns_p50  0 0 0.00%
> 
> randwrite2xram results
>  metric  baseline   current   diff
> 
> write_io_kbytes24831972  2887626216.29%
> read_clat_ns_p99  0 0 0.00%
> write_bw_bytes  83745273.50   92182192.5010.07%
> read_iops 0 0 0.00%
> write_clat_ns_p50 13952 11648   -16.51%
> read_io_kbytes0 0 0.00%
> read_io_bytes 0 0 0.00%
> write_clat_ns_p99 50176 52992 5.61%
> read_bw_bytes 0 0 0.00%
> elapsed 314   332 5.73%
> write_lat_ns_min5920.50  5127   -13.40%
> sys_cpu7.82  7.35-6.07%
> write_lat_ns_max   5.27e+10  3.88e+10   -26.44%
> read_lat_ns_min   0 0 0.00%
> write_iops 20445.62  22505.4210.07%
> read_lat_ns_max   0 0 0.00%
> read_clat_ns_p50  0

Re: [PATCH v3 01/13] btrfs: add sysfs interface for supported sectorsize

2021-03-29 Thread David Sterba

On Thu, Mar 25, 2021 at 10:41:43PM +0800, Anand Jain wrote:
> On 25/03/2021 15:14, Qu Wenruo wrote:
> > +static ssize_t supported_sectorsizes_show(struct kobject *kobj,
> > + struct kobj_attribute *a,
> > + char *buf)
> > +{
> > +   ssize_t ret = 0;
> > +
> > +   /* Only support sectorsize == PAGE_SIZE yet */
> > +   ret += scnprintf(buf + ret, PAGE_SIZE - ret, "%lu\n",
> > +PAGE_SIZE);
> > +   return ret;
> > +}
> 
>ret can be removed completely here.

You mean to do 'return scnprintf(...)' ? For now it's just a single
value returned but with further support there will be a pattern like is
eg. in supported_checksums_show, so it's ok as a preparatory work.

Re: [PATCH] btrfs: update outdated comment at btrfs_replace_file_extents()

2021-03-29 Thread David Sterba

On Fri, Mar 26, 2021 at 01:14:41PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> There is a comment at btrfs_replace_file_extents() that mentions that we
> set the full sync flag on an inode when cloning into a file with a size
> greater than or equals to 16MiB, through try_release_extent_mapping() when
> we truncate the page cache after replacing file extents during a clone
> operation.
> 
> That is not true anymore since commit 5e548b32018d96 ("btrfs: do not set
> the full sync flag on the inode during page release"), so update the
> comment to remove that part and rephrase it slightly to make it more
> clear why the full sync flag is set at btrfs_replace_file_extents().
> 
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.

Re: [PATCH] btrfs: Fix a typo

2021-03-26 Thread David Sterba

On Fri, Mar 26, 2021 at 06:29:32AM +0530, Bhaskar Chowdhury wrote:
> 
> s/reponsible/responsible/

So this is exactly what I don't want to see happen - one patch per typo.
I tried to explain it in the previous patch, please either fix all typos
under fs/btrfs or don't bother.

Re: [PATCH] btrfs: fixed rudimentary typos

2021-03-25 Thread David Sterba

On Thu, Mar 25, 2021 at 11:40:04PM +0530, Bhaskar Chowdhury wrote:
> On 13:49 Thu 25 Mar 2021, David Sterba wrote:
> >On Thu, Mar 25, 2021 at 09:51:13AM +0530, Bhaskar Chowdhury wrote:
> >>
> >> s/contaning/containing
> >> s/clearning/clearing/
> >
> >Have hou scanned the whole subdirectory for typos? We do typo fixing
> >about once a year in one big patch and won't fix them one by one.
> 
> Once a year You must be kidding! that is not good whatever the workflow
> you have .

No kidding. It's even worse, we get that every two years.

* 2016 0132761017e012ab4dc8584d679503f2ba26ca86
  33 files changed, 106 insertions(+), 105 deletions(-)

* 2018 52042d8e82ff50d40e76a275ac0b97aa663328b0
  25 files changed, 70 insertions(+), 69 deletions(-)

You can see the diffstat touches nearly all the files, almost hundred of
fixed typos per patch. Now compare that to sending 70-100 individual
patches. Time spent on any patch is not zero and for such trivial
changes it's not justified so the workflow is to do that in batches.
If you care about fixing typos in fs/btrfs/, please fix them all. I've
found about 50.

[GIT PULL] Btrfs fixes for 5.12-rc5

2021-03-25 Thread David Sterba

From: David Sterba 

Hi,

there are few fixes for issues that have some user visibility and are
simple enough for this time of development cycle.

Please pull thanks.

- a few fixes for rescue= mount option, adding more checks for missing
  trees

- fix sleeping in atomic context on qgroup deletion

- fix subvolume deletion on mount

- fix build with M= syntax

- fix checksum mismatch error message for direct io


The following changes since commit 485df75554257e883d0ce39bb886e8212349748e:

  btrfs: always pin deleted leaves when there are active tree mod log users 
(2021-03-16 20:32:22 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-5.12-rc4-tag

for you to fetch changes up to c1d6abdac46ca8127274bea195d804e3f2cec7ee:

  btrfs: fix check_data_csum() error message for direct I/O (2021-03-18 
21:25:11 +0100)


David Sterba (1):
  btrfs: fix build when using M=fs/btrfs

Filipe Manana (2):
  btrfs: fix subvolume/snapshot deletion not triggered on mount
  btrfs: fix sleep while in non-sleep context during qgroup removal

Johannes Thumshirn (1):
  btrfs: zoned: remove outdated WARN_ON in direct IO

Josef Bacik (3):
  btrfs: do not initialize dev stats if we have no dev_root
  btrfs: initialize device::fs_info always
  btrfs: do not initialize dev replace for bad dev root

Omar Sandoval (1):
  btrfs: fix check_data_csum() error message for direct I/O

 fs/btrfs/Makefile  | 10 ++
 fs/btrfs/dev-replace.c |  3 +++
 fs/btrfs/disk-io.c | 19 +--
 fs/btrfs/inode.c   | 18 +-
 fs/btrfs/qgroup.c  | 12 ++--
 fs/btrfs/volumes.c |  3 +++
 6 files changed, 48 insertions(+), 17 deletions(-)

Re: [PATCH] btrfs: fixed rudimentary typos

2021-03-25 Thread David Sterba

On Thu, Mar 25, 2021 at 09:51:13AM +0530, Bhaskar Chowdhury wrote:
> 
> s/contaning/containing
> s/clearning/clearing/

Have hou scanned the whole subdirectory for typos? We do typo fixing
about once a year in one big patch and won't fix them one by one.

Re: [PATCH AUTOSEL 5.11 26/44] btrfs: track qgroup released data in own variable in insert_prealloc_file_extent

2021-03-25 Thread David Sterba

On Thu, Mar 25, 2021 at 07:24:41AM -0400, Sasha Levin wrote:
> From: Qu Wenruo 
> 
> [ Upstream commit fbf48bb0b197e6894a04c714728c952af7153bf3 ]
> 
> There is a piece of weird code in insert_prealloc_file_extent(), which
> looks like:
> 
>   ret = btrfs_qgroup_release_data(inode, file_offset, len);
>   if (ret < 0)
>   return ERR_PTR(ret);
>   if (trans) {
>   ret = insert_reserved_file_extent(trans, inode,
> file_offset, &stack_fi,
> true, ret);
>   ...
>   }
>   extent_info.is_new_extent = true;
>   extent_info.qgroup_reserved = ret;
>   ...
> 
> Note how the variable @ret is abused here, and if anyone is adding code
> just after btrfs_qgroup_release_data() call, it's super easy to
> overwrite the @ret and cause tons of qgroup related bugs.
> 
> Fix such abuse by introducing new variable @qgroup_released, so that we
> won't reuse the existing variable @ret.
> 
> Signed-off-by: Qu Wenruo 
> Reviewed-by: David Sterba 
> Signed-off-by: David Sterba 
> Signed-off-by: Sasha Levin 

This patch is a preparatory work and does not make sense for backport
standalone. Either this one plus
https://lore.kernel.org/linux-btrfs/20210303104152.105877-2-...@suse.com/
or neither. And IIRC it does not apply directly and needs some
additional review before it can be backported to older code base, so it
has no CC: stable tags.

Btrfs progs release 5.11.1

2021-03-24 Thread David Sterba

Hi,

btrfs-progs version 5.11.1 have been released. This is a bugfix release.

Changelog:

  * properly format checksums when a mismatch is reported
  * check: fix false alert on tree block crossing 64K page boundary
  * convert:
* refuse to convert filesystem with 'needs_recovery'
* update documentation to require fsck before conversion
  * balance convert: fix raid56 warning when converting other profiles
  * fi resize: improved summary
  * other
* build: fix checks and autoconf defines
* fix symlink paths for CI support scripts
* updated tests

Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/
Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git

Shortlog:

David Sterba (10):
  btrfs-progs: ci: fix symlink paths of helper scripts
  btrfs-progs: tests: extend mkfs/001 to look for warnings after mkfs
  btrfs-progs: mkfs: enumerate all supported checksum algorithms in help 
text
  btrfs-progs: convert: refuse to convert filesystem with 'needs_recovery' 
set
  btrfs-progs: tests: add crafted image to test needs_recovery
  btrfs-progs: docs: change wording about required fsck
  btrfs-progs: balance: fix raid56 warning for other profiles
  btrfs-progs: tests: verify that ext4 supports extra_isize
  btrfs-progs: update CHANGES for 5.11.1
  Btrfs progs v5.11.1

Dāvis Mosāns (1):
  btrfs-progs: fix checksum output for "checksum verify failed"

Heiko Becker (1):
  btrfs-progs: build: Use PKG_CONFIG instead of pkg-config

Jaak Ristioja (1):
  btrfs-progs: check: fix typos in code comment and 1 typo in warning

Pierre Labastie (3):
  btrfs-progs: tests: fix quoting of sudo helper in 
misc/046-seed-multi-mount
  btrfs-progs: build: fix the test for EXT4_EPOCH_MASK
  btrfs-progs: build: do not use AC_DEFINE twice for a FIEMAP define

Qu Wenruo (1):
  btrfs-progs: fix false alert on tree block crossing 64K page boundary

Sidong Yang (1):
  btrfs-progs: fi resize: make output more readable

Re: [PATCH 0/1] btrfs-progs: build system - do not use AC_DEFINE twice

2021-03-23 Thread David Sterba

On Sat, Mar 20, 2021 at 10:27:27AM +0100, pierre.labas...@neuf.fr wrote:
> Note that I doubt this check is needed in configure:
> HAVE_OWN_FIEMAP_EXTENT_DEFINE is used only once in cmds/filesystem-du.c
> in:
> #if !defined(FIEMAP_EXTENT_SHARED) && (HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE 
> == 1)
> but HAVE_OWN_FIEMAP_EXTENT_SHARED_DEFINE is set to 1 by configure only if
> FIEMAP_EXTENT_SHARED is not defined in the kernel headers.
> 
> If you agree, I'll send a patch to completely remove this check.

The explanation why we want the configure-time check is in
https://github.com/kdave/btrfs-progs/commit/cf8fd1a70884db0b31e312d07806
ie. to support old distros. There's a runtime check for version in case
the fiemap flag is not supported in cmd_filesystem_du.

Re: [PATCH 1/1] btrfs-progs: build system - do not use AC_DEFINE twice

2021-03-23 Thread David Sterba

On Sat, Mar 20, 2021 at 10:27:28AM +0100, pierre.labas...@neuf.fr wrote:
> From: Pierre Labastie 
> 
> Autoheader uses the AC_DEFINE macros (and a few others) to populate
> the config.h.in file. The autotools documentation does not tell
> what happens if AC_DEFINE is used twice for the same identifier.
> 
> This patch prevents using AC_DEFINE twice for
> HAVE_OWN_FIEMAP_EXTENT_DEFINE, preserving the logic (using the
> fact that an undefined identifier in a preprocessor directive is
> taken as zero).
> 
> Signed-off-by: Pierre Labastie 

Added to devel, thanks.

Re: [PATCH 0/1] btrfs-progs: libbtrfsutil: Relicense to LGPLv2.1+

2021-03-23 Thread David Sterba

On Wed, Mar 17, 2021 at 04:01:43PM -0400, Neal Gompa wrote:
> This is a patch requesting all substantial copyright owners to sign off
> on changing the license of the libbtrfsutil code to LGPLv2.1+ in order
> to resolve various concerns around the mixture of code in btrfs-progs
> with libbtrfsutil.
> 
> Each significant (i.e. non-trivial) commit author has been CC'd to
> request their sign-off on this. Please reply to this to acknowledge
> whether or not this is acceptable for your code.

Thanks! I think we have acks for all non-trivial contirbutions. For the
record, the trivial one are:

* dbf60b488e3b ("libbtrfsutil: update btrfs_util_delete_subvolume docs")
  a comment update, clarifying usage

* 2e67bf0ed69d ("btrfs-progs: docs: fix typos in READMEs, INSTALL and * 
CHANGES")
* b1d39a42a4ef ("btrfs-progs: fix typos in comments")
  treewide comment typo fixes

* 01e35d9f53eb ("btrfs-progs: treewide: Fix missing declarations")
  code changes, but adding static or missing includes

* 9fd37f951be6 ("btrfs-progs: complete the implementation RAID1C34 definitions")
  copied definitions from kernel code

I'm not sure about the commit adding pkg-config spec file, it's not code
but it's beyond what I'd consider trivial. I've added Sheng Mao to CC,
please acknowledge the change.

* 4498fe1a2a7c ("libbtrfsutil: add pkg-config spec file")

I'll update the changelog with all the reasons for relicensing that have
been brought up.

Re: [PATCH] btrfs-progs: qgroup: remove outdated comment

2021-03-23 Thread David Sterba

On Mon, Mar 22, 2021 at 02:03:16PM +, Sidong Yang wrote:
> This comment is outdated and this patch remove
> it to avoid confusion.

Even if it's just a comment removal, changelog should say why it's
outated, I have no idea and would have to do the research myself.

Re: [PATCH] btrfs: zoned: fix uninitialized max_chunk_size

2021-03-23 Thread David Sterba

On Tue, Mar 23, 2021 at 01:46:19PM +0100, Arnd Bergmann wrote:
> From: Arnd Bergmann 
> 
> The ctl->max_chunk_size member might be used uninitialized
> when none of the three conditions for initializing it in
> init_alloc_chunk_ctl_policy_zoned() are true:
> 
> In function ‘init_alloc_chunk_ctl_policy_zoned’,
> inlined from ‘init_alloc_chunk_ctl’ at fs/btrfs/volumes.c:5023:3,
> inlined from ‘btrfs_alloc_chunk’ at fs/btrfs/volumes.c:5340:2:
> include/linux/compiler-gcc.h:48:45: error: ‘ctl.max_chunk_size’ may be used 
> uninitialized [-Werror=maybe-uninitialized]
>  4998 | ctl->max_chunk_size = min(limit, ctl->max_chunk_size);
>   |   ^~~
> fs/btrfs/volumes.c: In function ‘btrfs_alloc_chunk’:
> fs/btrfs/volumes.c:5316:32: note: ‘ctl’ declared here
>  5316 | struct alloc_chunk_ctl ctl;
>   |^~~
> 
> Initialize it to UINT_MAX and rely on the min() expression to limit
> it.
> 
> Fixes: 1cd6121f2a38 ("btrfs: zoned: implement zoned chunk allocator")
> Signed-off-by: Arnd Bergmann 
> ---
> Note that the -Wmaybe-unintialized warning is globally disabled
> by default. For some reason I got this warning anyway when building
> this specific file with gcc-11.

The warning catches a theoretical case but this would not happen in
pracitce.  There are three bits to check and that covers all valid
options, but there should be a final else {} like is in
init_alloc_chunk_ctl_policy_regular that does not let the function
continue as that would mean there are worse problems.

btrfs_alloc_chunk
  init_alloc_chunk_ctl
init_alloc_chunk_ctl_policy_zoned

and btrfs_alloc_chunk validates the ctl->flags against
BTRFS_BLOCK_GROUP_TYPE_MASK, which is exactly the tree branches.

> ---
>  fs/btrfs/volumes.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
> index bc3b33efddc5..b42b423b6a10 100644
> --- a/fs/btrfs/volumes.c
> +++ b/fs/btrfs/volumes.c
> @@ -4980,6 +4980,7 @@ static void init_alloc_chunk_ctl_policy_zoned(
>   u64 type = ctl->type;
>  
>   ctl->max_stripe_size = zone_size;
> + ctl->max_chunk_size = UINT_MAX;

This would allow the min() work but otherwise is not an expected to
happen at all.

>   if (type & BTRFS_BLOCK_GROUP_DATA) {
>   ctl->max_chunk_size = round_down(BTRFS_MAX_DATA_CHUNK_SIZE,
>zone_size);
> -- 
> 2.29.2

Re: APFS and BTRFS

2021-03-22 Thread David Sterba

On Sat, Mar 20, 2021 at 11:22:18AM -0400, Forrest Aldrich wrote:
> I have a really large (3+ TB) volume I am copying to a BTRFS volume 
> (both over USB) which is painfully slow.  In fact, it will probably take 
> days to complete.   My goal is to use BTRFS on that larger USB volume 
> (it's an 18TB drive)
> 
> I don't suppose there is a way to convert APFS to BTRFS :)   I know, but 
> I thought I would ask as it seems like others may have a similar query.

Convert relies on libraries providing the API to access the source
filesystems, I don't see anything like that for APFS so it would have to
be written from scratch. The linux-apfs project does not seem to be
active, last update 2 years ago. If it's just days of copying, I think
it's the better and safer option.

Re: [PATCH v2] btrfs-progs: common: make sure that qgroup id is in range

2021-03-18 Thread David Sterba

On Thu, Mar 18, 2021 at 02:22:20AM +, Sidong Yang wrote:
> On Wed, Mar 17, 2021 at 07:36:47PM +0100, David Sterba wrote:
> > On Tue, Mar 16, 2021 at 01:27:46PM +, Sidong Yang wrote:
> > > When user assign qgroup with qgroup id that is too big to exceeds
> > > range and invade level value, and it works without any error. but
> > > this action would be make undefined error. this code make sure that
> > > qgroup id doesn't exceed range(0 ~ 2^48-1).
> > 
> > Should the level be also validate? The function parse_qgroupid does not
> > do full validation, so eg 0//0 would be parsed as a path and not as a
> > typo, level larger than 64K will be silently clamped.
> 
> I agree. 0//0 would be parsed as path but it failed in
> btrfs_util_is_subvolume() and goes to err. I understand that upper 16
> bits of qgroupid is for level. so, The valid llevel range is [0~2^16-1].
> But I can't get it that level larger than 64K will be clampled. 

The way the level gets stored into the final qgroup id is level << 48.
For example invalid values 7/281474976710779 would be stored
as subvol id 123 and level 4465, where the u64 is 0x1171007b

> one more question about that, I see that the ioctl calls just store the
> qgroupid without any opeartion with level. is the level meaningless in
> kernel?

The quota groups are hierarchical and the level denotes the level, where
0/subvolid is the lowest always attached to a subvolume and the higher
levels are artificial and may contain any qgroups and do the whole
accounting on the subtree. The original design doc .pdf can be found in
https://git.kernel.org/pub/scm/linux/kernel/git/arne/qgroups-doc.git/tree/

Re: [PATCH v8 04/10] btrfs: fix check_data_csum() error message for direct I/O

2021-03-18 Thread David Sterba

On Thu, Mar 18, 2021 at 07:47:29AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/3/18 上午2:33, Omar Sandoval wrote:
> > On Wed, Mar 17, 2021 at 07:21:46PM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2021/3/17 上午3:43, Omar Sandoval wrote:
> >>> From: Omar Sandoval 
> >>>
> >>> Commit 1dae796aabf6 ("btrfs: inode: sink parameter start and len to
> >>> check_data_csum()") replaced the start parameter to check_data_csum()
> >>> with page_offset(), but page_offset() is not meaningful for direct I/O
> >>> pages. Bring back the start parameter.
> >>
> >> So direct IO pages doesn't have page::index set at all?
> >
> > No, they don't. Usually you do direct I/O into an anonymous page, but I
> > suppose you could even do direct I/O into a page mmap'd from another
> > file or filesystem. In either case, the index isn't meaningful for the
> > file you're doing direct I/O from.
> >
> >> Any reproducer? I'd like to try to reproduce it first.
> >
> > The easiest way to see this issue is to apply this patch:
> >
> > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> > index 2a92211439e8..a962b3026573 100644
> > --- a/fs/btrfs/inode.c
> > +++ b/fs/btrfs/inode.c
> > @@ -3114,6 +3114,9 @@ static int check_data_csum(struct inode *inode, 
> > struct btrfs_io_bio *io_bio,
> > u8 *csum_expected;
> > u8 csum[BTRFS_CSUM_SIZE];
> >
> > +   WARN_ONCE(page_offset(page) + pgoff != start,
> > + "page offset %llu != start %llu\n",
> > + page_offset(page) + pgoff, start);
> > ASSERT(pgoff + len <= PAGE_SIZE);
> >
> > offset_sectors = bio_offset >> fs_info->sectorsize_bits;
> >
> > Run this simple test:
> >
> > $ dd if=/dev/zero of=foo bs=4k count=1024
> > 1024+0 records in
> > 1024+0 records out
> > 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.00456495 s, 919 MB/s
> > $ sync
> > $ dd if=foo of=/dev/null iflag=direct bs=4k
> > 1024+0 records in
> > 1024+0 records out
> > 4194304 bytes (4.2 MB, 4.0 MiB) copied, 0.163079 s, 25.7 MB/s
> >
> > And you'll get a warning like:
> >
> > [   84.896486] [ cut here ]
> > [   84.897370] page offset 94199157981184 != start 0
> > [   84.898128] WARNING: CPU: 1 PID: 459 at fs/btrfs/inode.c:3119 
> > check_data_csum+0x189/0x260 [btrfs]
> > [   84.899547] Modules linked in: btrfs blake2b_generic xor pata_acpi 
> > ata_piix libata scsi_mod raid6_pq virtio_net net_failover virtio_rng 
> > libcrc32c rng_core failover
> > [   84.901742] CPU: 1 PID: 459 Comm: kworker/u56:2 Not tainted 
> > 5.12.0-rc3-00060-ge0cd3910d8cb-dirty #139
> > [   84.903205] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
> > [   84.904875] Workqueue: btrfs-endio btrfs_work_helper [btrfs]
> > [   84.905749] RIP: 0010:check_data_csum+0x189/0x260 [btrfs]
> > [   84.906576] Code: 57 11 00 00 0f 85 03 ff ff ff 4c 89 ca 48 c7 c7 50 ba 
> > 35 c0 4c 89 44 24 10 48 89 44 24 08 c6 05 04 57 11 00 01 e8 22 e0 cf d4 
> > <0f> 0b 4c 8b 44 24 10 48 8b 44 24 08 e9 d2 fe ff ff 41 8b 45 00 4d
> > [   84.909288] RSP: 0018:b6e9c164bb98 EFLAGS: 00010282
> > [   84.910061] RAX:  RBX: e96b84a05f40 RCX: 
> > 0001
> > [   84.911109] RDX: 8001 RSI: 9573d067 RDI: 
> > 
> > [   84.912149] RBP:  R08:  R09: 
> > c000dfff
> > [   84.913197] R10: 0001 R11: b6e9c164b9c0 R12: 
> > 
> > [   84.914247] R13: 9d32a28c8dc0 R14: 9d32ac495e10 R15: 
> > 0004
> > [   84.915304] FS:  () GS:9d399f64() 
> > knlGS:
> > [   84.916478] CS:  0010 DS:  ES:  CR0: 80050033
> > [   84.917340] CR2: 55ad52f97120 CR3: 0001292f4002 CR4: 
> > 00370ee0
> > [   84.918435] DR0:  DR1:  DR2: 
> > 
> > [   84.919473] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [   84.920515] Call Trace:
> > [   84.920884]  ? find_busiest_group+0x41/0x380
> > [   84.921518]  ? load_balance+0x176/0xc10
> > [   84.922082]  ? kvm_sched_clock_read+0x5/0x10
> > [   84.922711]  ? sched_clock+0x5/0x10
> > [   84.923236]  btrfs_end_dio_bio+0x2fb/0x310 [btrfs]
> > [   84.923982]  end_workqueue_fn+0x29/0x40 [btrfs]
> > [   84.924698]  btrfs_work_helper+0xc1/0x350 [btrfs]
> > [   84.925435]  process_one_work+0x1c8/0x390
> > [   84.926025]  ? process_one_work+0x390/0x390
> > [   84.926650]  worker_thread+0x30/0x370
> > [   84.927209]  ? process_one_work+0x390/0x390
> > [   84.927875]  kthread+0x13d/0x160
> > [   84.928466]  ? kthread_park+0x80/0x80
> > [   84.929008]  ret_from_fork+0x22/0x30
> > [   84.929543] ---[ end trace 4f87c4a13fa476d4 ]---
> >
> >>>
> >>> Fixes: 265d4ac03fdf ("btrfs: sink parameter start and len to 
> >>> check_data_csum")
> >>> Signed-off-by: Omar Sandoval 
> >>> ---
> >>>fs/btrfs/inode.c | 14 +-
> >>>1 file changed, 9 insertions(+), 5 deletions(

[GIT PULL] Btrfs fixes for 5.12-rc4

2021-03-18 Thread David Sterba

From: David Sterba 

Hi,

there are still regressions being found and fixed in the zoned mode and
subpage code, the rest are fixes for bugs reported by users. Please
pull, thanks.

Regressions:

- subpage block support:
  - readahead works on the proper block size
  - fix last page zeroing

- zoned mode:
  - linked list corruption for tree log

Fixes:

- qgroup leak after falloc faiulre

- tree mod log and backref resolving
  - extent buffer cloning race when resolving backrefs
  - pin deleted leaves with active tree mod log users

- drop debugging flag from slab cache


The following changes since commit badae9c86979c459bd7d895d6d7ddc7a01131ff7:

  btrfs: zoned: do not account freed region of read-only block group as 
zone_unusable (2021-03-04 16:16:58 +0100)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-5.12-rc3-tag

for you to fetch changes up to 485df75554257e883d0ce39bb886e8212349748e:

  btrfs: always pin deleted leaves when there are active tree mod log users 
(2021-03-16 20:32:22 +0100)


David Sterba (1):
  btrfs: fix slab cache flags for free space tree bitmap

Filipe Manana (3):
  btrfs: zoned: fix linked list corruption after log root tree allocation 
failure
  btrfs: fix race when cloning extent buffer during rewind of an old root
  btrfs: always pin deleted leaves when there are active tree mod log users

Qu Wenruo (5):
  btrfs: fix wrong offset to zero out range beyond i_size
  btrfs: track qgroup released data in own variable in 
insert_prealloc_file_extent
  btrfs: fix qgroup data rsv leak caused by falloc failure
  btrfs: subpage: fix wild pointer access during metadata read failure
  btrfs: subpage: make readahead work properly

 fs/btrfs/ctree.c   |  2 ++
 fs/btrfs/extent-tree.c | 23 ++-
 fs/btrfs/extent_io.c   | 33 +++--
 fs/btrfs/inode.c   | 37 ++---
 fs/btrfs/reada.c   | 35 ++-
 fs/btrfs/tree-log.c|  8 
 6 files changed, 103 insertions(+), 35 deletions(-)

Re: [PATCH] btrfs: fix sleep while in non-sleep context during qgroup removal

2021-03-18 Thread David Sterba

On Thu, Mar 18, 2021 at 11:22:05AM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> While removing a qgroup's sysfs entry we end up taking the kernfs_mutex,
> through kobject_del(), while holding the fs_info->qgroup_lock spinlock,
> producing the following trace:
> 
> [  821.843637] BUG: sleeping function called from invalid context at 
> kernel/locking/mutex.c:281
> [  821.843641] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 28214, 
> name: podman
> [  821.843644] CPU: 3 PID: 28214 Comm: podman Tainted: GW 
> 5.11.6 #15
> [  821.843646] Hardware name: Dell Inc. PowerEdge R330/084XW4, BIOS 2.11.0 
> 12/08/2020
> [  821.843647] Call Trace:
> [  821.843650]  dump_stack+0xa1/0xfb
> [  821.843656]  ___might_sleep+0x144/0x160
> [  821.843659]  mutex_lock+0x17/0x40
> [  821.843662]  kernfs_remove_by_name_ns+0x1f/0x80
> [  821.843666]  sysfs_remove_group+0x7d/0xe0
> [  821.843668]  sysfs_remove_groups+0x28/0x40
> [  821.843670]  kobject_del+0x2a/0x80
> [  821.843672]  btrfs_sysfs_del_one_qgroup+0x2b/0x40 [btrfs]
> [  821.843685]  __del_qgroup_rb+0x12/0x150 [btrfs]
> [  821.843696]  btrfs_remove_qgroup+0x288/0x2a0 [btrfs]
> [  821.843707]  btrfs_ioctl+0x3129/0x36a0 [btrfs]
> [  821.843717]  ? __mod_lruvec_page_state+0x5e/0xb0
> [  821.843719]  ? page_add_new_anon_rmap+0xbc/0x150
> [  821.843723]  ? kfree+0x1b4/0x300
> [  821.843725]  ? mntput_no_expire+0x55/0x330
> [  821.843728]  __x64_sys_ioctl+0x5a/0xa0
> [  821.843731]  do_syscall_64+0x33/0x70
> [  821.843733]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  821.843736] RIP: 0033:0x4cd3fb
> [  821.843739] Code: fa ff eb bd e8 86 8b fa ff e9 61 ff ff ff cc e8 fb 55 fa 
> ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 
> 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
> [  821.843741] RSP: 002b:00c000906b20 EFLAGS: 0206 ORIG_RAX: 
> 0010
> [  821.843744] RAX: ffda RBX: 00c5 RCX: 
> 004cd3fb
> [  821.843745] RDX: 00c000906b98 RSI: 4010942a RDI: 
> 000f
> [  821.843747] RBP: 00c000907cd0 R08: 00c000622901 R09: 
> 
> [  821.843748] R10: 00c000d992c0 R11: 0206 R12: 
> 012d
> [  821.843749] R13: 012c R14: 0200 R15: 
> 0049
> 
> Fix this by removing the qgroup sysfs entry while not holding the spinlock,
> since the spinlock is only meant for protection of the qgroup rbtree.
> 
> Reported-by: Stuart Shelton 
> Link: 
> https://lore.kernel.org/linux-btrfs/7a5485bb-0628-419d-a4d3-27b1af47e...@gmail.com/
> Signed-off-by: Filipe Manana 

Added to misc-next, with the Fixes: tag, thanks.

Re: [PATCH 0/1] btrfs-progs: libbtrfsutil: Relicense to LGPLv2.1+

2021-03-18 Thread David Sterba

On Thu, Mar 18, 2021 at 08:32:30AM +0800, Qu Wenruo wrote:
> 
> 
> On 2021/3/18 上午4:01, Neal Gompa wrote:
> > This is a patch requesting all substantial copyright owners to sign off
> > on changing the license of the libbtrfsutil code to LGPLv2.1+ in order
> > to resolve various concerns around the mixture of code in btrfs-progs
> > with libbtrfsutil.
> >
> > Each significant (i.e. non-trivial) commit author has been CC'd to
> > request their sign-off on this. Please reply to this to acknowledge
> > whether or not this is acceptable for your code.
> 
> I'm a little surprised that I got CCed, as I can't remember any
> contribution from me to libbtrfstuil.

$ git log --oneline --author=wqu libbtrfsutil
f01d2b85581e libbtrfsutil: Convert to designated initialization for 
SubvolumeIterator_type
425c950cc6fe libbtrfsutil: Convert to designated initialization for 
QgroupInherit_type
a528cbeead1c libbtrfsutil: Convert to designated initialization for 
BtrfsUtilError_type

Re: [PATCH 0/3] Handle bad dev_root properly with rescue=all

2021-03-18 Thread David Sterba

On Thu, Mar 11, 2021 at 11:23:13AM -0500, Josef Bacik wrote:
[...]

> rescue=all working without panicing the machine,
> and I verified everything by
> using btrfs-corrupt-block to corrupt a dev root of a file system.  Thanks,

We need to have such testing part of some suite but it depends on the
utilities that we don't want to ship by default. Last time we discussed
how to make btrfs-corrupt-block or similar available in source form in
fstests it was not pretty, like having half of the btrfs-progs sources
providing code to read and modify the data structures.

So, what if: we have such tests in the btrfs-progs testsuite. As they're
potentially dangerous, not run by default, but available for easy setup
and test in a VM. The testsuite can be exported into a tarball, with the
binaries included. Or simply run it built from git, that works too just
needs more dependencies installed.

I was thinking about collecting all the stress tests into one place,
fstests already has some but my idea is to provide stress testing of
btrfs-specific features, more at once. Or require some 3rd party tools
and data sources to provide the load.

It would be some infrastructure duplication with fstests, but lots of
that is already done in btrfs-progs and I'd rather go with some
duplication instead of development-time-only testing.

Re: [PATCH v2] btrfs-progs: common: make sure that qgroup id is in range

2021-03-17 Thread David Sterba

On Tue, Mar 16, 2021 at 01:27:46PM +, Sidong Yang wrote:
> When user assign qgroup with qgroup id that is too big to exceeds
> range and invade level value, and it works without any error. but
> this action would be make undefined error. this code make sure that
> qgroup id doesn't exceed range(0 ~ 2^48-1).

Should the level be also validate? The function parse_qgroupid does not
do full validation, so eg 0//0 would be parsed as a path and not as a
typo, level larger than 64K will be silently clamped.

Re: [PATCH] btrfs: fix subvolume/snapshot deletion not triggered on mount

2021-03-17 Thread David Sterba

On Tue, Mar 16, 2021 at 04:53:46PM +, fdman...@kernel.org wrote:
> From: Filipe Manana 
> 
> During the mount procedure we are calling btrfs_orphan_cleanup() against
> the root tree, which will find all orphans items in this tree. When an
> orphan item corresponds to a deleted subvolume/snapshot (instead of an
> inode space cache), it must not delete the orphan item, because that will
> cause btrfs_find_orphan_roots() to not find the orphan item and therefore
> not add the corresponding subvolume root to the list of dead roots, which
> results in the subvolume's tree never being deleted by the cleanup thread.
> 
> The same applies to the remount from RO to RW path.
> 
> Fix this by making btrfs_find_orphan_roots() run before calling
> btrfs_orphan_cleanup() against the root tree.
> 
> A test case for fstests will follow soon.
> 
> Reported-by: Robbie Ko 
> Link: 
> https://lore.kernel.org/linux-btrfs/b19f4310-35e0-606e-1eea-2dd84d28c...@synology.com/
> Fixes: 638331fa56caea ("btrfs: fix transaction leak and crash after cleaning 
> up orphans on RO mount")
> Signed-off-by: Filipe Manana 

Added to misc-next, thanks.

Re: [PATCH v2] btrfs: Use immediate assignment when referencing cc-option

2021-03-17 Thread David Sterba

On Wed, Mar 17, 2021 at 09:08:48AM -0700, Victor Erminpour wrote:
> Calling cc-option will use KBUILD_CFLAGS, which when lazy setting
> subdir-ccflags-y produces the following build error:
> 
> scripts/Makefile.lib:10: *** Recursive variable `KBUILD_CFLAGS' \
>   references itself (eventually).  Stop.
> 
> Use single := assignment for subdir-ccflags-y. The cc-option calls
> are done right away and we don't end up with KBUILD_CFLAGS
> referencing itself.
> 
> Signed-off-by: Victor Erminpour 
> ---
>  fs/btrfs/Makefile | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index b634c42115ea..583ca2028e08 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -1,16 +1,16 @@
>  # SPDX-License-Identifier: GPL-2.0
>  
>  # Subset of W=1 warnings
> +subdir-ccflags-y := $(call cc-option, -Wunused-but-set-variable) \
> + $(call cc-option, -Wunused-const-variable) \
> + $(call cc-option, -Wpacked-not-aligned) \
> + $(call cc-option, -Wstringop-truncation)

That's still overwriting the value unconditionally, the idea was a
separate variable

https://lore.kernel.org/linux-btrfs/20210317104313.17028-1-dste...@suse.com

Re: [PATCH] btrfs: zoned: automatically reclaim zones

2021-03-17 Thread David Sterba

On Wed, Mar 17, 2021 at 07:38:11PM +0900, Johannes Thumshirn wrote:
> When a file gets deleted on a zoned file system, the space freed is no
> returned back into the block group's free space, but is migrated to
> zone_unusable.
> 
> As this zone_unusable space is behind the current write pointer it is not
> possible to use it for new allocations. In the current implementation a
> zone is reset once all of the block group's space is accounted as zone
> unusable.
> 
> This behaviour can lead to premature ENOSPC errors on a busy file system.
> 
> Instead of only reclaiming the zone once it is completely unusable,
> kick off a reclaim job once the amount of unusable bytes exceeds a user
> configurable threshold between 51% and 100%.
> 
> Similar to reclaiming unused block groups, these dirty block groups are
> added to a to_reclaim list and then on a transaction commit, the reclaim
> process is triggered but after we deleted unused block groups, which will
> free space for the relocation process.
> 
> Signed-off-by: Johannes Thumshirn 

I'll add it as topic branch to for-next but I haven't reviewed it and
first thing I see missing is lack of mentioning the sysfs tunable in the
changelog.

Re: [PATCH] btrfs: zoned: remove outdated WARN_ON in direct IO

2021-03-17 Thread David Sterba

On Wed, Mar 17, 2021 at 01:22:11PM +, Johannes Thumshirn wrote:
> On 17/03/2021 14:20, Johannes Thumshirn wrote:
> > On 17/03/2021 11:54, David Sterba wrote:
> >> On Wed, Mar 17, 2021 at 05:57:31PM +0900, Johannes Thumshirn wrote:
> >>> In btrfs_submit_direct() there's a WAN_ON_ONCE() that will trigger if
> >>> we're submitting a DIO write on a zoned filesystem but are not using
> >>> REQ_OP_ZONE_APPEND to submit the IO to the block device.
> >>>
> >>> This is a left over from a previous version where btrfs_dio_iomap_begin()
> >>> didn't use btrfs_use_zone_append() to check for sequential write only
> >>> zones.
> >>
> >> I can't identify the patch where this got changed. I've landed on
> >> 544d24f9de73 ("btrfs: zoned: enable zone append writing for direct IO")
> >> but that adds the btrfs_use_zone_append, the append flag and also the
> >> warning.
> >>
> > 
> > It is an oversight from the development phase. In v11 (I think) I've added
> > 08f455593fff ("btrfs: zoned: cache if block group is on a sequential zone")
> > and forgot to remove the WARN_ON_ONCE() for 544d24f9de73 ("btrfs: zoned: 
> > enable zone append writing for direct IO").
> > 
> > When developing auto relocation I got hit by the WARN as a block groups
> > where relocated to conventional zone and the dio code calls
> > btrfs_use_zone_append() introduced by 08f455593fff to check if it can use
> > zone append (a.k.a. if it's a sequential zone) or not and sets the 
> > appropriate
> > flags for iomap.
> > 
> > I've never hit it in testing before, as I was relying on emulation to test
> > the conventional zones code but this one case wasn't hit, because on 
> > emulation
> > fs_info->max_zone_append_size is 0 and the WARN doesn't trigger either.
> 
> I just realized that explanation should have gone into the commit message, do 
> you
> want me to resend with a more elaborate commit message?

I'll append the explanation above to the changelog, no need to resend
unless you want to add/adjust something. Thanks.

Re: Fwd: btrfs-progs: libbtrfsutil is under LGPL-3.0 and statically liked into btrfs

2021-03-17 Thread David Sterba

On Wed, Mar 17, 2021 at 12:53:00PM +0100, Adam Borowski wrote:
> This is https://bugs.debian.org/985400
> 
> - Forwarded message from Claudius Heine  -
> 
> Dear Maintainer,
> 
> I looked into the licenses of the btrfs-progs project and found that the
> libbtrfsutils library is licensed under LGPL-3.0-or-later [1]
> 
> The `copyright` file does not show this this.
> 
> IANAL, but I think since `btrfs` (under GPL-2.0-only [2]) links to 
> `libbtrfsutil`
> statically this might cause a license conflict. See [3]. This would be the 
> part
> that might require upstream fixing.

Thanks for bringing it up.

> [1] https://github.com/kdave/btrfs-progs/blob/master/libbtrfsutil/btrfsutil.h
> [2] https://github.com/kdave/btrfs-progs/blob/master/btrfs.c
> [3] http://gplv3.fsf.org/dd3-faq#gpl-compat-matrix

As explained in that link

 "Use a library" means that you're not copying any source directly, but
 instead interacting with it through linking, importing, or other
 typical mechanisms that bind the sources together when you compile or
 run the code.

the static link applies and the licenses do not allow that. So what are
the options:

- relicense libbtrfsutil to LGPL v2.1 or later
- relicense libbtrfsutil to LGPL v2.1 only

There was another request to relicense it
https://lore.kernel.org/linux-btrfs/b927ca28-e280-4d79-184f-b72867dbd...@denx.de/

I'm not aware of any objections to relicensing, it hasn't happend in
5.11 due to time reasons but I think 5.12 is feasible.

If there's anybody willing to drive the process please let me know. The
mpv project did relicensing as well and we can draw some inspiration
from there https://github.com/mpv-player/mpv/issues/2033 . It took them
like 5 years but the number of contributors we'd need to ask is small and
most of them are known so it should not take more than a month.

Re: [PATCH 0/3] Handle bad dev_root properly with rescue=all

2021-03-17 Thread David Sterba

On Thu, Mar 11, 2021 at 11:23:13AM -0500, Josef Bacik wrote:
> Hello,
> 
> My recent debugging session with Neal's broken filesystem uncovered a glaring
> hole in my rescue=all patches, they don't deal with a NULL dev_root properly.
> In testing I only ever tested corrupting the extent tree or the csum tree, 
> since
> those are the most problematic.  The following 3 fixes allowed Neal to get
> rescue=all working without panicing the machine, and I verified everything by
> using btrfs-corrupt-block to corrupt a dev root of a file system.  Thanks,

When rescue= is set lots of things can't work, I was wondering if we
should add messages once the "if (!dev_root)" cases are hit but I don't
think we need it, it should be clear enough from the other rescue=
related messages.

Re: [PATCH 1/3] btrfs: init devices always

2021-03-17 Thread David Sterba

On Fri, Mar 12, 2021 at 01:57:32PM +0800, Anand Jain wrote:
> 
> 
> On 12/3/21 1:52 pm, Anand Jain wrote:
> > On 12/3/21 12:23 am, Josef Bacik wrote:
> >> Neal reported a panic trying to use -o rescue=all
> >>
> >> BUG: kernel NULL pointer dereference, address: 0030
> >> PGD 0 P4D 0
> >> Oops:  [#1] SMP NOPTI
> >> CPU: 0 PID: 696 Comm: mount Tainted: G    W 5.12.0-rc2+ #296
> >> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 
> >> 04/01/2014
> >> RIP: 0010:btrfs_device_init_dev_stats+0x1d/0x200
> >> RSP: 0018:afaec1483bb8 EFLAGS: 00010286
> >> RAX:  RBX: 9a5715bcb298 RCX: 0070
> >> RDX: 9a5703248000 RSI: 9a57052ea150 RDI: 9a5715bca400
> >> RBP: 9a57052ea150 R08: 0070 R09: 9a57052ea150
> >> R10: 000130faf0741c10 R11:  R12: 9a570370
> >> R13:  R14: 9a5715bcb278 R15: 9a57052ea150
> >> FS:  7f600d122c40() GS:9a577bc0() 
> >> knlGS:
> >> CS:  0010 DS:  ES:  CR0: 80050033
> >> CR2: 0030 CR3: 000112a46005 CR4: 00370ef0
> >> Call Trace:
> >>   ? btrfs_init_dev_stats+0x1f/0xf0
> >>   ? kmem_cache_alloc+0xef/0x1f0
> >>   btrfs_init_dev_stats+0x5f/0xf0
> >>   open_ctree+0x10cb/0x1720
> >>   btrfs_mount_root.cold+0x12/0xea
> >>   legacy_get_tree+0x27/0x40
> >>   vfs_get_tree+0x25/0xb0
> >>   vfs_kern_mount.part.0+0x71/0xb0
> >>   btrfs_mount+0x10d/0x380
> >>   legacy_get_tree+0x27/0x40
> >>   vfs_get_tree+0x25/0xb0
> >>   path_mount+0x433/0xa00
> >>   __x64_sys_mount+0xe3/0x120
> >>   do_syscall_64+0x33/0x40
> >>   entry_SYSCALL_64_after_hwframe+0x44/0xae
> >>
> >> This happens because when we call btrfs_init_dev_stats we do
> >> device->fs_info->dev_root.  However device->fs_info isn't init'ed
> >> because we were only calling btrfs_init_devices_late() if we properly
> >> read the device root. 
> > 
> > 
> >> However we don't actually need the device root to
> >> init the devices, this function simply assigns the devices their
> >> ->fs_info pointer properly, so this needs to be done unconditionally
> >> always so that we can properly deref device->fs_info in rescue cases.
> 
> 
> >   btrfs_device_init_dev_stats() calls btrfs_search_slot() leading
> >   to btrfs_search_slot_get_root(), and does de-reference root (dev_root)
> >   to get fs_info.
> 
>   Never mind. patch 2/3 handles it. Spoke too early.
>   Maybe can reorder the patches during integration.

I think swapping the order makes sense, though all the patches fix some
sort of crash, the number should be decreasing.

Re: [PATCH] btrfs: zoned: remove outdated WARN_ON in direct IO

2021-03-17 Thread David Sterba

On Wed, Mar 17, 2021 at 05:57:31PM +0900, Johannes Thumshirn wrote:
> In btrfs_submit_direct() there's a WAN_ON_ONCE() that will trigger if
> we're submitting a DIO write on a zoned filesystem but are not using
> REQ_OP_ZONE_APPEND to submit the IO to the block device.
> 
> This is a left over from a previous version where btrfs_dio_iomap_begin()
> didn't use btrfs_use_zone_append() to check for sequential write only
> zones.

I can't identify the patch where this got changed. I've landed on
544d24f9de73 ("btrfs: zoned: enable zone append writing for direct IO")
but that adds the btrfs_use_zone_append, the append flag and also the
warning.

[PATCH] btrfs: fix build when using M=fs/btrfs

2021-03-17 Thread David Sterba

There are people building the module with M= that's supposed to be used
for external modules. This got broken in e9aa7c285d20 ("btrfs: enable
W=1 checks for btrfs").

  $ make M=fs/btrfs
  scripts/Makefile.lib:10: *** Recursive variable 'KBUILD_CFLAGS' references 
itself (eventually).  Stop.
  make: *** [Makefile:1755: modules] Error 2

There's a difference compared to 'make fs/btrfs/btrfs.ko' which needs
to rebuild a few more things and also the dependency modules need to be
available. It could fail with eg.

  WARNING: Symbol version dump "Module.symvers" is missing.
   Modules may not have dependencies or modversions.

In some environments it's more convenient to rebuild just the btrfs
module by M= so let's make it work.

The problem is with recursive variable evaluation in += so the
conditional C options are stored in a temporary variable to avoid the
recursion.

Signed-off-by: David Sterba 
---
 fs/btrfs/Makefile | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index b634c42115ea..b4fb997eda16 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -7,10 +7,12 @@ subdir-ccflags-y += -Wmissing-format-attribute
 subdir-ccflags-y += -Wmissing-prototypes
 subdir-ccflags-y += -Wold-style-definition
 subdir-ccflags-y += -Wmissing-include-dirs
-subdir-ccflags-y += $(call cc-option, -Wunused-but-set-variable)
-subdir-ccflags-y += $(call cc-option, -Wunused-const-variable)
-subdir-ccflags-y += $(call cc-option, -Wpacked-not-aligned)
-subdir-ccflags-y += $(call cc-option, -Wstringop-truncation)
+condflags := \
+   $(call cc-option, -Wunused-but-set-variable)\
+   $(call cc-option, -Wunused-const-variable)  \
+   $(call cc-option, -Wpacked-not-aligned) \
+   $(call cc-option, -Wstringop-truncation)
+subdir-ccflags-y += $(condflags)
 # The following turn off the warnings enabled by -Wextra
 subdir-ccflags-y += -Wno-missing-field-initializers
 subdir-ccflags-y += -Wno-sign-compare
-- 
2.29.2

[PATCH 5.4] btrfs: scrub: Don't check free space before marking a block group RO

2021-03-17 Thread David Sterba

From: Qu Wenruo 

[ Upstream commit b12de52896c0e8213f70e3a168fde9e6eee95909 ]

[BUG]
When running btrfs/072 with only one online CPU, it has a pretty high
chance to fail:

  btrfs/072 12s ... _check_dmesg: something found in dmesg (see 
xfstests-dev/results//btrfs/072.dmesg)
  - output mismatch (see xfstests-dev/results//btrfs/072.out.bad)
  --- tests/btrfs/072.out 2019-10-22 15:18:14.008965340 +0800
  +++ /xfstests-dev/results//btrfs/072.out.bad  2019-11-14 
15:56:45.877152240 +0800
  @@ -1,2 +1,3 @@
   QA output created by 072
   Silence is golden
  +Scrub find errors in "-m dup -d single" test
  ...

And with the following call trace:

  BTRFS info (device dm-5): scrub: started on devid 1
  [ cut here ]
  BTRFS: Transaction aborted (error -27)
  WARNING: CPU: 0 PID: 55087 at fs/btrfs/block-group.c:1890 
btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
  CPU: 0 PID: 55087 Comm: btrfs Tainted: GW  O  5.4.0-rc1-custom+ 
#13
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:btrfs_create_pending_block_groups+0x3e6/0x470 [btrfs]
  Call Trace:
   __btrfs_end_transaction+0xdb/0x310 [btrfs]
   btrfs_end_transaction+0x10/0x20 [btrfs]
   btrfs_inc_block_group_ro+0x1c9/0x210 [btrfs]
   scrub_enumerate_chunks+0x264/0x940 [btrfs]
   btrfs_scrub_dev+0x45c/0x8f0 [btrfs]
   btrfs_ioctl+0x31a1/0x3fb0 [btrfs]
   do_vfs_ioctl+0x636/0xaa0
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x43/0x50
   do_syscall_64+0x79/0xe0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe
  ---[ end trace 166c865cec7688e7 ]---

[CAUSE]
The error number -27 is -EFBIG, returned from the following call chain:
btrfs_end_transaction()
|- __btrfs_end_transaction()
   |- btrfs_create_pending_block_groups()
  |- btrfs_finish_chunk_alloc()
 |- btrfs_add_system_chunk()

This happens because we have used up all space of
btrfs_super_block::sys_chunk_array.

The root cause is, we have the following bad loop of creating tons of
system chunks:

1. The only SYSTEM chunk is being scrubbed
   It's very common to have only one SYSTEM chunk.
2. New SYSTEM bg will be allocated
   As btrfs_inc_block_group_ro() will check if we have enough space
   after marking current bg RO. If not, then allocate a new chunk.
3. New SYSTEM bg is still empty, will be reclaimed
   During the reclaim, we will mark it RO again.
4. That newly allocated empty SYSTEM bg get scrubbed
   We go back to step 2, as the bg is already mark RO but still not
   cleaned up yet.

If the cleaner kthread doesn't get executed fast enough (e.g. only one
CPU), then we will get more and more empty SYSTEM chunks, using up all
the space of btrfs_super_block::sys_chunk_array.

[FIX]
Since scrub/dev-replace doesn't always need to allocate new extent,
especially chunk tree extent, so we don't really need to do chunk
pre-allocation.

To break above spiral, here we introduce a new parameter to
btrfs_inc_block_group(), @do_chunk_alloc, which indicates whether we
need extra chunk pre-allocation.

For relocation, we pass @do_chunk_alloc=true, while for scrub, we pass
@do_chunk_alloc=false.
This should keep unnecessary empty chunks from popping up for scrub.

Also, since there are two parameters for btrfs_inc_block_group_ro(),
add more comment for it.

Reviewed-by: Filipe Manana 
Signed-off-by: Qu Wenruo 
Signed-off-by: David Sterba 
---

There's a report for 5.4 and the patch applies with a minor fixup
without dependencies.

https://bugzilla.kernel.org/show_bug.cgi?id=210447

 fs/btrfs/block-group.c | 48 +++---
 fs/btrfs/block-group.h |  3 ++-
 fs/btrfs/relocation.c  |  2 +-
 fs/btrfs/scrub.c   | 21 +-
 4 files changed, 54 insertions(+), 20 deletions(-)

diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c
index 08ca9441270d..a352c1704042 100644
--- a/fs/btrfs/block-group.c
+++ b/fs/btrfs/block-group.c
@@ -2048,8 +2048,17 @@ static u64 update_block_group_flags(struct btrfs_fs_info 
*fs_info, u64 flags)
return flags;
 }
 
-int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache)
-
+/*
+ * Mark one block group RO, can be called several times for the same block
+ * group.
+ *
+ * @cache: the destination block group
+ * @do_chunk_alloc:whether need to do chunk pre-allocation, this is to
+ * ensure we still have some free space after marking this
+ * block group RO.
+ */
+int btrfs_inc_block_group_ro(struct btrfs_block_group_cache *cache,
+bool do_chunk_alloc)
 {
struct btrfs_fs_info *fs_info = cache->fs_info;
struct btrfs_trans_handle *trans;
@@ -2079,25 +2088,29 @@ int btrfs_inc_block_group_ro(struct 
btrfs_block_group_cache *cache)
goto again;
}
 
-   /*
-* if we are changing raid levels, try to allocate a corresponding
-

Re: [PATCH] btrfs: Use immediate assignment when referencing cc-option

2021-03-17 Thread David Sterba

On Tue, Mar 16, 2021 at 03:46:10PM -0700, Victor Erminpour wrote:
> Calling cc-option will use KBUILD_CFLAGS, which when lazy setting
> subdir-ccflags-y produces the following build error:
> 
> scripts/Makefile.lib:10: *** Recursive variable `KBUILD_CFLAGS' \
>   references itself (eventually).  Stop.
> 
> Use := assignment to subdir-ccflags-y when referencing cc-option.
> This causes make to also evaluate += immediately, cc-option
> calls are done right away and we don't end up with KBUILD_CFLAGS
> referencing itself.
> 
> Signed-off-by: Victor Erminpour 
> ---
>  fs/btrfs/Makefile | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
> index b634c42115ea..3dba1336fa95 100644
> --- a/fs/btrfs/Makefile
> +++ b/fs/btrfs/Makefile
> @@ -7,10 +7,10 @@ subdir-ccflags-y += -Wmissing-format-attribute
>  subdir-ccflags-y += -Wmissing-prototypes
>  subdir-ccflags-y += -Wold-style-definition
>  subdir-ccflags-y += -Wmissing-include-dirs
> -subdir-ccflags-y += $(call cc-option, -Wunused-but-set-variable)
> -subdir-ccflags-y += $(call cc-option, -Wunused-const-variable)
> -subdir-ccflags-y += $(call cc-option, -Wpacked-not-aligned)
> -subdir-ccflags-y += $(call cc-option, -Wstringop-truncation)
> +subdir-ccflags-y := $(call cc-option, -Wunused-but-set-variable)
> +subdir-ccflags-y := $(call cc-option, -Wunused-const-variable)
> +subdir-ccflags-y := $(call cc-option, -Wpacked-not-aligned)
> +subdir-ccflags-y := $(call cc-option, -Wstringop-truncation)

But this overwrites all the previously accumulated values from += so
effectively there's only the last one, no? That's not what we want.

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 2893 matches

Mail list logo