On Tue 22 Dec 2020 at 11:49, Naohiro Aota
wrote:
For a zone append write, the device decides the location the
data is
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that a ordered
extent maps
to a contiguous region on disk,
This is the 3/3 patch to enable tree-log on ZONED mode.
The allocation order of nodes of "fs_info->log_root_tree" and nodes of
"root->log_root" is not the same as the writing order of them. So, the
writing causes unaligned write errors.
This patch reorders the allocation of them by delaying alloc
This is the 2/3 patch to enable tree-log on ZONED mode.
Since we can start more than one log transactions per subvolume
simultaneously, nodes from multiple transactions can be allocated
interleaved. Such mixed allocation results in non-sequential writes at the
time of log transaction commit. The n
This is a preparation for the next patch. This commit split
alloc_log_tree() to allocating tree structure part (remains in
alloc_log_tree()) and allocating tree node part (moved in
btrfs_alloc_log_tree_node()). The latter part is also exported to be used
in the next patch.
Reviewed-by: Josef Bacik
This is the 1/3 patch to enable tree log on ZONED mode.
The tree-log feature does not work on ZONED mode as is. Blocks for a
tree-log tree are allocated mixed with other metadata blocks, and btrfs
writes and syncs the tree-log blocks to devices at the time of fsync(),
which is different timing fro
From: Johannes Thumshirn
In zoned mode, cache if a block-group is on a sequential write only zone.
On sequential write only zones, we can use REQ_OP_ZONE_APPEND for writing
of data, therefore provide btrfs_use_zone_append() to figure out if I/O is
targeting a sequential write only zone and we can
To serialize allocation and submit_bio, we introduced mutex around them. As
a result, preallocation must be completely disabled to avoid a deadlock.
Since current relocation process relies on preallocation to move file data
extents, it must be handled in another way. In ZONED mode, we just truncat
When btrfs find a checksum error and if the file system has a mirror of the
damaged data, btrfs read the correct data from the mirror and write the
data to damaged blocks. This repairing, however, is against the sequential
write required rule.
We can consider three methods to repair an IO failure
btrfs_rmap_block currently reverse-maps the physical addresses on all
devices to the corresponding logical addresses.
This commit extends the function to match to a specified device. The old
functionality of querying all devices is left intact by specifying NULL as
target device.
We pass block_de
When truncating a file, file buffers which have already been allocated but
not yet written may be truncated. Truncating these buffers could cause
breakage of a sequential write pattern in a block group if the truncated
blocks are for example followed by blocks allocated to another file. To
avoid t
In ZONED, btrfs uses per-FS zoned_meta_io_lock to serialize the metadata
write IOs.
Even with these serialization, write bios sent from btree_write_cache_pages
can be reordered by async checksum workers as these workers are per CPU and
not per zone.
To preserve write BIO ordering, we can disable
This is 4/4 patch to implement device-replace on ZONED mode.
Even after the copying is done, the write pointers of the source device and
the destination device may not be synchronized. For example, when the last
allocated extent is freed before device-replace process, the extent is not
copied, lea
We cannot use zone append for writing metadata, because the B-tree nodes
have references to each other using the logical address. Without knowing
the address in advance, we cannot construct the tree in the first place.
So we need to serialize write IOs for metadata.
We cannot add a mutex around al
For a zone append write, the device decides the location the data is
written to. Therefore we cannot ensure that two bios are written
consecutively on the device. In order to ensure that a ordered extent maps
to a contiguous region on disk, we need to maintain a "one bio == one
ordered extent" rule
If more than one IO is issued for one file extent, these IO can be written
to separate regions on a device. Since we cannot map one file extent to
such a separate area, we need to follow the "one IO == one ordered extent"
rule.
The Normal buffered, uncompressed, not pre-allocated write path (used
This is 3/4 patch to implement device-replace on ZONED mode.
This commit implement copying. So, it track the write pointer during device
replace process. Device-replace's copying is smart to copy only used
extents on source device, we have to fill the gap to honor the sequential
write rule in the
Zoned device has its own hardware restrictions e.g. max_zone_append_size
when using REQ_OP_ZONE_APPEND. To follow the restrictions, use
bio_add_zone_append_page() instead of bio_add_page(). We need target device
to use bio_add_zone_append_page(), so this commit reads the chunk
information to memoiz
ZONED btrfs uses REQ_OP_ZONE_APPEND bios for writing to actual devices. Let
btrfs_end_bio() and btrfs_op be aware of it.
Reviewed-by: Josef Bacik
Signed-off-by: Naohiro Aota
---
fs/btrfs/disk-io.c | 4 ++--
fs/btrfs/inode.c | 10 +-
fs/btrfs/volumes.c | 8
fs/btrfs/volumes.
This commit enables zone append writing for zoned btrfs. When using zone
append, a bio is issued to the start of a target zone and the device
decides to place it inside the zone. Upon completion the device reports
the actual written position back to the host.
Three parts are necessary to enable zo
This is 2/4 patch to implement device-replace for ZONED mode.
On zoned mode, a block group must be either copied (from the source device
to the destination device) or cloned (to the both device).
This commit implements the cloning part. If a block group targeted by an IO
is marked to copy, we sho
Likewise to buffered IO, enable zone append writing for direct IO when its
used on a zoned block device.
Reviewed-by: Josef Bacik
Signed-off-by: Naohiro Aota
---
fs/btrfs/inode.c | 17 +
1 file changed, 17 insertions(+)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 6b5
This is the 1/4 patch to support device-replace in ZONED mode.
We have two types of I/Os during the device-replace process. One is an I/O
to "copy" (by the scrub functions) all the device extents on the source
device to the destination device. The other one is an I/O to "clone" (by
handle_ops_on_
This commit extract page adding to bio part from submit_extent_page(). The
page is added only when bio_flags are the same, contiguous and the added
page fits in the same stripe as pages in the bio.
Condition checkings are reordered to allow early return to avoid possibly
heavy btrfs_bio_fits_in_st
For an ZONED volume, a block group maps to a zone of the device. For
deleted unused block groups, the zone of the block group can be reset to
rewind the zone write pointer at the start of the zone.
Reviewed-by: Josef Bacik
Signed-off-by: Naohiro Aota
---
fs/btrfs/block-group.c | 8 ++--
fs
From: Johannes Thumshirn
A following patch will add another caller of
btrfs_lookup_ordered_extent() from a bio endio context.
btrfs_lookup_ordered_extent() uses spin_lock_irq() which unconditionally
disables interrupts. Change this to spin_lock_irqsave() so interrupts
aren't disabled and re-enab
This final patch adds the ZONED incompat flag to
BTRFS_FEATURE_INCOMPAT_SUPP and enables btrfs to mount ZONED flagged file
system.
Signed-off-by: Naohiro Aota
Reviewed-by: Josef Bacik
---
fs/btrfs/ctree.h | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/btrfs/ctree.h b/f
Since the allocation info of tree log node is not recorded to the extent
tree, calculate_alloc_pointer() cannot detect the node, so the pointer can
be over a tree node.
Replaying the log call btrfs_remove_free_space() for each node in the log
tree. So, advance the pointer after the node.
Reviewed
This commit implements a sequential extent allocator for the ZONED mode.
This allocator just needs to check if there is enough space in the block
group. Therefor the allocator never manages bitmaps or clusters. Also add
ASSERTs to the corresponding functions.
Actually, with zone append writing, it
Tree manipulating operations like merging nodes often release
once-allocated tree nodes. Btrfs cleans such nodes so that pages in the
node are not uselessly written out. On ZONED volumes, however, such
optimization blocks the following IOs as the cancellation of the write out
of the freed blocks br
In zoned btrfs a region that was once written then freed is not usable
until we reset the underlying zone. So we need to distinguish such
unusable space from usable free space.
Therefore we need to introduce the "zone_unusable" field to the block
group structure, and "bytes_zone_unusable" to the
Add a check in verify_one_dev_extent() to check if a device extent on a
zoned block device is aligned to the respective zone boundary.
Signed-off-by: Naohiro Aota
Reviewed-by: Anand Jain
Reviewed-by: Josef Bacik
---
fs/btrfs/volumes.c | 14 ++
1 file changed, 14 insertions(+)
diff
Zoned btrfs must allocate blocks at the zones' write pointer. The device's
write pointer position can be mapped to a logical address within a block
group. This commit adds "alloc_offset" to track the logical address.
This logical address is populated in btrfs_load_block_group_zone_info()
from writ
The implementation of fitrim is depending on space cache, which is not used
and disabled for zoned btrfs' extent allocator. So the current code does
not work with zoned btrfs. In the future, we can implement fitrim for zoned
btrfs by enabling space cache (but, only for fitrim) or scanning the exten
Conventional zones do not have a write pointer, so we cannot use it to
determine the allocation offset if a block group contains a conventional
zone.
But instead, we can consider the end of the last allocated extent in the
block group as an allocation offset.
For new block group, we cannot calcul
From: Johannes Thumshirn
Emulate zoned btrfs mode on non-zoned devices. This is done by "slicing
up" the block-device into static sized chunks and fake a conventional zone
on each of them. The emulated zone size is determined from the size of
device extent.
This is mainly aimed at testing parts
From: Johannes Thumshirn
Run zoned btrfs mode on non-zoned devices. This is done by "slicing
up" the block-device into static sized chunks and fake a conventional zone
on each of them. The emulated zone size is determined from the size of
device extent.
This is mainly aimed at testing parts of t
This commit implements a zoned chunk/dev_extent allocator. The zoned
allocator aligns the device extents to zone boundaries, so that a zone
reset affects only the device extent and does not change the state of
blocks in the neighbor device extents.
Also, it checks that a region allocation is not o
This is preparation patch to implement zone emulation on a regular device.
To emulate zoned mode on a regular (non-zoned) device, we need to decide an
emulating zone size. Instead of making it compile-time static value, we'll
make it configurable at mkfs time. Since we have one zone == one device
From: Johannes Thumshirn
Don't set the zoned flag in fs_info when encountering the
BTRFS_FEATURE_INCOMPAT_ZONED on mount. The zoned flag in fs_info is in a
union together with the zone_size, so setting it too early will result in
setting an incorrect zone_size as well.
Once the correct zone_size
From: Johannes Thumshirn
Since we have no write pointer in conventional zones, we cannot determine
the allocation offset from it. Instead, we set the allocation offset after
the highest addressed extent. This is done by reading the extent tree in
btrfs_load_block_group_zone_info().
However, this
The zoned btrfs puts a superblock at the beginning of SB logging zones
if the zone is conventional. This difference causes a chicken-and-egg
problem for emulated zoned mode. Since the device is a regular
(non-zoned) device, we cannot know if the btrfs is regular or emulated
zoned while we read the
From: Johannes Thumshirn
Add bio_add_zone_append_page(), a wrapper around bio_add_hw_page() which
is intended to be used by file systems that directly add pages to a bio
instead of using bio_iov_iter_get_pages().
Cc: Jens Axboe
Reviewed-by: Christoph Hellwig
Signed-off-by: Johannes Thumshirn
A ZONE_APPEND bio must follow hardware restrictions (e.g. not exceeding
max_zone_append_sectors) not to be split. bio_iov_iter_get_pages builds
such restricted bio using __bio_iov_append_get_pages if bio_op(bio) ==
REQ_OP_ZONE_APPEND.
To utilize it, we need to set the bio_op before calling
bio_iov
This series adds zoned block device support to btrfs. Some of the patches
in the previous series are already merged as preparation patches.
This series is also available on github.
Kernel https://github.com/naota/linux/tree/btrfs-zoned-v12
Userland https://github.com/naota/btrfs-progs/tree/btrfs
On Thu, Jan 14, 2021 at 05:37:29PM +0100, David Sterba wrote:
> Hi,
>
> On Thu, Jan 14, 2021 at 03:12:26AM +0100, waxhead wrote:
> > I was looking through the mount options and being a madman with strong
> > opinions I can't help thinking that a lot of them does not really belong
> > as mount op
David Sterba wrote:
Hi,
On Thu, Jan 14, 2021 at 03:12:26AM +0100, waxhead wrote:
I was looking through the mount options and being a madman with strong
opinions I can't help thinking that a lot of them does not really belong
as mount options at all, but should rather be properties set on the
su
On Wed, Dec 16, 2020 at 03:42:40AM +, Sidong Yang wrote:
> This patch make output of filesystem-resize command more readable and
> give detail information for users. This patch provides more information
> about filesystem like below.
>
> Before:
> Resize '/mnt' of '1:-1G'
>
> After:
> Resize
On Wed, Jan 13, 2021 at 01:58:18PM +0800, Xing Zhengjun wrote:
>
>
> On 1/12/2021 11:45 PM, David Sterba wrote:
> > On Tue, Jan 12, 2021 at 11:36:14PM +0800, kernel test robot wrote:
> >> Greeting,
> >>
> >> FYI, we noticed a -18.3% regression of fio.write_iops due to commit:
> >>
> >>
> >> commi
Hi Neal,
On 14.01.21 20:38, Neal Gompa wrote:
On Thu, Dec 10, 2020 at 7:18 AM Stefano Babic wrote:
Hi David,
On 10.12.20 12:27, David Sterba wrote:
On Tue, Dec 08, 2020 at 01:00:01PM -0800, Omar Sandoval wrote:
On Tue, Dec 08, 2020 at 10:49:10AM +0100, Stefano Babic wrote:
Hi,
I hope I a
Hi David,
On 14.01.21 19:47, David Sterba wrote:
On Thu, Dec 10, 2020 at 01:03:04PM +0100, Stefano Babic wrote:
I read this, thanks.
I was quite confused about the license for libbtrfsutil due to both
"COPYING" and "COPYING.LESSER" in the library path. COPYING reports
GPLv3. But headers in fil
On Thu, Dec 10, 2020 at 7:18 AM Stefano Babic wrote:
>
> Hi David,
>
> On 10.12.20 12:27, David Sterba wrote:
> > On Tue, Dec 08, 2020 at 01:00:01PM -0800, Omar Sandoval wrote:
> >> On Tue, Dec 08, 2020 at 10:49:10AM +0100, Stefano Babic wrote:
> >>> Hi,
> >>>
> >>> I hope I am not OT. I ask about
A weird KASAN problem that Zygo reported could have been easily caught
if we checked for basic things in our backref freeing code. We have two
methods of freeing a backref node
- btrfs_backref_free_node: this just is kfree() essentially.
- btrfs_backref_drop_node: this actually unlinks the node a
While testing my error handling patches, I added a error injection site
at btrfs_inc_extent_ref, to validate the error handling I added was
doing the correct thing. However I hit a pretty ugly corruption while
doing this check, with the following error injection stack trace
btrfs_inc_extent_ref
b
While doing error injection testing with my relocation patches I hit the
following ASSERT()
assertion failed: list_empty(&block_group->dirty_list), in
fs/btrfs/block-group.c:3356
[ cut here ]
kernel BUG at fs/btrfs/ctree.h:3357!
invalid opcode: [#1] SMP NOPTI
CPU: 0 P
The backref code is looking for a reloc_root that corresponds to the
given fs root. However any number of things could have gone wrong while
initializing that reloc_root, like ENOMEM while trying to allocate the
root itself, or EIO while trying to write the root item. This would
result in no corr
v1->v2:
- Rebased onto misc-next, dropping everything that's been merged so far.
- Fixed "btrfs: splice remaining dirty_bg's onto the transaction dirty bg list"
to handle the btrfs_alloc_path() failure and cleaned up the error handling as
a result of that change.
- dropped "btrfs: don't clear r
When recovering a relocation, if we run into a reloc root that has 0
refs we simply add it to the reloc_control->reloc_roots list, and then
clean it up later. The problem with this is __del_reloc_root() doesn't
do anything if the root isn't in the radix tree, which in this case it
won't be because
On Tue, Dec 29, 2020 at 09:34:51AM +, Stéphane Lesimple wrote:
> December 29, 2020 1:32 AM, "Qu Wenruo" wrote:
>
> > There are cases where v1 free space cache is still left while user has
> > already enabled v2 cache.
> >
> > In that case, we still want to force v1 space cache cleanup in
> >
On Thu, Dec 10, 2020 at 01:03:04PM +0100, Stefano Babic wrote:
> I read this, thanks.
>
> I was quite confused about the license for libbtrfsutil due to both
> "COPYING" and "COPYING.LESSER" in the library path. COPYING reports
> GPLv3. But headers in file set LGPLv3, sure, and btrfs.h is GPLv2.
>
On Fri, Nov 27, 2020 at 04:30:32PM -0300, Marcos Paulo de Souza wrote:
> From: Marcos Paulo de Souza
>
> In this forth iteration, only patch 0002 was changed. Previously the variable
> full_path, which is passed by the user, was being overwritten in the inode
> loop.
> Now we create a temp var t
On Sat, Dec 26, 2020 at 02:46:06PM -0700, shng...@gmail.com wrote:
> From: Sheng Mao
>
> To use optimized CRC implemention, the input buffer must be
> unsigned long aligned. btrfs receive calculates checksum based on
> read_buf, including btrfs_cmd_header (with zero-ed CRC field)
> and command co
On Fri, Jan 08, 2021 at 05:44:35PM +0100, David Sterba wrote:
> On Wed, Dec 16, 2020 at 11:22:04AM -0500, Josef Bacik wrote:
> > Hello,
> >
> > A lot of these were in previous versions of the relocation error handling
> > patches. I added a few since the last go around. All of these do not rely
On Mon, Jan 11, 2021 at 11:23:15PM +0100, David Sterba wrote:
> On Wed, Dec 16, 2020 at 11:22:15AM -0500, Josef Bacik wrote:
> > --- a/fs/btrfs/relocation.c
> > +++ b/fs/btrfs/relocation.c
> > @@ -98,6 +98,7 @@ struct tree_block {
> > u64 bytenr;
> > }; /* Use rb_simple_node for sea
Hi,
On Thu, Jan 14, 2021 at 03:12:26AM +0100, waxhead wrote:
> I was looking through the mount options and being a madman with strong
> opinions I can't help thinking that a lot of them does not really belong
> as mount options at all, but should rather be properties set on the
> subvolume - fo
Hi,
On Sun, Dec 27, 2020 at 04:07:44PM +, Mark Harmstone wrote:
> I'm the creator of the Windows Btrfs driver. During the course of development,
> it's become apparent that for 100% compatibility with NTFS there'd need to be
> some minor changes to the disk format. Examples: Windows' LZNT1 com
On Mon, Jan 11, 2021 at 02:47:26PM -0500, Josef Bacik wrote:
On 12/21/20 10:48 PM, Naohiro Aota wrote:
We cannot use log-structured superblock writing in conventional zones since
there is no write pointer to determine the last written superblock
position. So, we write a superblock at a static lo
On Thu, Jan 14, 2021 at 01:13:50AM -0700, Chris Murphy wrote:
> Hi,
>
> It looks like this didn't make it to 5.10.7. I see the PR for
> 5.11-rc4. Is it likely it'll make it into 5.10.8?
5.10.8 is feasible I think, a minor diff fixup is needed to apply to
5.10.x, I'll send it.
Hi,
It looks like this didn't make it to 5.10.7. I see the PR for
5.11-rc4. Is it likely it'll make it into 5.10.8?
e076ab2a2ca70a0270232067cd49f76cd92efe64
btrfs: shrink delalloc pages instead of full inodes
Thanks,
--
Chris Murphy
68 matches
Mail list logo