Hi, this update brings updates of space handling, performance improvements or bug fixes. The subpage block size and zoned mode features have reached state where they're usable but with limitations.
The branch merges cleanly on top of current master, there are some minor conflicts reported by linux-next in iov_iter or kmap conversion. Please pull, thanks. - Performance or related - do not block on deleted block group mutex in the cleaner, avoids some long stalls - improved flushing: make it work better with ticket space reservations and avoid excessive transaction commits in some scenarios, slightly improves throughput for random write load - preemptive background flushing: separate the logic from ticket reservations, improve the accounting and decisions when to flush in low space conditions - less lock contention related to running delayed refs, let just one thread do the flushing when there are many inside transaction commit - dbench workload improvements: avoid unnecessary work when logging inodes, fewer fallbacks to transaction commit and thus less waiting for it (+7% throughput, -20% latency) - Core - subpage block size - currently read-only support - refactor and generalize code where sectorsize is assumed to be page size, add the subpage handling everywhere - the read-write support is on the way, page sizes are still limited to 4K or 64K - zoned mode, first working version but with limitations - SMR/ZBC/ZNS friendly allocation mode, utilizing the "no fixed location for structures" and chunked allocation - superblock as the only fixed data structure needs special handling, uses 2 consecutive zones as a ring buffer - tree-log support with a dedicated block group to avoid unordered writes - emulated zones on non-zoned devices - not yet working - all non-single block group profiles, requires more zone write pointer synchronization between the multiple block groups - fitrim due to dependency on space cache, can be implemented - Fixes - ref-verify: proper tree owner and node level tracking - fix pinned byte accounting, causing some early ENOSPC now more likely due to other changes in delayed refs - Other - error handling fixes and improvements - more error injection points - more function documentation - more and updated tracepoints - subset of W=1 checked by default - update comments to allow more automatic kdoc parameter checks ---------------------------------------------------------------- The following changes since commit e0756cfc7d7cd08c98a53b6009c091a3f6a50be6: Merge tag 'trace-v5.11-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace (2021-02-08 11:32:39 -0800) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-5.12-tag for you to fetch changes up to 9d294a685fbcb256ce8c5f7fd88a7596d0f52a8a: btrfs: zoned: enable to mount ZONED incompat flag (2021-02-09 02:52:24 +0100) ---------------------------------------------------------------- Abaci Team (1): btrfs: simplify condition in __btrfs_run_delayed_items Filipe Manana (10): btrfs: send: remove stale code when checking for shared extents btrfs: remove wrong comment for can_nocow_extent() btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit btrfs: fix extent buffer leak on failure to copy root Johannes Thumshirn (7): block: add bio_add_zone_append_page btrfs: release path before calling to btrfs_load_block_group_zone_info btrfs: zoned: do not load fs_info::zoned from incompat flag btrfs: zoned: allow zoned filesystems on non-zoned block devices btrfs: zoned: check if bio spans across an ordered extent btrfs: zoned: cache if block group is on a sequential zone btrfs: save irq flags when looking up an ordered extent Josef Bacik (34): btrfs: fix error handling in commit_fs_roots btrfs: allow error injection for btrfs_search_slot and btrfs_cow_block btrfs: noinline btrfs_should_cancel_balance btrfs: ref-verify: pass down tree block level when building refs btrfs: ref-verify: make sure owner is set for all refs btrfs: keep track of the root owner for relocation reads btrfs: do not cleanup upper nodes in btrfs_backref_cleanup_node btrfs: handle space_info::total_bytes_pinned inside the delayed ref itself btrfs: account for new extents being deleted in total_bytes_pinned btrfs: fix reloc root leak with 0 ref reloc roots on recovery btrfs: splice remaining dirty_bg's onto the transaction dirty bg list btrfs: do not warn if we can't find the reloc root when looking up backref btrfs: add asserts for deleting backref cache nodes btrfs: abort the transaction if we fail to inc ref in btrfs_copy_root btrfs: do not block on deleted bgs mutex in the cleaner btrfs: only let one thread pre-flush delayed refs in commit btrfs: delayed refs pre-flushing should only run the heads we have btrfs: only run delayed refs once before committing btrfs: move delayed ref flushing for qgroup into qgroup helper btrfs: remove bogus BUG_ON in alloc_reserved_tree_block btrfs: stop running all delayed refs during snapshot btrfs: run delayed refs less often in commit_cowonly_roots btrfs: make flush_space take a enum btrfs_flush_state instead of int btrfs: add a trace point for reserve tickets btrfs: track ordered bytes instead of just dio ordered bytes btrfs: introduce a FORCE_COMMIT_TRANS flush operation btrfs: improve preemptive background space flushing btrfs: rename need_do_async_reclaim btrfs: check reclaim_size in need_preemptive_reclaim btrfs: rework btrfs_calc_reclaim_metadata_size btrfs: simplify the logic in need_preemptive_flushing btrfs: implement space clamping for preemptive flushing btrfs: adjust the flush trace point to include the source btrfs: add a trace class for dumping the current ENOSPC state Michal Rostecki (1): btrfs: let callers of btrfs_get_io_geometry pass the em Naohiro Aota (36): iomap: support REQ_OP_ZONE_APPEND btrfs: zoned: defer loading zone info after opening trees btrfs: zoned: use regular super block location on zone emulation btrfs: zoned: disallow fitrim on zoned filesystems btrfs: zoned: implement zoned chunk allocator btrfs: zoned: verify device extent is aligned to zone btrfs: zoned: load zone's allocation offset btrfs: zoned: calculate allocation offset for conventional zones btrfs: zoned: track unusable bytes for zones btrfs: zoned: implement sequential extent allocation btrfs: zoned: redirty released extent buffers btrfs: zoned: advance allocation pointer after tree log node btrfs: zoned: reset zones of unused block groups btrfs: factor out helper adding a page to bio btrfs: zoned: use bio_add_zone_append_page btrfs: zoned: handle REQ_OP_ZONE_APPEND as writing btrfs: zoned: split ordered extent when bio is sent btrfs: extend btrfs_rmap_block for specifying a device btrfs: zoned: use ZONE_APPEND write for zoned mode btrfs: zoned: enable zone append writing for direct IO btrfs: zoned: introduce dedicated data write path for zoned filesystems btrfs: zoned: serialize metadata IO btrfs: zoned: wait for existing extents before truncating btrfs: zoned: do not use async metadata checksum on zoned filesystems btrfs: zoned: mark block groups to copy for device-replace btrfs: zoned: implement cloning for zoned device-replace btrfs: zoned: implement copying for zoned device-replace btrfs: zoned: support dev-replace in zoned filesystems btrfs: zoned: enable relocation on a zoned filesystem btrfs: zoned: relocate block group to repair IO failure in zoned filesystems btrfs: split alloc_log_tree() btrfs: zoned: extend zoned allocator to use dedicated tree-log block group btrfs: zoned: serialize log transaction on zoned filesystems btrfs: zoned: reorder log node allocation on zoned filesystem btrfs: zoned: deal with holes writing out tree-log pages btrfs: zoned: enable to mount ZONED incompat flag Nigel Christian (1): btrfs: remove repeated word in struct member comment Nikolay Borisov (24): btrfs: cleanup local variables in btrfs_file_write_iter btrfs: rename btrfs_find_highest_objectid to btrfs_init_root_free_objectid btrfs: rename btrfs_find_free_objectid to btrfs_get_free_objectid btrfs: rename btrfs_root::highest_objectid to free_objectid btrfs: make btrfs_root::free_objectid hold the next available objectid btrfs: remove new_dirid argument from btrfs_create_subvol_root btrfs: consolidate btrfs_previous_item ret val handling in btrfs_shrink_device btrfs: make btrfs_start_delalloc_root's nr argument a long btrfs: remove always true condition in btrfs_start_delalloc_roots btrfs: document modified parameter of add_extent_mapping btrfs: fix parameter description of btrfs_add_extent_mapping btrfs: fix function description formats in file-item.c btrfs: fix parameter description in delayed-ref.c functions btrfs: improve parameter description for __btrfs_write_out_cache btrfs: document now parameter of peek_discard_list btrfs: document fs_info in btrfs_rmap_block btrfs: fix description format of fs_info of btrfs_wait_on_delayed_iputs btrfs: document btrfs_check_shared parameters btrfs: fix parameter description of btrfs_inode_rsv_release/btrfs_delalloc_release_space btrfs: fix parameter description in space-info.c btrfs: fix parameter description for functions in extent_io.c btrfs: zoned: remove unused variable in btrfs_sb_log_location_bdev lib/zstd: convert constants to defines btrfs: enable W=1 checks for btrfs Qu Wenruo (27): btrfs: make btrfs_dio_private::bytes u32 btrfs: refactor btrfs_dec_test_* functions for ordered extents btrfs: rename parameter offset to disk_bytenr in submit_extent_page btrfs: refactor __extent_writepage_io() to improve readability btrfs: update comment for btrfs_dirty_pages btrfs: introduce helper to grab an existing extent buffer from a page btrfs: rework the order of btrfs_ordered_extent::flags btrfs: fix double accounting of ordered extent for subpage case in btrfs_invalidapge btrfs: merge PAGE_CLEAR_DIRTY and PAGE_SET_WRITEBACK to PAGE_START_WRITEBACK btrfs: set UNMAPPED bit early in btrfs_clone_extent_buffer() for subpage support btrfs: introduce the skeleton of btrfs_subpage structure btrfs: make attach_extent_buffer_page() handle subpage case btrfs: make grab_extent_buffer_from_page() handle subpage case btrfs: support subpage for extent buffer page release btrfs: attach private to dummy extent buffer pages btrfs: introduce helpers for subpage uptodate status btrfs: introduce helpers for subpage error status btrfs: support subpage in set/clear_extent_buffer_uptodate() btrfs: support subpage in btrfs_clone_extent_buffer btrfs: support subpage in try_release_extent_buffer() btrfs: introduce read_extent_buffer_subpage() btrfs: support subpage in endio_readpage_update_page_status() btrfs: introduce subpage metadata validation check btrfs: introduce btrfs_subpage for data inodes btrfs: integrate page status update for data read path into begin/end_page_read btrfs: allow read-only mount of 4K sector size fs on 64K page system btrfs: explain page locking and readahead in read_extent_buffer_pages() Roman Anasal (1): btrfs: send: use struct send_ctx *sctx for btrfs_compare_trees and changed_cb Yang Li (1): btrfs: remove redundant NULL check before kvfree Zhihao Cheng (1): btrfs: clarify error returns values in __load_free_space_cache block/bio.c | 33 ++ fs/btrfs/Makefile | 19 +- fs/btrfs/backref.c | 17 +- fs/btrfs/backref.h | 9 +- fs/btrfs/block-group.c | 178 +++++--- fs/btrfs/block-group.h | 21 +- fs/btrfs/btrfs_inode.h | 3 +- fs/btrfs/compression.c | 10 +- fs/btrfs/ctree.c | 9 +- fs/btrfs/ctree.h | 19 +- fs/btrfs/delalloc-space.c | 29 +- fs/btrfs/delayed-inode.c | 2 +- fs/btrfs/delayed-ref.c | 79 ++-- fs/btrfs/delayed-ref.h | 28 +- fs/btrfs/dev-replace.c | 186 +++++++- fs/btrfs/dev-replace.h | 3 + fs/btrfs/discard.c | 6 +- fs/btrfs/disk-io.c | 183 ++++++-- fs/btrfs/disk-io.h | 6 +- fs/btrfs/extent-tree.c | 361 ++++++++++------ fs/btrfs/extent_io.c | 791 +++++++++++++++++++++++++++------- fs/btrfs/extent_io.h | 17 +- fs/btrfs/extent_map.c | 18 +- fs/btrfs/file-item.c | 22 +- fs/btrfs/file.c | 58 ++- fs/btrfs/free-space-cache.c | 123 +++++- fs/btrfs/free-space-cache.h | 2 + fs/btrfs/inode.c | 336 ++++++++++++--- fs/btrfs/ioctl.c | 29 +- fs/btrfs/ordered-data.c | 224 +++++++--- fs/btrfs/ordered-data.h | 57 ++- fs/btrfs/raid56.c | 3 +- fs/btrfs/ref-verify.c | 43 +- fs/btrfs/reflink.c | 5 +- fs/btrfs/relocation.c | 99 ++++- fs/btrfs/scrub.c | 143 +++++++ fs/btrfs/send.c | 31 +- fs/btrfs/space-info.c | 365 ++++++++++++---- fs/btrfs/space-info.h | 25 +- fs/btrfs/subpage.c | 278 ++++++++++++ fs/btrfs/subpage.h | 91 ++++ fs/btrfs/super.c | 8 +- fs/btrfs/sysfs.c | 2 + fs/btrfs/tests/extent-map-tests.c | 2 +- fs/btrfs/transaction.c | 152 ++++--- fs/btrfs/transaction.h | 5 + fs/btrfs/tree-log.c | 288 ++++++------- fs/btrfs/volumes.c | 364 +++++++++++++--- fs/btrfs/volumes.h | 8 +- fs/btrfs/zoned.c | 873 +++++++++++++++++++++++++++++++++++++- fs/btrfs/zoned.h | 157 ++++++- fs/iomap/direct-io.c | 43 +- include/linux/bio.h | 2 + include/linux/iomap.h | 1 + include/linux/zstd.h | 8 +- include/trace/events/btrfs.h | 111 ++++- 56 files changed, 4874 insertions(+), 1111 deletions(-) create mode 100644 fs/btrfs/subpage.c create mode 100644 fs/btrfs/subpage.h