[GIT PULL] Btrfs
Hi Linus, My for-linus-4.10 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.10 Has two last minute fixes. The highest priority here is a regression fix for the decompression code, but we also fixed up a problem with the 32 bit compat ioctls. The decompression bug could hand back the wrong data on big reads when zlib was used. I have a larger cleanup to make the math here less error prone, but at this stage in the release Omar's patch is the best choice. Omar Sandoval (1) commits (+24/-15): Btrfs: fix btrfs_decompress_buf2page() Jeff Mahoney (1) commits (+4/-2): btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls Total: (2) commits (+28/-17) fs/btrfs/compression.c | 39 --- fs/btrfs/ioctl.c | 6 -- 2 files changed, 28 insertions(+), 17 deletions(-)
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Has a series of fixes and cleanups that Dave Sterba has been collecting: There is a pretty big variety here, cleaning up internal APIs and fixing corner cases. David Sterba (46) commits (+235/-313): btrfs: remove unused parameter from btrfs_subvolume_release_metadata (+6/-11) btrfs: remove pointless rcu protection from btrfs_qgroup_inherit (+0/-2) btrfs: check quota status earlier and don't do unnecessary frees (+3/-2) btrfs: remove unused parameter from btrfs_prepare_extent_commit (+3/-5) btrfs: remove unnecessary mutex lock in qgroup_account_snapshot (+1/-5) btrfs: embed extent_changeset::range_changed to the structure (+11/-17) btrfs: remove unused parameter from cleanup_write_cache_enospc (+2/-3) btrfs: remove unused parameters from __btrfs_write_out_cache (+3/-8) btrfs: remove unused parameter from clone_copy_inline_extent (+2/-3) btrfs: remove unused parameter from extent_write_cache_pages (+2/-4) btrfs: remove unused parameter from tree_move_next_or_upnext (+2/-4) btrfs: remove unused parameter from btrfs_check_super_valid (+3/-5) btrfs: remove unused logic of limiting async delalloc pages (+0/-7) btrfs: fix over-80 lines introduced by previous cleanups (+74/-63) btrfs: remove unused parameter from read_block_for_search (+5/-5) btrfs: remove unused parameter from adjust_slots_upwards (+2/-3) btrfs: remove unused parameter from init_first_rw_device (+3/-5) btrfs: make space cache inode readahead failure nonfatal (+3/-7) btrfs: remove unused parameters from scrub_setup_wr_ctx (+3/-7) btrfs: remove unused parameter from __btrfs_alloc_chunk (+4/-6) btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE (+23/-31) btrfs: remove unused parameter from submit_extent_page (+3/-9) btrfs: remove unused parameter from clean_tree_block (+17/-19) btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation (+2/-2) btrfs: remove unused parameter from __add_inline_refs (+2/-3) btrfs: remove unused parameter from add_pending_csums (+2/-4) btrfs: remove unused parameter from update_nr_written (+4/-4) btrfs: remove unused parameter from __push_leaf_right (+2/-3) btrfs: remove unused parameter from check_async_write (+2/-2) btrfs: remove unused parameter from btrfs_fill_super (+2/-3) btrfs: remove unused parameter from __push_leaf_left (+2/-3) btrfs: remove unused parameter from write_dev_supers (+3/-3) btrfs: remove unused parameter from __add_inode_ref (+1/-2) btrfs: remove unused parameters from btrfs_cmp_data (+2/-3) btrfs: remove unused parameter from create_snapshot (+2/-2) btrfs: ulist: make the finalization function public (+2/-1) btrfs: remove unused parameter from tree_move_down (+2/-2) btrfs: ulist: rename ulist_fini to ulist_release (+10/-10) btrfs: qgroups: make __del_qgroup_relation static (+1/-1) btrfs: use GFP_KERNEL in btrfs_read_qgroup_config (+1/-1) btrfs: remove unused parameter from split_item (+2/-3) btrfs: merge two superblock writing helpers (+4/-11) btrfs: qgroups: opencode qgroup_free helper (+9/-9) btrfs: use GFP_KERNEL in btrfs_quota_enable (+1/-1) btrfs: use GFP_KERNEL in create_snapshot (+2/-2) btrfs: remove unused ulist members (+0/-7) Nikolay Borisov (36) commits (+476/-480): btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inode (+8/-8) btrfs: Make btrfs_inode_delayed_dir_index_count take btrfs_inode (+5/-5) btrfs: Make btrfs_commit_inode_delayed_items take btrfs_inode (+4/-4) btrfs: Make btrfs_commit_inode_delayed_inode take btrfs_inode (+6/-6) btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inode (+5/-6) btrfs: Make btrfs_kill_delayed_inode_items take btrfs_inode (+4/-4) btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inode (+5/-5) btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inode (+6/-6) btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inode (+5/-5) btrfs: Make btrfs_check_ref_name_override take btrfs_inode (+4/-5) btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode (+6/-6) btrfs: Make btrfs_must_commit_transaction take btrfs_inode (+9/-9) btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode (+7/-7) btrfs: Make btrfs_log_changed_extents take btrfs_inode (+11/-11) btrfs: Make btrfs_record_unlink_dir take btrfs_inode (+14/-14) btrfs: Make btrfs_remove_delayed_node take btrfs_inode (+5/-5) btrfs: Make btrfs_get_logged_extents take btrfs_inode (+4/-4) btrfs: Make btrfs_log_trailing_hole take btrfs_inode (+4/-4) btrfs: Make btrfs_get_delayed_node take btrfs_inode (+8/-9) btrfs: Make btrfs_ino take a struct btrfs_inode (+151/-151) btrfs: Make log_directory_changes take btrfs_inode (+5/-6)
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.8 branch has some fixes for btrfs send/recv and fsync from Filipe and Robbie Ko: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.8 Bonus points to Filipe for already having xfstests in place for many of these. Filipe Manana (8) commits (+172/-52): Btrfs: improve performance on fsync against new inode after rename/unlink (+95/-9) Btrfs: send, avoid incorrect leaf accesses when sending utimes operations (+2/-0) Btrfs: remove unused function btrfs_add_delayed_qgroup_reserve() (+0/-30) Btrfs: be more precise on errors when getting an inode from disk (+18/-9) Btrfs: incremental send, fix invalid paths for rename operations (+2/-1) Btrfs: send, add missing error check for calls to path_loop() (+2/-0) Btrfs: add missing check for writeback errors on fsync (+8/-0) Btrfs: send, don't bug on inconsistent snapshots (+45/-3) Robbie Ko (4) commits (+111/-7): Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations (+11/-1) Btrfs: send, fix warning due to late freeing of orphan_dir_info structures (+4/-0) Btrfs: send, fix failure to move directories with the same name around (+95/-5) Btrfs: incremental send, fix premature rmdir operations (+1/-1) Total: (12) commits (+283/-59) fs/btrfs/delayed-ref.c | 27 fs/btrfs/delayed-ref.h | 3 - fs/btrfs/file.c| 8 +++ fs/btrfs/inode.c | 46 ++--- fs/btrfs/send.c| 173 + fs/btrfs/tree-log.c| 85 +--- 6 files changed, 283 insertions(+), 59 deletions(-)
[GIT PULL] Btrfs
Hi Linus, Please pull my for-linus-4.8 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.8 We've queued up a few different fixes in here. These range from enospc corners to fsync and quota fixes, and a few targeted at error handling for corrupt metadata/fuzzing. Liu Bo (5) commits (+60/-2): Btrfs: detect corruption when non-root leaf has zero item (+22/-1) Btrfs: add ASSERT for block group's memory leak (+5/-0) Btrfs: clarify do_chunk_alloc()'s return value (+9/-0) Btrfs: fix memory leak of reloc_root (+8/-1) Btrfs: check btree node's nritems (+16/-0) Qu Wenruo (4) commits (+191/-53): btrfs: relocation: Fix leaking qgroups numbers on data extents (+103/-6) btrfs: qgroup: Fix qgroup incorrectness caused by log replay (+16/-0) btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() (+71/-47) btrfs: backref: Fix soft lockup in __merge_refs function (+1/-0) Wang Xiaoguang (4) commits (+161/-108): btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() (+6/-4) btrfs: divide btrfs_update_reserved_bytes() into two functions (+57/-40) btrfs: update btrfs_space_info's bytes_may_use timely (+73/-63) btrfs: fix fsfreeze hang caused by delayed iputs deal (+25/-1) Jeff Mahoney (3) commits (+45/-18): btrfs: don't create or leak aliased root while cleaning up orphans (+22/-11) btrfs: waiting on qgroup rescan should not always be interruptible (+13/-6) btrfs: properly track when rescan worker is running (+10/-1) Filipe Manana (1) commits (+8/-4): Btrfs: fix lockdep warning on deadlock against an inode's log mutex Anand Jain (1) commits (+19/-8): btrfs: do not background blkdev_put() Alex Lyakas (1) commits (+1/-1): btrfs: flush_space: treat return value of do_chunk_alloc properly Josef Bacik (1) commits (+1/-0): Btrfs: fix em leak in find_first_block_group Total: (20) commits fs/btrfs/backref.c | 1 + fs/btrfs/ctree.h | 5 +- fs/btrfs/delayed-ref.c | 7 +- fs/btrfs/disk-io.c | 56 +-- fs/btrfs/disk-io.h | 2 + fs/btrfs/extent-tree.c | 185 +++-- fs/btrfs/extent_io.h | 1 + fs/btrfs/file.c| 28 fs/btrfs/inode-map.c | 3 +- fs/btrfs/inode.c | 37 +++--- fs/btrfs/ioctl.c | 2 +- fs/btrfs/qgroup.c | 62 ++--- fs/btrfs/qgroup.h | 36 -- fs/btrfs/relocation.c | 126 ++--- fs/btrfs/root-tree.c | 27 +--- fs/btrfs/super.c | 16 + fs/btrfs/transaction.c | 7 +- fs/btrfs/tree-log.c| 21 +- fs/btrfs/tree-log.h| 5 +- fs/btrfs/volumes.c | 27 +--- 20 files changed, 473 insertions(+), 181 deletions(-)
[GIT PULL] Btrfs
Hi Linus My for-linus-4.7 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.7 Has some fixes and some new self tests for btrfs. The self tests are usually disabled in the .config file (unless you're doing btrfs dev work), and this bunch is meant to find problems with the 64K page size patches. Jeff has a patch to help people see if they are using the hardware assist crc32c module, which really helps us nail down problems when people ask why crcs are using so much CPU. Otherwise, it's small fixes. Feifei Xu (8) commits (+475/-361): Btrfs: test_check_exists: Fix infinite loop when searching for free space entries (+2/-2) Btrfs: self-tests: Execute page straddling test only when nodesize < PAGE_SIZE (+30/-19) Btrfs: self-tests: Use macros instead of constants and add missing newline (+31/-18) Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes (+32/-22) Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system (+11/-1) Btrfs: Fix integer overflow when calculating bytes_per_bitmap (+7/-7) Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize (+7/-1) Btrfs: self-tests: Support non-4k page size (+355/-291) Liu Bo (3) commits (+104/-15): Btrfs: clear uptodate flags of pages in sys_array eb (+2/-0) Btrfs: add validadtion checks for chunk loading (+67/-15) Btrfs: add more validation checks for superblock (+35/-0) Josef Bacik (1) commits (+1/-0): Btrfs: end transaction if we abort when creating uuid root Jeff Mahoney (1) commits (+9/-2): btrfs: advertise which crc32c implementation is being used at module load Vinson Lee (1) commits (+1/-1): btrfs: Use __u64 in exported linux/btrfs.h. Total: (14) commits (+590/-379) fs/btrfs/ctree.c | 6 +- fs/btrfs/disk-io.c | 20 +- fs/btrfs/disk-io.h | 2 +- fs/btrfs/extent_io.c | 10 +- fs/btrfs/extent_io.h | 4 +- fs/btrfs/free-space-cache.c| 18 +- fs/btrfs/hash.c| 5 + fs/btrfs/hash.h| 1 + fs/btrfs/super.c | 57 -- fs/btrfs/tests/btrfs-tests.c | 6 +- fs/btrfs/tests/btrfs-tests.h | 27 +-- fs/btrfs/tests/extent-buffer-tests.c | 13 +- fs/btrfs/tests/extent-io-tests.c | 86 ++--- fs/btrfs/tests/free-space-tests.c | 76 +--- fs/btrfs/tests/free-space-tree-tests.c | 30 +-- fs/btrfs/tests/inode-tests.c | 344 ++--- fs/btrfs/tests/qgroup-tests.c | 111 ++- fs/btrfs/volumes.c | 109 +-- include/uapi/linux/btrfs.h | 2 +- 19 files changed, 569 insertions(+), 358 deletions(-)
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.7 branch has some fixes: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.7 I realized as I was prepping this pull that my tip commit still had Facebook task numbers and other internal metadata in it. So I had to reword the description, which is why it is only a few hours old. Only the description changed since testing. The important part of this pull is Filipe's set of fixes for btrfs device replacement. Filipe fixed a few issues seen on the list and a number he found on his own. Filipe Manana (8) commits (+93/-19): Btrfs: fix race setting block group back to RW mode during device replace (+5/-5) Btrfs: fix unprotected assignment of the left cursor for device replace (+4/-0) Btrfs: fix race setting block group readonly during device replace (+46/-2) Btrfs: fix race between device replace and block group removal (+11/-0) Btrfs: fix race between device replace and chunk allocation (+9/-12) Btrfs: fix race between readahead and device replace/removal (+2/-0) Btrfs: fix race between device replace and read repair (+10/-0) Btrfs: fix race between device replace and discard (+6/-0) Chris Mason (1) commits (+12/-1): Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent Total: (9) commits (+105/-20) fs/btrfs/extent-tree.c | 6 ++ fs/btrfs/extent_io.c| 10 ++ fs/btrfs/inode.c| 13 - fs/btrfs/ordered-data.c | 6 +- fs/btrfs/ordered-data.h | 2 +- fs/btrfs/reada.c| 2 ++ fs/btrfs/scrub.c| 50 ++--- fs/btrfs/volumes.c | 32 +++ 8 files changed, 103 insertions(+), 18 deletions(-)
[GIT PULL 1/2] Btrfs
Hi Linus, I have a two part pull this time because one of the patches Dave Sterba collected needed to be against v4.7-rc2 or higher (we used rc4). I try to make my for-linus-xx branch testable on top of the last major so we can hand fixes to people on the list more easily, so I've split this pull in two. My for-linus-4.7 branch has some fixes and two performance improvements that we've been testing for some time. git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.7 Josef's two performance fixes are most notable. The transid tracking patch makes a big improvement on pretty much every workload. Josef Bacik (2) commits (+38/-27): Btrfs: don't do nocow check unless we have to (+22/-22) Btrfs: track transid for delayed ref flushing (+16/-5) Liu Bo (1) commits (+11/-2): Btrfs: fix error handling in map_private_extent_buffer Chris Mason (1) commits (+11/-9): btrfs: fix deadlock in delayed_ref_async_start Wei Yongjun (1) commits (+1/-1): Btrfs: fix error return code in btrfs_init_test_fs() Chandan Rajendra (1) commits (+4/-6): Btrfs: Force stripesize to the value of sectorsize Wang Xiaoguang (1) commits (+2/-1): btrfs: fix disk_i_size update bug when fallocate() fails Total: (7) commits (+67/-46) fs/btrfs/ctree.c | 6 +- fs/btrfs/ctree.h | 2 +- fs/btrfs/disk-io.c | 6 ++ fs/btrfs/extent-tree.c | 15 +-- fs/btrfs/extent_io.c | 7 ++- fs/btrfs/file.c | 44 ++-- fs/btrfs/inode.c | 1 + fs/btrfs/ordered-data.c | 3 ++- fs/btrfs/tests/btrfs-tests.c | 2 +- fs/btrfs/transaction.c | 3 ++- fs/btrfs/volumes.c | 4 ++-- 11 files changed, 57 insertions(+), 36 deletions(-)
[GIT PULL 2/2] Btrfs
Hi Linus, Btrfs part two was supposed to be a single patch on part of v4.7-rc4. Somehow I didn't notice that my part2 branch repeated a few of the patches in part 1 when I set it up earlier this week. Cherry-picking gone wrong as I folded a fix into Dave Sterba's original integration. I've been testing the git-merged result of part1, part2 and your master for a while, but I just rebased part2 so it didn't include any duplicates. I ran git diff to verify the merged result of today's pull is exactly the same as the one I've been testing. My for-linus-4.7-part2 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.7-part2 Has one patch from Omar to bring iterate_shared back to btrfs. We have a tree of work we queue up for directory items and it doesn't lend itself well to shared access. While we're cleaning it up, Omar has changed things to use an exclusive lock when there are delayed items. Omar Sandoval (1) commits (+34/-13): Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes Total: (1) commits (+34/-13) fs/btrfs/delayed-inode.c | 27 ++- fs/btrfs/delayed-inode.h | 10 ++ fs/btrfs/inode.c | 10 ++ 3 files changed, 34 insertions(+), 13 deletions(-)
[GIT PULL] Btrfs
Hi Linus, We've got a fix in my for-linus-4.5 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.5 Filipe nailed down a problem where tree log replay would do some work that orphan code wasn't expecting to be done yet, leading to BUG_ON. Filipe Manana (1) commits (+9/-1): Btrfs: fix loading of orphan roots leading to BUG_ON Total: (1) commits (+9/-1) fs/btrfs/root-tree.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-)
Re: [GIT PULL] Btrfs
On Mon, Mar 21, 2016 at 06:16:54PM -0700, Linus Torvalds wrote: > On Mon, Mar 21, 2016 at 5:24 PM, Chris Mason wrote: > > > > I waited an extra day to send this one out because I hit a crash late > > last week with CONFIG_DEBUG_PAGEALLOC enabled (fixed in the top commit). > > Hmm. If that commit helps, it will spit out a warning. > > So is it actually fixed, or just hacked around to the point where you > don't get a page fault? > > That WARN_ON_ONCE kind of implies it's a "this happens, but we don't know > why". Hi Linus, while (bio_index < bio->bi_vcnt) { count = find some crcs ... while (count--) { ... page_bytes_left -= root->sectorsize; if (!page_bytes_left) { bio_index++; /* * make sure we're still inside the * bio before we update page_bytes_left */ if (bio_index >= bio->bi_vcnt) { WARN_ON_ONCE(count); goto done; } bvec++; page_bytes_left = bvec->bv_len; ^ this was the line that crashed before } } } done: cleanup; return; What should be happening here is we'll goto done when count is zero and we've walked past the end of the bio. IOW, both the outer and inner loops are doing the right tests and the right math, but the inner loop is improperly accessing a bogus bvec->bv_len because it didn't realize the outer loop was now completely done. I don't see a way for it to happen when count != 0, and I ran xfstests on a few machines to try and triple check that. If there are new bugs hiding here, we'll have EIOs returned up to userland because this function didn't properly fetch the crcs. If anyone reported the EIOs, they would send in the WARN_ON output too, so we'd know right away not to blame their hardware. I also ran for days with heavy read/write loads without seeing the crc errors. I didn't have the WARN_ON, or CONFIG_DEBUG_PAGEALLOC on that box, but if other things were wrong, we'd have done a lot worse than poke into bvec->bv_len, and the crc errors would have stopped the test. -chris
Re: [GIT PULL] Btrfs
On Mon, Mar 21, 2016 at 10:15:33PM -0400, Chris Mason wrote: > On Mon, Mar 21, 2016 at 06:16:54PM -0700, Linus Torvalds wrote: > > On Mon, Mar 21, 2016 at 5:24 PM, Chris Mason wrote: > > > > > > I waited an extra day to send this one out because I hit a crash late > > > last week with CONFIG_DEBUG_PAGEALLOC enabled (fixed in the top commit). > > > > Hmm. If that commit helps, it will spit out a warning. > > > > So is it actually fixed, or just hacked around to the point where you > > don't get a page fault? Hmmm, rereading my answer I realized I didn't actually answer. I really think this is fixed. I left the warning only because I originally expected something much more exotic. -chris
Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
On Thu, Mar 17, 2016 at 02:49:06PM -0600, Andreas Dilger wrote: > On Mar 17, 2016, at 12:35 PM, Chris Mason wrote: > > > > On Thu, Mar 17, 2016 at 10:47:29AM -0700, Linus Torvalds wrote: > >> On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum wrote: > >>> > >>> So we've not asked for NO_HIDE_STALE on the mailing lists, but I think > >>> it was one of the problems Sage had using xfs in his BlueStore > >>> implementation and was a big part of why it moved to pure userspace. > >>> FileStore might use NO_HIDE_STALE in some places but it would be > >>> pretty limited. When it came up at Linux FAST we were discussing how > >>> it and similar things had been problems for us in the past and it > >>> would've been nice if they were upstream. > >> > >> Hmm. > >> > >> So to me it really sounds like somebody should cook up a patch, but we > >> shouldn't put it in the upstream kernel until we get numbers and > >> actual "yes, we'd use this" from outside of google. > > > > We haven't had internal tiers yelling at us for fallocate performance, > > so I'm unlikely to suggest it, just because its a potential > > privacy leak we'd have to educate people about. What I'd be more likely > > to use is code inside the filesystem like this: > > > > somefs_fallocate() { > > if (trim_can_really_zero(my_device)) { > > trim > > allocate a regular extent > > return > > } else { > > do normal fallocate > > } > > } > > We were discussing almost this very same thing in the ext4 concall today. > > Ted initially didn't think it was worthwhile to implement, but after looking > at the whitelist for SATA SSDs it seems that there are enough devices on the > market that support the ATA_HORKAGE_ZERO_AFTER_TRIM to make this approach > worthwhile to implement. We'll end up with people complaining it makes fallocate slower because of the trims, so it's not a perfect solution. But I much prefer it to fallocate-stale. > > Also, if the ext4 extent size was limited it might even be possible to do > this efficiently enough with write_same on HDD devices. > > > Then the out of tree patch (for google or whoever) becomes a hack to > > flip trim_can_really_zero on a given block device. The rest of us can > > use explicit interfaces from the hardware when deciding what we want > > preallocation to mean. > > This might be a bit trickier, since this would affect all zero/trim > operations, not just ones for uninitialized data extents. Thinking more, my guess is that google will just keep doing what they are already doing ;) But there could be a flag in sysfs dedicated to trim-for-fallocate so admins can see what their devices are reporting. readonly in mainline, if someone wants to patch it in their large data center it wouldn't be hard. -chris
[GIT PULL] Btrfs
(+5/-1) Btrfs: change how we update the global block rsv (+20/-14) Btrfs: fix truncate_space_check (+10/-1) Qu Wenruo (3) commits (+68/-19): btrfs: Introduce new mount option usebackuproot to replace recovery (+25/-11) btrfs: Introduce new mount option to disable tree log replay (+40/-7) btrfs: Introduce new mount option alias for nologreplay (+3/-1) Byongho Lee (2) commits (+1/-6): btrfs: simplify expression in btrfs_calc_trans_metadata_size() (+1/-2) btrfs: remove redundant error check (+0/-4) Anand Jain (2) commits (+22/-10): btrfs: rename btrfs_print_info to btrfs_print_mod_info (+2/-2) btrfs: move btrfs_compression_type to compression.h (+20/-8) Kinglong Mee (2) commits (+18/-40): btrfs: fix memory leak of fs_info in block group cache (+1/-6) btrfs: drop null testing before destroy functions (+17/-34) Deepa Dinamani (1) commits (+26/-22): btrfs: Replace CURRENT_TIME by current_fs_time() Arnd Bergmann (1) commits (+2/-2): btrfs: avoid uninitialized variable warning Dave Jones (1) commits (+3/-6): btrfs: remove open-coded swap() in backref.c:__merge_refs Liu Bo (1) commits (+105/-84): Btrfs: fix lockdep deadlock warning due to dev_replace Adam Buchbinder (1) commits (+16/-16): btrfs: Fix misspellings in comments. Satoru Takeuchi (1) commits (+3/-0): Btrfs: Show a warning message if one of objectid reaches its highest value Ashish Samant (1) commits (+6/-1): btrfs: Print Warning only if ENOSPC_DEBUG is enabled Sudip Mukherjee (1) commits (+1/-1): btrfs: fix build warning Chris Mason (1) commits (+10/-0): btrfs: make sure we stay inside the bvec during __btrfs_lookup_bio_sums Rasmus Villemoes (1) commits (+3/-6): btrfs: use kbasename in btrfsic_mount Dan Carpenter (1) commits (+1/-1): btrfs: scrub: silence an uninitialized variable warning Total: (82) commits (+1142/-970) Documentation/filesystems/btrfs.txt| 261 ++ fs/btrfs/backref.c | 12 +- fs/btrfs/check-integrity.c | 12 +- fs/btrfs/compression.h | 9 + fs/btrfs/ctree.c | 36 ++-- fs/btrfs/ctree.h | 87 ++--- fs/btrfs/delayed-inode.c | 10 +- fs/btrfs/delayed-ref.c | 12 +- fs/btrfs/dev-replace.c | 134 +++--- fs/btrfs/dev-replace.h | 7 +- fs/btrfs/disk-io.c | 71 --- fs/btrfs/extent-tree.c | 40 ++-- fs/btrfs/extent_io.c | 40 ++-- fs/btrfs/extent_io.h | 5 +- fs/btrfs/extent_map.c | 8 +- fs/btrfs/file-item.c | 103 +++ fs/btrfs/file.c| 158 +--- fs/btrfs/inode-map.c | 3 + fs/btrfs/inode.c | 326 +++-- fs/btrfs/ioctl.c | 35 ++-- fs/btrfs/ordered-data.c| 6 +- fs/btrfs/print-tree.c | 23 ++- fs/btrfs/props.c | 1 + fs/btrfs/reada.c | 268 +-- fs/btrfs/root-tree.c | 2 +- fs/btrfs/scrub.c | 32 ++-- fs/btrfs/send.c| 37 ++-- fs/btrfs/super.c | 52 -- fs/btrfs/tests/btrfs-tests.c | 6 - fs/btrfs/tests/free-space-tree-tests.c | 1 + fs/btrfs/tests/inode-tests.c | 1 + fs/btrfs/transaction.c | 13 +- fs/btrfs/tree-log.c| 102 +-- fs/btrfs/tree-log.h| 2 + fs/btrfs/volumes.c | 51 +++--- fs/btrfs/xattr.c | 67 --- 36 files changed, 1102 insertions(+), 931 deletions(-)
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.5 branch has a btrfs DIO error passing fix. I know how much you love DIO, so I'm going to suggest against reading it. We'll follow up with a patch to drop the error arg from dio_end_io in the next merge window. git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.5 Filipe Manana (1) commits (+2/-0): Btrfs: fix direct IO requests not reporting IO error to user space Total: (1) commits (+2/-0) fs/btrfs/inode.c | 2 ++ 1 file changed, 2 insertions(+)
[GIT PULL] Btrfs
Hi Linus, Please pull my for-linus-4.5 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.5 This has a few fixes from Filipe, along with a readdir fix from Dave that we've been testing for some time. Filipe Manana (4) commits (+115/-68): Btrfs: remove no longer used function extent_read_full_page_nolock() (+12/-42) Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl (+6/-4) Btrfs: fix page reading in extent_same ioctl leading to csum errors (+21/-8) Btrfs: fix invalid page accesses in extent_same (dedup) ioctl (+76/-14) David Sterba (1) commits (+16/-3): btrfs: properly set the termination value of ctx->pos in readdir Total: (5) commits (+131/-71) fs/btrfs/backref.c | 10 ++-- fs/btrfs/compression.c | 6 +-- fs/btrfs/delayed-inode.c | 3 +- fs/btrfs/delayed-inode.h | 2 +- fs/btrfs/extent_io.c | 45 +- fs/btrfs/extent_io.h | 3 -- fs/btrfs/inode.c | 14 +- fs/btrfs/ioctl.c | 119 ++- 8 files changed, 131 insertions(+), 71 deletions(-)
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.3 branch has a few fixes: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.3 This is an assorted set I've been queuing up: Jeff Mahoney tracked down a tricky one where we ended up starting IO on the wrong mapping for special files in btrfs_evict_inode. A few people reported this one on the list. Filipe found (and provided a test for) a difficult bug in reading compressed extents, and Josef fixed up some quota record keeping with snapshot deletion. Chandan killed off an accounting bug during DIO that lead to WARN_ONs as we freed inodes. Filipe Manana (3) commits (+58/-16): Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock (+0/-4) Btrfs: don't initialize a space info as full to prevent ENOSPC (+1/-4) Btrfs: fix read corruption of compressed and shared extents (+57/-8) Josef Bacik (1) commits (+37/-2): Btrfs: keep dropped roots in cache until transaction commit Jeff Mahoney (1) commits (+2/-1): btrfs: skip waiting on ordered range for special files chandan (1) commits (+21/-23): Btrfs: Direct I/O: Fix space accounting Total: (6) commits (+118/-42) fs/btrfs/btrfs_inode.h | 2 -- fs/btrfs/disk-io.c | 2 -- fs/btrfs/extent-tree.c | 7 ++ fs/btrfs/extent_io.c | 65 +++--- fs/btrfs/inode.c | 45 +- fs/btrfs/super.c | 2 -- fs/btrfs/transaction.c | 32 + fs/btrfs/transaction.h | 5 +++- 8 files changed, 118 insertions(+), 42 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote: > [cc Tejun] > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote: > > On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote: > > > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote: > > > > On Wed 16-09-15 11:16:21, Chris Mason wrote: > > > > > Short version, Linus' patch still gives bigger IOs and similar perf to > > > > > Dave's original. I should have done the blktrace runs for 60 seconds > > > > > instead of 30, I suspect that would even out the average sizes between > > > > > the three patches. > > > > > > > > Thanks for the data Chris. So I guess we are fine with what's currently > > > > in, > > > > right? > > > > > > Looks like it works well to me. > > > > Graph looks good, though I'll confirm it on my test rig once I get > > out from under the pile of email and other stuff that is queued up > > after being away for a week... > > I ran some tests in the background while reading other email. > > TL;DR: Results look really bad - not only is the plugging > problematic, baseline writeback performance has regressed > significantly. We need to revert the plugging changes until the > underlying writeback performance regressions are sorted out. > > In more detail, these tests were run on my usual 16p/16GB RAM > performance test VM with storage set up as described here: > > https://urldefense.proofpoint.com/v1/url?u=http://permalink.gmane.org/gmane.linux.kernel/1768786&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=4Qwp5Zj8CpoMb6vOcz%2FNMQ%2Fsb0%2FamLUP1vqWgedxJL0%3D%0A&s=90b54e35a4a7fcc4bcab9e15e22c025c7c9e045541e4923500f2e3258fc1952b > > The test: > > $ ~/tests/fsmark-10-4-test-xfs.sh > meta-data=/dev/vdc isize=512agcount=500, agsize=268435455 > blks > = sectsz=512 attr=2, projid32bit=1 > = crc=1finobt=1, sparse=0 > data = bsize=4096 blocks=134217727500, imaxpct=1 > = sunit=0 swidth=0 blks > naming =version 2 bsize=4096 ascii-ci=0 ftype=1 > log =internal log bsize=4096 blocks=131072, version=2 > = sectsz=512 sunit=1 blks, lazy-count=1 > realtime =none extsz=4096 blocks=0, rtextents=0 > > # ./fs_mark -D 1 -S0 -n 1 -s 4096 -L 120 -d > /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d /mnt/scratch/3 > -d /mnt/scratch/4 -d /mnt/scratch/5 -d /mnt/scratch/6 -d /mnt/scratch/7 > # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015 > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. > # Directories: Time based hash between directories across 1 > subdirectories with 180 seconds per subdirectory. > # File names: 40 bytes long, (16 initial bytes of time stamp with 24 > random bytes at end of name) > # Files info: size 4096 bytes, written with an IO size of 16384 bytes > per write > # App overhead is time in microseconds spent in the test not doing file > writing related system calls. > > FSUse%Count SizeFiles/sec App Overhead > 08 4096 106938.0 543310 > 0 16 4096 102922.7 476362 > 0 24 4096 107182.9 538206 > 0 32 4096 107871.7 619821 > 0 40 4096 99255.6 622021 > 0 48 4096 103217.8 609943 > 0 56 4096 96544.2 640988 > 0 64 4096 100347.3 676237 > 0 72 4096 87534.8 483495 > 0 80 4096 72577.5 2556920 > 0 88 4096 97569.0 646996 > > I think too many variables have changed here. My numbers: FSUse%Count SizeFiles/sec App Overhead 0 16 4096 356407.1 1458461 0 32 4096 368755.1 1030047 0 48 4096 358736.8 992123 0 64 4096 361912.5 1009566 0 80 4096 342851.4 1004152 0 96 4096 358357.2 996014 0 112 4096 338025.8 1004412 0 128 4096
Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
On Thu, Sep 17, 2015 at 02:30:08PM +1000, Dave Chinner wrote: > On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote: > > On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote: > > > [cc Tejun] > > > > > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote: > > > # ./fs_mark -D 1 -S0 -n 1 -s 4096 -L 120 -d > > > /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/2 -d > > > /mnt/scratch/3 -d /mnt/scratch/4 -d /mnt/scratch/5 -d > > > /mnt/scratch/6 -d /mnt/scratch/7 > > > # Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015 > > > # Sync method: NO SYNC: Test does not issue sync() or fsync() calls. > > > # Directories: Time based hash between directories across 1 > > > subdirectories with 180 seconds per subdirectory. > > > # File names: 40 bytes long, (16 initial bytes of time stamp with > > > 24 random bytes at end of name) > > > # Files info: size 4096 bytes, written with an IO size of 16384 > > > bytes per write > > > # App overhead is time in microseconds spent in the test not doing > > > file writing related system calls. > > > > > > FSUse%Count SizeFiles/sec App Overhead > > > 08 4096 106938.0 543310 > > > 0 16 4096 102922.7 476362 > > > 0 24 4096 107182.9 538206 > > > 0 32 4096 107871.7 619821 > > > 0 40 4096 99255.6 622021 > > > 0 48 4096 103217.8 609943 > > > 0 56 4096 96544.2 640988 > > > 0 64 4096 100347.3 676237 > > > 0 72 4096 87534.8 483495 > > > 0 80 4096 72577.5 2556920 > > > 0 88 4096 97569.0 646996 > > > > > > > > > > I think too many variables have changed here. > > > > My numbers: > > > > FSUse%Count SizeFiles/sec App Overhead > > 0 16 4096 356407.1 1458461 > > 0 32 4096 368755.1 1030047 > > 0 48 4096 358736.8 992123 > > 0 64 4096 361912.5 1009566 > > 0 80 4096 342851.4 1004152 > > > > > I can push the dirty threshold lower to try and make sure we end up in > > the hard dirty limits but none of this is going to be related to the > > plugging patch. > > The point of this test is to drive writeback as hard as possible, > not to measure how fast we can create files in memory. i.e. if the > test isn't pushing the dirty limits on your machines, then it really > isn't putting a meaningful load on writeback, and so the plugging > won't make significant difference because writeback isn't IO > bound It does end up IO bound on my rig, just because we do eventually hit the dirty limits. Otherwise there would be zero benefits in fs_mark from any patches vs plain v4.2 But I setup a run last night with a dirty_ratio_bytes at 3G and dirty_background_ratio_bytes at 1.5G. There is definitely variation, but nothing like what you saw: FSUse%Count SizeFiles/sec App Overhead 0 16 4096 317427.9 1524951 0 32 4096 319723.9 1023874 0 48 4096 336696.4 1053884 0 64 4096 257113.1 1190851 0 80 4096 257644.2 1198054 0 96 4096 254896.6 1225610 0 112 4096 241052.6 1203227 0 128 4096 214961.2 1386236 0 144 4096 239985.7 1264659 0 160 4096 232174.3 1310018 0 176 4096 250477.9 1227289 0 192 4096 221500.9 1276223 0 208 4096 235212.1 1284989 0 224 4096 238580.2 1257260 0 240 4096 224182.6 1326821 0 256 4096 234628.7 1236402 0 272 4096 244675.3 1228400 0 288 4096 234364.0 1268408 0
Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
On Thu, Sep 17, 2015 at 12:39:51PM -0700, Linus Torvalds wrote: > On Wed, Sep 16, 2015 at 7:14 PM, Dave Chinner wrote: > >> > >> Dave, if you're testing my current -git, the other performance issue > >> might still be the spinlock thing. > > > > I have the fix as the first commit in my local tree - it'll remain > > there until I get a conflict after an update. :) > > Ok. I'm happy to report that you should get a conflict now, and that > the spinlock code should work well for your virtualized case again. > > No updates on the plugging thing yet, I'll wait a bit and follow this > thread and see if somebody comes up with any explanations or theories > in the hope that we might not need to revert (or at least have a more > targeted change). Playing around with the plug a little, most of the unplugs are coming from the cond_resched_lock(). Not really sure why we are doing the cond_resched() there, we should be doing it before we retake the lock instead. This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K files/sec up to 213K. Average IO size is the same as 4.3-rc1. It probably won't help Dave, since most of his unplugs should have been from the cond_resched_locked() too. diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 587ac08..05ed541 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -1481,6 +1481,19 @@ static long writeback_sb_inodes(struct super_block *sb, wbc_detach_inode(&wbc); work->nr_pages -= write_chunk - wbc.nr_to_write; wrote += write_chunk - wbc.nr_to_write; + + if (need_resched()) { + /* +* we're plugged and don't want to hand off to kblockd +* for the actual unplug work. But we do want to +* reschedule. So flush our plug and then +* schedule away +*/ + blk_flush_plug(current); + cond_resched(); + } + + spin_lock(&wb->list_lock); spin_lock(&inode->i_lock); if (!(inode->i_state & I_DIRTY_ALL)) @@ -1488,7 +1501,7 @@ static long writeback_sb_inodes(struct super_block *sb, requeue_inode(inode, wb, &wbc); inode_sync_complete(inode); spin_unlock(&inode->i_lock); - cond_resched_lock(&wb->list_lock); + /* * bail out to wb_writeback() often enough to check * background threshold and other termination conditions. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote: > On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason wrote: > > > > Playing around with the plug a little, most of the unplugs are coming > > from the cond_resched_lock(). Not really sure why we are doing the > > cond_resched() there, we should be doing it before we retake the lock > > instead. > > > > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K > > files/sec up to 213K. Average IO size is the same as 4.3-rc1. > > Ok, so at least for you, part of the problem really ends up being that > there's a mix of the "synchronous" unplugging (by the actual explicit > "blk_finish_plug(&plug);") and the writeback that is handed off to > kblockd_workqueue. > > I'm not seeing why that should be an issue. Sure, there's some CPU > overhead to context switching, but I don't see that it should be that > big of a deal. > > I wonder if there is something more serious wrong with the kblockd_workqueue. I'm driving the box pretty hard, it's right on the line between CPU bound and IO bound. So I've got 32 fs_mark processes banging away and 32 CPUs (16 really, with hyperthreading). They are popping in and out of balance_dirty_pages() so I have high CPU utilization alternating with high IO wait times. There no reads at all, so all of these waits are for buffered writes. People in balance_dirty_pages are indirectly waiting on the unplug, so maybe the context switch overhead on a loaded box is enough to explain it. We've definitely gotten more than 9% by inlining small synchronous items in btrfs in the past, but those were more explicitly synchronous. I know it's painfully hand wavy. I don't see any other users of the kblockd workqueues, and the perf profiles don't jump out at me. I'll feel better about the patch if Dave confirms any gains. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()
On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote: > On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner wrote: > > > > Ok, makes sense - the plug is not being flushed as we switch away, > > but Chris' patch makes it do that. > > Yup. Huh, that does make much more sense, thanks Linus. I'm wondering where else I've assumed that cond_resched() unplugged. > > And I actually think Chris' patch is better than the one I sent out > (but maybe the scheduler people should take a look at the behavior of > cond_resched()), I just wanted you to test that to verify the > behavior. Ok, I'll fix up the description and comments and send out. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: ext4: performance regression introduced by the cgroup writeback support
On Wed, Sep 23, 2015 at 01:49:31PM +, Dexuan Cui wrote: > Hi all, > Since some point between July and Sep, I have been suffered from a strange > "very slow write" issue and on Sep 9 I reported it to LKML (but got no > reply): https://lkml.org/lkml/2015/9/9/290 > > The issue is: under high CPU and disk I/O pressure, *some* processes can > suffer from a very slow write speed (e.g., <1MB/s or even only 20KB/s), while > the normal write speed should be at least dozens of MB/s. > > I think I identified the commit which introduced the regression: > ext4: implement cgroup writeback support > (https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=001e4a8775f6e8ad52a89e0072f09aee47d5d252) > > This commit is already in the mainline tree, so I can reproduce the issue > there too: > With the latest mainline, I can reproduce the issue; after I revert the > patch, I can't reproduce the issue. > > When the issue happens: > 1. the read speed is pretty normal, e.g.. it's still >100MB/s. > 2. 'top' shows both the 'user' and 'sys' utilization is about 0%, but the > IO-wait is always about 100%. > 3. 'iotop' shows the read speed is 0 (this is correct because there is indeed > no read request) and the write speed is pretty slow (the average is <1MB/s > or even 20KB/s). > 4. when the issue happens, sometimes any new process suffers from the slow > write issue, but sometimes it looks not all the new processes suffers from > the issue. > 5. The " WARNING: CPU: 7 PID: 6782 at fs/inode.c:390 ihold+0x30/0x40() " in > my Sep-9 mail may be another different issue. > 6. To reproduce the issue, I need to run my workload for enough long time > (see the below). > > My workload is simple: I just repeatedly build the kernel source ("make > clean; make -j16"). My kernel config is attached FYI. > > I can reproduce the issue on a physical machine: e.g., in my kernel building > test with my .config, it took only ~5 minutes in the first 176 runs, but > since the 177th run, it could take from 10 hours to 5 minutes - very unstable. > > It looks it's easier to reproduce the issue in a Hyper-V VM: usually I can > reproduce the issue within the first 10 or 20 runs. > > Any idea? Are you using cgroups? That patch really shouldn't impact load unless there are actual IO controls in place. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ext4: implement cgroup writeback support
On Wed, Sep 23, 2015 at 03:49:12PM +0300, Artem Bityutskiy wrote: > On Tue, 2015-07-21 at 23:56 -0400, Theodore Ts'o wrote: > > > v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB. > > > > > > Signed-off-by: Tejun Heo > > > Cc: "Theodore Ts'o" > > > Cc: Andreas Dilger > > > Cc: linux-e...@vger.kernel.org > > > > Thanks, applied. > > Hi, this patch introduces a regression - a major one, I'd say. > > Symptoms: copy a bunch of file, run sync, then run 'reboot', and after > you boot up the copied files are corrupted. So basically the user > -visible symptom is that 'sync' does not work. Hi Artem, Are you doing a hard shutdown (reboot -nf)? If you're doing a friendly shutdown, is the FS unmounting cleanly? > > I quite an effort to bisect it, but it led me to this patch. I bet it was a long bisect. Trying to see if the same patch to btrfs has similar impacts. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH 2/2] ext4: implement cgroup writeback support
On Wed, Sep 23, 2015 at 08:41:25PM +0300, Artem Bityutskiy wrote: >Hi > >$ sync >$ reboot If this is case, it should be possible to reproduce with: cp a bunch of stuff to /ext4 unmount /ext4 mount ext4 compare data If you're not getting a clean unmount of the test FS during the reboot, its a different test. Trying to reproduce here, so far its clean. Could you please double check for failed unmounts? Thanks, Chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs
Hi Linus, Please pull the fixes from my for-linus-4.2 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.2 Filipe fixed up a hard to trigger ENOSPC regression from our merge window pull, and we have a few other smaller fixes. Zhao Lei (2) commits (+4/-2): btrfs: Avoid NULL pointer dereference of free_extent_buffer when read_tree_block() fail (+2/-1) btrfs: Fix lockdep warning of btrfs_run_delayed_iputs() (+2/-1) Anand Jain (1) commits (+1/-1): btrfs: its btrfs_err() instead of btrfs_error() Filipe Manana (1) commits (+18/-0): Btrfs: fix quick exhaustion of the system array in the superblock Total: (4) commits (+23/-3) fs/btrfs/dev-replace.c | 2 +- fs/btrfs/disk-io.c | 3 ++- fs/btrfs/extent-tree.c | 18 ++ fs/btrfs/transaction.c | 3 ++- 4 files changed, 23 insertions(+), 3 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs
Hi Linus, Please pull my for-linus-4.2 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.2 Outside of our usual batch of fixes, this integrates the subvolume quota updates that Qu Wenruo from Fujitsu has been working on for a few releases now. He gets an extra gold star for making btrfs smaller this time, and fixing a number of quota corners in the process. Dave Sterba tested and integrated Anand Jain's sysfs improvements. Outside of exporting a symbol (ack'd by Greg) these are all internal to btrfs and it's mostly cleanups and fixes. Anand also attached some of our sysfs objects to our internal device management structs instead of an object off the super block. It will make device management easier overall and it's a better fit for how the sysfs files are used. None of the existing sysfs files are moved around. Thanks for all the fixes everyone: Anand Jain (28) commits (+304/-115): Btrfs: sysfs: move super_kobj and device_dir_kobj from fs_info to btrfs_fs_devices (+56/-43) Btrfs: sysfs: fix, btrfs_release_super_kobj() should to clean up the kobject data (+2/-0) Btrfs: sysfs: introduce function btrfs_sysfs_add_fsid() to create sysfs fsid (+14/-1) Btrfs: sysfs: fix, fs_info kobject_unregister has init_completion() twice (+0/-1) Btrfs: sysfs btrfs_kobj_rm_device() pass fs_devices instead of fs_info (+10/-10) Btrfs: sysfs: rename __btrfs_sysfs_remove_one to btrfs_sysfs_remove_fsid (+4/-4) Btrfs: sysfs: fix, kobject pointer clean up needed after kobject release (+1/-0) Btrfs: sysfs btrfs_kobj_add_device() pass fs_devices instead of fs_info (+6/-7) Btrfs: sysfs: don't fail seeding for the sake of sysfs kobject issue (+1/-1) Btrfc: sysfs: fix, check if device_dir_kobj is init before destroy (+6/-4) Btrfs: sysfs: provide framework to remove all fsid sysfs kobject (+16/-1) Btrfs: sysfs: separate device kobject and its attribute creation (+15/-6) Btrfs: sysfs: add support to show replacing target in the sysfs (+7/-1) Btrfs: check error before reporting missing device and add uuid (+2/-1) Btrfs: sysfs: add pointer to access fs_info from fs_devices (+25/-0) Btrfs: sysfs: btrfs_sysfs_remove_fsid() make it non static (+2/-1) Btrfs: sysfs: let default_attrs be separate from the kset (+8/-4) Btrfs: sysfs: separate kobject and attribute creation (+19/-14) Btrfs: sysfs: make btrfs_sysfs_add_device() non static (+1/-0) Btrfs: sysfs: make btrfs_sysfs_add_fsid() non static (+3/-1) Btrfs: introduce btrfs_get_fs_uuids to get fs_uuids (+5/-0) Btrfs: Check if kobject is initialized before put (+5/-3) Btrfs: sysfs: add support to add parent for fsid (+2/-2) Btrfs: sysfs: reorder the kobject creations (+13/-10) Btrfs: sysfs: fix, undo sysfs device links (+17/-0) Btrfs: log when missing device is created (+2/-0) lib: export symbol kobject_move() (+1/-0) Btrfs: free the stale device (+61/-0) Qu Wenruo (19) commits (+879/-1542): btrfs: extent-tree: Use ref_node to replace unneeded parameters in __inc_extent_ref() and __free_extent() (+21/-21) btrfs: qgroup: Make snapshot accounting work with new extent-oriented (+33/-20) btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots. (+40/-0) btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism. (+89/-27) btrfs: delayed-ref: Use list to replace the ref_root in ref_head. (+114/-123) btrfs: qgroup: Cleanup open-coded old/new_refcnt update and read. (+54/-41) btrfs: qgroup: Switch to new extent-oriented qgroup mechanism. (+28/-100) btrfs: qgroup: Record possible quota-related extent for qgroup. (+95/-7) btrfs: backref: Don't merge refs which are not for same block. (+3/-3) btrfs: qgroup: Cleanup the old ref_node-oriented mechanism. (+3/-972) btrfs: backref: Add special time_seq == (u64)-1 case for (+29/-6) btrfs: qgroup: Add function qgroup_update_counters(). (+120/-0) btrfs: qgroup: Add new function to record old_roots. (+29/-0) btrfs: delayed-ref: Cleanup the unneeded functions. (+0/-174) btrfs: qgroup: Add new qgroup calculation function (+118/-0) btrfs: qgroup: Add function qgroup_update_refcnt(). (+58/-0) btrfs: qgroup: Switch rescan to new mechanism. (+7/-36) btrfs: ulist: Add ulist_del() function. (+37/-11) btrfs: Fix superblock csum type check. (+1/-1) Filipe Manana (14) commits (+340/-76): Btrfs: incremental send, check if orphanized dir inode needs delayed rename (+37/-19) Btrfs: fix necessary chunk tree space calculation when allocating a chunk (+7/-12) Btrfs: wake up extent state waiters on unlock through clear_extent_bits (+6/-1) Btrfs: incremental send, fix clone operations for compressed extents (+17/-1) Btrfs: incremental send, don't delay directory renames unnecessarily (+46/-2) Btrfs: fix chunk allocation regression leading to transaction abort (+19/-3) Btrfs: fix mute
linux-next conflict resolution branch for btrfs
Hi Stephen, There are a few conflicts for btrfs in linux-next this time. They are small, but I pushed out the merge commit I'm using here: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next-merge -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next conflict resolution branch for btrfs
On Fri, Aug 21, 2015 at 10:45:24AM +1000, Stephen Rothwell wrote: > Hi Chris, > > On Thu, 20 Aug 2015 13:39:18 -0400 Chris Mason wrote: > > > > There are a few conflicts for btrfs in linux-next this time. They are > > small, but I pushed out the merge commit I'm using here: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git > > next-merge > > Thanks for that. It seems to have merged OK but maybe it conflicts > with something later in linux-next. Unfortunately see my other email > about a build problem. I will keep this example merge in mind for > later. Ok, sorry about that one. We probably want the ifdefs up in Tejun's code, but I'll talk with him about it today and get it fixed up. Thanks, Chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: linux-next conflict resolution branch for btrfs
On Fri, Aug 21, 2015 at 10:45:24AM +1000, Stephen Rothwell wrote: > Hi Chris, > > On Thu, 20 Aug 2015 13:39:18 -0400 Chris Mason wrote: > > > > There are a few conflicts for btrfs in linux-next this time. They are > > small, but I pushed out the merge commit I'm using here: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git > > next-merge > > Thanks for that. It seems to have merged OK but maybe it conflicts > with something later in linux-next. Unfortunately see my other email > about a build problem. I will keep this example merge in mind for > later. Ok, I put the ifdefs in btrfs. Really what I need to do is change bio_clone to do this work, but that means making sure its the right thing for dm/md first. I also added ifdefs for bio->bi_ioc in fs/btrfs/volumes.c, but another commit in linux-next actually deletes the whole function from btrfs. I've redone the example merge: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next-merge -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs
Hi Linus, Please pull my for-linus-4.2 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.2 This is an assortment of fixes. Most of the commits are from Filipe (fsync, the inode allocation cache and a few others). Mark kicked in a series fixing corners in the extent sharing ioctls, and everyone else fixed up on assorted other problems. Filipe Manana (9) commits (+375/-36): Btrfs: fix race between caching kthread and returning inode to inode cache (+11/-4) Btrfs: fix memory corruption on failure to submit bio for direct IO (+52/-18) Btrfs: fix crash on close_ctree() if cleaner starts new transaction (+29/-0) Btrfs: fix fsync after truncate when no_holes feature is enabled (+108/-0) Btrfs: fix race between balance and unused block group deletion (+58/-6) Btrfs: fix a comment in inode.c:evict_inode_truncate_pages() (+3/-2) Btrfs: use kmem_cache_free when freeing entry in inode cache (+1/-1) Btrfs: fix fsync xattr loss in the fast fsync path (+104/-0) Btrfs: fix fsync data loss after append write (+9/-5) Mark Fasheh (4) commits (+193/-58): btrfs: fix deadlock with extent-same and readpage (+117/-31) btrfs: don't update mtime/ctime on deduped inodes (+14/-10) btrfs: pass unaligned length to btrfs_cmp_data() (+2/-1) btrfs: allow dedupe of same inode (+60/-16) Liu Bo (2) commits (+15/-6): Btrfs: fix hang when failing to submit bio of directIO (+0/-3) Btrfs: fix warning of bytes_may_use (+15/-3) Zhao Lei (2) commits (+21/-20): btrfs: cleanup noused initialization of dev in btrfs_end_bio() (+1/-1) btrfs: add error handling for scrub_workers_get() (+20/-19) Yang Dongsheng (1) commits (+41/-8): btrfs: qgroup: allow user to clear the limitation on qgroup Shilong Wang (1) commits (+1/-1): Btrfs: fix wrong check for btrfs_force_chunk_alloc() Total: (19) commits (+646/-129) fs/btrfs/btrfs_inode.h | 2 + fs/btrfs/ctree.h| 1 + fs/btrfs/disk-io.c | 41 +++- fs/btrfs/extent-tree.c | 3 + fs/btrfs/inode-map.c| 17 +++- fs/btrfs/inode.c| 89 -- fs/btrfs/ioctl.c| 241 +--- fs/btrfs/ordered-data.c | 5 + fs/btrfs/qgroup.c | 49 -- fs/btrfs/relocation.c | 2 +- fs/btrfs/scrub.c| 39 fs/btrfs/tree-log.c | 226 - fs/btrfs/volumes.c | 50 -- 13 files changed, 641 insertions(+), 124 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: new oops in 4.4.0-rc4
On Thu, Dec 10, 2015 at 10:36:17AM -0600, Jon Christopherson wrote: > Hello, > > I noticed this new oops since running 4.4.0-rc4. Happens shortly after boot > and pretty much kills the system: > > > [ 177.774250] [ cut here ] > >[ 177.774256] kernel BUG at /data0/Source/mainline/mm/page-writeback.c:2654! > >[ 177.774258] invalid opcode: [#1] SMP > >[ 177.774261] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE > >nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 > >nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp > >bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables > >iptable_filter ip_tables x_tables ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad > >ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rfcomm > >bnep nfsd auth_rpcgss nfs_acl binfmt_misc nfs lockd grace sunrpc fscache xfs > >libcrc32c snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi > >nvidia_modeset(POE) eeepc_wmi mxm_wmi asus_wmi sparse_keymap intel_rapl > >iosf_mbi x86_pkg_temp_thermal intel_powerclamp dm_multipath nvidia(POE) > >btusb kvm_intel btrtl nls_iso8859_1 kvm btbcm irqbypass snd_hd > a_intel wl(POE) btintel hid_logitech_hidpp joydev bluetooth serio_raw > snd_hda_codec snd_hda_core snd_seq_midi cfg80211 snd_seq_midi_event snd_hwdep > snd_rawmidi lpc_ich snd_pcm drm snd_seq snd_seq_dev > ice snd_timer 8250_fintek snd mei_me mei soundcore wmi mac_hid parport_pc > shpchp ppdev msr nct6775 hwmon_vid coretemp lp parport btrfs xor raid6_pq > drbg ansi_cprng dm_crypt dm_mirror dm_region_hash dm_log hid_generic > hid_logitech_dj usbhid hid crct10dif_pclmul crc32_pclmul ahci aesni_intel > aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse libahci video > >[ 177.774357] CPU: 5 PID: 5158 Comm: thunderbird Tainted: PW OE > >4.4.0-121-generic #201512100930 > >[ 177.774360] Hardware name: System manufacturer System Product Name/P8P67 > >DELUXE, BIOS 3602 10/31/2012 > >[ 177.774362] task: 88040b6d ti: 8803af864000 task.ti: > >8803af864000 > >[ 177.774364] RIP: 0010:[] [] > >clear_page_dirty_for_io+0xe1/0x1a0 Dave Jones sent in a report about this with trinity too, I'm digging in today. Since you can trigger this reliably, what was the last known-good kernel for you? -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR
We have two reports of frequent crashes in btrfs where asserts in clear_page_dirty_for_io() were triggering on missing page locks. The crashes were much easier to trigger when processes were catching ctrl-c's, and after much debugging it really looked like lock_page was a noop. This recent commit looks pretty suspect to me, and I confirmed that we were exiting __wait_on_bit_lock() with -EINTR when it was called with TASK_UNINTERRUPTIBLE commit 68985633bccb6066bf1803e316fbc6c1f5b796d6 Author: Peter Zijlstra Date: Tue Dec 1 14:04:04 2015 +0100 sched/wait: Fix signal handling in bit wait helpers The patch below is mostly untested, and probably not the right solution. Dave's trinity run doesn't explode immediately anymore, and I wanted to get this out for discussion. A quick look on the list doesn't show anyone else has tracked this down, sorry if it's a dup. Reported-by: Dave Jones , Reported-by: Jon Christopherson Signed-off-by: Chris Mason diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c index f10bd87..12f69df 100644 --- a/kernel/sched/wait.c +++ b/kernel/sched/wait.c @@ -434,6 +434,8 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q, ret = action(&q->key); if (!ret) continue; + if (ret == -EINTR && mode == TASK_UNINTERRUPTIBLE) + continue; abort_exclusive_wait(wq, &q->wait, mode, &q->key); return ret; } while (test_and_set_bit(q->key.bit_nr, q->key.flags)); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[GIT PULL] Btrfs
Hi Linus, We have some fixes queued up in my for-linus-4.5 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.5 Dave had a small collection of fixes to the new free space tree code, one of which was keeping our sysfs files more up to date with feature bits as different things get enabled (lzo, raid5/6, etc). I should have kept the sysfs stuff for rc3, since we always manage to trip over something. This time it was GFP_KERNEL from somewhere that is NOFS only. Instead of rebasing it out I've put a revert in, and we'll fix it properly for rc3. Otherwise, Filipe fixed a btrfs DIO race and Qu Wenruo fixed up a use-after-free in our tracepoints that Dave Jones reported. David Sterba (10) commits (+90/-20): btrfs: sysfs: check initialization state before updating features (+3/-0) btrfs: sysfs: introduce helper for syncing bits with sysfs files (+33/-0) btrfs: synchronize incompat feature bits with sysfs files (+17/-0) btrfs: sysfs: fix typo in compat_ro attribute definition (+1/-1) Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()" (+0/-1) btrfs: add free space tree to the cow-only list (+2/-1) btrfs: tweak free space tree bitmap allocation (+16/-2) btrfs: sysfs: add free-space-tree bit attribute (+2/-0) btrfs: add free space tree to lockdep classes (+1/-0) btrfs: tests: switch to GFP_KERNEL (+15/-15) Chris Mason (2) commits (+1/-18): Revert "btrfs: synchronize incompat feature bits with sysfs files" (+0/-17) btrfs: don't use GFP_HIGHMEM for free-space-tree bitmap kzalloc (+1/-1) Filipe Manana (1) commits (+39/-11): Btrfs: fix race between fsync and lockless direct IO writes Qu Wenruo (1) commits (+1/-1): btrfs: async-thread: Fix a use-after-free error for trace Total: (14) commits (+131/-50) fs/btrfs/async-thread.c | 2 +- fs/btrfs/disk-io.c | 2 +- fs/btrfs/free-space-tree.c | 18 -- fs/btrfs/inode.c | 36 fs/btrfs/relocation.c| 3 ++- fs/btrfs/sysfs.c | 35 +++ fs/btrfs/sysfs.h | 5 - fs/btrfs/tests/btrfs-tests.c | 10 +- fs/btrfs/tests/extent-io-tests.c | 12 ++-- fs/btrfs/tests/inode-tests.c | 8 fs/btrfs/tree-log.c | 14 +++--- 11 files changed, 113 insertions(+), 32 deletions(-)
[GIT PULL] Btrfs
Hi Linus, A couple of small fixes in my for-linus-4.4 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.4 Chris Mason (2) commits (+19/-7): Btrfs: check for empty bitmap list in setup_cluster_bitmaps (+5/-3) Btrfs: check prepare_uptodate_page() error code earlier (+14/-4) Filipe Manana (2) commits (+9/-7): Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list (+8/-5) Btrfs: fix transaction handle leak in balance (+1/-2) Holger Hoffstätte (1) commits (+1/-1): btrfs: fix misleading warning when space cache failed to load Total: (5) commits (+29/-15) fs/btrfs/extent-tree.c | 10 +++--- fs/btrfs/file.c | 18 ++ fs/btrfs/free-space-cache.c | 10 ++ fs/btrfs/transaction.c | 1 - fs/btrfs/transaction.h | 2 +- fs/btrfs/volumes.c | 3 +-- 6 files changed, 29 insertions(+), 15 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v4 0/9] Update to zstd-1.4.6
On 2 Oct 2020, at 2:54, Christoph Hellwig wrote: On Wed, Sep 30, 2020 at 08:05:45PM +, Nick Terrell wrote: On Sep 29, 2020, at 11:53 PM, Christoph Hellwig wrote: As you keep resend this I keep retelling you that should not do it. Please provide a proper Linux API, and switch to that. Versioned APIs have absolutely no business in the Linux kernel. The API is not versioned. We provide a stable ABI for a large section of our API, and the parts that aren???t ABI stable don???t change in semantics, and undergo long deprecation periods before being removed. The change of callers is a one-time change to transition from the existing API in the kernel, which was never upstream's API, to upstream's API. Again, please transition it to a sane kernel API. We don't have an "upstream" in this case. The upstream is the zstd project where all this code originates, and where the active development takes place. As Eric Biggers pointed out, it also receives a lot of Q/A separate from the kernel. I think we gain a great deal by leveraging the testing and documentation of the zstd project in the kernel interfaces we use. We lose some consistency with the kernel coding style, but we gain the ability to search for docs, issues, and fixes directly against the zstd project and git repo. -chris
Re: [PATCH 10/12] btrfs: flag files as supporting buffered async reads
On 26 May 2020, at 15:51, Jens Axboe wrote: > btrfs uses generic_file_read_iter(), which already supports this. > > Signed-off-by: Jens Axboe Really looking forward to this! Acked-by: Chris Mason
[PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
Hi everyone, We’re validating a new kernel in the fleet, and compared with v5.2, performance is ~2-3% lower for some of our workloads. After some digging, Johannes found that our involuntary context switch rate was ~2x higher, and we were leaving a CPU idle a higher percentage of the time, even though the workload was trying to saturate the system. We were able to reproduce the problem with schbench, and Johannes bisected down to: commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912 Author: Vincent Guittot Date: Fri Oct 18 15:26:31 2019 +0200 sched/fair: Rework load_balance() Our working theory is the load balancing changes are leaving processes behind busy CPUs instead of moving them onto idle ones. I made a few schbench modifications to make this easier to demonstrate: https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/ My VM has 40 cpus (20 cores, 2 threads per core), and my schbench command line is: schbench -t 20 -r 0 -c 100 -s 1000 -i 30 -z 120 This has two message threads, and 20 workers per message thread. Once woken up, the workers think for a full second, which means you’ll have some long latencies if you’re stuck behind one of these workers in the runqueue. The message thread does a little bit of work and then sleeps, so we end up with 40 threads hammering full blast on the CPU and 2 threads popping in and out of idle. schbench times the delay from when a message thread wakes a worker to when the worker runs. On a good kernel, the output looks like this: Latency percentiles (usec) runtime 1290 (s) (3280 total samples) 50.0th: 155 (1653 samples) 75.0th: 189 (808 samples) 90.0th: 216 (501 samples) 95.0th: 227 (163 samples) *99.0th: 256 (123 samples) 99.5th: 1510 (16 samples) 99.9th: 3132 (13 samples) min=21, max=3286 With 0b0695f2b34a, we get this: Latency percentiles (usec) runtime 1440 (s) (4480 total samples) 50.0th: 147 (2261 samples) 75.0th: 182 (1116 samples) 90.0th: 205 (671 samples) 95.0th: 224 (215 samples) *99.0th: 12240 (173 samples) <—— much higher p99 and up 99.5th: 12752 (22 samples) 99.9th: 13104 (18 samples) min=21, max=13172 Since the idea is to fully load the machine with schbench, use schbench -t , and make sure the box doesn’t have other stuff running in the background. I used a VM because it ended up giving more consistent results on our kernel test machines, which have some periodic noise running in the background. We’ve tried a few different approaches, but don’t quite have a solid fix yet. I thought I’d kick off the discussion with my most useful hunks so far: diff a/kernel/sched/fair.c b/kernel/sched/fair.c --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
On 26 Oct 2020, at 4:39, Vincent Guittot wrote: Hi Chris On Sat, 24 Oct 2020 at 01:49, Chris Mason wrote: Hi everyone, We’re validating a new kernel in the fleet, and compared with v5.2, Which version are you using ? several improvements have been added since v5.5 and the rework of load_balance We’re validating v5.6, but all of the numbers referenced in this patch are against v5.9. I usually try to back port my way to victory on this kind of thing, but mainline seems to behave exactly the same as 0b0695f2b34a wrt this benchmark. performance is ~2-3% lower for some of our workloads. After some digging, Johannes found that our involuntary context switch rate was ~2x higher, and we were leaving a CPU idle a higher percentage of the time, even though the workload was trying to saturate the system. We were able to reproduce the problem with schbench, and Johannes bisected down to: commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912 Author: Vincent Guittot Date: Fri Oct 18 15:26:31 2019 +0200 sched/fair: Rework load_balance() Our working theory is the load balancing changes are leaving processes behind busy CPUs instead of moving them onto idle ones. I made a few schbench modifications to make this easier to demonstrate: https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/ My VM has 40 cpus (20 cores, 2 threads per core), and my schbench command line is: What is the topology ? are they all part of the same LLC ? We’ve seen the regression on both single socket and dual socket bare metal intel systems. On the VM I reproduced with, I saw similar latencies with and without siblings configured into the topology. -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
On 26 Oct 2020, at 10:24, Vincent Guittot wrote: Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit : On 26 Oct 2020, at 4:39, Vincent Guittot wrote: Hi Chris On Sat, 24 Oct 2020 at 01:49, Chris Mason wrote: Hi everyone, We’re validating a new kernel in the fleet, and compared with v5.2, Which version are you using ? several improvements have been added since v5.5 and the rework of load_balance We’re validating v5.6, but all of the numbers referenced in this patch are against v5.9. I usually try to back port my way to victory on this kind of thing, but mainline seems to behave exactly the same as 0b0695f2b34a wrt this benchmark. ok. Thanks for the confirmation I have been able to reproduce the problem on my setup. Thanks for taking a look! Can I ask what parameters you used on schbench, and what kind of results you saw? Mostly I’m trying to make sure it’s a useful tool, but also the patch didn’t change things here. Could you try the fix below ? --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * emptying busiest. */ if (local->group_type == group_has_spare) { - if (busiest->group_type > group_fully_busy) { + if ((busiest->group_type > group_fully_busy) && + (busiest->group_weight > 1)) { /* * If busiest is overloaded, try to fill spare * capacity. This might end up creating spare capacity When we calculate an imbalance at te smallest level, ie between CPUs (group_weight == 1), we should try to spread tasks on cpus instead of trying to fill spare capacity. With this patch on top of v5.9, my latencies are unchanged. I’m building against current Linus now just in case I’m missing other fixes. -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
On 26 Oct 2020, at 11:05, Chris Mason wrote: On 26 Oct 2020, at 10:24, Vincent Guittot wrote: Could you try the fix below ? --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * emptying busiest. */ if (local->group_type == group_has_spare) { - if (busiest->group_type > group_fully_busy) { + if ((busiest->group_type > group_fully_busy) && + (busiest->group_weight > 1)) { /* * If busiest is overloaded, try to fill spare * capacity. This might end up creating spare capacity When we calculate an imbalance at te smallest level, ie between CPUs (group_weight == 1), we should try to spread tasks on cpus instead of trying to fill spare capacity. With this patch on top of v5.9, my latencies are unchanged. I’m building against current Linus now just in case I’m missing other fixes. I reran things to make sure the nothing changed on my test box this weekend: 5.4.0-rc1-9-gfcf0553db6f4 (last good kernel) Latency percentiles (usec) runtime 30 (s) (1000 total samples) 50.0th: 180 (502 samples) 75.0th: 227 (251 samples) 90.0th: 268 (147 samples) 95.0th: 300 (50 samples) *99.0th: 338 (41 samples) 99.5th: 344 (4 samples) 99.9th: 1186 (5 samples) min=25, max=1185 5.4.0-rc1-00010-g0b0695f2b34a (first bad kernel) Latency percentiles (usec) runtime 150 (s) (960 total samples) 50.0th: 166 (488 samples) 75.0th: 210 (232 samples) 90.0th: 254 (145 samples) 95.0th: 299 (47 samples) *99.0th: 12688 (39 samples) 99.5th: 13008 (5 samples) 99.9th: 13104 (4 samples) min=24, max=13100 3650b228f83adda7e5ee532e2b90429c03f7b9ec (v5.10-rc1) + your patch Latency percentiles (usec) runtime 30 (s) (1000 total samples) 50.0th: 169 (505 samples) 75.0th: 210 (246 samples) 90.0th: 267 (151 samples) 95.0th: 305 (48 samples) *99.0th: 12656 (40 samples) 99.5th: 12944 (5 samples) 99.9th: 13168 (5 samples) min=44, max=13155 -chris
Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"
On 26 Oct 2020, at 12:20, Vincent Guittot wrote: Le lundi 26 oct. 2020 à 12:04:45 (-0400), Rik van Riel a écrit : On Mon, 26 Oct 2020 16:42:14 +0100 Vincent Guittot wrote: On Mon, 26 Oct 2020 at 16:04, Rik van Riel wrote: Could utilization estimates be off, either lagging or simply having a wrong estimate for a task, resulting in no task getting pulled sometimes, while doing a migrate_task imbalance always moves over something? task and cpu utilization are not always up to fully synced and may lag a bit which explains that sometimes LB can fail to migrate for a small diff OK, running with this little snippet below, I see latencies improve back to near where they used to be: Latency percentiles (usec) runtime 150 (s) 50.0th: 13 75.0th: 31 90.0th: 69 95.0th: 90 *99.0th: 761 99.5th: 2268 99.9th: 9104 min=1, max=16158 I suspect the right/cleaner approach might be to use migrate_task more in !CPU_NOT_IDLE cases? Running a task to an idle CPU immediately, instead of refusing to have the load balancer move it, improves latencies for fairly obvious reasons. I am not entirely clear on why the load balancer should need to be any more conservative about moving tasks than the wakeup path is in eg. select_idle_sibling. what you are suggesting is something like: diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 4978964e75e5..3b6fbf33abc2 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9156,7 +9156,8 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s * emptying busiest. */ if (local->group_type == group_has_spare) { - if (busiest->group_type > group_fully_busy) { + if ((busiest->group_type > group_fully_busy) && + !(env->sd->flags & SD_SHARE_PKG_RESOURCES)) { /* * If busiest is overloaded, try to fill spare * capacity. This might end up creating spare capacity which also fixes the problem for me and alignes LB with wakeup path regarding the migration in the LLC Vincent’s patch on top of 5.10-rc1 looks pretty great: Latency percentiles (usec) runtime 90 (s) (3320 total samples) 50.0th: 161 (1687 samples) 75.0th: 200 (817 samples) 90.0th: 228 (488 samples) 95.0th: 254 (164 samples) *99.0th: 314 (131 samples) 99.5th: 330 (17 samples) 99.9th: 356 (13 samples) min=29, max=358 Next we test in prod, which probably won’t have answers until tomorrow. Thanks again Vincent! -chris
Re: [PATCH v5 1/9] lib: zstd: Add zstd compatibility wrapper
On 10 Nov 2020, at 13:39, Christoph Hellwig wrote: On Mon, Nov 09, 2020 at 02:01:41PM -0500, Chris Mason wrote: You do consistently ask for a shim layer, but you haven???t explained what we gain by diverging from the documented and tested API of the upstream zstd project. It???s an important discussion given that we hope to regularly update the kernel side as they make improvements in zstd. An API that looks like every other kernel API, and doesn't cause endless amount of churn because someone decided they need a new API flavor of the day. Btw, I'm not asking for a shim layer - that was the compromise we ended up with. If zstd folks can't maintain a sane code base maybe we should just drop this childish churning code base from the tree. I think APIs change based on the needs of the project. We do this all the time in the kernel, and we don’t think twice about updating users of the API as needed. The zstd changes look awkward and large today because it’ a long time period, but we’ve all been pretty vocal in the past about the importance of being able to advance APIs. -chris
Re: [PATCH v5 1/9] lib: zstd: Add zstd compatibility wrapper
On 6 Nov 2020, at 13:38, Christoph Hellwig wrote: You just keep resedning this crap, don't you? Haven't you been told multiple times to provide a proper kernel API by now? You do consistently ask for a shim layer, but you haven’t explained what we gain by diverging from the documented and tested API of the upstream zstd project. It’s an important discussion given that we hope to regularly update the kernel side as they make improvements in zstd. The only benefit described so far seems to be camelcase related, but if there are problems in the API beyond that, I haven’t seen you describe them. I don’t think the camelcase alone justifies the added costs of the shim. -chris
Re: [PATCH] mm : fix pte _PAGE_DIRTY bit when fallback migrate page
On 16 Jul 2020, at 6:15, Robbie Ko wrote: Kirill A. Shutemov 於 2020/7/15 下午4:11 寫道: On Wed, Jul 15, 2020 at 10:45:39AM +0800, Robbie Ko wrote: Kirill A. Shutemov 於 2020/7/14 下午6:19 寫道: On Tue, Jul 14, 2020 at 11:46:12AM +0200, Vlastimil Babka wrote: On 7/13/20 3:57 AM, Robbie Ko wrote: Vlastimil Babka 於 2020/7/10 下午11:31 寫道: On 7/9/20 4:48 AM, robbieko wrote: From: Robbie Ko When a migrate page occurs, we first create a migration entry to replace the original pte, and then go to fallback_migrate_page to execute a writeout if the migratepage is not supported. In the writeout, we will clear the dirty bit of the page and use page_mkclean to clear the dirty bit along with the corresponding pte, but page_mkclean does not support migration entry. I don't follow the scenario. When we establish migration entries with try_to_unmap(), it transfers dirty bit from PTE to the page. Sorry, I mean is _PAGE_RW with pte_write When we establish migration entries with try_to_unmap(), we create a migration entry, and if pte_write we set it to SWP_MIGRATION_WRITE, which will replace the migration entry with the original pte. When migratepage, we go to fallback_migrate_page to execute a writeout if the migratepage is not supported. In the writeout, we call clear_page_dirty_for_io to clear the dirty bit of the page and use page_mkclean to clear pte _PAGE_RW with pte_wrprotect in page_mkclean_one. However, page_mkclean_one does not support migration entries, so the migration entry is still SWP_MIGRATION_WRITE. In writeout, then we call remove_migration_ptes to remove the migration entry, because it is still SWP_MIGRATION_WRITE so set _PAGE_RW to pte via pte_mkwrite. Therefore, subsequent mmap wirte will not trigger page_mkwrite to cause data loss. Hm, okay. Folks, is there any good reason why try_to_unmap(TTU_MIGRATION) should not clear PTE (make the PTE none) for file page? This, I'm not sure. But I think that for the fs that support migratepage, when migratepage is finished, the page should still be dirty, and the pte should still have _PAGE_RW, when the next mmap write occurs, we don't need to trigger the page_mkwrite again. I don’t know the page migration code well, but you’ll need this one as well on the 4.4 kernel you mentioned: commit 25f3c5021985e885292980d04a1423fd83c967bb Author: Chris Mason Date: Tue Jan 21 11:51:42 2020 -0500 Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker And this one as well: commit 7703bdd8d23e6ef057af3253958a793ec6066b28 Author: Chris Mason Date: Wed Jun 20 07:56:11 2018 -0700 Btrfs: don't clean dirty pages during buffered writes With those two in place, we haven’t found lost data from the migration code, but we did see the fallback migration helper dirtying pages without going through page_mkwrite, which triggers the suboptimal btrfs fixup worker code path. This isn’t a yea or nay on the patch, just additional info. -chris
Re: [PATCH] CodingStyle: Inclusive Terminology
On 5 Jul 2020, at 0:55, Willy Tarreau wrote: > On Sat, Jul 04, 2020 at 01:02:51PM -0700, Dan Williams wrote: >> +Non-inclusive terminology has that same distracting effect which is >> why >> +it is a style issue for Linux, it injures developer efficiency. > > I'm personally thinking that for a non-native speaker it's already > difficult to find the best term to describe something, but having to > apply an extra level of filtering on the found words to figure whether > they are allowed by the language police is even more difficult. Since our discussions are public, we’ve always had to deal with comments from people outside the community on a range of topics. But inside the kernel, it’s just a group of developers trying to help each other produce the best quality of code. We’ve got a long history together and in general I think we’re pretty good at assuming good intent. > *This* > injures developers efficiency. What could improve developers > efficiency > is to take care of removing *all* idiomatic or cultural words then. > For > example I've been participating to projects using the term > "blueprint", > I didn't understand what that meant. It was once explained to me and > given that it had no logical reason for being called this way, I now > forgot. If we follow your reasoning, Such words should be banned for > exactly the same reasons. Same for colors that probably don't mean > anything to those born blind. > > For example if in my local culture we eat tomatoes at starters and > apples for dessert, it could be convenient for me to use "tomato" and > "apple" as list elements to name the pointers leading to the beginning > and the end of the list, and it might sound obvious to many people, > but > not at all for many others. > > Maybe instead of providing an explicit list of a few words it should > simply say that terms that take their roots in the non-technical world > and whose meaning can only be understood based on history or local > culture ought to be avoided, because *that* actually is the real > root cause of the problem you're trying to address. I’d definitely agree that it’s a good goal to keep out non-technical terms. Even though we already try, every subsystem has its own set of patterns that reflect the most frequent contributors. -chris
Re: [Ksummit-discuss] [PATCH] CodingStyle: Inclusive Terminology
On 6 Jul 2020, at 10:06, Laurent Pinchart wrote: Hi Chris, On Mon, Jul 06, 2020 at 12:45:34PM +, Chris Mason via Ksummit-discuss wrote: On 5 Jul 2020, at 0:55, Willy Tarreau wrote: Maybe instead of providing an explicit list of a few words it should simply say that terms that take their roots in the non-technical world and whose meaning can only be understood based on history or local culture ought to be avoided, because *that* actually is the real root cause of the problem you're trying to address. I’d definitely agree that it’s a good goal to keep out non-technical terms. Even though we already try, every subsystem has its own set of patterns that reflect the most frequent contributors. That's an interesting point, because to me, it's the exact opposite. One of the intellectual rewards I find in working with the kernel is that our community is international and multicultural, allowing me to learn about other cultures. Aiming for the lowest common denominator seems to me to be closer to erasing cultural differences than including them. I hadn’t thought of it from this angle, but I do agree with you. I think the cultural side comes through more in discussions and in-person conferences than it does from the code itself. I do try to avoid local idioms or culture references unless I’m explaining them as part of a discussion or a personal story, mostly because I’ve gotten feedback from coworkers who had a hard time following my bad (ok, terrible) jokes or sarcasm. One internal example is commands that take —clowntown as an argument. It’s pretty therapeutic to type when you’re grumpy about tooling, but a lot of people probably have to look it up before it makes sense. -chris
Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API
On 16 Sep 2020, at 10:46, Christoph Hellwig wrote: On Wed, Sep 16, 2020 at 10:43:04AM -0400, Chris Mason wrote: Otherwise we just end up with drift and kernel-specific bugs that are harder to debug. To the extent those APIs make us contort the kernel code, I???m sure Nick is interested in improving things in both places. Seriously, we do not care elsewhere. Why would zlib be any different? Is the zlib upstream active? Or trying to sync active development with the kernel? I’d suggest the same path for them if they were. There are probably 1000 constructive ways to have that conversation. Please choose one of those instead of being an asshole. I think you are the asshole here by ignoring the practices we are using elsewhere and think your employers pet project is somehow special. It is not, and claiming so is everything but constructive. I’m happy to advocate for more constructive discussion for anyone’s project. I tend to pick threads where I have context and I know the people involved. The kernel best practices are pragmatic. As one of many users of any established-non-kernel project, there’s a compromise between the APIs they are using for a broad base of users and us. I’m sure they are interested in improving life for all of their users, while also improving maintainability for us. -chris
Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API
On 16 Sep 2020, at 4:49, Christoph Hellwig wrote: On Tue, Sep 15, 2020 at 08:42:59PM -0700, Nick Terrell wrote: From: Nick Terrell Move away from the compatibility wrapper to the zstd-1.4.6 API. This code is functionally equivalent. Again, please use sensible names And no one gives a fuck if this bad API is "zstd-1.4.6" as the Linux kernel uses its own APIs, not some random mess from a badly written userspace package. Hi Christoph, It’s not completely clear what you’re asking for here. If the API matches what’s in zstd-1.4.6, that seems like a reasonable way to label it. That’s what the upstream is for this code. I’m also not sure why we’re taking extra time to shit on the zstd userspace package. Can we please be constructive or at least actionable? -chris
Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API
On 17 Sep 2020, at 6:04, Christoph Hellwig wrote: On Wed, Sep 16, 2020 at 09:35:51PM -0400, Rik van Riel wrote: One possibility is to have a kernel wrapper on top of the zstd API to make it more ergonomic. I personally don???t really see the value in it, since it adds another layer of indirection between zstd and the caller, but it could be done. Zstd would not be the first part of the kernel to come from somewhere else, and have wrappers when it gets integrated into the kernel. There certainly is precedence there. It would be interesting to know what Christoph's preference is. Yes, I think kernel wrappers would be a pretty sensible step forward. That also avoid the need to do strange upgrades to a new version, and instead we can just change APIs on a as-needed basis. When we add wrappers, we end up creating a kernel specific API that doesn’t match the upstream zstd docs, and it doesn’t leverage as much of the zstd fuzzing and testing. So we’re actually making kernel zstd slightly less usable in hopes that our kernel specific part of the API is familiar enough to us that it makes zstd more usable. There’s no way to compare the two until the wrappers are done, but given the code today I’d prefer that we focus on making it really easy to track upstream. I really understand Christoph’s side here, but I’d rather ride a camel with the group than go it alone. I’d also much rather spend time on any problems where the structure of the zstd APIs don’t fit the kernel’s needs. The btrfs streaming compression/decompression looks pretty clean to me, but I think Johannes mentioned some possibilities to improve things for zswap (optimizations for page-at-atime). If there are places where the zstd memory management or error handling don’t fit naturally into the kernel, that would also be higher on my list. Fixing those are probably going to be much easier if we’re close to the zstd upstream, again so that we can leverage testing and long term code maintenance done there. -chris
Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API
On 16 Sep 2020, at 10:30, Christoph Hellwig wrote: On Wed, Sep 16, 2020 at 10:20:52AM -0400, Chris Mason wrote: It???s not completely clear what you???re asking for here. If the API matches what???s in zstd-1.4.6, that seems like a reasonable way to label it. That???s what the upstream is for this code. I???m also not sure why we???re taking extra time to shit on the zstd userspace package. Can we please be constructive or at least actionable? Because it really doesn't matter that these crappy APIs he is introducing match anything, especially not something done as horribly as the zstd API. We'll need to do this properly, and claiming compliance to some version of this lousy API is completely irrelevant for the kernel. If the underlying goal is to closely follow the upstream of another project, we’re much better off using those APIs as provided. Otherwise we just end up with drift and kernel-specific bugs that are harder to debug. To the extent those APIs make us contort the kernel code, I’m sure Nick is interested in improving things in both places. There are probably 1000 constructive ways to have that conversation. Please choose one of those instead of being an asshole. -chris
Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use
I'm being pretty liberal with chopping down quoted material to help emphasize a particular opinion about how to bootstrap existing out-of-tree projects into the kernel. My goal here is to talk more about the process and less about the technical details, so please forgive me if I've ignored or changed the technical meaning of anything below. On 30 May 2019, at 12:15, Kris Van Hees wrote: > On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote: > > ... I believe that the discussion that has been going on in other > emails has shown that while introducing a program type that provides a > generic (abstracted) context is a different approach from what has > been done > so far, it is a new use case that provides for additional ways in > which BPF > can be used. > [ ... ] > > Yes and no. It depends on what you are trying to do with the BPF > program that > is attached to the different events. From a tracing perspective, > providing a > single BPF program with an abstract context would ... [ ... ] > > In this model kprobe/ksys_write and > tracepoint/syscalls/sys_enter_write are > equivalent for most tracing purposes ... [ ... ] > > I agree with what you are saying but I am presenting an additional use > case [ ... ] >> >> All that aside the kernel support for shared libraries is an awesome >> feature to have and a bunch of folks want to see it happen, but >> it's not a blocker for 'dtrace to bpf' user space work. >> libbpf can be taught to do this 'pseudo shared library' feature >> while 'dtrace to bpf' side doesn't need to do anything special. [ ... ] This thread intermixes some abstract conceptual changes with smaller technical improvements, and in general it follows a familiar pattern other out-of-tree projects have hit while trying to adapt the kernel to their existing code. Just from this one email, I quoted the abstract models with use cases etc, and this is often where the discussions side track into less productive areas. > > So you are basically saying that I should redesign DTrace? In your place, I would have removed features and adapted dtrace as much as possible to require the absolute minimum of kernel patches, or even better, no patches at all. I'd document all of the features that worked as expected, and underline anything either missing or suboptimal that needed additional kernel changes. Then I'd focus on expanding the community of people using dtrace against the mainline kernel, and work through the series features and improvements one by one upstream over time. Your current approach relies on an all-or-nothing landing of patches upstream, and this consistently leads to conflict every time a project tries it. A more incremental approach will require bigger changes on the dtrace application side, but over time it'll be much easier to justify your kernel changes. You won't have to talk in abstract models, and you'll have many more concrete examples of people asking for dtrace features against mainline. Most importantly, you'll make dtrace available on more kernels than just the absolute latest mainline, and removing dependencies makes the project much easier for new users to try. -chris
Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"
On 30 Jan 2019, at 20:34, Dave Chinner wrote: > On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote: >> >> >> On 29 Jan 2019, at 23:17, Dave Chinner wrote: >> >>> From: Dave Chinner >>> >>> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0. >>> >>> This change causes serious changes to page cache and inode cache >>> behaviour and balance, resulting in major performance regressions >>> when combining worklaods such as large file copies and kernel >>> compiles. >>> >>> https://bugzilla.kernel.org/show_bug.cgi?id=202441 >> >> I'm a little confused by the latest comment in the bz: >> >> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24 > > Which says the first patch that changed the shrinker behaviour is > the underlying cause of the regression. > >> Are these reverts sufficient? > > I think so. Based on the latest comment: "If I had been less strict in my testing I probably would have discovered that the problem was present earlier than 4.19.3. Mr Gushins commit made it more visible. I'm going back to work after two days off, so I might not be able to respond inside your working hours, but I'll keep checking in on this as I get a chance." I don't think the reverts are sufficient. > >> Roman beat me to suggesting Rik's followup. We hit a different >> problem >> in prod with small slabs, and have a lot of instrumentation on Rik's >> code helping. > > I think that's just another nasty, expedient hack that doesn't solve > the underlying problem. Solving the underlying problem does not > require changing core reclaim algorithms and upsetting a page > reclaim/shrinker balance that has been stable and worked well for > just about everyone for years. > Things are definitely breaking down in non-specialized workloads, and have been for a long time. -chris
Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race
On 17 Jun 2020, at 13:20, Filipe Manana wrote: On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov wrote: --- fs/btrfs/extent_io.c | 45 1 file changed, 29 insertions(+), 16 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index c59e07360083..f6758ebbb6a2 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3927,6 +3927,11 @@ static noinline_for_stack int write_one_eb(struct extent_buffer *eb, clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags); num_pages = num_extent_pages(eb); atomic_set(&eb->io_pages, num_pages); + /* +* It is possible for releasepage to clear the TREE_REF bit before we +* set io_pages. See check_buffer_tree_ref for a more detailed comment. +*/ + check_buffer_tree_ref(eb); This is a whole different case from the one described in the changelog, as this is in the write path. Why do we need this one? This was Josef’s idea, but I really like the symmetry. You set io_pages, you do the tree_ref dance. Everyone fiddling with the write back bit right now correctly clears writeback after doing the atomic_dec on io_pages, but the race is tiny and prone to getting exposed again by shifting code around. Tree ref checks around io_pages are the most reliable way to prevent this bug from coming back again later. -chris
Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
On 04/25/2017 04:49 PM, Tejun Heo wrote: On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote: Will try that too. I can't see why HT would change it because I see single CPU queues misevaluated. Just in case, you need to tune the test params so that it doesn't load the machine too much and that there are some non-CPU intensive workloads going on to purturb things a bit. Anyways, I'm gonna try disabling HT. It's finickier but after changing the duty cycle a bit, it reproduces w/ HT off. I think the trick is setting the number of threads to the number of logical CPUs and tune -s/-c so that p99 starts climbing up. The following is from the root cgroup. Since it's only measuring wakeup latency, schbench is best at exposing problems when the machine is just barely below saturated. At saturation, everyone has to wait for the CPUs, and if we're relatively idle there's always a CPU to be found There's schbench -a to try and find this magic tipping point, but I haven't found a great way to automate for every kind of machine yet (sorry). -chris
[GIT PULL] Btrfs
Hi Linus, We have one more for btrfs: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 This is dropping a new WARN_ON from rc1 that ended up making more noise than we really want. The larger fix for the underflow got delayed a bit and it's better for now to put it under CONFIG_BTRFS_DEBUG. David Sterba (1) commits (+7/-4): btrfs: qgroup: move noisy underflow warning to debugging build Total: (1) commits (+7/-4) fs/btrfs/qgroup.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-)
[GIT PULL] Btrfs
Hi Linus, We have 3 small fixes queued up in my for-linus-4.11 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 Goldwyn Rodrigues (1) commits (+7/-7): btrfs: Change qgroup_meta_rsv to 64bit Dan Carpenter (1) commits (+6/-1): Btrfs: fix an integer overflow check Liu Bo (1) commits (+31/-21): Btrfs: bring back repair during read Total: (3) commits (+44/-29) fs/btrfs/ctree.h | 2 +- fs/btrfs/disk-io.c | 2 +- fs/btrfs/extent_io.c | 46 -- fs/btrfs/inode.c | 6 +++--- fs/btrfs/qgroup.c| 10 +- fs/btrfs/send.c | 7 ++- 6 files changed, 44 insertions(+), 29 deletions(-)
[GIT PULL] Btrfs
Hi Linus Dave Sterba collected a few more fixes for the last rc: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.11 These aren't marked for stable, but I'm putting them in with a batch were testing/sending by hand for this release. Liu Bo (3) commits (+11/-13): Btrfs: fix invalid dereference in btrfs_retry_endio (+4/-10) Btrfs: fix potential use-after-free for cloned bio (+1/-1) Btrfs: fix segmentation fault when doing dio read (+6/-2) Adam Borowski (1) commits (+3/-0): btrfs: drop the nossd flag when remounting with -o ssd Total: (4) commits (+14/-13) fs/btrfs/inode.c | 22 ++ fs/btrfs/super.c | 3 +++ fs/btrfs/volumes.c | 2 +- 3 files changed, 14 insertions(+), 13 deletions(-)
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 04:30 AM, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. -chris
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 03:00 PM, Eric Biggers wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote: The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. I am not surprised --- Zstandard is closer to the state of the art, both format-wise and implementation-wise, than the other choices in BTRFS. My point is that benchmarks need to account for how much data is compressed at a time. This is a common mistake when comparing different compression algorithms; the algorithm name and compression level do not tell the whole story. The dictionary size is extremely significant. No one is going to compress or decompress a 200 MB file as a single stream in kernel mode, so it does not make sense to justify adding Zstandard *to the kernel* based on such a benchmark. It is going to be divided into chunks. How big are the chunks in BTRFS? I thought that it compressed only one page (4 KiB) at a time, but I hope that has been, or is being, improved; 32 KiB - 128 KiB should be a better amount. (And if the amount of data compressed at a time happens to be different between the different algorithms, note that BTRFS benchmarks are likely to be measuring that as much as the algorithms themselves.) Btrfs hooks the compression code into the delayed allocation mechanism we use to gather large extents for COW. So if you write 100MB to a file, we'll have 100MB to compress at a time (within the limits of the amount of pages we allow to collect before forcing it down). But we want to balance how much memory you might need to uncompress during random reads. So we have an artificial limit of 128KB that we send at a time to the compression code. It's easy to change this, it's just a tradeoff made to limit the cost of reading small bits. It's the same for zlib,lzo and the new zstd patch. -chris
Re: [PATCH v5 2/5] lib: Add zstd modules
On 08/10/2017 03:25 PM, Hugo Mills wrote: On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: On 08/10/2017 04:30 AM, Eric Biggers wrote: Theses benchmarks are misleading because they compress the whole file as a single stream without resetting the dictionary, which isn't how data will typically be compressed in kernel mode. With filesystem compression the data has to be divided into small chunks that can each be decompressed independently. That eliminates one of the primary advantages of Zstandard (support for large dictionary sizes). I did btrfs benchmarks of kernel trees and other normal data sets as well. The numbers were in line with what Nick is posting here. zstd is a big win over both lzo and zlib from a btrfs point of view. It's true Nick's patches only support a single compression level in btrfs, but that's because btrfs doesn't have a way to pass in the compression ratio. It could easily be a mount option, it was just outside the scope of Nick's initial work. Could we please not add more mount options? I get that they're easy to implement, but it's a very blunt instrument. What we tend to see (with both nodatacow and compress) is people using the mount options, then asking for exceptions, discovering that they can't do that, and then falling back to doing it with attributes or btrfs properties. Could we just start with btrfs properties this time round, and cut out the mount option part of this cycle. In the long run, it'd be great to see most of the btrfs-specific mount options get deprecated and ultimately removed entirely, in favour of attributes/properties, where feasible. It's a good point, and as was commented later down I'd just do mount -o compress=zstd:3 or something. But I do prefer properties in general for this. My big point was just that next step is outside of Nick's scope. -chris
[GIT PULL] zstd support (lib, btrfs, squashfs)
Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. I have it in my zstd branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd There's a trivial conflict with the main btrfs pull that Dave Sterba just sent. His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. My idea was that you'd take our main btrfs pull first and this one second, but the conflicts are small enough it's not a big deal. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s| Adjusted MB/s | Memory (MB) | |--|--|-|---|-| | none |0.025 | 8479.54 | - | - | | zstd -1 |0.358 | 592.15 |636.60 |0.84 | | zstd -3 |0.396 | 535.32 |571.40 |1.46 | | zstd -5 |0.396 | 535.32 |571.40 |1.46 | | zstd -10 |0.374 | 566.81 |607.42 |2.51 | | zstd -15 |0.379 | 559.34 |598.84 |4.61 | | zstd -19 |0.412 | 514.54 |547.77 |8.80 | | zlib -1 |0.940 | 225.52 |231.68 |0.04 | | zlib -3 |0.883 | 240.08 |247.07 |0.04 | | zlib -6 |0.844 | 251.17 |258.84 |0.04 | | zlib -9 |0.837 | 253.27 |287.64 |0.04 | === I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran. Nick Terrell (4) commits (+14578/-12): crypto: Add zstd support (+356/-0) btrfs: Add zstd support (+468/-12) lib: Add zstd modules (+13014/-0) lib: Add xxhash module (+740/-0) Sean Purcell (1) commits (+178/-0): squashfs: Add zstd support Total: (5) commits (+14756/-12)
Re: [GIT PULL] zstd support (lib, btrfs, squashfs)
On 09/08/2017 03:33 PM, Chris Mason wrote: Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. I have it in my zstd branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd There's a trivial conflict with the main btrfs pull that Dave Sterba just sent. His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. My idea was that you'd take our main btrfs pull first and this one second, but the conflicts are small enough it's not a big deal. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Just to clarify, we've been testing the kernel side of this here at FB, but our zstd use in prod is limited to the application side. -chris
Re: [GIT PULL] zstd support (lib, btrfs, squashfs)
On Sat, Sep 09, 2017 at 09:35:59AM +0800, Herbert Xu wrote: On Fri, Sep 08, 2017 at 03:33:05PM -0400, Chris Mason wrote: crypto/Kconfig |9 + crypto/Makefile|1 + crypto/testmgr.c | 10 + crypto/testmgr.h | 71 + crypto/zstd.c | 265 Is there anyone going to use zstd through the crypto API? If not then I don't see the point in adding it at this point. Especially as the compression API is still in a state of flux. That part was requested by intel, but I'm happy to leave it out for another time. The rest of the patch series doesn't depend on it at all. -chris
[GIT PULL v2] zstd support (lib, btrfs, squashfs, nocrypto)
Hi Linus, Nick Terrell's patch series to add zstd support to the kernel has been floating around for a while. After talking with Dave Sterba, Herbert and Phillip, we decided to send the whole thing in as one pull request. Herbert had asked about the crypto patch when we discussed the pull, but I didn't realize he really meant not-right-now. I've rebased it out of this branch, and none of the other patches depended on it. I have things in my zstd-minimal branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd-minimal There's a trivial conflict with the main btrfs pull from last week. Dave's pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and I've put the sample resolution in a branch named zstd-4.14-merge. zstd is a big win in speed over zlib and in compression ratio over lzo, and the compression team here at FB has gotten great results using it in production. Nick will continue to update the kernel side with new improvements from the open source zstd userland code. Nick has a number of benchmarks for the main zstd code in his lib/zstd commit: I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM. The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor, 16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is 211,988,480 B large. Run the following commands for the benchmark: sudo modprobe zstd_compress_test sudo mknod zstd_compress_test c 245 0 sudo cp silesia.tar zstd_compress_test The time is reported by the time of the userland `cp`. The MB/s is computed with 1,536,217,008 B / time(buffer size, hash) which includes the time to copy from userland. The Adjusted MB/s is computed with 1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)). The memory reported is the amount of memory the compressor requests. | Method | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) | |--|--|--|---|-|--|--| | none | 11988480 |0.100 | 1 | 2119.88 |- |- | | zstd -1 | 73645762 |1.044 | 2.878 | 203.05 | 224.56 | 1.23 | | zstd -3 | 66988878 |1.761 | 3.165 | 120.38 | 127.63 | 2.47 | | zstd -5 | 65001259 |2.563 | 3.261 | 82.71 |86.07 | 2.86 | | zstd -10 | 60165346 | 13.242 | 3.523 | 16.01 |16.13 |13.22 | | zstd -15 | 58009756 | 47.601 | 3.654 |4.45 | 4.46 |21.61 | | zstd -19 | 54014593 | 102.835 | 3.925 |2.06 | 2.06 |60.15 | | zlib -1 | 77260026 |2.895 | 2.744 | 73.23 |75.85 | 0.27 | | zlib -3 | 72972206 |4.116 | 2.905 | 51.50 |52.79 | 0.27 | | zlib -6 | 68190360 |9.633 | 3.109 | 22.01 |22.24 | 0.27 | | zlib -9 | 67613382 | 22.554 | 3.135 |9.40 | 9.44 | 0.27 | I benchmarked zstd decompression using the same method on the same machine. The benchmark file is located in the upstream zstd repo under `contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is the amount of memory required to decompress data compressed with the given compression level. If you know the maximum size of your input, you can reduce the memory usage of decompression irrespective of the compression level. | Method | Time (s) | MB/s| Adjusted MB/s | Memory (MB) | |--|--|-|---|-| | none |0.025 | 8479.54 | - | - | | zstd -1 |0.358 | 592.15 |636.60 |0.84 | | zstd -3 |0.396 | 535.32 |571.40 |1.46 | | zstd -5 |0.396 | 535.32 |571.40 |1.46 | | zstd -10 |0.374 | 566.81 |607.42 |2.51 | | zstd -15 |0.379 | 559.34 |598.84 |4.61 | | zstd -19 |0.412 | 514.54 |547.77 |8.80 | | zlib -1 |0.940 | 225.52 |231.68 |0.04 | | zlib -3 |0.883 | 240.08 |247.07 |0.04 | | zlib -6 |0.844 | 251.17 |258.84 |0.04 | | zlib -9 |0.837 | 253.27 |287.64 |0.04 | === I ran a long series of tests and benchmarks on the btrfs side and the gains are very similar to the core benchmarks Nick ran. Nick Terrell (3) commits (+14222/-12): btrfs: Add zstd support (+468/-12) lib: Add zstd modules (+13014/-0) lib: Add xxhash module (+740/-0) Sean Purcell (1) commits (+178/-0): squashfs: Add zstd support Total: (4) commits (+14400/-12) fs/btrfs/Kconfig |2 + fs/btrfs/Makefile |2 +- fs/btrfs/compression.c |1 + fs/btrfs/compression.h |6 +- fs/btrfs/ctree.h |1 + fs/btrfs/disk-io.c |2 + fs/btrfs/ioctl.c |6 +- fs/btrfs/props.c |6 + fs/btrfs/super.c | 12 +- fs/btrfs/sysfs.c |2 + fs/btrfs/zstd.c| 432 ++ fs/squashfs/Kconfig| 14 +
[GIT PULL] Btrfs
ained from bdev_get_queue (+3/-4) btrfs: check if the device is flush capable (+4/-0) btrfs: delete unused member nobarriers (+0/-4) Edmund Nadolski (2) commits (+25/-20): btrfs: provide enumeration for __merge_refs mode argument (+13/-10) btrfs: replace hardcoded value with SEQ_LAST macro (+12/-10) Goldwyn Rodrigues (2) commits (+24/-3): btrfs: qgroups: Retry after commit on getting EDQUOT (+23/-1) btrfs: No need to check !(flags & MS_RDONLY) twice (+1/-2) Chris Mason (1) commits (+2/-2): btrfs: fix the gfp_mask for the reada_zones radix tree Adam Borowski (1) commits (+9/-3): btrfs: fix a bogus warning when converting only data or metadata Deepa Dinamani (1) commits (+2/-1): btrfs: Use ktime_get_real_ts for root ctime Dan Carpenter (1) commits (+15/-26): Btrfs: handle only applicable errors returned by btrfs_get_extent Dmitry V. Levin (1) commits (+2/-0): MAINTAINERS: add btrfs file entries for include directories Hans van Kranenburg (1) commits (+5/-5): Btrfs: consistent usage of types in balance_args Total: (71) commits MAINTAINERS | 2 + fs/btrfs/backref.c | 41 ++- fs/btrfs/btrfs_inode.h | 7 + fs/btrfs/compression.c | 18 +- fs/btrfs/ctree.c | 20 +- fs/btrfs/ctree.h | 34 +- fs/btrfs/delayed-inode.c | 46 +-- fs/btrfs/delayed-inode.h | 6 +- fs/btrfs/delayed-ref.c | 8 +- fs/btrfs/delayed-ref.h | 8 +- fs/btrfs/dev-replace.c | 9 +- fs/btrfs/disk-io.c | 13 +- fs/btrfs/disk-io.h | 4 +- fs/btrfs/extent-tree.c | 35 +- fs/btrfs/extent_io.c | 59 +-- fs/btrfs/extent_io.h | 8 +- fs/btrfs/extent_map.c| 10 +- fs/btrfs/extent_map.h| 3 +- fs/btrfs/file.c | 82 - fs/btrfs/free-space-cache.c | 2 +- fs/btrfs/inode.c | 289 +++ fs/btrfs/ioctl.c | 33 +- fs/btrfs/ordered-data.c | 20 +- fs/btrfs/ordered-data.h | 2 +- fs/btrfs/qgroup.c| 102 ++ fs/btrfs/qgroup.h| 51 ++- fs/btrfs/raid56.c| 38 +- fs/btrfs/reada.c | 37 +- fs/btrfs/root-tree.c | 3 +- fs/btrfs/scrub.c | 331 +++-- fs/btrfs/send.c | 23 +- fs/btrfs/super.c | 3 +- fs/btrfs/tests/btrfs-tests.c | 1 - fs/btrfs/transaction.c | 48 ++- fs/btrfs/transaction.h | 6 +- fs/btrfs/tree-log.c | 2 +- fs/btrfs/volumes.c | 854 +++ fs/btrfs/volumes.h | 8 +- include/trace/events/btrfs.h | 187 +- include/uapi/linux/btrfs.h | 10 +- 40 files changed, 1629 insertions(+), 834 deletions(-)
Re: [GIT PULL] Btrfs
On 05/09/2017 01:56 PM, Chris Mason wrote: > Hi Linus, > > My for-linus-4.12 branch: > > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git > for-linus-4.12 I hit send too soon, sorry. There's a trivial conflict with our WARN_ON fix that went into 4.11. I pushed the resolution to for-linus-4.12-merged. diff --cc fs/btrfs/qgroup.c index afbea61,3f75b5c..deffbeb --- a/fs/btrfs/qgroup.c +++ b/fs/btrfs/qgroup.c @@@ -1078,7 -1031,8 +1034,8 @@@ static int __qgroup_excl_accounting(str qgroup->excl += sign * num_bytes; qgroup->excl_cmpr += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes); - if (WARN_ON(qgroup->reserved < num_bytes)) + if (qgroup->reserved < num_bytes) report_reserved_underflow(fs_info, qgroup, num_bytes); else qgroup->reserved -= num_bytes; @@@ -1103,7 -1057,9 +1060,9 @@@ WARN_ON(sign < 0 && qgroup->excl < num_bytes); qgroup->excl += sign * num_bytes; if (sign > 0) { + trace_qgroup_update_reserve(fs_info, qgroup, + -(s64)num_bytes); - if (WARN_ON(qgroup->reserved < num_bytes)) + if (qgroup->reserved < num_bytes) report_reserved_underflow(fs_info, qgroup, num_bytes); else @@@ -2472,7 -2451,8 +2454,8 @@@ void btrfs_qgroup_free_refroot(struct b qg = unode_aux_to_qgroup(unode); + trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes); - if (WARN_ON(qg->reserved < num_bytes)) + if (qg->reserved < num_bytes) report_reserved_underflow(fs_info, qg, num_bytes); else qg->reserved -= num_bytes;
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 05/17/2017 06:53 AM, Peter Zijlstra wrote: On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote: sched/fair, cpumask: Export for_each_cpu_wrap() -static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int *wrapped) -{ - next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1); -} OK, so this patch fixed an actual bug in the for_each_cpu_wrap() implementation. The above 'n+1' should be 'n', and the effect is that it'll skip over CPUs, potentially resulting in an iteration that only sees every other CPU (for a fully contiguous mask). This in turn causes hackbench to further suffer from the regression introduced by commit: 4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive") So its well past time to fix this. Where the old scheme was a cliff-edge throttle on idle scanning, this introduces a more gradual approach. Instead of stopping to scan entirely, we limit how many CPUs we scan. Initial benchmarks show that it mostly recovers hackbench while not hurting anything else, except Mason's schbench, but not as bad as the old thing. It also appears to recover the tbench high-end, which also suffered like hackbench. I'm also hoping it will fix/preserve kitsunyan's interactivity issue. Please test.. We'll get some tests going here too. -chris
Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()
On 06/06/2017 05:21 AM, Peter Zijlstra wrote: On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote: On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote: On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote: Please test.. Results are still coming in but things do look better with your patch applied. It does look like there's a regression when running hackbench in process mode and when the CPUs are not fully utilised, e.g. check this out: This turned out to be a false positive; your patch improves things as far as I can see. Hooray, I'll move it to a part of the queue intended for merging. It's a little late, but Roman Gushchin helped get some runs of this with our production workload. The patch is every so slightly better. Thanks! -chris
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.12 branch has some fixes that Dave Sterba collected: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.12 We've been hitting an early enospc problem on production machines that Omar tracked down to an old int->u64 mistake. I waited a bit on this pull to make sure it was really the problem from production, but it's on ~2100 hosts now and I think we're good. Omar also noticed a commit in the queue would make new early ENOSPC problems. I pulled that out for now, which is why the top three commits are younger than the rest. Otherwise these are all fixes, some explaining very old bugs that we've been poking at for a while. Jeff Mahoney (2) commits (+4/-3): btrfs: fix race with relocation recovery and fs_root setup (+3/-3) btrfs: fix memory leak in update_space_info failure path (+1/-0) Liu Bo (1) commits (+1/-1): Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io Colin Ian King (1) commits (+1/-1): btrfs: fix incorrect error return ret being passed to mapping_set_error Omar Sandoval (1) commits (+2/-2): Btrfs: fix delalloc accounting leak caused by u32 overflow Qu Wenruo (1) commits (+122/-2): btrfs: fiemap: Cache and merge fiemap extent before submit it to user David Sterba (1) commits (+2/-2): btrfs: use correct types for page indices in btrfs_page_exists_in_range Jan Kara (1) commits (+6/-4): btrfs: Make flush bios explicitely sync Su Yue (1) commits (+1/-1): btrfs: tree-log.c: Wrong printk information about namelen Total: (9) commits (+139/-16) fs/btrfs/ctree.h | 4 +- fs/btrfs/dir-item.c| 2 +- fs/btrfs/disk-io.c | 10 ++-- fs/btrfs/extent-tree.c | 7 +-- fs/btrfs/extent_io.c | 126 +++-- fs/btrfs/inode.c | 6 +-- 6 files changed, 139 insertions(+), 16 deletions(-)
Re: [PATCH] btrfs: always write superblocks synchronously
On 05/03/2017 04:36 AM, Jan Kara wrote: On Tue 02-05-17 09:28:13, Davidlohr Bueso wrote: Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as synchronous" removed REQ_SYNC flag from WRITE_FUA implementation. Since REQ_FUA and REQ_FLUSH flags are stripped from submitted IO when the disk doesn't have volatile write cache and thus effectively make the write async. This was seen to cause performance hits up to 90% regression in disk IO related benchmarks such as reaim and dbench[1]. Fix the problem by making sure the first superblock write is also treated as synchronous since they can block progress of the journalling (commit, log syncs) machinery and thus the whole filesystem. Fixes: b685d3d65ac (block: treat REQ_FUA and REQ_PREFLUSH as synchronous) Cc: stable Cc: Jan Kara Signed-off-by: Davidlohr Bueso I wasn't patient enough and already sent the fix as part of my series fixing other filesystems [1]. It also fixes one more place in btrfs that needs REQ_SYNC to return to the original behavior. Thanks guys. -chris
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Re: Moving ndctl development into the kernel tree?
On 07/22/2017 02:49 PM, Dan Williams wrote: On Fri, Jul 21, 2017 at 7:52 PM, Dan Williams wrote: [ adding Chris ] On Fri, Jul 21, 2017 at 4:44 PM, Dan Williams wrote: On Fri, Jul 21, 2017 at 3:58 PM, Ingo Molnar wrote: * Dan Williams wrote: [...] * Like perf, ndctl borrows the sub-command architecture and option parsing from git. So, this code could be refactored into something shared / generic, i.e. the bits in tools/perf/util/. Just as a side note, stacktool (tools/stacktool/) is using the Git sub-command and options parsing code as well, and it's already sharing it with perf, via the tools/lib/subcmd/ library. ndctl could use that as well. Ah, nice, that refactoring happened about a year after ndctl was born. Which brings up the next question about what to do with the git history, but I'd want to know if ndctl is even welcome upstream before digging any deeper. I suspect this would be similar to what Chris did to merge btrfs while retaining the standalone history. Chris, any pointers on what worked well and what if anything you would do differently? I.e. I'm looking to use git filter-branch to rewrite ndctl history as if if had always been in tools/ndctl in the kernel tree. I found this old thread https://lkml.org/lkml/2008/10/30/523 and it seems to also recommend using an older kernel as the branch base. So it wasn't as painful as I thought it would be, I just used the script Linus recommended in that thread. Here is what I came up with merging the last ndctl release on top of v4.9, and then applying the pending development patches re-filtered to tools/ndctl: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=for-4.14/ndctl ...the next thing would be to rework the versioning to use the kernel version and switch to using tools/lib/subcmd/. I'd like to say I figured it all out back then, but the truth is that Linus held my hand the whole way. My memory of it is that his script worked really well, I just ran that and verified the results. -chris
Reminder v2: Linux Foundation Technical Advisory Board Elections -- Call for nominations
Hello everyone, Quick update on the TAB elections, we have 6 nominations so far: Jon Corbet Greg Kroah-Hartman Shuah Khan Steve Rostedt Ted Tso Tim Bird The elections are coming soon, please feel free to contact me if you have any questions about the TAB. - The Linux Foundation Technical Advisory Board (TAB) serves as the interface between the kernel development community and the Foundation. The TAB advises the Foundation on kernel-related matters, helps member companies learn to work with the community, and works to resolve community-related problems before they get out of hand. The board has ten members, one of whom sits on the LF board of directors. The election to select five TAB members will be held at the 2017 Kernel Summit in Prague, Czech Republic. The elections will take place at the conference center on Wednesday Oct 25th, shortly before the evening reception. The election will be open to all attendees of all of the Linux Foundation events taking place that week in Prague. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference here: https://goo.gl/ADVFtT The deadline for receiving nominations is up until the beginning of the election event. Any statements for the online document need to be sent by Monday Oct 23rd. Please get your nomination in early so everyone has a chance to review the nominations before voting. Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup
On 11/29/2017 12:05 PM, Tejun Heo wrote: On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote: Hello, On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote: What has happened with this patch set? No idea. cc'ing Chris directly. Chris, if the patchset looks good, can you please route them through the btrfs tree? lol looking at the patchset again, I'm not sure that's obviously the right tree. It can either be cgroup, block or btrfs. If no one objects, I'll just route them through cgroup. We'll have to coordinate a bit during the next merge window but I don't have a problem with these going in through cgroup. Dave does this sound good to you? I'd like to include my patch to do all crcs inline (instead of handing off to helper threads) when io controls are in place. By the merge window we should have some good data on how much it's all helping. -chris
Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup
On 11/30/2017 12:23 PM, David Sterba wrote: On Wed, Nov 29, 2017 at 01:38:26PM -0500, Chris Mason wrote: On 11/29/2017 12:05 PM, Tejun Heo wrote: On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote: Hello, On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote: What has happened with this patch set? No idea. cc'ing Chris directly. Chris, if the patchset looks good, can you please route them through the btrfs tree? lol looking at the patchset again, I'm not sure that's obviously the right tree. It can either be cgroup, block or btrfs. If no one objects, I'll just route them through cgroup. We'll have to coordinate a bit during the next merge window but I don't have a problem with these going in through cgroup. Dave does this sound good to you? There are only minor changes to btrfs code so cgroup tree would be better. I'd like to include my patch to do all crcs inline (instead of handing off to helper threads) when io controls are in place. By the merge window we should have some good data on how much it's all helping. Are there any problems in sight if the inline crc and cgroup chnanges go separately? I assume there's a runtime dependency, not a code dependency, so it could be sorted by the right merge order. The feature is just more useful with the inline crcs. Without them we end up with kworkers doing both high and low prio submissions and it all boils down to the speed of the lowest priority. -chris
Re: btrfs bio linked list corruption.
On 10/13/2016 02:16 PM, Dave Jones wrote: On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote: > On 10/12/2016 10:40 AM, Dave Jones wrote: > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote: > > > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote: > > > > > > > > > > > > On 10/11/2016 10:45 AM, Dave Jones wrote: > > > > > This is from Linus' current tree, with Al's iovec fixups on top. > > > > > > > > > > [ cut here ] > > > > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 __list_add+0x89/0xb0 > > > > > list_add corruption. prev->next should be next (e8806648), but was c967fcd8. (prev=880503878b80). > > > > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13 > > > > > c9d87458 8d32007c c9d874a8 > > > > > c9d87498 8d07a6c1 00210246 88050388e880 > > > > > > I hit this again overnight, it's the same trace, the only difference > > > being slightly different addresses in the list pointers: > > > > > > [42572.777196] list_add corruption. prev->next should be next (e8806648), but was c9647cd8. (prev=880503a0ba00). > > > > > > I'm actually a little surprised that ->next was the same across two > > > reboots on two different kernel builds. That might be a sign this is > > > more repeatable than I'd thought, even if it does take hours of runtime > > > right now to trigger it. I'll try and narrow the scope of what trinity > > > is doing to see if I can make it happen faster. > > > > .. and of course the first thing that happens is a completely different > > btrfs trace.. > > > > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs] > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 > > c900019076a8 b731ff3c > > c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98 > > 0801 880501cfa2a8 008a 008a > > This isn't even IO. Uuug. We're going to need a fast enough test > that we can bisect. Progress... I've found that this combination of syscalls.. ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2 hits one of these two bugs in a few minutes runtime. Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync. Mix them together though, and something goes awry. Hasn't triggered here yet. I'll leave it running though. -chris
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.9 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.9 Has some fixes from Omar and Dave Sterba for our new free space tree. This isn't heavily used yet, but as we move toward making it the new default we wanted to nail down an endian bug. Omar Sandoval (5) commits (+259/-145): Btrfs: expand free space tree sanity tests to catch endianness bug (+96/-68) Btrfs: fix extent buffer bitmap tests on big-endian systems (+51/-36) Btrfs: fix free space tree bitmaps on big-endian systems (+76/-27) Btrfs: fix mount -o clear_cache,space_cache=v2 (+12/-12) Btrfs: catch invalid free space trees (+24/-2) David Sterba (2) commits (+13/-12): btrfs: tests: uninline member definitions in free_space_extent (+2/-1) btrfs: tests: constify free space extent specs (+11/-11) Total: (7) commits (+272/-157) fs/btrfs/ctree.h | 3 +- fs/btrfs/disk-io.c | 33 +++--- fs/btrfs/extent_io.c | 64 +++ fs/btrfs/extent_io.h | 22 fs/btrfs/free-space-tree.c | 19 ++-- fs/btrfs/tests/extent-io-tests.c | 87 --- fs/btrfs/tests/free-space-tree-tests.c | 189 +++-- include/uapi/linux/btrfs.h | 12 ++- 8 files changed, 272 insertions(+), 157 deletions(-)
Linux Foundation Technical Advisory Board Elections and Nomination process
Hello everyone, The elections for five of the ten members of the Linux Foundation Technical Advisory Board (TAB) are held every year[1]. This year the election will be at the 2016 Kernel Summit in Santa Fe, NM. The elections will take place at the conference center on Wednesday Nov 2nd, shortly before the evening Kernel Summit/Plumbers reception. The elections will be open to all attendees of both the Kernel Summit and the Linux Plumbers. Anyone is eligible to stand for election, simply send your nomination to: tech-board-discuss at lists.linux-foundation.org Just before the election, everyone will have a chance to introduce themselves and briefly talk about why they would like to participate on the Technical Advisory Board. This year, we're encouraging everyone to include those details along with their nomination, which we will compile into an online document for quick reference. The deadline for receiving nominations is up until the beginning of the event where the election is held. Any statements for the online document need to be sent by Friday Oct 28th. Please remember if you're not going to be present that things go wrong with both networks and mailing lists, so get your nomination in early). Chris Mason, TAB Chair [1] TAB members sit for a term of two years, and half of the board is up for election every year. Five of the seats are up for election now. The other five are halfway through their term and will be up for election next year.
Re: btrfs bio linked list corruption.
On Sat, Oct 15, 2016 at 08:42:40PM -0400, Dave Jones wrote: On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote: > > > > .. and of course the first thing that happens is a completely different > > > > btrfs trace.. > > > > > > > > > > > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 start_transaction+0x40a/0x440 [btrfs] > > > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14 > > > > c900019076a8 b731ff3c > > > > c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98 > > > > 0801 880501cfa2a8 008a 008a > > > > > > This isn't even IO. Uuug. We're going to need a fast enough test > > > that we can bisect. > > > > Progress... > > I've found that this combination of syscalls.. > > > > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c lremovexattr -c pwritev2 > > > > hits one of these two bugs in a few minutes runtime. > > > > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync. > > Mix them together though, and something goes awry. > > > Hasn't triggered here yet. I'll leave it running though. The hits keep coming.. BUG: Bad page state in process kworker/u8:12 pfn:4988fa page:ea0012623e80 count:0 mapcount:0 mapping:8804450456e0 index:0x9 Hmpf, I've had this running since Friday without failing. Can you send me your .config please? -chris
Re: lockdep warning in btrfs in 4.8-rc3
On 09/08/2016 08:50 PM, Dave Jones wrote: On Thu, Sep 08, 2016 at 08:58:48AM -0400, Chris Mason wrote: > On 09/08/2016 07:50 AM, Christian Borntraeger wrote: > > On 09/08/2016 01:48 PM, Christian Borntraeger wrote: > >> Chris, > >> > >> with 4.8-rc3 I get the following on an s390 box: > > > > Sorry for the noise, just saw the fix in your pull request. > > > > The lockdep splat is still there, we'll need to annotate this one a little. Here's another one (unrelated?) that I've not seen before today: WARNING: CPU: 1 PID: 10664 at kernel/locking/lockdep.c:704 register_lock_class+0x33f/0x510 CPU: 1 PID: 10664 Comm: kworker/u8:5 Not tainted 4.8.0-rc5-think+ #2 Workqueue: writeback wb_workfn (flush-btrfs-1) 0097 b97fbad3 88013b8c3770 a63d3ab1 a6bf1792 a60df22f 88013b8c37b0 a60897a0 02c0b97fbad3 a6bf1792 Call Trace: [] dump_stack+0x6c/0x9b [] ? register_lock_class+0x33f/0x510 [] __warn+0x110/0x130 [] warn_slowpath_null+0x2c/0x40 [] register_lock_class+0x33f/0x510 [] ? bio_add_page+0x7e/0x120 [] __lock_acquire.isra.32+0x5b/0x8c0 [] lock_acquire+0x58/0x70 [] ? btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs] [] _raw_write_lock+0x38/0x70 [] ? btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs] [] btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs] [] lock_extent_buffer_for_io+0x28/0x2e0 [btrfs] [] btree_write_cache_pages+0x231/0x550 [btrfs] [] ? btree_set_page_dirty+0x20/0x20 [btrfs] [] btree_writepages+0x74/0x90 [btrfs] [] do_writepages+0x3e/0x80 [] __writeback_single_inode+0x42/0x220 [] writeback_sb_inodes+0x351/0x730 [] ? __wb_update_bandwidth+0x1c1/0x2b0 [] wb_writeback+0x138/0x2a0 [] wb_workfn+0x10e/0x340 [] ? __lock_acquire.isra.32+0x1cf/0x8c0 [] process_one_work+0x24f/0x5d0 [] ? process_one_work+0x1e0/0x5d0 [] worker_thread+0x53/0x5b0 [] ? process_one_work+0x5d0/0x5d0 [] kthread+0x120/0x140 [] ? finish_task_switch+0x6a/0x200 [] ret_from_fork+0x1f/0x40 [] ? kthread_create_on_node+0x270/0x270 ---[ end trace 7b39395c07435bf1 ]--- 700 /* 701 * Huh! same key, different name? Did someone trample 702 * on some memory? We're most confused. 703 */ 704 WARN_ON_ONCE(class->name != lock->name); That seems kinda scary. There was a trinity run going on at the same time, so this _might_ be a random scribble from something unrelated to btrfs, but just in case.. IWBNI that code printed out both cases so I could see if this was corruption or two unrelated keys. I'll make it do that in case it happens again. I haven't seen this one before, if you could make it happen again, that would be great ;) -chris Dave -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, We have three fixes in my for-linus-4.8 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.8 I'm not proud of how long it took me to track down that one liner in btrfs_sync_log(), but the good news is the patches I was trying to blame for these problems were actually fine (sorry Filipe). Wang Xiaoguang (2) commits (+16/-8): btrfs: introduce tickets_id to determine whether asynchronous metadata reclaim work makes progress (+7/-5) btrfs: do not decrease bytes_may_use when replaying extents (+9/-3) Chris Mason (1) commits (+1/-0): Btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns Total: (3) commits (+17/-8) fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 23 +++ fs/btrfs/tree-log.c| 1 + 3 files changed, 17 insertions(+), 8 deletions(-)
Re: bio linked list corruption.
On Tue, Oct 18, 2016 at 05:12:41PM -0600, Jens Axboe wrote: On 10/18/2016 04:42 PM, Dave Jones wrote: So Chris had me do a run on ext4 just for giggles. It took a while, but eventually this fell out... WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0 list_add corruption. prev->next should be next (e8c05648), but was c928bcd8. (prev=880503a145c0). CPU: 3 PID: 21324 Comm: modprobe Not tainted 4.9.0-rc1-think+ #1 c9a6b7b8 81320e3c c9a6b808 c9a6b7f8 8107a711 00210246 8805039f1740 880503a145c0 e8c05648 e8a05600 880502c39548 Call Trace: [] dump_stack+0x4f/0x73 [] __warn+0xc1/0xe0 [] warn_slowpath_fmt+0x5a/0x80 [] __list_add+0x89/0xb0 [] blk_sq_make_request+0x2f8/0x350 [] ? generic_make_request+0xec/0x240 [] generic_make_request+0xf9/0x240 [] submit_bio+0x78/0x150 [] ? __find_get_block+0x126/0x130 [] submit_bh_wbc+0x16f/0x1e0 [] ? __end_buffer_read_notouch+0x20/0x20 [] ll_rw_block+0xa8/0xb0 [] __breadahead+0x3f/0x70 [] __ext4_get_inode_loc+0x37c/0x3d0 [] ext4_iget+0x8d/0xb90 [] ? d_alloc_parallel+0x329/0x700 [] ext4_iget_normal+0x2a/0x30 [] ext4_lookup+0x136/0x250 [] lookup_slow+0x12d/0x220 [] walk_component+0x1e7/0x310 [] ? path_init+0x4d8/0x520 [] path_lookupat+0x62/0x120 [] ? getname_flags+0x32/0x180 [] filename_lookup+0xa8/0x130 [] ? strncpy_from_user+0x46/0x170 [] ? getname_flags+0x4e/0x180 [] user_path_at_empty+0x31/0x40 [] vfs_fstatat+0x61/0xc0 [] ? __lock_acquire.isra.32+0x1cf/0x8c0 [] SYSC_newstat+0x2e/0x60 [] ? __this_cpu_preempt_check+0x13/0x20 [] SyS_newstat+0x9/0x10 [] do_syscall_64+0x5c/0x170 [] entry_SYSCALL64_slow_path+0x25/0x25 So this one isn't a btrfs specific problem as I first thought. This sometimes reproduces within minutes, sometimes hours, which makes it a pain to bisect. It only started showing up this merge window though. Chinner reported the same thing on XFS, I'll look into it asap. Jens, not sure if you saw the whole thread. This has triggered bad page state errors, and also corrupted a btrfs list. It hurts me to say, but it might not actually be your fault. -chris
Re: bio linked list corruption.
On Tue, Oct 18, 2016 at 04:39:22PM -0700, Linus Torvalds wrote: On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason wrote: Jens, not sure if you saw the whole thread. This has triggered bad page state errors, and also corrupted a btrfs list. It hurts me to say, but it might not actually be your fault. Where is that thread, and what is the "this" that triggers problems? Looking at the "->mq_list" users, I'm not seeing any changes there in the last year or so. So I don't think it's the list itself. Seems to be the whole thing: http://www.gossamer-threads.com/lists/linux/kernel/2545792 My guess is xattr, but I don't have a good reason for that. -chris
Re: bio linked list corruption.
On Tue, Oct 18, 2016 at 05:10:56PM -0700, Linus Torvalds wrote: On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason wrote: Seems to be the whole thing: Ahh. On lkml, so I do have it in my mailbox, but Dave changed the subject line when he tested on ext4 rather than btrfs.. Anyway, the corrupted address is somewhat interesting. As Dave Jones said, he saw list_add corruption. prev->next should be next (e8806648), but was c967fcd8. (prev=880503878b80). list_add corruption. prev->next should be next (e8c05648), but was c928bcd8. (prev=880503a145c0). and Dave Chinner reports list_add corruption. prev->next should be next (e8c02808), but was c90005f6bda8. (prev=88013363bb80). and it's worth noting that the "but was" is a remarkably consistent vmalloc address (the c9000.. pattern gives it away). In fact, it's identical across two boots for DaveJ in the low 14 bits, and fairly high up in those low 14 bots (0x3cd8). DaveC has a different address, but it's also in the vmalloc space, and also looks like it is fairly high up in 14 bits (0x3da8). So in both cases it's almost certainly a stack address with a fairly empty stack. The differences are presumably due to different kernel configurations and/or just different filesystems calling the same function that does the same bad thing but now at different depths in the stack. Adding Andy to the cc, because this *might* be triggered by the vmalloc stack code itself. Maybe the re-use of stacks showing some problem? Maybe Chris (who can't see the problem) doesn't have CONFIG_VMAP_STACK enabled? CONFIG_VMAP_STACK=y, but maybe I just need to hammer on process creation more. I'm testing in a hugely stripped down VM, so Dave might have more background stuff going on. -chris
[GIT PULL] Btrfs
Hi Linus, We have a few small fixes queued up in my for-linus-4.8 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.8 I'm still prepping a set of fixes for btrfs fsync, just nailing down a hard to trigger memory corruption. For now, these are tested and ready: Josef Bacik (1) commits (+5/-3): Btrfs: kill invalid ASSERT() in process_all_refs() Liu Bo (1) commits (+5/-3): Btrfs: fix endless loop in balancing block groups Wang Xiaoguang (1) commits (+5/-5): btrfs: fix one bug that process may endlessly wait for ticket in wait_reserve_ticket() Total: (3) commits (+15/-11) fs/btrfs/extent-tree.c | 10 +- fs/btrfs/relocation.c | 8 +--- fs/btrfs/send.c| 8 +--- 3 files changed, 15 insertions(+), 11 deletions(-)
Linux Plumbers call for organizers
Each year, the Linux Foundation's Technical Advisory Board (TAB) seeks an organizing committee for the annual Linux Plumbers Conference; that process has now begun for the 2017 event. This is your chance to put your stamp on one of our community's most important gatherings. LPC 2017 will take place September 13-15, and will be colocated with Open Source Summit NA (formerly LinuxCon NA) at the JW Marriott in Downtown Los Angeles CA. Interested groups should have, at a minimum, an events coordinator, a treasurer, a microconference chair, and a chairperson. This group must be able to take the initiative to handle conference-specific details (including social events, the miniconf program, and more) while working with the Linux Foundation to ensure that logistics work smoothly. The process for putting in an application to run the Linux Plumbers Conference is documented here: https://wiki.linuxfoundation.org/en/LPC Applications should be in by October 1st; the TAB then will announce a decision by (at the latest) November 11th. If you're interested in submitting a proposal, but are concerned that you don't know enough about how previous Plumbers has been run, then fear not! The TAB will support the selected organizing committee with additional volunteers with past Plumbers organizing experience. Above all we are looking for a capable and enthusiastic group who we can work with to make the 2017 Linux Plumbers Conference a great success. If you have any questions about the submission process, please email the TAB at tech-bo...@lists.linux-foundation.org
Re: [Documentation] State of CPU controller in cgroup v2
On 08/16/2016 10:07 AM, Peter Zijlstra wrote: On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote: [ That, and a disturbing number of emotional outbursts against systemd, which has nothing to do with any of this. ] Oh, so I'm entirely dreaming this then: https://github.com/systemd/systemd/pull/3905 Completely unrelated. Also, the argument there seems unfair at best, you don't need cpu-v2 for buffered write control, you only need memcg and block co-mounted. This isn't systemd dictating cgroups2 or systemd trying to get rid of v1. But systemd is a common user of cgroups, and we do use it here in production. We're just sending patches upstream for the tools we're using. It's better than keeping them private, or reinventing a completely different tool that does almost the same thing. -chris