[GIT PULL] Btrfs

2017-02-11 Thread Chris Mason
Hi Linus,

My for-linus-4.10 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.10

Has two last minute fixes.  The highest priority here is a regression 
fix for the decompression code, but we also fixed up a problem with the 
32 bit compat ioctls.

The decompression bug could hand back the wrong data on big reads when 
zlib was used.  I have a larger cleanup to make the math here less error 
prone, but at this stage in the release Omar's patch is the best choice.

Omar Sandoval (1) commits (+24/-15):
 Btrfs: fix btrfs_decompress_buf2page()

Jeff Mahoney (1) commits (+4/-2):
 btrfs: fix btrfs_compat_ioctl failures on non-compat ioctls

Total: (2) commits (+28/-17)

  fs/btrfs/compression.c | 39 ---
  fs/btrfs/ioctl.c   |  6 --
  2 files changed, 28 insertions(+), 17 deletions(-)


[GIT PULL] Btrfs

2017-02-24 Thread Chris Mason
Hi Linus,

My for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Has a series of fixes and cleanups that Dave Sterba has been collecting:

There is a pretty big variety here, cleaning up internal APIs and fixing 
corner cases.

David Sterba (46) commits (+235/-313):
 btrfs: remove unused parameter from btrfs_subvolume_release_metadata 
(+6/-11)
 btrfs: remove pointless rcu protection from btrfs_qgroup_inherit (+0/-2)
 btrfs: check quota status earlier and don't do unnecessary frees (+3/-2)
 btrfs: remove unused parameter from btrfs_prepare_extent_commit (+3/-5)
 btrfs: remove unnecessary mutex lock in qgroup_account_snapshot (+1/-5)
 btrfs: embed extent_changeset::range_changed to the structure (+11/-17)
 btrfs: remove unused parameter from cleanup_write_cache_enospc (+2/-3)
 btrfs: remove unused parameters from __btrfs_write_out_cache (+3/-8)
 btrfs: remove unused parameter from clone_copy_inline_extent (+2/-3)
 btrfs: remove unused parameter from extent_write_cache_pages (+2/-4)
 btrfs: remove unused parameter from tree_move_next_or_upnext (+2/-4)
 btrfs: remove unused parameter from btrfs_check_super_valid (+3/-5)
 btrfs: remove unused logic of limiting async delalloc pages (+0/-7)
 btrfs: fix over-80 lines introduced by previous cleanups (+74/-63)
 btrfs: remove unused parameter from read_block_for_search (+5/-5)
 btrfs: remove unused parameter from adjust_slots_upwards (+2/-3)
 btrfs: remove unused parameter from init_first_rw_device (+3/-5)
 btrfs: make space cache inode readahead failure nonfatal (+3/-7)
 btrfs: remove unused parameters from scrub_setup_wr_ctx (+3/-7)
 btrfs: remove unused parameter from __btrfs_alloc_chunk (+4/-6)
 btrfs: add wrapper for counting BTRFS_MAX_EXTENT_SIZE (+23/-31)
 btrfs: remove unused parameter from submit_extent_page (+3/-9)
 btrfs: remove unused parameter from clean_tree_block (+17/-19)
 btrfs: use GFP_KERNEL in btrfs_add/del_qgroup_relation (+2/-2)
 btrfs: remove unused parameter from __add_inline_refs (+2/-3)
 btrfs: remove unused parameter from add_pending_csums (+2/-4)
 btrfs: remove unused parameter from update_nr_written (+4/-4)
 btrfs: remove unused parameter from __push_leaf_right (+2/-3)
 btrfs: remove unused parameter from check_async_write (+2/-2)
 btrfs: remove unused parameter from btrfs_fill_super (+2/-3)
 btrfs: remove unused parameter from __push_leaf_left (+2/-3)
 btrfs: remove unused parameter from write_dev_supers (+3/-3)
 btrfs: remove unused parameter from __add_inode_ref (+1/-2)
 btrfs: remove unused parameters from btrfs_cmp_data (+2/-3)
 btrfs: remove unused parameter from create_snapshot (+2/-2)
 btrfs: ulist: make the finalization function public (+2/-1)
 btrfs: remove unused parameter from tree_move_down (+2/-2)
 btrfs: ulist: rename ulist_fini to ulist_release (+10/-10)
 btrfs: qgroups: make __del_qgroup_relation static (+1/-1)
 btrfs: use GFP_KERNEL in btrfs_read_qgroup_config (+1/-1)
 btrfs: remove unused parameter from split_item (+2/-3)
 btrfs: merge two superblock writing helpers (+4/-11)
 btrfs: qgroups: opencode qgroup_free helper (+9/-9)
 btrfs: use GFP_KERNEL in btrfs_quota_enable (+1/-1)
 btrfs: use GFP_KERNEL in create_snapshot (+2/-2)
 btrfs: remove unused ulist members (+0/-7)

Nikolay Borisov (36) commits (+476/-480):
 btrfs: Make btrfs_delayed_inode_reserve_metadata take btrfs_inode (+8/-8)
 btrfs: Make btrfs_inode_delayed_dir_index_count take btrfs_inode (+5/-5)
 btrfs: Make btrfs_commit_inode_delayed_items take btrfs_inode (+4/-4)
 btrfs: Make btrfs_commit_inode_delayed_inode take btrfs_inode (+6/-6)
 btrfs: Make btrfs_get_or_create_delayed_node take btrfs_inode (+5/-6)
 btrfs: Make btrfs_kill_delayed_inode_items take btrfs_inode (+4/-4)
 btrfs: Make btrfs_delayed_delete_inode_ref take btrfs_inode (+5/-5)
 btrfs: Make btrfs_delete_delayed_dir_index take btrfs_inode (+6/-6)
 btrfs: Make btrfs_insert_delayed_dir_index take btrfs_inode (+5/-5)
 btrfs: Make btrfs_check_ref_name_override take btrfs_inode (+4/-5)
 btrfs: Make btrfs_record_snapshot_destroy take btrfs_inode (+6/-6)
 btrfs: Make btrfs_must_commit_transaction take btrfs_inode (+9/-9)
 btrfs: Make btrfs_del_dir_entries_in_log take btrfs_inode (+7/-7)
 btrfs: Make btrfs_log_changed_extents take btrfs_inode (+11/-11)
 btrfs: Make btrfs_record_unlink_dir take btrfs_inode (+14/-14)
 btrfs: Make btrfs_remove_delayed_node take btrfs_inode (+5/-5)
 btrfs: Make btrfs_get_logged_extents take btrfs_inode (+4/-4)
 btrfs: Make btrfs_log_trailing_hole take btrfs_inode (+4/-4)
 btrfs: Make btrfs_get_delayed_node take btrfs_inode (+8/-9)
 btrfs: Make btrfs_ino take a struct btrfs_inode (+151/-151)
 btrfs: Make log_directory_changes take btrfs_inode (+5/-6)
   

[GIT PULL] Btrfs

2016-08-10 Thread Chris Mason
Hi Linus,

My for-linus-4.8 branch has some fixes for btrfs send/recv and fsync
from Filipe and Robbie Ko:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.8

Bonus points to Filipe for already having xfstests in place for many of
these.

Filipe Manana (8) commits (+172/-52):
Btrfs: improve performance on fsync against new inode after rename/unlink 
(+95/-9)
Btrfs: send, avoid incorrect leaf accesses when sending utimes operations 
(+2/-0)
Btrfs: remove unused function btrfs_add_delayed_qgroup_reserve() (+0/-30)
Btrfs: be more precise on errors when getting an inode from disk (+18/-9)
Btrfs: incremental send, fix invalid paths for rename operations (+2/-1)
Btrfs: send, add missing error check for calls to path_loop() (+2/-0)
Btrfs: add missing check for writeback errors on fsync (+8/-0)
Btrfs: send, don't bug on inconsistent snapshots (+45/-3)

Robbie Ko (4) commits (+111/-7):
Btrfs: send, fix invalid leaf accesses due to incorrect utimes operations 
(+11/-1)
Btrfs: send, fix warning due to late freeing of orphan_dir_info structures 
(+4/-0)
Btrfs: send, fix failure to move directories with the same name around 
(+95/-5)
Btrfs: incremental send, fix premature rmdir operations (+1/-1)

Total: (12) commits (+283/-59)

 fs/btrfs/delayed-ref.c |  27 
 fs/btrfs/delayed-ref.h |   3 -
 fs/btrfs/file.c|   8 +++
 fs/btrfs/inode.c   |  46 ++---
 fs/btrfs/send.c| 173 +
 fs/btrfs/tree-log.c|  85 +---
 6 files changed, 283 insertions(+), 59 deletions(-)


[GIT PULL] Btrfs

2016-08-26 Thread Chris Mason
Hi Linus,

Please pull my for-linus-4.8 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.8

We've queued up a few different fixes in here.  These range from enospc
corners to fsync and quota fixes, and a few targeted at
error handling for corrupt metadata/fuzzing.

Liu Bo (5) commits (+60/-2):
Btrfs: detect corruption when non-root leaf has zero item (+22/-1)
Btrfs: add ASSERT for block group's memory leak (+5/-0)
Btrfs: clarify do_chunk_alloc()'s return value (+9/-0)
Btrfs: fix memory leak of reloc_root (+8/-1)
Btrfs: check btree node's nritems (+16/-0)

Qu Wenruo (4) commits (+191/-53):
btrfs: relocation: Fix leaking qgroups numbers on data extents (+103/-6)
btrfs: qgroup: Fix qgroup incorrectness caused by log replay (+16/-0)
btrfs: qgroup: Refactor btrfs_qgroup_insert_dirty_extent() (+71/-47)
btrfs: backref: Fix soft lockup in __merge_refs function (+1/-0)

Wang Xiaoguang (4) commits (+161/-108):
btrfs: use correct offset for reloc_inode in prealloc_file_extent_cluster() 
(+6/-4)
btrfs: divide btrfs_update_reserved_bytes() into two functions (+57/-40)
btrfs: update btrfs_space_info's bytes_may_use timely (+73/-63)
btrfs: fix fsfreeze hang caused by delayed iputs deal (+25/-1)

Jeff Mahoney (3) commits (+45/-18):
btrfs: don't create or leak aliased root while cleaning up orphans (+22/-11)
btrfs: waiting on qgroup rescan should not always be interruptible (+13/-6)
btrfs: properly track when rescan worker is running (+10/-1)

Filipe Manana (1) commits (+8/-4):
Btrfs: fix lockdep warning on deadlock against an inode's log mutex

Anand Jain (1) commits (+19/-8):
btrfs: do not background blkdev_put()

Alex Lyakas (1) commits (+1/-1):
btrfs: flush_space: treat return value of do_chunk_alloc properly

Josef Bacik (1) commits (+1/-0):
Btrfs: fix em leak in find_first_block_group

Total: (20) commits

 fs/btrfs/backref.c |   1 +
 fs/btrfs/ctree.h   |   5 +-
 fs/btrfs/delayed-ref.c |   7 +-
 fs/btrfs/disk-io.c |  56 +--
 fs/btrfs/disk-io.h |   2 +
 fs/btrfs/extent-tree.c | 185 +++--
 fs/btrfs/extent_io.h   |   1 +
 fs/btrfs/file.c|  28 
 fs/btrfs/inode-map.c   |   3 +-
 fs/btrfs/inode.c   |  37 +++---
 fs/btrfs/ioctl.c   |   2 +-
 fs/btrfs/qgroup.c  |  62 ++---
 fs/btrfs/qgroup.h  |  36 --
 fs/btrfs/relocation.c  | 126 ++---
 fs/btrfs/root-tree.c   |  27 +---
 fs/btrfs/super.c   |  16 +
 fs/btrfs/transaction.c |   7 +-
 fs/btrfs/tree-log.c|  21 +-
 fs/btrfs/tree-log.h|   5 +-
 fs/btrfs/volumes.c |  27 +---
 20 files changed, 473 insertions(+), 181 deletions(-)


[GIT PULL] Btrfs

2016-06-10 Thread Chris Mason
Hi Linus

My for-linus-4.7 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.7

Has some fixes and some new self tests for btrfs.  The self tests are
usually disabled in the .config file (unless you're doing btrfs dev
work), and this bunch is meant to find problems with the 64K page
size patches.

Jeff has a patch to help people see if they are using the hardware
assist crc32c module, which really helps us nail down problems when
people ask why crcs are using so much CPU.

Otherwise, it's small fixes.

Feifei Xu (8) commits (+475/-361):
Btrfs: test_check_exists: Fix infinite loop when searching for free space 
entries (+2/-2)
Btrfs: self-tests: Execute page straddling test only when nodesize < 
PAGE_SIZE (+30/-19)
Btrfs: self-tests: Use macros instead of constants and add missing newline 
(+31/-18)
Btrfs: self-tests: Support testing all possible sectorsizes and nodesizes 
(+32/-22)
Btrfs: self-tests: Fix extent buffer bitmap test fail on BE system (+11/-1)
Btrfs: Fix integer overflow when calculating bytes_per_bitmap (+7/-7)
Btrfs: self-tests: Fix test_bitmaps fail on 64k sectorsize (+7/-1)
Btrfs: self-tests: Support non-4k page size (+355/-291)

Liu Bo (3) commits (+104/-15):
Btrfs: clear uptodate flags of pages in sys_array eb (+2/-0)
Btrfs: add validadtion checks for chunk loading (+67/-15)
Btrfs: add more validation checks for superblock (+35/-0)

Josef Bacik (1) commits (+1/-0):
Btrfs: end transaction if we abort when creating uuid root

Jeff Mahoney (1) commits (+9/-2):
btrfs: advertise which crc32c implementation is being used at module load

Vinson Lee (1) commits (+1/-1):
btrfs: Use __u64 in exported linux/btrfs.h.

Total: (14) commits (+590/-379)

 fs/btrfs/ctree.c   |   6 +-
 fs/btrfs/disk-io.c |  20 +-
 fs/btrfs/disk-io.h |   2 +-
 fs/btrfs/extent_io.c   |  10 +-
 fs/btrfs/extent_io.h   |   4 +-
 fs/btrfs/free-space-cache.c|  18 +-
 fs/btrfs/hash.c|   5 +
 fs/btrfs/hash.h|   1 +
 fs/btrfs/super.c   |  57 --
 fs/btrfs/tests/btrfs-tests.c   |   6 +-
 fs/btrfs/tests/btrfs-tests.h   |  27 +--
 fs/btrfs/tests/extent-buffer-tests.c   |  13 +-
 fs/btrfs/tests/extent-io-tests.c   |  86 ++---
 fs/btrfs/tests/free-space-tests.c  |  76 +---
 fs/btrfs/tests/free-space-tree-tests.c |  30 +--
 fs/btrfs/tests/inode-tests.c   | 344 ++---
 fs/btrfs/tests/qgroup-tests.c  | 111 ++-
 fs/btrfs/volumes.c | 109 +--
 include/uapi/linux/btrfs.h |   2 +-
 19 files changed, 569 insertions(+), 358 deletions(-)


[GIT PULL] Btrfs

2016-06-03 Thread Chris Mason
Hi Linus,

My for-linus-4.7 branch has some fixes:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.7

I realized as I was prepping this pull that my tip commit still had
Facebook task numbers and other internal metadata in it.  So I had to
reword the description, which is why it is only a few hours old.  Only
the description changed since testing.

The important part of this pull is Filipe's set of fixes for btrfs device
replacement.  Filipe fixed a few issues seen on the list and a number
he found on his own.

Filipe Manana (8) commits (+93/-19):
Btrfs: fix race setting block group back to RW mode during device replace 
(+5/-5)
Btrfs: fix unprotected assignment of the left cursor for device replace 
(+4/-0)
Btrfs: fix race setting block group readonly during device replace (+46/-2)
Btrfs: fix race between device replace and block group removal (+11/-0)
Btrfs: fix race between device replace and chunk allocation (+9/-12)
Btrfs: fix race between readahead and device replace/removal (+2/-0)
Btrfs: fix race between device replace and read repair (+10/-0)
Btrfs: fix race between device replace and discard (+6/-0)

Chris Mason (1) commits (+12/-1):
Btrfs: deal with duplciates during extent_map insertion in btrfs_get_extent

Total: (9) commits (+105/-20)

 fs/btrfs/extent-tree.c  |  6 ++
 fs/btrfs/extent_io.c| 10 ++
 fs/btrfs/inode.c| 13 -
 fs/btrfs/ordered-data.c |  6 +-
 fs/btrfs/ordered-data.h |  2 +-
 fs/btrfs/reada.c|  2 ++
 fs/btrfs/scrub.c| 50 ++---
 fs/btrfs/volumes.c  | 32 +++
 8 files changed, 103 insertions(+), 18 deletions(-)


[GIT PULL 1/2] Btrfs

2016-06-25 Thread Chris Mason
Hi Linus,

I have a two part pull this time because one of the patches Dave Sterba
collected needed to be against v4.7-rc2 or higher (we used rc4).  I try
to make my for-linus-xx branch testable on top of the last major
so we can hand fixes to people on the list more easily, so I've split
this pull in two.

My for-linus-4.7 branch has some fixes and two performance improvements
that we've been testing for some time.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.7

Josef's two performance fixes are most notable.  The transid tracking
patch makes a big improvement on pretty much every workload.

Josef Bacik (2) commits (+38/-27):
Btrfs: don't do nocow check unless we have to (+22/-22)
Btrfs: track transid for delayed ref flushing (+16/-5)

Liu Bo (1) commits (+11/-2):
Btrfs: fix error handling in map_private_extent_buffer

Chris Mason (1) commits (+11/-9):
btrfs: fix deadlock in delayed_ref_async_start

Wei Yongjun (1) commits (+1/-1):
Btrfs: fix error return code in btrfs_init_test_fs()

Chandan Rajendra (1) commits (+4/-6):
Btrfs: Force stripesize to the value of sectorsize

Wang Xiaoguang (1) commits (+2/-1):
btrfs: fix disk_i_size update bug when fallocate() fails

Total: (7) commits (+67/-46)

 fs/btrfs/ctree.c |  6 +-
 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/disk-io.c   |  6 ++
 fs/btrfs/extent-tree.c   | 15 +--
 fs/btrfs/extent_io.c |  7 ++-
 fs/btrfs/file.c  | 44 ++--
 fs/btrfs/inode.c |  1 +
 fs/btrfs/ordered-data.c  |  3 ++-
 fs/btrfs/tests/btrfs-tests.c |  2 +-
 fs/btrfs/transaction.c   |  3 ++-
 fs/btrfs/volumes.c   |  4 ++--
 11 files changed, 57 insertions(+), 36 deletions(-)


[GIT PULL 2/2] Btrfs

2016-06-25 Thread Chris Mason
Hi Linus,

Btrfs part two was supposed to be a single patch on part of v4.7-rc4.
Somehow I didn't notice that my part2 branch repeated a few of the
patches in part 1 when I set it up earlier this week.  Cherry-picking
gone wrong as I folded a fix into Dave Sterba's original integration.

I've been testing the git-merged result of part1, part2 and your
master for a while, but I just rebased part2 so it didn't include
any duplicates.  I ran git diff to verify the merged result of
today's pull is exactly the same as the one I've been testing.

My for-linus-4.7-part2 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.7-part2

Has one patch from Omar to bring iterate_shared back to btrfs.  We have
a tree of work we queue up for directory items and it doesn't
lend itself well to shared access.  While we're cleaning it up, Omar
has changed things to use an exclusive lock when there are delayed
items.

Omar Sandoval (1) commits (+34/-13):
Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes

Total: (1) commits (+34/-13)

 fs/btrfs/delayed-inode.c | 27 ++-
 fs/btrfs/delayed-inode.h | 10 ++
 fs/btrfs/inode.c | 10 ++
 3 files changed, 34 insertions(+), 13 deletions(-)


[GIT PULL] Btrfs

2016-03-04 Thread Chris Mason
Hi Linus,

We've got a fix in my for-linus-4.5 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.5

Filipe nailed down a problem where tree log replay would do some work
that orphan code wasn't expecting to be done yet, leading to BUG_ON.

Filipe Manana (1) commits (+9/-1):
Btrfs: fix loading of orphan roots leading to BUG_ON

Total: (1) commits (+9/-1)

 fs/btrfs/root-tree.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)


Re: [GIT PULL] Btrfs

2016-03-21 Thread Chris Mason
On Mon, Mar 21, 2016 at 06:16:54PM -0700, Linus Torvalds wrote:
> On Mon, Mar 21, 2016 at 5:24 PM, Chris Mason  wrote:
> >
> > I waited an extra day to send this one out because I hit a crash late
> > last week with CONFIG_DEBUG_PAGEALLOC enabled (fixed in the top commit).
> 
> Hmm. If that commit helps, it will spit out a warning.
> 
> So is it actually fixed, or just hacked around to the point where you
> don't get a page fault?
> 
> That WARN_ON_ONCE kind of implies it's a "this happens, but we don't know 
> why".

Hi Linus,

while (bio_index < bio->bi_vcnt) {
count = find some crcs
...
while (count--) {
...
page_bytes_left -= root->sectorsize;
if (!page_bytes_left) {
bio_index++;
/*
 * make sure we're still inside the
 * bio before we update page_bytes_left
 */
if (bio_index >= bio->bi_vcnt) {
WARN_ON_ONCE(count);
goto done;
}
bvec++;
page_bytes_left = bvec->bv_len;
^ this was the line that crashed
  before
}

}
}

done:
cleanup;
return;

What should be happening here is we'll goto done when count is zero and
we've walked past the end of the bio.  IOW, both the outer and inner
loops are doing the right tests and the right math, but the inner loop
is improperly accessing a bogus bvec->bv_len because it didn't realize
the outer loop was now completely done.

I don't see a way for it to happen when count != 0, and I ran xfstests
on a few machines to try and triple check that.  If there are new bugs
hiding here, we'll have EIOs returned up to userland because this
function didn't properly fetch the crcs.  If anyone reported the EIOs,
they would send in the WARN_ON output too, so we'd know right away not
to blame their hardware.

I also ran for days with heavy read/write loads without seeing the crc
errors.  I didn't have the WARN_ON, or CONFIG_DEBUG_PAGEALLOC on that
box, but if other things were wrong, we'd have done a lot worse than poke
into bvec->bv_len, and the crc errors would have stopped the test.

-chris



Re: [GIT PULL] Btrfs

2016-03-21 Thread Chris Mason
On Mon, Mar 21, 2016 at 10:15:33PM -0400, Chris Mason wrote:
> On Mon, Mar 21, 2016 at 06:16:54PM -0700, Linus Torvalds wrote:
> > On Mon, Mar 21, 2016 at 5:24 PM, Chris Mason  wrote:
> > >
> > > I waited an extra day to send this one out because I hit a crash late
> > > last week with CONFIG_DEBUG_PAGEALLOC enabled (fixed in the top commit).
> > 
> > Hmm. If that commit helps, it will spit out a warning.
> > 
> > So is it actually fixed, or just hacked around to the point where you
> > don't get a page fault?

Hmmm, rereading my answer I realized I didn't actually answer.  I really
think this is fixed.  I left the warning only because I originally
expected something much more exotic.

-chris


Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks

2016-03-19 Thread Chris Mason
On Thu, Mar 17, 2016 at 02:49:06PM -0600, Andreas Dilger wrote:
> On Mar 17, 2016, at 12:35 PM, Chris Mason  wrote:
> > 
> > On Thu, Mar 17, 2016 at 10:47:29AM -0700, Linus Torvalds wrote:
> >> On Wed, Mar 16, 2016 at 10:18 PM, Gregory Farnum  wrote:
> >>> 
> >>> So we've not asked for NO_HIDE_STALE on the mailing lists, but I think
> >>> it was one of the problems Sage had using xfs in his BlueStore
> >>> implementation and was a big part of why it moved to pure userspace.
> >>> FileStore might use NO_HIDE_STALE in some places but it would be
> >>> pretty limited. When it came up at Linux FAST we were discussing how
> >>> it and similar things had been problems for us in the past and it
> >>> would've been nice if they were upstream.
> >> 
> >> Hmm.
> >> 
> >> So to me it really sounds like somebody should cook up a patch, but we
> >> shouldn't put it in the upstream kernel until we get numbers and
> >> actual "yes, we'd use this" from outside of google.
> > 
> > We haven't had internal tiers yelling at us for fallocate performance,
> > so I'm unlikely to suggest it, just because its a potential
> > privacy leak we'd have to educate people about.  What I'd be more likely
> > to use is code inside the filesystem like this:
> > 
> > somefs_fallocate() {
> > if (trim_can_really_zero(my_device)) {
> > trim
> > allocate a regular extent
> > return
> > } else {
> > do normal fallocate
> > }
> > }
> 
> We were discussing almost this very same thing in the ext4 concall today.
> 
> Ted initially didn't think it was worthwhile to implement, but after looking
> at the whitelist for SATA SSDs it seems that there are enough devices on the
> market that support the ATA_HORKAGE_ZERO_AFTER_TRIM to make this approach
> worthwhile to implement.

We'll end up with people complaining it makes fallocate slower because
of the trims, so it's not a perfect solution.  But I much prefer it to
fallocate-stale.

> 
> Also, if the ext4 extent size was limited it might even be possible to do
> this efficiently enough with write_same on HDD devices.
> 
> > Then the out of tree patch (for google or whoever) becomes a hack to
> > flip trim_can_really_zero on a given block device.  The rest of us can
> > use explicit interfaces from the hardware when deciding what we want
> > preallocation to mean.
> 
> This might be a bit trickier, since this would affect all zero/trim
> operations, not just ones for uninitialized data extents.

Thinking more, my guess is that google will just keep doing what they
are already doing ;)  But there could be a flag in sysfs dedicated to
trim-for-fallocate so admins can see what their devices are reporting.
readonly in mainline, if someone wants to patch it in their large data
center it wouldn't be hard.

-chris


[GIT PULL] Btrfs

2016-03-21 Thread Chris Mason
 (+5/-1)
Btrfs: change how we update the global block rsv (+20/-14)
Btrfs: fix truncate_space_check (+10/-1)

Qu Wenruo (3) commits (+68/-19):
btrfs: Introduce new mount option usebackuproot to replace recovery 
(+25/-11)
btrfs: Introduce new mount option to disable tree log replay (+40/-7)
btrfs: Introduce new mount option alias for nologreplay (+3/-1)

Byongho Lee (2) commits (+1/-6):
btrfs: simplify expression in btrfs_calc_trans_metadata_size() (+1/-2)
btrfs: remove redundant error check (+0/-4)

Anand Jain (2) commits (+22/-10):
btrfs: rename btrfs_print_info to btrfs_print_mod_info (+2/-2)
btrfs: move btrfs_compression_type to compression.h (+20/-8)

Kinglong Mee (2) commits (+18/-40):
btrfs: fix memory leak of fs_info in block group cache (+1/-6)
btrfs: drop null testing before destroy functions (+17/-34)

Deepa Dinamani (1) commits (+26/-22):
btrfs: Replace CURRENT_TIME by current_fs_time()

Arnd Bergmann (1) commits (+2/-2):
btrfs: avoid uninitialized variable warning

Dave Jones (1) commits (+3/-6):
btrfs: remove open-coded swap() in backref.c:__merge_refs

Liu Bo (1) commits (+105/-84):
Btrfs: fix lockdep deadlock warning due to dev_replace

Adam Buchbinder (1) commits (+16/-16):
btrfs: Fix misspellings in comments.

Satoru Takeuchi (1) commits (+3/-0):
Btrfs: Show a warning message if one of objectid reaches its highest value

Ashish Samant (1) commits (+6/-1):
btrfs: Print Warning only if ENOSPC_DEBUG is enabled

Sudip Mukherjee (1) commits (+1/-1):
    btrfs: fix build warning

Chris Mason (1) commits (+10/-0):
btrfs: make sure we stay inside the bvec during __btrfs_lookup_bio_sums

Rasmus Villemoes (1) commits (+3/-6):
btrfs: use kbasename in btrfsic_mount

Dan Carpenter (1) commits (+1/-1):
btrfs: scrub: silence an uninitialized variable warning

Total: (82) commits (+1142/-970)

 Documentation/filesystems/btrfs.txt| 261 ++
 fs/btrfs/backref.c |  12 +-
 fs/btrfs/check-integrity.c |  12 +-
 fs/btrfs/compression.h |   9 +
 fs/btrfs/ctree.c   |  36 ++--
 fs/btrfs/ctree.h   |  87 ++---
 fs/btrfs/delayed-inode.c   |  10 +-
 fs/btrfs/delayed-ref.c |  12 +-
 fs/btrfs/dev-replace.c | 134 +++---
 fs/btrfs/dev-replace.h |   7 +-
 fs/btrfs/disk-io.c |  71 ---
 fs/btrfs/extent-tree.c |  40 ++--
 fs/btrfs/extent_io.c   |  40 ++--
 fs/btrfs/extent_io.h   |   5 +-
 fs/btrfs/extent_map.c  |   8 +-
 fs/btrfs/file-item.c   | 103 +++
 fs/btrfs/file.c| 158 +---
 fs/btrfs/inode-map.c   |   3 +
 fs/btrfs/inode.c   | 326 +++--
 fs/btrfs/ioctl.c   |  35 ++--
 fs/btrfs/ordered-data.c|   6 +-
 fs/btrfs/print-tree.c  |  23 ++-
 fs/btrfs/props.c   |   1 +
 fs/btrfs/reada.c   | 268 +--
 fs/btrfs/root-tree.c   |   2 +-
 fs/btrfs/scrub.c   |  32 ++--
 fs/btrfs/send.c|  37 ++--
 fs/btrfs/super.c   |  52 --
 fs/btrfs/tests/btrfs-tests.c   |   6 -
 fs/btrfs/tests/free-space-tree-tests.c |   1 +
 fs/btrfs/tests/inode-tests.c   |   1 +
 fs/btrfs/transaction.c |  13 +-
 fs/btrfs/tree-log.c| 102 +--
 fs/btrfs/tree-log.h|   2 +
 fs/btrfs/volumes.c |  51 +++---
 fs/btrfs/xattr.c   |  67 ---
 36 files changed, 1102 insertions(+), 931 deletions(-)


[GIT PULL] Btrfs

2016-02-19 Thread Chris Mason
Hi Linus,

My for-linus-4.5 branch has a btrfs DIO error passing fix.  I know how
much you love DIO, so I'm going to suggest against reading it.  We'll
follow up with a patch to drop the error arg from dio_end_io in the
next merge window.

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.5

Filipe Manana (1) commits (+2/-0):
Btrfs: fix direct IO requests not reporting IO error to user space

Total: (1) commits (+2/-0)

 fs/btrfs/inode.c | 2 ++
 1 file changed, 2 insertions(+)


[GIT PULL] Btrfs

2016-02-12 Thread Chris Mason
Hi Linus,

Please pull my for-linus-4.5 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.5

This has a few fixes from Filipe, along with a readdir fix from Dave
that we've been testing for some time.

Filipe Manana (4) commits (+115/-68):
Btrfs: remove no longer used function extent_read_full_page_nolock() 
(+12/-42)
Btrfs: fix hang on extent buffer lock caused by the inode_paths ioctl 
(+6/-4)
Btrfs: fix page reading in extent_same ioctl leading to csum errors (+21/-8)
Btrfs: fix invalid page accesses in extent_same (dedup) ioctl (+76/-14)

David Sterba (1) commits (+16/-3):
btrfs: properly set the termination value of ctx->pos in readdir

Total: (5) commits (+131/-71)

 fs/btrfs/backref.c   |  10 ++--
 fs/btrfs/compression.c   |   6 +--
 fs/btrfs/delayed-inode.c |   3 +-
 fs/btrfs/delayed-inode.h |   2 +-
 fs/btrfs/extent_io.c |  45 +-
 fs/btrfs/extent_io.h |   3 --
 fs/btrfs/inode.c |  14 +-
 fs/btrfs/ioctl.c | 119 ++-
 8 files changed, 131 insertions(+), 71 deletions(-)


[GIT PULL] Btrfs

2015-09-25 Thread Chris Mason
Hi Linus,

My for-linus-4.3 branch has a few fixes:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.3

This is an assorted set I've been queuing up:

Jeff Mahoney tracked down a tricky one where we ended up starting IO on
the wrong mapping for special files in btrfs_evict_inode.  A few people
reported this one on the list.

Filipe found (and provided a test for) a difficult bug in reading
compressed extents, and Josef fixed up some quota record keeping with
snapshot deletion.  Chandan killed off an accounting bug during DIO that
lead to WARN_ONs as we freed inodes.

Filipe Manana (3) commits (+58/-16):
Btrfs: remove unnecessary locking of cleaner_mutex to avoid deadlock (+0/-4)
Btrfs: don't initialize a space info as full to prevent ENOSPC (+1/-4)
Btrfs: fix read corruption of compressed and shared extents (+57/-8)

Josef Bacik (1) commits (+37/-2):
Btrfs: keep dropped roots in cache until transaction commit

Jeff Mahoney (1) commits (+2/-1):
btrfs: skip waiting on ordered range for special files

chandan (1) commits (+21/-23):
Btrfs: Direct I/O: Fix space accounting

Total: (6) commits (+118/-42)

 fs/btrfs/btrfs_inode.h |  2 --
 fs/btrfs/disk-io.c |  2 --
 fs/btrfs/extent-tree.c |  7 ++
 fs/btrfs/extent_io.c   | 65 +++---
 fs/btrfs/inode.c   | 45 +-
 fs/btrfs/super.c   |  2 --
 fs/btrfs/transaction.c | 32 +
 fs/btrfs/transaction.h |  5 +++-
 8 files changed, 118 insertions(+), 42 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

2015-09-16 Thread Chris Mason
On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> [cc Tejun]
> 
> On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > On Wed, Sep 16, 2015 at 04:00:12PM -0400, Chris Mason wrote:
> > > On Wed, Sep 16, 2015 at 09:58:06PM +0200, Jan Kara wrote:
> > > > On Wed 16-09-15 11:16:21, Chris Mason wrote:
> > > > > Short version, Linus' patch still gives bigger IOs and similar perf to
> > > > > Dave's original.  I should have done the blktrace runs for 60 seconds
> > > > > instead of 30, I suspect that would even out the average sizes between
> > > > > the three patches.
> > > > 
> > > > Thanks for the data Chris. So I guess we are fine with what's currently 
> > > > in,
> > > > right?
> > > 
> > > Looks like it works well to me.
> > 
> > Graph looks good, though I'll confirm it on my test rig once I get
> > out from under the pile of email and other stuff that is queued up
> > after being away for a week...
> 
> I ran some tests in the background while reading other email.
> 
> TL;DR: Results look really bad - not only is the plugging
> problematic, baseline writeback performance has regressed
> significantly. We need to revert the plugging changes until the
> underlying writeback performance regressions are sorted out.
> 
> In more detail, these tests were run on my usual 16p/16GB RAM
> performance test VM with storage set up as described here:
> 
> https://urldefense.proofpoint.com/v1/url?u=http://permalink.gmane.org/gmane.linux.kernel/1768786&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=6%2FL0lzzDhu0Y1hL9xm%2BQyA%3D%3D%0A&m=4Qwp5Zj8CpoMb6vOcz%2FNMQ%2Fsb0%2FamLUP1vqWgedxJL0%3D%0A&s=90b54e35a4a7fcc4bcab9e15e22c025c7c9e045541e4923500f2e3258fc1952b
> 
> The test:
> 
> $ ~/tests/fsmark-10-4-test-xfs.sh
> meta-data=/dev/vdc   isize=512agcount=500, agsize=268435455 
> blks
>  =   sectsz=512   attr=2, projid32bit=1
>  =   crc=1finobt=1, sparse=0
> data =   bsize=4096   blocks=134217727500, imaxpct=1
>  =   sunit=0  swidth=0 blks
> naming   =version 2  bsize=4096   ascii-ci=0 ftype=1
> log  =internal log   bsize=4096   blocks=131072, version=2
>  =   sectsz=512   sunit=1 blks, lazy-count=1
> realtime =none   extsz=4096   blocks=0, rtextents=0
> 
> #  ./fs_mark  -D  1  -S0  -n  1  -s  4096  -L  120  -d  
> /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  /mnt/scratch/3  
> -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  /mnt/scratch/6  -d  /mnt/scratch/7
> #   Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> #   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> #   Directories:  Time based hash between directories across 1 
> subdirectories with 180 seconds per subdirectory.
> #   File names: 40 bytes long, (16 initial bytes of time stamp with 24 
> random bytes at end of name)
> #   Files info: size 4096 bytes, written with an IO size of 16384 bytes 
> per write
> #   App overhead is time in microseconds spent in the test not doing file 
> writing related system calls.
> 
> FSUse%Count SizeFiles/sec App Overhead
>  08 4096 106938.0   543310
>  0   16 4096 102922.7   476362
>  0   24 4096 107182.9   538206
>  0   32 4096 107871.7   619821
>  0   40 4096  99255.6   622021
>  0   48 4096 103217.8   609943
>  0   56 4096  96544.2   640988
>  0   64 4096 100347.3   676237
>  0   72 4096  87534.8   483495
>  0   80 4096  72577.5  2556920
>  0   88 4096  97569.0   646996
> 
> 

I think too many variables have changed here.

My numbers:

FSUse%Count SizeFiles/sec App Overhead
 0   16 4096 356407.1  1458461
 0   32 4096 368755.1  1030047
 0   48 4096 358736.8   992123
 0   64 4096 361912.5  1009566
 0   80 4096 342851.4  1004152
 0   96 4096 358357.2   996014
 0  112 4096 338025.8  1004412
 0  128 4096

Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

2015-09-17 Thread Chris Mason
On Thu, Sep 17, 2015 at 02:30:08PM +1000, Dave Chinner wrote:
> On Wed, Sep 16, 2015 at 11:48:59PM -0400, Chris Mason wrote:
> > On Thu, Sep 17, 2015 at 10:37:38AM +1000, Dave Chinner wrote:
> > > [cc Tejun]
> > > 
> > > On Thu, Sep 17, 2015 at 08:07:04AM +1000, Dave Chinner wrote:
> > > #  ./fs_mark  -D  1  -S0  -n  1  -s  4096  -L  120  -d  
> > > /mnt/scratch/0  -d  /mnt/scratch/1  -d  /mnt/scratch/2  -d  
> > > /mnt/scratch/3  -d  /mnt/scratch/4  -d  /mnt/scratch/5  -d  
> > > /mnt/scratch/6  -d  /mnt/scratch/7
> > > #   Version 3.3, 8 thread(s) starting at Thu Sep 17 08:08:36 2015
> > > #   Sync method: NO SYNC: Test does not issue sync() or fsync() calls.
> > > #   Directories:  Time based hash between directories across 1 
> > > subdirectories with 180 seconds per subdirectory.
> > > #   File names: 40 bytes long, (16 initial bytes of time stamp with 
> > > 24 random bytes at end of name)
> > > #   Files info: size 4096 bytes, written with an IO size of 16384 
> > > bytes per write
> > > #   App overhead is time in microseconds spent in the test not doing 
> > > file writing related system calls.
> > > 
> > > FSUse%Count SizeFiles/sec App Overhead
> > >  08 4096 106938.0   543310
> > >  0   16 4096 102922.7   476362
> > >  0   24 4096 107182.9   538206
> > >  0   32 4096 107871.7   619821
> > >  0   40 4096  99255.6   622021
> > >  0   48 4096 103217.8   609943
> > >  0   56 4096  96544.2   640988
> > >  0   64 4096 100347.3   676237
> > >  0   72 4096  87534.8   483495
> > >  0   80 4096  72577.5  2556920
> > >  0   88 4096  97569.0   646996
> > > 
> > > 
> > 
> > I think too many variables have changed here.
> > 
> > My numbers:
> > 
> > FSUse%Count SizeFiles/sec App Overhead
> >  0   16 4096 356407.1  1458461
> >  0   32 4096 368755.1  1030047
> >  0   48 4096 358736.8   992123
> >  0   64 4096 361912.5  1009566
> >  0   80 4096 342851.4  1004152
> 
> 
> 
> > I can push the dirty threshold lower to try and make sure we end up in
> > the hard dirty limits but none of this is going to be related to the
> > plugging patch.
> 
> The point of this test is to drive writeback as hard as possible,
> not to measure how fast we can create files in memory.  i.e. if the
> test isn't pushing the dirty limits on your machines, then it really
> isn't putting a meaningful load on writeback, and so the plugging
> won't make significant difference because writeback isn't IO
> bound

It does end up IO bound on my rig, just because we do eventually hit the
dirty limits.  Otherwise there would be zero benefits in fs_mark from
any patches vs plain v4.2

But I setup a run last night with a dirty_ratio_bytes at 3G and
dirty_background_ratio_bytes at 1.5G.

There is definitely variation, but nothing like what you saw:

FSUse%Count SizeFiles/sec App Overhead
 0   16 4096 317427.9  1524951
 0   32 4096 319723.9  1023874
 0   48 4096 336696.4  1053884
 0   64 4096 257113.1  1190851
 0   80 4096 257644.2  1198054
 0   96 4096 254896.6  1225610
 0  112 4096 241052.6  1203227
 0  128 4096 214961.2  1386236
 0  144 4096 239985.7  1264659
 0  160 4096 232174.3  1310018
 0  176 4096 250477.9  1227289
 0  192 4096 221500.9  1276223
 0  208 4096 235212.1  1284989
 0  224 4096 238580.2  1257260
 0  240 4096 224182.6  1326821
 0  256 4096 234628.7  1236402
 0  272 4096 244675.3  1228400
 0  288 4096 234364.0  1268408
 0   

Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

2015-09-17 Thread Chris Mason
On Thu, Sep 17, 2015 at 12:39:51PM -0700, Linus Torvalds wrote:
> On Wed, Sep 16, 2015 at 7:14 PM, Dave Chinner  wrote:
> >>
> >> Dave, if you're testing my current -git, the other performance issue
> >> might still be the spinlock thing.
> >
> > I have the fix as the first commit in my local tree - it'll remain
> > there until I get a conflict after an update. :)
> 
> Ok. I'm happy to report that you should get a conflict now, and that
> the spinlock code should work well for your virtualized case again.
> 
> No updates on the plugging thing yet, I'll wait a bit and follow this
> thread and see if somebody comes up with any explanations or theories
> in the hope that we might not need to revert (or at least have a more
> targeted change).

Playing around with the plug a little, most of the unplugs are coming
from the cond_resched_lock().  Not really sure why we are doing the
cond_resched() there, we should be doing it before we retake the lock
instead.

This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
files/sec up to 213K.  Average IO size is the same as 4.3-rc1.

It probably won't help Dave, since most of his unplugs should have been
from the cond_resched_locked() too.

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 587ac08..05ed541 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -1481,6 +1481,19 @@ static long writeback_sb_inodes(struct super_block *sb,
wbc_detach_inode(&wbc);
work->nr_pages -= write_chunk - wbc.nr_to_write;
wrote += write_chunk - wbc.nr_to_write;
+
+   if (need_resched()) {
+   /*
+* we're plugged and don't want to hand off to kblockd
+* for the actual unplug work.  But we do want to
+* reschedule.  So flush our plug and then
+* schedule away
+*/
+   blk_flush_plug(current);
+   cond_resched();
+   }
+
+
spin_lock(&wb->list_lock);
spin_lock(&inode->i_lock);
if (!(inode->i_state & I_DIRTY_ALL))
@@ -1488,7 +1501,7 @@ static long writeback_sb_inodes(struct super_block *sb,
requeue_inode(inode, wb, &wbc);
inode_sync_complete(inode);
spin_unlock(&inode->i_lock);
-   cond_resched_lock(&wb->list_lock);
+
/*
 * bail out to wb_writeback() often enough to check
 * background threshold and other termination conditions.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

2015-09-17 Thread Chris Mason
On Thu, Sep 17, 2015 at 04:08:19PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 3:42 PM, Chris Mason  wrote:
> >
> > Playing around with the plug a little, most of the unplugs are coming
> > from the cond_resched_lock().  Not really sure why we are doing the
> > cond_resched() there, we should be doing it before we retake the lock
> > instead.
> >
> > This patch takes my box (with dirty thresholds at 1.5GB/3GB) from 195K
> > files/sec up to 213K.  Average IO size is the same as 4.3-rc1.
> 
> Ok, so at least for you, part of the problem really ends up being that
> there's a mix of the "synchronous" unplugging (by the actual explicit
> "blk_finish_plug(&plug);") and the writeback that is handed off to
> kblockd_workqueue.
> 
> I'm not seeing why that should be an issue. Sure, there's some CPU
> overhead to context switching, but I don't see that it should be that
> big of a deal.
> 
> I wonder if there is something more serious wrong with the kblockd_workqueue.

I'm driving the box pretty hard, it's right on the line between CPU
bound and IO bound.  So I've got 32 fs_mark processes banging away and
32 CPUs (16 really, with hyperthreading).

They are popping in and out of balance_dirty_pages() so I have high CPU
utilization alternating with high IO wait times.  There no reads at all,
so all of these waits are for buffered writes.

People in balance_dirty_pages are indirectly waiting on the unplug, so
maybe the context switch overhead on a loaded box is enough to explain
it.  We've definitely gotten more than 9% by inlining small synchronous
items in btrfs in the past, but those were more explicitly synchronous.

I know it's painfully hand wavy.  I don't see any other users of the
kblockd workqueues, and the perf profiles don't jump out at me.  I'll
feel better about the patch if Dave confirms any gains.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] fs-writeback: drop wb->list_lock during blk_finish_plug()

2015-09-18 Thread Chris Mason
On Thu, Sep 17, 2015 at 11:04:03PM -0700, Linus Torvalds wrote:
> On Thu, Sep 17, 2015 at 10:40 PM, Dave Chinner  wrote:
> >
> > Ok, makes sense - the plug is not being flushed as we switch away,
> > but Chris' patch makes it do that.
> 
> Yup.

Huh, that does make much more sense, thanks Linus.  I'm wondering where
else I've assumed that cond_resched() unplugged.

> 
> And I actually think Chris' patch is better than the one I sent out
> (but maybe the scheduler people should take a look at the behavior of
> cond_resched()), I just wanted you to test that to verify the
> behavior.

Ok, I'll fix up the description and comments and send out.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: ext4: performance regression introduced by the cgroup writeback support

2015-09-23 Thread Chris Mason
On Wed, Sep 23, 2015 at 01:49:31PM +, Dexuan Cui wrote:
> Hi all,
> Since some point between July and Sep, I have been suffered from a strange 
> "very slow write" issue and on Sep 9 I reported it to LKML (but got no 
> reply): https://lkml.org/lkml/2015/9/9/290
> 
> The issue is: under high CPU and disk I/O pressure, *some* processes can 
> suffer from a very slow write speed (e.g., <1MB/s or even only 20KB/s), while 
> the normal write speed should be at least dozens of MB/s.
> 
> I think I identified the commit which introduced the regression:
> ext4: implement cgroup writeback support 
> (https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=001e4a8775f6e8ad52a89e0072f09aee47d5d252)
> 
> This commit is already in the mainline tree, so I can reproduce the issue 
> there too:
> With the latest mainline,  I can reproduce the issue; after I revert the 
> patch, I can't reproduce the issue.
> 
> When the issue happens:
> 1. the read speed is pretty normal, e.g.. it's still >100MB/s.
> 2. 'top' shows both the 'user' and 'sys' utilization is about 0%, but the 
> IO-wait is always about 100%.
> 3. 'iotop' shows the read speed is 0 (this is correct because there is indeed 
> no read request)  and the write speed is pretty slow (the average is <1MB/s 
> or even 20KB/s).
> 4. when the issue happens, sometimes any new process suffers from the slow 
> write issue, but sometimes it looks not all the new processes suffers from 
> the issue.
> 5. The " WARNING: CPU: 7 PID: 6782 at fs/inode.c:390 ihold+0x30/0x40() " in 
> my Sep-9 mail may be another different issue.
> 6. To reproduce the issue, I need to run my workload for enough long time 
> (see the below).
> 
> My workload is simple: I just repeatedly build the kernel source ("make 
> clean; make -j16"). My kernel config is attached FYI.
> 
> I can reproduce the issue on a physical machine: e.g., in my kernel building 
> test with my .config, it took only ~5 minutes in the first 176 runs, but 
> since the 177th run, it could take from 10 hours to 5 minutes - very unstable.
> 
> It looks it's easier to reproduce the issue in a Hyper-V VM: usually I can 
> reproduce the issue within the first 10 or 20 runs.
> 
> Any idea?

Are you using cgroups?  That patch really shouldn't impact load unless
there are actual IO controls in place.

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ext4: implement cgroup writeback support

2015-09-23 Thread Chris Mason
On Wed, Sep 23, 2015 at 03:49:12PM +0300, Artem Bityutskiy wrote:
> On Tue, 2015-07-21 at 23:56 -0400, Theodore Ts'o wrote:
> > > v2: Updated for MS_CGROUPWB -> SB_I_CGROUPWB.
> > > 
> > > Signed-off-by: Tejun Heo 
> > > Cc: "Theodore Ts'o" 
> > > Cc: Andreas Dilger 
> > > Cc: linux-e...@vger.kernel.org
> > 
> > Thanks, applied.
> 
> Hi, this patch introduces a regression - a major one, I'd say.
> 
> Symptoms: copy a bunch of file, run sync, then run 'reboot', and after
> you boot up the copied files are corrupted. So basically the user
> -visible symptom is that 'sync' does not work.

Hi Artem,

Are you doing a hard shutdown (reboot -nf)?  If you're doing a friendly
shutdown, is the FS unmounting cleanly?

> 
> I quite an effort to bisect it, but it led me to this patch.

I bet it was a long bisect.  Trying to see if the same patch to btrfs
has similar impacts.

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH 2/2] ext4: implement cgroup writeback support

2015-09-23 Thread Chris Mason
On Wed, Sep 23, 2015 at 08:41:25PM +0300, Artem Bityutskiy wrote:
>Hi
> 
>$ sync
>$ reboot

If this is case, it should be possible to reproduce with:

cp a bunch of stuff to /ext4
unmount /ext4
mount ext4
compare data

If you're not getting a clean unmount of the test FS during the reboot,
its a different test.  Trying to reproduce here, so far its clean.
Could you please double check for failed unmounts?

Thanks,
Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2015-07-31 Thread Chris Mason
Hi Linus,

Please pull the fixes from my for-linus-4.2 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.2

Filipe fixed up a hard to trigger ENOSPC regression from our merge
window pull, and we have a few other smaller fixes.

Zhao Lei (2) commits (+4/-2):
btrfs: Avoid NULL pointer dereference of free_extent_buffer when 
read_tree_block() fail (+2/-1)
btrfs: Fix lockdep warning of btrfs_run_delayed_iputs() (+2/-1)

Anand Jain (1) commits (+1/-1):
btrfs: its btrfs_err() instead of btrfs_error()

Filipe Manana (1) commits (+18/-0):
Btrfs: fix quick exhaustion of the system array in the superblock

Total: (4) commits (+23/-3)

 fs/btrfs/dev-replace.c |  2 +-
 fs/btrfs/disk-io.c |  3 ++-
 fs/btrfs/extent-tree.c | 18 ++
 fs/btrfs/transaction.c |  3 ++-
 4 files changed, 23 insertions(+), 3 deletions(-)

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2015-06-29 Thread Chris Mason
Hi Linus,

Please pull my for-linus-4.2 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.2

Outside of our usual batch of fixes, this integrates the subvolume quota
updates that Qu Wenruo from Fujitsu has been working on for a few
releases now.  He gets an extra gold star for making btrfs smaller this
time, and fixing a number of quota corners in the process.

Dave Sterba tested and integrated Anand Jain's sysfs improvements.
Outside of exporting a symbol (ack'd by Greg) these are all internal to
btrfs and it's mostly cleanups and fixes.  Anand also attached some of
our sysfs objects to our internal device management structs instead of
an object off the super block.  It will make device management easier
overall and it's a better fit for how the sysfs files are used.  None of
the existing sysfs files are moved around.

Thanks for all the fixes everyone:

Anand Jain (28) commits (+304/-115):
Btrfs: sysfs: move super_kobj and device_dir_kobj from fs_info to 
btrfs_fs_devices (+56/-43)
Btrfs: sysfs: fix, btrfs_release_super_kobj() should to clean up the 
kobject data (+2/-0)
Btrfs: sysfs: introduce function btrfs_sysfs_add_fsid() to create sysfs 
fsid (+14/-1)
Btrfs: sysfs: fix, fs_info kobject_unregister has init_completion() twice 
(+0/-1)
Btrfs: sysfs btrfs_kobj_rm_device() pass fs_devices instead of fs_info 
(+10/-10)
Btrfs: sysfs: rename __btrfs_sysfs_remove_one to btrfs_sysfs_remove_fsid 
(+4/-4)
Btrfs: sysfs: fix, kobject pointer clean up needed after kobject release 
(+1/-0)
Btrfs: sysfs btrfs_kobj_add_device() pass fs_devices instead of fs_info 
(+6/-7)
Btrfs: sysfs: don't fail seeding for the sake of sysfs kobject issue (+1/-1)
Btrfc: sysfs: fix, check if device_dir_kobj is init before destroy (+6/-4)
Btrfs: sysfs: provide framework to remove all fsid sysfs kobject (+16/-1)
Btrfs: sysfs: separate device kobject and its attribute creation (+15/-6)
Btrfs: sysfs: add support to show replacing target in the sysfs (+7/-1)
Btrfs: check error before reporting missing device and add uuid (+2/-1)
Btrfs: sysfs: add pointer to access fs_info from fs_devices (+25/-0)
Btrfs: sysfs: btrfs_sysfs_remove_fsid() make it non static (+2/-1)
Btrfs: sysfs: let default_attrs be separate from the kset (+8/-4)
Btrfs: sysfs: separate kobject and attribute creation (+19/-14)
Btrfs: sysfs: make btrfs_sysfs_add_device() non static (+1/-0)
Btrfs: sysfs: make btrfs_sysfs_add_fsid() non static (+3/-1)
Btrfs: introduce btrfs_get_fs_uuids to get fs_uuids (+5/-0)
Btrfs: Check if kobject is initialized before put (+5/-3)
Btrfs: sysfs: add support to add parent for fsid (+2/-2)
Btrfs: sysfs: reorder the kobject creations (+13/-10)
Btrfs: sysfs: fix, undo sysfs device links (+17/-0)
Btrfs: log when missing device is created (+2/-0)
lib: export symbol kobject_move() (+1/-0)
Btrfs: free the stale device (+61/-0)

Qu Wenruo (19) commits (+879/-1542):
btrfs: extent-tree: Use ref_node to replace unneeded parameters in 
__inc_extent_ref() and __free_extent() (+21/-21)
btrfs: qgroup: Make snapshot accounting work with new extent-oriented 
(+33/-20)
btrfs: qgroup: Add the ability to skip given qgroup for old/new_roots. 
(+40/-0)
btrfs: qgroup: Switch self test to extent-oriented qgroup mechanism. 
(+89/-27)
btrfs: delayed-ref: Use list to replace the ref_root in ref_head. 
(+114/-123)
btrfs: qgroup: Cleanup open-coded old/new_refcnt update and read. (+54/-41)
btrfs: qgroup: Switch to new extent-oriented qgroup mechanism. (+28/-100)
btrfs: qgroup: Record possible quota-related extent for qgroup. (+95/-7)
btrfs: backref: Don't merge refs which are not for same block. (+3/-3)
btrfs: qgroup: Cleanup the old ref_node-oriented mechanism. (+3/-972)
btrfs: backref: Add special time_seq == (u64)-1 case for (+29/-6)
btrfs: qgroup: Add function qgroup_update_counters(). (+120/-0)
btrfs: qgroup: Add new function to record old_roots. (+29/-0)
btrfs: delayed-ref: Cleanup the unneeded functions. (+0/-174)
btrfs: qgroup: Add new qgroup calculation function (+118/-0)
btrfs: qgroup: Add function qgroup_update_refcnt(). (+58/-0)
btrfs: qgroup: Switch rescan to new mechanism. (+7/-36)
btrfs: ulist: Add ulist_del() function. (+37/-11)
btrfs: Fix superblock csum type check. (+1/-1)

Filipe Manana (14) commits (+340/-76):
Btrfs: incremental send, check if orphanized dir inode needs delayed rename 
(+37/-19)
Btrfs: fix necessary chunk tree space calculation when allocating a chunk 
(+7/-12)
Btrfs: wake up extent state waiters on unlock through clear_extent_bits 
(+6/-1)
Btrfs: incremental send, fix clone operations for compressed extents 
(+17/-1)
Btrfs: incremental send, don't delay directory renames unnecessarily 
(+46/-2)
Btrfs: fix chunk allocation regression leading to transaction abort (+19/-3)
Btrfs: fix mute

linux-next conflict resolution branch for btrfs

2015-08-20 Thread Chris Mason
Hi Stephen,

There are a few conflicts for btrfs in linux-next this time.  They are
small, but I pushed out the merge commit I'm using here:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next-merge

-chris

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next conflict resolution branch for btrfs

2015-08-21 Thread Chris Mason
On Fri, Aug 21, 2015 at 10:45:24AM +1000, Stephen Rothwell wrote:
> Hi Chris,
> 
> On Thu, 20 Aug 2015 13:39:18 -0400 Chris Mason  wrote:
> >
> > There are a few conflicts for btrfs in linux-next this time.  They are
> > small, but I pushed out the merge commit I'm using here:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
> > next-merge
> 
> Thanks for that.  It seems to have merged OK but maybe it conflicts
> with something later in linux-next.  Unfortunately see my other email
> about a build problem.  I will keep this example merge in mind for
> later.

Ok, sorry about that one.  We probably want the ifdefs up in Tejun's
code, but I'll talk with him about it today and get it fixed up.

Thanks,
Chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: linux-next conflict resolution branch for btrfs

2015-08-21 Thread Chris Mason
On Fri, Aug 21, 2015 at 10:45:24AM +1000, Stephen Rothwell wrote:
> Hi Chris,
> 
> On Thu, 20 Aug 2015 13:39:18 -0400 Chris Mason  wrote:
> >
> > There are a few conflicts for btrfs in linux-next this time.  They are
> > small, but I pushed out the merge commit I'm using here:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
> > next-merge
> 
> Thanks for that.  It seems to have merged OK but maybe it conflicts
> with something later in linux-next.  Unfortunately see my other email
> about a build problem.  I will keep this example merge in mind for
> later.

Ok, I put the ifdefs in btrfs.  Really what I need to do is change
bio_clone to do this work, but that means making sure its the right
thing for dm/md first.

I also added ifdefs for bio->bi_ioc in fs/btrfs/volumes.c, but
another commit in linux-next actually deletes the whole function from
btrfs.  I've redone the example merge:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next-merge

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2015-07-10 Thread Chris Mason
Hi Linus,

Please pull my for-linus-4.2 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.2

This is an assortment of fixes.  Most of the commits are from Filipe
(fsync, the inode allocation cache and a few others).  Mark kicked in a
series fixing corners in the extent sharing ioctls, and everyone else
fixed up on assorted other problems.

Filipe Manana (9) commits (+375/-36):
Btrfs: fix race between caching kthread and returning inode to inode cache 
(+11/-4)
Btrfs: fix memory corruption on failure to submit bio for direct IO 
(+52/-18)
Btrfs: fix crash on close_ctree() if cleaner starts new transaction (+29/-0)
Btrfs: fix fsync after truncate when no_holes feature is enabled (+108/-0)
Btrfs: fix race between balance and unused block group deletion (+58/-6)
Btrfs: fix a comment in inode.c:evict_inode_truncate_pages() (+3/-2)
Btrfs: use kmem_cache_free when freeing entry in inode cache (+1/-1)
Btrfs: fix fsync xattr loss in the fast fsync path (+104/-0)
Btrfs: fix fsync data loss after append write (+9/-5)

Mark Fasheh (4) commits (+193/-58):
btrfs: fix deadlock with extent-same and readpage (+117/-31)
btrfs: don't update mtime/ctime on deduped inodes (+14/-10)
btrfs: pass unaligned length to btrfs_cmp_data() (+2/-1)
btrfs: allow dedupe of same inode (+60/-16)

Liu Bo (2) commits (+15/-6):
Btrfs: fix hang when failing to submit bio of directIO (+0/-3)
Btrfs: fix warning of bytes_may_use (+15/-3)

Zhao Lei (2) commits (+21/-20):
btrfs: cleanup noused initialization of dev in btrfs_end_bio() (+1/-1)
btrfs: add error handling for scrub_workers_get() (+20/-19)

Yang Dongsheng (1) commits (+41/-8):
btrfs: qgroup: allow user to clear the limitation on qgroup

Shilong Wang (1) commits (+1/-1):
Btrfs: fix wrong check for btrfs_force_chunk_alloc()

Total: (19) commits (+646/-129)

 fs/btrfs/btrfs_inode.h  |   2 +
 fs/btrfs/ctree.h|   1 +
 fs/btrfs/disk-io.c  |  41 +++-
 fs/btrfs/extent-tree.c  |   3 +
 fs/btrfs/inode-map.c|  17 +++-
 fs/btrfs/inode.c|  89 --
 fs/btrfs/ioctl.c| 241 +---
 fs/btrfs/ordered-data.c |   5 +
 fs/btrfs/qgroup.c   |  49 --
 fs/btrfs/relocation.c   |   2 +-
 fs/btrfs/scrub.c|  39 
 fs/btrfs/tree-log.c | 226 -
 fs/btrfs/volumes.c  |  50 --
 13 files changed, 641 insertions(+), 124 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: new oops in 4.4.0-rc4

2015-12-11 Thread Chris Mason
On Thu, Dec 10, 2015 at 10:36:17AM -0600, Jon Christopherson wrote:
> Hello,
> 
> I noticed this new oops since running 4.4.0-rc4. Happens shortly after boot
> and pretty much kills the system:
> 
> > [  177.774250] [ cut here ]
> >[  177.774256] kernel BUG at /data0/Source/mainline/mm/page-writeback.c:2654!
> >[  177.774258] invalid opcode:  [#1] SMP
> >[  177.774261] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE 
> >nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 
> >nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp 
> >bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables 
> >iptable_filter ip_tables x_tables ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad 
> >ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rfcomm 
> >bnep nfsd auth_rpcgss nfs_acl binfmt_misc nfs lockd grace sunrpc fscache xfs 
> >libcrc32c snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi 
> >nvidia_modeset(POE) eeepc_wmi mxm_wmi asus_wmi sparse_keymap intel_rapl 
> >iosf_mbi x86_pkg_temp_thermal intel_powerclamp dm_multipath nvidia(POE) 
> >btusb kvm_intel btrtl nls_iso8859_1 kvm btbcm irqbypass snd_hd
> a_intel wl(POE) btintel hid_logitech_hidpp joydev bluetooth serio_raw 
> snd_hda_codec snd_hda_core snd_seq_midi cfg80211 snd_seq_midi_event snd_hwdep 
> snd_rawmidi lpc_ich snd_pcm drm snd_seq snd_seq_dev
> ice snd_timer 8250_fintek snd mei_me mei soundcore wmi mac_hid parport_pc 
> shpchp ppdev msr nct6775 hwmon_vid coretemp lp parport btrfs xor raid6_pq 
> drbg ansi_cprng dm_crypt dm_mirror dm_region_hash dm_log hid_generic 
> hid_logitech_dj usbhid hid crct10dif_pclmul crc32_pclmul ahci aesni_intel 
> aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse libahci video
> >[  177.774357] CPU: 5 PID: 5158 Comm: thunderbird Tainted: PW  OE   
> >4.4.0-121-generic #201512100930
> >[  177.774360] Hardware name: System manufacturer System Product Name/P8P67 
> >DELUXE, BIOS 3602 10/31/2012
> >[  177.774362] task: 88040b6d ti: 8803af864000 task.ti: 
> >8803af864000
> >[  177.774364] RIP: 0010:[]  [] 
> >clear_page_dirty_for_io+0xe1/0x1a0

Dave Jones sent in a report about this with trinity too, I'm digging in
today.  Since you can trigger this reliably, what was the last
known-good kernel for you?

-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] lock_page() doesn't lock if __wait_on_bit_lock returns -EINTR

2015-12-12 Thread Chris Mason
We have two reports of frequent crashes in btrfs where asserts in
clear_page_dirty_for_io() were triggering on missing page locks.

The crashes were much easier to trigger when processes were catching
ctrl-c's, and after much debugging it really looked like lock_page was a
noop.

This recent commit looks pretty suspect to me, and I confirmed that we
were exiting __wait_on_bit_lock() with -EINTR when it was called with
TASK_UNINTERRUPTIBLE

commit 68985633bccb6066bf1803e316fbc6c1f5b796d6
Author: Peter Zijlstra 
Date:   Tue Dec 1 14:04:04 2015 +0100

sched/wait: Fix signal handling in bit wait helpers

The patch below is mostly untested, and probably not the right solution.
Dave's trinity run doesn't explode immediately anymore, and I wanted to
get this out for discussion.  A quick look on the list doesn't show
anyone else has tracked this down, sorry if it's a dup.

Reported-by: Dave Jones , 
Reported-by: Jon Christopherson 
Signed-off-by: Chris Mason 

diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c
index f10bd87..12f69df 100644
--- a/kernel/sched/wait.c
+++ b/kernel/sched/wait.c
@@ -434,6 +434,8 @@ __wait_on_bit_lock(wait_queue_head_t *wq, struct 
wait_bit_queue *q,
ret = action(&q->key);
if (!ret)
continue;
+   if (ret == -EINTR && mode == TASK_UNINTERRUPTIBLE)
+   continue;
abort_exclusive_wait(wq, &q->wait, mode, &q->key);
return ret;
} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[GIT PULL] Btrfs

2016-01-29 Thread Chris Mason
Hi Linus,

We have some fixes queued up in my for-linus-4.5 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.5

Dave had a small collection of fixes to the new free space tree code,
one of which was keeping our sysfs files more up to date with feature
bits as different things get enabled (lzo, raid5/6, etc).

I should have kept the sysfs stuff for rc3, since we always manage to
trip over something.  This time it was GFP_KERNEL from somewhere that is
NOFS only. Instead of rebasing it out I've put a revert in, and we'll
fix it properly for rc3.

Otherwise, Filipe fixed a btrfs DIO race and Qu Wenruo fixed up a
use-after-free in our tracepoints that Dave Jones reported.

David Sterba (10) commits (+90/-20):
btrfs: sysfs: check initialization state before updating features (+3/-0)
btrfs: sysfs: introduce helper for syncing bits with sysfs files (+33/-0)
btrfs: synchronize incompat feature bits with sysfs files (+17/-0)
btrfs: sysfs: fix typo in compat_ro attribute definition (+1/-1)
Revert "btrfs: clear PF_NOFREEZE in cleaner_kthread()" (+0/-1)
btrfs: add free space tree to the cow-only list (+2/-1)
btrfs: tweak free space tree bitmap allocation (+16/-2)
btrfs: sysfs: add free-space-tree bit attribute (+2/-0)
btrfs: add free space tree to lockdep classes (+1/-0)
btrfs: tests: switch to GFP_KERNEL (+15/-15)

Chris Mason (2) commits (+1/-18):
Revert "btrfs: synchronize incompat feature bits with sysfs files" (+0/-17)
btrfs: don't use GFP_HIGHMEM for free-space-tree bitmap kzalloc (+1/-1)

Filipe Manana (1) commits (+39/-11):
Btrfs: fix race between fsync and lockless direct IO writes

Qu Wenruo (1) commits (+1/-1):
btrfs: async-thread: Fix a use-after-free error for trace

Total: (14) commits (+131/-50)

 fs/btrfs/async-thread.c  |  2 +-
 fs/btrfs/disk-io.c   |  2 +-
 fs/btrfs/free-space-tree.c   | 18 --
 fs/btrfs/inode.c | 36 
 fs/btrfs/relocation.c|  3 ++-
 fs/btrfs/sysfs.c | 35 +++
 fs/btrfs/sysfs.h |  5 -
 fs/btrfs/tests/btrfs-tests.c | 10 +-
 fs/btrfs/tests/extent-io-tests.c | 12 ++--
 fs/btrfs/tests/inode-tests.c |  8 
 fs/btrfs/tree-log.c  | 14 +++---
 11 files changed, 113 insertions(+), 32 deletions(-)


[GIT PULL] Btrfs

2015-12-18 Thread Chris Mason
Hi Linus,

A couple of small fixes in my for-linus-4.4 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.4

Chris Mason (2) commits (+19/-7):
Btrfs: check for empty bitmap list in setup_cluster_bitmaps (+5/-3)
Btrfs: check prepare_uptodate_page() error code earlier (+14/-4)

Filipe Manana (2) commits (+9/-7):
Btrfs: fix unprotected list move from unused_bgs to deleted_bgs list (+8/-5)
Btrfs: fix transaction handle leak in balance (+1/-2)

Holger Hoffstätte (1) commits (+1/-1):
btrfs: fix misleading warning when space cache failed to load

Total: (5) commits (+29/-15)

 fs/btrfs/extent-tree.c  | 10 +++---
 fs/btrfs/file.c | 18 ++
 fs/btrfs/free-space-cache.c | 10 ++
 fs/btrfs/transaction.c  |  1 -
 fs/btrfs/transaction.h  |  2 +-
 fs/btrfs/volumes.c  |  3 +--
 6 files changed, 29 insertions(+), 15 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v4 0/9] Update to zstd-1.4.6

2020-10-02 Thread Chris Mason

On 2 Oct 2020, at 2:54, Christoph Hellwig wrote:


On Wed, Sep 30, 2020 at 08:05:45PM +, Nick Terrell wrote:



On Sep 29, 2020, at 11:53 PM, Christoph Hellwig  
wrote:


As you keep resend this I keep retelling you that should not do it.
Please provide a proper Linux API, and switch to that.  Versioned 
APIs

have absolutely no business in the Linux kernel.


The API is not versioned. We provide a stable ABI for a large section 
of our API,
and the parts that aren???t ABI stable don???t change in semantics, 
and undergo long

deprecation periods before being removed.

The change of callers is a one-time change to transition from the 
existing API

in the kernel, which was never upstream's API, to upstream's API.


Again, please transition it to a sane kernel API.  We don't have an
"upstream" in this case.


The upstream is the zstd project where all this code originates, and 
where the active development takes place.  As Eric Biggers pointed out, 
it also receives a lot of Q/A separate from the kernel.  I think we gain 
a great deal by leveraging the testing and documentation of the zstd 
project in the kernel interfaces we use.


We lose some consistency with the kernel coding style, but we gain the 
ability to search for docs, issues, and fixes directly against the zstd 
project and git repo.


-chris


Re: [PATCH 10/12] btrfs: flag files as supporting buffered async reads

2020-05-26 Thread Chris Mason
On 26 May 2020, at 15:51, Jens Axboe wrote:

> btrfs uses generic_file_read_iter(), which already supports this.
>
> Signed-off-by: Jens Axboe 

Really looking forward to this!

Acked-by: Chris Mason 


[PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"

2020-10-23 Thread Chris Mason

Hi everyone,

We’re validating a new kernel in the fleet, and compared with v5.2, 
performance is ~2-3% lower for some of our workloads.  After some 
digging, Johannes found that our involuntary context switch rate was ~2x 
higher, and we were leaving a CPU idle a higher percentage of the time, 
even though the workload was trying to saturate the system.


We were able to reproduce the problem with schbench, and Johannes 
bisected down to:


commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912
Author: Vincent Guittot 
Date:   Fri Oct 18 15:26:31 2019 +0200

sched/fair: Rework load_balance()

Our working theory is the load balancing changes are leaving processes 
behind busy CPUs instead of moving them onto idle ones.  I made a few 
schbench modifications to make this easier to demonstrate:


https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/

My VM has 40 cpus (20 cores, 2 threads per core), and my schbench 
command line is:


schbench -t 20 -r 0 -c 100 -s 1000 -i 30 -z 120

This has two message threads, and 20 workers per message thread.  Once 
woken up, the workers think for a full second, which means you’ll have 
some long latencies if you’re stuck behind one of these workers in the 
runqueue.  The message thread does a little bit of work and then sleeps, 
so we end up with 40 threads hammering full blast on the CPU and 2 
threads popping in and out of idle.


schbench times the delay from when a message thread wakes a worker to 
when the worker runs.  On a good kernel, the output looks like this:


Latency percentiles (usec) runtime 1290 (s) (3280 total samples)
50.0th: 155 (1653 samples)
75.0th: 189 (808 samples)
90.0th: 216 (501 samples)
95.0th: 227 (163 samples)
*99.0th: 256 (123 samples)
99.5th: 1510 (16 samples)
99.9th: 3132 (13 samples)
min=21, max=3286

With 0b0695f2b34a, we get this:

Latency percentiles (usec) runtime 1440 (s) (4480 total samples)
50.0th: 147 (2261 samples)
75.0th: 182 (1116 samples)
90.0th: 205 (671 samples)
95.0th: 224 (215 samples)
*99.0th: 12240 (173 samples) <—— much higher p99 and up
99.5th: 12752 (22 samples)
99.9th: 13104 (18 samples)
min=21, max=13172

Since the idea is to fully load the machine with schbench, use schbench 
-t , and make sure the box doesn’t have other stuff 
running in the background.  I used a VM because it ended up giving more 
consistent results on our kernel test machines, which have some periodic 
noise running in the background.


We’ve tried a few different approaches, but don’t quite have a solid 
fix yet.  I thought I’d kick off the discussion with my most useful 
hunks so far:


diff a/kernel/sched/fair.c b/kernel/sched/fair.c
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c

-chris


Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"

2020-10-26 Thread Chris Mason

On 26 Oct 2020, at 4:39, Vincent Guittot wrote:


Hi Chris

On Sat, 24 Oct 2020 at 01:49, Chris Mason  wrote:


Hi everyone,

We’re validating a new kernel in the fleet, and compared with v5.2,


Which version are you using ?
several improvements have been added since v5.5 and the rework of 
load_balance


We’re validating v5.6, but all of the numbers referenced in this patch 
are against v5.9.  I usually try to back port my way to victory on this 
kind of thing, but mainline seems to behave exactly the same as 
0b0695f2b34a wrt this benchmark.





performance is ~2-3% lower for some of our workloads.  After some
digging, Johannes found that our involuntary context switch rate was 
~2x
higher, and we were leaving a CPU idle a higher percentage of the 
time,

even though the workload was trying to saturate the system.

We were able to reproduce the problem with schbench, and Johannes
bisected down to:

commit 0b0695f2b34a4afa3f6e9aa1ff0e5336d8dad912
Author: Vincent Guittot 
Date:   Fri Oct 18 15:26:31 2019 +0200

 sched/fair: Rework load_balance()

Our working theory is the load balancing changes are leaving 
processes

behind busy CPUs instead of moving them onto idle ones.  I made a few
schbench modifications to make this easier to demonstrate:

https://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git/

My VM has 40 cpus (20 cores, 2 threads per core), and my schbench
command line is:


What is the topology ? are they all part of the same LLC ?


We’ve seen the regression on both single socket and dual socket bare 
metal intel systems.  On the VM I reproduced with, I saw similar 
latencies with and without siblings configured into the topology.


-chris


Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"

2020-10-26 Thread Chris Mason




On 26 Oct 2020, at 10:24, Vincent Guittot wrote:


Le lundi 26 oct. 2020 à 08:45:27 (-0400), Chris Mason a écrit :

On 26 Oct 2020, at 4:39, Vincent Guittot wrote:


Hi Chris

On Sat, 24 Oct 2020 at 01:49, Chris Mason  wrote:


Hi everyone,

We’re validating a new kernel in the fleet, and compared with 
v5.2,


Which version are you using ?
several improvements have been added since v5.5 and the rework of
load_balance


We’re validating v5.6, but all of the numbers referenced in this 
patch are
against v5.9.  I usually try to back port my way to victory on this 
kind of
thing, but mainline seems to behave exactly the same as 0b0695f2b34a 
wrt

this benchmark.


ok. Thanks for the confirmation

I have been able to reproduce the problem on my setup.


Thanks for taking a look!  Can I ask what parameters you used on 
schbench, and what kind of results you saw?  Mostly I’m trying to make 
sure it’s a useful tool, but also the patch didn’t change things 
here.




Could you try the fix below ?

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct 
lb_env *env, struct sd_lb_stats *s

 * emptying busiest.
 */
if (local->group_type == group_has_spare) {
-   if (busiest->group_type > group_fully_busy) {
+   if ((busiest->group_type > group_fully_busy) &&
+   (busiest->group_weight > 1)) {
/*
 * If busiest is overloaded, try to fill spare
 * capacity. This might end up creating spare 
capacity



When we calculate an imbalance at te smallest level, ie between CPUs 
(group_weight == 1),
we should try to spread tasks on cpus instead of trying to fill spare 
capacity.


With this patch on top of v5.9, my latencies are unchanged.  I’m 
building against current Linus now just in case I’m missing other 
fixes.


-chris


Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"

2020-10-26 Thread Chris Mason

On 26 Oct 2020, at 11:05, Chris Mason wrote:


On 26 Oct 2020, at 10:24, Vincent Guittot wrote:



Could you try the fix below ?

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9049,7 +9049,8 @@ static inline void calculate_imbalance(struct 
lb_env *env, struct sd_lb_stats *s

 * emptying busiest.
 */
if (local->group_type == group_has_spare) {
-   if (busiest->group_type > group_fully_busy) {
+   if ((busiest->group_type > group_fully_busy) &&
+   (busiest->group_weight > 1)) {
/*
 * If busiest is overloaded, try to fill 
spare
 * capacity. This might end up creating spare 
capacity



When we calculate an imbalance at te smallest level, ie between CPUs 
(group_weight == 1),
we should try to spread tasks on cpus instead of trying to fill spare 
capacity.


With this patch on top of v5.9, my latencies are unchanged.  I’m 
building against current Linus now just in case I’m missing other 
fixes.




I reran things to make sure the nothing changed on my test box this 
weekend:


5.4.0-rc1-9-gfcf0553db6f4 (last good kernel)
Latency percentiles (usec) runtime 30 (s) (1000 total samples)
50.0th: 180 (502 samples)
75.0th: 227 (251 samples)
90.0th: 268 (147 samples)
95.0th: 300 (50 samples)
*99.0th: 338 (41 samples)
99.5th: 344 (4 samples)
99.9th: 1186 (5 samples)
min=25, max=1185

5.4.0-rc1-00010-g0b0695f2b34a (first bad kernel)
Latency percentiles (usec) runtime 150 (s) (960 total samples)
50.0th: 166 (488 samples)
75.0th: 210 (232 samples)
90.0th: 254 (145 samples)
95.0th: 299 (47 samples)
*99.0th: 12688 (39 samples)
99.5th: 13008 (5 samples)
99.9th: 13104 (4 samples)
min=24, max=13100

3650b228f83adda7e5ee532e2b90429c03f7b9ec (v5.10-rc1) + your patch

Latency percentiles (usec) runtime 30 (s) (1000 total samples)
50.0th: 169 (505 samples)
75.0th: 210 (246 samples)
90.0th: 267 (151 samples)
95.0th: 305 (48 samples)
*99.0th: 12656 (40 samples)
99.5th: 12944 (5 samples)
99.9th: 13168 (5 samples)
min=44, max=13155

-chris


Re: [PATCH] fix scheduler regression from "sched/fair: Rework load_balance()"

2020-10-26 Thread Chris Mason

On 26 Oct 2020, at 12:20, Vincent Guittot wrote:


Le lundi 26 oct. 2020 à 12:04:45 (-0400), Rik van Riel a écrit :

On Mon, 26 Oct 2020 16:42:14 +0100
Vincent Guittot  wrote:

On Mon, 26 Oct 2020 at 16:04, Rik van Riel  wrote:



Could utilization estimates be off, either lagging or
simply having a wrong estimate for a task, resulting
in no task getting pulled sometimes, while doing a
migrate_task imbalance always moves over something?


task and cpu utilization are not always up to fully synced and may 
lag
a bit which explains that sometimes LB can fail to migrate for a 
small

diff


OK, running with this little snippet below, I see latencies
improve back to near where they used to be:

Latency percentiles (usec) runtime 150 (s)
50.0th: 13
75.0th: 31
90.0th: 69
95.0th: 90
*99.0th: 761
99.5th: 2268
99.9th: 9104
min=1, max=16158

I suspect the right/cleaner approach might be to use
migrate_task more in !CPU_NOT_IDLE cases?

Running a task to an idle CPU immediately, instead of refusing
to have the load balancer move it, improves latencies for fairly
obvious reasons.

I am not entirely clear on why the load balancer should need to
be any more conservative about moving tasks than the wakeup
path is in eg. select_idle_sibling.



what you are suggesting is something like:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4978964e75e5..3b6fbf33abc2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9156,7 +9156,8 @@ static inline void calculate_imbalance(struct 
lb_env *env, struct sd_lb_stats *s

 * emptying busiest.
 */
if (local->group_type == group_has_spare) {
-   if (busiest->group_type > group_fully_busy) {
+   if ((busiest->group_type > group_fully_busy) &&
+   !(env->sd->flags & SD_SHARE_PKG_RESOURCES)) {
/*
 * If busiest is overloaded, try to fill spare
 * capacity. This might end up creating spare 
capacity


which also fixes the problem for me and alignes LB with wakeup path 
regarding the migration

in the LLC


Vincent’s patch on top of 5.10-rc1 looks pretty great:

Latency percentiles (usec) runtime 90 (s) (3320 total samples)
50.0th: 161 (1687 samples)
75.0th: 200 (817 samples)
90.0th: 228 (488 samples)
95.0th: 254 (164 samples)
*99.0th: 314 (131 samples)
99.5th: 330 (17 samples)
99.9th: 356 (13 samples)
min=29, max=358

Next we test in prod, which probably won’t have answers until 
tomorrow.  Thanks again Vincent!


-chris


Re: [PATCH v5 1/9] lib: zstd: Add zstd compatibility wrapper

2020-11-10 Thread Chris Mason

On 10 Nov 2020, at 13:39, Christoph Hellwig wrote:


On Mon, Nov 09, 2020 at 02:01:41PM -0500, Chris Mason wrote:
You do consistently ask for a shim layer, but you haven???t explained 
what
we gain by diverging from the documented and tested API of the 
upstream zstd
project.  It???s an important discussion given that we hope to 
regularly

update the kernel side as they make improvements in zstd.


An API that looks like every other kernel API, and doesn't cause 
endless

amount of churn because someone decided they need a new API flavor of
the day.  Btw, I'm not asking for a shim layer - that was the 
compromise

we ended up with.

If zstd folks can't maintain a sane code base maybe we should just 
drop

this childish churning code base from the tree.


I think APIs change based on the needs of the project.  We do this all 
the time in the kernel, and we don’t think twice about updating users 
of the API as needed.  The zstd changes look awkward and large today 
because it’ a long time period, but we’ve all been pretty vocal in 
the past about the importance of being able to advance APIs.


-chris


Re: [PATCH v5 1/9] lib: zstd: Add zstd compatibility wrapper

2020-11-09 Thread Chris Mason




On 6 Nov 2020, at 13:38, Christoph Hellwig wrote:


You just keep resedning this crap, don't you?  Haven't you been told
multiple times to provide a proper kernel API by now?


You do consistently ask for a shim layer, but you haven’t explained 
what we gain by diverging from the documented and tested API of the 
upstream zstd project.  It’s an important discussion given that we 
hope to regularly update the kernel side as they make improvements in 
zstd.


The only benefit described so far seems to be camelcase related, but if 
there are problems in the API beyond that, I haven’t seen you describe 
them.  I don’t think the camelcase alone justifies the added costs of 
the shim.


-chris


Re: [PATCH] mm : fix pte _PAGE_DIRTY bit when fallback migrate page

2020-07-17 Thread Chris Mason

On 16 Jul 2020, at 6:15, Robbie Ko wrote:


Kirill A. Shutemov 於 2020/7/15 下午4:11 寫道:

On Wed, Jul 15, 2020 at 10:45:39AM +0800, Robbie Ko wrote:

Kirill A. Shutemov 於 2020/7/14 下午6:19 寫道:

On Tue, Jul 14, 2020 at 11:46:12AM +0200, Vlastimil Babka wrote:

On 7/13/20 3:57 AM, Robbie Ko wrote:

Vlastimil Babka 於 2020/7/10 下午11:31 寫道:

On 7/9/20 4:48 AM, robbieko wrote:

From: Robbie Ko 

When a migrate page occurs, we first create a migration entry
to replace the original pte, and then go to 
fallback_migrate_page

to execute a writeout if the migratepage is not supported.

In the writeout, we will clear the dirty bit of the page and 
use
page_mkclean to clear the dirty bit along with the 
corresponding pte,

but page_mkclean does not support migration entry.

I don't follow the scenario.

When we establish migration entries with try_to_unmap(), it 
transfers

dirty bit from PTE to the page.

Sorry, I mean is _PAGE_RW with pte_write

When we establish migration entries with try_to_unmap(),
we create a migration entry, and if pte_write we set it to 
SWP_MIGRATION_WRITE,

which will replace the migration entry with the original pte.

When migratepage,  we go to fallback_migrate_page to execute a 
writeout

if the migratepage is not supported.

In the writeout, we call clear_page_dirty_for_io to  clear the dirty 
bit of the page
and use page_mkclean to clear pte _PAGE_RW with pte_wrprotect in 
page_mkclean_one.


However, page_mkclean_one does not support migration entries, so the
migration entry is still SWP_MIGRATION_WRITE.

In writeout, then we call remove_migration_ptes to remove the 
migration entry,
because it is still SWP_MIGRATION_WRITE so set _PAGE_RW to pte via 
pte_mkwrite.


Therefore, subsequent mmap wirte will not trigger page_mkwrite to 
cause data loss.

Hm, okay.

Folks, is there any good reason why try_to_unmap(TTU_MIGRATION) 
should not

clear PTE (make the PTE none) for file page?


This, I'm not sure.
But I think that for the fs that support migratepage, when migratepage 
is finished,
the page should still be dirty, and the pte should still have 
_PAGE_RW,
when the next mmap write occurs, we don't need to trigger the 
page_mkwrite again.


I don’t know the page migration code well, but you’ll need this one 
as well on the 4.4 kernel you mentioned:


commit 25f3c5021985e885292980d04a1423fd83c967bb
Author: Chris Mason 
Date:   Tue Jan 21 11:51:42 2020 -0500

Btrfs: keep pages dirty when using btrfs_writepage_fixup_worker

And this one as well:

commit 7703bdd8d23e6ef057af3253958a793ec6066b28
Author: Chris Mason 
Date:   Wed Jun 20 07:56:11 2018 -0700

Btrfs: don't clean dirty pages during buffered writes

With those two in place, we haven’t found lost data from the migration 
code, but we did see the fallback migration helper dirtying pages 
without going through page_mkwrite, which triggers the suboptimal btrfs 
fixup worker code path.  This isn’t a yea or nay on the patch, just 
additional info.


-chris


Re: [PATCH] CodingStyle: Inclusive Terminology

2020-07-06 Thread Chris Mason
On 5 Jul 2020, at 0:55, Willy Tarreau wrote:

> On Sat, Jul 04, 2020 at 01:02:51PM -0700, Dan Williams wrote:
>> +Non-inclusive terminology has that same distracting effect which is 
>> why
>> +it is a style issue for Linux, it injures developer efficiency.
>
> I'm personally thinking that for a non-native speaker it's already
> difficult to find the best term to describe something, but having to
> apply an extra level of filtering on the found words to figure whether
> they are allowed by the language police is even more difficult.

Since our discussions are public, we’ve always had to deal with 
comments from people outside the community on a range of topics.  But 
inside the kernel, it’s just a group of developers trying to help each 
other produce the best quality of code.  We’ve got a long history 
together and in general I think we’re pretty good at assuming good 
intent.

> *This*
> injures developers efficiency. What could improve developers 
> efficiency
> is to take care of removing *all* idiomatic or cultural words then. 
> For
> example I've been participating to projects using the term 
> "blueprint",
> I didn't understand what that meant. It was once explained to me and
> given that it had no logical reason for being called this way, I now
> forgot. If we follow your reasoning, Such words should be banned for
> exactly the same reasons. Same for colors that probably don't mean
> anything to those born blind.
>
> For example if in my local culture we eat tomatoes at starters and
> apples for dessert, it could be convenient for me to use "tomato" and
> "apple" as list elements to name the pointers leading to the beginning
> and the end of the list, and it might sound obvious to many people, 
> but
> not at all for many others.
>
> Maybe instead of providing an explicit list of a few words it should
> simply say that terms that take their roots in the non-technical world
> and whose meaning can only be understood based on history or local
> culture ought to be avoided, because *that* actually is the real
> root cause of the problem you're trying to address.

I’d definitely agree that it’s a good goal to keep out non-technical 
terms.  Even though we already try, every subsystem has its own set of 
patterns that reflect the most frequent contributors.

-chris

Re: [Ksummit-discuss] [PATCH] CodingStyle: Inclusive Terminology

2020-07-06 Thread Chris Mason




On 6 Jul 2020, at 10:06, Laurent Pinchart wrote:


Hi Chris,

On Mon, Jul 06, 2020 at 12:45:34PM +, Chris Mason via 
Ksummit-discuss wrote:

On 5 Jul 2020, at 0:55, Willy Tarreau wrote:



Maybe instead of providing an explicit list of a few words it should
simply say that terms that take their roots in the non-technical 
world

and whose meaning can only be understood based on history or local
culture ought to be avoided, because *that* actually is the real
root cause of the problem you're trying to address.


I’d definitely agree that it’s a good goal to keep out 
non-technical
terms.  Even though we already try, every subsystem has its own set 
of

patterns that reflect the most frequent contributors.


That's an interesting point, because to me, it's the exact opposite. 
One

of the intellectual rewards I find in working with the kernel is that
our community is international and multicultural, allowing me to learn
about other cultures. Aiming for the lowest common denominator seems 
to

me to be closer to erasing cultural differences than including them.


I hadn’t thought of it from this angle, but I do agree with you.  I 
think the cultural side comes through more in discussions and in-person 
conferences than it does from the code itself.


I do try to avoid local idioms or culture references unless I’m 
explaining them as part of a discussion or a personal story, mostly 
because I’ve gotten feedback from coworkers who had a hard time 
following my bad (ok, terrible) jokes or sarcasm.  One internal example 
is commands that take —clowntown as an argument.  It’s pretty 
therapeutic to type when you’re grumpy about tooling, but a lot of 
people probably have to look it up before it makes sense.


-chris


Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API

2020-09-16 Thread Chris Mason

On 16 Sep 2020, at 10:46, Christoph Hellwig wrote:


On Wed, Sep 16, 2020 at 10:43:04AM -0400, Chris Mason wrote:
Otherwise we just end up with drift and kernel-specific bugs that are 
harder
to debug.  To the extent those APIs make us contort the kernel code, 
I???m

sure Nick is interested in improving things in both places.


Seriously, we do not care elsewhere.  Why would zlib be any different?


Is the zlib upstream active?  Or trying to sync active development with 
the kernel?  I’d suggest the same path for them if they were.




There are probably 1000 constructive ways to have that conversation.  
Please

choose one of those instead of being an asshole.


I think you are the asshole here by ignoring the practices we are 
using

elsewhere and think your employers pet project is somehow special.  It
is not, and claiming so is everything but constructive.


I’m happy to advocate for more constructive discussion for anyone’s 
project.  I tend to pick threads where I have context and I know the 
people involved.


The kernel best practices are pragmatic.  As one of many users of any 
established-non-kernel project, there’s a compromise between the APIs 
they are using for a broad base of users and us.  I’m sure they are 
interested in improving life for all of their users, while also 
improving maintainability for us.


-chris



Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API

2020-09-16 Thread Chris Mason

On 16 Sep 2020, at 4:49, Christoph Hellwig wrote:


On Tue, Sep 15, 2020 at 08:42:59PM -0700, Nick Terrell wrote:

From: Nick Terrell 

Move away from the compatibility wrapper to the zstd-1.4.6 API. This
code is functionally equivalent.


Again, please use sensible names  And no one gives a fuck if this bad
API is "zstd-1.4.6" as the Linux kernel uses its own APIs, not some
random mess from a badly written userspace package.


Hi Christoph,

It’s not completely clear what you’re asking for here.  If the API 
matches what’s in zstd-1.4.6, that seems like a reasonable way to 
label it.  That’s what the upstream is for this code.


I’m also not sure why we’re taking extra time to shit on the zstd 
userspace package.  Can we please be constructive or at least 
actionable?


-chris


Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API

2020-09-17 Thread Chris Mason

On 17 Sep 2020, at 6:04, Christoph Hellwig wrote:


On Wed, Sep 16, 2020 at 09:35:51PM -0400, Rik van Riel wrote:
One possibility is to have a kernel wrapper on top of the zstd API 
to

make it
more ergonomic. I personally don???t really see the value in it, 
since

it adds
another layer of indirection between zstd and the caller, but it
could be done.


Zstd would not be the first part of the kernel to
come from somewhere else, and have wrappers when
it gets integrated into the kernel. There certainly
is precedence there.

It would be interesting to know what Christoph's
preference is.


Yes, I think kernel wrappers would be a pretty sensible step forward.
That also avoid the need to do strange upgrades to a new version,
and instead we can just change APIs on a as-needed basis.


When we add wrappers, we end up creating a kernel specific API that 
doesn’t match the upstream zstd docs, and it doesn’t leverage as 
much of the zstd fuzzing and testing.


So we’re actually making kernel zstd slightly less usable in hopes 
that our kernel specific part of the API is familiar enough to us that 
it makes zstd more usable.  There’s no way to compare the two until 
the wrappers are done, but given the code today I’d prefer that we 
focus on making it really easy to track upstream.  I really understand 
Christoph’s side here, but I’d rather ride a camel with the group 
than go it alone.


I’d also much rather spend time on any problems where the structure of 
the zstd APIs don’t fit the kernel’s needs.  The btrfs streaming 
compression/decompression looks pretty clean to me, but I think Johannes 
mentioned some possibilities to improve things for zswap (optimizations 
for page-at-atime).  If there are places where the zstd memory 
management or error handling don’t fit naturally into the kernel, that 
would also be higher on my list.


Fixing those are probably going to be much easier if we’re close to 
the zstd upstream, again so that we can leverage testing and long term 
code maintenance done there.


-chris


Re: [PATCH 5/9] btrfs: zstd: Switch to the zstd-1.4.6 API

2020-09-16 Thread Chris Mason

On 16 Sep 2020, at 10:30, Christoph Hellwig wrote:


On Wed, Sep 16, 2020 at 10:20:52AM -0400, Chris Mason wrote:
It???s not completely clear what you???re asking for here.  If the 
API
matches what???s in zstd-1.4.6, that seems like a reasonable way to 
label

it.  That???s what the upstream is for this code.

I???m also not sure why we???re taking extra time to shit on the zstd
userspace package.  Can we please be constructive or at least 
actionable?


Because it really doesn't matter that these crappy APIs he is
introducing match anything, especially not something done as horribly
as the zstd API.  We'll need to do this properly, and claiming
compliance to some version of this lousy API is completely irrelevant
for the kernel.


If the underlying goal is to closely follow the upstream of another 
project, we’re much better off using those APIs as provided.


Otherwise we just end up with drift and kernel-specific bugs that are 
harder to debug.  To the extent those APIs make us contort the kernel 
code, I’m sure Nick is interested in improving things in both places.


There are probably 1000 constructive ways to have that conversation.  
Please choose one of those instead of being an asshole.


-chris


Re: [RFC PATCH 00/11] bpf, trace, dtrace: DTrace BPF program type implementation and sample use

2019-05-31 Thread Chris Mason

I'm being pretty liberal with chopping down quoted material to help 
emphasize a particular opinion about how to bootstrap existing 
out-of-tree projects into the kernel.  My goal here is to talk more 
about the process and less about the technical details, so please 
forgive me if I've ignored or changed the technical meaning of anything 
below.

On 30 May 2019, at 12:15, Kris Van Hees wrote:

> On Thu, May 23, 2019 at 01:28:44PM -0700, Alexei Starovoitov wrote:
>
> ... I believe that the discussion that has been going on in other
> emails has shown that while introducing a program type that provides a
> generic (abstracted) context is a different approach from what has 
> been done
> so far, it is a new use case that provides for additional ways in 
> which BPF
> can be used.
>

[ ... ]

>
> Yes and no.  It depends on what you are trying to do with the BPF 
> program that
> is attached to the different events.  From a tracing perspective, 
> providing a
> single BPF program with an abstract context would ...

[ ... ]

>
> In this model kprobe/ksys_write and 
> tracepoint/syscalls/sys_enter_write are
> equivalent for most tracing purposes ...

[ ... ]

>
> I agree with what you are saying but I am presenting an additional use 
> case

[ ... ]

>>
>> All that aside the kernel support for shared libraries is an awesome
>> feature to have and a bunch of folks want to see it happen, but
>> it's not a blocker for 'dtrace to bpf' user space work.
>> libbpf can be taught to do this 'pseudo shared library' feature
>> while 'dtrace to bpf' side doesn't need to do anything special.

[ ... ]

This thread intermixes some abstract conceptual changes with smaller 
technical improvements, and in general it follows a familiar pattern 
other out-of-tree projects have hit while trying to adapt the kernel to 
their existing code.  Just from this one email, I quoted the abstract 
models with use cases etc, and this is often where the discussions side 
track into less productive areas.

>
> So you are basically saying that I should redesign DTrace?

In your place, I would have removed features and adapted dtrace as much 
as possible to require the absolute minimum of kernel patches, or even 
better, no patches at all.  I'd document all of the features that worked 
as expected, and underline anything either missing or suboptimal that 
needed additional kernel changes.  Then I'd focus on expanding the 
community of people using dtrace against the mainline kernel, and work 
through the series features and improvements one by one upstream over 
time.

Your current approach relies on an all-or-nothing landing of patches 
upstream, and this consistently leads to conflict every time a project 
tries it.  A more incremental approach will require bigger changes on 
the dtrace application side, but over time it'll be much easier to 
justify your kernel changes.  You won't have to talk in abstract models, 
and you'll have many more concrete examples of people asking for dtrace 
features against mainline.  Most importantly, you'll make dtrace 
available on more kernels than just the absolute latest mainline, and 
removing dependencies makes the project much easier for new users to 
try.

-chris


Re: [PATCH 1/2] Revert "mm: don't reclaim inodes with many attached pages"

2019-01-31 Thread Chris Mason
On 30 Jan 2019, at 20:34, Dave Chinner wrote:

> On Wed, Jan 30, 2019 at 12:21:07PM +0000, Chris Mason wrote:
>>
>>
>> On 29 Jan 2019, at 23:17, Dave Chinner wrote:
>>
>>> From: Dave Chinner 
>>>
>>> This reverts commit a76cf1a474d7dbcd9336b5f5afb0162baa142cf0.
>>>
>>> This change causes serious changes to page cache and inode cache
>>> behaviour and balance, resulting in major performance regressions
>>> when combining worklaods such as large file copies and kernel
>>> compiles.
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=202441
>>
>> I'm a little confused by the latest comment in the bz:
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=202441#c24
>
> Which says the first patch that changed the shrinker behaviour is
> the underlying cause of the regression.
>
>> Are these reverts sufficient?
>
> I think so.

Based on the latest comment:

"If I had been less strict in my testing I probably would have 
discovered that the problem was present earlier than 4.19.3. Mr Gushins 
commit made it more visible.
I'm going back to work after two days off, so I might not be able to 
respond inside your working hours, but I'll keep checking in on this as 
I get a chance."

I don't think the reverts are sufficient.

>
>> Roman beat me to suggesting Rik's followup.  We hit a different 
>> problem
>> in prod with small slabs, and have a lot of instrumentation on Rik's
>> code helping.
>
> I think that's just another nasty, expedient hack that doesn't solve
> the underlying problem. Solving the underlying problem does not
> require changing core reclaim algorithms and upsetting a page
> reclaim/shrinker balance that has been stable and worked well for
> just about everyone for years.
>

Things are definitely breaking down in non-specialized workloads, and 
have been for a long time.

-chris


Re: [PATCH btrfs/for-next] btrfs: fix fatal extent_buffer readahead vs releasepage race

2020-06-17 Thread Chris Mason

On 17 Jun 2020, at 13:20, Filipe Manana wrote:


On Wed, Jun 17, 2020 at 5:32 PM Boris Burkov  wrote:


---
 fs/btrfs/extent_io.c | 45 


 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index c59e07360083..f6758ebbb6a2 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -3927,6 +3927,11 @@ static noinline_for_stack int 
write_one_eb(struct extent_buffer *eb,

clear_bit(EXTENT_BUFFER_WRITE_ERR, &eb->bflags);
num_pages = num_extent_pages(eb);
atomic_set(&eb->io_pages, num_pages);
+   /*
+* It is possible for releasepage to clear the TREE_REF bit 
before we
+* set io_pages. See check_buffer_tree_ref for a more 
detailed comment.

+*/
+   check_buffer_tree_ref(eb);


This is a whole different case from the one described in the
changelog, as this is in the write path.
Why do we need this one?


This was Josef’s idea, but I really like the symmetry.  You set 
io_pages, you do the tree_ref dance.  Everyone fiddling with the write 
back bit right now correctly clears writeback after doing the atomic_dec 
on io_pages, but the race is tiny and prone to getting exposed again by 
shifting code around.  Tree ref checks around io_pages are the most 
reliable way to prevent this bug from coming back again later.


-chris


Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

2017-04-25 Thread Chris Mason



On 04/25/2017 04:49 PM, Tejun Heo wrote:

On Tue, Apr 25, 2017 at 11:49:41AM -0700, Tejun Heo wrote:

Will try that too.  I can't see why HT would change it because I see
single CPU queues misevaluated.  Just in case, you need to tune the
test params so that it doesn't load the machine too much and that
there are some non-CPU intensive workloads going on to purturb things
a bit.  Anyways, I'm gonna try disabling HT.


It's finickier but after changing the duty cycle a bit, it reproduces
w/ HT off.  I think the trick is setting the number of threads to the
number of logical CPUs and tune -s/-c so that p99 starts climbing up.
The following is from the root cgroup.


Since it's only measuring wakeup latency, schbench is best at exposing 
problems when the machine is just barely below saturated.  At 
saturation, everyone has to wait for the CPUs, and if we're relatively 
idle there's always a CPU to be found


There's schbench -a to try and find this magic tipping point, but I 
haven't found a great way to automate for every kind of machine yet (sorry).


-chris


[GIT PULL] Btrfs

2017-04-27 Thread Chris Mason

Hi Linus,

We have one more for btrfs:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

This is dropping a new WARN_ON from rc1 that ended up making more noise 
than we really want.  The larger fix for the underflow got delayed a bit 
and it's better for now to put it under CONFIG_BTRFS_DEBUG.


David Sterba (1) commits (+7/-4):
   btrfs: qgroup: move noisy underflow warning to debugging build

Total: (1) commits (+7/-4)

fs/btrfs/qgroup.c | 11 +++
1 file changed, 7 insertions(+), 4 deletions(-)


[GIT PULL] Btrfs

2017-03-31 Thread Chris Mason
Hi Linus,

We have 3 small fixes queued up in my for-linus-4.11 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

Goldwyn Rodrigues (1) commits (+7/-7):
btrfs: Change qgroup_meta_rsv to 64bit

Dan Carpenter (1) commits (+6/-1):
Btrfs: fix an integer overflow check

Liu Bo (1) commits (+31/-21):
Btrfs: bring back repair during read

Total: (3) commits (+44/-29)

 fs/btrfs/ctree.h |  2 +-
 fs/btrfs/disk-io.c   |  2 +-
 fs/btrfs/extent_io.c | 46 --
 fs/btrfs/inode.c |  6 +++---
 fs/btrfs/qgroup.c| 10 +-
 fs/btrfs/send.c  |  7 ++-
 6 files changed, 44 insertions(+), 29 deletions(-)


[GIT PULL] Btrfs

2017-04-14 Thread Chris Mason

Hi Linus

Dave Sterba collected a few more fixes for the last rc:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.11

These aren't marked for stable, but I'm putting them in with a batch 
were testing/sending by hand for this release.


Liu Bo (3) commits (+11/-13):
   Btrfs: fix invalid dereference in btrfs_retry_endio (+4/-10)
   Btrfs: fix potential use-after-free for cloned bio (+1/-1)
   Btrfs: fix segmentation fault when doing dio read (+6/-2)

Adam Borowski (1) commits (+3/-0):
   btrfs: drop the nossd flag when remounting with -o ssd

Total: (4) commits (+14/-13)

fs/btrfs/inode.c   | 22 ++
fs/btrfs/super.c   |  3 +++
fs/btrfs/volumes.c |  2 +-
3 files changed, 14 insertions(+), 13 deletions(-)


Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Chris Mason

On 08/10/2017 04:30 AM, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:



The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as 
well.  The numbers were in line with what Nick is posting here.  zstd is 
a big win over both lzo and zlib from a btrfs point of view.


It's true Nick's patches only support a single compression level in 
btrfs, but that's because btrfs doesn't have a way to pass in the 
compression ratio.  It could easily be a mount option, it was just 
outside the scope of Nick's initial work.


-chris





Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-10 Thread Chris Mason

On 08/10/2017 03:00 PM, Eric Biggers wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:

On Wed, Aug 09, 2017 at 07:35:53PM -0700, Nick Terrell wrote:



The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |



Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.



I am not surprised --- Zstandard is closer to the state of the art, both
format-wise and implementation-wise, than the other choices in BTRFS.  My point
is that benchmarks need to account for how much data is compressed at a time.
This is a common mistake when comparing different compression algorithms; the
algorithm name and compression level do not tell the whole story.  The
dictionary size is extremely significant.  No one is going to compress or
decompress a 200 MB file as a single stream in kernel mode, so it does not make
sense to justify adding Zstandard *to the kernel* based on such a benchmark.  It
is going to be divided into chunks.  How big are the chunks in BTRFS?  I thought
that it compressed only one page (4 KiB) at a time, but I hope that has been, or
is being, improved; 32 KiB - 128 KiB should be a better amount.  (And if the
amount of data compressed at a time happens to be different between the
different algorithms, note that BTRFS benchmarks are likely to be measuring that
as much as the algorithms themselves.)


Btrfs hooks the compression code into the delayed allocation mechanism 
we use to gather large extents for COW.  So if you write 100MB to a 
file, we'll have 100MB to compress at a time (within the limits of the 
amount of pages we allow to collect before forcing it down).


But we want to balance how much memory you might need to uncompress 
during random reads.  So we have an artificial limit of 128KB that we 
send at a time to the compression code.  It's easy to change this, it's 
just a tradeoff made to limit the cost of reading small bits.


It's the same for zlib,lzo and the new zstd patch.

-chris



Re: [PATCH v5 2/5] lib: Add zstd modules

2017-08-11 Thread Chris Mason



On 08/10/2017 03:25 PM, Hugo Mills wrote:

On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote:

On 08/10/2017 04:30 AM, Eric Biggers wrote:


Theses benchmarks are misleading because they compress the whole file as a
single stream without resetting the dictionary, which isn't how data will
typically be compressed in kernel mode.  With filesystem compression the data
has to be divided into small chunks that can each be decompressed independently.
That eliminates one of the primary advantages of Zstandard (support for large
dictionary sizes).


I did btrfs benchmarks of kernel trees and other normal data sets as
well.  The numbers were in line with what Nick is posting here.
zstd is a big win over both lzo and zlib from a btrfs point of view.

It's true Nick's patches only support a single compression level in
btrfs, but that's because btrfs doesn't have a way to pass in the
compression ratio.  It could easily be a mount option, it was just
outside the scope of Nick's initial work.


Could we please not add more mount options? I get that they're easy
to implement, but it's a very blunt instrument. What we tend to see
(with both nodatacow and compress) is people using the mount options,
then asking for exceptions, discovering that they can't do that, and
then falling back to doing it with attributes or btrfs properties.
Could we just start with btrfs properties this time round, and cut out
the mount option part of this cycle.

In the long run, it'd be great to see most of the btrfs-specific
mount options get deprecated and ultimately removed entirely, in
favour of attributes/properties, where feasible.



It's a good point, and as was commented later down I'd just do mount -o 
compress=zstd:3 or something.


But I do prefer properties in general for this.  My big point was just 
that next step is outside of Nick's scope.


-chris



[GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason
Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert and
Phillip, we decided to send the whole thing in as one pull request.

I have it in my zstd branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd

There's a trivial conflict with the main btrfs pull that Dave Sterba just
sent.  His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.  My
idea was that you'd take our main btrfs pull first and this one second,
but the conflicts are small enough it's not a big deal.

zstd is a big win in speed over zlib and in compression ratio over lzo, and
the compression team here at FB has gotten great results using it in production.
Nick will continue to update the kernel side with new improvements from the 
open source zstd userland code.

Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:


I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |

I benchmarked zstd decompression using the same method on the same machine.
The benchmark file is located in the upstream zstd repo under
`contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is
the amount of memory required to decompress data compressed with the given
compression level. If you know the maximum size of your input, you can
reduce the memory usage of decompression irrespective of the compression
level.

| Method   | Time (s) | MB/s| Adjusted MB/s | Memory (MB) |
|--|--|-|---|-|
| none |0.025 | 8479.54 | - |   - |
| zstd -1  |0.358 |  592.15 |636.60 |0.84 |
| zstd -3  |0.396 |  535.32 |571.40 |1.46 |
| zstd -5  |0.396 |  535.32 |571.40 |1.46 |
| zstd -10 |0.374 |  566.81 |607.42 |2.51 |
| zstd -15 |0.379 |  559.34 |598.84 |4.61 |
| zstd -19 |0.412 |  514.54 |547.77 |8.80 |
| zlib -1  |0.940 |  225.52 |231.68 |0.04 |
| zlib -3  |0.883 |  240.08 |247.07 |0.04 |
| zlib -6  |0.844 |  251.17 |258.84 |0.04 |
| zlib -9  |0.837 |  253.27 |287.64 |0.04 |

===

I ran a long series of tests and benchmarks on the btrfs side and
the gains are very similar to the core benchmarks Nick ran.

Nick Terrell (4) commits (+14578/-12):  
crypto: Add zstd support (+356/-0)  
btrfs: Add zstd support (+468/-12)  
lib: Add zstd modules (+13014/-0)   
lib: Add xxhash module (+740/-0)

Sean Purcell (1) commits (+178/-0): 
squashfs: Add zstd support  

Total: (5) commits (+14756/-12)

Re: [GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason



On 09/08/2017 03:33 PM, Chris Mason wrote:

Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert and
Phillip, we decided to send the whole thing in as one pull request.

I have it in my zstd branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd

There's a trivial conflict with the main btrfs pull that Dave Sterba just
sent.  His pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.  My
idea was that you'd take our main btrfs pull first and this one second,
but the conflicts are small enough it's not a big deal.

zstd is a big win in speed over zlib and in compression ratio over lzo, and
the compression team here at FB has gotten great results using it in production.
Nick will continue to update the kernel side with new improvements from the
open source zstd userland code.


Just to clarify, we've been testing the kernel side of this here at FB, 
but our zstd use in prod is limited to the application side.


-chris


Re: [GIT PULL] zstd support (lib, btrfs, squashfs)

2017-09-08 Thread Chris Mason

On Sat, Sep 09, 2017 at 09:35:59AM +0800, Herbert Xu wrote:

On Fri, Sep 08, 2017 at 03:33:05PM -0400, Chris Mason wrote:


 crypto/Kconfig |9 +
 crypto/Makefile|1 +
 crypto/testmgr.c   |   10 +
 crypto/testmgr.h   |   71 +
 crypto/zstd.c  |  265 


Is there anyone going to use zstd through the crypto API? If not
then I don't see the point in adding it at this point.  Especially
as the compression API is still in a state of flux.


That part was requested by intel, but I'm happy to leave it out for 
another time.  The rest of the patch series doesn't depend on it at all.


-chris


[GIT PULL v2] zstd support (lib, btrfs, squashfs, nocrypto)

2017-09-11 Thread Chris Mason
Hi Linus,

Nick Terrell's patch series to add zstd support to the kernel has been
floating around for a while.  After talking with Dave Sterba, Herbert
and Phillip, we decided to send the whole thing in as one pull request.

Herbert had asked about the crypto patch when we discussed the pull, but
I didn't realize he really meant not-right-now.  I've rebased it out of
this branch, and none of the other patches depended on it.

I have things in my zstd-minimal branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git zstd-minimal

There's a trivial conflict with the main btrfs pull from last week.
Dave's pull deletes BTRFS_COMPRESS_LAST in fs/btrfs/compression.h, and
I've put the sample resolution in a branch named zstd-4.14-merge.

zstd is a big win in speed over zlib and in compression ratio over lzo,
and the compression team here at FB has gotten great results using it in
production.  Nick will continue to update the kernel side with new
improvements from the open source zstd userland code.

Nick has a number of benchmarks for the main zstd code in his lib/zstd
commit:


I ran the benchmarks on a Ubuntu 14.04 VM with 2 cores and 4 GiB of RAM.
The VM is running on a MacBook Pro with a 3.1 GHz Intel Core i7 processor,
16 GB of RAM, and a SSD. I benchmarked using `silesia.tar` [3], which is
211,988,480 B large. Run the following commands for the benchmark:

sudo modprobe zstd_compress_test
sudo mknod zstd_compress_test c 245 0
sudo cp silesia.tar zstd_compress_test

The time is reported by the time of the userland `cp`.
The MB/s is computed with

1,536,217,008 B / time(buffer size, hash)

which includes the time to copy from userland.
The Adjusted MB/s is computed with

1,536,217,088 B / (time(buffer size, hash) - time(buffer size, none)).

The memory reported is the amount of memory the compressor requests.

| Method   | Size (B) | Time (s) | Ratio | MB/s| Adj MB/s | Mem (MB) |
|--|--|--|---|-|--|--|
| none | 11988480 |0.100 | 1 | 2119.88 |- |- |
| zstd -1  | 73645762 |1.044 | 2.878 |  203.05 |   224.56 | 1.23 |
| zstd -3  | 66988878 |1.761 | 3.165 |  120.38 |   127.63 | 2.47 |
| zstd -5  | 65001259 |2.563 | 3.261 |   82.71 |86.07 | 2.86 |
| zstd -10 | 60165346 |   13.242 | 3.523 |   16.01 |16.13 |13.22 |
| zstd -15 | 58009756 |   47.601 | 3.654 |4.45 | 4.46 |21.61 |
| zstd -19 | 54014593 |  102.835 | 3.925 |2.06 | 2.06 |60.15 |
| zlib -1  | 77260026 |2.895 | 2.744 |   73.23 |75.85 | 0.27 |
| zlib -3  | 72972206 |4.116 | 2.905 |   51.50 |52.79 | 0.27 |
| zlib -6  | 68190360 |9.633 | 3.109 |   22.01 |22.24 | 0.27 |
| zlib -9  | 67613382 |   22.554 | 3.135 |9.40 | 9.44 | 0.27 |

I benchmarked zstd decompression using the same method on the same machine.
The benchmark file is located in the upstream zstd repo under
`contrib/linux-kernel/zstd_decompress_test.c` [4]. The memory reported is
the amount of memory required to decompress data compressed with the given
compression level. If you know the maximum size of your input, you can
reduce the memory usage of decompression irrespective of the compression
level.

| Method   | Time (s) | MB/s| Adjusted MB/s | Memory (MB) |
|--|--|-|---|-|
| none |0.025 | 8479.54 | - |   - |
| zstd -1  |0.358 |  592.15 |636.60 |0.84 |
| zstd -3  |0.396 |  535.32 |571.40 |1.46 |
| zstd -5  |0.396 |  535.32 |571.40 |1.46 |
| zstd -10 |0.374 |  566.81 |607.42 |2.51 |
| zstd -15 |0.379 |  559.34 |598.84 |4.61 |
| zstd -19 |0.412 |  514.54 |547.77 |8.80 |
| zlib -1  |0.940 |  225.52 |231.68 |0.04 |
| zlib -3  |0.883 |  240.08 |247.07 |0.04 |
| zlib -6  |0.844 |  251.17 |258.84 |0.04 |
| zlib -9  |0.837 |  253.27 |287.64 |0.04 |

===

I ran a long series of tests and benchmarks on the btrfs side and
the gains are very similar to the core benchmarks Nick ran.

Nick Terrell (3) commits (+14222/-12):
btrfs: Add zstd support (+468/-12)
lib: Add zstd modules (+13014/-0)
lib: Add xxhash module (+740/-0)

Sean Purcell (1) commits (+178/-0):
squashfs: Add zstd support

Total: (4) commits (+14400/-12)

 fs/btrfs/Kconfig   |2 +
 fs/btrfs/Makefile  |2 +-
 fs/btrfs/compression.c |1 +
 fs/btrfs/compression.h |6 +-
 fs/btrfs/ctree.h   |1 +
 fs/btrfs/disk-io.c |2 +
 fs/btrfs/ioctl.c   |6 +-
 fs/btrfs/props.c   |6 +
 fs/btrfs/super.c   |   12 +-
 fs/btrfs/sysfs.c   |2 +
 fs/btrfs/zstd.c|  432 ++
 fs/squashfs/Kconfig|   14 +
 

[GIT PULL] Btrfs

2017-05-09 Thread Chris Mason
ained from bdev_get_queue (+3/-4)
btrfs: check if the device is flush capable (+4/-0)
btrfs: delete unused member nobarriers (+0/-4)

Edmund Nadolski (2) commits (+25/-20):
btrfs: provide enumeration for __merge_refs mode argument (+13/-10)
btrfs: replace hardcoded value with SEQ_LAST macro (+12/-10)

Goldwyn Rodrigues (2) commits (+24/-3):
btrfs: qgroups: Retry after commit on getting EDQUOT (+23/-1)
btrfs: No need to check !(flags & MS_RDONLY) twice (+1/-2)

Chris Mason (1) commits (+2/-2):
btrfs: fix the gfp_mask for the reada_zones radix tree

Adam Borowski (1) commits (+9/-3):
btrfs: fix a bogus warning when converting only data or metadata

Deepa Dinamani (1) commits (+2/-1):
btrfs: Use ktime_get_real_ts for root ctime

Dan Carpenter (1) commits (+15/-26):
Btrfs: handle only applicable errors returned by btrfs_get_extent

Dmitry V. Levin (1) commits (+2/-0):
MAINTAINERS: add btrfs file entries for include directories

Hans van Kranenburg (1) commits (+5/-5):
Btrfs: consistent usage of types in balance_args

Total: (71) commits

 MAINTAINERS  |   2 +
 fs/btrfs/backref.c   |  41 ++-
 fs/btrfs/btrfs_inode.h   |   7 +
 fs/btrfs/compression.c   |  18 +-
 fs/btrfs/ctree.c |  20 +-
 fs/btrfs/ctree.h |  34 +-
 fs/btrfs/delayed-inode.c |  46 +--
 fs/btrfs/delayed-inode.h |   6 +-
 fs/btrfs/delayed-ref.c   |   8 +-
 fs/btrfs/delayed-ref.h   |   8 +-
 fs/btrfs/dev-replace.c   |   9 +-
 fs/btrfs/disk-io.c   |  13 +-
 fs/btrfs/disk-io.h   |   4 +-
 fs/btrfs/extent-tree.c   |  35 +-
 fs/btrfs/extent_io.c |  59 +--
 fs/btrfs/extent_io.h |   8 +-
 fs/btrfs/extent_map.c|  10 +-
 fs/btrfs/extent_map.h|   3 +-
 fs/btrfs/file.c  |  82 -
 fs/btrfs/free-space-cache.c  |   2 +-
 fs/btrfs/inode.c | 289 +++
 fs/btrfs/ioctl.c |  33 +-
 fs/btrfs/ordered-data.c  |  20 +-
 fs/btrfs/ordered-data.h  |   2 +-
 fs/btrfs/qgroup.c| 102 ++
 fs/btrfs/qgroup.h|  51 ++-
 fs/btrfs/raid56.c|  38 +-
 fs/btrfs/reada.c |  37 +-
 fs/btrfs/root-tree.c |   3 +-
 fs/btrfs/scrub.c | 331 +++--
 fs/btrfs/send.c  |  23 +-
 fs/btrfs/super.c |   3 +-
 fs/btrfs/tests/btrfs-tests.c |   1 -
 fs/btrfs/transaction.c   |  48 ++-
 fs/btrfs/transaction.h   |   6 +-
 fs/btrfs/tree-log.c  |   2 +-
 fs/btrfs/volumes.c   | 854 +++
 fs/btrfs/volumes.h   |   8 +-
 include/trace/events/btrfs.h | 187 +-
 include/uapi/linux/btrfs.h   |  10 +-
 40 files changed, 1629 insertions(+), 834 deletions(-)


Re: [GIT PULL] Btrfs

2017-05-09 Thread Chris Mason
On 05/09/2017 01:56 PM, Chris Mason wrote:
> Hi Linus,
> 
> My for-linus-4.12 branch:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
> for-linus-4.12

I hit send too soon, sorry.  There's a trivial conflict with our WARN_ON
fix that went into 4.11.  I pushed the resolution to
for-linus-4.12-merged.

diff --cc fs/btrfs/qgroup.c
index afbea61,3f75b5c..deffbeb
--- a/fs/btrfs/qgroup.c
+++ b/fs/btrfs/qgroup.c
@@@ -1078,7 -1031,8 +1034,8 @@@ static int __qgroup_excl_accounting(str
qgroup->excl += sign * num_bytes;
qgroup->excl_cmpr += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup, -(s64)num_bytes);
 -  if (WARN_ON(qgroup->reserved < num_bytes))
 +  if (qgroup->reserved < num_bytes)
report_reserved_underflow(fs_info, qgroup, num_bytes);
else
qgroup->reserved -= num_bytes;
@@@ -1103,7 -1057,9 +1060,9 @@@
WARN_ON(sign < 0 && qgroup->excl < num_bytes);
qgroup->excl += sign * num_bytes;
if (sign > 0) {
+   trace_qgroup_update_reserve(fs_info, qgroup,
+   -(s64)num_bytes);
 -  if (WARN_ON(qgroup->reserved < num_bytes))
 +  if (qgroup->reserved < num_bytes)
report_reserved_underflow(fs_info, qgroup,
  num_bytes);
else
@@@ -2472,7 -2451,8 +2454,8 @@@ void btrfs_qgroup_free_refroot(struct b
  
qg = unode_aux_to_qgroup(unode);
  
+   trace_qgroup_update_reserve(fs_info, qg, -(s64)num_bytes);
 -  if (WARN_ON(qg->reserved < num_bytes))
 +  if (qg->reserved < num_bytes)
report_reserved_underflow(fs_info, qg, num_bytes);
else
qg->reserved -= num_bytes;


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-05-17 Thread Chris Mason

On 05/17/2017 06:53 AM, Peter Zijlstra wrote:

On Mon, May 15, 2017 at 02:03:11AM -0700, tip-bot for Peter Zijlstra wrote:

sched/fair, cpumask: Export for_each_cpu_wrap()



-static int cpumask_next_wrap(int n, const struct cpumask *mask, int start, int 
*wrapped)
-{



-   next = find_next_bit(cpumask_bits(mask), nr_cpumask_bits, n+1);



-}


OK, so this patch fixed an actual bug in the for_each_cpu_wrap()
implementation. The above 'n+1' should be 'n', and the effect is that
it'll skip over CPUs, potentially resulting in an iteration that only
sees every other CPU (for a fully contiguous mask).

This in turn causes hackbench to further suffer from the regression
introduced by commit:

  4c77b18cf8b7 ("sched/fair: Make select_idle_cpu() more aggressive")

So its well past time to fix this.

Where the old scheme was a cliff-edge throttle on idle scanning, this
introduces a more gradual approach. Instead of stopping to scan
entirely, we limit how many CPUs we scan.

Initial benchmarks show that it mostly recovers hackbench while not
hurting anything else, except Mason's schbench, but not as bad as the
old thing.

It also appears to recover the tbench high-end, which also suffered like
hackbench.

I'm also hoping it will fix/preserve kitsunyan's interactivity issue.

Please test..


We'll get some tests going here too.

-chris


Re: hackbench vs select_idle_sibling; was: [tip:sched/core] sched/fair, cpumask: Export for_each_cpu_wrap()

2017-06-09 Thread Chris Mason

On 06/06/2017 05:21 AM, Peter Zijlstra wrote:

On Mon, Jun 05, 2017 at 02:00:21PM +0100, Matt Fleming wrote:

On Fri, 19 May, at 04:00:35PM, Matt Fleming wrote:

On Wed, 17 May, at 12:53:50PM, Peter Zijlstra wrote:


Please test..


Results are still coming in but things do look better with your patch
applied.

It does look like there's a regression when running hackbench in
process mode and when the CPUs are not fully utilised, e.g. check this
out:


This turned out to be a false positive; your patch improves things as
far as I can see.


Hooray, I'll move it to a part of the queue intended for merging.


It's a little late, but Roman Gushchin helped get some runs of this with 
our production workload.  The patch is every so slightly better.


Thanks!

-chris



[GIT PULL] Btrfs

2017-06-10 Thread Chris Mason
Hi Linus,

My for-linus-4.12 branch has some fixes that Dave Sterba collected:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.12

We've been hitting an early enospc problem on production machines that
Omar tracked down to an old int->u64 mistake.  I waited a bit on
this pull to make sure it was really the problem from production,
but it's on ~2100 hosts now and I think we're good.

Omar also noticed a commit in the queue would make new early ENOSPC
problems.  I pulled that out for now, which is why the top three commits
are younger than the rest.

Otherwise these are all fixes, some explaining very old bugs that we've
been poking at for a while.

Jeff Mahoney (2) commits (+4/-3):
btrfs: fix race with relocation recovery and fs_root setup (+3/-3)
btrfs: fix memory leak in update_space_info failure path (+1/-0)

Liu Bo (1) commits (+1/-1):
Btrfs: clear EXTENT_DEFRAG bits in finish_ordered_io

Colin Ian King (1) commits (+1/-1):
btrfs: fix incorrect error return ret being passed to mapping_set_error

Omar Sandoval (1) commits (+2/-2):
Btrfs: fix delalloc accounting leak caused by u32 overflow

Qu Wenruo (1) commits (+122/-2):
btrfs: fiemap: Cache and merge fiemap extent before submit it to user

David Sterba (1) commits (+2/-2):
btrfs: use correct types for page indices in btrfs_page_exists_in_range

Jan Kara (1) commits (+6/-4):
btrfs: Make flush bios explicitely sync

Su Yue (1) commits (+1/-1):
btrfs: tree-log.c: Wrong printk information about namelen

Total: (9) commits (+139/-16)

 fs/btrfs/ctree.h   |   4 +-
 fs/btrfs/dir-item.c|   2 +-
 fs/btrfs/disk-io.c |  10 ++--
 fs/btrfs/extent-tree.c |   7 +--
 fs/btrfs/extent_io.c   | 126 +++--
 fs/btrfs/inode.c   |   6 +--
 6 files changed, 139 insertions(+), 16 deletions(-)


Re: [PATCH] btrfs: always write superblocks synchronously

2017-05-03 Thread Chris Mason



On 05/03/2017 04:36 AM, Jan Kara wrote:

On Tue 02-05-17 09:28:13, Davidlohr Bueso wrote:

Commit b685d3d65ac7 "block: treat REQ_FUA and REQ_PREFLUSH as
synchronous" removed REQ_SYNC flag from WRITE_FUA implementation.
Since REQ_FUA and REQ_FLUSH flags are stripped from submitted IO
when the disk doesn't have volatile write cache and thus effectively
make the write async. This was seen to cause performance hits up
to 90% regression in disk IO related benchmarks such as reaim and
dbench[1].

Fix the problem by making sure the first superblock write is also
treated as synchronous since they can block progress of the
journalling (commit, log syncs) machinery and thus the whole filesystem.





Fixes: b685d3d65ac (block: treat REQ_FUA and REQ_PREFLUSH as synchronous)
Cc: stable 
Cc: Jan Kara 
Signed-off-by: Davidlohr Bueso 


I wasn't patient enough and already sent the fix as part of my series
fixing other filesystems [1]. It also fixes one more place in btrfs that
needs REQ_SYNC to return to the original behavior.




Thanks guys.

-chris



Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-09 Thread Chris Mason
Hello everyone,

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.  
The election to select five TAB members will be held at the 2017 Kernel
Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-09 Thread Chris Mason

Hello everyone,

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.
The election to select five TAB members will be held at the 2017 Kernel
Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Re: Moving ndctl development into the kernel tree?

2017-07-25 Thread Chris Mason

On 07/22/2017 02:49 PM, Dan Williams wrote:

On Fri, Jul 21, 2017 at 7:52 PM, Dan Williams  wrote:

[ adding Chris ]

On Fri, Jul 21, 2017 at 4:44 PM, Dan Williams  wrote:

On Fri, Jul 21, 2017 at 3:58 PM, Ingo Molnar  wrote:


* Dan Williams  wrote:


[...]

* Like perf, ndctl borrows the sub-command architecture and option
parsing from git. So, this code could be refactored into something
shared / generic, i.e. the bits in tools/perf/util/.


Just as a side note, stacktool (tools/stacktool/) is using the Git sub-command 
and
options parsing code as well, and it's already sharing it with perf, via the
tools/lib/subcmd/ library.

ndctl could use that as well.


Ah, nice, that refactoring happened about a year after ndctl was born.
Which brings up the next question about what to do with the git
history, but I'd want to know if ndctl is even welcome upstream before
digging any deeper.


I suspect this would be similar to what Chris did to merge btrfs while
retaining the standalone history. Chris, any pointers on what worked
well and what if anything you would do differently? I.e. I'm looking
to use git filter-branch to rewrite ndctl history as if if had always
been in tools/ndctl in the kernel tree. I found this old thread
https://lkml.org/lkml/2008/10/30/523 and it seems to also recommend
using an older kernel as the branch base.


So it wasn't as painful as I thought it would be, I just used the
script Linus recommended in that thread. Here is what I came up with
merging the last ndctl release on top of v4.9, and then applying the
pending development patches re-filtered to tools/ndctl:

 
https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=for-4.14/ndctl

...the next thing would be to rework the versioning to use the kernel
version and switch to using tools/lib/subcmd/.



I'd like to say I figured it all out back then, but the truth is that 
Linus held my hand the whole way.  My memory of it is that his script 
worked really well, I just ran that and verified the results.


-chris


Reminder v2: Linux Foundation Technical Advisory Board Elections -- Call for nominations

2017-10-22 Thread Chris Mason

Hello everyone,

Quick update on the TAB elections, we have 6 nominations so far:

Jon Corbet
Greg Kroah-Hartman
Shuah Khan
Steve Rostedt
Ted Tso
Tim Bird

The elections are coming soon, please feel free to contact me if you 
have any questions about the TAB.


-

The Linux Foundation Technical Advisory Board (TAB) serves as the
interface between the kernel development community and the Foundation.
The TAB advises the Foundation on kernel-related matters, helps member
companies learn to work with the community, and works to resolve
community-related problems before they get out of hand.  The board has
ten members, one of whom sits on the LF board of directors.  The 
election to select five TAB members will be held at the 2017 Kernel

Summit in Prague, Czech Republic.  The elections will take place at the
conference center on Wednesday Oct 25th, shortly before the evening
reception.

The election will be open to all attendees of all of the Linux
Foundation events taking place that week in Prague.  Anyone is eligible
to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce
themselves and briefly talk about why they would like to participate on
the Technical Advisory Board.  This year, we're encouraging everyone to
include those details along with their nomination, which we will compile
into an online document for quick reference here:

https://goo.gl/ADVFtT

The deadline for receiving nominations is up until the beginning of the
election event.  Any statements for the online document need to be sent
by Monday Oct 23rd.  Please get your nomination in early so everyone has
a chance to review the nominations before voting.

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.  The
other five are halfway through their term and will be up for election
next year.


Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-11-29 Thread Chris Mason

On 11/29/2017 12:05 PM, Tejun Heo wrote:

On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:

Hello,

On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:

What has happened with this patch set?


No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
can you please route them through the btrfs tree?


lol looking at the patchset again, I'm not sure that's obviously the
right tree.  It can either be cgroup, block or btrfs.  If no one
objects, I'll just route them through cgroup.


We'll have to coordinate a bit during the next merge window but I don't 
have a problem with these going in through cgroup.  Dave does this sound 
good to you?


I'd like to include my patch to do all crcs inline (instead of handing 
off to helper threads) when io controls are in place.  By the merge 
window we should have some good data on how much it's all helping.


-chris



Re: [PATCHSET v2] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup

2017-11-30 Thread Chris Mason



On 11/30/2017 12:23 PM, David Sterba wrote:

On Wed, Nov 29, 2017 at 01:38:26PM -0500, Chris Mason wrote:

On 11/29/2017 12:05 PM, Tejun Heo wrote:

On Wed, Nov 29, 2017 at 09:03:30AM -0800, Tejun Heo wrote:

Hello,

On Wed, Nov 29, 2017 at 05:56:08PM +0100, Jan Kara wrote:

What has happened with this patch set?


No idea.  cc'ing Chris directly.  Chris, if the patchset looks good,
can you please route them through the btrfs tree?


lol looking at the patchset again, I'm not sure that's obviously the
right tree.  It can either be cgroup, block or btrfs.  If no one
objects, I'll just route them through cgroup.


We'll have to coordinate a bit during the next merge window but I don't
have a problem with these going in through cgroup.  Dave does this sound
good to you?


There are only minor changes to btrfs code so cgroup tree would be
better.


I'd like to include my patch to do all crcs inline (instead of handing
off to helper threads) when io controls are in place.  By the merge
window we should have some good data on how much it's all helping.


Are there any problems in sight if the inline crc and cgroup chnanges go
separately? I assume there's a runtime dependency, not a code
dependency, so it could be sorted by the right merge order.



The feature is just more useful with the inline crcs.  Without them we 
end up with kworkers doing both high and low prio submissions and it all 
boils down to the speed of the lowest priority.


-chris



Re: btrfs bio linked list corruption.

2016-10-13 Thread Chris Mason

On 10/13/2016 02:16 PM, Dave Jones wrote:

On Wed, Oct 12, 2016 at 10:42:46AM -0400, Chris Mason wrote:
 > On 10/12/2016 10:40 AM, Dave Jones wrote:
 > > On Wed, Oct 12, 2016 at 09:47:17AM -0400, Dave Jones wrote:
 > >  > On Tue, Oct 11, 2016 at 11:54:09AM -0400, Chris Mason wrote:
 > >  >  >
 > >  >  >
 > >  >  > On 10/11/2016 10:45 AM, Dave Jones wrote:
 > >  >  > > This is from Linus' current tree, with Al's iovec fixups on top.
 > >  >  > >
 > >  >  > > [ cut here ]
 > >  >  > > WARNING: CPU: 1 PID: 3673 at lib/list_debug.c:33 
__list_add+0x89/0xb0
 > >  >  > > list_add corruption. prev->next should be next (e8806648), 
but was c967fcd8. (prev=880503878b80).
 > >  >  > > CPU: 1 PID: 3673 Comm: trinity-c0 Not tainted 4.8.0-think+ #13
 > >  >  > >  c9d87458 8d32007c c9d874a8 

 > >  >  > >  c9d87498 8d07a6c1 00210246 
88050388e880
 > >  >
 > >  > I hit this again overnight, it's the same trace, the only difference
 > >  > being slightly different addresses in the list pointers:
 > >  >
 > >  > [42572.777196] list_add corruption. prev->next should be next 
(e8806648), but was c9647cd8. (prev=880503a0ba00).
 > >  >
 > >  > I'm actually a little surprised that ->next was the same across two
 > >  > reboots on two different kernel builds.  That might be a sign this is
 > >  > more repeatable than I'd thought, even if it does take hours of runtime
 > >  > right now to trigger it.  I'll try and narrow the scope of what trinity
 > >  > is doing to see if I can make it happen faster.
 > >
 > > .. and of course the first thing that happens is a completely different
 > > btrfs trace..
 > >
 > >
 > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
 > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
 > >  c900019076a8 b731ff3c  
 > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
 > >  0801 880501cfa2a8 008a 008a
 >
 > This isn't even IO.  Uuug.  We're going to need a fast enough test
 > that we can bisect.

Progress...
I've found that this combination of syscalls..

./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr -c 
lremovexattr -c pwritev2

hits one of these two bugs in a few minutes runtime.

Just the xattr syscalls + fsync isn't enough, neither is just pwrite + fsync.
Mix them together though, and something goes awry.



Hasn't triggered here yet.  I'll leave it running though.

-chris


[GIT PULL] Btrfs

2016-10-14 Thread Chris Mason
Hi Linus,

My for-linus-4.9 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.9

Has some fixes from Omar and Dave Sterba for our new free space tree.
This isn't heavily used yet, but as we move toward making it the new
default we wanted to nail down an endian bug.

Omar Sandoval (5) commits (+259/-145):
Btrfs: expand free space tree sanity tests to catch endianness bug (+96/-68)
Btrfs: fix extent buffer bitmap tests on big-endian systems (+51/-36)
Btrfs: fix free space tree bitmaps on big-endian systems (+76/-27)
Btrfs: fix mount -o clear_cache,space_cache=v2 (+12/-12)
Btrfs: catch invalid free space trees (+24/-2)

David Sterba (2) commits (+13/-12):
btrfs: tests: uninline member definitions in free_space_extent (+2/-1)
btrfs: tests: constify free space extent specs (+11/-11)

Total: (7) commits (+272/-157)

 fs/btrfs/ctree.h   |   3 +-
 fs/btrfs/disk-io.c |  33 +++---
 fs/btrfs/extent_io.c   |  64 +++
 fs/btrfs/extent_io.h   |  22 
 fs/btrfs/free-space-tree.c |  19 ++--
 fs/btrfs/tests/extent-io-tests.c   |  87 ---
 fs/btrfs/tests/free-space-tree-tests.c | 189 +++--
 include/uapi/linux/btrfs.h |  12 ++-
 8 files changed, 272 insertions(+), 157 deletions(-)


Linux Foundation Technical Advisory Board Elections and Nomination process

2016-10-14 Thread Chris Mason

Hello everyone,

The elections for five of the ten members of the Linux Foundation 
Technical Advisory Board (TAB) are held every year[1]. This year the

election will be at the 2016 Kernel Summit in Santa Fe, NM.

The elections will take place at the conference center on Wednesday Nov 
2nd, shortly before the evening Kernel Summit/Plumbers reception.  The 
elections will be open to all attendees of both the Kernel Summit and 
the Linux Plumbers.


Anyone is eligible to stand for election, simply send your nomination to:

tech-board-discuss at lists.linux-foundation.org

Just before the election, everyone will have a chance to introduce 
themselves and briefly talk about why they would like to participate on 
the Technical Advisory Board.   This year, we're encouraging everyone to 
include those details along with their nomination, which we will compile 
into an online document for quick reference.


The deadline for receiving nominations is up until the beginning of
the event where the election is held.  Any statements for the online 
document need to be sent by Friday Oct 28th.  Please remember if

you're not going to be present that things go wrong with both networks
and mailing lists, so get your nomination in early).

Chris Mason, TAB Chair

[1] TAB members sit for a term of two years, and half of the board is up
for election every year. Five of the seats are up for election now.
The other five are halfway through their term and will be up for
election next year.


Re: btrfs bio linked list corruption.

2016-10-17 Thread Chris Mason

On Sat, Oct 15, 2016 at 08:42:40PM -0400, Dave Jones wrote:

On Thu, Oct 13, 2016 at 05:18:46PM -0400, Chris Mason wrote:

> >  > > .. and of course the first thing that happens is a completely different
> >  > > btrfs trace..
> >  > >
> >  > >
> >  > > WARNING: CPU: 1 PID: 21706 at fs/btrfs/transaction.c:489 
start_transaction+0x40a/0x440 [btrfs]
> >  > > CPU: 1 PID: 21706 Comm: trinity-c16 Not tainted 4.8.0-think+ #14
> >  > >  c900019076a8 b731ff3c  
> >  > >  c900019076e8 b707a6c1 01e9f5806ce0 8804f74c4d98
> >  > >  0801 880501cfa2a8 008a 008a
> >  >
> >  > This isn't even IO.  Uuug.  We're going to need a fast enough test
> >  > that we can bisect.
> >
> > Progress...
> > I've found that this combination of syscalls..
> >
> > ./trinity -C64 -q -l off -a64 --enable-fds=testfile -c fsync -c fsetxattr 
-c lremovexattr -c pwritev2
> >
> > hits one of these two bugs in a few minutes runtime.
> >
> > Just the xattr syscalls + fsync isn't enough, neither is just pwrite + 
fsync.
> > Mix them together though, and something goes awry.
> >
> Hasn't triggered here yet.  I'll leave it running though.

The hits keep coming..

BUG: Bad page state in process kworker/u8:12  pfn:4988fa
page:ea0012623e80 count:0 mapcount:0 mapping:8804450456e0 index:0x9


Hmpf, I've had this running since Friday without failing.  Can you send 
me your .config please?


-chris


Re: lockdep warning in btrfs in 4.8-rc3

2016-09-09 Thread Chris Mason

On 09/08/2016 08:50 PM, Dave Jones wrote:

On Thu, Sep 08, 2016 at 08:58:48AM -0400, Chris Mason wrote:
 > On 09/08/2016 07:50 AM, Christian Borntraeger wrote:
 > > On 09/08/2016 01:48 PM, Christian Borntraeger wrote:
 > >> Chris,
 > >>
 > >> with 4.8-rc3 I get the following on an s390 box:
 > >
 > > Sorry for the noise, just saw the fix in your pull request.
 > >
 >
 > The lockdep splat is still there, we'll need to annotate this one a little.

Here's another one (unrelated?) that I've not seen before today:

WARNING: CPU: 1 PID: 10664 at kernel/locking/lockdep.c:704 
register_lock_class+0x33f/0x510
CPU: 1 PID: 10664 Comm: kworker/u8:5 Not tainted 4.8.0-rc5-think+ #2
Workqueue: writeback wb_workfn (flush-btrfs-1)
 0097 b97fbad3 88013b8c3770 a63d3ab1
   a6bf1792 a60df22f
 88013b8c37b0 a60897a0 02c0b97fbad3 a6bf1792
Call Trace:
 [] dump_stack+0x6c/0x9b
 [] ? register_lock_class+0x33f/0x510
 [] __warn+0x110/0x130
 [] warn_slowpath_null+0x2c/0x40
 [] register_lock_class+0x33f/0x510
 [] ? bio_add_page+0x7e/0x120
 [] __lock_acquire.isra.32+0x5b/0x8c0
 [] lock_acquire+0x58/0x70
 [] ? btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] _raw_write_lock+0x38/0x70
 [] ? btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] btrfs_try_tree_write_lock+0x4a/0xb0 [btrfs]
 [] lock_extent_buffer_for_io+0x28/0x2e0 [btrfs]
 [] btree_write_cache_pages+0x231/0x550 [btrfs]
 [] ? btree_set_page_dirty+0x20/0x20 [btrfs]
 [] btree_writepages+0x74/0x90 [btrfs]
 [] do_writepages+0x3e/0x80
 [] __writeback_single_inode+0x42/0x220
 [] writeback_sb_inodes+0x351/0x730
 [] ? __wb_update_bandwidth+0x1c1/0x2b0
 [] wb_writeback+0x138/0x2a0
 [] wb_workfn+0x10e/0x340
 [] ? __lock_acquire.isra.32+0x1cf/0x8c0
 [] process_one_work+0x24f/0x5d0
 [] ? process_one_work+0x1e0/0x5d0
 [] worker_thread+0x53/0x5b0
 [] ? process_one_work+0x5d0/0x5d0
 [] kthread+0x120/0x140
 [] ? finish_task_switch+0x6a/0x200
 [] ret_from_fork+0x1f/0x40
 [] ? kthread_create_on_node+0x270/0x270
---[ end trace 7b39395c07435bf1 ]---


 700 /*
 701  * Huh! same key, different name? Did someone 
trample
 702  * on some memory? We're most confused.
 703  */
 704 WARN_ON_ONCE(class->name != lock->name);

That seems kinda scary. There was a trinity run going on at the same time,
so this _might_ be a random scribble from something unrelated to btrfs,
but just in case..

IWBNI that code printed out both cases so I could see if this was
corruption or two unrelated keys. I'll make it do that in case it
happens again.



I haven't seen this one before, if you could make it happen again, that 
would be great ;)


-chris



Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[GIT PULL] Btrfs

2016-09-09 Thread Chris Mason
Hi Linus,

We have three fixes in my for-linus-4.8 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.8

I'm not proud of how long it took me to track down that one liner in
btrfs_sync_log(), but the good news is the patches I was trying to blame
for these problems were actually fine (sorry Filipe).

Wang Xiaoguang (2) commits (+16/-8):
btrfs: introduce tickets_id to determine whether asynchronous metadata 
reclaim work makes progress (+7/-5)
btrfs: do not decrease bytes_may_use when replaying extents (+9/-3)

Chris Mason (1) commits (+1/-0):
Btrfs: remove root_log_ctx from ctx list before btrfs_sync_log returns

Total: (3) commits (+17/-8)

 fs/btrfs/ctree.h   |  1 +
 fs/btrfs/extent-tree.c | 23 +++
 fs/btrfs/tree-log.c|  1 +
 3 files changed, 17 insertions(+), 8 deletions(-)


Re: bio linked list corruption.

2016-10-18 Thread Chris Mason

On Tue, Oct 18, 2016 at 05:12:41PM -0600, Jens Axboe wrote:

On 10/18/2016 04:42 PM, Dave Jones wrote:

So Chris had me do a run on ext4 just for giggles. It took a while, but
eventually this fell out...


WARNING: CPU: 3 PID: 21324 at lib/list_debug.c:33 __list_add+0x89/0xb0
list_add corruption. prev->next should be next (e8c05648), but was 
c928bcd8. (prev=880503a145c0).
CPU: 3 PID: 21324 Comm: modprobe Not tainted 4.9.0-rc1-think+ #1
c9a6b7b8 81320e3c c9a6b808 
c9a6b7f8 8107a711 00210246 8805039f1740
880503a145c0 e8c05648 e8a05600 880502c39548
Call Trace:
[] dump_stack+0x4f/0x73
[] __warn+0xc1/0xe0
[] warn_slowpath_fmt+0x5a/0x80
[] __list_add+0x89/0xb0
[] blk_sq_make_request+0x2f8/0x350
[] ? generic_make_request+0xec/0x240
[] generic_make_request+0xf9/0x240
[] submit_bio+0x78/0x150
[] ? __find_get_block+0x126/0x130
[] submit_bh_wbc+0x16f/0x1e0
[] ? __end_buffer_read_notouch+0x20/0x20
[] ll_rw_block+0xa8/0xb0
[] __breadahead+0x3f/0x70
[] __ext4_get_inode_loc+0x37c/0x3d0
[] ext4_iget+0x8d/0xb90
[] ? d_alloc_parallel+0x329/0x700
[] ext4_iget_normal+0x2a/0x30
[] ext4_lookup+0x136/0x250
[] lookup_slow+0x12d/0x220
[] walk_component+0x1e7/0x310
[] ? path_init+0x4d8/0x520
[] path_lookupat+0x62/0x120
[] ? getname_flags+0x32/0x180
[] filename_lookup+0xa8/0x130
[] ? strncpy_from_user+0x46/0x170
[] ? getname_flags+0x4e/0x180
[] user_path_at_empty+0x31/0x40
[] vfs_fstatat+0x61/0xc0
[] ? __lock_acquire.isra.32+0x1cf/0x8c0
[] SYSC_newstat+0x2e/0x60
[] ? __this_cpu_preempt_check+0x13/0x20
[] SyS_newstat+0x9/0x10
[] do_syscall_64+0x5c/0x170
[] entry_SYSCALL64_slow_path+0x25/0x25

So this one isn't a btrfs specific problem as I first thought.

This sometimes reproduces within minutes, sometimes hours, which makes
it a pain to bisect.  It only started showing up this merge window though.


Chinner reported the same thing on XFS, I'll look into it asap.


Jens, not sure if you saw the whole thread.  This has triggered bad page 
state errors, and also corrupted a btrfs list.  It hurts me to say, but 
it might not actually be your fault.


-chris


Re: bio linked list corruption.

2016-10-18 Thread Chris Mason

On Tue, Oct 18, 2016 at 04:39:22PM -0700, Linus Torvalds wrote:

On Tue, Oct 18, 2016 at 4:31 PM, Chris Mason  wrote:


Jens, not sure if you saw the whole thread.  This has triggered bad page
state errors, and also corrupted a btrfs list.  It hurts me to say, but it
might not actually be your fault.


Where is that thread, and what is the "this" that triggers problems?

Looking at the "->mq_list" users, I'm not seeing any changes there in
the last year or so. So I don't think it's the list itself.


Seems to be the whole thing:

http://www.gossamer-threads.com/lists/linux/kernel/2545792

My guess is xattr, but I don't have a good reason for that.

-chris


Re: bio linked list corruption.

2016-10-18 Thread Chris Mason

On Tue, Oct 18, 2016 at 05:10:56PM -0700, Linus Torvalds wrote:

On Tue, Oct 18, 2016 at 4:42 PM, Chris Mason  wrote:


Seems to be the whole thing:


Ahh. On lkml, so I do have it in my mailbox, but Dave changed the
subject line when he tested on ext4 rather than btrfs..

Anyway, the corrupted address is somewhat interesting. As Dave Jones
said, he saw

 list_add corruption. prev->next should be next (e8806648),
but was c967fcd8. (prev=880503878b80).
 list_add corruption. prev->next should be next (e8c05648),
but was c928bcd8. (prev=880503a145c0).

and Dave Chinner reports

 list_add corruption. prev->next should be next (e8c02808),
but was c90005f6bda8. (prev=88013363bb80).

and it's worth noting that the "but was" is a remarkably consistent
vmalloc address (the c9000.. pattern gives it away). In fact, it's
identical across two boots for DaveJ in the low 14 bits, and fairly
high up in those low 14 bots (0x3cd8).

DaveC has a different address, but it's also in the vmalloc space, and
also looks like it is fairly high up in 14 bits (0x3da8). So in both
cases it's almost certainly a stack address with a fairly empty stack.
The differences are presumably due to different kernel configurations
and/or just different filesystems calling the same function that does
the same bad thing but now at different depths in the stack.

Adding Andy to the cc, because this *might* be triggered by the
vmalloc stack code itself. Maybe the re-use of stacks showing some
problem? Maybe Chris (who can't see the problem) doesn't have
CONFIG_VMAP_STACK enabled?


CONFIG_VMAP_STACK=y, but maybe I just need to hammer on process creation 
more.  I'm testing in a hugely stripped down VM, so Dave might have more 
background stuff going on.


-chris


[GIT PULL] Btrfs

2016-09-03 Thread Chris Mason
Hi Linus,

We have a few small fixes queued up in my for-linus-4.8 branch:

git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git 
for-linus-4.8

I'm still prepping a set of fixes for btrfs fsync, just nailing
down a hard to trigger memory corruption.  For now, these are tested and
ready:

Josef Bacik (1) commits (+5/-3):
Btrfs: kill invalid ASSERT() in process_all_refs()

Liu Bo (1) commits (+5/-3):
Btrfs: fix endless loop in balancing block groups

Wang Xiaoguang (1) commits (+5/-5):
btrfs: fix one bug that process may endlessly wait for ticket in 
wait_reserve_ticket()

Total: (3) commits (+15/-11)

 fs/btrfs/extent-tree.c | 10 +-
 fs/btrfs/relocation.c  |  8 +---
 fs/btrfs/send.c|  8 +---
 3 files changed, 15 insertions(+), 11 deletions(-)


Linux Plumbers call for organizers

2016-09-05 Thread Chris Mason


Each year, the Linux Foundation's Technical Advisory Board (TAB) seeks 
an organizing committee for the annual Linux Plumbers Conference; that 
process has now begun for the 2017 event.  This is your chance to put 
your stamp on one of our community's most important gatherings.


LPC 2017 will take place September 13-15, and will be colocated with 
Open Source Summit NA (formerly LinuxCon NA) at the JW Marriott in 
Downtown Los Angeles CA.


Interested groups should have, at a minimum, an events coordinator, a 
treasurer, a microconference chair, and a chairperson.  This group must 
be able to take the initiative to handle conference-specific details 
(including social events, the miniconf program, and more) while working 
with the Linux Foundation to ensure that logistics work smoothly.


The process for putting in an application to run the Linux Plumbers 
Conference is documented here:


https://wiki.linuxfoundation.org/en/LPC

Applications should be in by October 1st; the TAB then will announce a 
decision by (at the latest) November 11th.


If you're interested in submitting a proposal, but are concerned that 
you don't know enough about how previous Plumbers has been run, then 
fear not! The TAB will support the selected organizing committee with 
additional volunteers with past Plumbers organizing experience. Above 
all we are looking for a capable and enthusiastic group who we can work 
with to make the 2017 Linux Plumbers Conference a great success.


If you have any questions about the submission process, please email the 
TAB at tech-bo...@lists.linux-foundation.org


Re: [Documentation] State of CPU controller in cgroup v2

2016-08-16 Thread Chris Mason



On 08/16/2016 10:07 AM, Peter Zijlstra wrote:

On Wed, Aug 10, 2016 at 06:09:44PM -0400, Johannes Weiner wrote:


[ That, and a disturbing number of emotional outbursts against
  systemd, which has nothing to do with any of this. ]


Oh, so I'm entirely dreaming this then:

  https://github.com/systemd/systemd/pull/3905

Completely unrelated.

Also, the argument there seems unfair at best, you don't need cpu-v2 for
buffered write control, you only need memcg and block co-mounted.



This isn't systemd dictating cgroups2 or systemd trying to get rid of 
v1.  But systemd is a common user of cgroups, and we do use it here in 
production.


We're just sending patches upstream for the tools we're using.  It's 
better than keeping them private, or reinventing a completely different 
tool that does almost the same thing.


-chris


<    3   4   5   6   7   8