Re: USB upgrade fun
At Sun, 08 Oct 2017 17:58:10 +0800, Kai Hendry wrote: > > Hi there, > > My /mnt/raid1 suddenly became full somewhat expectedly, so I bought 2 > new USB 4TB hard drives (one WD, one Seagate) to upgrade to. > > After adding sde and sdd I started to see errors in dmesg [2]. > https://s.natalian.org/2017-10-07/raid1-newdisks.txt > [2] https://s.natalian.org/2017-10-07/btrfs-errors.txt These messages are harmless. Qu is tuckling on this problem. > > Sidenote: I've since learnt that removing a drive actually deletes the > contents of the drive? I don't want that. I was hoping to put that drive > into cold storage. How do I remove a drive without losing data from a > RAID1 configuration? Please let me clarify what you said. Do you worry about losing filesystem data in removed device, in this case /dev/sdc1? To be more specific, if /mnt/raid1/file is in /dev/sdc1 and lose this file by removing this device? If so, don't worry. When removing /dev/sdc1, the filesystem data exists in this device is moved to other devices, /dev/sdb1, /dev/sdd1, or /dev/sde1. Just FYI, `btrfs replace /dev/sdc1 /dev/sdd1 /mnt/raid1` is more suitable in your case. > > > I assumed it had to perhaps with the USB bus on my NUC5CPYB being maxed > out, and to expedite the sync, I tried to remove one of the older 2TB > sdc1. However the load went crazy and my system went completely > unstable. I shutdown the machine and after an hour I hard powered it > down since it seemed to hang (it's headless). Because all data in /dev/sdc1, in this case totally 1.81TiB(data) + 6.00GiB(metadata) + 32MiB(system) should be moved to remaining devices. > > > After a reboot it failed, namely because "nofail" wasn't in my fstab and > systemd is pedantic by default. After managing to get it booting into my > system without /mnt/raid1 I faced these "open ctree failed" issues. > After running btrfs check on all the drives and getting nowhere, I > decided to unplug the new drives and I discovered that when I take out > the new 4TB WD drive, I could mount it with -o degraded. > > dmesg errors with the WD include "My Passport" Wrong diagnostic page; > asked for 1 got 8 "Failed to get diagnostic page 0xffea" which > raised my suspicions. The model number btw is WDBYFT0040BYI-WESN > > Anyway, I'm back up and running with 2x2TB (one of them didn't finish > removing, I don't know which) & 1x4TB. > > [1] https://s.natalian.org/2017-10-08/usb-btrfs.txt > > I've decided to send the WD back for a refund. I've decided I want keep > the 2x2TB and RAID1 with the new 1x4TB disk. So 4TB total. btrfs > complains now of "Some devices missing" [1]. How do I fix this > situation? Probably `btrfs device remove missing /mnt/raid1` works. Thanks, Satoru > > Any tips how to name this individual disks? hdparm -I isn't a lot to go > on. > > [hendry@nuc ~]$ sudo hdparm -I /dev/sdb1 | grep Model > Model Number: ST4000LM024-2AN17V > [hendry@nuc ~]$ sudo hdparm -I /dev/sdc1 | grep Model > Model Number: ST2000LM003 HN-M201RAD > [hendry@nuc ~]$ sudo hdparm -I /dev/sdd1 | grep Model > Model Number: ST2000LM005 HN-M201AAD > > > Ok, thanks. Hope you can guide me, > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCHSET] cgroup, writeback, btrfs: make sure btrfs issues metadata IOs from the root cgroup
Hello, btrfs has different ways to issue metadata IOs and may end up issuing metadata or otherwise shared IOs from a non-root cgroup, which can lead to priority inversion and ineffective IO isolation. This patchset makes sure that btrfs issues all metadata and shared IOs from the root cgroup by exempting btree_inodes from cgroup writeback and explicitly associating shared IOs with the root cgroup. This patchset containst he following three patches [PATCH 1/3] cgroup, writeback: replace SB_I_CGROUPWB with per-inode [PATCH 2/3] cgroup, writeback: implement submit_bh_blkcg_css() [PATCH 3/3] btrfs: ensure that metadata and flush are issued from the and is also available in the following git branch git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-cgroup-btrfs-metadata diffstat follows. Thanks. fs/block_dev.c |3 +-- fs/btrfs/check-integrity.c |2 +- fs/btrfs/disk-io.c |4 fs/btrfs/ioctl.c|6 +- fs/btrfs/super.c|1 - fs/buffer.c | 12 fs/ext2/inode.c |3 ++- fs/ext2/super.c |1 - fs/ext4/inode.c |5 - fs/ext4/super.c |2 -- fs/fs-writeback.c |1 + include/linux/backing-dev.h |2 +- include/linux/buffer_head.h | 11 +++ include/linux/fs.h |3 ++- include/linux/writeback.h |6 -- 15 files changed, 48 insertions(+), 14 deletions(-) -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] cgroup, writeback: implement submit_bh_blkcg_css()
Add wbc->blkcg_css so that the blkcg_css association can be specified independently and implement submit_bh_blkcg_css() using it. This will be used to override cgroup membership on specific buffer_heads. Signed-off-by: Tejun HeoCc: Jan Kara Cc: Jens Axboe --- fs/buffer.c | 12 fs/fs-writeback.c | 1 + include/linux/buffer_head.h | 11 +++ include/linux/writeback.h | 6 -- 4 files changed, 28 insertions(+), 2 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 170df85..fac4f9a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -3143,6 +3143,18 @@ int submit_bh(int op, int op_flags, struct buffer_head *bh) } EXPORT_SYMBOL(submit_bh); +#ifdef CONFIG_CGROUP_WRITEBACK +int submit_bh_blkcg_css(int op, int op_flags, struct buffer_head *bh, + struct cgroup_subsys_state *blkcg_css) +{ + struct writeback_control wbc = { .blkcg_css = blkcg_css }; + + /* @wbc is used just to override the bio's blkcg_css */ + return submit_bh_wbc(op, op_flags, bh, 0, ); +} +EXPORT_SYMBOL(submit_bh_blkcg_css); +#endif + /** * ll_rw_block: low-level access to block devices (DEPRECATED) * @op: whether to %READ or %WRITE diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c index 245c430..cd882e6 100644 --- a/fs/fs-writeback.c +++ b/fs/fs-writeback.c @@ -538,6 +538,7 @@ void wbc_attach_and_unlock_inode(struct writeback_control *wbc, } wbc->wb = inode_to_wb(inode); + wbc->blkcg_css = wbc->wb->blkcg_css; wbc->inode = inode; wbc->wb_id = wbc->wb->memcg_css->id; diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index c8dae55..abb4dd4 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -13,6 +13,7 @@ #include #include #include +#include #ifdef CONFIG_BLOCK @@ -205,6 +206,16 @@ int bh_submit_read(struct buffer_head *bh); loff_t page_cache_seek_hole_data(struct inode *inode, loff_t offset, loff_t length, int whence); +#ifdef CONFIG_CGROUP_WRITEBACK +int submit_bh_blkcg(int op, int op_flags, struct buffer_head *bh, + struct cgroup_subsys_state *blkcg_css); +#else +static inline int submit_bh_blkcg(int op, int op_flags, struct buffer_head *bh, + struct cgroup_subsys_state *blkcg_css) +{ + return submit_bh(op, op_flags, bh); +} +#endif extern int buffer_heads_over_limit; /* diff --git a/include/linux/writeback.h b/include/linux/writeback.h index d581579..81e5946 100644 --- a/include/linux/writeback.h +++ b/include/linux/writeback.h @@ -91,6 +91,8 @@ struct writeback_control { unsigned for_sync:1;/* sync(2) WB_SYNC_ALL writeback */ #ifdef CONFIG_CGROUP_WRITEBACK struct bdi_writeback *wb; /* wb this writeback is issued under */ + struct cgroup_subsys_state *blkcg_css; /* usually wb->blkcg_css but + may be overridden */ struct inode *inode;/* inode being written out */ /* foreign inode detection, see wbc_detach_inode() */ @@ -277,8 +279,8 @@ static inline void wbc_init_bio(struct writeback_control *wbc, struct bio *bio) * behind a slow cgroup. Ultimately, we want pageout() to kick off * regular writeback instead of writing things out itself. */ - if (wbc->wb) - bio_associate_blkcg(bio, wbc->wb->blkcg_css); + if (wbc->blkcg_css) + bio_associate_blkcg(bio, wbc->blkcg_css); } #else /* CONFIG_CGROUP_WRITEBACK */ -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] cgroup, writeback: replace SB_I_CGROUPWB with per-inode S_CGROUPWB
Currently, filesystem can indiate cgroup writeback support per superblock; however, depending on the filesystem, especially if inodes are used to carry metadata, it can be useful to indicate cgroup writeback support per inode. This patch replaces the superblock flag SB_I_CGROUPWB with per-inode S_CGROUPWB, so that cgroup writeback can be enabled selectively. * block_dev sets the new flag in bdget() when initializing new inode. * ext2/4 set the new flag in ext?_set_inode_flags() function. * btrfs sets the new flag in btrfs_update_iflags() function. Note that this automatically excludes btree_inode which doesn't use btrfs_update_iflags() during initialization. This is an intended behavior change. Signed-off-by: Tejun HeoCc: Jan Kara Cc: Jens Axboe Cc: Chris Mason Cc: Josef Bacik Cc: linux-btrfs@vger.kernel.org Cc: "Theodore Ts'o" Cc: Andreas Dilger Cc: linux-e...@vger.kernel.org --- fs/block_dev.c | 3 +-- fs/btrfs/ioctl.c| 4 +++- fs/btrfs/super.c| 1 - fs/ext2/inode.c | 3 ++- fs/ext2/super.c | 1 - fs/ext4/inode.c | 5 - fs/ext4/super.c | 2 -- include/linux/backing-dev.h | 2 +- include/linux/fs.h | 3 ++- 9 files changed, 13 insertions(+), 11 deletions(-) diff --git a/fs/block_dev.c b/fs/block_dev.c index 93d088f..ff9c282 100644 --- a/fs/block_dev.c +++ b/fs/block_dev.c @@ -800,8 +800,6 @@ static struct dentry *bd_mount(struct file_system_type *fs_type, { struct dentry *dent; dent = mount_pseudo(fs_type, "bdev:", _sops, NULL, BDEVFS_MAGIC); - if (!IS_ERR(dent)) - dent->d_sb->s_iflags |= SB_I_CGROUPWB; return dent; } @@ -893,6 +891,7 @@ struct block_device *bdget(dev_t dev) inode->i_mode = S_IFBLK; inode->i_rdev = dev; inode->i_bdev = bdev; + inode->i_flags |= S_CGROUPWB; inode->i_data.a_ops = _blk_aops; mapping_set_gfp_mask(>i_data, GFP_USER); spin_lock(_lock); diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 6c7a49f..117cc63 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -150,9 +150,11 @@ void btrfs_update_iflags(struct inode *inode) new_fl |= S_NOATIME; if (ip->flags & BTRFS_INODE_DIRSYNC) new_fl |= S_DIRSYNC; + new_fl |= S_CGROUPWB; set_mask_bits(>i_flags, - S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC, + S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | + S_CGROUPWB, new_fl); } diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 35a128a..46a1488 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1136,7 +1136,6 @@ static int btrfs_fill_super(struct super_block *sb, sb->s_flags |= MS_POSIXACL; #endif sb->s_flags |= MS_I_VERSION; - sb->s_iflags |= SB_I_CGROUPWB; err = super_setup_bdi(sb); if (err) { diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c index 4dca6f3..3c5d822 100644 --- a/fs/ext2/inode.c +++ b/fs/ext2/inode.c @@ -1372,7 +1372,7 @@ void ext2_set_inode_flags(struct inode *inode) unsigned int flags = EXT2_I(inode)->i_flags; inode->i_flags &= ~(S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | - S_DIRSYNC | S_DAX); + S_DIRSYNC | S_DAX | S_CGROUPWB); if (flags & EXT2_SYNC_FL) inode->i_flags |= S_SYNC; if (flags & EXT2_APPEND_FL) @@ -1385,6 +1385,7 @@ void ext2_set_inode_flags(struct inode *inode) inode->i_flags |= S_DIRSYNC; if (test_opt(inode->i_sb, DAX) && S_ISREG(inode->i_mode)) inode->i_flags |= S_DAX; + inode->i_flags |= S_CGROUPWB; } struct inode *ext2_iget (struct super_block *sb, unsigned long ino) diff --git a/fs/ext2/super.c b/fs/ext2/super.c index 1458706..e6ba669e 100644 --- a/fs/ext2/super.c +++ b/fs/ext2/super.c @@ -922,7 +922,6 @@ static int ext2_fill_super(struct super_block *sb, void *data, int silent) sb->s_flags = (sb->s_flags & ~MS_POSIXACL) | ((EXT2_SB(sb)->s_mount_opt & EXT2_MOUNT_POSIX_ACL) ? MS_POSIXACL : 0); - sb->s_iflags |= SB_I_CGROUPWB; if (le32_to_cpu(es->s_rev_level) == EXT2_GOOD_OLD_REV && (EXT2_HAS_COMPAT_FEATURE(sb, ~0U) || diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c index 31db875..344f12b 100644 --- a/fs/ext4/inode.c +++ b/fs/ext4/inode.c @@ -4591,8 +4591,11 @@ void ext4_set_inode_flags(struct inode *inode) !ext4_should_journal_data(inode) && !ext4_has_inline_data(inode) && !ext4_encrypted_inode(inode)) new_fl |= S_DAX; + if (test_opt(inode->i_sb, DATA_FLAGS) !=
[PATCH 3/3] btrfs: ensure that metadata and flush are issued from the root cgroup
Issuing metdata or otherwise shared IOs from !root cgroup can lead to priority inversion. This patch ensures that those IOs are always issued from the root cgroup. This patch updates btrfs_update_iflags() to not set S_CGROUPWB on btree_inodes. This isn't strictly necessary as those inodes don't call the function during init; however, this serves as documentation and prevents possible future mistakes. If this isn't desirable, please feel free to drop the section. Signed-off-by: Tejun HeoCc: Chris Mason Cc: Josef Bacik --- fs/btrfs/check-integrity.c | 2 +- fs/btrfs/disk-io.c | 4 fs/btrfs/ioctl.c | 4 +++- 3 files changed, 8 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/check-integrity.c b/fs/btrfs/check-integrity.c index 7d5a9b5..058dea6 100644 --- a/fs/btrfs/check-integrity.c +++ b/fs/btrfs/check-integrity.c @@ -2741,7 +2741,7 @@ int btrfsic_submit_bh(int op, int op_flags, struct buffer_head *bh) struct btrfsic_dev_state *dev_state; if (!btrfsic_is_initialized) - return submit_bh(op, op_flags, bh); + return submit_bh_blkcg_css(op, op_flags, blkcg_root_css); mutex_lock(_mutex); /* since btrfsic_submit_bh() might also be called before diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index dfdab84..fe8bbe1 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1025,6 +1025,8 @@ static blk_status_t btree_submit_bio_hook(void *private_data, struct bio *bio, int async = check_async_write(bio_flags); blk_status_t ret; + bio_associate_blkcg(bio, blkcg_root_css); + if (bio_op(bio) != REQ_OP_WRITE) { /* * called for a read, do the setup so that checksum validation @@ -3512,6 +3514,8 @@ static void write_dev_flush(struct btrfs_device *device) return; bio_reset(bio); + bio_associate_blkcg(bio, blkcg_root_css); + bio->bi_end_io = btrfs_end_empty_barrier; bio_set_dev(bio, device->bdev); bio->bi_opf = REQ_OP_WRITE | REQ_SYNC | REQ_PREFLUSH; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 117cc63..8a7db6c 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -150,7 +150,9 @@ void btrfs_update_iflags(struct inode *inode) new_fl |= S_NOATIME; if (ip->flags & BTRFS_INODE_DIRSYNC) new_fl |= S_DIRSYNC; - new_fl |= S_CGROUPWB; + /* btree_inodes are always in the root cgroup */ + if (btrfs_ino(ip) != BTRFS_BTREE_INODE_OBJECTID) + new_fl |= S_CGROUPWB; set_mask_bits(>i_flags, S_SYNC | S_APPEND | S_IMMUTABLE | S_NOATIME | S_DIRSYNC | -- 2.9.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Tue, Oct 10, 2017 at 08:09:20AM +1100, Dave Chinner wrote: > On Mon, Oct 09, 2017 at 09:00:51AM -0400, Josef Bacik wrote: > > On Mon, Oct 09, 2017 at 04:17:31PM +1100, Dave Chinner wrote: > > > On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote: > > > > > Integrating into fstests means it will be immediately available to > > > > > all fs developers, it'll run on everything that everyone already has > > > > > setup for filesystem testing, and it will have familiar mkfs/mount > > > > > option setup behaviour so there's no new hoops for everyone to jump > > > > > through to run it... > > > > > > > > > > > > > TBF I specifically made it as easy as possible because I know we all > > > > hate trying > > > > to learn new shit. > > > > > > Yeah, it's also hard to get people to change their workflows to add > > > a whole new test harness into them. It's easy if it's just a new > > > command to an existing workflow :P > > > > > > > Agreed, so if you probably won't run this outside of fstests then I'll add > > it to > > xfstests. I envision this tool as being run by maintainers to verify their > > pull > > requests haven't regressed since the last set of patches, as well as by > > anybody > > trying to fix performance problems. So it's way more important to me that > > you, > > Ted, and all the various btrfs maintainers will run it than anybody else. > > > > > > I figured this was different enough to warrant a separate > > > > project, especially since I'm going to add block device jobs so Jens > > > > can test > > > > block layer things. If we all agree we'd rather see this in fstests > > > > then I'm > > > > happy to do that too. Thanks, > > > > > > I'm not fussed either way - it's a good discussion to have, though. > > > > > > If I want to add tests (e.g. my time-honoured fsmark tests), where > > > should I send patches? > > > > > > > I beat you to that! I wanted to avoid adding fs_mark to the suite because > > it > > means parsing another different set of outputs, so I added a new ioengine > > to fio > > for this > > > > http://www.spinics.net/lists/fio/msg06367.html > > > > and added a fio job to do 500k files > > > > https://github.com/josefbacik/fsperf/blob/master/tests/500kemptyfiles.fio > > > > The test is disabled by default for now because obviously the fio support > > hasn't > > landed yet. > > That seems misguided. fio is good, but it's not a universal > solution. > > > I'd _like_ to expand fio for cases we come up with that aren't possible, as > > there's already a ton of measurements that are taken, especially around > > latencies. > > To be properly useful it needs to support more than just fio to run > tests. Indeed, it's largely useless to me if that's all it can do or > it's a major pain to add support for different tools like fsmark. > > e.g. my typical perf regression test that you see the concurrnet > fsmark create workload is actually a lot more. It does: > > fsmark to create 50m zero length files > umount, > run parallel xfs_repair (excellent mmap_sem/page fault punisher) > mount > run parallel find -ctime (readdir + lookup traversal) > unmount, mount > run parallel ls -R (readdir + dtype traversal) > unmount, mount > parallel rm -rf of 50m files > > I have variants that use small 4k files or large files rather than > empty files, taht use different fsync patterns to stress the > log, use grep -R to traverse the data as well as > the directory/inode structure instead of find, etc. > > > That said I'm not opposed to throwing new stuff in there, it just > > means we have to add stuff to parse the output and store it in the database > > in a > > consistent way, which seems like more of a pain than just making fio do > > what we > > need it to. Thanks, > > fio is not going to be able to replace the sort of perf tests I run > from week to week. If that's all it's going to do then it's not > directly useful to me... > Agreed, I'm just going to add this stuff to fstests since I'd like to be able to use the _require stuff to make sure we only run stuff we have support for. I'll wire that up this week and send patches along. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Mon, Oct 09, 2017 at 09:00:51AM -0400, Josef Bacik wrote: > On Mon, Oct 09, 2017 at 04:17:31PM +1100, Dave Chinner wrote: > > On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote: > > > > Integrating into fstests means it will be immediately available to > > > > all fs developers, it'll run on everything that everyone already has > > > > setup for filesystem testing, and it will have familiar mkfs/mount > > > > option setup behaviour so there's no new hoops for everyone to jump > > > > through to run it... > > > > > > > > > > TBF I specifically made it as easy as possible because I know we all hate > > > trying > > > to learn new shit. > > > > Yeah, it's also hard to get people to change their workflows to add > > a whole new test harness into them. It's easy if it's just a new > > command to an existing workflow :P > > > > Agreed, so if you probably won't run this outside of fstests then I'll add it > to > xfstests. I envision this tool as being run by maintainers to verify their > pull > requests haven't regressed since the last set of patches, as well as by > anybody > trying to fix performance problems. So it's way more important to me that > you, > Ted, and all the various btrfs maintainers will run it than anybody else. > > > > I figured this was different enough to warrant a separate > > > project, especially since I'm going to add block device jobs so Jens can > > > test > > > block layer things. If we all agree we'd rather see this in fstests then > > > I'm > > > happy to do that too. Thanks, > > > > I'm not fussed either way - it's a good discussion to have, though. > > > > If I want to add tests (e.g. my time-honoured fsmark tests), where > > should I send patches? > > > > I beat you to that! I wanted to avoid adding fs_mark to the suite because it > means parsing another different set of outputs, so I added a new ioengine to > fio > for this > > http://www.spinics.net/lists/fio/msg06367.html > > and added a fio job to do 500k files > > https://github.com/josefbacik/fsperf/blob/master/tests/500kemptyfiles.fio > > The test is disabled by default for now because obviously the fio support > hasn't > landed yet. That seems misguided. fio is good, but it's not a universal solution. > I'd _like_ to expand fio for cases we come up with that aren't possible, as > there's already a ton of measurements that are taken, especially around > latencies. To be properly useful it needs to support more than just fio to run tests. Indeed, it's largely useless to me if that's all it can do or it's a major pain to add support for different tools like fsmark. e.g. my typical perf regression test that you see the concurrnet fsmark create workload is actually a lot more. It does: fsmark to create 50m zero length files umount, run parallel xfs_repair (excellent mmap_sem/page fault punisher) mount run parallel find -ctime (readdir + lookup traversal) unmount, mount run parallel ls -R (readdir + dtype traversal) unmount, mount parallel rm -rf of 50m files I have variants that use small 4k files or large files rather than empty files, taht use different fsync patterns to stress the log, use grep -R to traverse the data as well as the directory/inode structure instead of find, etc. > That said I'm not opposed to throwing new stuff in there, it just > means we have to add stuff to parse the output and store it in the database > in a > consistent way, which seems like more of a pain than just making fio do what > we > need it to. Thanks, fio is not going to be able to replace the sort of perf tests I run from week to week. If that's all it's going to do then it's not directly useful to me... Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: avoid losing data raid profile when deleting a device
We've avoided data losing raid profile when doing balance, but it turns out that deleting a device could also result in the same problem. This fixes the problem by creating an empty data chunk before relocating the data chunk. Metadata/System chunk are supposed to have non-zero bytes all the time so their raid profile is persistent. Reported-by: James AlandtSigned-off-by: Liu Bo --- fs/btrfs/volumes.c | 87 ++ 1 file changed, 68 insertions(+), 19 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 4a72c45..3f48bcd 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -144,6 +144,8 @@ static int __btrfs_map_block(struct btrfs_fs_info *fs_info, u64 logical, u64 *length, struct btrfs_bio **bbio_ret, int mirror_num, int need_raid_map); +static int btrfs_may_alloc_data_chunk(struct btrfs_fs_info *fs_info, + u64 chunk_offset); DEFINE_MUTEX(uuid_mutex); static LIST_HEAD(fs_uuids); @@ -3476,7 +3478,6 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info) u32 count_meta = 0; u32 count_sys = 0; int chunk_reserved = 0; - u64 bytes_used = 0; /* step one make some room on all the devices */ devices = _info->fs_devices->devices; @@ -3635,28 +3636,22 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info) goto loop; } - ASSERT(fs_info->data_sinfo); - spin_lock(_info->data_sinfo->lock); - bytes_used = fs_info->data_sinfo->bytes_used; - spin_unlock(_info->data_sinfo->lock); - - if ((chunk_type & BTRFS_BLOCK_GROUP_DATA) && - !chunk_reserved && !bytes_used) { - trans = btrfs_start_transaction(chunk_root, 0); - if (IS_ERR(trans)) { - mutex_unlock(_info->delete_unused_bgs_mutex); - ret = PTR_ERR(trans); - goto error; - } - - ret = btrfs_force_chunk_alloc(trans, fs_info, - BTRFS_BLOCK_GROUP_DATA); - btrfs_end_transaction(trans); + if (!chunk_reserved) { + /* +* We may be relocating the only data chunk we have, +* which could potentially end up with losing data's +* raid profile, so lets allocate an empty one in +* advance. +*/ + ret = btrfs_may_alloc_data_chunk(fs_info, +found_key.offset); if (ret < 0) { mutex_unlock(_info->delete_unused_bgs_mutex); + ret = PTR_ERR(trans); goto error; + } else if (ret == 1) { + chunk_reserved = 1; } - chunk_reserved = 1; } ret = btrfs_relocate_chunk(fs_info, found_key.offset); @@ -4327,6 +4322,48 @@ int btrfs_check_uuid_tree(struct btrfs_fs_info *fs_info) } /* + * return 1 : allocate a data chunk successfully, + * return <0: errors during allocating a data chunk, + * return 0 : no need to allocate a data chunk. + */ +static int btrfs_may_alloc_data_chunk(struct btrfs_fs_info *fs_info, + u64 chunk_offset) +{ + struct btrfs_block_group_cache *cache; + u64 bytes_used; + u64 chunk_type; + + cache = btrfs_lookup_block_group(fs_info, chunk_offset); + ASSERT(cache); + chunk_type = cache->flags; + btrfs_put_block_group(cache); + + if (chunk_type & BTRFS_BLOCK_GROUP_DATA) { + spin_lock(_info->data_sinfo->lock); + bytes_used = fs_info->data_sinfo->bytes_used; + spin_unlock(_info->data_sinfo->lock); + + if (!bytes_used) { + struct btrfs_trans_handle *trans; + int ret; + + trans = btrfs_join_transaction(fs_info->tree_root); + if (IS_ERR(trans)) + return PTR_ERR(trans); + + ret = btrfs_force_chunk_alloc(trans, fs_info, + BTRFS_BLOCK_GROUP_DATA); + btrfs_end_transaction(trans); + if (ret < 0) + return ret; + + return 1; + } + } + return 0; +} + +/* * shrinking a device means finding all of the
[PATCH] Fstest: btrfs/151: test if device delete ends up with losing raid profile
Currently running 'btrfs device delete' can end up with losing data raid profile (if any), this test is to reproduce the problem. The fix is "Btrfs: avoid losing data raid profile when deleting a device" Signed-off-by: Liu Bo--- tests/btrfs/151 | 73 + tests/btrfs/151.out | 2 ++ tests/btrfs/group | 1 + 3 files changed, 76 insertions(+) create mode 100755 tests/btrfs/151 create mode 100644 tests/btrfs/151.out diff --git a/tests/btrfs/151 b/tests/btrfs/151 new file mode 100755 index 000..1866cb6 --- /dev/null +++ b/tests/btrfs/151 @@ -0,0 +1,73 @@ +#! /bin/bash +# FS QA Test 151 +# +# Test if it's losing data chunk's raid profile after 'btrfs device +# remove'. +# +# The fix is +# Btrfs: avoid losing data raid profile when deleting a device +# +#--- +# Copyright (c) 2017 Oracle. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +here=`pwd` +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# remove previous $seqres.full before test +rm -f $seqres.full + +# real QA test starts here + +# Modify as appropriate. +_supported_fs btrfs +_supported_os Linux +_require_scratch +_require_scratch_dev_pool 3 + +# create raid1 for data +_scratch_pool_mkfs "-d raid1 -b 1G" >> $seqres.full 2>&1 + +# we need an empty data chunk, so nospace_cache is required. +_scratch_mount -onospace_cache + +# if data chunk is empty, 'btrfs device remove' can change raid1 to +# single. +$BTRFS_UTIL_PROG device delete 2 $SCRATCH_MNT >> $seqres.full + +df_ret=`$BTRFS_UTIL_PROG filesystem df $SCRATCH_MNT` +echo $df_ret | $AWK_PROG -F ':' '/Data,/ {print $1}' + +# success, all done +status=0 +exit diff --git a/tests/btrfs/151.out b/tests/btrfs/151.out new file mode 100644 index 000..0a1de06 --- /dev/null +++ b/tests/btrfs/151.out @@ -0,0 +1,2 @@ +QA output created by 151 +Data, RAID1 diff --git a/tests/btrfs/group b/tests/btrfs/group index e73bb1b..a7ff7b0 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -153,3 +153,4 @@ 148 auto quick rw 149 auto quick send compress 150 auto quick dangerous +151 auto quick -- 2.5.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why do full balance and deduplication reduce available free space?
Am Mon, 02 Oct 2017 22:19:32 +0200 schrieb Niccolò Belli: > Il 2017-10-02 21:35 Kai Krakow ha scritto: > > Besides defragging removing the reflinks, duperemove will unshare > > your snapshots when used in this way: If it sees duplicate blocks > > within the subvolumes you give it, it will potentially unshare > > blocks from the snapshots while rewriting extents. > > > > BTW, you should be able to use duperemove with read-only snapshots > > if used in read-only-open mode. But I'd rather suggest to use bees > > instead: It works at whole-volume level, walking extents instead of > > files. That way it is much faster, doesn't reprocess already > > deduplicated extents, and it works with read-only snapshots. > > > > Until my patch it didn't like mixed nodatasum/datasum workloads. > > Currently this is fixed by just leaving nocow data alone as users > > probably set nocow for exactly the reason to not fragment extents > > and relocate blocks. > > Bad Btrfs Feature Interactions: btrfs read-only snapshots (never > tested, probably wouldn't work well) > > Unfortunately it seems that bees doesn't support read-only snapshots, > so it's a no way. > > P.S. > I tried duperemove with -A, but besides taking much longer it didn't > improve the situation. > Are you sure that the culprit is duperemove? AFAIK it shouldn't > unshare extents... Unsharing of extents depends... If an extent is shared between a r/o and r/w snapshot, rewriting the extent for deduplication ends up in a shared extent again but it is no longer reflinked with the original r/o snapshot. At least if btrfs doesn't allow to change extents part of a r/o snapshot... Which you all tell is the case... And then, there's unsharing of metadata by the deduplication process itself. Both effects should be minimal, tho. But since chunks are allocated in 1GB sizes, it may jump 1GB worth of allocation just for a few extra MB needed. A metadata rebalance may fix this. -- Regards, Kai Replies to list-only preferred. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
WARNING: ... at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1be/0x1d0 [btrfs]
Hi, I update kernel from 4.11.10 to 4.13.4 and since that I get the following message in the kernel journal calling 'scrub' or 'balance'. I use Fedora 26 with btrfs-progs v4.9.1. What does this mean and (more important) what can I do? Bye Frank BTRFS info (device dm-7): relocating block group 44050690342912 flags system|raid1 BTRFS info (device dm-7): found 117 extents [ cut here ] WARNING: CPU: 3 PID: 22095 at fs/btrfs/ctree.h:1559 btrfs_update_device+0x1be/0x1d0 [btrfs] Modules linked in: rpcsec_gss_krb5 veth xt_nat xt_addrtype br_netfilter xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun nf_conntrack_sane xt_CT ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack libcrc32c iptable_mangle iptable_raw iptable_security ebtable_filter ebtables ip6table_filter ip6_tables ftsteutates btrfs xor raid6_pq tda18212 cxd2841er tpm_crb intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm iTCO_wdt irqbypass iTCO_vendor_support intel_cstate intel_uncore mei_wdt intel_rapl_perf ppdev hci_uart ddbridge dvb_core Okt 09 19:08:32 hypercloud.fritz.box kernel: btbcm btqca btintel mei_me i2c_i801 shpchp bluetooth joydev mei intel_pch_thermal wmi intel_lpss_acpi intel_lpss pinctrl_sunrisepoint fujitsu_laptop parport_pc sparse_keymap ecdh_generic parport tpm_tis pinctrl_intel tpm_tis_core rfkill tpm acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc dm_crypt i915 crct10dif_pclmul i2c_algo_bit crc32_pclmul drm_kms_helper crc32c_intel e1000e drm ghash_clmulni_intel serio_raw ptp pps_core video i2c_hid CPU: 3 PID: 22095 Comm: btrfs Tainted: GW 4.13.4-200.fc26.x86_64 #1 Hardware name: FUJITSU D3417-B1/D3417-B1, BIOS V5.0.0.11 R1.12.0 for D3417-B1x 02/09/2016 task: 8ecb59b026c0 task.stack: b805cae9 RIP: 0010:btrfs_update_device+0x1be/0x1d0 [btrfs] RSP: 0018:b805cae93ac8 EFLAGS: 00010206 RAX: 0fff RBX: 8ed094bb11c0 RCX: 074702251e00 RDX: 0004 RSI: 3efa RDI: 8ec9eb032c08 RBP: b805cae93b10 R08: 3efe R09: b805cae93a80 R10: 1000 R11: 0003 R12: 8ed0c7a3f000 R13: R14: 3eda R15: 8ec9eb032c08 FS: 7f10b256a8c0() GS:8ed0ee4c() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7f6261c0d5f0 CR3: 0005cad4c000 CR4: 003406e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: btrfs_remove_chunk+0x365/0x870 [btrfs] btrfs_relocate_chunk+0x7e/0xd0 [btrfs] btrfs_balance+0xc07/0x1390 [btrfs] btrfs_ioctl_balance+0x319/0x380 [btrfs] btrfs_ioctl+0x9d5/0x24a0 [btrfs] ? lru_cache_add+0x3a/0x80 ? lru_cache_add_active_or_unevictable+0x4c/0xf0 ? __handle_mm_fault+0x939/0x10b0 do_vfs_ioctl+0xa5/0x600 ? do_brk_flags+0x230/0x360 ? do_vfs_ioctl+0xa5/0x600 SyS_ioctl+0x79/0x90 entry_SYSCALL_64_fastpath+0x1a/0xa5 RIP: 0033:0x7f10b15e65e7 RSP: 002b:7ffc9402ebe8 EFLAGS: 0246 ORIG_RAX: 0010 RAX: ffda RBX: 8041 RCX: 7f10b15e65e7 RDX: 7ffc9402ec80 RSI: c4009420 RDI: 0003 RBP: 7f10b18abae0 R08: 55b07a3b3010 R09: 0078 R10: 7f10b18abb38 R11: 0246 R12: R13: 7f10b18abb38 R14: 8060 R15: Code: 4c 89 ff 45 31 c0 ba 10 00 00 00 4c 89 f6 e8 fa 20 ff ff 4c 89 ff e8 72 ef fc ff e9 d3 fe ff ff 41 bd f4 ff ff ff e9 d0 fe ff ff <0f> ff eb b6 e8 39 fd 78 c6 66 0f 1f 84 00 00 00 00 00 0f 1f 44 ---[ end trace d1e1c8aff99bfeb8 ]--- -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: prefix sysfs attribute struct names
On Sun, Oct 08, 2017 at 10:30:58PM +0200, Hans van Kranenburg wrote: > Currently struct names for sysfs are generated only based on the > attribute names. This means that attribute names cannot be reused in > multiple places throughout the complete btrfs sysfs hierarchy. > > E.g. allocation/data/total_bytes and allocation/data/single/total_bytes > result in the same struct name btrfs_attr_total_bytes. A workaround for > this case was made in the past by ad hoc creating an extra macro > wrapper, BTRFS_RAID_ATTR, that inserts some extra text in the struct > name. > > Instead of polluting sysfs.h with such kind of extra macro definitions, > and only doing so when there are collisions, use a prefix which gets > inserted in the struct name, so we keep everything nicely grouped > together by default. > > Current collections of attributes are: > * (the toplevel, empty prefix) > * allocation > * space_info > * raid > * features > > Signed-off-by: Hans van KranenburgReviewed-by: David Sterba > @@ -277,7 +277,8 @@ static ssize_t raid_bytes_show(struct kobject *kobj, > > down_read(>groups_sem); > list_for_each_entry(block_group, >block_groups[index], list) { > - if (>attr == BTRFS_RAID_ATTR_PTR(total_bytes)) > + if (>attr == \ the \ is not needed here, only in macro defintions that must be on one logical line > + BTRFS_ATTR_PTR(raid, total_bytes)) > val += block_group->key.offset; > else > val += btrfs_block_group_used(_group->item); > @@ -331,19 +332,20 @@ SPACE_INFO_ATTR(bytes_may_use); > SPACE_INFO_ATTR(bytes_readonly); > SPACE_INFO_ATTR(disk_used); > SPACE_INFO_ATTR(disk_total); > -BTRFS_ATTR(total_bytes_pinned, btrfs_space_info_show_total_bytes_pinned); > +BTRFS_ATTR(space_info, total_bytes_pinned, \ same here. will be fixed at commit time. > +btrfs_space_info_show_total_bytes_pinned); > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Fix bool initialization/comparison
On Sat, Oct 07, 2017 at 04:02:21PM +0200, Thomas Meyer wrote: > Bool initializations should use true and false. Bool tests don't need > comparisons. > > Signed-off-by: Thomas MeyerReviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: heuristic add shannon entropy calculation
On Sun, Oct 08, 2017 at 04:11:59PM +0300, Timofey Titovets wrote: > Byte distribution check in heuristic will filter edge data > cases and some time fail to classificate input data > > Let's fix that by adding Shannon entropy calculation, > that will cover classification of most other data types. > > As Shannon entropy need log2 with some precision to work, > let's use ilog2(N) and for increase precision, > by do ilog2(pow(N, 4)) > > Shannon entropy slightly changed for avoiding signed numbers > and divisions. > > Changes: > v1 -> v2: > - Replace log2_lshift4 wiht ilog2 > - To compensate accuracy errors of ilog2 > @ENTROPY_LVL_ACEPTABLE 70 -> 65 > @ENTROPY_LVL_HIGH 85 -> 80 > - Drop usage of u64 from shannon calculation > > Signed-off-by: Timofey Titovets> --- > fs/btrfs/compression.c | 83 > +- > 1 file changed, 82 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index 0060bc4ae5f2..8efbce5633b5 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -34,6 +34,7 @@ > #include > #include > #include > +#include > #include "ctree.h" > #include "disk-io.h" > #include "transaction.h" > @@ -1224,6 +1225,60 @@ int btrfs_decompress_buf2page(const char *buf, > unsigned long buf_start, > return 1; > } > > +/* > + * Shannon Entropy calculation > + * > + * Pure byte distribution analyze fail to determine > + * compressiability of data. Try calculate entropy to > + * estimate the average minimum number of bits needed > + * to encode a sample data. > + * > + * For comfort, use return of percentage of needed bit's, > + * instead of bit's amaount directly. > + * > + * @ENTROPY_LVL_ACEPTABLE - below that threshold sample has low byte > + * entropy and can be compressible with high probability > + * > + * @ENTROPY_LVL_HIGH - data are not compressible with high probability, > + * > + * Use of ilog2() decrease precision, so to see ~ correct answer > + * LVL decreased by 5. > + */ > + > +#define ENTROPY_LVL_ACEPTABLE 65 > +#define ENTROPY_LVL_HIGH 80 > + > +/* > + * For increase precision in shannon_entropy calculation > + * Let's do pow(n, M) to save more digits after comma > + * > + * Max int bit lengh 64 > + * ilog2(MAX_SAMPLE_SIZE) -> 13 > + * 13*4 = 52 < 64 -> M = 4 > + * So use pow(n, 4) > + */ > +static inline u32 ilog2_w(u64 n) > +{ > + return ilog2(n * n * n * n); > +} > + > +static u32 shannon_entropy(struct heuristic_ws *ws) > +{ > + const u32 entropy_max = 8*ilog2_w(2); > + u32 entropy_sum = 0; > + u32 p, p_base, sz_base; > + u32 i; > + > + sz_base = ilog2_w(ws->sample_size); > + for (i = 0; i < BUCKET_SIZE && ws->bucket[i].count > 0; i++) { > + p = ws->bucket[i].count; > + p_base = ilog2_w(p); > + entropy_sum += p * (sz_base - p_base); > + } > + > + entropy_sum /= ws->sample_size; > + return entropy_sum * 100 / entropy_max; > +} That's much better :) I'll add the patch to the rest of the heuristics. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] Btrfs: heuristic add shannon entropy calculation
On Sat, Oct 07, 2017 at 03:08:26AM +0300, Timofey Titovets wrote: > > I hear the compiler scream, trying to optimize all the ifs. And I think > > the CPU branch prediction would have a hard time, there's practically no > > expected pattern and the performance could be worse compared to the > > formula calculation. > > > > What's the reason you opencode it like that? Could you please post how > > it would be implemented in C? There are potentially optimized functions > > to do integer log2 (ilog2). > > That is just: > log2_lshift4(n) { return ilog2(pow(n, 16)); } > But that will cause a overflow. > > log2(8192) -> 13 - 13 bit long > 13 * 4 = 52, 52 < 64 > > That also restrict sample size to 15 bit > > So we can get approximately same behaviour, > by doing like ilog2(pow(n, 4)); > > https://pastebin.com/VryvqkCy - some numbers > TL;DR version: > pow(n, 4) > - count100 728300 > - error %0.0% 1.1%3.0% > pow(n, 16) > - count 188431148314 > - error % 0.0% 1.0%2.8% > > (I sample Win10.iso image as just big file with different data). > I gen error value by: > if (shannon_e_f > 0) > error = 100 - (shannon_e_i * 1.0 * 100 / shannon_e_f); > > In fact that cause errors in entropy level like: > TRUE -> Integer > 83 -> 80 > 87 -> 84 > 32 -> 28 > > In theory that possible to just make entropy level markers more aggressive, > like: 70->65 and 85 -> 80 > > That will cover precision errors. With the v2 patch and direct calculation, I'm ok with leaving some imprecision there. This is fine-tuning that can be done in parallel. > or make a byte level BigInt in kernel (bigint_log2, bigint_Lshift & etc) %) > (But this seems inexcusable to me + i'm unsure about performance + I'm > afraid Big/Little.Endian) I think we need to keep it fast and not anything too complex to calculate, unless there's a significant win in the heuristic results. > > Code like that could be microbenchmarked so we can see actual numbers, > > so far I'm just guessing. > > I'm mentioned in a cover letter https://github.com/Nefelim4ag/companalyze > (implement the heuristic with no optimizations, all code run, to get > RAW counters) > Over in memory bad compressible data: > 8188. BSize: 131072, RepPattern: 0, BSet: 256, BCSet: 220, ShanEi%: > 99| 99 ~0.0%, RndDist: 0, out: -3 > > With -O0 - ~1100 MiB/s > With -O2 - ~2500 MiB/s > > With (i found in kernel and add pow(n, 4)) > static int ilog2(uint64_t v) > { > int l = 0; > v = v*v*v*v; > while ((1UL << l) < v) > l++; The while() implementation is generic and works on all architectures, but x86_64 has an instruction for that and will be even faster: ilog2 -> http://elixir.free-electrons.com/linux/latest/source/include/linux/log2.h#L26 fls (find lowest set bit) -> http://elixir.free-electrons.com/linux/latest/source/arch/x86/include/asm/bitops.h#L450 bsrl > return l; > } > > With -O0 - ~1150 MiB/s > With -O2 - ~2500 MiB/s > > So, my solutions looks more bad at code, than performance > (i may be that because most requests for bad compressible data got > directly to lookup table) Ok, thanks for collecting the numbers. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/3] btrfs: rename page offset parameter in submit_extent_page
We're going to remove sector_t and will use 'offset', so this patch frees the name. Reviewed-by: Liu BoSigned-off-by: David Sterba --- fs/btrfs/extent_io.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 12ab19a4b93e..ea9ab934794b 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2762,7 +2762,7 @@ static int merge_bio(struct extent_io_tree *tree, struct page *page, static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree, struct writeback_control *wbc, struct page *page, sector_t sector, - size_t size, unsigned long offset, + size_t size, unsigned long pg_offset, struct block_device *bdev, struct bio **bio_ret, bio_end_io_t end_io_func, @@ -2786,8 +2786,8 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree, if (prev_bio_flags != bio_flags || !contig || force_bio_submit || - merge_bio(tree, page, offset, page_size, bio, bio_flags) || - bio_add_page(bio, page, page_size, offset) < page_size) { + merge_bio(tree, page, pg_offset, page_size, bio, bio_flags) || + bio_add_page(bio, page, page_size, pg_offset) < page_size) { ret = submit_one_bio(bio, mirror_num, prev_bio_flags); if (ret < 0) { *bio_ret = NULL; @@ -2802,7 +2802,7 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree, } bio = btrfs_bio_alloc(bdev, sector << 9); - bio_add_page(bio, page, page_size, offset); + bio_add_page(bio, page, page_size, pg_offset); bio->bi_end_io = end_io_func; bio->bi_private = tree; bio->bi_write_hint = page->mapping->host->i_write_hint; -- 2.14.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/3] btrfs: scrub: get rid of sector_t
The use of sector_t is not necessry, it's just for a warning. Switch to u64 and rename the variable and use byte units instead of 512b, ie. dropping the >> 9 shifts. The messages are adjusted as well. Reviewed-by: Liu BoSigned-off-by: David Sterba --- fs/btrfs/scrub.c | 16 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index e3f6c49e5c4d..cd1b791d9706 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -231,7 +231,7 @@ struct scrub_warning { struct btrfs_path *path; u64 extent_item_size; const char *errstr; - sector_tsector; + u64 physical; u64 logical; struct btrfs_device *dev; }; @@ -797,10 +797,10 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root, */ for (i = 0; i < ipath->fspath->elem_cnt; ++i) btrfs_warn_in_rcu(fs_info, - "%s at logical %llu on dev %s, sector %llu, root %llu, inode %llu, offset %llu, length %llu, links %u (path: %s)", +"%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu, length %llu, links %u (path: %s)", swarn->errstr, swarn->logical, rcu_str_deref(swarn->dev->name), - (unsigned long long)swarn->sector, + swarn->physical, root, inum, offset, min(isize - offset, (u64)PAGE_SIZE), nlink, (char *)(unsigned long)ipath->fspath->val[i]); @@ -810,10 +810,10 @@ static int scrub_print_warning_inode(u64 inum, u64 offset, u64 root, err: btrfs_warn_in_rcu(fs_info, - "%s at logical %llu on dev %s, sector %llu, root %llu, inode %llu, offset %llu: path resolving failed with ret=%d", + "%s at logical %llu on dev %s, physical %llu, root %llu, inode %llu, offset %llu: path resolving failed with ret=%d", swarn->errstr, swarn->logical, rcu_str_deref(swarn->dev->name), - (unsigned long long)swarn->sector, + swarn->physical, root, inum, offset, ret); free_ipath(ipath); @@ -845,7 +845,7 @@ static void scrub_print_warning(const char *errstr, struct scrub_block *sblock) if (!path) return; - swarn.sector = (sblock->pagev[0]->physical) >> 9; + swarn.physical = sblock->pagev[0]->physical; swarn.logical = sblock->pagev[0]->logical; swarn.errstr = errstr; swarn.dev = NULL; @@ -868,10 +868,10 @@ static void scrub_print_warning(const char *errstr, struct scrub_block *sblock) item_size, _root, _level); btrfs_warn_in_rcu(fs_info, - "%s at logical %llu on dev %s, sector %llu: metadata %s (level %d) in tree %llu", +"%s at logical %llu on dev %s, physical %llu: metadata %s (level %d) in tree %llu", errstr, swarn.logical, rcu_str_deref(dev->name), - (unsigned long long)swarn.sector, + swarn.physical, ref_level ? "node" : "leaf", ret < 0 ? -1 : ref_level, ret < 0 ? -1 : ref_root); -- 2.14.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 3/3] btrfs: get rid of sector_t and use u64 offset in submit_extent_page
The use of sector_t in the callchain of submit_extent_page is not necessary. Switch to u64 and rename the variable and use byte units instead of 512b, ie. dropping the >> 9 shifts and avoiding the con(tro)versions of sector_t. Reviewed-by: Liu BoSigned-off-by: David Sterba --- fs/btrfs/extent_io.c | 23 --- 1 file changed, 12 insertions(+), 11 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index ea9ab934794b..8d0834ee792b 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2761,7 +2761,7 @@ static int merge_bio(struct extent_io_tree *tree, struct page *page, */ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree, struct writeback_control *wbc, - struct page *page, sector_t sector, + struct page *page, u64 offset, size_t size, unsigned long pg_offset, struct block_device *bdev, struct bio **bio_ret, @@ -2776,6 +2776,7 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree, int contig = 0; int old_compressed = prev_bio_flags & EXTENT_BIO_COMPRESSED; size_t page_size = min_t(size_t, size, PAGE_SIZE); + sector_t sector = offset >> 9; if (bio_ret && *bio_ret) { bio = *bio_ret; @@ -2801,7 +2802,7 @@ static int submit_extent_page(unsigned int opf, struct extent_io_tree *tree, } } - bio = btrfs_bio_alloc(bdev, sector << 9); + bio = btrfs_bio_alloc(bdev, offset); bio_add_page(bio, page, page_size, pg_offset); bio->bi_end_io = end_io_func; bio->bi_private = tree; @@ -2892,7 +2893,6 @@ static int __do_readpage(struct extent_io_tree *tree, u64 last_byte = i_size_read(inode); u64 block_start; u64 cur_end; - sector_t sector; struct extent_map *em; struct block_device *bdev; int ret = 0; @@ -2928,6 +2928,7 @@ static int __do_readpage(struct extent_io_tree *tree, } while (cur <= end) { bool force_bio_submit = false; + u64 offset; if (cur >= last_byte) { char *userpage; @@ -2967,9 +2968,9 @@ static int __do_readpage(struct extent_io_tree *tree, iosize = ALIGN(iosize, blocksize); if (this_bio_flag & EXTENT_BIO_COMPRESSED) { disk_io_size = em->block_len; - sector = em->block_start >> 9; + offset = em->block_start; } else { - sector = (em->block_start + extent_offset) >> 9; + offset = em->block_start + extent_offset; disk_io_size = iosize; } bdev = em->bdev; @@ -3062,8 +3063,8 @@ static int __do_readpage(struct extent_io_tree *tree, } ret = submit_extent_page(REQ_OP_READ | read_flags, tree, NULL, -page, sector, disk_io_size, pg_offset, -bdev, bio, +page, offset, disk_io_size, +pg_offset, bdev, bio, end_bio_extent_readpage, mirror_num, *bio_flags, this_bio_flag, @@ -3324,7 +3325,6 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, u64 extent_offset; u64 block_start; u64 iosize; - sector_t sector; struct extent_map *em; struct block_device *bdev; size_t pg_offset = 0; @@ -3367,6 +3367,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, while (cur <= end) { u64 em_end; + u64 offset; if (cur >= i_size) { if (tree->ops && tree->ops->writepage_end_io_hook) @@ -3388,7 +3389,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, BUG_ON(end < cur); iosize = min(em_end - cur, end - cur + 1); iosize = ALIGN(iosize, blocksize); - sector = (em->block_start + extent_offset) >> 9; + offset = em->block_start + extent_offset; bdev = em->bdev; block_start = em->block_start; compressed = test_bit(EXTENT_FLAG_COMPRESSED, >flags); @@ -3431,7 +3432,7 @@ static noinline_for_stack int __extent_writepage_io(struct inode *inode, } ret = submit_extent_page(REQ_OP_WRITE | write_flags, tree, wbc, -page, sector, iosize, pg_offset, +
[PATCH v2 0/3] Cleanup sector_t
v2: - update missing message s/sector/physical/ - updated changelogs where the semantics changed - added reviewed-by David Sterba (3): btrfs: scrub: get rid of sector_t btrfs: rename page offset parameter in submit_extent_page btrfs: get rid of sector_t and use u64 offset in submit_extent_page fs/btrfs/extent_io.c | 31 --- fs/btrfs/scrub.c | 16 2 files changed, 24 insertions(+), 23 deletions(-) -- 2.14.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Mon, Oct 09, 2017 at 08:54:16AM -0400, Josef Bacik wrote: > I purposefully used as little as possible, just json and sqlite, and I tried > to > use as little python3 isms as possible. Any rpm based systems should have > these > libraries already installed, I agree that using any of the PyPI stuff is a > pain > and is a non-starter for me as I want to make it as easy as possible to get > running. > > Where do you fall on the including it in xfstests question? I expect that the > perf stuff would be run more as maintainers put their pull requests together > to > make sure things haven't regressed. To that end I was going to wire up > xfstests-bld to run this as well. Whatever you and Dave prefer is what I'll > do, > I'll use it wherever it ends up so you two are the ones that get to decide. I'm currently using Python 2.7 mainly because the LTM subsystem in gce-xfstests was implemented using that version of Python, and my initial efforts to port it to Python 3 were... not smooth. (Because it was doing authentication, I got bit by the Python 2 vs Python 3 "bytes vs. strings vs. unicode" transition especially hard.) So I'm going to be annoyed by needing to package Python 2.7 and Python 3.5 in my test VM's, but I can deal, and this will probably be the forcing function for me to figure out how make that jump. To be honest, the bigger issue I'm going to have to figure out is how to manage the state in the sqlite database across disposable VM's running in parallel. And the assumption being made with having a static sqllite database on a test machine is that the hardware capabilities of the are static, and that's not true with a VM, whether it's running via Qemu or in some cloud environment. I'm not going to care that much about Python 3 not being available on enterprise distro's, but I could see that being annoying for some folks. I'll let them worry about that. The main thing I think we'll need to worry about once we let Python into xfstests is to be pretty careful about specifying what version of Python the scripts need to be portable against (Python 3.3? 3.4? 3.5?) and what versions of python packages get imported. The bottom line is that I'm personally supportive of adding Python and fsperf to xfstests. We just need to be careful about the portability concerns, not just now, but any time we check in new Python code. And having some documented Python style guidelines, and adding unit tests so we can notice potential portability breakages would be a good idea. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/16] btrfs: Use pagevec_lookup_range_tag()
We want only pages from given range in btree_write_cache_pages() and extent_write_cache_pages(). Use pagevec_lookup_range_tag() instead of pagevec_lookup_tag() and remove unnecessary code. CC: linux-btrfs@vger.kernel.org CC: David SterbaReviewed-by: David Sterba Reviewed-by: Daniel Jordan Signed-off-by: Jan Kara --- fs/btrfs/extent_io.c | 19 --- 1 file changed, 4 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 970190cd347e..a4eb6c988f27 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -3818,8 +3818,8 @@ int btree_write_cache_pages(struct address_space *mapping, if (wbc->sync_mode == WB_SYNC_ALL) tag_pages_for_writeback(mapping, index, end); while (!done && !nr_to_write_done && (index <= end) && - (nr_pages = pagevec_lookup_tag(, mapping, , tag, - min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { + (nr_pages = pagevec_lookup_range_tag(, mapping, , end, + tag, PAGEVEC_SIZE))) { unsigned i; scanned = 1; @@ -3829,11 +3829,6 @@ int btree_write_cache_pages(struct address_space *mapping, if (!PagePrivate(page)) continue; - if (!wbc->range_cyclic && page->index > end) { - done = 1; - break; - } - spin_lock(>private_lock); if (!PagePrivate(page)) { spin_unlock(>private_lock); @@ -3965,8 +3960,8 @@ static int extent_write_cache_pages(struct address_space *mapping, tag_pages_for_writeback(mapping, index, end); done_index = index; while (!done && !nr_to_write_done && (index <= end) && - (nr_pages = pagevec_lookup_tag(, mapping, , tag, - min(end - index, (pgoff_t)PAGEVEC_SIZE-1) + 1))) { + (nr_pages = pagevec_lookup_range_tag(, mapping, , end, + tag, PAGEVEC_SIZE))) { unsigned i; scanned = 1; @@ -3991,12 +3986,6 @@ static int extent_write_cache_pages(struct address_space *mapping, continue; } - if (!wbc->range_cyclic && page->index > end) { - done = 1; - unlock_page(page); - continue; - } - if (wbc->sync_mode != WB_SYNC_NONE) { if (PageWriteback(page)) flush_fn(data); -- 2.12.3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Fri, Oct 06, 2017 at 05:09:57PM -0400, Josef Bacik wrote: > One thing that comes up a lot every LSF is the fact that we have no general > way > that we do performance testing. Every fs developer has a set of scripts or > things that they run with varying degrees of consistency, but nothing central > that we all use. I for one am getting tired of finding regressions when we > are > deploying new kernels internally, so I wired this thing up to try and address > this need. > > We all hate convoluted setups, the more brain power we have to put in to > setting > something up the less likely we are to use it, so I took the xfstests approach > of making it relatively simple to get running and relatively easy to add new > tests. For right now the only thing this framework does is run fio scripts. > I > chose fio because it already gathers loads of performance data about it's > runs. > We have everything we need there, latency, bandwidth, cpu time, and all broken > down by reads, writes, and trims. I figure most of us are familiar enough > with > fio and how it works to make it relatively easy to add new tests to the > framework. > > I've posted my code up on github, you can get it here > > https://github.com/josefbacik/fsperf Let me propose an existing framework that is capable of what is in fsperf, and much more. I'ts Mel Gorman's mmtests http://github.com/gormanm/mmtests . I've been using it for a year or so, built a few scripts on top of that to help me set up configs for specific machines or run tests in sequences. What are the capabilities regarding filesystem tests: * create and mount filesystems (based on configs) * start various workloads, that are possibly adapted to the machine (cpu, memory), there are many types, we'd be interested in those touching filesystems * gather system statistics - cpu, memory, IO, latency there are scripts that understand the output of various benchmarking tools (fio, dbench, ffsb, tiobench, bonnie, fs_mark, iozone, blogbench, ...) * export the results into plain text or html, with tables and graphs * it is already able to compare results of several runs, with statistical indicators The testsuite is actively used and maintained, which means that the efforts are mosly on the configuration side. From users' perspective this means to spend time with the setup and the rest will work as expected. Ie. you don't have to start debugging the suite because there are some version mismatches. > All (well most) of the results from fio are stored in a local sqlite database. > Right now the comparison stuff is very crude, it simply checks against the > previous run and it only checks a few of the keys by default. You can check > latency if you want, but while writing this stuff up it seemed that latency > was > too variable from run to run to be useful in a "did my thing regress or > improve" > sort of way. > > The configuration is brain dead simple, the README has examples. All you need > to do is make your local.cfg, run ./setup and then run ./fsperf and you are > good > to go. > > The plan is to add lots of workloads as we discover regressions and such. We > don't want anything that takes too long to run otherwise people won't run > this, > so the existing tests don't take much longer than a few minutes each. Sorry, this is IMO the wrong approach to benchmarking. Can you exercise the filesystem enough in a few minutes? Can you write at least 2 times memory size of data to the filesystem? Everything works fine when it's from caches and the filesystem is fresh. With that you can simply start using phoronix-test-suite and be done, with the same quality of results we all roll eyes about. Targeted tests using fio are fine and I understand the need to keep it minimal. mmtests have support for fio and any jobfile can be used, internally implemented with the 'fio --cmdline' option that will transform it to a shell variable that's passed to fio in the end. As proposed in the thread, why not use xfstests? It could be suitable for the configs, mkfs/mount and running but I think it would need a lot of work to enhance the result gathering and presentation. Essentially duplicating mmtests from that side. I was positively surprised by various performance monitors that I was not primarily interested in, like memory allocations or context switches. This gives deeper insights into the system and may help analyzing the benchmark results. Side note: you can run xfstests from mmtests, ie. the machine/options confugration is shared. I'm willing to write more about the actual usage of mmtests, but at this point I'm proposing the whole framework for consideration. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.2 24/74] btrfs: Don't clear SGID when inheriting ACLs
3.2.94-rc1 review patch. If anyone has any objections, please let me know. -- From: Jan Karacommit b7f8a09f8097db776b8d160862540e4fc1f51296 upstream. When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Fixes: 073931017b49d9458aa351605b43a7e34598caef CC: linux-btrfs@vger.kernel.org CC: David Sterba Signed-off-by: Jan Kara Signed-off-by: David Sterba [bwh: Backported to 3.2: move the call to posix_acl_update_mode() into btrfs_xattr_acl_set()] Signed-off-by: Ben Hutchings --- --- a/fs/btrfs/acl.c +++ b/fs/btrfs/acl.c @@ -117,12 +117,6 @@ static int btrfs_set_acl(struct btrfs_tr switch (type) { case ACL_TYPE_ACCESS: name = POSIX_ACL_XATTR_ACCESS; - if (acl) { - ret = posix_acl_update_mode(inode, >i_mode, ); - if (ret) - return ret; - } - ret = 0; break; case ACL_TYPE_DEFAULT: if (!S_ISDIR(inode->i_mode)) @@ -161,11 +155,12 @@ static int btrfs_xattr_acl_set(struct de { int ret; struct posix_acl *acl = NULL; + struct inode *inode = dentry->d_inode; - if (!inode_owner_or_capable(dentry->d_inode)) + if (!inode_owner_or_capable(inode)) return -EPERM; - if (!IS_POSIXACL(dentry->d_inode)) + if (!IS_POSIXACL(inode)) return -EOPNOTSUPP; if (value) { @@ -180,7 +175,12 @@ static int btrfs_xattr_acl_set(struct de } } - ret = btrfs_set_acl(NULL, dentry->d_inode, acl, type); + if (type == ACL_TYPE_ACCESS && acl) { + ret = posix_acl_update_mode(inode, >i_mode, ); + if (ret) + goto out; + } + ret = btrfs_set_acl(NULL, inode, acl, type); out: posix_acl_release(acl); -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3.16 075/192] btrfs: Don't clear SGID when inheriting ACLs
3.16.49-rc1 review patch. If anyone has any objections, please let me know. -- From: Jan Karacommit b7f8a09f8097db776b8d160862540e4fc1f51296 upstream. When new directory 'DIR1' is created in a directory 'DIR0' with SGID bit set, DIR1 is expected to have SGID bit set (and owning group equal to the owning group of 'DIR0'). However when 'DIR0' also has some default ACLs that 'DIR1' inherits, setting these ACLs will result in SGID bit on 'DIR1' to get cleared if user is not member of the owning group. Fix the problem by moving posix_acl_update_mode() out of __btrfs_set_acl() into btrfs_set_acl(). That way the function will not be called when inheriting ACLs which is what we want as it prevents SGID bit clearing and the mode has been properly set by posix_acl_create() anyway. Fixes: 073931017b49d9458aa351605b43a7e34598caef CC: linux-btrfs@vger.kernel.org CC: David Sterba Signed-off-by: Jan Kara Signed-off-by: David Sterba [bwh: Backported to 3.16: adjust context] Signed-off-by: Ben Hutchings --- fs/btrfs/acl.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) --- a/fs/btrfs/acl.c +++ b/fs/btrfs/acl.c @@ -82,12 +82,6 @@ static int __btrfs_set_acl(struct btrfs_ switch (type) { case ACL_TYPE_ACCESS: name = POSIX_ACL_XATTR_ACCESS; - if (acl) { - ret = posix_acl_update_mode(inode, >i_mode, ); - if (ret) - return ret; - } - ret = 0; break; case ACL_TYPE_DEFAULT: if (!S_ISDIR(inode->i_mode)) @@ -123,6 +117,13 @@ out: int btrfs_set_acl(struct inode *inode, struct posix_acl *acl, int type) { + int ret; + + if (type == ACL_TYPE_ACCESS && acl) { + ret = posix_acl_update_mode(inode, >i_mode, ); + if (ret) + return ret; + } return __btrfs_set_acl(NULL, inode, acl, type); } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Mon, Oct 09, 2017 at 04:17:31PM +1100, Dave Chinner wrote: > On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote: > > On Mon, Oct 09, 2017 at 11:51:37AM +1100, Dave Chinner wrote: > > > On Fri, Oct 06, 2017 at 05:09:57PM -0400, Josef Bacik wrote: > > > > Hello, > > > > > > > > One thing that comes up a lot every LSF is the fact that we have no > > > > general way > > > > that we do performance testing. Every fs developer has a set of > > > > scripts or > > > > things that they run with varying degrees of consistency, but nothing > > > > central > > > > that we all use. I for one am getting tired of finding regressions > > > > when we are > > > > deploying new kernels internally, so I wired this thing up to try and > > > > address > > > > this need. > > > > > > > > We all hate convoluted setups, the more brain power we have to put in > > > > to setting > > > > something up the less likely we are to use it, so I took the xfstests > > > > approach > > > > of making it relatively simple to get running and relatively easy to > > > > add new > > > > tests. For right now the only thing this framework does is run fio > > > > scripts. I > > > > chose fio because it already gathers loads of performance data about > > > > it's runs. > > > > We have everything we need there, latency, bandwidth, cpu time, and all > > > > broken > > > > down by reads, writes, and trims. I figure most of us are familiar > > > > enough with > > > > fio and how it works to make it relatively easy to add new tests to the > > > > framework. > > > > > > > > I've posted my code up on github, you can get it here > > > > > > > > https://github.com/josefbacik/fsperf > > > > > > > > All (well most) of the results from fio are stored in a local sqlite > > > > database. > > > > Right now the comparison stuff is very crude, it simply checks against > > > > the > > > > previous run and it only checks a few of the keys by default. You can > > > > check > > > > latency if you want, but while writing this stuff up it seemed that > > > > latency was > > > > too variable from run to run to be useful in a "did my thing regress or > > > > improve" > > > > sort of way. > > > > > > > > The configuration is brain dead simple, the README has examples. All > > > > you need > > > > to do is make your local.cfg, run ./setup and then run ./fsperf and you > > > > are good > > > > to go. > > > > > > Why re-invent the test infrastructure? Why not just make it a > > > tests/perf subdir in fstests? > > > > > > > Probably should have led with that shouldn't I have? There's nothing > > keeping me > > from doing it, but I didn't want to try and shoehorn in a python thing into > > fstests. I need python to do the sqlite and the json parsing to dump into > > the > > sqlite database. > > > > Now if you (and others) are not opposed to this being dropped into > > tests/perf > > then I'll work that up. But it's definitely going to need to be done in > > python. > > I know you yourself have said you aren't opposed to using python in the > > past, so > > if that's still the case then I can definitely wire it all up. > > I have no problems with people using python for stuff like this but, > OTOH, I'm not the fstests maintainer anymore :P > > > > > The plan is to add lots of workloads as we discover regressions and > > > > such. We > > > > don't want anything that takes too long to run otherwise people won't > > > > run this, > > > > so the existing tests don't take much longer than a few minutes each. > > > > I will be > > > > adding some more comparison options so you can compare against averages > > > > of all > > > > previous runs and such. > > > > > > Yup, that fits exactly into what fstests is for... :P > > > > > > Integrating into fstests means it will be immediately available to > > > all fs developers, it'll run on everything that everyone already has > > > setup for filesystem testing, and it will have familiar mkfs/mount > > > option setup behaviour so there's no new hoops for everyone to jump > > > through to run it... > > > > > > > TBF I specifically made it as easy as possible because I know we all hate > > trying > > to learn new shit. > > Yeah, it's also hard to get people to change their workflows to add > a whole new test harness into them. It's easy if it's just a new > command to an existing workflow :P > Agreed, so if you probably won't run this outside of fstests then I'll add it to xfstests. I envision this tool as being run by maintainers to verify their pull requests haven't regressed since the last set of patches, as well as by anybody trying to fix performance problems. So it's way more important to me that you, Ted, and all the various btrfs maintainers will run it than anybody else. > > I figured this was different enough to warrant a separate > > project, especially since I'm going to add block device jobs so Jens can > > test > > block layer things. If we all agree we'd rather
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Sun, Oct 08, 2017 at 11:43:35PM -0400, Theodore Ts'o wrote: > On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote: > > > > Probably should have led with that shouldn't I have? There's nothing > > keeping me > > from doing it, but I didn't want to try and shoehorn in a python thing into > > fstests. I need python to do the sqlite and the json parsing to dump into > > the > > sqlite database. > > What version of python are you using? From inspection it looks like > some variant of python 3.x (you're using print as a function w/o using > "from __future import print_function") but it's not immediately > obvious from the top-level fsperf shell script what version of python > your scripts are dependant upon. > > This could potentially be a bit of a portability issue across various > distributions --- RHEL doesn't ship with Python 3.x at all, and on > Debian you need to use python3 to get python 3.x, since > /usr/bin/python still points at Python 2.7 by default. So I could see > this as a potential issue for xfstests. > > I'm currently using Python 2.7 in my wrapper scripts for, among other > things, to parse xUnit XML format and create nice summaries like this: > > ext4/4k: 337 tests, 6 failures, 21 skipped, 3814 seconds > Failures: generic/232 generic/233 generic/361 generic/388 > generic/451 generic/459 > > So I'm not opposed to python, but I will note that if you are using > modules from the Python Package Index, and they are modules which are > not packaged by your distribution (so you're using pip or easy_install > to pull them off the network), it does make doing hermetic builds from > trusted sources to be a bit trickier. > > If you have a secops team who wants to know the provenance of software > which get thrown in production data centers (and automated downloading > from random external sources w/o any code review makes them break out > in hives), use of PyPI adds a new wrinkle. It's not impossible to > solve, by any means, but it's something to consider. > I purposefully used as little as possible, just json and sqlite, and I tried to use as little python3 isms as possible. Any rpm based systems should have these libraries already installed, I agree that using any of the PyPI stuff is a pain and is a non-starter for me as I want to make it as easy as possible to get running. Where do you fall on the including it in xfstests question? I expect that the perf stuff would be run more as maintainers put their pull requests together to make sure things haven't regressed. To that end I was going to wire up xfstests-bld to run this as well. Whatever you and Dave prefer is what I'll do, I'll use it wherever it ends up so you two are the ones that get to decide. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Mon, Oct 09, 2017 at 02:54:34PM +0800, Eryu Guan wrote: > I have no problem either if python is really needed, after all this is a > very useful infrastructure improvement. But the python version problem > brought up by Ted made me a bit nervous, we need to work that round > carefully. > > OTOH, I'm just curious, what is the specific reason that python is a > hard requirement? If we can use perl, that'll be much easier for > fstests. Note that perl has its own versioning issues (but it's been easier given that Perl has been frozen so long waiting for Godot^H^H^H^H^H Perl6). It's for similar reasons that Python 2.7 is nice and stable. Python 2 is frozen while Python 3 caught up. The only downside is that Python 2.7 is now deprecated, and will stop getting security updates from Python Core in 2020. I'll note that Python 3.6 has some nice features that aren't in Python 3.5 --- but Debian Stable only has Python 3.5, and Debian Unstable has Python 3.6. Which is why I said, "it looks like you're using some variant of Python 3.x", but it wasn't obvious what version exactly was required by fsperf. This version instability in Python and Perl is way Larry McVoy ended up using Tcl for Bitkeeper, by the way. That was the only thing that was guaranteed to work everywhere, exactly the same, without random changes added by Perl/Python innovations... It's a little easier for me with gce/kvm-xfstests, since I'm using a Debian Stable images/chroots, and I don't even going to _pretend_ that I care about cross-distro portability for the Test Appliance VM's that xfstests-bld creates. But I suspect this will be more of an issue with xfstests. Also, the same issues around versioning / "DLL hell" and provenance of various high-level package/modules exists with Perl just as much with Python. Just substitute CPAN with PyPI. And again, some of the popular Perl packages have been packaged by the distro's to solve the versioning / provenance problem, but exactly _which_ packages / modules are packaged varies from distro to distro. (Hopefully the most popular ones will be packaged by both Red Hat, SuSE and Debian derivitives, but you'll have to check for each package / module you want to use.) One way of solving the problem is just including those Perl / Python package modules in the sources of xfstests itself; that way you're not depending on which version of a particular module / package is available on a distro, and you're also not randomly downloading software over the network and hoping it works / hasn't been taken over by some hostile state power. (I'd be much happier if PyPI or CPAN used SHA checksums of what you expect to be downloaded; even if you use Easy_Install's requirements.txt, you're still trusting PyPI to give you what it thinks is version of Junitparser 1.0.0.) This is why I've dropped my own copy of junitparser into the git repo of xfstests-bld. It's the "ship your own version of the DLL" solution to the "DLL hell" problem, but it was the best I could come up with, especially since Debian hadn't packaged the Junitparser python module. I also figured it was much better at lowering the blood pressure of the friendly local secops types. :-) - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v8 2/2] btrfs: check device for critical errors and mark failed
On 2017-10-06 19:33, Liu Bo wrote: On Thu, Oct 05, 2017 at 07:07:44AM -0400, Austin S. Hemmelgarn wrote: On 2017-10-04 16:11, Liu Bo wrote: On Tue, Oct 03, 2017 at 11:59:20PM +0800, Anand Jain wrote: From: Anand JainWrite and flush errors are critical errors, upon which the device fd must be closed and marked as failed. Can we defer the job of closing device to umount? We can go mark the device failed and skip it while doing read/write, and umount can do the cleanup work. That way we don't need a dedicated thread looping around to detect a rare situation. If BTRFS doesn't close the device, then it's 100% guaranteed if it reconnects that it will show up under a different device node. It would also mean that the device node stays visible when there is in fact no device connected to it, which is a pain from a monitoring perspective. I see, you're assuming that these errors are due to disconnection of disks, it could be bad sectors (although almost impossible from enterprise hard disks) or some other errors across the stack. And we need to handle any of these cases. If BTRFS is going to claim handling of device failures, it needs to handle all types of device failure it reasonably can without making assumptions about what type of failure occurred. I do agree that cleanup needs to be done if disk got disconnected, but not doing cleanup here, a udev rule is needed to handle such an event. And why exactly does it need to involve userspace beyond notifying that something bad happened? We know the failure happened, why should we tell userspace about it and require userspace to tell us to handle it? It's not like this is a performance critical path since it won't happen very often, and the primary concern is data safety, not performance, so why exactly do we need to push something that isn't a policy decision (and arguably isn't a decision, if we're not using a device, we shouldn't be holding it open, period) out to userspace? Looking at it a bit differently though, every time I've seen any write or flush errors, either the connection to the device was unstable, or the storage device was failing, and neither case is something that should require userspace to make a decision about how to handle things. To be entirely honest though, from a system administrator's perspective, I would actually expect a device getting disconnected (or otherwise disappearing off the bus) to end up in a 'Missing' state, not a 'Failed' state. I see no reason to treat it any differently than if it disappeared before the volume was mounted, other than the requirement to actually handle it. That type of handling _will_ require userspace involvement (because I seriously doubt there will ever be proper notification from the block layer to holders that a device is gone), but that's also largely orthogonal to this patch set (there's no reason that we need to handle write or flush errors the same way). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "BTRFS error (device vda1): couldn't get super buffer head for bytenr x"
On Mon, Oct 9, 2017 at 1:22 AM, Nick Gilmourwrote: >> I don't see a 'btrfs filesystem resize' command in your sequence. Did >> you actually resize the file system before you resized the underlying >> (virtual) block device? > > > OK. I guess, this is it. I didn't do any 'btrfs filesystem resize' . The > guides I was following didn't mention something like that. I was assuming > that if it works for other FS like this it should also work for BTRFS. Literally all other file systems require the fs be resized first when shrinking, before the underlying partition or virtual block device is reduced. There is nothing unique about Btrfs in this respect. Some tools will use fsadm in the sequence to make sure the file system is properly resized. So the question is whether any of the resize tools you did use, expect to use fsadm. And if they do or did, whether and why there wasn't an error if fsadm doesn't support Btrfs before the block device was resized anyway. I don't know offhand if Btrfs is supported by fsadmn, the man page does not list it as one of the supported file systems. > # btrfs filesystem resize -350g /home ? > > Should it be before or after I shrink the .img? When shrinking, the fs resize must happen first. Then the block device can be reduced to match. When growing, the block device is extended, then the fs is resized. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ANNOUNCE] fsperf: a simple fs/block performance testing framework
On Mon, Oct 09, 2017 at 04:17:31PM +1100, Dave Chinner wrote: > On Sun, Oct 08, 2017 at 10:25:10PM -0400, Josef Bacik wrote: > > On Mon, Oct 09, 2017 at 11:51:37AM +1100, Dave Chinner wrote: > > > On Fri, Oct 06, 2017 at 05:09:57PM -0400, Josef Bacik wrote: > > > > Hello, > > > > > > > > One thing that comes up a lot every LSF is the fact that we have no > > > > general way > > > > that we do performance testing. Every fs developer has a set of > > > > scripts or > > > > things that they run with varying degrees of consistency, but nothing > > > > central > > > > that we all use. I for one am getting tired of finding regressions > > > > when we are > > > > deploying new kernels internally, so I wired this thing up to try and > > > > address > > > > this need. > > > > > > > > We all hate convoluted setups, the more brain power we have to put in > > > > to setting > > > > something up the less likely we are to use it, so I took the xfstests > > > > approach > > > > of making it relatively simple to get running and relatively easy to > > > > add new > > > > tests. For right now the only thing this framework does is run fio > > > > scripts. I > > > > chose fio because it already gathers loads of performance data about > > > > it's runs. > > > > We have everything we need there, latency, bandwidth, cpu time, and all > > > > broken > > > > down by reads, writes, and trims. I figure most of us are familiar > > > > enough with > > > > fio and how it works to make it relatively easy to add new tests to the > > > > framework. > > > > > > > > I've posted my code up on github, you can get it here > > > > > > > > https://github.com/josefbacik/fsperf > > > > > > > > All (well most) of the results from fio are stored in a local sqlite > > > > database. > > > > Right now the comparison stuff is very crude, it simply checks against > > > > the > > > > previous run and it only checks a few of the keys by default. You can > > > > check > > > > latency if you want, but while writing this stuff up it seemed that > > > > latency was > > > > too variable from run to run to be useful in a "did my thing regress or > > > > improve" > > > > sort of way. > > > > > > > > The configuration is brain dead simple, the README has examples. All > > > > you need > > > > to do is make your local.cfg, run ./setup and then run ./fsperf and you > > > > are good > > > > to go. > > > > > > Why re-invent the test infrastructure? Why not just make it a > > > tests/perf subdir in fstests? > > > > > > > Probably should have led with that shouldn't I have? There's nothing > > keeping me > > from doing it, but I didn't want to try and shoehorn in a python thing into > > fstests. I need python to do the sqlite and the json parsing to dump into > > the > > sqlite database. > > > > Now if you (and others) are not opposed to this being dropped into > > tests/perf > > then I'll work that up. But it's definitely going to need to be done in > > python. > > I know you yourself have said you aren't opposed to using python in the > > past, so > > if that's still the case then I can definitely wire it all up. > > I have no problems with people using python for stuff like this but, > OTOH, I'm not the fstests maintainer anymore :P I have no problem either if python is really needed, after all this is a very useful infrastructure improvement. But the python version problem brought up by Ted made me a bit nervous, we need to work that round carefully. OTOH, I'm just curious, what is the specific reason that python is a hard requirement? If we can use perl, that'll be much easier for fstests. BTW, opinions from key fs developers/fstests users, like you, are also very important and welcomed :) Thanks, Eryu > > > > > The plan is to add lots of workloads as we discover regressions and > > > > such. We > > > > don't want anything that takes too long to run otherwise people won't > > > > run this, > > > > so the existing tests don't take much longer than a few minutes each. > > > > I will be > > > > adding some more comparison options so you can compare against averages > > > > of all > > > > previous runs and such. > > > > > > Yup, that fits exactly into what fstests is for... :P > > > > > > Integrating into fstests means it will be immediately available to > > > all fs developers, it'll run on everything that everyone already has > > > setup for filesystem testing, and it will have familiar mkfs/mount > > > option setup behaviour so there's no new hoops for everyone to jump > > > through to run it... > > > > > > > TBF I specifically made it as easy as possible because I know we all hate > > trying > > to learn new shit. > > Yeah, it's also hard to get people to change their workflows to add > a whole new test harness into them. It's easy if it's just a new > command to an existing workflow :P > > > I figured this was different enough to warrant a separate > > project, especially since I'm going to add block