Re: price to pay for nocow file bit?
Josef Bacik posted on Wed, 07 Jan 2015 15:10:06 -0500 as excerpted: >> Does this have any effect on functionality? As I understood snapshots >> still work fine for files marked like that, and so do reflinks. Any >> drawback functionality-wise? Apparently file compression support is >> lost if the bit is set? (which I can live with too, journal files are >> internally compressed anyway) >> >> > Yeah no compression, no checksums. If you do reflink then you'll COW > once and then the new COW will be nocow so it'll be fine. Same goes for > snapshots. So you'll likely incur some fragmentation but less than > before, but I'd measure to just make sure if it's that big of a deal. > >> What about performance? Do any operations get substantially slower by >> setting this bit? For example, what happens if I take a snapshot of >> files with this bit set and then modify the file, does this result in a >> full (and hence slow) copy of the file on that occasion? >> >> > Performance is the same. The otherwise nocow on-snapshot "cow1" is per-block (4096-byte AFAIK), so some fragmentation, but slower. The "perfect storm" situation is people doing automated per-minute snapshots or similar (some people go to extremes with snapper or the like...), in which case setting nocow often doesn't help a whole lot, depending on how active the file-writing is, of course. But for something like append-plus-pointer-update-pattern log files with something like per-day snapshotting, nocow should at least in theory help quite a bit, since the write-frequency and thus the prevented cows should be MUCH higher than the daily snapshot and thus the forced-block-cow1s. - FWIW, I'm systemd on btrfs here, but I use syslog-ng for my non-volatile logs and have Storage=volatile in journald.conf, using journald only for current-session, where unit status including last-10-messages makes troubleshooting /so/ much easier. =:^) Once past current-session, text logs are more useful to me, which is where syslog-ng comes in. Each to its strength, and keeping the journals from wearing the SSDs[1] is a very nice bonus. =:^) --- [1] I can and do filter what syslog-ng writes, but couldn't find a way to filter journald's writes, only queries/reads. That alone saves writes for repeated noise I'm filtering out with syslog before it's ever written, that journald would still be writing if I let it write non- volatile. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
Martin Steigerwald posted on Wed, 07 Jan 2015 20:08:50 +0100 as excerpted: > No BTRFS developers commented yet on this, neither in this thread nor in > the bug report at kernel.org I made. Just a quick general note on this point... There has in the past (and I believe referenced on the wiki) been dev comment to the effect that on the list they tend to find particular reports/threads and work on them until they find and either fix the issue or (when not urgent) decide it must wait for something else, first. During the time they're busy pursuing such a report, they don't read others on the list very closely, and such list-only bug reports may thus get dropped on the floor and never worked on. The recommendation, then, is to report it to the list, and if not picked up right away and you plan on being around in a few weeks/months when they potentially get to it, file a bug on it, so it doesn't get dropped on the floor. With the bugzilla.kernel.org report you've followed the recommendation, but the implication is that you won't necessarily get any comment right away, only later, when they're not immediately busy looking at some other bug. So lack of b.k.o comment in the immediate term doesn't mean they're ignoring the bug or don't value it; it just means they're hot on the trail of something else ATM and it might take some time to get that "first comment" engagement. But the recommendation is to file the bugzilla report precisely so it does /not/ get lost, and you've done that, so... you've done your part there and now comes the enforced patience bit of waiting for that engagement. But if it takes a bit, I would keep the bug updated every kernel release or so, with a comment updating status. (Meanwhile, I've seen no indication of such issues here. Most of my btrfs are 8-24 GiB each, all SSD, mostly dual-device btrfs raid1 both data/metadata. Maybe I don't run those full enough, however. I do have three mixed-bg mode sub-GiB btrfs, however, with one of them, a 256 MiB single-device dup-mode btrfs, used as /boot, that tends to run reasonably full, but I've not seen a problem like that there, either. But my use- case probably simply doesn't hit the problem.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 0/3] Btrfs: Enhancment for qgroup.
On 01/07/2015 08:49 AM, Satoru Takeuchi wrote: > Hi Yang, > > On 2015/01/05 15:16, Dongsheng Yang wrote: >> Hi Josef and others, >> >> This patch set is about enhancing qgroup. >> >> [1/3]: fix a bug about qgroup leak when we exceed quota limit, >> It is reviewd by Josef. >> [2/3]: introduce a new accounter in qgroup to close a window where >> user will exceed the limit by qgroup. It "looks good" to Josef. >> [3/3]: a new patch to fix a bug reported by Satoru. > I tested your the patchset v3. Although it's far better > than the patchset v2, there is still one problem in this patchset. > When I wrote 1.5GiB to a subvolume with 1.0 GiB limit, > 1.0GiB - 139 block (in this case, 1KiB/block) was written. > > I consider user should be able to write just 1.0GiB in this case. Hi Satoru, Yes, Currently, user can not write 1.0GiB in this case. Because qgroup is accounting data and metadata togather. And I have posted an idea in this thread that split it into three modes, data, metadata and both. TODO issues: c). limit and account size in 3 modes, data, metadata and both. qgroup is accounting the size both of data and metadata togather, but to users, the data size is the most useful to them. But, you mentioned that the result is different in each time. Hmmm there must be something wrong in it. I need some more investigation to answer this question. Thanx a lot for your test! Yang > > * Test result > > === > + mkfs.btrfs -f /dev/vdb > Btrfs v3.17 > See http://btrfs.wiki.kernel.org for more information. > > Turning ON incompat feature 'extref': increased hardlink limit per file to > 65536 > fs created label (null) on /dev/vdb > nodesize 16384 leafsize 16384 sectorsize 4096 size 30.00GiB > + mount /dev/vdb /root/btrfs-auto-test/ > + ret=0 > + btrfs quota enable /root/btrfs-auto-test/ > + btrfs subvolume create /root/btrfs-auto-test//sub > Create subvolume '/root/btrfs-auto-test/sub' > + btrfs qgroup limit 1G /root/btrfs-auto-test//sub > + dd if=/dev/zero of=/root/btrfs-auto-test//sub/file bs=1024 count=150 > dd: error writing '/root/btrfs-auto-test//sub/file': Disk quota exceeded > 1048438+0 records in# Tried to write 1GiB - 138 KiB > 1048437+0 records out # Succeeded to write 1GiB - 139 KiB > 1073599488 bytes (1.1 GB) copied, 19.0247 s, 56.4 MB/s > === > > * note > > I tried to run the reproducer five times and the result is > a bit different for each time. > > = > # Written > - > 1 1GiB - 139 KiB > 2 1GiB - 139 KiB > 3 1GiB - 145 KiB > 4 1GiB - 135 KiB > 5 1GiB - 135 KiB > == > > So I consider it's a problem comes from timing. > > If I changed the block size from 1KiB to 1 MiB, > the difference in bytes got larger. > > > # Written > > 1 1GiB - 1 MiB > 2 1GiB - 1 MiB > 3 1GiB - 1 MiB > 4 1GiB - 1 MiB > 5 1GiB - 1 MiB > > > Thanks, > Satoru > >> BTW, I have some other plan about qgroup in my TODO list: >> >> Kernel: >> a). adjust the accounters in parent qgroup when we move >> the child qgroup. >> Currently, when we move a qgroup, the parent qgroup >> will not updated at the same time. This will cause some wrong >> numbers in qgroup. >> >> b). add a ioctl to show the qgroup info. >> Command "btrfs qgroup show" is showing the qgroup info >> read from qgroup tree. But there is some information in memory >> which is not synced into device. Then it will show some outdate >> number. >> >> c). limit and account size in 3 modes, data, metadata and both. >> qgroup is accounting the size both of data and metadata >> togather, but to a user, the data size is the most useful to them. >> >> d). remove a subvolume related qgroup when subvolume is deleted and >> there is no other reference to it. >> >> user-tool: >> a). Add the unit of B/K/M/G to btrfs qgroup show. >> b). get the information via ioctl rather than reading it from >> btree. Will keep the old way as a fallback for compatiblity. >> >> Any comment and sugguestion is welcome. :) >> >> Yang >> >> Dongsheng Yang (3): >>Btrfs: qgroup: free reserved in exceeding quota. >>Btrfs: qgroup: Introduce a may_use to account >> space_info->bytes_may_use. >>Btrfs: qgroup, Account data space in more proper timings. >> >> fs/btrfs/extent-tree.c | 41 +++--- >> fs/btrfs/file.c| 9 --- >> fs/btrfs/inode.c | 18 - >> fs/btrfs/qgroup.c | 68 >> +++--- >> fs/btrfs/qgroup.h | 4 +++ >> 5 files changed, 117 insertions(+), 23 deletions(-) >> > . > -- To unsubscribe from this list: send the line "unsubscribe linux-btrf
Re: [PATCH] btrfs: introduce shrinker for rb_tree that keeps valid btrfs_devices
[ping] On Wed, 2014-12-10 at 15:39 +0800, Gui Hecheng wrote: > The following patch: > btrfs: remove empty fs_devices to prevent memory runout > > introduces @valid_dev_root aiming at recording @btrfs_device objects that > have corresponding block devices with btrfs. > But if a block device is broken or unplugged, no one tells the > @valid_dev_root to cleanup the "dead" objects. > > To recycle the memory occuppied by those "dead"s, we could rely on > the shrinker. The shrinker's scan function will traverse the > @valid_dev_root and trys to open the devices one by one, if it fails > or encounters a non-btrfs it will remove the "dead" @btrfs_device. > > A special case to deal with is that a block device is unplugged and > replugged, then it appears with a new @bdev->bd_dev as devnum. > In this case, we should remove the older since we should have a new > one for that block device already. > > Signed-off-by: Gui Hecheng > --- > fs/btrfs/super.c | 10 > fs/btrfs/volumes.c | 74 > +- > fs/btrfs/volumes.h | 4 +++ > 3 files changed, 87 insertions(+), 1 deletion(-) > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index ee09a56..29069af 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -1987,6 +1987,12 @@ static struct miscdevice btrfs_misc = { > .fops = &btrfs_ctl_fops > }; > > +static struct shrinker btrfs_valid_dev_shrinker = { > + .scan_objects = btrfs_valid_dev_scan, > + .count_objects = btrfs_valid_dev_count, > + .seeks = DEFAULT_SEEKS, > +}; > + > MODULE_ALIAS_MISCDEV(BTRFS_MINOR); > MODULE_ALIAS("devname:btrfs-control"); > > @@ -2100,6 +2106,8 @@ static int __init init_btrfs_fs(void) > > btrfs_init_lockdep(); > > + register_shrinker(&btrfs_valid_dev_shrinker); > + > btrfs_print_info(); > > err = btrfs_run_sanity_tests(); > @@ -2113,6 +2121,7 @@ static int __init init_btrfs_fs(void) > return 0; > > unregister_ioctl: > + unregister_shrinker(&btrfs_valid_dev_shrinker); > btrfs_interface_exit(); > free_end_io_wq: > btrfs_end_io_wq_exit(); > @@ -2153,6 +2162,7 @@ static void __exit exit_btrfs_fs(void) > btrfs_interface_exit(); > btrfs_end_io_wq_exit(); > unregister_filesystem(&btrfs_fs_type); > + unregister_shrinker(&btrfs_valid_dev_shrinker); > btrfs_exit_sysfs(); > btrfs_cleanup_valid_dev_root(); > btrfs_cleanup_fs_uuids(); > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 7093cce..62f37b1 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -54,6 +54,7 @@ static void btrfs_dev_stat_print_on_load(struct > btrfs_device *device); > DEFINE_MUTEX(uuid_mutex); > static LIST_HEAD(fs_uuids); > static struct rb_root valid_dev_root = RB_ROOT; > +static atomic_long_t unopened_dev_count = ATOMIC_LONG_INIT(0); > > static struct btrfs_device *insert_valid_device(struct btrfs_device *new_dev) > { > @@ -130,6 +131,8 @@ static void free_invalid_device(struct btrfs_device > *invalid_dev) > { > struct btrfs_fs_devices *old_fs; > > + atomic_long_dec(&unopened_dev_count); > + > old_fs = invalid_dev->fs_devices; > mutex_lock(&old_fs->device_list_mutex); > list_del(&invalid_dev->dev_list); > @@ -615,6 +618,7 @@ static noinline int device_list_add(const char *path, > list_add_rcu(&device->dev_list, &fs_devices->devices); > fs_devices->num_devices++; > mutex_unlock(&fs_devices->device_list_mutex); > + atomic_long_inc(&unopened_dev_count); > > ret = 1; > device->fs_devices = fs_devices; > @@ -788,6 +792,7 @@ again: > blkdev_put(device->bdev, device->mode); > device->bdev = NULL; > fs_devices->open_devices--; > + atomic_long_inc(&unopened_dev_count); > } > if (device->writeable) { > list_del_init(&device->dev_alloc_list); > @@ -850,8 +855,10 @@ static int __btrfs_close_devices(struct btrfs_fs_devices > *fs_devices) > struct btrfs_device *new_device; > struct rcu_string *name; > > - if (device->bdev) > + if (device->bdev) { > fs_devices->open_devices--; > + atomic_long_inc(&unopened_dev_count); > + } > > if (device->writeable && > device->devid != BTRFS_DEV_REPLACE_DEVID) { > @@ -981,6 +988,7 @@ static int __btrfs_open_devices(struct btrfs_fs_devices > *fs_devices, > fs_devices->rotating = 1; > > fs_devices->open_devices++; > + atomic_long_dec(&unopened_dev_count); > if (device->writeable && > device->devid != BTRFS_DEV_REPLACE_DEVID) { > fs_devices->rw_devices++; > @@ -6828,3 +6836,67 @@ vo
[PATCH] btrfs-progs: doc: fix format of btrfs-replace
Current 'man btrfs-replace' is as follows: ... ... -f force using and overwriting even if it looks like containing a valid btrfs filesystem. A valid filesystem is assumed if a btrfs superblock is found which contains a correct checksum. Devices which are currently mounted are never allowed to be used as the . -B no background replace. ... ... The format of 'B' option is wrong. So, fix it. Signed-off-by: Tsutomu Itoh --- NOTE: This patch based on v3.18.x branch. Documentation/btrfs-replace.txt | 1 + 1 file changed, 1 insertion(+) diff --git a/Documentation/btrfs-replace.txt b/Documentation/btrfs-replace.txt index 7402484..e8eac2c 100644 --- a/Documentation/btrfs-replace.txt +++ b/Documentation/btrfs-replace.txt @@ -52,6 +52,7 @@ containing a valid btrfs filesystem. A valid filesystem is assumed if a btrfs superblock is found which contains a correct checksum. Devices which are currently mounted are never allowed to be used as the . ++ -B no background replace. -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at /home/apw/COD/linux/fs/btrfs/inode.c:3123!
On 2015-01-07 15:58, Satoru Takeuchi wrote: Create subvolume './subvolume' # dd if=/dev/urandom of=bigfile.img bs=64k Does it really this command? I consider it will fill up whole /dev/vdb. It normally would fill the fs if left for long, but I've pressed ctrl+c after about 6 GB. And is it not subvolume/bigfile.img but bigfile.img? If I recall correctly, it was not inside the subvolume. (...) 3127377920 bytes (3.1 GB) copied, 194.641 s, 16.1 MB/s If bigfile.img is just under /mnt/test, I can't understand why this command succeeded to write more 3 GiB. The previous command wrote 6 GB, this one wrote 3.1 GB - there was still plenty of free space. (...) # dd if=/dev/urandom of=bigfile3.img bs=64k ^C3617580+0 records in 3617579+0 records out 237081657344 bytes (237 GB) copied, 14796 s, 16.0 MB/s It's too. This one was also left running for long, followed by ctrl+c (note the ^C in my pasted output). We didn't fill the fs 100% in any of these cases. # df -h Filesystem Size Used Avail Use% Mounted on (...) /dev/vdb256G 230G 25G 91% /mnt/test # btrfs qgroup show /mnt/test qgroupid rfer excl 0/5 1638416384 0/257245960245248 245960245248 # ls -l total 240451584 -rw-r--r-- 1 root root 3127377920 Dec 19 20:06 bigfile2.img -rw-r--r-- 1 root root 237081657344 Dec 20 00:15 bigfile3.img -rw-r--r-- 1 root root 6013386752 Dec 19 20:02 bigfile.img # rm bigfile3.img # sync # dmesg (...) [ 95.055420] BTRFS: device fsid 97f98279-21e7-4822-89be-3aed9dc05f2c devid 1 transid 3 /dev/vdb [ 118.446509] BTRFS info (device vdb): disk space caching is enabled [ 118.446518] BTRFS: flagging fs with big metadata feature [ 118.452176] BTRFS: creating UUID tree [ 575.189412] BTRFS info (device vdb): qgroup scan completed [15948.234826] [ cut here ] [15948.234883] kernel BUG at /home/apw/COD/linux/fs/btrfs/inode.c:3123! [15948.234906] invalid opcode: [#1] SMP [15948.234925] Modules linked in: nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables nf_log_ipv4 nf_log_common xt_LOG ipt_REJECT nf_reject_ipv4 xt_tcpudp nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iptable_filter ip_tables x_tables dm_crypt btrfs xor crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel ppdev aes_x86_64 lrw raid6_pq gf128mul glue_helper ablk_helper cryptd serio_raw mac_hid pvpanic 8250_fintek parport_pc i2c_piix4 lp parport psmouse qxl ttm floppy drm_kms_helper drm [15948.235172] CPU: 0 PID: 3274 Comm: btrfs-cleaner Not tainted 3.18.1-031801-generic #201412170637 [15948.235193] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014 [15948.235222] task: 880036708a00 ti: 88007b97c000 task.ti: 88007b97c000 [15948.235240] RIP: 0010:[] [] btrfs_orphan_add+0x1a9/0x1c0 [btrfs] [15948.235305] RSP: 0018:88007b97fc98 EFLAGS: 00010286 [15948.235318] RAX: ffe4 RBX: 88007b80a800 RCX: [15948.235333] RDX: 219e RSI: 0004 RDI: 880079418138 [15948.235349] RBP: 88007b97fcd8 R08: 88007fc1cae0 R09: 88007ad272d0 [15948.235366] R10: R11: 0010 R12: 88007a2d9500 [15948.235381] R13: 8800027d60e0 R14: 88007b80ac58 R15: 0001 [15948.235401] FS: () GS:88007fc0() knlGS: [15948.235418] CS: 0010 DS: ES: CR0: 80050033 [15948.235432] CR2: 7f0489ff CR3: 7a5e CR4: 001407f0 [15948.235464] Stack: [15948.235473] 88007b97fcd8 c0497acf 88007b809800 88003c207400 [15948.235498] 88007b809800 88007ad272d0 88007a2d9500 0001 [15948.235521] 88007b97fd58 c04412e0 880079418000 0004c0427fea [15948.235551] Call Trace: [15948.235601] [] ? lookup_free_space_inode+0x4f/0x100 [btrfs] [15948.235642] [] btrfs_remove_block_group+0x140/0x490 [btrfs] [15948.235693] [] btrfs_remove_chunk+0x245/0x380 [btrfs] [15948.235731] [] btrfs_delete_unused_bgs+0x236/0x270 [btrfs] [15948.235771] [] cleaner_kthread+0x12c/0x190 [btrfs] [15948.235806] [] ? btrfs_destroy_all_delalloc_inodes+0x120/0x120 [btrfs] [15948.235844] [] kthread+0xc9/0xe0 [15948.235872] [] ? flush_kthread_worker+0x90/0x90 [15948.235900] [] ret_from_fork+0x7c/0xb0 [15948.235919] [] ? flush_kthread_worker+0x90/0x90 [15948.235933] Code: e8 7d a1 fc ff 8b 45 c8 e9 6d ff ff ff 0f 1f 44 00 00 f0 41 80 65 80 fd 4c 89 ef 89 45 c8 e8 cf 20 fe ff 8b 45 c8 e9 48 ff ff ff <0f> 0b 4c 89 f7 45 31 f6 e8 8a a2 35 c1 e9 f9 fe ff ff 0f 1f 44 [15948.236017] RIP [] btrfs_orphan_add+0x1a9/0x1c0 [btrfs] [15948.236017] RSP [15948.761942] ---[ end trace 0ccd21c265dce56b ]--- # ls bigfile2.img bigfile.img # touch 1 (...never returned...) Tomasz
Re: [PATCH] btrfs-progs: Fix a copy-n-paste bug in btrfs_read_fs_root().
On 2015/01/07 18:23, Qu Wenruo wrote: > Signed-off-by: Qu Wenruo Reviewed-by: Satoru Takeuchi > --- > disk-io.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/disk-io.c b/disk-io.c > index 2bf8586..b853f66 100644 > --- a/disk-io.c > +++ b/disk-io.c > @@ -693,7 +693,7 @@ struct btrfs_root *btrfs_read_fs_root(struct > btrfs_fs_info *fs_info, > if (location->objectid == BTRFS_CSUM_TREE_OBJECTID) > return fs_info->csum_root; > if (location->objectid == BTRFS_QUOTA_TREE_OBJECTID) > - return fs_info->csum_root; > + return fs_info->quota_root; > > BUG_ON(location->objectid == BTRFS_TREE_RELOC_OBJECTID || > location->offset != (u64)-1); > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On 01/07/2015 04:05 PM, Goffredo Baroncelli wrote: I am trying to understand the pros and cons of turning this bit on, before I can make this change. So far I see one big pro, but I wonder if there's any major con I should think about? Nope there's no real con other than you don't get csums, but that doesn't really matter for you. Thanks, In a btrfs-raid setup, in case of a corrupted sector, is BTRFS able to rebuild the sector ? I suppose no; if so this has to be add to the cons I think. It won't know its corrupted, but it can rebuild if say you yank a drive and add a new one. RAID5/RAID6 would catch corruption of course. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
On Wed, Jan 07, 2015 at 08:08:50PM +0100, Martin Steigerwald wrote: > Am Dienstag, 6. Januar 2015, 15:03:23 schrieb Zygo Blaxell: > > ext3 has a related problem when it's nearly full: it will try to search > > gigabytes of block allocation bitmaps searching for a free block, which > > can result in a single 'mkdir' call spending 45 minutes reading a large > > slow 99.5% full filesystem. > > Ok, thats for bitmap access. Ext4 uses extens. ...and the problem doesn't happen to the same degree on ext4 as it did on ext3. > > So far I've found that problems start when space drops below 1GB free > > (although it can go as low as 400MB) and problems stop when space gets > > above 1GB free, even without resizing or balancing the filesystem. > > I've adjusted free space monitoring thresholds accordingly for now, > > and it seems to be keeping things working so far. > > Just to see whether we are on the same terms: You talk about space that BTRFS > has not yet reserved for chunks, i.e. the difference between size and used in > btrfs fi sh, right? The number I look at for this issue is statvfs() f_bavail (i.e. the "Available" column of /bin/df). Before the empty-chunk-deallocation code, most of my filesystems would quickly reach a steady state where all space is allocated to chunks, and they stay that way unless I have to downsize them. Now there is free (non-chunk) space on most of my filesystems. I'll try monitoring btrfs fi df and btrfs fi show under the failing conditions and see if there are interesting correlations. signature.asc Description: Digital signature
Re: price to pay for nocow file bit?
> >> I am trying to understand the pros and cons of turning this bit >> on, before I can make this change. So far I see one big pro, but I >> wonder if there's any major con I should think about? >> > > Nope there's no real con other than you don't get csums, but that > doesn't really matter for you. Thanks, In a btrfs-raid setup, in case of a corrupted sector, is BTRFS able to rebuild the sector ? I suppose no; if so this has to be add to the cons I think. >From my tests [1][2] I was unable to get bigger difference between doing a >defrag and setting chattr -C the log directory. Did you get other results, if so I am interested to know more. BR G.Baroncelli [1] http://kreijack.blogspot.it/2014/06/btrfs-and-systemd-journal.html [2] http://lists.freedesktop.org/archives/systemd-devel/2014-June/020141.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: price to pay for nocow file bit?
On 01/07/2015 12:43 PM, Lennart Poettering wrote: Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. I've been wondering if mount -o autodefrag would deal with this problem but I haven't had the chance to look into it. Now, to improve things a bit, I yesterday made a change to journald, to issue the btrfs defrag ioctl when a journal file is rotated, i.e. when we know that no further writes will be ever done on the file. However, I wonder now if I should go one step further even, and use the equivalent of "chattr -C" (i.e. nocow) on all journal files. I am wondering what price I would precisely have to pay for that. Judging by this earlier thread: https://urldefense.proofpoint.com/v1/url?u=http://www.spinics.net/lists/linux-btrfs/msg33134.html&k=ZVNjlDMF0FElm4dQtryO4A%3D%3D%0A&r=cKCbChRKsMpTX8ybrSkonQ%3D%3D%0A&m=ODekp6cRJncqEDXqNoiRQ1kLtNawlAzzBmNPpCF7hIw%3D%0A&s=3868518396650e6542b0189719e11f9c490e400c5205c29a20db0b699969c414 it's mostly about data integrity, which is something I can live with, given the conservative write patterns of journald, and the fact that we do our own checksumming and careful data validation. I mean, if btrfs in this mode provides no worse data integrity semantics than ext4 I am fully fine with losing this feature for these files. Yup its no worse than ext4. Hence I am mostly interested in what else is lost if this flag is turned on by default for all journal files journald creates: Does this have any effect on functionality? As I understood snapshots still work fine for files marked like that, and so do reflinks. Any drawback functionality-wise? Apparently file compression support is lost if the bit is set? (which I can live with too, journal files are internally compressed anyway) Yeah no compression, no checksums. If you do reflink then you'll COW once and then the new COW will be nocow so it'll be fine. Same goes for snapshots. So you'll likely incur some fragmentation but less than before, but I'd measure to just make sure if it's that big of a deal. What about performance? Do any operations get substantially slower by setting this bit? For example, what happens if I take a snapshot of files with this bit set and then modify the file, does this result in a full (and hence slow) copy of the file on that occasion? Performance is the same. I am trying to understand the pros and cons of turning this bit on, before I can make this change. So far I see one big pro, but I wonder if there's any major con I should think about? Nope there's no real con other than you don't get csums, but that doesn't really matter for you. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS free space handling still needs more work: Hangs again (no complete lockups, "just" tasks stuck for some time)
Am Dienstag, 6. Januar 2015, 15:03:23 schrieb Zygo Blaxell: > On Mon, Dec 29, 2014 at 10:32:00AM +0100, Martin Steigerwald wrote: > > Am Sonntag, 28. Dezember 2014, 21:07:05 schrieb Zygo Blaxell: > > > On Sat, Dec 27, 2014 at 08:23:59PM +0100, Martin Steigerwald wrote: […] > > > > Zygo, was is the characteristics of your filesystem. Do you use > > > > compress=lzo and skinny metadata as well? How are the chunks > > > > allocated? > > > > What kind of data you have on it? > > > > > > compress-force (default zlib), no skinny-metadata. Chunks are d=single, > > > m=dup. Data is a mix of various desktop applications, most active > > > file sizes from a few hundred K to a few MB, maybe 300k-400k files. > > > No database or VM workloads. Filesystem is 100GB and is usually between > > > 98 and 99% full (about 1-2GB free). > > > > > > I have another filesystem which has similar problems when it's 99.99% > > > full (it's 13TB, so 0.01% is 1.3GB). That filesystem is RAID1 with > > > skinny-metadata and no-holes. > > > > > > On various filesystems I have the above CPU-burning problem, a bunch of > > > irreproducible random crashes, and a hang with a kernel stack that goes > > > through SyS_unlinkat and btrfs_evict_inode. > > > > Zygo, thanks. That desktop filesystem sounds a bit similar to my usecase, > > with the interesting difference that you have no databases or VMs on it. > > > > That said, I use the Windows XP rarely, but using it was what made the > > issue so visible for me. Is your desktop filesystem on SSD? > > No, but I recently stumbled across the same symptoms on an 8GB SD card > on kernel 3.12.24 (raspberry pi). When the filesystem hit over ~97% > full, all accesses were blocked for several minutes. I was able to > work around it by adjusting the threshold on a garbage collector daemon > (i.e. deleting a lot of expendable files) to keep usage below 90%. > I didn't try to balance the filesystem, and didn't seem to need to. Interesting. > ext3 has a related problem when it's nearly full: it will try to search > gigabytes of block allocation bitmaps searching for a free block, which > can result in a single 'mkdir' call spending 45 minutes reading a large > slow 99.5% full filesystem. Ok, thats for bitmap access. Ext4 uses extens. BTRFS can use bitmaps as well, but also supports extents and I think uses it for most use cases. > I'd expect a btrfs filesystem that was nearly full to have a small tree > of cached free space extents and be able to search it quickly even if > the result is negative (i.e. there's no free space). It seems to be > doing something else... :-P Yeah :) > > Do you have the chance to extend one of the affected filesystems to check > > my theory that this does not happen as long as BTRFS can still allocate > > new > > data chunks? If its right, your FS should be fluent again as long as you > > see more than 1 GiB free > > > > Label: none uuid: 53bdf47c-4298-45bc-a30f-8a310c274069 > > > > Total devices 2 FS bytes used 512.00KiB > > devid1 size 10.00GiB used 6.53GiB path > > /dev/mapper/sata-btrfsraid1 > > devid2 size 10.00GiB used 6.53GiB path > > /dev/mapper/msata-btrfsraid1 > > > > between "size" and "used" in btrfs fi sh. I suggest going with at least > > 2-3 > > GiB, as BTRFS may allocate just one chunk so quickly that you do not have > > the chance to recognize the difference. > > So far I've found that problems start when space drops below 1GB free > (although it can go as low as 400MB) and problems stop when space gets > above 1GB free, even without resizing or balancing the filesystem. > I've adjusted free space monitoring thresholds accordingly for now, > and it seems to be keeping things working so far. Just to see whether we are on the same terms: You talk about space that BTRFS has not yet reserved for chunks, i.e. the difference between size and used in btrfs fi sh, right? No BTRFS developers commented yet on this, neither in this thread nor in the bug report at kernel.org I made. > > Well, and if thats works for you, we are back to my recommendation: > > > > More so than with other filesystems give BTRFS plenty of free space to > > operate with. At best as much, that you always have a mininum of 2-3 GiB > > unused device space for chunk reservation left. One could even do some > > Nagios/Icinga monitoring plugin for that :) Thanks, -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 signature.asc Description: This is a digitally signed message part.
ssd mode on rotational media
What issues would arise if ssd mode is activated because of a block layer setting the rotational flag to zero? This happens for me running btrfs on bcache. Would it be beneficial to pass the no_ssd flag? Thanks, Kyle -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
price to pay for nocow file bit?
Heya! Currently, systemd-journald's disk access patterns (appending to the end of files, then updating a few pointers in the front) result in awfully fragmented journal files on btrfs, which has a pretty negative effect on performance when accessing them. Now, to improve things a bit, I yesterday made a change to journald, to issue the btrfs defrag ioctl when a journal file is rotated, i.e. when we know that no further writes will be ever done on the file. However, I wonder now if I should go one step further even, and use the equivalent of "chattr -C" (i.e. nocow) on all journal files. I am wondering what price I would precisely have to pay for that. Judging by this earlier thread: http://www.spinics.net/lists/linux-btrfs/msg33134.html it's mostly about data integrity, which is something I can live with, given the conservative write patterns of journald, and the fact that we do our own checksumming and careful data validation. I mean, if btrfs in this mode provides no worse data integrity semantics than ext4 I am fully fine with losing this feature for these files. Hence I am mostly interested in what else is lost if this flag is turned on by default for all journal files journald creates: Does this have any effect on functionality? As I understood snapshots still work fine for files marked like that, and so do reflinks. Any drawback functionality-wise? Apparently file compression support is lost if the bit is set? (which I can live with too, journal files are internally compressed anyway) What about performance? Do any operations get substantially slower by setting this bit? For example, what happens if I take a snapshot of files with this bit set and then modify the file, does this result in a full (and hence slow) copy of the file on that occasion? I am trying to understand the pros and cons of turning this bit on, before I can make this change. So far I see one big pro, but I wonder if there's any major con I should think about? Thanks, Lennart -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_inode_item's otime?
On Wed, Jan 07, 2015 at 02:57:35PM +0100, Lennart Poettering wrote: > Exposig this as xattr sounds great to me too. NAK - exposing random stat data as xattr only creates problems. Given that we don't seem to be able to get a new stat format anytime soon we should add a generic ioctl to expose it, reading it from struct kstat which all filesystem that support this attribute should fill out. And there's quite a lot of them. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_inode_item's otime?
On Tue, 06.01.15 19:26, David Sterba (dste...@suse.cz) wrote: > > (Of course, even without xstat(), I think it would be good to have an > > unprivileged ioctl to query the otime in btrfs... the TREE_SEARCH > > ioctl after all requires privileges...) > > Adding this interface is a different question. I do not like to add > ioctls that do too specialized things that normally fit into a generic > interface like the xstat example. We could use the object properties > instead (ie. export the otime as an extended attribute), but the work on > that has stalled and it's not ready to just simply add the otime in > advance. Exposig this as xattr sounds great to me too. Lennart -- Lennart Poettering, Red Hat -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Data recovery after RBD I/O error
On 2015-01-06 23:11, Jérôme Poulin wrote: On Mon, Jan 5, 2015 at 6:59 AM, Austin S Hemmelgarn wrote: Secondly, I would highly recommend not using ANY non-cluster-aware FS on top of a clustered block device like RBD For my use-case, this is just a single server using the RBD device. No clustering involved on the BTRFS side of thing. My only point is that there isn't anything in BTRFS to handle it accidentally being multiply mounted. Ext* for example aren't clustered, but do have an optional feature to prevent multiple mounting. However, it was really useful to take snapshots (just like LVM) before modifying the filesystem in any way. Have you tried Ceph's built in snapshot support? I don't remember how to use it, but I do know it is there (at least, it is in the most recent versions), and it is a bit more like LVM's snapshots than BTRFS is. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS_IOC_TREE_SEARCH ioctl
On Mon, 05.01.15 19:14, Nehemiah Dacres (vivacar...@gmail.com) wrote: > Is libbtrfs documented or even stable yet? What stage of development is it > in anyway? is there a design spec yet? Note that the code we use in systemd is not based on libbtrfs, we just call the ioctls directly. Lennart -- Lennart Poettering, Red Hat -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: get the accurate value of used_bytes in btrfs_get_block_group_info().
On 01/07/2015 05:22 PM, Qu Wenruo wrote: Hi Satoru-san Hi Dongsheng, On 2015/01/05 20:19, Dongsheng Yang wrote: Ping. IOCTL of BTRFS_IOC_SPACE_INFO currently does not report the data used but not synced to user. Then btrfs fi df will give user a wrong numbers before sync. This patch solve this problem. On 10/27/2014 08:38 PM, Dongsheng Yang wrote: Reproducer: # mkfs.btrfs -f -b 20G /dev/sdb # mount /dev/sdb /mnt/test # fallocate -l 17G /mnt/test/largefile # btrfs fi df /mnt/test Data, single: total=17.49GiB, used=6.00GiB <- only 6G, but actually it should be 17G. System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B I tried to reproduce your problem with 3.19-rc1. However, this problem doesn't happen. Could you also try to reproduce with the upstream kernel? I can still reproduce it in 3.18, but it seems to be fixed in 3.19-rc1 already by other patch, so this patch is no longer needed. Oops, my fault. I forgot to test it with upstream. :( Satoru and Qu, thanx a lot. Yang Thanks, Qu * Detail test script (named "yang-test.sh" here): === #!/bin/bash -x PART1=/dev/vdb MNT_PNT=./mnt mkfs.btrfs -f -b 20G ${PART1} mount ${PART1} ${MNT_PNT} fallocate -l 17G ${MNT_PNT}/largefile btrfs fi df ${MNT_PNT} sync btrfs fi df ${MNT_PNT} umount ${MNT_PNT} === Result: === # ./yang-test.sh + PART1=/dev/vdb + MNT_PNT=./mnt + mkfs.btrfs -f -b 20G /dev/vdb Btrfs v3.17 See http://btrfs.wiki.kernel.org for more information. Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 fs created label (null) on /dev/vdb nodesize 16384 leafsize 16384 sectorsize 4096 size 20.00GiB + mount /dev/vdb ./mnt + fallocate -l 17G ./mnt/largefile + btrfs fi df ./mnt Data, single: total=17.01GiB, used=17.00GiB # Used 17GiB properly System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B + sync + btrfs fi df ./mnt Data, single: total=17.01GiB, used=17.00GiB# (of course) used 17GiB too System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B + umount ./mnt === Although I ran this test five times, the results are the same. Thanks, Satoru # sync # btrfs fi df /mnt/test Data, single: total=17.49GiB, used=17.00GiB <- After sync, it is expected. System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B The value of 6.00GiB is actually calculated in btrfs_get_block_group_info() by adding the @block_group->item->used for each group together. In this way, it did not consider the bytes in cache. This patch adds the value of @pinned, @reserved and @bytes_super in struct btrfs_block_group_cache to make sure we can get the accurate @used_bytes. Reported-by: Qu Wenruo Signed-off-by: Dongsheng Yang --- fs/btrfs/ioctl.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 33c80f5..bc2aaeb 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3892,6 +3892,10 @@ void btrfs_get_block_group_info(struct list_head *groups_list, space->total_bytes += block_group->key.offset; space->used_bytes += btrfs_block_group_used(&block_group->item); +/* Add bytes-info in cache */ +space->used_bytes += block_group->pinned; +space->used_bytes += block_group->reserved; +space->used_bytes += block_group->bytes_super; } } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: get the accurate value of used_bytes in btrfs_get_block_group_info().
Hi Satoru-san Hi Dongsheng, On 2015/01/05 20:19, Dongsheng Yang wrote: Ping. IOCTL of BTRFS_IOC_SPACE_INFO currently does not report the data used but not synced to user. Then btrfs fi df will give user a wrong numbers before sync. This patch solve this problem. On 10/27/2014 08:38 PM, Dongsheng Yang wrote: Reproducer: # mkfs.btrfs -f -b 20G /dev/sdb # mount /dev/sdb /mnt/test # fallocate -l 17G /mnt/test/largefile # btrfs fi df /mnt/test Data, single: total=17.49GiB, used=6.00GiB <- only 6G, but actually it should be 17G. System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B I tried to reproduce your problem with 3.19-rc1. However, this problem doesn't happen. Could you also try to reproduce with the upstream kernel? I can still reproduce it in 3.18, but it seems to be fixed in 3.19-rc1 already by other patch, so this patch is no longer needed. Thanks, Qu * Detail test script (named "yang-test.sh" here): === #!/bin/bash -x PART1=/dev/vdb MNT_PNT=./mnt mkfs.btrfs -f -b 20G ${PART1} mount ${PART1} ${MNT_PNT} fallocate -l 17G ${MNT_PNT}/largefile btrfs fi df ${MNT_PNT} sync btrfs fi df ${MNT_PNT} umount ${MNT_PNT} === Result: === # ./yang-test.sh + PART1=/dev/vdb + MNT_PNT=./mnt + mkfs.btrfs -f -b 20G /dev/vdb Btrfs v3.17 See http://btrfs.wiki.kernel.org for more information. Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 fs created label (null) on /dev/vdb nodesize 16384 leafsize 16384 sectorsize 4096 size 20.00GiB + mount /dev/vdb ./mnt + fallocate -l 17G ./mnt/largefile + btrfs fi df ./mnt Data, single: total=17.01GiB, used=17.00GiB # Used 17GiB properly System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B + sync + btrfs fi df ./mnt Data, single: total=17.01GiB, used=17.00GiB# (of course) used 17GiB too System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B + umount ./mnt === Although I ran this test five times, the results are the same. Thanks, Satoru # sync # btrfs fi df /mnt/test Data, single: total=17.49GiB, used=17.00GiB <- After sync, it is expected. System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B The value of 6.00GiB is actually calculated in btrfs_get_block_group_info() by adding the @block_group->item->used for each group together. In this way, it did not consider the bytes in cache. This patch adds the value of @pinned, @reserved and @bytes_super in struct btrfs_block_group_cache to make sure we can get the accurate @used_bytes. Reported-by: Qu Wenruo Signed-off-by: Dongsheng Yang --- fs/btrfs/ioctl.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 33c80f5..bc2aaeb 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3892,6 +3892,10 @@ void btrfs_get_block_group_info(struct list_head *groups_list, space->total_bytes += block_group->key.offset; space->used_bytes += btrfs_block_group_used(&block_group->item); +/* Add bytes-info in cache */ +space->used_bytes += block_group->pinned; +space->used_bytes += block_group->reserved; +space->used_bytes += block_group->bytes_super; } } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: Fix a copy-n-paste bug in btrfs_read_fs_root().
Signed-off-by: Qu Wenruo --- disk-io.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/disk-io.c b/disk-io.c index 2bf8586..b853f66 100644 --- a/disk-io.c +++ b/disk-io.c @@ -693,7 +693,7 @@ struct btrfs_root *btrfs_read_fs_root(struct btrfs_fs_info *fs_info, if (location->objectid == BTRFS_CSUM_TREE_OBJECTID) return fs_info->csum_root; if (location->objectid == BTRFS_QUOTA_TREE_OBJECTID) - return fs_info->csum_root; + return fs_info->quota_root; BUG_ON(location->objectid == BTRFS_TREE_RELOC_OBJECTID || location->offset != (u64)-1); -- 2.2.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH] Btrfs: use asynchronous submit for large DIO io in single profile
Commit 1ae399382512 ("Btrfs: do not use async submit for small DIO io's") benefits small DIO io's. However, if we're owning the SINGLE profile, this also affects large DIO io's since in that case, map_length is (chunk_length - bio's offset_in_chunk), it's farily large so that it's very likely to be larger than a large bio's size, which avoids asynchronous submit. For instance, if we have a 512k bio, the efforts of calculating (512k/4k=128) checksums will be taken by the DIO task. This adds a limit 'BTRFS_STRIPE_LEN' to decide if it's small enough to avoid asynchronous submit. Still, in this case we don't need to split the bio and can submit it directly. Signed-off-by: Liu Bo --- fs/btrfs/inode.c | 16 ++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e687bb0..c640d7e 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7792,6 +7792,7 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, int nr_pages = 0; int ret; int async_submit = 0; + u64 alloc_profile; map_length = orig_bio->bi_iter.bi_size; ret = btrfs_map_block(root->fs_info, rw, start_sector << 9, @@ -7799,15 +7800,26 @@ static int btrfs_submit_direct_hook(int rw, struct btrfs_dio_private *dip, if (ret) return -EIO; + alloc_profile = btrfs_get_alloc_profile(root, 1); + if (map_length >= orig_bio->bi_iter.bi_size) { bio = orig_bio; dip->flags |= BTRFS_DIO_ORIG_BIO_SUBMITTED; + + /* +* In the case of 'single' profile, the above check is very +* likely to be true as map_length is (chunk_length - offset), +* so checking BTRFS_STRIPE_LEN here. +*/ + if ((alloc_profile & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0 && + orig_bio->bi_iter.bi_size >= BTRFS_STRIPE_LEN) + async_submit = 1; + goto submit; } /* async crcs make it difficult to collect full stripe writes. */ - if (btrfs_get_alloc_profile(root, 1) & - (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) + if (alloc_profile & (BTRFS_BLOCK_GROUP_RAID5 | BTRFS_BLOCK_GROUP_RAID6)) async_submit = 0; else async_submit = 1; -- 1.8.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: get the accurate value of used_bytes in btrfs_get_block_group_info().
Hi Dongsheng, On 2015/01/05 20:19, Dongsheng Yang wrote: Ping. IOCTL of BTRFS_IOC_SPACE_INFO currently does not report the data used but not synced to user. Then btrfs fi df will give user a wrong numbers before sync. This patch solve this problem. On 10/27/2014 08:38 PM, Dongsheng Yang wrote: Reproducer: # mkfs.btrfs -f -b 20G /dev/sdb # mount /dev/sdb /mnt/test # fallocate -l 17G /mnt/test/largefile # btrfs fi df /mnt/test Data, single: total=17.49GiB, used=6.00GiB <- only 6G, but actually it should be 17G. System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B I tried to reproduce your problem with 3.19-rc1. However, this problem doesn't happen. Could you also try to reproduce with the upstream kernel? * Detail test script (named "yang-test.sh" here): === #!/bin/bash -x PART1=/dev/vdb MNT_PNT=./mnt mkfs.btrfs -f -b 20G ${PART1} mount ${PART1} ${MNT_PNT} fallocate -l 17G ${MNT_PNT}/largefile btrfs fi df ${MNT_PNT} sync btrfs fi df ${MNT_PNT} umount ${MNT_PNT} === Result: === # ./yang-test.sh + PART1=/dev/vdb + MNT_PNT=./mnt + mkfs.btrfs -f -b 20G /dev/vdb Btrfs v3.17 See http://btrfs.wiki.kernel.org for more information. Turning ON incompat feature 'extref': increased hardlink limit per file to 65536 fs created label (null) on /dev/vdb nodesize 16384 leafsize 16384 sectorsize 4096 size 20.00GiB + mount /dev/vdb ./mnt + fallocate -l 17G ./mnt/largefile + btrfs fi df ./mnt Data, single: total=17.01GiB, used=17.00GiB # Used 17GiB properly System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B + sync + btrfs fi df ./mnt Data, single: total=17.01GiB, used=17.00GiB# (of course) used 17GiB too System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B + umount ./mnt === Although I ran this test five times, the results are the same. Thanks, Satoru # sync # btrfs fi df /mnt/test Data, single: total=17.49GiB, used=17.00GiB <- After sync, it is expected. System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B The value of 6.00GiB is actually calculated in btrfs_get_block_group_info() by adding the @block_group->item->used for each group together. In this way, it did not consider the bytes in cache. This patch adds the value of @pinned, @reserved and @bytes_super in struct btrfs_block_group_cache to make sure we can get the accurate @used_bytes. Reported-by: Qu Wenruo Signed-off-by: Dongsheng Yang --- fs/btrfs/ioctl.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 33c80f5..bc2aaeb 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3892,6 +3892,10 @@ void btrfs_get_block_group_info(struct list_head *groups_list, space->total_bytes += block_group->key.offset; space->used_bytes += btrfs_block_group_used(&block_group->item); +/* Add bytes-info in cache */ +space->used_bytes += block_group->pinned; +space->used_bytes += block_group->reserved; +space->used_bytes += block_group->bytes_super; } } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html