Re: Ongoing Btrfs stability issues
>It's hosted on an EBS volume; we don't use ephemeral storage at all. The EBS >volumes are all SSD I have recently done some SSD corruption experiments on small set of workloads, so I thought I would share my experience. While creating btrfs using mkfs.btrfs command for SSDs, by default the metadata duplication option is disabled. this renders btrfs-scrubbing ineffective, as there are no redundant metadata to restore corrupted metadata from. So if there are any errors during read operation on SSD, unlike HDD where the corruptions would be handled by btrfs scrub on the fly while detecting checksum error, for SSD the read would fail as uncorrectable error. Could you confirm if metadata DUP is enabled for your system by running the following cmd: $btrfs fi df /mnt # mount is the mount point Data, single: total=8.00MiB, used=64.00KiB System, single: total=4.00MiB, used=16.00KiB Metadata, single: total=168.00MiB, used=112.00KiB GlobalReserve, single: total=16.00MiB, used=0.00B If metadata is single in your case as well (and not DUP), that may be the problem for btrfs-scrub not working effectively on the fly (mid-stream bit-rot correction), causing reliability issues. A couple of such bugs that are observed specifically for SSDs is reported here: https://bugzilla.kernel.org/show_bug.cgi?id=198463 https://bugzilla.kernel.org/show_bug.cgi?id=198807 These do not occur for HDD, and I believe should not occur when filesystem is mounted with nossd mode. On Fri, Feb 16, 2018 at 10:03 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as > excerpted: > >> This will probably sound like an odd question, but does BTRFS think your >> storage devices are SSD's or not? Based on what you're saying, it >> sounds like you're running into issues resulting from the >> over-aggressive SSD 'optimizations' that were done by BTRFS until very >> recently. >> >> You can verify if this is what's causing your problems or not by either >> upgrading to a recent mainline kernel version (I know the changes are in >> 4.15, I don't remember for certain if they're in 4.14 or not, but I >> think they are), or by adding 'nossd' to your mount options, and then >> seeing if you still have the problems or not (I suspect this is only >> part of it, and thus changing this will reduce the issues, but not >> completely eliminate them). Make sure and run a full balance after >> changing either item, as the aforementioned 'optimizations' have an >> impact on how data is organized on-disk (which is ultimately what causes >> the issues), so they will have a lingering effect if you don't balance >> everything. > > According to the wiki, 4.14 does indeed have the ssd changes. > > According to the bug, he's running 4.13.x on one server and 4.14.x on > two. So upgrading the one to 4.14.x should mean all will have that fix. > > However, without a full balance it /will/ take some time to settle down > (again, assuming btrfs was using ssd mode), so the lingering effect could > still be creating problems on the 4.14 kernel servers for the moment. > > -- > Duncan - List replies preferred. No HTML msgs. > "Every nonfree program has a lord, a master -- > and if you use the program, he is your master." Richard Stallman > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Shehbaz Jaffer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ongoing Btrfs stability issues
Austin S. Hemmelgarn posted on Fri, 16 Feb 2018 14:44:07 -0500 as excerpted: > This will probably sound like an odd question, but does BTRFS think your > storage devices are SSD's or not? Based on what you're saying, it > sounds like you're running into issues resulting from the > over-aggressive SSD 'optimizations' that were done by BTRFS until very > recently. > > You can verify if this is what's causing your problems or not by either > upgrading to a recent mainline kernel version (I know the changes are in > 4.15, I don't remember for certain if they're in 4.14 or not, but I > think they are), or by adding 'nossd' to your mount options, and then > seeing if you still have the problems or not (I suspect this is only > part of it, and thus changing this will reduce the issues, but not > completely eliminate them). Make sure and run a full balance after > changing either item, as the aforementioned 'optimizations' have an > impact on how data is organized on-disk (which is ultimately what causes > the issues), so they will have a lingering effect if you don't balance > everything. According to the wiki, 4.14 does indeed have the ssd changes. According to the bug, he's running 4.13.x on one server and 4.14.x on two. So upgrading the one to 4.14.x should mean all will have that fix. However, without a full balance it /will/ take some time to settle down (again, assuming btrfs was using ssd mode), so the lingering effect could still be creating problems on the 4.14 kernel servers for the moment. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
On 2018年02月16日 22:12, Ellis H. Wilson III wrote: > On 02/15/2018 08:55 PM, Qu Wenruo wrote: >> On 2018年02月16日 00:30, Ellis H. Wilson III wrote: >>> Very helpful information. Thank you Qu and Hans! >>> >>> I have about 1.7TB of homedir data newly rsync'd data on a single >>> enterprise 7200rpm HDD and the following output for btrfs-debug: >>> >>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2 >>> total bytes 6001175126016 >>> bytes used 1832557875200 >>> >>> Hans' (very cool) tool reports: >>> ROOT_TREE 624.00KiB 0( 38) 1( 1) >>> EXTENT_TREE 327.31MiB 0( 20881) 1( 66) 2( 1) >> >> Extent tree is not so large, a little unexpected to see such slow mount. >> >> BTW, how many chunks do you have? >> >> It could be checked by: >> >> # btrfs-debug-tree -t chunk | grep CHUNK_ITEM | wc -l > > Since yesterday I've doubled the size by copying the homdir dataset in > again. Here are new stats: > > extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2 > total bytes 6001175126016 > bytes used 3663525969920 > > $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l > 3454 OK, this explains everything. There are too many chunks. This means at mount you need to search for block group item 3454 times. Even each search only needs to iterate 3 tree blocks, multiply it 3454 it would still be a big work. Although some tree blocks like the root node and level 1 nodes can be cached, we still need to read about 3500 tree blocks. If the fs is created using 16K nodesize, this means you need to do random read for 54M using 16K blocksize. No wonder it will takes some time. Normally I would expect 1G chunk for each data and metadata chunk. If there is nothing special, it means your filesystem is already larger than 3T. If your used space is way smaller (less than 30%) than 3.5T, then this means your chunk usage is pretty low, and in that case, balance to reduce number of chunks (block groups) would reduce mount time. My personally estimate about mount time is O(nlogn). So if you are able to reduce chunk number to half, you could reduce mount time by 60%. > > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.14MiB 0( 72) 1( 1) > EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > > The old mean mount time was 4.319s. It now takes 11.537s for the > doubled dataset. Again please realize this is on an old version of > BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still > like to understand this delay more. Should I expect this to scale in > this way all the way up to my proposed 60-80TB filesystem so long as the > file size distribution stays roughly similar? That would definitely be > in terms of multiple minutes at that point. > >>> Taking 100 snapshots (no changes between snapshots however) of the above >>> subvolume doesn't appear to impact mount/umount time. >> >> 100 unmodified snapshots won't affect mount time. >> >> It needs new extents, which can be created by overwriting extents in >> snapshots. >> So it won't really cause much difference if all these snapshots are all >> unmodified. > > Good to know, thanks! > >>> Snapshot creation >>> and deletion both operate at between 0.25s to 0.5s. >> >> IIRC snapshot deletion is delayed, so the real work doesn't happen when >> "btrfs sub del" returns. > > I was using btrfs sub del -C for the deletions, so I believe (if that > command truly waits for the subvolume to be utterly gone) it captures > the entirety of the snapshot. No, snapshot deletion is completely delayed in background. -C only ensures that even a powerloss happen after command return, you won't see the snapshot anywhere, but it will still be deleted in background. Thanks, Qu > > Best, > > ellis > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: OpenPGP digital signature
Re: [PATCH] Fix NULL pointer exception in find_bio_stripe()
On Fri, Feb 16, 2018 at 07:51:38PM +, Dmitriy Gorokh wrote: > On detaching of a disk which is a part of a RAID6 filesystem, the following > kernel OOPS may happen: > > [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, > flush 1, corrupt 0, gen 0 > [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on > /dev/sdo > [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, > flush 1, corrupt 0, gen 0 > [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on > /dev/sdo > [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, > flush 1, corrupt 0, gen 0 > [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo > [63122.935338] BUG: unable to handle kernel NULL pointer dereference at > 0080 > [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs] > [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0 > [63122.971202] Oops: [#1] SMP > [63122.990786] Modules linked in: libcrc32c dlm configfs cpufreq_userspace > cpufreq_powersave cpufreq_conservative softdog nfsd auth_rpcgss nfs_acl nfs > lockd grace fscache sunrpc bonding ipmi_devintf ipmi_msghandler joydev > snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd psmouse evdev parport_pc > soundcore serio_raw battery pcspkr video ac97_bus ac parport ohci_pci > ohci_hcd i2c_piix4 button crc32c_generic crc32c_intel btrfs xor > zstd_decompress zstd_compress xxhash raid6_pq dm_mod dax raid1 md_mod > hid_generic usbhid hid xhci_pci xhci_hcd ehci_pci ehci_hcd usbcore sg sd_mod > sr_mod cdrom ata_generic ahci libahci ata_piix libata e1000 scsi_mod [last > unloaded: scst] > [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W > 4.14.2-16-scst34x+ #8 > [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS > VirtualBox 12/01/2006 > [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs] > [63123.007595] task: 880036ea4040 task.stack: c90006384000 > [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs] > [63123.007968] RSP: 0018:c90006387ad8 EFLAGS: 00010287 > [63123.008140] RAX: 0002 RBX: 88004beaa0b8 RCX: > 8800b2bd5690 > [63123.008359] RDX: RSI: 88007bb43500 RDI: > 88004beaa000 > [63123.008621] RBP: c90006387ae8 R08: 9910 R09: > 8800b2bd5600 > [63123.008840] R10: 0004 R11: 0001 R12: > 88007bb43500 > [63123.009059] R13: fffb R14: 880036fc5180 R15: > 0004 > [63123.009278] FS: () GS:8800b700() > knlGS: > [63123.009564] CS: 0010 DS: ES: CR0: 80050033 > [63123.009748] CR2: 0080 CR3: b0866000 CR4: > 000406f0 > [63123.009969] Call Trace: > [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs] > [63123.010251] bio_endio+0xa1/0x120 > [63123.010378] generic_make_request+0x218/0x270 > [63123.010921] submit_bio+0x66/0x130 > [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs] > [63123.011245] full_stripe_write+0x96/0xc0 [btrfs] > [63123.011428] raid56_parity_write+0x117/0x170 [btrfs] > [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs] > [63123.011759] ? ___cache_free+0x1c5/0x300 > [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs] > [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs] > [63123.012257] normal_work_helper+0x19e/0x300 [btrfs] > [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs] > [63123.012656] process_one_work+0x14d/0x350 > [63123.012888] worker_thread+0x4d/0x3a0 > [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20 > [63123.013192] kthread+0x109/0x140 > [63123.013315] ? process_scheduled_works+0x40/0x40 > [63123.013472] ? kthread_stop+0x110/0x110 > [63123.013610] ret_from_fork+0x25/0x30 > [63123.013741] Code: 7e 43 31 c0 48 63 d0 48 8d 14 52 49 8d 4c d1 60 48 8b 51 > 08 49 39 d0 72 1f 4c 63 1b 4c 01 da 49 39 d0 73 14 48 8b 11 48 8b 52 68 <48> > 8b 8a 80 00 00 00 48 39 4e 08 74 14 83 c0 01 44 39 d0 75 c4 > [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: c90006387ad8 > [63123.014678] CR2: 0080 > [63123.016590] ---[ end trace a295ea7259c17880 ]— > > This is reproducible in a cycle, where a series of writes is followed by SCSI > device delete command. The test may take up to few minutes. > > Fixes: commit 74d46992e0d9dee7f1f376de0d56d31614c8a17a ("block: replace > bi_bdev with a gendisk pointer and partitions index") > --- > fs/btrfs/raid56.c | 1 + > 1 file changed, 1 insertion(+) This is not the correct way to submit patches for inclusion in the stable kernel tree. Please read: https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html for how to do this properly. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org
[PATCH] Fix NULL pointer exception in find_bio_stripe()
On detaching of a disk which is a part of a RAID6 filesystem, the following kernel OOPS may happen: [63122.680461] BTRFS error (device sdo): bdev /dev/sdo errs: wr 0, rd 0, flush 1, corrupt 0, gen 0 [63122.719584] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo [63122.719587] BTRFS error (device sdo): bdev /dev/sdo errs: wr 1, rd 0, flush 1, corrupt 0, gen 0 [63122.803516] BTRFS warning (device sdo): lost page write due to IO error on /dev/sdo [63122.803519] BTRFS error (device sdo): bdev /dev/sdo errs: wr 2, rd 0, flush 1, corrupt 0, gen 0 [63122.863902] BTRFS critical (device sdo): fatal error on device /dev/sdo [63122.935338] BUG: unable to handle kernel NULL pointer dereference at 0080 [63122.946554] IP: fail_bio_stripe+0x58/0xa0 [btrfs] [63122.958185] PGD 9ecda067 P4D 9ecda067 PUD b2b37067 PMD 0 [63122.971202] Oops: [#1] SMP [63122.990786] Modules linked in: libcrc32c dlm configfs cpufreq_userspace cpufreq_powersave cpufreq_conservative softdog nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc bonding ipmi_devintf ipmi_msghandler joydev snd_intel8x0 snd_ac97_codec snd_pcm snd_timer snd psmouse evdev parport_pc soundcore serio_raw battery pcspkr video ac97_bus ac parport ohci_pci ohci_hcd i2c_piix4 button crc32c_generic crc32c_intel btrfs xor zstd_decompress zstd_compress xxhash raid6_pq dm_mod dax raid1 md_mod hid_generic usbhid hid xhci_pci xhci_hcd ehci_pci ehci_hcd usbcore sg sd_mod sr_mod cdrom ata_generic ahci libahci ata_piix libata e1000 scsi_mod [last unloaded: scst] [63123.006760] CPU: 0 PID: 3979 Comm: kworker/u8:9 Tainted: G W 4.14.2-16-scst34x+ #8 [63123.007091] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [63123.007402] Workqueue: btrfs-worker btrfs_worker_helper [btrfs] [63123.007595] task: 880036ea4040 task.stack: c90006384000 [63123.007796] RIP: 0010:fail_bio_stripe+0x58/0xa0 [btrfs] [63123.007968] RSP: 0018:c90006387ad8 EFLAGS: 00010287 [63123.008140] RAX: 0002 RBX: 88004beaa0b8 RCX: 8800b2bd5690 [63123.008359] RDX: RSI: 88007bb43500 RDI: 88004beaa000 [63123.008621] RBP: c90006387ae8 R08: 9910 R09: 8800b2bd5600 [63123.008840] R10: 0004 R11: 0001 R12: 88007bb43500 [63123.009059] R13: fffb R14: 880036fc5180 R15: 0004 [63123.009278] FS: () GS:8800b700() knlGS: [63123.009564] CS: 0010 DS: ES: CR0: 80050033 [63123.009748] CR2: 0080 CR3: b0866000 CR4: 000406f0 [63123.009969] Call Trace: [63123.010085] raid_write_end_io+0x7e/0x80 [btrfs] [63123.010251] bio_endio+0xa1/0x120 [63123.010378] generic_make_request+0x218/0x270 [63123.010921] submit_bio+0x66/0x130 [63123.011073] finish_rmw+0x3fc/0x5b0 [btrfs] [63123.011245] full_stripe_write+0x96/0xc0 [btrfs] [63123.011428] raid56_parity_write+0x117/0x170 [btrfs] [63123.011604] btrfs_map_bio+0x2ec/0x320 [btrfs] [63123.011759] ? ___cache_free+0x1c5/0x300 [63123.011909] __btrfs_submit_bio_done+0x26/0x50 [btrfs] [63123.012087] run_one_async_done+0x9c/0xc0 [btrfs] [63123.012257] normal_work_helper+0x19e/0x300 [btrfs] [63123.012429] btrfs_worker_helper+0x12/0x20 [btrfs] [63123.012656] process_one_work+0x14d/0x350 [63123.012888] worker_thread+0x4d/0x3a0 [63123.013026] ? _raw_spin_unlock_irqrestore+0x15/0x20 [63123.013192] kthread+0x109/0x140 [63123.013315] ? process_scheduled_works+0x40/0x40 [63123.013472] ? kthread_stop+0x110/0x110 [63123.013610] ret_from_fork+0x25/0x30 [63123.013741] Code: 7e 43 31 c0 48 63 d0 48 8d 14 52 49 8d 4c d1 60 48 8b 51 08 49 39 d0 72 1f 4c 63 1b 4c 01 da 49 39 d0 73 14 48 8b 11 48 8b 52 68 <48> 8b 8a 80 00 00 00 48 39 4e 08 74 14 83 c0 01 44 39 d0 75 c4 [63123.014469] RIP: fail_bio_stripe+0x58/0xa0 [btrfs] RSP: c90006387ad8 [63123.014678] CR2: 0080 [63123.016590] ---[ end trace a295ea7259c17880 ]— This is reproducible in a cycle, where a series of writes is followed by SCSI device delete command. The test may take up to few minutes. Fixes: commit 74d46992e0d9dee7f1f376de0d56d31614c8a17a ("block: replace bi_bdev with a gendisk pointer and partitions index") --- fs/btrfs/raid56.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c index dec0907dfb8a..fcfc20de2df3 100644 --- a/fs/btrfs/raid56.c +++ b/fs/btrfs/raid56.c @@ -1370,6 +1370,7 @@ static int find_bio_stripe(struct btrfs_raid_bio *rbio, stripe_start = stripe->physical; if (physical >= stripe_start && physical < stripe_start + rbio->stripe_len && + stripe->dev->bdev && bio->bi_disk == stripe->dev->bdev->bd_disk && bio->bi_partno == stripe->dev->bdev->bd_partno) { return i; -- 2.14.2
Re: Ongoing Btrfs stability issues
On 2018-02-15 11:18, Alex Adriaanse wrote: We've been using Btrfs in production on AWS EC2 with EBS devices for over 2 years. There is so much I love about Btrfs: CoW snapshots, compression, subvolumes, flexibility, the tools, etc. However, lack of stability has been a serious ongoing issue for us, and we're getting to the point that it's becoming hard to justify continuing to use it unless we make some changes that will get it stable. The instability manifests itself mostly in the form of the VM completely crashing, I/O operations freezing, or the filesystem going into readonly mode. We've spent an enormous amount of time trying to recover corrupted filesystems, and the time that servers were down as a result of Btrfs instability has accumulated to many days. We've made many changes to try to improve Btrfs stability: upgrading to newer kernels, setting up nightly balances, setting up monitoring to ensure our filesystems stay under 70% utilization, etc. This has definitely helped quite a bit, but even with these things in place it's still unstable. Take https://bugzilla.kernel.org/show_bug.cgi?id=198787 for example, which I created yesterday: we've had 4 VMs (out of 20) go down over the past week alone because of Btrfs errors. Thankfully, no data was lost, but I did have to copy everything over to a new filesystem. Many of our VMs that run Btrfs have a high rate of I/O (both read/write; I/O utilization is often pegged at 100%). The filesystems that get little I/O seem pretty stable, but the ones that undergo a lot of I/O activity are the ones that suffer from the most instability problems. We run the following balances on every filesystem every night: btrfs balance start -dusage=10 btrfs balance start -dusage=20 btrfs balance start -dusage=40,limit=100 I would suggest changing this to eliminate the balance with '-dusage=10' (it's redundant with the '-dusage=20' one unless your filesystem is in pathologically bad shape), and adding equivalent filters for balancing metadata (which generally goes pretty fast). Unless you've got a huge filesystem, you can also cut down on that limit filter. 100 data chunks that are 40% full is up to 40GB of data to move on a normally sized filesystem, or potentially up to 200GB if you've got a really big filesystem (I forget what point BTRFS starts scaling up chunk sizes at, but I'm pretty sure it's in the TB range). We also use the following btrfs-snap cronjobs to implement rotating snapshots, with short-term snapshots taking place every 15 minutes and less frequent ones being retained for up to 3 days: 0 1-23 * * * /opt/btrfs-snap/btrfs-snap -r 23 15,30,45 * * * * /opt/btrfs-snap/btrfs-snap -r 15m 3 0 0 * * * /opt/btrfs-snap/btrfs-snap -r daily 3 Our filesystems are mounted with the "compress=lzo" option. Are we doing something wrong? Are there things we should change to improve stability? I wouldn't be surprised if eliminating snapshots would stabilize things, but if we do that we might as well be using a filesystem like XFS. Are there fixes queued up that will solve the problems listed in the Bugzilla ticket referenced above? Or is our I/O-intensive workload just not a good fit for Btrfs? This will probably sound like an odd question, but does BTRFS think your storage devices are SSD's or not? Based on what you're saying, it sounds like you're running into issues resulting from the over-aggressive SSD 'optimizations' that were done by BTRFS until very recently. You can verify if this is what's causing your problems or not by either upgrading to a recent mainline kernel version (I know the changes are in 4.15, I don't remember for certain if they're in 4.14 or not, but I think they are), or by adding 'nossd' to your mount options, and then seeing if you still have the problems or not (I suspect this is only part of it, and thus changing this will reduce the issues, but not completely eliminate them). Make sure and run a full balance after changing either item, as the aforementioned 'optimizations' have an impact on how data is organized on-disk (which is ultimately what causes the issues), so they will have a lingering effect if you don't balance everything. 'autodefrag' is the other mount option that I would try toggling (turn it off if you've got it on, or on if you've got it off). I doubt it will have much impact, but it does change how things end up on disk. Additionally to all that, make sure your monitoring isn't just looking at the regular `df` command's output, it's woefully insufficient for monitoring space usage on BTRFS. If you want to check things properly, you want to be looking at the data in /sys/fs/btrfs//allocation, more specifically checking the following percentages: 1. The sum of the values in /sys/fs/btrfs/relative to the sum total of the size of the block devices for the filesystem. 2. The ratio of /sys/fs/btrfs//allocation/data/bytes_u
Btrfs progs release 4.15.1
Hi, btrfs-progs version 4.15.1 have been released. This is a minor update with build fixes, cleanups and test enhancements. Changes: * build * fix build on musl * support asciidoctor for doc generation * cleanups * sync some code with kernel * check: move code to own directory, split to more files * tests * more build tests in travis * tests now pass with asan and ubsan * testsuite can be exported and used separately Tarballs: https://www.kernel.org/pub/linux/kernel/people/kdave/btrfs-progs/ Git: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/btrfs-progs.git Shortlog: Anand Jain (1): btrfs-progs: print-tree: fix INODE_ITEM sequence and flags David Sterba (20): btrfs-progs: check: rename files after moving code btrfs-progs: docs: fix manual page title format btrfs-progs: build: add support for asciidoctor doc generator btrfs-progs: convert: fix build on musl btrfs-progs: mkfs: fix build on musl btrfs-progs: ci: change clone depth to 1 btrfs-progs: ci: add support scripts for docker build btrfs-progs: ci: add dockerfile for a musl build test btrfs-progs: ci: enable musl build tests in docker btrfs-progs: ci: use helper script for default build commands btrfs-progs: ci: replace inline shell commands with scripts btrfs-progs: rework testsuite export btrfs-progs: tests: update README.md btrfs-progs: tests: unify test drivers, make ready for extenral testsuite btrfs-progs: test: update clean-test.sh after the TEST_TOP update btrfs-progs: tests: document exported testsuite btrfs-progs: reorder tests in make target btrfs-progs: let callers of btrfs_show_qgroups free the buffers btrfs-progs: update CHANGES for v4.15.1 Btrfs progs v4.15.1 Gu Jinxiang (12): btrfs-progs: Use fs_info instead of root for BTRFS_LEAF_DATA_SIZE btrfs-progs: Use fs_info instead of root for BTRFS_NODEPTRS_PER_BLOCK btrfs-progs: Sync code with kernel for BTRFS_MAX_INLINE_DATA_SIZE btrfs-progs: Use fs_info instead of root for BTRFS_MAX_XATTR_SIZE btrfs-progs: do clean up for redundancy value assignment btrfs-progs: remove no longer used btrfs_alloc_extent btrfs-progs: Cleanup use of root in leaf_data_end btrfs-progs: add prerequisite mkfs.btrfs for test-cli btrfs-progs: add prerequisite btrfs-image for test-fuzz btrfs-progs: add prerequisite btrfs-convert for test-misc btrfs-progs: Add make testsuite command for export tests btrfs-progs: introduce TEST_TOP and INTERNAL_BIN for tests Qu Wenruo (22): btrfs-progs: tests: chang tree-reloc-tree test number from 027 to 015 btrfs-progs: Move cmds-check.c to check/main.c btrfs-progs: check: Move original mode definitions to check/original.h btrfs-progs: check: Move definitions of lowmem mode to check/lowmem.h btrfs-progs: check: Move node_refs structure to check/common.h btrfs-progs: check: Export check global variables to check/common.h btrfs-progs: check: Move imode_to_type function to check/common.h btrfs-progs: check: Move fs_root_objectid function to check/common.h btrfs-progs: check: Move count_csum_range function to check/common.c btrfs-progs: check: Move __create_inode_item function to check/common.c btrfs-progs: check: Move link_inode_to_lostfound function to common.c btrfs-progs: check: Move check_dev_size_alignment to check/common.c btrfs-progs: check: move reada_walk_down to check/common.c btrfs-progs: check: Move check_child_node to check/common.c btrfs-progs: check: Move reset_cached_block_groups to check/common.c btrfs-progs: check: Move lowmem check code to its own check/lowmem.[ch] btrfs-progs: check/lowmem: Cleanup unnecessary _v2 suffixes btrfs-progs: check: Cleanup all checkpatch error and warning btrfs-progs: fsck-tests: Cleanup the restored image for 028 btrfs-progs: btrfs-progs: Fix read beyond boundary bug in build_roots_info_cache() btrfs-progs: mkfs/rootdir: Fix memory leak in traverse_directory() btrfs-progs: convert/ext2: Fix memory leak caused by handled ext2_filsys Su Yue (1): btrfs-progs: tests common: remove meaningless colon in extract_image() -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs fixes for 4.16-rc1
Hi, we have a few assorted fixes, some of them show up during fstests so I gave them more testing. Please pull, thanks. The following changes since commit 3acbcbfc8f06d4ade2aab2ebba0a2542a05ce90c: btrfs: drop devid as device_list_add() arg (2018-01-29 19:31:16 +0100) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git for-4.16-rc1-tag for you to fetch changes up to fd649f10c3d21ee9d7542c609f29978bdf73ab94: btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device (2018-02-05 17:15:14 +0100) Filipe Manana (1): Btrfs: fix null pointer dereference when replacing missing device Liu Bo (6): Btrfs: fix deadlock in run_delalloc_nocow Btrfs: fix crash due to not cleaning up tree log block's dirty bits Btrfs: fix extent state leak from tree log Btrfs: fix btrfs_evict_inode to handle abnormal inodes correctly Btrfs: fix use-after-free on root->orphan_block_rsv Btrfs: fix unexpected -EEXIST when creating new inode Nikolay Borisov (2): btrfs: Ignore errors from btrfs_qgroup_trace_extent_post btrfs: Fix use-after-free when cleaning up fs_devs with a single stale device Zygo Blaxell (1): btrfs: remove spurious WARN_ON(ref->count < 0) in find_parent_nodes fs/btrfs/backref.c | 11 ++- fs/btrfs/delayed-ref.c | 3 ++- fs/btrfs/extent-tree.c | 4 fs/btrfs/inode.c | 41 ++--- fs/btrfs/qgroup.c | 9 +++-- fs/btrfs/tree-log.c| 32 ++-- fs/btrfs/volumes.c | 1 + 7 files changed, 80 insertions(+), 21 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
On 02/16/2018 09:42 AM, Ellis H. Wilson III wrote: On 02/16/2018 09:20 AM, Hans van Kranenburg wrote: Well, imagine you have a big tree (an actual real life tree outside) and you need to pick things (e.g. apples) which are hanging everywhere. So, what you need to to is climb the tree, climb on a branch all the way to the end where the first apple is... climb back, climb up a bit, go onto the next branch to the end for the next apple... etc etc The bigger the tree is, the longer it keeps you busy, because the apples will be semi-evenly distributed around the full tree, and they're always hanging at the end of the branch. The speed with which you can climb around (random read disk access IO speed for btrfs, because your disk cache is empty when first mounting) determines how quickly you're done. So, yes. Thanks Hans. I will say multiple minutes (by the looks of things, I'll end up near to an hour for 60TB if this non-linear scaling continues) to mount a filesystem is undesirable, but I won't offer that criticism without thinking constructively for a moment: Help me out by referencing the tree in question if you don't mind, so I can better understand the point of picking all these "apples" (I would guess for capacity reporting via df, but maybe there's more). Typical disclaimer that I haven't yet grokked the various inner-workings of BTRFS, so this is quite possibly a terrible or unapproachable idea: On umount, you must already have whatever metadata you were doing the tree walk on mount for in-memory (otherwise you would have been able to lazily do the treewalk after a quick mount). Therefore, could we not stash this metadata at or associated with, say, the root of the subvolumes? This way you can always determine on mount quickly if the cache is still valid (i.e., no situation like: remount with old btrfs, change stuff, umount with old btrfs, remount with new btrfs, pain). I would guess generation would be sufficient to determine if the cached metadata is valid for the given root block. This would scale with number of subvolumes (but not snapshots), and would be reasonably quick I think. I see on 02/13 Qu commented regarding a similar idea, except proposed perhaps a richer version of my above suggestion (making block group into its own tree). The concern was that it would be a lot of work since it modifies the on-disk format. That's a reasonable worry. I will get a new kernel, expand my array to around 36TB, and will generate a plot of mount times against extents going up to at least 30TB in increments of 0.5TB. If this proves to reach absurd mount time delays (to be specific, anything above around 60s is untenable for our use), we may very well be sufficiently motivated to implement the above improvement and submit it for consideration. Accordingly, if anybody has additional and/or more specific thoughts on the optimization, I am all ears. Best, ellis -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
On 02/16/2018 09:20 AM, Hans van Kranenburg wrote: Well, imagine you have a big tree (an actual real life tree outside) and you need to pick things (e.g. apples) which are hanging everywhere. So, what you need to to is climb the tree, climb on a branch all the way to the end where the first apple is... climb back, climb up a bit, go onto the next branch to the end for the next apple... etc etc The bigger the tree is, the longer it keeps you busy, because the apples will be semi-evenly distributed around the full tree, and they're always hanging at the end of the branch. The speed with which you can climb around (random read disk access IO speed for btrfs, because your disk cache is empty when first mounting) determines how quickly you're done. So, yes. Thanks Hans. I will say multiple minutes (by the looks of things, I'll end up near to an hour for 60TB if this non-linear scaling continues) to mount a filesystem is undesirable, but I won't offer that criticism without thinking constructively for a moment: Help me out by referencing the tree in question if you don't mind, so I can better understand the point of picking all these "apples" (I would guess for capacity reporting via df, but maybe there's more). Typical disclaimer that I haven't yet grokked the various inner-workings of BTRFS, so this is quite possibly a terrible or unapproachable idea: On umount, you must already have whatever metadata you were doing the tree walk on mount for in-memory (otherwise you would have been able to lazily do the treewalk after a quick mount). Therefore, could we not stash this metadata at or associated with, say, the root of the subvolumes? This way you can always determine on mount quickly if the cache is still valid (i.e., no situation like: remount with old btrfs, change stuff, umount with old btrfs, remount with new btrfs, pain). I would guess generation would be sufficient to determine if the cached metadata is valid for the given root block. This would scale with number of subvolumes (but not snapshots), and would be reasonably quick I think. Thoughts? ellis -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
On 02/16/2018 03:12 PM, Ellis H. Wilson III wrote: > On 02/15/2018 08:55 PM, Qu Wenruo wrote: >> On 2018年02月16日 00:30, Ellis H. Wilson III wrote: >>> Very helpful information. Thank you Qu and Hans! >>> >>> I have about 1.7TB of homedir data newly rsync'd data on a single >>> enterprise 7200rpm HDD and the following output for btrfs-debug: >>> >>> extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2 >>> total bytes 6001175126016 >>> bytes used 1832557875200 >>> >>> Hans' (very cool) tool reports: >>> ROOT_TREE 624.00KiB 0( 38) 1( 1) >>> EXTENT_TREE 327.31MiB 0( 20881) 1( 66) 2( 1) >> >> Extent tree is not so large, a little unexpected to see such slow mount. >> >> BTW, how many chunks do you have? >> >> It could be checked by: >> >> # btrfs-debug-tree -t chunk | grep CHUNK_ITEM | wc -l > > Since yesterday I've doubled the size by copying the homdir dataset in > again. Here are new stats: > > extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2 > total bytes 6001175126016 > bytes used 3663525969920 > > $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l > 3454 > > $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ > ROOT_TREE 1.14MiB 0( 72) 1( 1) > EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) > CHUNK_TREE 384.00KiB 0( 23) 1( 1) > DEV_TREE 272.00KiB 0( 16) 1( 1) > FS_TREE 11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) > CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) > QUOTA_TREE 0.00B > UUID_TREE 16.00KiB 0( 1) > FREE_SPACE_TREE 0.00B > DATA_RELOC_TREE 16.00KiB 0( 1) > > The old mean mount time was 4.319s. It now takes 11.537s for the > doubled dataset. Again please realize this is on an old version of > BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still > like to understand this delay more. Should I expect this to scale in > this way all the way up to my proposed 60-80TB filesystem so long as the > file size distribution stays roughly similar? That would definitely be > in terms of multiple minutes at that point. Well, imagine you have a big tree (an actual real life tree outside) and you need to pick things (e.g. apples) which are hanging everywhere. So, what you need to to is climb the tree, climb on a branch all the way to the end where the first apple is... climb back, climb up a bit, go onto the next branch to the end for the next apple... etc etc The bigger the tree is, the longer it keeps you busy, because the apples will be semi-evenly distributed around the full tree, and they're always hanging at the end of the branch. The speed with which you can climb around (random read disk access IO speed for btrfs, because your disk cache is empty when first mounting) determines how quickly you're done. So, yes. >>> Taking 100 snapshots (no changes between snapshots however) of the above >>> subvolume doesn't appear to impact mount/umount time. >> >> 100 unmodified snapshots won't affect mount time. >> >> It needs new extents, which can be created by overwriting extents in >> snapshots. >> So it won't really cause much difference if all these snapshots are all >> unmodified. > > Good to know, thanks! > >>> Snapshot creation >>> and deletion both operate at between 0.25s to 0.5s. >> >> IIRC snapshot deletion is delayed, so the real work doesn't happen when >> "btrfs sub del" returns. > > I was using btrfs sub del -C for the deletions, so I believe (if that > command truly waits for the subvolume to be utterly gone) it captures > the entirety of the snapshot. > > Best, > > ellis -- Hans van Kranenburg -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Status of FST and mount times
On 02/15/2018 08:55 PM, Qu Wenruo wrote: On 2018年02月16日 00:30, Ellis H. Wilson III wrote: Very helpful information. Thank you Qu and Hans! I have about 1.7TB of homedir data newly rsync'd data on a single enterprise 7200rpm HDD and the following output for btrfs-debug: extent tree key (EXTENT_TREE ROOT_ITEM 0) 543384862720 level 2 total bytes 6001175126016 bytes used 1832557875200 Hans' (very cool) tool reports: ROOT_TREE 624.00KiB 0( 38) 1( 1) EXTENT_TREE 327.31MiB 0( 20881) 1( 66) 2( 1) Extent tree is not so large, a little unexpected to see such slow mount. BTW, how many chunks do you have? It could be checked by: # btrfs-debug-tree -t chunk | grep CHUNK_ITEM | wc -l Since yesterday I've doubled the size by copying the homdir dataset in again. Here are new stats: extent tree key (EXTENT_TREE ROOT_ITEM 0) 385990656 level 2 total bytes 6001175126016 bytes used 3663525969920 $ sudo btrfs-debug-tree -t chunk /dev/sdb | grep CHUNK_ITEM | wc -l 3454 $ sudo ./show_metadata_tree_sizes.py /mnt/btrfs/ ROOT_TREE 1.14MiB 0(72) 1( 1) EXTENT_TREE 644.27MiB 0( 41101) 1( 131) 2( 1) CHUNK_TREE384.00KiB 0(23) 1( 1) DEV_TREE 272.00KiB 0(16) 1( 1) FS_TREE11.55GiB 0(754442) 1( 2179) 2( 5) 3( 2) CSUM_TREE 3.50GiB 0(228593) 1( 791) 2( 2) 3( 1) QUOTA_TREE0.00B UUID_TREE 16.00KiB 0( 1) FREE_SPACE_TREE 0.00B DATA_RELOC_TREE16.00KiB 0( 1) The old mean mount time was 4.319s. It now takes 11.537s for the doubled dataset. Again please realize this is on an old version of BTRFS (4.5.5), so perhaps newer ones will perform better, but I'd still like to understand this delay more. Should I expect this to scale in this way all the way up to my proposed 60-80TB filesystem so long as the file size distribution stays roughly similar? That would definitely be in terms of multiple minutes at that point. Taking 100 snapshots (no changes between snapshots however) of the above subvolume doesn't appear to impact mount/umount time. 100 unmodified snapshots won't affect mount time. It needs new extents, which can be created by overwriting extents in snapshots. So it won't really cause much difference if all these snapshots are all unmodified. Good to know, thanks! Snapshot creation and deletion both operate at between 0.25s to 0.5s. IIRC snapshot deletion is delayed, so the real work doesn't happen when "btrfs sub del" returns. I was using btrfs sub del -C for the deletions, so I believe (if that command truly waits for the subvolume to be utterly gone) it captures the entirety of the snapshot. Best, ellis -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs send/receive in reverse possible?
On Fri, Feb 16, 2018 at 10:43:54AM +0800, Sampson Fung wrote: > I have snapshot A on Drive_A. > I send snapshot A to an empty Drive_B. Then keep Drive_A as backup. > I use Drive_B as active. > I create new snapshot B on Drive_B. > > Can I use btrfs send/receive to send incremental differences back to Drive_A? > What is the correct way of doing this? You can't do it with the existing tools -- it needs a change to the send stream format. Here's a write-up of what's going on behind the scenes, and what needs to change: https://www.spinics.net/lists/linux-btrfs/msg44089.html Hugo. -- Hugo Mills | I can't foretell the future, I just work there. hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 |The Doctor signature.asc Description: Digital signature