Re: Heavy nocow'd VM image fragmentation
Larkin Lowrey posted on Sun, 26 Oct 2014 12:20:45 -0500 as excerpted: One unusual property of my setup is I have my fs on top of bcache. More specifically, the stack is md raid6 - bcache - lvm - btrfs. When the fs mounts it has mount option 'ssd' due to the fact that bcache sets /sys/block/bcache0/queue/rotational to 0. Is there any reason why either the 'ssd' mount option or being backed by bcache could be responsible? Bcache... Some kernel cycles ago btrfs on bcache had known issues but IDR the details. I /think/ that was fixed, but if you don't know what I'm referring to, I'd suggest looking back in the btrfs list archives (and assuming there's a bcache list, there's too), to see what it was, whether it was fixed, and (presumably on the bcache list) current status. ... Actually just did a bcache keyword search in my archive and see you on a thread, saying it was working fine for you, so never mind, looks like you are aware of that thread, and actually know more about the status than I do... I don't believe the ssd mount option /should/ be triggering fragmentation; I use it here on real ssd, but as I said, I don't have that sort of large-internal-write-pattern file to worry about and have autodefrag set too, plus compress=lzo so filefrag's reports aren't trustworthy here anyway. But what I DO know is that there's a nossd mount option available if the detection's going whacky and it's adding the ssd mount option inappropriately. That has been there for a couple kernel cycles now. See the btrfs (5) manpage for the mount options. So you could try the nossd mount option and see if it makes a difference. Meanwhile, that's quite a stack you have there. Before I switched to btrfs and btrfs raid, I was running mdraid here, and for a period ran lvm on top of mdraid. But as an admin I decided that was simply too complex a setup for me to be confident in my own ability to properly handle disaster recovery. And because I could feed the appropriate root on mdraid parameters directly to the kernel and didn't need an initr* for it, while I did for lvm, I kept mdraid, and actually had a few chances to practice disaster recovery on mdraid over time, becoming quite comfortable with it. But not only do you have that, you have bcache thrown in too, and in place of the traditional reiserfs I was using (and still use on my second backups and media partitions on spinning rust as I've had very good results with reiserfs since data=ordered became the default, even thru various hardware issues... I'll avoid the stories), you're using btrfs, which has its own raid modes, altho I suppose you're not using them. So that is indeed quite a stack. If you're comfortable with your ability to properly handle disaster recovery at all those levels, wow, you definitely have my respect. Or do you just have it all backed up and figure if it blows up and disaster recovery isn't going to be trivial you simply rebuild and restore from backup? I guess with btrfs not yet fully stable and mature that's the best idea at its level anyway, and if you have it backed up for that, then you have it backed up for the others and /can/ simply rebuild your stack and restore from backup, should you need to. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poll: time to switch skinny-metadata on by default?
Zygo Blaxell posted on Mon, 27 Oct 2014 00:39:25 -0400 as excerpted: One thing that may be significant is _when_ those 3 hanging filesystems are hanging: when using rsync to update local files. These machines are using the traditional rsync copy-then-rename method rather than --inplace updates. There's no problem copying data into an empty directory with rsync, but as soon as I start updating existing data, some process (not necessarily rsync) using with the filesystem gets stuck within 36 hours, and stays stuck for days. If I don't run rsync on the skinny filesystems, they'll run for a week or more without incident--and if I then start running rsync again, they hang later the same day. Limited counterpoint here: My packages partition is btrfs with skinny-metadata (skinny extents in dmsg), and the main gentoo tree on it gets regularly rsynced against gentoo servers. In fact, my sync script does that *AND* a git-pull on three overlays, in parallel with the rsync so all three git-pulls and the rsync are happening at once. No problems with that here. =:^) However, I suspect other factors in my setup avoid whatever's triggering it for Zygo. * The filesystem is btrfs raid1 mode data/metadata. * Only 24 GiB in size (show says 19.78 GiB used, df says 15.84 of 18 GiB data used, 969 MiB of 1.75 GiB metadata used). * Relatively fast SSD, ssd auto-detected and added as a mount option. * I set the skinny-metadata option (and extref and no-holes) at mkfs.btrfs time, while Zygo converted and presumably has both fat and skinny metadata. FWIW I've been spared all the rsync-triggered issues people have reported over time. I'm guessing I don't hit the same race conditions because with the small filesystem my overhead is lower, and with the ssd I simply don't have the same bottlenecks. So I'd not expect to hit this problem here either and that I'm not hitting it doesn't prove much, except that with reasonably fast ssds and smaller filesystems, whatever race conditions people seem to so commonly trigger with rsync elsewhere, simply don't seem to happen here. So as I said, limited counterpoint, but offered FWIW. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Poll: time to switch skinny-metadata on by default?
Marc Joliet posted on Mon, 27 Oct 2014 02:24:15 +0100 as excerpted: Am Sat, 25 Oct 2014 14:35:33 -0600 schrieb Chris Murphy li...@colorremedies.com: On Oct 25, 2014, at 2:33 PM, Chris Murphy li...@colorremedies.com wrote: On Oct 25, 2014, at 6:24 AM, Marc Joliet mar...@gmx.de wrote: First of all: does grub2 support booting from a btrfs file system with skinny-metadata, or is it irrelevant? Seems plausible if older kernels don't understand skinny-metadata, that GRUB2 won't either. So I just tested it with grub2-2.02-0.8.fc21 and it works. I'm surprised, actually. I don't understand the nature of the incompatibility with older kernels. Can they not mount a Btrfs volume even as ro? If so then I'd expect GRUB to have a problem, so I'm going to guess that maybe a 3.9 or older kernel could ro mount a Btrfs volume with skinny extents and the incompatibility is writing. That sounds plausible, though I hope for a definitive answer. (FWIW, I originally asked because I couldn't find any commits to grub2 related to skinny metadata; the updates to the btrfs driver were fairly sparse.) FWIW I have three /boot partitions, one one each of my main drives. All three are gpt with a reserved BIOS partition that grub2 installs its monolithic grub2core into, but have dedicated /boot partitions as well, for the grub2 config and additional grub2 modules, kernels, etc. The third one is reiserfs on spinning rust, but the other two are btrfs on ssd. Last time I updated I thought I switched them to skinny-metadata, but just checking dmesg while mounting them now, the second one (first backup) is skinny-metadata, but my working /boot is still fat-metadata. I did test the backup (with the skinny-metadata) after I did the mkfs and restore and it booted to grub2 and from grub2 to my main system just fine, so grub2 with skinny-metadata *CAN* work. But because it's my backup, I don't update it with new kernels as frequently as I do my working /boot, nor do I boot from it that often. So while I can be sure grub2 /can/ work with skinny-metadata, I do not yet know at this point if it does so /reliably/. And of course, to the extent that grub2 works differently on MBR and/or on GPT when it doesn't have a reserved BIOS partition to put the monolithic grub2core in, I haven't tested that. Tho in theory that should install in slack-space if available and the filesystem shouldn't affect that at all. But I know reiserfs used to screw up grub1 very occasionally (maybe .5-1% of new kernel installations; it did it I think twice in about 7 years, and I run git kernels so update them reasonably frequently) on my old MBR setup without much slack-space to spare, and I'd have to reinstall grub1. So that's a qualified skinny-metadata shouldn't affect grub2, as I've booted using grub2 on a btrfs with skinny-metadata /boot. But I've simply not tested it enough to know whether it's reliable over time as the filesystem updates and changes, or not. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation.
On Mon, Oct 27, 2014 at 08:18:12AM +0800, Qu Wenruo wrote: Original Message Subject: Re: [PATCH] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation. From: Liu Bo bo.li@oracle.com To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2014年10月24日 19:06 On Thu, Oct 23, 2014 at 10:37:51AM +0800, Qu Wenruo wrote: When btrfs allocate a chunk, it will try to alloc up to 1G for data and 256M for metadata, or 10% of all the writeable space if there is enough 10G for data, if (type BTRFS_BLOCK_GROUP_DATA) { max_stripe_size = 1024 * 1024 * 1024; max_chunk_size = 10 * max_stripe_size; Oh, sorry, 10G is right. Any other comments? Thanks, Qu ... thanks, -liubo space for the stripe on device. However, when we run out of space, this allocation may cause unbalanced chunk allocation. For example, there are only 1G unallocated space, and request for allocate DATA chunk is sent, and all the space will be allocated as data chunk, making later metadata chunk alloc request unable to handle, which will cause ENOSPC. This is the one of the common complains from end users about why ENOSPC happens but there is still available space. Okay, I don't think this is the common case, AFAIK, the most ENOSPC is caused by our runtime worst case metadata reservation problem. btrfs has been inclined to create a fairly large metadata chunk (1G) in its initial mkfs stage and 256M metadata chunk is also a very large one. As of your below example, yes, we don't have space for metadata allocation, but do we really need to allocate a new one? Or am I missing something? thanks, -liubo This patch will try not to alloc chunk which is more than half of the unallocated space, making the last space more balanced at a small cost of more fragmented chunk at the last 1G. Some easy example: Preallocate 17.5G on a 20G empty btrfs fs: [Before] # btrfs fi show /mnt/test Label: none uuid: da8741b1-5d47-4245-9e94-bfccea34e91e Total devices 1 FS bytes used 17.50GiB devid1 size 20.00GiB used 20.00GiB path /dev/sdb All space is allocated. No space later metadata space. [After] # btrfs fi show /mnt/test Label: none uuid: e6935aeb-a232-4140-84f9-80aab1f23d56 Total devices 1 FS bytes used 17.50GiB devid1 size 20.00GiB used 19.77GiB path /dev/sdb About 230M is still available for later metadata allocation. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/volumes.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d47289c..fa8de79 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4240,6 +4240,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, int ret; u64 max_stripe_size; u64 max_chunk_size; + u64 total_avail_space = 0; u64 stripe_size; u64 num_bytes; u64 raid_stripe_len = BTRFS_STRIPE_LEN; @@ -4352,10 +4353,27 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, devices_info[ndevs].max_avail = max_avail; devices_info[ndevs].total_avail = total_avail; devices_info[ndevs].dev = device; + total_avail_space += total_avail; ++ndevs; } /* +* Try not to occupy more than half of the unallocated space. +* When run short of space and alloc all the space to +* data/metadata will cause ENOSPC to be triggered more easily. +* +* And since the minimum chunk size is 16M, the half-half will cause +* 16M allocated from 20M available space and reset 4M will not be +* used ever. In that case(16~32M), allocate all directly. +*/ + if (total_avail_space 32 * 1024 * 1024 + total_avail_space 16 * 1024 * 1024) + max_chunk_size = total_avail_space; + else + max_chunk_size = min(total_avail_space / 2, max_chunk_size); + max_chunk_size = min(total_avail_space / 2, max_chunk_size); + + /* * now sort the devices by hole size / available space */ sort(devices_info, ndevs, sizeof(struct btrfs_device_info), -- 2.1.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation.
Original Message Subject: Re: [PATCH] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation. From: Liu Bo bo.li@oracle.com To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2014年10月27日 16:14 On Mon, Oct 27, 2014 at 08:18:12AM +0800, Qu Wenruo wrote: Original Message Subject: Re: [PATCH] btrfs: Enhance btrfs chunk allocation algorithm to reduce ENOSPC caused by unbalanced data/metadata allocation. From: Liu Bo bo.li@oracle.com To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2014年10月24日 19:06 On Thu, Oct 23, 2014 at 10:37:51AM +0800, Qu Wenruo wrote: When btrfs allocate a chunk, it will try to alloc up to 1G for data and 256M for metadata, or 10% of all the writeable space if there is enough 10G for data, if (type BTRFS_BLOCK_GROUP_DATA) { max_stripe_size = 1024 * 1024 * 1024; max_chunk_size = 10 * max_stripe_size; Oh, sorry, 10G is right. Any other comments? Thanks, Qu ... thanks, -liubo space for the stripe on device. However, when we run out of space, this allocation may cause unbalanced chunk allocation. For example, there are only 1G unallocated space, and request for allocate DATA chunk is sent, and all the space will be allocated as data chunk, making later metadata chunk alloc request unable to handle, which will cause ENOSPC. This is the one of the common complains from end users about why ENOSPC happens but there is still available space. Okay, I don't think this is the common case, AFAIK, the most ENOSPC is caused by our runtime worst case metadata reservation problem. btrfs has been inclined to create a fairly large metadata chunk (1G) in its initial mkfs stage and 256M metadata chunk is also a very large one. As of your below example, yes, we don't have space for metadata allocation, but do we really need to allocate a new one? Or am I missing something? thanks, -liubo Yes that's true this is not the common cause, but at least this patch may make the percentage of 'df' command reach as close to 100% as possible before hitting ENOSPC under normal operations. (If not using balance) And some case like the following mail may be improved by the patch: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36097.html I understand that most of the cases that a lot of free data space and no metadata space is caused by create and then delete large files, but if the last giga bytes can be allocated more carefully, at least the available bytes of 'df' command should be reduced before hit ENOSPC. How do you think about it? Thanks, Qu This patch will try not to alloc chunk which is more than half of the unallocated space, making the last space more balanced at a small cost of more fragmented chunk at the last 1G. Some easy example: Preallocate 17.5G on a 20G empty btrfs fs: [Before] # btrfs fi show /mnt/test Label: none uuid: da8741b1-5d47-4245-9e94-bfccea34e91e Total devices 1 FS bytes used 17.50GiB devid1 size 20.00GiB used 20.00GiB path /dev/sdb All space is allocated. No space later metadata space. [After] # btrfs fi show /mnt/test Label: none uuid: e6935aeb-a232-4140-84f9-80aab1f23d56 Total devices 1 FS bytes used 17.50GiB devid1 size 20.00GiB used 19.77GiB path /dev/sdb About 230M is still available for later metadata allocation. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/volumes.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d47289c..fa8de79 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -4240,6 +4240,7 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, int ret; u64 max_stripe_size; u64 max_chunk_size; + u64 total_avail_space = 0; u64 stripe_size; u64 num_bytes; u64 raid_stripe_len = BTRFS_STRIPE_LEN; @@ -4352,10 +4353,27 @@ static int __btrfs_alloc_chunk(struct btrfs_trans_handle *trans, devices_info[ndevs].max_avail = max_avail; devices_info[ndevs].total_avail = total_avail; devices_info[ndevs].dev = device; + total_avail_space += total_avail; ++ndevs; } /* +* Try not to occupy more than half of the unallocated space. +* When run short of space and alloc all the space to +* data/metadata will cause ENOSPC to be triggered more easily. +* +* And since the minimum chunk size is 16M, the half-half will cause +* 16M allocated from 20M available space and reset 4M will not be +* used ever. In that case(16~32M), allocate all directly. +*/ + if (total_avail_space 32 * 1024 * 1024 + total_avail_space 16 * 1024 * 1024) + max_chunk_size = total_avail_space; + else + max_chunk_size =
[PATCH] Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
If we couldn't find our extent item, we accessed the current slot (path-slots[0]) to check if it corresponds to an equivalent skinny metadata item. However this slot could be beyond our last item in the leaf (i.e. path-slots[0] = btrfs_header_nritems(leaf)), in which case we shouldn't process it. Since btrfs_lookup_extent() is only used to find extent items for data extents, fix this by removing completely the logic that looks up for an equivalent skinny metadata item, since it can not exist. Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0d599ba..9141b2b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -710,7 +710,7 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info *info) rcu_read_unlock(); } -/* simple helper to search for an existing extent at a given offset */ +/* simple helper to search for an existing data extent at a given offset */ int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len) { int ret; @@ -726,12 +726,6 @@ int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len) key.type = BTRFS_EXTENT_ITEM_KEY; ret = btrfs_search_slot(NULL, root-fs_info-extent_root, key, path, 0, 0); - if (ret 0) { - btrfs_item_key_to_cpu(path-nodes[0], key, path-slots[0]); - if (key.objectid == start - key.type == BTRFS_METADATA_ITEM_KEY) - ret = 0; - } btrfs_free_path(path); return ret; } -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
We have a race that can lead us to miss skinny extent items in the function btrfs_lookup_extent_info() when the skinny metadata feature is enabled. So basically the sequence of steps is: 1) We search in the extent tree for the skinny extent, which returns 0 (not found); 2) We check the previous item in the returned leaf for a non-skinny extent, and we don't find it; 3) Because we didn't find the non-skinny extent in step 2), we release our path to search the extent tree again, but this time for a non-skinny extent key; 4) Right after we released our path in step 3), a skinny extent was inserted in the extent tree (delayed refs were run) - our second extent tree search will miss it, because it's not looking for a skinny extent; 5) After the second search returned (with ret 0), we look for any delayed ref for our extent's bytenr (and we do it while holding a read lock on the leaf), but we won't find any, as such delayed ref had just run and completed after we released out path in step 3) before doing the second search. Fix this by removing completely the path release and re-search logic. This is safe, because if we seach for a metadata item and we don't find it, we have the guarantee that the returned leaf is the one where the item would be inserted, and so path-slots[0] 0 and path-slots[0] - 1 must be the slot where the non-skinny extent item is if it exists. The only case where path-slots[0] is zero is when there are no smaller keys in the tree (i.e. no left siblings for our leaf), in which case the re-search logic isn't needed as well. This race has been present since the introduction of skinny metadata (change 3173a18f70554fe7880bb2d85c7da566e364eb3c). Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 8 1 file changed, 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9141b2b..2cedd06 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -780,7 +780,6 @@ search_again: else key.type = BTRFS_EXTENT_ITEM_KEY; -again: ret = btrfs_search_slot(trans, root-fs_info-extent_root, key, path, 0, 0); if (ret 0) @@ -796,13 +795,6 @@ again: key.offset == root-nodesize) ret = 0; } - if (ret) { - key.objectid = bytenr; - key.type = BTRFS_EXTENT_ITEM_KEY; - key.offset = root-nodesize; - btrfs_release_path(path); - goto again; - } } if (ret == 0) { -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
BTRFS balance segfault, where to go from here
Hello Folks, I used to have an array of 4x4TB drives with BTRFS in raid10. The kernel version is: 3.13-0.bpo.1-amd64 BTRFS version is: v3.14.1 When it was reaching 80% in space I added another 4TB drive to the array with: btrfs device add /dev/sdf /mnt/backup And started the balancing to the new drive: btrfs filesystem balance /mnt/backup This was going for a while for 5-6 hours before it segfaulted with not enough free space message. Now my configuration looks like this: btrfs fi show /mnt/backup Label: 'backup' uuid: ... Total devices 5 FS bytes used 5.93TiB devid1 size 3.64TiB used 2.82TiB path /dev/sdd devid2 size 3.64TiB used 2.82TiB path /dev/sdc devid3 size 3.64TiB used 2.81TiB path /dev/sdb devid4 size 3.64TiB used 2.82TiB path /dev/sde devid5 size 3.64TiB used 638.50GiB path /dev/sdf After this crash happend during the balancing (logs are attached at the end) the system remounted my /mnt/backup share as RO. At this point I started to really worry. I umounted and remounted it manually. At the beginning it run some self checks which took like 5 mins then as iotop showed it continued with the balancing which failed again the same way. For next time after mount I immediately put the balancing on pause (which helped). My question is where to go from here? What I going to do right now is to copy the most important data to another separated XFS drive. What I planning to do is: 1, Upgrade the kernel 2, Upgrade BTRFS 3, Continue the balancing. Could someone please also explain that how is exactly the raid10 setup works with ODD number of drives with btrfs? Raid10 should be a stripe of mirrors. Now then this sdf drive is mirrored or striped or what? Some btrfs gurus could tell me that should I be worried of dataloss because of this or not? Would I need even more free space just to add a 5th drive? If so how much more? Kernel logs --- Oct 24 17:25:44 backup kernel: [29396.873750] btrfs: relocating block group 5162588438528 flags 65 Oct 24 17:26:09 backup kernel: [29421.594524] btrfs: found 13126 extents Oct 24 17:26:38 backup kernel: [29450.769228] btrfs: found 13126 extents Oct 24 17:26:39 backup kernel: [29451.345198] btrfs: relocating block group 5161514696704 flags 68 Oct 24 17:31:33 backup kernel: [29745.776810] BTRFS debug (device sdb): run_one_delayed_ref returned -28 Oct 24 17:31:33 backup kernel: [29745.776818] [ cut here ] Oct 24 17:31:33 backup kernel: [29745.776847] WARNING: CPU: 1 PID: 1807 at /build/linux-t5aGFh/linux-3.13.10/fs/btrfs/super.c:254 __btrfs_abort_transaction+0x5a/0x140 [btrfs]() Oct 24 17:31:33 backup kernel: [29745.776849] btrfs: Transaction aborted (error -28) Oct 24 17:31:33 backup kernel: [29745.776851] Modules linked in: xen_gntdev xen_evtchn xenfs xen_privcmd nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache sunrpc 8021q garp mrp bridge stp llc loop iTCO_wdt iTCO_vendor_support lpc_ich radeon mfd_core processor evdev ttm drm_kms_helper drm i2c_algo_bit coretemp rng_core serio_raw pcspkr i2c_i801 i2c_core i3000_edac thermal_sys button shpchp edac_core ext4 crc16 mbcache jbd2 btrfs xor raid6_pq crc32c libcrc32c dm_mod xen_pciback sg sd_mod sr_mod crc_t10dif cdrom crct10dif_common ata_generic ahci ata_piix libahci 3w_9xxx libata scsi_mod ehci_pci uhci_hcd ehci_hcd e1000e ptp pps_core usbcore usb_common Oct 24 17:31:33 backup kernel: [29745.776902] CPU: 1 PID: 1807 Comm: btrfs-transacti Not tainted 3.13-0.bpo.1-amd64 #1 Debian 3.13.10-1~bpo70+1 Oct 24 17:31:33 backup kernel: [29745.776905] Hardware name: Supermicro PDSM4+/PDSM4+, BIOS 6.00 02/05/2007 Oct 24 17:31:33 backup kernel: [29745.776907] a0257130 814d16c9 88006a7f3cc8 Oct 24 17:31:33 backup kernel: [29745.776911] 81060967 ffe4 880004282800 88003b813ec0 Oct 24 17:31:33 backup kernel: [29745.776914] 0aaa a0253b60 81060a55 a0257260 Oct 24 17:31:33 backup kernel: [29745.776918] Call Trace: Oct 24 17:31:33 backup kernel: [29745.776926] [814d16c9] ? dump_stack+0x41/0x51 Oct 24 17:31:33 backup kernel: [29745.776931] [81060967] ? warn_slowpath_common+0x87/0xc0 Oct 24 17:31:33 backup kernel: [29745.776935] [81060a55] ? warn_slowpath_fmt+0x45/0x50 Oct 24 17:31:33 backup kernel: [29745.776946] [a01b73ca] ? __btrfs_abort_transaction+0x5a/0x140 [btrfs] Oct 24 17:31:33 backup kernel: [29745.776959] [a01d2e72] ? btrfs_run_delayed_refs+0x372/0x530 [btrfs] Oct 24 17:31:33 backup kernel: [29745.776974] [a01fa8c3] ? btrfs_run_ordered_operations+0x213/0x2b0 [btrfs] Oct 24 17:31:33 backup kernel: [29745.776988] [a01e2fea] ? btrfs_commit_transaction+0x5a/0x990 [btrfs] Oct 24 17:31:33 backup kernel: [29745.777001] [a01e1345] ? transaction_kthread+0x1c5/0x240 [btrfs] Oct 24 17:31:33 backup kernel:
Re: [PATCH] Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
On Mon, 27 Oct 2014 09:16:55 +, Filipe Manana wrote: If we couldn't find our extent item, we accessed the current slot (path-slots[0]) to check if it corresponds to an equivalent skinny metadata item. However this slot could be beyond our last item in the leaf (i.e. path-slots[0] = btrfs_header_nritems(leaf)), in which case we shouldn't process it. Since btrfs_lookup_extent() is only used to find extent items for data extents, fix this by removing completely the logic that looks up for an equivalent skinny metadata item, since it can not exist. I think we also need a better function name, such as btrfs_lookup_data_extent. Thanks Miao Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 8 +--- 1 file changed, 1 insertion(+), 7 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0d599ba..9141b2b 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -710,7 +710,7 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info *info) rcu_read_unlock(); } -/* simple helper to search for an existing extent at a given offset */ +/* simple helper to search for an existing data extent at a given offset */ int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len) { int ret; @@ -726,12 +726,6 @@ int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len) key.type = BTRFS_EXTENT_ITEM_KEY; ret = btrfs_search_slot(NULL, root-fs_info-extent_root, key, path, 0, 0); - if (ret 0) { - btrfs_item_key_to_cpu(path-nodes[0], key, path-slots[0]); - if (key.objectid == start - key.type == BTRFS_METADATA_ITEM_KEY) - ret = 0; - } btrfs_free_path(path); return ret; } -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: fix invalid leaf slot access in btrfs_lookup_extent()
If we couldn't find our extent item, we accessed the current slot (path-slots[0]) to check if it corresponds to an equivalent skinny metadata item. However this slot could be beyond our last item in the leaf (i.e. path-slots[0] = btrfs_header_nritems(leaf)), in which case we shouldn't process it. Since btrfs_lookup_extent() is only used to find extent items for data extents, fix this by removing completely the logic that looks up for an equivalent skinny metadata item, since it can not exist. Signed-off-by: Filipe Manana fdman...@suse.com --- V2: Renamed btrfs_lookup_extent() to btrfs_lookup_data_extent(). fs/btrfs/ctree.h | 2 +- fs/btrfs/extent-tree.c | 10 ++ fs/btrfs/tree-log.c| 2 +- 3 files changed, 4 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index dd8b275..b72b358 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -3276,7 +3276,7 @@ int btrfs_run_delayed_refs(struct btrfs_trans_handle *trans, struct btrfs_root *root, unsigned long count); int btrfs_async_run_delayed_refs(struct btrfs_root *root, unsigned long count, int wait); -int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len); +int btrfs_lookup_data_extent(struct btrfs_root *root, u64 start, u64 len); int btrfs_lookup_extent_info(struct btrfs_trans_handle *trans, struct btrfs_root *root, u64 bytenr, u64 offset, int metadata, u64 *refs, u64 *flags); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 0d599ba..87c0b46f 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -710,8 +710,8 @@ void btrfs_clear_space_info_full(struct btrfs_fs_info *info) rcu_read_unlock(); } -/* simple helper to search for an existing extent at a given offset */ -int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len) +/* simple helper to search for an existing data extent at a given offset */ +int btrfs_lookup_data_extent(struct btrfs_root *root, u64 start, u64 len) { int ret; struct btrfs_key key; @@ -726,12 +726,6 @@ int btrfs_lookup_extent(struct btrfs_root *root, u64 start, u64 len) key.type = BTRFS_EXTENT_ITEM_KEY; ret = btrfs_search_slot(NULL, root-fs_info-extent_root, key, path, 0, 0); - if (ret 0) { - btrfs_item_key_to_cpu(path-nodes[0], key, path-slots[0]); - if (key.objectid == start - key.type == BTRFS_METADATA_ITEM_KEY) - ret = 0; - } btrfs_free_path(path); return ret; } diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 2b26dad..6d58d72 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -672,7 +672,7 @@ static noinline int replay_one_extent(struct btrfs_trans_handle *trans, * is this extent already allocated in the extent * allocation tree? If so, just add a reference */ - ret = btrfs_lookup_extent(root, ins.objectid, + ret = btrfs_lookup_data_extent(root, ins.objectid, ins.offset); if (ret == 0) { ret = btrfs_inc_extent_ref(trans, root, -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: suspicious number of devices: 72057594037927936
On Mon, Oct 27, 2014 at 10:34 AM, Christian Kujau li...@nerdbynature.de wrote: (somehow this message did not make it to the list) Hi, After upgrading from linux 3.17.0 to 3.18.0-rc2, I cannot mount my btrfs partition any more. It's just one btrfs partition, no raid, no compression, no fancy mount options: # mount -t btrfs -o ro /dev/sda6 /usr/local/ mount: wrong fs type, bad option, bad superblock on /dev/sda6, [...] BTRFS: suspicious number of devices: 72057594037927936 BTRFS: super offset mismatch 1099511627776 != 65536 BTRFS: superblock contains fatal errors BTRFS: open_ctree failed The only thing fancy may be the machine: PowerBook G4 (powerpc 32 bit), running Debian/Linux (stable). The message comes from the newly added fs/btrfs/disk-io.c: if (sb-num_devices (1UL 31)) printk(KERN_WARNING BTRFS: suspicious number of devices: %llu\n, sb-num_devices); And 72057594037927936 is 2^56, so maybe there's an endianess problem here? Sounds like you need to revert this patch: https://patchwork.kernel.org/patch/5004701/ (which ignored endianess) or go back to an older kernel (don't use 3.17 or 3.17.1 however, due to other serious issues, latest 3.16.x should be safe). There's a v2 of that patch that fixes the endianess issue, but it didn't make it to 3.18-rc1/2 (https://patchwork.kernel.org/patch/5082701/) regards Some details below, please let me know what other details may be needed. Going back to 3.17 now... Thanks, Christian. # file -Ls /dev/sda6 /dev/sda6: sticky BTRFS Filesystem sectorsize 4096, nodesize 4096, leafsize 4096) # btrfsck /dev/sda6 checking extents checking fs roots checking root refs found 2035929088 bytes used err is 0 total csum bytes: 1886920 total tree bytes: 102936576 total fs tree bytes: 94441472 btree space waste bytes: 30875964 file data blocks allocated: 1932992512 referenced 1932849152 Btrfs Btrfs v0.19 # echo $? 0 -- BOFH excuse #427: network down, IP packets delivered via UPS -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
On Mon, 27 Oct 2014 09:19:52 +, Filipe Manana wrote: We have a race that can lead us to miss skinny extent items in the function btrfs_lookup_extent_info() when the skinny metadata feature is enabled. So basically the sequence of steps is: 1) We search in the extent tree for the skinny extent, which returns 0 (not found); 2) We check the previous item in the returned leaf for a non-skinny extent, and we don't find it; 3) Because we didn't find the non-skinny extent in step 2), we release our path to search the extent tree again, but this time for a non-skinny extent key; 4) Right after we released our path in step 3), a skinny extent was inserted in the extent tree (delayed refs were run) - our second extent tree search will miss it, because it's not looking for a skinny extent; 5) After the second search returned (with ret 0), we look for any delayed ref for our extent's bytenr (and we do it while holding a read lock on the leaf), but we won't find any, as such delayed ref had just run and completed after we released out path in step 3) before doing the second search. Fix this by removing completely the path release and re-search logic. This is safe, because if we seach for a metadata item and we don't find it, we have the guarantee that the returned leaf is the one where the item would be inserted, and so path-slots[0] 0 and path-slots[0] - 1 must be the slot where the non-skinny extent item is if it exists. The only case where path-slots[0] is I think this analysis is wrong if there are some independent shared ref metadata for a tree block, just like: ++-+-+ | tree block extent item | shared ref1 | shared ref2 | ++-+-+ Thanks Miao zero is when there are no smaller keys in the tree (i.e. no left siblings for our leaf), in which case the re-search logic isn't needed as well. This race has been present since the introduction of skinny metadata (change 3173a18f70554fe7880bb2d85c7da566e364eb3c). Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 8 1 file changed, 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9141b2b..2cedd06 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -780,7 +780,6 @@ search_again: else key.type = BTRFS_EXTENT_ITEM_KEY; -again: ret = btrfs_search_slot(trans, root-fs_info-extent_root, key, path, 0, 0); if (ret 0) @@ -796,13 +795,6 @@ again: key.offset == root-nodesize) ret = 0; } - if (ret) { - key.objectid = bytenr; - key.type = BTRFS_EXTENT_ITEM_KEY; - key.offset = root-nodesize; - btrfs_release_path(path); - goto again; - } } if (ret == 0) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Heavy nocow'd VM image fragmentation
On 2014-10-26 13:20, Larkin Lowrey wrote: On 10/24/2014 10:28 PM, Duncan wrote: Robert White posted on Fri, 24 Oct 2014 19:41:32 -0700 as excerpted: On 10/24/2014 04:49 AM, Marc MERLIN wrote: On Thu, Oct 23, 2014 at 06:04:43PM -0500, Larkin Lowrey wrote: I have a 240GB VirtualBox vdi image that is showing heavy fragmentation (filefrag). The file was created in a dir that was chattr +C'd, the file was created via fallocate and the contents of the orignal image were copied into the file via dd. I verified that the image was +C. To be honest, I have the same problem, and it's vexing: If I understand correctly, when you take a snapshot the file goes into what I call 1COW mode. Yes, but the OP said he hadn't snapshotted since creating the file, and MM's a regular that actually wrote much of the wiki documentation on raid56 modes, so he better know about the snapshotting problem too. So that can't be it. There's apparently a bug in some recent code, and it's not honoring the NOCOW even in normal operation, when it should be. (FWIW I'm not running any VMs or large DBs here, so don't have nocow set on anything and can and do use autodefrag on all my btrfs. So I can't say one way or the other, personally.) Correct, there were no snapshots during VM usage when the fragmentation occurred. One unusual property of my setup is I have my fs on top of bcache. More specifically, the stack is md raid6 - bcache - lvm - btrfs. When the fs mounts it has mount option 'ssd' due to the fact that bcache sets /sys/block/bcache0/queue/rotational to 0. Is there any reason why either the 'ssd' mount option or being backed by bcache could be responsible? Two things: First, regarding your question, the ssd mount option shouldn't be responsible for this, because it is supposed to spread out allocation only at the chunk level, not the block level, but some recent commit may have changed that. Are you using any kind of compression in btrfs? If so, then filefrag won't report the number of fragments correctly (it currently reports the number of compressed blocks in the file instead), and in fact, if you are using compression in btrfs, I would expect the number of compressed blocks to go up as you use more space in the VM image, long runs of zero bytes compress well, other stuff (especially on-disk structures from encapsulated filesystems) doesn't. You might consider putting the vm images directly on the LVM layer instead, that tends to get much better performance in my experience than storing them on a filesystem. Secondly, I'd recommend switching from using bcache under LVM to using dm-cache on top of LVM, as it makes it much easier to recover from the various failure modes, and also to deal with a corrupted cache, due to the fact that dm-cache doesn't put any metadata on the backing device. It takes longer to shutdown when in write-back mode, and isn't SSD optimized, but has also been much more reliable in my experience. smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH] Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
On Mon, Oct 27, 2014 at 11:08 AM, Miao Xie mi...@cn.fujitsu.com wrote: On Mon, 27 Oct 2014 09:19:52 +, Filipe Manana wrote: We have a race that can lead us to miss skinny extent items in the function btrfs_lookup_extent_info() when the skinny metadata feature is enabled. So basically the sequence of steps is: 1) We search in the extent tree for the skinny extent, which returns 0 (not found); 2) We check the previous item in the returned leaf for a non-skinny extent, and we don't find it; 3) Because we didn't find the non-skinny extent in step 2), we release our path to search the extent tree again, but this time for a non-skinny extent key; 4) Right after we released our path in step 3), a skinny extent was inserted in the extent tree (delayed refs were run) - our second extent tree search will miss it, because it's not looking for a skinny extent; 5) After the second search returned (with ret 0), we look for any delayed ref for our extent's bytenr (and we do it while holding a read lock on the leaf), but we won't find any, as such delayed ref had just run and completed after we released out path in step 3) before doing the second search. Fix this by removing completely the path release and re-search logic. This is safe, because if we seach for a metadata item and we don't find it, we have the guarantee that the returned leaf is the one where the item would be inserted, and so path-slots[0] 0 and path-slots[0] - 1 must be the slot where the non-skinny extent item is if it exists. The only case where path-slots[0] is I think this analysis is wrong if there are some independent shared ref metadata for a tree block, just like: ++-+-+ | tree block extent item | shared ref1 | shared ref2 | ++-+-+ Why does that matters? Can you elaborate why it's not correct? We're looking for the extent item only in btrfs_lookup_extent_info(), and running a delayed ref, independently of being inlined/shared, it implies inserting a new extent item or updating an existing extent item (updating ref count). thanks Thanks Miao zero is when there are no smaller keys in the tree (i.e. no left siblings for our leaf), in which case the re-search logic isn't needed as well. This race has been present since the introduction of skinny metadata (change 3173a18f70554fe7880bb2d85c7da566e364eb3c). Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 8 1 file changed, 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9141b2b..2cedd06 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -780,7 +780,6 @@ search_again: else key.type = BTRFS_EXTENT_ITEM_KEY; -again: ret = btrfs_search_slot(trans, root-fs_info-extent_root, key, path, 0, 0); if (ret 0) @@ -796,13 +795,6 @@ again: key.offset == root-nodesize) ret = 0; } - if (ret) { - key.objectid = bytenr; - key.type = BTRFS_EXTENT_ITEM_KEY; - key.offset = root-nodesize; - btrfs_release_path(path); - goto again; - } } if (ret == 0) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: get the accurate value of used_bytes in btrfs_get_block_group_info().
Reproducer: # mkfs.btrfs -f -b 20G /dev/sdb # mount /dev/sdb /mnt/test # fallocate -l 17G /mnt/test/largefile # btrfs fi df /mnt/test Data, single: total=17.49GiB, used=6.00GiB - only 6G, but actually it should be 17G. System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B # sync # btrfs fi df /mnt/test Data, single: total=17.49GiB, used=17.00GiB - After sync, it is expected. System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00B Metadata, DUP: total=1.00GiB, used=112.00KiB Metadata, single: total=8.00MiB, used=0.00B GlobalReserve, single: total=16.00MiB, used=0.00B The value of 6.00GiB is actually calculated in btrfs_get_block_group_info() by adding the @block_group-item-used for each group together. In this way, it did not consider the bytes in cache. This patch adds the value of @pinned, @reserved and @bytes_super in struct btrfs_block_group_cache to make sure we can get the accurate @used_bytes. Reported-by: Qu Wenruo quwen...@cn.fujitsu.com Signed-off-by: Dongsheng Yang yangds.f...@cn.fujitsu.com --- fs/btrfs/ioctl.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 33c80f5..bc2aaeb 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -3892,6 +3892,10 @@ void btrfs_get_block_group_info(struct list_head *groups_list, space-total_bytes += block_group-key.offset; space-used_bytes += btrfs_block_group_used(block_group-item); + /* Add bytes-info in cache */ + space-used_bytes += block_group-pinned; + space-used_bytes += block_group-reserved; + space-used_bytes += block_group-bytes_super; } } -- 1.8.4.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
nocow and compression
Hi, I created a filesystem and mounted it with compress-force=lzo. Then I did: # df -h . Filesystem Size Used Avail Use% Mounted on /dev/loop0 100M 4.1M 96M 5% /mnt # yes Hello World | dd of=/mnt/test iflag=fullblock bs=1M count=20 status=none yes: standard output: Broken pipe yes: write error # sync; ls -l ; df -h . total 20480 -rw-r--r-- 1 root root 20971520 Oct 27 13:48 test Filesystem Size Used Avail Use% Mounted on /dev/loop0 100M 4.7M 96M 5% /mnt so far so good ... # touch test2; chattr +C test2 # dd if=test of=test2 conv=notrunc bs=1M iflag=fullblock oflag=append status=none # sync; ls -l ; df -h . total 40960 -rw-r--r-- 1 root root 20971520 Oct 27 13:51 test -rw-r--r-- 1 root root 20971520 Oct 27 13:51 test2 Filesystem Size Used Avail Use% Mounted on /dev/loop0 100M 25M 76M 25% /mnt oops, no compression. Is this intended? Marc -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: nocow and compression
Am Montag, 27. Oktober 2014, 13:59:24 schrieb Swâmi Petaramesh: Le lundi 27 octobre 2014, 13:56:07 Marc Dietrich a écrit : oops, no compression. Is this intended? « Compression does not work for NOCOW files » is clearly stated in https://btrfs.wiki.kernel.org/index.php/Compression#How_does_compression_int eract_with_direct_IO_or_COW.3F ah, sorry, I somehow overlooked this. Thanks Marc signature.asc Description: This is a digitally signed message part.
Re: nocow and compression
As far as I understood, NOCOW means that modified parts of files be rewritten into place, whereas compression causes compressed blocks of variable sizes to be created (depending upon their compression ratio). Changing a block in a file will most probably change its compressed size, and then you see why it cannot be rewritten into place... Somebody correct me if I'm wrong ;-) Le lundi 27 octobre 2014, 14:06:36 Marc Dietrich a écrit : « Compression does not work for NOCOW files » is clearly stated in https://btrfs.wiki.kernel.org/index.php/Compression#How_does_compression_i nt eract_with_direct_IO_or_COW.3F ah, sorry, I somehow overlooked this. Thanks Marc -- Swâmi Petaramesh sw...@petaramesh.org http://petaramesh.org PGP 9076E32E Tout être manifesté est là pour embaumer, pour exprimer la Présence. -- Jean-Marc Mantel -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
On Mon, Oct 27, 2014 at 12:11 PM, Filipe David Manana fdman...@gmail.com wrote: On Mon, Oct 27, 2014 at 11:08 AM, Miao Xie mi...@cn.fujitsu.com wrote: On Mon, 27 Oct 2014 09:19:52 +, Filipe Manana wrote: We have a race that can lead us to miss skinny extent items in the function btrfs_lookup_extent_info() when the skinny metadata feature is enabled. So basically the sequence of steps is: 1) We search in the extent tree for the skinny extent, which returns 0 (not found); 2) We check the previous item in the returned leaf for a non-skinny extent, and we don't find it; 3) Because we didn't find the non-skinny extent in step 2), we release our path to search the extent tree again, but this time for a non-skinny extent key; 4) Right after we released our path in step 3), a skinny extent was inserted in the extent tree (delayed refs were run) - our second extent tree search will miss it, because it's not looking for a skinny extent; 5) After the second search returned (with ret 0), we look for any delayed ref for our extent's bytenr (and we do it while holding a read lock on the leaf), but we won't find any, as such delayed ref had just run and completed after we released out path in step 3) before doing the second search. Fix this by removing completely the path release and re-search logic. This is safe, because if we seach for a metadata item and we don't find it, we have the guarantee that the returned leaf is the one where the item would be inserted, and so path-slots[0] 0 and path-slots[0] - 1 must be the slot where the non-skinny extent item is if it exists. The only case where path-slots[0] is I think this analysis is wrong if there are some independent shared ref metadata for a tree block, just like: ++-+-+ | tree block extent item | shared ref1 | shared ref2 | ++-+-+ Trying to guess what's in your mind. Is the concern that if after a non-skinny extent item we have non-inlined references, the assumption that path-slots[0] - 1 points to the extent item would be wrong when searching for a skinny extent? That wouldn't be the case because BTRFS_EXTENT_ITEM_KEY == 168 and BTRFS_METADATA_ITEM_KEY == 169, with BTRFS_SHARED_BLOCK_REF_KEY == 182. So in the presence of such non-inlined shared tree block reference items, searching for a skinny extent item leaves us at a slot that points to the first non-inlined ref (regardless of its type, since they're all 169), and therefore path-slots[0] - 1 is the non-skinny extent item. thanks. Why does that matters? Can you elaborate why it's not correct? We're looking for the extent item only in btrfs_lookup_extent_info(), and running a delayed ref, independently of being inlined/shared, it implies inserting a new extent item or updating an existing extent item (updating ref count). thanks Thanks Miao zero is when there are no smaller keys in the tree (i.e. no left siblings for our leaf), in which case the re-search logic isn't needed as well. This race has been present since the introduction of skinny metadata (change 3173a18f70554fe7880bb2d85c7da566e364eb3c). Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 8 1 file changed, 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9141b2b..2cedd06 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -780,7 +780,6 @@ search_again: else key.type = BTRFS_EXTENT_ITEM_KEY; -again: ret = btrfs_search_slot(trans, root-fs_info-extent_root, key, path, 0, 0); if (ret 0) @@ -796,13 +795,6 @@ again: key.offset == root-nodesize) ret = 0; } - if (ret) { - key.objectid = bytenr; - key.type = BTRFS_EXTENT_ITEM_KEY; - key.offset = root-nodesize; - btrfs_release_path(path); - goto again; - } } if (ret == 0) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs unmountable: read block failed check_tree_block; Couldn't read tree root
Hi! My btrfs system partition went readonly. After reboot it doesnt mount anymore. System was openSUSE 13.1 Tumbleweed (kernel 3.17.??). Now I'm on openSUSE 13.2-RC1 rescue (kernel 3.16.3). I dumped (dd) the whole 250 GB SSD to some USB file and tried some btrfs tools on another copy per loopback device. But everything failed with: kernel: BTRFS: failed to read tree root on dm-2 See http://pastebin.com/raw.php?i=dPnU6nzg. Any hints where to go from here? Ciao Ansgar -- Ansgar Hockmann-Stolle, Universität Osnabrück, Rechenzentrum Albrechtstraße 28, 49076 Osnabrück, Deutschland, Raum 31/E77B +49 541 969-2749 (fax -2470), http://www.home.uos.de/anshockm -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem converting data raid0 to raid1: enospc errors during balance
On Oct 26, 2014, at 7:40 PM, Qu Wenruo quwen...@cn.fujitsu.com wrote: BTW what's the output of 'df' command? Jasper, What do you get for the conventional df command when this btrfs volume is mounted? Thanks. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: suspicious number of devices: 72057594037927936
On Mon, Oct 27, 2014 at 10:57:59AM +, Filipe David Manana wrote: The only thing fancy may be the machine: PowerBook G4 (powerpc 32 bit), running Debian/Linux (stable). The message comes from the newly added fs/btrfs/disk-io.c: if (sb-num_devices (1UL 31)) printk(KERN_WARNING BTRFS: suspicious number of devices: %llu\n, sb-num_devices); And 72057594037927936 is 2^56, so maybe there's an endianess problem here? Sounds like you need to revert this patch: https://patchwork.kernel.org/patch/5004701/ (which ignored endianess) or go back to an older kernel (don't use 3.17 or 3.17.1 however, due to other serious issues, latest 3.16.x should be safe). There's a v2 of that patch that fixes the endianess issue, but it didn't make it to 3.18-rc1/2 (https://patchwork.kernel.org/patch/5082701/) Yeah sorry, I sent the v2 too late, here's an incremental that applies on top of current 3.18-rc https://patchwork.kernel.org/patch/5160651/ -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Problem converting data raid0 to raid1: enospc errors during balance
Hej guys! Thanks for your input on the issue this far. Too my knowledge raid1 in btrfs means 2 copies of each piece of data independent of the amount of disks used. So 4 x 2,73tb would result in a totaal storage of roughly 5,5tb right? Shouldn't this be more then enough? btw, here is the output for df: http://paste.debian.net/128932/ Date: Mon, 27 Oct 2014 12:49:15 +0800 From: quwen...@cn.fujitsu.com To: li...@colorremedies.com CC: jverb...@hotmail.com; linux-btrfs@vger.kernel.org Subject: Re: Problem converting data raid0 to raid1: enospc errors during balance Original Message Subject: Re: Problem converting data raid0 to raid1: enospc errors during balance From: Chris Murphy li...@colorremedies.com To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2014年10月27日 12:40 On Oct 26, 2014, at 7:40 PM, Qu Wenruo quwen...@cn.fujitsu.com wrote: Hi, Although I'm not completely sure, but it seems that, you really ran out of space. [1] Your array won't hold raid1 for 1.97T data Your array used up 1.97T raid0 data, it takes 1.97T for raid0. But if converted to 1.97T, it will occupy 1.97T X2 = 3.94T. Your array are only 2.73T, it is too small to contain the data. I'm not understanding. The btrfs fi show, shows 4x 2.73TiB devices, so that seems like it's a 10+TiB array. There's 2.04TiB raid0 data chunks, so roughly 500GiB per device, yet 1.94TiB is reported used per device by fi show. Confusing. Also it's still very confusing: Data, RAID1: total=2.85TiB, used=790.46GiB whether this means 2.85TiB out of 10TiB is allocated, or if it's twice that due to raid1. I can't ever remember this presentation detail, so again the secret decode ring where the UI doesn't expressly tell us what's going on is going to continue to be a source of confusion for users. Chris Murphy Oh, I misread the output That turns strange now BTW what's the output of 'df' command? Thanks, Qu
Re: Problem converting data raid0 to raid1: enospc errors during balance
On Oct 27, 2014, at 9:56 AM, Jasper Verberk jverb...@hotmail.com wrote: These are the results to a normal df: http://paste.debian.net/128932/ The mountpoint is /data. OK so this is with the new computation in kernel 3.17 (which I think contains a bug by counting free space twice); so now it shows available blocks based on the loss due to mirroring or parity. So 1k blocks 5860533168 = 5.45TiB. If you boot an older kernel my expectation is this shows up as 10.91TiB. In any case, df says there's 1.77TiB worth of data, so there should be plenty of space. Somewhere there's a bug. Either the 'btrfs fi df' is insufficiently communicating whether the desired operation can be done, or there's actual kernel confusion on how much space is available to do the conversion. I wonder what happens if you go back to kernel 3.16 and try do do the conversion? Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS balance segfault, where to go from here
On Oct 27, 2014, at 3:26 AM, Stephan Alz stephan...@gmx.com wrote: My question is where to go from here? What I going to do right now is to copy the most important data to another separated XFS drive. What I planning to do is: 1, Upgrade the kernel 2, Upgrade BTRFS 3, Continue the balancing. Definitely upgrade the kernel and see how that goes, there's been many many changes since 3.13. I would upgrade the user space tools also but that's not as important. FYI you can mount with skip_balance mount option to inhibit resuming balance, sometimes pausing the balance isn't fast enough when there are balance problems. Could someone please also explain that how is exactly the raid10 setup works with ODD number of drives with btrfs? Raid10 should be a stripe of mirrors. Now then this sdf drive is mirrored or striped or what? I have no idea honestly. Btrfs is very tolerant of adding odd number and sizes of devices, but things get a bit nutty in actual operation sometimes. This might be one of them because traditionally raid10 is always even number of drives, odd numbers just don't make sense. But Btrfs allows the addition; I think the expectation is you'd have added two before doing the balance though. Some btrfs gurus could tell me that should I be worried of dataloss because of this or not? Anything is possible so hopefully you have backups. My expectation is worse case scenario the fs gets confused and you can't mount rw anymore in which case you won't be able to make it an even drive raid10. But in the case even as ro you can update your backups, blow away the Btrfs volume and start from scratch with an even number of drives, right? Would I need even more free space just to add a 5th drive? If so how much more? Gonna guess you'd need to add a drive that's at least 2.83TiB in size if you want to keep it raid10. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: suspicious number of devices: 72057594037927936
On Mon, 27 Oct 2014 at 16:35, David Sterba wrote: Yeah sorry, I sent the v2 too late, here's an incremental that applies on top of current 3.18-rc https://patchwork.kernel.org/patch/5160651/ Yup, that fixes it. Thank you! If it's needed: Tested-by: Christian Kujau li...@nerdbynature.de @Filipe: and thanks for warning me about 3.17 - I used 3.17.0 since it came out and compiled kernels on the btrfs partition and haven't had any issues. But it wasn't used very often, so whatever the serious issues were, I haven't experienced any. Christian. -- BOFH excuse #98: The vendor put the bug there. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: suspicious number of devices: 72057594037927936
On Mon, Oct 27, 2014 at 11:21:13AM -0700, Christian Kujau wrote: On Mon, 27 Oct 2014 at 16:35, David Sterba wrote: Yeah sorry, I sent the v2 too late, here's an incremental that applies on top of current 3.18-rc https://patchwork.kernel.org/patch/5160651/ Yup, that fixes it. Thank you! If it's needed: Tested-by: Christian Kujau li...@nerdbynature.de @Filipe: and thanks for warning me about 3.17 - I used 3.17.0 since it came out and compiled kernels on the btrfs partition and haven't had any issues. But it wasn't used very often, so whatever the serious issues were, I haven't experienced any. If you make read-only snapshots, there's a good chance of metadata corruption. It's fixed in 3.17.2. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Our so-called leaders speak/with words they try to jail ya/ --- They subjugate the meek/but it's the rhetoric of failure. signature.asc Description: Digital signature
RAID1 fails to recover chunk tree
Revisit of a previous issue. Setup a single 640GB drive with BTRFS and compression. This was not a system drive, just a place to put random junk. Made a RAID1 with another drive of just the metadata. Was in that state for less than 12 hours-ish, removed the second drive and now cannot get to any data on the original drive. Data remained single while only metadata was RAID1. Single drive btrfs was made on Ubuntu with kernel 3.13.0 and tools 3.12. $ sudo mount -o degraded /dev/sdc1 /media/Data/ mount: wrong fs type, bad option, bad superblock on /dev/sdc1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so $ dmesg | tail [45353.869448] KBD BUG in ../../../../../../../../ drivers/2d/lnx/fgl/drm/kernel/ gal.c at line: 304! [45353.901511] KBD BUG in ../../../../../../../../drivers/2d/lnx/fgl/drm/kernel/gal.c at line: 304! [45353.901666] KBD BUG in ../../../../../../../../drivers/2d/lnx/fgl/drm/kernel/gal.c at line: 304! [45354.148488] KBD BUG in ../../../../../../../../drivers/2d/lnx/fgl/drm/kernel/gal.c at line: 304! [45354.148573] KBD BUG in ../../../../../../../../drivers/2d/lnx/fgl/drm/kernel/gal.c at line: 304! [46241.155350] btrfs: device fsid bd78815a-802b-43e2-8387-fc6ab4237d67 devid 1 transid 60944 /dev/sdc1 [46241.155923] btrfs: allowing degraded mounts [46241.155927] btrfs: disk space caching is enabled [46241.159436] btrfs: failed to read chunk root on sdc1 [46241.177815] btrfs: open_ctree failed $ btrfs-show-super /dev/sdc1 superblock: bytenr=65536, device=/dev/sdc1 -- --- csum0x93bcb1b5 [match] bytenr 65536 flags 0x1 magic _BHRfS_M [match] fsidbd78815a-802b-43e2-8387-fc6ab4237d67 label generation 60944 root909586694144 sys_array_size 97 chunk_root_generation 60938 root_level 1 chunk_root 911673917440 chunk_root_level1 log_root0 log_root_transid0 log_root_level 0 total_bytes 1115871535104 bytes_used 321833435136 sectorsize 4096 nodesize4096 leafsize4096 stripesize 4096 root_dir6 num_devices 2 compat_flags0x0 compat_ro_flags 0x0 incompat_flags 0x9 csum_type 0 csum_size 4 cache_generation60944 uuid_tree_generation60944 dev_item.uuid d82b2027-17b6-4513-a86d-9227a42d7ed1 dev_item.fsid bd78815a-802b-43e2-8387-fc6ab4237d67 [match] dev_item.type 0 dev_item.total_bytes615763673088 dev_item.bytes_used 324270030848 dev_item.io_align 4096 dev_item.io_width 4096 dev_item.sector_size4096 dev_item.devid 1 dev_item.dev_group 0 dev_item.seek_speed 0 dev_item.bandwidth 0 dev_item.generation 0 $ sudo btrfs device add -f /dev/sdh1 /dev/sdc1 ERROR: error adding the device '/dev/sdh1' - Inappropriate ioctl for device $ sudo btrfs device delete missing /dev/sdc1 ERROR: error removing the device 'missing' - Inappropriate ioctl for device $ sudo mount -o degraded,defaults,compress=lzo /dev/sdc1 /media/Data/ mount: wrong fs type, bad option, bad superblock on /dev/sdc1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so $ dmesg | tail [106991.655384] btrfs: device fsid bd78815a-802b-43e2-8387-fc6ab4237d67 devid 1 transid 60944 /dev/sdc1 [106991.665066] btrfs: device fsid bd78815a-802b-43e2-8387-fc6ab4237d67 devid 1 transid 60944 /dev/sdc1 [107019.954397] btrfs: device fsid bd78815a-802b-43e2-8387-fc6ab4237d67 devid 1 transid 60944 /dev/sdc1 [107019.962009] btrfs: device fsid bd78815a-802b-43e2-8387-fc6ab4237d67 devid 1 transid 60944 /dev/sdc1 [107070.124927] btrfs: device fsid bd78815a-802b-43e2-8387-fc6ab4237d67 devid 1 transid 60944 /dev/sdc1 [107070.126475] btrfs: allowing degraded mounts [107070.126479] btrfs: use lzo compression [107070.126480] btrfs: disk space caching is enabled [107070.127254] btrfs: failed to read chunk root on sdc1 [107070.142983] btrfs: open_ctree failed $ sudo btrfs rescue super-recover -v /dev/sdc1 All Devices: Device: id = 1, name = /dev/sdc1 Before Recovering: [All good supers]: device name = /dev/sdc1 superblock bytenr = 65536 device name = /dev/sdc1 superblock bytenr = 67108864 device name = /dev/sdc1 superblock bytenr = 274877906944 [All bad supers]: All supers are valid, no need to recover $ btrfs rescue chunk-recover -v /dev/sdc1 snipped Chunk: start = 860100755456, len = 1073741824, type = 1, num_stripes = 1 Stripes list: [ 0] Stripe: devid = 1,
Re: [PATCH] Btrfs: fix snapshot inconsistency after a file write followed by truncate
On Tue, Oct 21, 2014 at 6:12 AM, Filipe Manana fdman...@suse.com wrote: If right after starting the snapshot creation ioctl we perform a write against a file followed by a truncate, with both operations increasing the file's size, we can get a snapshot tree that reflects a state of the source subvolume's tree where the file truncation happened but the write operation didn't. This leaves a gap between 2 file extent items of the inode, which makes btrfs' fsck complain about it. For example, if we perform the following file operations: $ mkfs.btrfs -f /dev/vdd $ mount /dev/vdd /mnt $ xfs_io -f \ -c pwrite -S 0xaa -b 32K 0 32K \ -c fsync \ -c pwrite -S 0xbb -b 32770 16K 32770 \ -c truncate 90123 \ /mnt/foobar and the snapshot creation ioctl was just called before the second write, we often can get the following inode items in the snapshot's btree: item 120 key (257 INODE_ITEM 0) itemoff 7987 itemsize 160 inode generation 146 transid 7 size 90123 block group 0 mode 100600 links 1 uid 0 gid 0 rdev 0 flags 0x0 item 121 key (257 INODE_REF 256) itemoff 7967 itemsize 20 inode ref index 282 namelen 10 name: foobar item 122 key (257 EXTENT_DATA 0) itemoff 7914 itemsize 53 extent data disk byte 1104855040 nr 32768 extent data offset 0 nr 32768 ram 32768 extent compression 0 item 123 key (257 EXTENT_DATA 53248) itemoff 7861 itemsize 53 extent data disk byte 0 nr 0 extent data offset 0 nr 40960 ram 40960 extent compression 0 There's a file range, corresponding to the interval [32K; ALIGN(16K + 32770, 4096)[ for which there's no file extent item covering it. This is because the file write and file truncate operations happened both right after the snapshot creation ioctl called btrfs_start_delalloc_inodes(), which means we didn't start and wait for the ordered extent that matches the write and, in btrfs_setsize(), we were able to call btrfs_cont_expand() before being able to commit the current transaction in the snapshot creation ioctl. So this made it possibe to insert the hole file extent item in the source subvolume (which represents the region added by the truncate) right before the transaction commit from the snapshot creation ioctl. Btrfs' fsck tool complains about such cases with a message like the following: root 331 inode 257 errors 100, file extent discount From a user perspective, the expectation when a snapshot is created while those file operations are being performed is that the snapshot will have a file that either: 1) is empty 2) only the first write was captured 3) only the 2 writes were captured 4) both writes and the truncation were captured But never capture a state where only the first write and the truncation were captured (since the second write was performed before the truncation). A test case for xfstests follows. Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/inode.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 0d41741..c28b78f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -4622,6 +4622,9 @@ static int btrfs_setsize(struct inode *inode, struct iattr *attr) } if (newsize oldsize) { + ret = btrfs_wait_ordered_range(inode, 0, (u64)-1); + if (ret) + return ret; Expanding truncates aren't my favorite operation, but we don't want them to imply fsync. I'm holding off on this one while I work out the rest of the vacation backlog ;) -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs unmountable: read block failed check_tree_block; Couldn't read tree root
Am 27.10.14 um 14:23 schrieb Ansgar Hockmann-Stolle: Hi! My btrfs system partition went readonly. After reboot it doesnt mount anymore. System was openSUSE 13.1 Tumbleweed (kernel 3.17.??). Now I'm on openSUSE 13.2-RC1 rescue (kernel 3.16.3). I dumped (dd) the whole 250 GB SSD to some USB file and tried some btrfs tools on another copy per loopback device. But everything failed with: kernel: BTRFS: failed to read tree root on dm-2 See http://pastebin.com/raw.php?i=dPnU6nzg. Any hints where to go from here? After an offlist hint (thanks Tom!) I compiled the latest btrfs-progs 3.17 and tried some more ... linux:~/bin # ./btrfs --version Btrfs v3.17 linux:~/bin # ./btrfs-find-root /dev/sda3 Super think's the tree root is at 1015238656, chunk root 20971520 Well block 239718400 seems great, but generation doesn't match, have=661931, want=663595 level 0 Well block 239722496 seems great, but generation doesn't match, have=661931, want=663595 level 0 Well block 320098304 seems great, but generation doesn't match, have=662233, want=663595 level 0 Well block 879341568 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879345664 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879382528 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879398912 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879403008 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879423488 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879435776 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 880095232 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 881504256 seems great, but generation doesn't match, have=663228, want=663595 level 0 Well block 881512448 seems great, but generation doesn't match, have=663228, want=663595 level 0 Well block 936271872 seems great, but generation doesn't match, have=663397, want=663595 level 0 Well block 1004490752 seems great, but generation doesn't match, have=663571, want=663595 level 0 Well block 1007804416 seems great, but generation doesn't match, have=663572, want=663595 level 0 Well block 1012031488 seems great, but generation doesn't match, have=663575, want=663595 level 0 Well block 1012396032 seems great, but generation doesn't match, have=663575, want=663595 level 0 Well block 1012633600 seems great, but generation doesn't match, have=663586, want=663595 level 0 Well block 1012871168 seems great, but generation doesn't match, have=663585, want=663595 level 0 Well block 1015201792 seems great, but generation doesn't match, have=663588, want=663595 level 0 Well block 1015836672 seems great, but generation doesn't match, have=663596, want=663595 level 1 Well block 44132536320 seems great, but generation doesn't match, have=658774, want=663595 level 0 Well block 44178280448 seems great, but generation doesn't match, have=658774, want=663595 level 0 Well block 87443644416 seems great, but generation doesn't match, have=661349, want=663595 level 0 Well block 87514079232 seems great, but generation doesn't match, have=651051, want=663595 level 0 Well block 87517679616 seems great, but generation doesn't match, have=661349, want=663595 level 0 Well block 98697822208 seems great, but generation doesn't match, have=643548, want=663595 level 0 Well block 103285026816 seems great, but generation doesn't match, have=661672, want=663595 level 0 Well block 103309553664 seems great, but generation doesn't match, have=661674, want=663595 level 0 Well block 103523430400 seems great, but generation doesn't match, have=661767, want=663595 level 0 No more metdata to scan, exiting This line I found interesting because have is want + 1: Well block 1015836672 seems great, but generation doesn't match, have=663596, want=663595 level 1 And here the tail of btrfs rescue chunk-recover (full output at http://pastebin.com/raw.php?i=1D5VgDxv) [..] Total Chunks: 234 Heathy: 231 Bad: 3 Orphan Block Groups: Orphan Device Extents: Couldn't map the block 1015238656 btrfs: volumes.c:1140: btrfs_num_copies: Assertion `!(ce-start logical || ce-start + ce-size logical)' failed. Aborted Sadly btrfs check --repair keep up refusing to do its job. linux:~ # btrfs check --repair /dev/sda3 enabling repair mode Check tree block failed, want=1015238656, have=0 Check tree block failed, want=1015238656, have=0 Check tree block failed, want=1015238656, have=0 Check tree block failed, want=1015238656, have=0 Check tree block failed, want=1015238656, have=0 read block failed check_tree_block Couldn't read tree root Checking filesystem on /dev/sda3 UUID: 1af256b5-b1ad-443b-aeee-a6853e70b7e2 Critical roots corrupted, unable to fsck the FS Segmentation fault Any more hints? Ciao
Re: btrfs unmountable: read block failed check_tree_block; Couldn't read tree root
Ansgar Hockmann-Stolle posted on Mon, 27 Oct 2014 14:23:19 +0100 as excerpted: Hi! My btrfs system partition went readonly. After reboot it doesnt mount anymore. System was openSUSE 13.1 Tumbleweed (kernel 3.17.??). Now I'm on openSUSE 13.2-RC1 rescue (kernel 3.16.3). I dumped (dd) the whole 250 GB SSD to some USB file and tried some btrfs tools on another copy per loopback device. But everything failed with: kernel: BTRFS: failed to read tree root on dm-2 See http://pastebin.com/raw.php?i=dPnU6nzg. Any hints where to go from here? Good job posting initial problem information. =:^) A lot of folks take 2-3 rounds of request and reply before that much info is available on the problem. While others may be able to assist you in restoring that filesystem to working condition, my focus is more on recovering what can be recovered from it and doing a fresh mkfs. System partition, 250 GB, looks to be just under 231 GiB based on the total bytes from btrfs-show-super. How recent is your backup, and/or being a system partition, is it simply the distro installation, possibly without too much customization, thus easily reinstalled? IOW, if you were to call that partition a total loss and simply mkfs it, would you lose anything real valuable that's not backed up? (Of course, the standard lecture at this point is that if it's not backed up, by definition you didn't consider it valuable enough to be worth the hassle, so by definition it's not valuable and you can simply blow it away, but...) If you're in good shape in that regard, that's what I'd probably do at this point, keeping the dd image you made in case someone's interested in tracking the problem down and making btrfs handle that case. If there's important files on there that you don't have backed up, or if you have a backup but it's older than you'd like and you want to try to recover current versions of what you can (the situation I was in a few months ago), then btrfs restore is what you're interested in. Restore works on an /unmounted/ (and potentially unmountable, as here) filesystem, letting you retrieve files from it and copy them to other filesystems. It does NOT write anything to the damaged filesystem itself, so no worries about making the problem worse. There's a page on the wiki describing how to use btrfs restore along with btrfs-find-root in some detail, definitely more than is in the manpages or that I want to do here. https://btrfs.wiki.kernel.org/index.php/Restore Some useful hints that weren't originally clear to me as I used that page here: * Generation and transid are the same thing, a sequentially increasing number that updates every time the root tree is written. The generation recorded in your superblocks (from btrfs-show-super) is 663595, so the idea would be that generation/transid, falling back one to 663594 if 95 isn't usable, then 93, then... etc. The lower the number the further back in history you're going, so obviously, you want the closest to 663595 that you can get, that still gives you access to a (nearly) whole filesystem, or at least the parts of it you are interested in. * That page was written before restore's -D/--dry-run option was available. This option can be quite helpful, and I recommend using it to see what will actually be restored at each generation and associated tree root (bytenr/byte-number). Tho (with -v/verbose) the list of files restored will normally be too long to go thru in detail, you can either scan it or pipe the output to wc -l to get a general idea of how many files would be restored. * Restore's -l/list-tree-roots option isn't listed on the page either. btrfs restore -l -t bytenr can be quite useful, giving you a nice list of trees available for the generation corresponding to that bytenr (as found using btrfs-find-root). This is where the page's advice to pick the latest tree root with all or as many as possible of the filesystem trees in it, comes in, since this lets you easily see which trees each root has available. * I don't use snapshots or subvolumes here, while I understand OpenSuSE uses them rather heavily (via snapper). Thus I have no direct experience with restore's snapshot-related options. Presumably you can either ignore the snapshots (the apparent default) or restore them either in general (using -s) or selectively (using -r, with the appropriate snapshot rootid). * It's worth noting that restore simply lets you retrieve files. It does *NOT* retrieve file ownership or permissions, with the restored files all being owned by the user you ran btrfs restore under (presumably root), with $UMASK permissions. You'll have to restore ownership and permissions manually. When I used restore here I had a backup, but the backup was old. So I hacked up a bash scriptlet with a for loop, that went thru all the restored files recursively, comparing them against the old backup. If the file existed in
Re: Does btrfs-restore report missing/corrupt files?
On 10/26/2014 12:59 AM, Christian Tschabuschnig wrote: Hello, currently I am trying to recover a btrfs filesystem which had a few subvolumes. When running # btrfs restore -sx /dev/xxx . one subvolume gets restored. Important Aside: The one time I had to resort to btrfs restore I didn't get the contents of _many_ of the really small files. My _guess_ is that those where the files small enough to reside entirely within the original filesystem's metadata. You should mount the filesystem read-only and recursively copy the hirearchy to another file system as well as doing a restore. The two results can then be folded together, or at least the former might help you find some of what the latter might miss. I could be totally wrong, or restore could have been improved since then, but it was what seemed to be happening. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs unmountable: read block failed check_tree_block; Couldn't read tree root
Original Message Subject: Re: btrfs unmountable: read block failed check_tree_block; Couldn't read tree root From: Qu Wenruo quwen...@cn.fujitsu.com To: Ansgar Hockmann-Stolle ansgar.hockmann-sto...@uni-osnabrueck.de, linux-btrfs@vger.kernel.org Date: 2014年10月28日 09:05 Original Message Subject: Re: btrfs unmountable: read block failed check_tree_block; Couldn't read tree root From: Ansgar Hockmann-Stolle ansgar.hockmann-sto...@uni-osnabrueck.de To: linux-btrfs@vger.kernel.org Date: 2014年10月28日 07:03 Am 27.10.14 um 14:23 schrieb Ansgar Hockmann-Stolle: Hi! My btrfs system partition went readonly. After reboot it doesnt mount anymore. System was openSUSE 13.1 Tumbleweed (kernel 3.17.??). Now I'm on openSUSE 13.2-RC1 rescue (kernel 3.16.3). I dumped (dd) the whole 250 GB SSD to some USB file and tried some btrfs tools on another copy per loopback device. But everything failed with: kernel: BTRFS: failed to read tree root on dm-2 See http://pastebin.com/raw.php?i=dPnU6nzg. Any hints where to go from here? After an offlist hint (thanks Tom!) I compiled the latest btrfs-progs 3.17 and tried some more ... linux:~/bin # ./btrfs --version Btrfs v3.17 linux:~/bin # ./btrfs-find-root /dev/sda3 Super think's the tree root is at 1015238656, chunk root 20971520 Well block 239718400 seems great, but generation doesn't match, have=661931, want=663595 level 0 Well block 239722496 seems great, but generation doesn't match, have=661931, want=663595 level 0 Well block 320098304 seems great, but generation doesn't match, have=662233, want=663595 level 0 Well block 879341568 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879345664 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879382528 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879398912 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879403008 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879423488 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879435776 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 880095232 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 881504256 seems great, but generation doesn't match, have=663228, want=663595 level 0 Well block 881512448 seems great, but generation doesn't match, have=663228, want=663595 level 0 Well block 936271872 seems great, but generation doesn't match, have=663397, want=663595 level 0 Well block 1004490752 seems great, but generation doesn't match, have=663571, want=663595 level 0 Well block 1007804416 seems great, but generation doesn't match, have=663572, want=663595 level 0 Well block 1012031488 seems great, but generation doesn't match, have=663575, want=663595 level 0 Well block 1012396032 seems great, but generation doesn't match, have=663575, want=663595 level 0 Well block 1012633600 seems great, but generation doesn't match, have=663586, want=663595 level 0 Well block 1012871168 seems great, but generation doesn't match, have=663585, want=663595 level 0 Well block 1015201792 seems great, but generation doesn't match, have=663588, want=663595 level 0 Well block 1015836672 seems great, but generation doesn't match, have=663596, want=663595 level 1 Well block 44132536320 seems great, but generation doesn't match, have=658774, want=663595 level 0 Well block 44178280448 seems great, but generation doesn't match, have=658774, want=663595 level 0 Well block 87443644416 seems great, but generation doesn't match, have=661349, want=663595 level 0 Well block 87514079232 seems great, but generation doesn't match, have=651051, want=663595 level 0 Well block 87517679616 seems great, but generation doesn't match, have=661349, want=663595 level 0 Well block 98697822208 seems great, but generation doesn't match, have=643548, want=663595 level 0 Well block 103285026816 seems great, but generation doesn't match, have=661672, want=663595 level 0 Well block 103309553664 seems great, but generation doesn't match, have=661674, want=663595 level 0 Well block 103523430400 seems great, but generation doesn't match, have=661767, want=663595 level 0 No more metdata to scan, exiting This line I found interesting because have is want + 1: Well block 1015836672 seems great, but generation doesn't match, have=663596, want=663595 level 1 And here the tail of btrfs rescue chunk-recover (full output at http://pastebin.com/raw.php?i=1D5VgDxv) [..] Total Chunks:234 Heathy:231 Bad:3 Orphan Block Groups: Orphan Device Extents: Couldn't map the block 1015238656 btrfs: volumes.c:1140: btrfs_num_copies: Assertion `!(ce-start logical || ce-start + ce-size logical)' failed. Aborted After looking into the 3 bad chunks, it turns
Re: [PATCH] Btrfs: fix race that makes btrfs_lookup_extent_info miss skinny extent items
On Mon, 27 Oct 2014 13:44:22 +, Filipe David Manana wrote: On Mon, Oct 27, 2014 at 12:11 PM, Filipe David Manana fdman...@gmail.com wrote: On Mon, Oct 27, 2014 at 11:08 AM, Miao Xie mi...@cn.fujitsu.com wrote: On Mon, 27 Oct 2014 09:19:52 +, Filipe Manana wrote: We have a race that can lead us to miss skinny extent items in the function btrfs_lookup_extent_info() when the skinny metadata feature is enabled. So basically the sequence of steps is: 1) We search in the extent tree for the skinny extent, which returns 0 (not found); 2) We check the previous item in the returned leaf for a non-skinny extent, and we don't find it; 3) Because we didn't find the non-skinny extent in step 2), we release our path to search the extent tree again, but this time for a non-skinny extent key; 4) Right after we released our path in step 3), a skinny extent was inserted in the extent tree (delayed refs were run) - our second extent tree search will miss it, because it's not looking for a skinny extent; 5) After the second search returned (with ret 0), we look for any delayed ref for our extent's bytenr (and we do it while holding a read lock on the leaf), but we won't find any, as such delayed ref had just run and completed after we released out path in step 3) before doing the second search. Fix this by removing completely the path release and re-search logic. This is safe, because if we seach for a metadata item and we don't find it, we have the guarantee that the returned leaf is the one where the item would be inserted, and so path-slots[0] 0 and path-slots[0] - 1 must be the slot where the non-skinny extent item is if it exists. The only case where path-slots[0] is I think this analysis is wrong if there are some independent shared ref metadata for a tree block, just like: ++-+-+ | tree block extent item | shared ref1 | shared ref2 | ++-+-+ Trying to guess what's in your mind. Is the concern that if after a non-skinny extent item we have non-inlined references, the assumption that path-slots[0] - 1 points to the extent item would be wrong when searching for a skinny extent? That wouldn't be the case because BTRFS_EXTENT_ITEM_KEY == 168 and BTRFS_METADATA_ITEM_KEY == 169, with BTRFS_SHARED_BLOCK_REF_KEY == 182. So in the presence of such non-inlined shared tree block reference items, searching for a skinny extent item leaves us at a slot that points to the first non-inlined ref (regardless of its type, since they're all 169), and therefore path-slots[0] - 1 is the non-skinny extent item. You are right. I forget to check the value of key type. Sorry. This patch seems good for me. Reviewed-by: Miao Xie mi...@cn.fujitsu.com thanks. Why does that matters? Can you elaborate why it's not correct? We're looking for the extent item only in btrfs_lookup_extent_info(), and running a delayed ref, independently of being inlined/shared, it implies inserting a new extent item or updating an existing extent item (updating ref count). thanks Thanks Miao zero is when there are no smaller keys in the tree (i.e. no left siblings for our leaf), in which case the re-search logic isn't needed as well. This race has been present since the introduction of skinny metadata (change 3173a18f70554fe7880bb2d85c7da566e364eb3c). Signed-off-by: Filipe Manana fdman...@suse.com --- fs/btrfs/extent-tree.c | 8 1 file changed, 8 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 9141b2b..2cedd06 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -780,7 +780,6 @@ search_again: else key.type = BTRFS_EXTENT_ITEM_KEY; -again: ret = btrfs_search_slot(trans, root-fs_info-extent_root, key, path, 0, 0); if (ret 0) @@ -796,13 +795,6 @@ again: key.offset == root-nodesize) ret = 0; } - if (ret) { - key.objectid = bytenr; - key.type = BTRFS_EXTENT_ITEM_KEY; - key.offset = root-nodesize; - btrfs_release_path(path); - goto again; - } } if (ret == 0) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- Filipe David Manana, Reasonable men adapt themselves to the world. Unreasonable men adapt the world to themselves. That's why all progress depends on unreasonable men. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: Btrfs-progs release 3.17
On Thu, 2014-10-23 at 15:23 +0200, Petr Janecek wrote: Hello Gui, Oh, it seems that there are btrfs with missing devs that are bringing troubles to the @open_ctree_... function. what do you mean by missing devs? I have no degraded fs. Ah, sorry, I'm too focused on the problem that Anand's script pointed out. Ignore this missing devs. The time btrfs fi sh spends scanning disks of a filesystem seems to be proportional to the amount of data stored on them: on a completely idle system, of ~20s total time it spends 10s scanning each of /mnt/b and /mnt/b0, and almost no time on /mnt/b3 (which is the biggest) Filesystem Size Used Avail Use% Mounted on /dev/sdm5.5T 2.4T 2.1T 54% /mnt/b /dev/sda5.5T 2.5T 3.1T 45% /mnt/b0 /dev/sde7.3T 90G 5.4T 2% /mnt/b3 For your original problems: o error messages: The concurrency problem exists as Anand said. As you said, running balance cp lead to such messages, so I think there are some unintentional redundency works over the mounted devices when dealing with umounted ones. I'll try to o stalling: This may be due to concurrency problem either. After the first problem handled, let's see what happens. Thanks, Gui Thanks, Petr -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs-progs release 3.17
On Thu, 2014-10-23 at 21:36 +0800, Anand Jain wrote: there is no point in re-creating so many btrfs kernel's logic in user space. its just unnecessary, when kernel is already doing it. use some interface to get info from kernel after device is registered, (not necessarily mounted). so progs can be as sleek as possible. to me it started as just one more bug now we have fixed so many many. It all needs one good interface for kernel which provides anything anything from the kernel. Oh, the interface for kernel you described is really interesting. But how to store the seed/sprout relationships so that we can fetch them correctly for umounted btrfs? -Gui On 10/23/14 16:52, Gui Hecheng wrote: On Thu, 2014-10-23 at 16:13 +0800, Anand Jain wrote: Some of the disks on my system were missing and I was able to hit this issue. Check tree block failed, want=12582912, have=0 read block failed check_tree_block Couldn't read chunk root warning devid 2 not found already Check tree block failed, want=143360, have=0 read block failed check_tree_block Couldn't read chunk root warning, device 4 is missing warning, device 3 is missing warning, device 2 is missing warning, device 1 is missing Did a bisect and it leads to this following patch. commit 915902c5002485fb13d27c4b699a73fb66cc0f09 btrfs-progs: fix device missing of btrfs fi show with seed devices Also this patch stalls ~2sec in the cmd btrfs fi show, on my system with 48 disks. Also a simple test case hits some warnings... mkfs.btrfs -draid1 -mraid1 /dev/sdb /dev/sdc mount /dev/sdb /btrfs fillfs /btrfs 100 umount /btrfs wipefs -a /dev/sdb modprobe -r btrfs modprobe btrfs mount -o degraded /dev/sdb /btrfs btrfs fi show Label: none uuid: 9844cd05-1c8c-473e-a84b-bac95aab7bc9 Total devices 2 FS bytes used 1.59MiB devid2 size 967.87MiB used 104.75MiB path /dev/sdc *** Some devices missing warning, device 1 is missing warning, device 1 is missing warning devid 1 not found already Hi Anand and Petr, Oh, it seems that there are btrfs with missing devs that are bringing troubles to the @open_ctree_... function. This should be a missing case of the patch above which should only take effects when seeding devices are present. I will try my best to follow this case, suggestions are welcome, Thanks! -Gui On 10/23/14 14:57, Petr Janecek wrote: Hello, You have mentioned two issues when balance and fi show running concurrently my mail was a bit chaotic, but I get the stalls even on idle system. Today I got parent transid verify failed on 1559973888000 wanted 1819 found 1821 parent transid verify failed on 1559973888000 wanted 1819 found 1821 parent transid verify failed on 1559973888000 wanted 1819 found 1821 parent transid verify failed on 1559973888000 wanted 1819 found 1821 Ignoring transid failure leaf parent key incorrect 1559973888000 from 'btrfs fi sh' while I was just copying something, no balance running. [...] [PATCH 1/1] btrfs-progs: code optimize cmd_scan_dev() use btrfs_register_one_device() [PATCH 1/2] btrfs-progs: introduce btrfs_register_all_device() [PATCH 2/2] btrfs-progs: optimize btrfs_scan_lblkid() for multiple calls If you could, pls.. Now on 3.17 apply above 3 patches and see if you see any better performance for the stalling issue. no perceptible change: takes ~40 seconds both before and after applying. Old version 1 sec. can you do same steps on 3.16 and report what you observe So many rejects -- do you have older versions of these patches? Thanks, Petr -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html