Re: btrfs restore fails because of NO SPACE
Wolf Bublitz posted on Fri, 20 May 2016 19:38:55 +0200 as excerpted: > Hallo, > > I have a confusing problem with my btrfs raid. Currently I am using the > following setup: > > btrfs fi show > Label: none uuid: 93000933-e46d-403b-80d7-60475855e3f3 > Total devices 2 FS bytes used 2.56TiB > devid1 size 2.73TiB used 2.71TiB path /dev/sda > devid4 size 2.73TiB used 2.71TiB path /dev/sdb > > As you can see both disks are full. > > Actually I cannot mount my raid, even with recovery options enabled: > > mount /dev/sda /mnt/Data -t btrfs > -onospace_cache,clear_cache,enospc_debug,nodatacow [generic mount error] > dmesg shows: > > [ 1066.243813] BTRFS error (device sda): parent transid verify > failed on 9676657786880 wanted 242139 found 0 [...] > [ 1066.273234] BTRFS: failed to read chunk root on sda None of those options are likely to help, there. What /might/ help is the "usebackuproot" mount option, if your kernel is reasonably current, or the "recovery" mount option, if it's a bit older. Btrfs mount options are documented in the btrfs (5) manpage (not the btrfs (8) manpage, specify the 5), tho again, usebackuproot will only appear in the manpage if you're running a reasonably current btrfs-progs version (recovery should be listed in both new and old, but it's listed as deprecated in new, referring readers to the usebackuproot entry). Or alternatively to the manpage, you can check the mount options listing on the wiki. > After spending some time with Google I found a possible solution for my > problem by running: > > btrfs restore -v /dev/sda /mnt/Data > > Actually this operation fails silently (computer freezes). After examine > the kernel logs I have found out that the operations fails because of > „NO SPACE LEFT ON DEVICE“. Can anybody please give me a solution for > this problem? You don't explicitly say what you expect btrfs restore to do, but given the specific command you use, I suspect that you misunderstand what it does, and it's actually working, but you are running out of space as a result of using restore incorrectly, because of that misunderstanding. What btrfs restore does is provide you a read-only method to try to restore your files from a filesystem that won't mount, by rewriting what it can recover to an entirely different location on an entirely different, mounted, filesystem, which of course must contain enough space to hold a new copy of all those restored files. And if the filesystem in question isn't mounted to its usual mountpoint, /mnt/Data, that means you're trying to write all the recovered files to whatever filesystem actually contains the mountpoint itself, almost certainly the root filesystem (/), in this case. And I'll place money on a bet that whatever your root filesystem is, it doesn't have the terabytes of free space that are likely to be necessary to restore all of that multi-device multi-TB per device btrfs! Otherwise, you /likely/ wouldn't be running the separate btrfs in the first place, but storing it on your main filesystem, instead. So when btrfs restore runs out of room on / ... everything freezes. IOW, in ordered to successfully use btrfs restore, you have to have a filesystem with enough free space available mounted somewhere in ordered to write the files to that btrfs restore is restoring! If you don't, yes, you /will/ run into problems! =:^) That said, it is possible to use pattern matching to tell btrfs restore to only restore say one directory at a time, and by using that if you don't have enough space for everything but are willing to give up some of what was stored, you can tell btrfs restore to only restore the vitally important stuff that will fit, and not bother trying to restore the less important stuff that won't fit. Again, see the btrfs-restore manpage, for that and other details. And, unless you want those restored files all written as root, you'll probably want to use the restore metadata option as well, to restore timestamps and owner/perms information. Similarly, there's an option to restore symlinks as well, without which they'll be missing. So you probably do want to check that manpage. Just sayin'. =:^) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs restore fails because of NO SPACE
Chris Murphy posted on Fri, 20 May 2016 15:53:07 -0600 as excerpted: >>btrfs fi show Label: none uuid: 93000933-e46d-403b-80d7-60475855e3f3 >> Total devices 2 FS bytes used 2.56TiB >> devid1 size 2.73TiB used 2.71TiB path /dev/sda >> devid4 size 2.73TiB used 2.71TiB path /dev/sdb > > > OK so why does it only list two devices? This is a three drive or four > drive raid10? This is Btrfs raid10 specifically? I'm confused now about > the setup and why btrfs fi show isn't saying there are missing devices, > there is no such thing as two drive btrfs raid10. I'm confused about where you got that it was a raid10? I don't see it in anything he posted, that got here to gmane, at least. In fact, I see only his initial thread-root post, and it doesn't mention raid10 at all that I can see. So given that fi show says two devices, none missing, I'd say it can't be a raid10, and further, given that he didn't specify the raid type, the btrfs default for a two-device btrfs must be assumed, which is raid1 system and metadata, single mode data. But I think that's beside the point in terms of the original question. I think the problem is with his understanding of restore. See the reply directly to his post, that I'll be making after this one. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
bcache protective superblocks is a one-time procedure which can be done online. The bcache devices act as normal HDD if not attached to a caching SSD. It's really less pain than you may think. And it's a solution available now. Converting back later is easy: Just detach the HDDs from the SSDs and use them for some other purpose if you feel so later. Having the bcache protective superblock still in place doesn't hurt then. Bcache is a no-op without caching device attached. >>> >>> >>> No, bcache is _almost_ a no-op without a caching device. From a >>> userspace >>> perspective, it does nothing, but it is still another layer of >>> indirection >>> in the kernel, which does have a small impact on performance. The same >>> is >>> true of using LVM with a single volume taking up the entire partition, it >>> looks almost no different from just using the partition, but it will >>> perform >>> worse than using the partition directly. I've actually done profiling of >>> both to figure out base values for the overhead, and while bcache with no >>> cache device is not as bad as the LVM example, it can still be a roughly >>> 0.5-2% slowdown (it gets more noticeable the faster your backing storage >>> is). >>> >>> You also lose the ability to mount that filesystem directly on a kernel >>> without bcache support (this may or may not be an issue for you). >> >> >> The bcache (protective) superblock is in an 8KiB block in front of the >> file system device. In case the current, non-bcached HDD's use modern >> partitioning, you can do a 5-minute remove or add of bcache, without >> moving/copying filesystem data. So in case you have a bcache-formatted >> HDD that had just 1 primary partition (512 byte logical sectors), the >> partition start is at sector 2048 and the filesystem start is at 2064. >> Hard removing bcache (so making sure the module is not >> needed/loaded/used the next boot) can be done done by changing the >> start-sector of the partition from 2048 to 2064. In gdisk one has to >> change the alignment to 16 first, otherwise this it refuses. And of >> course, also first flush+stop+de-register bcache for the HDD. >> >> The other way around is also possible, i.e. changing the start-sector >> from 2048 to 2032. So that makes adding bcache to an existing >> filesystem a 5 minute action and not a GBs- or TBs copy action. It is >> not online of course, but just one reboot is needed (or just umount, >> gdisk, partprobe, add bcache etc). >> For RAID setups, one could just do 1 HDD first. > > My argument about the overhead was not about the superblock, it was about > the bcache layer itself. It isn't practical to just access the data > directly if you plan on adding a cache device, because then you couldn't do > so online unless you're going through bcache. This extra layer of > indirection in the kernel does add overhead, regardless of the on-disk > format. Yes, sorry, I took some shortcut in the discussion and jumped to a method for avoiding this 0.5-2% slowdown that you mention. (Or a kernel crashing in bcache code due to corrupt SB on a backing device or corrupted caching device contents). I am actually bit surprised that there is a measurable slowdown, considering that it is basically just one 8KiB offset on a certain layer in the kernel stack, but I haven't looked at that code. > Secondarily, having a HDD with just one partition is not a typical use case, > and that argument about the slack space resulting from the 1M alignment only > holds true if you're using an MBR instead of a GPT layout (or for that > matter, almost any other partition table format), and you're not booting > from that disk (because GRUB embeds itself there). It's also fully possible > to have an MBR formatted disk which doesn't have any spare space there too > (which is how most flash drives get formatted). I don't know other tables than MBR and GPT, but this bcache SB 'insertion' works with both. Indeed, if GRUB is involved, it can get complicated, I have avoided that. If there is less than 8KiB slack space on a HDD, I would worry about alignment/performance first, then there is likely a reason to fully rewrite the HDD with a standard 1M alingment. If there is more partitions and the partition in front of the one you would like to be bcached, I personally would shrink it by 8KiB (like NTFS or swap or ext4 ) if that saves me TeraBytes of datatransfers. > This also doesn't change the fact that without careful initial formatting > (it is possible on some filesystems to embed the bcache SB at the beginning > of the FS itself, many of them have some reserved space at the beginning of > the partition for bootloaders, and this space doesn't have to exist when > mounting the FS) or manual alteration of the partition, it's not possible to > mount the FS on a system without bcache support. If we consider a non-bootable single HDD btrfs FS, are you then suggesting that the bcache SB could be placed in the
Re: btrfs restore fails because of NO SPACE
>btrfs fi show >Label: none uuid: 93000933-e46d-403b-80d7-60475855e3f3 > Total devices 2 FS bytes used 2.56TiB > devid1 size 2.73TiB used 2.71TiB path /dev/sda > devid4 size 2.73TiB used 2.71TiB path /dev/sdb OK so why does it only list two devices? This is a three drive or four drive raid10? This is Btrfs raid10 specifically? I'm confused now about the setup and why btrfs fi show isn't saying there are missing devices, there is no such thing as two drive btrfs raid10. Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
On Fri, May 20, 2016 at 7:59 PM, Austin S. Hemmelgarnwrote: > On 2016-05-20 13:02, Ferry Toth wrote: >> >> We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4, >> then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs >> partitions are in the same pool, which is in btrfs RAID10 format. /boot >> is in subvolume @boot. > > If you have GRUB installed on all 4, then you don't actually have the full > 2047 sectors between the MBR and the partition free, as GRUB is embedded in > that space. I forget exactly how much space it takes up, but I know it's > not the whole 1023.5K I would not suggest risking usage of the final 8k > there though. You could however convert to raid1 temporarily, and then for > each device, delete it, reformat for bcache, then re-add it to the FS. This > may take a while, but should be safe (of course, it's only an option if > you're already using a kernel with bcache support). There is more then enough space in that 2047 sectors area for inserting a bcache SB, but initially I also found it risky and was not so sure. I anyhow don't want GRUB in the MBR, but in the filesystem/OS partition that it should boot, otherwise multi-OS on the same SSD or HDD gets into trouble. For the described system, assuming a few minutes offline or 'maintenance' mode is acceptable, I personally would just shrink the swap by 8KiB, lower its end-sector by 16 and also lower the start-sector of the btrfs partition by 16 and then add bcache. The location of GRUB should not matter actually. >> In this configuration nothing would beat btrfs if I could just add 2 >> SSD's to the pool that would be clever enough to be paired in RAID1 and >> would be preferred for small (<1GB) file writes. Then balance should be >> able to move not often used files to the HDD. >> >> None of the methods mentioned here sound easy or quick to do, or even >> well tested. I agree that all the methods are actually quite complicated, especially if compared to ZFS and its tools. Adding an ARC is as simple and easy as you want and describe. The statement I wanted make is that adding bcache for a (btrfs) file-system can be done without touching the FS itself, provided that one can allow some offline time for the FS. > It really depends on what you're used to. I would consider most of the > options easy, but one of the areas I'm strongest with is storage management, > and I've repaired damaged filesystems and partition tables by hand with a > hex editor before, so I'm not necessarily a typical user. If I was going to > suggest something specifically, it would be dm-cache, because it requires no > modification to the backing store at all, but that would require running on > LVM if you want it to be easy to set up (it's possible to do it without LVM, > but you need something to call dmsetup before mounting the filesystem, which > is not easy to configure correctly), and if you're on an enterprise distro, > it may not be supported. > > If you wanted to, it's possible, and not all that difficult, to convert a > BTRFS system to BTRFS on top of LVM online, but you would probably have to > split out the boot subvolume to a separate partition (depends on which > distro you're on, some have working LVM support in GRUB, some don't). If > you're on a distro which does have LVM support in GRUB, the procedure would > be: > 1. Convert the BTRFS array to raid1. This lets you run with only 3 disks > instead of 4. > 2. Delete one of the disks from the array. > 3. Convert the disk you deleted from the array to a LVM PV and add it to a > VG. > 4. Create a new logical volume occupying almost all of the PV you just added > (having a little slack space is usually a good thing). > 5. Add use btrfs replace to add the LV to the BTRFS array while deleting one > of the others. > 6. Repeat from step 3-5 for each disk, but stop at step 4 when you have > exactly one disk that isn't on LVM (so for four disks, stop at step four > when you have 2 with BTRFS+LVM, one with just the LVM logical volume, and > one with just BTRFS). > 7. Reinstall GRUB (it should pull in LVM support now). > 8. Use BTRFS replace to move the final BTRFS disk to the empty LVM volume. > 9. Convert the now empty final disk to LVM using steps 3-4 > 10. Add the LV to the BTRFS array and rebalance to raid10. > 11. Reinstall GRUB again (just to be certain). > > I've done essentially the same thing on numerous occasions when > reprovisioning for various reasons, and it's actually one of the things > outside of the xfstests that I check with my regression testing (including > simulating a couple of the common failure modes). It takes a while > (especially for big arrays with lots of data), but it works, and is > relatively safe (you are guaranteed to be able to rebuild a raid1 array of 3 > disks from just 2, so losing the disk in the process of copying it will not > result in data loss unless you hit a kernel bug). -- To unsubscribe from this
[PATCH] Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes
From: Omar SandovalCommit fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"") backed out the conversion to ->iterate_shared() for Btrfs because the delayed inode handling in btrfs_real_readdir() is racy. However, we can still do readdir in parallel if there are no delayed nodes. This is a temporary fix which upgrades the shared inode lock to an exclusive lock only when we have delayed items until we come up with a more complete solution. While we're here, rename the btrfs_{get,put}_delayed_items functions to make it very clear that they're just for readdir. Tested with xfstests and by doing a parallel kernel build: while make tinyconfig && make -j4 && git clean dqfx; do : done along with a bunch of parallel finds in another shell: while true; do for ((i=0; i<4; i++)); do find . >/dev/null & done wait done Signed-off-by: Omar Sandoval --- fs/btrfs/delayed-inode.c | 27 ++- fs/btrfs/delayed-inode.h | 10 ++ fs/btrfs/inode.c | 10 ++ 3 files changed, 34 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 6cef0062f929..d60cd17ea66b 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -1606,15 +1606,23 @@ int btrfs_inode_delayed_dir_index_count(struct inode *inode) return 0; } -void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list, -struct list_head *del_list) +bool btrfs_readdir_get_delayed_items(struct inode *inode, +struct list_head *ins_list, +struct list_head *del_list) { struct btrfs_delayed_node *delayed_node; struct btrfs_delayed_item *item; delayed_node = btrfs_get_delayed_node(inode); if (!delayed_node) - return; + return false; + + /* +* We can only do one readdir with delayed items at a time because of +* item->readdir_list. +*/ + inode_unlock_shared(inode); + inode_lock(inode); mutex_lock(_node->mutex); item = __btrfs_first_delayed_insertion_item(delayed_node); @@ -1641,10 +1649,13 @@ void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list, * requeue or dequeue this delayed node. */ atomic_dec(_node->refs); + + return true; } -void btrfs_put_delayed_items(struct list_head *ins_list, -struct list_head *del_list) +void btrfs_readdir_put_delayed_items(struct inode *inode, +struct list_head *ins_list, +struct list_head *del_list) { struct btrfs_delayed_item *curr, *next; @@ -1659,6 +1670,12 @@ void btrfs_put_delayed_items(struct list_head *ins_list, if (atomic_dec_and_test(>refs)) kfree(curr); } + + /* +* The VFS is going to do up_read(), so we need to downgrade back to a +* read lock. +*/ + downgrade_write(>i_rwsem); } int btrfs_should_delete_dir_index(struct list_head *del_list, diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h index 0167853c84ae..2495b3d4075f 100644 --- a/fs/btrfs/delayed-inode.h +++ b/fs/btrfs/delayed-inode.h @@ -137,10 +137,12 @@ void btrfs_kill_all_delayed_nodes(struct btrfs_root *root); void btrfs_destroy_delayed_inodes(struct btrfs_root *root); /* Used for readdir() */ -void btrfs_get_delayed_items(struct inode *inode, struct list_head *ins_list, -struct list_head *del_list); -void btrfs_put_delayed_items(struct list_head *ins_list, -struct list_head *del_list); +bool btrfs_readdir_get_delayed_items(struct inode *inode, +struct list_head *ins_list, +struct list_head *del_list); +void btrfs_readdir_put_delayed_items(struct inode *inode, +struct list_head *ins_list, +struct list_head *del_list); int btrfs_should_delete_dir_index(struct list_head *del_list, u64 index); int btrfs_readdir_delayed_dir_index(struct dir_context *ctx, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 6b7fe291a174..6ab6ca195f2f 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -5733,6 +5733,7 @@ static int btrfs_real_readdir(struct file *file, struct dir_context *ctx) int name_len; int is_curr = 0;/* ctx->pos points to the current index? */ bool emitted; + bool put = false; /* FIXME, use a real flag for deciding about the key type */ if (root->fs_info->tree_root == root) @@ -5750,7 +5751,8 @@
Re: btrfs restore fails because of NO SPACE
What versions for kernel and btrfs-progs? Have you tried only '-o ro,recovery' ? What kernel messages do you get for this? Failure to read chunk tree message is usually bad. If you have a recent enough btrfs-progs, try 'btrfs check' on the volume without --repair and post the results; recent would be 4.4.1 or better. Also a good idea is to post btrfs-show-super -f, there's a scant possibility there's more than one backup chunk root and possible to explicitly use an older one (but btrfs check--chunk-root is only in v4.5 of btrfs-progs). Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs restore fails because of NO SPACE
Hallo, I have a confusing problem with my btrfs raid. Currently I am using the following setup: btrfs fi show Label: none uuid: 93000933-e46d-403b-80d7-60475855e3f3 Total devices 2 FS bytes used 2.56TiB devid1 size 2.73TiB used 2.71TiB path /dev/sda devid4 size 2.73TiB used 2.71TiB path /dev/sdb As you can see both disks are full. Actually I cannot mount my raid, even with recovery options enabled: mount /dev/sda /mnt/Data -t btrfs -onospace_cache,clear_cache,enospc_debug,nodatacow mount: wrong fs type, bad option, bad superblock on /dev/sda, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so. dmesg shows: [ 1066.221696] BTRFS info (device sda): disabling disk space caching [ 1066.227990] BTRFS info (device sda): force clearing of disk cache [ 1066.234331] BTRFS info (device sda): setting nodatacow, compression disabled [ 1066.243813] BTRFS error (device sda): parent transid verify failed on 9676657786880 wanted 242139 found 0 [ 1066.253672] BTRFS error (device sda): parent transid verify failed on 9676657786880 wanted 242139 found 0 [ 1066.263450] BTRFS error (device sda): parent transid verify failed on 9676657786880 wanted 242139 found 0 [ 1066.273234] BTRFS: failed to read chunk root on sda [ 1066.279675] BTRFS warning (device sda): page private not zero on page 9676657786880 [ 1066.287482] BTRFS warning (device sda): page private not zero on page 9676657790976 [ 1066.295361] BTRFS warning (device sda): page private not zero on page 9676657795072 [ 1066.303204] BTRFS warning (device sda): page private not zero on page 9676657799168 [ 1066.369266] BTRFS: open_ctree failed After spending some time with Google I found a possible solution for my problem by running: btrfs restore -v /dev/sda /mnt/Data Actually this operation fails silently (computer freezes). After examine the kernel logs I have found out that the operations fails because of „NO SPACE LEFT ON DEVICE“. Can anybody please give me a solution for this problem? Greetings Wolf Bublitz-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
On 2016-05-20 13:02, Ferry Toth wrote: We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4, then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs partitions are in the same pool, which is in btrfs RAID10 format. /boot is in subvolume @boot. If you have GRUB installed on all 4, then you don't actually have the full 2047 sectors between the MBR and the partition free, as GRUB is embedded in that space. I forget exactly how much space it takes up, but I know it's not the whole 1023.5K I would not suggest risking usage of the final 8k there though. You could however convert to raid1 temporarily, and then for each device, delete it, reformat for bcache, then re-add it to the FS. This may take a while, but should be safe (of course, it's only an option if you're already using a kernel with bcache support). In this configuration nothing would beat btrfs if I could just add 2 SSD's to the pool that would be clever enough to be paired in RAID1 and would be preferred for small (<1GB) file writes. Then balance should be able to move not often used files to the HDD. None of the methods mentioned here sound easy or quick to do, or even well tested. It really depends on what you're used to. I would consider most of the options easy, but one of the areas I'm strongest with is storage management, and I've repaired damaged filesystems and partition tables by hand with a hex editor before, so I'm not necessarily a typical user. If I was going to suggest something specifically, it would be dm-cache, because it requires no modification to the backing store at all, but that would require running on LVM if you want it to be easy to set up (it's possible to do it without LVM, but you need something to call dmsetup before mounting the filesystem, which is not easy to configure correctly), and if you're on an enterprise distro, it may not be supported. If you wanted to, it's possible, and not all that difficult, to convert a BTRFS system to BTRFS on top of LVM online, but you would probably have to split out the boot subvolume to a separate partition (depends on which distro you're on, some have working LVM support in GRUB, some don't). If you're on a distro which does have LVM support in GRUB, the procedure would be: 1. Convert the BTRFS array to raid1. This lets you run with only 3 disks instead of 4. 2. Delete one of the disks from the array. 3. Convert the disk you deleted from the array to a LVM PV and add it to a VG. 4. Create a new logical volume occupying almost all of the PV you just added (having a little slack space is usually a good thing). 5. Add use btrfs replace to add the LV to the BTRFS array while deleting one of the others. 6. Repeat from step 3-5 for each disk, but stop at step 4 when you have exactly one disk that isn't on LVM (so for four disks, stop at step four when you have 2 with BTRFS+LVM, one with just the LVM logical volume, and one with just BTRFS). 7. Reinstall GRUB (it should pull in LVM support now). 8. Use BTRFS replace to move the final BTRFS disk to the empty LVM volume. 9. Convert the now empty final disk to LVM using steps 3-4 10. Add the LV to the BTRFS array and rebalance to raid10. 11. Reinstall GRUB again (just to be certain). I've done essentially the same thing on numerous occasions when reprovisioning for various reasons, and it's actually one of the things outside of the xfstests that I check with my regression testing (including simulating a couple of the common failure modes). It takes a while (especially for big arrays with lots of data), but it works, and is relatively safe (you are guaranteed to be able to rebuild a raid1 array of 3 disks from just 2, so losing the disk in the process of copying it will not result in data loss unless you hit a kernel bug). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
Op Fri, 20 May 2016 08:03:12 -0400, schreef Austin S. Hemmelgarn: > On 2016-05-19 19:23, Henk Slager wrote: >> On Thu, May 19, 2016 at 8:51 PM, Austin S. Hemmelgarn >>wrote: >>> On 2016-05-19 14:09, Kai Krakow wrote: Am Wed, 18 May 2016 22:44:55 + (UTC) schrieb Ferry Toth : > Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow: Bcache is actually low maintenance, no knobs to turn. Converting to bcache protective superblocks is a one-time procedure which can be done online. The bcache devices act as normal HDD if not attached to a caching SSD. It's really less pain than you may think. And it's a solution available now. Converting back later is easy: Just detach the HDDs from the SSDs and use them for some other purpose if you feel so later. Having the bcache protective superblock still in place doesn't hurt then. Bcache is a no-op without caching device attached. >>> >>> No, bcache is _almost_ a no-op without a caching device. From a >>> userspace perspective, it does nothing, but it is still another layer >>> of indirection in the kernel, which does have a small impact on >>> performance. The same is true of using LVM with a single volume >>> taking up the entire partition, it looks almost no different from just >>> using the partition, but it will perform worse than using the >>> partition directly. I've actually done profiling of both to figure >>> out base values for the overhead, and while bcache with no cache >>> device is not as bad as the LVM example, it can still be a roughly >>> 0.5-2% slowdown (it gets more noticeable the faster your backing >>> storage is). >>> >>> You also lose the ability to mount that filesystem directly on a >>> kernel without bcache support (this may or may not be an issue for >>> you). >> >> The bcache (protective) superblock is in an 8KiB block in front of the >> file system device. In case the current, non-bcached HDD's use modern >> partitioning, you can do a 5-minute remove or add of bcache, without >> moving/copying filesystem data. So in case you have a bcache-formatted >> HDD that had just 1 primary partition (512 byte logical sectors), the >> partition start is at sector 2048 and the filesystem start is at 2064. >> Hard removing bcache (so making sure the module is not >> needed/loaded/used the next boot) can be done done by changing the >> start-sector of the partition from 2048 to 2064. In gdisk one has to >> change the alignment to 16 first, otherwise this it refuses. And of >> course, also first flush+stop+de-register bcache for the HDD. >> >> The other way around is also possible, i.e. changing the start-sector >> from 2048 to 2032. So that makes adding bcache to an existing >> filesystem a 5 minute action and not a GBs- or TBs copy action. It is >> not online of course, but just one reboot is needed (or just umount, >> gdisk, partprobe, add bcache etc). >> For RAID setups, one could just do 1 HDD first. > My argument about the overhead was not about the superblock, it was > about the bcache layer itself. It isn't practical to just access the > data directly if you plan on adding a cache device, because then you > couldn't do so online unless you're going through bcache. This extra > layer of indirection in the kernel does add overhead, regardless of the > on-disk format. > > Secondarily, having a HDD with just one partition is not a typical use > case, and that argument about the slack space resulting from the 1M > alignment only holds true if you're using an MBR instead of a GPT layout > (or for that matter, almost any other partition table format), and > you're not booting from that disk (because GRUB embeds itself there). > It's also fully possible to have an MBR formatted disk which doesn't > have any spare space there too (which is how most flash drives get > formatted). We have 4 1TB drives in MBR, 1MB free at the beginning, grub on all 4, then 8GB swap, then all the rest btrfs (no LVM used). The 4 btrfs partitions are in the same pool, which is in btrfs RAID10 format. /boot is in subvolume @boot. In this configuration nothing would beat btrfs if I could just add 2 SSD's to the pool that would be clever enough to be paired in RAID1 and would be preferred for small (<1GB) file writes. Then balance should be able to move not often used files to the HDD. None of the methods mentioned here sound easy or quick to do, or even well tested. > This also doesn't change the fact that without careful initial > formatting (it is possible on some filesystems to embed the bcache SB at > the beginning of the FS itself, many of them have some reserved space at > the beginning of the partition for bootloaders, and this space doesn't > have to exist when mounting the FS) or manual alteration of the > partition, it's not possible to mount the FS on a system without bcache > support. >> >> There is also a tool doing the conversion in-place
Re: Amount of scrubbed data goes from 15.90GiB to 26.66GiB after defragment -r -v -clzo on a fs always mounted with compress=lzo
On venerdì 13 maggio 2016 08:11:27 CEST, Duncan wrote: In theory the various btrfs dedup solutions out there should work as well, while letting you keep the snapshots (at least to the extent they're either writable snapshots so can be reflink modified Unfortunately as you said dedup doesn't work with read-only snapshots (I only use read-only snapshots with snapper) :( Does dedup's dedup-syscall branch (https://github.com/g2p/bedup/tree/wip/dedup-syscall) which uses the new batch deduplication ioctl merged in Linux 3.12 fix this? Unfortunately latest commit is from september :( -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] Btrfs: fix race between device replace and chunk allocation
On Fri, May 20, 2016 at 4:30 PM, Josef Bacikwrote: > On Fri, May 20, 2016 at 12:45 AM, wrote: >> From: Filipe Manana >> >> While iterating and copying extents from the source device, the device >> replace code keeps adjusting a left cursor that is used to make sure that >> once we finish processing a device extent, any future writes to extents >> from the corresponding block group will get into both the source and >> target devices. This left cursor is also used for resuming the device >> replace operation at mount time. >> >> However using this left cursor to decide whether writes go into both >> devices or only the source device is not enough to guarantee we don't >> miss copying extents into the target device. There are two cases where >> the current approach fails. The first one is related to when there are >> holes in the device and they get allocated for new block groups while >> the device replace operation is iterating the device extents (more on >> this explained below). The second one is that when that loop over the >> device extents finishes, we start dellaloc, wait for all ordered extents >> and then commit the current transaction, we might have got new block >> groups allocated that are now using a device extent that has an offset >> greater then or equals to the value of the left cursor, in which case >> writes to extents belonging to these new block groups will get issued >> only to the source device. >> >> For the first case where the current approach of using a left cursor >> fails, consider the source device currently has the following layout: >> >> [ extent bg A ] [ hole, unallocated space ] [extent bg B ] >> 3Gb 4Gb 5Gb >> >> While we are iterating the device extents from the source device using >> the commit root of the device tree, the following happens: >> >> CPU 1CPU 2 >> >> >> >> scrub_enumerate_chunks() >> --> searches the device tree for >> extents belonging to the source >> device using the device tree's >> commit root >> --> 1st iteration finds extent belonging to >> block group A >> >> --> sets block group A to RO mode >> (btrfs_inc_block_group_ro) >> >> --> sets cursor left to found_key.offset >> which is 3Gb >> >> --> scrub_chunk() starts >> copies all allocated extents from >> block group's A stripe at source >> device into target device >> >> >> btrfs_alloc_chunk() >> --> allocates >> device extent >> in the >> range [4Gb, 5Gb[ >> from the >> source device for >> a new block >> group C >> >>extent allocated >> from block >>group C for a >> direct IO, >>buffered write or >> btree node/leaf >> >>extent is written >> to, perhaps >>in response to a >> writepages() >>call from the VM >> or directly >>through direct IO >> >>the write is made >> only against >>the source device >> and not against >>the target device >> because the >>extent's offset >> is in the interval >>[4Gb, 5Gb[ which >> is larger then >>the value of >> cursor_left (3Gb) >> >> --> scrub_chunks() finishes >> >> --> updates left cursor from 3Gb to >> 4Gb >> >> --> btrfs_dec_block_group_ro() sets >> block group A back to RW mode >> >> >> >> --> 2nd iteration finds extent belonging to >> block group B - it did not find the new >> extent in the range [4Gb, 5Gb[ for block >> group C because we are using the device >> tree's commit root or even because the >> block group's items are not all yet >> inserted in the respective btrees, that is, >> the block group is still attached to some >>
Re: [PATCH 6/6] Btrfs: fix race between device replace and chunk allocation
On Fri, May 20, 2016 at 12:45 AM,wrote: > From: Filipe Manana > > While iterating and copying extents from the source device, the device > replace code keeps adjusting a left cursor that is used to make sure that > once we finish processing a device extent, any future writes to extents > from the corresponding block group will get into both the source and > target devices. This left cursor is also used for resuming the device > replace operation at mount time. > > However using this left cursor to decide whether writes go into both > devices or only the source device is not enough to guarantee we don't > miss copying extents into the target device. There are two cases where > the current approach fails. The first one is related to when there are > holes in the device and they get allocated for new block groups while > the device replace operation is iterating the device extents (more on > this explained below). The second one is that when that loop over the > device extents finishes, we start dellaloc, wait for all ordered extents > and then commit the current transaction, we might have got new block > groups allocated that are now using a device extent that has an offset > greater then or equals to the value of the left cursor, in which case > writes to extents belonging to these new block groups will get issued > only to the source device. > > For the first case where the current approach of using a left cursor > fails, consider the source device currently has the following layout: > > [ extent bg A ] [ hole, unallocated space ] [extent bg B ] > 3Gb 4Gb 5Gb > > While we are iterating the device extents from the source device using > the commit root of the device tree, the following happens: > > CPU 1CPU 2 > > > > scrub_enumerate_chunks() > --> searches the device tree for > extents belonging to the source > device using the device tree's > commit root > --> 1st iteration finds extent belonging to > block group A > > --> sets block group A to RO mode > (btrfs_inc_block_group_ro) > > --> sets cursor left to found_key.offset > which is 3Gb > > --> scrub_chunk() starts > copies all allocated extents from > block group's A stripe at source > device into target device > >btrfs_alloc_chunk() > --> allocates > device extent > in the range > [4Gb, 5Gb[ > from the > source device for > a new block > group C > >extent allocated > from block >group C for a > direct IO, >buffered write or > btree node/leaf > >extent is written > to, perhaps >in response to a > writepages() >call from the VM > or directly >through direct IO > >the write is made > only against >the source device > and not against >the target device > because the >extent's offset is > in the interval >[4Gb, 5Gb[ which > is larger then >the value of > cursor_left (3Gb) > > --> scrub_chunks() finishes > > --> updates left cursor from 3Gb to > 4Gb > > --> btrfs_dec_block_group_ro() sets > block group A back to RW mode > > > > --> 2nd iteration finds extent belonging to > block group B - it did not find the new > extent in the range [4Gb, 5Gb[ for block > group C because we are using the device > tree's commit root or even because the > block group's items are not all yet > inserted in the respective btrees, that is, > the block group is still attached to some > transaction handle's new_bgs list and > btrfs_create_pending_block_groups() was > not called yet against that transaction > handle, so the device extent items
Re: [PATCH 5/6] Btrfs: fix race setting block group back to RW mode during device replace
On Fri, May 20, 2016 at 12:45 AM,wrote: > From: Filipe Manana > > After it finishes processing a device extent, the device replace code sets > back the block group to RW mode and then after that it sets the left cursor > to match the logical end address of the block group, so that future writes > into extents belonging to the block group go both the source (old) and > target (new) devices. However from the moment we turn the block group > back to RW mode we have a short time window, that lasts until we update > the left cursor's value, where extents can be allocated from the block > group and written to, in which case they will not be copied/written to > the target (new) device. Fix this by updating the left cursor's value > before turning the block group back to RW mode. > > Signed-off-by: Filipe Manana Reviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/6] Btrfs: fix unprotected assignment of the left cursor for device replace
On Fri, May 20, 2016 at 12:44 AM,wrote: > From: Filipe Manana > > We were assigning new values to fields of the device replace object > without holding the respective lock after processing each device extent. > This is important for the left cursor field which can be accessed by a > concurrent task running __btrfs_map_block (which, correctly, takes the > device replace lock). > So change these fields while holding the device replace lock. > > Signed-off-by: Filipe Manana Eesh, thanks, Reviewed-by: Josef Bacik -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/6] Btrfs: fix race setting block group readonly during device replace
On Fri, May 20, 2016 at 12:44 AM,wrote: > From: Filipe Manana > > When we do a device replace, for each device extent we find from the > source device, we set the corresponding block group to readonly mode to > prevent writes into it from happening while we are copying the device > extent from the source to the target device. However just before we set > the block group to readonly mode some concurrent task might have already > allocated an extent from it or decided it could perform a nocow write > into one of its extents, which can make the device replace process to > miss copying an extent since it uses the extent tree's commit root to > search for extents and only once it finishes searching for all extents > belonging to the block group it does set the left cursor to the logical > end address of the block group - this is a problem if the respective > ordered extents finish while we are searching for extents using the > extent tree's commit root and no transaction commit happens while we > are iterating the tree, since it's the delayed references created by the > ordered extents (when they complete) that insert the extent items into > the extent tree (using the non-commit root of course). > Example: > > CPU 1CPU 2 > > btrfs_dev_replace_start() >btrfs_scrub_dev() > scrub_enumerate_chunks() >--> finds device extent belonging >to block group X > > > > starts buffered write > against some inode > > writepages is run > against > that inode forcing > dellaloc > to run > > btrfs_writepages() > extent_writepages() > > extent_write_cache_pages() > > __extent_writepage() > > writepage_delalloc() > > run_delalloc_range() > > cow_file_range() > > btrfs_reserve_extent() > --> > allocates an extent > > from block group X > > (which is not yet >in > RO mode) > > btrfs_add_ordered_extent() > --> > creates ordered extent Y > flush_epd_write_bio() > --> bio against the > extent from > block group X > is submitted > >btrfs_inc_block_group_ro(bg X) > --> sets block group X to readonly > >scrub_chunk(bg X) > scrub_stripe(device extent from srcdev) >--> keeps searching for extent items >belonging to the block group using >the extent tree's commit root >--> it never blocks due to >fs_info->scrub_pause_req as no >one tries to commit transaction N >--> copies all extents found from the >source device into the target device >--> finishes search loop > > bio completes > > ordered extent Y > completes > and creates delayed > data > reference which will > add an > extent item to the > extent > tree when run > (typically > at transaction commit > time) > > --> so the task > doing the > scrub/device > replace > at CPU 1 misses > this >
Re: [PATCH 2/6] Btrfs: fix race between device replace and block group removal
On Fri, May 20, 2016 at 12:44 AM,wrote: > From: Filipe Manana > > When it's finishing, the device replace code iterates all extent maps > representing block group and for each one that has a stripe that refers > to the source device, it replaces its device with the target device. > However when it replaces the source device with the target device it, > the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID), > only after its ID is changed to match the one from the source device. > This leads to races with the chunk removal code that can temporarly see > a device with an ID of 0ULL and then attempt to use that ID to remove > items from the device tree and fail, causing a transaction abort: > > [ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) > to /dev/sde finished > [ 9238.594377] [ cut here ] > [ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 > btrfs_remove_chunk+0x2e5/0x793 [btrfs] > [ 9238.594403] BTRFS: Transaction aborted (error 1) > [ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor > tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 > evdev sg i2c_core se > rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom > sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio > e1000 scsi_mod fl > oppy [last unloaded: btrfs] > [ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted > 4.6.0-rc7-btrfs-next-29+ #1 > [ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by > qemu-project.org 04/01/2014 > [ 9238.594421] 88017f1dbc60 8126b42c > 88017f1dbcb0 > [ 9238.594422] 88017f1dbca0 81052b14 > 0ad37f1dbd18 > [ 9238.594423] 0001 88018068a558 88005c4b9c00 > 880233f60db0 > [ 9238.594424] Call Trace: > [ 9238.594428] [] dump_stack+0x67/0x90 > [ 9238.594430] [] __warn+0xc2/0xdd > [ 9238.594432] [] warn_slowpath_fmt+0x4b/0x53 > [ 9238.594434] [] ? kmem_cache_free+0x128/0x188 > [ 9238.594450] [] btrfs_remove_chunk+0x2e5/0x793 [btrfs] > [ 9238.594452] [] ? arch_local_irq_save+0x9/0xc > [ 9238.594464] [] btrfs_delete_unused_bgs+0x317/0x382 > [btrfs] > [ 9238.594476] [] cleaner_kthread+0x1ad/0x1c7 [btrfs] > [ 9238.594489] [] ? btree_invalidatepage+0x8e/0x8e [btrfs] > [ 9238.594490] [] kthread+0xd4/0xdc > [ 9238.594494] [] ret_from_fork+0x22/0x40 > [ 9238.594495] [] ? kthread_stop+0x286/0x286 > [ 9238.594496] ---[ end trace 183efbe50275f059 ]--- > > The sequence of steps leading to this is like the following: > > CPU 1 CPU 2 > > btrfs_dev_replace_finishing() > >at this point >dev_replace->tgtdev->devid == >BTRFS_DEV_REPLACE_DEVID (0ULL) > >... > >btrfs_start_transaction() >btrfs_commit_transaction() > > btrfs_delete_unused_bgs() >btrfs_remove_chunk() > > looks up for the > extent map > corresponding to the > chunk > > lock_chunks() > (chunk_mutex) > check_system_chunk() > unlock_chunks() > (chunk_mutex) > >locks fs_info->chunk_mutex > >btrfs_dev_replace_update_device_in_mapping_tree() > --> iterates fs_info->mapping_tree and > replaces the device in every extent > map's map->stripes[] with > dev_replace->tgtdev, which still has > an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) > > iterates over all > stripes from > the extent map > >--> calls > btrfs_free_dev_extent() >passing it the > target device >that still has > an ID of 0ULL > >--> > btrfs_free_dev_extent() fails > --> aborts > current transaction > >finishes setting up the target device, >namely it sets tgtdev->devid to the value >of srcdev->devid (which is necessarily > 0) > >frees the srcdev > >unlocks fs_info->chunk_mutex > > So fix this by taking the device list mutex while processing the stripes > for the chunk's extent map. This is similar to the race between device > replace and block group creation that was fixed by commit 50460e37186a > ("Btrfs: fix
Re: [PATCH 1/6] Btrfs: fix race between readahead and device replace/removal
On Fri, May 20, 2016 at 12:44 AM,wrote: > From: Filipe Manana > > The list of devices is protected by the device_list_mutex and the device > replace code, in its finishing phase correctly takes that mutex before > removing the source device from that list. However the readahead code was > iterating that list without acquiring the respective mutex leading to > crashes later on due to invalid memory accesses: > > [125671.831036] general protection fault: [#1] PREEMPT SMP > [125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor > raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport > processor ser > [125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: GW > 4.6.0-rc7-btrfs-next-29+ #1 > [125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS > by qemu-project.org 04/01/2014 > [125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs] > [125671.834973] task: 8801ac520540 ti: 8801ac918000 task.ti: > 8801ac918000 > [125671.834973] RIP: 0010:[] [] > __radix_tree_lookup+0x6a/0x105 > [125671.834973] RSP: 0018:8801ac91bc28 EFLAGS: 00010206 > [125671.834973] RAX: RBX: 6b6b6b6b6b6b6b6a RCX: > > [125671.834973] RDX: RSI: 000c1bff RDI: > 88002ebd62a8 > [125671.834973] RBP: 8801ac91bc70 R08: 0001 R09: > > [125671.834973] R10: 8801ac91bc70 R11: R12: > 88002ebd62a8 > [125671.834973] R13: R14: R15: > 000c1bff > [125671.834973] FS: () GS:88023fd4() > knlGS: > [125671.834973] CS: 0010 DS: ES: CR0: 80050033 > [125671.834973] CR2: 0073cae4 CR3: b7723000 CR4: > 06e0 > [125671.834973] Stack: > [125671.834973] 8801422d5600 8802286bbc00 > > [125671.834973] 0001 8802286bbc00 000c1bff > > [125671.834973] 88002e639eb8 8801ac91bc80 81270541 > 8801ac91bcb0 > [125671.834973] Call Trace: > [125671.834973] [] radix_tree_lookup+0xd/0xf > [125671.834973] [] reada_peer_zones_set_lock+0x3e/0x60 > [btrfs] > [125671.834973] [] reada_pick_zone+0x29/0x103 [btrfs] > [125671.834973] [] reada_start_machine_worker+0x129/0x2d3 > [btrfs] > [125671.834973] [] btrfs_scrubparity_helper+0x185/0x3aa > [btrfs] > [125671.834973] [] btrfs_readahead_helper+0xe/0x10 [btrfs] > [125671.834973] [] process_one_work+0x271/0x4e9 > [125671.834973] [] worker_thread+0x1eb/0x2c9 > [125671.834973] [] ? rescuer_thread+0x2b3/0x2b3 > [125671.834973] [] kthread+0xd4/0xdc > [125671.834973] [] ret_from_fork+0x22/0x40 > [125671.834973] [] ? kthread_stop+0x286/0x286 > > So fix this by taking the device_list_mutex in the readahead code. We > can't use here the lighter approach of using a rcu_read_lock() and > rcu_read_unlock() pair together with a list_for_each_entry_rcu() call > because we end up doing calls to sleeping functions (kzalloc()) in the > respective code path. > > Signed-off-by: Filipe Manana I think it might be time to change this to a rwsem as well as we use it in a bunch of places that are read only like statfs and readahead. But this works for now. Reviewed-by: Josef Bacik Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] Btrfs: fix unprotected assignment of the left cursor for device replace
From: Filipe MananaWe were assigning new values to fields of the device replace object without holding the respective lock after processing each device extent. This is important for the left cursor field which can be accessed by a concurrent task running __btrfs_map_block (which, correctly, takes the device replace lock). So change these fields while holding the device replace lock. Signed-off-by: Filipe Manana --- fs/btrfs/scrub.c | 4 1 file changed, 4 insertions(+) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index a181b52..a58e0ae 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3640,9 +3640,11 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, break; } + btrfs_dev_replace_lock(_info->dev_replace, 1); dev_replace->cursor_right = found_key.offset + length; dev_replace->cursor_left = found_key.offset; dev_replace->item_needs_writeback = 1; + btrfs_dev_replace_unlock(_info->dev_replace, 1); ret = scrub_chunk(sctx, scrub_dev, chunk_offset, length, found_key.offset, cache, is_dev_replace); @@ -3716,8 +3718,10 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, break; } + btrfs_dev_replace_lock(_info->dev_replace, 1); dev_replace->cursor_left = dev_replace->cursor_right; dev_replace->item_needs_writeback = 1; + btrfs_dev_replace_unlock(_info->dev_replace, 1); skip: key.offset = found_key.offset + length; btrfs_release_path(path); -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] Btrfs: fix race setting block group back to RW mode during device replace
From: Filipe MananaAfter it finishes processing a device extent, the device replace code sets back the block group to RW mode and then after that it sets the left cursor to match the logical end address of the block group, so that future writes into extents belonging to the block group go both the source (old) and target (new) devices. However from the moment we turn the block group back to RW mode we have a short time window, that lasts until we update the left cursor's value, where extents can be allocated from the block group and written to, in which case they will not be copied/written to the target (new) device. Fix this by updating the left cursor's value before turning the block group back to RW mode. Signed-off-by: Filipe Manana --- fs/btrfs/scrub.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index a58e0ae..c4c09a8 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3680,6 +3680,11 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, scrub_pause_off(fs_info); + btrfs_dev_replace_lock(_info->dev_replace, 1); + dev_replace->cursor_left = dev_replace->cursor_right; + dev_replace->item_needs_writeback = 1; + btrfs_dev_replace_unlock(_info->dev_replace, 1); + if (ro_set) btrfs_dec_block_group_ro(root, cache); @@ -3717,11 +3722,6 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, ret = -ENOMEM; break; } - - btrfs_dev_replace_lock(_info->dev_replace, 1); - dev_replace->cursor_left = dev_replace->cursor_right; - dev_replace->item_needs_writeback = 1; - btrfs_dev_replace_unlock(_info->dev_replace, 1); skip: key.offset = found_key.offset + length; btrfs_release_path(path); -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] Btrfs: fix race between device replace and chunk allocation
From: Filipe MananaWhile iterating and copying extents from the source device, the device replace code keeps adjusting a left cursor that is used to make sure that once we finish processing a device extent, any future writes to extents from the corresponding block group will get into both the source and target devices. This left cursor is also used for resuming the device replace operation at mount time. However using this left cursor to decide whether writes go into both devices or only the source device is not enough to guarantee we don't miss copying extents into the target device. There are two cases where the current approach fails. The first one is related to when there are holes in the device and they get allocated for new block groups while the device replace operation is iterating the device extents (more on this explained below). The second one is that when that loop over the device extents finishes, we start dellaloc, wait for all ordered extents and then commit the current transaction, we might have got new block groups allocated that are now using a device extent that has an offset greater then or equals to the value of the left cursor, in which case writes to extents belonging to these new block groups will get issued only to the source device. For the first case where the current approach of using a left cursor fails, consider the source device currently has the following layout: [ extent bg A ] [ hole, unallocated space ] [extent bg B ] 3Gb 4Gb 5Gb While we are iterating the device extents from the source device using the commit root of the device tree, the following happens: CPU 1CPU 2 scrub_enumerate_chunks() --> searches the device tree for extents belonging to the source device using the device tree's commit root --> 1st iteration finds extent belonging to block group A --> sets block group A to RO mode (btrfs_inc_block_group_ro) --> sets cursor left to found_key.offset which is 3Gb --> scrub_chunk() starts copies all allocated extents from block group's A stripe at source device into target device btrfs_alloc_chunk() --> allocates device extent in the range [4Gb, 5Gb[ from the source device for a new block group C extent allocated from block group C for a direct IO, buffered write or btree node/leaf extent is written to, perhaps in response to a writepages() call from the VM or directly through direct IO the write is made only against the source device and not against the target device because the extent's offset is in the interval [4Gb, 5Gb[ which is larger then the value of cursor_left (3Gb) --> scrub_chunks() finishes --> updates left cursor from 3Gb to 4Gb --> btrfs_dec_block_group_ro() sets block group A back to RW mode --> 2nd iteration finds extent belonging to block group B - it did not find the new extent in the range [4Gb, 5Gb[ for block group C because we are using the device tree's commit root or even because the block group's items are not all yet inserted in the respective btrees, that is, the block group is still attached to some transaction handle's new_bgs list and btrfs_create_pending_block_groups() was not called yet against that transaction handle, so the device extent items were not yet inserted into the devices tree --> so we end not copying anything from the newly allocated device extent from the source device to the target device So fix this by making
[PATCH 3/6] Btrfs: fix race setting block group readonly during device replace
From: Filipe MananaWhen we do a device replace, for each device extent we find from the source device, we set the corresponding block group to readonly mode to prevent writes into it from happening while we are copying the device extent from the source to the target device. However just before we set the block group to readonly mode some concurrent task might have already allocated an extent from it or decided it could perform a nocow write into one of its extents, which can make the device replace process to miss copying an extent since it uses the extent tree's commit root to search for extents and only once it finishes searching for all extents belonging to the block group it does set the left cursor to the logical end address of the block group - this is a problem if the respective ordered extents finish while we are searching for extents using the extent tree's commit root and no transaction commit happens while we are iterating the tree, since it's the delayed references created by the ordered extents (when they complete) that insert the extent items into the extent tree (using the non-commit root of course). Example: CPU 1CPU 2 btrfs_dev_replace_start() btrfs_scrub_dev() scrub_enumerate_chunks() --> finds device extent belonging to block group X starts buffered write against some inode writepages is run against that inode forcing dellaloc to run btrfs_writepages() extent_writepages() extent_write_cache_pages() __extent_writepage() writepage_delalloc() run_delalloc_range() cow_file_range() btrfs_reserve_extent() --> allocates an extent from block group X (which is not yet in RO mode) btrfs_add_ordered_extent() --> creates ordered extent Y flush_epd_write_bio() --> bio against the extent from block group X is submitted btrfs_inc_block_group_ro(bg X) --> sets block group X to readonly scrub_chunk(bg X) scrub_stripe(device extent from srcdev) --> keeps searching for extent items belonging to the block group using the extent tree's commit root --> it never blocks due to fs_info->scrub_pause_req as no one tries to commit transaction N --> copies all extents found from the source device into the target device --> finishes search loop bio completes ordered extent Y completes and creates delayed data reference which will add an extent item to the extent tree when run (typically at transaction commit time) --> so the task doing the scrub/device replace at CPU 1 misses this and does not copy this extent into the new/target device btrfs_dec_block_group_ro(bg X) --> turns block group X back to RW
[PATCH 2/6] Btrfs: fix race between device replace and block group removal
From: Filipe MananaWhen it's finishing, the device replace code iterates all extent maps representing block group and for each one that has a stripe that refers to the source device, it replaces its device with the target device. However when it replaces the source device with the target device it, the target device still has an ID of 0ULL (BTRFS_DEV_REPLACE_DEVID), only after its ID is changed to match the one from the source device. This leads to races with the chunk removal code that can temporarly see a device with an ID of 0ULL and then attempt to use that ID to remove items from the device tree and fail, causing a transaction abort: [ 9238.594364] BTRFS info (device sdf): dev_replace from /dev/sdf (devid 3) to /dev/sde finished [ 9238.594377] [ cut here ] [ 9238.594402] WARNING: CPU: 14 PID: 21566 at fs/btrfs/volumes.c:2771 btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594403] BTRFS: Transaction aborted (error 1) [ 9238.594416] Modules linked in: btrfs crc32c_generic acpi_cpufreq xor tpm_tis tpm raid6_pq ppdev parport_pc processor psmouse parport i2c_piix4 evdev sg i2c_core se rio_raw pcspkr button loop autofs4 ext4 crc16 jbd2 mbcache sr_mod cdrom sd_mod ata_generic virtio_scsi ata_piix virtio_pci libata virtio_ring virtio e1000 scsi_mod fl oppy [last unloaded: btrfs] [ 9238.594418] CPU: 14 PID: 21566 Comm: btrfs-cleaner Not tainted 4.6.0-rc7-btrfs-next-29+ #1 [ 9238.594419] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [ 9238.594421] 88017f1dbc60 8126b42c 88017f1dbcb0 [ 9238.594422] 88017f1dbca0 81052b14 0ad37f1dbd18 [ 9238.594423] 0001 88018068a558 88005c4b9c00 880233f60db0 [ 9238.594424] Call Trace: [ 9238.594428] [] dump_stack+0x67/0x90 [ 9238.594430] [] __warn+0xc2/0xdd [ 9238.594432] [] warn_slowpath_fmt+0x4b/0x53 [ 9238.594434] [] ? kmem_cache_free+0x128/0x188 [ 9238.594450] [] btrfs_remove_chunk+0x2e5/0x793 [btrfs] [ 9238.594452] [] ? arch_local_irq_save+0x9/0xc [ 9238.594464] [] btrfs_delete_unused_bgs+0x317/0x382 [btrfs] [ 9238.594476] [] cleaner_kthread+0x1ad/0x1c7 [btrfs] [ 9238.594489] [] ? btree_invalidatepage+0x8e/0x8e [btrfs] [ 9238.594490] [] kthread+0xd4/0xdc [ 9238.594494] [] ret_from_fork+0x22/0x40 [ 9238.594495] [] ? kthread_stop+0x286/0x286 [ 9238.594496] ---[ end trace 183efbe50275f059 ]--- The sequence of steps leading to this is like the following: CPU 1 CPU 2 btrfs_dev_replace_finishing() at this point dev_replace->tgtdev->devid == BTRFS_DEV_REPLACE_DEVID (0ULL) ... btrfs_start_transaction() btrfs_commit_transaction() btrfs_delete_unused_bgs() btrfs_remove_chunk() looks up for the extent map corresponding to the chunk lock_chunks() (chunk_mutex) check_system_chunk() unlock_chunks() (chunk_mutex) locks fs_info->chunk_mutex btrfs_dev_replace_update_device_in_mapping_tree() --> iterates fs_info->mapping_tree and replaces the device in every extent map's map->stripes[] with dev_replace->tgtdev, which still has an id of 0ULL (BTRFS_DEV_REPLACE_DEVID) iterates over all stripes from the extent map --> calls btrfs_free_dev_extent() passing it the target device that still has an ID of 0ULL --> btrfs_free_dev_extent() fails --> aborts current transaction finishes setting up the target device, namely it sets tgtdev->devid to the value of srcdev->devid (which is necessarily > 0) frees the srcdev unlocks fs_info->chunk_mutex So fix this by taking the device list mutex while processing the stripes for the chunk's extent map. This is similar to the race between device replace and block group creation that was fixed by commit 50460e37186a ("Btrfs: fix race when finishing dev replace leading to transaction abort"). Signed-off-by: Filipe Manana --- fs/btrfs/volumes.c | 11 +++ 1 file changed, 11 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index bd0f45f..683e2bd 100644 ---
[PATCH 1/6] Btrfs: fix race between readahead and device replace/removal
From: Filipe MananaThe list of devices is protected by the device_list_mutex and the device replace code, in its finishing phase correctly takes that mutex before removing the source device from that list. However the readahead code was iterating that list without acquiring the respective mutex leading to crashes later on due to invalid memory accesses: [125671.831036] general protection fault: [#1] PREEMPT SMP [125671.832129] Modules linked in: btrfs dm_flakey dm_mod crc32c_generic xor raid6_pq acpi_cpufreq tpm_tis tpm ppdev evdev parport_pc psmouse sg parport processor ser [125671.834973] CPU: 10 PID: 19603 Comm: kworker/u32:19 Tainted: GW 4.6.0-rc7-btrfs-next-29+ #1 [125671.834973] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS by qemu-project.org 04/01/2014 [125671.834973] Workqueue: btrfs-readahead btrfs_readahead_helper [btrfs] [125671.834973] task: 8801ac520540 ti: 8801ac918000 task.ti: 8801ac918000 [125671.834973] RIP: 0010:[] [] __radix_tree_lookup+0x6a/0x105 [125671.834973] RSP: 0018:8801ac91bc28 EFLAGS: 00010206 [125671.834973] RAX: RBX: 6b6b6b6b6b6b6b6a RCX: [125671.834973] RDX: RSI: 000c1bff RDI: 88002ebd62a8 [125671.834973] RBP: 8801ac91bc70 R08: 0001 R09: [125671.834973] R10: 8801ac91bc70 R11: R12: 88002ebd62a8 [125671.834973] R13: R14: R15: 000c1bff [125671.834973] FS: () GS:88023fd4() knlGS: [125671.834973] CS: 0010 DS: ES: CR0: 80050033 [125671.834973] CR2: 0073cae4 CR3: b7723000 CR4: 06e0 [125671.834973] Stack: [125671.834973] 8801422d5600 8802286bbc00 [125671.834973] 0001 8802286bbc00 000c1bff [125671.834973] 88002e639eb8 8801ac91bc80 81270541 8801ac91bcb0 [125671.834973] Call Trace: [125671.834973] [] radix_tree_lookup+0xd/0xf [125671.834973] [] reada_peer_zones_set_lock+0x3e/0x60 [btrfs] [125671.834973] [] reada_pick_zone+0x29/0x103 [btrfs] [125671.834973] [] reada_start_machine_worker+0x129/0x2d3 [btrfs] [125671.834973] [] btrfs_scrubparity_helper+0x185/0x3aa [btrfs] [125671.834973] [] btrfs_readahead_helper+0xe/0x10 [btrfs] [125671.834973] [] process_one_work+0x271/0x4e9 [125671.834973] [] worker_thread+0x1eb/0x2c9 [125671.834973] [] ? rescuer_thread+0x2b3/0x2b3 [125671.834973] [] kthread+0xd4/0xdc [125671.834973] [] ret_from_fork+0x22/0x40 [125671.834973] [] ? kthread_stop+0x286/0x286 So fix this by taking the device_list_mutex in the readahead code. We can't use here the lighter approach of using a rcu_read_lock() and rcu_read_unlock() pair together with a list_for_each_entry_rcu() call because we end up doing calls to sleeping functions (kzalloc()) in the respective code path. Signed-off-by: Filipe Manana --- fs/btrfs/reada.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/reada.c b/fs/btrfs/reada.c index 298631ea..8428db7 100644 --- a/fs/btrfs/reada.c +++ b/fs/btrfs/reada.c @@ -761,12 +761,14 @@ static void __reada_start_machine(struct btrfs_fs_info *fs_info) do { enqueued = 0; + mutex_lock(_devices->device_list_mutex); list_for_each_entry(device, _devices->devices, dev_list) { if (atomic_read(>reada_in_flight) < MAX_IN_FLIGHT) enqueued += reada_start_machine_dev(fs_info, device); } + mutex_unlock(_devices->device_list_mutex); total += enqueued; } while (enqueued && total < 1); -- 2.7.0.rc3 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
kernel 4.5.5 & space_cache=v2 early enospc, forced read-only
Just trying space_cache=v2 on my big backup btrfs, mounted via space_cache=v2,enospc_debug,nofail,noatime,compress=zlib. Looks like something got confused during an rsync which then quickly propigated up to forcing the fs read-only in the long stack traces below. I'll be happy to test the new ENOSPC ticket system patches once they seem ready if it will help. btrfs file usage: Overall: Device size: 249.22TiB Device allocated:211.45TiB Device unallocated: 37.77TiB Device missing: 0.00B Used:210.90TiB Free (estimated): 38.31TiB (min: 19.43TiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:210.92TiB, Used:210.37TiB /dev/sdb 25.23TiB /dev/sdc 25.22TiB /dev/sdd 35.23TiB /dev/sde 35.23TiB /dev/sdf 22.50TiB /dev/sdg 22.50TiB /dev/sdh 22.50TiB /dev/sdi 22.50TiB Metadata,RAID10: Size:274.00GiB, Used:272.47GiB /dev/sdb 34.25GiB /dev/sdc 34.25GiB /dev/sdd 34.25GiB /dev/sde 34.25GiB /dev/sdf 34.25GiB /dev/sdg 34.25GiB /dev/sdh 34.25GiB /dev/sdi 34.25GiB System,RAID10: Size:64.00MiB, Used:28.09MiB /dev/sdb8.00MiB /dev/sdc8.00MiB /dev/sdd8.00MiB /dev/sde8.00MiB /dev/sdf8.00MiB /dev/sdg8.00MiB /dev/sdh8.00MiB /dev/sdi8.00MiB Unallocated: /dev/sdb4.75TiB /dev/sdc4.75TiB /dev/sdd4.75TiB /dev/sde4.75TiB /dev/sdf4.75TiB /dev/sdg4.75TiB /dev/sdh4.75TiB /dev/sdi4.75TiB [20581.396634] WARNING: CPU: 6 PID: 4639 at fs/btrfs/extent-tree.c:7964 btrfs_alloc_tree_block+0xed/0x3e6 [btrfs]() [20581.396684] BTRFS: block rsv returned -28 [20581.396686] Modules linked in: ipmi_si mpt3sas raid_class scsi_transport_sas dell_rbu nfsv3 nfsv4 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc ext2 intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_generic hmac drbg aesni_intel aes_x86_64 glue_helper lrw gf128mul joydev ablk_helper cryptd iTCO_wdt iTCO_vendor_support evdev ipmi_devintf dcdbas serio_raw pcspkr lpc_ich ipmi_msghandler mfd_core i7core_edac edac_core acpi_power_meter button processor loop autofs4 ext4 crc16 mbcache jbd2 btrfs xor raid6_pq hid_generic usbhid hid sg sd_mod crc32c_intel psmouse uhci_hcd ehci_pci ehci_hcd megaraid_sas ixgbe mdio usbcore ptp usb_common pps_core scsi_mod bnx2 [last unloaded: ipmi_si] [20581.397148] CPU: 6 PID: 4639 Comm: kworker/u65:6 Tainted: G I 4.5.5 #1 [20581.397260] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [20581.397293] 0006 811f4e82 88081851fa80 0009 [20581.397346] 810459cb a016f717 88081e8c4150 88081851fad8 [20581.397400] 88041e3d2800 8806c7022840 81045a23 a01e5f8c [20581.397453] Call Trace: [20581.397481] [] ? dump_stack+0x46/0x59 [20581.397512] [] ? warn_slowpath_common+0x94/0xa9 [20581.397558] [] ? btrfs_alloc_tree_block+0xed/0x3e6 [btrfs] [20581.397589] [] ? warn_slowpath_fmt+0x43/0x4b [20581.397632] [] ? unlock_up+0x89/0x103 [btrfs] [20581.397678] [] ? btrfs_alloc_tree_block+0xed/0x3e6 [btrfs] [20581.397728] [] ? btrfs_tree_read_unlock+0x5c/0x5e [btrfs] [20581.397774] [] ? __btrfs_cow_block+0xda/0x45f [btrfs] [20581.397819] [] ? btrfs_cow_block+0xdd/0x144 [btrfs] [20581.397863] [] ? btrfs_search_slot+0x285/0x6d7 [btrfs] [20581.397911] [] ? btrfs_lookup_csum+0x39/0xcc [btrfs] [20581.397959] [] ? btrfs_csum_file_blocks+0x6a/0x4e0 [btrfs] [20581.398010] [] ? add_pending_csums.isra.42+0x42/0x5b [btrfs] [20581.398075] [] ? btrfs_finish_ordered_io+0x331/0x4cb [btrfs] [20581.398141] [] ? normal_work_helper+0xe2/0x21f [btrfs] [20581.398174] [] ? process_one_work+0x177/0x2a9 [20581.398203] [] ? worker_thread+0x1e9/0x292 [20581.398232] [] ? rescuer_thread+0x2a5/0x2a5 [20581.398262] [] ? kthread+0xa7/0xaf [20581.398289] [] ? kthread_parkme+0x16/0x16 [20581.398321] [] ? ret_from_fork+0x3f/0x70 [20581.398349] [] ? kthread_parkme+0x16/0x16 [20581.398377] ---[ end trace 9a28cf840837b232 ]--- [20655.152170] use_block_rsv: 773 callbacks suppressed [20655.152263] WARNING: CPU: 11 PID: 4814 at fs/btrfs/extent-tree.c:7964 btrfs_alloc_tree_block+0xed/0x3e6 [btrfs]() [20655.152313] BTRFS: block rsv returned -28 [20655.152808] CPU: 11 PID: 4814 Comm: kworker/u66:7 Tainted: G W I 4.5.5 #1 [20655.152921] Workqueue: btrfs-endio-write btrfs_endio_write_helper [btrfs] [20655.152953] 0006 811f4e82 8801ad383a80 0009 [20655.153006] 810459cb a016f717 88081e8c4150 8801ad383ad8 [20655.153060] 88041e3d2800
Re: Hot data tracking / hybrid storage
On 2016-05-19 19:23, Henk Slager wrote: On Thu, May 19, 2016 at 8:51 PM, Austin S. Hemmelgarnwrote: On 2016-05-19 14:09, Kai Krakow wrote: Am Wed, 18 May 2016 22:44:55 + (UTC) schrieb Ferry Toth : Op Tue, 17 May 2016 20:33:35 +0200, schreef Kai Krakow: Bcache is actually low maintenance, no knobs to turn. Converting to bcache protective superblocks is a one-time procedure which can be done online. The bcache devices act as normal HDD if not attached to a caching SSD. It's really less pain than you may think. And it's a solution available now. Converting back later is easy: Just detach the HDDs from the SSDs and use them for some other purpose if you feel so later. Having the bcache protective superblock still in place doesn't hurt then. Bcache is a no-op without caching device attached. No, bcache is _almost_ a no-op without a caching device. From a userspace perspective, it does nothing, but it is still another layer of indirection in the kernel, which does have a small impact on performance. The same is true of using LVM with a single volume taking up the entire partition, it looks almost no different from just using the partition, but it will perform worse than using the partition directly. I've actually done profiling of both to figure out base values for the overhead, and while bcache with no cache device is not as bad as the LVM example, it can still be a roughly 0.5-2% slowdown (it gets more noticeable the faster your backing storage is). You also lose the ability to mount that filesystem directly on a kernel without bcache support (this may or may not be an issue for you). The bcache (protective) superblock is in an 8KiB block in front of the file system device. In case the current, non-bcached HDD's use modern partitioning, you can do a 5-minute remove or add of bcache, without moving/copying filesystem data. So in case you have a bcache-formatted HDD that had just 1 primary partition (512 byte logical sectors), the partition start is at sector 2048 and the filesystem start is at 2064. Hard removing bcache (so making sure the module is not needed/loaded/used the next boot) can be done done by changing the start-sector of the partition from 2048 to 2064. In gdisk one has to change the alignment to 16 first, otherwise this it refuses. And of course, also first flush+stop+de-register bcache for the HDD. The other way around is also possible, i.e. changing the start-sector from 2048 to 2032. So that makes adding bcache to an existing filesystem a 5 minute action and not a GBs- or TBs copy action. It is not online of course, but just one reboot is needed (or just umount, gdisk, partprobe, add bcache etc). For RAID setups, one could just do 1 HDD first. My argument about the overhead was not about the superblock, it was about the bcache layer itself. It isn't practical to just access the data directly if you plan on adding a cache device, because then you couldn't do so online unless you're going through bcache. This extra layer of indirection in the kernel does add overhead, regardless of the on-disk format. Secondarily, having a HDD with just one partition is not a typical use case, and that argument about the slack space resulting from the 1M alignment only holds true if you're using an MBR instead of a GPT layout (or for that matter, almost any other partition table format), and you're not booting from that disk (because GRUB embeds itself there). It's also fully possible to have an MBR formatted disk which doesn't have any spare space there too (which is how most flash drives get formatted). This also doesn't change the fact that without careful initial formatting (it is possible on some filesystems to embed the bcache SB at the beginning of the FS itself, many of them have some reserved space at the beginning of the partition for bootloaders, and this space doesn't have to exist when mounting the FS) or manual alteration of the partition, it's not possible to mount the FS on a system without bcache support. There is also a tool doing the conversion in-place (I haven't used it myself, my python(s) had trouble; I could do the partition table edit much faster/easier): https://github.com/g2p/blocks#bcache-conversion I actually hadn't known about this tool, thanks for mentioning it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Hot data tracking / hybrid storage
On 2016-05-19 17:01, Kai Krakow wrote: Am Thu, 19 May 2016 14:51:01 -0400 schrieb "Austin S. Hemmelgarn": For a point of reference, I've got a pair of 250GB Crucial MX100's (they cost less than 0.50 USD per GB when I got them and provide essentially the same power-loss protections that the high end Intel SSD's do) which have seen more than 2.5TB of data writes over their lifetime, combined from at least three different filesystem formats (BTRFS, FAT32, and ext4), swap space, and LVM management, and the wear-leveling indicator on each still says they have 100% life remaining, and the similar 500GB one I just recently upgraded in my laptop had seen over 50TB of writes and was still saying 95% life remaining (and had been for months). Correction, I hadn't checked recently, the 250G ones have seen about 6.336TB of writes (I hadn't checked for multiple months), and report 90% remaining life, with about 240 days of power-on time. This overall equates to about 775MBs of writes per-hour, and assuming similar write rates for the remaining life of the SSD, I can still expect roughly 9 years of service from these, which means about 10 years of life given my usage, which is well beyond what I typically get from a traditional hard disk for the same price, and far exceeds the typical usable life of most desktops, laptops, and even some workstation computers. And you have to also keep in mind, this 775MB/hour of writes is coming from a system that is running: * BOINC distributed computing applications (regularly downloading big files, and almost constantly writing data) * Dropbox * Software builds for almost a dozen different systems (I use Gentoo, so _everything_ is built locally) * Regression testing for BTRFS * Basic network services (DHCP, DNS, and similar things) * A tor entry node * A local mail server (store and forward only, I just use it for monitoring messages) And all of that (except the BTRFS regression testing) is running 24/7, and that's just the local VM's, and doesn't include the file sharing or SAN services. Root filesystems for all of these VM's are all on the SSD's, as is the host's root filesystem and swap partition, and many of the data partitions. And I haven't really done any write optimization, and it's still less than 1GB/hour of writes to the SSD. The typical user (including many types of server systems) will be writing much less than that most of the time. The smaller Crucials are much worse at that: The MX100 128GB version I had was specified for 85TB writes which I hit after about 12 months (97% lifetime used according to smartctl) due to excessive write patterns. I'm not sure how long it would have lasted but I decided to swap it for a Samsung 500GB drive, and reconfigure my system for much less write patterns. What should I say: I liked the Crucial more, first: It has an easy lifetime counter in smartctl, Samsung doesn't. And it had powerloss protection which Samsung doesn't explicitly mention (tho I think it has it). At least, according to endurance tests, my Samsung SSD should take about 1 PB of writes. I've already written 7 TB if I can trust the smartctl raw value. But I think you cannot compare specification values to a real endurance test... I think it says 150TBW for 500GB 850 EVO. The point was more that wear out is less of an issue for a lot of people than many individuals make it out to be, not me trying to make Crucial sound like an amazing brand. Yes, one of the Crucial MX100's may not last long as a Samsung EVO in a busy mail server or something similar, but for a majority of people, they will probably outlast the usefulness of the computer. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: sharing page cache pages between multiple mappings
On Fri, May 20, 2016 at 1:48 AM, Dave Chinnerwrote: > On Thu, May 19, 2016 at 12:17:14PM +0200, Miklos Szeredi wrote: >> On Thu, May 19, 2016 at 11:05 AM, Michal Hocko wrote: >> > On Thu 19-05-16 10:20:13, Miklos Szeredi wrote: >> >> Has anyone thought about sharing pages between multiple files? >> >> >> >> The obvious application is for COW filesytems where there are >> >> logically distinct files that physically share data and could easily >> >> share the cache as well if there was infrastructure for it. >> > >> > FYI this has been discussed at LSFMM this year[1]. I wasn't at the >> > session so cannot tell you any details but the LWN article covers it at >> > least briefly. >> >> Cool, so it's not such a crazy idea. > > Oh, it most certainly is crazy. :P > >> Darrick, would you mind briefly sharing your ideas regarding this? > > The current line of though is that we'll only attempt this in XFS on > inodes that are known to share underlying physical extents. i.e. > files that have blocks that have been reflinked or deduped. That > way we can overload the breaking of reflink blocks (via copy on > write) with unsharing the pages in the page cache for that inode. > i.e. shared pages can propagate upwards in overlay if it uses > reflink for copy-up and writes will then break the sharing with the > underlying source without overlay having to do anything special. > > Right now I'm not sure what mechanism we will use - we want to > support files that have a mix of private and shared pages, so that > implies we are not going to be sharing mappings but sharing pages > instead. However, we've been looking at this as being completely > encapsulated within the filesystem because it's tightly linked to > changes in the physical layout of the filesystem, not as general > "share this mapping between two unrelated inodes" infrastructure. > That may change as we dig deeper into it... > >> The use case I have is fixing overlayfs weird behavior. The following >> may result in "buf" not matching "data": >> >> int fr = open("foo", O_RDONLY); >> int fw = open("foo", O_RDWR); >> write(fw, data, sizeof(data)); >> read(fr, buf, sizeof(data)); >> >> The reason is that "foo" is on a read-only layer, and opening it for >> read-write triggers copy-up into a read-write layer. However the old, >> read-only open still refers to the unmodified file. >> >> Fixing this properly requires that when opening a file, we don't >> delegate operations fully to the underlying file, but rather allow >> sharing of pages from underlying file until the file is copied up. At >> that point we switch to sharing pages with the read-write copy. > > Unless I'm missing something here (quite possible!), I'm not sure > we can fix that problem with page cache sharing or reflink. It > implies we are sharing pages in a downwards direction - private > overlay pages/mappings from multiple inodes would need to be shared > with a single underlying shared read-only inode, and I lack the > imagination to see how that works... Indeed, reflink doesn't make this work. We could reflink-up on any open (or on lookup), not just on write, it's a trivial change in overlayfs. Drawback is slower first open/lookup and space used by duplicate trees even without modification on the overlay. Not sure if that's a problem in practice. I'll think about the generic downwards sharing. For overlayfs it doesn't need to be per-page, so that might make it somewhat simpler problem. Thanks, Miklos -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html