Re: Copying related snapshots to another server with btrfs send/receive?
On Mon, May 05, 2014 at 03:24:45AM +, Duncan wrote: *However*: snapshotting a read-only snapshot and making the new one writable is easy enough[1]. Just keep the originals read-only so they can be used as parents/clones, and make a second, writable snapshot of the first, to do your writable stuff in. --- [1] Snapshotting a snapshot: I'm getting a metaphorical flashing light I already snapshot ro snapshots as rw snapshots and that works fine. It's actually rely on this in my script: http://marc.merlins.org/perso/btrfs/post_2014-03-22_Btrfs-Tips_-Doing-Fast-Incremental-Backups-With-Btrfs-Send-and-Receive.html (skip to the bottom) # We make a read-write snapshot in case you want to use it for a chroot # and some testing with a writeable filesystem or want to boot from a # last good known snapshot. btrfs subvolume snapshot $src_newsnap $src_newsnaprw $ssh btrfs subvolume snapshot $dest_pool/$src_newsnap $dest_pool/$src_newsnaprw Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/2] btrfs: Add missing device check in dev_info/rm_dev ioctl
Thanks for working on this. I am running some tests will let you know. Anand On 05/06/2014 02:33 PM, Qu Wenruo wrote: Old btrfs can't find a missing btrfs device since there is no mechanism for block layer to inform fs layer. But we can use a workaround that only check status(by using request_queue-queue_flags) of every device in a btrfs filesystem when calling dev_info/rm_dev ioctl, since other ioctls do not really cares about missing device. Cc: Anand Jain anand.j...@oracle.com Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/ioctl.c | 1 + fs/btrfs/volumes.c | 25 - fs/btrfs/volumes.h | 2 ++ 3 files changed, 27 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0401397..7680a40 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2606,6 +2606,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) goto out; } + btrfs_check_dev_missing(root, dev, 1); di_args-devid = dev-devid; di_args-bytes_used = dev-bytes_used; di_args-total_bytes = dev-total_bytes; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d241130a..c7d7908 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1548,9 +1548,10 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) * is held. */ list_for_each_entry(tmp, devices, dev_list) { + btrfs_check_dev_missing(root, tmp, 0); if (tmp-in_fs_metadata !tmp-is_tgtdev_for_dev_replace - !tmp-bdev) { + (!tmp-bdev || tmp-missing)) { device = tmp; break; } @@ -6300,3 +6301,25 @@ int btrfs_scratch_superblock(struct btrfs_device *device) return 0; } + +/* If need_lock is set, uuid_mutex will be used */ +int btrfs_check_dev_missing(struct btrfs_root *root, struct btrfs_device *dev, + int need_lock) +{ + struct request_queue *q; + + if (unlikely(!dev || !dev-bdev || !dev-bdev-bd_queue)) + return -ENOENT; + q = dev-bdev-bd_queue; + + if (need_lock) + mutex_lock(uuid_mutex); + if (test_bit(QUEUE_FLAG_DEAD, q-queue_flags) || + test_bit(QUEUE_FLAG_DYING, q-queue_flags)) { + dev-missing = 1; + root-fs_info-fs_devices-missing_devices++; + } + if (need_lock) + mutex_unlock(uuid_mutex); + return 0; +} diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 80754f9..47a44af 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -356,6 +356,8 @@ unsigned long btrfs_full_stripe_len(struct btrfs_root *root, int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, struct btrfs_root *extent_root, u64 chunk_offset, u64 chunk_size); +int btrfs_check_dev_missing(struct btrfs_root *root, struct btrfs_device *dev, + int need_lock); static inline void btrfs_dev_stat_inc(struct btrfs_device *dev, int index) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs fi show show full?
On Tue, May 06, 2014 at 08:10:00PM +, Duncan wrote: Marc MERLIN posted on Sun, 04 May 2014 22:50:29 -0700 as excerpted: In the second FS: Label: btrfs_pool1 uuid: [...] Total devices 1 FS bytes used 442.17GiB devid1 size 865.01GiB used 751.04GiB path [...] The difference is huge between 'Total used' and 'devid used'. Is btrfs going to fix this on its own, or likely not and I'm stuck doing a full balance (without filters since I'm balancing data and not metadata)? If that helps. legolas:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=734.01GiB, used=435.29GiB System, DUP: total=8.00MiB, used=96.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=8.50GiB, used=6.74GiB Metadata, single: total=8.00MiB, used=0.00 Definitely helps. The spread is in data. Try btrfs balance start -dusage=20 /mnt/btrfs_pool1 So, I had already tried -dusage=50 yesterday, and I'm now reasonable: Label: btrfs_pool1 uuid: 4850ee22-bf32-4131-a841-02abdb4a5ba6 Total devices 1 FS bytes used 443.22GiB devid1 size 865.01GiB used 514.04GiB path /dev/mapper/cryptroot something like -dusage=50 or -dusage=80, likely MUCH faster, but will return less chunks to unallocated, as well. Still, your spread between (fewer) data-total and data-used is high enough, I expect -dusage=20 will give you pretty good results. So, on On http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html I wrote In the case above, because the filesystem is only 55% full, I can ask balance to rewrite all chunks that are more than 55% full: legolas:~# btrfs balance start -dusage=55 /mnt/btrfs_pool1 Did I get this right? I'm not sure I did, since it seems the bigger the -dusage number, the more work balance has to do. If I asked -dsuage=85, it would do all chunks that are more than 15% full? So, do I need to change the text above to say more than 45% full ? More generally, does it not make sense to just use the same percentage in -dusage than the percentage of total filesytem full? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 1/2] btrfs: Add missing device check in dev_info/rm_dev ioctl
Original Message Subject: Re: [RFC PATCH 1/2] btrfs: Add missing device check in dev_info/rm_dev ioctl From: Anand Jain anand.j...@oracle.com To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2014年05月07日 16:00 Thanks for working on this. I am running some tests will let you know. Anand Thanks for your tests. I have only check the scsi_device/X:X:X:X/device/delete interface to remove the device, so if you have some other device remove tests, that would be much nicer. Thanks, Qu On 05/06/2014 02:33 PM, Qu Wenruo wrote: Old btrfs can't find a missing btrfs device since there is no mechanism for block layer to inform fs layer. But we can use a workaround that only check status(by using request_queue-queue_flags) of every device in a btrfs filesystem when calling dev_info/rm_dev ioctl, since other ioctls do not really cares about missing device. Cc: Anand Jain anand.j...@oracle.com Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/ioctl.c | 1 + fs/btrfs/volumes.c | 25 - fs/btrfs/volumes.h | 2 ++ 3 files changed, 27 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0401397..7680a40 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2606,6 +2606,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) goto out; } +btrfs_check_dev_missing(root, dev, 1); di_args-devid = dev-devid; di_args-bytes_used = dev-bytes_used; di_args-total_bytes = dev-total_bytes; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d241130a..c7d7908 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1548,9 +1548,10 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) * is held. */ list_for_each_entry(tmp, devices, dev_list) { +btrfs_check_dev_missing(root, tmp, 0); if (tmp-in_fs_metadata !tmp-is_tgtdev_for_dev_replace -!tmp-bdev) { +(!tmp-bdev || tmp-missing)) { device = tmp; break; } @@ -6300,3 +6301,25 @@ int btrfs_scratch_superblock(struct btrfs_device *device) return 0; } + +/* If need_lock is set, uuid_mutex will be used */ +int btrfs_check_dev_missing(struct btrfs_root *root, struct btrfs_device *dev, +int need_lock) +{ +struct request_queue *q; + +if (unlikely(!dev || !dev-bdev || !dev-bdev-bd_queue)) +return -ENOENT; +q = dev-bdev-bd_queue; + +if (need_lock) +mutex_lock(uuid_mutex); +if (test_bit(QUEUE_FLAG_DEAD, q-queue_flags) || +test_bit(QUEUE_FLAG_DYING, q-queue_flags)) { +dev-missing = 1; +root-fs_info-fs_devices-missing_devices++; +} +if (need_lock) +mutex_unlock(uuid_mutex); +return 0; +} diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 80754f9..47a44af 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -356,6 +356,8 @@ unsigned long btrfs_full_stripe_len(struct btrfs_root *root, int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, struct btrfs_root *extent_root, u64 chunk_offset, u64 chunk_size); +int btrfs_check_dev_missing(struct btrfs_root *root, struct btrfs_device *dev, +int need_lock); static inline void btrfs_dev_stat_inc(struct btrfs_device *dev, int index) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: raid0 vs single, and should we allow -mdup by default on SSDs?
Hi Chris and other devs, Does it really make sense to turn off -mdup on SSDs? I would argue that no. In my case dmcrypt protected me from that, so I'm happy, but even if I didn't use it, I'd want the protection of -mdup, even if the protection mght only be partial. On Tue, May 06, 2014 at 05:16:08PM +, Duncan wrote: Single only stripes in such extremely large (1 GiB data, quarter-GiB metadata, per strip) chunks that it doesn't matter for speed, and then only as a result of its chunk allocation policy. If one can define such large strips as striping, which it is in a way, but not really in the practical sense. Oh good, I didn't know it was that big. The effect of a lost device, then, is more or less random, tho for single metadata the effect is likely to be quite large up to total loss, due to the damage to the tree. It's not out of thin air that the multi-device Yes. I totally use either -mdup or -mraid1. That contrasts with raid0, where the striping is at sizes well under a chunk (memory page size or 4 MiB on x86/amd64 data I believe, tho the fact that files under the 16 MiB node size may actually be entirely folded into metadata and not have a data extent allocation at all skews things for up to the 16 MiB metadata node size), so the definition of small file likely to be recovered is **MUCH** smaller on raid0, than on single. Great to know, I'll use -m raid1 -d single next time. Effectively, raid0 data you're only (relatively) likely to recover files smaller than 16 MiB, while single data, it's files smaller than 1 GiB. Thanks much for that. On Tue, May 06, 2014 at 07:05:52PM +, Duncan wrote: 1) In ordered to do that, btrfs (I guess mkfs.btrfs in this case) must be able to detect that the device *IS* ssd. Depending on the SSD, the kernel version, and whether the btrfs is being created direct on bare- metal device or on some device layered (lvm or dmcrypt or whatever) on top of the bare metal, btrfs may or may not successfully detect that. Obviously in your case[1] the ssd wasn't detected. Indeed. I also found out why my SSD has -mdup: It's on top of dmcrypt so btrfs failed to see it was and SSD and gave me -mdup. Good, that's what I wanted anyway :) I believe I've seen you mention using dmcrypt or the like, however, which probably doesn't pass whatever is used for ssd protection on thru, thus explaining btrfs not seeing it and having to specify it yourself, if you wish. You guessed correctly, congrats. 2) The only reason I happen to know about the SSD metadata single-device single mode default exception (where metadata otherwise defaults to dup mode on single-device, and to raid1 mode on multi-device regardless of the media), is as a result of I believe Chris Mason commenting on it in an on-list reply. The reasoning given in that reply was not the erase-block reason I've seen someone else mention here (and which doesn't quite make sense to me, since I don't know why that would make a difference), but rather: Yes. I personally don't think it's a good idea. Basically when having 2 copies, they could still end up on the same erase block, making them less redundant. My answer to that is 'so what?' There are plenty of other times where dup would be useful on an SSD. I really don't see the point of trying to it off by default just because maybe in one case it would not offer extra protection. Some SSD firmware does automatic deduplication and compression. On these devices, DUP-mode would almost certainly be stored as a single internal data block with two external address references anyway, so it would actually be single in any case, and defaulting to single (a) doesn't hide that fact, and (b) reduces overhead that's justified for safety otherwise, but if the firmware is doing an end run around that safety anyway, might as well just shortcut the overhead as well. If some SSDs do this, let's not punish those have SSDs that don't. However, while the btrfs default will apply to all (detected) ssds, not all ssds have firmware that does this internal deduplication! Exactly. On Tue, May 06, 2014 at 07:39:12PM +, Duncan wrote: Well, assuming that by -d linear you meant -d single. Btrfs doesn't call it linear, tho at the data safety level, btrfs single is actually quite comparable to mdadm linear. =:^) Yes, I meant single, sorry :) (aka linear for mdadm) At the time I used -m raid1 -d raid0, but it sounds for slightly extra recoverability, I should have ued -m raid1 -d linear (and yes, I undertand that one should not consider a -d linear recoverable when a drive went missing). That appears to be a very good use of either -d raid0 or -d single, yes. And since you're apparently not streaming such high resolution video that you NEED the raid0, single does indeed give you a somewhat better chance at recovery. zoneminder saves 'video' as a stream of independent small jpegs,
Re: raid0 vs single, and should we allow -mdup by default on SSDs?
On Wed, May 07, 2014 at 01:18:40AM -0700, Marc MERLIN wrote: On Tue, May 06, 2014 at 07:39:12PM +, Duncan wrote: That appears to be a very good use of either -d raid0 or -d single, yes. And since you're apparently not streaming such high resolution video that you NEED the raid0, single does indeed give you a somewhat better chance at recovery. zoneminder saves 'video' as a stream of independent small jpegs, so I'm good. Actually come to think of it they're so small that they probably all ended up in the raid1 metadata. That also means that I'm not getting twice the storage space like I planned to. Oh well... There's a mount option to change the threshold at which files are inlined in metadata: maxinline=bytes. You could play with that for this particular use-case. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I am but mad north-north-west: when the wind is southerly, I --- know a hawk from a handsaw. signature.asc Description: Digital signature
Re: raid0 vs single, and should we allow -mdup by default on SSDs?
On Wed, May 07, 2014 at 09:29:41AM +0100, Hugo Mills wrote: On Wed, May 07, 2014 at 01:18:40AM -0700, Marc MERLIN wrote: On Tue, May 06, 2014 at 07:39:12PM +, Duncan wrote: That appears to be a very good use of either -d raid0 or -d single, yes. And since you're apparently not streaming such high resolution video that you NEED the raid0, single does indeed give you a somewhat better chance at recovery. zoneminder saves 'video' as a stream of independent small jpegs, so I'm good. Actually come to think of it they're so small that they probably all ended up in the raid1 metadata. That also means that I'm not getting twice the storage space like I planned to. Oh well... There's a mount option to change the threshold at which files are inlined in metadata: maxinline=bytes. You could play with that for this particular use-case. Oh cool, thank you. Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does Suse do live filesystem revert with btrfs?
On Tue, May 06, 2014 at 04:26:48PM +, Duncan wrote: Marc MERLIN posted on Sun, 04 May 2014 22:04:59 -0700 as excerpted: On Mon, May 05, 2014 at 01:36:39AM +0100, Hugo Mills wrote: I'm guessing it involves reflink copies of files from the snapshot back to the original, and then restarting affected services. That's about the only other thing that I can think of, but it's got load of race conditions in it (albeit difficult to hit in most cases, I suspect). Aaah, right, you can use a script to see the file differences between two snapshots, and then restore that with reflink if you can truly get a list of all changed files. However, that is indeed not atomic at all, even if faster than rsync. Would send/receive help in such a script? Not really, you still end up with a new snapshot that you can't live switch to. It's really either 1) reboot 2) use cp --reflink to copy a list of changed files (as well as rm to delete the ones that were removed). I'm currently using btrfs-diff (below) which shows changed files but it doesn't show files deleted. Is there something better that would show me which files changed and how between 2 snapshots? btrfs-diff: - #!/bin/bash usage() { echo $@ 2; echo Usage: $0 older-snapshot newer-snapshot 2; exit 1; } [ $# -eq 2 ] || usage Incorrect invocation; SNAPSHOT_OLD=$1; SNAPSHOT_NEW=$2; [ -d $SNAPSHOT_OLD ] || usage $SNAPSHOT_OLD does not exist; [ -d $SNAPSHOT_NEW ] || usage $SNAPSHOT_NEW does not exist; OLD_TRANSID=`btrfs subvolume find-new $SNAPSHOT_OLD 999` OLD_TRANSID=${OLD_TRANSID#transid marker was } [ -n $OLD_TRANSID -a $OLD_TRANSID -gt 0 ] || usage Failed to find generation for $SNAPSHOT_NEW btrfs subvolume find-new $SNAPSHOT_NEW $OLD_TRANSID | sed '$d' | cut -f17- -d' ' | sort | uniq - Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on software RAID0
On Tue, May 06, 2014 at 09:02:46AM +0200, john terragon wrote: just one last doubt: why do you use --align-payload=1024? (or 8912) Cryptsetup man says that the default for the payload alignment is 2048 (512-byte sectors). So, it's already aligned by default to 4K-byte physical sectors (if that was your concern). Am I missing something? With 4K sectors, I agree that 2048 would be better. What I was trying to do there is avoid write amplification. After reading http://wiki.drewhess.com/wiki/Creating_an_encrypted_filesystem_on_a_partition I went with mdadm --create /dev/md8 --level=5 --raid-devices=5 /dev/sd[abdef]1 --chunk=256 --bitmap=/boot/bitmap-md8 which I believe required me to use cryptsetup luksFormat --align-payload=1024 -s 256 -c aes-xts-plain64 /dev/md8 (that was with 5 drives, or 4 drives with data). Would agree with the math? If so, for 4K sector sizes, if we have to use align-payload=1024, in turn I'd have to use --chunk=512. Does that sound right? Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs fi show show full?
On 2014/05/07 09:59 AM, Marc MERLIN wrote: [snip] Did I get this right? I'm not sure I did, since it seems the bigger the -dusage number, the more work balance has to do. If I asked -dsuage=85, it would do all chunks that are more than 15% full? -dusage=85 balances all chunks that up to 85% full. The higher the number, the more work that needs to be done. So, do I need to change the text above to say more than 45% full ? More generally, does it not make sense to just use the same percentage in -dusage than the percentage of total filesytem full? Thanks, Marc Separately, Duncan has made me realise my halfway up algorithm is not very good - it was probably just good enough at the time and worked well enough that I wasn't prompted to analyse it further. Doing a simulation with randomly-semi-filled chunks, df at 55%, and chunk utilisation at 86%, -dusage=55 balances 30% of the chunks, almost perfectly bringing chunk utilisation down to 56%. In my algorithm I would have used -dusage=70 which in my simulation would have balanced 34% of the chunks - but bringing chunk utilisation down to 55% - a bit of wasted effort and unnecessary SSD wear. I think now that I need to experiment with a much lower -dusage value and perhaps to repeat the balance with the df value (55 in the example) if the chunk usage is still too high. Getting an optimal first value algorithmically might prove a challenge - I might just end up picking some arbitrary percentage point below the df value. Pathological use-cases still apply however (for example if all chunks except one are exactly 54% full). The up-side is that if the algorithm is applied regularly (as in scripted and scheduled) then the situation will always be that the majority of chunks are going to be relatively full, avoiding the pathological use-case. -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using mount -o bind vs mount -o subvol=vol
On Mon, May 05, 2014 at 02:12:30AM +, Duncan wrote: Marc MERLIN posted on Sat, 03 May 2014 17:47:32 -0700 as excerpted: Just as an FYI, like (likely) most subscribers, I do prefer Cc on replies. Without that, I'm much less likely to see your message timely, or at all if I'm behind on Email. TL;DR: Put simply, with certain sometimes major exceptions, IMO subvolumes are /mostly/ a solution looking for a problem. In the /general/ case, I don't see the point and personally STRONGLY prefer multiple independent partitions for their much stronger data safety and mounting/backup flexibility. That's why I use independent partitions, here. I'm a partitions guy, but now that I have subvolumes which can be snapshotted/backed up independently, I'm much happier with a single shared pool. Look at a btrfs pool like an LVM pool, except more flexible. To each their own I guess. 1) Multiple subvolumes on a common filesystem share the filesystem tree- and super-structure. If something happens to that filesystem, you had all your data eggs in that one basket and the bottom just dropped out of it! If you can't recover, kiss **ALL** those data eggs goodbye! Backups :) (and having your booting filesystem on a different pool from you data pool). 3) Filesystem size and time to complete whole-filesystem operations such as balance, scrub and check are directly related; the larger the filesystem, the longer such operations take. There are reports here of balances taking days on multi-terabyte filesystems, and double-digit hours isn't unusual at all. True, but if I have a 10TB array, I'm not going to cut it into 10 1TB arrays just for that. Now ask yourself, how likely are you to routinely run a scrub or balance as preventive maintenance if you know it's going to take the entire day to finish? Here, the times are literally so trivial can and do run a full filesystem rebalance to time it and make this point and maintenance such as scrub or balance simply ceases to be an issue. It runs nightly from cron on my laptop. 1TB filesystem on SSD, no sweat. 4) Many distros are using btrfs subvolumes on a single btrfs storage pool the way they formerly used LVM volume groups, as a common storage pool allowing them the flexibility to (re)allocate space to whatever lvm volume or btrfs subvolume needs it. Yep. OTOH, for users and distros with a pretty good idea of what their allocations are going to look like, generally due to the experience they've gained over the years, that extra flexibility isn't a big benefit You and me yes, most other people no. And to be honest, I've been doing this for 20 years, and my guesses are not always right 10 years later on a machine that's still running :) (of which I have several) 6) Subvolumes be used to control snapshotting since snapshots stop at subvolume boundaries. In the presence of point #5 storage pools, and given the reality of btrfs NOCOW attribute behavior when mixed with snapshots, subvolumes become an important tool for limiting snapshot coverage area, in particular, for demarcing areas that should NOT be snapshotted when the filesystem or parent subvolume is snapshotted, due for instance to the horrible interaction between large heavy-internal- rewrite files and COW, which means they should be set NOCOW, coupled with the horrible interaction between NOCOW on such files and snapshotting. Yep. Similarly, subvolumes and their boundaries can be used to set borders for frequency or timing of snapshotting, say snapshotting the general root/system tree before updates, while snapshotting /home hourly. Yep. Point #6 is, I'd argue, one of the few legitimate use-cases for subvolumes as opposed to independent filesystems, and it actually loses relevancy if #4 is subsumed to point #1 and #3, already. However, given the reality of popular distro btrfs layouts and usage, #4 is in practice overruling all the others in many distro-default btrfs deployments today, and #6 then becomes relevant. subvolumes are also used as units of backup for btrfs send. So my vote would be, for example (modified slightly for posting from my own mounts): mount /dev/sda5 / mount /dev/sda4 /var/log mount /dev/sda6 /home On my laptop: /dev/mapper/cryptroot on / type btrfs (rw,noatime,compress=lzo,ssd,discard,space_cache) /dev/mapper/cryptroot on /usr type btrfs (rw,noatime,compress=lzo,ssd,discard,space_cache) /dev/mapper/cryptroot on /var type btrfs (rw,noatime,compress=lzo,ssd,discard,space_cache) /dev/mapper/cryptroot on /home type btrfs (rw,noatime,compress=lzo,ssd,discard,space_cache) /dev/mapper/cryptroot on /tmp type btrfs (rw,noexec,noatime,compress=lzo,ssd,discard,space_cache) /dev/mapper/cryptroot on /var/local/nobckd2 type btrfs (rw,noatime,compress=lzo,ssd,discard,space_cache) /dev/mapper/disk2 on /var/local/space type btrfs (rw,noatime,compress=lzo,discard,space_cache) /dev/mapper/cryptroot
[PATCH 2/3] Crypto: xxhash: add tests
Signed-off-by: Liu Bo bo.li@oracle.com --- crypto/testmgr.c | 10 ++ crypto/testmgr.h | 33 + 2 files changed, 43 insertions(+) diff --git a/crypto/testmgr.c b/crypto/testmgr.c index dc3cf35..27ba702 100644 --- a/crypto/testmgr.c +++ b/crypto/testmgr.c @@ -3153,6 +3153,16 @@ static const struct alg_test_desc alg_test_descs[] = { } } }, { + .alg = xxh32, + .test = alg_test_hash, + .fips_allowed = 1, + .suite = { + .hash = { + .vecs = xxh32_tv_template, + .count = XXH32_TEST_VECTORS + } + } + }, { .alg = zlib, .test = alg_test_pcomp, .fips_allowed = 1, diff --git a/crypto/testmgr.h b/crypto/testmgr.h index 3db83db..8e56884 100644 --- a/crypto/testmgr.h +++ b/crypto/testmgr.h @@ -26660,6 +26660,39 @@ static struct hash_testvec michael_mic_tv_template[] = { } }; +#define XXH32_TEST_VECTORS 3 + +static struct hash_testvec xxh32_tv_template[] = { + { + .plaintext = \x9e, + .psize = 1, + .digest = \xe5\xbe\x5c\xb8, + }, + { + .plaintext = \x9e\xff\x1f\x4b\x5e\x53\x2f\xdd +\xb5\x54\x4d\x2a\x95\x2b, + .psize = 14, + .digest = \xb4\x0a\xaa\xe5, + }, + { + .plaintext = \x9e\xff\x1f\x4b\x5e\x53\x2f\xdd +\xb5\x54\x4d\x2a\x95\x2b\x57\xae +\x5d\xba\x74\xe9\xd3\xa6\x4c\x98 +\x30\x60\xc0\x80\x00\x00\x00\x00 +\x00\x00\x00\x00\x00\x00\x00\x00 +\x00\x00\x00\x00\x00\x00\x00\x00 +\x00\x00\x00\x00\x00\x00\x00\x00 +\x00\x00\x00\x00\x00\x00\x00\x00 +\x00\x00\x00\x00\x00\x00\x00\x00 +\x00\x00\x00\x00\x00\x00\x00\x00 +\x00\x00\x00\x00\x00\x00\x00\x00 +\x00\x00\x00\x00\x00\x00\x00\x00 +\x00\x00\x00\x00\x00, + .psize = 101, + .digest = \x12\xa4\x1a\x1f, + } +}; + /* * CRC32C test vectors */ -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] Btrfs: add another checksum algorithm xxhash
xxHash is an extremely fast non-cryptographic Hash algorithm, working at speeds close to RAM limits.[1] And xxhash is 32-bits hash, same as crc32. This modifies btrfs's checksum API a bit and adopts xxhash as an alternative checksum algorithm. Note: We needs to update btrfs-progs side as well to set it up. [1]: https://code.google.com/p/xxhash/ Signed-off-by: Liu Bo bo.li@oracle.com --- fs/btrfs/Kconfig| 22 fs/btrfs/compression.c | 6 +-- fs/btrfs/ctree.h| 12 +++-- fs/btrfs/dir-item.c | 10 ++-- fs/btrfs/disk-io.c | 126 fs/btrfs/disk-io.h | 2 - fs/btrfs/extent-tree.c | 43 ++- fs/btrfs/file-item.c| 9 ++-- fs/btrfs/free-space-cache.c | 15 +++--- fs/btrfs/hash.c | 75 -- fs/btrfs/hash.h | 22 fs/btrfs/inode-item.c | 6 +-- fs/btrfs/inode.c| 16 +++--- fs/btrfs/props.c| 37 +++-- fs/btrfs/props.h| 3 +- fs/btrfs/scrub.c| 70 +++- fs/btrfs/send.c | 7 ++- fs/btrfs/super.c| 9 ++-- fs/btrfs/tree-log.c | 2 +- 19 files changed, 331 insertions(+), 161 deletions(-) diff --git a/fs/btrfs/Kconfig b/fs/btrfs/Kconfig index a66768e..ef45456 100644 --- a/fs/btrfs/Kconfig +++ b/fs/btrfs/Kconfig @@ -2,6 +2,7 @@ config BTRFS_FS tristate Btrfs filesystem support select CRYPTO select CRYPTO_CRC32C + select CRYPTO_XXH32 select ZLIB_INFLATE select ZLIB_DEFLATE select LZO_COMPRESS @@ -88,3 +89,24 @@ config BTRFS_ASSERT any of the assertions trip. This is meant for btrfs developers only. If unsure, say N. + +choice + prompt choose checksum algorithm + default BTRFS_CRC32C + help + This option allows to select a checksum algorithm + +config BTRFS_CRC32C + depends on CRYPTO_CRC32C + bool BTRFS_CRC32C + help + crc32c + +config BTRFS_XXH32 + depends on CRYPTO_XXH32 + bool BTRFS_XXH32 + help + xxhash + +endchoice + diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index d43c544..889b0f1 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -41,6 +41,7 @@ #include compression.h #include extent_io.h #include extent_map.h +#include hash.h struct compressed_bio { /* number of bios pending for this compressed extent */ @@ -114,17 +115,16 @@ static int check_compressed_csum(struct inode *inode, char *kaddr; u32 csum; u32 *cb_sum = cb-sums; + struct btrfs_fs_info *fs_info = BTRFS_I(inode)-root-fs_info; if (BTRFS_I(inode)-flags BTRFS_INODE_NODATASUM) return 0; for (i = 0; i cb-nr_pages; i++) { page = cb-compressed_pages[i]; - csum = ~(u32)0; kaddr = kmap_atomic(page); - csum = btrfs_csum_data(kaddr, csum, PAGE_CACHE_SIZE); - btrfs_csum_final(csum, (char *)csum); + btrfs_csum_data(fs_info, kaddr, PAGE_CACHE_SIZE, (char *)csum); kunmap_atomic(kaddr); if (csum != *cb_sum) { diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index ba6b885..cbb6533 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -176,12 +176,16 @@ struct btrfs_ordered_sum; /* 32 bytes in various csum fields */ #define BTRFS_CSUM_SIZE 32 -/* csum types */ +/* + * csum types, + * - 4 bytes for CRC32(crc32c) + * - 4 bytes for XXH32(xxhash) + */ #define BTRFS_CSUM_TYPE_CRC32 0 +#define BTRFS_CSUM_TYPE_XXH32 1 -static int btrfs_csum_sizes[] = { 4, 0 }; +static int btrfs_csum_sizes[] = { 4, 4, 0 }; -/* four bytes for CRC32 */ #define BTRFS_EMPTY_DIR_SIZE 0 /* spefic to btrfs_map_block(), therefore not in include/linux/blk_types.h */ @@ -1688,6 +1692,8 @@ struct btrfs_fs_info { struct semaphore uuid_tree_rescan_sem; unsigned int update_uuid_tree_gen:1; + + struct crypto_shash *tfm; }; struct btrfs_subvolume_writers { diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c index a0691df..1332858 100644 --- a/fs/btrfs/dir-item.c +++ b/fs/btrfs/dir-item.c @@ -87,7 +87,7 @@ int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans, key.objectid = objectid; btrfs_set_key_type(key, BTRFS_XATTR_ITEM_KEY); - key.offset = btrfs_name_hash(name, name_len); + key.offset = btrfs_name_hash(root-fs_info, name, name_len); data_size = sizeof(*dir_item) + name_len + data_len; dir_item = insert_with_overflow(trans, root, path, key, data_size, @@ -138,7 +138,7 @@ int btrfs_insert_dir_item(struct btrfs_trans_handle *trans, struct btrfs_root key.objectid = btrfs_ino(dir); btrfs_set_key_type(key, BTRFS_DIR_ITEM_KEY); - key.offset = btrfs_name_hash(name, name_len);
[PATCH] Btrfs-progs: add xxhash
From: root root@localhost.localdomain Signed-off-by: Liu Bo bo.li@oracle.com --- Makefile |4 +- crc32c.h |4 +- disk-io.c |2 +- hash.h|2 +- xxhash.c | 448 + xxhash.h | 171 +++ 6 files changed, 626 insertions(+), 5 deletions(-) create mode 100644 xxhash.c create mode 100644 xxhash.h diff --git a/Makefile b/Makefile index 369df6c..1d70bc9 100644 --- a/Makefile +++ b/Makefile @@ -16,10 +16,10 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \ cmds-property.o cmds-dedup.o libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \ - uuid-tree.o + uuid-tree.o xxhash.o libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \ crc32c.h list.h kerncompat.h radix-tree.h extent-cache.h \ - extent_io.h ioctl.h ctree.h btrfsck.h + extent_io.h ioctl.h ctree.h btrfsck.h xxhash.h TESTS = fsck-tests.sh INSTALL = install diff --git a/crc32c.h b/crc32c.h index c552ef6..6dd0ce2 100644 --- a/crc32c.h +++ b/crc32c.h @@ -25,9 +25,11 @@ #include btrfs/kerncompat.h #endif /* BTRFS_FLAT_INCLUDES */ +#include xxhash.h + u32 crc32c_le(u32 seed, unsigned char const *data, size_t length); void crc32c_optimization_init(void); -#define crc32c(seed, data, length) crc32c_le(seed, (unsigned char const *)data, length) +#define crc32c(seed, data, length) XXH32(data, length, 0) #define btrfs_crc32c crc32c #endif diff --git a/disk-io.c b/disk-io.c index 19b95a7..2c72f7f 100644 --- a/disk-io.c +++ b/disk-io.c @@ -67,7 +67,7 @@ u32 btrfs_csum_data(struct btrfs_root *root, char *data, u32 seed, size_t len) void btrfs_csum_final(u32 crc, char *result) { - *(__le32 *)result = ~cpu_to_le32(crc); + *(__le32 *)result = cpu_to_le32(crc); } static int __csum_tree_block_size(struct extent_buffer *buf, u16 csum_size, diff --git a/hash.h b/hash.h index c0b88a1..2d1a71d 100644 --- a/hash.h +++ b/hash.h @@ -22,6 +22,6 @@ static inline u64 btrfs_name_hash(const char *name, int len) { - return ~(crc32c((u32)(~0), name, len)); + return crc32c((u32)(~0), name, len); } #endif diff --git a/xxhash.c b/xxhash.c new file mode 100644 index 000..f855a58 --- /dev/null +++ b/xxhash.c @@ -0,0 +1,448 @@ +/* +xxHash - Fast Hash algorithm +Copyright (C) 2012-2014, Yann Collet. +BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions are +met: + +* Redistributions of source code must retain the above copyright +notice, this list of conditions and the following disclaimer. +* Redistributions in binary form must reproduce the above +copyright notice, this list of conditions and the following disclaimer +in the documentation and/or other materials provided with the +distribution. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + +You can contact the author at : +- xxHash source repository : http://code.google.com/p/xxhash/ +*/ + + +//** +// Tuning parameters +//** +// Unaligned memory access is automatically enabled for common CPU, such as x86. +// For others CPU, the compiler will be more cautious, and insert extra code to ensure aligned access is respected. +// If you know your target CPU supports unaligned memory access, you want to force this option manually to improve performance. +// You can also enable this parameter if you know your input data will always be aligned (boundaries of 4, for U32). +#if defined(__ARM_FEATURE_UNALIGNED) || defined(__i386) || defined(_M_IX86) || defined(__x86_64__) || defined(_M_X64) +# define XXH_USE_UNALIGNED_ACCESS 1 +#endif + +// XXH_ACCEPT_NULL_INPUT_POINTER : +// If the input pointer is a null pointer, xxHash default behavior is to trigger a memory access error, since it is a bad pointer. +// When this option is enabled, xxHash output for null input pointers will be the same as a null-length
[PATCH 1/3] Crypto: add xxhash algorithm
This will be used in btrfs, and maybe in others in the future. Signed-off-by: Liu Bo bo.li@oracle.com --- crypto/Kconfig | 7 + crypto/Makefile | 1 + crypto/xxhash.c | 383 include/crypto/xxhash.h | 209 ++ 4 files changed, 600 insertions(+) create mode 100644 crypto/xxhash.c create mode 100644 include/crypto/xxhash.h diff --git a/crypto/Kconfig b/crypto/Kconfig index ce4012a..2e56de0 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -622,6 +622,13 @@ config CRYPTO_GHASH_CLMUL_NI_INTEL GHASH is message digest algorithm for GCM (Galois/Counter Mode). The implementation is accelerated by CLMUL-NI of Intel. +config CRYPTO_XXH32 + tristate XXHASH digest algorithm + select CRYPTO_HASH + help + xxHash - Fast Hash Algorithm + source repository : http://code.google.com/p/xxhash/ + comment Ciphers config CRYPTO_AES diff --git a/crypto/Makefile b/crypto/Makefile index 38e64231..7c3f363 100644 --- a/crypto/Makefile +++ b/crypto/Makefile @@ -97,6 +97,7 @@ obj-$(CONFIG_CRYPTO_GHASH) += ghash-generic.o obj-$(CONFIG_CRYPTO_USER_API) += af_alg.o obj-$(CONFIG_CRYPTO_USER_API_HASH) += algif_hash.o obj-$(CONFIG_CRYPTO_USER_API_SKCIPHER) += algif_skcipher.o +obj-$(CONFIG_CRYPTO_XXH32) += xxhash.o # # generic algorithms and the async_tx api diff --git a/crypto/xxhash.c b/crypto/xxhash.c new file mode 100644 index 000..b84c7cf --- /dev/null +++ b/crypto/xxhash.c @@ -0,0 +1,383 @@ +/* + * xxHash - Fast Hash algorithm + * Copyright (C) 2012-2014, Yann Collet. + * BSD 2-Clause License (http://www.opensource.org/licenses/bsd-license.php) + + * Redistribution and use in source and binary forms, with or without + * modification, are permitted provided that the following conditions are met: + + * Redistributions of source code must retain the above copyright notice, + * this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above + * copyright notice, this list of conditions and the following disclaimer + * in the documentation and/or other materials provided with the distribution. + + * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS + * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT + * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR + * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT + * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, + * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT + * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, + * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY + * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT + * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE + * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. + + * You can contact the author at : + * xxHash source repository : http://code.google.com/p/xxhash/ + */ + +#include crypto/internal/hash.h +#include crypto/xxhash.h +#include linux/string.h +#include linux/mm.h +#include linux/module.h +#include linux/init.h +#include linux/types.h +#include linux/slab.h + +static inline u32 XXH_readLE32_align(const u32 * ptr, XXH_endianess endian, +XXH_alignment align) +{ + if (align == XXH_unaligned) + return endian == + XXH_littleEndian ? A32(ptr) : XXH_swap32(A32(ptr)); + else + return endian == XXH_littleEndian ? *ptr : XXH_swap32(*ptr); +} + +static inline u32 XXH_readLE32(const u32 * ptr, XXH_endianess endian) +{ + return XXH_readLE32_align(ptr, endian, XXH_unaligned); +} + +/* Simple Hash Functions */ +static inline u32 XXH32_endian_align(const void *input, int len, u32 seed, +XXH_endianess endian, +XXH_alignment align) +{ + const u8 *p = (const u8 *)input; + const u8 *const bEnd = p + len; + u32 h32; + +#ifdef XXH_ACCEPT_NULL_INPUT_POINTER + if (p == NULL) { + len = 0; + p = (const u8 *)(size_t) 16; + } +#endif + if (len = 16) { + const u8 *const limit = bEnd - 16; + u32 v1 = seed + PRIME32_1 + PRIME32_2; + u32 v2 = seed + PRIME32_2; + u32 v3 = seed + 0; + u32 v4 = seed - PRIME32_1; + u32 tmp; + + do { + tmp = XXH_readLE32_align((const u32 *)p, endian, align); + v1 += tmp * PRIME32_2; + v1 = XXH_rotl32(v1, 13); + v1 *= PRIME32_1; + p += 4; + + tmp = XXH_readLE32_align((const u32 *)p, endian, align); +
[RFC PATCH 0/3] Btrfs: add xxhash algorithm
xxHash is an extremely fast non-cryptographic Hash algorithm, working at speeds close to RAM limits.[1] And xxhash is 32-bits hash, same as crc32. Here is the hash comparsion extracted from the link[1]: (single thread, Windows Seven 32 bits, using Open Source's SMHasher on a Core 2 Duo @3GHz) NameSpeed Q.Score Author xxHash 5.4 GB/s 10 CRC32 0.43 GB/s 9 This patch set adds xxhash into linux kernel and then modifies btrfs's checksum API a bit and adopts xxhash as an alternative checksum algorithm. At the very first stage of RFC, I only ran xfstests through to make sure it can work. A bunch of performance tests will be made in the future. Note: We need to update btrfs-progs side as well to set it up, I attach a hacky patch just for users to play with ;-) [1]: https://code.google.com/p/xxhash/ Liu Bo (3): Crypto: add xxhash algorithm Crypto: xxhash: add tests Btrfs: add another checksum algorithm xxhash crypto/Kconfig | 7 + crypto/Makefile | 1 + crypto/testmgr.c| 10 ++ crypto/testmgr.h| 33 crypto/xxhash.c | 383 fs/btrfs/Kconfig| 22 +++ fs/btrfs/compression.c | 6 +- fs/btrfs/ctree.h| 12 +- fs/btrfs/dir-item.c | 10 +- fs/btrfs/disk-io.c | 126 --- fs/btrfs/disk-io.h | 2 - fs/btrfs/extent-tree.c | 43 +++-- fs/btrfs/file-item.c| 9 +- fs/btrfs/free-space-cache.c | 15 +- fs/btrfs/hash.c | 75 +++-- fs/btrfs/hash.h | 22 ++- fs/btrfs/inode-item.c | 6 +- fs/btrfs/inode.c| 16 +- fs/btrfs/props.c| 37 - fs/btrfs/props.h| 3 +- fs/btrfs/scrub.c| 70 ++-- fs/btrfs/send.c | 7 +- fs/btrfs/super.c| 9 +- fs/btrfs/tree-log.c | 2 +- include/crypto/xxhash.h | 209 25 files changed, 974 insertions(+), 161 deletions(-) create mode 100644 crypto/xxhash.c create mode 100644 include/crypto/xxhash.h -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] Btrfs: add xxhash algorithm
On Wed, May 07, 2014 at 06:56:29PM +0800, Liu Bo wrote: xxHash is an extremely fast non-cryptographic Hash algorithm, working at speeds close to RAM limits.[1] And xxhash is 32-bits hash, same as crc32. Here is the hash comparsion extracted from the link[1]: (single thread, Windows Seven 32 bits, using Open Source's SMHasher on a Core 2 Duo @3GHz) NameSpeed Q.Score Author xxHash 5.4 GB/s 10 CRC32 0.43 GB/s 9 Core 2 Duo is awfully old CPU. Since 2008, Intel CPUs have crc32 instruction, hugely speeding up CRC operations. -- Tomasz Torcz God, root, what's the difference? xmpp: zdzich...@chrome.pl God is more forgiving. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using noCow with snapshots ?
Russell Coker posted on Wed, 07 May 2014 15:36:15 +1000 as excerpted: How could BTRFS and a database fight about data recovery? BTRFS offers similar guarantees about data durability etc to other journalled filesystems and only differs by having checksums so that while a snapshot might have half the data that was written by an app you at least know that the half will be consistent. If you had database files on a separate subvol to the database log then you would be at risk of having problems making a any sort of consistent snapshot (the Debian approach of /var/log/mysql and /var/lib/mysql is a bad idea). But there would be no difference with LVM snapshots in that regard. Race conditions having to do with unsynced checkpoints, primarily. And it's actually the btrfs checksumming that seems to create the problem. The symptom being reported (tho I can say I've not seen further reports recently, maybe it's fixed now) was that the checksummed values btrfs restored as correct were considered corrupted by the database or vm. If the checksums checked out after btrfs did its replay (as they did or btrfs would error on access), but the databases and VMs were still reporting corruption, then the explanation that was left was that the btrfs replay and checksum validation was screwing up the application's own checksumming validation, which could be explained if the two were sufficiently out of sync that btrfs fixing its own view was actually breaking the view as seen by the data validating app. Tho as I said I've not seen that sort of report in several kernel cycles now. But I'm not sure whether that's because the issues have been fixed or for some other reason (maybe everybody experiencing the problem gave up and switched to some other filesystem now, and the message is out there well enough that new people see it before they experience and report the same thing, or similar but everybody's switched to NOCOW now and knows not to do snapshotting on the NOCOW files, or...). Regardless, NOCOW and not doing snapshotting (because it triggers COW anyway) on gig-plus internal-write files remains a very good idea. (Also, quotas and quota sequence numbers play into the combinational explosion problem along with snapshot-aware-defrag, too. See the writeup on that that Dave wrote while he was on paternity leave.) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: check, fix csum check in the presence of non-inlined refs
When we have non-inlined extent references, we were failing to find the corresponding extent item for an existing csum item in the csum tree. Reproducer: mkfs.btrfs -f /dev/sdd mount /dev/sdd /mnt xfs_io -f -c falloc 780366 135302 /mnt/foo xfs_io -c falloc 327680 151552 /mnt/foo xfs_io -c pwrite -S 0xff -b 131072 0 131072 /mnt/foo sync for i in `seq 1 40`; do btrfs subvolume snapshot /mnt /mnt/snap$i ; done umount /mnt btrfs check /dev/sdd The check command exited with status 1 and the following output: Checking filesystem on /dev/sdd UUID: 2416ab5f-9d71-457e-bb13-a27d4f6b399a checking extents checking free space cache checking fs roots checking csums There are no extents for csum range 12980224-12984320 Csum exists for 12980224-12984320 but there is no extent record found 1388544 bytes used err is 1 total csum bytes: 132 total tree bytes: 704512 total fs tree bytes: 573440 total extent tree bytes: 16384 btree space waste bytes: 564479 file data blocks allocated: 19341312 referenced 14606336 Btrfs v3.14.1-94-g80597e7 After this change it no longer erroneously reports a missing extent for the csum item and exits with a status of 0. Also added missing btrfs_prev_leaf() return value checks, as we were ignoring errors and non-existence of left siblings completely. Signed-off-by: Filipe David Borba Manana fdman...@gmail.com --- cmds-check.c | 38 +++--- 1 file changed, 27 insertions(+), 11 deletions(-) diff --git a/cmds-check.c b/cmds-check.c index 103efc5..18612c8 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -3650,8 +3650,7 @@ static int check_extent_exists(struct btrfs_root *root, u64 bytenr, key.objectid = bytenr; key.type = BTRFS_EXTENT_ITEM_KEY; - key.offset = 0; - + key.offset = (u64)-1; again: ret = btrfs_search_slot(NULL, root-fs_info-extent_root, key, path, @@ -3661,10 +3660,17 @@ again: btrfs_free_path(path); return ret; } else if (ret) { - if (path-slots[0]) + if (path-slots[0] 0) { path-slots[0]--; - else - btrfs_prev_leaf(root, path); + } else { + ret = btrfs_prev_leaf(root, path); + if (ret 0) { + goto out; + } else if (ret 0) { + ret = 0; + goto out; + } + } } btrfs_item_key_to_cpu(path-nodes[0], key, path-slots[0]); @@ -3674,13 +3680,22 @@ again: * bytenr, so walk back one more just in case. Dear future traveler, * first congrats on mastering time travel. Now if it's not too much * trouble could you go back to 2006 and tell Chris to make the -* BLOCK_GROUP_ITEM_KEY lower than the EXTENT_ITEM_KEY please? +* BLOCK_GROUP_ITEM_KEY (and BTRFS_*_REF_KEY) lower than the +* EXTENT_ITEM_KEY please? */ - if (key.type == BTRFS_BLOCK_GROUP_ITEM_KEY) { - if (path-slots[0]) + while (key.type BTRFS_EXTENT_ITEM_KEY) { + if (path-slots[0] 0) { path-slots[0]--; - else - btrfs_prev_leaf(root, path); + } else { + ret = btrfs_prev_leaf(root, path); + if (ret 0) { + goto out; + } else if (ret 0) { + ret = 0; + goto out; + } + } + btrfs_item_key_to_cpu(path-nodes[0], key, path-slots[0]); } while (num_bytes) { @@ -3752,7 +3767,8 @@ again: } ret = 0; - if (num_bytes) { +out: + if (num_bytes !ret) { fprintf(stderr, There are no extents for csum range %Lu-%Lu\n, bytenr, bytenr+num_bytes); ret = 1; -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs snapshot sizes
So have others found a good way to have an idea about how much space is taken by each snapshot? I've tried quota trees, but I'm not sure how to read the output, or if it's correct (including the negative numbers some have mentioned). Are there other options? I think the main problem is that the shared data field is not working, making it harder to know which blocks are only used in a given snapshot. subvol group totalunshared --- backup/debian32 0/262 403.84G -5.46G backup/debian32_daily_20140504_00:03:01 0/3660 446.45G 0.00G backup/debian32_daily_20140505_00:03:01 0/3687 431.11G 0.00G backup/debian32_daily_20140506_00:03:00 0/3705 420.83G 0.00G backup/debian32_daily_20140507_00:03:01 0/3724 411.87G 0.00G backup/debian32_weekly_20140504_00:04:010/3675 446.45G 0.00G backup/debian64 0/263 855.97G -1.50G backup/debian64_daily_20140504_00:03:01 0/3662 860.19G 0.00G backup/debian64_daily_20140505_00:03:01 0/3690 859.32G 0.00G backup/debian64_daily_20140506_00:03:00 0/3707 858.15G 0.00G backup/debian64_daily_20140507_00:03:01 0/3726 857.47G 0.00G backup/debian64_weekly_20140504_00:04:010/3676 860.19G 0.00G backup/ubuntu 0/264 360.28G 0.00G backup/ubuntu_daily_20140504_00:03:01 0/3664 364.53G 0.00G backup/ubuntu_daily_20140505_00:03:01 0/3692 362.44G 0.00G backup/ubuntu_daily_20140506_00:03:00 0/3709 360.91G 0.00G backup/ubuntu_daily_20140507_00:03:01 0/3727 360.33G 0.00G backup/ubuntu_weekly_20140504_00:04:01 0/3677 364.53G 0.00G Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does Suse do live filesystem revert with btrfs?
Marc MERLIN posted on Wed, 07 May 2014 01:56:12 -0700 as excerpted: On Tue, May 06, 2014 at 04:26:48PM +, Duncan wrote: Marc MERLIN posted on Sun, 04 May 2014 22:04:59 -0700 as excerpted: Aaah, right, you can use a script to see the file differences between two snapshots, and then restore that with reflink if you can truly get a list of all changed files. However, that is indeed not atomic at all, even if faster than rsync. Would send/receive help in such a script? Not really, you still end up with a new snapshot that you can't live switch to. It's really either 1) reboot 2) use cp --reflink to copy a list of changed files (as well as rm to delete the ones that were removed). What I meant was... use send/receive locally, in place of the cp --reflink. But now that I think of it, at least in the normal sense that wouldn't work, since send is like diff and receive like patch, but what would be needed would actually be an option similar to patch --reverse. With something like that, you could (in theory, in practice it'd be racy if other running apps were writing to it too) reverse the live subvolume to the state of the snapshot. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does Suse do live filesystem revert with btrfs?
On Wed, May 07, 2014 at 11:35:52AM +, Duncan wrote: Marc MERLIN posted on Wed, 07 May 2014 01:56:12 -0700 as excerpted: On Tue, May 06, 2014 at 04:26:48PM +, Duncan wrote: Marc MERLIN posted on Sun, 04 May 2014 22:04:59 -0700 as excerpted: Aaah, right, you can use a script to see the file differences between two snapshots, and then restore that with reflink if you can truly get a list of all changed files. However, that is indeed not atomic at all, even if faster than rsync. Would send/receive help in such a script? Not really, you still end up with a new snapshot that you can't live switch to. It's really either 1) reboot 2) use cp --reflink to copy a list of changed files (as well as rm to delete the ones that were removed). What I meant was... use send/receive locally, in place of the cp --reflink. This won't work since it can only work on another read-only subvolume. But you could use btrfs send -p to get a list of changes between 2 snapshots, decode that (without btrfs receive) just to spit out the names of the files that changed or got deleted. It would be wasteful since it would cause all the changed blocks to be read on the source, but still better than nothing. Really, we'd just need a btrfs --send --dry-run -v -p vol1 vol2 which would spit out a list of the file ops it would do. That'd be enough to simply grep out the deletes, do them locally and then use cp --reflink on everything else. Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs/035: update clone test to expect EOPNOTSUPP
With kernel commit 00fdf13a2e9f313a044288aa59d3b8ec29ff904a, the first clone-range overwrite attempt now fails with EOPNOTSUPP, rather than tripping a Btrfs BUG_ON(). This test now trips a new Btrfs bug, in which EIO is returned for subsequent reads following the second clone range ioctl. Signed-off-by: David Disseldorp dd...@suse.de --- tests/btrfs/035 | 11 +++ tests/btrfs/035.out | 5 + 2 files changed, 16 insertions(+) diff --git a/tests/btrfs/035 b/tests/btrfs/035 index 6808179..c9530f6 100755 --- a/tests/btrfs/035 +++ b/tests/btrfs/035 @@ -57,21 +57,32 @@ src_str=aa echo -n $src_str $SCRATCH_MNT/src $CLONER_PROG $SCRATCH_MNT/src $SCRATCH_MNT/src.clone1 +cat $SCRATCH_MNT/src.clone1 +echo src_str=bbcc echo -n $src_str $SCRATCH_MNT/src $CLONER_PROG $SCRATCH_MNT/src $SCRATCH_MNT/src.clone2 +cat $SCRATCH_MNT/src.clone2 +echo +# Prior to kernel commit 00fdf13a2e9f313a044288aa59d3b8ec29ff904a, this clone +# resulted in a BUG_ON in __btrfs_drop_extents(). The kernel now returns +# EOPNOTSUPP up to userspace. snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone1 | awk '{print $5}'` echo attempting ioctl (src.clone1 src) $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \ $SCRATCH_MNT/src.clone1 $SCRATCH_MNT/src +cat $SCRATCH_MNT/src +echo snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone2 | awk '{print $5}'` echo attempting ioctl (src.clone2 src) $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \ $SCRATCH_MNT/src.clone2 $SCRATCH_MNT/src +# BUG: subsequent access attempts currently result in EIO... +cat $SCRATCH_MNT/src status=0 ; exit diff --git a/tests/btrfs/035.out b/tests/btrfs/035.out index f86cadf..0ea2c4f 100644 --- a/tests/btrfs/035.out +++ b/tests/btrfs/035.out @@ -1,3 +1,8 @@ QA output created by 035 +aa +bbcc attempting ioctl (src.clone1 src) +clone failed: Operation not supported +bbcc attempting ioctl (src.clone2 src) +bbcc -- 1.8.4.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using mount -o bind vs mount -o subvol=vol
Marc MERLIN posted on Wed, 07 May 2014 03:55:51 -0700 as excerpted: subvolumes are also used as units of backup for btrfs send. Hmm, yes. Thanks. I don't use send/receive here so forgot about that. So my vote would be, for example (modified slightly for posting from my own mounts): mount /dev/sda5 / mount /dev/sda4 /var/log mount /dev/sda6 /home On my laptop: [snip] FWIW, those were examples. I actually have more. But to each their own :) Indeed. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs fi show show full?
On Wed, 7 May 2014 04:30:30 -0700 Marc MERLIN m...@merlins.org wrote: -dusage=85 balances all chunks that up to 85% full. The higher the number, the more work that needs to be done. Aah, right. I see why it's more work. =20 only makes is process the few chunks that are up to 20% full which won't be many if your FS is almost full. It's actually even less work than you imply. Balance only has to rewrite the actual content, not the empty space in the chunk. So 20% full means it's only writing 20% of the (possible/full) content, thus only taking 20% of the time to rewrite that chunk that it'd take to rewrite a full chunk. Which is why a usage=5 or 20 goes so fast, even if the system's actually mostly empty but is all allocated. With a 20% full chunk it's rewriting five chunks into one; at 5%, it's rewriting 20 chunks into one. That goes pretty fast, even if there's a bunch of them to write! -- Duncan - No HTML messages please, as they are filtered as spam. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs issues in 3.14
On Wed, May 7, 2014 at 9:35 AM, Kenny MacDermid kenny.macder...@gmail.com wrote: On Tue, May 6, 2014 at 11:22 PM, Liu Bo bo.li@oracle.com wrote: What does sysrq+w say when the hang happens? The whole system isn't hung, I may have explained that wrong. The system will hang if I try to shutdown, and the process will hang if I try to kill -9 it. It looks like the browser is in this state currently so I did an 'echo w /proc/sysrq-trigger' and have attached the full dmesg with the browser issues and the output. I had to hard reboot to clear that issue, and I decided to do another 'btrfs check' while /home was unmounted. It generated the following output: checking extents checking free space cache Wanted bytes 45056, found 32768 for off 63805808640 Wanted bytes 90016, found 32768 for off 63805808640 cache appears valid but isnt 62843256832 Checking filesystem on //dev/mapper/home UUID: 9a60a25f-eeb4-494c-b1af-ebd8e4f79b6b found 13672418478 bytes used err is -22 total csum bytes: 72089212 total tree bytes: 906100736 total fs tree bytes: 808370176 total extent tree bytes: 18153472 btree space waste bytes: 116247440 file data blocks allocated: 101046853632 referenced 73680674816 Btrfs v3.14.1 This is on the new filesystem. I redid the dmcrypt and the lvm lv when I recreated the filesystem as well, so it's less than a week old. Before rebuilding the old was was telling me: Checking filesystem on /dev/mapper/home UUID: 4f5d7a10-d003-48a7-a901-bf22d534888f free space inode generation (0) did not match free space cache generation (115200) found 29963117667 bytes used err is 1 total csum bytes: 63740440 total tree bytes: 745504768 total fs tree bytes: 624951296 total extent tree bytes: 36749312 btree space waste bytes: 119018687 file data blocks allocated: 181026942976 referenced 73759866880 Btrfs v0.20-rc1-358-g194aa4a-dirty and checking extents checking free space cache checking fs roots root 257 inode 29647 errors 200, dir isize wrong root 257 inode 391917 errors 200, dir isize wrong root 257 inode 497392 errors 410, odd dir item, nbytes wrong Checking filesystem on /dev/mapper/home UUID: 4f5d7a10-d003-48a7-a901-bf22d534888f free space inode generation (0) did not match free space cache generation (115200) found 31310902624 bytes used err is 1 total csum bytes: 63579480 total tree bytes: 743342080 total fs tree bytes: 623198208 total extent tree bytes: 36601856 btree space waste bytes: 118906643 file data blocks allocated: 180831965184 referenced 73631731712 Btrfs v3.14 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please review and comment, dealing with btrfs full issues
On Tue, May 06, 2014 at 05:43:24PM +0100, Hugo Mills wrote: So in my case when I hit that case, I had to use dusage=0 to recover. Anything above that just didn't work. I suspect when using more than zero the first chunk it wanted to balance wasn't empty - and it had nowhere to put it. Then when you did dusage=0, it didn't need a destination for the data. That is actually an interesting workaround for that case. I've actually looked into implementing a smallest=n filter that would taken only the n least-full chunks (by fraction) and balance those. However, it's not entirely trivial to do efficiently with the current filtering code. I've prototyped something similar, to limit the number of balanced chunks by a number. To achieve n least-full chunks would be an iterative process of increasing the usage filter and limiting the number of chunks until the desired N is reached. N=n F=0 while (N 0) { balance -dusage=F,limit=N N -= number of balanced chunks F++ } The patch is in branch dev/balance-limit in my git repos. We can then implement the n-least-full as a synthetic filter from userspace. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Smallest-n balance filter (was Re: Please review and comment, dealing with btrfs full issues)
On Wed, May 07, 2014 at 04:09:27PM +0200, David Sterba wrote: On Tue, May 06, 2014 at 05:43:24PM +0100, Hugo Mills wrote: So in my case when I hit that case, I had to use dusage=0 to recover. Anything above that just didn't work. I suspect when using more than zero the first chunk it wanted to balance wasn't empty - and it had nowhere to put it. Then when you did dusage=0, it didn't need a destination for the data. That is actually an interesting workaround for that case. I've actually looked into implementing a smallest=n filter that would taken only the n least-full chunks (by fraction) and balance those. However, it's not entirely trivial to do efficiently with the current filtering code. I've prototyped something similar, to limit the number of balanced chunks by a number. To achieve n least-full chunks would be an iterative process of increasing the usage filter and limiting the number of chunks until the desired N is reached. N=n F=0 while (N 0) { balance -dusage=F,limit=N N -= number of balanced chunks F++ } The patch is in branch dev/balance-limit in my git repos. We can then implement the n-least-full as a synthetic filter from userspace. This is inefficient, because we've got an O(m) pass through all the chunks for every call. If we reduce the number of calls by increasing the increment of F (F+=3, say), then we risk overbalancing, or missing out on smaller chunks we could have balanced earlier. From a practical point of view, it may make little difference, but the computer scientist in me is going ew. The other method, for small n only, would be to construct the list first, an O(m log n) operation for a filesystem of size m, requiring O(n) storage, and then iterate over just those chunks. The problem with that is the storage requirements, and keeping track of the state of the list for restart purposes. [actually, there's probably an O(m) algorithm to get the n smallest items, but those are a bit complicated] Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- A diverse working environment: Di longer you vork here, di --- verse it gets. signature.asc Description: Digital signature
Re: Smallest-n balance filter (was Re: Please review and comment, dealing with btrfs full issues)
On Wed, May 07, 2014 at 03:23:01PM +0100, Hugo Mills wrote: N=n F=0 while (N 0) { balance -dusage=F,limit=N N -= number of balanced chunks F++ } The patch is in branch dev/balance-limit in my git repos. We can then implement the n-least-full as a synthetic filter from userspace. This is inefficient, because we've got an O(m) pass through all the chunks for every call. If we reduce the number of calls by increasing the increment of F (F+=3, say), then we risk overbalancing, or missing out on smaller chunks we could have balanced earlier. From a practical point of view, it may make little difference, but the computer scientist in me is going ew. I'm trying to find the practical way, no doubts about the inefficiencies. The +1 increment was meant to outline the idea, I'm usually using the sequence 0, 1, 5, 10, [etc +10]. I think we can afford some inaccuracy, I as a user would not mind if there's some overbalancing (within a sane margin). The other method, for small n only, would be to construct the list first, an O(m log n) operation for a filesystem of size m, requiring O(n) storage, and then iterate over just those chunks. The size of filesystem matters, but the scanning phase of balance uses in-memory structures and this should not be that bad for terabyte-sized filesystems (ie. number of blockgoups will be some thousands). Possibly we can stop looking for new chunks in the first phase of balance when there are already N candidate chunks found, and process them. The problem with that is the storage requirements, and keeping track of the state of the list for restart purposes. [actually, there's probably an O(m) algorithm to get the n smallest items, but those are a bit complicated] If the filesystem is under load, the chunks' usage may increase or decrease in time and as we know, balance takes time, so the chunk-todo-list may look different when next one is about to be processed. But yeah, this could be a cheaper check to skip a given chunk if it's out of the filter criteria than going through the whole list again. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: faster/more efficient insertion of file extent items
On Sun, Feb 09, 2014 at 11:45:12PM +, Filipe David Borba Manana wrote: This is an extension to my previous commit titled: Btrfs: faster file extent item replace operations (hash 1acae57b161ef1282f565ef907f72aeed0eb71d9) Instead of inserting the new file extent item if we deleted existing file extent items covering our target file range, also allow to insert the new file extent item if we didn't find any existing items to delete and replace_extent != 0, since in this case our caller would do another tree search to insert the new file extent item anyway, therefore just combine the two tree searches into a single one, saving cpu time, reducing lock contention and reducing btree node/leaf COW operations. This covers the case where applications keep doing tail append writes to files, which for example is the case of Apache CouchDB (its database and view index files are always open with O_APPEND). (I'm tracking a bug which is very hard to reproduce and the stack seems to locate on this area.) Even I know that this has been merged, I still have to say that this just makes the code nearly hard-to-maintained. __btrfs_drop_extents() has already been one of the most complex function since it was written, but now it's become more and more complex! I'm not sure whether the gained performance number deserves that kind of complexity, man, to be honest, try to ask yourself how much time you'll spend in re-understanding the code and all the details. thanks, -liubo Signed-off-by: Filipe David Borba Manana fdman...@gmail.com --- fs/btrfs/file.c | 52 ++-- 1 file changed, 30 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 0165b86..006af2f 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -720,7 +720,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans, if (drop_cache) btrfs_drop_extent_cache(inode, start, end - 1, 0); - if (start = BTRFS_I(inode)-disk_i_size) + if (start = BTRFS_I(inode)-disk_i_size !replace_extent) modify_tree = 0; while (1) { @@ -938,34 +938,42 @@ next_slot: * Set path-slots[0] to first slot, so that after the delete * if items are move off from our leaf to its immediate left or * right neighbor leafs, we end up with a correct and adjusted - * path-slots[0] for our insertion. + * path-slots[0] for our insertion (if replace_extent != 0). */ path-slots[0] = del_slot; ret = btrfs_del_items(trans, root, path, del_slot, del_nr); if (ret) btrfs_abort_transaction(trans, root, ret); + } - leaf = path-nodes[0]; - /* - * leaf eb has flag EXTENT_BUFFER_STALE if it was deleted (that - * is, its contents got pushed to its neighbors), in which case - * it means path-locks[0] == 0 - */ - if (!ret replace_extent leafs_visited == 1 - path-locks[0] - btrfs_leaf_free_space(root, leaf) = - sizeof(struct btrfs_item) + extent_item_size) { - - key.objectid = ino; - key.type = BTRFS_EXTENT_DATA_KEY; - key.offset = start; - setup_items_for_insert(root, path, key, -extent_item_size, -extent_item_size, -sizeof(struct btrfs_item) + -extent_item_size, 1); - *key_inserted = 1; + leaf = path-nodes[0]; + /* + * If btrfs_del_items() was called, it might have deleted a leaf, in + * which case it unlocked our path, so check path-locks[0] matches a + * write lock. + */ + if (!ret replace_extent leafs_visited == 1 + (path-locks[0] == BTRFS_WRITE_LOCK_BLOCKING || + path-locks[0] == BTRFS_WRITE_LOCK) + btrfs_leaf_free_space(root, leaf) = + sizeof(struct btrfs_item) + extent_item_size) { + + key.objectid = ino; + key.type = BTRFS_EXTENT_DATA_KEY; + key.offset = start; + if (!del_nr path-slots[0] btrfs_header_nritems(leaf)) { + struct btrfs_key slot_key; + + btrfs_item_key_to_cpu(leaf, slot_key, path-slots[0]); + if (btrfs_comp_cpu_keys(key, slot_key) 0) + path-slots[0]++; } + setup_items_for_insert(root, path, key, +extent_item_size, +extent_item_size, +sizeof(struct btrfs_item) + +
Re: [PATCH] Btrfs: faster/more efficient insertion of file extent items
On 05/07/2014 11:21 AM, Liu Bo wrote: On Sun, Feb 09, 2014 at 11:45:12PM +, Filipe David Borba Manana wrote: This is an extension to my previous commit titled: Btrfs: faster file extent item replace operations (hash 1acae57b161ef1282f565ef907f72aeed0eb71d9) Instead of inserting the new file extent item if we deleted existing file extent items covering our target file range, also allow to insert the new file extent item if we didn't find any existing items to delete and replace_extent != 0, since in this case our caller would do another tree search to insert the new file extent item anyway, therefore just combine the two tree searches into a single one, saving cpu time, reducing lock contention and reducing btree node/leaf COW operations. This covers the case where applications keep doing tail append writes to files, which for example is the case of Apache CouchDB (its database and view index files are always open with O_APPEND). (I'm tracking a bug which is very hard to reproduce and the stack seems to locate on this area.) Even I know that this has been merged, I still have to say that this just makes the code nearly hard-to-maintained. __btrfs_drop_extents() has already been one of the most complex function since it was written, but now it's become more and more complex! I'm not sure whether the gained performance number deserves that kind of complexity, man, to be honest, try to ask yourself how much time you'll spend in re-understanding the code and all the details. It's just a complex operation anyway, so really it's going to suck no matter what. What I would like to see is some sanity tests committed that test the various corner cases of btrfs_drop_extents so when we make these sort of changes we can be sure we're not breaking anything. So in fact that's the new requirement, whoever wants to touch btrfs_drop_extents next has to make sanity tests for it first, and then they can do what they want, this includes cleaning it up. Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE
-- Good day. Did You Get The Last Email We Sent You? -- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs-progs: balance filter: add limit of processed chunks
Add more control to the balance behaviour. Usage filter may not be finegrained enough and can lead to moving too many chunks at once. Another example use is in connection with drange+devid or vrange filters that allow to work with a specific chunk or even with a chunk on a given device. The limit filter applies last, the value of 0 means no limiting. CC: Ilya Dryomov idryo...@gmail.com CC: Hugo Mills h...@carfax.org.uk Signed-off-by: David Sterba dste...@suse.cz --- cmds-balance.c | 14 ++ ioctl.h| 4 +++- volumes.h | 1 + 3 files changed, 18 insertions(+), 1 deletion(-) diff --git a/cmds-balance.c b/cmds-balance.c index 8a743ecabd33..5de51bd463c4 100644 --- a/cmds-balance.c +++ b/cmds-balance.c @@ -218,6 +218,18 @@ static int parse_filters(char *filters, struct btrfs_balance_args *args) args-flags |= BTRFS_BALANCE_ARGS_CONVERT; } else if (!strcmp(this_char, soft)) { args-flags |= BTRFS_BALANCE_ARGS_SOFT; + } else if (!strcmp(this_char, limit)) { + if (!value || !*value) { + fprintf(stderr, + the limit filter requires an argument\n); + return 1; + } + if (parse_u64(value, args-limit)) { + fprintf(stderr, Invalid limit argument: %s\n, + value); + return 1; + } + args-flags |= BTRFS_BALANCE_ARGS_LIMIT; } else { fprintf(stderr, Unrecognized balance option '%s'\n, this_char); @@ -252,6 +264,8 @@ static void dump_balance_args(struct btrfs_balance_args *args) printf(, vrange=%llu..%llu, (unsigned long long)args-vstart, (unsigned long long)args-vend); + if (args-flags BTRFS_BALANCE_ARGS_LIMIT) + printf(, limit=%llu, (unsigned long long)args-limit); printf(\n); } diff --git a/ioctl.h b/ioctl.h index 9627e8d1bac6..f0fc06086c3e 100644 --- a/ioctl.h +++ b/ioctl.h @@ -194,7 +194,9 @@ struct btrfs_balance_args { __u64 flags; - __u64 unused[8]; + __u64 limit; + + __u64 unused[7]; } __attribute__ ((__packed__)); struct btrfs_balance_progress { diff --git a/volumes.h b/volumes.h index b1ff3d04f931..8405aef2cc0a 100644 --- a/volumes.h +++ b/volumes.h @@ -130,6 +130,7 @@ struct map_lookup { #define BTRFS_BALANCE_ARGS_DEVID (1ULL 2) #define BTRFS_BALANCE_ARGS_DRANGE (1ULL 3) #define BTRFS_BALANCE_ARGS_VRANGE (1ULL 4) +#define BTRFS_BALANCE_ARGS_LIMIT (1ULL 5) /* * Profile changing flags. When SOFT is set we won't relocate chunk if -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: balance filter: add limit of processed chunks
This started as debugging helper, to watch the effects of converting between raid levels on multiple devices, but could be useful standalone. In my case the usage filter was not finegrained enough and led to converting too many chunks at once. Another example use is in connection with drange+devid or vrange filters that allow to work with a specific chunk or even with a chunk on a given device. The limit filter applies last, the value of 0 means no limiting. CC: Ilya Dryomov idryo...@gmail.com CC: Hugo Mills h...@carfax.org.uk Signed-off-by: David Sterba dste...@suse.cz --- The name 'limit' should resebmle the meaning from SQL SELECT. Though it may not be that useful on it's own, we can use it as a building block for more complex filters. fs/btrfs/ctree.h |7 ++- fs/btrfs/volumes.c | 18 ++ fs/btrfs/volumes.h |1 + include/uapi/linux/btrfs.h |3 ++- 4 files changed, 27 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index ba6b88528dc7..e6f899dc5e47 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -840,7 +840,10 @@ struct btrfs_disk_balance_args { /* BTRFS_BALANCE_ARGS_* */ __le64 flags; - __le64 unused[8]; + /* BTRFS_BALANCE_ARGS_LIMIT value */ + __le64 limit; + + __le64 unused[7]; } __attribute__ ((__packed__)); /* @@ -2897,6 +2900,7 @@ btrfs_disk_balance_args_to_cpu(struct btrfs_balance_args *cpu, cpu-vend = le64_to_cpu(disk-vend); cpu-target = le64_to_cpu(disk-target); cpu-flags = le64_to_cpu(disk-flags); + cpu-limit = le64_to_cpu(disk-limit); } static inline void @@ -2914,6 +2918,7 @@ btrfs_cpu_balance_args_to_disk(struct btrfs_disk_balance_args *disk, disk-vend = cpu_to_le64(cpu-vend); disk-target = cpu_to_le64(cpu-target); disk-flags = cpu_to_le64(cpu-flags); + disk-limit = cpu_to_le64(cpu-limit); } /* struct btrfs_super_block */ diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 49d7fab73360..3b761a456acd 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -2922,6 +2922,16 @@ static int should_balance_chunk(struct btrfs_root *root, return 0; } + /* +* limited by count, must be the last filter +*/ + if ((bargs-flags BTRFS_BALANCE_ARGS_LIMIT)) { + if (bargs-limit == 0) + return 0; + else + bargs-limit--; + } + return 1; } @@ -2944,6 +2954,9 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info) int ret; int enospc_errors = 0; bool counting = true; + u64 limit_data = bctl-data.limit; + u64 limit_meta = bctl-meta.limit; + u64 limit_sys = bctl-sys.limit; /* step one make some room on all the devices */ devices = fs_info-fs_devices-devices; @@ -2982,6 +2995,11 @@ static int __btrfs_balance(struct btrfs_fs_info *fs_info) memset(bctl-stat, 0, sizeof(bctl-stat)); spin_unlock(fs_info-balance_lock); again: + if (!counting) { + bctl-data.limit = limit_data; + bctl-meta.limit = limit_meta; + bctl-sys.limit = limit_sys; + } key.objectid = BTRFS_FIRST_CHUNK_TREE_OBJECTID; key.offset = (u64)-1; key.type = BTRFS_CHUNK_ITEM_KEY; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 80754f9dd3df..1a15bbeb65e2 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -255,6 +255,7 @@ struct map_lookup { #define BTRFS_BALANCE_ARGS_DEVID (1ULL 2) #define BTRFS_BALANCE_ARGS_DRANGE (1ULL 3) #define BTRFS_BALANCE_ARGS_VRANGE (1ULL 4) +#define BTRFS_BALANCE_ARGS_LIMIT (1ULL 5) /* * Profile changing flags. When SOFT is set we won't relocate chunk if diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index b4d69092fbdb..901a3c563f60 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -211,7 +211,8 @@ struct btrfs_balance_args { __u64 flags; - __u64 unused[8]; + __u64 limit;/* limit number of processed chunks */ + __u64 unused[7]; } __attribute__ ((__packed__)); /* report balance progress to userspace */ -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: retrieve more info from FS_INFO ioctl
Provide the basic information about filesystem through the ioctl: * b-tree node size (same as leaf size) * sector size * expected alignment of CLONE_RANGE and EXTENT_SAME ioctl arguments Backward compatibility: if the values are 0, kernel does not provide this information, the applications should ignore them. Signed-off-by: David Sterba dste...@suse.cz --- fs/btrfs/ioctl.c |4 include/uapi/linux/btrfs.h |6 +- 2 files changed, 9 insertions(+), 1 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 2ad7de94efef..74530f226e50 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2574,6 +2574,10 @@ static long btrfs_ioctl_fs_info(struct btrfs_root *root, void __user *arg) } mutex_unlock(fs_devices-device_list_mutex); + fi_args-nodesize = root-fs_info-super_copy-nodesize; + fi_args-sectorsize = root-fs_info-super_copy-sectorsize; + fi_args-clone_alignment = root-fs_info-super_copy-sectorsize; + if (copy_to_user(arg, fi_args, sizeof(*fi_args))) ret = -EFAULT; diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index b4d69092fbdb..aad9391e0a6d 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -181,7 +181,11 @@ struct btrfs_ioctl_fs_info_args { __u64 max_id; /* out */ __u64 num_devices; /* out */ __u8 fsid[BTRFS_FSID_SIZE]; /* out */ - __u64 reserved[124];/* pad to 1k */ + __u32 nodesize; /* out */ + __u32 sectorsize; /* out */ + __u32 clone_alignment; /* out */ + __u32 reserved32; + __u64 reserved[122];/* pad to 1k */ }; struct btrfs_ioctl_feature_flags { -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: export more from FS_INFO to sysfs
Similar to the FS_INFO updates, export the basic filesystem info through sysfs: node size, sector size and clone alignment. Signed-off-by: David Sterba dste...@suse.cz --- fs/btrfs/sysfs.c | 40 1 files changed, 40 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c index c5eb2143dc66..ba2a645dee07 100644 --- a/fs/btrfs/sysfs.c +++ b/fs/btrfs/sysfs.c @@ -396,8 +396,48 @@ static ssize_t btrfs_label_store(struct kobject *kobj, } BTRFS_ATTR_RW(label, 0644, btrfs_label_show, btrfs_label_store); +static ssize_t btrfs_no_store(struct kobject *kobj, +struct kobj_attribute *a, +const char *buf, size_t len) +{ + return -EPERM; +} + +static ssize_t btrfs_nodesize_show(struct kobject *kobj, + struct kobj_attribute *a, char *buf) +{ + struct btrfs_fs_info *fs_info = to_fs_info(kobj); + + return snprintf(buf, PAGE_SIZE, %u\n, fs_info-super_copy-nodesize); +} + +BTRFS_ATTR_RW(nodesize, 0444, btrfs_nodesize_show, btrfs_no_store); + +static ssize_t btrfs_sectorsize_show(struct kobject *kobj, + struct kobj_attribute *a, char *buf) +{ + struct btrfs_fs_info *fs_info = to_fs_info(kobj); + + return snprintf(buf, PAGE_SIZE, %u\n, fs_info-super_copy-sectorsize); +} + +BTRFS_ATTR_RW(sectorsize, 0444, btrfs_sectorsize_show, btrfs_no_store); + +static ssize_t btrfs_clone_alignment_show(struct kobject *kobj, + struct kobj_attribute *a, char *buf) +{ + struct btrfs_fs_info *fs_info = to_fs_info(kobj); + + return snprintf(buf, PAGE_SIZE, %u\n, fs_info-super_copy-sectorsize); +} + +BTRFS_ATTR_RW(clone_alignment, 0444, btrfs_clone_alignment_show, btrfs_no_store); + static struct attribute *btrfs_attrs[] = { BTRFS_ATTR_PTR(label), + BTRFS_ATTR_PTR(nodesize), + BTRFS_ATTR_PTR(sectorsize), + BTRFS_ATTR_PTR(clone_alignment), NULL, }; -- 1.7.9 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Back from leave
Hi back, On Mon, May 05, 2014 at 10:28:13AM -0400, Josef Bacik wrote: I had way too much email so I just deleted it all, if there was something you wanted my specific attention on then bounce it back at me and I'll look at it. Thanks, it would be really great if you resurrect btrfs-next. Most of the current patches have been merged to 3.15 so for now it's IMHO ok to do a hard reset to linus/master. Please push anything you've already queued so we can let you know about the rest. thanks. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Back from leave
On 05/07/2014 12:38 PM, David Sterba wrote: Hi back, On Mon, May 05, 2014 at 10:28:13AM -0400, Josef Bacik wrote: I had way too much email so I just deleted it all, if there was something you wanted my specific attention on then bounce it back at me and I'll look at it. Thanks, it would be really great if you resurrect btrfs-next. Most of the current patches have been merged to 3.15 so for now it's IMHO ok to do a hard reset to linus/master. Please push anything you've already queued so we can let you know about the rest. I've got them all queued up here, but I'm having trouble getting through an overnight stress.sh run (hangs). As soon as I nail down the problem I'll push out to my linux-next queue. At least for the next release, trying to help Josef focus on qgroups and other work he's had queued up. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does Suse do live filesystem revert with btrfs?
On 05/07/2014 01:39 PM, Marc MERLIN wrote: On Wed, May 07, 2014 at 11:35:52AM +, Duncan wrote: Marc MERLIN posted on Wed, 07 May 2014 01:56:12 -0700 as excerpted: On Tue, May 06, 2014 at 04:26:48PM +, Duncan wrote: Marc MERLIN posted on Sun, 04 May 2014 22:04:59 -0700 as excerpted: Aaah, right, you can use a script to see the file differences between two snapshots, and then restore that with reflink if you can truly get a list of all changed files. However, that is indeed not atomic at all, even if faster than rsync. Would send/receive help in such a script? Not really, you still end up with a new snapshot that you can't live switch to. It's really either 1) reboot 2) use cp --reflink to copy a list of changed files (as well as rm to delete the ones that were removed). What I meant was... use send/receive locally, in place of the cp --reflink. This won't work since it can only work on another read-only subvolume. But you could use btrfs send -p to get a list of changes between 2 snapshots, decode that (without btrfs receive) just to spit out the names of the files that changed or got deleted. It would be wasteful since it would cause all the changed blocks to be read on the source, but still better than nothing. Really, we'd just need a btrfs --send --dry-run -v -p vol1 vol2 which would spit out a list of the file ops it would do. That'd be enough to simply grep out the deletes, do them locally and then use cp --reflink on everything else. What happens to the already opened files ? I suppose that a process which has already opened a file, see the old one; instead a new open could see the new one. If this is acceptable, why not doing mount --bind /snapshot /, or use pivot_root(2), or a overlay filesystem ? May be that we need to move also the other already mounted_filesystem (like /proc, /sys)... Marc -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: faster/more efficient insertion of file extent items
On Wed, May 7, 2014 at 4:21 PM, Liu Bo bo.li@oracle.com wrote: On Sun, Feb 09, 2014 at 11:45:12PM +, Filipe David Borba Manana wrote: This is an extension to my previous commit titled: Btrfs: faster file extent item replace operations (hash 1acae57b161ef1282f565ef907f72aeed0eb71d9) Instead of inserting the new file extent item if we deleted existing file extent items covering our target file range, also allow to insert the new file extent item if we didn't find any existing items to delete and replace_extent != 0, since in this case our caller would do another tree search to insert the new file extent item anyway, therefore just combine the two tree searches into a single one, saving cpu time, reducing lock contention and reducing btree node/leaf COW operations. This covers the case where applications keep doing tail append writes to files, which for example is the case of Apache CouchDB (its database and view index files are always open with O_APPEND). (I'm tracking a bug which is very hard to reproduce and the stack seems to locate on this area.) Even I know that this has been merged, I still have to say that this just makes the code nearly hard-to-maintained. __btrfs_drop_extents() has already been one of the most complex function since it was written, but now it's become more and more complex! I'm not sure whether the gained performance number deserves that kind of complexity, man, to be honest, try to ask yourself how much time you'll spend in re-understanding the code and all the details. The changes (this and the previous one mentioned in the change log) essentially only add an if statement at the end of the function, which has useful comments describing its purpose. It didn't change the logic in the big while loop, which is/was basically the whole function, that does the work of processing extent items and deleting them. Therefore I disagree that it added such huge amount of complexity. Thanks, and sorry for the debugging frustration you are going through. thanks, -liubo Signed-off-by: Filipe David Borba Manana fdman...@gmail.com --- fs/btrfs/file.c | 52 ++-- 1 file changed, 30 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 0165b86..006af2f 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -720,7 +720,7 @@ int __btrfs_drop_extents(struct btrfs_trans_handle *trans, if (drop_cache) btrfs_drop_extent_cache(inode, start, end - 1, 0); - if (start = BTRFS_I(inode)-disk_i_size) + if (start = BTRFS_I(inode)-disk_i_size !replace_extent) modify_tree = 0; while (1) { @@ -938,34 +938,42 @@ next_slot: * Set path-slots[0] to first slot, so that after the delete * if items are move off from our leaf to its immediate left or * right neighbor leafs, we end up with a correct and adjusted - * path-slots[0] for our insertion. + * path-slots[0] for our insertion (if replace_extent != 0). */ path-slots[0] = del_slot; ret = btrfs_del_items(trans, root, path, del_slot, del_nr); if (ret) btrfs_abort_transaction(trans, root, ret); + } - leaf = path-nodes[0]; - /* - * leaf eb has flag EXTENT_BUFFER_STALE if it was deleted (that - * is, its contents got pushed to its neighbors), in which case - * it means path-locks[0] == 0 - */ - if (!ret replace_extent leafs_visited == 1 - path-locks[0] - btrfs_leaf_free_space(root, leaf) = - sizeof(struct btrfs_item) + extent_item_size) { - - key.objectid = ino; - key.type = BTRFS_EXTENT_DATA_KEY; - key.offset = start; - setup_items_for_insert(root, path, key, -extent_item_size, -extent_item_size, -sizeof(struct btrfs_item) + -extent_item_size, 1); - *key_inserted = 1; + leaf = path-nodes[0]; + /* + * If btrfs_del_items() was called, it might have deleted a leaf, in + * which case it unlocked our path, so check path-locks[0] matches a + * write lock. + */ + if (!ret replace_extent leafs_visited == 1 + (path-locks[0] == BTRFS_WRITE_LOCK_BLOCKING || + path-locks[0] == BTRFS_WRITE_LOCK) + btrfs_leaf_free_space(root, leaf) = + sizeof(struct btrfs_item) + extent_item_size) { + + key.objectid = ino; + key.type = BTRFS_EXTENT_DATA_KEY; + key.offset = start; + if
[PATCH 1/3] btrfs-progs: print qgroup excl as unsigned
It's unsigned in the structure definition. Reviewed-by: Mark Fasheh mfas...@suse.de --- print-tree.c | 12 ++-- qgroup.c | 4 ++-- 2 files changed, 8 insertions(+), 8 deletions(-) diff --git a/print-tree.c b/print-tree.c index 7263b09..adef94a 100644 --- a/print-tree.c +++ b/print-tree.c @@ -884,18 +884,18 @@ void btrfs_print_leaf(struct btrfs_root *root, struct extent_buffer *l) qg_info = btrfs_item_ptr(l, i, struct btrfs_qgroup_info_item); printf(\t\tgeneration %llu\n -\t\treferenced %lld referenced compressed %lld\n -\t\texclusive %lld exclusive compressed %lld\n, +\t\treferenced %llu referenced compressed %llu\n +\t\texclusive %llu exclusive compressed %llu\n, (unsigned long long) btrfs_qgroup_info_generation(l, qg_info), - (long long) + (unsigned long long) btrfs_qgroup_info_referenced(l, qg_info), - (long long) + (unsigned long long) btrfs_qgroup_info_referenced_compressed(l, qg_info), - (long long) + (unsigned long long) btrfs_qgroup_info_exclusive(l, qg_info), - (long long) + (unsigned long long) btrfs_qgroup_info_exclusive_compressed(l, qg_info)); break; diff --git a/qgroup.c b/qgroup.c index 94d1feb..368b262 100644 --- a/qgroup.c +++ b/qgroup.c @@ -203,11 +203,11 @@ static void print_qgroup_column(struct btrfs_qgroup *qgroup, print_qgroup_column_add_blank(BTRFS_QGROUP_QGROUPID, len); break; case BTRFS_QGROUP_RFER: - len = printf(%lld, qgroup-rfer); + len = printf(%llu, qgroup-rfer); print_qgroup_column_add_blank(BTRFS_QGROUP_RFER, len); break; case BTRFS_QGROUP_EXCL: - len = printf(%lld, qgroup-excl); + len = printf(%llu, qgroup-excl); print_qgroup_column_add_blank(BTRFS_QGROUP_EXCL, len); break; case BTRFS_QGROUP_PARENT: -- 1.8.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] btrfs-progs: add quota group verify to btrfsck
Hi, The following 3 patches add support to btrfsck to check the counts in subvolume quota groups. With these patches a user can run btrfsck against a volume and if quota is enabled, qgroup data will be checked against the actual space used on disk. I also added a --qgroup-report option that will run the qgroup checker (only) and print out a full report of all qgroups. The patches can be pulled from the following branch: git://github.com/markfasheh/btrfs-progs-patches.git qgroup-verify The patches can also be viewed: https://github.com/markfasheh/btrfs-progs-patches/tree/qgroup-verify The first two patches set up for qgroups: - The change in patch #1 is optional. It corrects the print of qgroup bytes to be %llu as they are unsigned values. This means however that corrupted groups will no longer show a negative value but instead an unrealistically large one. It's my opinion that '-1' and '18446744073709551615' both look pretty obviously broken when put in 'qgroup show' output so I'm going for correctness. Here's the difference in output: qgroupid rfer excl 0/5 16384 16384 0/2574109430784 -1429504 qgroupid rfer excl 0/5 16384 16384 0/2574109430784 18446744073708122112 - Patch 2 imports the ulist code from kernel. Any qgroup code that deals with resolving refs to roots needs this so that it can insert into a 'list' that guarantees unique items. - Patch 3 adds the actual code to do the work of adding up referenced and exclusive bytecounts. This involves walking the extent tree and recording refs. We then resolve implied refs by walking down from each interior node. Finally, shared ref roots are found and each extent is accounted to any roots that reference it. Here's what it looks like now if you run btrfsck against a filesystem with a couple corrupted qgroups: Checking filesystem on /dev/vdb2 UUID: 8203ca66-9858-4e3f-b447-5bbaacf79c02 checking extents checking free space cache checking fs roots checking csums checking root refs checking quota groups Counts for qgroup id: 257 are different our:referenced 4124762112 referenced compressed 4124762112 disk: referenced 4109430784 referenced compressed 4109430784 diff: referenced 15331328 referenced compressed 15331328 our:exclusive 901120 exclusive compressed 901120 disk: exclusive 18446744073708122112 exclusive compressed 18446744073708122112 diff: exclusive 2330624 exclusive compressed 2330624 Counts for qgroup id: 280 are different our:referenced 3750768640 referenced compressed 3750768640 disk: referenced 3750768640 referenced compressed 3750768640 our:exclusive 14749696 exclusive compressed 14749696 disk: exclusive 11882496 exclusive compressed 11882496 diff: exclusive 2867200 exclusive compressed 2867200 found 1009512957 bytes used err is 0 total csum bytes: 3955388 total tree bytes: 346292224 total fs tree bytes: 331939840 total extent tree bytes: 9338880 btree space waste bytes: 48141929 file data blocks allocated: 6477553664 referenced 6062055424 Btrfs v3.14.1-3-gc8c1814 There's a minor issue in that we'll also print out qgroups for deleted subvolumes as they still persist on disk (not shown here). I'm pretty sure we can fix that with a followup patch to just check them against existing subvolumes when we initially read our qgroup info from disk. Thanks, --Mark -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] btrfs-progs: import ulist
qgroup-verify.c wants this for walking root refs. Signed-off-by: Mark Fasheh mfas...@suse.de --- Makefile | 3 +- kerncompat.h | 2 +- ulist.c | 253 +++ ulist.h | 66 4 files changed, 322 insertions(+), 2 deletions(-) create mode 100644 ulist.c create mode 100644 ulist.h diff --git a/Makefile b/Makefile index da05197..202013e 100644 --- a/Makefile +++ b/Makefile @@ -9,7 +9,8 @@ CFLAGS = -g -O1 -fno-strict-aliasing objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \ root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \ extent-cache.o extent_io.o volumes.o utils.o repair.o \ - qgroup.o raid6.o free-space-cache.o list_sort.o props.o + qgroup.o raid6.o free-space-cache.o list_sort.o props.o \ + ulist.o cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \ cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \ diff --git a/kerncompat.h b/kerncompat.h index f370cd8..652275e 100644 --- a/kerncompat.h +++ b/kerncompat.h @@ -235,7 +235,7 @@ static inline long IS_ERR(const void *ptr) #define BUG_ON(c) assert(!(c)) #define WARN_ON(c) assert(!(c)) - +#defineASSERT(c) assert(c) #define container_of(ptr, type, member) ({ \ const typeof( ((type *)0)-member ) *__mptr = (ptr);\ diff --git a/ulist.c b/ulist.c new file mode 100644 index 000..60fdc09 --- /dev/null +++ b/ulist.c @@ -0,0 +1,253 @@ +/* + * Copyright (C) 2011 STRATO AG + * written by Arne Jansen sensi...@gmx.net + * Distributed under the GNU GPL license version 2. + */ + +//#include linux/slab.h +#include stdlib.h +#include kerncompat.h +#include ulist.h +#include ctree.h + +/* + * ulist is a generic data structure to hold a collection of unique u64 + * values. The only operations it supports is adding to the list and + * enumerating it. + * It is possible to store an auxiliary value along with the key. + * + * A sample usage for ulists is the enumeration of directed graphs without + * visiting a node twice. The pseudo-code could look like this: + * + * ulist = ulist_alloc(); + * ulist_add(ulist, root); + * ULIST_ITER_INIT(uiter); + * + * while ((elem = ulist_next(ulist, uiter)) { + * for (all child nodes n in elem) + * ulist_add(ulist, n); + * do something useful with the node; + * } + * ulist_free(ulist); + * + * This assumes the graph nodes are adressable by u64. This stems from the + * usage for tree enumeration in btrfs, where the logical addresses are + * 64 bit. + * + * It is also useful for tree enumeration which could be done elegantly + * recursively, but is not possible due to kernel stack limitations. The + * loop would be similar to the above. + */ + +/** + * ulist_init - freshly initialize a ulist + * @ulist: the ulist to initialize + * + * Note: don't use this function to init an already used ulist, use + * ulist_reinit instead. + */ +void ulist_init(struct ulist *ulist) +{ + INIT_LIST_HEAD(ulist-nodes); + ulist-root = RB_ROOT; + ulist-nnodes = 0; +} + +/** + * ulist_fini - free up additionally allocated memory for the ulist + * @ulist: the ulist from which to free the additional memory + * + * This is useful in cases where the base 'struct ulist' has been statically + * allocated. + */ +static void ulist_fini(struct ulist *ulist) +{ + struct ulist_node *node; + struct ulist_node *next; + + list_for_each_entry_safe(node, next, ulist-nodes, list) { + kfree(node); + } + ulist-root = RB_ROOT; + INIT_LIST_HEAD(ulist-nodes); +} + +/** + * ulist_reinit - prepare a ulist for reuse + * @ulist: ulist to be reused + * + * Free up all additional memory allocated for the list elements and reinit + * the ulist. + */ +void ulist_reinit(struct ulist *ulist) +{ + ulist_fini(ulist); + ulist_init(ulist); +} + +/** + * ulist_alloc - dynamically allocate a ulist + * @gfp_mask: allocation flags to for base allocation + * + * The allocated ulist will be returned in an initialized state. + */ +struct ulist *ulist_alloc(gfp_t gfp_mask) +{ + struct ulist *ulist = kmalloc(sizeof(*ulist), gfp_mask); + + if (!ulist) + return NULL; + + ulist_init(ulist); + + return ulist; +} + +/** + * ulist_free - free dynamically allocated ulist + * @ulist: ulist to free + * + * It is not necessary to call ulist_fini before. + */ +void ulist_free(struct ulist *ulist) +{ + if (!ulist) + return; + ulist_fini(ulist); + kfree(ulist); +} + +static struct ulist_node *ulist_rbtree_search(struct ulist *ulist, u64 val) +{ + struct rb_node *n = ulist-root.rb_node; + struct ulist_node *u = NULL; + + while (n) { + u = rb_entry(n, struct ulist_node,
[PATCH 3/3] btrfs-progs: add quota group verify code
This patch adds functionality (in qgroup-verify.c) to compute bytecounts in subvolume quota groups. The original groups are read in and stored in memory so that after we compute our own bytecounts, we can compare them with those on disk. A print function is provided to do this comparison and show the results on the console. A 'qgroup check' pass is added to btrfsck. If any subvolume quota groups differ from what we compute, the differences for them are printed. We also provide an option '--qgroup-report' which will run only the quota check code and print a report on all quota groups. Other than making it possible to verify that our qgroup changes work correctly, this mode can also be used in xfstests for automated checking after qgroup tests. This patch does not address the following: - compressed counts are identical to non compressed, because kernel doesn't make the distinction yet. Adding the code to verify compressed counts shouldn't be hard at all though once kernel can do this. - It is only concerned with subvolume quota groups (like most of btrfs-progs). Signed-off-by: Mark Fasheh mfas...@suse.de --- Makefile|2 +- cmds-check.c| 24 ++ ctree.h | 10 + disk-io.c | 16 +- print-tree.c|2 +- print-tree.h|1 + qgroup-verify.c | 1085 +++ qgroup-verify.h | 25 ++ 8 files changed, 1161 insertions(+), 4 deletions(-) create mode 100644 qgroup-verify.c create mode 100644 qgroup-verify.h diff --git a/Makefile b/Makefile index 202013e..51e5264 100644 --- a/Makefile +++ b/Makefile @@ -10,7 +10,7 @@ objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \ root-tree.o dir-item.o file-item.o inode-item.o inode-map.o \ extent-cache.o extent_io.o volumes.o utils.o repair.o \ qgroup.o raid6.o free-space-cache.o list_sort.o props.o \ - ulist.o + ulist.o qgroup-verify.o cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-inspect.o cmds-balance.o cmds-send.o cmds-receive.o \ cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \ diff --git a/cmds-check.c b/cmds-check.c index d195e7a..5401ad9 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -38,6 +38,7 @@ #include commands.h #include free-space-cache.h #include btrfsck.h +#include qgroup-verify.h static u64 bytes_used = 0; static u64 total_csum_bytes = 0; @@ -6427,6 +6428,7 @@ static struct option long_options[] = { { init-csum-tree, 0, NULL, 0 }, { init-extent-tree, 0, NULL, 0 }, { backup, 0, NULL, 0 }, + { qgroup-report, 0, NULL, 'Q' }, { NULL, 0, NULL, 0} }; @@ -6439,6 +6441,7 @@ const char * const cmd_check_usage[] = { --repairtry to repair the filesystem, --init-csum-treecreate a new CRC tree, --init-extent-tree create a new extent tree, + --qgroup-report print a report on qgroup consistency, NULL }; @@ -6453,6 +6456,7 @@ int cmd_check(int argc, char **argv) u64 num; int option_index = 0; int init_csum_tree = 0; + int qgroup_report = 0; enum btrfs_open_ctree_flags ctree_flags = OPEN_CTREE_PARTIAL | OPEN_CTREE_EXCLUSIVE; @@ -6479,6 +6483,9 @@ int cmd_check(int argc, char **argv) printf(using SB copy %llu, bytenr %llu\n, num, (unsigned long long)bytenr); break; + case 'Q': + qgroup_report = 1; + break; case '?': case 'h': usage(cmd_check_usage); @@ -6526,6 +6533,14 @@ int cmd_check(int argc, char **argv) root = info-fs_root; uuid_unparse(info-super_copy-fsid, uuidbuf); + if (qgroup_report) { + printf(Print quota groups for %s\nUUID: %s\n, argv[optind], + uuidbuf); + ret = qgroup_verify_all(info); + if (ret == 0) + print_qgroup_report(1); + goto close_out; + } printf(Checking filesystem on %s\nUUID: %s\n, argv[optind], uuidbuf); if (!extent_buffer_uptodate(info-tree_root-node) || @@ -6629,11 +6644,20 @@ int cmd_check(int argc, char **argv) free(bad); } + if (info-quota_enabled) { + int err; + fprintf(stderr, checking quota groups\n); + err = qgroup_verify_all(info); + if (err) + goto out; + } + if (!list_empty(root-fs_info-recow_ebs)) { fprintf(stderr, Transid errors in file system\n); ret = 1; } out: + print_qgroup_report(0); if (found_old_backref) {
Re: [RFC PATCH 0/3] Btrfs: add xxhash algorithm
On Wed, May 07, 2014 at 01:08:06PM +0200, Tomasz Torcz wrote: On Wed, May 07, 2014 at 06:56:29PM +0800, Liu Bo wrote: xxHash is an extremely fast non-cryptographic Hash algorithm, working at speeds close to RAM limits.[1] And xxhash is 32-bits hash, same as crc32. Here is the hash comparsion extracted from the link[1]: (single thread, Windows Seven 32 bits, using Open Source's SMHasher on a Core 2 Duo @3GHz) NameSpeed Q.Score Author xxHash 5.4 GB/s 10 CRC32 0.43 GB/s 9 Core 2 Duo is awfully old CPU. Since 2008, Intel CPUs have crc32 instruction, hugely speeding up CRC operations. Just for kicks I (sloppily) benchmarked a few of the kernel's hash implementations on a Core i5-3320M CPU @3.3GHz: xxhash: 6.0GB/s crc32c-intel: 11.5GB/s crc32c (no hw accel): 1.8GB/s --D -- Tomasz Torcz God, root, what's the difference? xmpp: zdzich...@chrome.pl God is more forgiving. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfstests: fix flink test
I don't have flink support in my xfsprogs, but it doesn't fail with command not found or whatever, it fails because I don't have the -T option. So fix _require_xfs_io_command to check for an invalid option and not run. This way I get notrun instead of a failure. Thanks, Signed-off-by: Josef Bacik jba...@fb.com --- common/rc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/common/rc b/common/rc index 5c13db5..4fa7e63 100644 --- a/common/rc +++ b/common/rc @@ -1258,6 +1258,8 @@ _require_xfs_io_command() _notrun xfs_io $command support is missing echo $testio | grep -q Operation not supported \ _notrun xfs_io $command failed (old kernel/wrong fs?) + echo $testio | grep -q invalid option \ + _notrun xfs_io $command support is missing } # Check that a fs has enough free space (in 1024b blocks) -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 0/3] Btrfs: add xxhash algorithm
On Wed, May 7, 2014 at 1:50 PM, Darrick J. Wong darrick.w...@oracle.com wrote: Just for kicks I (sloppily) benchmarked a few of the kernel's hash implementations on a Core i5-3320M CPU @3.3GHz: xxhash: 6.0GB/s crc32c-intel: 11.5GB/s crc32c (no hw accel): 1.8GB/s CRC also usually has the very mild data recovery advantage that if your error is just a bitflip you can correct it using the crc in a computationally efficient manner, potentially enabling fancy recovery tools... so it it were merely equal in speed you'd still probably prefer to use a CRC. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] xfstests/btrfs: add qgroup rescan stress test
On 03/09/2014 11:44 PM, Wang Shilong wrote: Test flow is to run fsstress after triggering quota rescan. the ruler is simple, we just remove all files and directories, sync filesystem and see if qgroup's ref and excl are nodesize. Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com --- v1-v2: switch into new helper _run_btrfs_util_prog() --- tests/btrfs/041 | 76 + tests/btrfs/041.out | 3 +++ tests/btrfs/group | 1 + 3 files changed, 80 insertions(+) create mode 100644 tests/btrfs/041 create mode 100644 tests/btrfs/041.out diff --git a/tests/btrfs/041 b/tests/btrfs/041 new file mode 100644 index 000..92bd080 --- /dev/null +++ b/tests/btrfs/041 @@ -0,0 +1,76 @@ +#! /bin/bash +# FSQA Test No. btrfs/041 +# +# Quota rescan stress test, we run fsstress and quota rescan concurrently +# +#--- +# Copyright (C) 2014 Fujitsu. All rights reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 + +_cleanup() +{ + cd / + rm -f $tmp.* +} +trap _cleanup; exit \$status 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_need_to_be_root +_supported_fs btrfs +_supported_os Linux +_require_scratch + +rm -f $seqres.full + +run_check _scratch_mkfs -b 1g --nodesize 4096 +run_check _scratch_mount + Add -o nospace_cache here please, otherwise I don't get the same output. +# -w ensures that the only ops are ones which cause write I/O +run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \ + $FSSTRESS_AVOID /dev/null + +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \ + $SCRATCH_MNT/snap1 $seqres.full 21 _run_btrfs_util_prog will already redirect to $seqres.full, you don't need this part. + +run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 1000 \ + $FSSTRESS_AVOID /dev/null + +_run_btrfs_util_prog quota enable $SCRATCH_MNT +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT + +#ignore removing subvolume errors +rm -rf $SCRATCH_MNT/* /dev/null + +_run_btrfs_util_prog filesystem sync $SCRATCH_MNT $seqres.full 21 Same here. +_run_btrfs_util_prog qgroup show $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' \ + | $AWK_PROG '{print $1 $2 $3 }' + You can't use _run_btrfs_util_prog here, it will eat the output. You need to use $BTRFS_UTIL_PROG instead. Fix these up and resend, this is a really important test and I needed it to make sure my qgroups patch was right (which it is now.) Thanks, Josef -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: add sanity tests for new qgroup accounting code
This exercises the various parts of the new qgroup accounting code. We do some basic stuff and do some things with the shared refs to make sure all that code works. I had to add a bunch of infrastructure because I needed to be able to insert items into a fake tree without having to do all the hard work myself, hopefully this will be usefull in the future. Thanks, Signed-off-by: Josef Bacik jba...@fb.com --- fs/btrfs/Makefile | 2 +- fs/btrfs/backref.c| 4 + fs/btrfs/ctree.c | 4 + fs/btrfs/ctree.h | 3 + fs/btrfs/disk-io.c| 18 +- fs/btrfs/disk-io.h| 1 + fs/btrfs/extent-tree.c| 17 ++ fs/btrfs/extent_io.c | 47 + fs/btrfs/extent_io.h | 2 + fs/btrfs/qgroup.c | 23 +++ fs/btrfs/super.c | 3 + fs/btrfs/tests/btrfs-tests.c | 96 + fs/btrfs/tests/btrfs-tests.h | 9 + fs/btrfs/tests/inode-tests.c | 35 +--- fs/btrfs/tests/qgroup-tests.c | 468 ++ fs/btrfs/transaction.h| 1 + 16 files changed, 696 insertions(+), 37 deletions(-) create mode 100644 fs/btrfs/tests/qgroup-tests.c diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index ae837d2..b566ef3 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -17,4 +17,4 @@ btrfs-$(CONFIG_BTRFS_FS_REF_VERIFY) += ref-verify.o btrfs-$(CONFIG_BTRFS_FS_RUN_SANITY_TESTS) += tests/free-space-tests.o \ tests/extent-buffer-tests.o tests/btrfs-tests.o \ - tests/extent-io-tests.o tests/inode-tests.o + tests/extent-io-tests.o tests/inode-tests.o tests/qgroup-tests.o diff --git a/fs/btrfs/backref.c b/fs/btrfs/backref.c index 10db21f..f09aa18 100644 --- a/fs/btrfs/backref.c +++ b/fs/btrfs/backref.c @@ -900,7 +900,11 @@ again: goto out; BUG_ON(ret == 0); +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS + if (trans likely(trans-type != __TRANS_DUMMY)) { +#else if (trans) { +#endif /* * look if there are updates for this ref queued and lock the * head diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c index 208a84d..aa849e0 100644 --- a/fs/btrfs/ctree.c +++ b/fs/btrfs/ctree.c @@ -1503,6 +1503,10 @@ static inline int should_cow_block(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct extent_buffer *buf) { +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS + if (unlikely(root-dummy_root)) + return 0; +#endif /* ensure we can see the force_cow */ smp_rmb(); diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 33a1b27..96dae25 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1781,6 +1781,7 @@ struct btrfs_root { int in_radix; #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS int dummy_root; + u64 alloc_bytenr; #endif u64 defrag_trans_start; struct btrfs_key defrag_progress; @@ -4096,6 +4097,8 @@ static inline int btrfs_defrag_cancelled(struct btrfs_fs_info *fs_info) /* Sanity test specific functions */ #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS void btrfs_test_destroy_inode(struct inode *inode); +int btrfs_verify_qgroup_counts(struct btrfs_fs_info *fs_info, u64 qgroupid, + u64 rfer, u64 excl); #endif #endif diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index d965f51..009baaa 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1114,6 +1114,11 @@ struct extent_buffer *btrfs_find_tree_block(struct btrfs_root *root, struct extent_buffer *btrfs_find_create_tree_block(struct btrfs_root *root, u64 bytenr, u32 blocksize) { +#ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS + if (unlikely(root-dummy_root)) + return alloc_test_extent_buffer(root-fs_info, bytenr, + blocksize); +#endif return alloc_extent_buffer(root-fs_info, bytenr, blocksize); } @@ -1296,6 +1301,7 @@ struct btrfs_root *btrfs_alloc_dummy_root(void) return ERR_PTR(-ENOMEM); __setup_root(4096, 4096, 4096, 4096, root, NULL, 1); root-dummy_root = 1; + root-alloc_bytenr = 0; return root; } @@ -2095,7 +2101,7 @@ static void free_root_pointers(struct btrfs_fs_info *info, int chunk_root) free_root_extent_buffers(info-chunk_root); } -static void del_fs_roots(struct btrfs_fs_info *fs_info) +void btrfs_free_fs_roots(struct btrfs_fs_info *fs_info) { int ret; struct btrfs_root *gang[8]; @@ -2984,7 +2990,7 @@ fail_qgroup: fail_trans_kthread: kthread_stop(fs_info-transaction_kthread); btrfs_cleanup_transaction(fs_info-tree_root); - del_fs_roots(fs_info); + btrfs_free_fs_roots(fs_info); fail_cleaner: kthread_stop(fs_info-cleaner_kthread); @@ -3519,8 +3525,10 @@ void
Re: raid0 vs single, and should we allow -mdup by default on SSDs?
On Wed, May 7, 2014 at 3:52 AM, Marc MERLIN m...@merlins.org wrote: On Wed, May 07, 2014 at 09:29:41AM +0100, Hugo Mills wrote: On Wed, May 07, 2014 at 01:18:40AM -0700, Marc MERLIN wrote: On Tue, May 06, 2014 at 07:39:12PM +, Duncan wrote: That appears to be a very good use of either -d raid0 or -d single, yes. And since you're apparently not streaming such high resolution video that you NEED the raid0, single does indeed give you a somewhat better chance at recovery. zoneminder saves 'video' as a stream of independent small jpegs, so I'm good. Actually come to think of it they're so small that they probably all ended up in the raid1 metadata. That also means that I'm not getting twice the storage space like I planned to. Oh well... There's a mount option to change the threshold at which files are inlined in metadata: maxinline=bytes. You could play with that for this particular use-case. Oh cool, thank you. Since each non-inlined file will occupy a minimum of 4k, you may find that inlining will still save space even if it is duplicated. Even if they are duplicated in the metadata under RAID1, inlining a bunch of 256 byte files will still be more space efficient than storing them as regular files. But if most of the files are in the 2k-3k range, you may be more efficient to store them as files. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
In a moment of irony, my laptop's boot SSD's btrfs fileysstem crashed last night with my btrfs talk slides still open on it. It went read only overnight but did not crash. Please tell me ASAP if you need anything off the filesystem before I recover it since I'm travelling, and need to bring my laptop back up to a working state ASAP (I'll save the irony of showing up at my talk with Err, I can't give my btrfs talk, btrfs crashed on my laptop). I'm not interested in partial recovery, I have hourly backups on my secondary drive on my laptop (thankfully) and was able to boot from that drive (double thankfully). Good thing I plan ahead :) If there is something you'd like me to try to recover the filesystem or to get more data off it to diagnose the bug, please let me know ASAP. Otherwise, I'll just wipe it and recover from my disk backup, but obviously this is bad. Details: My system didn't crash, but the filesystem went read only, and of course couldn't syslog the error. Thankfully I was saved by remote syslog which did work: kernel: [545039.443412] [ cut here ] kernel: [545039.443429] WARNING: CPU: 2 PID: 556 at fs/btrfs/inode.c:4927 btrfs_invalidate_inode kernel: [545039.443432] Modules linked in: e1000e iwlmvm mac80211 iwlwifi cfg80211 xhci_hcd usb_storage rndis_host cdc_ether btusb uvcvideo usbnet ehci_pci ehci_hcd usbcore usb_common tun sg nls_utf8 nls_cp437 vfat fat rpcsec_gss_krb5 nfsv4 ctr ccm ipt_MASQUERADE ipt_REJECT xt_tcpudp xt_conntrack xt_LOG iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables ppdev cpufreq_powersave cpufreq_userspace cpufreq_conservative cpufreq_stats rfcomm bnep autofs4 binfmt_misc uinput nfsd auth_rpcgss nfs_acl nfs lockd fscache sunrpc configs parport_pc lp parport input_polldev loop firewire_sbp2 firewire_core crc_itu_t ecryptfs videobuf2_vmalloc videobuf2_memops videobuf2_core videodev bluetooth 6lowpan_iphc media joydev arc4 snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_pcm_oss snd_mixer_oss thinkpad_acpi x86_pkg_temp_thermal s kernel: nd_pcm intel_powerclamp nvram coretemp snd_seq_midi snd_seq_midi_event kvm_intel snd_rawmidi kvm crct10dif_pclmul snd_seq crc32_pclmul rtsx_pci_ms iTCO_wdt iTCO_vendor_support ghash_clmulni_intel snd_seq_device memstick rtsx_pci_sdmmc snd_timer lpc_ich pcspkr microcode psmouse i2c_i801 serio_raw snd rtsx_pci soundcore tpm_tis rfkill tpm ac battery intel_smartconnect wmi evdev processor sata_sil24 r8169 mii fuse fan raid456 multipath mmc_block mmc_core dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_crypt dm_mod async_raid6_recov async_pq async_xor async_memcpy async_tx blowfish_x86_64 blowfish_common ecb xts crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd ptp pps_core thermal [last unloaded: e1000e] kernel: [545039.443693] CPU: 2 PID: 556 Comm: btrfs-transacti Tainted: G W3.14.0-amd64-i915-preempt-20140216 #2 kernel: [545039.443697] Hardware name: LENOVO 20BECT0/20BECT0, BIOS GMET28WW (1.08 ) 09/18/2013 kernel: [545039.443701] 8800cd9f3d80 8160a06d kernel: [545039.443718] 8800cd9f3db8 81050025 81234676 88040665c000 kernel: [545039.443727] 8800cd9f3e30 880406f708b8 880402181000 8800cd9f3dc8 kernel: [545039.443735] Call Trace: kernel: [545039.443746] [8160a06d] dump_stack+0x4e/0x7a kernel: [545039.443754] [81050025] warn_slowpath_common+0x7f/0x98 kernel: [545039.443761] [81234676] ? btrfs_invalidate_inodes+0x2f/0x12e kernel: [545039.443768] [810500ec] warn_slowpath_null+0x1a/0x1c kernel: [545039.443775] [81234676] btrfs_invalidate_inodes+0x2f/0x12e kernel: [545039.443784] [81227ac3] btrfs_cleanup_transaction+0x3b2/0x43f kernel: [545039.443792] [81227c92] transaction_kthread+0x142/0x1ab kernel: [545039.443799] [81227b50] ? btrfs_cleanup_transaction+0x43f/0x43f kernel: [545039.443807] [8106bc62] kthread+0xae/0xb6 kernel: [545039.443815] [8106bbb4] ? __kthread_parkme+0x61/0x61 kernel: [545039.443822] [8161637c] ret_from_fork+0x7c/0xb0 kernel: [545039.443829] [8106bbb4] ? __kthread_parkme+0x61/0x61 kernel: [545039.443834] ---[ end trace 3c290eaa69000df6 ]--- Now, if I try to mount it, I get: [ 17.234587] BTRFS: device label btrfs_pool1 devid 1 transid 415424 /dev/mapper/cryptroot [ 17.236873] BTRFS info (device dm-0): disk space caching is enabled [ 17.243687] BTRFS: bad tree block start 10983188636980216968 828930883584 [ 17.245986] BTRFS: bad tree block start 12509109177217855588 828930883584 [ 17.248174] BTRFS: failed to read tree root on dm-0 [ 17.325141] BTRFS: open_ctree failed mount -o ro,recovery gives: [ 412.572216] BTRFS: device label
Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
On 05/07/2014 07:39 PM, Marc MERLIN wrote: In a moment of irony, my laptop's boot SSD's btrfs fileysstem crashed last night with my btrfs talk slides still open on it. It went read only overnight but did not crash. Please tell me ASAP if you need anything off the filesystem before I recover it since I'm travelling, and need to bring my laptop back up to a working state ASAP (I'll save the irony of showing up at my talk with Err, I can't give my btrfs talk, btrfs crashed on my laptop). I'm not interested in partial recovery, I have hourly backups on my secondary drive on my laptop (thankfully) and was able to boot from that drive (double thankfully). Good thing I plan ahead :) If there is something you'd like me to try to recover the filesystem or to get more data off it to diagnose the bug, please let me know ASAP. Otherwise, I'll just wipe it and recover from my disk backup, but obviously this is bad. Hi Marc, Looks like you're on 3.14, did this have the fixes from my git tree that went into 3.15-rc? For now I'd say that if you can make a dd image of the FS, please do so. Otherwise, I don't want to suck down your time right before the trip. -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
On Wed, May 07, 2014 at 08:38:38PM -0400, Chris Mason wrote: Looks like you're on 3.14, did this have the fixes from my git tree that went into 3.15-rc? You're correct, it's running 3.14.0. Considering that it's my main laptop that I kind of need to work, I avoid rc kernels if possible :) But if I had known that 3.14 had corruption problems, I'd have re-thought that :) (besides my report, were there other ones I missed? Is 3.14.0 something to avoid for now?) (yes, I know 3.14.3 is out now, I should upgrade) For now I'd say that if you can make a dd image of the FS, please do so. Otherwise, I don't want to suck down your time right before the trip. A full dd image is not practical, it's 1TB and I have nowhere to put it. I could do an image if you'd like, and upload it when I have proper internet (I'm thinking it's likely going to be a 1GB upload) (by the way, I'm already in the trip, and I have 1h before my next plane and a bit of time tonight (in 10H my time that is) to upload stuff or more logs if that helps. But more importantly, I have my main file server at home running 3.14.0 too. Is there a risk of known corruption, or nothing known yet? Of if you'd like output of fsck in dry-run mode, I can do that too. Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: fsck: add an option to check data csums
This patch adds an option '--check-data-csum' to verify data csums. fsck won't check data csums unless users specify this option explictly. Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com --- Documentation/btrfs-check.txt | 2 + cmds-check.c | 122 -- 2 files changed, 120 insertions(+), 4 deletions(-) diff --git a/Documentation/btrfs-check.txt b/Documentation/btrfs-check.txt index 485a49c..bc10755 100644 --- a/Documentation/btrfs-check.txt +++ b/Documentation/btrfs-check.txt @@ -30,6 +30,8 @@ try to repair the filesystem. create a new CRC tree. --init-extent-tree:: create a new extent tree. +--check-data-csum:: +check data csums. EXIT STATUS --- diff --git a/cmds-check.c b/cmds-check.c index 103efc5..b53d49c 100644 --- a/cmds-check.c +++ b/cmds-check.c @@ -53,6 +53,7 @@ static LIST_HEAD(delete_items); static int repair = 0; static int no_holes = 0; static int init_extent_tree = 0; +static int check_data_csum = 0; struct extent_backref { struct list_head list; @@ -3634,6 +3635,106 @@ static int check_space_cache(struct btrfs_root *root) return error ? -EINVAL : 0; } +static int read_extent_data(struct btrfs_root *root, char *data, + u64 logical, u64 len, int mirror) +{ + u64 offset = 0; + struct btrfs_multi_bio *multi = NULL; + struct btrfs_fs_info *info = root-fs_info; + struct btrfs_device *device; + int ret = 0; + u64 read_len; + unsigned long bytes_left = len; + + while (bytes_left) { + read_len = bytes_left; + device = NULL; + ret = btrfs_map_block(info-mapping_tree, READ, + logical + offset, read_len, multi, + mirror, NULL); + if (ret) { + fprintf(stderr, Couldn't map the block %llu\n, + logical + offset); + goto error; + } + device = multi-stripes[0].dev; + + if (device-fd == 0) + goto error; + + if (read_len root-sectorsize) + read_len = root-sectorsize; + if (read_len bytes_left) + read_len = bytes_left; + + ret = pread64(device-fd, data + offset, read_len, + multi-stripes[0].physical); + if (ret != read_len) + goto error; + offset += read_len; + bytes_left -= read_len; + kfree(multi); + multi = NULL; + } + return 0; +error: + kfree(multi); + return -EIO; +} + +static int check_extent_csums(struct btrfs_root *root, u64 bytenr, + u64 num_bytes, unsigned long leaf_offset, + struct extent_buffer *eb) { + + u64 offset = 0; + u16 csum_size = btrfs_super_csum_size(root-fs_info-super_copy); + char *data; + u32 crc; + unsigned long tmp; + char result[csum_size]; + char out[csum_size]; + int ret = 0; + __s64 cmp; + int mirror; + int num_copies = btrfs_num_copies(root-fs_info-mapping_tree, + bytenr, num_bytes); + + BUG_ON(num_bytes % root-sectorsize); + data = malloc(root-sectorsize); + if (!data) + return -ENOMEM; + + while (offset num_bytes) { + mirror = 0; +again: + ret = read_extent_data(root, data, bytenr + offset, + root-sectorsize, mirror); + if (ret) + goto out; + + crc = ~(u32)0; + crc = btrfs_csum_data(NULL, (char *)data, crc, + root-sectorsize); + btrfs_csum_final(crc, result); + + tmp = leaf_offset + offset / root-sectorsize * csum_size; + read_extent_buffer(eb, out, tmp, csum_size); + cmp = memcmp(out, result, csum_size); + if (cmp) { + fprintf(stderr, mirror: %d range bytenr: %llu, len: %d checksum mismatch\n, + mirror, bytenr + offset, root-sectorsize); + if (mirror num_copies - 1) { + mirror += 1; + goto again; + } + } + offset += root-sectorsize; + } +out: + free(data); + return ret; +} + static int check_extent_exists(struct btrfs_root *root, u64 bytenr, u64 num_bytes) { @@ -3771,6 +3872,8 @@ static int check_csums(struct btrfs_root *root) u16 csum_size = btrfs_super_csum_size(root-fs_info-super_copy); int errors = 0; int ret; + u64 data_len; + unsigned long
Re: URGENT: my laptop's boot ssd btrfs crashed, what do you need off it?
On Wed, May 07, 2014 at 05:43:44PM -0700, Marc MERLIN wrote: A full dd image is not practical, it's 1TB and I have nowhere to put it. I could do an image if you'd like, and upload it when I have proper internet (I'm thinking it's likely going to be a 1GB upload) In the meantime, here is fsck output: legolas:/boot/grub# btrfsck /dev/mapper/disk1 21 | tee /tmp/fsck Check tree block failed, want=828930883584, have=10983188636980216968 Check tree block failed, want=828930883584, have=10983188636980216968 Check tree block failed, want=828930883584, have=12509109177217855588 Check tree block failed, want=828930883584, have=12509109177217855588 Check tree block failed, want=828930883584, have=12509109177217855588 read block failed check_tree_block Couldn't read tree root Critical roots corrupted, unable to fsck the FS Checking filesystem on /dev/mapper/disk1 UUID: 4850ee22-bf32-4131-a841-02abdb4a5ba6 Let me know if I should try --init-csum-tree and/or --init-extent-tree legolas:/# /sbin/btrfs-find-root /dev/mapper/disk1 Super think's the tree root is at 828930883584, chunk root 20979712 Well block 12585312256 seems great, but generation doesn't match, have=410782, want=415424 level 0 (...) Well block 82629248 seems great, but generation doesn't match, have=415420, want=415424 level 0 Found tree root at 828930887680 gen 415424 level 0 legolas:/# I noted that: 828930887680 - 828930883584 = 4096 So I have a root tree that's bigger than what super is looking for? Could that be my problem? Can btrfs restore be used to navigate the filesystem and look for files and patterns without dumping the entire filesystem, which I don't have room for? In the meantime, I didn't get it to work anyway: legolas:/var/local/space/nobck# btrfs restore -t 828930887680 /dev/mapper/disk1 restore Couldn't setup extent tree Couldn't read fs root: -2 extent buffer leak: start 828930887680 len 4096 Now, even if that worked, https://btrfs.wiki.kernel.org/index.php/Restore#Advanced_usage says I can use -r to only restore a subvolume, but I don't know its objectid. How would I do this? (I don't actually really need the data, I'm just trying to learn what I would do if I did) Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: doc: link btrfsck to btrfs-check
Original Message Subject: Re: [PATCH] btrfs-progs: doc: link btrfsck to btrfs-check From: David Sterba dste...@suse.cz To: Qu Wenruo quwen...@cn.fujitsu.com Date: 2014年04月18日 22:48 On Thu, Apr 17, 2014 at 08:47:28AM +0800, Qu Wenruo wrote: @@ -73,6 +74,7 @@ install: install-man install-man: man $(INSTALL) -d -m 755 $(DESTDIR)$(man8dir) $(INSTALL) -m 644 $(GZ_MAN8) $(DESTDIR)$(man8dir) + $(LNS) btrfs-check.txt $(DESTDIR)$(man8dir) Shouldn't the source of soft link be btrfs-check.8.gz. ? Forgot to mention that the dest is also wrong. This will make $(DESTDIR)$(man8dir)/btrfs-check.8.gz to be a infinite loop(pointing to it self). The correct one should be like the following: + $(LNS) btrfs-check.8.gz $(DESTDIR)$(man8dir)/btrfsck.8.gz Thanks, Qu @@ -47,4 +49,3 @@ SEE ALSO `mkfs.btrfs`(8), `btrfs-scrub`(8), `btrfs-rescue`(8) -`btrfsck`(8) Sorry to bother you but 'btrfs-scrub'/'btrfs-rescue' and 'btrfs-restore' seems also metioning 'btrfsck' and may also needs to remove 'btrfsck'. Thanks for catching them, I'll fix it up. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2 10/10] Btrfs: reclaim the reserved metadata space at background
On Mon, 10 Mar 2014 09:35:13 -0400, Josef Bacik wrote: On 03/06/2014 12:55 AM, Miao Xie wrote: Before applying this patch, the task had to reclaim the metadata space by itself if the metadata space was not enough. And When the task started the space reclamation, all the other tasks which wanted to reserve the metadata space were blocked. At some cases, they would be blocked for a long time, it made the performance fluctuate wildly. So we introduce the background metadata space reclamation, when the space is about to be exhausted, we insert a reclaim work into the workqueue, the worker of the workqueue helps us to reclaim the reserved space at the background. By this way, the tasks needn't reclaim the space by themselves at most cases, and even if the tasks have to reclaim the space or are blocked for the space reclamation, they will get enough space more quickly. We needn't worry about the early enospc problem because all the reclaim work is serialized by the lock. Signed-off-by: Miao Xie mi...@cn.fujitsu.com This causes generic/015 to fail with early enospc, I'm kicking this patch out, I'll take the rest. Thanks, It is not early enospc problem. This test is to check that the space of the file is released immediately or not after the file is deleted. In fact, the result of the test is unstable, because the kernel may be syncing the file data when we delete it, if so the space of file would not be released immediately. But the case I said above is rare because the size of fs in this test is just 50MB, and the memory size of the most machine is very large(maybe 1GB), that is the dirty pages is not so many, the background flusher may not be waked up immediately, so no one holds the inode of the test file after we delete it, and then the space of it can be released immediately. After applying this patch, we will flush the dirty pages because our background metadata space reclaimer finds that the metadata space is going to be used up ( 5% of the total metadata size), and need flush dirty pages to reclaim some delalloc metadata space. that is this patch makes the above case happen easily. Anyway, we need improve this patch though it is not a bug. I will send out a new one. Thanks Miao Josef -BEGIN PGP SIGNATURE- Version: GnuPG v1 Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQIcBAEBAgAGBQJTHb+NAAoJEANb+wAKly3BCcUP/jGmW85hiurfTF7eom+wzDcr nxqvdTB/F21UJU1RRrb92CdYRYb9d4hHKhXE5OK+qamE+K55GEtgCUWCLQgDfJJL Wx0aUD/pTqv3J5S5zM43UBJkn2ZR99Q7hJzm9PPMSMn7hBgK87QUEme8HerCPUgY 0VS4OcqUGhg88qO8GjdEFLnHawhjMDw9iGPUi+tMdCEmr9aQQo8ntiahdVKyTHej vSRQRs0igvAt73OWHXiP6vc4LOQdu1vKCFdbxhgg+duKjNOHfUoaiiaUiGhWIA9l BcTWd62bEJNOaXd6k06GzhpCWzMM6faTLfjI6XADUFY0VZ79akzk2KAO6YdaLz8w 3IAKN1chTpr7q7oPuRDgDQuwwdeLPImN29CKlAF3jlSRJEblM8CKoXYD1fyqVwDy c1mA6mMUJnEnXrkJ/Pb5zuNIZMAlU+v3d6CCjYKHMACORvJeZVlg9gLLMATaAJIA xLjFlzbgSbp/OUNuBuS4YGIaa51aAyODd2h1T3E+T5JYbVkA39N3Ni9HODE8AuSE E6U/06FK47L0e5uGFrM3tMTL0XBF62C1iml4NsjOWgiERz8lFDdFVArgXamCVacM 1+VdeLLS88RHFEuwlMBy/ZQBdnvWCVsNVjYukuxntmWbSWrsLUFUSzExWnp+7TAO xkEd2yMw75yasTVGKSXU =Q/fM -END PGP SIGNATURE- -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3] Btrfs: reclaim the reserved metadata space at background
Before applying this patch, the task had to reclaim the metadata space by itself if the metadata space was not enough. And When the task started the space reclamation, all the other tasks which wanted to reserve the metadata space were blocked. At some cases, they would be blocked for a long time, it made the performance fluctuate wildly. So we introduce the background metadata space reclamation, when the space is about to be exhausted, we insert a reclaim work into the workqueue, the worker of the workqueue helps us to reclaim the reserved space at the background. By this way, the tasks needn't reclaim the space by themselves at most cases, and even if the tasks have to reclaim the space or are blocked for the space reclamation, they will get enough space more quickly. Here is my test result(Tested by compilebench): Memory:2GB CPU: 2Cores * 1CPU Partition: 40GB(SSD) Test command: # compilebench -D mnt -m Without this patch: intial create total runs 30 avg 54.36 MB/s (user 0.52s sys 2.44s) compile total runs 30 avg 123.72 MB/s (user 0.13s sys 1.17s) read compiled tree total runs 3 avg 81.15 MB/s (user 0.74s sys 4.89s) delete compiled tree total runs 30 avg 5.32 seconds (user 0.35s sys 4.37s) With this patch: intial create total runs 30 avg 59.80 MB/s (user 0.52s sys 2.53s) compile total runs 30 avg 151.44 MB/s (user 0.13s sys 1.11s) read compiled tree total runs 3 avg 83.25 MB/s (user 0.76s sys 4.91s) delete compiled tree total runs 30 avg 5.29 seconds (user 0.34s sys 4.34s) Signed-off-by: Miao Xie mi...@cn.fujitsu.com --- Changelog v2 - v3: - change the condition that the background reclaimation starts. --- fs/btrfs/ctree.h | 6 +++ fs/btrfs/disk-io.c | 3 ++ fs/btrfs/extent-tree.c | 105 - fs/btrfs/super.c | 1 + 4 files changed, 114 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 4c48df5..f264edf 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -33,6 +33,7 @@ #include asm/kmap_types.h #include linux/pagemap.h #include linux/btrfs.h +#include linux/workqueue.h #include extent_io.h #include extent_map.h #include async-thread.h @@ -1313,6 +1314,8 @@ struct btrfs_stripe_hash_table { #define BTRFS_STRIPE_HASH_TABLE_BITS 11 +void btrfs_init_async_reclaim_work(struct work_struct *work); + /* fs_info */ struct reloc_control; struct btrfs_device; @@ -1688,6 +1691,9 @@ struct btrfs_fs_info { struct semaphore uuid_tree_rescan_sem; unsigned int update_uuid_tree_gen:1; + + /* Used to reclaim the metadata space in the background. */ + struct work_struct async_reclaim_work; }; struct btrfs_subvolume_writers { diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 029d46c..475889a 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2291,6 +2291,7 @@ int open_ctree(struct super_block *sb, atomic_set(fs_info-balance_cancel_req, 0); fs_info-balance_ctl = NULL; init_waitqueue_head(fs_info-balance_wait_q); + btrfs_init_async_reclaim_work(fs_info-async_reclaim_work); sb-s_blocksize = 4096; sb-s_blocksize_bits = blksize_bits(4096); @@ -3603,6 +3604,8 @@ int close_ctree(struct btrfs_root *root) /* clear out the rbtree of defraggable inodes */ btrfs_cleanup_defrag_inodes(fs_info); + cancel_work_sync(fs_info-async_reclaim_work); + if (!(fs_info-sb-s_flags MS_RDONLY)) { ret = btrfs_commit_super(root); if (ret) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 1306487..5a5e156 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -4201,6 +4201,104 @@ static int flush_space(struct btrfs_root *root, return ret; } + +static inline u64 +btrfs_calc_reclaim_metadata_size(struct btrfs_root *root, +struct btrfs_space_info *space_info) +{ + u64 used; + u64 expected; + u64 to_reclaim; + + to_reclaim = min_t(u64, num_online_cpus() * 1024 * 1024, + 16 * 1024 * 1024); + spin_lock(space_info-lock); + if (can_overcommit(root, space_info, to_reclaim, + BTRFS_RESERVE_FLUSH_ALL)) { + to_reclaim = 0; + goto out; + } + + used = space_info-bytes_used + space_info-bytes_reserved + + space_info-bytes_pinned + space_info-bytes_readonly + + space_info-bytes_may_use; + if (can_overcommit(root, space_info, 1024 * 1024, + BTRFS_RESERVE_FLUSH_ALL)) + expected = div_factor_fine(space_info-total_bytes, 95); + else + expected = div_factor_fine(space_info-total_bytes, 90); + + if (used expected) + to_reclaim = used - expected; + else + to_reclaim = 0; + to_reclaim = min(to_reclaim,
Re: btrfs issues in 3.14
On Wed, May 07, 2014 at 09:35:06AM -0300, Kenny MacDermid wrote: On Tue, May 6, 2014 at 11:22 PM, Liu Bo bo.li@oracle.com wrote: What does sysrq+w say when the hang happens? The whole system isn't hung, I may have explained that wrong. The system will hang if I try to shutdown, and the process will hang if I try to kill -9 it. It looks like the browser is in this state currently so I did an 'echo w /proc/sysrq-trigger' and have attached the full dmesg with the browser issues and the output. Those stacks show the blocked tasks are waiting for a page's writeback, but they don't show what blocks the endio process of that page. I'd recommand you to try the lastest 3.15.0-rc4 or btrfs-next, as many fixes are merged during this period. thanks, -liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] btrfs-progs: update man page for btrfs-show-super
Add '-f' option for btrfs-show-super manpage, This option implies that sys chunk array and backup roots info will show up. Signed-off-by: Gui Hecheng guihc.f...@cn.fujitsu.com --- Documentation/btrfs-show-super.txt | 7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-show-super.txt b/Documentation/btrfs-show-super.txt index e8e17ab..074700f 100644 --- a/Documentation/btrfs-show-super.txt +++ b/Documentation/btrfs-show-super.txt @@ -20,8 +20,13 @@ Mainly used for debug purpose. OPTIONS --- +-f:: +Print full superblock information. ++ +Including the system chunk array and backup roots. + -a:: -Print all the superblock information. +Print information of all superblocks. + If this option is given, '-i' option will be ignored. -- 1.8.1.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 1/3] xfstests/btrfs: add qgroup rescan stress test
On 05/08/2014 04:58 AM, Josef Bacik wrote: On 03/09/2014 11:44 PM, Wang Shilong wrote: Test flow is to run fsstress after triggering quota rescan. the ruler is simple, we just remove all files and directories, sync filesystem and see if qgroup's ref and excl are nodesize. Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com --- v1-v2: switch into new helper _run_btrfs_util_prog() --- tests/btrfs/041 | 76 + tests/btrfs/041.out | 3 +++ tests/btrfs/group | 1 + 3 files changed, 80 insertions(+) create mode 100644 tests/btrfs/041 create mode 100644 tests/btrfs/041.out diff --git a/tests/btrfs/041 b/tests/btrfs/041 new file mode 100644 index 000..92bd080 --- /dev/null +++ b/tests/btrfs/041 @@ -0,0 +1,76 @@ +#! /bin/bash +# FSQA Test No. btrfs/041 +# +# Quota rescan stress test, we run fsstress and quota rescan concurrently +# +#--- +# Copyright (C) 2014 Fujitsu. All rights reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +# +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ +status=1 + +_cleanup() +{ +cd / +rm -f $tmp.* +} +trap _cleanup; exit \$status 0 1 2 3 15 + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter + +# real QA test starts here +_need_to_be_root +_supported_fs btrfs +_supported_os Linux +_require_scratch + +rm -f $seqres.full + +run_check _scratch_mkfs -b 1g --nodesize 4096 +run_check _scratch_mount + Add -o nospace_cache here please, otherwise I don't get the same output. I am little confused why we need specify this mount option explicitly? As far as i know, space cache is not included into qgroup accounting space. Thanks, Wang +# -w ensures that the only ops are ones which cause write I/O +run_check $FSSTRESS_PROG -d $SCRATCH_MNT -w -p 5 -n 1000 \ +$FSSTRESS_AVOID /dev/null + +_run_btrfs_util_prog subvolume snapshot $SCRATCH_MNT \ + $SCRATCH_MNT/snap1 $seqres.full 21 _run_btrfs_util_prog will already redirect to $seqres.full, you don't need this part. + +run_check $FSSTRESS_PROG -d $SCRATCH_MNT/snap1 -w -p 5 -n 1000 \ + $FSSTRESS_AVOID /dev/null + +_run_btrfs_util_prog quota enable $SCRATCH_MNT +_run_btrfs_util_prog quota rescan -w $SCRATCH_MNT + +#ignore removing subvolume errors +rm -rf $SCRATCH_MNT/* /dev/null + +_run_btrfs_util_prog filesystem sync $SCRATCH_MNT $seqres.full 21 Same here. +_run_btrfs_util_prog qgroup show $SCRATCH_MNT | $SED_PROG -n '/[0-9]/p' \ +| $AWK_PROG '{print $1 $2 $3 }' + You can't use _run_btrfs_util_prog here, it will eat the output. You need to use $BTRFS_UTIL_PROG instead. Fix these up and resend, this is a really important test and I needed it to make sure my qgroups patch was right (which it is now.) Thanks, Josef . -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs/035: update clone test to expect EOPNOTSUPP
On Wed, May 07, 2014 at 02:33:18PM +0200, David Disseldorp wrote: With kernel commit 00fdf13a2e9f313a044288aa59d3b8ec29ff904a, the first clone-range overwrite attempt now fails with EOPNOTSUPP, rather than tripping a Btrfs BUG_ON(). This test now trips a new Btrfs bug, in which EIO is returned for subsequent reads following the second clone range ioctl. Hi David, Something different here, I didn't get EI on 3.15.0-rc4. thanks, -liubo Signed-off-by: David Disseldorp dd...@suse.de --- tests/btrfs/035 | 11 +++ tests/btrfs/035.out | 5 + 2 files changed, 16 insertions(+) diff --git a/tests/btrfs/035 b/tests/btrfs/035 index 6808179..c9530f6 100755 --- a/tests/btrfs/035 +++ b/tests/btrfs/035 @@ -57,21 +57,32 @@ src_str=aa echo -n $src_str $SCRATCH_MNT/src $CLONER_PROG $SCRATCH_MNT/src $SCRATCH_MNT/src.clone1 +cat $SCRATCH_MNT/src.clone1 +echo src_str=bbcc echo -n $src_str $SCRATCH_MNT/src $CLONER_PROG $SCRATCH_MNT/src $SCRATCH_MNT/src.clone2 +cat $SCRATCH_MNT/src.clone2 +echo +# Prior to kernel commit 00fdf13a2e9f313a044288aa59d3b8ec29ff904a, this clone +# resulted in a BUG_ON in __btrfs_drop_extents(). The kernel now returns +# EOPNOTSUPP up to userspace. snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone1 | awk '{print $5}'` echo attempting ioctl (src.clone1 src) $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \ $SCRATCH_MNT/src.clone1 $SCRATCH_MNT/src +cat $SCRATCH_MNT/src +echo snap_src_sz=`ls -lah $SCRATCH_MNT/src.clone2 | awk '{print $5}'` echo attempting ioctl (src.clone2 src) $CLONER_PROG -s 0 -d 0 -l ${snap_src_sz} \ $SCRATCH_MNT/src.clone2 $SCRATCH_MNT/src +# BUG: subsequent access attempts currently result in EIO... +cat $SCRATCH_MNT/src status=0 ; exit diff --git a/tests/btrfs/035.out b/tests/btrfs/035.out index f86cadf..0ea2c4f 100644 --- a/tests/btrfs/035.out +++ b/tests/btrfs/035.out @@ -1,3 +1,8 @@ QA output created by 035 +aa +bbcc attempting ioctl (src.clone1 src) +clone failed: Operation not supported +bbcc attempting ioctl (src.clone2 src) +bbcc -- 1.8.4.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] xfstests: fix flink test
On 5/7/14, 3:54 PM, Josef Bacik wrote: I don't have flink support in my xfsprogs, but it doesn't fail with command not found or whatever, it fails because I don't have the -T option. So fix _require_xfs_io_command to check for an invalid option and not run. This way I get notrun instead of a failure. Thanks, This actually doesn't work for me on an old kernel, if that matters; it fails with: /mnt/test: Is a directory and nothing catches that. Old xfsprogs tries to open the file in question RDWR even before it gets to the -T option (which would fail, I guess), and you can't do that for directories. So I suppose we could explicitly test for that when checking flink: [ $command = flink ] echo $testio | grep -q Is a directory \ _notrun xfs_io flink support is missing or alternately, first just run xfs_io w/ the command but no file; today, at least, that works: [root@bp-05 xfstests]# xfs_io -c flink command flink not found [root@bp-05 xfstests]# xfs_io -c pread [root@bp-05 xfstests]# so could do this before the case statement: $XFS_IO_PROG -c $command 21 | grep -q not found \ _notrun xfs_io $command support is missing but that might be subject to future changes in xfs_io command parsing... -Eric Signed-off-by: Josef Bacik jba...@fb.com --- common/rc | 2 ++ 1 file changed, 2 insertions(+) diff --git a/common/rc b/common/rc index 5c13db5..4fa7e63 100644 --- a/common/rc +++ b/common/rc @@ -1258,6 +1258,8 @@ _require_xfs_io_command() _notrun xfs_io $command support is missing echo $testio | grep -q Operation not supported \ _notrun xfs_io $command failed (old kernel/wrong fs?) + echo $testio | grep -q invalid option \ + _notrun xfs_io $command support is missing } # Check that a fs has enough free space (in 1024b blocks) -- 1.8.3.1 ___ xfs mailing list x...@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: remove OPT_acl parse when acl disabled
Even CONFIG_BTRFS_FS_POSIX_ACL is not defined, the acl still could been enabled using a mount option, and now fs/btrfs/acl.o is not built, so the mount options will appear to be supported but will be silently ignored. Signed-off-by: Guangliang Zhao lucienc...@gmail.com --- fs/btrfs/super.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 363404b..68ae27c 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -579,9 +579,11 @@ int btrfs_parse_options(struct btrfs_root *root, char *options) goto out; } break; +#ifdef CONFIG_BTRFS_FS_POSIX_ACL case Opt_acl: root-fs_info-sb-s_flags |= MS_POSIXACL; break; +#endif case Opt_noacl: root-fs_info-sb-s_flags = ~MS_POSIXACL; break; -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html