Re: Another ENOSPC situation
On Fri, Apr 1, 2016 at 10:55 PM, Duncan <1i5t5.dun...@cox.net> wrote: > Marc Haber posted on Fri, 01 Apr 2016 15:40:29 +0200 as excerpted: >> [4/502]mh@swivel:~$ sudo btrfs fi usage / >> Overall: >> Device size: 600.00GiB >> Device allocated:600.00GiB >> Device unallocated:1.00MiB > > That's the problem right there. The admin didn't do his job and spot the > near full allocation issue I don't yet agree this is an admin problem. This is the 2nd or 3rd case we've seen only recently where there's plenty of space in all chunk types and yet ENOSPC happens, seemingly only because there's no unallocated space remaining. I don't know that this is a regression for sure, but it sure seems like one. >> >> Data,single: Size:553.93GiB, Used:405.73GiB >>/dev/mapper/swivelbtr 553.93GiB >> >> Metadata,DUP: Size:23.00GiB, Used:3.83GiB >>/dev/mapper/swivelbtr 46.00GiB >> >> System,DUP: Size:32.00MiB, Used:112.00KiB >>/dev/mapper/swivelbtr 64.00MiB >> >> Unallocated: >>/dev/mapper/swivelbtr 1.00MiB >> [5/503]mh@swivel:~$ > > Both data and metadata have several GiB free, data ~140 GiB free, and > metadata isn't into global reserve, so the system isn't totally wedged, > only partially, due to the lack of unallocated space. Unallocated space alone hasn't ever caused this that I can remember. It's most often been totally full metadata chunks, with free space in allocated data chunks, with no unallocated space out of which to create another metadata chunk to write out changes. There should be plenty of space for either a -dusage=1 or -musage=1 balance to free up a bunch of partially allocated chunks. Offhand I don't think the profiles filter is helpful in this case. OK so where I could be wrong is that I'm expecting balance doesn't require allocated space to work. I'd expect that it can COW extents from one chunk into another existing chunk (of the same type) and then once that's successful, free up that chunk, i.e. revert it back to unallocated. If balance can only copy into newly allocated chunks, that seems like a big problem. I thought that problems had been fixed a very long time ago. And what we don't see from 'usage' that we will see from 'df' is the GlobalReserve values. I'd like to see that. Anyway, in the meantime there is a work around: btrfs dev add Just add a device, even if it's an 8GiB flash drive. But it can be a spare space on a partition, or it can be a logical volume, or whatever you want. That'll add some gigs of unallocated space. Now the balance will work, or for absolutely sure there's a bug (and a new one because this has always worked in the past). After whatever filtered or full balance is done, make sure to 'btfs dev rem' and confirm it's gone with 'btrfs fi show' before removing the device. It's a two device volume until that device is successfully removed and is in something of a fragile state until then because any loss of data on that 2nd device has a good chance of face planting the file system. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 10/13] btrfs: introduce helper functions to perform hot replace
Hi Anand, [auto build test ERROR on btrfs/next] [also build test ERROR on v4.6-rc1 next-20160401] [if your patch is applied to the wrong git tree, please drop us a note to help improving the system] url: https://github.com/0day-ci/linux/commits/Anand-Jain/Introduce-device-state-failed-Hot-spare-and-Auto-replace/20160402-093528 base: https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next config: x86_64-rhel (attached as .config) reproduce: # save the attached .config to linux build tree make ARCH=x86_64 All errors (new ones prefixed by >>): fs/btrfs/dev-replace.c: In function 'btrfs_auto_replace_start': fs/btrfs/dev-replace.c:981:38: warning: passing argument 2 of 'btrfs_dev_replace_start' from incompatible pointer type ret = btrfs_dev_replace_start(root, tgt_path, ^ fs/btrfs/dev-replace.c:308:5: note: expected 'struct btrfs_ioctl_dev_replace_args *' but argument is of type 'char *' int btrfs_dev_replace_start(struct btrfs_root *root, ^ >> fs/btrfs/dev-replace.c:981:8: error: too many arguments to function >> 'btrfs_dev_replace_start' ret = btrfs_dev_replace_start(root, tgt_path, ^ fs/btrfs/dev-replace.c:308:5: note: declared here int btrfs_dev_replace_start(struct btrfs_root *root, ^ vim +/btrfs_dev_replace_start +981 fs/btrfs/dev-replace.c 975 src_path = kstrdup(rcu_str_deref(src_device->name), GFP_ATOMIC); 976 rcu_read_unlock(); 977 if (!src_path) { 978 kfree(tgt_path); 979 return -ENOMEM; 980 } > 981 ret = btrfs_dev_replace_start(root, tgt_path, 982 src_device->devid, src_path, 983 BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID); 984 if (ret) --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: Another ENOSPC situation
Marc Haber posted on Fri, 01 Apr 2016 15:40:29 +0200 as excerpted: > Hi, > > just for a change, this is another btrfs on a different host. The host > is also running Debian unstable with mainline kernels, the btrfs in > question was created (not converted) in March 2015 with btrfs-tools > 3.17. It is the root fs of my main work notebook which is under > workstation load, with lots of snapshots being created and deleted. > > Balance immediately fails with ENOSPC > > balance -dprofiles=single -dusage=1 goes through "fine" ("had to > relocate 0 out of 602 chunks") > > balance -dprofiles=single -dusage=2 also ENOSPCes immediately. > > [4/502]mh@swivel:~$ sudo btrfs fi usage / > Overall: > Device size: 600.00GiB > Device allocated:600.00GiB > Device unallocated:1.00MiB That's the problem right there. The admin didn't do his job and spot the near full allocation issue (perhaps with the help of some script set to run periodically and tell him about it) before it got critical, and now there's no room left to balance, to fix the problem. This despite the fact that the admin chose to run a not yet entirely stable filesystem that's well known to run off the rails in precisely this sort of way, occasionally, with specific use-cases such as heavy snapshotting more often than others. > Device missing: 0.00B > Used:413.40GiB > Free (estimated):148.20GiB (min: 148.20GiB) Tho the used vs. free isn't all that bad... it's just that the allocated vs. unallocated was allowed to run off the rails and get the filesystem in a bind. But that does mean it should be possible to do something about it. =:^) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,single: Size:553.93GiB, Used:405.73GiB >/dev/mapper/swivelbtr 553.93GiB > > Metadata,DUP: Size:23.00GiB, Used:3.83GiB >/dev/mapper/swivelbtr 46.00GiB > > System,DUP: Size:32.00MiB, Used:112.00KiB >/dev/mapper/swivelbtr 64.00MiB > > Unallocated: >/dev/mapper/swivelbtr 1.00MiB > [5/503]mh@swivel:~$ Both data and metadata have several GiB free, data ~140 GiB free, and metadata isn't into global reserve, so the system isn't totally wedged, only partially, due to the lack of unallocated space. > btrfs balance -mprofiles seems to do something. one kworked and one > btrfs-transaction process hog one CPU core each for hours, while > blocking the filesystem for minutes apiece, which leads to the host > being nearly unuseable up to the point of "clock and mouse pointer > frozen for nearly ten minutes". > > The btrfs balance cancel I issued after four hours of this state took > eleven minutes alone to complete. It's worth noting as an aside that Linux isn't necessarily tuned for interactivity by default, tho there are definitely ways to make it more so. Additionally, on some mobos at least, it's possible to tweak the BIOS balance between interactivity and thruput. An old Tyan board (PCI not the newer PCIE, which avoids some of the problems with multiple dedicated buses) I had was tilted a bit heavily toward thruput, which did make sense as it was actually a server board, until I tweaked things a bit. That made a LOT of difference, curing the dragging, but also curing occasional audio runouts, etc. Turns out it was simply tuned to do huge bus "packets" (I forgot the proper in-context term, and that board died a few years ago, so...), increasing thruput, but also increasing latency beyond what the sound card and keyboard/mouse (or in that case the human operating them) could reasonably deal with. By shortening the PCI "packet length", it reduced thruput a bit but greatly improved latency, letting other users have their turn when they needed it, not some time later. Of course in addition to PCIE putting many of those things on dedicated buses these days, ssds are so much faster that a lot of things that could potentially be problems on spinning rust, simply don't tend to be issues on ssds. As much as anything, I think that's what a lot of users bothered by such problems are turning to, and I'd bet that's a good part of why SSDs are as popular as they are, as well. I know I've simply not had many of the problems here that others had, and while I think part of it is the multiple relatively small but independent filesystems and part of it may be because I don't use snapshotting, I also think a major part of it is simply that the SSDs I'm running btrfs on are simply so much faster than spinning rust that the problems either don't occur, or if they do, they're done before I even notice them. FWIW, I do still use spinning rust, but for my media partition and (second) backups, not for anything speed critical at all. And FWIW, I still use reiserfs on
Re: Global hotspare functionality
On 04/02/2016 09:33 AM, Yauhen Kharuzhy wrote: On Sat, Apr 02, 2016 at 09:15:56AM +0800, Anand Jain wrote: On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote: On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: Hi Yauhen, Issue 2. At start of autoreplacig drive by hotspare, kernel craches in transaction handling code (inside of btrfs_commit_transaction() called by autoreplace initiating routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master (oops text is attached also). Bdev is closed after replacing by btrfs_dev_replace_finishing(), so this is safe but doesn't seem to be right way. I have sent out V2. I don't see that issue with this, could you pls try ? Yes, it reproduced on v4.4.5 kernel. I will try with current 'for-linus-4.6' Chris' tree soon. To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev can be freed by kernel after releasing of all references to it. So far the raid group profile would adapt to lower suitable group profile when device is missing/failed. This appears to be not happening with RAID56 OR there are stale IO which wasn't flushed out. Anyway to have this fixed I am moving the patch btrfs: introduce device dynamic state transition to offline or failed to the top in v3 for any potential changes. But firstly we need a reliable test case, or a very carefully crafted test case which can create this situation Here below is the dm-error that I am using for testing, which apparently doesn't report this issue. Could you please try on V3. ? (pls note the device names are hard coded in the test script sorry about that) This would eventually be fstests script. Sure. But I don't see any V3 patches in the list. Are you still preparing to send them or I missed something? Its out now. There was a little distraction when I was about to send it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Global hotspare functionality
On Sat, Apr 02, 2016 at 09:15:56AM +0800, Anand Jain wrote: > > > On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote: > >On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: > >> > >>Hi Yauhen, > >> > > > >>> > >>>Issue 2. > >>>At start of autoreplacig drive by hotspare, kernel craches in transaction > >>>handling code (inside of btrfs_commit_transaction() called by autoreplace > >>>initiating > >>>routines). I 'fixed' this by removing of closing of bdev in > >>>btrfs_close_one_device_dont_free(), see > >>>https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master > >>>(oops text is attached also). Bdev is closed after replacing by > >>>btrfs_dev_replace_finishing(), so this is safe but doesn't seem > >>>to be right way. > >> > >> I have sent out V2. I don't see that issue with this, > >> could you pls try ? > > > >Yes, it reproduced on v4.4.5 kernel. I will try with current > >'for-linus-4.6' Chris' tree soon. > > > >To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev > >can be freed by kernel after releasing of all references to it. > > So far the raid group profile would adapt to lower suitable > group profile when device is missing/failed. This appears to > be not happening with RAID56 OR there are stale IO which wasn't > flushed out. Anyway to have this fixed I am moving the patch >btrfs: introduce device dynamic state transition to offline or failed > to the top in v3 for any potential changes. > But firstly we need a reliable test case, or a very carefully > crafted test case which can create this situation > > Here below is the dm-error that I am using for testing, which > apparently doesn't report this issue. Could you please try on V3. ? > (pls note the device names are hard coded in the test script > sorry about that) This would eventually be fstests script. Sure. But I don't see any V3 patches in the list. Are you still preparing to send them or I missed something? -- Yauhen Kharuzhy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/13 v3] Introduce device state 'failed', Hot spare and Auto replace
Thanks for various comments, tests and feedback. Background: Hot spare and Auto replace: Hot spare is predominately used to mitigate or narrow the time window of a degraded mode, during which any further disk failure might lead to a catastrophic data loss. Data center storage generally will have couple of disks reserved as spares on the storage, so that it will automatically kickin to resilver the storage pool so that the pool is back to a healthy state. Mainly this is an storage feature rather than a FS feature, I believe people acquainted with enterprise storage use cases will appreciate the need of it, and so most/all of the enterprise storage has hot spare feature. Btrfs device states: This patch-set adds 'failed' state and makes provision to use 'offline' state as two new device states. So to summarize various device states and their meanings.. /* missing: device wasn't found at the time of mount */ int missing; /* * failed: device confirmed to have experienced critical * io failure */ int failed; /* * offline: When there is no confirmation that a disk has * failed. But an interim communication breakdown * and not necessarily a candidate for the device replace. * Device might be online after user intervention or after * block transport layer error recovery. */ int offline; Device state transition Tuning and visualization: Sysfs interfaces are planned to provide the required tuning for device state transition, sensitivities and visualization of device states. However sysfs framework which could provide such an interface is being reviewed/tested and not yet ready as of now. So for the testing and debug of these features here I have used an update version of the procfs patch which is in the ML. [PATCH] btrfs: debug: procfs-devlist: introduce procfs interface for the device list for debugging I find the above patch very useful, easy to use (as compared to sysfs to visualize the device state) and stable. This patch set does not depend on any of the sysfs patches as such. Backward compatibility: Adds a new incompatibility feature flags (BTRFS_FEATURE_INCOMPAT_SPARE_DEV) to manage the spare device when older kernels are used. So it is tested to be work fine with older kernel/prog versions. Auto replace: Replace happens automatically, that is when there is any write failed or flush failed, the device will be marked as failed, which will stop any further IO attempt to that device. And in the next commit cycle the auto replace will pick the spare device to replace the failed device. And so the btrfs volume is back to a healthy state. Per FSID spare vs Global spare: As of now only global hot spare is supported, that is hot spare(s) are for all the btrfs FS in the system. However future there will be a fs_info->no_auto_replace tunable which can be tuned by the user to limit the use of global spare. Example use case: Here below is an example use case of the hot spare setup. Add a spare device: btrfs spare add /dev/sde -f If there is a spare device which is already added before the, just run btrfs dev scan [/dev/sde] Which will register the spare device to the kernel. btrfs fi show Label: none uuid: 52f170c1-725c-457d-8cfd-d57090460091 Total devices 2 FS bytes used 112.00KiB devid 1 size 2.00GiB used 417.50MiB path /dev/sdc devid 2 size 2.00GiB used 417.50MiB path /dev/sdd Global spare device size 3.00GiB path /dev/sde Patches: Kernel: First, it needs, Qu's per chunk missing device patchset, which is part of the set. Next patches 6-9 adds support for Spare device. For kernel without spare feature the spare device is kept away. And when the kernel supports the spare device, it will inhibit from mounting it. Further these patch set provides helper function to pick a spare device and release a spare device back to the spare device pool. Patch 10 provides helper function to auto replace. Patch 11 provides helper function to bring a device to failed state. Patch 12 marks a device as failed based on flush and write errors, and avoids any further IO to it. Last 13 triggers auto replace. Progs: Needs below 4 patches which will add sub cli 'spare' to manage the spare device. As of now deleting a spare device has to be managed using wipefs. However in the long run we would a proper btrfs command to do that job. V2->V3: Kernel: Thanks to Yauhen and Austin for the review comments. Again split Patch 11 and 12 which was merged in V2 for better. Patch numbers are reordered (sorry about that) but for better. Fix rcu issue in btrfs_get_spare_device(), we don't need rcu as its under uuid_mutex Fix rcu issue and to check for replace lock at btrfs_auto_replace_start() Cleanup old: casualty_kthread() new: health_kthread() with changes as per 838fe188 'btrfs: cleaner_kthread() doesn't need explicit freeze' (thanks
[PATCH 10/13] btrfs: introduce helper functions to perform hot replace
Hot replace / auto replace is important volume manager feature and is critical to the data center operations, so that the degraded volume can be brought back to a healthy state at the earliest and without manual intervention. This modifies the existing replace code to suite the need of auto replace, in the long run I hope both the codes to be merged. Signed-off-by: Anand JainTested-by: Austin S. Hemmelgarn --- fs/btrfs/dev-replace.c | 43 +++ fs/btrfs/dev-replace.h | 1 + 2 files changed, 44 insertions(+) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 2b926867d136..ceab4c51db32 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -957,3 +957,46 @@ void btrfs_bio_counter_inc_blocked(struct btrfs_fs_info *fs_info) _info->fs_state)); } } + +int btrfs_auto_replace_start(struct btrfs_root *root, + struct btrfs_device *src_device) +{ + int ret; + char *tgt_path; + char *src_path; + struct btrfs_fs_info *fs_info = root->fs_info; + + if (fs_info->sb->s_flags & MS_RDONLY) + return -EROFS; + + btrfs_dev_replace_lock(_info->dev_replace, 0); + if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) { + btrfs_dev_replace_unlock(_info->dev_replace, 0); + return -EBUSY; + } + btrfs_dev_replace_unlock(_info->dev_replace, 0); + + if (btrfs_get_spare_device(_path)) { + btrfs_err(root->fs_info, + "No spare device found/configured in the kernel"); + return -EINVAL; + } + + rcu_read_lock(); + src_path = kstrdup(rcu_str_deref(src_device->name), GFP_ATOMIC); + rcu_read_unlock(); + if (!src_path) { + kfree(tgt_path); + return -ENOMEM; + } + ret = btrfs_dev_replace_start(root, tgt_path, + src_device->devid, src_path, + BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID); + if (ret) + btrfs_put_spare_device(tgt_path); + + kfree(tgt_path); + kfree(src_path); + + return 0; +} diff --git a/fs/btrfs/dev-replace.h b/fs/btrfs/dev-replace.h index e922b42d91df..b918b9d6e5df 100644 --- a/fs/btrfs/dev-replace.h +++ b/fs/btrfs/dev-replace.h @@ -46,4 +46,5 @@ static inline void btrfs_dev_replace_stats_inc(atomic64_t *stat_value) { atomic64_inc(stat_value); } +int btrfs_auto_replace_start(struct btrfs_root *root, struct btrfs_device *src_device); #endif -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/13] btrfs: Do per-chunk check for mount time check
From: Qu WenruoNow use the btrfs_check_degraded() to do mount time degraded check. With this patch, now we can mount with the following case: # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc # wipefs -a /dev/sdc # mount /dev/sdb /mnt/btrfs -o degraded As the single data chunk is only in sdb, so it's OK to mount as degraded, as missing one device is OK for RAID1. But still fail with the following case as expected: # mkfs.btrfs -f -m raid1 -d single /dev/sdb /dev/sdc # wipefs -a /dev/sdb # mount /dev/sdc /mnt/btrfs -o degraded As the data chunk is only in sdb, so it's not OK to mount it as degraded. Reported-by: Zhao Lei Reported-by: Anand Jain Signed-off-by: Qu Wenruo [Btrfs: use btrfs_error instead of btrfs_err during mount] Signed-off-by: Anand Jain --- fs/btrfs/disk-io.c | 18 ++ 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index c95e3ce9f22e..bfea0f8f6a87 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2880,6 +2880,16 @@ int open_ctree(struct super_block *sb, goto fail_tree_roots; } + ret = btrfs_check_degradable(fs_info, fs_info->sb->s_flags); + if (ret < 0) { + btrfs_err(fs_info, "degraded writable mount failed %d", ret); + goto fail_tree_roots; + } else if (ret > 0 && !btrfs_test_opt(chunk_root, DEGRADED)) { + btrfs_warn(fs_info, + "Some device missing, but still degraded mountable, please mount with -o degraded option"); + ret = -EACCES; + goto fail_tree_roots; + } /* * keep the device that is marked to be the target device for the * dev_replace procedure @@ -2983,14 +2993,6 @@ retry_root_backup: } fs_info->num_tolerated_disk_barrier_failures = btrfs_calc_num_tolerated_disk_barrier_failures(fs_info); - if (fs_info->fs_devices->missing_devices > -fs_info->num_tolerated_disk_barrier_failures && - !(sb->s_flags & MS_RDONLY)) { - pr_warn("BTRFS: missing devices(%llu) exceeds the limit(%d), writeable mount is not allowed\n", - fs_info->fs_devices->missing_devices, - fs_info->num_tolerated_disk_barrier_failures); - goto fail_sysfs; - } fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root, "btrfs-cleaner"); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/13] btrfs: Cleanup num_tolerated_disk_barrier_failures
From: Qu WenruoAs we use per-chunk degradable check, now the global num_tolerated_disk_barrier_failures is of no use. So cleanup it. Signed-off-by: Qu Wenruo [Btrfs: resolve conflict to apply 'btrfs: Cleanup num_tolerated_disk_barrier_failures'] Signed-off-by: Anand Jain --- fs/btrfs/ctree.h | 2 -- fs/btrfs/disk-io.c | 56 -- fs/btrfs/disk-io.h | 2 -- fs/btrfs/volumes.c | 17 - 4 files changed, 77 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 84a6a5b3384a..e0a50f478e01 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1829,8 +1829,6 @@ struct btrfs_fs_info { /* next backup root to be overwritten */ int backup_root_index; - int num_tolerated_disk_barrier_failures; - /* device replace state */ struct btrfs_dev_replace dev_replace; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 85e26d62c089..7f02f1766037 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2991,8 +2991,6 @@ retry_root_backup: printk(KERN_ERR "BTRFS: Failed to read block groups: %d\n", ret); goto fail_sysfs; } - fs_info->num_tolerated_disk_barrier_failures = - btrfs_calc_num_tolerated_disk_barrier_failures(fs_info); fs_info->cleaner_kthread = kthread_run(cleaner_kthread, tree_root, "btrfs-cleaner"); @@ -3559,60 +3557,6 @@ int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags) return min_tolerated; } -int btrfs_calc_num_tolerated_disk_barrier_failures( - struct btrfs_fs_info *fs_info) -{ - struct btrfs_ioctl_space_info space; - struct btrfs_space_info *sinfo; - u64 types[] = {BTRFS_BLOCK_GROUP_DATA, - BTRFS_BLOCK_GROUP_SYSTEM, - BTRFS_BLOCK_GROUP_METADATA, - BTRFS_BLOCK_GROUP_DATA | BTRFS_BLOCK_GROUP_METADATA}; - int i; - int c; - int num_tolerated_disk_barrier_failures = - (int)fs_info->fs_devices->num_devices; - - for (i = 0; i < ARRAY_SIZE(types); i++) { - struct btrfs_space_info *tmp; - - sinfo = NULL; - rcu_read_lock(); - list_for_each_entry_rcu(tmp, _info->space_info, list) { - if (tmp->flags == types[i]) { - sinfo = tmp; - break; - } - } - rcu_read_unlock(); - - if (!sinfo) - continue; - - down_read(>groups_sem); - for (c = 0; c < BTRFS_NR_RAID_TYPES; c++) { - u64 flags; - - if (list_empty(>block_groups[c])) - continue; - - btrfs_get_block_group_info(>block_groups[c], - ); - if (space.total_bytes == 0 || space.used_bytes == 0) - continue; - flags = space.flags; - - num_tolerated_disk_barrier_failures = min( - num_tolerated_disk_barrier_failures, - btrfs_get_num_tolerated_disk_barrier_failures( - flags)); - } - up_read(>groups_sem); - } - - return num_tolerated_disk_barrier_failures; -} - static int write_all_supers(struct btrfs_root *root, int max_mirrors) { struct list_head *head; diff --git a/fs/btrfs/disk-io.h b/fs/btrfs/disk-io.h index 8e79d0070bcf..dd155621f95f 100644 --- a/fs/btrfs/disk-io.h +++ b/fs/btrfs/disk-io.h @@ -141,8 +141,6 @@ struct btrfs_root *btrfs_create_tree(struct btrfs_trans_handle *trans, int btree_lock_page_hook(struct page *page, void *data, void (*flush_fn)(void *)); int btrfs_get_num_tolerated_disk_barrier_failures(u64 flags); -int btrfs_calc_num_tolerated_disk_barrier_failures( - struct btrfs_fs_info *fs_info); int __init btrfs_end_io_wq_init(void); void btrfs_end_io_wq_exit(void); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index a840d78ba127..dff2deaf88d3 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1876,9 +1876,6 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) free_fs_devices(cur_devices); } - root->fs_info->num_tolerated_disk_barrier_failures = - btrfs_calc_num_tolerated_disk_barrier_failures(root->fs_info); - /* * at this point, the device is zero sized. We want to * remove it from the devices list and zero out the old super @@ -2405,8 +2402,6 @@ int btrfs_init_new_device(struct btrfs_root *root, char *device_path)
[PATCH 11/13] btrfs: introduce device dynamic state transition to offline or failed
This patch provides helper functions to force a device to offline or failed, and we need this device states for the following reasons, 1) a. it can be reported that device has failed when it does b. close the device when it goes offline so that blocklayer can cleanup 2) identify the candidate for the auto replace 3) avoid further commit error reported against the failing device and 4) a device in the multi device btrfs may go offline from the system (but as of now in in some system config btrfs gets unmounted in this context, which is not a correct behavior) Signed-off-by: Anand JainTested-by: Austin S. Hemmelgarn --- fs/btrfs/volumes.c | 137 + fs/btrfs/volumes.h | 13 + 2 files changed, 150 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 072cefac958c..eb9f28504d3f 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7149,3 +7149,140 @@ out: read_unlock(_tree->map_tree.lock); return ret; } + +static void __close_device(struct work_struct *work) +{ + struct btrfs_device *device; + + device = container_of(work, struct btrfs_device, rcu_work); + + if (device->bdev) + blkdev_put(device->bdev, device->mode); + + device->bdev = NULL; +} + +static void close_device(struct rcu_head *head) +{ + struct btrfs_device *device; + + device = container_of(head, struct btrfs_device, rcu); + + INIT_WORK(>rcu_work, __close_device); + schedule_work(>rcu_work); +} + +void btrfs_close_one_device_dont_free(struct btrfs_device *device) +{ + struct btrfs_fs_devices *fs_devices = device->fs_devices; + + if (device->bdev) + fs_devices->open_devices--; + + if (device->writeable && + device->devid != BTRFS_DEV_REPLACE_DEVID) { + list_del_init(>dev_alloc_list); + fs_devices->rw_devices--; + } + + device->writeable = 0; + + call_rcu(>rcu, close_device); +} + +void force_device_close(struct btrfs_device *device) +{ + struct btrfs_device *next_device; + struct btrfs_fs_devices *fs_devices; + + fs_devices = device->fs_devices; + + mutex_lock(_devices->device_list_mutex); + lock_chunks(fs_devices->fs_info->fs_root); + + next_device = list_entry(fs_devices->devices.next, + struct btrfs_device, dev_list); + if (device->bdev == fs_devices->fs_info->sb->s_bdev) + fs_devices->fs_info->sb->s_bdev = next_device->bdev; + + if (device->bdev == fs_devices->latest_bdev) + fs_devices->latest_bdev = next_device->bdev; + + btrfs_close_one_device_dont_free(device); + + /* +* TODO: works for now, but its better to keep the state of +* missing and offline different, and update rest of the +* places where we check for only missing and not for failed +* or offline as of now. +*/ + device->missing = 1; + fs_devices->missing_devices++; + device->writeable = 0; + + rcu_barrier(); + + unlock_chunks(fs_devices->fs_info->fs_root); + mutex_unlock(_devices->device_list_mutex); +} + +void btrfs_enforce_device_state(struct btrfs_device *dev, char *why) +{ + bool degrade_option; + int tolerated_fail; + struct btrfs_fs_info *fs_info; + struct btrfs_fs_devices *fs_devices; + + fs_devices = dev->fs_devices; + fs_info = fs_devices->fs_info; + degrade_option = btrfs_test_opt(fs_info->fs_root, DEGRADED); + + /* todo: support seed later */ + if (fs_devices->seeding) + return; + + /* this shouldn't be called if device is already missing */ + if (dev->missing || !dev->bdev) + return; + + if (dev->offline || dev->failed) + return; + + /* Only RW device is requested to force close let FS handle it*/ + if (fs_devices->rw_devices == 1) { + btrfs_std_error(fs_info, -EIO, + "force offline last RW device"); + return; + } + + if (!strcmp(why, "offline")) + dev->offline = 1; + else if (!strcmp(why, "failed")) + dev->failed = 1; + else + return; + + btrfs_sysfs_rm_device_link(fs_devices, dev); + + force_device_close(dev); + + tolerated_fail = btrfs_check_degradable(fs_info, + fs_info->sb->s_flags); + if (tolerated_fail > 0) { + btrfs_warn_in_rcu(fs_info, "device %s %s, chunks degraded", + rcu_str_deref(dev->name), why); + } else if(tolerated_fail < 0) { + btrfs_warn_in_rcu(fs_info, + "device %s %s, chunks failed", + rcu_str_deref(dev->name), why); +
[PATCH 03/13] btrfs: Do per-chunk degraded check for remount
From: Qu WenruoJust the same for mount time check, use new btrfs_check_degraded() to do per chunk check. Signed-off-by: Qu Wenruo Btrfs: use btrfs_error instead of btrfs_err during remount Signed-off-by: Anand Jain --- fs/btrfs/super.c | 11 +++ 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 00b8f37cc306..87639fa53b10 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1767,11 +1767,14 @@ static int btrfs_remount(struct super_block *sb, int *flags, char *data) goto restore; } - if (fs_info->fs_devices->missing_devices > -fs_info->num_tolerated_disk_barrier_failures && - !(*flags & MS_RDONLY)) { + ret = btrfs_check_degradable(fs_info, *flags); + if (ret < 0) { + btrfs_err(fs_info, + "degraded writable remount failed %d", ret); + goto restore; + } else if (ret > 0 && !btrfs_test_opt(root, DEGRADED)) { btrfs_warn(fs_info, - "too many missing devices, writeable remount is not allowed"); + "some device missing, but still degraded mountable, please remount with -o degraded option"); ret = -EACCES; goto restore; } -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/13] btrfs: Introduce a new function to check if all chunks a OK for degraded mount
From: Qu WenruoIntroduce a new function, btrfs_check_degradable(), to judge if all chunks in btrfs is OK for degraded mount. It provides the new basis for accurate btrfs mount/remount and even runtime degraded mount check other than old one-size-fit-all method. Signed-off-by: Qu Wenruo --- fs/btrfs/volumes.c | 63 ++ fs/btrfs/volumes.h | 1 + 2 files changed, 64 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index e2b54d546b7c..dd3dc53a302a 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7042,3 +7042,66 @@ static void btrfs_close_one_device(struct btrfs_device *device) call_rcu(>rcu, free_device); } + +/* + * Check if all chunks in the fs is OK for degraded mount + * Caller itself should do extra check if DEGRADED mount option is given + * for >0 return value. + * + * Return 0 if all chunks are OK. + * Return >0 if all chunks are degradable but not all OK. + * Return <0 if any chunk is not degradable or other bug. + */ +int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags) +{ + struct btrfs_mapping_tree *map_tree = _info->mapping_tree; + struct extent_map *em; + u64 next_start = 0; + int ret = 0; + + if (flags & MS_RDONLY) + return 0; + + read_lock(_tree->map_tree.lock); + em = lookup_extent_mapping(_tree->map_tree, 0, (u64)(-1)); + /* No any chunk? Should be a huge bug */ + if (!em) { + ret = -ENOENT; + goto out; + } + + while (em) { + struct map_lookup *map; + int missing = 0; + int max_tolerated; + int i; + + map = (struct map_lookup *) em->bdev; + max_tolerated = + btrfs_get_num_tolerated_disk_barrier_failures( + map->type); + for (i = 0; i < map->num_stripes; i++) { + if (map->stripes[i].dev->missing) + missing++; + } + if (missing > max_tolerated) { + ret = -EIO; + btrfs_warn(fs_info, + "missing devices(%d) exceeds the limit(%d), writebale mount is not allowed", + missing, max_tolerated); + goto out; + } else if (missing) + ret = 1; + next_start = extent_map_end(em); + + /* +* Alwasy search range [next_start, (u64)-1) to find the next +* chunk map +*/ + em = lookup_extent_mapping(_tree->map_tree, next_start, + (u64)(-1) - next_start); + } +out: + read_unlock(_tree->map_tree.lock); + return ret; +} diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 1939ebde63df..351431a3f5aa 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -566,5 +566,6 @@ static inline void unlock_chunks(struct btrfs_root *root) struct list_head *btrfs_get_fs_uuids(void); void btrfs_set_fs_info_ptr(struct btrfs_fs_info *fs_info); void btrfs_reset_fs_info_ptr(struct btrfs_fs_info *fs_info); +int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags); #endif -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/13] btrfs: add check not to mount a spare device
Spare devices can be scanned but shouldn't be mountable. Signed-off-by: Anand JainTested-by: Austin S. Hemmelgarn --- fs/btrfs/disk-io.c | 8 1 file changed, 8 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 7f02f1766037..b99329e37965 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2806,6 +2806,14 @@ int open_ctree(struct super_block *sb, goto fail_alloc; } + if (btrfs_super_incompat_flags(disk_super) & + BTRFS_FEATURE_INCOMPAT_SPARE_DEV) { + /*You can only scan a spare device but not mount*/ + printk(KERN_ERR "BTRFS: You can't mount a spare device\n"); + err = -ENOTSUPP; + goto fail_alloc; + } + /* * Needn't use the lock because there is no other task which will * update the flag. -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/13] btrfs: support btrfs dev scan for spare device
When the user or system calls the BTRFS_IOC_SCAN_DEV, ioctl this patch will make sure it is added to the device list and set it as spare. This operation will be same when BTRFS_IOC_DEVICES_READY as well since BTRFS_IOC_DEVICES_READY ioctl has been doing that by legacy. Signed-off-by: Anand JainTested-by: Austin S. Hemmelgarn --- fs/btrfs/volumes.c | 4 fs/btrfs/volumes.h | 2 ++ 2 files changed, 6 insertions(+) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index dff2deaf88d3..d729539f9612 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -604,6 +604,10 @@ static noinline int device_list_add(const char *path, if (IS_ERR(fs_devices)) return PTR_ERR(fs_devices); + if (btrfs_super_incompat_flags(disk_super) & + BTRFS_FEATURE_INCOMPAT_SPARE_DEV) + fs_devices->spare = 1; + list_add(_devices->list, _uuids); device = NULL; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 48ced5cc09e4..51cf716eb35b 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -263,6 +263,8 @@ struct btrfs_fs_devices { struct kobject fsid_kobj; struct kobject *device_dir_kobj; struct completion kobj_unregister; + + int spare; }; #define BTRFS_BIO_INLINE_CSUM_SIZE 64 -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 12/13] btrfs: check device for critical errors and mark failed
Write and Flush errors are considered as critical errors, upon which the device will be brought offline and marked as failed. Write and Flush errors are identified using device error statistics. This is monitored using a kthread btrfs_health. Signed-off-by: Anand JainTested-by: Austin S. Hemmelgarn --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/disk-io.c | 101 - fs/btrfs/volumes.c | 1 + fs/btrfs/volumes.h | 4 +++ 4 files changed, 107 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index aa693cfdc9f0..47e9cd9dd29a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1569,6 +1569,7 @@ struct btrfs_fs_info { struct mutex tree_log_mutex; struct mutex transaction_kthread_mutex; struct mutex cleaner_mutex; + struct mutex health_mutex; struct mutex chunk_mutex; struct mutex volume_mutex; @@ -1686,6 +1687,7 @@ struct btrfs_fs_info { struct btrfs_workqueue *extent_workers; struct task_struct *transaction_kthread; struct task_struct *cleaner_kthread; + struct task_struct *health_kthread; int thread_pool_size; struct kobject *space_info_kobj; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index b99329e37965..b523e56b34e9 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1869,6 +1869,93 @@ sleep: return 0; } +/* + * returns: + * < 0 : Check didn't run, std error + * 0 : No errors found + * > 0 : # of devices having fatal errors + */ +static int btrfs_update_devices_health(struct btrfs_root *root) +{ + int ret = 0; + struct btrfs_device *device; + struct btrfs_fs_info *fs_info = root->fs_info; + + if (btrfs_fs_closing(fs_info)) + return -EBUSY; + + /* mark disk(s) with write or flush error(s) as failed */ + mutex_lock(_info->volume_mutex); + list_for_each_entry_rcu(device, + _info->fs_devices->devices, dev_list) { + int c_err; + + if (device->failed) { + ret++; + continue; + } + + /* +* todo: replace target device's write/flush error, +* skip for now +*/ + if (device->is_tgtdev_for_dev_replace) + continue; + + if (!device->dev_stats_valid) + continue; + + c_err = atomic_read(>new_critical_errs); + atomic_sub(c_err, >new_critical_errs); + if (c_err) { + btrfs_crit_in_rcu(fs_info, + "fatal error on device %s", + rcu_str_deref(device->name)); + btrfs_enforce_device_state(device, "failed"); + ret ++; + } + } + mutex_unlock(_info->volume_mutex); + + return ret; +} + +/* + * Devices health maintenance kthread, gets woken-up by transaction + * kthread, once sysfs is ready, this should publish the report + * through sysfs so that user land scripts and invoke actions. + */ +static int health_kthread(void *arg) +{ + struct btrfs_root *root = arg; + + do { + if (btrfs_need_cleaner_sleep(root)) + goto sleep; + + if (!mutex_trylock(>fs_info->health_mutex)) + goto sleep; + + if (btrfs_need_cleaner_sleep(root)) { + mutex_unlock(>fs_info->health_mutex); + goto sleep; + } + + /* Check devices health */ + btrfs_update_devices_health(root); + + mutex_unlock(>fs_info->health_mutex); + +sleep: + set_current_state(TASK_INTERRUPTIBLE); + if (!kthread_should_stop()) + schedule(); + __set_current_state(TASK_RUNNING); + } while (!kthread_should_stop()); + + return 0; +} + static int transaction_kthread(void *arg) { struct btrfs_root *root = arg; @@ -1915,6 +2002,7 @@ static int transaction_kthread(void *arg) btrfs_end_transaction(trans, root); } sleep: + wake_up_process(root->fs_info->health_kthread); wake_up_process(root->fs_info->cleaner_kthread); mutex_unlock(>fs_info->transaction_kthread_mutex); @@ -2663,6 +2751,7 @@ int open_ctree(struct super_block *sb, mutex_init(_info->chunk_mutex); mutex_init(_info->transaction_kthread_mutex); mutex_init(_info->cleaner_mutex); + mutex_init(_info->health_mutex); mutex_init(_info->volume_mutex); mutex_init(_info->ro_block_group_mutex); init_rwsem(_info->commit_root_sem); @@ -3005,11 +3094,16 @@ retry_root_backup: if
[PATCH 06/13] btrfs: introduce BTRFS_FEATURE_INCOMPAT_SPARE_DEV
Add BTRFS_FEATURE_INCOMPAT_SPARE_DEV (400) flag to identify a spare device. Along with this it checks in the mount context that a spare device will fail to mount. As spare devices aren't mountable. Signed-off-by: Anand JainTested-by: Austin S. Hemmelgarn --- fs/btrfs/ctree.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index e0a50f478e01..2c185a8e92f0 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -531,6 +531,7 @@ struct btrfs_super_block { #define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7) #define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8) #define BTRFS_FEATURE_INCOMPAT_NO_HOLES(1ULL << 9) +#define BTRFS_FEATURE_INCOMPAT_SPARE_DEV (1ULL << 10) #define BTRFS_FEATURE_COMPAT_SUPP 0ULL #define BTRFS_FEATURE_COMPAT_SAFE_SET 0ULL @@ -551,7 +552,8 @@ struct btrfs_super_block { BTRFS_FEATURE_INCOMPAT_RAID56 |\ BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF | \ BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA | \ -BTRFS_FEATURE_INCOMPAT_NO_HOLES) +BTRFS_FEATURE_INCOMPAT_NO_HOLES | \ +BTRFS_FEATURE_INCOMPAT_SPARE_DEV) #define BTRFS_FEATURE_INCOMPAT_SAFE_SET\ (BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/13] btrfs: Allow barrier_all_devices to do per-chunk device check
From: Qu WenruoThe last user of num_tolerated_disk_barrier_failures is barrier_all_devices(). But it's can be easily changed to new per-chunk degradable check framework. Now btrfs_device will have two extra members, representing send/wait error, set at write_dev_flush() time. And then check it in a similar but more accurate behavior than old code. Signed-off-by: Qu Wenruo --- fs/btrfs/disk-io.c | 13 + fs/btrfs/volumes.c | 6 +- fs/btrfs/volumes.h | 4 3 files changed, 14 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index bfea0f8f6a87..85e26d62c089 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3491,8 +3491,6 @@ static int barrier_all_devices(struct btrfs_fs_info *info) { struct list_head *head; struct btrfs_device *dev; - int errors_send = 0; - int errors_wait = 0; int ret; /* send down all the barriers */ @@ -3501,7 +3499,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (dev->missing) continue; if (!dev->bdev) { - errors_send++; + dev->err_send = 1; continue; } if (!dev->in_fs_metadata || !dev->writeable) @@ -3509,7 +3507,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 0); if (ret) - errors_send++; + dev->err_send = 1; } /* wait for all the barriers */ @@ -3517,7 +3515,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) if (dev->missing) continue; if (!dev->bdev) { - errors_wait++; + dev->err_wait = 1; continue; } if (!dev->in_fs_metadata || !dev->writeable) @@ -3525,10 +3523,9 @@ static int barrier_all_devices(struct btrfs_fs_info *info) ret = write_dev_flush(dev, 1); if (ret) - errors_wait++; + dev->err_wait = 1; } - if (errors_send > info->num_tolerated_disk_barrier_failures || - errors_wait > info->num_tolerated_disk_barrier_failures) + if (btrfs_check_degradable(info, info->sb->s_flags) < 0) return -EIO; return 0; } diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index dd3dc53a302a..a840d78ba127 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -7081,8 +7081,12 @@ int btrfs_check_degradable(struct btrfs_fs_info *fs_info, unsigned flags) btrfs_get_num_tolerated_disk_barrier_failures( map->type); for (i = 0; i < map->num_stripes; i++) { - if (map->stripes[i].dev->missing) + if (map->stripes[i].dev->missing || + map->stripes[i].dev->err_wait || + map->stripes[i].dev->err_send) missing++; + map->stripes[i].dev->err_wait = 0; + map->stripes[i].dev->err_send = 0; } if (missing > max_tolerated) { ret = -EIO; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 351431a3f5aa..48ced5cc09e4 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -76,6 +76,10 @@ struct btrfs_device { int can_discard; int is_tgtdev_for_dev_replace; + /* for barrier_all_devices() check */ + int err_send; + int err_wait; + #ifdef __BTRFS_NEED_DEVICE_DATA_ORDERED seqcount_t data_seqcount; #endif -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 09/13] btrfs: provide framework to get and put a spare device
This adds functions to get and put a spare device from the list. So that hot repace code can pick a spare device when needed. Signed-off-by: Anand JainTested-by: Austin S. Hemmelgarn --- fs/btrfs/ctree.h | 1 + fs/btrfs/super.c | 5 + fs/btrfs/volumes.c | 53 + fs/btrfs/volumes.h | 2 ++ 4 files changed, 61 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2c185a8e92f0..aa693cfdc9f0 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -4185,6 +4185,7 @@ void btrfs_sysfs_remove_mounted(struct btrfs_fs_info *fs_info); ssize_t btrfs_listxattr(struct dentry *dentry, char *buffer, size_t size); /* super.c */ +struct file_system_type *btrfs_get_fs_type(void); int btrfs_parse_options(struct btrfs_root *root, char *options, unsigned long new_flags); int btrfs_sync_fs(struct super_block *sb, int wait); diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 87639fa53b10..49ba899b2d36 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -69,6 +69,11 @@ static struct file_system_type btrfs_fs_type; static int btrfs_remount(struct super_block *sb, int *flags, char *data); +struct file_system_type *btrfs_get_fs_type() +{ + return _fs_type; +} + const char *btrfs_decode_error(int errno) { char *errstr = "unknown"; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d729539f9612..072cefac958c 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -525,6 +525,59 @@ static void pending_bios_fn(struct btrfs_work *work) run_scheduled_bios(device); } +int btrfs_get_spare_device(char **path) +{ + int ret = 1; + struct btrfs_fs_devices *fs_devices; + struct btrfs_device *device; + struct list_head *fs_uuids = btrfs_get_fs_uuids(); + + mutex_lock(_mutex); + list_for_each_entry(fs_devices, fs_uuids, list) { + if (!fs_devices->spare) + continue; + + /* as of now there is only one device in the spare fs_devices */ + device = list_entry(fs_devices->devices.next, + struct btrfs_device, dev_list); + + if (!device || !device->name) + continue; + + fs_devices->spare = 0; + /* +* Its under uuid_mutex and there is one spare per fsid +* so rcu lock is actually not required +*/ + *path = kstrdup(device->name->str, GFP_KERNEL); + if (*path) + ret = 0; + else + ret = -ENOMEM; + break; + } + + if (!ret) { + btrfs_sysfs_remove_fsid(fs_devices); + list_del(_devices->list); + free_fs_devices(fs_devices); + } + mutex_unlock(_mutex); + + return ret; +} + +void btrfs_put_spare_device(char *path) +{ + struct file_system_type *btrfs_fs_type; + struct btrfs_fs_devices *fs_devices; + + btrfs_fs_type = btrfs_get_fs_type(); + + if (btrfs_scan_one_device(path, FMODE_READ, + btrfs_fs_type, _devices)) + printk(KERN_INFO "failed to return spare device\n"); +} void btrfs_free_stale_device(struct btrfs_device *cur_dev) { diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 51cf716eb35b..b4308afa3097 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -469,6 +469,8 @@ int btrfs_init_new_device(struct btrfs_root *root, char *path); int btrfs_init_dev_replace_tgtdev(struct btrfs_root *root, char *device_path, struct btrfs_device *srcdev, struct btrfs_device **device_out); +int btrfs_get_spare_device(char **path); +void btrfs_put_spare_device(char *path); int btrfs_balance(struct btrfs_balance_control *bctl, struct btrfs_ioctl_balance_args *bargs); int btrfs_resume_balance_async(struct btrfs_fs_info *fs_info); -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/13] btrfs: check for failed device and hot replace
This patch checks for failed device and kicks out auto replace, if when user decided to disable auto replace it can be done by future sysfs or future ioctl interface to set fs_info->no_auto_replace parameter to 1. Signed-off-by: Anand JainTested-by: Austin S. Hemmelgarn --- fs/btrfs/ctree.h | 2 ++ fs/btrfs/disk-io.c | 34 ++ 2 files changed, 36 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 47e9cd9dd29a..67bb36bb82ee 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1862,6 +1862,8 @@ struct btrfs_fs_info { struct list_head pinned_chunks; int creating_free_space_tree; + + int no_auto_replace; }; struct btrfs_subvolume_writers { diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index b523e56b34e9..f205e7e94948 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1869,6 +1869,38 @@ sleep: return 0; } +static int btrfs_recuperate(struct btrfs_root *root) +{ + int ret; + int found = 0; + struct btrfs_device *device; + struct btrfs_fs_devices *fs_devices; + + fs_devices = root->fs_info->fs_devices; + + mutex_lock(_devices->device_list_mutex); + rcu_read_lock(); + list_for_each_entry_rcu(device, + _devices->devices, dev_list) { + if (device->failed) { + found = 1; + break; + } + } + rcu_read_unlock(); + mutex_unlock(_devices->device_list_mutex); + + /* +* We are using the replace code which should be interrupt-able +* during unmount, and as of now there is no user land stop +* request that we support and this will run until its complete +*/ + if (found && !root->fs_info->no_auto_replace) + ret = btrfs_auto_replace_start(root, device); + + return ret; +} + /* * returns: * < 0 : Check didn't run, std error @@ -1944,6 +1976,8 @@ static int health_kthread(void *arg) /* Check devices health */ btrfs_update_devices_health(root); + btrfs_recuperate(root); + mutex_unlock(>fs_info->health_mutex); sleep: -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Global hotspare functionality
On 03/31/2016 06:17 AM, Yauhen Kharuzhy wrote: On Tue, Mar 29, 2016 at 10:40:40PM +0300, Yauhen Kharuzhy wrote: Hi. I am testing hotspare v2 on kernel v4.4.5 (I will try latest Chris' tree later) now with lockdep debugging enabled. At starting of replacement, lockdep warning is displayed, because kstrdup() is called with GFP_NOFS inside of rcu_read_lock/unlock() block (GFP_NOFS can sleep). Similar thing in the btrfs_auto_replace_start(): rcu_str_deref() without rcu_read_lock(): int btrfs_auto_replace_start(struct btrfs_root *root, struct btrfs_device *src_device) { int ret; char *tgt_path; if (btrfs_get_spare_device(_path)) { btrfs_err(root->fs_info, "No spare device found/configured in the kernel"); return -EINVAL; } ret = btrfs_dev_replace_start(root, tgt_path, src_device->devid, rcu_str_deref(src_device->name), This is fixed in V3. Thanks, Anand BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID); if (ret) btrfs_put_spare_device(tgt_path); kfree(tgt_path); return 0; } [ 156.168133] === [ 156.168963] [ INFO: suspicious RCU usage. ] [ 156.169822] 4.4.5-scst31x+ #20 Not tainted [ 156.170656] --- [ 156.171488] fs/btrfs/dev-replace.c:990 suspicious rcu_dereference_check() usage! [ 156.172920] [ 156.172920] other info that might help us debug this: [ 156.172920] [ 156.174825] [ 156.174825] rcu_scheduler_active = 1, debug_locks = 0 [ 156.176152] 1 lock held by btrfs-casualty/4807: [ 156.181917] #0: (_info->casualty_mutex){+.+...}, at: [] casualty_kthread+0x64/0x390 [btrfs] [ 156.193511] [ 156.193511] stack backtrace: [ 156.194680] CPU: 0 PID: 4807 Comm: btrfs-casualty Not tainted 4.4.5-scst31x+ #20 [ 156.201650] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 [ 156.219100] 88005d79fda0 813529e3 88005e19c600 [ 156.221216] 0001 88005d79fdd0 810d6407 [ 156.224287] 88005f4a0c00 88005da36000 88005d79fe08 [ 156.226375] Call Trace: [ 156.227078] [] dump_stack+0x85/0xc2 [ 156.228152] [] lockdep_rcu_suspicious+0xd7/0x110 [ 156.229418] [] btrfs_auto_replace_start+0xa6/0xd0 [btrfs] [ 156.230714] [] casualty_kthread+0x2c4/0x390 [btrfs] [ 156.231915] [] ? casualty_kthread+0x19c/0x390 [btrfs] [ 156.233105] [] ? btrfs_check_devices+0x200/0x200 [btrfs] [ 156.234339] [] kthread+0xef/0x110 [ 156.235309] [] ? __raw_callee_save___pv_queued_spin_unlock+0x11/0x20 [ 156.236940] [] ? kthread_create_on_node+0x200/0x200 [ 156.239489] [] ret_from_fork+0x3f/0x70 [ 156.240533] [] ? kthread_create_on_node+0x200/0x200 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Global hotspare functionality
On 03/30/2016 03:47 AM, Yauhen Kharuzhy wrote: On Tue, Mar 29, 2016 at 10:41:36PM +0800, Anand Jain wrote: Hi Yauhen, Issue 2. At start of autoreplacig drive by hotspare, kernel craches in transaction handling code (inside of btrfs_commit_transaction() called by autoreplace initiating routines). I 'fixed' this by removing of closing of bdev in btrfs_close_one_device_dont_free(), see https://bitbucket.org/jekhor/linux-btrfs/commits/dfa441c9ec7b3833f6a5e4d0b6f8c678faea29bb?at=master (oops text is attached also). Bdev is closed after replacing by btrfs_dev_replace_finishing(), so this is safe but doesn't seem to be right way. I have sent out V2. I don't see that issue with this, could you pls try ? Yes, it reproduced on v4.4.5 kernel. I will try with current 'for-linus-4.6' Chris' tree soon. To emulate a drive failure, I disconnect the drive in VirtualBox, so bdev can be freed by kernel after releasing of all references to it. So far the raid group profile would adapt to lower suitable group profile when device is missing/failed. This appears to be not happening with RAID56 OR there are stale IO which wasn't flushed out. Anyway to have this fixed I am moving the patch btrfs: introduce device dynamic state transition to offline or failed to the top in v3 for any potential changes. But firstly we need a reliable test case, or a very carefully crafted test case which can create this situation Here below is the dm-error that I am using for testing, which apparently doesn't report this issue. Could you please try on V3. ? (pls note the device names are hard coded in the test script sorry about that) This would eventually be fstests script. # cat util run() { local ret echo -- ${*} -- echo ${*} | bash ret=$? if [ $ret -ne 0 ]; then echo echo "## FAILED: RET $ret #" echo exit fi echo #echo "OK?"; read } runnt() { local ret echo -- ${*} -- echo ${*} | bash ret=$? echo #echo "OK?"; read } wipeall() { runnt "wipefs -a /dev/sd[c-h] > /dev/null" } create_err_dev_raid1() { dm_backing_dev="/dev/sdd" blk_dev_size=`blockdev --getsz $dm_backing_dev` dmerror_dev="/dev/mapper/dm-sdd" dmlinear_table="0 $blk_dev_size linear $dm_backing_dev 0" dmerror_table="0 $blk_dev_size error $dm_backing_dev 0" echo -e dm_backing_dev'\t'= $dm_backing_dev echo -e blk_dev_size'\t'= $blk_dev_size echo -e dmerror_dev'\t'= $dmerror_dev echo -e dmlinear_table'\t'= $dmlinear_table echo -e dmerror_table'\t'= $dmerror_table echo runnt "dmsetup remove dm-sdd > /dev/null 2>&1" run "dmsetup create dm-sdd --table '${dmlinear_table}'" run "mkfs.btrfs -f -draid1 -mraid1 /dev/sdc $dmerror_dev > /dev/null 2>&1" run mount /dev/sdc /btrfs run "fillfs /btrfs 1000 > /dev/null 2>&1" run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1" run btrfs fi show # run sleep 32 run dmsetup suspend dm-sdd run "dmsetup load dm-sdd --table '$dmerror_table'" run dmsetup resume dm-sdd run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1" run btrfs fi show } create_err_dev_raid56() { dm_backing_dev="/dev/sdd" blk_dev_size=`blockdev --getsz $dm_backing_dev` dmerror_dev="/dev/mapper/dm-sdd" dmlinear_table="0 $blk_dev_size linear $dm_backing_dev 0" dmerror_table="0 $blk_dev_size error $dm_backing_dev 0" echo -e dm_backing_dev'\t'= $dm_backing_dev echo -e blk_dev_size'\t'= $blk_dev_size echo -e dmerror_dev'\t'= $dmerror_dev echo -e dmlinear_table'\t'= $dmlinear_table echo -e dmerror_table'\t'= $dmerror_table echo runnt "dmsetup remove dm-sdd > /dev/null 2>&1" run "dmsetup create dm-sdd --table '${dmlinear_table}'" run "mkfs.btrfs -f -draid5 -mraid5 /dev/sdc /dev/sdf $dmerror_dev > /dev/null 2>&1" run mount /dev/sdc /btrfs run "fillfs /btrfs 1000 > /dev/null 2>&1" run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1" run btrfs fi show # run sleep 32 run dmsetup suspend dm-sdd run "dmsetup load dm-sdd --table '$dmerror_table'" run dmsetup resume dm-sdd run "dd if=/dev/zero of=/btrfs/tf1 bs=4096 count=100 > /dev/null 2>&1" run btrfs fi show } # cat auto-replace-test56 source $(dirname $0)/util wipeall run btrfs spare add /dev/sde #run cat /proc/fs/btrfs/devlist create_err_dev_raid56 -- Thanks, Anand [ 1464.232552] BTRFS info (device sdc): dev_replace from (devid 4) to /dev/sdg started [ 1464.255824] BUG: unable to handle kernel NULL pointer dereference at 0548 [ 1464.291760]
Re: [PATCH 12/12] btrfs: check device for critical errors and mark failed
On 03/30/2016 08:49 AM, Yauhen Kharuzhy wrote: On Tue, Mar 29, 2016 at 10:22:29PM +0800, Anand Jain wrote: Write and Flush errors are considered as critical errors, upon which the device will be brought offline and marked as failed. Write and Flush errors are identified using device error statistics. Signed-off-by: Anand Jainbtrfs: check for failed device and hot replace This patch creates casualty_kthread to check for the failed devices, and triggers device replace. Signed-off-by: Anand Jain --- fs/btrfs/ctree.h | 2 + fs/btrfs/disk-io.c | 161 - fs/btrfs/disk-io.h | 2 + fs/btrfs/volumes.c | 1 + fs/btrfs/volumes.h | 4 ++ 5 files changed, 169 insertions(+), 1 deletion(-) btrfs_check_and_handle_casualty() tries to perfom auto-replacement only once after each failure. If no hotspare was added in system before failure, only one remaining way to replace drive is to perform replace manually. This sounds reasonable, so just clarification: are you sure that we shouldn't start autoreplacement if hotspare will be added after drive failure? V1 of the patchset tried to perform autoreplace endlessly until replace drive is added. Yeah. I did that change purposely, but in V3 I have reverted, so that code is more flexible and has better design control/change. Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 12/12] btrfs: check device for critical errors and mark failed
On 03/30/2016 06:41 AM, Yauhen Kharuzhy wrote: On Tue, Mar 29, 2016 at 10:22:29PM +0800, Anand Jain wrote: Write and Flush errors are considered as critical errors, upon which the device will be brought offline and marked as failed. Write and Flush errors are identified using device error statistics. Signed-off-by: Anand Jainbtrfs: check for failed device and hot replace This patch creates casualty_kthread to check for the failed devices, and triggers device replace. Signed-off-by: Anand Jain --- fs/btrfs/ctree.h | 2 + fs/btrfs/disk-io.c | 161 - fs/btrfs/disk-io.h | 2 + fs/btrfs/volumes.c | 1 + fs/btrfs/volumes.h | 4 ++ 5 files changed, 169 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 2c185a8e92f0..36f1c29e00a0 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1569,6 +1569,7 @@ struct btrfs_fs_info { struct mutex tree_log_mutex; struct mutex transaction_kthread_mutex; struct mutex cleaner_mutex; + struct mutex casualty_mutex; struct mutex chunk_mutex; struct mutex volume_mutex; @@ -1686,6 +1687,7 @@ struct btrfs_fs_info { struct btrfs_workqueue *extent_workers; struct task_struct *transaction_kthread; struct task_struct *cleaner_kthread; + struct task_struct *casualty_kthread; int thread_pool_size; struct kobject *space_info_kobj; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index b99329e37965..650e26e0acda 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -1869,6 +1869,153 @@ sleep: return 0; } +static int btrfs_check_and_handle_casualty(void *arg) +{ + int ret; + int found = 0; + struct btrfs_device *device; + struct btrfs_root *root = arg; + struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices; + + btrfs_dev_replace_lock(_info->dev_replace, 0); + if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) { + btrfs_dev_replace_unlock(_info->dev_replace, 0); + return -EBUSY; + } + btrfs_dev_replace_unlock(_info->dev_replace, 0); + + ret = btrfs_check_devices(fs_devices); + if (ret == 1) { + /* +* There were some casualties, and if its beyond a +* chunk group can tolerate, then FS will already +* be in readonly, so check that. And that's best +* btrfs could do as of now and no replace will help. +*/ + if (fs_info->sb->s_flags & MS_RDONLY) + return -EROFS; + + mutex_lock(_devices->device_list_mutex); + rcu_read_lock(); + list_for_each_entry_rcu(device, + _devices->devices, dev_list) { + if (device->failed) { + found = 1; + break; + } + } + rcu_read_unlock(); + mutex_unlock(_devices->device_list_mutex); + } + + /* +* We are using the replace code which should be interrupt-able +* during unmount, and as of now there is no user land stop +* request that we support and this will run until its complete +*/ + if (found) + ret = btrfs_auto_replace_start(root, device); + + return ret; +} + +/* + * A kthread to check if any auto maintenance be required. This is + * multithread safe, and kthread is running only if + * fs_info->casualty_kthread is not NULL, fixme: atomic ? + */ +static int casualty_kthread(void *arg) +{ + int ret; + int again; + struct btrfs_root *root = arg; + + do { + again = 0; + + if (btrfs_need_cleaner_sleep(root)) + goto sleep; + + if (!mutex_trylock(>fs_info->casualty_mutex)) + goto sleep; + + if (btrfs_need_cleaner_sleep(root)) { + mutex_unlock(>fs_info->casualty_mutex); + goto sleep; + } + + ret = btrfs_check_and_handle_casualty(arg); + if (ret == -EROFS) { + /* +* When checking and fixing the devices, the +* FS may be marked as RO in some situations. +* And on ROFS casualty thread has no work. +* So optimize here, to stop this thread until +* FS is back to RW. +*/ + } + mutex_unlock(>fs_info->casualty_mutex); + +sleep: + if (!try_to_freeze() && !again) { This block was copy-pasted from the cleaner_kthread(). 'again' variable
Re: Another ENOSPC situation
On Fri, Apr 1, 2016 at 10:40 PM, Marc Haberwrote: > On Fri, Apr 01, 2016 at 09:20:52PM +0200, Henk Slager wrote: >> On Fri, Apr 1, 2016 at 6:50 PM, Marc Haber >> wrote: >> > On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote: >> >> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote: >> >> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber >> >> > wrote: >> >> > > btrfs balance -mprofiles seems to do something. one kworked and one >> >> > > btrfs-transaction process hog one CPU core each for hours, while >> >> > > blocking the filesystem for minutes apiece, which leads to the host >> >> > > being nearly unuseable up to the point of "clock and mouse pointer >> >> > > frozen for nearly ten minutes". >> >> > >> >> > I assume you still have your every 10 minutes snapshotting running >> >> > while balancing? >> >> >> >> No, I disabled the cronjob before trying the balance. I might be >> >> crazy, but not stup^wunexperienced. >> > >> > That being said, I would still expect the code not to allow _this_ >> > kind of effect on the entire system when two alledgely incompatible >> > operations run simultaneously. I mean, Linux is a multi-user, >> > multi-tasking operating system where one simply cannot expect all >> > processes to be cooperative to each other. We have the operating >> > systems to prevent this kind of issues, not to cause them. >> >> Maybe look at it differently: Does user mh have trouble using this >> laptop w.r.t. storing files? > > No. I would have cried murder otherwise. > >> In openSUSE Tumbleweed (the snapshot from end of march), root access >> is needed to change the default snapshotting config, otherwise you >> will have a 10 year history. After that change has been done according >> to needs of the user, there is no need to run manual balance. > > So you are saying the balancing a filesystem should never be > necessary? Or what are you trying to say? There is a package bbtrfsmaintenance which does balancing for the user after it is configured by root according to user's wishes and needs. Key thing I want to say is that you should change you snapshotting rate and/or policy. It has been hinted before and it is more a psychological issue than technical I think. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[GIT PULL] Btrfs
Hi Linus, My for-linus-4.6 branch: git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git for-linus-4.6 Has a few fixes Dave Sterba had queued up. These are all pretty small, but since they were tested I decided against waiting for more: Alex Lyakas (2) commits (+18/-10): btrfs: do not write corrupted metadata blocks to disk (+13/-2) btrfs: csum_tree_block: return proper errno value (+5/-8) Jiri Kosina (2) commits (+7/-10): btrfs: cleaner_kthread() doesn't need explicit freeze (+1/-1) btrfs: transaction_kthread() is not freezable (+6/-9) Total: (4) commits (+25/-20) fs/btrfs/disk-io.c | 45 + 1 file changed, 25 insertions(+), 20 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)
> I grabbed this part from the log after the machine crashed again > following trying to transfer a bunch of files that included ones with > csum errors, let me know if this looks like the same issue you were > having: > Idk? You hit a soft lockup, mine got a "kernel BUG at..." Your stack trace diverges from mine after bio_endio. James -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Another ENOSPC situation
On Fri, Apr 01, 2016 at 09:20:52PM +0200, Henk Slager wrote: > On Fri, Apr 1, 2016 at 6:50 PM, Marc Haber> wrote: > > On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote: > >> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote: > >> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber > >> > wrote: > >> > > btrfs balance -mprofiles seems to do something. one kworked and one > >> > > btrfs-transaction process hog one CPU core each for hours, while > >> > > blocking the filesystem for minutes apiece, which leads to the host > >> > > being nearly unuseable up to the point of "clock and mouse pointer > >> > > frozen for nearly ten minutes". > >> > > >> > I assume you still have your every 10 minutes snapshotting running > >> > while balancing? > >> > >> No, I disabled the cronjob before trying the balance. I might be > >> crazy, but not stup^wunexperienced. > > > > That being said, I would still expect the code not to allow _this_ > > kind of effect on the entire system when two alledgely incompatible > > operations run simultaneously. I mean, Linux is a multi-user, > > multi-tasking operating system where one simply cannot expect all > > processes to be cooperative to each other. We have the operating > > systems to prevent this kind of issues, not to cause them. > > Maybe look at it differently: Does user mh have trouble using this > laptop w.r.t. storing files? No. I would have cried murder otherwise. > In openSUSE Tumbleweed (the snapshot from end of march), root access > is needed to change the default snapshotting config, otherwise you > will have a 10 year history. After that change has been done according > to needs of the user, there is no need to run manual balance. So you are saying the balancing a filesystem should never be necessary? Or what are you trying to say? Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/8] btrfs: uapi migration for user-visible API components
commit 55e301fd57a (Btrfs: move fs/btrfs/ioctl.h to include/uapi/linux/btrfs.h) was intended to make the ioctl definitions available to userspace. Unfortunately, moving just that file wasn't enough and many of the ioctls aren't actually usable without the userspace programmer filling in the gaps. Specifically, for the routine ioctls like BTRFS_IOC_SETLABEL, BTRFS_LABEL_SIZE wasn't defined so the ioctl definition would be incomplete. We were also missing the argument structure for defrag. Beyond that, many of the ioctl structures have a flags field that may or may not be independent of the btrfs internals. Lastly, the SEARCH_TREE ioctl exposes all of the internal items of the tree to userspace programmers so the item structures should be exposed so that they can be parsed properly. So, to make all this more convenient for consumers of these APIs, I've moved the flags used by the ioctl structures into btrfs.h and moved the item definitions, key IDs, tree root objectids, and other well-known objectids into a new btrfs_tree.h. ctree.h includes this new header directly, so there aren't any changes to .c files at all. The only part of this set that isn't just a direct cut-and-paste is the last one which converts u8 and u64 values to __u8 and __u64 since the former aren't exported via include/uapi. The goal is that everything required to use the btrfs ioctls for a particular kernel release should be made available by exporting the uapi headers for that release. I intend to use these for the strace ioctl decoding patch I've been working on so that I don't need to duplicate of the definitions in the code I send upstream as the final version of the patch. Prior to this patchset, I had to duplicate nearly 100 defines and several structures -- and that's without doing any item decoding at all. I do expect there might be some discussion here. :) -Jeff Jeff Mahoney (8): btrfs: uapi/linux/btrfs.h migration, move BTRFS_LABEL_SIZE btrfs: uapi/linux/btrfs.h migration, qgroup limit flags btrfs: uapi/linux/btrfs.h migration, document subvol flags btrfs: uapi/linux/btrfs.h migration, move feature flags btrfs: uapi/linux/btrfs.h migration, move balance flags btrfs: uapi/linux/btrfs.h migration, move struct btrfs_ioctl_defrag_range_args btrfs: uapi/linux/btrfs_tree.h migration, item types and defines btrfs: uapi/linux/btrfs_tree.h, use __u8 and __u64 fs/btrfs/ctree.h| 1014 +-- fs/btrfs/volumes.h | 46 -- include/uapi/linux/btrfs.h | 173 ++- include/uapi/linux/btrfs_tree.h | 966 + 4 files changed, 1135 insertions(+), 1064 deletions(-) create mode 100644 include/uapi/linux/btrfs_tree.h -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/8] btrfs: uapi/linux/btrfs.h migration, move feature flags
The compat/compat_ro/incompat feature flags are used by the feature set/get ioctls. Signed-off-by: Jeff Mahoney--- fs/btrfs/ctree.h | 25 - include/uapi/linux/btrfs.h | 31 +++ 2 files changed, 31 insertions(+), 25 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c228b39..378482c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -506,31 +506,6 @@ struct btrfs_super_block { * Compat flags that we support. If any incompat flags are set other than the * ones specified below then we will fail to mount */ -#define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE(1ULL << 0) - -#define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF (1ULL << 0) -#define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL (1ULL << 1) -#define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS(1ULL << 2) -#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO(1ULL << 3) -/* - * some patches floated around with a second compression method - * lets save that incompat here for when they do get in - * Note we don't actually support it, we're just reserving the - * number - */ -#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZOv2 (1ULL << 4) - -/* - * older kernels tried to do bigger metadata blocks, but the - * code was pretty buggy. Lets not let them try anymore. - */ -#define BTRFS_FEATURE_INCOMPAT_BIG_METADATA(1ULL << 5) - -#define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) -#define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7) -#define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8) -#define BTRFS_FEATURE_INCOMPAT_NO_HOLES(1ULL << 9) - #define BTRFS_FEATURE_COMPAT_SUPP 0ULL #define BTRFS_FEATURE_COMPAT_SAFE_SET 0ULL #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR0ULL diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 0316e23..de98717 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -222,6 +222,37 @@ struct btrfs_ioctl_fs_info_args { __u64 reserved[122];/* pad to 1k */ }; +/* + * feature flags + * + * Used by: + * struct btrfs_ioctl_feature_flags + */ +#define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE(1ULL << 0) + +#define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF (1ULL << 0) +#define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL (1ULL << 1) +#define BTRFS_FEATURE_INCOMPAT_MIXED_GROUPS(1ULL << 2) +#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZO(1ULL << 3) +/* + * some patches floated around with a second compression method + * lets save that incompat here for when they do get in + * Note we don't actually support it, we're just reserving the + * number + */ +#define BTRFS_FEATURE_INCOMPAT_COMPRESS_LZOv2 (1ULL << 4) + +/* + * older kernels tried to do bigger metadata blocks, but the + * code was pretty buggy. Lets not let them try anymore. + */ +#define BTRFS_FEATURE_INCOMPAT_BIG_METADATA(1ULL << 5) + +#define BTRFS_FEATURE_INCOMPAT_EXTENDED_IREF (1ULL << 6) +#define BTRFS_FEATURE_INCOMPAT_RAID56 (1ULL << 7) +#define BTRFS_FEATURE_INCOMPAT_SKINNY_METADATA (1ULL << 8) +#define BTRFS_FEATURE_INCOMPAT_NO_HOLES(1ULL << 9) + struct btrfs_ioctl_feature_flags { __u64 compat_flags; __u64 compat_ro_flags; -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/8] btrfs: uapi/linux/btrfs.h migration, move struct btrfs_ioctl_defrag_range_args
struct btrfs_ioctl_defrag_range_args is used by the BTRFS_IOC_DEFRAG_RANGE ioctl. Signed-off-by: Jeff Mahoney--- fs/btrfs/ctree.h | 31 --- include/uapi/linux/btrfs.h | 38 +- 2 files changed, 37 insertions(+), 32 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 378482c..89f36b6 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1992,37 +1992,6 @@ struct btrfs_root { atomic_t qgroup_meta_rsv; }; -struct btrfs_ioctl_defrag_range_args { - /* start of the defrag operation */ - __u64 start; - - /* number of bytes to defrag, use (u64)-1 to say all */ - __u64 len; - - /* -* flags for the operation, which can include turning -* on compression for this one defrag -*/ - __u64 flags; - - /* -* any extent bigger than this will be considered -* already defragged. Use 0 to take the kernel default -* Use 1 to say every single extent must be rewritten -*/ - __u32 extent_thresh; - - /* -* which compression method to use if turning on compression -* for this defrag operation. If unspecified, zlib will -* be used -*/ - __u32 compress_type; - - /* spare for later */ - __u32 unused[4]; -}; - /* * inode items have the data typically returned from stat and store other diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index abae362..98aff38 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -474,9 +474,45 @@ struct btrfs_ioctl_clone_range_args { __u64 dest_offset; }; -/* flags for the defrag range ioctl */ +/* + * flags definition for the defrag range ioctl + * + * Used by: + * struct btrfs_ioctl_defrag_range_args.flags + */ #define BTRFS_DEFRAG_RANGE_COMPRESS 1 #define BTRFS_DEFRAG_RANGE_START_IO 2 +struct btrfs_ioctl_defrag_range_args { + /* start of the defrag operation */ + __u64 start; + + /* number of bytes to defrag, use (u64)-1 to say all */ + __u64 len; + + /* +* flags for the operation, which can include turning +* on compression for this one defrag +*/ + __u64 flags; + + /* +* any extent bigger than this will be considered +* already defragged. Use 0 to take the kernel default +* Use 1 to say every single extent must be rewritten +*/ + __u32 extent_thresh; + + /* +* which compression method to use if turning on compression +* for this defrag operation. If unspecified, zlib will +* be used +*/ + __u32 compress_type; + + /* spare for later */ + __u32 unused[4]; +}; + #define BTRFS_SAME_DATA_DIFFERS1 /* For extent-same ioctl */ -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/8] btrfs: uapi/linux/btrfs.h migration, qgroup limit flags
The BTRFS_QGROUP_LIMIT_* flags are required to tell the kernel which fields are valid when using the BTRFS_IOC_QGROUP_LIMIT ioctl. Signed-off-by: Jeff Mahoney--- fs/btrfs/ctree.h | 8 include/uapi/linux/btrfs.h | 22 +- 2 files changed, 21 insertions(+), 9 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 3beaa24..c228b39 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1154,14 +1154,6 @@ struct btrfs_qgroup_info_item { __le64 excl_cmpr; } __attribute__ ((__packed__)); -/* flags definition for qgroup limits */ -#define BTRFS_QGROUP_LIMIT_MAX_RFER(1ULL << 0) -#define BTRFS_QGROUP_LIMIT_MAX_EXCL(1ULL << 1) -#define BTRFS_QGROUP_LIMIT_RSV_RFER(1ULL << 2) -#define BTRFS_QGROUP_LIMIT_RSV_EXCL(1ULL << 3) -#define BTRFS_QGROUP_LIMIT_RFER_CMPR (1ULL << 4) -#define BTRFS_QGROUP_LIMIT_EXCL_CMPR (1ULL << 5) - struct btrfs_qgroup_limit_item { /* * only updated when any of the other values change diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 11eee34..9651af3 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -41,7 +41,19 @@ struct btrfs_ioctl_vol_args { #define BTRFS_UUID_SIZE 16 #define BTRFS_UUID_UNPARSED_SIZE 37 -#define BTRFS_QGROUP_INHERIT_SET_LIMITS(1ULL << 0) +/* + * flags definition for qgroup limits + * + * Used by: + * struct btrfs_qgroup_limit.flags + * struct btrfs_qgroup_limit_item.flags + */ +#define BTRFS_QGROUP_LIMIT_MAX_RFER(1ULL << 0) +#define BTRFS_QGROUP_LIMIT_MAX_EXCL(1ULL << 1) +#define BTRFS_QGROUP_LIMIT_RSV_RFER(1ULL << 2) +#define BTRFS_QGROUP_LIMIT_RSV_EXCL(1ULL << 3) +#define BTRFS_QGROUP_LIMIT_RFER_CMPR (1ULL << 4) +#define BTRFS_QGROUP_LIMIT_EXCL_CMPR (1ULL << 5) struct btrfs_qgroup_limit { __u64 flags; @@ -51,6 +63,14 @@ struct btrfs_qgroup_limit { __u64 rsv_excl; }; +/* + * flags definition for qgroup inheritance + * + * Used by: + * struct btrfs_qgroup_inherit.flags + */ +#define BTRFS_QGROUP_INHERIT_SET_LIMITS(1ULL << 0) + struct btrfs_qgroup_inherit { __u64 flags; __u64 num_qgroups; -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 7/8] btrfs: uapi/linux/btrfs_tree.h migration, item types and defines
The BTRFS_IOC_SEARCH_TREE ioctl returns file system items directly to userspace. In order to decode them, full type information is required. Create a new header, btrfs_tree to contain these since most users won't need them. Signed-off-by: Jeff Mahoney--- fs/btrfs/ctree.h| 949 +-- include/uapi/linux/btrfs_tree.h | 966 2 files changed, 967 insertions(+), 948 deletions(-) create mode 100644 include/uapi/linux/btrfs_tree.h diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 89f36b6..cf34fb5 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -33,6 +33,7 @@ #include #include #include +#include #include #include #include @@ -64,98 +65,6 @@ struct btrfs_ordered_sum; #define BTRFS_COMPAT_EXTENT_TREE_V0 -/* holds pointers to all of the tree roots */ -#define BTRFS_ROOT_TREE_OBJECTID 1ULL - -/* stores information about which extents are in use, and reference counts */ -#define BTRFS_EXTENT_TREE_OBJECTID 2ULL - -/* - * chunk tree stores translations from logical -> physical block numbering - * the super block points to the chunk tree - */ -#define BTRFS_CHUNK_TREE_OBJECTID 3ULL - -/* - * stores information about which areas of a given device are in use. - * one per device. The tree of tree roots points to the device tree - */ -#define BTRFS_DEV_TREE_OBJECTID 4ULL - -/* one per subvolume, storing files and directories */ -#define BTRFS_FS_TREE_OBJECTID 5ULL - -/* directory objectid inside the root tree */ -#define BTRFS_ROOT_TREE_DIR_OBJECTID 6ULL - -/* holds checksums of all the data extents */ -#define BTRFS_CSUM_TREE_OBJECTID 7ULL - -/* holds quota configuration and tracking */ -#define BTRFS_QUOTA_TREE_OBJECTID 8ULL - -/* for storing items that use the BTRFS_UUID_KEY* types */ -#define BTRFS_UUID_TREE_OBJECTID 9ULL - -/* tracks free space in block groups. */ -#define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL - -/* device stats in the device tree */ -#define BTRFS_DEV_STATS_OBJECTID 0ULL - -/* for storing balance parameters in the root tree */ -#define BTRFS_BALANCE_OBJECTID -4ULL - -/* orhpan objectid for tracking unlinked/truncated files */ -#define BTRFS_ORPHAN_OBJECTID -5ULL - -/* does write ahead logging to speed up fsyncs */ -#define BTRFS_TREE_LOG_OBJECTID -6ULL -#define BTRFS_TREE_LOG_FIXUP_OBJECTID -7ULL - -/* for space balancing */ -#define BTRFS_TREE_RELOC_OBJECTID -8ULL -#define BTRFS_DATA_RELOC_TREE_OBJECTID -9ULL - -/* - * extent checksums all have this objectid - * this allows them to share the logging tree - * for fsyncs - */ -#define BTRFS_EXTENT_CSUM_OBJECTID -10ULL - -/* For storing free space cache */ -#define BTRFS_FREE_SPACE_OBJECTID -11ULL - -/* - * The inode number assigned to the special inode for storing - * free ino cache - */ -#define BTRFS_FREE_INO_OBJECTID -12ULL - -/* dummy objectid represents multiple objectids */ -#define BTRFS_MULTIPLE_OBJECTIDS -255ULL - -/* - * All files have objectids in this range. - */ -#define BTRFS_FIRST_FREE_OBJECTID 256ULL -#define BTRFS_LAST_FREE_OBJECTID -256ULL -#define BTRFS_FIRST_CHUNK_TREE_OBJECTID 256ULL - - -/* - * the device items go into the chunk tree. The key is in the form - * [ 1 BTRFS_DEV_ITEM_KEY device_id ] - */ -#define BTRFS_DEV_ITEMS_OBJECTID 1ULL - -#define BTRFS_BTREE_INODE_OBJECTID 1 - -#define BTRFS_EMPTY_SUBVOL_DIR_OBJECTID 2 - -#define BTRFS_DEV_REPLACE_DEVID 0ULL - /* * the max metadata block size. This limit is somewhat artificial, * but the memmove costs go through the roof for larger blocks. @@ -175,12 +84,6 @@ struct btrfs_ordered_sum; */ #define BTRFS_LINK_MAX 65535U -/* 32 bytes in various csum fields */ -#define BTRFS_CSUM_SIZE 32 - -/* csum types */ -#define BTRFS_CSUM_TYPE_CRC32 0 - static const int btrfs_csum_sizes[] = { 4 }; /* four bytes for CRC32 */ @@ -189,17 +92,6 @@ static const int btrfs_csum_sizes[] = { 4 }; /* spefic to btrfs_map_block(), therefore not in include/linux/blk_types.h */ #define REQ_GET_READ_MIRRORS (1 << 30) -#define BTRFS_FT_UNKNOWN 0 -#define BTRFS_FT_REG_FILE 1 -#define BTRFS_FT_DIR 2 -#define BTRFS_FT_CHRDEV3 -#define BTRFS_FT_BLKDEV4 -#define BTRFS_FT_FIFO 5 -#define BTRFS_FT_SOCK 6 -#define BTRFS_FT_SYMLINK 7 -#define BTRFS_FT_XATTR 8 -#define BTRFS_FT_MAX 9 - /* ioprio of readahead is set to idle */ #define BTRFS_IOPRIO_READA (IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)) @@ -207,138 +99,10 @@ static const int btrfs_csum_sizes[] = { 4 }; #define BTRFS_MAX_EXTENT_SIZE SZ_128M -/* - * The key defines the order in the tree, and so it also defines (optimal) - * block layout. - * - * objectid corresponds to the inode number. - * - * type tells us things about the object, and is a kind of stream selector. - * so for a given inode, keys with type of 1 might refer to the inode data, - * type of 2 may point to file data in the btree and type
[PATCH 3/8] btrfs: uapi/linux/btrfs.h migration, document subvol flags
Signed-off-by: Jeff Mahoney--- include/uapi/linux/btrfs.h | 17 ++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 9651af3..0316e23 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -34,9 +34,6 @@ struct btrfs_ioctl_vol_args { #define BTRFS_DEVICE_PATH_NAME_MAX 1024 -#define BTRFS_SUBVOL_CREATE_ASYNC (1ULL << 0) -#define BTRFS_SUBVOL_RDONLY(1ULL << 1) -#define BTRFS_SUBVOL_QGROUP_INHERIT(1ULL << 2) #define BTRFS_FSID_SIZE 16 #define BTRFS_UUID_SIZE 16 #define BTRFS_UUID_UNPARSED_SIZE 37 @@ -85,6 +82,20 @@ struct btrfs_ioctl_qgroup_limit_args { struct btrfs_qgroup_limit lim; }; +/* + * flags for subvolumes + * + * Used by: + * struct btrfs_ioctl_vol_args_v2.flags + * + * BTRFS_SUBVOL_RDONLY is also provided/consumed by the following ioctls: + * - BTRFS_IOC_SUBVOL_GETFLAGS + * - BTRFS_IOC_SUBVOL_SETFLAGS + */ +#define BTRFS_SUBVOL_CREATE_ASYNC (1ULL << 0) +#define BTRFS_SUBVOL_RDONLY(1ULL << 1) +#define BTRFS_SUBVOL_QGROUP_INHERIT(1ULL << 2) + #define BTRFS_SUBVOL_NAME_MAX 4039 struct btrfs_ioctl_vol_args_v2 { __s64 fd; -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 8/8] btrfs: uapi/linux/btrfs_tree.h, use __u8 and __u64
u8 and u64 aren't exported to userspace, while __u8 and __u64 are. Signed-off-by: Jeff Mahoney--- include/uapi/linux/btrfs_tree.h | 52 - 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/include/uapi/linux/btrfs_tree.h b/include/uapi/linux/btrfs_tree.h index 1e87505..d5ad15a 100644 --- a/include/uapi/linux/btrfs_tree.h +++ b/include/uapi/linux/btrfs_tree.h @@ -334,14 +334,14 @@ */ struct btrfs_disk_key { __le64 objectid; - u8 type; + __u8 type; __le64 offset; } __attribute__ ((__packed__)); struct btrfs_key { - u64 objectid; - u8 type; - u64 offset; + __u64 objectid; + __u8 type; + __u64 offset; } __attribute__ ((__packed__)); struct btrfs_dev_item { @@ -379,22 +379,22 @@ struct btrfs_dev_item { __le32 dev_group; /* seek speed 0-100 where 100 is fastest */ - u8 seek_speed; + __u8 seek_speed; /* bandwidth 0-100 where 100 is fastest */ - u8 bandwidth; + __u8 bandwidth; /* btrfs generated uuid for this device */ - u8 uuid[BTRFS_UUID_SIZE]; + __u8 uuid[BTRFS_UUID_SIZE]; /* uuid of FS who owns this device */ - u8 fsid[BTRFS_UUID_SIZE]; + __u8 fsid[BTRFS_UUID_SIZE]; } __attribute__ ((__packed__)); struct btrfs_stripe { __le64 devid; __le64 offset; - u8 dev_uuid[BTRFS_UUID_SIZE]; + __u8 dev_uuid[BTRFS_UUID_SIZE]; } __attribute__ ((__packed__)); struct btrfs_chunk { @@ -433,7 +433,7 @@ struct btrfs_chunk { struct btrfs_free_space_entry { __le64 offset; __le64 bytes; - u8 type; + __u8 type; } __attribute__ ((__packed__)); struct btrfs_free_space_header { @@ -486,7 +486,7 @@ struct btrfs_extent_item_v0 { struct btrfs_tree_block_info { struct btrfs_disk_key key; - u8 level; + __u8 level; } __attribute__ ((__packed__)); struct btrfs_extent_data_ref { @@ -501,7 +501,7 @@ struct btrfs_shared_data_ref { } __attribute__ ((__packed__)); struct btrfs_extent_inline_ref { - u8 type; + __u8 type; __le64 offset; } __attribute__ ((__packed__)); @@ -523,7 +523,7 @@ struct btrfs_dev_extent { __le64 chunk_objectid; __le64 chunk_offset; __le64 length; - u8 chunk_tree_uuid[BTRFS_UUID_SIZE]; + __u8 chunk_tree_uuid[BTRFS_UUID_SIZE]; } __attribute__ ((__packed__)); struct btrfs_inode_ref { @@ -583,7 +583,7 @@ struct btrfs_dir_item { __le64 transid; __le16 data_len; __le16 name_len; - u8 type; + __u8 type; } __attribute__ ((__packed__)); #define BTRFS_ROOT_SUBVOL_RDONLY (1ULL << 0) @@ -605,8 +605,8 @@ struct btrfs_root_item { __le64 flags; __le32 refs; struct btrfs_disk_key drop_progress; - u8 drop_level; - u8 level; + __u8 drop_level; + __u8 level; /* * The following fields appear after subvol_uuids+subvol_times @@ -625,9 +625,9 @@ struct btrfs_root_item { * when invalidating the fields. */ __le64 generation_v2; - u8 uuid[BTRFS_UUID_SIZE]; - u8 parent_uuid[BTRFS_UUID_SIZE]; - u8 received_uuid[BTRFS_UUID_SIZE]; + __u8 uuid[BTRFS_UUID_SIZE]; + __u8 parent_uuid[BTRFS_UUID_SIZE]; + __u8 received_uuid[BTRFS_UUID_SIZE]; __le64 ctransid; /* updated when an inode changes */ __le64 otransid; /* trans when created */ __le64 stransid; /* trans when sent. non-zero for received subvol */ @@ -751,12 +751,12 @@ struct btrfs_file_extent_item { * it is treated like an incompat flag for reading and writing, * but not for stat. */ - u8 compression; - u8 encryption; + __u8 compression; + __u8 encryption; __le16 other_encoding; /* spare for later use */ /* are we inline data or a real extent? */ - u8 type; + __u8 type; /* * disk space consumed by the extent, checksum blocks are included @@ -783,7 +783,7 @@ struct btrfs_file_extent_item { } __attribute__ ((__packed__)); struct btrfs_csum_item { - u8 csum; + __u8 csum; } __attribute__ ((__packed__)); struct btrfs_dev_stats_item { @@ -874,14 +874,14 @@ enum btrfs_raid_types { #define BTRFS_EXTENDED_PROFILE_MASK(BTRFS_BLOCK_GROUP_PROFILE_MASK | \ BTRFS_AVAIL_ALLOC_BIT_SINGLE) -static inline u64 chunk_to_extended(u64 flags) +static inline __u64 chunk_to_extended(__u64 flags) { if ((flags & BTRFS_BLOCK_GROUP_PROFILE_MASK) == 0) flags |= BTRFS_AVAIL_ALLOC_BIT_SINGLE; return flags; } -static inline u64 extended_to_chunk(u64 flags) +static inline __u64 extended_to_chunk(__u64 flags) { return flags & ~BTRFS_AVAIL_ALLOC_BIT_SINGLE; } @@ -900,7 +900,7 @@ struct btrfs_free_space_info { #define
[PATCH 5/8] btrfs: uapi/linux/btrfs.h migration, move balance flags
The BTRFS_BALANCE_* flags are used by struct btrfs_ioctl_balance_args.flags and btrfs_ioctl_balance_args.{data,meta,sys}.flags in the BTRFS_IOC_BALANCE ioctl. Signed-off-by: Jeff Mahoney--- fs/btrfs/volumes.h | 46 - include/uapi/linux/btrfs.h | 64 ++ 2 files changed, 64 insertions(+), 46 deletions(-) diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 1939ebd..144cec3 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -357,52 +357,6 @@ struct map_lookup { #define map_lookup_size(n) (sizeof(struct map_lookup) + \ (sizeof(struct btrfs_bio_stripe) * (n))) -/* - * Restriper's general type filter - */ -#define BTRFS_BALANCE_DATA (1ULL << 0) -#define BTRFS_BALANCE_SYSTEM (1ULL << 1) -#define BTRFS_BALANCE_METADATA (1ULL << 2) - -#define BTRFS_BALANCE_TYPE_MASK(BTRFS_BALANCE_DATA | \ -BTRFS_BALANCE_SYSTEM | \ -BTRFS_BALANCE_METADATA) - -#define BTRFS_BALANCE_FORCE(1ULL << 3) -#define BTRFS_BALANCE_RESUME (1ULL << 4) - -/* - * Balance filters - */ -#define BTRFS_BALANCE_ARGS_PROFILES(1ULL << 0) -#define BTRFS_BALANCE_ARGS_USAGE (1ULL << 1) -#define BTRFS_BALANCE_ARGS_DEVID (1ULL << 2) -#define BTRFS_BALANCE_ARGS_DRANGE (1ULL << 3) -#define BTRFS_BALANCE_ARGS_VRANGE (1ULL << 4) -#define BTRFS_BALANCE_ARGS_LIMIT (1ULL << 5) -#define BTRFS_BALANCE_ARGS_LIMIT_RANGE (1ULL << 6) -#define BTRFS_BALANCE_ARGS_STRIPES_RANGE (1ULL << 7) -#define BTRFS_BALANCE_ARGS_USAGE_RANGE (1ULL << 10) - -#define BTRFS_BALANCE_ARGS_MASK\ - (BTRFS_BALANCE_ARGS_PROFILES | \ -BTRFS_BALANCE_ARGS_USAGE | \ -BTRFS_BALANCE_ARGS_DEVID | \ -BTRFS_BALANCE_ARGS_DRANGE |\ -BTRFS_BALANCE_ARGS_VRANGE |\ -BTRFS_BALANCE_ARGS_LIMIT | \ -BTRFS_BALANCE_ARGS_LIMIT_RANGE | \ -BTRFS_BALANCE_ARGS_STRIPES_RANGE | \ -BTRFS_BALANCE_ARGS_USAGE_RANGE) - -/* - * Profile changing flags. When SOFT is set we won't relocate chunk if - * it already has the target profile (even though it may be - * half-filled). - */ -#define BTRFS_BALANCE_ARGS_CONVERT (1ULL << 8) -#define BTRFS_BALANCE_ARGS_SOFT(1ULL << 9) - struct btrfs_balance_args; struct btrfs_balance_progress; struct btrfs_balance_control { diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index de98717..abae362 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -317,6 +317,70 @@ struct btrfs_balance_progress { __u64 completed;/* # of chunks relocated so far */ }; +/* + * flags definition for balance + * + * Restriper's general type filter + * + * Used by: + * btrfs_ioctl_balance_args.flags + * btrfs_balance_control.flags (internal) + */ +#define BTRFS_BALANCE_DATA (1ULL << 0) +#define BTRFS_BALANCE_SYSTEM (1ULL << 1) +#define BTRFS_BALANCE_METADATA (1ULL << 2) + +#define BTRFS_BALANCE_TYPE_MASK(BTRFS_BALANCE_DATA | \ +BTRFS_BALANCE_SYSTEM | \ +BTRFS_BALANCE_METADATA) + +#define BTRFS_BALANCE_FORCE(1ULL << 3) +#define BTRFS_BALANCE_RESUME (1ULL << 4) + +/* + * flags definitions for per-type balance args + * + * Balance filters + * + * Used by: + * struct btrfs_balance_args + */ +#define BTRFS_BALANCE_ARGS_PROFILES(1ULL << 0) +#define BTRFS_BALANCE_ARGS_USAGE (1ULL << 1) +#define BTRFS_BALANCE_ARGS_DEVID (1ULL << 2) +#define BTRFS_BALANCE_ARGS_DRANGE (1ULL << 3) +#define BTRFS_BALANCE_ARGS_VRANGE (1ULL << 4) +#define BTRFS_BALANCE_ARGS_LIMIT (1ULL << 5) +#define BTRFS_BALANCE_ARGS_LIMIT_RANGE (1ULL << 6) +#define BTRFS_BALANCE_ARGS_STRIPES_RANGE (1ULL << 7) +#define BTRFS_BALANCE_ARGS_USAGE_RANGE (1ULL << 10) + +#define BTRFS_BALANCE_ARGS_MASK\ + (BTRFS_BALANCE_ARGS_PROFILES | \ +BTRFS_BALANCE_ARGS_USAGE | \ +BTRFS_BALANCE_ARGS_DEVID | \ +BTRFS_BALANCE_ARGS_DRANGE |\ +BTRFS_BALANCE_ARGS_VRANGE |\ +BTRFS_BALANCE_ARGS_LIMIT | \ +BTRFS_BALANCE_ARGS_LIMIT_RANGE | \ +BTRFS_BALANCE_ARGS_STRIPES_RANGE | \ +BTRFS_BALANCE_ARGS_USAGE_RANGE) + +/* + * Profile changing flags. When SOFT is set we won't relocate chunk if + * it already has the target profile (even though it may be + * half-filled). + */ +#define BTRFS_BALANCE_ARGS_CONVERT (1ULL << 8) +#define BTRFS_BALANCE_ARGS_SOFT(1ULL << 9) + + +/* + * flags definition for balance state + * + * Used by: + *
[PATCH 1/8] btrfs: uapi/linux/btrfs.h migration, move BTRFS_LABEL_SIZE
BTRFS_LABEL_SIZE is required to define the BTRFS_IOC_GET_FSLABEL and BTRFS_IOC_SET_FSLABEL ioctls. Signed-off-by: Jeff Mahoney--- fs/btrfs/ctree.h | 1 - include/uapi/linux/btrfs.h | 1 + 2 files changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 84a6a5b..3beaa24 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -410,7 +410,6 @@ struct btrfs_header { * room to translate 14 chunks with 3 stripes each. */ #define BTRFS_SYSTEM_CHUNK_ARRAY_SIZE 2048 -#define BTRFS_LABEL_SIZE 256 /* * just in case we somehow lose the roots and are not able to mount, diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index dea8931..11eee34 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -23,6 +23,7 @@ #define BTRFS_IOCTL_MAGIC 0x94 #define BTRFS_VOL_NAME_MAX 255 +#define BTRFS_LABEL_SIZE 256 /* this should be 4k */ #define BTRFS_PATH_NAME_MAX 4087 -- 2.7.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Another ENOSPC situation
On Fri, Apr 1, 2016 at 6:50 PM, Marc Haberwrote: > On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote: >> On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote: >> > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber >> > wrote: >> > > btrfs balance -mprofiles seems to do something. one kworked and one >> > > btrfs-transaction process hog one CPU core each for hours, while >> > > blocking the filesystem for minutes apiece, which leads to the host >> > > being nearly unuseable up to the point of "clock and mouse pointer >> > > frozen for nearly ten minutes". >> > >> > I assume you still have your every 10 minutes snapshotting running >> > while balancing? >> >> No, I disabled the cronjob before trying the balance. I might be >> crazy, but not stup^wunexperienced. > > That being said, I would still expect the code not to allow _this_ > kind of effect on the entire system when two alledgely incompatible > operations run simultaneously. I mean, Linux is a multi-user, > multi-tasking operating system where one simply cannot expect all > processes to be cooperative to each other. We have the operating > systems to prevent this kind of issues, not to cause them. Maybe look at it differently: Does user mh have trouble using this laptop w.r.t. storing files? In openSUSE Tumbleweed (the snapshot from end of march), root access is needed to change the default snapshotting config, otherwise you will have a 10 year history. After that change has been done according to needs of the user, there is no need to run manual balance. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Compression causes kernel crashes if there are I/O or checksum errors (was: RE: kernel BUG at fs/btrfs/volumes.c:5519 when hot-removing device in RAID-1)
I grabbed this part from the log after the machine crashed again following trying to transfer a bunch of files that included ones with csum errors, let me know if this looks like the same issue you were having: Mar 31 00:49:42 sl-server kernel: NMI watchdog: BUG: soft lockup - CPU#21 stuck for 22s! [kworker/u67:5:80994] Mar 31 00:49:42 sl-server kernel: Modules linked in: fuse xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 tun ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter dm_mirror dm_region_hash dm_log dm_mod kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xfs aesni_intel lrw gf128mul glue_helper libcrc32c ablk_helper cryptd joydev input_leds edac_mce_amd k10temp edac_core fam15h_power sp5100_tco sg i2c_piix4 8250_fintek acpi_cpufreq shpchp nfsd auth_rpcgss nfs_acl Mar 31 00:49:42 sl-server kernel:Â Â lockd grace sunrpc ip_tables btrfs xor ata_generic pata_acpi raid6_pq sd_mod mgag200 crc32c_intel drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci serio_raw pata_atiixp libahci igb drm ptp pps_core mpt3sas dca raid_class libata i2c_algo_bit scsi_transport_sas fjes uas usb_storage Mar 31 00:49:42 sl-server kernel: CPU: 21 PID: 80994 Comm: kworker/u67:5 Not tainted 4.5.0-1.el7.elrepo.x86_64 #1 Mar 31 00:49:42 sl-server kernel: Hardware name: Supermicro H8DG6/H8DGi/H8DG6/H8DGi, BIOS 3.511/25/2013 Mar 31 00:49:42 sl-server kernel: Workqueue: btrfs-endio btrfs_endio_helper [btrfs] Mar 31 00:49:42 sl-server kernel: task: 8817f6fa8000 ti: 8800b731 task.ti: 8800b731 Mar 31 00:49:42 sl-server kernel: RIP: 0010:[]Â Â [] btrfs_decompress_buf2page+0x123/0x200 [btrfs] Mar 31 00:49:42 sl-server kernel: RSP: 0018:8800b7313be0Â Â EFLAGS: 0246 Mar 31 00:49:42 sl-server kernel: RAX: RBX: RCX: Mar 31 00:49:42 sl-server kernel: RDX: RSI: c9000e3d8000 RDI: 88144c7cc000 Mar 31 00:49:42 sl-server kernel: RBP: 8800b7313c48 R08: 8810f0295000 R09: 0020 Mar 31 00:49:42 sl-server kernel: R10: 8810d2ba7869 R11: 00010008 R12: 8817f6fa8000 Mar 31 00:49:42 sl-server kernel: R13: 8800b7313ce0 R14: 0008 R15: 1000 Mar 31 00:49:42 sl-server kernel: FS:Â Â 7efce58fb740() GS:881807d4() knlGS: Mar 31 00:49:42 sl-server kernel: CS:Â Â 0010 DS: ES: CR0: 8005003b Mar 31 00:49:42 sl-server kernel: CR2: 7f00caf249e8 CR3: 001062121000 CR4: 000406e0 Mar 31 00:49:42 sl-server kernel: Stack: Mar 31 00:49:42 sl-server kernel:Â Â 0020 f000 8810f0295000 8744 Mar 31 00:49:42 sl-server kernel:Â Â 00010008 c9000e3d7000 ea005131f300 0001 Mar 31 00:49:42 sl-server kernel:Â Â 0797 2869 0869 8810d2ba7000 Mar 31 00:49:42 sl-server kernel: Call Trace: Mar 31 00:49:42 sl-server kernel:Â Â [] lzo_decompress_biovec+0x202/0x300 [btrfs] Mar 31 00:49:42 sl-server kernel:Â Â [] end_compressed_bio_read+0x1f6/0x2f0 [btrfs] Mar 31 00:49:42 sl-server kernel:Â Â [] bio_endio+0x40/0x60 Mar 31 00:49:42 sl-server kernel:Â Â [] end_workqueue_fn+0x3c/0x40 [btrfs] Mar 31 00:49:42 sl-server kernel:Â Â [] normal_work_helper+0xc0/0x2c0 [btrfs] Mar 31 00:49:42 sl-server kernel:Â Â [] btrfs_endio_helper+0x12/0x20 [btrfs] Mar 31 00:49:42 sl-server kernel:Â Â [] process_one_work+0x14f/0x400 Mar 31 00:49:42 sl-server kernel:Â Â [] worker_thread+0x125/0x4b0 Mar 31 00:49:42 sl-server kernel:Â Â [] ? rescuer_thread+0x370/0x370 Mar 31 00:49:42 sl-server kernel:Â Â [] kthread+0xd8/0xf0 Mar 31 00:49:42 sl-server kernel:Â Â [] ? kthread_park+0x60/0x60 Mar 31 00:49:42 sl-server kernel:Â Â [] ret_from_fork+0x3f/0x70 Mar 31 00:49:42 sl-server kernel:Â Â [] ? kthread_park+0x60/0x60 Mar 31 00:49:42 sl-server kernel: Code: c7 48 8b 45 c0 49 03 7d 00 4a 8d 34 38 e8 06 18 00 e1 41 83 ac 24 28 12 00 00 01 41 8b 84 24 28 12 00 00 85 c0 0f 88 bf 00 00 00 <48> 89 d8 49 03 45 00 49 01 df 49 29 de 48 01 5d d0 48 3d 00 10Â Mar 31 00:49:43 sl-server sh[1297]: abrt-dump-oops: Found oopses: 1 Mar 31 00:49:43 sl-server sh[1297]: abrt-dump-oops: Creating problem directories Mar 31 00:49:43 sl-server sh[1297]: abrt-dump-oops: Not going to make dump directories world readable because PrivateReports is on -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_destroy_inode WARN_ON.
On Fri, Apr 01, 2016 at 02:12:27PM -0400, Dave Jones wrote: > BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 30s! > Showing busy workqueues and worker pools: > workqueue events: flags=0x0 > pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256 > pending: vmstat_shepherd > pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 > pending: check_corruption > pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=3/256 > pending: usb_serial_port_work, lru_add_drain_per_cpu BAR(17230), > e1000_watchdog_task > workqueue events_power_efficient: flags=0x82 > pwq 8: cpus=0-3 flags=0x4 nice=0 active=3/256 > pending: fb_flashcursor, neigh_periodic_work, neigh_periodic_work > workqueue events_freezable_power_: flags=0x86 > pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/256 > pending: disk_events_workfn > workqueue netns: flags=0x6000a > pwq 8: cpus=0-3 flags=0x4 nice=0 active=1/1 > in-flight: 10038:cleanup_net > workqueue writeback: flags=0x4e > pwq 8: cpus=0-3 flags=0x4 nice=0 active=2/256 > pending: wb_workfn, wb_workfn > workqueue kblockd: flags=0x18 > pwq 3: cpus=1 node=0 flags=0x0 nice=-20 active=2/256 > pending: blk_mq_timeout_work, blk_mq_timeout_work > workqueue vmstat: flags=0xc > pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256 > pending: vmstat_update > pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256 > pending: vmstat_update > pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 > pending: vmstat_update > pool 8: cpus=0-3 flags=0x4 nice=0 hung=0s workers=11 idle: 11638 10276 609 > 17937 606 9237 605 891 15998 14100 > note: trinity-c13[18815] exited with preempt_count 1 This has wedged userspace too: 23082 pts/2SN+0:00 | \_ /bin/bash scripts/test-multi.sh 14140 pts/2SNL+ 0:15 | \_ ../trinity -q -l off -N 100 -a64 -x fsync -x fdatasync 16900 ?DNs0:04 | \_ ../trinity -q -l off -N 100 -a64 -x fsync -x fdata 18894 ?DNs0:02 | \_ ../trinity -q -l off -N 100 -a64 -x fsync -x fdata (14:16:02:davej@think:trinity[master])$ stack 16900 [] wait_on_page_bit_killable+0x156/0x1b0 [] __lock_page_or_retry+0x112/0x1b0 [] filemap_fault+0x367/0xb30 [] __do_fault+0x167/0x3d0 [] handle_mm_fault+0x1837/0x2520 [] __do_page_fault+0x248/0x770 [] do_page_fault+0x39/0xa0 [] page_fault+0x1f/0x30 [] mm_release+0x1ec/0x230 [] do_exit+0x5d0/0x18c0 [] do_group_exit+0xac/0x190 [] get_signal+0x48f/0xeb0 [] do_signal+0xa0/0xb50 [] exit_to_usermode_loop+0xd9/0x100 [] do_syscall_64+0x238/0x2b0 [] return_from_SYSCALL_64+0x0/0x7a [] 0x (14:16:09:davej@think:trinity[master])$ stack 18894 [] btrfs_file_write_iter+0xe8/0x9a0 [btrfs] [] __vfs_write+0x279/0x2e0 [] vfs_write+0x11e/0x2b0 [] SyS_write+0xd2/0x1a0 [] do_syscall_64+0x103/0x2b0 [] return_from_SYSCALL_64+0x0/0x7a [] 0x I tried to ftrace the latter process, and the box completely hung. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs_destroy_inode WARN_ON.
On Sun, Mar 27, 2016 at 09:14:00PM -0400, Dave Jones wrote: > > WARNING: CPU: 2 PID: 32570 at fs/btrfs/inode.c:9261 > btrfs_destroy_inode+0x389/0x3f0 [btrfs] > > CPU: 2 PID: 32570 Comm: rm Not tainted 4.5.0-think+ #14 > > c039baf9 ef721ef0 88025966fc08 8957bcdb > > 88025966fc50 890b41f1 > > 88045d918040 242d4eed6048 88024eed6048 88024eed6048 > > Call Trace: > > [] ? btrfs_destroy_inode+0x389/0x3f0 [btrfs] > > [] dump_stack+0x68/0x9d > > [] __warn+0x111/0x130 > > [] warn_slowpath_null+0x1d/0x20 > > [] btrfs_destroy_inode+0x389/0x3f0 [btrfs] > > [] destroy_inode+0x67/0x90 > > [] evict+0x1b7/0x240 > > [] iput+0x3ae/0x4e0 > > [] ? dput+0x20e/0x460 > > [] do_unlinkat+0x256/0x440 > > [] ? do_rmdir+0x350/0x350 > > [] ? syscall_trace_enter_phase1+0x87/0x260 > > [] ? enter_from_user_mode+0x50/0x50 > > [] ? __lock_is_held+0x25/0xd0 > > [] ? mark_held_locks+0x22/0xc0 > > [] ? syscall_trace_enter_phase2+0x12d/0x3d0 > > [] ? SyS_rmdir+0x20/0x20 > > [] SyS_unlinkat+0x1b/0x30 > > [] do_syscall_64+0xf4/0x240 > > [] entry_SYSCALL64_slow_path+0x25/0x25 > > ---[ end trace a48ce4e6a1b5e409 ]--- > > > > That's WARN_ON(BTRFS_I(inode)->csum_bytes); > > > > *maybe* it's a bad disk, but there's no indication in dmesg of anything > awry. > > Spinning rust on SATA, nothing special. > > Same WARN_ON is reachable from umount too.. > > WARNING: CPU: 2 PID: 20092 at fs/btrfs/inode.c:9261 > btrfs_destroy_inode+0x40c/0x480 [btrfs] > CPU: 2 PID: 20092 Comm: umount Tainted: GW 4.5.0-think+ #1 > a32c482b 8803cd187b60 9d63af84 > c05c5e40 c04d316c > 8803cd187ba8 9d0c4c27 880460d80040 242dcd187bb0 > Call Trace: > [] dump_stack+0x95/0xe1 > [] ? btrfs_destroy_inode+0x40c/0x480 [btrfs] > [] __warn+0x147/0x170 > [] warn_slowpath_null+0x31/0x40 > [] btrfs_destroy_inode+0x40c/0x480 [btrfs] > [] ? btrfs_test_destroy_inode+0x40/0x40 [btrfs] > [] destroy_inode+0x77/0xb0 > [] evict+0x20e/0x2c0 > [] dispose_list+0x70/0xb0 > [] evict_inodes+0x26f/0x2c0 > [] ? inode_add_lru+0x60/0x60 > [] ? fsnotify_unmount_inodes+0x215/0x2c0 > [] generic_shutdown_super+0x76/0x1c0 > [] kill_anon_super+0x29/0x40 > [] btrfs_kill_super+0x31/0x130 [btrfs] > [] deactivate_locked_super+0x6f/0xb0 > [] deactivate_super+0x99/0xb0 > [] cleanup_mnt+0x70/0xd0 > [] __cleanup_mnt+0x1b/0x20 > [] task_work_run+0xef/0x130 > [] exit_to_usermode_loop+0xf9/0x100 > [] do_syscall_64+0x238/0x2b0 > [] entry_SYSCALL64_slow_path+0x25/0x25 Additional fallout: BTRFS: assertion failed: num_extents, file: fs/btrfs/extent-tree.c, line: 5584 [ cut here ] kernel BUG at fs/btrfs/ctree.h:4320! invalid opcode: [#1] PREEMPT SMP DEBUG_PAGEALLOC KASAN CPU: 1 PID: 18815 Comm: trinity-c13 Tainted: GW 4.6.0-rc1-think+ #1 task: 88045de10040 ti: 8803afa38000 task.ti: 8803afa38000 RIP: 0010:[] [] assfail.constprop.88+0x2b/0x2d [btrfs] RSP: 0018:8803afa3f838 EFLAGS: 00010282 RAX: 004e RBX: c046e200 RCX: RDX: RSI: 0003 RDI: ed0075f47efb RBP: 8803afa3f848 R08: 0001 R09: 0001 R10: R11: 0001 R12: 15d0 R13: 8803fda0e048 R14: 8803fda0dc38 R15: 8803fda0dc58 FS: 7fa0566d6700() GS:880468a0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: 7fa0566d9000 CR3: 000333bc4000 CR4: 001406e0 DR0: 7fa0554fb000 DR1: DR2: DR3: DR6: 0ff0 DR7: 0600 Stack: 8803fda0e048 8803afa3f880 c032288b 880460bb33f8 8803fda0e048 8803fda0dc38 8803fda0dc58 8803afa3f8c8 c032f851 0001 Call Trace: [] drop_outstanding_extent+0x10b/0x130 [btrfs] [] btrfs_delalloc_release_metadata+0x71/0x480 [btrfs] [] ? __btrfs_buffered_write+0xa6f/0xb50 [btrfs] [] btrfs_delalloc_release_space+0x27/0x50 [btrfs] [] __btrfs_buffered_write+0xa28/0xb50 [btrfs] [] ? btrfs_dirty_pages+0x1c0/0x1c0 [btrfs] [] ? filemap_fdatawait_range+0x3e/0x50 [] ? generic_file_direct_write+0x237/0x2f0 [] ? filemap_write_and_wait_range+0xa0/0xa0 [] ? btrfs_file_write_iter+0x670/0x9a0 [btrfs] [] btrfs_file_write_iter+0x74d/0x9a0 [btrfs] [] do_iter_readv_writev+0x153/0x1f0 [] ? btrfs_sync_file+0x920/0x920 [btrfs] [] ? vfs_iter_read+0x1e0/0x1e0 [] ? preempt_count_sub+0xb9/0x130 [] ? percpu_down_read+0x57/0xa0 [] ? __sb_start_write+0xee/0x130 [] ? btrfs_sync_file+0x920/0x920 [btrfs] [] do_readv_writev+0x30f/0x460 [] ? vfs_write+0x2b0/0x2b0 [] ?
Re: Another ENOSPC situation
On Fri, Apr 01, 2016 at 06:30:20PM +0200, Marc Haber wrote: > On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote: > > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber> > wrote: > > > btrfs balance -mprofiles seems to do something. one kworked and one > > > btrfs-transaction process hog one CPU core each for hours, while > > > blocking the filesystem for minutes apiece, which leads to the host > > > being nearly unuseable up to the point of "clock and mouse pointer > > > frozen for nearly ten minutes". > > > > I assume you still have your every 10 minutes snapshotting running > > while balancing? > > No, I disabled the cronjob before trying the balance. I might be > crazy, but not stup^wunexperienced. That being said, I would still expect the code not to allow _this_ kind of effect on the entire system when two alledgely incompatible operations run simultaneously. I mean, Linux is a multi-user, multi-tasking operating system where one simply cannot expect all processes to be cooperative to each other. We have the operating systems to prevent this kind of issues, not to cause them. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v9 00/19] Btrfs dedupe framework
On Fri, Apr 01, 2016 at 08:26:43AM +0800, Qu Wenruo wrote: > > > David Sterba wrote on 2016/03/31 18:12 +0200: > > On Wed, Mar 30, 2016 at 03:55:55PM +0800, Qu Wenruo wrote: > >> This March 30th patchset update mostly addresses the patchset structure > >> comment from David: > >> 1) Change the patchset sequence > >> Not If only apply the first 14 patches, it can provide the full > >> backward compatible in-memory only dedupe backend. > >> > >> Only starts from patch 15, on-disk format will be changed. > >> > >> So patch 1~14 is going to be pushed for next merge window, while I'll > >> still submit them all for review purpose. > > > > I'll buy 1-10 with the ioctl hidden under the BTRFS_DEBUG config option > > until the interface is settled. > > > > > Nice to hear that. > > I'll add BTRFS_DEBUG config then. Independent of the next merge window, I'll add them to my for-next after you send the updated version. I'll also try to review them next week, but I don't remember any critical issue during first reading, so there's no blocker. > BTW, any comment on btrfs-convert rewrite? This not the right place to ask, better to ping as reply to the thread as I could miss it. Nevertheless, the answer is that it's going to devel branch, the convert tests passed (as required minimum), but the patchset is still not reviewed up to my satisfaction. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Another ENOSPC situation
On Fri, Apr 01, 2016 at 05:44:30PM +0200, Henk Slager wrote: > On Fri, Apr 1, 2016 at 3:40 PM, Marc Haber> wrote: > > btrfs balance -mprofiles seems to do something. one kworked and one > > btrfs-transaction process hog one CPU core each for hours, while > > blocking the filesystem for minutes apiece, which leads to the host > > being nearly unuseable up to the point of "clock and mouse pointer > > frozen for nearly ten minutes". > > I assume you still have your every 10 minutes snapshotting running > while balancing? No, I disabled the cronjob before trying the balance. I might be crazy, but not stup^wunexperienced. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Another ENOSPC situation
On Fri, Apr 1, 2016 at 3:40 PM, Marc Haberwrote: > Hi, > > just for a change, this is another btrfs on a different host. The host > is also running Debian unstable with mainline kernels, the btrfs in > question was created (not converted) in March 2015 with btrfs-tools > 3.17. It is the root fs of my main work notebook which is under > workstation load, with lots of snapshots being created and deleted. > > Balance immediately fails with ENOSPC > > balance -dprofiles=single -dusage=1 goes through "fine" ("had to > relocate 0 out of 602 chunks") > > balance -dprofiles=single -dusage=2 also ENOSPCes immediately. > > [4/502]mh@swivel:~$ sudo btrfs fi usage / > Overall: > Device size: 600.00GiB > Device allocated:600.00GiB > Device unallocated:1.00MiB > Device missing: 0.00B > Used:413.40GiB > Free (estimated):148.20GiB (min: 148.20GiB) > Data ratio: 1.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B) > > Data,single: Size:553.93GiB, Used:405.73GiB >/dev/mapper/swivelbtr 553.93GiB > > Metadata,DUP: Size:23.00GiB, Used:3.83GiB >/dev/mapper/swivelbtr 46.00GiB > > System,DUP: Size:32.00MiB, Used:112.00KiB >/dev/mapper/swivelbtr 64.00MiB > > Unallocated: >/dev/mapper/swivelbtr 1.00MiB > [5/503]mh@swivel:~$ > > btrfs balance -mprofiles seems to do something. one kworked and one > btrfs-transaction process hog one CPU core each for hours, while > blocking the filesystem for minutes apiece, which leads to the host > being nearly unuseable up to the point of "clock and mouse pointer > frozen for nearly ten minutes". I assume you still have your every 10 minutes snapshotting running while balancing? > The btrfs balance cancel I issued after four hours of this state took > eleven minutes alone to complete. > > These are all log entries that were obtained after starting btrfs > balance -mprofiles on 09:43 > Apr 1 12:18:21 swivel kernel: [253651.970413] BTRFS info (device dm-14): > found 3523 extents > Apr 1 12:18:21 swivel kernel: [253652.035572] BTRFS info (device dm-14): > relocating block group 1538365849600 flags 36 > Apr 1 13:30:57 swivel kernel: [258007.653597] BTRFS info (device dm-14): > found 3585 extents > Apr 1 13:30:57 swivel kernel: [258007.746541] BTRFS info (device dm-14): > relocating block group 1536755236864 flags 36 > Apr 1 13:49:39 swivel kernel: [259130.296184] BTRFS info (device dm-14): > found 3047 extents > Apr 1 13:49:39 swivel kernel: [259130.357314] BTRFS info (device dm-14): > relocating block group 1528702173184 flags 36 > Apr 1 14:30:00 swivel kernel: [261550.776348] BTRFS info (device dm-14): > found 4200 extents > > This kernel trace from 11:16 is not btrfs-related, is it? I guess it's > bluetooth related since it happened simultaneously to the bluetooth > device popping out an in: > Apr 1 11:16:38 swivel kernel: [249948.993751] usb 1-1.4: USB disconnect, > device number 39 > Apr 1 11:16:38 swivel systemd[1]: Starting Load/Save RF Kill Switch Status... > Apr 1 11:16:38 swivel systemd[1]: Started Load/Save RF Kill Switch Status. > Apr 1 11:16:38 swivel systemd[1]: bluetooth.target: Unit not needed anymore. > Stopping. > Apr 1 11:16:38 swivel systemd[1]: Stopped target Bluetooth. > Apr 1 11:16:38 swivel laptop-mode: Laptop mode > Apr 1 11:16:38 swivel laptop-mode: enabled, not active > Apr 1 11:16:39 swivel kernel: [249949.211549] usb 1-1.4: new full-speed USB > device number 40 using ehci-pci > Apr 1 11:16:39 swivel kernel: [249949.308386] usb 1-1.4: New USB device > found, idVendor=0a5c, idProduct=217f > Apr 1 11:16:39 swivel kernel: [249949.308397] usb 1-1.4: New USB device > strings: Mfr=1, Product=2, SerialNumber=3 > Apr 1 11:16:39 swivel kernel: [249949.308402] usb 1-1.4: Product: Broadcom > Bluetooth Device > Apr 1 11:16:39 swivel kernel: [249949.308407] usb 1-1.4: Manufacturer: > Broadcom Corp > Apr 1 11:16:39 swivel kernel: [249949.308412] usb 1-1.4: SerialNumber: > CCAF78F1274F > Apr 1 11:16:39 swivel systemd[1]: Reached target Bluetooth. > Apr 1 11:16:39 swivel kernel: [249949.507794] [ cut here > ] > Apr 1 11:16:39 swivel kernel: [249949.507810] WARNING: CPU: 1 PID: 11 at > arch/x86/kernel/cpu/perf_event_intel_ds.c:325 reserve_ds_buffers+0x102/0x326() > Apr 1 11:16:39 swivel kernel: [249949.507813] alloc_bts_buffer: BTS buffer > allocation failure > Apr 1 11:16:39 swivel kernel: [249949.507816] Modules linked in: cpuid > hid_generic usbhid hid e1000e tun ctr ccm rfcomm bridge stp llc > cpufreq_userspace cpufreq_stats cpufreq_conservative cpufreq_powersave > nf_conntrack_netlink nfnetlink bnep binfmt_misc intel_rapl > x86_pkg_temp_thermal arc4 intel_powerclamp kvm_intel kvm irqbypass iwldvm > snd_hda_codec_conexant
Re: Again, no space left on device while rebalancing and recipe doesnt work
On Sat, Feb 27, 2016 at 10:14:50PM +0100, Marc Haber wrote: > I have again the issue of no space left on device while rebalancing > (with btrfs-tools 4.4.1 on kernel 4.4.2 on Debian unstable): just for the record: The host started acting up in more and more interesting ways, and after a call of rm during kernel build resulted in SIGSEGV, I did the backup-format-restore routine for this system back to ext4 just to find out whether I have bad hardware or a bad filesystem. And, since going back to ext4, the system is just fine again. So it's not bad hardware. This systems's root drive is going to stay on ext4 for a loong time. If I get the btrfs phenomena I experience on other hosts get solved at some time in the future, I might migrate /home back to btrfs, but that's not going to happen in the next six months. This is a really bad experience which has made me lost a lot of faith in the new filesystem. I really feel sad about that. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Another ENOSPC situation
Hi, just for a change, this is another btrfs on a different host. The host is also running Debian unstable with mainline kernels, the btrfs in question was created (not converted) in March 2015 with btrfs-tools 3.17. It is the root fs of my main work notebook which is under workstation load, with lots of snapshots being created and deleted. Balance immediately fails with ENOSPC balance -dprofiles=single -dusage=1 goes through "fine" ("had to relocate 0 out of 602 chunks") balance -dprofiles=single -dusage=2 also ENOSPCes immediately. [4/502]mh@swivel:~$ sudo btrfs fi usage / Overall: Device size: 600.00GiB Device allocated:600.00GiB Device unallocated:1.00MiB Device missing: 0.00B Used:413.40GiB Free (estimated):148.20GiB (min: 148.20GiB) Data ratio: 1.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:553.93GiB, Used:405.73GiB /dev/mapper/swivelbtr 553.93GiB Metadata,DUP: Size:23.00GiB, Used:3.83GiB /dev/mapper/swivelbtr 46.00GiB System,DUP: Size:32.00MiB, Used:112.00KiB /dev/mapper/swivelbtr 64.00MiB Unallocated: /dev/mapper/swivelbtr 1.00MiB [5/503]mh@swivel:~$ btrfs balance -mprofiles seems to do something. one kworked and one btrfs-transaction process hog one CPU core each for hours, while blocking the filesystem for minutes apiece, which leads to the host being nearly unuseable up to the point of "clock and mouse pointer frozen for nearly ten minutes". The btrfs balance cancel I issued after four hours of this state took eleven minutes alone to complete. These are all log entries that were obtained after starting btrfs balance -mprofiles on 09:43 Apr 1 12:18:21 swivel kernel: [253651.970413] BTRFS info (device dm-14): found 3523 extents Apr 1 12:18:21 swivel kernel: [253652.035572] BTRFS info (device dm-14): relocating block group 1538365849600 flags 36 Apr 1 13:30:57 swivel kernel: [258007.653597] BTRFS info (device dm-14): found 3585 extents Apr 1 13:30:57 swivel kernel: [258007.746541] BTRFS info (device dm-14): relocating block group 1536755236864 flags 36 Apr 1 13:49:39 swivel kernel: [259130.296184] BTRFS info (device dm-14): found 3047 extents Apr 1 13:49:39 swivel kernel: [259130.357314] BTRFS info (device dm-14): relocating block group 1528702173184 flags 36 Apr 1 14:30:00 swivel kernel: [261550.776348] BTRFS info (device dm-14): found 4200 extents This kernel trace from 11:16 is not btrfs-related, is it? I guess it's bluetooth related since it happened simultaneously to the bluetooth device popping out an in: Apr 1 11:16:38 swivel kernel: [249948.993751] usb 1-1.4: USB disconnect, device number 39 Apr 1 11:16:38 swivel systemd[1]: Starting Load/Save RF Kill Switch Status... Apr 1 11:16:38 swivel systemd[1]: Started Load/Save RF Kill Switch Status. Apr 1 11:16:38 swivel systemd[1]: bluetooth.target: Unit not needed anymore. Stopping. Apr 1 11:16:38 swivel systemd[1]: Stopped target Bluetooth. Apr 1 11:16:38 swivel laptop-mode: Laptop mode Apr 1 11:16:38 swivel laptop-mode: enabled, not active Apr 1 11:16:39 swivel kernel: [249949.211549] usb 1-1.4: new full-speed USB device number 40 using ehci-pci Apr 1 11:16:39 swivel kernel: [249949.308386] usb 1-1.4: New USB device found, idVendor=0a5c, idProduct=217f Apr 1 11:16:39 swivel kernel: [249949.308397] usb 1-1.4: New USB device strings: Mfr=1, Product=2, SerialNumber=3 Apr 1 11:16:39 swivel kernel: [249949.308402] usb 1-1.4: Product: Broadcom Bluetooth Device Apr 1 11:16:39 swivel kernel: [249949.308407] usb 1-1.4: Manufacturer: Broadcom Corp Apr 1 11:16:39 swivel kernel: [249949.308412] usb 1-1.4: SerialNumber: CCAF78F1274F Apr 1 11:16:39 swivel systemd[1]: Reached target Bluetooth. Apr 1 11:16:39 swivel kernel: [249949.507794] [ cut here ] Apr 1 11:16:39 swivel kernel: [249949.507810] WARNING: CPU: 1 PID: 11 at arch/x86/kernel/cpu/perf_event_intel_ds.c:325 reserve_ds_buffers+0x102/0x326() Apr 1 11:16:39 swivel kernel: [249949.507813] alloc_bts_buffer: BTS buffer allocation failure Apr 1 11:16:39 swivel kernel: [249949.507816] Modules linked in: cpuid hid_generic usbhid hid e1000e tun ctr ccm rfcomm bridge stp llc cpufreq_userspace cpufreq_stats cpufreq_conservative cpufreq_powersave nf_conntrack_netlink nfnetlink bnep binfmt_misc intel_rapl x86_pkg_temp_thermal arc4 intel_powerclamp kvm_intel kvm irqbypass iwldvm snd_hda_codec_conexant snd_hda_codec_generic mac80211 input_leds btusb btbcm i2c_i801 snd_hda_intel btintel snd_hda_codec bluetooth iwlwifi snd_hda_core cfg80211 snd_hwdep sg snd_pcm_oss snd_mixer_oss lpc_ich mfd_core snd_pcm shpchp snd_timer thinkpad_acpi nvram snd battery soundcore rfkill ac tpm_tis tpm evdev processor xt_TCPMSS xt_tcpudp iptable_mangle iptable_filter
Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning
On Fri, Apr 01, 2016 at 08:09:56PM +0800, Qu Wenruo wrote: > > > On 04/01/2016 07:39 PM, David Sterba wrote: > > On Fri, Apr 01, 2016 at 04:50:06PM +0800, Qu Wenruo wrote: > >>> After another look, why don't we use nodesize directly? Or stripesize > >>> where applies. With max_size == 0 the test does not make sense, we ought > >>> to know the alignment. > >>> > >> Yes, my first though is also to use nodesize directly, which should be > >> always correct. > >> > >> But the problem is, the related function call stack doesn't have any > >> member to reach btrfs_root or btrfs_fs_info. > >> > >> In the very beginning version of such crossing stripe check, I used to > >> add a btrfs_root/btrfs_fs_info parameter to the function. > >> > >> But the code change are too many, so I use 'max_size'. > >> > >> I can try to re-do such modification, but IIRC it didn't cause a good > >> result. > > > > Yes it would require refactoring, which would be good on itself, because > > add_extent_rec takes 12(!) parameters. Some of its callers would need to > > be updated, but it seems doable. > > I'll try to refactor. I'm working on it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: "bad metadata" not fixed by btrfs repair
On Thu, Mar 31, 2016 at 08:42:46PM +0200, Henk Slager wrote: > So also false alerts. btrfs-tools 4.5.1 with Qu's patch from patchwork doesnt show those warnings any more. Greetings Marc -- - Marc Haber | "I don't trust Computers. They | Mailadresse im Header Leimen, Germany| lose things."Winona Ryder | Fon: *49 6224 1600402 Nordisch by Nature | How to make an American Quilt | Fax: *49 6224 1600421 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning
On 04/01/2016 07:39 PM, David Sterba wrote: On Fri, Apr 01, 2016 at 04:50:06PM +0800, Qu Wenruo wrote: After another look, why don't we use nodesize directly? Or stripesize where applies. With max_size == 0 the test does not make sense, we ought to know the alignment. Yes, my first though is also to use nodesize directly, which should be always correct. But the problem is, the related function call stack doesn't have any member to reach btrfs_root or btrfs_fs_info. In the very beginning version of such crossing stripe check, I used to add a btrfs_root/btrfs_fs_info parameter to the function. But the code change are too many, so I use 'max_size'. I can try to re-do such modification, but IIRC it didn't cause a good result. Yes it would require refactoring, which would be good on itself, because add_extent_rec takes 12(!) parameters. Some of its callers would need to be updated, but it seems doable. I'll try to refactor. I though current extent-tree rework would change all these mess, but considering the case of btrfs-convert, I'd better refactor current code other than waiting other reviewers to appear. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning
On Fri, Apr 01, 2016 at 04:50:06PM +0800, Qu Wenruo wrote: > > After another look, why don't we use nodesize directly? Or stripesize > > where applies. With max_size == 0 the test does not make sense, we ought > > to know the alignment. > > > Yes, my first though is also to use nodesize directly, which should be > always correct. > > But the problem is, the related function call stack doesn't have any > member to reach btrfs_root or btrfs_fs_info. > > In the very beginning version of such crossing stripe check, I used to > add a btrfs_root/btrfs_fs_info parameter to the function. > > But the code change are too many, so I use 'max_size'. > > I can try to re-do such modification, but IIRC it didn't cause a good > result. Yes it would require refactoring, which would be good on itself, because add_extent_rec takes 12(!) parameters. Some of its callers would need to be updated, but it seems doable. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: empty disk reports full
On Fri, Apr 01, 2016 at 11:50:50AM +0200, Alejandro Vargas wrote: > I am using a 2Tb disk for incremental backups. > > I use rsync for backing up to a subvolume, and each day I creates an snapshot > of the lastest snapshot and do rsync in this. > > When the disk becomes nearly full (100Gb or less available) I deletes the > oldest subvolume (withbtrfs subvolume delete). > > My problem is that *even removing ALL the subvolumes*, the free space does > not change. It continues reporting the same size (disk is nearly full). > > I tried "btrfs balance start /mnt/backup" but it takes hours and hours. > > I'm using linux 4.1.15 > btrfs-progs v4.1.2 Can you show us the output of both "sudo btrfs fi show" and "btrfs fi df /mnt/backup", please? Hugo. -- Hugo Mills | The Creature from the Black Logon hugo@... carfax.org.uk | http://carfax.org.uk/ | PGP: E2AB1DE4 | signature.asc Description: Digital signature
Re: [PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info
Hi Wang, [auto build test ERROR on btrfs/next] [also build test ERROR on v4.6-rc1 next-20160401] [if your patch is applied to the wrong git tree, please drop us a note to help improving the system] url: https://github.com/0day-ci/linux/commits/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937 base: https://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs.git next config: x86_64-rhel (attached as .config) reproduce: # save the attached .config to linux build tree make ARCH=x86_64 Note: the linux-review/Qu-Wenruo/Btrfs-dedupe-framework/20160401-143937 HEAD 0a445f5009c064ee1d3fc966e41bb75627594afe builds fine. It only hurts bisectibility. All errors (new ones prefixed by >>): >> ERROR: "btrfs_dedupe_disable" [fs/btrfs/btrfs.ko] undefined! --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
empty disk reports full
I am using a 2Tb disk for incremental backups. I use rsync for backing up to a subvolume, and each day I creates an snapshot of the lastest snapshot and do rsync in this. When the disk becomes nearly full (100Gb or less available) I deletes the oldest subvolume (withbtrfs subvolume delete). My problem is that *even removing ALL the subvolumes*, the free space does not change. It continues reporting the same size (disk is nearly full). I tried "btrfs balance start /mnt/backup" but it takes hours and hours. I'm using linux 4.1.15 btrfs-progs v4.1.2 BEGIN:VCARD VERSION:3.0 EMAIL:a...@zener.es FN:Alejandro Vargas N:Vargas;Alejandro;;; NICKNAME:anv PHOTO;ENCODING=b;TYPE=jpeg:/9j/4AAQSkZJRgABAQEAAQABAAD/2wBDAAgGBgcGBQgHBwcJ CQgKDBQNDAsLDBkSEw8UHRofHh0aHBwgJC4nICIsIxwcKDcpLDAxNDQ0Hyc5PTgyPC4zNDL/2wB DAQkJCQwLDBgNDRgyIRwhMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMj IyMjIyMjIyMjL/wAARCAC0ALQDASIAAhEBAxEB/8QAHwAAAQUBAQEBAQEAAAECAwQFB gcICQoL/8QAtRAAAgEDAwIEAwUFBAQAAAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS 0fAkM2JyggkKFhcYGRolJicoKSo0NTY3ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd 4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2u Hi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEAAwEBAQEBAQEBAQECAwQFBgcICQoL/8QAt REAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSExBhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYk NOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYa HiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6O nq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwD3uiiigAqrqN19j0+4ueP3UTPz7DNWqxfFriLwtqDc/ wCqxwfUgUAeez/E3VQxVI7cD3QnH60q/ELXZmWOCOOSRgBkRjA9/b8a4h4HdTJIfKhHJPr7CtSz VktwqgqG9TyR71E2o6msIczO/wBQ8bv9kS3tCDcbAJZsDAbHO3/GuMmeW5kaSV2Z2OSSetOSAYF ThAB0rnlUbOuFOMVoUxGM0jRDPSrmAegphXtjNRdmyKRj54FNZDV3yxnpQYwRRzDMwxFvWkMBxW msKnPFK0AHajmEZQgYHOKNhHJ4q+0eM8VCyU+cLDIp3iBAY7T1HrXo3gvX45oVsZpTvB+QMc8e1 ebMuKfBcS20iywuUdeQwPIrWE+5z1aKlqj3vIpwx2rxKHx/rOm3AJbzE7hyWDV6H4b8c6frsQSR 0t7oDmNm4b3B7/St9NziaadmdXRWS/ibR48hr+IEdRgnH6VUk8beH4+uoofojH+lFwszoaK5g+P /AA8D/wAfp/79t/hRRcLM3n1Cyj+/dwL/AL0gH9arSeINHi+9qVrx6Sg14t9mGMFz+VJ9lQ9XNZ e1NfY+Z7BJ4u0KPk6jGT6DJqnf67o2r6Vdwi5EiBfmGCASOcZNeWC1iXnLH3zVLULsxqtjbggNw RnrmqjLmYnTtsSOw1PU2ZFC2sLEKB0JrUiQDHFVrK3EECoByOv1q8FwKxqO7OunHliSjHFIzdqb ux2pGc+lZ8pqh3Pemsp60B+OlIXzxnFOxQ3njg0o96QsR3pu40WHYepxUoYEVXycZHWlUmlyiHu gNV2iyelWN3SmMelKwFR4+aiZOMGrTComjPFNCZm3UW9CpGQawDPLZ3G1SVIOVI6g11rxlhyOK5 rWICpL9x1FdNKXRnNWimrmxYTwaqS1w7Lcr94A43e9Xf7NtupDn6tXH2N2yX6OvysBiu3gmE0Ky L0YZpVU46oyg76EP9nWv/PNv++j/jRVrmisOaRpZEYPHNNJ96GPy8daT607DF55OePSsu2j+0aw zHkJzWsi7sj1qloyKb67bqQQBWtPZkNao1Ixg1IDniqNxeLC20Alz0AqJNSVcb8ClyNm90ka3AG KQIWOcVWi1C3kIy6jNXYpIzja4OfSjlsXGaDyeelRmLnpV7AIGOppfLU8ZqbFpmf5OTigwkcYq+ YlUDFRSFcdelOw3IpldvbimNjJwas/IepqJmjHpRykOSIgT60tMMiDjeKAwYcGp5RpoXAJ609Y6 Yikt0qzjavNKwpMr3CAJgDrXL6z9w9M1vX0+BySMVzWpTbkIOcE1tT3MJ7GPaj99g9VrqdDuwUa At0JIrlo8rMSOh7GtHTJPKuQd3XrWtRXRzrRnZAjHWimK5KAg5GKK4dTosh7IopmwDtUz+1MUMz KoHzMcAdzVR1EwjwpJxVDRCQ+oSEYwcfzrrtEs9Mj8+XVZzt8vd87NjJ464GR+OaTWdT0h4 kg0ez8nnLSkYJx7V0QjZGUqiuc7BYlzubJY81cbR4JIsSIKI71oz88pBH0ptxrJWE/vSx6YwKaT 6GqlfoULjw9GDujLAdqitobi0fbuJHvV+31JHUl3Oewap/NguRhcByeCp4JpvmtqNNXLFpOWUbj j61ZEgGelYtvcgsQAeOuB0rQclI8sCv1GM1i0zdND5LjA9KybzUGGQnJ9anlcLEJZS4Vs7FUctj qfYVBbT2ckmDasf8Ael/wAq4xIlJGLM2pzvmNm9gOKsQWmqMPnAwO5NdJHcWScG2iGB/fb/Gpvt tmfuxRfg7f41o2ZXVzmzaXLJhxx6g1HbmW1m+Ykp710U11asMeSef7sv8A9as6YWjdVmjPqAHH8 xU77lKSL9ttkQsuMUkxyCB0qtpxG1zFIskY4yAQR9QelWHIHB6+lZOJd7sxr8lEJb8K5W8YtnHf 9K6rVG/csSOh/OuRmfLNitqSMqpWbcJv+A1NZSEMzE4KioCw80H3waG3Rybl+6a2Zznc237y2jY PjIoqPTziwh6/dorge5umabZ7UzYyzpKjbZU+63XFVjNJ3PWmNO5f75/CpjLsW4dyG+a5j1OZoG B3YznqxxyT+NRx3NykiiWF1ODyORS3rSC+LAnkK3T2q9Oh+xxuSfmHGa6uezS7kKnfUxru+2ykE E+w71VvLieJU+cRq3PAzirggUylnGSe9PkghdRlc+xrVTSG6bZjjUpPOVEPmg8ZK4ra0tmmu44S xQlssfQDkmqxt48/KiqB6CrVpGUiuHiX95s8tD2yxx/LJ/CnKSewcjii3by3t3kPcyCPPABwB9A OKutdX9shEV7Kw7rIdwP4GmRIsMSovQDGfU+tNmbjrXPfU6Y01YL5yLWGeZwzyw7gAMKvJGAKwo rh4gXB/WtO9ydEgLEAxu8ecZ6/MB/OsSMrImxweO1bR2Oe17o0o9RTgKpdyM4Xk1GdWLIzm1O1e pLAfkKit1CEbCV4xxSPpULksSQTzxVe71IlSl0IzqgkcNFIVb+4x61bt795hzgnuKy5tMRQSpIP rSW2+CQBjz0z60motaFxi1ozrdAlt47yUXHmeSY23CPGT+dWHu7VL/Du6WjKQrMOQ3visnSiS1y +OFiP6kCphJIqFRHuU9Qe9Yvcaj2E1vItywyU/vAcVxjsHOMjOe1dJqOpz2to0RiIt3+/Hjgj1B 7VzzwxY8+1l82PuvRlPoRWtNaGVS97MqmNhIw4FTRLvjXIBcMBkc8Ujy5lPY8DJ9aksGRrkiNsA nkHt7VUnpczsdXYSeXZRIQeBRUiI+wbCAMd6K8tyd9zsSjYje4YSFCv4+gqOdjbkMMNwMk0rtls 9/YdaHAk65x04rRTQezZJERfKrblSQDGOxAPFW5oLryo42Uts6AHOKr6dbhrxM7jjnn+Va/lAsz EnrW0Zp6k+zaZim3lHWF/wFJ5Mg5MUn/fJrVaJz0kcfQ00wTsMfaJPzq1I0UZGT5E8p2pA5z6Ka twRmFBGcZBLMB2bpjPsM/nVhrVlGZJZGA7FyM0Rwk/dUAD0puWg1Bt6iElU6VWYknk/WtB4mKcC s+ZGGc9KhGw+JDcWk9oBljh0Hqw7flmsXyhuPGGB6HtWpFjfkHBHp1q2ZrvOcxTLjpNCrfrjP61 onYycWndGTEoKkEAH0qbbhOM1dmvDFGWfSLCTHorr/JqZBf21wv/ACBYAe4WWT/4uq6XJ17Ga8X Uk96j8pc4rab7O3TS0AP/AE0k/wDiqcPLH+rsIIyPVWY/+PE1Nx2fYi0xBHZSuR/rGCL7gcn+lX 4kB5xxUI3vgtngY9APwq/AmIzx2rNvUm1kYevYFkcgYrkWsvNm82AFT7d67DXUzZMcdOay9LgJQ
[PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
Introduce a new tree, dedupe tree to record on-disk dedupe hash. As a persist hash storage instead of in-memeory only implement. Unlike Liu Bo's implement, in this version we won't do hack for bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such search case, just like in-memory backend. Signed-off-by: Liu BoSigned-off-by: Wang Xiaoguang Signed-off-by: Qu Wenruo --- Fix a small rebase bug, which missed 4 lines. --- fs/btrfs/ctree.h | 53 +++- fs/btrfs/dedupe.h| 5 + fs/btrfs/disk-io.c | 6 + fs/btrfs/relocation.c| 3 ++- include/trace/events/btrfs.h | 3 ++- 5 files changed, 67 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0e8933c..659790c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -100,6 +100,9 @@ struct btrfs_ordered_sum; /* tracks free space in block groups. */ #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL +/* on-disk dedupe tree (EXPERIMENTAL) */ +#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL + /* device stats in the device tree */ #define BTRFS_DEV_STATS_OBJECTID 0ULL @@ -538,7 +541,8 @@ struct btrfs_super_block { #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR0ULL #define BTRFS_FEATURE_COMPAT_RO_SUPP \ - (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE) + (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE | \ +BTRFS_FEATURE_COMPAT_RO_DEDUPE) #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET 0ULL #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR 0ULL @@ -960,6 +964,36 @@ struct btrfs_csum_item { u8 csum; } __attribute__ ((__packed__)); +/* + * Objectid: 0 + * Type: BTRFS_DEDUPE_STATUS_ITEM_KEY + * Offset: 0 + */ +struct btrfs_dedupe_status_item { + __le64 blocksize; + __le64 limit_nr; + __le16 hash_type; + __le16 backend; +} __attribute__ ((__packed__)); + +/* + * Objectid: Last 64 bit of the hash + * Type: BTRFS_DEDUPE_HASH_ITEM_KEY + * Offset: Bytenr of the hash + * + * Used for hash <-> bytenr search + * Hash exclude the last 64 bit follows + */ + +/* + * Objectid: bytenr + * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY + * offset: Last 64 bit of the hash + * + * Used for bytenr <-> hash search (for free_extent) + * Its itemsize should always be 0. + */ + struct btrfs_dev_stats_item { /* * grow this item struct at the end for future enhancements and keep @@ -2168,6 +2202,13 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_CHUNK_ITEM_KEY 228 /* + * Dedup item and status + */ +#define BTRFS_DEDUPE_STATUS_ITEM_KEY 230 +#define BTRFS_DEDUPE_HASH_ITEM_KEY 231 +#define BTRFS_DEDUPE_BYTENR_ITEM_KEY 232 + +/* * Records the overall state of the qgroups. * There's only one instance of this key present, * (0, BTRFS_QGROUP_STATUS_KEY, 0) @@ -3265,6 +3306,16 @@ static inline unsigned long btrfs_leaf_data(struct extent_buffer *l) return offsetof(struct btrfs_leaf, items); } +/* btrfs_dedupe_status */ +BTRFS_SETGET_FUNCS(dedupe_status_blocksize, struct btrfs_dedupe_status_item, + blocksize, 64); +BTRFS_SETGET_FUNCS(dedupe_status_limit, struct btrfs_dedupe_status_item, + limit_nr, 64); +BTRFS_SETGET_FUNCS(dedupe_status_hash_type, struct btrfs_dedupe_status_item, + hash_type, 16); +BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item, + backend, 16); + /* struct btrfs_file_extent_item */ BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8); BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr, diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h index f5d2b45..1ac1bcb 100644 --- a/fs/btrfs/dedupe.h +++ b/fs/btrfs/dedupe.h @@ -60,6 +60,8 @@ struct btrfs_dedupe_hash { u8 hash[]; }; +struct btrfs_root; + struct btrfs_dedupe_info { /* dedupe blocksize */ u64 blocksize; @@ -75,6 +77,9 @@ struct btrfs_dedupe_info { struct list_head lru_list; u64 limit_nr; u64 current_nr; + + /* for persist data like dedup-hash and dedupe status */ + struct btrfs_root *dedupe_root; }; struct btrfs_trans_handle; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index ed6a6fd..c7eda03 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -184,6 +184,7 @@ static struct btrfs_lockdep_keyset { { .id = BTRFS_DATA_RELOC_TREE_OBJECTID, .name_stem = "dreloc" }, { .id = BTRFS_UUID_TREE_OBJECTID, .name_stem = "uuid" }, { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, .name_stem = "free-space" }, + { .id = BTRFS_DEDUPE_TREE_OBJECTID, .name_stem = "dedupe" }, { .id = 0, .name_stem = "tree" }, }; @@ -1678,6 +1679,11 @@ struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info, if (location->objectid ==
Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning
David Sterba wrote on 2016/04/01 10:44 +0200: On Fri, Apr 01, 2016 at 08:28:18AM +0800, Qu Wenruo wrote: David Sterba wrote on 2016/03/31 18:30 +0200: On Thu, Mar 31, 2016 at 10:19:34AM +0800, Qu Wenruo wrote: At least 2 user from mail list reported btrfsck reported false alert of "bad metadata [,) crossing stripe boundary". While the reported number are all inside the same 64K boundary. After some check, all the false alert have the same bytenr feature, which can be divided by stripe size (64K). The result seems to be initial 'max_size' can be 0, causing 'start' + 'max_size' - 1, to cross the stripe boundary. Fix it by always update extent_record->cross_stripe when the extent_record is updated, to avoid temporary false alert to be reported. Signed-off-by: Qu WenruoApplied, thanks. Do you have a test image for that? Unfortunately, no. Although I figured out the cause the the false alert, I still didn't find a image/method to reproduce it, except the images of reporters. I can dig a little further trying to make a image. After another look, why don't we use nodesize directly? Or stripesize where applies. With max_size == 0 the test does not make sense, we ought to know the alignment. Yes, my first though is also to use nodesize directly, which should be always correct. But the problem is, the related function call stack doesn't have any member to reach btrfs_root or btrfs_fs_info. In the very beginning version of such crossing stripe check, I used to add a btrfs_root/btrfs_fs_info parameter to the function. But the code change are too many, so I use 'max_size'. I can try to re-do such modification, but IIRC it didn't cause a good result. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs-progs: fsck: Fix a false metadata extent warning
On Fri, Apr 01, 2016 at 08:28:18AM +0800, Qu Wenruo wrote: > > > David Sterba wrote on 2016/03/31 18:30 +0200: > > On Thu, Mar 31, 2016 at 10:19:34AM +0800, Qu Wenruo wrote: > >> At least 2 user from mail list reported btrfsck reported false alert of > >> "bad metadata [,) crossing stripe boundary". > >> > >> While the reported number are all inside the same 64K boundary. > >> After some check, all the false alert have the same bytenr feature, > >> which can be divided by stripe size (64K). > >> > >> The result seems to be initial 'max_size' can be 0, causing 'start' + > >> 'max_size' - 1, to cross the stripe boundary. > >> > >> Fix it by always update extent_record->cross_stripe when the > >> extent_record is updated, to avoid temporary false alert to be reported. > >> > >> Signed-off-by: Qu Wenruo> > > > Applied, thanks. > > > > Do you have a test image for that? > > > > > Unfortunately, no. > > Although I figured out the cause the the false alert, I still didn't > find a image/method to reproduce it, except the images of reporters. > > I can dig a little further trying to make a image. After another look, why don't we use nodesize directly? Or stripesize where applies. With max_size == 0 the test does not make sense, we ought to know the alignment. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 5/8] btrfs-progs: Add dedupe feature for mkfs and convert
Add new DEDUPE ro compat flag and corresponding mkfs/convert flag 'dedupe'. Since dedupe tree is completely isolated from fs tree, so even old kernel could do read mount. So add it to RO compat flag instead of common incompat flags Signed-off-by: Qu Wenruo--- Documentation/mkfs.btrfs.asciidoc | 9 btrfs-convert.c | 19 +++- mkfs.c| 8 +-- utils.c | 47 +-- utils.h | 7 +++--- 5 files changed, 67 insertions(+), 23 deletions(-) diff --git a/Documentation/mkfs.btrfs.asciidoc b/Documentation/mkfs.btrfs.asciidoc index e4321de..12a41c6 100644 --- a/Documentation/mkfs.btrfs.asciidoc +++ b/Documentation/mkfs.btrfs.asciidoc @@ -208,6 +208,15 @@ reduced-size metadata for extent references, saves a few percent of metadata improved representation of file extents where holes are not explicitly stored as an extent, saves a few percent of metadata if sparse files are used +*dedupe*:: +allow btrfs to use new on-disk format designed for in-band(write time) +de-duplication. ++ +on-disk storage backend and persist de-duplication status needs this feature. ++ +this feature is RO compat feature, means old kernel can still mount it +read-only. + BLOCK GROUPS, CHUNKS, RAID -- diff --git a/btrfs-convert.c b/btrfs-convert.c index 4474489..77e72f6 100644 --- a/btrfs-convert.c +++ b/btrfs-convert.c @@ -2453,7 +2453,7 @@ static int convert_open_fs(const char *devname, static int do_convert(const char *devname, int datacsum, int packing, int noxattr, u32 nodesize, int copylabel, const char *fslabel, int progress, - u64 features) + u64 features, u64 ro_features) { int i, ret, blocks_per_node; int fd = -1; @@ -2504,8 +2504,9 @@ static int do_convert(const char *devname, int datacsum, int packing, int noxatt fprintf(stderr, "unable to open %s\n", devname); goto fail; } - btrfs_parse_features_to_string(features_buf, features); - if (features == BTRFS_MKFS_DEFAULT_FEATURES) + btrfs_parse_features_to_string(features_buf, features, ro_features); + if (features == BTRFS_MKFS_DEFAULT_FEATURES && + ro_features == 0) strcat(features_buf, " (default)"); printf("create btrfs filesystem:\n"); @@ -2521,6 +2522,7 @@ static int do_convert(const char *devname, int datacsum, int packing, int noxatt mkfs_cfg.sectorsize = blocksize; mkfs_cfg.stripesize = blocksize; mkfs_cfg.features = features; + mkfs_cfg.ro_features = ro_features; ret = make_btrfs(fd, _cfg); if (ret) { @@ -3071,6 +3073,7 @@ int main(int argc, char *argv[]) char *file; char fslabel[BTRFS_LABEL_SIZE]; u64 features = BTRFS_MKFS_DEFAULT_FEATURES; + u64 ro_features = 0; while(1) { enum { GETOPT_VAL_NO_PROGRESS = 256 }; @@ -3128,7 +3131,8 @@ int main(int argc, char *argv[]) char *orig = strdup(optarg); char *tmp = orig; - tmp = btrfs_parse_fs_features(tmp, ); + tmp = btrfs_parse_fs_features(tmp, , + _features); if (tmp) { fprintf(stderr, "Unrecognized filesystem feature '%s'\n", @@ -3146,7 +3150,9 @@ int main(int argc, char *argv[]) char buf[64]; btrfs_parse_features_to_string(buf, - features & ~BTRFS_CONVERT_ALLOWED_FEATURES); + features & + ~BTRFS_CONVERT_ALLOWED_FEATURES, + ro_features); fprintf(stderr, "ERROR: features not allowed for convert: %s\n", buf); @@ -3196,7 +3202,8 @@ int main(int argc, char *argv[]) ret = do_rollback(file); } else { ret = do_convert(file, datacsum, packing, noxattr, nodesize, - copylabel, fslabel, progress, features); + copylabel, fslabel, progress, features, + ro_features); } if (ret) return 1; diff --git a/mkfs.c b/mkfs.c index 5e79e0b..5071060 100644 --- a/mkfs.c +++ b/mkfs.c @@ -1369,6 +1369,7 @@ int main(int argc, char **argv) int saved_optind; char fs_uuid[BTRFS_UUID_UNPARSED_SIZE] = { 0 };
[PATCH v7 2/8] btrfs-progs: dedupe: Add enable command for dedupe command group
Add enable subcommand for dedupe commmand group. Signed-off-by: Qu Wenruo--- Documentation/btrfs-dedupe.asciidoc | 105 +++- btrfs-completion| 6 +- cmds-dedupe.c | 155 ioctl.h | 2 + 4 files changed, 266 insertions(+), 2 deletions(-) diff --git a/Documentation/btrfs-dedupe.asciidoc b/Documentation/btrfs-dedupe.asciidoc index 5d63c32..8ab40ab 100644 --- a/Documentation/btrfs-dedupe.asciidoc +++ b/Documentation/btrfs-dedupe.asciidoc @@ -21,7 +21,110 @@ use with caution. SUBCOMMAND -- -Nothing yet +*enable* [options] :: +Enable in-band de-duplication for a filesystem. ++ +`Options` ++ +-s|--storage-backend +Specify de-duplication hash storage backend. +Supported backends are 'ondisk' and 'inmemory'. +If not specified, default value is 'inmemory'. ++ +Refer to *BACKENDS* sector for more information. + +-b|--blocksize +Specify dedupe block size. +Supported values are power of 2 from '16K' to '8M'. +Default value is '128K'. ++ +Refer to *BLOCKSIZE* sector for more information. + +-a|--hash-algorithm +Specify hash algorithm. +Only 'sha256' is supported yet. + +-l|--limit-hash +Specify maximum number of hashes stored in memory. +Only works for 'inmemory' backend. +Conflicts with '-m' option. ++ +Only positive values are valid. +Default value is '32K'. + +-m|--limit-memory +Specify maximum memory used for hashes. +Only works for 'inmemory' backend. +Conflicts with '-l' option. ++ +Only value larger than or equal to '1024' is valid. +No default value. ++ +NOTE: Memory limit will be rounded down to kernel internal hash size, +so the memory limit shown in 'btrfs dedupe status' may be different +from the . + +WARNING: Too large value for '-l' or '-m' will easily trigger OOM. +Please use with caution according to system memory or use 'ondisk' backend +if memory usage is critical. + +BACKENDS + +Btrfs in-band de-duplication support two different backends with their own +features. + +In-memory backend:: +This backend provides backward-compatibility, and more fine-tuning options. +But hash pool is non-persistent and may exhaust kernel memory if not setup +properly. ++ +This backend can be used on old btrfs(without '-O dedupe' mkfs option). +When used on old btrfs, this backend needs to be enabled manually after mount. ++ +Designed for fast hash search speed, in-memory backend will keep all dedupe +hashes in memory. (Although overall performance is still much the same with +'ondisk' backend) ++ +And only keeps limited number of hash in memory to avoid exhausting memory. +Hashes over the limit will be dropped following Last-Recent-Use behavior. +So this backend has a consistent overhead for given limit but can\'t ensure +any all duplicated blocks will be de-duplicated. ++ +After umount and mount, in-memory backend need to refill its hash pool. + +On-disk backend:: +This backend provides persistent hash pool, with more smart memory management +for hash pool. +But it\'s not backward-compatible, meaning it must be used with '-O dedupe' mkfs +option and older kernel can\'t mount it read-write. ++ +Designed for de-duplication rate, hash pool is stored as B+ tree on disk. +Although this behavior may cause extra disk IO for hash search under extreme +high memory pressure, +under most case the overall performance should be on par with 'inmemory' +backend. ++ +After umount and mount, on-disk backend still has its hash on disk, no need to +refill its dedupe hash pool. + +DEDUPE BLOCK SIZE + +In-band de-duplication is done at dedupe block size. +Any data smaller than dedupe block size won\'t go through in-band +de-duplication. + +And dedupe block size affects dedupe rate and fragmentation heavily. + +Smaller block size will cause more fragments, but higher dedupe rate. + +Larger block size will cause less fragments, but lower dedupe rate. + +In-band de-duplication rate is highly related to the workload pattern. +So it\'s highly recommended to align dedupe block size to the workload +block size to make full use of de-duplication. + +And dedupe block size larger than 128K will cause compression unavailable, as +compression only support maximum extent size of 128K. EXIT STATUS --- diff --git a/btrfs-completion b/btrfs-completion index 3ede77b..50f7ea2 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -29,7 +29,7 @@ _btrfs() local cmd=${words[1]} -commands='subvolume filesystem balance device scrub check rescue restore inspect-internal property send receive quota qgroup replace help version' +commands='subvolume filesystem balance device scrub check rescue restore inspect-internal property send receive quota qgroup dedupe replace help version' commands_subvolume='create delete list snapshot find-new get-default set-default show sync' commands_filesystem='defragment
[PATCH v7 3/8] btrfs-progs: dedupe: Add disable support for inband dedupelication
Add disable subcommand for dedupe command group. Signed-off-by: Qu Wenruo--- Documentation/btrfs-dedupe.asciidoc | 5 + btrfs-completion| 2 +- cmds-dedupe.c | 42 + 3 files changed, 48 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-dedupe.asciidoc b/Documentation/btrfs-dedupe.asciidoc index 8ab40ab..28fe05f 100644 --- a/Documentation/btrfs-dedupe.asciidoc +++ b/Documentation/btrfs-dedupe.asciidoc @@ -21,6 +21,11 @@ use with caution. SUBCOMMAND -- +*disable* :: +Disable in-band de-duplication for a filesystem. ++ +This will trash all stored dedupe hash. ++ *enable* [options] :: Enable in-band de-duplication for a filesystem. + diff --git a/btrfs-completion b/btrfs-completion index 50f7ea2..9a6c73b 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -40,7 +40,7 @@ _btrfs() commands_property='get set list' commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' -commands_dedupe='enable' +commands_dedupe='enable disable' commands_replace='start status cancel' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then diff --git a/cmds-dedupe.c b/cmds-dedupe.c index d9dcb10..64ac0f2 100644 --- a/cmds-dedupe.c +++ b/cmds-dedupe.c @@ -190,9 +190,51 @@ out: return ret; } +static const char * const cmd_dedupe_disable_usage[] = { + "btrfs dedupe disable ", + "Disable in-band(write time) de-duplication of a btrfs.", + NULL +}; + +static int cmd_dedupe_disable(int argc, char **argv) +{ + struct btrfs_ioctl_dedupe_args dargs; + DIR *dirstream; + char *path; + int fd; + int ret; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedupe_disable_usage); + + path = argv[1]; + fd = open_file_or_dir(path, ); + if (fd < 0) { + error("failed to open file or directory: %s", path); + return 1; + } + memset(, 0, sizeof(dargs)); + dargs.cmd = BTRFS_DEDUPE_CTL_DISABLE; + + ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, ); + if (ret < 0) { + error("failed to disable inband deduplication: %s", + strerror(errno)); + ret = 1; + goto out; + } + ret = 0; + +out: + close_file_or_dir(fd, dirstream); + return 0; +} + const struct cmd_group dedupe_cmd_group = { dedupe_cmd_group_usage, dedupe_cmd_group_info, { { "enable", cmd_dedupe_enable, cmd_dedupe_enable_usage, NULL, 0}, + { "disable", cmd_dedupe_disable, cmd_dedupe_disable_usage, + NULL, 0}, NULL_CMD_STRUCT } }; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 0/8] Inband dedupe for btrfs-progs
No much change from previous version. 1) Rebased to latest devel branch 2) Update ctree.h to follow kernel structure change 3) Update print-tree to follow kernel structure change Qu Wenruo (7): btrfs-progs: Basic framework for dedupe command group btrfs-progs: dedupe: Add enable command for dedupe command group btrfs-progs: dedupe: Add disable support for inband dedupelication btrfs-progs: dedupe: Add status subcommand btrfs-progs: Add dedupe feature for mkfs and convert btrfs-progs: Add show-super support for new DEDUPE flag btrfs-progs: debug-tree: Add dedupe tree support Wang Xiaoguang (1): btrfs-progs: property: add a dedupe property Documentation/Makefile.in | 1 + Documentation/btrfs-dedupe.asciidoc | 150 Documentation/btrfs-property.asciidoc | 2 + Documentation/btrfs.asciidoc | 4 + Documentation/mkfs.btrfs.asciidoc | 9 + Makefile.in | 3 +- btrfs-completion | 6 +- btrfs-convert.c | 19 +- btrfs.c | 1 + cmds-dedupe.c | 329 ++ cmds-inspect-dump-super.c | 18 ++ cmds-inspect-dump-tree.c | 4 + commands.h| 2 + ctree.h | 46 - dedupe.h | 42 + ioctl.h | 23 +++ mkfs.c| 8 +- print-tree.c | 118 props.c | 73 utils.c | 47 +++-- utils.h | 7 +- 21 files changed, 886 insertions(+), 26 deletions(-) create mode 100644 Documentation/btrfs-dedupe.asciidoc create mode 100644 cmds-dedupe.c create mode 100644 dedupe.h -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 8/8] btrfs-progs: property: add a dedupe property
From: Wang XiaoguangNormally if we enable online dedupe for a fs, it's filesystem wide de-duplication. With this property, we can explicitly disable data de-duplication for specified files. Signed-off-by: Wang Xiaoguang --- Documentation/btrfs-property.asciidoc | 2 + props.c | 73 +++ 2 files changed, 75 insertions(+) diff --git a/Documentation/btrfs-property.asciidoc b/Documentation/btrfs-property.asciidoc index 8b9b7f0..ca90035 100644 --- a/Documentation/btrfs-property.asciidoc +++ b/Documentation/btrfs-property.asciidoc @@ -44,6 +44,8 @@ label label of device compression compression setting for an inode: lzo, zlib, or "" (empty string) +dedupe +online dedupe setting for an inode: disable or "" (empty string) *list* [-t ] :: Lists available properties with their descriptions for the given object. diff --git a/props.c b/props.c index 5b74932..d8f6925 100644 --- a/props.c +++ b/props.c @@ -187,6 +187,77 @@ out: return ret; } +static int prop_dedupe(enum prop_object_type type, const char *object, + const char *name, const char *value) +{ + int ret; + ssize_t sret; + int fd = -1; + DIR *dirstream = NULL; + char *buf = NULL; + char *xattr_name = NULL; + int open_flags = value ? O_RDWR : O_RDONLY; + + fd = open_file_or_dir3(object, , open_flags); + if (fd == -1) { + ret = -errno; + fprintf(stderr, "ERROR: open %s failed. %s\n", + object, strerror(-ret)); + goto out; + } + + xattr_name = malloc(XATTR_BTRFS_PREFIX_LEN + strlen(name) + 1); + if (!xattr_name) { + ret = -ENOMEM; + goto out; + } + memcpy(xattr_name, XATTR_BTRFS_PREFIX, XATTR_BTRFS_PREFIX_LEN); + memcpy(xattr_name + XATTR_BTRFS_PREFIX_LEN, name, strlen(name)); + xattr_name[XATTR_BTRFS_PREFIX_LEN + strlen(name)] = '\0'; + + if (value) + sret = fsetxattr(fd, xattr_name, value, strlen(value), 0); + else + sret = fgetxattr(fd, xattr_name, NULL, 0); + if (sret < 0) { + ret = -errno; + if (ret != -ENOATTR) + fprintf(stderr, + "ERROR: failed to %s dedupe for %s. %s\n", + value ? "set" : "get", object, strerror(-ret)); + else + ret = 0; + goto out; + } + if (!value) { + size_t len = sret; + + buf = malloc(len); + if (!buf) { + ret = -ENOMEM; + goto out; + } + sret = fgetxattr(fd, xattr_name, buf, len); + if (sret < 0) { + ret = -errno; + fprintf(stderr, + "ERROR: failed to get dedupe for %s. %s\n", + object, strerror(-ret)); + goto out; + } + fprintf(stdout, "dedupe=%.*s\n", (int)len, buf); + } + + ret = 0; +out: + free(xattr_name); + free(buf); + if (fd >= 0) + close_file_or_dir(fd, dirstream); + + return ret; +} + const struct prop_handler prop_handlers[] = { {"ro", "Set/get read-only flag of subvolume.", 0, prop_object_subvol, prop_read_only}, @@ -194,5 +265,7 @@ const struct prop_handler prop_handlers[] = { prop_object_dev | prop_object_root, prop_label}, {"compression", "Set/get compression for a file or directory", 0, prop_object_inode, prop_compression}, + {"dedupe", "Set/get dedupe for a file or directory", 0, +prop_object_inode, prop_dedupe}, {NULL, NULL, 0, 0, NULL} }; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 6/8] btrfs-progs: Add show-super support for new DEDUPE flag
Now btrfs-show-super can handle DEDUPE ro compat flag. Signed-off-by: Qu Wenruo--- cmds-inspect-dump-super.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/cmds-inspect-dump-super.c b/cmds-inspect-dump-super.c index 3e09ee8..6a939c9 100644 --- a/cmds-inspect-dump-super.c +++ b/cmds-inspect-dump-super.c @@ -198,6 +198,16 @@ struct readable_flag_entry { char *output; }; +#define DEF_RO_COMPAT_FLAG_ENTRY(bit_name) \ + {BTRFS_FEATURE_COMPAT_RO_##bit_name, #bit_name} + +struct readable_flag_entry ro_compat_flags_array[] = { + DEF_RO_COMPAT_FLAG_ENTRY(DEDUPE) +}; + +static const int ro_compat_flags_num = sizeof(ro_compat_flags_array) / + sizeof(struct readable_flag_entry); + #define DEF_INCOMPAT_FLAG_ENTRY(bit_name) \ {BTRFS_FEATURE_INCOMPAT_##bit_name, #bit_name} @@ -269,6 +279,13 @@ static void __print_readable_flag(u64 flag, struct readable_flag_entry *array, printf(")\n"); } +static void print_readable_ro_compat_flag(u64 ro_flag) +{ + return __print_readable_flag(ro_flag, ro_compat_flags_array, +ro_compat_flags_num, +BTRFS_FEATURE_COMPAT_RO_SUPP); +} + static void print_readable_incompat_flag(u64 flag) { return __print_readable_flag(flag, incompat_flags_array, @@ -360,6 +377,7 @@ static void dump_superblock(struct btrfs_super_block *sb, int full) (unsigned long long)btrfs_super_compat_flags(sb)); printf("compat_ro_flags\t\t0x%llx\n", (unsigned long long)btrfs_super_compat_ro_flags(sb)); + print_readable_ro_compat_flag(btrfs_super_compat_ro_flags(sb)); printf("incompat_flags\t\t0x%llx\n", (unsigned long long)btrfs_super_incompat_flags(sb)); print_readable_incompat_flag(btrfs_super_incompat_flags(sb)); -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 7/8] btrfs-progs: debug-tree: Add dedupe tree support
Add dedupe tree support for btrfs-debug-tree. Signed-off-by: Qu Wenruo--- cmds-inspect-dump-tree.c | 4 ++ ctree.h | 7 +++ print-tree.c | 118 +++ 3 files changed, 129 insertions(+) diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c index 43c8b67..0c75a3c 100644 --- a/cmds-inspect-dump-tree.c +++ b/cmds-inspect-dump-tree.c @@ -496,6 +496,10 @@ again: printf("multiple"); } break; + case BTRFS_DEDUPE_TREE_OBJECTID: + if (!skip) + printf("dedupe"); + break; default: if (!skip) { printf("file"); diff --git a/ctree.h b/ctree.h index 87ea684..15504b2 100644 --- a/ctree.h +++ b/ctree.h @@ -79,6 +79,9 @@ struct btrfs_free_space_ctl; /* tracks free space in block groups. */ #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL +/* on-disk dedupe tree (EXPERIMENTAL) */ +#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL + /* for storing balance parameters in the root tree */ #define BTRFS_BALANCE_OBJECTID -4ULL @@ -1219,6 +1222,10 @@ struct btrfs_root { #define BTRFS_DEV_ITEM_KEY 216 #define BTRFS_CHUNK_ITEM_KEY 228 +#define BTRFS_DEDUPE_STATUS_ITEM_KEY 230 +#define BTRFS_DEDUPE_HASH_ITEM_KEY 231 +#define BTRFS_DEDUPE_BYTENR_ITEM_KEY 232 + #define BTRFS_BALANCE_ITEM_KEY 248 /* diff --git a/print-tree.c b/print-tree.c index d0f37a5..5b8b90c 100644 --- a/print-tree.c +++ b/print-tree.c @@ -25,6 +25,7 @@ #include "disk-io.h" #include "print-tree.h" #include "utils.h" +#include "dedupe.h" static void print_dir_item_type(struct extent_buffer *eb, @@ -687,11 +688,31 @@ static void print_key_type(u64 objectid, u8 type) case BTRFS_UUID_KEY_RECEIVED_SUBVOL: printf("UUID_KEY_RECEIVED_SUBVOL"); break; + case BTRFS_DEDUPE_STATUS_ITEM_KEY: + printf("DEDUPE_STATUS_ITEM"); + break; + case BTRFS_DEDUPE_HASH_ITEM_KEY: + printf("DEDUPE_HASH_ITEM"); + break; + case BTRFS_DEDUPE_BYTENR_ITEM_KEY: + printf("DEDUPE_BYTENR_ITEM"); + break; default: printf("UNKNOWN.%d", type); }; } +static void print_64bit_hash(u64 hash) +{ + int i; + unsigned char buf[8]; + + memcpy(buf, , 8); + printf("0x"); + for (i = 0; i < 8; i++) + printf("%02x", buf[i]); +} + static void print_objectid(u64 objectid, u8 type) { switch (type) { @@ -706,6 +727,9 @@ static void print_objectid(u64 objectid, u8 type) case BTRFS_UUID_KEY_RECEIVED_SUBVOL: printf("0x%016llx", (unsigned long long)objectid); return; + case BTRFS_DEDUPE_HASH_ITEM_KEY: + print_64bit_hash(objectid); + return; } switch (objectid) { @@ -772,6 +796,9 @@ static void print_objectid(u64 objectid, u8 type) case BTRFS_MULTIPLE_OBJECTIDS: printf("MULTIPLE"); break; + case BTRFS_DEDUPE_TREE_OBJECTID: + printf("DEDUPE_TREE"); + break; case (u64)-1: printf("-1"); break; @@ -807,6 +834,9 @@ void btrfs_print_key(struct btrfs_disk_key *disk_key) case BTRFS_UUID_KEY_RECEIVED_SUBVOL: printf(" 0x%016llx)", (unsigned long long)offset); break; + case BTRFS_DEDUPE_BYTENR_ITEM_KEY: + print_64bit_hash(offset); + break; default: if (offset == (u64)-1) printf(" -1)"); @@ -835,6 +865,85 @@ static void print_uuid_item(struct extent_buffer *l, unsigned long offset, } } +static void print_dedupe_status(struct extent_buffer *node, int slot) +{ + struct btrfs_dedupe_status_item *status_item; + u64 blocksize; + u64 limit; + u16 hash_type; + u16 backend; + + status_item = btrfs_item_ptr(node, slot, + struct btrfs_dedupe_status_item); + blocksize = btrfs_dedupe_status_blocksize(node, status_item); + limit = btrfs_dedupe_status_limit(node, status_item); + hash_type = btrfs_dedupe_status_hash_type(node, status_item); + backend = btrfs_dedupe_status_backend(node, status_item); + + printf("\t\tdedupe status item "); + if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) + printf("backend: inmemory\n"); + else if (backend == BTRFS_DEDUPE_BACKEND_ONDISK) + printf("backend: ondisk\n"); + else + printf("backend: Unrecognized(%u)\n", backend); + + if (hash_type ==
[PATCH v7 4/8] btrfs-progs: dedupe: Add status subcommand
Add status subcommand for dedupe command group. Signed-off-by: Qu Wenruo--- Documentation/btrfs-dedupe.asciidoc | 3 ++ btrfs-completion| 2 +- cmds-dedupe.c | 84 + 3 files changed, 88 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-dedupe.asciidoc b/Documentation/btrfs-dedupe.asciidoc index 28fe05f..5a5bf52 100644 --- a/Documentation/btrfs-dedupe.asciidoc +++ b/Documentation/btrfs-dedupe.asciidoc @@ -73,6 +73,9 @@ WARNING: Too large value for '-l' or '-m' will easily trigger OOM. Please use with caution according to system memory or use 'ondisk' backend if memory usage is critical. +*status* :: +Show current in-band de-duplication status of a filesystem. + BACKENDS Btrfs in-band de-duplication support two different backends with their own diff --git a/btrfs-completion b/btrfs-completion index 9a6c73b..fbaae0c 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -40,7 +40,7 @@ _btrfs() commands_property='get set list' commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' -commands_dedupe='enable disable' +commands_dedupe='enable disable status' commands_replace='start status cancel' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then diff --git a/cmds-dedupe.c b/cmds-dedupe.c index 64ac0f2..8005b6e 100644 --- a/cmds-dedupe.c +++ b/cmds-dedupe.c @@ -230,11 +230,95 @@ out: return 0; } +static const char * const cmd_dedupe_status_usage[] = { + "btrfs dedupe status ", + "Show current in-band(write time) de-duplication status of a btrfs.", + NULL +}; + +static int cmd_dedupe_status(int argc, char **argv) +{ + struct btrfs_ioctl_dedupe_args dargs; + DIR *dirstream; + char *path; + int fd; + int ret; + int print_limit = 1; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedupe_status_usage); + + path = argv[1]; + fd = open_file_or_dir(path, ); + if (fd < 0) { + error("failed to open file or directory: %s", path); + ret = 1; + goto out; + } + memset(, 0, sizeof(dargs)); + dargs.cmd = BTRFS_DEDUPE_CTL_STATUS; + + ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, ); + if (ret < 0) { + error("failed to get inband deduplication status: %s", + strerror(errno)); + ret = 1; + goto out; + } + ret = 0; + if (dargs.status == 0) { + printf("Status: \t\t\tDisabled\n"); + goto out; + } + printf("Status:\t\t\tEnabled\n"); + + if (dargs.hash_type == BTRFS_DEDUPE_HASH_SHA256) + printf("Hash algorithm:\t\tSHA-256\n"); + else + printf("Hash algorithm:\t\tUnrecognized(%x)\n", + dargs.hash_type); + + if (dargs.backend == BTRFS_DEDUPE_BACKEND_INMEMORY) { + printf("Backend:\t\tIn-memory\n"); + print_limit = 1; + } else if (dargs.backend == BTRFS_DEDUPE_BACKEND_ONDISK) { + printf("Backend:\t\tOn-disk\n"); + print_limit = 0; + } else { + printf("Backend:\t\tUnrecognized(%x)\n", + dargs.backend); + } + + printf("Dedup Blocksize:\t%llu\n", dargs.blocksize); + + if (print_limit) { + u64 cur_mem; + + /* Limit nr may be 0 */ + if (dargs.limit_nr) + cur_mem = dargs.current_nr * (dargs.limit_mem / + dargs.limit_nr); + else + cur_mem = 0; + + printf("Number of hash: \t[%llu/%llu]\n", dargs.current_nr, + dargs.limit_nr); + printf("Memory usage: \t\t[%s/%s]\n", + pretty_size(cur_mem), + pretty_size(dargs.limit_mem)); + } +out: + close_file_or_dir(fd, dirstream); + return ret; +} + const struct cmd_group dedupe_cmd_group = { dedupe_cmd_group_usage, dedupe_cmd_group_info, { { "enable", cmd_dedupe_enable, cmd_dedupe_enable_usage, NULL, 0}, { "disable", cmd_dedupe_disable, cmd_dedupe_disable_usage, NULL, 0}, + { "status", cmd_dedupe_status, cmd_dedupe_status_usage, + NULL, 0}, NULL_CMD_STRUCT } }; -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v7 1/8] btrfs-progs: Basic framework for dedupe command group
Add basic ioctl header and command group framework for later use. Alone with basic man page doc. Signed-off-by: Qu Wenruo--- Documentation/Makefile.in | 1 + Documentation/btrfs-dedupe.asciidoc | 39 ++ Documentation/btrfs.asciidoc| 4 Makefile.in | 3 ++- btrfs.c | 1 + cmds-dedupe.c | 48 + commands.h | 2 ++ ctree.h | 39 +- dedupe.h| 42 ioctl.h | 21 10 files changed, 198 insertions(+), 2 deletions(-) create mode 100644 Documentation/btrfs-dedupe.asciidoc create mode 100644 cmds-dedupe.c create mode 100644 dedupe.h diff --git a/Documentation/Makefile.in b/Documentation/Makefile.in index aea2cb4..24fd35e 100644 --- a/Documentation/Makefile.in +++ b/Documentation/Makefile.in @@ -28,6 +28,7 @@ MAN8_TXT += btrfs-qgroup.asciidoc MAN8_TXT += btrfs-replace.asciidoc MAN8_TXT += btrfs-restore.asciidoc MAN8_TXT += btrfs-property.asciidoc +MAN8_TXT += btrfs-dedupe.asciidoc # Category 5 manual page MAN5_TXT += btrfs-man5.asciidoc diff --git a/Documentation/btrfs-dedupe.asciidoc b/Documentation/btrfs-dedupe.asciidoc new file mode 100644 index 000..5d63c32 --- /dev/null +++ b/Documentation/btrfs-dedupe.asciidoc @@ -0,0 +1,39 @@ +btrfs-dedupe(8) +== + +NAME + +btrfs-dedupe - manage in-band (write time) de-duplication of a btrfs filesystem + +SYNOPSIS + +*btrfs dedupe* + +DESCRIPTION +--- +*btrfs dedupe* is used to enable/disable or show current in-band de-duplication +status of a btrfs filesystem. + +Kernel support for in-band de-duplication starts from 4.6. + +WARNING: In-band de-duplication is still an experimental feautre of btrfs, +use with caution. + +SUBCOMMAND +-- +Nothing yet + +EXIT STATUS +--- +*btrfs dedupe* returns a zero exit status if it succeeds. Non zero is +returned in case of failure. + +AVAILABILITY + +*btrfs* is part of btrfs-progs. +Please refer to the btrfs wiki http://btrfs.wiki.kernel.org for +further details. + +SEE ALSO + +`mkfs.btrfs`(8), diff --git a/Documentation/btrfs.asciidoc b/Documentation/btrfs.asciidoc index 6a77a85..8ded842 100644 --- a/Documentation/btrfs.asciidoc +++ b/Documentation/btrfs.asciidoc @@ -43,6 +43,10 @@ COMMANDS Do off-line check on a btrfs filesystem. + See `btrfs-check`(8) for details. +*dedupe*:: + Control btrfs in-band(write time) de-duplication. + + See `btrfs-dedupe`(8) for details. + *device*:: Manage devices managed by btrfs, including add/delete/scan and so on. + diff --git a/Makefile.in b/Makefile.in index 0a1aece..c3f7072 100644 --- a/Makefile.in +++ b/Makefile.in @@ -76,7 +76,8 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-quota.o cmds-qgroup.o cmds-replace.o cmds-check.o \ cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \ cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \ - cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o + cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o \ + cmds-dedupe.o libbtrfs_objects = send-stream.o send-utils.o rbtree.o btrfs-list.o crc32c.o \ uuid-tree.o utils-lib.o rbtree-utils.o libbtrfs_headers = send-stream.h send-utils.h send.h rbtree.h btrfs-list.h \ diff --git a/btrfs.c b/btrfs.c index cc70515..c0c8f27 100644 --- a/btrfs.c +++ b/btrfs.c @@ -199,6 +199,7 @@ static const struct cmd_group btrfs_cmd_group = { { "receive", cmd_receive, cmd_receive_usage, NULL, 0 }, { "quota", cmd_quota, NULL, _cmd_group, 0 }, { "qgroup", cmd_qgroup, NULL, _cmd_group, 0 }, + { "dedupe", cmd_dedupe, NULL, _cmd_group, 0 }, { "replace", cmd_replace, NULL, _cmd_group, 0 }, { "help", cmd_help, cmd_help_usage, NULL, 0 }, { "version", cmd_version, cmd_version_usage, NULL, 0 }, diff --git a/cmds-dedupe.c b/cmds-dedupe.c new file mode 100644 index 000..b25b8db --- /dev/null +++ b/cmds-dedupe.c @@ -0,0 +1,48 @@ +/* + * Copyright (C) 2015 Fujitsu. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy
[PATCH v10 07/21] btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface
From: Wang XiaoguangUnlike in-memory or on-disk dedupe method, only SHA256 hash method is supported yet, so implement btrfs_dedupe_calc_hash() interface using SHA256. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang --- fs/btrfs/dedupe.c | 49 + 1 file changed, 49 insertions(+) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 9175a5f..bdaea3a 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -593,3 +593,52 @@ int btrfs_dedupe_search(struct btrfs_fs_info *fs_info, } return ret; } + +int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info, + struct inode *inode, u64 start, + struct btrfs_dedupe_hash *hash) +{ + int i; + int ret; + struct page *p; + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; + struct crypto_shash *tfm = dedupe_info->dedupe_driver; + struct { + struct shash_desc desc; + char ctx[crypto_shash_descsize(tfm)]; + } sdesc; + u64 dedupe_bs; + u64 sectorsize = BTRFS_I(inode)->root->sectorsize; + + if (!fs_info->dedupe_enabled || !hash) + return 0; + + if (WARN_ON(dedupe_info == NULL)) + return -EINVAL; + + WARN_ON(!IS_ALIGNED(start, sectorsize)); + + dedupe_bs = dedupe_info->blocksize; + + sdesc.desc.tfm = tfm; + sdesc.desc.flags = 0; + ret = crypto_shash_init(); + if (ret) + return ret; + for (i = 0; sectorsize * i < dedupe_bs; i++) { + char *d; + + p = find_get_page(inode->i_mapping, + (start >> PAGE_CACHE_SHIFT) + i); + if (WARN_ON(!p)) + return -ENOENT; + d = kmap(p); + ret = crypto_shash_update(, d, sectorsize); + kunmap(p); + page_cache_release(p); + if (ret) + return ret; + } + ret = crypto_shash_final(, hash->hash); + return ret; +} -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 12/21] btrfs: dedupe: add an inode nodedupe flag
From: Wang XiaoguangIntroduce BTRFS_INODE_NODEDUP flag, then we can explicitly disable online data dedupelication for specified files. Signed-off-by: Wang Xiaoguang --- fs/btrfs/ctree.h | 1 + fs/btrfs/ioctl.c | 6 +- 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 85044bf..0e8933c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2381,6 +2381,7 @@ do { \ #define BTRFS_INODE_NOATIME(1 << 9) #define BTRFS_INODE_DIRSYNC(1 << 10) #define BTRFS_INODE_COMPRESS (1 << 11) +#define BTRFS_INODE_NODEDUPE (1 << 12) #define BTRFS_INODE_ROOT_ITEM_INIT (1 << 31) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index f659ed5..1fca655 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -161,7 +161,8 @@ void btrfs_update_iflags(struct inode *inode) /* * Inherit flags from the parent inode. * - * Currently only the compression flags and the cow flags are inherited. + * Currently only the compression flags, dedupe flags and the cow flags + * are inherited. */ void btrfs_inherit_iflags(struct inode *inode, struct inode *dir) { @@ -186,6 +187,9 @@ void btrfs_inherit_iflags(struct inode *inode, struct inode *dir) BTRFS_I(inode)->flags |= BTRFS_INODE_NODATASUM; } + if (flags & BTRFS_INODE_NODEDUPE) + BTRFS_I(inode)->flags |= BTRFS_INODE_NODEDUPE; + btrfs_update_iflags(inode); } -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 14/21] btrfs: dedupe: add per-file online dedupe control
From: Wang XiaoguangIntroduce inode_need_dedupe() to implement per-file online dedupe control. Signed-off-by: Wang Xiaoguang --- fs/btrfs/inode.c | 15 ++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 96790d0..c80fd74 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -708,6 +708,18 @@ static void end_dedupe_extent(struct inode *inode, u64 start, } } +static inline int inode_need_dedupe(struct btrfs_fs_info *fs_info, + struct inode *inode) +{ + if (!fs_info->dedupe_enabled) + return 0; + + if (BTRFS_I(inode)->flags & BTRFS_INODE_NODEDUPE) + return 0; + + return 1; +} + /* * phase two of compressed writeback. This is the ordered portion * of the code, which only gets called in the order the work was @@ -1680,7 +1692,8 @@ static int run_delalloc_range(struct inode *inode, struct page *locked_page, } else if (BTRFS_I(inode)->flags & BTRFS_INODE_PREALLOC && !force_cow) { ret = run_delalloc_nocow(inode, locked_page, start, end, page_started, 0, nr_written); - } else if (!inode_need_compress(inode) && !fs_info->dedupe_enabled) { + } else if (!inode_need_compress(inode) && + !inode_need_dedupe(fs_info, inode)) { ret = cow_file_range(inode, locked_page, start, end, page_started, nr_written, 1, NULL); } else { -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 10/21] btrfs: try more times to alloc metadata reserve space
From: Wang XiaoguangIn btrfs_delalloc_reserve_metadata(), the number of metadata bytes we try to reserve is calculated by the difference between outstanding_extents and reserved_extents. When reserve_metadata_bytes() fails to reserve desited metadata space, it has already done some reclaim work, such as write ordered extents. In that case, outstanding_extents and reserved_extents may already changed, and we may reserve enough metadata space then. So this patch will try to call reserve_metadata_bytes() at most 3 times to ensure we really run out of space. Such false ENOSPC is mainly caused by small file extents and time consuming delalloc functions, which mainly affects in-band de-duplication. (Compress should also be affected, but LZO/zlib is faster than SHA256, so still harder to trigger than dedupe). Signed-off-by: Wang Xiaoguang --- fs/btrfs/extent-tree.c | 25 ++--- 1 file changed, 22 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index dabd721..016d2ec 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2421,7 +2421,7 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, * a new extent is revered, then deleted * in one tran, and inc/dec get merged to 0. * -* In this case, we need to remove its dedup +* In this case, we need to remove its dedupe * hash. */ btrfs_dedupe_del(trans, fs_info, node->bytenr); @@ -5675,6 +5675,7 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) bool delalloc_lock = true; u64 to_free = 0; unsigned dropped; + int loops = 0; /* If we are a free space inode we need to not flush since we will be in * the middle of a transaction commit. We also don't need the delalloc @@ -5690,11 +5691,12 @@ int btrfs_delalloc_reserve_metadata(struct inode *inode, u64 num_bytes) btrfs_transaction_in_commit(root->fs_info)) schedule_timeout(1); + num_bytes = ALIGN(num_bytes, root->sectorsize); + +again: if (delalloc_lock) mutex_lock(_I(inode)->delalloc_mutex); - num_bytes = ALIGN(num_bytes, root->sectorsize); - spin_lock(_I(inode)->lock); nr_extents = (unsigned)div64_u64(num_bytes + BTRFS_MAX_EXTENT_SIZE - 1, @@ -5815,6 +5817,23 @@ out_fail: } if (delalloc_lock) mutex_unlock(_I(inode)->delalloc_mutex); + /* +* The number of metadata bytes is calculated by the difference +* between outstanding_extents and reserved_extents. Sometimes though +* reserve_metadata_bytes() fails to reserve the wanted metadata bytes, +* indeed it has already done some work to reclaim metadata space, hence +* both outstanding_extents and reserved_extents would have changed and +* the bytes we try to reserve would also has changed(may be smaller). +* So here we try to reserve again. This is much useful for online +* dedupe, which will easily eat almost all meta space. +* +* XXX: Indeed here 3 is arbitrarily choosed, it's a good workaround for +* online dedupe, later we should find a better method to avoid dedupe +* enospc issue. +*/ + if (unlikely(ret == -ENOSPC && loops++ < 3)) + goto again; + return ret; } -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 02/21] btrfs: dedupe: Introduce function to initialize dedupe info
From: Wang XiaoguangAdd generic function to initialize dedupe info. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang --- fs/btrfs/Makefile | 2 +- fs/btrfs/dedupe.c | 154 ++ fs/btrfs/dedupe.h | 16 +- 3 files changed, 169 insertions(+), 3 deletions(-) create mode 100644 fs/btrfs/dedupe.c diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile index 128ce17..1b8c627 100644 --- a/fs/btrfs/Makefile +++ b/fs/btrfs/Makefile @@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o root-tree.o dir-item.o \ export.o tree-log.o free-space-cache.o zlib.o lzo.o \ compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \ reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \ - uuid-tree.o props.o hash.o free-space-tree.o + uuid-tree.o props.o hash.o free-space-tree.o dedupe.o btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c new file mode 100644 index 000..2211588 --- /dev/null +++ b/fs/btrfs/dedupe.c @@ -0,0 +1,154 @@ +/* + * Copyright (C) 2016 Fujitsu. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ +#include "ctree.h" +#include "dedupe.h" +#include "btrfs_inode.h" +#include "transaction.h" +#include "delayed-ref.h" + +struct inmem_hash { + struct rb_node hash_node; + struct rb_node bytenr_node; + struct list_head lru_list; + + u64 bytenr; + u32 num_bytes; + + u8 hash[]; +}; + +static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type, + u16 backend, u64 blocksize, u64 limit) +{ + struct btrfs_dedupe_info *dedupe_info; + + dedupe_info = kzalloc(sizeof(*dedupe_info), GFP_NOFS); + if (!dedupe_info) + return -ENOMEM; + + dedupe_info->hash_type = type; + dedupe_info->backend = backend; + dedupe_info->blocksize = blocksize; + dedupe_info->limit_nr = limit; + + /* only support SHA256 yet */ + dedupe_info->dedupe_driver = crypto_alloc_shash("sha256", 0, 0); + if (IS_ERR(dedupe_info->dedupe_driver)) { + int ret; + + ret = PTR_ERR(dedupe_info->dedupe_driver); + kfree(dedupe_info); + return ret; + } + + dedupe_info->hash_root = RB_ROOT; + dedupe_info->bytenr_root = RB_ROOT; + dedupe_info->current_nr = 0; + INIT_LIST_HEAD(_info->lru_list); + mutex_init(_info->lock); + + *ret_info = dedupe_info; + return 0; +} + +static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type, + u16 backend, u64 blocksize, u64 limit_nr, + u64 limit_mem, u64 *ret_limit) +{ + if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX || + blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN || + blocksize < fs_info->tree_root->sectorsize || + !is_power_of_2(blocksize)) + return -EINVAL; + /* +* For new backend and hash type, we return special return code +* as they can be easily expended. +*/ + if (hash_type >= ARRAY_SIZE(btrfs_dedupe_sizes)) + return -EOPNOTSUPP; + if (backend >= BTRFS_DEDUPE_BACKEND_COUNT) + return -EOPNOTSUPP; + + /* Backend specific check */ + if (backend == BTRFS_DEDUPE_BACKEND_INMEMORY) { + if (!limit_nr && !limit_mem) + *ret_limit = BTRFS_DEDUPE_LIMIT_NR_DEFAULT; + else { + u64 tmp = (u64)-1; + + if (limit_mem) { + tmp = limit_mem / (sizeof(struct inmem_hash) + + btrfs_dedupe_hash_size(hash_type)); + /* Too small limit_mem to fill a hash item */ + if (!tmp) + return -EINVAL; + } + if (!limit_nr) + limit_nr = (u64)-1; + + *ret_limit = min(tmp, limit_nr); +
[PATCH v10 16/21] btrfs: dedupe: Add basic tree structure for on-disk dedupe method
Introduce a new tree, dedupe tree to record on-disk dedupe hash. As a persist hash storage instead of in-memeory only implement. Unlike Liu Bo's implement, in this version we won't do hack for bytenr -> hash search, but add a new type, DEDUP_BYTENR_ITEM for such search case, just like in-memory backend. Signed-off-by: Liu BoSigned-off-by: Wang Xiaoguang Signed-off-by: Qu Wenruo --- fs/btrfs/ctree.h | 53 +++- fs/btrfs/dedupe.h| 5 + fs/btrfs/disk-io.c | 6 + fs/btrfs/relocation.c| 3 ++- include/trace/events/btrfs.h | 3 ++- 5 files changed, 67 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 0e8933c..659790c 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -100,6 +100,9 @@ struct btrfs_ordered_sum; /* tracks free space in block groups. */ #define BTRFS_FREE_SPACE_TREE_OBJECTID 10ULL +/* on-disk dedupe tree (EXPERIMENTAL) */ +#define BTRFS_DEDUPE_TREE_OBJECTID 11ULL + /* device stats in the device tree */ #define BTRFS_DEV_STATS_OBJECTID 0ULL @@ -538,7 +541,8 @@ struct btrfs_super_block { #define BTRFS_FEATURE_COMPAT_SAFE_CLEAR0ULL #define BTRFS_FEATURE_COMPAT_RO_SUPP \ - (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE) + (BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE | \ +BTRFS_FEATURE_COMPAT_RO_DEDUPE) #define BTRFS_FEATURE_COMPAT_RO_SAFE_SET 0ULL #define BTRFS_FEATURE_COMPAT_RO_SAFE_CLEAR 0ULL @@ -960,6 +964,36 @@ struct btrfs_csum_item { u8 csum; } __attribute__ ((__packed__)); +/* + * Objectid: 0 + * Type: BTRFS_DEDUPE_STATUS_ITEM_KEY + * Offset: 0 + */ +struct btrfs_dedupe_status_item { + __le64 blocksize; + __le64 limit_nr; + __le16 hash_type; + __le16 backend; +} __attribute__ ((__packed__)); + +/* + * Objectid: Last 64 bit of the hash + * Type: BTRFS_DEDUPE_HASH_ITEM_KEY + * Offset: Bytenr of the hash + * + * Used for hash <-> bytenr search + * Hash exclude the last 64 bit follows + */ + +/* + * Objectid: bytenr + * Type: BTRFS_DEDUPE_BYTENR_ITEM_KEY + * offset: Last 64 bit of the hash + * + * Used for bytenr <-> hash search (for free_extent) + * Its itemsize should always be 0. + */ + struct btrfs_dev_stats_item { /* * grow this item struct at the end for future enhancements and keep @@ -2168,6 +2202,13 @@ struct btrfs_ioctl_defrag_range_args { #define BTRFS_CHUNK_ITEM_KEY 228 /* + * Dedup item and status + */ +#define BTRFS_DEDUPE_STATUS_ITEM_KEY 230 +#define BTRFS_DEDUPE_HASH_ITEM_KEY 231 +#define BTRFS_DEDUPE_BYTENR_ITEM_KEY 232 + +/* * Records the overall state of the qgroups. * There's only one instance of this key present, * (0, BTRFS_QGROUP_STATUS_KEY, 0) @@ -3265,6 +3306,16 @@ static inline unsigned long btrfs_leaf_data(struct extent_buffer *l) return offsetof(struct btrfs_leaf, items); } +/* btrfs_dedupe_status */ +BTRFS_SETGET_FUNCS(dedupe_status_blocksize, struct btrfs_dedupe_status_item, + blocksize, 64); +BTRFS_SETGET_FUNCS(dedupe_status_limit, struct btrfs_dedupe_status_item, + limit_nr, 64); +BTRFS_SETGET_FUNCS(dedupe_status_hash_type, struct btrfs_dedupe_status_item, + hash_type, 16); +BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item, + backend, 16); + /* struct btrfs_file_extent_item */ BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8); BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr, diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h index f5d2b45..1ac1bcb 100644 --- a/fs/btrfs/dedupe.h +++ b/fs/btrfs/dedupe.h @@ -60,6 +60,8 @@ struct btrfs_dedupe_hash { u8 hash[]; }; +struct btrfs_root; + struct btrfs_dedupe_info { /* dedupe blocksize */ u64 blocksize; @@ -75,6 +77,9 @@ struct btrfs_dedupe_info { struct list_head lru_list; u64 limit_nr; u64 current_nr; + + /* for persist data like dedup-hash and dedupe status */ + struct btrfs_root *dedupe_root; }; struct btrfs_trans_handle; diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index ed6a6fd..c7eda03 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -184,6 +184,7 @@ static struct btrfs_lockdep_keyset { { .id = BTRFS_DATA_RELOC_TREE_OBJECTID, .name_stem = "dreloc" }, { .id = BTRFS_UUID_TREE_OBJECTID, .name_stem = "uuid" }, { .id = BTRFS_FREE_SPACE_TREE_OBJECTID, .name_stem = "free-space" }, + { .id = BTRFS_DEDUPE_TREE_OBJECTID, .name_stem = "dedupe" }, { .id = 0, .name_stem = "tree" }, }; @@ -1678,6 +1679,11 @@ struct btrfs_root *btrfs_get_fs_root(struct btrfs_fs_info *fs_info, if (location->objectid == BTRFS_FREE_SPACE_TREE_OBJECTID)
[PATCH v10 11/21] btrfs: dedupe: Add ioctl for inband dedupelication
From: Wang XiaoguangAdd ioctl interface for inband dedupelication, which includes: 1) enable 2) disable 3) status And a pseudo RO compat flag, to imply that btrfs now supports inband dedup. However we don't add any ondisk format change, it's just a pseudo RO compat flag. All these ioctl interface are state-less, which means caller don't need to bother previous dedupe state before calling them, and only need to care the final desired state. For example, if user want to enable dedupe with specified block size and limit, just fill the ioctl structure and call enable ioctl. No need to check if dedupe is already running. These ioctls will handle things like re-configure or disable quite well. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang --- fs/btrfs/ctree.h | 1 + fs/btrfs/dedupe.c | 48 fs/btrfs/dedupe.h | 15 ++ fs/btrfs/disk-io.c | 3 ++ fs/btrfs/ioctl.c | 68 ++ fs/btrfs/sysfs.c | 2 ++ include/uapi/linux/btrfs.h | 23 7 files changed, 160 insertions(+) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 022ab61..85044bf 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -508,6 +508,7 @@ struct btrfs_super_block { * ones specified below then we will fail to mount */ #define BTRFS_FEATURE_COMPAT_RO_FREE_SPACE_TREE(1ULL << 0) +#define BTRFS_FEATURE_COMPAT_RO_DEDUPE (1ULL << 1) #define BTRFS_FEATURE_INCOMPAT_MIXED_BACKREF (1ULL << 0) #define BTRFS_FEATURE_INCOMPAT_DEFAULT_SUBVOL (1ULL << 1) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index bdaea3a..cfb7fea 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -41,6 +41,33 @@ static inline struct inmem_hash *inmem_alloc_hash(u16 type) GFP_NOFS); } +void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, +struct btrfs_ioctl_dedupe_args *dargs) +{ + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; + + if (!fs_info->dedupe_enabled || !dedupe_info) { + dargs->status = 0; + dargs->blocksize = 0; + dargs->backend = 0; + dargs->hash_type = 0; + dargs->limit_nr = 0; + dargs->current_nr = 0; + return; + } + mutex_lock(_info->lock); + dargs->status = 1; + dargs->blocksize = dedupe_info->blocksize; + dargs->backend = dedupe_info->backend; + dargs->hash_type = dedupe_info->hash_type; + dargs->limit_nr = dedupe_info->limit_nr; + dargs->limit_mem = dedupe_info->limit_nr * + (sizeof(struct inmem_hash) + +btrfs_dedupe_sizes[dedupe_info->hash_type]); + dargs->current_nr = dedupe_info->current_nr; + mutex_unlock(_info->lock); +} + static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type, u16 backend, u64 blocksize, u64 limit) { @@ -371,6 +398,27 @@ static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info) mutex_unlock(_info->lock); } +int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info) +{ + struct btrfs_dedupe_info *dedupe_info; + + fs_info->dedupe_enabled = 0; + /* same as disable */ + smp_wmb(); + dedupe_info = fs_info->dedupe_info; + fs_info->dedupe_info = NULL; + + if (!dedupe_info) + return 0; + + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) + inmem_destroy(dedupe_info); + + crypto_free_shash(dedupe_info->dedupe_driver); + kfree(dedupe_info); + return 0; +} + int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) { struct btrfs_dedupe_info *dedupe_info; diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h index e5d0d34..f5d2b45 100644 --- a/fs/btrfs/dedupe.h +++ b/fs/btrfs/dedupe.h @@ -103,6 +103,15 @@ static inline struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type) int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend, u64 blocksize, u64 limit_nr, u64 limit_mem); + + /* + * Get inband dedupe info + * Since it needs to access different backends' hash size, which + * is not exported, we need such simple function. + */ +void btrfs_dedupe_status(struct btrfs_fs_info *fs_info, +struct btrfs_ioctl_dedupe_args *dargs); + /* * Disable dedupe and invalidate all its dedupe data. * Called at dedupe disable time. @@ -110,6 +119,12 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend, int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info); /* + * Cleanup current btrfs_dedupe_info + * Called in umount time + */ +int btrfs_dedupe_cleanup(struct btrfs_fs_info *fs_info); + +/* * Calculate hash for dedup. * Caller must
[PATCH v10 15/21] btrfs: relocation: Enhance error handling to avoid BUG_ON
Since the introduce of btrfs dedupe tree, it's possible that balance can race with dedupe disabling. When this happens, dedupe_enabled will make btrfs_get_fs_root() return PTR_ERR(-ENOENT). But due to a bug in error handling branch, when this happens backref_cache->nr_nodes is increased but the node is neither added to backref_cache or nr_nodes decreased. Causing BUG_ON() in backref_cache_cleanup() [ 2611.668810] [ cut here ] [ 2611.669946] kernel BUG at /home/sat/ktest/linux/fs/btrfs/relocation.c:243! [ 2611.670572] invalid opcode: [#1] SMP [ 2611.686797] Call Trace: [ 2611.687034] [] btrfs_relocate_block_group+0x1b3/0x290 [btrfs] [ 2611.687706] [] btrfs_relocate_chunk.isra.40+0x47/0xd0 [btrfs] [ 2611.688385] [] btrfs_balance+0xb22/0x11e0 [btrfs] [ 2611.688966] [] btrfs_ioctl_balance+0x391/0x3a0 [btrfs] [ 2611.689587] [] btrfs_ioctl+0x1650/0x2290 [btrfs] [ 2611.690145] [] ? lru_cache_add+0x3a/0x80 [ 2611.690647] [] ? lru_cache_add_active_or_unevictable+0x4c/0xc0 [ 2611.691310] [] ? handle_mm_fault+0xcd4/0x17f0 [ 2611.691842] [] ? cp_new_stat+0x153/0x180 [ 2611.692342] [] ? __vma_link_rb+0xfd/0x110 [ 2611.692842] [] ? vma_link+0xb9/0xc0 [ 2611.693303] [] do_vfs_ioctl+0xa1/0x5a0 [ 2611.693781] [] ? __do_page_fault+0x1b4/0x400 [ 2611.694310] [] SyS_ioctl+0x41/0x70 [ 2611.694758] [] entry_SYSCALL_64_fastpath+0x12/0x71 [ 2611.695331] Code: ff 48 8b 45 bf 49 83 af a8 05 00 00 01 49 89 87 a0 05 00 00 e9 2e fd ff ff b8 f4 ff ff ff e9 e4 fb ff ff 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 41 89 c6 e9 b8 fb ff ff e8 9e a6 e8 e0 4c 89 e7 44 [ 2611.697870] RIP [] relocate_block_group+0x741/0x7a0 [btrfs] [ 2611.698818] RSP This patch will call remove_backref_node() in error handling branch, and cache the returned -ENOENT in relocate_tree_block() and continue balancing. Reported-by: Satoru TakeuchiSigned-off-by: Qu Wenruo --- fs/btrfs/relocation.c | 22 +- 1 file changed, 17 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c index 33183ce..d72a981 100644 --- a/fs/btrfs/relocation.c +++ b/fs/btrfs/relocation.c @@ -887,6 +887,13 @@ again: root = read_fs_root(rc->extent_root->fs_info, key.offset); if (IS_ERR(root)) { err = PTR_ERR(root); + /* +* Don't forget to cleanup current node. +* As it may not be added to backref_cache but nr_node +* increased. +* This will cause BUG_ON() in backref_cache_cleanup(). +*/ + remove_backref_node(>backref_cache, cur); goto out; } @@ -2990,14 +2997,21 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans, } rb_node = rb_first(blocks); - while (rb_node) { + for (rb_node = rb_first(blocks); rb_node; rb_node = rb_next(rb_node)) { block = rb_entry(rb_node, struct tree_block, rb_node); node = build_backref_tree(rc, >key, block->level, block->bytenr); if (IS_ERR(node)) { + /* +* The root(dedupe tree yet) of the tree block is +* going to be freed and can't be reached. +* Just skip it and continue balancing. +*/ + if (PTR_ERR(node) == -ENOENT) + continue; err = PTR_ERR(node); - goto out; + break; } ret = relocate_tree_block(trans, rc, node, >key, @@ -3005,11 +3019,9 @@ int relocate_tree_blocks(struct btrfs_trans_handle *trans, if (ret < 0) { if (ret != -EAGAIN || rb_node == rb_first(blocks)) err = ret; - goto out; + break; } - rb_node = rb_next(rb_node); } -out: err = finish_pending_nodes(trans, rc, path, err); out_free_path: -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 08/21] btrfs: ordered-extent: Add support for dedupe
From: Wang XiaoguangAdd ordered-extent support for dedupe. Note, current ordered-extent support only supports non-compressed source extent. Support for compressed source extent will be added later. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang --- fs/btrfs/ordered-data.c | 44 fs/btrfs/ordered-data.h | 13 + 2 files changed, 53 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 0de7da5..ef24ad1 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -26,6 +26,7 @@ #include "extent_io.h" #include "disk-io.h" #include "compression.h" +#include "dedupe.h" static struct kmem_cache *btrfs_ordered_extent_cache; @@ -184,7 +185,8 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, */ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, - int type, int dio, int compress_type) + int type, int dio, int compress_type, + struct btrfs_dedupe_hash *hash) { struct btrfs_root *root = BTRFS_I(inode)->root; struct btrfs_ordered_inode_tree *tree; @@ -204,6 +206,31 @@ static int __btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, entry->inode = igrab(inode); entry->compress_type = compress_type; entry->truncated_len = (u64)-1; + entry->hash = NULL; + /* +* Hash hit must go through dedupe routine at all cost, even dedupe +* is disabled. As its delayed ref is already increased. +*/ + if (hash && (hash->bytenr || root->fs_info->dedupe_enabled)) { + struct btrfs_dedupe_info *dedupe_info; + + dedupe_info = root->fs_info->dedupe_info; + if (WARN_ON(dedupe_info == NULL)) { + kmem_cache_free(btrfs_ordered_extent_cache, + entry); + return -EINVAL; + } + entry->hash = btrfs_dedupe_alloc_hash(dedupe_info->hash_type); + if (!entry->hash) { + kmem_cache_free(btrfs_ordered_extent_cache, entry); + return -ENOMEM; + } + entry->hash->bytenr = hash->bytenr; + entry->hash->num_bytes = hash->num_bytes; + memcpy(entry->hash->hash, hash->hash, + btrfs_dedupe_sizes[dedupe_info->hash_type]); + } + if (type != BTRFS_ORDERED_IO_DONE && type != BTRFS_ORDERED_COMPLETE) set_bit(type, >flags); @@ -250,15 +277,23 @@ int btrfs_add_ordered_extent(struct inode *inode, u64 file_offset, { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 0, - BTRFS_COMPRESS_NONE); + BTRFS_COMPRESS_NONE, NULL); } +int btrfs_add_ordered_extent_dedupe(struct inode *inode, u64 file_offset, + u64 start, u64 len, u64 disk_len, int type, + struct btrfs_dedupe_hash *hash) +{ + return __btrfs_add_ordered_extent(inode, file_offset, start, len, + disk_len, type, 0, + BTRFS_COMPRESS_NONE, hash); +} int btrfs_add_ordered_extent_dio(struct inode *inode, u64 file_offset, u64 start, u64 len, u64 disk_len, int type) { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 1, - BTRFS_COMPRESS_NONE); + BTRFS_COMPRESS_NONE, NULL); } int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, @@ -267,7 +302,7 @@ int btrfs_add_ordered_extent_compress(struct inode *inode, u64 file_offset, { return __btrfs_add_ordered_extent(inode, file_offset, start, len, disk_len, type, 0, - compress_type); + compress_type, NULL); } /* @@ -577,6 +612,7 @@ void btrfs_put_ordered_extent(struct btrfs_ordered_extent *entry) list_del(>list); kfree(sum); } + kfree(entry->hash); kmem_cache_free(btrfs_ordered_extent_cache, entry); } } diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 23c9605..8a54476 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -139,6 +139,16
[PATCH v10 17/21] btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info
Since we will introduce a new on-disk based dedupe method, introduce new interfaces to resume previous dedupe setup. And since we introduce a new tree for status, also add disable handler for it. Signed-off-by: Wang XiaoguangSigned-off-by: Qu Wenruo --- fs/btrfs/dedupe.c | 197 - fs/btrfs/dedupe.h | 13 fs/btrfs/disk-io.c | 25 ++- fs/btrfs/disk-io.h | 1 + 4 files changed, 232 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index cfb7fea..a274c1c 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -21,6 +21,8 @@ #include "transaction.h" #include "delayed-ref.h" #include "qgroup.h" +#include "disk-io.h" +#include "locking.h" struct inmem_hash { struct rb_node hash_node; @@ -102,10 +104,69 @@ static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type, return 0; } +static int init_dedupe_tree(struct btrfs_fs_info *fs_info, + struct btrfs_dedupe_info *dedupe_info) +{ + struct btrfs_root *dedupe_root; + struct btrfs_key key; + struct btrfs_path *path; + struct btrfs_dedupe_status_item *status; + struct btrfs_trans_handle *trans; + int ret; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + trans = btrfs_start_transaction(fs_info->tree_root, 2); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + goto out; + } + dedupe_root = btrfs_create_tree(trans, fs_info, + BTRFS_DEDUPE_TREE_OBJECTID); + if (IS_ERR(dedupe_root)) { + ret = PTR_ERR(dedupe_root); + btrfs_abort_transaction(trans, fs_info->tree_root, ret); + goto out; + } + dedupe_info->dedupe_root = dedupe_root; + + key.objectid = 0; + key.type = BTRFS_DEDUPE_STATUS_ITEM_KEY; + key.offset = 0; + + ret = btrfs_insert_empty_item(trans, dedupe_root, path, , + sizeof(*status)); + if (ret < 0) { + btrfs_abort_transaction(trans, fs_info->tree_root, ret); + goto out; + } + + status = btrfs_item_ptr(path->nodes[0], path->slots[0], + struct btrfs_dedupe_status_item); + btrfs_set_dedupe_status_blocksize(path->nodes[0], status, +dedupe_info->blocksize); + btrfs_set_dedupe_status_limit(path->nodes[0], status, + dedupe_info->limit_nr); + btrfs_set_dedupe_status_hash_type(path->nodes[0], status, + dedupe_info->hash_type); + btrfs_set_dedupe_status_backend(path->nodes[0], status, + dedupe_info->backend); + btrfs_mark_buffer_dirty(path->nodes[0]); +out: + btrfs_free_path(path); + if (ret == 0) + btrfs_commit_transaction(trans, fs_info->tree_root); + return ret; +} + static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type, u16 backend, u64 blocksize, u64 limit_nr, u64 limit_mem, u64 *ret_limit) { + u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy); + if (blocksize > BTRFS_DEDUPE_BLOCKSIZE_MAX || blocksize < BTRFS_DEDUPE_BLOCKSIZE_MIN || blocksize < fs_info->tree_root->sectorsize || @@ -140,8 +201,12 @@ static int check_dedupe_parameter(struct btrfs_fs_info *fs_info, u16 hash_type, *ret_limit = min(tmp, limit_nr); } } - if (backend == BTRFS_DEDUPE_BACKEND_ONDISK) + if (backend == BTRFS_DEDUPE_BACKEND_ONDISK) { + /* Ondisk backend must use RO compat feature */ + if (!(compat_ro_flag & BTRFS_FEATURE_COMPAT_RO_DEDUPE)) + return -EOPNOTSUPP; *ret_limit = 0; + } return 0; } @@ -150,11 +215,16 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend, { struct btrfs_dedupe_info *dedupe_info; u64 limit = 0; + u64 compat_ro_flag = btrfs_super_compat_ro_flags(fs_info->super_copy); + int create_tree; int ret = 0; /* only one limit is accepted for enable*/ if (limit_nr && limit_mem) return -EINVAL; + /* enable and disable may modify ondisk data, so block RO fs*/ + if (fs_info->sb->s_flags & MS_RDONLY) + return -EROFS; ret = check_dedupe_parameter(fs_info, type, backend, blocksize, limit_nr, limit_mem, ); @@ -179,9 +249,19 @@ int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend, } enable: + create_tree = compat_ro_flag &
[PATCH v10 20/21] btrfs: dedupe: Add support for adding hash for on-disk backend
Now on-disk backend can add hash now. Since all needed on-disk backend functions are added, also allow on-disk backend to be used, by changing DEDUPE_BACKEND_COUNT from 1(inmemory only) to 2 (inmemory + ondisk). Signed-off-by: Wang XiaoguangSigned-off-by: Qu Wenruo --- fs/btrfs/dedupe.c | 83 +++ fs/btrfs/dedupe.h | 3 +- 2 files changed, 84 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 7c5d58a..1f0178e 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -437,6 +437,87 @@ out: return 0; } +static int ondisk_search_bytenr(struct btrfs_trans_handle *trans, + struct btrfs_dedupe_info *dedupe_info, + struct btrfs_path *path, u64 bytenr, + int prepare_del); +static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash, + u64 *bytenr_ret, u32 *num_bytes_ret); +static int ondisk_add(struct btrfs_trans_handle *trans, + struct btrfs_dedupe_info *dedupe_info, + struct btrfs_dedupe_hash *hash) +{ + struct btrfs_path *path; + struct btrfs_root *dedupe_root = dedupe_info->dedupe_root; + struct btrfs_key key; + u64 hash_offset; + u64 bytenr; + u32 num_bytes; + int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type]; + int ret; + + if (WARN_ON(hash_len <= 8 || + !IS_ALIGNED(hash->bytenr, dedupe_root->sectorsize))) + return -EINVAL; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + mutex_lock(_info->lock); + + ret = ondisk_search_bytenr(NULL, dedupe_info, path, hash->bytenr, 0); + if (ret < 0) + goto out; + if (ret > 0) { + ret = 0; + goto out; + } + btrfs_release_path(path); + + ret = ondisk_search_hash(dedupe_info, hash->hash, , _bytes); + if (ret < 0) + goto out; + /* Same hash found, don't re-add to save dedupe tree space */ + if (ret > 0) { + ret = 0; + goto out; + } + + /* Insert hash->bytenr item */ + memcpy(, hash->hash + hash_len - 8, 8); + key.type = BTRFS_DEDUPE_HASH_ITEM_KEY; + key.offset = hash->bytenr; + + /* The last 8 bit will not be included into hash */ + ret = btrfs_insert_empty_item(trans, dedupe_root, path, , + hash_len - 8); + WARN_ON(ret == -EEXIST); + if (ret < 0) + goto out; + hash_offset = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]); + write_extent_buffer(path->nodes[0], hash->hash, + hash_offset, hash_len - 8); + btrfs_mark_buffer_dirty(path->nodes[0]); + btrfs_release_path(path); + + /* Then bytenr->hash item */ + key.objectid = hash->bytenr; + key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY; + memcpy(, hash->hash + hash_len - 8, 8); + + ret = btrfs_insert_empty_item(trans, dedupe_root, path, , 0); + WARN_ON(ret == -EEXIST); + if (ret < 0) + goto out; + btrfs_mark_buffer_dirty(path->nodes[0]); + +out: + mutex_unlock(_info->lock); + btrfs_free_path(path); + return ret; +} + int btrfs_dedupe_add(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, struct btrfs_dedupe_hash *hash) @@ -458,6 +539,8 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans, if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) return inmem_add(dedupe_info, hash); + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK) + return ondisk_add(trans, dedupe_info, hash); return -EINVAL; } diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h index bfcacd7..1573456 100644 --- a/fs/btrfs/dedupe.h +++ b/fs/btrfs/dedupe.h @@ -31,8 +31,7 @@ #define BTRFS_DEDUPE_BACKEND_INMEMORY 0 #define BTRFS_DEDUPE_BACKEND_ONDISK1 -/* Only support inmemory yet, so count is still only 1 */ -#define BTRFS_DEDUPE_BACKEND_COUNT 1 +#define BTRFS_DEDUPE_BACKEND_COUNT 2 /* Dedup block size limit and default value */ #define BTRFS_DEDUPE_BLOCKSIZE_MAX (8 * 1024 * 1024) -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 04/21] btrfs: dedupe: Introduce function to remove hash from in-memory tree
From: Wang XiaoguangIntroduce static function inmem_del() to remove hash from in-memory dedupe tree. And implement btrfs_dedupe_del() and btrfs_dedup_destroy() interfaces. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang --- fs/btrfs/dedupe.c | 105 ++ 1 file changed, 105 insertions(+) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 4e8455e..a229ded 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -303,3 +303,108 @@ int btrfs_dedupe_add(struct btrfs_trans_handle *trans, return inmem_add(dedupe_info, hash); return -EINVAL; } + +static struct inmem_hash * +inmem_search_bytenr(struct btrfs_dedupe_info *dedupe_info, u64 bytenr) +{ + struct rb_node **p = _info->bytenr_root.rb_node; + struct rb_node *parent = NULL; + struct inmem_hash *entry = NULL; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct inmem_hash, bytenr_node); + + if (bytenr < entry->bytenr) + p = &(*p)->rb_left; + else if (bytenr > entry->bytenr) + p = &(*p)->rb_right; + else + return entry; + } + + return NULL; +} + +/* Delete a hash from in-memory dedupe tree */ +static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr) +{ + struct inmem_hash *hash; + + mutex_lock(_info->lock); + hash = inmem_search_bytenr(dedupe_info, bytenr); + if (!hash) { + mutex_unlock(_info->lock); + return 0; + } + + __inmem_del(dedupe_info, hash); + mutex_unlock(_info->lock); + return 0; +} + +/* Remove a dedupe hash from dedupe tree */ +int btrfs_dedupe_del(struct btrfs_trans_handle *trans, +struct btrfs_fs_info *fs_info, u64 bytenr) +{ + struct btrfs_dedupe_info *dedupe_info = fs_info->dedupe_info; + + if (!fs_info->dedupe_enabled) + return 0; + + if (WARN_ON(dedupe_info == NULL)) + return -EINVAL; + + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) + return inmem_del(dedupe_info, bytenr); + return -EINVAL; +} + +static void inmem_destroy(struct btrfs_dedupe_info *dedupe_info) +{ + struct inmem_hash *entry, *tmp; + + mutex_lock(_info->lock); + list_for_each_entry_safe(entry, tmp, _info->lru_list, lru_list) + __inmem_del(dedupe_info, entry); + mutex_unlock(_info->lock); +} + +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) +{ + struct btrfs_dedupe_info *dedupe_info; + int ret; + + /* Here we don't want to increase refs of dedupe_info */ + fs_info->dedupe_enabled = 0; + + dedupe_info = fs_info->dedupe_info; + + if (!dedupe_info) + return 0; + + /* Don't allow disable status change in RO mount */ + if (fs_info->sb->s_flags & MS_RDONLY) + return -EROFS; + + /* +* Wait for all unfinished write to complete dedupe routine +* As disable operation is not a frequent operation, we are +* OK to use heavy but safe sync_filesystem(). +*/ + down_read(_info->sb->s_umount); + ret = sync_filesystem(fs_info->sb); + up_read(_info->sb->s_umount); + if (ret < 0) + return ret; + + fs_info->dedupe_info = NULL; + + /* now we are OK to clean up everything */ + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) + inmem_destroy(dedupe_info); + + crypto_free_shash(dedupe_info->dedupe_driver); + kfree(dedupe_info); + return 0; +} -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 06/21] btrfs: dedupe: Introduce function to search for an existing hash
From: Wang XiaoguangIntroduce static function inmem_search() to handle the job for in-memory hash tree. The trick is, we must ensure the delayed ref head is not being run at the time we search the for the hash. With inmem_search(), we can implement the btrfs_dedupe_search() interface. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang --- fs/btrfs/dedupe.c | 185 ++ 1 file changed, 185 insertions(+) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index a229ded..9175a5f 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -20,6 +20,7 @@ #include "btrfs_inode.h" #include "transaction.h" #include "delayed-ref.h" +#include "qgroup.h" struct inmem_hash { struct rb_node hash_node; @@ -408,3 +409,187 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) kfree(dedupe_info); return 0; } + +/* + * Caller must ensure the corresponding ref head is not being run. + */ +static struct inmem_hash * +inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash) +{ + struct rb_node **p = _info->hash_root.rb_node; + struct rb_node *parent = NULL; + struct inmem_hash *entry = NULL; + u16 hash_type = dedupe_info->hash_type; + int hash_len = btrfs_dedupe_sizes[hash_type]; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct inmem_hash, hash_node); + + if (memcmp(hash, entry->hash, hash_len) < 0) { + p = &(*p)->rb_left; + } else if (memcmp(hash, entry->hash, hash_len) > 0) { + p = &(*p)->rb_right; + } else { + /* Found, need to re-add it to LRU list head */ + list_del(>lru_list); + list_add(>lru_list, _info->lru_list); + return entry; + } + } + return NULL; +} + +static int inmem_search(struct btrfs_dedupe_info *dedupe_info, + struct inode *inode, u64 file_pos, + struct btrfs_dedupe_hash *hash) +{ + int ret; + struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_trans_handle *trans; + struct btrfs_delayed_ref_root *delayed_refs; + struct btrfs_delayed_ref_head *head; + struct btrfs_delayed_ref_head *insert_head; + struct btrfs_delayed_data_ref *insert_dref; + struct btrfs_qgroup_extent_record *insert_qrecord = NULL; + struct inmem_hash *found_hash; + int free_insert = 1; + u64 bytenr; + u32 num_bytes; + + insert_head = kmem_cache_alloc(btrfs_delayed_ref_head_cachep, GFP_NOFS); + if (!insert_head) + return -ENOMEM; + insert_head->extent_op = NULL; + insert_dref = kmem_cache_alloc(btrfs_delayed_data_ref_cachep, GFP_NOFS); + if (!insert_dref) { + kmem_cache_free(btrfs_delayed_ref_head_cachep, insert_head); + return -ENOMEM; + } + if (root->fs_info->quota_enabled && + is_fstree(root->root_key.objectid)) { + insert_qrecord = kmalloc(sizeof(*insert_qrecord), GFP_NOFS); + if (!insert_qrecord) { + kmem_cache_free(btrfs_delayed_ref_head_cachep, + insert_head); + kmem_cache_free(btrfs_delayed_data_ref_cachep, + insert_dref); + return -ENOMEM; + } + } + + trans = btrfs_join_transaction(root); + if (IS_ERR(trans)) { + ret = PTR_ERR(trans); + goto free_mem; + } + +again: + mutex_lock(_info->lock); + found_hash = inmem_search_hash(dedupe_info, hash->hash); + /* If we don't find a duplicated extent, just return. */ + if (!found_hash) { + ret = 0; + goto out; + } + bytenr = found_hash->bytenr; + num_bytes = found_hash->num_bytes; + + delayed_refs = >transaction->delayed_refs; + + spin_lock(_refs->lock); + head = btrfs_find_delayed_ref_head(trans, bytenr); + if (!head) { + /* +* We can safely insert a new delayed_ref as long as we +* hold delayed_refs->lock. +* Only need to use atomic inc_extent_ref() +*/ + btrfs_add_delayed_data_ref_locked(root->fs_info, trans, + insert_dref, insert_head, insert_qrecord, + bytenr, num_bytes, 0, root->root_key.objectid, + btrfs_ino(inode), file_pos, 0, + BTRFS_ADD_DELAYED_REF); + spin_unlock(_refs->lock); + + /* add_delayed_data_ref_locked will free unused memory */ +
[PATCH v10 19/21] btrfs: dedupe: Add support to delete hash for on-disk backend
Now on-disk backend can delete hash now. Signed-off-by: Wang XiaoguangSigned-off-by: Qu Wenruo --- fs/btrfs/dedupe.c | 100 ++ 1 file changed, 100 insertions(+) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 00f2a01..7c5d58a 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -500,6 +500,104 @@ static int inmem_del(struct btrfs_dedupe_info *dedupe_info, u64 bytenr) return 0; } +/* + * If prepare_del is given, this will setup search_slot() for delete. + * Caller needs to do proper locking. + * + * Return > 0 for found. + * Return 0 for not found. + * Return < 0 for error. + */ +static int ondisk_search_bytenr(struct btrfs_trans_handle *trans, + struct btrfs_dedupe_info *dedupe_info, + struct btrfs_path *path, u64 bytenr, + int prepare_del) +{ + struct btrfs_key key; + struct btrfs_root *dedupe_root = dedupe_info->dedupe_root; + int ret; + int ins_len = 0; + int cow = 0; + + if (prepare_del) { + if (WARN_ON(trans == NULL)) + return -EINVAL; + cow = 1; + ins_len = -1; + } + + key.objectid = bytenr; + key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY; + key.offset = (u64)-1; + + ret = btrfs_search_slot(trans, dedupe_root, , path, + ins_len, cow); + + if (ret < 0) + return ret; + /* +* Although it's almost impossible, it's still possible that +* the last 64bits are all 1. +*/ + if (ret == 0) + return 1; + + ret = btrfs_previous_item(dedupe_root, path, bytenr, + BTRFS_DEDUPE_BYTENR_ITEM_KEY); + if (ret < 0) + return ret; + if (ret > 0) + return 0; + return 1; +} + +static int ondisk_del(struct btrfs_trans_handle *trans, + struct btrfs_dedupe_info *dedupe_info, u64 bytenr) +{ + struct btrfs_root *dedupe_root = dedupe_info->dedupe_root; + struct btrfs_path *path; + struct btrfs_key key; + int ret; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + key.objectid = bytenr; + key.type = BTRFS_DEDUPE_BYTENR_ITEM_KEY; + key.offset = 0; + + mutex_lock(_info->lock); + + ret = ondisk_search_bytenr(trans, dedupe_info, path, bytenr, 1); + if (ret <= 0) + goto out; + + btrfs_item_key_to_cpu(path->nodes[0], , path->slots[0]); + ret = btrfs_del_item(trans, dedupe_root, path); + btrfs_release_path(path); + if (ret < 0) + goto out; + /* Search for hash item and delete it */ + key.objectid = key.offset; + key.type = BTRFS_DEDUPE_HASH_ITEM_KEY; + key.offset = bytenr; + + ret = btrfs_search_slot(trans, dedupe_root, , path, -1, 1); + if (WARN_ON(ret > 0)) { + ret = -ENOENT; + goto out; + } + if (ret < 0) + goto out; + ret = btrfs_del_item(trans, dedupe_root, path); + +out: + btrfs_free_path(path); + mutex_unlock(_info->lock); + return ret; +} + /* Remove a dedupe hash from dedupe tree */ int btrfs_dedupe_del(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info, u64 bytenr) @@ -514,6 +612,8 @@ int btrfs_dedupe_del(struct btrfs_trans_handle *trans, if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) return inmem_del(dedupe_info, bytenr); + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_ONDISK) + return ondisk_del(trans, dedupe_info, bytenr); return -EINVAL; } -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 05/21] btrfs: delayed-ref: Add support for increasing data ref under spinlock
For in-band dedupe, btrfs needs to increase data ref with delayed_ref locked, so add a new function btrfs_add_delayed_data_ref_lock() to increase extent ref with delayed_refs already locked. Signed-off-by: Qu Wenruo--- fs/btrfs/delayed-ref.c | 30 +++--- fs/btrfs/delayed-ref.h | 8 2 files changed, 31 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index 430b368..07474e8 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -805,6 +805,26 @@ free_ref: } /* + * Do real delayed data ref insert. + * Caller must hold delayed_refs->lock and allocation memory + * for dref,head_ref and record. + */ +void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info, + struct btrfs_trans_handle *trans, + struct btrfs_delayed_data_ref *dref, + struct btrfs_delayed_ref_head *head_ref, + struct btrfs_qgroup_extent_record *qrecord, + u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root, + u64 owner, u64 offset, u64 reserved, int action) +{ + head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, + qrecord, bytenr, num_bytes, ref_root, reserved, + action, 1); + add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr, + num_bytes, parent, ref_root, owner, offset, action); +} + +/* * add a delayed data ref. it's similar to btrfs_add_delayed_tree_ref. */ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, @@ -849,13 +869,9 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, * insert both the head node and the new ref without dropping * the spin lock */ - head_ref = add_delayed_ref_head(fs_info, trans, _ref->node, record, - bytenr, num_bytes, ref_root, reserved, - action, 1); - - add_delayed_data_ref(fs_info, trans, head_ref, >node, bytenr, - num_bytes, parent, ref_root, owner, offset, - action); + btrfs_add_delayed_data_ref_locked(fs_info, trans, ref, head_ref, record, + bytenr, num_bytes, parent, ref_root, owner, offset, + reserved, action); spin_unlock(_refs->lock); return 0; diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h index c24b653..2765858 100644 --- a/fs/btrfs/delayed-ref.h +++ b/fs/btrfs/delayed-ref.h @@ -239,11 +239,19 @@ static inline void btrfs_put_delayed_ref(struct btrfs_delayed_ref_node *ref) } } +struct btrfs_qgroup_extent_record; int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info, struct btrfs_trans_handle *trans, u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root, int level, int action, struct btrfs_delayed_extent_op *extent_op); +void btrfs_add_delayed_data_ref_locked(struct btrfs_fs_info *fs_info, + struct btrfs_trans_handle *trans, + struct btrfs_delayed_data_ref *dref, + struct btrfs_delayed_ref_head *head_ref, + struct btrfs_qgroup_extent_record *qrecord, + u64 bytenr, u64 num_bytes, u64 parent, u64 ref_root, + u64 owner, u64 offset, u64 reserved, int action); int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, struct btrfs_trans_handle *trans, u64 bytenr, u64 num_bytes, -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 18/21] btrfs: dedupe: Add support for on-disk hash search
Now on-disk backend should be able to search hash now. Signed-off-by: Wang XiaoguangSigned-off-by: Qu Wenruo --- fs/btrfs/dedupe.c | 167 -- fs/btrfs/dedupe.h | 1 + 2 files changed, 151 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index a274c1c..00f2a01 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -652,6 +652,112 @@ int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info) } /* + * Compare ondisk hash with src. + * Return 0 if hash matches. + * Return non-zero for hash mismatch + * + * Caller should ensure the slot contains a valid hash item. + */ +static int memcmp_ondisk_hash(const struct btrfs_key *key, + struct extent_buffer *node, int slot, + int hash_len, const u8 *src) +{ + u64 offset; + int ret; + + /* Return value doesn't make sense in this case though */ + if (WARN_ON(hash_len <= 8 || key->type != BTRFS_DEDUPE_HASH_ITEM_KEY)) + return -EINVAL; + + /* compare the hash exlcuding the last 64 bits */ + offset = btrfs_item_ptr_offset(node, slot); + ret = memcmp_extent_buffer(node, src, offset, hash_len - 8); + if (ret) + return ret; + return memcmp(>objectid, src + hash_len - 8, 8); +} + + /* + * Return 0 for not found + * Return >0 for found and set bytenr_ret + * Return <0 for error + */ +static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash, + u64 *bytenr_ret, u32 *num_bytes_ret) +{ + struct btrfs_path *path; + struct btrfs_key key; + struct btrfs_root *dedupe_root = dedupe_info->dedupe_root; + u8 *buf = NULL; + u64 hash_key; + int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type]; + int ret; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + buf = kmalloc(hash_len, GFP_NOFS); + if (!buf) { + ret = -ENOMEM; + goto out; + } + + memcpy(_key, hash + hash_len - 8, 8); + key.objectid = hash_key; + key.type = BTRFS_DEDUPE_HASH_ITEM_KEY; + key.offset = (u64)-1; + + ret = btrfs_search_slot(NULL, dedupe_root, , path, 0, 0); + if (ret < 0) + goto out; + WARN_ON(ret == 0); + while (1) { + struct extent_buffer *node; + struct btrfs_dedupe_hash_item *hash_item; + int slot; + + ret = btrfs_previous_item(dedupe_root, path, hash_key, + BTRFS_DEDUPE_HASH_ITEM_KEY); + if (ret < 0) + break; + if (ret > 0) { + ret = 0; + break; + } + + node = path->nodes[0]; + slot = path->slots[0]; + btrfs_item_key_to_cpu(node, , slot); + + /* +* Type of objectid mismatch means no previous item may +* hit, exit searching +*/ + if (key.type != BTRFS_DEDUPE_HASH_ITEM_KEY || + memcmp(, _key, 8)) + break; + hash_item = btrfs_item_ptr(node, slot, + struct btrfs_dedupe_hash_item); + /* +* If the hash mismatch, it's still possible that previous item +* has the desired hash. +*/ + if (memcmp_ondisk_hash(, node, slot, hash_len, hash)) + continue; + /* Found */ + ret = 1; + *bytenr_ret = key.offset; + *num_bytes_ret = dedupe_info->blocksize; + break; + } +out: + kfree(buf); + btrfs_free_path(path); + return ret; +} + +/* * Caller must ensure the corresponding ref head is not being run. */ static struct inmem_hash * @@ -681,9 +787,36 @@ inmem_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash) return NULL; } -static int inmem_search(struct btrfs_dedupe_info *dedupe_info, - struct inode *inode, u64 file_pos, - struct btrfs_dedupe_hash *hash) +/* Wapper for different backends, caller needs to hold dedupe_info->lock */ +static inline int generic_search_hash(struct btrfs_dedupe_info *dedupe_info, + u8 *hash, u64 *bytenr_ret, + u32 *num_bytes_ret) +{ + if (dedupe_info->backend == BTRFS_DEDUPE_BACKEND_INMEMORY) { + struct inmem_hash *found_hash; + int ret; + + found_hash = inmem_search_hash(dedupe_info, hash); + if (found_hash) { + ret = 1; + *bytenr_ret = found_hash->bytenr; +
[PATCH v10 03/21] btrfs: dedupe: Introduce function to add hash into in-memory tree
From: Wang XiaoguangIntroduce static function inmem_add() to add hash into in-memory tree. And now we can implement the btrfs_dedupe_add() interface. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang --- fs/btrfs/dedupe.c | 151 ++ 1 file changed, 151 insertions(+) diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 2211588..4e8455e 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -32,6 +32,14 @@ struct inmem_hash { u8 hash[]; }; +static inline struct inmem_hash *inmem_alloc_hash(u16 type) +{ + if (WARN_ON(type >= ARRAY_SIZE(btrfs_dedupe_sizes))) + return NULL; + return kzalloc(sizeof(struct inmem_hash) + btrfs_dedupe_sizes[type], + GFP_NOFS); +} + static int init_dedupe_info(struct btrfs_dedupe_info **ret_info, u16 type, u16 backend, u64 blocksize, u64 limit) { @@ -152,3 +160,146 @@ enable: fs_info->dedupe_enabled = 1; return ret; } + +static int inmem_insert_hash(struct rb_root *root, +struct inmem_hash *hash, int hash_len) +{ + struct rb_node **p = >rb_node; + struct rb_node *parent = NULL; + struct inmem_hash *entry = NULL; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct inmem_hash, hash_node); + if (memcmp(hash->hash, entry->hash, hash_len) < 0) + p = &(*p)->rb_left; + else if (memcmp(hash->hash, entry->hash, hash_len) > 0) + p = &(*p)->rb_right; + else + return 1; + } + rb_link_node(>hash_node, parent, p); + rb_insert_color(>hash_node, root); + return 0; +} + +static int inmem_insert_bytenr(struct rb_root *root, + struct inmem_hash *hash) +{ + struct rb_node **p = >rb_node; + struct rb_node *parent = NULL; + struct inmem_hash *entry = NULL; + + while (*p) { + parent = *p; + entry = rb_entry(parent, struct inmem_hash, bytenr_node); + if (hash->bytenr < entry->bytenr) + p = &(*p)->rb_left; + else if (hash->bytenr > entry->bytenr) + p = &(*p)->rb_right; + else + return 1; + } + rb_link_node(>bytenr_node, parent, p); + rb_insert_color(>bytenr_node, root); + return 0; +} + +static void __inmem_del(struct btrfs_dedupe_info *dedupe_info, + struct inmem_hash *hash) +{ + list_del(>lru_list); + rb_erase(>hash_node, _info->hash_root); + rb_erase(>bytenr_node, _info->bytenr_root); + + if (!WARN_ON(dedupe_info->current_nr == 0)) + dedupe_info->current_nr--; + + kfree(hash); +} + +/* + * Insert a hash into in-memory dedupe tree + * Will remove exceeding last recent use hash. + * + * If the hash mathced with existing one, we won't insert it, to + * save memory + */ +static int inmem_add(struct btrfs_dedupe_info *dedupe_info, +struct btrfs_dedupe_hash *hash) +{ + int ret = 0; + u16 type = dedupe_info->hash_type; + struct inmem_hash *ihash; + + ihash = inmem_alloc_hash(type); + + if (!ihash) + return -ENOMEM; + + /* Copy the data out */ + ihash->bytenr = hash->bytenr; + ihash->num_bytes = hash->num_bytes; + memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]); + + mutex_lock(_info->lock); + + ret = inmem_insert_bytenr(_info->bytenr_root, ihash); + if (ret > 0) { + kfree(ihash); + ret = 0; + goto out; + } + + ret = inmem_insert_hash(_info->hash_root, ihash, + btrfs_dedupe_sizes[type]); + if (ret > 0) { + /* +* We only keep one hash in tree to save memory, so if +* hash conflicts, free the one to insert. +*/ + rb_erase(>bytenr_node, _info->bytenr_root); + kfree(ihash); + ret = 0; + goto out; + } + + list_add(>lru_list, _info->lru_list); + dedupe_info->current_nr++; + + /* Remove the last dedupe hash if we exceed limit */ + while (dedupe_info->current_nr > dedupe_info->limit_nr) { + struct inmem_hash *last; + + last = list_entry(dedupe_info->lru_list.prev, + struct inmem_hash, lru_list); + __inmem_del(dedupe_info, last); + } +out: + mutex_unlock(_info->lock); + return 0; +} + +int btrfs_dedupe_add(struct btrfs_trans_handle *trans, +struct btrfs_fs_info *fs_info, +struct btrfs_dedupe_hash
[PATCH v10 13/21] btrfs: dedupe: add a property handler for online dedupe
From: Wang XiaoguangWe use btrfs extended attribute "btrfs.dedupe" to record per-file online dedupe status, so add a dedupe property handler. Signed-off-by: Wang Xiaoguang --- fs/btrfs/props.c | 41 + 1 file changed, 41 insertions(+) diff --git a/fs/btrfs/props.c b/fs/btrfs/props.c index 3699212..a430886 100644 --- a/fs/btrfs/props.c +++ b/fs/btrfs/props.c @@ -42,6 +42,11 @@ static int prop_compression_apply(struct inode *inode, size_t len); static const char *prop_compression_extract(struct inode *inode); +static int prop_dedupe_validate(const char *value, size_t len); +static int prop_dedupe_apply(struct inode *inode, const char *value, +size_t len); +static const char *prop_dedupe_extract(struct inode *inode); + static struct prop_handler prop_handlers[] = { { .xattr_name = XATTR_BTRFS_PREFIX "compression", @@ -50,6 +55,13 @@ static struct prop_handler prop_handlers[] = { .extract = prop_compression_extract, .inheritable = 1 }, + { + .xattr_name = XATTR_BTRFS_PREFIX "dedupe", + .validate = prop_dedupe_validate, + .apply = prop_dedupe_apply, + .extract = prop_dedupe_extract, + .inheritable = 1 + }, }; void __init btrfs_props_init(void) @@ -426,4 +438,33 @@ static const char *prop_compression_extract(struct inode *inode) return NULL; } +static int prop_dedupe_validate(const char *value, size_t len) +{ + if (!strncmp("disable", value, len)) + return 0; + + return -EINVAL; +} + +static int prop_dedupe_apply(struct inode *inode, const char *value, size_t len) +{ + if (len == 0) { + BTRFS_I(inode)->flags &= ~BTRFS_INODE_NODEDUPE; + return 0; + } + + if (!strncmp("disable", value, len)) { + BTRFS_I(inode)->flags |= BTRFS_INODE_NODEDUPE; + return 0; + } + + return -EINVAL; +} + +static const char *prop_dedupe_extract(struct inode *inode) +{ + if (BTRFS_I(inode)->flags & BTRFS_INODE_NODEDUPE) + return "disable"; + return NULL; +} -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v10 01/21] btrfs: dedupe: Introduce dedupe framework and its header
From: Wang XiaoguangIntroduce the header for btrfs online(write time) de-duplication framework and needed header. The new de-duplication framework is going to support 2 different dedupe methods and 1 dedupe hash. Signed-off-by: Qu Wenruo Signed-off-by: Wang Xiaoguang --- fs/btrfs/ctree.h | 5 ++ fs/btrfs/dedupe.h | 134 + fs/btrfs/disk-io.c | 1 + 3 files changed, 140 insertions(+) create mode 100644 fs/btrfs/dedupe.h diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 84a6a5b..022ab61 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1860,6 +1860,11 @@ struct btrfs_fs_info { struct list_head pinned_chunks; int creating_free_space_tree; + + /* Inband de-duplication related structures*/ + unsigned int dedupe_enabled:1; + struct btrfs_dedupe_info *dedupe_info; + struct mutex dedupe_ioctl_lock; }; struct btrfs_subvolume_writers { diff --git a/fs/btrfs/dedupe.h b/fs/btrfs/dedupe.h new file mode 100644 index 000..40f4808 --- /dev/null +++ b/fs/btrfs/dedupe.h @@ -0,0 +1,134 @@ +/* + * Copyright (C) 2015 Fujitsu. All rights reserved. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License v2 as published by the Free Software Foundation. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + */ + +#ifndef __BTRFS_DEDUPE__ +#define __BTRFS_DEDUPE__ + +#include +#include +#include + +/* + * Dedup storage backend + * On disk is persist storage but overhead is large + * In memory is fast but will lose all its hash on umount + */ +#define BTRFS_DEDUPE_BACKEND_INMEMORY 0 +#define BTRFS_DEDUPE_BACKEND_ONDISK1 + +/* Only support inmemory yet, so count is still only 1 */ +#define BTRFS_DEDUPE_BACKEND_COUNT 1 + +/* Dedup block size limit and default value */ +#define BTRFS_DEDUPE_BLOCKSIZE_MAX (8 * 1024 * 1024) +#define BTRFS_DEDUPE_BLOCKSIZE_MIN (16 * 1024) +#define BTRFS_DEDUPE_BLOCKSIZE_DEFAULT (128 * 1024) + +/* Hash algorithm, only support SHA256 yet */ +#define BTRFS_DEDUPE_HASH_SHA256 0 + +static int btrfs_dedupe_sizes[] = { 32 }; + +/* + * For caller outside of dedup.c + * + * Different dedupe backends should have their own hash structure + */ +struct btrfs_dedupe_hash { + u64 bytenr; + u32 num_bytes; + + /* last field is a variable length array of dedupe hash */ + u8 hash[]; +}; + +struct btrfs_dedupe_info { + /* dedupe blocksize */ + u64 blocksize; + u16 backend; + u16 hash_type; + + struct crypto_shash *dedupe_driver; + struct mutex lock; + + /* following members are only used in in-memory dedupe mode */ + struct rb_root hash_root; + struct rb_root bytenr_root; + struct list_head lru_list; + u64 limit_nr; + u64 current_nr; +}; + +struct btrfs_trans_handle; + +static inline int btrfs_dedupe_hash_hit(struct btrfs_dedupe_hash *hash) +{ + return (hash && hash->bytenr); +} + +int btrfs_dedupe_hash_size(u16 type); +struct btrfs_dedupe_hash *btrfs_dedupe_alloc_hash(u16 type); + +/* + * Initial inband dedupe info + * Called at dedupe enable time. + */ +int btrfs_dedupe_enable(struct btrfs_fs_info *fs_info, u16 type, u16 backend, + u64 blocksize, u64 limit_nr, u64 limit_mem); + +/* + * Disable dedupe and invalidate all its dedupe data. + * Called at dedupe disable time. + */ +int btrfs_dedupe_disable(struct btrfs_fs_info *fs_info); + +/* + * Calculate hash for dedup. + * Caller must ensure [start, start + dedupe_bs) has valid data. + */ +int btrfs_dedupe_calc_hash(struct btrfs_fs_info *fs_info, + struct inode *inode, u64 start, + struct btrfs_dedupe_hash *hash); + +/* + * Search for duplicated extents by calculated hash + * Caller must call btrfs_dedupe_calc_hash() first to get the hash. + * + * @inode: the inode for we are writing + * @file_pos: offset inside the inode + * As we will increase extent ref immediately after a hash match, + * we need @file_pos and @inode in this case. + * + * Return > 0 for a hash match, and the extent ref will be + * *INCREASED*, and hash->bytenr/num_bytes will record the existing + * extent data. + * Return 0 for a hash miss. Nothing is done + */ +int btrfs_dedupe_search(struct btrfs_fs_info *fs_info, + struct inode *inode, u64 file_pos,
[PATCH v10 21/21] btrfs: dedupe: Preparation for compress-dedupe co-work
For dedupe to work with compression, new members recording compression algorithm and on-disk extent length are needed. Add them for later compress-dedupe co-work. Signed-off-by: Qu Wenruo--- fs/btrfs/ctree.h| 22 +- fs/btrfs/dedupe.c | 78 - fs/btrfs/dedupe.h | 2 ++ fs/btrfs/inode.c| 2 ++ fs/btrfs/ordered-data.c | 2 ++ 5 files changed, 85 insertions(+), 21 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 659790c..fdbe66b 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -982,8 +982,22 @@ struct btrfs_dedupe_status_item { * Offset: Bytenr of the hash * * Used for hash <-> bytenr search - * Hash exclude the last 64 bit follows */ +struct btrfs_dedupe_hash_item { + /* +* length of dedupe range on disk +* For in-memory length, it's always +* dedupe_info->block_size +*/ + __le32 disk_len; + + u8 compression; + + /* +* Hash follows, exclude the last 64bit, +* as it's already in key.objectid. +*/ +} __attribute__ ((__packed__)); /* * Objectid: bytenr @@ -3316,6 +3330,12 @@ BTRFS_SETGET_FUNCS(dedupe_status_hash_type, struct btrfs_dedupe_status_item, BTRFS_SETGET_FUNCS(dedupe_status_backend, struct btrfs_dedupe_status_item, backend, 16); +/* btrfs_dedupe_hash_item */ +BTRFS_SETGET_FUNCS(dedupe_hash_disk_len, struct btrfs_dedupe_hash_item, + disk_len, 32); +BTRFS_SETGET_FUNCS(dedupe_hash_compression, struct btrfs_dedupe_hash_item, + compression, 8); + /* struct btrfs_file_extent_item */ BTRFS_SETGET_FUNCS(file_extent_type, struct btrfs_file_extent_item, type, 8); BTRFS_SETGET_STACK_FUNCS(stack_file_extent_disk_bytenr, diff --git a/fs/btrfs/dedupe.c b/fs/btrfs/dedupe.c index 1f0178e..e91420d 100644 --- a/fs/btrfs/dedupe.c +++ b/fs/btrfs/dedupe.c @@ -31,6 +31,8 @@ struct inmem_hash { u64 bytenr; u32 num_bytes; + u32 disk_num_bytes; + u8 compression; u8 hash[]; }; @@ -397,6 +399,8 @@ static int inmem_add(struct btrfs_dedupe_info *dedupe_info, /* Copy the data out */ ihash->bytenr = hash->bytenr; ihash->num_bytes = hash->num_bytes; + ihash->disk_num_bytes = hash->disk_num_bytes; + ihash->compression = hash->compression; memcpy(ihash->hash, hash->hash, btrfs_dedupe_sizes[type]); mutex_lock(_info->lock); @@ -442,7 +446,8 @@ static int ondisk_search_bytenr(struct btrfs_trans_handle *trans, struct btrfs_path *path, u64 bytenr, int prepare_del); static int ondisk_search_hash(struct btrfs_dedupe_info *dedupe_info, u8 *hash, - u64 *bytenr_ret, u32 *num_bytes_ret); + u64 *bytenr_ret, u32 *num_bytes_ret, + u32 *disk_num_bytes_ret, u8 *compression); static int ondisk_add(struct btrfs_trans_handle *trans, struct btrfs_dedupe_info *dedupe_info, struct btrfs_dedupe_hash *hash) @@ -450,7 +455,7 @@ static int ondisk_add(struct btrfs_trans_handle *trans, struct btrfs_path *path; struct btrfs_root *dedupe_root = dedupe_info->dedupe_root; struct btrfs_key key; - u64 hash_offset; + struct btrfs_dedupe_hash_item *hash_item; u64 bytenr; u32 num_bytes; int hash_len = btrfs_dedupe_sizes[dedupe_info->hash_type]; @@ -475,7 +480,8 @@ static int ondisk_add(struct btrfs_trans_handle *trans, } btrfs_release_path(path); - ret = ondisk_search_hash(dedupe_info, hash->hash, , _bytes); + ret = ondisk_search_hash(dedupe_info, hash->hash, , _bytes, +NULL, NULL); if (ret < 0) goto out; /* Same hash found, don't re-add to save dedupe tree space */ @@ -491,13 +497,18 @@ static int ondisk_add(struct btrfs_trans_handle *trans, /* The last 8 bit will not be included into hash */ ret = btrfs_insert_empty_item(trans, dedupe_root, path, , - hash_len - 8); + sizeof(*hash_item) + hash_len - 8); WARN_ON(ret == -EEXIST); if (ret < 0) goto out; - hash_offset = btrfs_item_ptr_offset(path->nodes[0], path->slots[0]); + hash_item = btrfs_item_ptr(path->nodes[0], path->slots[0], + struct btrfs_dedupe_hash_item); + btrfs_set_dedupe_hash_disk_len(path->nodes[0], hash_item, + hash->disk_num_bytes); + btrfs_set_dedupe_hash_compression(path->nodes[0], hash_item, + hash->compression); write_extent_buffer(path->nodes[0], hash->hash, - hash_offset, hash_len - 8);
[PATCH v10 09/21] btrfs: dedupe: Inband in-memory only de-duplication implement
Core implement for inband de-duplication. It reuse the async_cow_start() facility to do the calculate dedupe hash. And use dedupe hash to do inband de-duplication at extent level. The work flow is as below: 1) Run delalloc range for an inode 2) Calculate hash for the delalloc range at the unit of dedupe_bs 3) For hash match(duplicated) case, just increase source extent ref and insert file extent. For hash mismatch case, go through the normal cow_file_range() fallback, and add hash into dedupe_tree. Compress for hash miss case is not supported yet. Current implement restore all dedupe hash in memory rb-tree, with LRU behavior to control the limit. Signed-off-by: Wang XiaoguangSigned-off-by: Qu Wenruo --- fs/btrfs/extent-tree.c | 18 fs/btrfs/inode.c | 235 ++--- fs/btrfs/relocation.c | 16 3 files changed, 236 insertions(+), 33 deletions(-) diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 53e1297..dabd721 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -37,6 +37,7 @@ #include "math.h" #include "sysfs.h" #include "qgroup.h" +#include "dedupe.h" #undef SCRAMBLE_DELAYED_REFS @@ -2399,6 +2400,8 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, if (btrfs_delayed_ref_is_head(node)) { struct btrfs_delayed_ref_head *head; + struct btrfs_fs_info *fs_info = root->fs_info; + /* * we've hit the end of the chain and we were supposed * to insert this extent into the tree. But, it got @@ -2413,6 +2416,15 @@ static int run_one_delayed_ref(struct btrfs_trans_handle *trans, btrfs_pin_extent(root, node->bytenr, node->num_bytes, 1); if (head->is_data) { + /* +* If insert_reserved is given, it means +* a new extent is revered, then deleted +* in one tran, and inc/dec get merged to 0. +* +* In this case, we need to remove its dedup +* hash. +*/ + btrfs_dedupe_del(trans, fs_info, node->bytenr); ret = btrfs_del_csums(trans, root, node->bytenr, node->num_bytes); @@ -6713,6 +6725,12 @@ static int __btrfs_free_extent(struct btrfs_trans_handle *trans, btrfs_release_path(path); if (is_data) { + ret = btrfs_dedupe_del(trans, info, bytenr); + if (ret < 0) { + btrfs_abort_transaction(trans, extent_root, + ret); + goto out; + } ret = btrfs_del_csums(trans, root, bytenr, num_bytes); if (ret) { btrfs_abort_transaction(trans, extent_root, ret); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 41a5688..96790d0 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -60,6 +60,7 @@ #include "hash.h" #include "props.h" #include "qgroup.h" +#include "dedupe.h" struct btrfs_iget_args { struct btrfs_key *location; @@ -106,7 +107,8 @@ static int btrfs_finish_ordered_io(struct btrfs_ordered_extent *ordered_extent); static noinline int cow_file_range(struct inode *inode, struct page *locked_page, u64 start, u64 end, int *page_started, - unsigned long *nr_written, int unlock); + unsigned long *nr_written, int unlock, + struct btrfs_dedupe_hash *hash); static struct extent_map *create_pinned_em(struct inode *inode, u64 start, u64 len, u64 orig_start, u64 block_start, u64 block_len, @@ -335,6 +337,7 @@ struct async_extent { struct page **pages; unsigned long nr_pages; int compress_type; + struct btrfs_dedupe_hash *hash; struct list_head list; }; @@ -353,7 +356,8 @@ static noinline int add_async_extent(struct async_cow *cow, u64 compressed_size, struct page **pages, unsigned long nr_pages, -int compress_type) +int compress_type, +struct
[PATCH v10 00/21] Btrfs dedupe framework
This patchset can be fetched from github: https://github.com/adam900710/linux.git wang_dedupe_20160401 In this patchset, we're proud to bring a completely new storage backend: Khala backend. With Khala backend, all dedupe hash will be restored in the Khala, shared with every Kalai protoss, with unlimited storage and almost zero search latency. A perfect backend for any Kalai protoss. "My life for Aiur!" Unfortunately, such backend is not available for human. OK, except the super-fancy and date-related backend, the patchset is still a serious patchset. In this patchset, we mostly addressed the on-disk format change comment from Chris: 1) Reduced dedupe hash item and bytenr item. Now dedupe hash item structure size is reduced from 41 bytes (9 bytes hash_item + 32 bytes hash) to 29 bytes (5 bytes hash_item + 24 bytes hash) Without the last patch, it's even less with only 24 bytes (24 bytes hash only). And dedupe bytenr item structure size is reduced from 32 bytes (full hash) to 0. 2) Hide dedupe ioctls into CONFIG_BTRFS_DEBUG Advised by David, to make btrfs dedupe as an experimental feature for advanced user. This is used to allow this patchset to be merged while still allow us to change ioctl in the further. 3) Add back missing bug fix patches I just missed 2 bug fix patches in previous iteration. Adding them back. Now patch 1~11 provide the full backward-compatible in-memory backend. And patch 12~14 provide per-file dedupe flag feature. Patch 15~20 provide on-disk dedupe backend with persist dedupe state for in-memory backend. The last patch is just preparation for possible dedupe-compress co-work. Changelog: v2: Totally reworked to handle multiple backends v3: Fix a stupid but deadly on-disk backend bug Add handle for multiple hash on same bytenr corner case to fix abort trans error Increase dedup rate by enhancing delayed ref handler for both backend. Move dedup_add() to run_delayed_ref() time, to fix abort trans error. Increase dedup block size up limit to 8M. v4: Add dedup prop for disabling dedup for given files/dirs. Merge inmem_search() and ondisk_search() into generic_search() to save some code Fix another delayed_ref related bug. Use the same mutex for both inmem and ondisk backend. Move dedup_add() back to btrfs_finish_ordered_io() to increase dedup rate. v5: Reuse compress routine for much simpler dedup function. Slightly improved performance due to above modification. Fix race between dedup enable/disable Fix for false ENOSPC report v6: Further enable/disable race window fix. Minor format change according to checkpatch. v7: Fix one concurrency bug with balance. Slightly modify return value from -EINVAL to -EOPNOTSUPP for btrfs_dedup_ioctl() to allow progs to distinguish unsupported commands and wrong parameter. Rebased to integration-4.6. v8: Rename 'dedup' to 'dedupe'. Add support to allow dedupe and compression work at the same time. Fix several balance related bugs. Special thanks to Satoru Takeuchi, who exposed most of them. Small dedupe hit case performance improvement. v9: Re-order the patchset to completely separate pure in-memory and any on-disk format change. Fold bug fixes into its original patch. v10: Adding back missing bug fix patch. Reduce on-disk item size. Hide dedupe ioctl under CONFIG_BTRFS_DEBUG. Qu Wenruo (9): btrfs: delayed-ref: Add support for increasing data ref under spinlock btrfs: dedupe: Inband in-memory only de-duplication implement btrfs: relocation: Enhance error handling to avoid BUG_ON btrfs: dedupe: Add basic tree structure for on-disk dedupe method btrfs: dedupe: Introduce interfaces to resume and cleanup dedupe info btrfs: dedupe: Add support for on-disk hash search btrfs: dedupe: Add support to delete hash for on-disk backend btrfs: dedupe: Add support for adding hash for on-disk backend btrfs: dedupe: Preparation for compress-dedupe co-work Wang Xiaoguang (12): btrfs: dedupe: Introduce dedupe framework and its header btrfs: dedupe: Introduce function to initialize dedupe info btrfs: dedupe: Introduce function to add hash into in-memory tree btrfs: dedupe: Introduce function to remove hash from in-memory tree btrfs: dedupe: Introduce function to search for an existing hash btrfs: dedupe: Implement btrfs_dedupe_calc_hash interface btrfs: ordered-extent: Add support for dedupe btrfs: try more times to alloc metadata reserve space btrfs: dedupe: Add ioctl for inband dedupelication btrfs: dedupe: add an inode nodedupe flag btrfs: dedupe: add a property handler for online dedupe btrfs: dedupe: add per-file online dedupe control fs/btrfs/Makefile|2 +- fs/btrfs/ctree.h | 80 ++- fs/btrfs/dedupe.c| 1239 ++ fs/btrfs/dedupe.h| 181 ++ fs/btrfs/delayed-ref.c | 30 +- fs/btrfs/delayed-ref.h |