Re: filesystem corruption
Zygo Blaxell posted on Mon, 03 Nov 2014 23:31:45 -0500 as excerpted: On Mon, Nov 03, 2014 at 10:11:18AM -0700, Chris Murphy wrote: On Nov 2, 2014, at 8:43 PM, Zygo Blaxell zblax...@furryterror.org wrote: btrfs seems to assume the data is correct on both disks (the generation numbers and checksums are OK) but gets confused by equally plausible but different metadata on each disk. It doesn't take long before the filesystem becomes data soup or crashes the kernel. This is a pretty significant problem to still be present, honestly. I can understand the catchup mechanism is probably not built yet, but clearly the two devices don't have the same generation. The lower generation device should probably be booted/ignored or declared missing in the meantime to prevent trashing the file system. The problem with generation numbers is when both devices get divergent generation numbers but we can't tell them apart [snip very reasonable scenario] Now we have two disks with equal generation numbers. Generations 6..9 on sda are not the same as generations 6..9 on sdb, so if we mix the two disks' metadata we get bad confusion. It needs to be more than a sequential number. If one of the disks disappears we need to record this fact on the surviving disks, and also cope with _both_ disks claiming to be the surviving one. Zygo's absolutely correct. There is an existing catchup mechanism, but the tracking is /purely/ sequential generation number based, and if the two generation sequences diverge, Welcome to the (data) Twilight Zone! I noted this in my own early pre-deployment raid1 mode testing as well, except that I didn't at that point know about sequence numbers and never got as far as letting the filesystem make data soup of itself. What I did was this: 1) Create a two-device raid1 data and metadata filesystem, mount it and stick some data on it. 2) Unmount, pull a device, mount degraded the remaining device. 3) Change a file. 4) Unmount, switch devices, mount degraded the other device. 5) Change the same file in an different/incompatible way. 6) Unmount, plug both devices in again, mount (not degraded). 7) Wait for the sync I was used to from mdraid, which of course didn't occur. 8) Check the file to see which version showed up. I don't recall which version it was, but it wasn't the common pre-change version. 9) Unmount, pull each device one at a time, mounting the other one degraded and checking the file again. 10) The file on each device remained different, without a warning or indication of any problem at all when I mounted undegraded in 6/7. Had I initiated a scrub, presumably it would have seen the difference and if one was a newer generation, it would have taken it, overwriting the other. I don't know what it would have done if both were the same generation, tho the file being small (just a few line text file, big enough to test the effect of differing edits), I guess it would take one version or the other. If the file was large enough to be multiple extents, however, I've no idea whether it'd take one or the other, or possibly combine the two, picking extents where they differed more or less randomly. By that time the lack of warning and absolute resolution to one version or the other even after mounting undegraded and accessing the file with incompatible versions on each of the two devices was bothering me sufficiently that I didn't test any further. Being just me I have to worry about (unlike a multi-admin corporate scenario where you can never be /sure/ what the other admins will do regardless of agreed procedure), I simply set myself a set of rules very similar to what Zygo proposed: 1) If for whatever reason I ever split a btrfs raid1 with the intent or even the possibility of bringing the pieces back together again, if at all possible, never mount the split pieces writable -- mount read-only. 2) If a writable mount is required, keep the writable mounts to one device of the split. As long as the other device is never mounted writable, it will have an older generation when they're reunited and a scrub should take care of things, reliably resolving to the updated written device, rewriting the older generation on the other device. What I'd do here is physically put the removed side of the raid1 in storage, far enough from the remaining side that I couldn't possibly get them mixed up. I'd clearly label it as well, creating a defense in depth of at least two, the labeling and the physical separation and storage of the read-only device. 3) If for whatever reason the originally read-only side must be mounted writable, very clearly mark the originally mounted-writable device POISONED/TOXIC!! *NEVER* *EVER* let such a POISONED device anywhere near its original raid1 mate, until it is wiped, such that there's no possibility of btrfs getting confused and contaminated with the poisoned data. Given how unimpressed I was
Re: [PATCH] Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2
On Mon, 3 Nov 2014 08:56:50 -0500, Josef Bacik wrote: Our gluster boxes get several thousand statfs() calls per second, which begins to suck hardcore with all of the lock contention on the chunk mutex and dev list mutex. We don't really need to hold these things, if we have transient weirdness with statfs() because of the chunk allocator we don't care, so remove this locking. We still need the dev_list lock if you mount with -o alloc_start however, which is a good argument for nuking that thing from orbit, but that's a patch for another day. Thanks, Signed-off-by: Josef Bacik jba...@fb.com --- V1-V2: make sure -alloc_start is set before doing the dev extent lookup logic. I am strange that why we need dev_list_lock if we mount with -o alloc_start. AFAIK. -alloc_start is protected by chunk_mutex. But I think we needn't care that someone changes -alloc_start, in other words, we needn't take chunk_mutex during the whole process, the following case can be tolerated by the users, I think. Task1 Task2 statfs mutex_lock(fs_info-chunk_mutex); tmp = fs_info-alloc_start; mutex_unlock(fs_info-chunk_mutex); btrfs_calc_avail_data_space(fs_info, tmp) ... mount -o remount,alloc_start= ... Thanks Miao fs/btrfs/super.c | 72 1 file changed, 47 insertions(+), 25 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 54bd91e..dc337d1 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1644,8 +1644,20 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) int i = 0, nr_devices; int ret; + /* + * We aren't under the device list lock, so this is racey-ish, but good + * enough for our purposes. + */ nr_devices = fs_info-fs_devices-open_devices; - BUG_ON(!nr_devices); + if (!nr_devices) { + smp_mb(); + nr_devices = fs_info-fs_devices-open_devices; + ASSERT(nr_devices); + if (!nr_devices) { + *free_bytes = 0; + return 0; + } + } devices_info = kmalloc_array(nr_devices, sizeof(*devices_info), GFP_NOFS); @@ -1670,11 +1682,17 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) else min_stripe_size = BTRFS_STRIPE_LEN; - list_for_each_entry(device, fs_devices-devices, dev_list) { + if (fs_info-alloc_start) + mutex_lock(fs_devices-device_list_mutex); + rcu_read_lock(); + list_for_each_entry_rcu(device, fs_devices-devices, dev_list) { if (!device-in_fs_metadata || !device-bdev || device-is_tgtdev_for_dev_replace) continue; + if (i = nr_devices) + break; + avail_space = device-total_bytes - device-bytes_used; /* align with stripe_len */ @@ -1689,24 +1707,32 @@ static int btrfs_calc_avail_data_space(struct btrfs_root *root, u64 *free_bytes) skip_space = 1024 * 1024; /* user can set the offset in fs_info-alloc_start. */ - if (fs_info-alloc_start + BTRFS_STRIPE_LEN = - device-total_bytes) + if (fs_info-alloc_start + fs_info-alloc_start + BTRFS_STRIPE_LEN = + device-total_bytes) { + rcu_read_unlock(); skip_space = max(fs_info-alloc_start, skip_space); - /* - * btrfs can not use the free space in [0, skip_space - 1], - * we must subtract it from the total. In order to implement - * it, we account the used space in this range first. - */ - ret = btrfs_account_dev_extents_size(device, 0, skip_space - 1, - used_space); - if (ret) { - kfree(devices_info); - return ret; - } + /* + * btrfs can not use the free space in + * [0, skip_space - 1], we must subtract it from the + * total. In order to implement it, we account the used + * space in this range first. + */ + ret = btrfs_account_dev_extents_size(device, 0, + skip_space - 1, + used_space); + if (ret) { + kfree(devices_info); +
BTRFS Quota Display Tool
Hello I looking for a web based tool for displaying quotas for btrfs volumes. Something which get's it's data from btrfsQuota.py and displays nice bars on a webfrontend so we can see how much space a user is consuming right now. Although btrfs is not supporting per user based quotas, we made a workaround for this to separate 1 users files into 1 subvolume. This way his quota can be determined. A lot of additional features would come handy, such as email notification when a user gets near to his Quota limit or another notification when he hits it. On some of our hosting deployments, this is taken care of nicely by ISPconfig (Hard/Soft limits can be easily configured on the controlpanel), repquota (Display quota usage statistics from command line) and warnquota (Able to generate the email alerts what I have mentioned). Please if you know any existing (preferrably PHP based) framework for this let me know, so I wouldn't have to develop it from scratch. Thanks -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel crash during btrfs device delete on raid6 volume
On Tue, Nov 4, 2014 at 9:36 AM, Erik Berg bt...@slipsprogrammoer.no wrote: Pulled the latest btrfs-progs from kdave (v3.17-12-gcafacda) and using the latest linux release candidate (3.18.0-031800rc3-generic) from canonical/ubuntu btrfs fi show Label: none uuid: 5c5fea06-0319-4e03-a42e-004e64aeed92 Total devices 9 FS bytes used 10.91TiB devid2 size 931.48GiB used 928.02GiB path /dev/sdc1 devid3 size 931.48GiB used 928.02GiB path /dev/sdd1 devid4 size 1.82TiB used 1.67TiB path /dev/sde1 devid5 size 2.73TiB used 2.28TiB path /dev/sdf1 devid6 size 3.64TiB used 2.73TiB path /dev/sdg1 devid7 size 3.64TiB used 2.73TiB path /dev/sdh1 devid8 size 931.46GiB used 655.90GiB path /dev/sdb1 devid9 size 3.64TiB used 2.73TiB path /dev/sdi1 devid 10 size 3.64TiB used 1.79TiB path /dev/sdj1 btrfs fi df Data, RAID6: total=10.91TiB, used=10.90TiB System, RAID6: total=96.00MiB, used=800.00KiB Metadata, RAID6: total=13.23GiB, used=11.79GiB GlobalReserve, single: total=512.00MiB, used=0.00B Trying to remove device sdb1, the kernel crashes after a minute or so. [ 597.576827] [ cut here ] [ 597.617519] kernel BUG at /home/apw/COD/linux/mm/slub.c:3334! [ 597.668145] invalid opcode: [#1] SMP [ 597.704410] Modules linked in: arc4 md4 ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables gpio_ich intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd serio_raw hpilo hpwdt 8250_fintek acpi_power_meter ie31200_edac lpc_ich edac_core ipmi_si ipmi_msghandler mac_hid lp parport nls_utf8 cifs fscache hid_generic usbhid hid btrfs xor raid6_pq uas usb_storage tg3 ptp ahci psmouse libahci pps_core hpsa [ 598.268179] CPU: 1 PID: 129 Comm: kworker/u128:3 Not tainted 3.18.0-031800rc3-generic #201411022335 [ 598.349925] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 11/09/2013 [ 598.413231] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-2) [ 598.471103] task: 8803f16a3c00 ti: 880036b7 task.ti: 880036b7 [ 598.538393] RIP: 0010:[811c74fd] [811c74fd] kfree+0x16d/0x170 [ 598.606217] RSP: 0018:880036b73528 EFLAGS: 00010246 [ 598.653844] RAX: 0100 RBX: 880036b735c8 RCX: [ 598.717899] RDX: 8803743a6010 RSI: dead00100100 RDI: 880036b735c8 [ 598.781662] RBP: 880036b73558 R08: R09: eadadcc0 [ 598.846028] R10: 0001 R11: 0010 R12: 8803f1e09800 [ 598.910713] R13: 8803ac757d40 R14: c04fed0c R15: 880036b735d8 [ 598.975333] FS: () GS:88040b42() knlGS: [ 599.048512] CS: 0010 DS: ES: CR0: 80050033 [ 599.100167] CR2: 7fa9a3854024 CR3: 01c16000 CR4: 001407e0 [ 599.165150] Stack: [ 599.183305] 8803f1e09800 0dad07c2 8803f1e09800 8803ac757d40 [ 599.249603] 8803ac757d40 880036b735d8 880036b73618 c04fed0c [ 599.316306] 8803f1b86b00 880374338000 0dad07dc 880036b73638 [ 599.383404] Call Trace: [ 599.405429] [c04fed0c] btrfs_lookup_csums_range+0x2ac/0x4a0 [btrfs] Not a new bug unfortunately, but since it is in the error handling people must not be hitting it often. It's also not related to device replace. while (ret 0 !list_empty(tmplist)) { sums = list_entry(tmplist, struct btrfs_ordered_sum, list); list_del(sums-list); kfree(sums); } We're trying to call kfree on the on-stack list head. I'm fixing it up here, thanks for posting the oops! -chris -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Kernel crash during btrfs device delete on raid6 volume
On Tue, Nov 4, 2014 at 9:55 AM, Chris Mason c...@fb.com wrote: On Tue, Nov 4, 2014 at 9:36 AM, Erik Berg bt...@slipsprogrammoer.no wrote: Pulled the latest btrfs-progs from kdave (v3.17-12-gcafacda) and using the latest linux release candidate (3.18.0-031800rc3-generic) from canonical/ubuntu Trying to remove device sdb1, the kernel crashes after a minute or so. [ 597.576827] [ cut here ] [ 597.617519] kernel BUG at /home/apw/COD/linux/mm/slub.c:3334! [ 597.668145] invalid opcode: [#1] SMP [ 597.704410] Modules linked in: arc4 md4 ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat ebtables x_tables gpio_ich intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd serio_raw hpilo hpwdt 8250_fintek acpi_power_meter ie31200_edac lpc_ich edac_core ipmi_si ipmi_msghandler mac_hid lp parport nls_utf8 cifs fscache hid_generic usbhid hid btrfs xor raid6_pq uas usb_storage tg3 ptp ahci psmouse libahci pps_core hpsa [ 598.268179] CPU: 1 PID: 129 Comm: kworker/u128:3 Not tainted 3.18.0-031800rc3-generic #201411022335 [ 598.349925] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 11/09/2013 [ 598.413231] Workqueue: writeback bdi_writeback_workfn (flush-btrfs-2) [ 598.471103] task: 8803f16a3c00 ti: 880036b7 task.ti: 880036b7 [ 598.538393] RIP: 0010:[811c74fd] [811c74fd] kfree+0x16d/0x170 [ 598.606217] RSP: 0018:880036b73528 EFLAGS: 00010246 [ 598.653844] RAX: 0100 RBX: 880036b735c8 RCX: [ 598.717899] RDX: 8803743a6010 RSI: dead00100100 RDI: 880036b735c8 [ 598.781662] RBP: 880036b73558 R08: R09: eadadcc0 [ 598.846028] R10: 0001 R11: 0010 R12: 8803f1e09800 [ 598.910713] R13: 8803ac757d40 R14: c04fed0c R15: 880036b735d8 [ 598.975333] FS: () GS:88040b42() knlGS: [ 599.048512] CS: 0010 DS: ES: CR0: 80050033 [ 599.100167] CR2: 7fa9a3854024 CR3: 01c16000 CR4: 001407e0 [ 599.165150] Stack: [ 599.183305] 8803f1e09800 0dad07c2 8803f1e09800 8803ac757d40 [ 599.249603] 8803ac757d40 880036b735d8 880036b73618 c04fed0c [ 599.316306] 8803f1b86b00 880374338000 0dad07dc 880036b73638 [ 599.383404] Call Trace: [ 599.405429] [c04fed0c] btrfs_lookup_csums_range+0x2ac/0x4a0 [btrfs] Not a new bug unfortunately, but since it is in the error handling people must not be hitting it often. It's also not related to device replace. while (ret 0 !list_empty(tmplist)) { sums = list_entry(tmplist, struct btrfs_ordered_sum, list); list_del(sums-list); kfree(sums); } We're trying to call kfree on the on-stack list head. I'm fixing it up here, thanks for posting the oops! Fix attached, or you can wait for the next rc. Thanks. -chris From 6e5aafb27419f32575b27ef9d6a31e5d54661aca Mon Sep 17 00:00:00 2001 From: Chris Mason c...@fb.com Date: Tue, 4 Nov 2014 06:59:04 -0800 Subject: [PATCH] Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup If we hit any errors in btrfs_lookup_csums_range, we'll loop through all the csums we allocate and free them. But the code was using list_entry incorrectly, and ended up trying to free the on-stack list_head instead. This bug came from commit 0678b6185 btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range() Signed-off-by: Chris Mason c...@fb.com Reported-by: Erik Berg bt...@slipsprogrammoer.no cc: sta...@vger.kernel.org # 3.3 or newer --- fs/btrfs/file-item.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c index 783a943..84a2d18 100644 --- a/fs/btrfs/file-item.c +++ b/fs/btrfs/file-item.c @@ -413,7 +413,7 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end, ret = 0; fail: while (ret 0 !list_empty(tmplist)) { - sums = list_entry(tmplist, struct btrfs_ordered_sum, list); + sums = list_entry(tmplist.next, struct btrfs_ordered_sum, list); list_del(sums-list); kfree(sums); } -- 1.8.1
[PATCH] btrfs-progs: use the correct SI prefixes
The SI standard defines lowercase 'k' and uppercase for the rest. Signed-off-by: David Sterba dste...@suse.cz --- Documentation/btrfs-filesystem.txt | 6 +++--- cmds-filesystem.c | 8 utils.c| 2 +- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/Documentation/btrfs-filesystem.txt b/Documentation/btrfs-filesystem.txt index 6e63d2c9c2ae..a8f2972a0e1a 100644 --- a/Documentation/btrfs-filesystem.txt +++ b/Documentation/btrfs-filesystem.txt @@ -35,11 +35,11 @@ select the 1000 base for the following options, according to the SI standard -k|--kbytes show sizes in KiB, or kB with --si -m|--mbytes -show sizes in MiB, or mB with --si +show sizes in MiB, or MB with --si -g|--gbytes -show sizes in GiB, or gB with --si +show sizes in GiB, or GB with --si -t|--tbytes -show sizes in TiB, or tB with --si +show sizes in TiB, or TB with --si If conflicting options are passed, the last one takes precedence. diff --git a/cmds-filesystem.c b/cmds-filesystem.c index af56fbeb48ed..e4b278590ca6 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -128,11 +128,11 @@ static const char * const cmd_df_usage[] = { -h human friendly numbers, base 1024 (default), -H human friendly numbers, base 1000, --iec use 1024 as a base (KiB, MiB, GiB, TiB), - --si use 1000 as a base (kB, mB, gB, tB), + --si use 1000 as a base (kB, MB, GB, TB), -k|--kbytesshow sizes in KiB, or kB with --si, - -m|--mbytesshow sizes in MiB, or mB with --si, - -g|--gbytesshow sizes in GiB, or gB with --si, - -t|--tbytesshow sizes in TiB, or tB with --si, + -m|--mbytesshow sizes in MiB, or MB with --si, + -g|--gbytesshow sizes in GiB, or GB with --si, + -t|--tbytesshow sizes in TiB, or TB with --si, NULL }; diff --git a/utils.c b/utils.c index f51bc564d8f1..4b3bace4433a 100644 --- a/utils.c +++ b/utils.c @@ -1328,7 +1328,7 @@ out: static const char* unit_suffix_binary[] = { B, KiB, MiB, GiB, TiB, PiB, EiB}; static const char* unit_suffix_decimal[] = - { B, kB, mB, gB, tB, pB, eB}; + { B, kB, MB, GB, TB, PB, EB}; int pretty_size_snprintf(u64 size, char *str, size_t str_size, unsigned unit_mode) { -- 2.1.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: filesystem corruption
On Nov 3, 2014, at 9:31 PM, Zygo Blaxell zblax...@furryterror.org wrote: On Mon, Nov 03, 2014 at 10:11:18AM -0700, Chris Murphy wrote: On Nov 2, 2014, at 8:43 PM, Zygo Blaxell zblax...@furryterror.org wrote: btrfs seems to assume the data is correct on both disks (the generation numbers and checksums are OK) but gets confused by equally plausible but different metadata on each disk. It doesn't take long before the filesystem becomes data soup or crashes the kernel. This is a pretty significant problem to still be present, honestly. I can understand the catchup mechanism is probably not built yet, but clearly the two devices don't have the same generation. The lower generation device should probably be booted/ignored or declared missing in the meantime to prevent trashing the file system. The problem with generation numbers is when both devices get divergent generation numbers but we can't tell them apart, e.g. 1. sda generation = 5, sdb generation = 5 2. sdb temporarily disconnects, so we are degraded on just sda 3. sda gets more generations 6..9 4. sda temporarily disconnects, so we have no disks at all. 5. the machine reboots, gets sdb back but not sda If we allow degraded here, then: 6. sdb gets more generations 6..9 7. sdb disconnects, no disks so no filesystem 8. the machine reboots again, this time with sda and sdb present Now we have two disks with equal generation numbers. Generations 6..9 on sda are not the same as generations 6..9 on sdb, so if we mix the two disks' metadata we get bad confusion. It needs to be more than a sequential number. If one of the disks disappears we need to record this fact on the surviving disks, and also cope with _both_ disks claiming to be the surviving one. I agree this is also a problem. But the most common case is where we know that sda generation is newer (larger value) and most recently modified, and sdb has not since been modified but needs to be caught up. As far as I know the only way to do that on Btrfs right now is a full balance, it doesn't catch up just be being reconnected with a normal mount. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs deduplication and linux cache management
On Mon, Nov 03, 2014 at 03:09:11PM +0100, LuVar wrote: Thanks for nice and replicate at home yourself example. On my machine it is behaving precisely like in your: code root@blackdawn:/home/luvar# sync; sysctl vm.drop_caches=1 vm.drop_caches = 1 root@blackdawn:/home/luvar# time cat /home/luvar/programs/adt-bundle-linux/sdk/system-images/android-L/default/armeabi-v7a/userdata.img /dev/null real0m6.768s user0m0.016s sys 0m0.599s root@blackdawn:/home/luvar# time cat /home/luvar/programs/android-sdk-linux/system-images/android-L/default/armeabi-v7a/userdata.img /dev/null real0m5.259s user0m0.018s sys 0m0.695s root@blackdawn:/home/luvar# time cat /home/luvar/programs/adt-bundle-linux/sdk/system-images/android-L/default/armeabi-v7a/userdata.img /dev/null real0m0.701s user0m0.014s sys 0m0.288s root@blackdawn:/home/luvar# time cat /home/luvar/programs/android-sdk-linux/system-images/android-L/default/armeabi-v7a/userdata.img /dev/null real0m0.286s user0m0.013s sys 0m0.272s /code If you would mind asking, is there any plan to optimize this behaviour? I know that btrfs is not like ZFS (whole system from blockdevice, through cache, to VFS), so vould be possible to implement such optimization without major patch in linux block cache/VFS cache? I'd like to know this too. I think not any time soon though. AIUI (I'm not really an expert here), the VFS cache is keyed on tuples of (device:inode, offset), so it has no way to cope with aliasing the same physical blocks through distinct inodes. It would have to learn about reference counting (so multiple inodes can refer to shared blocks, one inode can refer to the same blocks twice, etc) and copy-on-write (so we can modify just one share of a shared-extent cache page). For compressed data caching, the filesystem would be volunteering references to blocks that were not asked for (e.g. unread portions of compressed extents). It's not impossible to make those changes to the VFS cache, but the only filesystem on mainline Linux that would benefit is btrfs (ZFS is not on mainline Linux, the ZFS maintainers probably prefer to use their own cache layer anyway, and nobody else shares extents between files). For filesystems that don't share extents, adding the necessary stuff to VFS is a lot of extra overhead they will never use. Back in the day, the Linux cache used to use tuples of (device, block_number), but this approach doesn't work on non-block filesystems like NFS, so it was dropped in favor of the inode+offset caching. A block-based scheme would handle shared extents but not compressed ones (e.g. you've got a 4K cacheable page that was compressed to 312 bytes somewhere in the middle of a 57K compressed data extent...what's that page's block number, again?). Thanks, have a nice day, -- LuVar - Zygo Blaxell zblax...@furryterror.org wrote: On Thu, Oct 30, 2014 at 10:26:07AM +0100, lu...@plaintext.sk wrote: Hi, I want to ask, if deduplicated file content will be cached in linux kernel just once for two deduplicated files. To explain in deep: - I use btrfs for whole system with few subvolumes with some compression on some subvolumes. - I have two directories with eclipse SDK with slightly differences (same version, different config) - I assume that given directories is deduplicated and so two eclipse installations take place on hdd like one would (in rough estimation) - I will start one of given eclipse - linux kernel will cache all opened files during start of eclipse (I have enough free ram) - I am just happy stupid linux user: 1. will kernel cache file content after decompression? (I think yes) 2. cached data will be in VFS layer or in block device layer? My guess based on behavior is the VFS layer. See below. - When I will lunch second eclipse (different from first, but deduplicated from first) after first one: 1. will second start require less data to be read from HDD? No. 2. will be metadata for second instance read from hdd? (I asume yes) Yes (how could it not?). 3. will be actual data read second time? (I hope not) Unfortunately, yes. This is my test: 1. Create a file full of compressible data that is big enough to take a few seconds to read from disk, but not too big to fit in RAM: yes $(date) | head -c 500m a 2. Create a deduplicated (shared extent) copy of same: cp --reflink=always a b (use filefrag -v to verify both files have same physical extents) 3. Drop caches sync; sysctl vm.drop_caches=1 4. Time reading both files with cold and hot cache: time cat a /dev/null time cat b /dev/null time cat a /dev/null time cat b /dev/null Ideally, the first 'cat a' would load the file back from disk, so it will take a long
Re: filesystem corruption
Chris Murphy posted on Tue, 04 Nov 2014 11:28:39 -0700 as excerpted: It needs to be more than a sequential number. If one of the disks disappears we need to record this fact on the surviving disks, and also cope with _both_ disks claiming to be the surviving one. I agree this is also a problem. But the most common case is where we know that sda generation is newer (larger value) and most recently modified, and sdb has not since been modified but needs to be caught up. As far as I know the only way to do that on Btrfs right now is a full balance, it doesn't catch up just be being reconnected with a normal mount. I thought it was a scrub that would take care of that, not a balance? (Maybe do both to be sure?) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 03/11] Btrfs-progs: allow fsck to take the tree bytenr
Josef Bacik jbacik at fb.com writes: Sometimes we have a pretty corrupted fs but have an old tree bytenr that we could use, add the ability to specify the tree root bytenr. Thanks, Signed-off-by: Josef Bacik jbacik at fb.com Tested-by: Ansgar Hockmann-Stolle ansgar.hockmann-stolle at uni-osnabrueck.de This patch fixed my case: http://www.spinics.net/lists/linux-btrfs/msg38714.html Thank you! And thank you, Qu Wenruo for the help!! I tested all blocks that find-root gave - but only the one with generation = want + 1 gave some possible root items. The btrfsck with --tree-root that block finally fixed my file system. Hooray! Ciao Ansgar -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: filesystem corruption
On 11/04/2014 10:28 AM, Chris Murphy wrote: On Nov 3, 2014, at 9:31 PM, Zygo Blaxell zblax...@furryterror.org wrote: Now we have two disks with equal generation numbers. Generations 6..9 on sda are not the same as generations 6..9 on sdb, so if we mix the two disks' metadata we get bad confusion. It needs to be more than a sequential number. If one of the disks disappears we need to record this fact on the surviving disks, and also cope with _both_ disks claiming to be the surviving one. I agree this is also a problem. But the most common case is where we know that sda generation is newer (larger value) and most recently modified, and sdb has not since been modified but needs to be caught up. As far as I know the only way to do that on Btrfs right now is a full balance, it doesn't catch up just be being reconnected with a normal mount. I would think that any time any system or fraction thereof is mounted with both a degraded and rw, status a degraded flag should be set somewhere/somehow in the superblock etc. The only way to clear this flag would be to reach a reconciled state. That state could be reached in one of several ways. Removing the missing mirror element would be a fast reconcile, doing a balance or scrub would be a slow reconcile for a filessytem where all the media are returned to service (e.g. the missing volume of a RAID 1 etc is returned.) Generation numbers are pretty good, but I'd put on a rider that any generation number or equivelant incremented while the system is degraded should have a unique quanta (say a GUID) generated and stored along with the generation number. The mere existence of this quanta would act as the degraded flag. Any check/compare/access related to the generation number would know to notice that the GUID is in place and do the necessary resolution. If successful the GUID would be discarded. As to how this could be implemented, I'm not fully conversant on the internal layout. One possibility would be to add a block reference, or, indeed replace the current storage for generation numbers completely with block reference to a block containing the generation number and the potential GUID. The main value of having an out-of-structure reference is that its content is less space constrained, and it could be shared by multiple usages. In the case, for instance, where the block is added (as opposed to replacing the generation number) only one such block would be needed per degraded,rw mount, and it could be attached to as many filesystem structures as needed. Just as metadata under DUP is divergent after a degraded mount, a generation block wold be divergent, and likely in a different location than its peers on a subsequent restored geometry. A gerenation block could have other nicities like the date/time and the devices present (or absent); such information could conceivably be used to intellegently disambiguate references. For instance if one degraded mount had sda and sdb, and second had sdb and sdc, then itd be known that sdb was dominant for having been present every time. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: filesystem corruption
On Tue, Nov 04, 2014 at 11:28:39AM -0700, Chris Murphy wrote: On Nov 3, 2014, at 9:31 PM, Zygo Blaxell zblax...@furryterror.org wrote: It needs to be more than a sequential number. If one of the disks disappears we need to record this fact on the surviving disks, and also cope with _both_ disks claiming to be the surviving one. I agree this is also a problem. But the most common case is where we know that sda generation is newer (larger value) and most recently modified, and sdb has not since been modified but needs to be caught up. As far as I know the only way to do that on Btrfs right now is a full balance, it doesn't catch up just be being reconnected with a normal mount. The data on the disks might be inconistent, so resynchronization must read from only the good copy. A balance could just spread corruption around if it reads from two out-of-sync mirrors. (Maybe it already does the right thing if sdb was not modified...?). The full resync operation is more like btrfs device replace, except that it's replacing a disk in-place (i.e. without removing it first), and it would not read from the non-good disk. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: Digital signature
[SOLVED] btrfs unmountable: read block failed check_tree_block; Couldn't read tree root
For solution see http://article.gmane.org/gmane.comp.file-systems.btrfs/39974 Am 28.10.14 um 00:03 schrieb Ansgar Hockmann-Stolle: Am 27.10.14 um 14:23 schrieb Ansgar Hockmann-Stolle: Hi! My btrfs system partition went readonly. After reboot it doesnt mount anymore. System was openSUSE 13.1 Tumbleweed (kernel 3.17.??). Now I'm on openSUSE 13.2-RC1 rescue (kernel 3.16.3). I dumped (dd) the whole 250 GB SSD to some USB file and tried some btrfs tools on another copy per loopback device. But everything failed with: kernel: BTRFS: failed to read tree root on dm-2 See http://pastebin.com/raw.php?i=dPnU6nzg. Any hints where to go from here? After an offlist hint (thanks Tom!) I compiled the latest btrfs-progs 3.17 and tried some more ... linux:~/bin # ./btrfs --version Btrfs v3.17 linux:~/bin # ./btrfs-find-root /dev/sda3 Super think's the tree root is at 1015238656, chunk root 20971520 Well block 239718400 seems great, but generation doesn't match, have=661931, want=663595 level 0 Well block 239722496 seems great, but generation doesn't match, have=661931, want=663595 level 0 Well block 320098304 seems great, but generation doesn't match, have=662233, want=663595 level 0 Well block 879341568 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879345664 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879382528 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879398912 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879403008 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879423488 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 879435776 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 880095232 seems great, but generation doesn't match, have=663227, want=663595 level 0 Well block 881504256 seems great, but generation doesn't match, have=663228, want=663595 level 0 Well block 881512448 seems great, but generation doesn't match, have=663228, want=663595 level 0 Well block 936271872 seems great, but generation doesn't match, have=663397, want=663595 level 0 Well block 1004490752 seems great, but generation doesn't match, have=663571, want=663595 level 0 Well block 1007804416 seems great, but generation doesn't match, have=663572, want=663595 level 0 Well block 1012031488 seems great, but generation doesn't match, have=663575, want=663595 level 0 Well block 1012396032 seems great, but generation doesn't match, have=663575, want=663595 level 0 Well block 1012633600 seems great, but generation doesn't match, have=663586, want=663595 level 0 Well block 1012871168 seems great, but generation doesn't match, have=663585, want=663595 level 0 Well block 1015201792 seems great, but generation doesn't match, have=663588, want=663595 level 0 Well block 1015836672 seems great, but generation doesn't match, have=663596, want=663595 level 1 Well block 44132536320 seems great, but generation doesn't match, have=658774, want=663595 level 0 Well block 44178280448 seems great, but generation doesn't match, have=658774, want=663595 level 0 Well block 87443644416 seems great, but generation doesn't match, have=661349, want=663595 level 0 Well block 87514079232 seems great, but generation doesn't match, have=651051, want=663595 level 0 Well block 87517679616 seems great, but generation doesn't match, have=661349, want=663595 level 0 Well block 98697822208 seems great, but generation doesn't match, have=643548, want=663595 level 0 Well block 103285026816 seems great, but generation doesn't match, have=661672, want=663595 level 0 Well block 103309553664 seems great, but generation doesn't match, have=661674, want=663595 level 0 Well block 103523430400 seems great, but generation doesn't match, have=661767, want=663595 level 0 No more metdata to scan, exiting This line I found interesting because have is want + 1: Well block 1015836672 seems great, but generation doesn't match, have=663596, want=663595 level 1 And here the tail of btrfs rescue chunk-recover (full output at http://pastebin.com/raw.php?i=1D5VgDxv) [..] Total Chunks:234 Heathy:231 Bad:3 Orphan Block Groups: Orphan Device Extents: Couldn't map the block 1015238656 btrfs: volumes.c:1140: btrfs_num_copies: Assertion `!(ce-start logical || ce-start + ce-size logical)' failed. Aborted Sadly btrfs check --repair keep up refusing to do its job. linux:~ # btrfs check --repair /dev/sda3 enabling repair mode Check tree block failed, want=1015238656, have=0 Check tree block failed, want=1015238656, have=0 Check tree block failed, want=1015238656, have=0 Check tree block failed, want=1015238656, have=0 Check tree block failed, want=1015238656, have=0 read block failed check_tree_block Couldn't read tree root Checking filesystem on /dev/sda3 UUID: 1af256b5-b1ad-443b-aeee-a6853e70b7e2
Re: Kernel crash during btrfs device delete on raid6 volume
On Tue, Nov 04, 2014 at 10:58:48AM -0500, Chris Mason wrote: Not a new bug unfortunately, but since it is in the error handling people must not be hitting it often. It's also not related to device replace. while (ret 0 !list_empty(tmplist)) { sums = list_entry(tmplist, struct btrfs_ordered_sum, list); list_del(sums-list); kfree(sums); } We're trying to call kfree on the on-stack list head. I'm fixing it up here, thanks for posting the oops! Fix attached, or you can wait for the next rc. Thanks. -chris From 6e5aafb27419f32575b27ef9d6a31e5d54661aca Mon Sep 17 00:00:00 2001 From: Chris Mason c...@fb.com Date: Tue, 4 Nov 2014 06:59:04 -0800 Subject: [PATCH] Btrfs: fix kfree on list_head in btrfs_lookup_csums_range error cleanup If we hit any errors in btrfs_lookup_csums_range, we'll loop through all the csums we allocate and free them. But the code was using list_entry incorrectly, and ended up trying to free the on-stack list_head instead. This bug came from commit 0678b6185 Wow, that's an old commit! Thanks for the CC. The fix looks good to me, so you can add: Reviewed-by: Mark Fasheh mfas...@suse.de if you like, thanks. --Mark -- Mark Fasheh -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html