Re: filesystem corruption

2014-11-04 Thread Duncan
Zygo Blaxell posted on Mon, 03 Nov 2014 23:31:45 -0500 as excerpted:

 On Mon, Nov 03, 2014 at 10:11:18AM -0700, Chris Murphy wrote:
 
 On Nov 2, 2014, at 8:43 PM, Zygo Blaxell zblax...@furryterror.org
 wrote:
  btrfs seems to assume the data is correct on both disks (the
  generation numbers and checksums are OK) but gets confused by equally
  plausible but different metadata on each disk.  It doesn't take long
  before the filesystem becomes data soup or crashes the kernel.
 
 This is a pretty significant problem to still be present, honestly. I
 can understand the catchup mechanism is probably not built yet,
 but clearly the two devices don't have the same generation. The lower
 generation device should probably be booted/ignored or declared missing
 in the meantime to prevent trashing the file system.
 
 The problem with generation numbers is when both devices get divergent
 generation numbers but we can't tell them apart

[snip very reasonable scenario]

 Now we have two disks with equal generation numbers. 
 Generations 6..9 on sda are not the same as generations 6..9 on sdb, so
 if we mix the two disks' metadata we get bad confusion.
 
 It needs to be more than a sequential number.  If one of the disks
 disappears we need to record this fact on the surviving disks, and also
 cope with _both_ disks claiming to be the surviving one.

Zygo's absolutely correct.  There is an existing catchup mechanism, but 
the tracking is /purely/ sequential generation number based, and if the 
two generation sequences diverge, Welcome to the (data) Twilight Zone!

I noted this in my own early pre-deployment raid1 mode testing as well, 
except that I didn't at that point know about sequence numbers and never 
got as far as letting the filesystem make data soup of itself.

What I did was this:

1) Create a two-device raid1 data and metadata filesystem, mount it and 
stick some data on it.

2) Unmount, pull a device, mount degraded the remaining device.

3) Change a file.

4) Unmount, switch devices, mount degraded the other device.

5) Change the same file in an different/incompatible way.

6) Unmount, plug both devices in again, mount (not degraded).

7) Wait for the sync I was used to from mdraid, which of course didn't 
occur.

8) Check the file to see which version showed up.  I don't recall which 
version it was, but it wasn't the common pre-change version.

9) Unmount, pull each device one at a time, mounting the other one 
degraded and checking the file again.

10) The file on each device remained different, without a warning or 
indication of any problem at all when I mounted undegraded in 6/7.

Had I initiated a scrub, presumably it would have seen the difference and 
if one was a newer generation, it would have taken it, overwriting the 
other.  I don't know what it would have done if both were the same 
generation, tho the file being small (just a few line text file, big 
enough to test the effect of differing edits), I guess it would take one 
version or the other.  If the file was large enough to be multiple 
extents, however, I've no idea whether it'd take one or the other, or 
possibly combine the two, picking extents where they differed more or 
less randomly.

By that time the lack of warning and absolute resolution to one version 
or the other even after mounting undegraded and accessing the file with 
incompatible versions on each of the two devices was bothering me 
sufficiently that I didn't test any further.

Being just me I have to worry about (unlike a multi-admin corporate 
scenario where you can never be /sure/ what the other admins will do 
regardless of agreed procedure), I simply set myself a set of rules very 
similar to what Zygo proposed:

1) If for whatever reason I ever split a btrfs raid1 with the intent or 
even the possibility of bringing the pieces back together again, if at 
all possible, never mount the split pieces writable -- mount read-only.

2) If a writable mount is required, keep the writable mounts to one 
device of the split.  As long as the other device is never mounted 
writable, it will have an older generation when they're reunited and a 
scrub should take care of things, reliably resolving to the updated 
written device, rewriting the older generation on the other device.

What I'd do here is physically put the removed side of the raid1 in 
storage, far enough from the remaining side that I couldn't possibly get 
them mixed up.  I'd clearly label it as well, creating a defense in 
depth of at least two, the labeling and the physical separation and 
storage of the read-only device.

3) If for whatever reason the originally read-only side must be mounted 
writable, very clearly mark the originally mounted-writable device 
POISONED/TOXIC!!  *NEVER* *EVER* let such a POISONED device anywhere near 
its original raid1 mate, until it is wiped, such that there's no 
possibility of btrfs getting confused and contaminated with the poisoned 
data.

Given how unimpressed I was 

Re: [PATCH] Btrfs: don't take the chunk_mutex/dev_list mutex in statfs V2

2014-11-04 Thread Miao Xie
On Mon, 3 Nov 2014 08:56:50 -0500, Josef Bacik wrote:
 Our gluster boxes get several thousand statfs() calls per second, which begins
 to suck hardcore with all of the lock contention on the chunk mutex and dev 
 list
 mutex.  We don't really need to hold these things, if we have transient
 weirdness with statfs() because of the chunk allocator we don't care, so 
 remove
 this locking.
 
 We still need the dev_list lock if you mount with -o alloc_start however, 
 which
 is a good argument for nuking that thing from orbit, but that's a patch for
 another day.  Thanks,
 
 Signed-off-by: Josef Bacik jba...@fb.com
 ---
 V1-V2: make sure -alloc_start is set before doing the dev extent lookup 
 logic.

I am strange that why we need dev_list_lock if we mount with -o alloc_start. 
AFAIK.
-alloc_start is protected by chunk_mutex.

But I think we needn't care that someone changes -alloc_start, in other words, 
we needn't take chunk_mutex during the whole process, the following case can be
tolerated by the users, I think.

Task1   Task2
statfs
  mutex_lock(fs_info-chunk_mutex);
  tmp = fs_info-alloc_start;
  mutex_unlock(fs_info-chunk_mutex);
  btrfs_calc_avail_data_space(fs_info, tmp)
...
mount -o 
remount,alloc_start=
...

Thanks
Miao

 
  fs/btrfs/super.c | 72 
 
  1 file changed, 47 insertions(+), 25 deletions(-)
 
 diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
 index 54bd91e..dc337d1 100644
 --- a/fs/btrfs/super.c
 +++ b/fs/btrfs/super.c
 @@ -1644,8 +1644,20 @@ static int btrfs_calc_avail_data_space(struct 
 btrfs_root *root, u64 *free_bytes)
   int i = 0, nr_devices;
   int ret;
  
 + /*
 +  * We aren't under the device list lock, so this is racey-ish, but good
 +  * enough for our purposes.
 +  */
   nr_devices = fs_info-fs_devices-open_devices;
 - BUG_ON(!nr_devices);
 + if (!nr_devices) {
 + smp_mb();
 + nr_devices = fs_info-fs_devices-open_devices;
 + ASSERT(nr_devices);
 + if (!nr_devices) {
 + *free_bytes = 0;
 + return 0;
 + }
 + }
  
   devices_info = kmalloc_array(nr_devices, sizeof(*devices_info),
  GFP_NOFS);
 @@ -1670,11 +1682,17 @@ static int btrfs_calc_avail_data_space(struct 
 btrfs_root *root, u64 *free_bytes)
   else
   min_stripe_size = BTRFS_STRIPE_LEN;
  
 - list_for_each_entry(device, fs_devices-devices, dev_list) {
 + if (fs_info-alloc_start)
 + mutex_lock(fs_devices-device_list_mutex);
 + rcu_read_lock();
 + list_for_each_entry_rcu(device, fs_devices-devices, dev_list) {
   if (!device-in_fs_metadata || !device-bdev ||
   device-is_tgtdev_for_dev_replace)
   continue;
  
 + if (i = nr_devices)
 + break;
 +
   avail_space = device-total_bytes - device-bytes_used;
  
   /* align with stripe_len */
 @@ -1689,24 +1707,32 @@ static int btrfs_calc_avail_data_space(struct 
 btrfs_root *root, u64 *free_bytes)
   skip_space = 1024 * 1024;
  
   /* user can set the offset in fs_info-alloc_start. */
 - if (fs_info-alloc_start + BTRFS_STRIPE_LEN =
 - device-total_bytes)
 + if (fs_info-alloc_start 
 + fs_info-alloc_start + BTRFS_STRIPE_LEN =
 + device-total_bytes) {
 + rcu_read_unlock();
   skip_space = max(fs_info-alloc_start, skip_space);
  
 - /*
 -  * btrfs can not use the free space in [0, skip_space - 1],
 -  * we must subtract it from the total. In order to implement
 -  * it, we account the used space in this range first.
 -  */
 - ret = btrfs_account_dev_extents_size(device, 0, skip_space - 1,
 -  used_space);
 - if (ret) {
 - kfree(devices_info);
 - return ret;
 - }
 + /*
 +  * btrfs can not use the free space in
 +  * [0, skip_space - 1], we must subtract it from the
 +  * total. In order to implement it, we account the used
 +  * space in this range first.
 +  */
 + ret = btrfs_account_dev_extents_size(device, 0,
 +  skip_space - 1,
 +  used_space);
 + if (ret) {
 + kfree(devices_info);
 + 

BTRFS Quota Display Tool

2014-11-04 Thread stephan008
Hello

I looking for a web based tool for displaying quotas for btrfs volumes. 
Something which get's it's data from btrfsQuota.py and displays nice bars on a 
webfrontend so we can see how much space a user is consuming right now. 

Although btrfs is not supporting per user based quotas, we made a workaround 
for this to separate 1 users files into 1 subvolume. This way his quota can be 
determined.

A lot of additional features would come handy, such as email notification when 
a user gets near to his Quota limit or another notification when he hits it. 
On some of our hosting deployments, this is taken care of nicely by ISPconfig 
(Hard/Soft limits can be easily configured on the controlpanel), repquota 
(Display quota usage statistics from command line) and warnquota (Able to 
generate the email alerts what I have mentioned).

Please if you know any existing (preferrably PHP based) framework for this let 
me know, so I wouldn't have to develop it from scratch.

Thanks
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel crash during btrfs device delete on raid6 volume

2014-11-04 Thread Chris Mason
On Tue, Nov 4, 2014 at 9:36 AM, Erik Berg bt...@slipsprogrammoer.no 
wrote:
Pulled the latest btrfs-progs from kdave (v3.17-12-gcafacda) and 
using the latest linux release candidate (3.18.0-031800rc3-generic) 
from canonical/ubuntu


btrfs fi show
Label: none  uuid: 5c5fea06-0319-4e03-a42e-004e64aeed92
Total devices 9 FS bytes used 10.91TiB
devid2 size 931.48GiB used 928.02GiB path /dev/sdc1
devid3 size 931.48GiB used 928.02GiB path /dev/sdd1
devid4 size 1.82TiB used 1.67TiB path /dev/sde1
devid5 size 2.73TiB used 2.28TiB path /dev/sdf1
devid6 size 3.64TiB used 2.73TiB path /dev/sdg1
devid7 size 3.64TiB used 2.73TiB path /dev/sdh1
devid8 size 931.46GiB used 655.90GiB path /dev/sdb1
devid9 size 3.64TiB used 2.73TiB path /dev/sdi1
devid   10 size 3.64TiB used 1.79TiB path /dev/sdj1

btrfs fi df
Data, RAID6: total=10.91TiB, used=10.90TiB
System, RAID6: total=96.00MiB, used=800.00KiB
Metadata, RAID6: total=13.23GiB, used=11.79GiB
GlobalReserve, single: total=512.00MiB, used=0.00B

Trying to remove device sdb1, the kernel crashes after a minute or so.

[  597.576827] [ cut here ]
[  597.617519] kernel BUG at /home/apw/COD/linux/mm/slub.c:3334!
[  597.668145] invalid opcode:  [#1] SMP
[  597.704410] Modules linked in: arc4 md4 ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT 
nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp bridge stp llc 
ip6table_filter ip6_tables iptable_filter ip_tables ebtable_nat 
ebtables x_tables gpio_ich intel_rapl x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel cryptd serio_raw hpilo hpwdt 8250_fintek 
acpi_power_meter ie31200_edac lpc_ich edac_core ipmi_si 
ipmi_msghandler mac_hid lp parport nls_utf8 cifs fscache hid_generic 
usbhid hid btrfs xor raid6_pq uas usb_storage tg3 ptp ahci psmouse 
libahci pps_core hpsa
[  598.268179] CPU: 1 PID: 129 Comm: kworker/u128:3 Not tainted 
3.18.0-031800rc3-generic #201411022335
[  598.349925] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 
11/09/2013
[  598.413231] Workqueue: writeback bdi_writeback_workfn 
(flush-btrfs-2)
[  598.471103] task: 8803f16a3c00 ti: 880036b7 task.ti: 
880036b7
[  598.538393] RIP: 0010:[811c74fd]  [811c74fd] 
kfree+0x16d/0x170

[  598.606217] RSP: 0018:880036b73528  EFLAGS: 00010246
[  598.653844] RAX: 0100 RBX: 880036b735c8 RCX: 

[  598.717899] RDX: 8803743a6010 RSI: dead00100100 RDI: 
880036b735c8
[  598.781662] RBP: 880036b73558 R08:  R09: 
eadadcc0
[  598.846028] R10: 0001 R11: 0010 R12: 
8803f1e09800
[  598.910713] R13: 8803ac757d40 R14: c04fed0c R15: 
880036b735d8
[  598.975333] FS:  () GS:88040b42() 
knlGS:

[  599.048512] CS:  0010 DS:  ES:  CR0: 80050033
[  599.100167] CR2: 7fa9a3854024 CR3: 01c16000 CR4: 
001407e0

[  599.165150] Stack:
[  599.183305]  8803f1e09800 0dad07c2 8803f1e09800 
8803ac757d40
[  599.249603]  8803ac757d40 880036b735d8 880036b73618 
c04fed0c
[  599.316306]  8803f1b86b00 880374338000 0dad07dc 
880036b73638

[  599.383404] Call Trace:
[  599.405429]  [c04fed0c] 
btrfs_lookup_csums_range+0x2ac/0x4a0 [btrfs]


Not a new bug unfortunately, but since it is in the error handling 
people must not be hitting it often.  It's also not related to device 
replace.



   while (ret  0  !list_empty(tmplist)) {
   sums = list_entry(tmplist, struct btrfs_ordered_sum, 
list);

   list_del(sums-list);
   kfree(sums);
   }

We're trying to call kfree on the on-stack list head.  I'm fixing it up 
here, thanks for posting the oops!


-chris



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Kernel crash during btrfs device delete on raid6 volume

2014-11-04 Thread Chris Mason

On Tue, Nov 4, 2014 at 9:55 AM, Chris Mason c...@fb.com wrote:
On Tue, Nov 4, 2014 at 9:36 AM, Erik Berg bt...@slipsprogrammoer.no 
wrote:
Pulled the latest btrfs-progs from kdave (v3.17-12-gcafacda) and 
using the latest linux release candidate (3.18.0-031800rc3-generic) 
from canonical/ubuntu


Trying to remove device sdb1, the kernel crashes after a minute or 
so.


[  597.576827] [ cut here ]
[  597.617519] kernel BUG at /home/apw/COD/linux/mm/slub.c:3334!
[  597.668145] invalid opcode:  [#1] SMP
[  597.704410] Modules linked in: arc4 md4 ipt_MASQUERADE 
nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack 
ipt_REJECT nf_reject_ipv4 xt_CHECKSUM iptable_mangle xt_tcpudp 
bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables 
ebtable_nat ebtables x_tables gpio_ich intel_rapl 
x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm 
crct10dif_pclmul crc32_pclmul ghash_clmulni_intel cryptd serio_raw 
hpilo hpwdt 8250_fintek acpi_power_meter ie31200_edac lpc_ich 
edac_core ipmi_si ipmi_msghandler mac_hid lp parport nls_utf8 cifs 
fscache hid_generic usbhid hid btrfs xor raid6_pq uas usb_storage 
tg3 ptp ahci psmouse libahci pps_core hpsa
[  598.268179] CPU: 1 PID: 129 Comm: kworker/u128:3 Not tainted 
3.18.0-031800rc3-generic #201411022335
[  598.349925] Hardware name: HP ProLiant MicroServer Gen8, BIOS J06 
11/09/2013
[  598.413231] Workqueue: writeback bdi_writeback_workfn 
(flush-btrfs-2)
[  598.471103] task: 8803f16a3c00 ti: 880036b7 task.ti: 
880036b7
[  598.538393] RIP: 0010:[811c74fd]  [811c74fd] 
kfree+0x16d/0x170

[  598.606217] RSP: 0018:880036b73528  EFLAGS: 00010246
[  598.653844] RAX: 0100 RBX: 880036b735c8 RCX: 

[  598.717899] RDX: 8803743a6010 RSI: dead00100100 RDI: 
880036b735c8
[  598.781662] RBP: 880036b73558 R08:  R09: 
eadadcc0
[  598.846028] R10: 0001 R11: 0010 R12: 
8803f1e09800
[  598.910713] R13: 8803ac757d40 R14: c04fed0c R15: 
880036b735d8
[  598.975333] FS:  () GS:88040b42() 
knlGS:

[  599.048512] CS:  0010 DS:  ES:  CR0: 80050033
[  599.100167] CR2: 7fa9a3854024 CR3: 01c16000 CR4: 
001407e0

[  599.165150] Stack:
[  599.183305]  8803f1e09800 0dad07c2 8803f1e09800 
8803ac757d40
[  599.249603]  8803ac757d40 880036b735d8 880036b73618 
c04fed0c
[  599.316306]  8803f1b86b00 880374338000 0dad07dc 
880036b73638

[  599.383404] Call Trace:
[  599.405429]  [c04fed0c] 
btrfs_lookup_csums_range+0x2ac/0x4a0 [btrfs]


Not a new bug unfortunately, but since it is in the error handling 
people must not be hitting it often.  It's also not related to device 
replace.



   while (ret  0  !list_empty(tmplist)) {
   sums = list_entry(tmplist, struct btrfs_ordered_sum, 
list);

   list_del(sums-list);
   kfree(sums);
   }

We're trying to call kfree on the on-stack list head.  I'm fixing it 
up here, thanks for posting the oops!


Fix attached, or you can wait for the next rc.  Thanks.

-chris


From 6e5aafb27419f32575b27ef9d6a31e5d54661aca Mon Sep 17 00:00:00 2001
From: Chris Mason c...@fb.com
Date: Tue, 4 Nov 2014 06:59:04 -0800
Subject: [PATCH] Btrfs: fix kfree on list_head in btrfs_lookup_csums_range
 error cleanup

If we hit any errors in btrfs_lookup_csums_range, we'll loop through all
the csums we allocate and free them.  But the code was using list_entry
incorrectly, and ended up trying to free the on-stack list_head instead.

This bug came from commit 0678b6185

btrfs: Don't BUG_ON kzalloc error in btrfs_lookup_csums_range()

Signed-off-by: Chris Mason c...@fb.com
Reported-by: Erik Berg bt...@slipsprogrammoer.no
cc: sta...@vger.kernel.org # 3.3 or newer
---
 fs/btrfs/file-item.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/file-item.c b/fs/btrfs/file-item.c
index 783a943..84a2d18 100644
--- a/fs/btrfs/file-item.c
+++ b/fs/btrfs/file-item.c
@@ -413,7 +413,7 @@ int btrfs_lookup_csums_range(struct btrfs_root *root, u64 start, u64 end,
 	ret = 0;
 fail:
 	while (ret  0  !list_empty(tmplist)) {
-		sums = list_entry(tmplist, struct btrfs_ordered_sum, list);
+		sums = list_entry(tmplist.next, struct btrfs_ordered_sum, list);
 		list_del(sums-list);
 		kfree(sums);
 	}
-- 
1.8.1



[PATCH] btrfs-progs: use the correct SI prefixes

2014-11-04 Thread David Sterba
The SI standard defines lowercase 'k' and uppercase for the rest.

Signed-off-by: David Sterba dste...@suse.cz
---
 Documentation/btrfs-filesystem.txt | 6 +++---
 cmds-filesystem.c  | 8 
 utils.c| 2 +-
 3 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/Documentation/btrfs-filesystem.txt 
b/Documentation/btrfs-filesystem.txt
index 6e63d2c9c2ae..a8f2972a0e1a 100644
--- a/Documentation/btrfs-filesystem.txt
+++ b/Documentation/btrfs-filesystem.txt
@@ -35,11 +35,11 @@ select the 1000 base for the following options, according 
to the SI standard
 -k|--kbytes
 show sizes in KiB, or kB with --si
 -m|--mbytes
-show sizes in MiB, or mB with --si
+show sizes in MiB, or MB with --si
 -g|--gbytes
-show sizes in GiB, or gB with --si
+show sizes in GiB, or GB with --si
 -t|--tbytes
-show sizes in TiB, or tB with --si
+show sizes in TiB, or TB with --si
 
 If conflicting options are passed, the last one takes precedence.
 
diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index af56fbeb48ed..e4b278590ca6 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -128,11 +128,11 @@ static const char * const cmd_df_usage[] = {
-h human friendly numbers, base 1024 (default),
-H human friendly numbers, base 1000,
--iec  use 1024 as a base (KiB, MiB, GiB, TiB),
-   --si   use 1000 as a base (kB, mB, gB, tB),
+   --si   use 1000 as a base (kB, MB, GB, TB),
-k|--kbytesshow sizes in KiB, or kB with --si,
-   -m|--mbytesshow sizes in MiB, or mB with --si,
-   -g|--gbytesshow sizes in GiB, or gB with --si,
-   -t|--tbytesshow sizes in TiB, or tB with --si,
+   -m|--mbytesshow sizes in MiB, or MB with --si,
+   -g|--gbytesshow sizes in GiB, or GB with --si,
+   -t|--tbytesshow sizes in TiB, or TB with --si,
NULL
 };
 
diff --git a/utils.c b/utils.c
index f51bc564d8f1..4b3bace4433a 100644
--- a/utils.c
+++ b/utils.c
@@ -1328,7 +1328,7 @@ out:
 static const char* unit_suffix_binary[] =
{ B, KiB, MiB, GiB, TiB, PiB, EiB};
 static const char* unit_suffix_decimal[] =
-   { B, kB, mB, gB, tB, pB, eB};
+   { B, kB, MB, GB, TB, PB, EB};
 
 int pretty_size_snprintf(u64 size, char *str, size_t str_size, unsigned 
unit_mode)
 {
-- 
2.1.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: filesystem corruption

2014-11-04 Thread Chris Murphy

On Nov 3, 2014, at 9:31 PM, Zygo Blaxell zblax...@furryterror.org wrote:

 On Mon, Nov 03, 2014 at 10:11:18AM -0700, Chris Murphy wrote:
 
 On Nov 2, 2014, at 8:43 PM, Zygo Blaxell zblax...@furryterror.org wrote:
 btrfs seems to assume the data is correct on both disks (the generation
 numbers and checksums are OK) but gets confused by equally plausible but
 different metadata on each disk.  It doesn't take long before the
 filesystem becomes data soup or crashes the kernel.
 
 This is a pretty significant problem to still be present, honestly. I
 can understand the catchup mechanism is probably not built yet,
 but clearly the two devices don't have the same generation. The lower
 generation device should probably be booted/ignored or declared missing
 in the meantime to prevent trashing the file system.
 
 The problem with generation numbers is when both devices get divergent
 generation numbers but we can't tell them apart, e.g.
 
   1.  sda generation = 5, sdb generation = 5
 
   2.  sdb temporarily disconnects, so we are degraded on just sda
 
   3.  sda gets more generations 6..9
 
   4.  sda temporarily disconnects, so we have no disks at all.
 
   5.  the machine reboots, gets sdb back but not sda
 
 If we allow degraded here, then:
 
   6.  sdb gets more generations 6..9
 
   7.  sdb disconnects, no disks so no filesystem
 
   8.  the machine reboots again, this time with sda and sdb present
 
 Now we have two disks with equal generation numbers.  Generations 6..9
 on sda are not the same as generations 6..9 on sdb, so if we mix the
 two disks' metadata we get bad confusion.
 
 It needs to be more than a sequential number.  If one of the disks
 disappears we need to record this fact on the surviving disks, and also
 cope with _both_ disks claiming to be the surviving one.

I agree this is also a problem. But the most common case is where we know that 
sda generation is newer (larger value) and most recently modified, and sdb has 
not since been modified but needs to be caught up. As far as I know the only 
way to do that on Btrfs right now is a full balance, it doesn't catch up just 
be being reconnected with a normal mount.


Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs deduplication and linux cache management

2014-11-04 Thread Zygo Blaxell
On Mon, Nov 03, 2014 at 03:09:11PM +0100, LuVar wrote:
 Thanks for nice and replicate at home yourself example. On my machine it is 
 behaving precisely like in your:
 
 code
 root@blackdawn:/home/luvar# sync; sysctl vm.drop_caches=1
 vm.drop_caches = 1
 root@blackdawn:/home/luvar# time cat 
 /home/luvar/programs/adt-bundle-linux/sdk/system-images/android-L/default/armeabi-v7a/userdata.img
   /dev/null 
 real0m6.768s
 user0m0.016s
 sys 0m0.599s
 
 root@blackdawn:/home/luvar# time cat 
 /home/luvar/programs/android-sdk-linux/system-images/android-L/default/armeabi-v7a/userdata.img
   /dev/null 
 real0m5.259s
 user0m0.018s
 sys 0m0.695s
 
 root@blackdawn:/home/luvar# time cat 
 /home/luvar/programs/adt-bundle-linux/sdk/system-images/android-L/default/armeabi-v7a/userdata.img
   /dev/null 
 real0m0.701s
 user0m0.014s
 sys 0m0.288s
 
 root@blackdawn:/home/luvar# time cat 
 /home/luvar/programs/android-sdk-linux/system-images/android-L/default/armeabi-v7a/userdata.img
   /dev/null
 real0m0.286s
 user0m0.013s
 sys 0m0.272s
 /code
 
 If you would mind asking, is there any plan to optimize this
 behaviour? I know that btrfs is not like ZFS (whole system from
 blockdevice, through cache, to VFS), so vould be possible to implement
 such optimization without major patch in linux block cache/VFS cache?

I'd like to know this too.  I think not any time soon though.

AIUI (I'm not really an expert here), the VFS cache is keyed on tuples of
(device:inode, offset), so it has no way to cope with aliasing the same
physical blocks through distinct inodes.  It would have to learn about
reference counting (so multiple inodes can refer to shared blocks, one
inode can refer to the same blocks twice, etc) and copy-on-write (so we
can modify just one share of a shared-extent cache page).  For compressed
data caching, the filesystem would be volunteering references to blocks
that were not asked for (e.g.  unread portions of compressed extents).

It's not impossible to make those changes to the VFS cache, but the
only filesystem on mainline Linux that would benefit is btrfs (ZFS is
not on mainline Linux, the ZFS maintainers probably prefer to use their
own cache layer anyway, and nobody else shares extents between files).
For filesystems that don't share extents, adding the necessary stuff to
VFS is a lot of extra overhead they will never use.

Back in the day, the Linux cache used to use tuples of (device,
block_number), but this approach doesn't work on non-block filesystems
like NFS, so it was dropped in favor of the inode+offset caching.
A block-based scheme would handle shared extents but not compressed ones
(e.g. you've got a 4K cacheable page that was compressed to 312 bytes
somewhere in the middle of a 57K compressed data extent...what's that
page's block number, again?).

 Thanks, have a nice day,
 --
 LuVar
 
 
 - Zygo Blaxell zblax...@furryterror.org wrote:
 
  On Thu, Oct 30, 2014 at 10:26:07AM +0100, lu...@plaintext.sk wrote:
   Hi,
   I want to ask, if deduplicated file content will be cached in linux
  kernel just once for two deduplicated files.
   
   To explain in deep:
- I use btrfs for whole system with few subvolumes with some
  compression on some subvolumes.
- I have two directories with eclipse SDK with slightly differences
  (same version, different config)
- I assume that given directories is deduplicated and so two
  eclipse installations take place on hdd like one would (in rough
  estimation)
- I will start one of given eclipse
- linux kernel will cache all opened files during start of eclipse
  (I have enough free ram)
- I am just happy stupid linux user:
   1. will kernel cache file content after decompression? (I think
  yes)
   2. cached data will be in VFS layer or in block device layer?
  
  My guess based on behavior is the VFS layer.  See below.
  
- When I will lunch second eclipse (different from first, but
  deduplicated from first) after first one:
   1. will second start require less data to be read from HDD?
  
  No.
  
   2. will be metadata for second instance read from hdd? (I asume
  yes)
  
  Yes (how could it not?).
  
   3. will be actual data read second time? (I hope not)
  
  Unfortunately, yes.
  
  This is my test:
  
  1.  Create a file full of compressible data that is big enough to
  take
  a few seconds to read from disk, but not too big to fit in RAM:
  
  yes $(date) | head -c 500m  a
  
  2.  Create a deduplicated (shared extent) copy of same:
  
  cp --reflink=always a b
  
  (use filefrag -v to verify both files have same physical extents)
  
  3.  Drop caches
  
  sync; sysctl vm.drop_caches=1
  
  4.  Time reading both files with cold and hot cache:
  
  time cat a  /dev/null
  time cat b  /dev/null
  time cat a  /dev/null
  time cat b  /dev/null
  
  Ideally, the first 'cat a' would load the file back from disk, so it
  will take a long 

Re: filesystem corruption

2014-11-04 Thread Duncan
Chris Murphy posted on Tue, 04 Nov 2014 11:28:39 -0700 as excerpted:

 It needs to be more than a sequential number.  If one of the disks
 disappears we need to record this fact on the surviving disks, and also
 cope with _both_ disks claiming to be the surviving one.
 
 I agree this is also a problem. But the most common case is where we
 know that sda generation is newer (larger value) and most recently
 modified, and sdb has not since been modified but needs to be caught up.
 As far as I know the only way to do that on Btrfs right now is a full
 balance, it doesn't catch up just be being reconnected with a normal
 mount.

I thought it was a scrub that would take care of that, not a balance?

(Maybe do both to be sure?)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 03/11] Btrfs-progs: allow fsck to take the tree bytenr

2014-11-04 Thread Ansgar Hockmann-Stolle
Josef Bacik jbacik at fb.com writes:
 Sometimes we have a pretty corrupted fs but have an old tree bytenr that we
 could use, add the ability to specify the tree root bytenr.  Thanks,
 
 Signed-off-by: Josef Bacik jbacik at fb.com
Tested-by: Ansgar Hockmann-Stolle ansgar.hockmann-stolle at
uni-osnabrueck.de

This patch fixed my case:
http://www.spinics.net/lists/linux-btrfs/msg38714.html
Thank you! And thank you, Qu Wenruo for the help!!

I tested all blocks that find-root gave - but only the one with generation =
want + 1 gave some possible root items. The btrfsck with --tree-root that
block finally fixed my file system. Hooray!

Ciao
Ansgar

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: filesystem corruption

2014-11-04 Thread Robert White

On 11/04/2014 10:28 AM, Chris Murphy wrote:

On Nov 3, 2014, at 9:31 PM, Zygo Blaxell zblax...@furryterror.org wrote:

Now we have two disks with equal generation numbers.  Generations 6..9
on sda are not the same as generations 6..9 on sdb, so if we mix the
two disks' metadata we get bad confusion.

It needs to be more than a sequential number.  If one of the disks
disappears we need to record this fact on the surviving disks, and also
cope with _both_ disks claiming to be the surviving one.


I agree this is also a problem. But the most common case is where we know that 
sda generation is newer (larger value) and most recently modified, and sdb has 
not since been modified but needs to be caught up. As far as I know the only 
way to do that on Btrfs right now is a full balance, it doesn't catch up just 
be being reconnected with a normal mount.



I would think that any time any system or fraction thereof is mounted 
with both a degraded and rw, status a degraded flag should be set 
somewhere/somehow in the superblock etc.


The only way to clear this flag would be to reach a reconciled state. 
That state could be reached in one of several ways. Removing the missing 
mirror element would be a fast reconcile, doing a balance or scrub would 
be a slow reconcile for a filessytem where all the media are returned to 
service (e.g. the missing volume of a RAID 1 etc is returned.)


Generation numbers are pretty good, but I'd put on a rider that any 
generation number or equivelant incremented while the system is degraded 
should have a unique quanta (say a GUID) generated and stored along with 
the generation number. The mere existence of this quanta would act as 
the degraded flag.


Any check/compare/access related to the generation number would know to 
notice that the GUID is in place and do the necessary resolution. If 
successful the GUID would be discarded.


As to how this could be implemented, I'm not fully conversant on the 
internal layout.


One possibility would be to add a block reference, or, indeed replace 
the current storage for generation numbers completely with block 
reference to a block containing the generation number and the potential 
GUID. The main value of having an out-of-structure reference is that its 
content is less space constrained, and it could be shared by multiple 
usages. In the case, for instance, where the block is added (as opposed 
to replacing the generation number) only one such block would be needed 
per degraded,rw mount, and it could be attached to as many filesystem 
structures as needed.



Just as metadata under DUP is divergent after a degraded mount, a 
generation block wold be divergent, and likely in a different location 
than its peers on a subsequent restored geometry.


A gerenation block could have other nicities like the date/time and the 
devices present (or absent); such information could conceivably be used 
to intellegently disambiguate references. For instance if one degraded 
mount had sda and sdb, and second had sdb and sdc, then itd be known 
that sdb was dominant for having been present every time.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: filesystem corruption

2014-11-04 Thread Zygo Blaxell
On Tue, Nov 04, 2014 at 11:28:39AM -0700, Chris Murphy wrote:
 On Nov 3, 2014, at 9:31 PM, Zygo Blaxell zblax...@furryterror.org wrote:
  It needs to be more than a sequential number.  If one of the disks
  disappears we need to record this fact on the surviving disks, and also
  cope with _both_ disks claiming to be the surviving one.
 
 I agree this is also a problem. But the most common case is where we
 know that sda generation is newer (larger value) and most recently
 modified, and sdb has not since been modified but needs to be caught
 up. As far as I know the only way to do that on Btrfs right now is
 a full balance, it doesn't catch up just be being reconnected with a
 normal mount.

The data on the disks might be inconistent, so resynchronization must
read from only the good copy.  A balance could just spread corruption
around if it reads from two out-of-sync mirrors.  (Maybe it already does
the right thing if sdb was not modified...?).

The full resync operation is more like btrfs device replace, except that
it's replacing a disk in-place (i.e. without removing it first), and it
would not read from the non-good disk.

 
 Chris Murphy--
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html


signature.asc
Description: Digital signature


[SOLVED] btrfs unmountable: read block failed check_tree_block; Couldn't read tree root

2014-11-04 Thread Ansgar Hockmann-Stolle

For solution see
http://article.gmane.org/gmane.comp.file-systems.btrfs/39974

Am 28.10.14 um 00:03 schrieb Ansgar Hockmann-Stolle:

Am 27.10.14 um 14:23 schrieb Ansgar Hockmann-Stolle:

Hi!

My btrfs system partition went readonly. After reboot it doesnt mount
anymore. System was openSUSE 13.1 Tumbleweed (kernel 3.17.??). Now I'm
on openSUSE 13.2-RC1 rescue (kernel 3.16.3). I dumped (dd) the whole 250
GB SSD to some USB file and tried some btrfs tools on another copy per
loopback device. But everything failed with:

kernel: BTRFS: failed to read tree root on dm-2

See http://pastebin.com/raw.php?i=dPnU6nzg.

Any hints where to go from here?


After an offlist hint (thanks Tom!) I compiled the latest btrfs-progs
3.17 and tried some more ...

linux:~/bin # ./btrfs --version
Btrfs v3.17
linux:~/bin # ./btrfs-find-root /dev/sda3
Super think's the tree root is at 1015238656, chunk root 20971520
Well block 239718400 seems great, but generation doesn't match,
have=661931, want=663595 level 0
Well block 239722496 seems great, but generation doesn't match,
have=661931, want=663595 level 0
Well block 320098304 seems great, but generation doesn't match,
have=662233, want=663595 level 0
Well block 879341568 seems great, but generation doesn't match,
have=663227, want=663595 level 0
Well block 879345664 seems great, but generation doesn't match,
have=663227, want=663595 level 0
Well block 879382528 seems great, but generation doesn't match,
have=663227, want=663595 level 0
Well block 879398912 seems great, but generation doesn't match,
have=663227, want=663595 level 0
Well block 879403008 seems great, but generation doesn't match,
have=663227, want=663595 level 0
Well block 879423488 seems great, but generation doesn't match,
have=663227, want=663595 level 0
Well block 879435776 seems great, but generation doesn't match,
have=663227, want=663595 level 0
Well block 880095232 seems great, but generation doesn't match,
have=663227, want=663595 level 0
Well block 881504256 seems great, but generation doesn't match,
have=663228, want=663595 level 0
Well block 881512448 seems great, but generation doesn't match,
have=663228, want=663595 level 0
Well block 936271872 seems great, but generation doesn't match,
have=663397, want=663595 level 0
Well block 1004490752 seems great, but generation doesn't match,
have=663571, want=663595 level 0
Well block 1007804416 seems great, but generation doesn't match,
have=663572, want=663595 level 0
Well block 1012031488 seems great, but generation doesn't match,
have=663575, want=663595 level 0
Well block 1012396032 seems great, but generation doesn't match,
have=663575, want=663595 level 0
Well block 1012633600 seems great, but generation doesn't match,
have=663586, want=663595 level 0
Well block 1012871168 seems great, but generation doesn't match,
have=663585, want=663595 level 0
Well block 1015201792 seems great, but generation doesn't match,
have=663588, want=663595 level 0
Well block 1015836672 seems great, but generation doesn't match,
have=663596, want=663595 level 1
Well block 44132536320 seems great, but generation doesn't match,
have=658774, want=663595 level 0
Well block 44178280448 seems great, but generation doesn't match,
have=658774, want=663595 level 0
Well block 87443644416 seems great, but generation doesn't match,
have=661349, want=663595 level 0
Well block 87514079232 seems great, but generation doesn't match,
have=651051, want=663595 level 0
Well block 87517679616 seems great, but generation doesn't match,
have=661349, want=663595 level 0
Well block 98697822208 seems great, but generation doesn't match,
have=643548, want=663595 level 0
Well block 103285026816 seems great, but generation doesn't match,
have=661672, want=663595 level 0
Well block 103309553664 seems great, but generation doesn't match,
have=661674, want=663595 level 0
Well block 103523430400 seems great, but generation doesn't match,
have=661767, want=663595 level 0
No more metdata to scan, exiting

This line I found interesting because have is want + 1:
Well block 1015836672 seems great, but generation doesn't match,
have=663596, want=663595 level 1

And here the tail of btrfs rescue chunk-recover (full output at
http://pastebin.com/raw.php?i=1D5VgDxv)

[..]
Total Chunks:234
   Heathy:231
   Bad:3

Orphan Block Groups:

Orphan Device Extents:
Couldn't map the block 1015238656
btrfs: volumes.c:1140: btrfs_num_copies: Assertion `!(ce-start 
logical || ce-start + ce-size  logical)' failed.
Aborted


Sadly btrfs check --repair keep up refusing to do its job.

linux:~ # btrfs check --repair /dev/sda3
enabling repair mode
Check tree block failed, want=1015238656, have=0
Check tree block failed, want=1015238656, have=0
Check tree block failed, want=1015238656, have=0
Check tree block failed, want=1015238656, have=0
Check tree block failed, want=1015238656, have=0
read block failed check_tree_block
Couldn't read tree root
Checking filesystem on /dev/sda3
UUID: 1af256b5-b1ad-443b-aeee-a6853e70b7e2

Re: Kernel crash during btrfs device delete on raid6 volume

2014-11-04 Thread Mark Fasheh
On Tue, Nov 04, 2014 at 10:58:48AM -0500, Chris Mason wrote:
 Not a new bug unfortunately, but since it is in the error handling people 
 must not be hitting it often.  It's also not related to device replace.


while (ret  0  !list_empty(tmplist)) {
sums = list_entry(tmplist, struct btrfs_ordered_sum, 
 list);
list_del(sums-list);
kfree(sums);
}

 We're trying to call kfree on the on-stack list head.  I'm fixing it up 
 here, thanks for posting the oops!

 Fix attached, or you can wait for the next rc.  Thanks.

 -chris



 From 6e5aafb27419f32575b27ef9d6a31e5d54661aca Mon Sep 17 00:00:00 2001
 From: Chris Mason c...@fb.com
 Date: Tue, 4 Nov 2014 06:59:04 -0800
 Subject: [PATCH] Btrfs: fix kfree on list_head in btrfs_lookup_csums_range
  error cleanup
 
 If we hit any errors in btrfs_lookup_csums_range, we'll loop through all
 the csums we allocate and free them.  But the code was using list_entry
 incorrectly, and ended up trying to free the on-stack list_head instead.
 
 This bug came from commit 0678b6185

Wow, that's an old commit! Thanks for the CC. The fix looks good to me, so
you can add:

Reviewed-by: Mark Fasheh mfas...@suse.de

if you like, thanks.
--Mark

--
Mark Fasheh
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html