Re: BTRFS partitioning scheme (was BTRFS with RAID1 cannot boot when removing drive)

2014-02-13 Thread Austin S Hemmelgarn
On 2014-02-13 12:33, Chris Murphy wrote:
 
 On Feb 13, 2014, at 1:50 AM, Frank Kingswood 
 fr...@kingswood-consulting.co.uk wrote:
 
 On 12/02/14 17:13, Saint Germain wrote:
 Ok based on your advices, here is what I have done so far to use UEFI
 (remeber that the objective is to have a clean and simple BTRFS RAID1
 install).

 A) I start first with only one drive, I have gone with the following
 partition scheme (Debian wheezy, kernel 3.12, grub 2.00, GPT partition
 with parted):
 sda1 = 1MiB BIOS Boot partition (no FS, set 1 bios_grub on with
 parted to set the type)
 sda2 = 550 MiB EFI System Partition (FAT32, toggle 2 boot with
 parted to set the type),  mounted on /boot/efi

 I'm curious, why so big? There's only one file of about 100kb there, and I 
 was considering shrinking mine to the minimum possible (which seems to be 
 about 33 MB).
 
 I'm not sure what OS loader you're using but I haven't seen a grubx64.efi 
 less than ~500KB. In general I'm seeing it at about 1MB. The Fedora grub-efi 
 and shim packages as installed on the ESP take up 10MB. So 33MiB is a bit 
 small, and if we were more conservative, we'd update the OS loader by writing 
 the new one to a temp directory rather than overwriting existing. And then 
 remove the old, and rename the new.
 
 The UEFI spec says if the system partition is FAT, it should be FAT32. For 
 removable media it's FAT12/FAT16. I don't know what tool the various distro 
 installers are using, but at least on Fedora they are using mkdosfs, part of 
 dosfstools. And its cutoff for making FAT16/FAT32 based on media size is 
 500MB unless otherwise specified, and the installer doesn't specify so 
 actually by default Fedora system partitions are FAT16, to no obvious ill 
 effect. But if you want a FAT32 ESP created by the installer the ESP needs to 
 be 500MiB or 525MB. So 550MB is a reasonable number to make that happen.
 
 If we were slightly smarter (and more A.R.), UEFI bugs aside, we'd put the 
 ESP as the last partition on the disk rather than as the first and then 
 honestly would we really care about consuming even 1GiB of the slowest part 
 of a spinning disk? Or causing a bit of overprovisioning for SSD? No. It's 
 probably a squeak of an improvement if anything.
 
 For those who want to use gummiboot, it calls for the kernel and initramfs to 
 be located on the ESP and is mounted at /boot rather than /boot/efi. So 
 that's also a reason to make it bigger than usual.
 
 
 
 
 sda3 = 1 TiB root partition (BTRFS), mounted on /
 sda4 = 6 GiB swap partition
 (that way I should be able to be compatible with both CSM or UEFI)

 B) normal Debian installation on sdas, activate the CSM on the
 motherboard and reboot.

 C) apt-get install grub-efi-amd64 and grub-install /dev/sda

 And the problems begin:
 1) grub-install doesn't give any error but using the --debug I can see
 that it is not using EFI.
 2) Ok I force with grub-install --target=x86_64-efi
 --efi-directory=/boot/efi --bootloader-id=grub --recheck --debug
 /dev/sda
 3) This time something is generated in /boot/efi: 
 /boot/efi/EFI/grub/grubx64.efi
 4) Copy the file /boot/efi/EFI/grub/grubx64.efi to
 /boot/efi/EFI/boot/bootx64.efi

 is EFI/boot/ correct here?
 
 If you want a fallback bootloader, yes.
 

 If you're lucky then your BIOS will tell what path it will try to read for 
 the boot code. For me that is /EFI/debian/grubx64.efi.
 
 NVRAM is what does this. But if NVRAM becomes corrupt, or the entry is 
 deleted for whatever reason, the proper fallback is bootarch.efi.

While this is what the UEFI spec says is supposed to be the fallback,
many systems don't actually look there unless the media is removable.
All of my UEFI systems instead look for Microsoft/Boot/bootmgfw.efi as
the fallback (Cause most x86 system designers don't care at all about
standards compliance as long as it will run windows).

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with btrfs balance

2014-02-13 Thread Austin S. Hemmelgarn


On 02/10/2014 08:41 AM, Brendan Hide wrote:
 On 2014/02/10 04:33 AM, Austin S Hemmelgarn wrote:
 snip
 Apparently, trying to use -mconvert=dup or -sconvert=dup on a
 multi-device filesystem using one of the RAID profiles for metadata
 fails with a statement to look at the kernel log, which doesn't show
 anything at all about the failure.
 ^ If this is the case then it is definitely a bug. Can you provide some
 version info? Specifically kernel, btrfs-tools, and Distro.
 snip it appears
 that the kernel stops you from converting to a dup profile for metadata
 in this case because it thinks that such a profile doesn't work on
 multiple devices, despite the fact that you can take a single device
 filesystem, and a device, and it will still work fine even without
 converting the metadata/system profiles.
 I believe dup used to work on multiple devices but the facility was
 removed. In the standard case it doesn't make sense to use dup with
 multiple devices: It uses the same amount of diskspace but is more
 vulnerable than the RAID1 alternative.
 snip Ideally, this
 should be changed to allow converting to dup so that when converting a
 multi-device filesystem to single-device, you never have to have
 metadata or system chunks use a single profile.
 This is a good use-case for having the facility. I'm thinking that, if
 it is brought back in, the only caveat is that appropriate warnings
 should be put in place to indicate that it is inappropriate.
 
 My guess on how you'd like to migrate from raid1/raid1 to single/dup,
 assuming sda and sdb:
 btrfs balance start -dconvert=single -mconvert=dup /
 btrfs device delete /dev/sdb /
 
Do you happen to know which git repository and branch is preferred to
base patches on?  I'm getting ready to write one to fix this, and would
like to make it as easy as possible for the developers to merge.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with btrfs balance

2014-02-14 Thread Austin S Hemmelgarn
On 02/14/2014 02:56 AM, Brendan Hide wrote:
 On 14/02/14 05:42, Austin S. Hemmelgarn wrote:
 On 2014/02/10 04:33 AM, Austin S Hemmelgarn wrote:
 Do you happen to know which git repository and branch is
 preferred to base patches on?  I'm getting ready to write one to
 fix this, and would like to make it as easy as possible for the
 developers to merge.
 A list of the main repositories is maintained at 
 https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories
 
 I'd suggest David Sterba's branch as he maintains it for
 userspace-tools integration.
 
In this case, it will need to be patched both in the userspace tools,
and in the kernel, it's the kernel itself that prevents the balance,
cause it thinks that you can't do dup profiles with multiple devices.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Allow forced conversion of metadata to dup profile on multiple devices

2014-02-14 Thread Austin S Hemmelgarn
Currently, btrfs balance start fails when trying to convert metadata or
system chunks to dup profile on filesystems with multiple devices.  This
requires that a conversion from a multi-device filesystem to a single
device filesystem use the following methodology:
1. btrfs balance start -dconvert=single -mconvert=single \
   -sconvert=single -f /
2. btrfs device delete / /dev/sdx
3. btrfs balance start -mconvert=dup -sconvert=dup /
This results in a period of time (possibly very long if the devices are
big) where you don't have the protection guarantees of multiple copies
of metadata chunks.

After applying this patch, one can instead use the following methodology
for conversion from a multi-device filesystem to a single device
filesystem:
1. btrfs balance start -dconvert=single -mconvert=dup \
   -sconvert=dup -f /
2. btrfs device delete / /dev/sdx
This greatly reduces the chances of the operation causing data loss due
to a read error during the device delete.

Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com
---
 fs/btrfs/volumes.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 07629e9..38a9522 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3152,10 +3152,8 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
num_devices--;
}
btrfs_dev_replace_unlock(fs_info-dev_replace);
-   allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE;
-   if (num_devices == 1)
-   allowed |= BTRFS_BLOCK_GROUP_DUP;
-   else if (num_devices  1)
+   allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE | BTRFS_BLOCK_GROUP_DUP;
+   if (num_devices  1)
allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1);
if (num_devices  2)
allowed |= BTRFS_BLOCK_GROUP_RAID5;
@@ -3221,6 +3219,21 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
goto out;
}
}
+   if (((bctl-sys.flags  BTRFS_BALANCE_ARGS_CONVERT) 
+   (bctl-sys.target  ~BTRFS_BLOCK_GROUP_DUP) ||
+   (bctl-meta.flags  BTRFS_BALANCE_ARGS_CONVERT) 
+   (bctl-meta.target  ~BTRFS_BLOCK_GROUP_DUP)) 
+   (num_devs  1)) {
+   if (bctl-flags  BTRFS_BALANCE_FORCE) {
+   btrfs_info(fs_info, force conversion of 
metadata 
+  to dup profile on multiple 
devices);
+   } else {
+   btrfs_err(fs_info, balance will reduce 
metadata 
+ integrity, use force if you want 
this);
+   ret = -EINVAL;
+   goto out;
+   }
+   }
} while (read_seqretry(fs_info-profiles_lock, seq));
 
if (bctl-sys.flags  BTRFS_BALANCE_ARGS_CONVERT) {
-- 
1.8.5.4


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Allow forced conversion of metadata to dup profile on multiple devices

2014-02-19 Thread Austin S Hemmelgarn
Currently, btrfs balance start fails when trying to convert metadata or
system chunks to dup profile on filesystems with multiple devices.  This
requires that a conversion from a multi-device filesystem to a single
device filesystem use the following methodology:
1. btrfs balance start -dconvert=single -mconvert=single \
   -sconvert=single -f /
2. btrfs device delete / /dev/sdx
3. btrfs balance start -mconvert=dup -sconvert=dup /
This results in a period of time (possibly very long if the devices are
big) where you don't have the protection guarantees of multiple copies
of metadata chunks.

After applying this patch, one can instead use the following methodology
for conversion from a multi-device filesystem to a single device
filesystem:
1. btrfs balance start -dconvert=single -mconvert=dup \
   -sconvert=dup -f /
2. btrfs device delete / /dev/sdx
This greatly reduces the chances of the operation causing data loss due
to a read error during the device delete.

Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com
---
 fs/btrfs/volumes.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 07629e9..38a9522 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3152,10 +3152,8 @@ int btrfs_balance(struct btrfs_balance_control
*bctl,
num_devices--;
}
btrfs_dev_replace_unlock(fs_info-dev_replace);
-   allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE;
-   if (num_devices == 1)
-   allowed |= BTRFS_BLOCK_GROUP_DUP;
-   else if (num_devices  1)
+   allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE | BTRFS_BLOCK_GROUP_DUP;
+   if (num_devices  1)
allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1);
if (num_devices  2)
allowed |= BTRFS_BLOCK_GROUP_RAID5;
@@ -3221,6 +3219,21 @@ int btrfs_balance(struct btrfs_balance_control
*bctl,
goto out;
}
}
+   if (((bctl-sys.flags  BTRFS_BALANCE_ARGS_CONVERT) 
+   (bctl-sys.target  ~BTRFS_BLOCK_GROUP_DUP) ||
+   (bctl-meta.flags  BTRFS_BALANCE_ARGS_CONVERT) 
+   (bctl-meta.target  ~BTRFS_BLOCK_GROUP_DUP)) 
+   (num_devs  1)) {
+   if (bctl-flags  BTRFS_BALANCE_FORCE) {
+   btrfs_info(fs_info, force conversion of 
metadata 
+  to dup profile on multiple 
devices);
+   } else {
+   btrfs_err(fs_info, balance will reduce 
metadata 
+ integrity, use force if you want 
this);
+   ret = -EINVAL;
+   goto out;
+   }
+   }
} while (read_seqretry(fs_info-profiles_lock, seq));
if (bctl-sys.flags  BTRFS_BALANCE_ARGS_CONVERT) {
-- 
1.8.5.4

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Allow forced conversion of metadata to dup profile on multiple devices

2014-02-24 Thread Austin S Hemmelgarn
On 2014-02-24 08:37, Ilya Dryomov wrote:
 On Thu, Feb 20, 2014 at 6:57 PM, David Sterba dste...@suse.cz wrote:
 On Wed, Feb 19, 2014 at 11:10:41AM -0500, Austin S Hemmelgarn wrote:
 Currently, btrfs balance start fails when trying to convert metadata or
 system chunks to dup profile on filesystems with multiple devices.  This
 requires that a conversion from a multi-device filesystem to a single
 device filesystem use the following methodology:
 1. btrfs balance start -dconvert=single -mconvert=single \
-sconvert=single -f /
 2. btrfs device delete / /dev/sdx
 3. btrfs balance start -mconvert=dup -sconvert=dup /
 This results in a period of time (possibly very long if the devices are
 big) where you don't have the protection guarantees of multiple copies
 of metadata chunks.

 After applying this patch, one can instead use the following methodology
 for conversion from a multi-device filesystem to a single device
 filesystem:
 1. btrfs balance start -dconvert=single -mconvert=dup \
-sconvert=dup -f /
 2. btrfs device delete / /dev/sdx
 This greatly reduces the chances of the operation causing data loss due
 to a read error during the device delete.

 Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com
 Reviewed-by: David Sterba dste...@suse.cz

 Sounds useful. The muliple devices + DUP is allowed setup when the
 device is added, this patch only adds the 'delete' counterpart. The
 imroved data loss protection during the process is a good thing.
 
 Hi,
 
 Have you actually tried to queue it?  Unless I'm missing something, it won't
 compile, and on top of that, it seems to be corrupted too..
The patch itself was made using git, AFAICT it should be fine.  I've
personally built and tested it using UML.
 
 IIRC muliple devices + DUP is allowed only until the first balance, has that
 changed?

This is just a limitation of how the kernel handles balances, DUP
profiles with multiple devices work, it's just terribly inefficient.
The primary use case is converting a multi-device FS with RAID for
metadata to a single device FS without having to reduce integrity.
 Thanks,
 
 Ilya
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] btrfs: Allow forced conversion of metadata to dup profile on multiple devices

2014-02-24 Thread Austin S Hemmelgarn
On 2014-02-24 09:12, Ilya Dryomov wrote:
 On Mon, Feb 24, 2014 at 3:44 PM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 On 2014-02-24 08:37, Ilya Dryomov wrote:
 On Thu, Feb 20, 2014 at 6:57 PM, David Sterba dste...@suse.cz wrote:
 On Wed, Feb 19, 2014 at 11:10:41AM -0500, Austin S Hemmelgarn wrote:
 Currently, btrfs balance start fails when trying to convert metadata or
 system chunks to dup profile on filesystems with multiple devices.  This
 requires that a conversion from a multi-device filesystem to a single
 device filesystem use the following methodology:
 1. btrfs balance start -dconvert=single -mconvert=single \
-sconvert=single -f /
 2. btrfs device delete / /dev/sdx
 3. btrfs balance start -mconvert=dup -sconvert=dup /
 This results in a period of time (possibly very long if the devices are
 big) where you don't have the protection guarantees of multiple copies
 of metadata chunks.

 After applying this patch, one can instead use the following methodology
 for conversion from a multi-device filesystem to a single device
 filesystem:
 1. btrfs balance start -dconvert=single -mconvert=dup \
-sconvert=dup -f /
 2. btrfs device delete / /dev/sdx
 This greatly reduces the chances of the operation causing data loss due
 to a read error during the device delete.

 Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com
 Reviewed-by: David Sterba dste...@suse.cz

 Sounds useful. The muliple devices + DUP is allowed setup when the
 device is added, this patch only adds the 'delete' counterpart. The
 imroved data loss protection during the process is a good thing.

 Hi,

 Have you actually tried to queue it?  Unless I'm missing something, it won't
 compile, and on top of that, it seems to be corrupted too..
 The patch itself was made using git, AFAICT it should be fine.  I've
 personally built and tested it using UML.
 
 It doesn't look fine.  It was generated with git, but it got corrupted
 on the way: either how you pasted it or the email client you use is the
 problem.
 
 On Wed, Feb 19, 2014 at 6:10 PM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 Currently, btrfs balance start fails when trying to convert metadata or
 system chunks to dup profile on filesystems with multiple devices.  This
 requires that a conversion from a multi-device filesystem to a single
 device filesystem use the following methodology:
 1. btrfs balance start -dconvert=single -mconvert=single \
-sconvert=single -f /
 2. btrfs device delete / /dev/sdx
 3. btrfs balance start -mconvert=dup -sconvert=dup /
 This results in a period of time (possibly very long if the devices are
 big) where you don't have the protection guarantees of multiple copies
 of metadata chunks.

 After applying this patch, one can instead use the following methodology
 for conversion from a multi-device filesystem to a single device
 filesystem:
 1. btrfs balance start -dconvert=single -mconvert=dup \
-sconvert=dup -f /
 2. btrfs device delete / /dev/sdx
 This greatly reduces the chances of the operation causing data loss due
 to a read error during the device delete.

 Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com
 ---
  fs/btrfs/volumes.c | 21 +
  1 file changed, 17 insertions(+), 4 deletions(-)

 diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
 index 07629e9..38a9522 100644
 --- a/fs/btrfs/volumes.c
 +++ b/fs/btrfs/volumes.c
 @@ -3152,10 +3152,8 @@ int btrfs_balance(struct btrfs_balance_control
 *bctl,
 
 ^^^, that should be a single line
 
 num_devices--;
 }
 btrfs_dev_replace_unlock(fs_info-dev_replace);
 -   allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE;
 -   if (num_devices == 1)
 -   allowed |= BTRFS_BLOCK_GROUP_DUP;
 -   else if (num_devices  1)
 +   allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE | BTRFS_BLOCK_GROUP_DUP;
 +   if (num_devices  1)
 allowed |= (BTRFS_BLOCK_GROUP_RAID0 | 
 BTRFS_BLOCK_GROUP_RAID1);
 if (num_devices  2)
 allowed |= BTRFS_BLOCK_GROUP_RAID5;
 @@ -3221,6 +3219,21 @@ int btrfs_balance(struct btrfs_balance_control
 *bctl,
 
 ^^^, ditto
 
 goto out;
 }
 }
 +   if (((bctl-sys.flags  BTRFS_BALANCE_ARGS_CONVERT) 
 +   (bctl-sys.target  ~BTRFS_BLOCK_GROUP_DUP) ||
 +   (bctl-meta.flags  BTRFS_BALANCE_ARGS_CONVERT) 
 +   (bctl-meta.target  ~BTRFS_BLOCK_GROUP_DUP)) 
 +   (num_devs  1)) {
 +   if (bctl-flags  BTRFS_BALANCE_FORCE) {
 +   btrfs_info(fs_info, force conversion of 
 metadata 
 +  to dup profile on multiple 
 devices);
 +   } else {
 +   btrfs_err(fs_info, balance will reduce 
 metadata

[PATCH] btrfs: Allow forced conversion of metadata to dup profile on, multiple devices

2014-02-26 Thread Austin S Hemmelgarn
Currently, btrfs balance start fails when trying to convert metadata or
system chunks to dup profile on filesystems with multiple devices.  This
requires that a conversion from a multi-device filesystem to a single
device filesystem use the following methodology:
1. btrfs balance start -dconvert=single -mconvert=single \
   -sconvert=single -f /
2. btrfs device delete / /dev/sdx
3. btrfs balance start -mconvert=dup -sconvert=dup /
This results in a period of time (possibly very long if the devices are
big) where you don't have the protection guarantees of multiple copies
of metadata chunks.

After applying this patch, one can instead use the following methodology
for conversion from a multi-device filesystem to a single device
filesystem:
1. btrfs balance start -dconvert=single -mconvert=dup \
   -sconvert=dup -f /
2. btrfs device delete / /dev/sdx
This greatly reduces the chances of the operation causing data loss due
to a read error during the device delete.

Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com
---
 fs/btrfs/volumes.c | 21 +
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 07629e9..38a9522 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -3152,10 +3152,8 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
num_devices--;
}
btrfs_dev_replace_unlock(fs_info-dev_replace);
-   allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE;
-   if (num_devices == 1)
-   allowed |= BTRFS_BLOCK_GROUP_DUP;
-   else if (num_devices  1)
+   allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE | BTRFS_BLOCK_GROUP_DUP;
+   if (num_devices  1)
allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1);
if (num_devices  2)
allowed |= BTRFS_BLOCK_GROUP_RAID5;
@@ -3221,6 +3219,21 @@ int btrfs_balance(struct btrfs_balance_control *bctl,
goto out;
}
}
+   if (((bctl-sys.flags  BTRFS_BALANCE_ARGS_CONVERT) 
+   (bctl-sys.target  ~BTRFS_BLOCK_GROUP_DUP) ||
+   (bctl-meta.flags  BTRFS_BALANCE_ARGS_CONVERT) 
+   (bctl-meta.target  ~BTRFS_BLOCK_GROUP_DUP)) 
+   (num_devs  1)) {
+   if (bctl-flags  BTRFS_BALANCE_FORCE) {
+   btrfs_info(fs_info, force conversion of 
metadata 
+  to dup profile on multiple 
devices);
+   } else {
+   btrfs_err(fs_info, balance will reduce 
metadata 
+ integrity, use force if you want 
this);
+   ret = -EINVAL;
+   goto out;
+   }
+   }
} while (read_seqretry(fs_info-profiles_lock, seq));
 
if (bctl-sys.flags  BTRFS_BALANCE_ARGS_CONVERT) {
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Massive BTRFS performance degradation

2014-03-09 Thread Austin S Hemmelgarn
On 03/09/2014 04:17 AM, Swâmi Petaramesh wrote:
 Le dimanche 9 mars 2014 08:48:20 KC a écrit :
 I am experiencing massive performance degradation on my BTRFS
 root partition on SSD.
 
 BTW, is BTRFS still a SSD-killer ? It had this reputation a while
 ago, and I'm not sure if this still is the case, but I don't dare
 (yet) converting to BTRFS one of my laptops that has a SSD...
 
Actually, because of the COW nature of BTRFS, it should be better for
SSD's than stuff like ext4 (which DOES kill SSD's when journaling is
enabled because it ends up doing thousands of read-modify-write cycles
to the same 128k of the disk under just generic usage).  Just make
sure that you use the 'ssd' and 'discard' mount options.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Incremental backup for a raid1

2014-03-14 Thread Austin S Hemmelgarn
On 2014-03-14 09:46, George Mitchell wrote:
 Actually, an interesting concept would be to have the initial two drive
 RAID 1 mirrored by 2 additional drives in 4-way configuration on a
 second machine at a remote location on a private high speed network with
 both machines up 24/7.  In that case, if such a configuration would
 work, either machine could be obliterated and the data would survive
 fully intact in full duplex mode.  It would just need to be remounted
 from the backup system and away it goes.  Just thinking of interesting
 possibilities with n-way mirroring.  Oh how I would love to have n-way
 mirroring to play with!
That can already be done, albeit slightly differently by stacking btrfs
RAID 1 on top of a pair of DRBD devices.  Of course, this doesn't
provide quite the same degree of safety as your suggestion, but it does
work (and DRBD makes the remote copy write-mostly for the local system
automatically).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS setup advice for laptop performance ?

2014-04-04 Thread Austin S Hemmelgarn
On 2014-04-04 04:02, Swâmi Petaramesh wrote:
 Hi,
 
 I'm going to receive a new small laptop with a 500 GB 5400 RPM mechanical 
 ole' rust  HD, and I plan ton install BTRFS on it.
 
 It will have a kernel 3.13 for now, until 3.14 gets released.
 
 However I'm still concerned with chronic BTRFS dreadful performance and still 
 find that BRTFS degrades much over time even with periodic defrag and best 
 practices etc.
I keep hearing this from people, but i personally don't see this to be
the case at all.  I'm pretty sure the 'big' performance degradation that
people are seeing is due to how they are using snapshots, not a result
using BTRFS itself (I don't use them for anything other than ensuring a
stable system image for rsync and/or tar based backups).
 
 So I'd like to start with the best possible options and have a few questions :
 
 - Is it still recommended to mkfs with a nodesize or leafsize different 
 (bigger) than the default ? I wouldn't like to lose too much disk space 
 anyway 
 (1/2 nodesize per file on average ?), as it will be limited...
This depends on many things, the average size of the files on the disk
is the biggest factor.  In general you should get the best disk
utilization by setting nodesize so that a majority of the files are less
than the leafsize minus 256 bytes, and all but a few are smaller than
two times the leafsize minus 256 bytes.  However, if you want to really
benefit from the data compression, you should just use the smallest
leaf/nodesize for your system (which is what mkfs defaults to), as data
that gets as BTRFS stores files whose size is at least (roughly) 256
bytes less than the leafsize inline with the metadata, and doesn't
compress such files.
 
 - Is it recommended to alter the FS to have skinny extents ? I've done this 
 on all of my BTRFS machines without problem, still the kernel spits a notice 
 at mount time, and I'm worrying kind of Why is the kernel warning me I have 
 skinny extents ? Is it bad ? Is it something I should avoid ?
I think that the primary reason for the warning is that it is backward
incompatible, older kernels can't mount filesystems using it.
 
 - Are there other optimization tricks I should perform at mkfs time because 
 thay can't be changed later on ?
 
 - Are there other btrfstune or mount options I should pass before starting to 
 populate the FS with a system and data ?
Unless you are using stuff like QEMU or Virtualbox, you should probably
have autodefrag and space_cache on from the very start.
 
 - Generally speaking, does LZO compression improve or degrade performance ? 
 I'm not able to figure it out clearly.
As long as your memory bandwidth is significantly higher than disk
bandwidth (which is almost always the case, even with SSD's), this
should provide at least some improvement with respect to IO involving
large files.  Because you are using a traditional hard disk instead of
an SSD, you might get better performance using zlib (assuming you don't
mind slightly higer processor usage for IO to files larger than the
leafsize).  If you care less about disk utilization than you do about
performance, you might want to use compress_force instead of compress,
as the performance boost comes from not having to write as much data to
disk.
 
 TIA for the insight.
 
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS setup advice for laptop performance ?

2014-04-04 Thread Austin S Hemmelgarn
On 2014-04-04 08:48, Swâmi Petaramesh wrote:
 Le vendredi 4 avril 2014 08:33:10 Austin S Hemmelgarn a écrit :
 However I'm still concerned with chronic BTRFS dreadful performance and
 still  find that BRTFS degrades much over time even with periodic defrag
 and best practices etc.

 I keep hearing this from people, but i personally don't see this to be
 the case at all.  I'm pretty sure the 'big' performance degradation that
 people are seeing is due to how they are using snapshots, not a result
 using BTRFS itself (I don't use them for anything other than ensuring a
 stable system image for rsync and/or tar based backups).
 
 Maybe I was wrong to suppose that if a feature exists, it is supposed to be 
 usable... I have used ZFS for years, and on ZFS having *hundreds* of 
 snapshots 
 of any given FS have exactly zero impact on performance...
 
 With BTRFS, some time ago I tried to use SuSE snapper that passes its time 
 doing and releasing snapshots, but it soon made my systems unusable...
 
 Now, I only keep 2-3 manually made snapshots just for keeping a stable and 
 OK 
 archive of my machine in a known state just in case...
 
 But if even this has a noticeable negative impact on BTRFS performance, then 
 what the hell are BTRFS snapshots good at ??
 
 Kind regards.
 
I'm not saying that using a few snapshots is a bad thing, I'm saying
that thousands of snapshots is a bad thing (I have actually seen people
with hat many, including one individual who had almost 32,000 snapshots
on the same drive).  I personally do keep a few around on my system on a
regular basis, even aside from the backups, and have no noticable
performance degradation.  For reference, the (main) system that I am
using has a Intel Celeron 847 running at 1.1GHz, 4G of DDR3-1333 RAM,
and a 500G 5400 RPM SATAII hard disk.  My root filesystem is BTRFS
volume mounted with autodefrag,space_cache,compress-force=lzo,noatime
(the noatime improves performance (and power efficency) for btrfs
because metadata updates end up cascading up the metadata tree (updating
the atime on /etc/foo/bar causes the atime to be updated on /etc/foo,
which causes the atime to be updated on /etc, which causes the atime to
be updated on /)
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS setup advice for laptop performance ?

2014-04-07 Thread Austin S Hemmelgarn
On 2014-04-05 07:10, Swâmi Petaramesh wrote:
 Le samedi 5 avril 2014 10:12:17 Duncan wrote [excellent performance advice 
 about disabling Akonadi in BTRFS etc]:
 
 Thanks Duncan for all this excellent discussion.
 
 However I'm still rather puzzled with a filesystem for which advice is if 
 you 
 want tolerable performance, you have to turn off features that are the 
 default 
 with any other FS out there (relatime - noatime) or you have to quit using 
 this database, or you have to fiddle around with esoteric option such as 
 disabling COW wich BTW is one of BTRFS most promiment features.
 
The only reason AFAIK that noatime isn't the default on other
filesystems is because it breaks stuff like mutt.  Other than that,
nobody really uses atimes, and noatime will in-fact get you better
performance on any filesystem.
 [...]
 To put it plain flat clear, even if relatime causes writes, every other FS 
 out there can cope with it. Even if akonadi is heavy and a disk resource hog, 
 any other FS out there can cope with it and still maintain acceptable, usable 
 performance.
 
This is because every other filesystem (except ZFS) doesn't use COW
semantics.  IIRC, using those same features on ZFS causes the same
problems.  This in fact brings to mind one of the biggest reasons that I
refuse to use KDE (or systemd for that matter), KDE systems run slower
in my experience even on ext4, XFS, and JFS, not just on COW filesystems.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS setup advice for laptop performance ?

2014-04-08 Thread Austin S Hemmelgarn
On 2014-04-08 07:56, Clemens Eisserer wrote:
 Hi,
 
 This is because every other filesystem (except ZFS) doesn't use COW
 semantics.
 
 Nilfs2 also is COW based.
 
 Regards, Clemens
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
Apologies, I had forgotten about NILFS2 (probably because I choose not
to deal with it due to stability issues that i have experienced, and a
lack of XATTR and ACL support).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Which companies are using Btrfs in production?

2014-04-24 Thread Austin S. Hemmelgarn
On 2014-04-23 21:19, Marc MERLIN wrote:
 Oh while we're at it, are there companies that can say they are using btrfs
 in production?
 
 Marc
 
Ohio Gravure Technologies is currently preparing to use it on our next
generation of production systems.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: safe/necessary to balance system chunks?

2014-04-25 Thread Austin S Hemmelgarn
On 2014-04-25 13:24, Chris Murphy wrote:
 
 On Apr 25, 2014, at 8:57 AM, Steve Leung sjle...@shaw.ca wrote:
 

 Hi list,

 I've got a 3-device RAID1 btrfs filesystem that started out life as 
 single-device.

 btrfs fi df:

 Data, RAID1: total=1.31TiB, used=1.07TiB
 System, RAID1: total=32.00MiB, used=224.00KiB
 System, DUP: total=32.00MiB, used=32.00KiB
 System, single: total=4.00MiB, used=0.00
 Metadata, RAID1: total=66.00GiB, used=2.97GiB

 This still lists some system chunks as DUP, and not as RAID1.  Does this 
 mean that if one device were to fail, some system chunks would be 
 unrecoverable?  How bad would that be?
 
 Since it's system type, it might mean the whole volume is toast if the 
 drive containing those 32KB dies. I'm not sure what kind of information is in 
 system chunk type, but I'd expect it's important enough that if unavailable 
 that mounting the file system may be difficult or impossible. Perhaps btrfs 
 restore would still work?
 
 Anyway, it's probably a high penalty for losing only 32KB of data.  I think 
 this could use some testing to try and reproduce conversions where some 
 amount of system or metadata type chunks are stuck in DUP. This has come 
 up before on the list but I'm not sure how it's happening, as I've never 
 encountered it.

As far as I understand it, the system chunks are THE root chunk tree for
the entire system, that is to say, it's the tree of tree roots that is
pointed to by the superblock. (I would love to know if this
understanding is wrong).  Thus losing that data almost always means
losing the whole filesystem.

 Assuming this is something that needs to be fixed, would I be able to fix 
 this by balancing the system chunks?  Since the force flag is required, 
 does that mean that balancing system chunks is inherently risky or 
 unpleasant?
 
 I don't think force is needed. You'd use btrfs balance start -sconvert=raid1 
 mountpoint; or with -sconvert=raid1,soft although it's probably a minor 
 distinction for such a small amount of data.
The kernel won't allow a balance involving system chunks unless you
specify force, as it considers any kind of balance using them to be
dangerous.  Given your circumstances, I'd personally say that the safety
provided by RAID1 outweighs the risk of making the FS un-mountable.
 
 The metadata looks like it could use a balance, 66GB of metadata chunks 
 allocated but only 3GB used. So you could include something like -musage=50 
 at the same time and that will balance any chunks with 50% or less usage.
 
 
 Chris Murphy
 

Personally, I would recommend making a full backup of all the data (tar
works wonderfully for this), and recreate the entire filesystem from
scratch, but passing all three devices to mkfs.btrfs.  This should
result in all the chunks being RAID1, and will also allow you to benefit
from newer features.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: safe/necessary to balance system chunks?

2014-04-25 Thread Austin S Hemmelgarn
On 2014-04-25 14:43, Steve Leung wrote:
 On 04/25/2014 12:12 PM, Austin S Hemmelgarn wrote:
 On 2014-04-25 13:24, Chris Murphy wrote:

 On Apr 25, 2014, at 8:57 AM, Steve Leung sjle...@shaw.ca wrote:

 I've got a 3-device RAID1 btrfs filesystem that started out life as
 single-device.

 btrfs fi df:

 Data, RAID1: total=1.31TiB, used=1.07TiB
 System, RAID1: total=32.00MiB, used=224.00KiB
 System, DUP: total=32.00MiB, used=32.00KiB
 System, single: total=4.00MiB, used=0.00
 Metadata, RAID1: total=66.00GiB, used=2.97GiB

 This still lists some system chunks as DUP, and not as RAID1.  Does
 this mean that if one device were to fail, some system chunks would
 be unrecoverable?  How bad would that be?

 Assuming this is something that needs to be fixed, would I be able
 to fix this by balancing the system chunks?  Since the force flag
 is required, does that mean that balancing system chunks is
 inherently risky or unpleasant?

 I don't think force is needed. You'd use btrfs balance start
 -sconvert=raid1 mountpoint; or with -sconvert=raid1,soft although
 it's probably a minor distinction for such a small amount of data.
 The kernel won't allow a balance involving system chunks unless you
 specify force, as it considers any kind of balance using them to be
 dangerous.  Given your circumstances, I'd personally say that the safety
 provided by RAID1 outweighs the risk of making the FS un-mountable.
 
 Agreed, I'll attempt the system balance shortly.
 
 Personally, I would recommend making a full backup of all the data (tar
 works wonderfully for this), and recreate the entire filesystem from
 scratch, but passing all three devices to mkfs.btrfs.  This should
 result in all the chunks being RAID1, and will also allow you to benefit
 from newer features.
 
 I do have backups of the really important stuff from this filesystem,
 but they're offsite.  As this is just for a home system, I don't have
 enough temporary space for a full backup handy (which is related to how
 I ended up in this situation in the first place).
 
 Once everything gets rebalanced though, I don't think I'd be missing out
 on any features, would I?
 
 Steve
In general, it shouldn't be an issue, but it might get you slightly
better performance to recreate it.  I actually have a similar situation
 with how I have my desktop system set up, when I go about recreating
the filesystem (which I do every time I upgrade either the tools or the
kernel), I use the following approach:

1. Delete one of the devices from the filesystem
2. Create a new btrfs file system on the device just removed from the
filesystem
3. Copy the data from the old filesystem to the new one
4. one at a time, delete the remaining devices from the old filesystem
and add them to the new one, re-balancing the new filesystem after
adding each device.

This seems to work relatively well for me, and prevents the possibility
that there is ever just one copy of the data.  It does, however, require
that the amount of data that you are storing on the filesystem is less
than the size of one of the devices (although you can kind of work
around this limitation by setting compress-force=zlib on the new file
system when you mount it, then using defrag to decompress everything
after the conversion is done), and that you have to drop to single user
mode for the conversion (unless it's something that isn't needed all the
time, like the home directories or /usr/src, in which case you just log
everyone out and log in as root on the console to do it).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs on bcache

2014-05-01 Thread Austin S Hemmelgarn
On 2014-04-30 14:16, Felix Homann wrote:
 Hi,
 a couple of months ago there has been some discussion about issues
 when using btrfs on bcache:
 
 http://thread.gmane.org/gmane.comp.file-systems.btrfs/31018
 
 From looking at the mailing list archives I cannot tell whether or not
 this issue has been resolved in current kernels from either bcache's
 or btrfs' side.
 
 Can anyone tell me what's the current state of this issue? Should it
 be safe to use btrfs on bcache by now?

In all practicality, I don't think anyone who frequents the list knows.
 I do know that there are a number of people (myself included) who avoid
bcache in general because of having issues with seemingly random kernel
OOPSes when it is linked in (either as a module or compiled in), even
when it isn't being used.  My advice would be to just test it with some
non-essential data (maybe set up a virtual machine?).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Help with space

2014-05-03 Thread Austin S Hemmelgarn
On 05/02/2014 03:21 PM, Chris Murphy wrote:
 
 On May 2, 2014, at 2:23 AM, Duncan 1i5t5.dun...@cox.net wrote:
 
 Something tells me btrfs replace (not device replace, simply
 replace) should be moved to btrfs device replace…
 
 The syntax for btrfs device is different though; replace is like
 balance: btrfs balance start and btrfs replace start. And you can
 also get a status on it. We don't (yet) have options to stop,
 start, resume, which could maybe come in handy for long rebuilds
 and a reboot is required (?) although maybe that just gets handled
 automatically: set it to pause, then unmount, then reboot, then
 mount and resume.
 
 Well, I'd say two copies if it's only two devices in the raid1...
 would be true raid1.  But if it's say four devices in the raid1,
 as is certainly possible with btrfs raid1, that if it's not
 mirrored 4-way across all devices, it's not true raid1, but
 rather some sort of hybrid raid,  raid10 (or raid01) if the
 devices are so arranged, raid1+linear if arranged that way, or
 some form that doesn't nicely fall into a well defined raid level
 categorization.
 
 Well, md raid1 is always n-way. So if you use -n 3 and specify
 three devices, you'll get 3-way mirroring (3 mirrors). But I don't
 know any hardware raid that works this way. They all seem to be
 raid 1 is strictly two devices. At 4 devices it's raid10, and only
 in pairs.
 
 Btrfs raid1 with 3+ devices is unique as far as I can tell. It is
 something like raid1 (2 copies) + linear/concat. But that
 allocation is round robin. I don't read code but based on how a 3
 disk raid1 volume grows VDI files as it's filled it looks like 1GB
 chunks are copied like this
Actually, MD RAID10 can be configured to work almost the same with an
odd number of disks, except it uses (much) smaller chunks, and it does
more intelligent striping of reads.
 
 Disk1 Disk2   Disk3 134   124 235 679 578 
 689
 
 So 1 through 9 each represent a 1GB chunk. Disk 1 and 2 each have a
 chunk 1; disk 2 and 3 each have a chunk 2, and so on. Total of 9GB
 of data taking up 18GB of space, 6GB on each drive. You can't do
 this with any other raid1 as far as I know. You do definitely run
 out of space on one disk first though because of uneven metadata to
 data chunk allocation.
 
 Anyway I think we're off the rails with raid1 nomenclature as soon
 as we have 3 devices. It's probably better to call it replication,
 with an assumed default of 2 replicates unless otherwise
 specified.
 
 There's definitely a benefit to a 3 device volume with 2
 replicates, efficiency wise. As soon as we go to four disks 2
 replicates it makes more sense to do raid10, although I haven't
 tested odd device raid10 setups so I'm not sure what happens.
 
 
 Chris Murphy
 
 -- To unsubscribe from this list: send the line unsubscribe
 linux-btrfs in the body of a message to majord...@vger.kernel.org 
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RAID-1 - suboptimal write performance?

2014-05-16 Thread Austin S Hemmelgarn
On 05/16/2014 04:41 PM, Tomasz Chmielewski wrote:
 On Fri, 16 May 2014 14:06:24 -0400
 Calvin Walton calvin.wal...@kepstin.ca wrote:
 
 No comment on the performance issue, other than to say that I've seen
 similar on RAID-10 before, I think.

 Also, what happens when the system crashes, and one drive has
 several hundred megabytes data more than the other one?

 This shouldn't be an issue as long as you occasionally run a scrub or
 balance. The scrub should find it and fix the missing data, and a
 balance would just rewrite it as proper RAID-1 as a matter of course.
 
 It's similar (writes to just one drive, while the other is idle) when
 removing (many) snapshots. 
 
 Not sure if that's optimal behaviour.
 
I think, after having looked at some of the code, that I know what is
causing this (although my interpretation of the code may be completely
off target).  As far as I can make out, BTRFS only dispatches writes to
one device at a time, and the write() system call only returns when the
data is on both devices.  While dispatching to one device at a time is
optimal when both 'devices' are partitions on the same underlying disk
(and also if your optimization metric is the simplicity of the
underlying code), it degrades very fast to the worst case when using
multiple devices.  The underlying cause however, which the one device at
a time logic in BTRFS just makes much worse, is that the buffer for the
write() call is kept in memory until the write completes, and counts
against the per-process write-caching limit, and when the process fills
up it's write-cache, the next call it makes that would write to the disk
hangs until the write cache is less full.

The two options that I've found that work around this are:
1. Run 'sync' whenever the program stalls, or
2. Disable write-caching by adding the following to /etc/sysctl.conf
vm.dirty_bytes = 0
vm.dirty_background_bytes = 0

Option 1 is kind of tedious, but doesn't hurt performance all that much,
Option 2 will lower throughput, but will cause most of the stalls to
disappear.

Ideally, BTRFS should dispatch the first write for a block in a
round-robin fashion among available devices.  This won't fix the
underlying issue, but it will make it less of an issue for BTRFS.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: send/receive and bedup

2014-05-19 Thread Austin S Hemmelgarn
On 2014-05-19 13:12, Konstantinos Skarlatos wrote:
 On 19/5/2014 7:01 μμ, Brendan Hide wrote:
 On 19/05/14 15:00, Scott Middleton wrote:
 On 19 May 2014 09:07, Marc MERLIN m...@merlins.org wrote:
 On Wed, May 14, 2014 at 11:36:03PM +0800, Scott Middleton wrote:
 I read so much about BtrFS that I mistaked Bedup with Duperemove.
 Duperemove is actually what I am testing.
 I'm currently using programs that find files that are the same, and
 hardlink them together:
 http://marc.merlins.org/perso/linux/post_2012-05-01_Handy-tip-to-save-on-inodes-and-disk-space_-finddupes_-fdupes_-and-hardlink_py.html


 hardlink.py actually seems to be the faster (memory and CPU) one event
 though it's in python.
 I can get others to run out of RAM on my 8GB server easily :(

 Interesting app.

 An issue with hardlinking (with the backups use-case, this problem
 isn't likely to happen), is that if you modify a file, all the
 hardlinks get changed along with it - including the ones that you
 don't want changed.

 @Marc: Since you've been using btrfs for a while now I'm sure you've
 already considered whether or not a reflink copy is the better/worse
 option.


 Bedup should be better, but last I tried I couldn't get it to work.
 It's been updated since then, I just haven't had the chance to try it
 again since then.

 Please post what you find out, or if you have a hardlink maker that's
 better than the ones I found :)


 Thanks for that.

 I may be  completely wrong in my approach.

 I am not looking for a file level comparison. Bedup worked fine for
 that. I have a lot of virtual images and shadow protect images where
 only a few megabytes may be the difference. So a file level hash and
 comparison doesn't really achieve my goals.

 I thought duperemove may be on a lower level.

 https://github.com/markfasheh/duperemove

 Duperemove is a simple tool for finding duplicated extents and
 submitting them for deduplication. When given a list of files it will
 hash their contents on a block by block basis and compare those hashes
 to each other, finding and categorizing extents that match each
 other. When given the -d option, duperemove will submit those
 extents for deduplication using the btrfs-extent-same ioctl.

 It defaults to 128k but you can make it smaller.

 I hit a hurdle though. The 3TB HDD  I used seemed OK when I did a long
 SMART test but seems to die every few hours. Admittedly it was part of
 a failed mdadm RAID array that I pulled out of a clients machine.

 The only other copy I have of the data is the original mdadm array
 that was recently replaced with a new server, so I am loathe to use
 that HDD yet. At least for another couple of weeks!


 I am still hopeful duperemove will work.
 Duperemove does look exactly like what you are looking for. The last
 traffic on the mailing list regarding that was in August last year. It
 looks like it was pulled into the main kernel repository on September
 1st.

 The last commit to the duperemove application was on April 20th this
 year. Maybe Mark (cc'd) can provide further insight on its current
 status.

 I have been testing duperemove and it seems to work just fine, in
 contrast with bedup that i have been unable to install/compile/sort out
 the mess with python versions. I have 2 questions about duperemove:
 1) can it use existing filesystem csums instead of calculating its own?
While this might seem like a great idea at first, it really isn't.
BTRFS uses CRC32c at the moment as it's checksum algorithm, and while
that is relatively good at detecting small differences (i.e. a single
bit flipped out of every 64 or so bytes), it is known to have issues
with hash collisions.  Normally, the data on disk won't change enough
even from a media error to cause a hash collision, but when you start
using it to compare extents that aren't known to be the same to begin
with, and then try to merge those extents, you run the risk of serious
file corruption.  Also, AFAIK, BTRFS doesn't expose the block checksum
to userspace directly (although I may be wrong about this, in which case
i retract the following statement) this would therefore require some
kernelspace support.
 2) can it be included in btrfs-progs so that it becomes a standard
 feature of btrfs?
I would definitely like to second this suggestion, I hear a lot of
people talking about how BTRFS has batch deduplication, but it's almost
impossible to make use of without extra software or writing your own code.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: ditto blocks on ZFS

2014-05-20 Thread Austin S Hemmelgarn
On 2014-05-19 22:07, Russell Coker wrote:
 On Mon, 19 May 2014 23:47:37 Brendan Hide wrote:
 This is extremely difficult to measure objectively. Subjectively ... see
 below.

 [snip]

 *What other failure modes* should we guard against?

 I know I'd sleep a /little/ better at night knowing that a double disk
 failure on a raid5/1/10 configuration might ruin a ton of data along
 with an obscure set of metadata in some long tree paths - but not the
 entire filesystem.
 
 My experience is that most disk failures that don't involve extreme physical 
 damage (EG dropping a drive on concrete) don't involve totally losing the 
 disk.  Much of the discussion about RAID failures concerns entirely failed 
 disks, but I believe that is due to RAID implementations such as Linux 
 software RAID that will entirely remove a disk when it gives errors.
 
 I have a disk which had ~14,000 errors of which ~2000 errors were corrected 
 by 
 duplicate metadata.  If two disks with that problem were in a RAID-1 array 
 then duplicate metadata would be a significant benefit.
 
 The other use-case/failure mode - where you are somehow unlucky enough
 to have sets of bad sectors/bitrot on multiple disks that simultaneously
 affect the only copies of the tree roots - is an extremely unlikely
 scenario. As unlikely as it may be, the scenario is a very painful
 consequence in spite of VERY little corruption. That is where the
 peace-of-mind/bragging rights come in.
 
 http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
 
 The NetApp research on latent errors on drives is worth reading.  On page 12 
 they report latent sector errors on 9.5% of SATA disks per year.  So if you 
 lose one disk entirely the risk of having errors on a second disk is higher 
 than you would want for RAID-5.  While losing the root of the tree is 
 unlikely, losing a directory in the middle that has lots of subdirectories is 
 a risk.
 
 I can understand why people wouldn't want ditto blocks to be mandatory.  But 
 why are people arguing against them as an option?
 
 
 As an aside, I'd really like to be able to set RAID levels by subtree.  I'd 
 like to use RAID-1 with ditto blocks for my important data and RAID-0 for 
 unimportant data.
 
But the proposed changes for n-way replication would already handle
this.  They would just need the option of having more than one copy per
device (which theoretically shouldn't be too hard once you have n-way
replication).  Also, BTRFS already has the option of replicating the
root tree across multiple devices (it is included in the System Data
subset), and in fact dose so by default when using multiple devices.
Also, there are plans to have per-subvolume or per file RAID level
selection, but IIRC that is planned for after n-way replication (and of
course, RAID 5/6, as n-way replication isn't going to be implemented
until after RAID 5/6)



smime.p7s
Description: S/MIME Cryptographic Signature


Re: ditto blocks on ZFS

2014-05-22 Thread Austin S Hemmelgarn
On 2014-05-21 19:05, Martin wrote:
 Very good comment from Ashford.
 
 
 Sorry, but I see no advantages from Russell's replies other than for a
 feel-good factor or a dangerous false sense of security. At best,
 there is a weak justification that for metadata, again going from 2% to
 4% isn't going to be a great problem (storage is cheap and fast).
 
 I thought an important idea behind btrfs was that we avoid by design in
 the first place the very long and vulnerable RAID rebuild scenarios
 suffered for block-level RAID...
 
 
 On 21/05/14 03:51, Russell Coker wrote:
 Absolutely. Hopefully this discussion will inspire the developers to
 consider this an interesting technical challenge and a feature that
 is needed to beat ZFS.
 
 Sorry, but I think that is completely the wrong reasoning. ...Unless
 that is you are some proprietary sales droid hyping features and big
 numbers! :-P
 
 
 Personally I'm not convinced we gain anything beyond what btrfs will
 eventually offer in any case for the n-way raid or the raid-n Cauchy stuff.
 
 Also note that usually, data is wanted to be 100% reliable and
 retrievable. Or if that fails, you go to your backups instead. Gambling
 proportions and importance rather than *ensuring* fault/error
 tolerance is a very human thing... ;-)
 
 
 Sorry:
 
 Interesting idea but not convinced there's any advantage for disk/SSD
 storage.
 
 
 Regards,
 Martin
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 Another nice option in this case might be adding logic to make sure
that there is some (considerable) offset between copies of metadata
using the dup profile (all of the filesystems that I have actually
looked at the low-level on-disk structures have had both copies of the
System chunks right next to each other, right at the beginning of the
disk, which of course mitigates the usefulness of storing two copies of
them on disk).  Adding an offset in those allocations would provide some
better protection against some of the more common 'idiot' failure-modes
(i.e. trying to use dd to write a disk image to a USB flash drive, and
accidentally overwriting the first n GB of your first HDD instead).
Ideally, once we have n-way replication, System chunks should default to
one copy per device for multi-device filesystems.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: is it safe to change BTRFS_STRIPE_LEN?

2014-05-24 Thread Austin S Hemmelgarn
On 05/24/2014 12:44 PM, john terragon wrote:
 Hi.
 
 I'm playing around with (software) raid0 on SSDs and since I remember
 I read somewhere that intel recommends 128K stripe size for HDD arrays
 but only 16K stripe size for SSD arrays, I wanted to see how a
 small(er) stripe size would work on my system. Obviously with btrfs on
 top of md-raid I could use the stripe size I want. But if I'm not
 mistaken the stripe size with the native raid0 in btrfs is fixed to
 64K in BTRFS_STRIPE_LEN (volumes.h).
 So I was wondering if it would be reasonably safe to just change that
 to 16K (and duck and wait for the explosion ;) ).
 
 Can anyone adept to the inner workings of btrfs raid0 code confirm if
 that would be the right way to proceed? (obviously without absolutely
 any blame to be placed on anyone other than myself if things should go
 badly :) )
I personally can't render an opinion on whether changing it would make
things break or not, but I do know that it would need to be changed both
in the kernel and the tools, and the resultant kernel and tools would
not be entirely compatible with filesystems produced by the regular
tools and kernel, possibly to the point of corrupting any filesystem
they touch.

As for the 64k default strip size, that sounds correct, and is probably
because that's the largest block that the I/O schedulers on Linux will
dispatch as a single write to the underlying device.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: btrfs send ioctl failed with -5: Input/output error

2014-05-26 Thread Austin S Hemmelgarn
On 05/26/2014 05:04 PM, Michael Welsh Duggan wrote:
 Michael Welsh Duggan m...@md5i.com writes:
 
 I am now getting the following error when trying to do a btrfs send:

 root@maru2:/usr/local/src/btrfs-progs# ./btrfs send
 /usr/local/snapshots/2014-05-15  /backup/intermediate
 At subvol /usr/local/snapshots/2014-05-15
 ERROR: send ioctl failed with -5: Input/output error

 I'm running a 3.14.4 kernel, and Btrfs progs v3.14.1.

 root@maru2:/usr/local/src/btrfs-progs# uname -a
 Linux maru2 3.14-1-amd64 #1 SMP Debian 3.14.4-1 (2014-05-13) x86_64 GNU/Linux

 root@maru2:/usr/local/src/btrfs-progs# ./btrfs --version
 Btrfs v3.14.1

 Is there anything I can do to help debug this issue?
 
 I'd like to find out what is happening here.  I am an experienced C
 programmer, but have not dealt with kernel hacking before.  I _do_ know
 how to build and install a kernel.  I'd like some hints on what logging,
 etc., I could add in order to determine where in the send ioctl
 processing the IO error is coming from.  From there, I hope to move to
 why.  Ideally I'd be able to run gdb on this, but nothing I have read
 online about kernel debugging with gdb sounds promising.
 
 I'd make an image, but the amount of data is enough to make this
 prohibitive.
 
I would look into ftrace (the Tracing submenu of the Kernel Hacking menu
in menuconfig and nconfig).  The other thing to look at at least
initially is KDB, but using that requires either a serial console or an
AT or PS/2 keyboard.  Using UML and GDB for debugging is possible, but
it can take a long time to set-up and is often slow (admitedly, KDB
isn't much faster).  If you do go the UML+GDB route, make sure to build
in fault-injection (the flaky DM module is particularly nice for this
type of thing), and I would suggest using something like Buildroot
(http://www.buildroot.net) to generate the root filesystem for it.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS, SSD and single metadata

2014-06-16 Thread Austin S Hemmelgarn
On 2014-06-16 03:54, Swâmi Petaramesh wrote:
 Hi,
 
 I created a BTRFS filesytem over LVM over LUKS encryption on an SSD [yes, I 
 know...], and I noticed that the FS got created with metadata in DUP mode, 
 contrary to what man mkfs.btrfs says for SSDs - it would be supposed to be 
 SINGLE...
 
 Well I don't know if my system didn't identify the SSD because of the 
 LVM+LUKS 
 stack (however it mounts well by itself with the ssd flag and accepts the 
 discard option [yes, I know...]), or if the manpage is obsolete or if this 
 feature just doesn't work...?
 
 The SSD being a Micron RealSSD C400
 
 For both SSD preservation and data integrity, would it be advisable to change 
 metadata to SINGLE using a rebalance, or if I'd better just leave things 
 the 
 way they are...?
 
 TIA for any insight.
 
What mkfs.btrfs looks at is
/sys/block/whatever-device/queue/rotational, if that is 1 it knows
that the device isn't a SSD.  I believe that LVM passes through whatever
the next lower layer's value is, but dmcrypt (and by extension LUKS)
always force it to a 1 (possibly to prevent programs from using
heuristics for enabling discard)



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-16 Thread Austin S Hemmelgarn
On 2014-06-16 06:35, Russell Coker wrote:
 On Mon, 16 Jun 2014 12:14:49 Lennart Poettering wrote:
 On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote:
 I am not really following though why this trips up btrfs though. I am
 not sure I understand why this breaks btrfs COW behaviour. I mean,
 fallocate() isn't necessarily supposed to write anything really, it's
 mostly about allocating disk space in advance. I would claim that
 journald's usage of it is very much within the entire reason why it
 exists...

 I don't believe that fallocate() makes any difference to fragmentation on
 BTRFS.  Blocks will be allocated when writes occur so regardless of an
 fallocate() call the usage pattern in systemd-journald will cause
 fragmentation.

 journald's write pattern looks something like this: append something to
 the end, make sure it is written, then update a few offsets stored at
 the beginning of the file to point to the newly appended data. This is
 of course not easy to handle for COW file systems. But then again, it's
 probably not too different from access patterns of other database or
 database-like engines...
 
 Not being too different from the access patterns of other databases means 
 having all the same problems as other databases...  Oracle is now selling ZFS 
 servers specifically designed for running the Oracle database, but that's 
 with 
 hybrid storage flash (ZIL and L2ARC on SSD).  While BTRFS doesn't support 
 features equivalent for ZIL and L2ARC it's easy to run a separate filesystem 
 on SSD for things that need performance (few if any current BTRFS users would 
 have a database too big to entirely fit on a SSD).
 
 The problem we are dealing with is database-like access patterns on systems 
 that are not designed as database servers.
 
 Would it be possible to get an interface for defragmenting files that's not 
 specific to BTRFS?  If we had a standard way of doing this then systemd-
 journald could request a defragment of the file at appropriate times.
 
While this is a wonderful idea, what about all the extra I/O this will
cause (and all the extra wear on SSD's)?  While I understand wanting
this to be faster, you should also consider the fact that defragmenting
the file on a regular basis is going to trash performance for other
applications.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: BTRFS, SSD and single metadata

2014-06-16 Thread Austin S Hemmelgarn
On 2014-06-16 07:18, Swâmi Petaramesh wrote:
 Hi Austin, and thanks for your reply.
 
 Le lundi 16 juin 2014, 07:09:55 Austin S Hemmelgarn a écrit :

 What mkfs.btrfs looks at is
 /sys/block/whatever-device/queue/rotational, if that is 1 it knows
 that the device isn't a SSD.  I believe that LVM passes through whatever
 the next lower layer's value is, but dmcrypt (and by extension LUKS)
 always force it to a 1 (possibly to prevent programs from using
 heuristics for enabling discard)
 
 In the current running condition, the system clearly sees this is *not* 
 rotational, even thru the LVM/dmcrypt stack :
 
 # mount | grep btrfs
 /dev/mapper/VG-LINUX on / type btrfs 
 (rw,noatime,seclabel,compress=lzo,ssd,discard,space_cache,autodefrag)
 
 # ll /dev/mapper/VGV-LINUX
 lrwxrwxrwx. 1 root root 7 16 juin  09:21 /dev/mapper/VG-LINUX - ../dm-1
 
 # cat /sys/block/dm-1/queue/rotational 
 0
 
 ...However, at mkfs.btrfs time, it migth well not have seen it, as I made it 
 from a live USB key in which both the lvm.conf and crypttab had not been 
 taylored to allow trim commands...
 
 However, now that the FS is created, I still wonder whether I should use a 
 rebalance to change the metadata from DUP to SINGLE, or if Id' better stay 
 with DUP...
 
 Kind regards.
 
 
I'd personally stay with the DUP profile, but then that's just me being
paranoid.  You will almost certainly get better performance using the
SINGLE profile instead of DUP, but this is mostly due to it requiring
fewer blocks to be encrypted by LUKS (Which is almost certainly your
primary bottleneck unless you have some high-end crypto-accelerator card).



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [systemd-devel] Slow startup of systemd-journal on BTRFS

2014-06-16 Thread Austin S Hemmelgarn
On 06/16/2014 03:52 PM, Martin wrote:
 On 16/06/14 17:05, Josef Bacik wrote:

 On 06/16/2014 03:14 AM, Lennart Poettering wrote:
 On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote:

 I am not really following though why this trips up btrfs though. I am
 not sure I understand why this breaks btrfs COW behaviour. I mean,
 
 I don't believe that fallocate() makes any difference to
 fragmentation on
 BTRFS.  Blocks will be allocated when writes occur so regardless of an
 fallocate() call the usage pattern in systemd-journald will cause
 fragmentation.

 journald's write pattern looks something like this: append something to
 the end, make sure it is written, then update a few offsets stored at
 the beginning of the file to point to the newly appended data. This is
 of course not easy to handle for COW file systems. But then again, it's
 probably not too different from access patterns of other database or
 database-like engines...
 
 Even though this appears to be a problem case for btrfs/COW, is there a
 more favourable write/access sequence possible that is easily
 implemented that is favourable for both ext4-like fs /and/ COW fs?
 
 Database-like writing is known 'difficult' for filesystems: Can a data
 log can be a simpler case?
 
 
 Was waiting for you to show up before I said anything since most systemd
 related emails always devolve into how evil you are rather than what is
 actually happening.
 
 Ouch! Hope you two know each other!! :-P :-)
 
 
 [...]
 since we shouldn't be fragmenting this badly.

 Like I said what you guys are doing is fine, if btrfs falls on it's face
 then its not your fault.  I'd just like an exact idea of when you guys
 are fsync'ing so I can replicate in a smaller way.  Thanks,
 
 Good if COW can be so resilient. I have about 2GBytes of data logging
 files and I must defrag those as part of my backups to stop the system
 fragmenting to a stop (I use cp -a to defrag the files to a new area
 and restart the data software logger on that).
 
 
 Random thoughts:
 
 Would using a second small file just for the mmap-ed pointers help avoid
 repeated rewriting of random offsets in the log file causing excessive
 fragmentation?
 
 Align the data writes to 16kByte or 64kByte boundaries/chunks?
 
 Are mmap-ed files a similar problem to using a swap file and so should
 the same btrfs file swap code be used for both?
 
 
 Not looked over the code so all random guesses...
 
 Regards,
 Martin
 
 
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
Just a thought, partly inspired by the mention of the swap code, has
anyone tried making the file NOCOW and pre-allocating to the max journal
size?  A similar approach has seemed to help on my systems with generic
log files (I keep debug level logs from almost everything, so I end up
with very active log files with ridiculous numbers of fragments if I
don't pre-allocate and mark them NOCOW).  I don't know for certain how
BTRFS handles appends to NOCOW files, but I would be willing to bet that
it ends up with a new fragment for each filesystem block worth of space
allocated.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: btrfs on whole disk (no partitions)

2014-06-19 Thread Austin S Hemmelgarn
On 2014-06-18 16:10, Chris Murphy wrote:
 
 On Jun 18, 2014, at 1:29 PM, Daniel Cegiełka daniel.cegie...@gmail.com 
 wrote:
 
 Hi,
 I created btrfs directly to disk using such a scheme (no partitions):

 dd if=/dev/zero of=/dev/sda bs=4096
 mkfs.btrfs -L dev_sda /dev/sda
 mount /dev/sda /mnt

 cd /mnt
 btrfs subvolume create __active
 btrfs subvolume create __active/rootvol
 btrfs subvolume create __active/usr
 btrfs subvolume create __active/home
 btrfs subvolume create __active/var
 btrfs subvolume create __snapshots

 cd /
 umount /mnt
 mount -o subvol=__active/rootvol /dev/sda /mnt
 mkdir /mnt/{usr,home,var}
 mount -o subvol=__active/usr /dev/sda /mnt/usr
 mount -o subvol=__active/home /dev/sda /mnt/home
 mount -o subvol=__active/var /dev/sda /mnt/var

 # /etc/fstab
 UID=ID/btrfs rw,relative,space_cache,subvol=__active/rootvol0 0
 UUID=ID/usrbtrfs rw,relative,space_cache,subvol=__active/usr0 0
 UUID=ID/homebtrfs rw,relative,space_cache,subvol=__active/home0 0
 UUID=ID/varbtrfs rw,relative,space_cache,subvol=__active/var0 0
 
 rw and space_cache are redundant because they are default; and relative is 
 not a valid mount option. All you need is subvol= 
 
 Everything works fine. Is such a solution is recommended? In my
 opinion, the creation of the partitions seems to be completely
 unnecessary if you can use btrfs.
 
 It's firmware specific. Some BIOS firmwares will want to see a valid MBR 
 partition map at LBA 0, not just boot code. Others only care to blindly 
 execute the boot code which would be put in the Btrfs bootloader pad (64KB). 
 I don't know if parted 3.1 recognizes partitionless disks with Btrfs though 
 so it might slightly increase the risk that it's treated as something other 
 than what it is.
 
 For UEFI firmware, it would definitely need to be partitioned since an EFI 
 System partition is required.
 
 Chris Murphy--
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
On most hardware, I would definitely suggest at least adding a minimal
sized partition table, the people who design the BIOS code on most
systems make too many assumptions to trust their code to work correctly.
 That said, I regularly use BTRFS on flat devices for the root
filesystems for Xen PV Guest systems, systems that boot from SAN, and
secondary disks on other systems with no issues whatsoever.



smime.p7s
Description: S/MIME Cryptographic Signature


Questions about BTRFS_IOC_FILE_EXTENT_SAME

2014-06-19 Thread Austin S Hemmelgarn
I have a few questions about the BTRFS_IOC_FILE_EXTENT_SAME ioctl, and
was hoping that I could get answers here without having to go source
diving or trying to test things myself:

1. What kind of overhead is there when it is called on a group of
extents that aren't actually the same (aside from the obvious pair of
context-switches that are required for an ioctl)?  I would think that it
would bail at the first difference it finds, but I have learned that
when it comes to kernel code, just because something seems obvious
doesn't mean that's how it's done.

2. Does it matter if the ranges passed in are actual extents, or can
they be arbitrary ranges of equal bytes in the files?

3. What happens if one of the ranges is truncated by the end of a file?
 IOW, if I have files A and B, and file A is longer than file B, and
file B is identical to the start of file A, what happens if I pass in
both files starting at offset 0, but pass the length of file A instead
of passing in the length of file B?

4. Does it matter if one of the extents passed in is compressed and the
other is not?

Thanks in advance.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: -d single for data blocks on a multiple devices doesn't work as it should

2014-06-24 Thread Austin S Hemmelgarn
 I somehow have doubts that a complex filesystem is the right project for
 me to start learning C, so I'll have to pass :-) No huge corporation
 with that itch behind me either, and I guess it will be more than a few
 hours for a btrfs programmer so no way I could sponsor that on my own.

Whether or not it is the right project really depends on where you
intend to do most of your C programming.  If you plan to do most of it
in kernel code and occasional userspace wrappers for kernel interfaces
(like me), then it could be a great place because it's under such heavy
development (which means more developers are working on it, and bugs get
spotted faster, both of which are good things for project you are using
to learn a language).  If, however, you intend to use it mostly for
userspace, then I would definitely agree with you, programming in
userspace and kernel-space are so different that it's almost like a
different language using the same syntax and similar semantics.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [Question] Btrfs on iSCSI device

2014-06-27 Thread Austin S Hemmelgarn
On 2014-06-27 12:34, Goffredo Baroncelli wrote:
 Hi,
 On 06/27/2014 05:44 PM, Zhe Zhang wrote:
 Hi,

 I setup 2 Linux servers to share the same device through iSCSI. Then I
 created a btrfs on the device. Then I saw the problem that the 2 Linux
 servers do not see a consistent file system image.

 Details:
 -- Server 1 running kernel 2.6.32, server 2 running 3.2.1
 -- Both running btrfs v0.20-rc1
 -- Server 2 has device /dev/vdc, exposed as iSCSI target
  -- Server 1 mounts the device as /dev/sda
 -- Server 1 'mount /dev/sda /mnt/btrfs'; server 2 'mount /dev/vdc 
 /mnt/btrfs',
  -- When server 1 'touch /mnt/btrfs/foo', server 2 doesn't see any
 file under /mnt/btrfs
 -- I created /mnt/btrfs/foo on server 2 as well; then I added some
 content from both server 1 and server 2 to /mnt/btrfs/foo
 -- After that each server sees the content it adds, but not the
 content from the other server
 -- Both server 'umount /mnt/btrfs', and mount it again
 -- Then both servers see /mnt/btrfs/foo with the content added from
 server 2 (I guess it's because server 2 created the foo file later
 than server 1).

 I did a similar test on ext4 and both servers see a consistent image
 of the file system. When server 1 creates a foo file server 2
 immediately sees it.

 Is this how btrfs is supposed to work?
 
 I don't think that it is possible to mount the _same device_ at the _same 
 time_ on two different machines. And this doesn't depend by the filesystem.
 
 The fact that you see it working, I suspect that is is casual.
 
 When I tried this (same scsi HD connected to two machines), I had to ensure 
 that the two machines never accessed to the HD at the same time.
 

 Thanks,

 Zhe
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

 
 
If you need shared storage like that, you need to use a real cluster
filesystem like GFS2 or OCFS2, BTRFS isn't designed for any kind of
concurrent access to shared storage from separate systems.
The reason it appears to work when using iSCSI and not with directly
connected parallel SCSI or SAS is that iSCSI doesn't provide low level
hardware access.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [Question] Btrfs on iSCSI device

2014-06-27 Thread Austin S Hemmelgarn
On 06/27/2014 07:40 PM, Russell Coker wrote:
 On Fri, 27 Jun 2014 18:34:34 Goffredo Baroncelli wrote:
 I don't think that it is possible to mount the _same device_ at the _same
 time_ on two different machines. And this doesn't depend by the filesystem.
 
 If you use a clustered filesystem then you can safely mount it on multiple 
 machines.
 
 If you use a non-clustered filesystem it can still mount and even appear to 
 work for a while.  It's surprising how many writes you can make to a dual-
 mounted filesystem that's not designed for such things before you get a 
 totally broken filesystem.
 
 On Fri, 27 Jun 2014 13:15:16 Austin S Hemmelgarn wrote:
 The reason it appears to work when using iSCSI and not with directly
 connected parallel SCSI or SAS is that iSCSI doesn't provide low level
 hardware access.
 
 I've tried this with dual-attached FC and had no problems mounting.  In what 
 way is directly connected SCSI different from FC?
 
FC is actually it's own networking stack (and you can even run (in
theory) other protocols like IP and ATM on top of it), whereas parallel
SCSI is just a multi-drop bus, and SAS is just a tree-structured bus
with point-to-point communications emulated on top of it.  In other
words, parallel SCSI has topological constraints like RS-485, SAS has
topology constraints like USB, and FC has topology constraints like
Ethernet.

Secondarily, most filesystems on Linux will let you mount them multiple
times on separate hosts (ext4 has features to prevent this, but they are
expensive and therefore turned off by default, I think XFS might have
similar features, but I'm not sure).  BTRFS should in theory be more
resilient than most because of the COW nature (as long as it's only a
few commit cycles, you should still be able to recover most of the data
just fine).



smime.p7s
Description: S/MIME Cryptographic Signature


Re: mount time of multi-disk arrays

2014-07-07 Thread Austin S Hemmelgarn
On 2014-07-07 09:54, Konstantinos Skarlatos wrote:
 On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote:
 Hello List,

 can anyone tell me how much time is acceptable and assumable for a
 multi-disk btrfs array with classical hard disk drives to mount?

 I'm having a bit of trouble with my current systemd setup, because it
 couldn't mount my btrfs raid anymore after adding the 5th drive. With
 the 4 drive setup it failed to mount once in a few times. Now it fails
 everytime because the default timeout of 1m 30s is reached and mount is
 aborted.
 My last 10 manual mounts took between 1m57s and 2m12s to finish.
 I have the exact same problem, and have to manually mount my large
 multi-disk btrfs filesystems, so I would be interested in a solution as
 well.
 

 My hardware setup contains a
 - Intel Core i7 4770
 - Kernel 3.15.2-1-ARCH
 - 32GB RAM
 - dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm)
 - dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm)

 Thanks in advance

 André-Sebastian Liebe
 --


 # btrfs fi sh
 Label: 'apc01_pool0'  uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb
  Total devices 5 FS bytes used 14.21TiB
  devid1 size 3.64TiB used 2.86TiB path /dev/sdd
  devid2 size 3.64TiB used 2.86TiB path /dev/sdc
  devid3 size 3.64TiB used 2.86TiB path /dev/sdf
  devid4 size 3.64TiB used 2.86TiB path /dev/sde
  devid5 size 3.64TiB used 2.88TiB path /dev/sdb

 Btrfs v3.14.2-dirty

 # btrfs fi df /data/pool0/
 Data, single: total=14.28TiB, used=14.19TiB
 System, RAID1: total=8.00MiB, used=1.54MiB
 Metadata, RAID1: total=26.00GiB, used=20.20GiB
 unknown, single: total=512.00MiB, used=0.00

This is interesting, I actually did some profiling of the mount timings
for a bunch of different configurations of 4 (identical other than
hardware age) 1TB Seagate disks.  One of the arrangements I tested was
Data using single profile and Metadata/System using RAID1.  Based on the
results I got, and what you are reporting, the mount time doesn't scale
linearly in proportion to the amount of storage space.

You might want to try the RAID10 profile for Metadata, of the
configurations I tested, the fastest used Single for Data and RAID10 for
Metadata/System.

Also, based on the System chunk usage, I'm guessing that you have a LOT
of subvolumes/snapshots, and I do know that having very large (100+)
numbers of either does slow down the mount command (I don't think that
we cache subvolume information between mount invocations, so it has to
re-parse the system chunks for each individual mount).



smime.p7s
Description: S/MIME Cryptographic Signature


Re: btrfs RAID with enterprise SATA or SAS drives

2014-07-10 Thread Austin S Hemmelgarn
On 2014-07-09 22:10, Russell Coker wrote:
 On Wed, 9 Jul 2014 16:48:05 Martin Steigerwald wrote:
 - for someone using SAS or enterprise SATA drives with Linux, I
 understand btrfs gives the extra benefit of checksums, are there any
 other specific benefits over using mdadm or dmraid?

 I think I can answer this one.

 Most important advantage I think is BTRFS is aware of which blocks of the
 RAID are in use and need to be synced:

 - Instant initialization of RAID regardless of size (unless at some
 capacity mkfs.btrfs needs more time)
 
 From mdadm(8):
 
--assume-clean
   Tell mdadm that the array pre-existed and is known to be  clean.
   It  can be useful when trying to recover from a major failure as
   you can be sure that no data will be affected unless  you  actu‐
   ally  write  to  the array.  It can also be used when creating a
   RAID1 or RAID10 if you want to avoid the initial resync, however
   this  practice  — while normally safe — is not recommended.  Use
   this only if you really know what you are doing.
 
   When the devices that will be part of a new  array  were  filled
   with zeros before creation the operator knows the array is actu‐
   ally clean. If that is the case,  such  as  after  running  bad‐
   blocks,  this  argument  can be used to tell mdadm the facts the
   operator knows.
 
 While it might be regarded as a hack, it is possible to do a fairly instant 
 initialisation of a Linux software RAID-1.

This has the notable disadvantage however that the first scrub you run
will essentially preform a full resync if you didn't make sure that the
disks had identical data to begin with.
 - Rebuild after disk failure or disk replace will only copy *used* blocks
 
 Have you done any benchmarks on this?  The down-side of copying used blocks 
 is 
 that you first need to discover which blocks are used.  Given that seek time 
 is 
 a major bottleneck at some portion of space used it will be faster to just 
 copy the entire disk.
 
 I haven't done any tests on BTRFS in this regard, but I've seen a disk 
 replacement on ZFS run significantly slower than a dd of the block device 
 would.
 
First of all, this isn't really a good comparison for two reasons:
1. EVERYTHING on ZFS (or any filesystem that tries to do that much work)
is slower than a dd of the raw block device.
2. Even if the throughput is lower, this is only really an issue if the
disk is more than half full, because you don't copy the unused blocks

Also, while it isn't really a recovery situation, I recently upgraded
from a 2 1TB disk BTRFS RAID1 setup to a 4 1TB disk BTRFS RAID10 setup,
and the performance of the re-balance really wasn't all that bad.  I
have maybe 100GB of actual data, so the array started out roughly 10%
full, and the re-balance only took about 2 minutes.  Of course, it
probably helps that I make a point to keep my filesystems de-fragmented,
scrub and balance regularly, and don't use a lot of sub-volumes or
snapshots, so the filesystem in question is not too different from what
it would have looked like if I had just wiped the FS and restored from a
backup.
 Scrubbing can repair from good disk if RAID with redundancy, but SoftRAID
 should be able to do this as well. But also for scrubbing: BTRFS only
 check and repairs used blocks.
 
 When you scrub Linux Software RAID (and in fact pretty much every RAID) it 
 will only correct errors that the disks flag.  If a disk returns bad data and 
 says that it's good then the RAID scrub will happily copy the bad data over 
 the good data (for a RAID-1) or generate new valid parity blocks for bad data 
 (for RAID-5/6).
 
 http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html
 
 Page 12 of the above document says that nearline disks (IE the ones people 
 like me can afford for home use) have a 0.466% incidence of returning bad 
 data 
 and claiming it's good in a year.  Currently I run about 20 such disks in a 
 variety of servers, workstations, and laptops.  Therefore the probability of 
 having no such errors on all those disks would be .99534^20=.91081.  The 
 probability of having no such errors over a period of 10 years would be 
 (.99534^20)^10=.39290 which means that over 10 years I should expect to have 
 such errors, which is why BTRFS RAID-1 and DUP metadata on single disks are 
 necessary features.
 




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Btrfs transaction checksum corruption losing root of the tree bizarre UUID change.

2014-07-10 Thread Austin S Hemmelgarn
On 07/10/2014 07:32 PM, Tomasz Kusmierz wrote:
 Hi all !
 
 So it been some time with btrfs, and so far I was very pleased, but
 since I've upgraded to ubuntu from 13.10 to 14.04 problems started to
 occur (YES I know this might be unrelated).
 
 So in the past I've had problems with btrfs which turned out to be a
 problem caused by static from printer generating some corruption in
 ram causing checksum failures on the file system - so I'm not going to
 assume that there is something wrong with btrfs from the start.
 
 Anyway:
 On my server I'm running 6 x 2TB disk in raid 10 for general storage
 and 2 x ~0.5 TB raid 1 for system. Might be unrelated, but after
 upgrading to 14.04 I've started using Own Cloud which uses Apache 
 MySql for backing store - all data stored on storage array, mysql was
 on system array.
 
 All started with csum errors showing up in mysql data files and in
 some transactions !!!. Generally system imidiatelly was switching to
 all btrfs read only mode due to being forced by kernel (don't have
 dmesg / syslog now). Removed offending files, problem seemed to go
 away and started from scratch. After 5 days problem reapered and now
 was located around same mysql files and in files managed by apache as
 cloud. At this point since these files are rather dear to me I've
 decided to pull all stops and try to rescue as much as I can.
 
 As a excercise in btrfs managment I've run btrfsck --repair - did not
 help. Repeated with --init-csum-tree - turned out that this left me
 with blank system array. Nice ! could use some warning here.
 
I know that this will eventually be pointed out by somebody, so I'm
going to save them the trouble and mention that it does say on both the
wiki and in the manpages that btrfsck should be a last-resort (ie, after
you have made sure you have backups of anything on the FS).
 I've moved all drives and move those to my main rig which got a nice
 16GB of ecc ram, so errors of ram, cpu, controller should be kept
 theoretically eliminated. I've used system array drives and spare
 drive to extract all dear to me files to newly created array (1tb +
 500GB + 640GB). Runned a scrub on it and everything seemed OK. At this
 point I've deleted dear to me files from storage array and ran  a
 scrub. Scrub now showed even more csum errors in transactions and one
 large file that was not touched FOR VERY LONG TIME (size ~1GB).
 Deleted file. Ran scrub - no errors. Copied dear to me files back to
 storage array. Ran scrub - no issues. Deleted files from my backup
 array and decided to call a day. Next day I've decided to run a scrub
 once more just to be sure this time it discovered a myriad of errors
 in files and transactions. Since I've had no time to continue decided
 to postpone on next day - next day I've started my rig and noticed
 that both backup array and storage array does not mount anymore. I was
 attempting to rescue situation without any luck. Power cycled PC and
 on next startup both arrays failed to mount, when I tried to mount
 backup array mount told me that this specific uuid DOES NOT EXIST
 !?!?!
 
 my fstab uuid:
 fcf23e83-f165-4af0-8d1c-cd6f8d2788f4
 new uuid:
 771a4ed0-5859-4e10-b916-07aec4b1a60b
 
 
 tried to mount by /dev/sdb1 and it did mount. Tried by new uuid and it
 did mount as well. Scrub passes with flying colours on backup array
 while storage array still fails to mount with:
 
 root@ubuntu-pc:~# mount /dev/sdd1 /arrays/@storage/
 mount: wrong fs type, bad option, bad superblock on /dev/sdd1,
missing codepage or helper program, or other error
In some cases useful info is found in syslog - try
dmesg | tail  or so
 
 for any device in the array.
 
 Honestly this is a question to more senior guys - what should I do now ?
 
 Chris Mason - have you got any updates to your old friend stress.sh
 ? If not I can try using previous version that you provided to stress
 test my system - but I this is a second system that exposes this
 erratic behaviour.
 
 Anyone - what can I do to rescue my bellowed files (no sarcasm with
 zfs / ext4 / tapes / DVDs)
 
 ps. needles to say: SMART - no sata CRC errors, no relocated sectors,
 no errors what so ever (as much as I can see).
First thing that I would do is some very heavy testing with tools like
iozone and fio.  I would use the verify mode from iozone to further
check data integrity.  My guess based on what you have said is that it
is probably issues with either the storage controller (I've had issues
with almost every brand of SATA controller other than Intel, AMD, Via,
and Nvidia, and it almost always manifested as data corruption under
heavy load), or something in the disk's firmware.  I would still suggest
double-checking your RAM with Memtest, and check the cables on the
drives.  The one other thing that I can think of is potential voltage
sags from the PSU (either because the PSU is overloaded at times, or
because of really noisy/poorly-conditioned line power).  Of course, I
may be 

Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Austin S Hemmelgarn
On 07/20/2014 10:00 AM, Tomasz Torcz wrote:
 On Sun, Jul 20, 2014 at 01:53:34PM +, Duncan wrote:
 TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted:

 One week for a raid10 rebuild 4x3TB drives is a very long time.
 Any thoughts?
 Can you share any statistics from your RAID10 rebuilds?


 At a week, that's nearly 5 MiB per second, which isn't great, but isn't 
 entirely out of the realm of reason either, given all the processing it's 
 doing.  A day would be 33.11+, reasonable thruput for a straight copy, 
 and a raid rebuild is rather more complex than a straight copy, so...
 
   Uhm, sorry, but 5MBps is _entirely_ unreasonable.  It is order-of-magnitude
 unreasonable.  And all the processing shouldn't even show as a blip
 on modern CPUs.
   This speed is undefendable.
 
I wholly agree that it's undefendable, but I can tell you why it is so
slow, it's not 'all the processing' (which is maybe a few hundred
instructions on x86 for each block), it's because BTRFS still serializes
writes to devices, instead of queuing all of them in parallel (that is,
when there are four devices that need written to, it writes to each one
in sequence, waiting for the previous write to finish before dispatching
the next write).  Personally, I would love to see this behavior
improved, but I really don't have any time to work on it myself.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH RFC] btrfs: Use backup superblocks if and only if the first superblock is valid but corrupted.

2014-07-26 Thread Austin S Hemmelgarn
On 07/24/2014 05:28 PM, Chris Mason wrote:
 
 
 On 06/26/2014 11:53 PM, Qu Wenruo wrote:
 Current btrfs will only use the first superblock, making the backup
 superblocks only useful for 'btrfs rescue super' command.

 The old problem is that if we use backup superblocks when the first
 superblock is not valid, we will be able to mount a none btrfs
 filesystem, which used to contains btrfs but other fs is made on it.

 The old problem can be solved related easily by checking the first
 superblock in a special way:
 1) If the magic number in the first superblock does not match:
This filesystem is not btrfs anymore, just exit.
If end-user consider it's really btrfs, then old 'btrfs rescue super'
method is still available.

 2) If the magic number in the first superblock matches but checksum does
not match:
This filesystem is btrfs but first superblock is corrupted, use
backup roots. Just continue searching remaining superblocks.
 
 I do agree that in these cases we can trust that the backup superblock
 comes from the same filesystem.
 
 But, for right now I'd prefer the admin get involved in using the backup
 supers.  I think silently using the backups is going to lead to surprises.
Maybe there could be a mount non-default mount-option to use backup
superblocks iff the first one is corrupted, and then log a warning
whenever this actually happens?  Not handling stuff like this
automatically really hurts HA use cases.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PATCH RFC] btrfs: Use backup superblocks if and only if the first superblock is valid but corrupted.

2014-07-27 Thread Austin S Hemmelgarn
On 07/27/2014 08:29 PM, Qu Wenruo wrote:
 
  Original Message 
 Subject: Re: [PATCH RFC] btrfs: Use backup superblocks if and only if
 the first superblock is valid but corrupted.
 From: Austin S Hemmelgarn ahferro...@gmail.com
 To: Chris Mason c...@fb.com, Qu Wenruo quwen...@cn.fujitsu.com,
 linux-btrfs@vger.kernel.org
 Date: 2014年07月27日 10:57
 On 07/24/2014 05:28 PM, Chris Mason wrote:

 On 06/26/2014 11:53 PM, Qu Wenruo wrote:
 Current btrfs will only use the first superblock, making the backup
 superblocks only useful for 'btrfs rescue super' command.

 The old problem is that if we use backup superblocks when the first
 superblock is not valid, we will be able to mount a none btrfs
 filesystem, which used to contains btrfs but other fs is made on it.

 The old problem can be solved related easily by checking the first
 superblock in a special way:
 1) If the magic number in the first superblock does not match:
 This filesystem is not btrfs anymore, just exit.
 If end-user consider it's really btrfs, then old 'btrfs rescue
 super'
 method is still available.

 2) If the magic number in the first superblock matches but checksum
 does
 not match:
 This filesystem is btrfs but first superblock is corrupted, use
 backup roots. Just continue searching remaining superblocks.
 I do agree that in these cases we can trust that the backup superblock
 comes from the same filesystem.

 But, for right now I'd prefer the admin get involved in using the backup
 supers.  I think silently using the backups is going to lead to
 surprises.
 Maybe there could be a mount non-default mount-option to use backup
 superblocks iff the first one is corrupted, and then log a warning
 whenever this actually happens?  Not handling stuff like this
 automatically really hurts HA use cases.


 This seems better and comments also shows this idea.
 What about merging the behavior into 'recovery' mount option or adding a
 new mount option?
Personally, I'd add a new mount option, but make recovery imply that option.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Multi Core Support for compression in compression.c

2014-07-27 Thread Austin S Hemmelgarn
On 07/27/2014 04:47 PM, Nick Krause wrote:
 This may be a bad idea , but compression in brtfs seems to be only
 using one core to compress.
 Depending on the CPU used and the amount of cores in the CPU we can
 make this much faster
 with multiple cores. This seems bad by my reading at least I would
 recommend for writing compression
 we write a function to use a certain amount of cores based on the load
 of the system's CPU not using
 more then 75% of the system's CPU resources as my system when idle has
 never needed more
 then one core of my i5 2500k to run when with interrupts for opening
 eclipse are running. For reading
 compression on good core seems fine to me as testing other compression
 software for reads , it's
 way less CPU intensive.
 Cheers Nick
We would probably get a bigger benefit from taking an approach like
SquashFS has recently added, that is, allowing multi-threaded
decompression fro reads, and decompressing directly into the pagecache.
 Such an approach would likely make zlib compression much more scalable
on large systems.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Multi Core Support for compression in compression.c

2014-07-28 Thread Austin S Hemmelgarn
On 07/27/2014 11:21 PM, Nick Krause wrote:
 On Sun, Jul 27, 2014 at 10:56 PM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 On 07/27/2014 04:47 PM, Nick Krause wrote:
 This may be a bad idea , but compression in brtfs seems to be only
 using one core to compress.
 Depending on the CPU used and the amount of cores in the CPU we can
 make this much faster
 with multiple cores. This seems bad by my reading at least I would
 recommend for writing compression
 we write a function to use a certain amount of cores based on the load
 of the system's CPU not using
 more then 75% of the system's CPU resources as my system when idle has
 never needed more
 then one core of my i5 2500k to run when with interrupts for opening
 eclipse are running. For reading
 compression on good core seems fine to me as testing other compression
 software for reads , it's
 way less CPU intensive.
 Cheers Nick
 We would probably get a bigger benefit from taking an approach like
 SquashFS has recently added, that is, allowing multi-threaded
 decompression fro reads, and decompressing directly into the pagecache.
  Such an approach would likely make zlib compression much more scalable
 on large systems.


 
 Austin,
 That seems better then my idea as you seem to be more up to date on
 brtfs devolopment.
 If you and the other developers of brtfs are interested in adding this
 as a feature please let
 me known as I would like to help improve brtfs as the file system as
 an idea is great just
 seems like it needs a lot of work :).
 Nick
I wouldn't say that I am a BTRFS developer (power user maybe?), but I
would definitely say that parallelizing compression on writes would be a
good idea too (especially for things like lz4, which IIRC is either in
3.16 or in the queue for 3.17).  Both options would be a lot of work,
but almost any performance optimization would.  I would almost say that
it would provide a bigger performance improvement to get BTRFS to
intelligently stripe reads and writes (at the moment, any given worker
thread only dispatches one write or read to a single device at a time,
and any given write() or read() syscall gets handled by only one worker).



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Multi Core Support for compression in compression.c

2014-07-28 Thread Austin S Hemmelgarn
On 2014-07-28 11:57, Nick Krause wrote:
 On Mon, Jul 28, 2014 at 11:13 AM, Nick Krause xerofo...@gmail.com
 wrote:
 On Mon, Jul 28, 2014 at 6:10 AM, Austin S Hemmelgarn 
 ahferro...@gmail.com wrote:
 On 07/27/2014 11:21 PM, Nick Krause wrote:
 On Sun, Jul 27, 2014 at 10:56 PM, Austin S Hemmelgarn 
 ahferro...@gmail.com wrote:
 On 07/27/2014 04:47 PM, Nick Krause wrote:
 This may be a bad idea , but compression in brtfs seems
 to be only using one core to compress. Depending on the
 CPU used and the amount of cores in the CPU we can make
 this much faster with multiple cores. This seems bad by
 my reading at least I would recommend for writing
 compression we write a function to use a certain amount
 of cores based on the load of the system's CPU not using 
 more then 75% of the system's CPU resources as my system
 when idle has never needed more then one core of my i5
 2500k to run when with interrupts for opening eclipse are
 running. For reading compression on good core seems fine
 to me as testing other compression software for reads ,
 it's way less CPU intensive. Cheers Nick
 We would probably get a bigger benefit from taking an
 approach like SquashFS has recently added, that is,
 allowing multi-threaded decompression fro reads, and
 decompressing directly into the pagecache. Such an approach
 would likely make zlib compression much more scalable on
 large systems.
 
 
 
 Austin, That seems better then my idea as you seem to be more
 up to date on brtfs devolopment. If you and the other
 developers of brtfs are interested in adding this as a
 feature please let me known as I would like to help improve
 brtfs as the file system as an idea is great just seems like
 it needs a lot of work :). Nick
 I wouldn't say that I am a BTRFS developer (power user maybe?),
 but I would definitely say that parallelizing compression on
 writes would be a good idea too (especially for things like
 lz4, which IIRC is either in 3.16 or in the queue for 3.17).
 Both options would be a lot of work, but almost any performance
 optimization would.  I would almost say that it would provide a
 bigger performance improvement to get BTRFS to intelligently
 stripe reads and writes (at the moment, any given worker thread
 only dispatches one write or read to a single device at a
 time, and any given write() or read() syscall gets handled by
 only one worker).
 
 
 I will look into this idea and see if I can do this for writes. 
 Regards Nick
 
 Austin, Seems since we don't want to release the cache for inodes
 in order to improve writes if are going to use the page cache. We
 seem to be doing this for writes in end_compressed_bio_write for
 standard pages and in end_compressed_bio_write. If we want to cache
 write pages why are we removing then ? Seems like this needs to be
 removed in order to start off. Regards Nick
 
I'm not entirely sure, it's been a while since I went exploring in the
page-cache code.  My guess is that there is some reason that you and I
aren't seeing that we are trying for write-around semantics, maybe one
of the people who originally wrote this code could weigh in?  Part of
this might be to do with the fact that normal page-cache semantics
don't always work as expected with COW filesystems (cause a write goes
to a different block on the device than a read before the write would
have gone to).  It might be easier to parallelize reads first, and
then work from that (and most workloads would probably benefit more
from the parallelized reads).



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Multi Core Support for compression in compression.c

2014-07-29 Thread Austin S Hemmelgarn
On 2014-07-29 13:08, Nick Krause wrote:
 On Mon, Jul 28, 2014 at 2:36 PM, Nick Krause xerofo...@gmail.com wrote:
 On Mon, Jul 28, 2014 at 12:19 PM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 On 2014-07-28 11:57, Nick Krause wrote:
 On Mon, Jul 28, 2014 at 11:13 AM, Nick Krause xerofo...@gmail.com
 wrote:
 On Mon, Jul 28, 2014 at 6:10 AM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 On 07/27/2014 11:21 PM, Nick Krause wrote:
 On Sun, Jul 27, 2014 at 10:56 PM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 On 07/27/2014 04:47 PM, Nick Krause wrote:
 This may be a bad idea , but compression in brtfs seems
 to be only using one core to compress. Depending on the
 CPU used and the amount of cores in the CPU we can make
 this much faster with multiple cores. This seems bad by
 my reading at least I would recommend for writing
 compression we write a function to use a certain amount
 of cores based on the load of the system's CPU not using
 more then 75% of the system's CPU resources as my system
 when idle has never needed more then one core of my i5
 2500k to run when with interrupts for opening eclipse are
 running. For reading compression on good core seems fine
 to me as testing other compression software for reads ,
 it's way less CPU intensive. Cheers Nick
 We would probably get a bigger benefit from taking an
 approach like SquashFS has recently added, that is,
 allowing multi-threaded decompression fro reads, and
 decompressing directly into the pagecache. Such an approach
 would likely make zlib compression much more scalable on
 large systems.



 Austin, That seems better then my idea as you seem to be more
 up to date on brtfs devolopment. If you and the other
 developers of brtfs are interested in adding this as a
 feature please let me known as I would like to help improve
 brtfs as the file system as an idea is great just seems like
 it needs a lot of work :). Nick
 I wouldn't say that I am a BTRFS developer (power user maybe?),
 but I would definitely say that parallelizing compression on
 writes would be a good idea too (especially for things like
 lz4, which IIRC is either in 3.16 or in the queue for 3.17).
 Both options would be a lot of work, but almost any performance
 optimization would.  I would almost say that it would provide a
 bigger performance improvement to get BTRFS to intelligently
 stripe reads and writes (at the moment, any given worker thread
 only dispatches one write or read to a single device at a
 time, and any given write() or read() syscall gets handled by
 only one worker).


 I will look into this idea and see if I can do this for writes.
 Regards Nick

 Austin, Seems since we don't want to release the cache for inodes
 in order to improve writes if are going to use the page cache. We
 seem to be doing this for writes in end_compressed_bio_write for
 standard pages and in end_compressed_bio_write. If we want to cache
 write pages why are we removing then ? Seems like this needs to be
 removed in order to start off. Regards Nick

 I'm not entirely sure, it's been a while since I went exploring in the
 page-cache code.  My guess is that there is some reason that you and I
 aren't seeing that we are trying for write-around semantics, maybe one
 of the people who originally wrote this code could weigh in?  Part of
 this might be to do with the fact that normal page-cache semantics
 don't always work as expected with COW filesystems (cause a write goes
 to a different block on the device than a read before the write would
 have gone to).  It might be easier to parallelize reads first, and
 then work from that (and most workloads would probably benefit more
 from the parallelized reads).

 I will look into this later today and work on it then.
 Regards Nick
 
 Seems the best way to do is to create a kernel thread per core like in NFS and
 depending on the load of the system use these threads.
 Regards Nick
 
It might be more work now, but it would probably be better in the long
run to do it using kernel workqueues, as they would provide better
support for suspend/hibernate/resume, and then you wouldn't need to
worry about scheduling or how many CPU cores are in the system.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Btrfs offline deduplication

2014-08-01 Thread Austin S Hemmelgarn
On 07/31/2014 07:54 PM, Timofey Titovets wrote:
 Good time of day.
 I have several questions about data deduplication on btrfs.
 Sorry if i ask stupid questions or waste you time %)
 
 What about implementation of offline data deduplication? I don't see
 any activity on this place, may be i need to ask a particular person?
 Where the problem? May be a can i try to help (testing as example)?
 
 I could be wrong, but as i understand btrfs store crc32 checksum one
 per file, if this is true, may be make a sense to create small worker
 for dedup files? Like worker for autodefrag?
 With simple logic like:
 if sum1 == sum2  file_size1 == file_size2; then
 if (bit_to_bit_identical(file1,2)); then merge(file1, file2);
 This can be first attempt to implement per file offline dedup
 What you think about it? could i be wrong? or this is a horrible crutch?
 (as i understand it not change format of fs)
 
 (bedup and other tools, its cool, but have several problem with these
 tools and i think, what kernel implementation can work better).
 
I think there may be some misunderstandings here about some of the
internals of BTRFS.  First of all, checksums are stored per block, not
per file, and secondly, deduplication can be done on a much finer scale
than individual files (you can deduplicate individual extents).

I do think however that having the option of a background thread doing
deduplication asynchronously is a good idea, but then you would have to
have some way to trigger it on individual files/trees, and triggering on
writes like the autodefrag thread does doesn't make much sense.  Having
some userspace program to tell it to run on a given set of files would
probably be the best approach for a trigger.  I don't remember if this
kind of thing was also included in the online deduplication patches that
got posted a while back or not.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Btrfs offline deduplication

2014-08-01 Thread Austin S Hemmelgarn
On 08/01/2014 02:55 PM, Mark Fasheh wrote:
 On Fri, Aug 01, 2014 at 10:16:08AM -0400, Austin S Hemmelgarn wrote:
 On 2014-08-01 09:23, David Sterba wrote:
 On Fri, Aug 01, 2014 at 06:17:44AM -0400, Austin S Hemmelgarn wrote:
 I do think however that having the option of a background thread doing
 deduplication asynchronously is a good idea, but then you would have to
 have some way to trigger it on individual files/trees, and triggering on
 writes like the autodefrag thread does doesn't make much sense.  Having
 some userspace program to tell it to run on a given set of files would
 probably be the best approach for a trigger.  I don't remember if this
 kind of thing was also included in the online deduplication patches that
 got posted a while back or not.

 IIRC the proposed implementation only merged new writes with existing
 data.

 For the out-of-band (off-line) dedup there's bedup
 (https://github.com/g2p/bedup) or Mark's duperemove tool
 (https://github.com/markfasheh/duperemove) that work on a set of files.

 Something kernel-side to do the work asynchronously would be nice,
 especially if it could leverage the check-sums that BTRFS already stores
 for the blocks.  Having a userspace interface for offline deduplication
 similar to that for scrub operations would even better.
 
 Why does this have to be kernel side? There's userspace software already to
 dedupe that can be run on a regular basis. Exporting checksums is a
 differnet story (you can do that via ioctl) but running the dedupe software
 itself inside the kernel is exactly what we want to avoid by having the
 dedupe ioctl in the first place.
   --Mark
 
 --
 Mark Fasheh
 
Based on the same logic however, we don't need scrub to be done kernel
side, as it wouldn't take but one more ioctl to be able to tell it which
block out of a set to treat as valid.  I'm not saying that things need
to be done in the kernel, but duperemove doesn't use the ioctl interface
even if it exists, and bedup is buggy as hell (unless it's improved
greatly in the last two weeks), and neither of them is at all efficient.
 I do understand that this isn't something that is computationally
simple (especially on x86 with it's defficiency of registers), but rsync
does almost the same thing for data transmission over the network, and
it does so seemingly much more efficiently than either option available
at the moment.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: ENOSPC with mkdir and rename

2014-08-04 Thread Austin S Hemmelgarn
On 2014-08-04 09:17, Peter Waller wrote:
 For anyone else having this problem, this article is fairly useful for
 understanding disk full problems and rebalance:
 
 http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html
 
 It actually covers the problem that I had, which is that a rebalance
 can't take place because it is full.
 
 I still am unsure what is really wrong with this whole situation. Is
 it that I wasn't careful to do a rebalance when I should have been
 doing? Is it that BTRFS doesn't do a rebalance automatically when it
 could in principle?
 
 It's pretty bad to end up in a situation (with spare space) where the
 only way out is to add more storage, which may be impractical,
 difficult or expensive.
I really disagree with the statement that adding more storage is
difficult or expensive, all you need to do is plug in a 2G USB flash
drive, or allocate a ramdisk, and add the device to the filesystem only
long enough to do a full balance.
 
 The other thing that I still don't understand I've seen repeated in a
 few places, from the above article:
 
 because the filesystem is only 55% full, I can ask balance to rewrite
 all chunks that are more than 55% full
 
 Then he uses `btrfs balance start -dusage=55 /mnt/btrfs_pool1`. I
 don't understand the relationship between the FS is 55% full and
 chunks more than 55% full. What's going on here?
To understand this, you have to understand that BTRFS uses a two level
allocation scheme, at the top level, you have chunks, which are
contiguous regions of the disk that get used for storing a specific
block type.  For data chunks, these default to 1G in size, for metadata,
they default to 256M in size.  When a filesystem is created, you get the
minimum number of chunks of each type based on the replication profiles
chosen for each chunk type; with no extra options, this means 1 data
chunk and 2 metadata chunks for a single disk filesystem.  Within each
chunk, BTRFS then allocates and frees individual blocks on demand, these
blocks are the analogue of blocks in most other filesystems.  When there
are no free blocks in any chunks of a given type, BTRFS then allocates
new chunks of that type based on the replication profile.  Unlike blocks
however, chunks aren't freed automatically (there are good reasons for
this behavior, but they are kind of long to explain here), this is where
balance comes in, it takes all of the blocks in the filesystem, and
sends them back through the block allocator.  This usually causes all of
the free blocks to end up in a single chunk, and frees the unneeded chunks.

When someone talks about a chunk being x% full, they mean that x% of the
space in that chunk is used by allocated blocks.  Talking about how full
the filesystem is can get tricky because of the replication profiles,
but the usual consensus is to treat that as the percentage of the
filesystem that contains blocks that are being used.

It should say LESS than 55% full in the various articles, as the
-dusage=x option tells balance to only consider chunks that are less
than 55% full for balancing.  In general, if your filesystem is totally
full, you should use numbers starting with 0, and working your way up
from there.  You may even get lucky, and using -dusage=0 -musage=0 may
free up enough chunks that you don't need to add more storage.
 
 I conclude that now since I have added more storage, the rebalance
 won't fail and if I keep rebalancing from a cron job I won't hit this
 problem again (unless the filesystem fills up very fast! what then?).
 I don't know however what value to assign to `-dusage` in general for
 the cron rebalance. Any hints?
I've found that something between 25 and 50 tends to do well, much
outside of that range and you start to get diminishing returns.  The
exact value tends to be more personal preference, I use 25 on most of my
systems, because I don't like saturating the disks with I/O for very
long.  Do make sure however to add -musage=x as well, metadata also
should be balanced (especially if you have very large numbers of small
files).
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 




smime.p7s
Description: S/MIME Cryptographic Signature


Re: ENOSPC with mkdir and rename

2014-08-04 Thread Austin S Hemmelgarn
On 2014-08-04 10:11, Peter Waller wrote:
 On 4 August 2014 15:02, Austin S Hemmelgarn ahferro...@gmail.com wrote:
 I really disagree with the statement that adding more storage is
 difficult or expensive, all you need to do is plug in a 2G USB flash
 drive, or allocate a ramdisk, and add the device to the filesystem only
 long enough to do a full balance.
 
 What if the machine is a server in a datacenter you don't have
 physical access to and the problem is an emergency preventing your
 users from being able to get work done?
 
 What happens if you use a RAM disk and there is a power failure?
 
I'm not saying that either option is a perfect solution.  In fact, the
only reason that I even mentioned the ramdisk is because I have had good
success with that on my laptop, but then laptops essentially have a
built-in UPS.  I personally wouldn't use a ramdisk except as a last
resort if you don't have some sort of UPS or redundancy in the PSU.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: ENOSPC with mkdir and rename

2014-08-04 Thread Austin S Hemmelgarn
On 2014-08-04 06:31, Peter Waller wrote:
 Thanks Hugo, this is the most informative e-mail yet! (more inline)
 
 On 4 August 2014 11:22, Hugo Mills h...@carfax.org.uk wrote:

  * btrfs fi show
 - look at the total and used values. If used  total, you're OK.
   If used == total, then you could potentially hit ENOSPC.
 
 Another thing which is unclear and undocumented anywhere I can find is
 what the meaning of `btrfs fi show` is.
 
 I'm sure it is totally obvious if you are a developer or if you have
 used it for long enough. But it isn't covered in the manpage, nor in
 the oracle documentation, nor anywhere on the wiki that I could find.
 
You didn't look very hard then, because there is information in the
manpage (oh wait, you mentioned Oracle, your probably using RHEL or
CentOS, which are the last thing you should be using if you want to use
stuff like BTRFS that is under heavy development), and it is documented
on the wiki.
 When I looked at it in my problematic situation, it said 500 GiB /
 500 GiB. That sounded fine to me because I interpreted the output as
 what fraction of which RAID devices BTRFS was using. In other words, I
 thought Oh, BTRFS will just make use of the whole device that's
 available to it.. I thought that `btrfs fi df` was the source of
 information for how much space was free inside of that.
 
  * btrfs fi df
 - look at metadata used vs total. If these are close to zero (on
   3.15+) or close to 512 MiB (on 3.15), then you are in danger of
   ENOSPC.
 
 Hmm. It's unfortunate that this could indicate an amount of space
 which is free when it actually isn't.
That depends on what you mean by 'free'.
 
 - look at data used vs total. If the used is much smaller than
   total, you can reclaim some of the allocation with a filtered
   balance (btrfs balance start -dusage=5), which will then give
   you unallocated space again (see the btrfs fi show test).
 
 So the filtered balance didn't help in my situation. I understand it's
 something to do with the 5 parameter. But I do not understand what
 the impact of changing this parameter is. It is something to do with a
 fraction of something, but those things are still not present in my
 mental model despite a large amount of reading. Is there an
 illustration which could clear this up?
 
Think of each chunk like a box, and each block as a block, and that you
have two different types of block (data and metadata) and two different
types of box (also data and metadata). The data boxes are four times the
size of the metadata boxes, and they all have to fit in one really big
container (the device itself).  You can only put data blocks in the data
boxs, and you can only put metadata blocks in metadata boxes.  Say that
in total, you can fit 128 data boxes in the large container, or you can
replace one data box with up to four metadata boxes.  Even though you
may only have a few blocks in a given box, the box still takes up the
same amount of space in the larger container.  Thus, it's possible to
have only a few blocks stored, but not be able to add any more boxes to
the larger container.  A balance operation is essentially the equivalent
of taking all of the blocks of a given type, and fitting them into the
smallest number of boxes possible.
 Among other things I also got the kernel stack trace I pasted at the
 bottom of the first e-mail to this thread when I did the rebalance.
 
This FAQ entry is pretty horrible, I'm afraid. I actually started
 rewriting it here to try to make it clearer what's going on. I'll try
 to work on it a bit more this week and put out a better version for
 the wiki.
 
 This is great to hear! :)
 
 Thanks for your response Hugo, that really cleared up a lot of mental
 model problems. I hope the documentation can be improved so that
 others can learn from my mistakes.
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 




smime.p7s
Description: S/MIME Cryptographic Signature


Re: ENOSPC with mkdir and rename

2014-08-05 Thread Austin S Hemmelgarn
On 2014-08-05 04:20, Duncan wrote:
 Austin S Hemmelgarn posted on Mon, 04 Aug 2014 13:09:23 -0400 as
 excerpted:
 
 Think of each chunk like a box, and each block as a block, and that you
 have two different types of block (data and metadata) and two different
 types of box (also data and metadata). The data boxes are four times the
 size of the metadata boxes, and they all have to fit in one really big
 container (the device itself).  You can only put data blocks in the data
 boxs, and you can only put metadata blocks in metadata boxes.  Say that
 in total, you can fit 128 data boxes in the large container, or you can
 replace one data box with up to four metadata boxes.  Even though you
 may only have a few blocks in a given box, the box still takes up the
 same amount of space in the larger container.  Thus, it's possible to
 have only a few blocks stored, but not be able to add any more boxes to
 the larger container.  A balance operation is essentially the equivalent
 of taking all of the blocks of a given type, and fitting them into the
 smallest number of boxes possible.
 
 FWIW, that's a great analogy to stick up on the wiki somewhere, probably 
 somewhere in the FAQ related to ENOSPC.  Please consider doing so.
 
 (Someone took one of my explanations from the list and stuck it in the 
 wiki, virtually word-for-word, with a link to the list post in the 
 archives for more.  I was glad, as for some reason I just seem to work 
 best on the lists, and seem to treat web pages as read-only, even if 
 they're on a wiki I in theory have or can get write-privs on.  I'm 
 suggesting someone, doesn't have to be you tho great if it is, do the 
 same with this.)
 
I would love to have it up on the wiki, but don't have an account or
write privileges.  FWIW, I consider anything I post on a mailing list
that isn't marked otherwise (except patches) to be public domain, so
everyone feel free to use it however you want.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Ideas for a feature implementation

2014-08-10 Thread Austin S Hemmelgarn
On 08/10/2014 03:21 PM, Vimal A R wrote:
 Hello,
 
 I came across the to-do list at 
 https://btrfs.wiki.kernel.org/index.php/Project_ideas and would like to know 
 if this list is updated and recent.
 
 I am looking for a project idea for my under graduate degree which can be 
 completed in around 3-4 months. Are there any suggestions and ideas to help 
 me further?
 
 Thank you,
 Vimal
It's not really listed there (though some of the projects there might be
considered subsets of it), but improved parallelization for the
multi-device setups is one thing that I know that a lot of people would
like to see.

Another thing that isn't listed there, that I would personally love to
see is support for secure file deletion.  To be truly secure though,
this would need to hook into the COW logic so that files marked for
secure deletion can't be reflinked (maybe make the automatically NOCOW
instead, and don't allow snapshots?), and when they get written to, the
blocks that get COW'ed have the old block overwritten.

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ideas for a feature implementation

2014-08-11 Thread Austin S Hemmelgarn
On 08/11/2014 04:27 PM, Chris Murphy wrote:
 
 On Aug 10, 2014, at 8:53 PM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 
 
 Another thing that isn't listed there, that I would personally
 love to see is support for secure file deletion.  To be truly
 secure though, this would need to hook into the COW logic so that
 files marked for secure deletion can't be reflinked (maybe make
 the automatically NOCOW instead, and don't allow snapshots?), and
 when they get written to, the blocks that get COW'ed have the old
 block overwritten.
 
 If the file is reflinked or snapshot, then it can it be secure
 deleted? Because what does it mean to secure delete a file when
 there's a completely independent file pointing to the same physical
 blocks? What if someone else owns that independent file? Does the
 reflink copy get rm'd as well? Or does the file remain, but its
 blocks are zero'd/corrupted?
The semantics that I would expect would be that the extents can't be
reflinked, and when snapshotted the whole file gets COW'ed, and then
inherits the secure deletion flag, possibly with another flag saying
that the user can't disable the secure deletion flag.
 
 For SSDs, whether it's an overwrite or an FITRIM ioctl it's an open
 question when the data is actually irretrievable. It may be
 seconds, but could be much longer (hours?) so I'm not sure if it's
 useful. On HDD's using SMR it's not necessarily a given an
 overwrite will work there either.
By secure deletion, I don't mean make the data absolutely
unrecoverable by any means, I mean make it functionally impractical
for someone without low-level access to and/or extensive knowledge of
the hardware to recover the data; that is, more secure than simply
unlinking the file, but obviously less than (for example) the
application of thermite to the disk platters.  I'm talking the rough
equivalent of wiping the data from RAM.

Anyone who is truly security minded should be using whole disk
encryption anyway, but even then you have the data accessible from the
running OS.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ideas for a feature implementation

2014-08-12 Thread Austin S Hemmelgarn
On 2014-08-12 11:52, David Pottage wrote:
 
 On 11/08/14 03:53, Austin S Hemmelgarn wrote:
 
 Another thing that isn't listed there, that I would personally love to
 see is support for secure file deletion.  To be truly secure though,
 this would need to hook into the COW logic so that files marked for
 secure deletion can't be reflinked (maybe make the automatically NOCOW
 instead, and don't allow snapshots?), and when they get written to, the
 blocks that get COW'ed have the old block overwritten.
 How would secure deletion interact with file de-duplication?
 
 For example suppose you and I are both users on a multi user system. We
 both obtain copies of the same file independently, and save that file to
 our home directories.
 
 A background process notices that both files are the same and
 de-duplicates them. This means that both your file and mine point to the
 same blocks on disc. This is exactly the same as would happen if you
 made a COW copy of your file, transferred ownership to me, and I moved
 it into my home dir.
 
 You then decide to secure delete your copy of the file. What happens to
 mine? If it gets removed, then you have just deleted a file you don't
 own, if it does not then the file-system has broken the contract to
 secure delete a file when you asked it to.
 
 Also, what happens if the two files have similar portions, but they are
 not identical. For example, if you download and ISO image for ubuntu,
 and I download the ISO for kubuntu (at the same version). There will be
 a lot of sections that are the same, because they will contain a lot of
 packages in common, so there will be large gains in de-duplicating the
 similar parts, but most people would consider the files to be different.
 
 Could this mean that if you secure delete your ubuntu iso, then portions
 of my kubuntu iso might become corrupt?
 
You could work around this by marking the extent, instead of the file
(marking a file would mark all of it's extents), and then checking for
that marking when the extent is freed (ie, nobody refers to it anymore).
While this approach might not seem useful to most people, there are
practical use cases for it (even without whole disk encryption).
It would be pretty easy actually to integrate this globally for a
file-system as a mount option.
 Even if we limit secure delete to root, then we still leave the risk of
 unintentonaly breaking user files, because non-one realised that all or
 part of the file appears in other files via de-duplication. In any case
 if secure delete is limited to root, then most people would not find it
 useful. (or they would use sudo to do it, which brings us back to the
 same problems).
 
 Basically, I think that file secure deletion as a concept is not
 compatible with a 5th generation file system. If you relay want to
 securely remove a file, then copy the stuff you need elsewhere, and put
 the disc in the crusher. Alternatively put the filesystem in an encypted
 container, and then reformat the disc with a different encryption key.
 
While I agree that the traditional notion of secure deletion doesn't fit
in the current generation of file systems, there is still a need for COW
filesystems to be able to prevent sensitive data from being exposed
during run-time.  On any current BTRFS filesystem, it is still possible
to find blocks that have been COW'ed (assuming discard is turned off)
and have no referents, possibly long after the block itself is freed,
and especially if the volume is much larger than the stored data set
(like a large majority of desktop users these days) or the workload is
not write intensive.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Large files, nodatacow and fragmentation

2014-08-14 Thread Austin S Hemmelgarn
On 2014-08-14 10:30, G. Richard Bellamy wrote:
 On Wed, Aug 13, 2014 at 9:23 PM, Chris Murphy li...@colorremedies.com wrote:
 lsattr /var/lib/libvirt/images/atlas.qcow2

 Is the xattr actually in place on that file?
 
 2014-08-14 07:07:36
 $ filefrag /var/lib/libvirt/images/atlas.qcow2
 /var/lib/libvirt/images/atlas.qcow2: 46378 extents found
 2014-08-14 07:08:34
 $ lsattr /var/lib/libvirt/images/atlas.qcow2
 ---C /var/lib/libvirt/images/atlas.qcow2
 
 So, yeah, the attribute is set.
 

 It will fragment somewhat but I can't say that I've seen this much 
 fragmentation with xattr C applied to qcow2. What's the workload? How was 
 the qcow2 created? I recommend -o 
 preallocation=metadata,compat=1.1,lazy_refcounts=on when creating it. My 
 workloads were rather simplistic: OS installs and reinstalls. What's the 
 filesystem being used in the guest that's using the qcow2 as backing?
 
 When I created the file, I definitely preallocated the metadata, but
 did not set compat or lazy_refcounts. However, isn't that more a
 function of how qemu + KVM managed the image, rather than how btrfs?
 This is a p2v target, if that matters. Workload has been minimal since
 virtualizing because I have yet to get usable performance with this
 configuration. The filesystem in the guest is Win7 NTFS. I have seen
 massive thrashing of the underlying volume during VSS operations in
 the guest, if that signifies.
 

 It might be that your workload is best suited for a preallocated raw file 
 that inherits +C, or even possibly an LV.
 
 I'm close to that decision. As I mentioned, I much prefer the btrfs
 subvolume story over lvm, so moving to raw is probably more desirable
 than that... however, then I run into my lack of understanding of the
 difference between qcow2 and raw with respect to recoverability, e.g.
 does raw have the same ACID characteristics as a qcow2 image, or is
 atomicity a completely separate concern from the format? The ability
 for the owning process to recover from corruption or inconsistency is
 a key factor in deciding whether or not to turn COW off in btrfs - if
 your overlying system is capable of such recovery, like a database
 engine or (presumably) virtualization layer, then COW isn't a
 necessary function from the underlying system.
 
 So, just since I started this reply, you can see the difference in
 fragmentation:
 2014-08-14 07:25:04
 $ filefrag /var/lib/libvirt/images/atlas.qcow2
 /var/lib/libvirt/images/atlas.qcow2: 46461 extents found
 
 That's 17 minutes, an OS without interaction (I wasn't doing anything
 with it, but it may have been doing its own work like updates, etc.),
 and I see an fragmentation increase of 83 extents, and a raid10 volume
 that was beating itself up (I could hear the drives chattering away as
 they worked).
The fact that it is Windows using NTFS is probably part of the problem.
 Here's some things you can do to decrease it's background disk
utilization (these also improve performance on real hardware):
1. Disable system restore points.  These aren't really necessary if you
are running in a VM and can take snapshots from the host OS.
2. Disable the indexing service.  This does a lot of background disk IO,
and most people don't need the high speed search functionality.
3. Turn off Windows Features that you don't need.  This won't help disk
utilization much, but can greatly improve overall system performance.
4. Disable the paging file.  Windows does a lot of unnecessary
background paging, which can cause lots of unneeded disk IO.  Be careful
doing this however, as it may cause problems for memory hungry applications.
5. See if you can disable boot time services you don't need.  Bluetooth,
SmartCard, and Adaptive Screen Brightness are all things you probably
don't need in a VM environment.

Of these, 1, 2, and 4 will probably help the most.  The other thing is
that NTFS is a journaling file system, and putting a journaled file
system image on a COW backing store will always cause some degree of
thrashing, because the same few hundred MB of the disk get rewritten
over and over again, and the only way to work around that on BTRFS is to
make the file NOCOW, AND preallocate the entire file in one operation
(use the fallocate command from util-linux to do this).




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Questions on using BtrFS for fileserver

2014-08-19 Thread Austin S Hemmelgarn
On 2014-08-19 12:21, M G Berberich wrote:
 Hello,
 
 we are thinking about using BtrFS on standard hardware for a
 fileserver with about 50T (100T raw) of storage (25×4TByte).
 
 This is what I understood so far. Is this right?
 
 · incremental send/receive works.
 
 · There is no support for hotspares (spare disks that automatically
   replaces faulty disk).
 
 · BtrFS with RAID1 is fairly stable.
 
 · RAID 5/6 spreads all data over all devices, leading to performance
   problems on large diskarrays, and there is no option to limit the
   numbers of disk per stripe so far.
 
 Some questions:
 
 · There where reports, that bcache with btrfs leads to corruption. Is
   this still so?
Based on some testing I did last month, bcache with anything has the
potential to cause data corruption.
 
 · If a disk failes, does BtrFS rebalance automatically? (This would
   give a a kind o hotspare behavior)
No, but it wouldn't be hard to write a simple monitoring program to do
this from userspace.  IIRC, the big issue is that you need to add a
device in-place of the failed one for the re-balance to work.
 
 · Besides using bcache, are there any possibilities to boost
   performance by adding (dedicated) cache-SSDs to a BtrFS?
Like mentioned in one of the other responses, I would suggest looking
into dm-cache.  BTRFS itself does not have any functionality for this,
although there has been talk of implementing device priorities for
reads, which could provide a similar performance boost.
 
 · Are there any reports/papers/web-pages about BtrFS-systems this size
   in use? Praises, complains, performance-reviews, whatever…
While it doesn't quite fit the description, I have had very good success
with a very active 2TB BTRFS RAID10 filesystem consisting of BTRFS on
four unpartitioned 1TB SATA III hard drives.  The filesystem gets in
excess of 100GB of data written to it each day (almost all rewrites
however), and is what I use for /home, /var/log, and /var/lib, and I've
had no issues with it that were caused by BTRFS, and in-fact, the very
fact that it uses BTRFS helped me recover data when the storage
controller they are connected to went bad.  On average, I get about 125%
of raw disk performance on writes, and about 110% on reads.

If you are using a very large number of disks, then I would not suggest
that you use BTRFS RAID10, but instead BTRFS RAID1, as RAID10 will try
to stripe things across ALL of the devices in the filesystem, and unless
you have no more than about four times as many disks as storage
controllers (that is, each controller has no more than four disks
attached to it), the overhead outweighs the benefit of striping the data.

Also, just to make sure it's clear, in BTRFS RAID1, each block gets
written EXACTLY twice.  On the plus side though, this means that if you
do set-up a caching mechanism, you may be able to keep most of the array
spun down a majority of the time.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Questions on using BtrFS for fileserver

2014-08-20 Thread Austin S Hemmelgarn
On 08/19/2014 05:38 PM, Andrej Manduch wrote:
 Hi,
 
 On 08/19/2014 06:21 PM, M G Berberich wrote: · Are there any
 reports/papers/web-pages about BtrFS-systems this size
   in use? Praises, complains, performance-reviews, whatever…
 
 I don't know about papers or benchmarks but few weeks ago there was a
 guy who has problem with really long mounting with btrfs with similiar size.
 https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36226.html
 
 And I would not recommend 3TB disks. *I'm not btrfs dev* but as far as I
 know there is a quite different between rebuilding disk on real RAID and
 btrfs RAID. The problem is btrfs has RAID on filesystem level not on hw
 level so there is bigger mechanical overheat on drives and thus it take
 significantli longer than regular RAID.
It really suprises me that so many people come to this conclusion, but
maybe they don't provide as much slack space as I do on my systems.  In
general you will only have a longer rebuild on BTRFS than on hardware
RAID if the filesystem is more than about 50% full.  On my desktop array
(4x 1TB disks using BTRFS RAID10), I've replaced disks before and it
took less than an hour for the operation.  Of course that array is
usually not more than 10% full.  Interestingly, it took less time to
rebuild this array the last time I lost a disk than it did back when it
was 3x 1TB disks in a BTRFS RAID1, so things might improve overall with
a larger number of disks in the array.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Significance of high number of mails on this list?

2014-08-22 Thread Austin S Hemmelgarn
On 2014-08-20 23:22, Shriramana Sharma wrote:
 Hello. People on this list have been kind enough to reply to my
 technical questions. However, seeing the high number of mails on this
 list, esp with the title PATCH, I have a question about the
 development itself:
 
 Is this just an indication of a vibrant user/devel community [*] and
 healthy development of many new nice features to eventually come out
 in stable form later, or are we still at the fixing rough edges stage?
 IOW what is the proportion of commits adding new features to those
 stabilising/fixing features?
 
 [* Since there is no separate btrfs-users vs brtfs-dev I'm not able to
 gauge this difference either. i.e. if there were a dedicated -dev list
 I might not be alarmed by a high number of mails indicating fast
 development.]
 
 Mostly I have read like BTRFS is mostly stable but there might be a
 few corner cases as yet unknown since this is a totally new generation
 of FSs. But still given the volume of mails here I wanted to ask...
 I'm sorry I realize I'm being a bit vague but I'm not sure how to
 exactly express what I'm feeling about BTRFS right now...
 
Personally I'd say that BTRFS is 'stable' enough for light usage without
using stuff like quotas or RAID5/6.  So far, having used it since 3.10,
I've only once had a filesystem get corrupted when there wasn't some
serious underlying hardware issue (crashed disk, SATA controller
dropping random single sectors from writes, etc.), and it gives me much
better performance than what I previously used (ext4 on top of LVM).
As far as what to make of the volume of patches on the mailing list, I'd
say that that shouldn't be used as a measure of quality.  The ext4
mailing list is almost as busy on a regular basis, and people have been
using that in production for years, and the XFS mailing list gets much
higher volume of patches from time to time, and it's generally
considered the gold standard of a stable filesystem.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Distro vs latest kernel for BTRFS?

2014-08-22 Thread Austin S Hemmelgarn
On 2014-08-22 07:59, Shriramana Sharma wrote:
 Hello. I've seen repeated advices to use the latest kernel. While
 hearing of the recent compression bug affecting recent kernels does
 somewhat warn one off the previous advice, I would like to know what
 people who are running regular distros do to get the latest kernel.
 
 Personally I'm on Kubuntu, which provides mainline kernels till a
 particular point but not beyond that.
 
 Do people here always compile the latest kernel themselves just to get
 the latest BTRFS stability fixes (and  improvements, though as a
 second priority)?
 
I personally use Gentoo Unstable on all my systems, so I build all my
kernels locally anyway, and stay pretty much in-line with the current
stable Mainline kernel.
Interestingly, I haven't had any issues related to either of the
recently discovered bugs, despite meeting all of the criteria for being
affected by them.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Distro vs latest kernel for BTRFS?

2014-08-22 Thread Austin S Hemmelgarn
On 2014-08-22 14:22, Rich Freeman wrote:
 On Fri, Aug 22, 2014 at 8:04 AM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:

 I personally use Gentoo Unstable on all my systems, so I build all my
 kernels locally anyway, and stay pretty much in-line with the current
 stable Mainline kernel.
 
 Gentoo Unstable probably means gentoo-sources, testing version,
 which follows the stable kernel branch, but the most recent stable,
 and not the long-term stable.  gentoo-sources stable version generally
 follows the most recent longterm stable kernel (so 3.14 right now).
 I'm not sure what the exact policy is, but that is my sense of it.
 
 So, you're still running a stable kernel most likely.  If you really
 want mainline then you want git-sources.  That follows the most recent
 mainline I believe.  Of course, if you're following it that closely
 then you probably should think about just doing a git clone and
 managing it yourself, since then you can handle patches/etc more
 easily.
 
 I think the best option for somebody running btrfs is to stick with a
 stable kernel branch, either the current stable or a very recent
 longterm.  I wouldn't go back into 3.2 land or anything like that.
 
 But, yes, if you had stuck with 3.14 and not gone to the current
 stable then you would have missed the compress=lzo deadlock.  So, pick
 your poison.  :)
 
 Rich
 
By saying 'unstable' I'm referring to the stuff delimited in portage
with the ~ARCH keywords.  Personally, I wouldn't use that term myself
(all of my systems running on such packages have been rock-solid stable
from a software perspective), but that is how the official documentation
refers to things with the ~ARCH keywords.  There are a lot of Gentoo
users who don't know about the keyword thing other than as an occasional
inconvenience when emerging certain packages, so I just use the same
term as the documentation.

For the record, I am using the gentoo-sources package, but instead of
using what they mark as stable (which is 3.14), I'm using the most
recent version (which is 3.16.1).



smime.p7s
Description: S/MIME Cryptographic Signature


Re: superblock checksum mismatch after crash, cannot mount

2014-08-25 Thread Austin S Hemmelgarn
On 2014-08-24 15:48, Chris Murphy wrote:
 
 On Aug 24, 2014, at 10:59 AM, Flash ROM flashromg...@yandex.com wrote:
 While it sounds dumb, this strange thing being done to put partition table 
 in separate erase block, so it never read-modify-written when FAT entries 
 are updated. Should something go wrong, FAR can recover from backup copy. 
 But erased partition table just suxx. Then, FAT tables are aligned in way to 
 fit well around erase block bounds.
 
 I think you seriously overestimate the knowledge of camera manufacturer's 
 about the details of flash storage; and any ability to discover it; and any 
 willingness on the part of the flash manufacturer to reveal such underlying 
 details. The whole point of these cards is to completely abstract the reality 
 of the underlying hardware from the application layer - in this case the 
 camera or mobile device using it.
 
If you really know what you are doing, it is possible to determine erase
block size by looking at device performance timings, with surprisingly
high accuracy (assuming you aren't trying to have software do it for
you).  I've actually done this before on several occasions, with nearly
100% success.
 Also, with SDXC exFAT is now specified. And it has only one FAT there isn't a 
 backup FAT. So they're even more difficult to recover data from should things 
 go awry filesystem wise.
 
It's too bad that TFAT didn't catch on, as it would have been great for
SD cards if it could be configured to put each FAT on a different erase
block.
 
 This said, you can *try* to reformat, BUT no standard OS of firmware 
 formatter will help you with default settings. They can't know geometry of 
 underlying NAND and controller properties. There is no standard, widely 
 accepted way to get such information from card. No matter if you use OS 
 formatter, camera formatter or whatever. YOU WILL RUIN factory format (which 
 is crafted in best possible way) and replace it with another, very likely 
 suboptimal one.
 
 It's recommended by the card manufacturers to reformat it in each camera its 
 inserted into. It's the only recommended way to erase the sd card for 
 re-use, they don't recommend selectively deleting images. And it's known that 
 one camera's partition table and formatting can irritate another camera 
 make/model if the card isn't reformatted by that camera.
 
It's not just cameras that have this issue, a lot of other hardware
makes stupid assumptions about the format of media.  The first firmware
release for the Nintendo Wii for example, chocked if you tried to use an
SD card with more than one partition on it, and old desktop versions of
Windows won't ever show you anything other than the first partition on
an SD card (or most USB storage devices for that matter).




smime.p7s
Description: S/MIME Cryptographic Signature


Re: ext4 vs btrfs performance on SSD array

2014-09-02 Thread Austin S Hemmelgarn
I wholeheartedly agree.  Of course, getting something other than CFQ as
the default I/O scheduler is going to be a difficult task.  Enough
people upstream are convinced that we all NEED I/O priorities, when most
of what I see people doing with them is bandwidth provisioning, which
can be done much more accurately (and flexibly) using cgroups.

Ironically, there have been a lot of in-kernel defaults that I have run
into issues with recently, most of which originated in the DOS era,
where a few MB of RAM was high-end.

On 2014-09-02 08:55, Zack Coffey wrote:
 While I'm sure some of those settings were selected with good reason,
 maybe there can be a few options (2 or 3) that have some basic
 intelligence at creation to pick a more sane option.
 
 Some checks to see if an option or two might be better suited for the
 fs. Like the RAID5 stripe size. Leave the default as is, but maybe a
 quick speed test to automatically choose from a handful of the most
 common values. If they fail or nothing better is found, then apply the
 default value just like it would now.
 
 
 On Mon, Sep 1, 2014 at 9:22 PM, Christoph Hellwig h...@infradead.org wrote:
 On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote:
 Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4
 and XFS are doing is doing 128k IOs because that's the default block
 device readahead size.  'blockdev --setra 1024 /dev/sdd' before
 mounting the filesystem will probably fix it.

 Btw, it's really getting time to make Linux storage fs work out the
 box.  There's way to many things that are stupid by default and we
 require everyone to fix up manually:

  - the ridiculously low max_sectors default
  - the very small max readahead size
  - replacing cfq with deadline (or noop)
  - the too small RAID5 stripe cache size

 and probably a few I forgot about.  It's time to make things perform
 well out of the box..
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe linux-btrfs in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Large files, nodatacow and fragmentation

2014-09-02 Thread Austin S Hemmelgarn
On 2014-09-02 14:31, G. Richard Bellamy wrote:
 I thought I'd follow-up and give everyone an update, in case anyone
 had further interest.
 
 I've rebuilt the RAID10 volume in question with a Samsung 840 Pro for
 bcache front device.
 
 It's 5x600GB SAS 15k RPM drives RAID10, with the 512MB SSD bcache.
 
 2014-09-02 11:23:16
 root@eanna i /var/lib/libvirt/images # lsblk
 NAME  MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
 sda 8:00 558.9G  0 disk
 └─bcache3 254:30 558.9G  0 disk /var/lib/btrfs/data
 sdb 8:16   0 558.9G  0 disk
 └─bcache2 254:20 558.9G  0 disk
 sdc 8:32   0 558.9G  0 disk
 └─bcache1 254:10 558.9G  0 disk
 sdd 8:48   0 558.9G  0 disk
 └─bcache0 254:00 558.9G  0 disk
 sde 8:64   0 558.9G  0 disk
 └─bcache4 254:40 558.9G  0 disk
 sdf 8:80   0   1.8T  0 disk
 └─sdf1  8:81   0   1.8T  0 part
 sdg 8:96   0   477G  0 disk /var/lib/btrfs/system
 sdh 8:112  0   477G  0 disk
 sdi 8:128  0   477G  0 disk
 ├─bcache0 254:00 558.9G  0 disk
 ├─bcache1 254:10 558.9G  0 disk
 ├─bcache2 254:20 558.9G  0 disk
 ├─bcache3 254:30 558.9G  0 disk /var/lib/btrfs/data
 └─bcache4 254:40 558.9G  0 disk
 sr011:01  1024M  0 rom
 
 I further split the system and data drives of the VM Win7 guest. It's
 very interesting to see the huge level of fragmentation I'm seeing,
 even with the help of ordered writes offered by bcache - in other
 words while bcache seems to be offering me stability and better
 behavior to the guest, the underlying the filesystem is still seeing a
 level of fragmentation that has me scratching my head.
 
 That being said, I don't know what would be normal fragmentation of a
 VM Win7 guest system drive, so could be I'm just operating in my zone
 of ignorance again.
 
 2014-09-01 14:41:19
 root@eanna i /var/lib/libvirt/images # filefrag atlas-*
 atlas-data.qcow2: 7 extents found
 atlas-system.qcow2: 154 extents found
 2014-09-01 18:12:27
 root@eanna i /var/lib/libvirt/images # filefrag atlas-*
 atlas-data.qcow2: 564 extents found
 atlas-system.qcow2: 28171 extents found
 2014-09-02 08:22:00
 root@eanna i /var/lib/libvirt/images # filefrag atlas-*
 atlas-data.qcow2: 564 extents found
 atlas-system.qcow2: 35281 extents found
 2014-09-02 08:44:43
 root@eanna i /var/lib/libvirt/images # filefrag atlas-*
 atlas-data.qcow2: 564 extents found
 atlas-system.qcow2: 37203 extents found
 2014-09-02 10:14:32
 root@eanna i /var/lib/libvirt/images # filefrag atlas-*
 atlas-data.qcow2: 564 extents found
 atlas-system.qcow2: 40903 extents found
 
This may sound odd, but are you exposing the disk to the Win7 guest as a
non-rotational device? Win7 and higher tend to have different write
behavior when they think they are on an SSD (or something else where
seek latency is effectively 0).  Most VMM's (at least, most that I've
seen) will use fallocate to punch holes for ranges that get TRIM'ed in
the guest, so if windows is sending TRIM commands, that may also be part
of the issue.  Also, you might try reducing the amount of logging in the
guest.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: No space on empty, degraded raid10

2014-09-08 Thread Austin S Hemmelgarn
On 2014-09-07 16:38, Or Tal wrote:
 Hi,
 
 I've created a new raid10 array from 4, 4TB drives in order to migrate
 old data to it.
 As I didn't have enough sata ports, I:
 - disconnected one of the raid10 disks to free a sata port,
 - connected an old disk I wanted to migrate,
 - mounted the array with -o degraded
 - copied the data it it.
 
 After about 2MB I got a no space left on device message.
 btrfs fi df showed strange things - much less space in every category
 (about 8GB?) and none of then was full.
 
 Ubuntu 14.10 beta - linux 3.16.0-14
Yeah, RAID10 doesn't really work in degraded mode (even if you have two
disks that have stripes from the same copy).  The approach that would be
needed for what you want to do is:
 1. Make a BTRFS RAID1 filesystem with _3_ new drives
 2. Connect one of the old disks
 3. Transfer data from old disk to new filesystem
 4. After repeating steps 2 and 3 for each old disk, connect the final
new disk, add it to the filesystem, and rebalance with '-dconvert=raid10
-mconvert=raid10'

Also, I've found out the hard way that system chunks really should be
RAID1, _NOT_ RAID10, otherwise it's very likely that the filesystem
won't mount at all if you lose 2 disks.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Austin S Hemmelgarn
On 2014-09-10 08:27, Bob Williams wrote:
 I have two 2TB disks formatted as a btrfs raid1 array, mirroring both
 data and metadata. Last night I started
 
 # btrfs filesystem balance path
 
In general, unless things are really bad, you don't ever want to use
balance on such a big filesystem without some filters to control what
gets balanced (especially if the filesystem is more than about 50% full
most of the time).

My suggestion in this case would be to use:
# btrfs balance start -dusage=25 -musage=25 path
on a roughly weekly basis.  This will only balance chunks that are less
than 25% full, and therefore run much faster.  If you are particular
about high storage efficiency, then try 50 instead of 25.
 and it is still running 18 hours later. This suggests that most stuff
 only gets written to one physical device, which in turn suggests that
 there is a risk of lost data if one physical device fails. Or is there
 something clever about btrfs raid that I've missed? I've used linux
 software raid (mdraid) before, and it appeared to write to both
 devices simultaneously.
The reason that a full balance takes so long on a big (and I'm assuming
based on the 18 hours it's taken, very full) filesystem is that it reads
all of the data, and writes it out to both disks, but it doesn't do very
good load-balancing like mdraid or LVM do.  I've got a 4x 500Gib BTRFS
RAID10 filesystem that I use for my home directory on my desktop system,
and a full balance on that takes about 6 hours.
 
 Is it safe to interrupt [^Z] the btrfs balancing process?
^Z sends a SIGSTOP, which is a really bad idea with something that is
doing low-level stuff to a filesystem.  If you need to stop the balance
process (and are using a recent enough kernel and btrfs-progs), the
preferred way to do so is to use the following from another terminal:
# btrfs balance stop path
Depending on what the balance operation is working when you do this, it
may take a few minutes before it actually stops (the longest that I've
seen it take is ~200 seconds).
 
 As a rough guide, how often should one perform
 
 a) balance
 b) defragment
 c) scrub
 
 on a btrfs raid setup?
In general, you should be running scrub regularly, and balance and
defragment as needed.  On the BTRFS RAID filesystems that I have, I use
the following policy:
1) Run a 25% balance (the command I mentioned above) on a weekly basis.
2) If the filesystem has less than 50% of either the data or metadata
chunks full at the end of the month, run a full balance on it.
3) Run a scrub on a daily basis.
4) Defragment files only as needed (which isn't often for me because I
use the autodefrag mount option).
5) Make sure than only one of balance, scrub or defrag is running at a
given time.
Normally, you shouldn't need to run balance at all on most BTRFS
filesystems, unless your usage patterns vary widely over time (I'm
actually a good example of this, most of the files in my home directory
are relatively small, except for when I am building a system with
buildroot or compiling a kernel, and on occasion I have VM images that
I'm working with).



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Is it necessary to balance a btrfs raid1 array?

2014-09-10 Thread Austin S Hemmelgarn
On 2014-09-10 09:48, Rich Freeman wrote:
 On Wed, Sep 10, 2014 at 9:06 AM, Austin S Hemmelgarn
 ahferro...@gmail.com wrote:
 Normally, you shouldn't need to run balance at all on most BTRFS
 filesystems, unless your usage patterns vary widely over time (I'm
 actually a good example of this, most of the files in my home directory
 are relatively small, except for when I am building a system with
 buildroot or compiling a kernel, and on occasion I have VM images that
 I'm working with).
 
 Tend to agree, but I do keep a close eye on free space.  If I get to
 the point where I'm over 90% allocated to chunks with lots of unused
 space otherwise I run a balance.  I tend to have the most problems
 with my root/OS filesystem running on a 64GB SSD, likely because it is
 so small.
 
 Is there a big performance penalty running mixed chunks on an SSD?  I
 believe this would get rid of the risk of ENOSPC issues if everything
 gets allocated to chunks.  There are obviously no issues with random
 access on an SSD, but there could be other problems (cache
 utilization, etc).
There shouldn't be any more performance penalty than for normally
running mixed chunks.  Also, a 64GB SSD is not small, I use a pair of
64GB SSD's in a BTRFS RAID1 configuration for root on my desktop, and
consistently use less than a quarter (12G on average) of the available
space, and that's with stuff like LibreOffice and the entire OpenClipart
distribution (although I'm not running an 'enterprise' distribution, and
keep /tmp and /var/tmp on tmpfs).
 
 I tend to watch btrfs fi sho and if the total space used starts
 getting high then I run a balance.  Usually I run with -dusage=30 or
 -dusage=50, but sometimes I get to the point where I just need to do a
 full balance.  Often it is helpful to run a series of balance commands
 starting at -dusage=10 and moving up in increments.  This at least
 prevents killing IO continuously for hours.  If we can get to a point
 where balancing can operate at low IO priority that would be helpful.
 
 IO priority is a problem in btrfs in general.  Even tasks run at idle
 scheduling priority can really block up a disk.  I've seen a lot of
 hurry-and-wait behavior in btrfs.  It seems like the initial commit to
 the log/etc is willing to accept a very large volume of data, and then
 when all the trees get updated the system grinds to a crawl trying to
 deal with all the data that was committed.  The problem is that you
 have two queues, with the second queue being rate-limiting but the
 first queue being the one that applies priority control.  What we
 really need is for the log to have controls on how much it accepts so
 that the updating of the trees/etc never is rate-limiting.   That will
 limit the ability to have short IO write bursts, but it would prevent
 low-priority writes from blocking high-priority read/writes.

You know, you can pretty easily control bandwidth utilization just using
cgroups.  This is what I do, and I get much better results with cgroups
and the deadline IO scheduler than I ever did with CFQ. Abstract
priorities are a not bad for controlling relative CPU utilization, but
they really suck for IO scheduling.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: No space on empty, degraded raid10

2014-09-11 Thread Austin S Hemmelgarn
On 2014-09-11 02:40, Russell Coker wrote:
 On Mon, 8 Sep 2014, Austin S Hemmelgarn ahferro...@gmail.com wrote:
 Also, I've found out the hard way that system chunks really should be
 RAID1, NOT RAID10, otherwise it's very likely that the filesystem
 won't mount at all if you lose 2 disks.
 
 Why would that be different?
 
 In a RAID-1 you expect system problems if 2 disks fail, why would RAID-10 be 
 different?
That's still the case, but in a RAID1 with four disks, of the six
different pairs of two disks you could lose, only one will make the
filesystem un-mountable, whereas for a four disk RAID10, there are two
different pairs of two disks you could lose to make the filesystem
un-mountable.  In haven't run the numbers for higher numbers of disks,
but things are likely not better, because if you lose both copies of the
same stripe, things will fail.
 
 Also it would be nice if there was a N-way mirror option for system data.  As 
 such data is tiny (32MB on the 120G filesystem in my workstation) the space 
 used by having a copy on every disk in the array shouldn't matter.
 
N-way mirroring is in the queue for after RAID5/6 work; ideally, once it
is ready, mkfs should default to one copy per disk in the filesystem.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: No space on empty, degraded raid10

2014-09-11 Thread Austin S Hemmelgarn
On 2014-09-11 07:38, Hugo Mills wrote:
 On Thu, Sep 11, 2014 at 07:19:00AM -0400, Austin S Hemmelgarn wrote:
 On 2014-09-11 02:40, Russell Coker wrote:
 Also it would be nice if there was a N-way mirror option for system data.  
 As 
 such data is tiny (32MB on the 120G filesystem in my workstation) the space 
 used by having a copy on every disk in the array shouldn't matter.

 N-way mirroring is in the queue for after RAID5/6 work; ideally, once it
 is ready, mkfs should default to one copy per disk in the filesystem.
 
Why change the default from 2-copies, which it's been for years?

Sorry about the ambiguity in my statement, I meant that the default for
system chunks should be one copy per disk in the filesystem.  If you
don't have a copy of the system chunks, then you essentially don't have
a filesystem, and that means that BTRFS RAID6 can't provide true
resilience against 2 disks failing catastrophically unless there are at
least 3 copies of the system chunks.



smime.p7s
Description: S/MIME Cryptographic Signature


Problem with unmountable filesystem.

2014-09-16 Thread Austin S Hemmelgarn
So, I just recently had to hard reset a system running  root on BTRFS,
and when it tried to come back up, it chocked on the root filesystem.
Based on the kernel messages, the primary issue is log corruption, and
in theory btrfs-zero-log should fix it.  The actual issue however, is
that the primary superblock appears to be pointing at a corrupted root
tree, which causes pretty much everything that does anything other than
just read the sb to fail.  The first backup sb does point to a good
tree, but only btrfs check and btrfs restore have any option to ignore
the first sb and use one of the backups instead.  To make matters more
complicated, the first sb still has a valid checksum and passes the
tests done by btrfs rescue super-recover, and therefore that can't be
used to recover either.  I was wondering if anyone here might have any
advice.  I'm fine using dd to replace the primary sb with one of the
backups, but don't know the exact parameters that would be needed.
Also, we should consider adding a mount option to select a specific sb
mirror to use; I know that ext* have such an option, and that has
actually saved me a couple of times.  I'm using btrfs-progs 3.16 and
kernel 3.16.1.



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Problem with unmountable filesystem.

2014-09-17 Thread Austin S Hemmelgarn
On 2014-09-16 16:57, Chris Murphy wrote:
 
 On Sep 16, 2014, at 8:40 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote:
 
 Based on the kernel messages, the primary issue is log corruption, and
 in theory btrfs-zero-log should fix it.
 
 Can you provide a complete dmesg somewhere for this initial failure, just for 
 reference? I'm curious what this indication looks like compared to other 
 problems.
 
Okay, I can't really get a 'complete' dmesg, because the system panics 
on the mount failure (the filesystem in question is the system's root 
filesystem), the system has no serial ports, and I didn't think to 
build in support for console on ttyUSB0.  I can however get what the 
recovery environment (locally compiled based on buildroot) shows when I 
try to mount the filesystem:
[   30.871036] BTRFS: device label gentoo devid 1 transid 160615 /dev/sda3
[   30.875225] BTRFS info (device sda3): disk space caching is enabled
[   30.917091] BTRFS: detected SSD devices, enabling SSD mode
[   30.920536] BTRFS: bad tree block start 0 130402254848
[   30.924018] BTRFS: bad tree block start 0 130402254848
[   30.926234] BTRFS: failed to read log tree
[   30.953055] BTRFS: open_ctree failed
  The actual issue however, is
 that the primary superblock appears to be pointing at a corrupted root
 tree, which causes pretty much everything that does anything other than
 just read the sb to fail.  The first backup sb does point to a good
 tree, but only btrfs check and btrfs restore have any option to ignore
 the first sb and use one of the backups instead.
 
 Maybe use wipefs -a on this volume, which removes the magic from only the 
 first superblock by default (you can specify another location). And then try 
 btrfs-show-super -F which dumps supers with bad magic.
 
Thanks for the suggestion, I hadn't thought of that...
 I just tried this:
 # wipefs -a /dev/sdb
 /dev/sdb: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 
 5f 4d
 # btrfs-show-super -F /dev/sdb
 superblock: bytenr=65536, device=/dev/sdb
 -
 csum  0x5c1196d7 [DON'T MATCH]
 bytenr65536
 flags 0x1
 magic  [DON'T MATCH]
 […]
 # btrfs-show-super -i1 /dev/sdb
 superblock: bytenr=67108864, device=/dev/sdb
 -
 csum  0xfc70be19 [match]
 bytenr67108864
 flags 0x1
 magic _BHRfS_M [match]
 
 So the mirror is definitely there and valid.
 # btrfs rescue super-recover -yv /dev/sdb
 No valid Btrfs found on /dev/sdb
 Usage or syntax errors
 
 Not expected at all, man page says Recover bad superblocks from good 
 copies. There's a good copy, it's not being found by btrfs rescue 
 super-recover. Seems like a bug.
 
 
 # btrfs check /dev/sdb
 No valid Btrfs found on /dev/sdb
 Couldn't open file system
 
 # btrfs check -s1 /dev/sdb
 using SB copy 1, bytenr 67108864
 Checking filesystem on /dev/sdb
 UUID: 9acf13de-5b98-4f28-9992-533e4a99d348
 [snip]
 OK it finds it, maybe a --repair will fix the bad first one?
 # btrfs check -s1 /dev/sdb
 using SB copy 1, bytenr 67108864
 enabling repair mode
 Checking filesystem on /dev/sdb
 UUID: 9acf13de-5b98-4f28-9992-533e4a99d348
 [snip]
 No indication of repair
 # btrfs check /dev/sdb
 No valid Btrfs found on /dev/sdb
 Couldn't open file system
 # btrfs check /dev/sdb
 No valid Btrfs found on /dev/sdb
 Couldn't open file system
 [root@f21v ~]# btrfs-show-super -F /dev/sdb
 superblock: bytenr=65536, device=/dev/sdb
 -
 csum  0x5c1196d7 [DON'T MATCH]
 bytenr65536
 flags 0x1
 magic  [DON'T MATCH]
 
 
 Still not fixed. Maybe I needed to corrupt something else in the superblock 
 other than the magic and this behavior is intentional, otherwise wipefs -a, 
 followed by btrfsck would resurrect an intentionally wiped btrfs fs, 
 potentially wiping out some newer file system in the process.
 
...though maybe it's a good thing I didn't.
 
 
 I'm fine using dd to replace the primary sb with one of the
 backups, but don't know the exact parameters that would be needed.
 
 Here's an idea:
 
 # btrfs-show-super /dev/sdb
 superblock: bytenr=65536, device=/dev/sdb
 -
 csum  0x92aa51ab [match]
 [snip]
 So I know what I'm looking for starts at LBA 65536/512
 
 # dd if=/dev/sdb skip=128 count=4 2/dev/null | hexdump -C
   92 aa 51 ab 00 00 00 00  00 00 00 00 00 00 00 00  |..Q…..|
 [snip]
 
 And as it turns out the csum is right at the beginning, 4 bytes. So use bs of 
 4 bytes, seek 65536/4, count of 1. This should zero just 4 bytes starting at 
 65536 bytes in.
 
 # dd if=/dev/zero of=/dev/sdb bs=4 seek=16384 count=1
 
 Checked it with the earlier skip=128

Re: Problem with unmountable filesystem.

2014-09-18 Thread Austin S Hemmelgarn
On 09/17/2014 02:57 PM, Chris Murphy wrote:
 
 On Sep 17, 2014, at 5:23 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote:

 Thanks for all the help.
 
 Well, it's not much help. It seems possible to corrupt a primary superblock 
 that points to a corrupt tree root, and use btrfs rescure super-recover to 
 replace it, and then mount should work. One thing I didn't try was corrupting 
 the primary superblock and just mounting normally or with recovery, to see if 
 it'll automatically ignore the primary superblock and use the backup.
 
 But I think you're onto something, that a good superblock can point to a 
 corrupt tree root, and then not have a straight forward way to mount the good 
 tree root. If I understand this correctly.
 
Corrupting the primary superblock did in fact work, and I decided to try
mounting immediately, which failed.  I didn't try with -o recovery, but
I think that would probably fail as well.  Things worked perfectly
however after using btrfs rescue super-recover.  As far as avoiding
future problems, I think the best solution would be to have the mount
operation try the tree root pointed to by the backup superblock if the
one pointed to by the primary seems corrupted.

Secondarily, this almost makes me want to set the ssd option on all
BTRFS filesystems, just to get the rotating superblock updates, because
if it weren't for that behavior, I probably wouldn't have been able to
recovery anything in this particular case.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Problem with unmountable filesystem.

2014-09-18 Thread Austin S Hemmelgarn
On 09/17/2014 04:22 PM, Duncan wrote:
 Austin S Hemmelgarn posted on Wed, 17 Sep 2014 07:23:46 -0400 as
 excerpted:
 
 I've also discovered, when trying to use btrfs restore to copy out the
 data to a different system, that 3.14.1 restore apparently chokes on
 filesystem that have lzo compression turned on.  It's reporting errors
 trying to inflate compressed files, and I know for a fact that none of
 those files were even open, let alone being written to, when the system
 crashed.  I don't know if this is a known bug or even if it is still the
 case with btrfs-progs 3.16, but I figured I'd comment about it because I
 haven't seen anything about it anywhere.
 
 FWIW that's a known and recently patched issue.  If you're still seeing 
 issues with it with btrfs-progs 3.16, report it, but 3.14.1 almost 
 certainly wouldn't have had the fix.  (This is one related patch turned 
 up by a quick search; there may be others.)
 
 * commit 93ebec96f2ae1d3276ebe89e2d6188f9b46692fb
 | Author: Vincent Stehlé vincent.ste...@laposte.net
 | Date:   Wed Jun 18 18:51:19 2014 +0200
 |
 | btrfs-progs: restore: check lzo compress length
 |
 | When things go wrong for lzo-compressed btrfs, feeding
 | lzo1x_decompress_safe() with corrupt data during restore
 | can lead to crashes. Reduce the risk by adding
 | a check on the input length.
 |
 | Signed-off-by: Vincent Stehlé vincent.ste...@laposte.net
 | Signed-off-by: David Sterba dste...@suse.cz
 |
 |  cmds-restore.c | 6 ++
 |  1 file changed, 6 insertions(+)
 
Yeah, 3.16 seems fine, I just hadn't updated my recovery environment
yet.  Ironically, I did some performance testing afterwards, and
realized that using any compression was actually slowing down my system
(my disk appears to be faster than my RAM, which is really sad, even for
a laptop).
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance Issues

2014-09-19 Thread Austin S Hemmelgarn

On 2014-09-19 08:18, Rob Spanton wrote:

Hi,

I have a particularly uncomplicated setup (a desktop PC with a hard
disk) and I'm seeing particularly slow performance from btrfs.  A `git
status` in the linux source tree takes about 46 seconds after dropping
caches, whereas on other machines using ext4 this takes about 13s.  My
mail client (evolution) also seems to perform particularly poorly on
this setup, and my hunch is that it's spending a lot of time waiting on
the filesystem.

I've tried mounting with noatime, and this has had no effect.  Anyone
got any ideas?

Here are the things that the wiki page asked for [1]:

uname -a:

 Linux zarniwoop.blob 3.16.2-200.fc20.x86_64 #1 SMP Mon Sep 8
 11:54:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

btrfs --version:

 Btrfs v3.16

btrfs fi show:

 Label: 'fedora'  uuid: 717c0a1b-815c-4e6a-86c0-60b921e84d75
Total devices 1 FS bytes used 1.49TiB
devid1 size 2.72TiB used 1.50TiB path /dev/sda4

 Btrfs v3.16

btrfs fi df /:

 Data, single: total=1.48TiB, used=1.48TiB
 System, DUP: total=32.00MiB, used=208.00KiB
 Metadata, DUP: total=11.50GiB, used=10.43GiB
 unknown, single: total=512.00MiB, used=0.00

dmesg dump is attached.

Please CC any responses to me, as I'm not subscribed to the list.

Cheers,

Rob

[1] https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list


WRT the performance of Evolution, the issue is probably fragmentation of 
the data files.  If you run the command:

# btrfs fi defrag -rv /home
you should see some improvement in evolution performance (until you get 
any new mail that is).  Evolution (like most graphical e-mail clients 
these days) uses sqlite for data storage, and sqlite database files are 
one of the known pathological cases for COW filesystems in general; the 
solution is to mark the files as NOCOW (see the info about VM images in 
[1] and [2], the same suggestions apply to database files).


As for git, I haven't seen any performance issues specific to BTRFS; are 
you using any compress= mount option? zlib based compression is known to 
cause serious slowdowns.  I don't think that git uses any kind of 
database for data storage.  Also, if the performance comparison is from 
other systems, unless those systems have the EXACT same hardware 
configuration, they aren't really a good comparison.  Unless the pc this 
is on is a relatively recent system (less than a year or two old), it 
may just be hardware that is the performance bottleneck.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Performance Issues

2014-09-19 Thread Austin S Hemmelgarn

On 2014-09-19 08:25, Swâmi Petaramesh wrote:

Le vendredi 19 septembre 2014, 13:18:34 Rob Spanton a écrit :

I have a particularly uncomplicated setup (a desktop PC with a hard
disk) and I'm seeing particularly slow performance from btrfs.


Weeelll I have the same over-complicated kind of setup, and an Arch Linux
BTRFS system which used to boot in some decent amout of time in the past now
takes about 5 full minutes to just make it to the KDM login prompt, and
another 5 minutes before KDE is fully started. Makes me think of the good ole'
times of Windows 95 OSR2 on a 486SX with a dying 1 GB Hard disk...
Well, part of your problem might be KDE itself, it's extremely CPU 
intensive these days.  I'd suggest disabling the 'semantic desktop' 
stuff, because that tends to be the worst offender as far as soaking up 
system resources.  Also, if you recently switched to systemd, that may 
be causing some slowdown as well (journald's default settings are 
terrible for performance)


Now, let me add that I had removed all snaphots, ran a full defrag, and even
rebalanced the damn thing without any positive effect...

(And yes, my HD is physically in good shape, SMART feels fully happy, and it's
less than 75% full...)

I've been using BTRFS for 2-3 years on a dozen of different systems, and if
something doesn't surprise me at all, it's « slow performance », indeed,
although I'm myself more accustomed to « incredibly fscking damn slow
performance »...
It's kind of funny, but I haven't had any performance issues with BTRFS 
since about 3.10, even on the systems my employer is using Fedora 20 on, 
and those use only a Core 2 Duo Processor, DDR2-800 RAM, and SATA2 hard 
drives.

HTH






smime.p7s
Description: S/MIME Cryptographic Signature


Re: Performance Issues

2014-09-19 Thread Austin S Hemmelgarn

On 2014-09-19 08:49, Austin S Hemmelgarn wrote:

On 2014-09-19 08:18, Rob Spanton wrote:

Hi,

I have a particularly uncomplicated setup (a desktop PC with a hard
disk) and I'm seeing particularly slow performance from btrfs.  A `git
status` in the linux source tree takes about 46 seconds after dropping
caches, whereas on other machines using ext4 this takes about 13s.  My
mail client (evolution) also seems to perform particularly poorly on
this setup, and my hunch is that it's spending a lot of time waiting on
the filesystem.

I've tried mounting with noatime, and this has had no effect.  Anyone
got any ideas?

Here are the things that the wiki page asked for [1]:

uname -a:

 Linux zarniwoop.blob 3.16.2-200.fc20.x86_64 #1 SMP Mon Sep 8
 11:54:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

btrfs --version:

 Btrfs v3.16

btrfs fi show:

 Label: 'fedora'  uuid: 717c0a1b-815c-4e6a-86c0-60b921e84d75
 Total devices 1 FS bytes used 1.49TiB
 devid1 size 2.72TiB used 1.50TiB path /dev/sda4

 Btrfs v3.16

btrfs fi df /:

 Data, single: total=1.48TiB, used=1.48TiB
 System, DUP: total=32.00MiB, used=208.00KiB
 Metadata, DUP: total=11.50GiB, used=10.43GiB
 unknown, single: total=512.00MiB, used=0.00

dmesg dump is attached.

Please CC any responses to me, as I'm not subscribed to the list.

Cheers,

Rob

[1] https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list



WRT the performance of Evolution, the issue is probably fragmentation of
the data files.  If you run the command:
# btrfs fi defrag -rv /home
you should see some improvement in evolution performance (until you get
any new mail that is).  Evolution (like most graphical e-mail clients
these days) uses sqlite for data storage, and sqlite database files are
one of the known pathological cases for COW filesystems in general; the
solution is to mark the files as NOCOW (see the info about VM images in
[1] and [2], the same suggestions apply to database files).

As for git, I haven't seen any performance issues specific to BTRFS; are
you using any compress= mount option? zlib based compression is known to
cause serious slowdowns.  I don't think that git uses any kind of
database for data storage.  Also, if the performance comparison is from
other systems, unless those systems have the EXACT same hardware
configuration, they aren't really a good comparison.  Unless the pc this
is on is a relatively recent system (less than a year or two old), it
may just be hardware that is the performance bottleneck.


Realized after I sent this that I forgot the links for [1] and [2]

[1] https://btrfs.wiki.kernel.org/index.php/UseCases
[2] https://btrfs.wiki.kernel.org/index.php/FAQ



smime.p7s
Description: S/MIME Cryptographic Signature


Re: Performance Issues

2014-09-19 Thread Austin S Hemmelgarn

On 2014-09-19 09:51, Holger Hoffstätte wrote:


On Fri, 19 Sep 2014 13:18:34 +0100, Rob Spanton wrote:


I have a particularly uncomplicated setup (a desktop PC with a hard
disk) and I'm seeing particularly slow performance from btrfs.  A `git
status` in the linux source tree takes about 46 seconds after dropping
caches, whereas on other machines using ext4 this takes about 13s.  My
mail client (evolution) also seems to perform particularly poorly on
this setup, and my hunch is that it's spending a lot of time waiting on
the filesystem.


This is - unfortunately - a particular btrfs oddity/characteristic/flaw,
whatever you want to call it. git relies a lot on fast stat() calls,
and those seem to be particularly slow with btrfs esp. on rotational
media. I have the same problem with rsync on a freshly mounted volume;
it gets fast (quite so!) after the first run.
I find that kind of funny, because regardless of filesystem, stat() is 
one of the *slowest* syscalls on almost every *nix system in existence.


The simplest thing to fix this is a du -s /dev/null to pre-cache all
file inodes.

I'd also love a technical explanation why this happens and how it could
be fixed. Maybe it's just a consequence of how the metadata tree(s)
are laid out on disk.
While I don't know for certain, I think it's largely just a side effect 
of the lack of performance tuning in the BTRFS code.



I've tried mounting with noatime, and this has had no effect.  Anyone
got any ideas?


Don't drop the caches :-)

-h






smime.p7s
Description: S/MIME Cryptographic Signature


Re: Problem with unmountable filesystem.

2014-09-19 Thread Austin S Hemmelgarn

On 2014-09-19 13:07, Chris Murphy wrote:

Possibly btrfs-select-super can do some of the things I was doing the hard way. It's 
possible to select a super to overwrite other supers, even if they're good 
ones. Whereas btrfs rescue super-recover won't do that, and neither will btrfsck, hence 
why I corrupted the one I didn't want first. This command isn't built by default (at 
least not on Fedora).
I don't think it's built by default on any of the major distributions. 
On Gentoo you need to set package specific configure options.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: Problem with unmountable filesystem.

2014-09-19 Thread Austin S Hemmelgarn

On 2014-09-19 13:54, Chris Murphy wrote:


On Sep 17, 2014, at 5:23 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote:

[   30.920536] BTRFS: bad tree block start 0 130402254848
[   30.924018] BTRFS: bad tree block start 0 130402254848
[   30.926234] BTRFS: failed to read log tree
[   30.953055] BTRFS: open_ctree failed

I'm still confused. Btrfs knows this tree root is bad, but it has backup roots. 
So why wasn't one of those used by -o recovery? I thought that's the whole 
point of that mount option. Backup tree roots are per superblock, so 
conceivably you'd have up to 8 of these with two superblocks, they're shown with
btrfs-show-super -af  ## and -F even if a super is bad

But skipping that, to fix this you need to know which super is pointing to the 
wrong tree root, since you're using ssd mount option with rotating supers. I 
assume mount uses the super with the highest generation number. So you'd need 
to:
btrfs-show-super -a
to find out the super with the most recent generation. You'd assume that one 
was wrong. And then use btrfs-select-super to pick the right one, and replace 
the wrong one. Then you could mount.

I also wonder if btrfs check -sX would show different results in your case. I'd 
think it would because it ought to know one of those tree roots is bad, seeing 
as mount knows it. And then it seems (I'm speculating a ton) that --repair 
might try to fix the bad tree root, and then if it fails I'd like to think it 
can just find the most recent good tree root, ideally one listed as a 
backup_tree_root by any good superblock, and then have the next mount use that.

I'm not sure why this persistently fails, and I wonder if there are cases of 
users giving up and blowing away file systems that could actually be mountable. 
But it's just really a manual process figuring out what things to do in what 
order to get them to mount.

From what I can tell, btrfs check doesn't do anything about backup 
superblocks unless you specifically tell it to.  In this case, running 
btrfs check without specifying a superblock mirror, and with explicitly 
specifying the primary superblock produced identical results (namely it 
choked, hard, with an error message similar to that from the kernel. 
However, running it with -s1 to select the first backup superblock 
returned no errors at all other than the space_cache being invalid and 
the count of used blocks being wrong.


Based on my (limited) understanding of the mount code, it does try to 
use the superblock with the highest generation (regardless of whether we 
are on an ssd or not), but doesn't properly fall back to a secondary 
superblock after trying to mount using the primary.


As far as btrfs check repair trying to fix this, I don't think that it 
does so currently, probably for the same reason that mount fails.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: Single disk parrallelization

2014-09-19 Thread Austin S Hemmelgarn

On 2014-09-19 14:10, Jeb Thomson wrote:

With the advanced features of btrfs, it would be an additional simple task to 
make different platters run in parallel.

In this case, say a disk has three platters, and so three seek heads as well. 
If we can identify that much, and what offsets they are at, it then becomes a 
trivial matter to place the reads and writes to different platters at the same 
time.

In affect, this means each platter should be operating as a single virtualized 
unit, instead of one single unit...


In theory this is a great idea except for two things:
1) Most consumer drives have only one platter.
2) The kernel doesn't have such low-level hardware access, so it would 
have to be implemented in device firmware (and I'd be willing to bet 
that most drive manufacturers already stripe data across multiple 
platters when possible).





smime.p7s
Description: S/MIME Cryptographic Signature


Re: general thoughts and questions + general and RAID5/6 stability?

2014-09-23 Thread Austin S Hemmelgarn

On 2014-09-22 16:51, Stefan G. Weichinger wrote:

Am 20.09.2014 um 11:32 schrieb Duncan:


What I do as part of my regular backup regime, is every few kernel cycles
I wipe the (first level) backup and do a fresh mkfs.btrfs, activating new
optional features as I believe appropriate.  Then I boot to the new
backup and run a bit to test it, then wipe the normal working copy and do
a fresh mkfs.btrfs on it, again with the new optional features enabled
that I want.


Is re-creating btrfs-filesystems *recommended* in any way?

Does that actually make a difference in the fs-structure?

I would recommend it, there are some newer features that you can only 
set at mkfs time.  Quite often, when a new feature is implemented, it is 
some time before things are such that it can be enabled online, and even 
then that doesn't convert anything until it is rewritten.

So far I assumed it was enough to keep the kernel up2date, use current
(stable) btrfs-progs and run some scrub every week or so (not to mention
backups .. if it ain't backed up, it was/isn't important).

Stefan







smime.p7s
Description: S/MIME Cryptographic Signature


Re: general thoughts and questions + general and RAID5/6 stability?

2014-09-23 Thread Austin S Hemmelgarn

On 2014-09-23 09:06, Stefan G. Weichinger wrote:

Am 23.09.2014 um 14:08 schrieb Austin S Hemmelgarn:

On 2014-09-22 16:51, Stefan G. Weichinger wrote:

Is re-creating btrfs-filesystems *recommended* in any way?

Does that actually make a difference in the fs-structure?


I would recommend it, there are some newer features that you can only
set at mkfs time.  Quite often, when a new feature is implemented, it is
some time before things are such that it can be enabled online, and even
then that doesn't convert anything until it is rewritten.


What features for example?
Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives the 
following list of features:

mixed-bg- mixed data and metadata block groups
extref  - increased hard-link limit per file to 65536
raid56  - raid56 extended format
skinny-metadata - reduced size metadata extent refs
no-holes- no explicit hole extents for files

mixed-bg is something that you generally wouldn't want to change after mkfs.
extref can be enabled online, and the filesystem metadata gets updated 
as-needed, and dosen't provide any real performance improvement (but is 
needed for some mail servers that have HUGE mail-queues)
I don't know anything about the raid56 option, but there isn't any way 
to change it after mkfs.
skinyy-metadata can be changed online, and the format gets updated on 
rewrite of each metadata block.  This one does provide a performance 
improvement (stat() in particular runs noticeably faster).  You should 
probably enable this if it isn't already enabled, even if you don't 
recreate your filesystem.
no-holes cannot currently be changed online, and is a very recent 
addition (post v3.14 btrfs-progs I believe) that provides improved 
performance for sparse files (which is particularly useful if you are 
doing things with fixed size virtual machine disk images).


It's this last one that prompted me personally to recreate my 
filesystems most recently, as I use sparse files to save space as much 
as possible.


I created my main btrfs a few months ago and would like to avoid
recreating it as this would mean restoring my root-fs on my main
workstation.

Although I would do it if it is worth it ;-)

I assume I could read some kind of version number out of the superblock
or so?

btrfs-show-super ?

AFAIK there isn't really any 'version number' that has any meaning in 
the superblock (except for telling the kernel that it uses the stable 
disk layout), however, there are flag bits that you can look for 
(compat_flags, compat_ro_flags, and incompat_flags).  I'm not 100% 
certain what each bit means, but on my system with a only 1 month old 
BTRFS filesystem, with extref, skinny-metadata, and no-holes turned on, 
i have compat_flags: 0x0, compat_ro_flags: 0x0, and incompat_flags: 0x16b.


The other potentially significant thing is that the default 
nodesize/leafsize has changed recently from 4096 to 16384, as that gives 
somewhat better performance for most use cases.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: general thoughts and questions + general and RAID5/6 stability?

2014-09-23 Thread Austin S Hemmelgarn

On 2014-09-23 10:23, Tobias Holst wrote:

If it is unknown, which of these options have been used at btrfs
creation time - is it possible to check the state of these options
afterwards on a mounted or unmounted filesystem?


2014-09-23 15:38 GMT+02:00 Austin S Hemmelgarn ahferro...@gmail.com
mailto:ahferro...@gmail.com:

Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives
the following list of features:
mixed-bg- mixed data and metadata block groups
extref  - increased hard-link limit per file to 65536
raid56  - raid56 extended format
skinny-metadata - reduced size metadata extent refs
no-holes- no explicit hole extents for files

I don't think there is a specific tool for doing this, but some of them 
do show up in dmesg, for example skinny-metadata shows up as a mention 
of the FS having skinny extents.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: What is the vision for btrfs fs repair?

2014-10-09 Thread Austin S Hemmelgarn

On 2014-10-08 15:11, Eric Sandeen wrote:

I was looking at Marc's post:

http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html

and it feels like there isn't exactly a cohesive, overarching vision for
repair of a corrupted btrfs filesystem.

In other words - I'm an admin cruising along, when the kernel throws some
fs corruption error, or for whatever reason btrfs fails to mount.
What should I do?

Marc lays out several steps, but to me this highlights that there seem to
be a lot of disjoint mechanisms out there to deal with these problems;
mostly from Marc's blog, with some bits of my own:

* btrfs scrub
Errors are corrected along if possible (what *is* possible?)
* mount -o recovery
Enable autorecovery attempts if a bad tree root is found at mount 
time.
* mount -o degraded
Allow mounts to continue with missing devices.
(This isn't really a way to recover from corruption, right?)
* btrfs-zero-log
remove the log tree if log tree is corrupt
* btrfs rescue
Recover a damaged btrfs filesystem
chunk-recover
super-recover
How does this relate to btrfs check?
* btrfs check
repair a btrfs filesystem
--repair
--init-csum-tree
--init-extent-tree
How does this relate to btrfs rescue?
* btrfs restore
try to salvage files from a damaged filesystem
(not really repair, it's disk-scraping)


What's the vision for, say, scrub vs. check vs. rescue?  Should they repair the
same errors, only online vs. offline?  If not, what class of errors does one 
fix vs.
the other?  How would an admin know?  Can btrfs check recover a bad tree root
in the same way that mount -o recovery does?  How would I know if I should use
--init-*-tree, or chunk-recover, and what are the ramifications of using
these options?

It feels like recovery tools have been badly splintered, and if there's an
overarching design or vision for btrfs fs repair, I can't tell what it is.
Can anyone help me?


Well, based on my understanding:
* btrfs scrub is intended to be almost exactly equivalent to scrubbing a 
RAID volume; that is, it fixes disparity between multiple copies of the 
same block.  IOW, it isn't really repair per se, but more preventative 
maintnence.  Currently, it only works for cases where you have multiple 
copies of a block (dup, raid1, and raid10 profiles), but support is 
planned for error correction of raid5 and raid6 profiles.
* mount -o recovery I don't know much about, but AFAICT, it s more for 
dealing with metadata related FS corruption.
* mount -o degraded is used to mount a fs configured for a raid storage 
profile with fewer devices than the profile minimum.  It's primarily so 
that you can get the fs into a state where you can run 'btrfs device 
replace'
* btrfs-zero-log only deals with log tree corruption.  This would be 
roughly equivalent to zeroing out the journal on an XFS or ext4 
filesystem, and should almost never be needed.
* btrfs rescue is intended for low level recovery corruption on an 
offline fs.
* chunk-recover I'm not entirely sure about, but I believe it's 
like scrub for a single chunk on an offline fs
* super-recover is for dealing with corrupted superblocks, and 
tries to replace it with one of the other copies (which hopefully isn't 
corrupted)
* btrfs check is intended to (eventually) be equivalent to the fsck 
utility for most other filesystems.  Currently, it's relatively good at 
identifying corruption, but less so at actually fixing it.  There are 
however, some things that it won't catch, like a superblock pointing to 
a corrupted root tree.
* btrfs restore is essentially disk scraping, but with built-in 
knowledge of the filesystem's on-disk structure, which makes it more 
reliable than more generic tools like scalpel for files that are too big 
to fit in the metadata blocks, and it is pretty much essential for 
dealing with transparently compressed files.


In general, my personal procedure for handling a misbehaving BTRFS 
filesystem is:
* Run btrfs check on it WITHOUT ANY OTHER OPTIONS to try to identify 
what's wrong

* Try mounting it using -o recovery
* Try mounting it using -o ro,recovery
* Use -o degraded only if it's a BTRFS raid set that lost a disk
* If btrfs check AND dmesg both seem to indicate that the log tree is 
corrupt, try btrfs-zero-log
* If btrfs check indicated a corrupt superblock, try btrfs rescue 
super-recover

* If all of the above fails, ask for advice on the mailing list or IRC
Also, you should be running btrfs scrub regularly to correct bit-rot and 
force remapping of blocks with read errors.  While BTRFS technically 
handles both transparently on reads, it only corrects thing on disk when 
you do a scrub.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: What is the vision for btrfs fs repair?

2014-10-09 Thread Austin S Hemmelgarn

On 2014-10-09 07:53, Duncan wrote:

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
excerpted:


Also, you should be running btrfs scrub regularly to correct bit-rot
and force remapping of blocks with read errors.  While BTRFS
technically handles both transparently on reads, it only corrects thing
on disk when you do a scrub.


AFAIK that isn't quite correct.  Currently, the number of copies is
limited to two, meaning if one of the two is bad, there's a 50% chance of
btrfs reading the good one on first try.

If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad
one, it checks the other one and assuming it's good, replaces the bad one
with the good one both for the read (which otherwise errors out), and by
overwriting the bad one.

But here's the rub.  The chances of detecting that bad block are
relatively low in most cases.  First, the system must try reading it for
some reason, but even then, chances are 50% it'll pick the good one and
won't even notice the bad one.

Thus, while btrfs may randomly bump into a bad block and rewrite it with
the good copy, scrub is the only way to systematically detect and (if
there's a good copy) fix these checksum errors.  It's not that btrfs
doesn't do it if it finds them, it's that the chances of finding them are
relatively low, unless you do a scrub, which systematically checks the
entire filesystem (well, other than files marked nocsum, or nocow, which
implies nocsum, or files written when mounted with nodatacow or
nodatasum).

At least that's the way it /should/ work.  I guess it's possible that
btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but
if so, that's the first /I/ remember reading of it.


I'm not 100% certain, but I believe it doesn't actually fix things on 
disk when it detects an error during a read, I know it doesn't it the fs 
is mounted ro (even if the media is writable), because I did some 
testing to see how 'read-only' mounting a btrfs filesystem really is.


Also, that's a much better description of how multiple copies work than 
I could probably have ever given.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: What is the vision for btrfs fs repair?

2014-10-09 Thread Austin S Hemmelgarn

On 2014-10-09 08:12, Hugo Mills wrote:

On Thu, Oct 09, 2014 at 08:07:51AM -0400, Austin S Hemmelgarn wrote:

On 2014-10-09 07:53, Duncan wrote:

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
excerpted:


Also, you should be running btrfs scrub regularly to correct bit-rot
and force remapping of blocks with read errors.  While BTRFS
technically handles both transparently on reads, it only corrects thing
on disk when you do a scrub.


AFAIK that isn't quite correct.  Currently, the number of copies is
limited to two, meaning if one of the two is bad, there's a 50% chance of
btrfs reading the good one on first try.

If btrfs reads the good copy, it simply uses it.  If btrfs reads the bad
one, it checks the other one and assuming it's good, replaces the bad one
with the good one both for the read (which otherwise errors out), and by
overwriting the bad one.

But here's the rub.  The chances of detecting that bad block are
relatively low in most cases.  First, the system must try reading it for
some reason, but even then, chances are 50% it'll pick the good one and
won't even notice the bad one.

Thus, while btrfs may randomly bump into a bad block and rewrite it with
the good copy, scrub is the only way to systematically detect and (if
there's a good copy) fix these checksum errors.  It's not that btrfs
doesn't do it if it finds them, it's that the chances of finding them are
relatively low, unless you do a scrub, which systematically checks the
entire filesystem (well, other than files marked nocsum, or nocow, which
implies nocsum, or files written when mounted with nodatacow or
nodatasum).

At least that's the way it /should/ work.  I guess it's possible that
btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but
if so, that's the first /I/ remember reading of it.


I'm not 100% certain, but I believe it doesn't actually fix things on disk
when it detects an error during a read,


I'm fairly sure it does, as I've had it happen to me. :)
I probably just misinterpreted the source code, while I know enough C to 
generally understand things, I'm by far no expert.



I know it doesn't it the fs is
mounted ro (even if the media is writable), because I did some testing to
see how 'read-only' mounting a btrfs filesystem really is.


If the FS is RO, then yes, it won't fix things.

Hugo.






smime.p7s
Description: S/MIME Cryptographic Signature


Re: What is the vision for btrfs fs repair?

2014-10-09 Thread Austin S Hemmelgarn

On 2014-10-09 08:34, Duncan wrote:

On Thu, 09 Oct 2014 08:07:51 -0400
Austin S Hemmelgarn ahferro...@gmail.com wrote:


On 2014-10-09 07:53, Duncan wrote:

Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as
excerpted:


Also, you should be running btrfs scrub regularly to correct
bit-rot and force remapping of blocks with read errors.  While
BTRFS technically handles both transparently on reads, it only
corrects thing on disk when you do a scrub.


AFAIK that isn't quite correct.  Currently, the number of copies is
limited to two, meaning if one of the two is bad, there's a 50%
chance of btrfs reading the good one on first try.

If btrfs reads the good copy, it simply uses it.  If btrfs reads
the bad one, it checks the other one and assuming it's good,
replaces the bad one with the good one both for the read (which
otherwise errors out), and by overwriting the bad one.

But here's the rub.  The chances of detecting that bad block are
relatively low in most cases.  First, the system must try reading
it for some reason, but even then, chances are 50% it'll pick the
good one and won't even notice the bad one.

Thus, while btrfs may randomly bump into a bad block and rewrite it
with the good copy, scrub is the only way to systematically detect
and (if there's a good copy) fix these checksum errors.  It's not
that btrfs doesn't do it if it finds them, it's that the chances of
finding them are relatively low, unless you do a scrub, which
systematically checks the entire filesystem (well, other than files
marked nocsum, or nocow, which implies nocsum, or files written
when mounted with nodatacow or nodatasum).

At least that's the way it /should/ work.  I guess it's possible
that btrfs isn't doing those routine bump-into-it-and-fix-it
fixes yet, but if so, that's the first /I/ remember reading of it.


I'm not 100% certain, but I believe it doesn't actually fix things on
disk when it detects an error during a read, I know it doesn't it the
fs is mounted ro (even if the media is writable), because I did some
testing to see how 'read-only' mounting a btrfs filesystem really is.


Definitely it won't with a read-only mount.  But then scrub shouldn't
be able to write to a read-only mount either.  The only way a read-only
mount should be writable is if it's mounted (bind-mounted or
btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to
that mount, not the read-only mounted location.

In theory yes, but there are caveats to this, namely:
* atime updates still happen unless you have mounted the fs with noatime
* The superblock gets updated if there are 'any' writes
* The free space cache 'might' be updated if there are any writes

All in all, a BTRFS filesystem mounted ro is much more read-only than 
say ext4 (which at least updates the sb, and old versions replayed the 
journal, in addition to the atime updates).


There's even debate about replaying the journal or doing orphan-delete
on read-only mounts (at least on-media, the change could, and arguably
should, occur in RAM and be cached, marking the cache dirty at the
same time so it's appropriately flushed if/when the filesystem goes
writable), with some arguing read-only means just that, don't
write /anything/ to it until it's read-write mounted.

But writable-mounted, detected checksum errors (with a good copy
available) should be rewritten as far as I know.  If not, I'd call it
a bug.  The problem is in the detection, not in the rewriting.  Scrub's
the only way to reliably detect these errors since it's the only thing
that systematically checks /everything/.


Also, that's a much better description of how multiple copies work
than I could probably have ever given.


Thanks.  =:^)






smime.p7s
Description: S/MIME Cryptographic Signature


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Austin S Hemmelgarn

On 2014-10-10 13:43, Bob Marley wrote:

On 10/10/2014 16:37, Chris Murphy wrote:

The fail safe behavior is to treat the known good tree root as the
default tree root, and bypass the bad tree root if it cannot be
repaired, so that the volume can be mounted with default mount options
(i.e. the ones in fstab). Otherwise it's a filesystem that isn't well
suited for general purpose use as rootfs let alone for boot.



A filesystem which is suited for general purpose use is a filesystem
which honors fsync, and doesn't *ever* auto-roll-back without user
intervention.

Anything different is not suited for database transactions at all. Any
paid service which has the users database on btrfs is going to be at
risk of losing payments, and probably without the company even knowing.
If btrfs goes this way I hope a big warning is written on the wiki and
on the manpages telling that this filesystem is totally unsuitable for
hosting databases performing transactions.
If they need reliability, they should have some form of redundancy 
in-place and/or run the database directly on the block device; because 
even ext4, XFS, and pretty much every other filesystem can lose data 
sometimes, the difference being that those tend to give worse results 
when hardware is misbehaving than BTRFS does, because BTRFS usually has 
a old copy of whatever data structure gets corrupted to fall back on.


Also, you really shouldn't be running databases on a BTRFS filesystem at 
the moment anyway, because of the significant performance implications.


At most I can suggest that a flag in the metadata be added to
allow/disallow auto-roll-back-on-error on such filesystem, so people can
decide the tolerant vs. transaction-safe mode at filesystem creation.



The problem with this is that if the auto-recovery code did run (and 
IMHO the kernel should spit out a warning to the system log whenever it 
does), then chances are that you wouldn't have had a consistent view if 
you had prevented it from running either; and, if the database is 
properly distributed/replicated, then it should recover by itself.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: What is the vision for btrfs fs repair?

2014-10-13 Thread Austin S Hemmelgarn

On 2014-10-10 18:05, Eric Sandeen wrote:

On 10/10/14 2:35 PM, Austin S Hemmelgarn wrote:

On 2014-10-10 13:43, Bob Marley wrote:

On 10/10/2014 16:37, Chris Murphy wrote:

The fail safe behavior is to treat the known good tree root as
the default tree root, and bypass the bad tree root if it cannot
be repaired, so that the volume can be mounted with default mount
options (i.e. the ones in fstab). Otherwise it's a filesystem
that isn't well suited for general purpose use as rootfs let
alone for boot.



A filesystem which is suited for general purpose use is a
filesystem which honors fsync, and doesn't *ever* auto-roll-back
without user intervention.

Anything different is not suited for database transactions at all.
Any paid service which has the users database on btrfs is going to
be at risk of losing payments, and probably without the company
even knowing. If btrfs goes this way I hope a big warning is
written on the wiki and on the manpages telling that this
filesystem is totally unsuitable for hosting databases performing
transactions.

If they need reliability, they should have some form of redundancy
in-place and/or run the database directly on the block device;
because even ext4, XFS, and pretty much every other filesystem can
lose data sometimes,


Not if i.e. fsync returns.  If the data is gone later, it's a hardware
problem, or occasionally a bug - bugs that are usually found  fixed
pretty quickly.

Yes, barring bugs and hardware problems they won't lose data.



the difference being that those tend to give
worse results when hardware is misbehaving than BTRFS does, because
BTRFS usually has a old copy of whatever data structure gets
corrupted to fall back on.


I'm curious, is that based on conjecture or real-world testing?

I wouldn't really call it testing, but based on personal experience I 
know that ext4 can lose whole directory sub-trees if it gets a single 
corrupt sector in the wrong place.  I've also had that happen on FAT32 
and (somewhat interestingly) HFS+ with failing/misbehaving hardware; and 
I've actually had individual files disappear on HFS+ without any 
discernible hardware issues.  I don't have as much experience with XFS, 
but would assume based on what I do know of it that it could have 
similar issues.  As for BTRFS, I've only ever had any issues with it 3 
times, one was due to the kernel panicking during resume from S1, and 
the other two were due to hardware problems that would have caused 
issues on most other filesystems as well.  In both cases of hardware 
issues, while the filesystem was initially unmountable, it was 
relatively simple to fix once I knew how.  I tried to fix an ext4 fs 
that had become unmountable due to dropped writes once, and that was 
anything but simple, even with the much greater amount of documentation.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: What is the vision for btrfs fs repair?

2014-10-13 Thread Austin S Hemmelgarn

On 2014-10-12 06:14, Martin Steigerwald wrote:

Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy:

On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote:

On 10/10/2014 03:58, Chris Murphy wrote:

* mount -o recovery

Enable autorecovery attempts if a bad tree root is found at mount
time.


I'm confused why it's not the default yet. Maybe it's continuing to
evolve at a pace that suggests something could sneak in that makes
things worse? It is almost an oxymoron in that I'm manually enabling an
autorecovery

If true, maybe the closest indication we'd get of btrfs stablity is the
default enabling of autorecovery.

No way!
I wouldn't want a default like that.

If you think at distributed transactions: suppose a sync was issued on
both sides of a distributed transaction, then power was lost on one side,
than btrfs had corruption. When I remount it, definitely the worst thing
that can happen is that it auto-rolls-back to a previous known-good
state.

For a general purpose file system, losing 30 seconds (or less) of
questionably committed data, likely corrupt, is a file system that won't
mount without user intervention, which requires a secret decoder ring to
get it to mount at all. And may require the use of specialized tools to
retrieve that data in any case.

The fail safe behavior is to treat the known good tree root as the default
tree root, and bypass the bad tree root if it cannot be repaired, so that
the volume can be mounted with default mount options (i.e. the ones in
fstab). Otherwise it's a filesystem that isn't well suited for general
purpose use as rootfs let alone for boot.


To understand this a bit better:

What can be the reasons a recent tree gets corrupted?


Well, so far I have had the following cause corrupted trees:
1. Kernel panic during resume from ACPI S1 (suspend to RAM), which just 
happened to be in the middle of a tree commit.

2. Generic power loss during a tree commit.
3. A device not properly honoring write-barriers (the operations 
immediately adjacent to the write barrier weren't being ordered 
correctly all the time).


Based on what I know about BTRFS, the following could also cause problems:
1. A single-event-upset somewhere in the write path.
2. The kernel issuing a write to the wrong device (I haven't had this 
happen to me, but know people who have).


In general, any of these will cause problems for pretty much any 
filesystem, not just BTRFS.

I always thought with a controller and device and driver combination that
honors fsync with BTRFS it would either be the new state of the last known
good state *anyway*. So where does the need to rollback arise from?

I think that in this case the term rollback is a bit ambiguous, here it 
means from the point of view of userspace, which sees the FS as having 
'rolled-back' from the most recent state to the last known good state.

That said all journalling filesystems have some sort of rollback as far as I
understand: If the last journal entry is incomplete they discard it on journal
replay. So even there you use the last seconds of write activity.

But in case fsync() returns the data needs to be safe on disk. I always
thought BTRFS honors this under *any* circumstance. If some proposed
autorollback breaks this guarentee, I think something is broke elsewhere.

And fsync is an fsync is an fsync. Its semantics are clear as crystal. There
is nothing, absolutely nothing to discuss about it.

An fsync completes if the device itself reported Yeah, I have the data on
disk, all safe and cool to go. Anything else is a bug IMO.

Or a hardware issue, most filesystems need disks to properly honor write 
barriers to provide guaranteed semantics on an fsync, and many consumer 
disk drives still don't honor them consistently.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Wishlist Item :: One Subvol in Multiple Places

2014-10-15 Thread Austin S Hemmelgarn

On 2014-10-14 18:25, Robert White wrote:

I've got no idea if this is possible given the current storage layout,
but it would be Really Nice™ if there were a way to have a single
subvolume exist in more than one place in hirearchy. I know this can be
faked via mount tricks (bind or use of subvol=), but having it be a real
thing would be preferable.

For example, if I have two or more distributions on a computer or want
to switch between 32bit and 64bit environments frequently, but I want to
use the same /home (which is its own subvolume anyway) it would be nice
if the native layout could be permuted such that /__System_32/home and
/__System_64/home were the actual same subvolume.

The mechanism, were it possible, would be something like btrfs
subvolume link /existing/path /new/path (or bind instead of link)

I've got no idea if the directory structure would allow for this, but if
it would it would simplify several things (for me anyway) if the file
system layout represented the runtime layout.
This probably won't be implemented, for the same reason that most modern 
unix systems disallow hardlinks to directories; namely, it results in 
ambiguity regarding resolution of the .. directory entry.
The better solution would be to put /home in a separate top-level 
sub-volume, and then mount that in each location.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: strange 3.16.3 problem

2014-10-20 Thread Austin S Hemmelgarn

On 2014-10-20 09:02, Zygo Blaxell wrote:

On Mon, Oct 20, 2014 at 04:38:28AM +, Duncan wrote:

Russell Coker posted on Sat, 18 Oct 2014 14:54:19 +1100 as excerpted:


# find . -name *546
./1412233213.M638209P10546 # ls -l ./1412233213.M638209P10546 ls: cannot
access ./1412233213.M638209P10546: No such file or directory


Does your mail server do a lot of renames?  Is one perhaps stuck?  If so,
that sounds like the same thing Zygo Blaxell is reporting in the
3.16.3..3.17.1 hang in renameat2() thread, OP on Sun, 19 Oct 2014
15:25:26 -400, Msg-ID: 20141019192525.ga29...@hungrycats.org, as linked
here:

http://permalink.gmane.org/gmane.comp.file-systems.btrfs/39539

I pointed him at this thread too.  I hadn't seen you mention a hung
rename, but the other symptoms sound similar.


Not really.  It looks like Russell having a NFS client-side problem,
I'm having a server-side one (maybe).  Also, all Russell's system calls
seem to be returning promptly, while some of mine are not.  Even if
there were timeouts, an NFS server timeout gives a different error than
'No such file or directory'.  Finally, the one and only thing I _can_
do with my bug is 'ls' on the renamed files (for me, the find would get
stuck before returning any output).

For Russell's issue...most of the stuff I can think of has been
tried already.  I didn't see if there was any attempt try to ls the
file from the NFS server as well as the client side.  If ls is OK on
the server but not the client, it's an NFS issue (possibly interacting
with some btrfs-specific quirk); otherwise, it's likely a corrupted
filesystem (mail servers seem to be unusually good at making these).

Most of the I/O time on mail servers tends to land in the fsync() system
call, and some nasty fsync() btrfs bugs were fixed in 3.17 (i.e. after
3.16, and not in the 3.16.x stable update for x = 5 (the last one
I've checked)).  That said, I'm not familiar with how fsync() translates
over NFS, so it might not be relevant after all.

If the NFS server's view of the filesystem is OK, check the NFS protocol
version from /proc/mounts on the client.  Sometimes NFS clients will
get some transient network error during connection and fall back to some
earlier (and potentially buggier) NFS version.  I've seen very different
behavior in some important corner cases from v4 and v3 clients, for
example, and if the client is falling all the way back to v2 the bugs
and their workarounds start to get just plain _weird_ (e.g. filenames
which produce specific values from some hash function or that contain
specific character sequences are unusable).  v2 is so old it may even
have issues with 64-bit inode numbers.

Just now saw this thread, but IIRC 'No such file or directory' also gets 
returned sometimes when trying to automount a share that can't be 
enumerated by the client, and also sometimes when there is a stale NFS 
file handle.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: Poll: time to switch skinny-metadata on by default?

2014-10-21 Thread Austin S Hemmelgarn

On 2014-10-21 05:29, Duncan wrote:

David Sterba posted on Mon, 20 Oct 2014 18:34:03 +0200 as excerpted:


On Thu, Oct 16, 2014 at 01:33:37PM +0200, David Sterba wrote:

I'd like to make it default with the 3.17 release of btrfs-progs.
Please let me know if you have objections.


For the record, 3.17 will not change the defaults. The timing of the
poll was very bad to get enough feedback before the release. Let's keep
it open for now.


FWIW my own results agree with yours, I've had no problem with skinny-
metadata here, and it has been my default now for a couple backup-and-new-
mkfs.btrfs generations, now.

As you know there were some problems with it in the first kernel cycle or
two after it was introduced as an option, and I waited awhile until they
died down before trying it here, but as I said, no problems since I
switched it on, and I've been running it awhile now.

So defaulting to skinny-metadata looks good from here. =:^)

Same here, I've been using it on all my systems since I switched from 
3.15 to 3.16, and have had no issues whatsoever.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: downgrade from kernel 3.17 to 3.10

2014-10-21 Thread Austin S Hemmelgarn

On 2014-10-21 11:34, Cristian Falcas wrote:

I will start investigating how can we build our own rpms from the 3.16
sources. Until then we are stuck with the ones from the official repos
or elrepo. Which means 3.10 is the latest for el6. We used this until
now and seems we where lucky enough to not hit anything bad.

IIRC there is a make target in the kernel sources that generates the 
appropriate RPM's for you, although doing so from mainline won't get you 
any of the patches from Oracle that they use in el.

We upgraded to 3.17 because we use ceph on the machine with openstack
and on the ceph site they recommended 3.14. And because we need
writable snapshots, we are forced to use btrfs under ceph.

Thank you all for your advice.




On Tue, Oct 21, 2014 at 6:20 PM, Robert White rwh...@pobox.com wrote:

On 10/21/2014 06:18 AM, Cristian Falcas wrote:


Thank you for your answer.

I will reformat the disk with a 3.10 kernel in the meantime, because I
don't have any rpms for 3.16 now.



More concisely: Don't use 3.10 BTRFS for data you value. There is a
non-trivial chance that the problems you observed are/were due to bad
things on the disk written there by 3.10.

There is no value to recreating your file systems under 3.10 as the same
thing is likely to go bad again when you get out of the dungeon.

What are your RPM options? What about just getting the sources from
kernel.org and compiling your won 3.16.5?

Seriously, 3.10 just... no...

8-)






smime.p7s
Description: S/MIME Cryptographic Signature


Re: device balance times

2014-10-22 Thread Austin S Hemmelgarn

On 2014-10-21 16:44, Arnaud Kapp wrote:

Hello,

I would like to ask if the balance time is related to the number of
snapshot or if this is related only to data (or both).

I currently have about 4TB of data and around 5k snapshots. I'm thinking
of going raid1 instead of single. From the numbers I see this seems
totally impossible as it would take *way* too long.

Would destroying snapshots (those are hourly snapshots to prevent stupid
error to happens, like `rm my_important_file`) help?

Should I reconsider moving to raid1 because of the time it would take?

Sorry if I'm somehow hijacking this thread, but it seemed related :)

Thanks,

The issue is the snapshots, because I regularly fully re-balance my home 
directory on my desktop which is ~150GB on a BTRFS raid10 setup with 
only 3 or 4 snapshots (I only do daily snapshots, cause anything I need 
finer granularity on I have under git), and that takes only about 2 or 3 
hours depending on how many empty chunks I have.


I would remove the snapshots, and also start keeping fewer of them (5k 
hourly snapshots is more than six months worth of file versions), and 
then run the balance.  I would also suggest converting data by itself 
first, and then converting metadata, as converting data chunks will 
require re-writing large parts of the metadata.

On 10/21/2014 10:14 PM, Piotr Pawłow wrote:

On 21.10.2014 20:59, Tomasz Chmielewski wrote:

FYI - after a failed disk and replacing it I've run a balance; it took
almost 3 weeks to complete, for 120 GBs of data:


Looks normal to me. Last time I started a balance after adding 6th
device to my FS, it took 4 days to move 25GBs of data. Some chunks took
20 hours to move. I currently have 156 snapshots on this FS (nightly
rsync backups).

I think it is so slow, because it's disassembling chunks piece by piece
and stuffing these pieces elsewhere, instead of moving chunks as a
whole. If you have a lot of little pieces (as I do), it will take a
while...

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html





smime.p7s
Description: S/MIME Cryptographic Signature


Re: 5 _thousand_ snapshots? even 160?

2014-10-22 Thread Austin S Hemmelgarn

On 2014-10-21 21:10, Robert White wrote:


I don't think balance will _ever_ move the contents of a read only
snapshot. I could be wrong. I think you just end up with an endlessly
fragmented storage space and balance has to take each chunk and search
for someplace else it might better fit. Which explains why it took so long.

And just _forget_ single-extent large files at that point.

(Of course I could be wrong about the never move rule, but that would
just make the checksums on the potentially hundreds or thousands of
references need to be recalculated after a move, which would make
incremental send/receive unfathomable.)

Balance doesn't do anything different for snapshots from what it does 
with regular data.  I think you are confusing balance with 
defragmentation, as that does (in theory) handle snapshots differently. 
 Balance just takes all of the blocks selected by the filters, and 
sends the through the block allocator again, and then updates the 
metadata to point to the new blocks.  It can result in some 
fragmentation, but usually only for files bigger than about 256M, and 
even then doesn't always cause fragmentation


On 10/21/2014 01:44 PM, Arnaud Kapp wrote:

Hello,

I would like to ask if the balance time is related to the number of
snapshot or if this is related only to data (or both).

I currently have about 4TB of data and around 5k snapshots. I'm thinking
of going raid1 instead of single. From the numbers I see this seems
totally impossible as it would take *way* too long.

Would destroying snapshots (those are hourly snapshots to prevent stupid
error to happens, like `rm my_important_file`) help?

Should I reconsider moving to raid1 because of the time it would take?

Sorry if I'm somehow hijacking this thread, but it seemed related :)

Thanks,

On 10/21/2014 10:14 PM, Piotr Pawłow wrote:

On 21.10.2014 20:59, Tomasz Chmielewski wrote:

FYI - after a failed disk and replacing it I've run a balance; it took
almost 3 weeks to complete, for 120 GBs of data:


Looks normal to me. Last time I started a balance after adding 6th
device to my FS, it took 4 days to move 25GBs of data. Some chunks took
20 hours to move. I currently have 156 snapshots on this FS (nightly
rsync backups).

I think it is so slow, because it's disassembling chunks piece by piece
and stuffing these pieces elsewhere, instead of moving chunks as a
whole. If you have a lot of little pieces (as I do), it will take a
while...






smime.p7s
Description: S/MIME Cryptographic Signature


Re: NOCOW and Swap Files?

2014-10-23 Thread Austin S Hemmelgarn

On 2014-10-22 16:08, Robert White wrote:

So the documentation is clear that you can't mount a swap file through
BTRFS (unless you use a loop device).

Why isn't a NOCOW file that has been fully pre-allocated -- as with
fallocate(1) -- not suitable for swapping?

I found one reference to an unimplemented feature necessary for swap,
but wouldn't it be reasonable for that feature to exist for NOCOW files?
(or does this relate to my previous questions about the COW operation
that happens after a snapshot?)
I actually use a swapfile on BTRFS on a regular basis on my laptop 
(trying to keep the number of partitions to a minimum, cause I dual boot 
Windows), and here's what the init script I use for it does:
1. Remove any old swap file (the fs is on an SSD, so I do this mostly to 
get the discard operation).

2. Use touch to create a new file.
3. Use chattr to mark the file NOCOW.
4. Use fallocate to pre-allocate the space for the file.
5. Bind the file to a loop device.
6. Format as swap and add as swapspace.

This works very reliably for me, and the overhead of the loop device is 
relatively insignificant (because my disk is actually faster than my 
RAM) for my use case, and I can safely balance/defrag/fstrim the 
filesystem without causing issues with the swap file.


If you can avoid using a swapfile though, I would suggest doing so, 
regardless of which FS you are using.  I actually use a 4-disk RAID-0 
LVM volume on my desktop, and it gets noticeably better performance than 
using a swap file.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: device balance times

2014-10-23 Thread Austin S Hemmelgarn

On 2014-10-23 05:19, Miao Xie wrote:

On Wed, 22 Oct 2014 14:40:47 +0200, Piotr Pawłow wrote:

On 22.10.2014 03:43, Chris Murphy wrote:

On Oct 21, 2014, at 4:14 PM, Piotr Pawłowp...@siedziba.pl  wrote:

Looks normal to me. Last time I started a balance after adding 6th device to my 
FS, it took 4 days to move 25GBs of data.

It's long term untenable. At some point it must be fixed. It's way, way slower 
than md raid.
At a certain point it needs to fallback to block level copying, with a ~ 32KB 
block. It can't be treating things as if they're 1K files, doing file level 
copying that takes forever. It's just too risky that another device fails in 
the meantime.


There's device replace for restoring redundancy, which is fast, but not 
implemented yet for RAID5/6.


Now my colleague and I is implementing the scrub/replace for RAID5/6
and I have a plan to reimplement the balance and split it off from the 
metadata/file data process. the main idea is
- allocate a new chunk which has the same size as the relocated one, but don't 
insert it into the block group list, so we don't
   allocate the free space from it.
- set the source chunk to be Read-only
- copy the data from the source chunk to the new chunk
- replace the extent map of the source chunk with the one of the new chunk(The 
new chunk has
   the same logical address and the length as the old one)
- release the source chunk

By this way, we needn't deal the data one extent by one extent, and needn't do 
any space reservation,
so the speed will be very fast even we have lots of snapshots.

Even if balance gets re-implemented this way, we should still provide 
some way to consolidate the data from multiple partially full chunks. 
Maybe keep the old balance path and have some option (maybe call it 
aggressive?) that turns it on instead of the new code.





smime.p7s
Description: S/MIME Cryptographic Signature


Re: Heavy nocow'd VM image fragmentation

2014-10-27 Thread Austin S Hemmelgarn

On 2014-10-26 13:20, Larkin Lowrey wrote:

On 10/24/2014 10:28 PM, Duncan wrote:

Robert White posted on Fri, 24 Oct 2014 19:41:32 -0700 as excerpted:


On 10/24/2014 04:49 AM, Marc MERLIN wrote:

On Thu, Oct 23, 2014 at 06:04:43PM -0500, Larkin Lowrey wrote:

I have a 240GB VirtualBox vdi image that is showing heavy
fragmentation (filefrag). The file was created in a dir that was
chattr +C'd, the file was created via fallocate and the contents of
the orignal image were copied into the file via dd. I verified that
the image was +C.

To be honest, I have the same problem, and it's vexing:

If I understand correctly, when you take a snapshot the file goes into
what I call 1COW mode.

Yes, but the OP said he hadn't snapshotted since creating the file, and
MM's a regular that actually wrote much of the wiki documentation on
raid56 modes, so he better know about the snapshotting problem too.

So that can't be it.  There's apparently a bug in some recent code, and
it's not honoring the NOCOW even in normal operation, when it should be.

(FWIW I'm not running any VMs or large DBs here, so don't have nocow set
on anything and can and do use autodefrag on all my btrfs.  So I can't
say one way or the other, personally.)



Correct, there were no snapshots during VM usage when the fragmentation
occurred.

One unusual property of my setup is I have my fs on top of bcache. More
specifically, the stack is md raid6  - bcache - lvm - btrfs. When the
fs mounts it has mount option 'ssd' due to the fact that bcache sets
/sys/block/bcache0/queue/rotational to 0.

Is there any reason why either the 'ssd' mount option or being backed by
bcache could be responsible?



Two things:
First, regarding your question, the ssd mount option shouldn't be 
responsible for this, because it is supposed to spread out allocation 
only at the chunk level, not the block level, but some recent commit may 
have changed that.  Are you using any kind of compression in btrfs?  If 
so, then filefrag won't report the number of fragments correctly (it 
currently reports the number of compressed blocks in the file instead), 
and in fact, if you are using compression in btrfs, I would expect the 
number of compressed blocks to go up as you use more space in the VM 
image, long runs of zero bytes compress well, other stuff (especially 
on-disk structures from encapsulated filesystems) doesn't.  You might 
consider putting the vm images directly on the LVM layer instead, that 
tends to get much better performance in my experience than storing them 
on a filesystem.


Secondly, I'd recommend switching from using bcache under LVM to using 
dm-cache on top of LVM, as it makes it much easier to recover from the 
various failure modes, and also to deal with a corrupted cache, due to 
the fact that dm-cache doesn't put any metadata on the backing device. 
It takes longer to shutdown when in write-back mode, and isn't SSD 
optimized, but has also been much more reliable in my experience.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: btrfs deduplication and linux cache management

2014-10-30 Thread Austin S Hemmelgarn

On 2014-10-30 05:26, lu...@plaintext.sk wrote:

Hi,
I want to ask, if deduplicated file content will be cached in linux kernel just 
once for two deduplicated files.

To explain in deep:
  - I use btrfs for whole system with few subvolumes with some compression on 
some subvolumes.
  - I have two directories with eclipse SDK with slightly differences (same 
version, different config)
  - I assume that given directories is deduplicated and so two eclipse 
installations take place on hdd like one would (in rough estimation)
  - I will start one of given eclipse
  - linux kernel will cache all opened files during start of eclipse (I have 
enough free ram)
  - I am just happy stupid linux user:
 1. will kernel cache file content after decompression? (I think yes)
 2. cached data will be in VFS layer or in block device layer?
  - When I will lunch second eclipse (different from first, but deduplicated 
from first) after first one:
 1. will second start require less data to be read from HDD?
 2. will be metadata for second instance read from hdd? (I asume yes)
 3. will be actual data read second time? (I hope not)

Thanks for answers,
have a nice day,


I don't know for certain, but here is how I understand things work in 
this case:
1. Individual blocks are cached in the block device layer, which means 
that the de-duplicated data would only be cached at most as many times 
as there are disks it is on (ie at most 1 time for a single device 
filesystem, up to twice for a multi-device btrfs raid1 setup).
2. In the vfs layer, the cache handles decoded inodes (the actual file 
metadata), dentries (the file's entry in the parent directory), and 
individual pages of file content (after decompression).  AFAIK, the vfs 
layer's cache is pathname based, so that would probably cache two copies 
of the data, but after the metadata look-up, wouldn't need to read from 
the disk cause of the block layer cache.


Overall, this means that while de-duplicated data may be cached more 
than once, it shouldn't need to be reread from disk if there is still a 
copy in cache.  Metadata may or may not need to be read from the disk, 
depending on what is in the VFS cache.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: scrub implies failing drive - smartctl blissfully unaware

2014-11-18 Thread Austin S Hemmelgarn

On 2014-11-18 02:29, Brendan Hide wrote:

Hey, guys

See further below extracted output from a daily scrub showing csum
errors on sdb, part of a raid1 btrfs. Looking back, it has been getting
errors like this for a few days now.

The disk is patently unreliable but smartctl's output implies there are
no issues. Is this somehow standard faire for S.M.A.R.T. output?

Here are (I think) the important bits of the smartctl output for
$(smartctl -a /dev/sdb) (the full results are attached):
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE UPDATED
WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate 0x000f   100   253   006Pre-fail
Always   -   0
   5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail
Always   -   1
   7 Seek_Error_Rate 0x000f   086   060   030Pre-fail
Always   -   440801014
197 Current_Pending_Sector  0x0012   100   100   000Old_age
Always   -   0
198 Offline_Uncorrectable   0x0010   100   100   000Old_age
Offline  -   0
199 UDMA_CRC_Error_Count0x003e   200   200   000Old_age
Always   -   0
200 Multi_Zone_Error_Rate   0x   100   253   000Old_age
Offline  -   0
202 Data_Address_Mark_Errs  0x0032   100   253   000Old_age
Always   -   0



 Original Message 
Subject: Cron root@watricky /usr/local/sbin/btrfs-scrub-all
Date: Tue, 18 Nov 2014 04:19:12 +0200
From: (Cron Daemon) root@watricky
To: brendan@watricky



WARNING: errors detected during scrubbing, corrected.
[snip]
scrub device /dev/sdb2 (id 2) done
 scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682
seconds
 total bytes scrubbed: 189.49GiB with 5420 errors
 error details: read=5 csum=5415
 corrected errors: 5420, uncorrectable errors: 0, unverified errors:
164
[snip]

In addition to the storage controller being a possibility as mentioned 
in another reply, there are some parts of the drive that aren't covered 
by SMART attributes on most disks, most notably the on-drive cache. 
There really isn't a way to disable the read cache on the drive, but you 
can disable write-caching, which may improve things (and if it's a cheap 
disk, may provide better reliability for BTRFS as well).  The other 
thing I would suggest trying is a different data cable to the drive 
itself, I've had issues with some SATA cables (the cheap red ones you 
get in the retail packaging for some hard disks in particular) having 
either bad connectors, or bad strain-reliefs, and failing after only a 
few hundred hours of use.




smime.p7s
Description: S/MIME Cryptographic Signature


Re: btrfs send and an existing backup

2014-11-21 Thread Austin S Hemmelgarn

On 2014-11-20 09:10, Duncan wrote:

Bardur Arantsson posted on Thu, 20 Nov 2014 14:17:52 +0100 as excerpted:


If you have no other backups, I would really recommend that you *don't*
use btrfs for your backup, or at least have a *third* backup which isn't
on btrfs -- there are *still* problems with btrfs that can potentially
wreck your backup filesystem. (Although it's obviously less likely if
the external HDD will only be connected occasionally.)

Don't get me wrong, btrfs is becoming more and more stable, but I
wouldn't trust it with my *only* backup, especially if also running
btrfs on the backed-up filesystem.


This.

My working versions and first backups are btrfs.  My secondary backups
are reiserfs (my old filesystem of choice, which has been very reliable
for me), just in case both the btrfs versions bite the dust due to a bug
in btrfs itself.

Likewise, except I use compressed, encrypted tarballs stored on both 
Amazon S3 and Dropbox.




smime.p7s
Description: S/MIME Cryptographic Signature


  1   2   3   4   5   6   7   8   9   10   >