Re: BTRFS partitioning scheme (was BTRFS with RAID1 cannot boot when removing drive)
On 2014-02-13 12:33, Chris Murphy wrote: On Feb 13, 2014, at 1:50 AM, Frank Kingswood fr...@kingswood-consulting.co.uk wrote: On 12/02/14 17:13, Saint Germain wrote: Ok based on your advices, here is what I have done so far to use UEFI (remeber that the objective is to have a clean and simple BTRFS RAID1 install). A) I start first with only one drive, I have gone with the following partition scheme (Debian wheezy, kernel 3.12, grub 2.00, GPT partition with parted): sda1 = 1MiB BIOS Boot partition (no FS, set 1 bios_grub on with parted to set the type) sda2 = 550 MiB EFI System Partition (FAT32, toggle 2 boot with parted to set the type), mounted on /boot/efi I'm curious, why so big? There's only one file of about 100kb there, and I was considering shrinking mine to the minimum possible (which seems to be about 33 MB). I'm not sure what OS loader you're using but I haven't seen a grubx64.efi less than ~500KB. In general I'm seeing it at about 1MB. The Fedora grub-efi and shim packages as installed on the ESP take up 10MB. So 33MiB is a bit small, and if we were more conservative, we'd update the OS loader by writing the new one to a temp directory rather than overwriting existing. And then remove the old, and rename the new. The UEFI spec says if the system partition is FAT, it should be FAT32. For removable media it's FAT12/FAT16. I don't know what tool the various distro installers are using, but at least on Fedora they are using mkdosfs, part of dosfstools. And its cutoff for making FAT16/FAT32 based on media size is 500MB unless otherwise specified, and the installer doesn't specify so actually by default Fedora system partitions are FAT16, to no obvious ill effect. But if you want a FAT32 ESP created by the installer the ESP needs to be 500MiB or 525MB. So 550MB is a reasonable number to make that happen. If we were slightly smarter (and more A.R.), UEFI bugs aside, we'd put the ESP as the last partition on the disk rather than as the first and then honestly would we really care about consuming even 1GiB of the slowest part of a spinning disk? Or causing a bit of overprovisioning for SSD? No. It's probably a squeak of an improvement if anything. For those who want to use gummiboot, it calls for the kernel and initramfs to be located on the ESP and is mounted at /boot rather than /boot/efi. So that's also a reason to make it bigger than usual. sda3 = 1 TiB root partition (BTRFS), mounted on / sda4 = 6 GiB swap partition (that way I should be able to be compatible with both CSM or UEFI) B) normal Debian installation on sdas, activate the CSM on the motherboard and reboot. C) apt-get install grub-efi-amd64 and grub-install /dev/sda And the problems begin: 1) grub-install doesn't give any error but using the --debug I can see that it is not using EFI. 2) Ok I force with grub-install --target=x86_64-efi --efi-directory=/boot/efi --bootloader-id=grub --recheck --debug /dev/sda 3) This time something is generated in /boot/efi: /boot/efi/EFI/grub/grubx64.efi 4) Copy the file /boot/efi/EFI/grub/grubx64.efi to /boot/efi/EFI/boot/bootx64.efi is EFI/boot/ correct here? If you want a fallback bootloader, yes. If you're lucky then your BIOS will tell what path it will try to read for the boot code. For me that is /EFI/debian/grubx64.efi. NVRAM is what does this. But if NVRAM becomes corrupt, or the entry is deleted for whatever reason, the proper fallback is bootarch.efi. While this is what the UEFI spec says is supposed to be the fallback, many systems don't actually look there unless the media is removable. All of my UEFI systems instead look for Microsoft/Boot/bootmgfw.efi as the fallback (Cause most x86 system designers don't care at all about standards compliance as long as it will run windows). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with btrfs balance
On 02/10/2014 08:41 AM, Brendan Hide wrote: On 2014/02/10 04:33 AM, Austin S Hemmelgarn wrote: snip Apparently, trying to use -mconvert=dup or -sconvert=dup on a multi-device filesystem using one of the RAID profiles for metadata fails with a statement to look at the kernel log, which doesn't show anything at all about the failure. ^ If this is the case then it is definitely a bug. Can you provide some version info? Specifically kernel, btrfs-tools, and Distro. snip it appears that the kernel stops you from converting to a dup profile for metadata in this case because it thinks that such a profile doesn't work on multiple devices, despite the fact that you can take a single device filesystem, and a device, and it will still work fine even without converting the metadata/system profiles. I believe dup used to work on multiple devices but the facility was removed. In the standard case it doesn't make sense to use dup with multiple devices: It uses the same amount of diskspace but is more vulnerable than the RAID1 alternative. snip Ideally, this should be changed to allow converting to dup so that when converting a multi-device filesystem to single-device, you never have to have metadata or system chunks use a single profile. This is a good use-case for having the facility. I'm thinking that, if it is brought back in, the only caveat is that appropriate warnings should be put in place to indicate that it is inappropriate. My guess on how you'd like to migrate from raid1/raid1 to single/dup, assuming sda and sdb: btrfs balance start -dconvert=single -mconvert=dup / btrfs device delete /dev/sdb / Do you happen to know which git repository and branch is preferred to base patches on? I'm getting ready to write one to fix this, and would like to make it as easy as possible for the developers to merge. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with btrfs balance
On 02/14/2014 02:56 AM, Brendan Hide wrote: On 14/02/14 05:42, Austin S. Hemmelgarn wrote: On 2014/02/10 04:33 AM, Austin S Hemmelgarn wrote: Do you happen to know which git repository and branch is preferred to base patches on? I'm getting ready to write one to fix this, and would like to make it as easy as possible for the developers to merge. A list of the main repositories is maintained at https://btrfs.wiki.kernel.org/index.php/Btrfs_source_repositories I'd suggest David Sterba's branch as he maintains it for userspace-tools integration. In this case, it will need to be patched both in the userspace tools, and in the kernel, it's the kernel itself that prevents the balance, cause it thinks that you can't do dup profiles with multiple devices. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Allow forced conversion of metadata to dup profile on multiple devices
Currently, btrfs balance start fails when trying to convert metadata or system chunks to dup profile on filesystems with multiple devices. This requires that a conversion from a multi-device filesystem to a single device filesystem use the following methodology: 1. btrfs balance start -dconvert=single -mconvert=single \ -sconvert=single -f / 2. btrfs device delete / /dev/sdx 3. btrfs balance start -mconvert=dup -sconvert=dup / This results in a period of time (possibly very long if the devices are big) where you don't have the protection guarantees of multiple copies of metadata chunks. After applying this patch, one can instead use the following methodology for conversion from a multi-device filesystem to a single device filesystem: 1. btrfs balance start -dconvert=single -mconvert=dup \ -sconvert=dup -f / 2. btrfs device delete / /dev/sdx This greatly reduces the chances of the operation causing data loss due to a read error during the device delete. Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com --- fs/btrfs/volumes.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 07629e9..38a9522 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3152,10 +3152,8 @@ int btrfs_balance(struct btrfs_balance_control *bctl, num_devices--; } btrfs_dev_replace_unlock(fs_info-dev_replace); - allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE; - if (num_devices == 1) - allowed |= BTRFS_BLOCK_GROUP_DUP; - else if (num_devices 1) + allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE | BTRFS_BLOCK_GROUP_DUP; + if (num_devices 1) allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1); if (num_devices 2) allowed |= BTRFS_BLOCK_GROUP_RAID5; @@ -3221,6 +3219,21 @@ int btrfs_balance(struct btrfs_balance_control *bctl, goto out; } } + if (((bctl-sys.flags BTRFS_BALANCE_ARGS_CONVERT) + (bctl-sys.target ~BTRFS_BLOCK_GROUP_DUP) || + (bctl-meta.flags BTRFS_BALANCE_ARGS_CONVERT) + (bctl-meta.target ~BTRFS_BLOCK_GROUP_DUP)) + (num_devs 1)) { + if (bctl-flags BTRFS_BALANCE_FORCE) { + btrfs_info(fs_info, force conversion of metadata + to dup profile on multiple devices); + } else { + btrfs_err(fs_info, balance will reduce metadata + integrity, use force if you want this); + ret = -EINVAL; + goto out; + } + } } while (read_seqretry(fs_info-profiles_lock, seq)); if (bctl-sys.flags BTRFS_BALANCE_ARGS_CONVERT) { -- 1.8.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Allow forced conversion of metadata to dup profile on multiple devices
Currently, btrfs balance start fails when trying to convert metadata or system chunks to dup profile on filesystems with multiple devices. This requires that a conversion from a multi-device filesystem to a single device filesystem use the following methodology: 1. btrfs balance start -dconvert=single -mconvert=single \ -sconvert=single -f / 2. btrfs device delete / /dev/sdx 3. btrfs balance start -mconvert=dup -sconvert=dup / This results in a period of time (possibly very long if the devices are big) where you don't have the protection guarantees of multiple copies of metadata chunks. After applying this patch, one can instead use the following methodology for conversion from a multi-device filesystem to a single device filesystem: 1. btrfs balance start -dconvert=single -mconvert=dup \ -sconvert=dup -f / 2. btrfs device delete / /dev/sdx This greatly reduces the chances of the operation causing data loss due to a read error during the device delete. Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com --- fs/btrfs/volumes.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 07629e9..38a9522 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3152,10 +3152,8 @@ int btrfs_balance(struct btrfs_balance_control *bctl, num_devices--; } btrfs_dev_replace_unlock(fs_info-dev_replace); - allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE; - if (num_devices == 1) - allowed |= BTRFS_BLOCK_GROUP_DUP; - else if (num_devices 1) + allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE | BTRFS_BLOCK_GROUP_DUP; + if (num_devices 1) allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1); if (num_devices 2) allowed |= BTRFS_BLOCK_GROUP_RAID5; @@ -3221,6 +3219,21 @@ int btrfs_balance(struct btrfs_balance_control *bctl, goto out; } } + if (((bctl-sys.flags BTRFS_BALANCE_ARGS_CONVERT) + (bctl-sys.target ~BTRFS_BLOCK_GROUP_DUP) || + (bctl-meta.flags BTRFS_BALANCE_ARGS_CONVERT) + (bctl-meta.target ~BTRFS_BLOCK_GROUP_DUP)) + (num_devs 1)) { + if (bctl-flags BTRFS_BALANCE_FORCE) { + btrfs_info(fs_info, force conversion of metadata + to dup profile on multiple devices); + } else { + btrfs_err(fs_info, balance will reduce metadata + integrity, use force if you want this); + ret = -EINVAL; + goto out; + } + } } while (read_seqretry(fs_info-profiles_lock, seq)); if (bctl-sys.flags BTRFS_BALANCE_ARGS_CONVERT) { -- 1.8.5.4 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Allow forced conversion of metadata to dup profile on multiple devices
On 2014-02-24 08:37, Ilya Dryomov wrote: On Thu, Feb 20, 2014 at 6:57 PM, David Sterba dste...@suse.cz wrote: On Wed, Feb 19, 2014 at 11:10:41AM -0500, Austin S Hemmelgarn wrote: Currently, btrfs balance start fails when trying to convert metadata or system chunks to dup profile on filesystems with multiple devices. This requires that a conversion from a multi-device filesystem to a single device filesystem use the following methodology: 1. btrfs balance start -dconvert=single -mconvert=single \ -sconvert=single -f / 2. btrfs device delete / /dev/sdx 3. btrfs balance start -mconvert=dup -sconvert=dup / This results in a period of time (possibly very long if the devices are big) where you don't have the protection guarantees of multiple copies of metadata chunks. After applying this patch, one can instead use the following methodology for conversion from a multi-device filesystem to a single device filesystem: 1. btrfs balance start -dconvert=single -mconvert=dup \ -sconvert=dup -f / 2. btrfs device delete / /dev/sdx This greatly reduces the chances of the operation causing data loss due to a read error during the device delete. Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com Reviewed-by: David Sterba dste...@suse.cz Sounds useful. The muliple devices + DUP is allowed setup when the device is added, this patch only adds the 'delete' counterpart. The imroved data loss protection during the process is a good thing. Hi, Have you actually tried to queue it? Unless I'm missing something, it won't compile, and on top of that, it seems to be corrupted too.. The patch itself was made using git, AFAICT it should be fine. I've personally built and tested it using UML. IIRC muliple devices + DUP is allowed only until the first balance, has that changed? This is just a limitation of how the kernel handles balances, DUP profiles with multiple devices work, it's just terribly inefficient. The primary use case is converting a multi-device FS with RAID for metadata to a single device FS without having to reduce integrity. Thanks, Ilya -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Allow forced conversion of metadata to dup profile on multiple devices
On 2014-02-24 09:12, Ilya Dryomov wrote: On Mon, Feb 24, 2014 at 3:44 PM, Austin S Hemmelgarn ahferro...@gmail.com wrote: On 2014-02-24 08:37, Ilya Dryomov wrote: On Thu, Feb 20, 2014 at 6:57 PM, David Sterba dste...@suse.cz wrote: On Wed, Feb 19, 2014 at 11:10:41AM -0500, Austin S Hemmelgarn wrote: Currently, btrfs balance start fails when trying to convert metadata or system chunks to dup profile on filesystems with multiple devices. This requires that a conversion from a multi-device filesystem to a single device filesystem use the following methodology: 1. btrfs balance start -dconvert=single -mconvert=single \ -sconvert=single -f / 2. btrfs device delete / /dev/sdx 3. btrfs balance start -mconvert=dup -sconvert=dup / This results in a period of time (possibly very long if the devices are big) where you don't have the protection guarantees of multiple copies of metadata chunks. After applying this patch, one can instead use the following methodology for conversion from a multi-device filesystem to a single device filesystem: 1. btrfs balance start -dconvert=single -mconvert=dup \ -sconvert=dup -f / 2. btrfs device delete / /dev/sdx This greatly reduces the chances of the operation causing data loss due to a read error during the device delete. Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com Reviewed-by: David Sterba dste...@suse.cz Sounds useful. The muliple devices + DUP is allowed setup when the device is added, this patch only adds the 'delete' counterpart. The imroved data loss protection during the process is a good thing. Hi, Have you actually tried to queue it? Unless I'm missing something, it won't compile, and on top of that, it seems to be corrupted too.. The patch itself was made using git, AFAICT it should be fine. I've personally built and tested it using UML. It doesn't look fine. It was generated with git, but it got corrupted on the way: either how you pasted it or the email client you use is the problem. On Wed, Feb 19, 2014 at 6:10 PM, Austin S Hemmelgarn ahferro...@gmail.com wrote: Currently, btrfs balance start fails when trying to convert metadata or system chunks to dup profile on filesystems with multiple devices. This requires that a conversion from a multi-device filesystem to a single device filesystem use the following methodology: 1. btrfs balance start -dconvert=single -mconvert=single \ -sconvert=single -f / 2. btrfs device delete / /dev/sdx 3. btrfs balance start -mconvert=dup -sconvert=dup / This results in a period of time (possibly very long if the devices are big) where you don't have the protection guarantees of multiple copies of metadata chunks. After applying this patch, one can instead use the following methodology for conversion from a multi-device filesystem to a single device filesystem: 1. btrfs balance start -dconvert=single -mconvert=dup \ -sconvert=dup -f / 2. btrfs device delete / /dev/sdx This greatly reduces the chances of the operation causing data loss due to a read error during the device delete. Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com --- fs/btrfs/volumes.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 07629e9..38a9522 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3152,10 +3152,8 @@ int btrfs_balance(struct btrfs_balance_control *bctl, ^^^, that should be a single line num_devices--; } btrfs_dev_replace_unlock(fs_info-dev_replace); - allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE; - if (num_devices == 1) - allowed |= BTRFS_BLOCK_GROUP_DUP; - else if (num_devices 1) + allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE | BTRFS_BLOCK_GROUP_DUP; + if (num_devices 1) allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1); if (num_devices 2) allowed |= BTRFS_BLOCK_GROUP_RAID5; @@ -3221,6 +3219,21 @@ int btrfs_balance(struct btrfs_balance_control *bctl, ^^^, ditto goto out; } } + if (((bctl-sys.flags BTRFS_BALANCE_ARGS_CONVERT) + (bctl-sys.target ~BTRFS_BLOCK_GROUP_DUP) || + (bctl-meta.flags BTRFS_BALANCE_ARGS_CONVERT) + (bctl-meta.target ~BTRFS_BLOCK_GROUP_DUP)) + (num_devs 1)) { + if (bctl-flags BTRFS_BALANCE_FORCE) { + btrfs_info(fs_info, force conversion of metadata + to dup profile on multiple devices); + } else { + btrfs_err(fs_info, balance will reduce metadata
[PATCH] btrfs: Allow forced conversion of metadata to dup profile on, multiple devices
Currently, btrfs balance start fails when trying to convert metadata or system chunks to dup profile on filesystems with multiple devices. This requires that a conversion from a multi-device filesystem to a single device filesystem use the following methodology: 1. btrfs balance start -dconvert=single -mconvert=single \ -sconvert=single -f / 2. btrfs device delete / /dev/sdx 3. btrfs balance start -mconvert=dup -sconvert=dup / This results in a period of time (possibly very long if the devices are big) where you don't have the protection guarantees of multiple copies of metadata chunks. After applying this patch, one can instead use the following methodology for conversion from a multi-device filesystem to a single device filesystem: 1. btrfs balance start -dconvert=single -mconvert=dup \ -sconvert=dup -f / 2. btrfs device delete / /dev/sdx This greatly reduces the chances of the operation causing data loss due to a read error during the device delete. Signed-off-by: Austin S. Hemmelgarn ahferro...@gmail.com --- fs/btrfs/volumes.c | 21 + 1 file changed, 17 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 07629e9..38a9522 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -3152,10 +3152,8 @@ int btrfs_balance(struct btrfs_balance_control *bctl, num_devices--; } btrfs_dev_replace_unlock(fs_info-dev_replace); - allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE; - if (num_devices == 1) - allowed |= BTRFS_BLOCK_GROUP_DUP; - else if (num_devices 1) + allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE | BTRFS_BLOCK_GROUP_DUP; + if (num_devices 1) allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1); if (num_devices 2) allowed |= BTRFS_BLOCK_GROUP_RAID5; @@ -3221,6 +3219,21 @@ int btrfs_balance(struct btrfs_balance_control *bctl, goto out; } } + if (((bctl-sys.flags BTRFS_BALANCE_ARGS_CONVERT) + (bctl-sys.target ~BTRFS_BLOCK_GROUP_DUP) || + (bctl-meta.flags BTRFS_BALANCE_ARGS_CONVERT) + (bctl-meta.target ~BTRFS_BLOCK_GROUP_DUP)) + (num_devs 1)) { + if (bctl-flags BTRFS_BALANCE_FORCE) { + btrfs_info(fs_info, force conversion of metadata + to dup profile on multiple devices); + } else { + btrfs_err(fs_info, balance will reduce metadata + integrity, use force if you want this); + ret = -EINVAL; + goto out; + } + } } while (read_seqretry(fs_info-profiles_lock, seq)); if (bctl-sys.flags BTRFS_BALANCE_ARGS_CONVERT) { -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Massive BTRFS performance degradation
On 03/09/2014 04:17 AM, Swâmi Petaramesh wrote: Le dimanche 9 mars 2014 08:48:20 KC a écrit : I am experiencing massive performance degradation on my BTRFS root partition on SSD. BTW, is BTRFS still a SSD-killer ? It had this reputation a while ago, and I'm not sure if this still is the case, but I don't dare (yet) converting to BTRFS one of my laptops that has a SSD... Actually, because of the COW nature of BTRFS, it should be better for SSD's than stuff like ext4 (which DOES kill SSD's when journaling is enabled because it ends up doing thousands of read-modify-write cycles to the same 128k of the disk under just generic usage). Just make sure that you use the 'ssd' and 'discard' mount options. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Incremental backup for a raid1
On 2014-03-14 09:46, George Mitchell wrote: Actually, an interesting concept would be to have the initial two drive RAID 1 mirrored by 2 additional drives in 4-way configuration on a second machine at a remote location on a private high speed network with both machines up 24/7. In that case, if such a configuration would work, either machine could be obliterated and the data would survive fully intact in full duplex mode. It would just need to be remounted from the backup system and away it goes. Just thinking of interesting possibilities with n-way mirroring. Oh how I would love to have n-way mirroring to play with! That can already be done, albeit slightly differently by stacking btrfs RAID 1 on top of a pair of DRBD devices. Of course, this doesn't provide quite the same degree of safety as your suggestion, but it does work (and DRBD makes the remote copy write-mostly for the local system automatically). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS setup advice for laptop performance ?
On 2014-04-04 04:02, Swâmi Petaramesh wrote: Hi, I'm going to receive a new small laptop with a 500 GB 5400 RPM mechanical ole' rust HD, and I plan ton install BTRFS on it. It will have a kernel 3.13 for now, until 3.14 gets released. However I'm still concerned with chronic BTRFS dreadful performance and still find that BRTFS degrades much over time even with periodic defrag and best practices etc. I keep hearing this from people, but i personally don't see this to be the case at all. I'm pretty sure the 'big' performance degradation that people are seeing is due to how they are using snapshots, not a result using BTRFS itself (I don't use them for anything other than ensuring a stable system image for rsync and/or tar based backups). So I'd like to start with the best possible options and have a few questions : - Is it still recommended to mkfs with a nodesize or leafsize different (bigger) than the default ? I wouldn't like to lose too much disk space anyway (1/2 nodesize per file on average ?), as it will be limited... This depends on many things, the average size of the files on the disk is the biggest factor. In general you should get the best disk utilization by setting nodesize so that a majority of the files are less than the leafsize minus 256 bytes, and all but a few are smaller than two times the leafsize minus 256 bytes. However, if you want to really benefit from the data compression, you should just use the smallest leaf/nodesize for your system (which is what mkfs defaults to), as data that gets as BTRFS stores files whose size is at least (roughly) 256 bytes less than the leafsize inline with the metadata, and doesn't compress such files. - Is it recommended to alter the FS to have skinny extents ? I've done this on all of my BTRFS machines without problem, still the kernel spits a notice at mount time, and I'm worrying kind of Why is the kernel warning me I have skinny extents ? Is it bad ? Is it something I should avoid ? I think that the primary reason for the warning is that it is backward incompatible, older kernels can't mount filesystems using it. - Are there other optimization tricks I should perform at mkfs time because thay can't be changed later on ? - Are there other btrfstune or mount options I should pass before starting to populate the FS with a system and data ? Unless you are using stuff like QEMU or Virtualbox, you should probably have autodefrag and space_cache on from the very start. - Generally speaking, does LZO compression improve or degrade performance ? I'm not able to figure it out clearly. As long as your memory bandwidth is significantly higher than disk bandwidth (which is almost always the case, even with SSD's), this should provide at least some improvement with respect to IO involving large files. Because you are using a traditional hard disk instead of an SSD, you might get better performance using zlib (assuming you don't mind slightly higer processor usage for IO to files larger than the leafsize). If you care less about disk utilization than you do about performance, you might want to use compress_force instead of compress, as the performance boost comes from not having to write as much data to disk. TIA for the insight. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS setup advice for laptop performance ?
On 2014-04-04 08:48, Swâmi Petaramesh wrote: Le vendredi 4 avril 2014 08:33:10 Austin S Hemmelgarn a écrit : However I'm still concerned with chronic BTRFS dreadful performance and still find that BRTFS degrades much over time even with periodic defrag and best practices etc. I keep hearing this from people, but i personally don't see this to be the case at all. I'm pretty sure the 'big' performance degradation that people are seeing is due to how they are using snapshots, not a result using BTRFS itself (I don't use them for anything other than ensuring a stable system image for rsync and/or tar based backups). Maybe I was wrong to suppose that if a feature exists, it is supposed to be usable... I have used ZFS for years, and on ZFS having *hundreds* of snapshots of any given FS have exactly zero impact on performance... With BTRFS, some time ago I tried to use SuSE snapper that passes its time doing and releasing snapshots, but it soon made my systems unusable... Now, I only keep 2-3 manually made snapshots just for keeping a stable and OK archive of my machine in a known state just in case... But if even this has a noticeable negative impact on BTRFS performance, then what the hell are BTRFS snapshots good at ?? Kind regards. I'm not saying that using a few snapshots is a bad thing, I'm saying that thousands of snapshots is a bad thing (I have actually seen people with hat many, including one individual who had almost 32,000 snapshots on the same drive). I personally do keep a few around on my system on a regular basis, even aside from the backups, and have no noticable performance degradation. For reference, the (main) system that I am using has a Intel Celeron 847 running at 1.1GHz, 4G of DDR3-1333 RAM, and a 500G 5400 RPM SATAII hard disk. My root filesystem is BTRFS volume mounted with autodefrag,space_cache,compress-force=lzo,noatime (the noatime improves performance (and power efficency) for btrfs because metadata updates end up cascading up the metadata tree (updating the atime on /etc/foo/bar causes the atime to be updated on /etc/foo, which causes the atime to be updated on /etc, which causes the atime to be updated on /) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS setup advice for laptop performance ?
On 2014-04-05 07:10, Swâmi Petaramesh wrote: Le samedi 5 avril 2014 10:12:17 Duncan wrote [excellent performance advice about disabling Akonadi in BTRFS etc]: Thanks Duncan for all this excellent discussion. However I'm still rather puzzled with a filesystem for which advice is if you want tolerable performance, you have to turn off features that are the default with any other FS out there (relatime - noatime) or you have to quit using this database, or you have to fiddle around with esoteric option such as disabling COW wich BTW is one of BTRFS most promiment features. The only reason AFAIK that noatime isn't the default on other filesystems is because it breaks stuff like mutt. Other than that, nobody really uses atimes, and noatime will in-fact get you better performance on any filesystem. [...] To put it plain flat clear, even if relatime causes writes, every other FS out there can cope with it. Even if akonadi is heavy and a disk resource hog, any other FS out there can cope with it and still maintain acceptable, usable performance. This is because every other filesystem (except ZFS) doesn't use COW semantics. IIRC, using those same features on ZFS causes the same problems. This in fact brings to mind one of the biggest reasons that I refuse to use KDE (or systemd for that matter), KDE systems run slower in my experience even on ext4, XFS, and JFS, not just on COW filesystems. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS setup advice for laptop performance ?
On 2014-04-08 07:56, Clemens Eisserer wrote: Hi, This is because every other filesystem (except ZFS) doesn't use COW semantics. Nilfs2 also is COW based. Regards, Clemens -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Apologies, I had forgotten about NILFS2 (probably because I choose not to deal with it due to stability issues that i have experienced, and a lack of XATTR and ACL support). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Which companies are using Btrfs in production?
On 2014-04-23 21:19, Marc MERLIN wrote: Oh while we're at it, are there companies that can say they are using btrfs in production? Marc Ohio Gravure Technologies is currently preparing to use it on our next generation of production systems. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: safe/necessary to balance system chunks?
On 2014-04-25 13:24, Chris Murphy wrote: On Apr 25, 2014, at 8:57 AM, Steve Leung sjle...@shaw.ca wrote: Hi list, I've got a 3-device RAID1 btrfs filesystem that started out life as single-device. btrfs fi df: Data, RAID1: total=1.31TiB, used=1.07TiB System, RAID1: total=32.00MiB, used=224.00KiB System, DUP: total=32.00MiB, used=32.00KiB System, single: total=4.00MiB, used=0.00 Metadata, RAID1: total=66.00GiB, used=2.97GiB This still lists some system chunks as DUP, and not as RAID1. Does this mean that if one device were to fail, some system chunks would be unrecoverable? How bad would that be? Since it's system type, it might mean the whole volume is toast if the drive containing those 32KB dies. I'm not sure what kind of information is in system chunk type, but I'd expect it's important enough that if unavailable that mounting the file system may be difficult or impossible. Perhaps btrfs restore would still work? Anyway, it's probably a high penalty for losing only 32KB of data. I think this could use some testing to try and reproduce conversions where some amount of system or metadata type chunks are stuck in DUP. This has come up before on the list but I'm not sure how it's happening, as I've never encountered it. As far as I understand it, the system chunks are THE root chunk tree for the entire system, that is to say, it's the tree of tree roots that is pointed to by the superblock. (I would love to know if this understanding is wrong). Thus losing that data almost always means losing the whole filesystem. Assuming this is something that needs to be fixed, would I be able to fix this by balancing the system chunks? Since the force flag is required, does that mean that balancing system chunks is inherently risky or unpleasant? I don't think force is needed. You'd use btrfs balance start -sconvert=raid1 mountpoint; or with -sconvert=raid1,soft although it's probably a minor distinction for such a small amount of data. The kernel won't allow a balance involving system chunks unless you specify force, as it considers any kind of balance using them to be dangerous. Given your circumstances, I'd personally say that the safety provided by RAID1 outweighs the risk of making the FS un-mountable. The metadata looks like it could use a balance, 66GB of metadata chunks allocated but only 3GB used. So you could include something like -musage=50 at the same time and that will balance any chunks with 50% or less usage. Chris Murphy Personally, I would recommend making a full backup of all the data (tar works wonderfully for this), and recreate the entire filesystem from scratch, but passing all three devices to mkfs.btrfs. This should result in all the chunks being RAID1, and will also allow you to benefit from newer features. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: safe/necessary to balance system chunks?
On 2014-04-25 14:43, Steve Leung wrote: On 04/25/2014 12:12 PM, Austin S Hemmelgarn wrote: On 2014-04-25 13:24, Chris Murphy wrote: On Apr 25, 2014, at 8:57 AM, Steve Leung sjle...@shaw.ca wrote: I've got a 3-device RAID1 btrfs filesystem that started out life as single-device. btrfs fi df: Data, RAID1: total=1.31TiB, used=1.07TiB System, RAID1: total=32.00MiB, used=224.00KiB System, DUP: total=32.00MiB, used=32.00KiB System, single: total=4.00MiB, used=0.00 Metadata, RAID1: total=66.00GiB, used=2.97GiB This still lists some system chunks as DUP, and not as RAID1. Does this mean that if one device were to fail, some system chunks would be unrecoverable? How bad would that be? Assuming this is something that needs to be fixed, would I be able to fix this by balancing the system chunks? Since the force flag is required, does that mean that balancing system chunks is inherently risky or unpleasant? I don't think force is needed. You'd use btrfs balance start -sconvert=raid1 mountpoint; or with -sconvert=raid1,soft although it's probably a minor distinction for such a small amount of data. The kernel won't allow a balance involving system chunks unless you specify force, as it considers any kind of balance using them to be dangerous. Given your circumstances, I'd personally say that the safety provided by RAID1 outweighs the risk of making the FS un-mountable. Agreed, I'll attempt the system balance shortly. Personally, I would recommend making a full backup of all the data (tar works wonderfully for this), and recreate the entire filesystem from scratch, but passing all three devices to mkfs.btrfs. This should result in all the chunks being RAID1, and will also allow you to benefit from newer features. I do have backups of the really important stuff from this filesystem, but they're offsite. As this is just for a home system, I don't have enough temporary space for a full backup handy (which is related to how I ended up in this situation in the first place). Once everything gets rebalanced though, I don't think I'd be missing out on any features, would I? Steve In general, it shouldn't be an issue, but it might get you slightly better performance to recreate it. I actually have a similar situation with how I have my desktop system set up, when I go about recreating the filesystem (which I do every time I upgrade either the tools or the kernel), I use the following approach: 1. Delete one of the devices from the filesystem 2. Create a new btrfs file system on the device just removed from the filesystem 3. Copy the data from the old filesystem to the new one 4. one at a time, delete the remaining devices from the old filesystem and add them to the new one, re-balancing the new filesystem after adding each device. This seems to work relatively well for me, and prevents the possibility that there is ever just one copy of the data. It does, however, require that the amount of data that you are storing on the filesystem is less than the size of one of the devices (although you can kind of work around this limitation by setting compress-force=zlib on the new file system when you mount it, then using defrag to decompress everything after the conversion is done), and that you have to drop to single user mode for the conversion (unless it's something that isn't needed all the time, like the home directories or /usr/src, in which case you just log everyone out and log in as root on the console to do it). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on bcache
On 2014-04-30 14:16, Felix Homann wrote: Hi, a couple of months ago there has been some discussion about issues when using btrfs on bcache: http://thread.gmane.org/gmane.comp.file-systems.btrfs/31018 From looking at the mailing list archives I cannot tell whether or not this issue has been resolved in current kernels from either bcache's or btrfs' side. Can anyone tell me what's the current state of this issue? Should it be safe to use btrfs on bcache by now? In all practicality, I don't think anyone who frequents the list knows. I do know that there are a number of people (myself included) who avoid bcache in general because of having issues with seemingly random kernel OOPSes when it is linked in (either as a module or compiled in), even when it isn't being used. My advice would be to just test it with some non-essential data (maybe set up a virtual machine?). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Help with space
On 05/02/2014 03:21 PM, Chris Murphy wrote: On May 2, 2014, at 2:23 AM, Duncan 1i5t5.dun...@cox.net wrote: Something tells me btrfs replace (not device replace, simply replace) should be moved to btrfs device replace… The syntax for btrfs device is different though; replace is like balance: btrfs balance start and btrfs replace start. And you can also get a status on it. We don't (yet) have options to stop, start, resume, which could maybe come in handy for long rebuilds and a reboot is required (?) although maybe that just gets handled automatically: set it to pause, then unmount, then reboot, then mount and resume. Well, I'd say two copies if it's only two devices in the raid1... would be true raid1. But if it's say four devices in the raid1, as is certainly possible with btrfs raid1, that if it's not mirrored 4-way across all devices, it's not true raid1, but rather some sort of hybrid raid, raid10 (or raid01) if the devices are so arranged, raid1+linear if arranged that way, or some form that doesn't nicely fall into a well defined raid level categorization. Well, md raid1 is always n-way. So if you use -n 3 and specify three devices, you'll get 3-way mirroring (3 mirrors). But I don't know any hardware raid that works this way. They all seem to be raid 1 is strictly two devices. At 4 devices it's raid10, and only in pairs. Btrfs raid1 with 3+ devices is unique as far as I can tell. It is something like raid1 (2 copies) + linear/concat. But that allocation is round robin. I don't read code but based on how a 3 disk raid1 volume grows VDI files as it's filled it looks like 1GB chunks are copied like this Actually, MD RAID10 can be configured to work almost the same with an odd number of disks, except it uses (much) smaller chunks, and it does more intelligent striping of reads. Disk1 Disk2 Disk3 134 124 235 679 578 689 So 1 through 9 each represent a 1GB chunk. Disk 1 and 2 each have a chunk 1; disk 2 and 3 each have a chunk 2, and so on. Total of 9GB of data taking up 18GB of space, 6GB on each drive. You can't do this with any other raid1 as far as I know. You do definitely run out of space on one disk first though because of uneven metadata to data chunk allocation. Anyway I think we're off the rails with raid1 nomenclature as soon as we have 3 devices. It's probably better to call it replication, with an assumed default of 2 replicates unless otherwise specified. There's definitely a benefit to a 3 device volume with 2 replicates, efficiency wise. As soon as we go to four disks 2 replicates it makes more sense to do raid10, although I haven't tested odd device raid10 setups so I'm not sure what happens. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RAID-1 - suboptimal write performance?
On 05/16/2014 04:41 PM, Tomasz Chmielewski wrote: On Fri, 16 May 2014 14:06:24 -0400 Calvin Walton calvin.wal...@kepstin.ca wrote: No comment on the performance issue, other than to say that I've seen similar on RAID-10 before, I think. Also, what happens when the system crashes, and one drive has several hundred megabytes data more than the other one? This shouldn't be an issue as long as you occasionally run a scrub or balance. The scrub should find it and fix the missing data, and a balance would just rewrite it as proper RAID-1 as a matter of course. It's similar (writes to just one drive, while the other is idle) when removing (many) snapshots. Not sure if that's optimal behaviour. I think, after having looked at some of the code, that I know what is causing this (although my interpretation of the code may be completely off target). As far as I can make out, BTRFS only dispatches writes to one device at a time, and the write() system call only returns when the data is on both devices. While dispatching to one device at a time is optimal when both 'devices' are partitions on the same underlying disk (and also if your optimization metric is the simplicity of the underlying code), it degrades very fast to the worst case when using multiple devices. The underlying cause however, which the one device at a time logic in BTRFS just makes much worse, is that the buffer for the write() call is kept in memory until the write completes, and counts against the per-process write-caching limit, and when the process fills up it's write-cache, the next call it makes that would write to the disk hangs until the write cache is less full. The two options that I've found that work around this are: 1. Run 'sync' whenever the program stalls, or 2. Disable write-caching by adding the following to /etc/sysctl.conf vm.dirty_bytes = 0 vm.dirty_background_bytes = 0 Option 1 is kind of tedious, but doesn't hurt performance all that much, Option 2 will lower throughput, but will cause most of the stalls to disappear. Ideally, BTRFS should dispatch the first write for a block in a round-robin fashion among available devices. This won't fix the underlying issue, but it will make it less of an issue for BTRFS. smime.p7s Description: S/MIME Cryptographic Signature
Re: send/receive and bedup
On 2014-05-19 13:12, Konstantinos Skarlatos wrote: On 19/5/2014 7:01 μμ, Brendan Hide wrote: On 19/05/14 15:00, Scott Middleton wrote: On 19 May 2014 09:07, Marc MERLIN m...@merlins.org wrote: On Wed, May 14, 2014 at 11:36:03PM +0800, Scott Middleton wrote: I read so much about BtrFS that I mistaked Bedup with Duperemove. Duperemove is actually what I am testing. I'm currently using programs that find files that are the same, and hardlink them together: http://marc.merlins.org/perso/linux/post_2012-05-01_Handy-tip-to-save-on-inodes-and-disk-space_-finddupes_-fdupes_-and-hardlink_py.html hardlink.py actually seems to be the faster (memory and CPU) one event though it's in python. I can get others to run out of RAM on my 8GB server easily :( Interesting app. An issue with hardlinking (with the backups use-case, this problem isn't likely to happen), is that if you modify a file, all the hardlinks get changed along with it - including the ones that you don't want changed. @Marc: Since you've been using btrfs for a while now I'm sure you've already considered whether or not a reflink copy is the better/worse option. Bedup should be better, but last I tried I couldn't get it to work. It's been updated since then, I just haven't had the chance to try it again since then. Please post what you find out, or if you have a hardlink maker that's better than the ones I found :) Thanks for that. I may be completely wrong in my approach. I am not looking for a file level comparison. Bedup worked fine for that. I have a lot of virtual images and shadow protect images where only a few megabytes may be the difference. So a file level hash and comparison doesn't really achieve my goals. I thought duperemove may be on a lower level. https://github.com/markfasheh/duperemove Duperemove is a simple tool for finding duplicated extents and submitting them for deduplication. When given a list of files it will hash their contents on a block by block basis and compare those hashes to each other, finding and categorizing extents that match each other. When given the -d option, duperemove will submit those extents for deduplication using the btrfs-extent-same ioctl. It defaults to 128k but you can make it smaller. I hit a hurdle though. The 3TB HDD I used seemed OK when I did a long SMART test but seems to die every few hours. Admittedly it was part of a failed mdadm RAID array that I pulled out of a clients machine. The only other copy I have of the data is the original mdadm array that was recently replaced with a new server, so I am loathe to use that HDD yet. At least for another couple of weeks! I am still hopeful duperemove will work. Duperemove does look exactly like what you are looking for. The last traffic on the mailing list regarding that was in August last year. It looks like it was pulled into the main kernel repository on September 1st. The last commit to the duperemove application was on April 20th this year. Maybe Mark (cc'd) can provide further insight on its current status. I have been testing duperemove and it seems to work just fine, in contrast with bedup that i have been unable to install/compile/sort out the mess with python versions. I have 2 questions about duperemove: 1) can it use existing filesystem csums instead of calculating its own? While this might seem like a great idea at first, it really isn't. BTRFS uses CRC32c at the moment as it's checksum algorithm, and while that is relatively good at detecting small differences (i.e. a single bit flipped out of every 64 or so bytes), it is known to have issues with hash collisions. Normally, the data on disk won't change enough even from a media error to cause a hash collision, but when you start using it to compare extents that aren't known to be the same to begin with, and then try to merge those extents, you run the risk of serious file corruption. Also, AFAIK, BTRFS doesn't expose the block checksum to userspace directly (although I may be wrong about this, in which case i retract the following statement) this would therefore require some kernelspace support. 2) can it be included in btrfs-progs so that it becomes a standard feature of btrfs? I would definitely like to second this suggestion, I hear a lot of people talking about how BTRFS has batch deduplication, but it's almost impossible to make use of without extra software or writing your own code. smime.p7s Description: S/MIME Cryptographic Signature
Re: ditto blocks on ZFS
On 2014-05-19 22:07, Russell Coker wrote: On Mon, 19 May 2014 23:47:37 Brendan Hide wrote: This is extremely difficult to measure objectively. Subjectively ... see below. [snip] *What other failure modes* should we guard against? I know I'd sleep a /little/ better at night knowing that a double disk failure on a raid5/1/10 configuration might ruin a ton of data along with an obscure set of metadata in some long tree paths - but not the entire filesystem. My experience is that most disk failures that don't involve extreme physical damage (EG dropping a drive on concrete) don't involve totally losing the disk. Much of the discussion about RAID failures concerns entirely failed disks, but I believe that is due to RAID implementations such as Linux software RAID that will entirely remove a disk when it gives errors. I have a disk which had ~14,000 errors of which ~2000 errors were corrected by duplicate metadata. If two disks with that problem were in a RAID-1 array then duplicate metadata would be a significant benefit. The other use-case/failure mode - where you are somehow unlucky enough to have sets of bad sectors/bitrot on multiple disks that simultaneously affect the only copies of the tree roots - is an extremely unlikely scenario. As unlikely as it may be, the scenario is a very painful consequence in spite of VERY little corruption. That is where the peace-of-mind/bragging rights come in. http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html The NetApp research on latent errors on drives is worth reading. On page 12 they report latent sector errors on 9.5% of SATA disks per year. So if you lose one disk entirely the risk of having errors on a second disk is higher than you would want for RAID-5. While losing the root of the tree is unlikely, losing a directory in the middle that has lots of subdirectories is a risk. I can understand why people wouldn't want ditto blocks to be mandatory. But why are people arguing against them as an option? As an aside, I'd really like to be able to set RAID levels by subtree. I'd like to use RAID-1 with ditto blocks for my important data and RAID-0 for unimportant data. But the proposed changes for n-way replication would already handle this. They would just need the option of having more than one copy per device (which theoretically shouldn't be too hard once you have n-way replication). Also, BTRFS already has the option of replicating the root tree across multiple devices (it is included in the System Data subset), and in fact dose so by default when using multiple devices. Also, there are plans to have per-subvolume or per file RAID level selection, but IIRC that is planned for after n-way replication (and of course, RAID 5/6, as n-way replication isn't going to be implemented until after RAID 5/6) smime.p7s Description: S/MIME Cryptographic Signature
Re: ditto blocks on ZFS
On 2014-05-21 19:05, Martin wrote: Very good comment from Ashford. Sorry, but I see no advantages from Russell's replies other than for a feel-good factor or a dangerous false sense of security. At best, there is a weak justification that for metadata, again going from 2% to 4% isn't going to be a great problem (storage is cheap and fast). I thought an important idea behind btrfs was that we avoid by design in the first place the very long and vulnerable RAID rebuild scenarios suffered for block-level RAID... On 21/05/14 03:51, Russell Coker wrote: Absolutely. Hopefully this discussion will inspire the developers to consider this an interesting technical challenge and a feature that is needed to beat ZFS. Sorry, but I think that is completely the wrong reasoning. ...Unless that is you are some proprietary sales droid hyping features and big numbers! :-P Personally I'm not convinced we gain anything beyond what btrfs will eventually offer in any case for the n-way raid or the raid-n Cauchy stuff. Also note that usually, data is wanted to be 100% reliable and retrievable. Or if that fails, you go to your backups instead. Gambling proportions and importance rather than *ensuring* fault/error tolerance is a very human thing... ;-) Sorry: Interesting idea but not convinced there's any advantage for disk/SSD storage. Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Another nice option in this case might be adding logic to make sure that there is some (considerable) offset between copies of metadata using the dup profile (all of the filesystems that I have actually looked at the low-level on-disk structures have had both copies of the System chunks right next to each other, right at the beginning of the disk, which of course mitigates the usefulness of storing two copies of them on disk). Adding an offset in those allocations would provide some better protection against some of the more common 'idiot' failure-modes (i.e. trying to use dd to write a disk image to a USB flash drive, and accidentally overwriting the first n GB of your first HDD instead). Ideally, once we have n-way replication, System chunks should default to one copy per device for multi-device filesystems. smime.p7s Description: S/MIME Cryptographic Signature
Re: is it safe to change BTRFS_STRIPE_LEN?
On 05/24/2014 12:44 PM, john terragon wrote: Hi. I'm playing around with (software) raid0 on SSDs and since I remember I read somewhere that intel recommends 128K stripe size for HDD arrays but only 16K stripe size for SSD arrays, I wanted to see how a small(er) stripe size would work on my system. Obviously with btrfs on top of md-raid I could use the stripe size I want. But if I'm not mistaken the stripe size with the native raid0 in btrfs is fixed to 64K in BTRFS_STRIPE_LEN (volumes.h). So I was wondering if it would be reasonably safe to just change that to 16K (and duck and wait for the explosion ;) ). Can anyone adept to the inner workings of btrfs raid0 code confirm if that would be the right way to proceed? (obviously without absolutely any blame to be placed on anyone other than myself if things should go badly :) ) I personally can't render an opinion on whether changing it would make things break or not, but I do know that it would need to be changed both in the kernel and the tools, and the resultant kernel and tools would not be entirely compatible with filesystems produced by the regular tools and kernel, possibly to the point of corrupting any filesystem they touch. As for the 64k default strip size, that sounds correct, and is probably because that's the largest block that the I/O schedulers on Linux will dispatch as a single write to the underlying device. smime.p7s Description: S/MIME Cryptographic Signature
Re: btrfs send ioctl failed with -5: Input/output error
On 05/26/2014 05:04 PM, Michael Welsh Duggan wrote: Michael Welsh Duggan m...@md5i.com writes: I am now getting the following error when trying to do a btrfs send: root@maru2:/usr/local/src/btrfs-progs# ./btrfs send /usr/local/snapshots/2014-05-15 /backup/intermediate At subvol /usr/local/snapshots/2014-05-15 ERROR: send ioctl failed with -5: Input/output error I'm running a 3.14.4 kernel, and Btrfs progs v3.14.1. root@maru2:/usr/local/src/btrfs-progs# uname -a Linux maru2 3.14-1-amd64 #1 SMP Debian 3.14.4-1 (2014-05-13) x86_64 GNU/Linux root@maru2:/usr/local/src/btrfs-progs# ./btrfs --version Btrfs v3.14.1 Is there anything I can do to help debug this issue? I'd like to find out what is happening here. I am an experienced C programmer, but have not dealt with kernel hacking before. I _do_ know how to build and install a kernel. I'd like some hints on what logging, etc., I could add in order to determine where in the send ioctl processing the IO error is coming from. From there, I hope to move to why. Ideally I'd be able to run gdb on this, but nothing I have read online about kernel debugging with gdb sounds promising. I'd make an image, but the amount of data is enough to make this prohibitive. I would look into ftrace (the Tracing submenu of the Kernel Hacking menu in menuconfig and nconfig). The other thing to look at at least initially is KDB, but using that requires either a serial console or an AT or PS/2 keyboard. Using UML and GDB for debugging is possible, but it can take a long time to set-up and is often slow (admitedly, KDB isn't much faster). If you do go the UML+GDB route, make sure to build in fault-injection (the flaky DM module is particularly nice for this type of thing), and I would suggest using something like Buildroot (http://www.buildroot.net) to generate the root filesystem for it. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS, SSD and single metadata
On 2014-06-16 03:54, Swâmi Petaramesh wrote: Hi, I created a BTRFS filesytem over LVM over LUKS encryption on an SSD [yes, I know...], and I noticed that the FS got created with metadata in DUP mode, contrary to what man mkfs.btrfs says for SSDs - it would be supposed to be SINGLE... Well I don't know if my system didn't identify the SSD because of the LVM+LUKS stack (however it mounts well by itself with the ssd flag and accepts the discard option [yes, I know...]), or if the manpage is obsolete or if this feature just doesn't work...? The SSD being a Micron RealSSD C400 For both SSD preservation and data integrity, would it be advisable to change metadata to SINGLE using a rebalance, or if I'd better just leave things the way they are...? TIA for any insight. What mkfs.btrfs looks at is /sys/block/whatever-device/queue/rotational, if that is 1 it knows that the device isn't a SSD. I believe that LVM passes through whatever the next lower layer's value is, but dmcrypt (and by extension LUKS) always force it to a 1 (possibly to prevent programs from using heuristics for enabling discard) smime.p7s Description: S/MIME Cryptographic Signature
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 2014-06-16 06:35, Russell Coker wrote: On Mon, 16 Jun 2014 12:14:49 Lennart Poettering wrote: On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote: I am not really following though why this trips up btrfs though. I am not sure I understand why this breaks btrfs COW behaviour. I mean, fallocate() isn't necessarily supposed to write anything really, it's mostly about allocating disk space in advance. I would claim that journald's usage of it is very much within the entire reason why it exists... I don't believe that fallocate() makes any difference to fragmentation on BTRFS. Blocks will be allocated when writes occur so regardless of an fallocate() call the usage pattern in systemd-journald will cause fragmentation. journald's write pattern looks something like this: append something to the end, make sure it is written, then update a few offsets stored at the beginning of the file to point to the newly appended data. This is of course not easy to handle for COW file systems. But then again, it's probably not too different from access patterns of other database or database-like engines... Not being too different from the access patterns of other databases means having all the same problems as other databases... Oracle is now selling ZFS servers specifically designed for running the Oracle database, but that's with hybrid storage flash (ZIL and L2ARC on SSD). While BTRFS doesn't support features equivalent for ZIL and L2ARC it's easy to run a separate filesystem on SSD for things that need performance (few if any current BTRFS users would have a database too big to entirely fit on a SSD). The problem we are dealing with is database-like access patterns on systems that are not designed as database servers. Would it be possible to get an interface for defragmenting files that's not specific to BTRFS? If we had a standard way of doing this then systemd- journald could request a defragment of the file at appropriate times. While this is a wonderful idea, what about all the extra I/O this will cause (and all the extra wear on SSD's)? While I understand wanting this to be faster, you should also consider the fact that defragmenting the file on a regular basis is going to trash performance for other applications. smime.p7s Description: S/MIME Cryptographic Signature
Re: BTRFS, SSD and single metadata
On 2014-06-16 07:18, Swâmi Petaramesh wrote: Hi Austin, and thanks for your reply. Le lundi 16 juin 2014, 07:09:55 Austin S Hemmelgarn a écrit : What mkfs.btrfs looks at is /sys/block/whatever-device/queue/rotational, if that is 1 it knows that the device isn't a SSD. I believe that LVM passes through whatever the next lower layer's value is, but dmcrypt (and by extension LUKS) always force it to a 1 (possibly to prevent programs from using heuristics for enabling discard) In the current running condition, the system clearly sees this is *not* rotational, even thru the LVM/dmcrypt stack : # mount | grep btrfs /dev/mapper/VG-LINUX on / type btrfs (rw,noatime,seclabel,compress=lzo,ssd,discard,space_cache,autodefrag) # ll /dev/mapper/VGV-LINUX lrwxrwxrwx. 1 root root 7 16 juin 09:21 /dev/mapper/VG-LINUX - ../dm-1 # cat /sys/block/dm-1/queue/rotational 0 ...However, at mkfs.btrfs time, it migth well not have seen it, as I made it from a live USB key in which both the lvm.conf and crypttab had not been taylored to allow trim commands... However, now that the FS is created, I still wonder whether I should use a rebalance to change the metadata from DUP to SINGLE, or if Id' better stay with DUP... Kind regards. I'd personally stay with the DUP profile, but then that's just me being paranoid. You will almost certainly get better performance using the SINGLE profile instead of DUP, but this is mostly due to it requiring fewer blocks to be encrypted by LUKS (Which is almost certainly your primary bottleneck unless you have some high-end crypto-accelerator card). smime.p7s Description: S/MIME Cryptographic Signature
Re: [systemd-devel] Slow startup of systemd-journal on BTRFS
On 06/16/2014 03:52 PM, Martin wrote: On 16/06/14 17:05, Josef Bacik wrote: On 06/16/2014 03:14 AM, Lennart Poettering wrote: On Mon, 16.06.14 10:17, Russell Coker (russ...@coker.com.au) wrote: I am not really following though why this trips up btrfs though. I am not sure I understand why this breaks btrfs COW behaviour. I mean, I don't believe that fallocate() makes any difference to fragmentation on BTRFS. Blocks will be allocated when writes occur so regardless of an fallocate() call the usage pattern in systemd-journald will cause fragmentation. journald's write pattern looks something like this: append something to the end, make sure it is written, then update a few offsets stored at the beginning of the file to point to the newly appended data. This is of course not easy to handle for COW file systems. But then again, it's probably not too different from access patterns of other database or database-like engines... Even though this appears to be a problem case for btrfs/COW, is there a more favourable write/access sequence possible that is easily implemented that is favourable for both ext4-like fs /and/ COW fs? Database-like writing is known 'difficult' for filesystems: Can a data log can be a simpler case? Was waiting for you to show up before I said anything since most systemd related emails always devolve into how evil you are rather than what is actually happening. Ouch! Hope you two know each other!! :-P :-) [...] since we shouldn't be fragmenting this badly. Like I said what you guys are doing is fine, if btrfs falls on it's face then its not your fault. I'd just like an exact idea of when you guys are fsync'ing so I can replicate in a smaller way. Thanks, Good if COW can be so resilient. I have about 2GBytes of data logging files and I must defrag those as part of my backups to stop the system fragmenting to a stop (I use cp -a to defrag the files to a new area and restart the data software logger on that). Random thoughts: Would using a second small file just for the mmap-ed pointers help avoid repeated rewriting of random offsets in the log file causing excessive fragmentation? Align the data writes to 16kByte or 64kByte boundaries/chunks? Are mmap-ed files a similar problem to using a swap file and so should the same btrfs file swap code be used for both? Not looked over the code so all random guesses... Regards, Martin -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Just a thought, partly inspired by the mention of the swap code, has anyone tried making the file NOCOW and pre-allocating to the max journal size? A similar approach has seemed to help on my systems with generic log files (I keep debug level logs from almost everything, so I end up with very active log files with ridiculous numbers of fragments if I don't pre-allocate and mark them NOCOW). I don't know for certain how BTRFS handles appends to NOCOW files, but I would be willing to bet that it ends up with a new fragment for each filesystem block worth of space allocated. smime.p7s Description: S/MIME Cryptographic Signature
Re: btrfs on whole disk (no partitions)
On 2014-06-18 16:10, Chris Murphy wrote: On Jun 18, 2014, at 1:29 PM, Daniel Cegiełka daniel.cegie...@gmail.com wrote: Hi, I created btrfs directly to disk using such a scheme (no partitions): dd if=/dev/zero of=/dev/sda bs=4096 mkfs.btrfs -L dev_sda /dev/sda mount /dev/sda /mnt cd /mnt btrfs subvolume create __active btrfs subvolume create __active/rootvol btrfs subvolume create __active/usr btrfs subvolume create __active/home btrfs subvolume create __active/var btrfs subvolume create __snapshots cd / umount /mnt mount -o subvol=__active/rootvol /dev/sda /mnt mkdir /mnt/{usr,home,var} mount -o subvol=__active/usr /dev/sda /mnt/usr mount -o subvol=__active/home /dev/sda /mnt/home mount -o subvol=__active/var /dev/sda /mnt/var # /etc/fstab UID=ID/btrfs rw,relative,space_cache,subvol=__active/rootvol0 0 UUID=ID/usrbtrfs rw,relative,space_cache,subvol=__active/usr0 0 UUID=ID/homebtrfs rw,relative,space_cache,subvol=__active/home0 0 UUID=ID/varbtrfs rw,relative,space_cache,subvol=__active/var0 0 rw and space_cache are redundant because they are default; and relative is not a valid mount option. All you need is subvol= Everything works fine. Is such a solution is recommended? In my opinion, the creation of the partitions seems to be completely unnecessary if you can use btrfs. It's firmware specific. Some BIOS firmwares will want to see a valid MBR partition map at LBA 0, not just boot code. Others only care to blindly execute the boot code which would be put in the Btrfs bootloader pad (64KB). I don't know if parted 3.1 recognizes partitionless disks with Btrfs though so it might slightly increase the risk that it's treated as something other than what it is. For UEFI firmware, it would definitely need to be partitioned since an EFI System partition is required. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html On most hardware, I would definitely suggest at least adding a minimal sized partition table, the people who design the BIOS code on most systems make too many assumptions to trust their code to work correctly. That said, I regularly use BTRFS on flat devices for the root filesystems for Xen PV Guest systems, systems that boot from SAN, and secondary disks on other systems with no issues whatsoever. smime.p7s Description: S/MIME Cryptographic Signature
Questions about BTRFS_IOC_FILE_EXTENT_SAME
I have a few questions about the BTRFS_IOC_FILE_EXTENT_SAME ioctl, and was hoping that I could get answers here without having to go source diving or trying to test things myself: 1. What kind of overhead is there when it is called on a group of extents that aren't actually the same (aside from the obvious pair of context-switches that are required for an ioctl)? I would think that it would bail at the first difference it finds, but I have learned that when it comes to kernel code, just because something seems obvious doesn't mean that's how it's done. 2. Does it matter if the ranges passed in are actual extents, or can they be arbitrary ranges of equal bytes in the files? 3. What happens if one of the ranges is truncated by the end of a file? IOW, if I have files A and B, and file A is longer than file B, and file B is identical to the start of file A, what happens if I pass in both files starting at offset 0, but pass the length of file A instead of passing in the length of file B? 4. Does it matter if one of the extents passed in is compressed and the other is not? Thanks in advance. smime.p7s Description: S/MIME Cryptographic Signature
Re: -d single for data blocks on a multiple devices doesn't work as it should
I somehow have doubts that a complex filesystem is the right project for me to start learning C, so I'll have to pass :-) No huge corporation with that itch behind me either, and I guess it will be more than a few hours for a btrfs programmer so no way I could sponsor that on my own. Whether or not it is the right project really depends on where you intend to do most of your C programming. If you plan to do most of it in kernel code and occasional userspace wrappers for kernel interfaces (like me), then it could be a great place because it's under such heavy development (which means more developers are working on it, and bugs get spotted faster, both of which are good things for project you are using to learn a language). If, however, you intend to use it mostly for userspace, then I would definitely agree with you, programming in userspace and kernel-space are so different that it's almost like a different language using the same syntax and similar semantics. smime.p7s Description: S/MIME Cryptographic Signature
Re: [Question] Btrfs on iSCSI device
On 2014-06-27 12:34, Goffredo Baroncelli wrote: Hi, On 06/27/2014 05:44 PM, Zhe Zhang wrote: Hi, I setup 2 Linux servers to share the same device through iSCSI. Then I created a btrfs on the device. Then I saw the problem that the 2 Linux servers do not see a consistent file system image. Details: -- Server 1 running kernel 2.6.32, server 2 running 3.2.1 -- Both running btrfs v0.20-rc1 -- Server 2 has device /dev/vdc, exposed as iSCSI target -- Server 1 mounts the device as /dev/sda -- Server 1 'mount /dev/sda /mnt/btrfs'; server 2 'mount /dev/vdc /mnt/btrfs', -- When server 1 'touch /mnt/btrfs/foo', server 2 doesn't see any file under /mnt/btrfs -- I created /mnt/btrfs/foo on server 2 as well; then I added some content from both server 1 and server 2 to /mnt/btrfs/foo -- After that each server sees the content it adds, but not the content from the other server -- Both server 'umount /mnt/btrfs', and mount it again -- Then both servers see /mnt/btrfs/foo with the content added from server 2 (I guess it's because server 2 created the foo file later than server 1). I did a similar test on ext4 and both servers see a consistent image of the file system. When server 1 creates a foo file server 2 immediately sees it. Is this how btrfs is supposed to work? I don't think that it is possible to mount the _same device_ at the _same time_ on two different machines. And this doesn't depend by the filesystem. The fact that you see it working, I suspect that is is casual. When I tried this (same scsi HD connected to two machines), I had to ensure that the two machines never accessed to the HD at the same time. Thanks, Zhe -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html If you need shared storage like that, you need to use a real cluster filesystem like GFS2 or OCFS2, BTRFS isn't designed for any kind of concurrent access to shared storage from separate systems. The reason it appears to work when using iSCSI and not with directly connected parallel SCSI or SAS is that iSCSI doesn't provide low level hardware access. smime.p7s Description: S/MIME Cryptographic Signature
Re: [Question] Btrfs on iSCSI device
On 06/27/2014 07:40 PM, Russell Coker wrote: On Fri, 27 Jun 2014 18:34:34 Goffredo Baroncelli wrote: I don't think that it is possible to mount the _same device_ at the _same time_ on two different machines. And this doesn't depend by the filesystem. If you use a clustered filesystem then you can safely mount it on multiple machines. If you use a non-clustered filesystem it can still mount and even appear to work for a while. It's surprising how many writes you can make to a dual- mounted filesystem that's not designed for such things before you get a totally broken filesystem. On Fri, 27 Jun 2014 13:15:16 Austin S Hemmelgarn wrote: The reason it appears to work when using iSCSI and not with directly connected parallel SCSI or SAS is that iSCSI doesn't provide low level hardware access. I've tried this with dual-attached FC and had no problems mounting. In what way is directly connected SCSI different from FC? FC is actually it's own networking stack (and you can even run (in theory) other protocols like IP and ATM on top of it), whereas parallel SCSI is just a multi-drop bus, and SAS is just a tree-structured bus with point-to-point communications emulated on top of it. In other words, parallel SCSI has topological constraints like RS-485, SAS has topology constraints like USB, and FC has topology constraints like Ethernet. Secondarily, most filesystems on Linux will let you mount them multiple times on separate hosts (ext4 has features to prevent this, but they are expensive and therefore turned off by default, I think XFS might have similar features, but I'm not sure). BTRFS should in theory be more resilient than most because of the COW nature (as long as it's only a few commit cycles, you should still be able to recover most of the data just fine). smime.p7s Description: S/MIME Cryptographic Signature
Re: mount time of multi-disk arrays
On 2014-07-07 09:54, Konstantinos Skarlatos wrote: On 7/7/2014 4:38 μμ, André-Sebastian Liebe wrote: Hello List, can anyone tell me how much time is acceptable and assumable for a multi-disk btrfs array with classical hard disk drives to mount? I'm having a bit of trouble with my current systemd setup, because it couldn't mount my btrfs raid anymore after adding the 5th drive. With the 4 drive setup it failed to mount once in a few times. Now it fails everytime because the default timeout of 1m 30s is reached and mount is aborted. My last 10 manual mounts took between 1m57s and 2m12s to finish. I have the exact same problem, and have to manually mount my large multi-disk btrfs filesystems, so I would be interested in a solution as well. My hardware setup contains a - Intel Core i7 4770 - Kernel 3.15.2-1-ARCH - 32GB RAM - dev 1-4 are 4TB Seagate ST4000DM000 (5900rpm) - dev 5 is a 4TB Wstern Digital WDC WD40EFRX (5400rpm) Thanks in advance André-Sebastian Liebe -- # btrfs fi sh Label: 'apc01_pool0' uuid: 066141c6-16ca-4a30-b55c-e606b90ad0fb Total devices 5 FS bytes used 14.21TiB devid1 size 3.64TiB used 2.86TiB path /dev/sdd devid2 size 3.64TiB used 2.86TiB path /dev/sdc devid3 size 3.64TiB used 2.86TiB path /dev/sdf devid4 size 3.64TiB used 2.86TiB path /dev/sde devid5 size 3.64TiB used 2.88TiB path /dev/sdb Btrfs v3.14.2-dirty # btrfs fi df /data/pool0/ Data, single: total=14.28TiB, used=14.19TiB System, RAID1: total=8.00MiB, used=1.54MiB Metadata, RAID1: total=26.00GiB, used=20.20GiB unknown, single: total=512.00MiB, used=0.00 This is interesting, I actually did some profiling of the mount timings for a bunch of different configurations of 4 (identical other than hardware age) 1TB Seagate disks. One of the arrangements I tested was Data using single profile and Metadata/System using RAID1. Based on the results I got, and what you are reporting, the mount time doesn't scale linearly in proportion to the amount of storage space. You might want to try the RAID10 profile for Metadata, of the configurations I tested, the fastest used Single for Data and RAID10 for Metadata/System. Also, based on the System chunk usage, I'm guessing that you have a LOT of subvolumes/snapshots, and I do know that having very large (100+) numbers of either does slow down the mount command (I don't think that we cache subvolume information between mount invocations, so it has to re-parse the system chunks for each individual mount). smime.p7s Description: S/MIME Cryptographic Signature
Re: btrfs RAID with enterprise SATA or SAS drives
On 2014-07-09 22:10, Russell Coker wrote: On Wed, 9 Jul 2014 16:48:05 Martin Steigerwald wrote: - for someone using SAS or enterprise SATA drives with Linux, I understand btrfs gives the extra benefit of checksums, are there any other specific benefits over using mdadm or dmraid? I think I can answer this one. Most important advantage I think is BTRFS is aware of which blocks of the RAID are in use and need to be synced: - Instant initialization of RAID regardless of size (unless at some capacity mkfs.btrfs needs more time) From mdadm(8): --assume-clean Tell mdadm that the array pre-existed and is known to be clean. It can be useful when trying to recover from a major failure as you can be sure that no data will be affected unless you actu‐ ally write to the array. It can also be used when creating a RAID1 or RAID10 if you want to avoid the initial resync, however this practice — while normally safe — is not recommended. Use this only if you really know what you are doing. When the devices that will be part of a new array were filled with zeros before creation the operator knows the array is actu‐ ally clean. If that is the case, such as after running bad‐ blocks, this argument can be used to tell mdadm the facts the operator knows. While it might be regarded as a hack, it is possible to do a fairly instant initialisation of a Linux software RAID-1. This has the notable disadvantage however that the first scrub you run will essentially preform a full resync if you didn't make sure that the disks had identical data to begin with. - Rebuild after disk failure or disk replace will only copy *used* blocks Have you done any benchmarks on this? The down-side of copying used blocks is that you first need to discover which blocks are used. Given that seek time is a major bottleneck at some portion of space used it will be faster to just copy the entire disk. I haven't done any tests on BTRFS in this regard, but I've seen a disk replacement on ZFS run significantly slower than a dd of the block device would. First of all, this isn't really a good comparison for two reasons: 1. EVERYTHING on ZFS (or any filesystem that tries to do that much work) is slower than a dd of the raw block device. 2. Even if the throughput is lower, this is only really an issue if the disk is more than half full, because you don't copy the unused blocks Also, while it isn't really a recovery situation, I recently upgraded from a 2 1TB disk BTRFS RAID1 setup to a 4 1TB disk BTRFS RAID10 setup, and the performance of the re-balance really wasn't all that bad. I have maybe 100GB of actual data, so the array started out roughly 10% full, and the re-balance only took about 2 minutes. Of course, it probably helps that I make a point to keep my filesystems de-fragmented, scrub and balance regularly, and don't use a lot of sub-volumes or snapshots, so the filesystem in question is not too different from what it would have looked like if I had just wiped the FS and restored from a backup. Scrubbing can repair from good disk if RAID with redundancy, but SoftRAID should be able to do this as well. But also for scrubbing: BTRFS only check and repairs used blocks. When you scrub Linux Software RAID (and in fact pretty much every RAID) it will only correct errors that the disks flag. If a disk returns bad data and says that it's good then the RAID scrub will happily copy the bad data over the good data (for a RAID-1) or generate new valid parity blocks for bad data (for RAID-5/6). http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.html Page 12 of the above document says that nearline disks (IE the ones people like me can afford for home use) have a 0.466% incidence of returning bad data and claiming it's good in a year. Currently I run about 20 such disks in a variety of servers, workstations, and laptops. Therefore the probability of having no such errors on all those disks would be .99534^20=.91081. The probability of having no such errors over a period of 10 years would be (.99534^20)^10=.39290 which means that over 10 years I should expect to have such errors, which is why BTRFS RAID-1 and DUP metadata on single disks are necessary features. smime.p7s Description: S/MIME Cryptographic Signature
Re: Btrfs transaction checksum corruption losing root of the tree bizarre UUID change.
On 07/10/2014 07:32 PM, Tomasz Kusmierz wrote: Hi all ! So it been some time with btrfs, and so far I was very pleased, but since I've upgraded to ubuntu from 13.10 to 14.04 problems started to occur (YES I know this might be unrelated). So in the past I've had problems with btrfs which turned out to be a problem caused by static from printer generating some corruption in ram causing checksum failures on the file system - so I'm not going to assume that there is something wrong with btrfs from the start. Anyway: On my server I'm running 6 x 2TB disk in raid 10 for general storage and 2 x ~0.5 TB raid 1 for system. Might be unrelated, but after upgrading to 14.04 I've started using Own Cloud which uses Apache MySql for backing store - all data stored on storage array, mysql was on system array. All started with csum errors showing up in mysql data files and in some transactions !!!. Generally system imidiatelly was switching to all btrfs read only mode due to being forced by kernel (don't have dmesg / syslog now). Removed offending files, problem seemed to go away and started from scratch. After 5 days problem reapered and now was located around same mysql files and in files managed by apache as cloud. At this point since these files are rather dear to me I've decided to pull all stops and try to rescue as much as I can. As a excercise in btrfs managment I've run btrfsck --repair - did not help. Repeated with --init-csum-tree - turned out that this left me with blank system array. Nice ! could use some warning here. I know that this will eventually be pointed out by somebody, so I'm going to save them the trouble and mention that it does say on both the wiki and in the manpages that btrfsck should be a last-resort (ie, after you have made sure you have backups of anything on the FS). I've moved all drives and move those to my main rig which got a nice 16GB of ecc ram, so errors of ram, cpu, controller should be kept theoretically eliminated. I've used system array drives and spare drive to extract all dear to me files to newly created array (1tb + 500GB + 640GB). Runned a scrub on it and everything seemed OK. At this point I've deleted dear to me files from storage array and ran a scrub. Scrub now showed even more csum errors in transactions and one large file that was not touched FOR VERY LONG TIME (size ~1GB). Deleted file. Ran scrub - no errors. Copied dear to me files back to storage array. Ran scrub - no issues. Deleted files from my backup array and decided to call a day. Next day I've decided to run a scrub once more just to be sure this time it discovered a myriad of errors in files and transactions. Since I've had no time to continue decided to postpone on next day - next day I've started my rig and noticed that both backup array and storage array does not mount anymore. I was attempting to rescue situation without any luck. Power cycled PC and on next startup both arrays failed to mount, when I tried to mount backup array mount told me that this specific uuid DOES NOT EXIST !?!?! my fstab uuid: fcf23e83-f165-4af0-8d1c-cd6f8d2788f4 new uuid: 771a4ed0-5859-4e10-b916-07aec4b1a60b tried to mount by /dev/sdb1 and it did mount. Tried by new uuid and it did mount as well. Scrub passes with flying colours on backup array while storage array still fails to mount with: root@ubuntu-pc:~# mount /dev/sdd1 /arrays/@storage/ mount: wrong fs type, bad option, bad superblock on /dev/sdd1, missing codepage or helper program, or other error In some cases useful info is found in syslog - try dmesg | tail or so for any device in the array. Honestly this is a question to more senior guys - what should I do now ? Chris Mason - have you got any updates to your old friend stress.sh ? If not I can try using previous version that you provided to stress test my system - but I this is a second system that exposes this erratic behaviour. Anyone - what can I do to rescue my bellowed files (no sarcasm with zfs / ext4 / tapes / DVDs) ps. needles to say: SMART - no sata CRC errors, no relocated sectors, no errors what so ever (as much as I can see). First thing that I would do is some very heavy testing with tools like iozone and fio. I would use the verify mode from iozone to further check data integrity. My guess based on what you have said is that it is probably issues with either the storage controller (I've had issues with almost every brand of SATA controller other than Intel, AMD, Via, and Nvidia, and it almost always manifested as data corruption under heavy load), or something in the disk's firmware. I would still suggest double-checking your RAM with Memtest, and check the cables on the drives. The one other thing that I can think of is potential voltage sags from the PSU (either because the PSU is overloaded at times, or because of really noisy/poorly-conditioned line power). Of course, I may be
Re: 1 week to rebuid 4x 3TB raid10 is a long time!
On 07/20/2014 10:00 AM, Tomasz Torcz wrote: On Sun, Jul 20, 2014 at 01:53:34PM +, Duncan wrote: TM posted on Sun, 20 Jul 2014 08:45:51 + as excerpted: One week for a raid10 rebuild 4x3TB drives is a very long time. Any thoughts? Can you share any statistics from your RAID10 rebuilds? At a week, that's nearly 5 MiB per second, which isn't great, but isn't entirely out of the realm of reason either, given all the processing it's doing. A day would be 33.11+, reasonable thruput for a straight copy, and a raid rebuild is rather more complex than a straight copy, so... Uhm, sorry, but 5MBps is _entirely_ unreasonable. It is order-of-magnitude unreasonable. And all the processing shouldn't even show as a blip on modern CPUs. This speed is undefendable. I wholly agree that it's undefendable, but I can tell you why it is so slow, it's not 'all the processing' (which is maybe a few hundred instructions on x86 for each block), it's because BTRFS still serializes writes to devices, instead of queuing all of them in parallel (that is, when there are four devices that need written to, it writes to each one in sequence, waiting for the previous write to finish before dispatching the next write). Personally, I would love to see this behavior improved, but I really don't have any time to work on it myself. smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH RFC] btrfs: Use backup superblocks if and only if the first superblock is valid but corrupted.
On 07/24/2014 05:28 PM, Chris Mason wrote: On 06/26/2014 11:53 PM, Qu Wenruo wrote: Current btrfs will only use the first superblock, making the backup superblocks only useful for 'btrfs rescue super' command. The old problem is that if we use backup superblocks when the first superblock is not valid, we will be able to mount a none btrfs filesystem, which used to contains btrfs but other fs is made on it. The old problem can be solved related easily by checking the first superblock in a special way: 1) If the magic number in the first superblock does not match: This filesystem is not btrfs anymore, just exit. If end-user consider it's really btrfs, then old 'btrfs rescue super' method is still available. 2) If the magic number in the first superblock matches but checksum does not match: This filesystem is btrfs but first superblock is corrupted, use backup roots. Just continue searching remaining superblocks. I do agree that in these cases we can trust that the backup superblock comes from the same filesystem. But, for right now I'd prefer the admin get involved in using the backup supers. I think silently using the backups is going to lead to surprises. Maybe there could be a mount non-default mount-option to use backup superblocks iff the first one is corrupted, and then log a warning whenever this actually happens? Not handling stuff like this automatically really hurts HA use cases. smime.p7s Description: S/MIME Cryptographic Signature
Re: [PATCH RFC] btrfs: Use backup superblocks if and only if the first superblock is valid but corrupted.
On 07/27/2014 08:29 PM, Qu Wenruo wrote: Original Message Subject: Re: [PATCH RFC] btrfs: Use backup superblocks if and only if the first superblock is valid but corrupted. From: Austin S Hemmelgarn ahferro...@gmail.com To: Chris Mason c...@fb.com, Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs@vger.kernel.org Date: 2014年07月27日 10:57 On 07/24/2014 05:28 PM, Chris Mason wrote: On 06/26/2014 11:53 PM, Qu Wenruo wrote: Current btrfs will only use the first superblock, making the backup superblocks only useful for 'btrfs rescue super' command. The old problem is that if we use backup superblocks when the first superblock is not valid, we will be able to mount a none btrfs filesystem, which used to contains btrfs but other fs is made on it. The old problem can be solved related easily by checking the first superblock in a special way: 1) If the magic number in the first superblock does not match: This filesystem is not btrfs anymore, just exit. If end-user consider it's really btrfs, then old 'btrfs rescue super' method is still available. 2) If the magic number in the first superblock matches but checksum does not match: This filesystem is btrfs but first superblock is corrupted, use backup roots. Just continue searching remaining superblocks. I do agree that in these cases we can trust that the backup superblock comes from the same filesystem. But, for right now I'd prefer the admin get involved in using the backup supers. I think silently using the backups is going to lead to surprises. Maybe there could be a mount non-default mount-option to use backup superblocks iff the first one is corrupted, and then log a warning whenever this actually happens? Not handling stuff like this automatically really hurts HA use cases. This seems better and comments also shows this idea. What about merging the behavior into 'recovery' mount option or adding a new mount option? Personally, I'd add a new mount option, but make recovery imply that option. smime.p7s Description: S/MIME Cryptographic Signature
Re: Multi Core Support for compression in compression.c
On 07/27/2014 04:47 PM, Nick Krause wrote: This may be a bad idea , but compression in brtfs seems to be only using one core to compress. Depending on the CPU used and the amount of cores in the CPU we can make this much faster with multiple cores. This seems bad by my reading at least I would recommend for writing compression we write a function to use a certain amount of cores based on the load of the system's CPU not using more then 75% of the system's CPU resources as my system when idle has never needed more then one core of my i5 2500k to run when with interrupts for opening eclipse are running. For reading compression on good core seems fine to me as testing other compression software for reads , it's way less CPU intensive. Cheers Nick We would probably get a bigger benefit from taking an approach like SquashFS has recently added, that is, allowing multi-threaded decompression fro reads, and decompressing directly into the pagecache. Such an approach would likely make zlib compression much more scalable on large systems. smime.p7s Description: S/MIME Cryptographic Signature
Re: Multi Core Support for compression in compression.c
On 07/27/2014 11:21 PM, Nick Krause wrote: On Sun, Jul 27, 2014 at 10:56 PM, Austin S Hemmelgarn ahferro...@gmail.com wrote: On 07/27/2014 04:47 PM, Nick Krause wrote: This may be a bad idea , but compression in brtfs seems to be only using one core to compress. Depending on the CPU used and the amount of cores in the CPU we can make this much faster with multiple cores. This seems bad by my reading at least I would recommend for writing compression we write a function to use a certain amount of cores based on the load of the system's CPU not using more then 75% of the system's CPU resources as my system when idle has never needed more then one core of my i5 2500k to run when with interrupts for opening eclipse are running. For reading compression on good core seems fine to me as testing other compression software for reads , it's way less CPU intensive. Cheers Nick We would probably get a bigger benefit from taking an approach like SquashFS has recently added, that is, allowing multi-threaded decompression fro reads, and decompressing directly into the pagecache. Such an approach would likely make zlib compression much more scalable on large systems. Austin, That seems better then my idea as you seem to be more up to date on brtfs devolopment. If you and the other developers of brtfs are interested in adding this as a feature please let me known as I would like to help improve brtfs as the file system as an idea is great just seems like it needs a lot of work :). Nick I wouldn't say that I am a BTRFS developer (power user maybe?), but I would definitely say that parallelizing compression on writes would be a good idea too (especially for things like lz4, which IIRC is either in 3.16 or in the queue for 3.17). Both options would be a lot of work, but almost any performance optimization would. I would almost say that it would provide a bigger performance improvement to get BTRFS to intelligently stripe reads and writes (at the moment, any given worker thread only dispatches one write or read to a single device at a time, and any given write() or read() syscall gets handled by only one worker). smime.p7s Description: S/MIME Cryptographic Signature
Re: Multi Core Support for compression in compression.c
On 2014-07-28 11:57, Nick Krause wrote: On Mon, Jul 28, 2014 at 11:13 AM, Nick Krause xerofo...@gmail.com wrote: On Mon, Jul 28, 2014 at 6:10 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote: On 07/27/2014 11:21 PM, Nick Krause wrote: On Sun, Jul 27, 2014 at 10:56 PM, Austin S Hemmelgarn ahferro...@gmail.com wrote: On 07/27/2014 04:47 PM, Nick Krause wrote: This may be a bad idea , but compression in brtfs seems to be only using one core to compress. Depending on the CPU used and the amount of cores in the CPU we can make this much faster with multiple cores. This seems bad by my reading at least I would recommend for writing compression we write a function to use a certain amount of cores based on the load of the system's CPU not using more then 75% of the system's CPU resources as my system when idle has never needed more then one core of my i5 2500k to run when with interrupts for opening eclipse are running. For reading compression on good core seems fine to me as testing other compression software for reads , it's way less CPU intensive. Cheers Nick We would probably get a bigger benefit from taking an approach like SquashFS has recently added, that is, allowing multi-threaded decompression fro reads, and decompressing directly into the pagecache. Such an approach would likely make zlib compression much more scalable on large systems. Austin, That seems better then my idea as you seem to be more up to date on brtfs devolopment. If you and the other developers of brtfs are interested in adding this as a feature please let me known as I would like to help improve brtfs as the file system as an idea is great just seems like it needs a lot of work :). Nick I wouldn't say that I am a BTRFS developer (power user maybe?), but I would definitely say that parallelizing compression on writes would be a good idea too (especially for things like lz4, which IIRC is either in 3.16 or in the queue for 3.17). Both options would be a lot of work, but almost any performance optimization would. I would almost say that it would provide a bigger performance improvement to get BTRFS to intelligently stripe reads and writes (at the moment, any given worker thread only dispatches one write or read to a single device at a time, and any given write() or read() syscall gets handled by only one worker). I will look into this idea and see if I can do this for writes. Regards Nick Austin, Seems since we don't want to release the cache for inodes in order to improve writes if are going to use the page cache. We seem to be doing this for writes in end_compressed_bio_write for standard pages and in end_compressed_bio_write. If we want to cache write pages why are we removing then ? Seems like this needs to be removed in order to start off. Regards Nick I'm not entirely sure, it's been a while since I went exploring in the page-cache code. My guess is that there is some reason that you and I aren't seeing that we are trying for write-around semantics, maybe one of the people who originally wrote this code could weigh in? Part of this might be to do with the fact that normal page-cache semantics don't always work as expected with COW filesystems (cause a write goes to a different block on the device than a read before the write would have gone to). It might be easier to parallelize reads first, and then work from that (and most workloads would probably benefit more from the parallelized reads). smime.p7s Description: S/MIME Cryptographic Signature
Re: Multi Core Support for compression in compression.c
On 2014-07-29 13:08, Nick Krause wrote: On Mon, Jul 28, 2014 at 2:36 PM, Nick Krause xerofo...@gmail.com wrote: On Mon, Jul 28, 2014 at 12:19 PM, Austin S Hemmelgarn ahferro...@gmail.com wrote: On 2014-07-28 11:57, Nick Krause wrote: On Mon, Jul 28, 2014 at 11:13 AM, Nick Krause xerofo...@gmail.com wrote: On Mon, Jul 28, 2014 at 6:10 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote: On 07/27/2014 11:21 PM, Nick Krause wrote: On Sun, Jul 27, 2014 at 10:56 PM, Austin S Hemmelgarn ahferro...@gmail.com wrote: On 07/27/2014 04:47 PM, Nick Krause wrote: This may be a bad idea , but compression in brtfs seems to be only using one core to compress. Depending on the CPU used and the amount of cores in the CPU we can make this much faster with multiple cores. This seems bad by my reading at least I would recommend for writing compression we write a function to use a certain amount of cores based on the load of the system's CPU not using more then 75% of the system's CPU resources as my system when idle has never needed more then one core of my i5 2500k to run when with interrupts for opening eclipse are running. For reading compression on good core seems fine to me as testing other compression software for reads , it's way less CPU intensive. Cheers Nick We would probably get a bigger benefit from taking an approach like SquashFS has recently added, that is, allowing multi-threaded decompression fro reads, and decompressing directly into the pagecache. Such an approach would likely make zlib compression much more scalable on large systems. Austin, That seems better then my idea as you seem to be more up to date on brtfs devolopment. If you and the other developers of brtfs are interested in adding this as a feature please let me known as I would like to help improve brtfs as the file system as an idea is great just seems like it needs a lot of work :). Nick I wouldn't say that I am a BTRFS developer (power user maybe?), but I would definitely say that parallelizing compression on writes would be a good idea too (especially for things like lz4, which IIRC is either in 3.16 or in the queue for 3.17). Both options would be a lot of work, but almost any performance optimization would. I would almost say that it would provide a bigger performance improvement to get BTRFS to intelligently stripe reads and writes (at the moment, any given worker thread only dispatches one write or read to a single device at a time, and any given write() or read() syscall gets handled by only one worker). I will look into this idea and see if I can do this for writes. Regards Nick Austin, Seems since we don't want to release the cache for inodes in order to improve writes if are going to use the page cache. We seem to be doing this for writes in end_compressed_bio_write for standard pages and in end_compressed_bio_write. If we want to cache write pages why are we removing then ? Seems like this needs to be removed in order to start off. Regards Nick I'm not entirely sure, it's been a while since I went exploring in the page-cache code. My guess is that there is some reason that you and I aren't seeing that we are trying for write-around semantics, maybe one of the people who originally wrote this code could weigh in? Part of this might be to do with the fact that normal page-cache semantics don't always work as expected with COW filesystems (cause a write goes to a different block on the device than a read before the write would have gone to). It might be easier to parallelize reads first, and then work from that (and most workloads would probably benefit more from the parallelized reads). I will look into this later today and work on it then. Regards Nick Seems the best way to do is to create a kernel thread per core like in NFS and depending on the load of the system use these threads. Regards Nick It might be more work now, but it would probably be better in the long run to do it using kernel workqueues, as they would provide better support for suspend/hibernate/resume, and then you wouldn't need to worry about scheduling or how many CPU cores are in the system. smime.p7s Description: S/MIME Cryptographic Signature
Re: Btrfs offline deduplication
On 07/31/2014 07:54 PM, Timofey Titovets wrote: Good time of day. I have several questions about data deduplication on btrfs. Sorry if i ask stupid questions or waste you time %) What about implementation of offline data deduplication? I don't see any activity on this place, may be i need to ask a particular person? Where the problem? May be a can i try to help (testing as example)? I could be wrong, but as i understand btrfs store crc32 checksum one per file, if this is true, may be make a sense to create small worker for dedup files? Like worker for autodefrag? With simple logic like: if sum1 == sum2 file_size1 == file_size2; then if (bit_to_bit_identical(file1,2)); then merge(file1, file2); This can be first attempt to implement per file offline dedup What you think about it? could i be wrong? or this is a horrible crutch? (as i understand it not change format of fs) (bedup and other tools, its cool, but have several problem with these tools and i think, what kernel implementation can work better). I think there may be some misunderstandings here about some of the internals of BTRFS. First of all, checksums are stored per block, not per file, and secondly, deduplication can be done on a much finer scale than individual files (you can deduplicate individual extents). I do think however that having the option of a background thread doing deduplication asynchronously is a good idea, but then you would have to have some way to trigger it on individual files/trees, and triggering on writes like the autodefrag thread does doesn't make much sense. Having some userspace program to tell it to run on a given set of files would probably be the best approach for a trigger. I don't remember if this kind of thing was also included in the online deduplication patches that got posted a while back or not. smime.p7s Description: S/MIME Cryptographic Signature
Re: Btrfs offline deduplication
On 08/01/2014 02:55 PM, Mark Fasheh wrote: On Fri, Aug 01, 2014 at 10:16:08AM -0400, Austin S Hemmelgarn wrote: On 2014-08-01 09:23, David Sterba wrote: On Fri, Aug 01, 2014 at 06:17:44AM -0400, Austin S Hemmelgarn wrote: I do think however that having the option of a background thread doing deduplication asynchronously is a good idea, but then you would have to have some way to trigger it on individual files/trees, and triggering on writes like the autodefrag thread does doesn't make much sense. Having some userspace program to tell it to run on a given set of files would probably be the best approach for a trigger. I don't remember if this kind of thing was also included in the online deduplication patches that got posted a while back or not. IIRC the proposed implementation only merged new writes with existing data. For the out-of-band (off-line) dedup there's bedup (https://github.com/g2p/bedup) or Mark's duperemove tool (https://github.com/markfasheh/duperemove) that work on a set of files. Something kernel-side to do the work asynchronously would be nice, especially if it could leverage the check-sums that BTRFS already stores for the blocks. Having a userspace interface for offline deduplication similar to that for scrub operations would even better. Why does this have to be kernel side? There's userspace software already to dedupe that can be run on a regular basis. Exporting checksums is a differnet story (you can do that via ioctl) but running the dedupe software itself inside the kernel is exactly what we want to avoid by having the dedupe ioctl in the first place. --Mark -- Mark Fasheh Based on the same logic however, we don't need scrub to be done kernel side, as it wouldn't take but one more ioctl to be able to tell it which block out of a set to treat as valid. I'm not saying that things need to be done in the kernel, but duperemove doesn't use the ioctl interface even if it exists, and bedup is buggy as hell (unless it's improved greatly in the last two weeks), and neither of them is at all efficient. I do understand that this isn't something that is computationally simple (especially on x86 with it's defficiency of registers), but rsync does almost the same thing for data transmission over the network, and it does so seemingly much more efficiently than either option available at the moment. smime.p7s Description: S/MIME Cryptographic Signature
Re: ENOSPC with mkdir and rename
On 2014-08-04 09:17, Peter Waller wrote: For anyone else having this problem, this article is fairly useful for understanding disk full problems and rebalance: http://marc.merlins.org/perso/btrfs/post_2014-05-04_Fixing-Btrfs-Filesystem-Full-Problems.html It actually covers the problem that I had, which is that a rebalance can't take place because it is full. I still am unsure what is really wrong with this whole situation. Is it that I wasn't careful to do a rebalance when I should have been doing? Is it that BTRFS doesn't do a rebalance automatically when it could in principle? It's pretty bad to end up in a situation (with spare space) where the only way out is to add more storage, which may be impractical, difficult or expensive. I really disagree with the statement that adding more storage is difficult or expensive, all you need to do is plug in a 2G USB flash drive, or allocate a ramdisk, and add the device to the filesystem only long enough to do a full balance. The other thing that I still don't understand I've seen repeated in a few places, from the above article: because the filesystem is only 55% full, I can ask balance to rewrite all chunks that are more than 55% full Then he uses `btrfs balance start -dusage=55 /mnt/btrfs_pool1`. I don't understand the relationship between the FS is 55% full and chunks more than 55% full. What's going on here? To understand this, you have to understand that BTRFS uses a two level allocation scheme, at the top level, you have chunks, which are contiguous regions of the disk that get used for storing a specific block type. For data chunks, these default to 1G in size, for metadata, they default to 256M in size. When a filesystem is created, you get the minimum number of chunks of each type based on the replication profiles chosen for each chunk type; with no extra options, this means 1 data chunk and 2 metadata chunks for a single disk filesystem. Within each chunk, BTRFS then allocates and frees individual blocks on demand, these blocks are the analogue of blocks in most other filesystems. When there are no free blocks in any chunks of a given type, BTRFS then allocates new chunks of that type based on the replication profile. Unlike blocks however, chunks aren't freed automatically (there are good reasons for this behavior, but they are kind of long to explain here), this is where balance comes in, it takes all of the blocks in the filesystem, and sends them back through the block allocator. This usually causes all of the free blocks to end up in a single chunk, and frees the unneeded chunks. When someone talks about a chunk being x% full, they mean that x% of the space in that chunk is used by allocated blocks. Talking about how full the filesystem is can get tricky because of the replication profiles, but the usual consensus is to treat that as the percentage of the filesystem that contains blocks that are being used. It should say LESS than 55% full in the various articles, as the -dusage=x option tells balance to only consider chunks that are less than 55% full for balancing. In general, if your filesystem is totally full, you should use numbers starting with 0, and working your way up from there. You may even get lucky, and using -dusage=0 -musage=0 may free up enough chunks that you don't need to add more storage. I conclude that now since I have added more storage, the rebalance won't fail and if I keep rebalancing from a cron job I won't hit this problem again (unless the filesystem fills up very fast! what then?). I don't know however what value to assign to `-dusage` in general for the cron rebalance. Any hints? I've found that something between 25 and 50 tends to do well, much outside of that range and you start to get diminishing returns. The exact value tends to be more personal preference, I use 25 on most of my systems, because I don't like saturating the disks with I/O for very long. Do make sure however to add -musage=x as well, metadata also should be balanced (especially if you have very large numbers of small files). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html smime.p7s Description: S/MIME Cryptographic Signature
Re: ENOSPC with mkdir and rename
On 2014-08-04 10:11, Peter Waller wrote: On 4 August 2014 15:02, Austin S Hemmelgarn ahferro...@gmail.com wrote: I really disagree with the statement that adding more storage is difficult or expensive, all you need to do is plug in a 2G USB flash drive, or allocate a ramdisk, and add the device to the filesystem only long enough to do a full balance. What if the machine is a server in a datacenter you don't have physical access to and the problem is an emergency preventing your users from being able to get work done? What happens if you use a RAM disk and there is a power failure? I'm not saying that either option is a perfect solution. In fact, the only reason that I even mentioned the ramdisk is because I have had good success with that on my laptop, but then laptops essentially have a built-in UPS. I personally wouldn't use a ramdisk except as a last resort if you don't have some sort of UPS or redundancy in the PSU. smime.p7s Description: S/MIME Cryptographic Signature
Re: ENOSPC with mkdir and rename
On 2014-08-04 06:31, Peter Waller wrote: Thanks Hugo, this is the most informative e-mail yet! (more inline) On 4 August 2014 11:22, Hugo Mills h...@carfax.org.uk wrote: * btrfs fi show - look at the total and used values. If used total, you're OK. If used == total, then you could potentially hit ENOSPC. Another thing which is unclear and undocumented anywhere I can find is what the meaning of `btrfs fi show` is. I'm sure it is totally obvious if you are a developer or if you have used it for long enough. But it isn't covered in the manpage, nor in the oracle documentation, nor anywhere on the wiki that I could find. You didn't look very hard then, because there is information in the manpage (oh wait, you mentioned Oracle, your probably using RHEL or CentOS, which are the last thing you should be using if you want to use stuff like BTRFS that is under heavy development), and it is documented on the wiki. When I looked at it in my problematic situation, it said 500 GiB / 500 GiB. That sounded fine to me because I interpreted the output as what fraction of which RAID devices BTRFS was using. In other words, I thought Oh, BTRFS will just make use of the whole device that's available to it.. I thought that `btrfs fi df` was the source of information for how much space was free inside of that. * btrfs fi df - look at metadata used vs total. If these are close to zero (on 3.15+) or close to 512 MiB (on 3.15), then you are in danger of ENOSPC. Hmm. It's unfortunate that this could indicate an amount of space which is free when it actually isn't. That depends on what you mean by 'free'. - look at data used vs total. If the used is much smaller than total, you can reclaim some of the allocation with a filtered balance (btrfs balance start -dusage=5), which will then give you unallocated space again (see the btrfs fi show test). So the filtered balance didn't help in my situation. I understand it's something to do with the 5 parameter. But I do not understand what the impact of changing this parameter is. It is something to do with a fraction of something, but those things are still not present in my mental model despite a large amount of reading. Is there an illustration which could clear this up? Think of each chunk like a box, and each block as a block, and that you have two different types of block (data and metadata) and two different types of box (also data and metadata). The data boxes are four times the size of the metadata boxes, and they all have to fit in one really big container (the device itself). You can only put data blocks in the data boxs, and you can only put metadata blocks in metadata boxes. Say that in total, you can fit 128 data boxes in the large container, or you can replace one data box with up to four metadata boxes. Even though you may only have a few blocks in a given box, the box still takes up the same amount of space in the larger container. Thus, it's possible to have only a few blocks stored, but not be able to add any more boxes to the larger container. A balance operation is essentially the equivalent of taking all of the blocks of a given type, and fitting them into the smallest number of boxes possible. Among other things I also got the kernel stack trace I pasted at the bottom of the first e-mail to this thread when I did the rebalance. This FAQ entry is pretty horrible, I'm afraid. I actually started rewriting it here to try to make it clearer what's going on. I'll try to work on it a bit more this week and put out a better version for the wiki. This is great to hear! :) Thanks for your response Hugo, that really cleared up a lot of mental model problems. I hope the documentation can be improved so that others can learn from my mistakes. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html smime.p7s Description: S/MIME Cryptographic Signature
Re: ENOSPC with mkdir and rename
On 2014-08-05 04:20, Duncan wrote: Austin S Hemmelgarn posted on Mon, 04 Aug 2014 13:09:23 -0400 as excerpted: Think of each chunk like a box, and each block as a block, and that you have two different types of block (data and metadata) and two different types of box (also data and metadata). The data boxes are four times the size of the metadata boxes, and they all have to fit in one really big container (the device itself). You can only put data blocks in the data boxs, and you can only put metadata blocks in metadata boxes. Say that in total, you can fit 128 data boxes in the large container, or you can replace one data box with up to four metadata boxes. Even though you may only have a few blocks in a given box, the box still takes up the same amount of space in the larger container. Thus, it's possible to have only a few blocks stored, but not be able to add any more boxes to the larger container. A balance operation is essentially the equivalent of taking all of the blocks of a given type, and fitting them into the smallest number of boxes possible. FWIW, that's a great analogy to stick up on the wiki somewhere, probably somewhere in the FAQ related to ENOSPC. Please consider doing so. (Someone took one of my explanations from the list and stuck it in the wiki, virtually word-for-word, with a link to the list post in the archives for more. I was glad, as for some reason I just seem to work best on the lists, and seem to treat web pages as read-only, even if they're on a wiki I in theory have or can get write-privs on. I'm suggesting someone, doesn't have to be you tho great if it is, do the same with this.) I would love to have it up on the wiki, but don't have an account or write privileges. FWIW, I consider anything I post on a mailing list that isn't marked otherwise (except patches) to be public domain, so everyone feel free to use it however you want. smime.p7s Description: S/MIME Cryptographic Signature
Re: Ideas for a feature implementation
On 08/10/2014 03:21 PM, Vimal A R wrote: Hello, I came across the to-do list at https://btrfs.wiki.kernel.org/index.php/Project_ideas and would like to know if this list is updated and recent. I am looking for a project idea for my under graduate degree which can be completed in around 3-4 months. Are there any suggestions and ideas to help me further? Thank you, Vimal It's not really listed there (though some of the projects there might be considered subsets of it), but improved parallelization for the multi-device setups is one thing that I know that a lot of people would like to see. Another thing that isn't listed there, that I would personally love to see is support for secure file deletion. To be truly secure though, this would need to hook into the COW logic so that files marked for secure deletion can't be reflinked (maybe make the automatically NOCOW instead, and don't allow snapshots?), and when they get written to, the blocks that get COW'ed have the old block overwritten. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas for a feature implementation
On 08/11/2014 04:27 PM, Chris Murphy wrote: On Aug 10, 2014, at 8:53 PM, Austin S Hemmelgarn ahferro...@gmail.com wrote: Another thing that isn't listed there, that I would personally love to see is support for secure file deletion. To be truly secure though, this would need to hook into the COW logic so that files marked for secure deletion can't be reflinked (maybe make the automatically NOCOW instead, and don't allow snapshots?), and when they get written to, the blocks that get COW'ed have the old block overwritten. If the file is reflinked or snapshot, then it can it be secure deleted? Because what does it mean to secure delete a file when there's a completely independent file pointing to the same physical blocks? What if someone else owns that independent file? Does the reflink copy get rm'd as well? Or does the file remain, but its blocks are zero'd/corrupted? The semantics that I would expect would be that the extents can't be reflinked, and when snapshotted the whole file gets COW'ed, and then inherits the secure deletion flag, possibly with another flag saying that the user can't disable the secure deletion flag. For SSDs, whether it's an overwrite or an FITRIM ioctl it's an open question when the data is actually irretrievable. It may be seconds, but could be much longer (hours?) so I'm not sure if it's useful. On HDD's using SMR it's not necessarily a given an overwrite will work there either. By secure deletion, I don't mean make the data absolutely unrecoverable by any means, I mean make it functionally impractical for someone without low-level access to and/or extensive knowledge of the hardware to recover the data; that is, more secure than simply unlinking the file, but obviously less than (for example) the application of thermite to the disk platters. I'm talking the rough equivalent of wiping the data from RAM. Anyone who is truly security minded should be using whole disk encryption anyway, but even then you have the data accessible from the running OS. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ideas for a feature implementation
On 2014-08-12 11:52, David Pottage wrote: On 11/08/14 03:53, Austin S Hemmelgarn wrote: Another thing that isn't listed there, that I would personally love to see is support for secure file deletion. To be truly secure though, this would need to hook into the COW logic so that files marked for secure deletion can't be reflinked (maybe make the automatically NOCOW instead, and don't allow snapshots?), and when they get written to, the blocks that get COW'ed have the old block overwritten. How would secure deletion interact with file de-duplication? For example suppose you and I are both users on a multi user system. We both obtain copies of the same file independently, and save that file to our home directories. A background process notices that both files are the same and de-duplicates them. This means that both your file and mine point to the same blocks on disc. This is exactly the same as would happen if you made a COW copy of your file, transferred ownership to me, and I moved it into my home dir. You then decide to secure delete your copy of the file. What happens to mine? If it gets removed, then you have just deleted a file you don't own, if it does not then the file-system has broken the contract to secure delete a file when you asked it to. Also, what happens if the two files have similar portions, but they are not identical. For example, if you download and ISO image for ubuntu, and I download the ISO for kubuntu (at the same version). There will be a lot of sections that are the same, because they will contain a lot of packages in common, so there will be large gains in de-duplicating the similar parts, but most people would consider the files to be different. Could this mean that if you secure delete your ubuntu iso, then portions of my kubuntu iso might become corrupt? You could work around this by marking the extent, instead of the file (marking a file would mark all of it's extents), and then checking for that marking when the extent is freed (ie, nobody refers to it anymore). While this approach might not seem useful to most people, there are practical use cases for it (even without whole disk encryption). It would be pretty easy actually to integrate this globally for a file-system as a mount option. Even if we limit secure delete to root, then we still leave the risk of unintentonaly breaking user files, because non-one realised that all or part of the file appears in other files via de-duplication. In any case if secure delete is limited to root, then most people would not find it useful. (or they would use sudo to do it, which brings us back to the same problems). Basically, I think that file secure deletion as a concept is not compatible with a 5th generation file system. If you relay want to securely remove a file, then copy the stuff you need elsewhere, and put the disc in the crusher. Alternatively put the filesystem in an encypted container, and then reformat the disc with a different encryption key. While I agree that the traditional notion of secure deletion doesn't fit in the current generation of file systems, there is still a need for COW filesystems to be able to prevent sensitive data from being exposed during run-time. On any current BTRFS filesystem, it is still possible to find blocks that have been COW'ed (assuming discard is turned off) and have no referents, possibly long after the block itself is freed, and especially if the volume is much larger than the stored data set (like a large majority of desktop users these days) or the workload is not write intensive. smime.p7s Description: S/MIME Cryptographic Signature
Re: Large files, nodatacow and fragmentation
On 2014-08-14 10:30, G. Richard Bellamy wrote: On Wed, Aug 13, 2014 at 9:23 PM, Chris Murphy li...@colorremedies.com wrote: lsattr /var/lib/libvirt/images/atlas.qcow2 Is the xattr actually in place on that file? 2014-08-14 07:07:36 $ filefrag /var/lib/libvirt/images/atlas.qcow2 /var/lib/libvirt/images/atlas.qcow2: 46378 extents found 2014-08-14 07:08:34 $ lsattr /var/lib/libvirt/images/atlas.qcow2 ---C /var/lib/libvirt/images/atlas.qcow2 So, yeah, the attribute is set. It will fragment somewhat but I can't say that I've seen this much fragmentation with xattr C applied to qcow2. What's the workload? How was the qcow2 created? I recommend -o preallocation=metadata,compat=1.1,lazy_refcounts=on when creating it. My workloads were rather simplistic: OS installs and reinstalls. What's the filesystem being used in the guest that's using the qcow2 as backing? When I created the file, I definitely preallocated the metadata, but did not set compat or lazy_refcounts. However, isn't that more a function of how qemu + KVM managed the image, rather than how btrfs? This is a p2v target, if that matters. Workload has been minimal since virtualizing because I have yet to get usable performance with this configuration. The filesystem in the guest is Win7 NTFS. I have seen massive thrashing of the underlying volume during VSS operations in the guest, if that signifies. It might be that your workload is best suited for a preallocated raw file that inherits +C, or even possibly an LV. I'm close to that decision. As I mentioned, I much prefer the btrfs subvolume story over lvm, so moving to raw is probably more desirable than that... however, then I run into my lack of understanding of the difference between qcow2 and raw with respect to recoverability, e.g. does raw have the same ACID characteristics as a qcow2 image, or is atomicity a completely separate concern from the format? The ability for the owning process to recover from corruption or inconsistency is a key factor in deciding whether or not to turn COW off in btrfs - if your overlying system is capable of such recovery, like a database engine or (presumably) virtualization layer, then COW isn't a necessary function from the underlying system. So, just since I started this reply, you can see the difference in fragmentation: 2014-08-14 07:25:04 $ filefrag /var/lib/libvirt/images/atlas.qcow2 /var/lib/libvirt/images/atlas.qcow2: 46461 extents found That's 17 minutes, an OS without interaction (I wasn't doing anything with it, but it may have been doing its own work like updates, etc.), and I see an fragmentation increase of 83 extents, and a raid10 volume that was beating itself up (I could hear the drives chattering away as they worked). The fact that it is Windows using NTFS is probably part of the problem. Here's some things you can do to decrease it's background disk utilization (these also improve performance on real hardware): 1. Disable system restore points. These aren't really necessary if you are running in a VM and can take snapshots from the host OS. 2. Disable the indexing service. This does a lot of background disk IO, and most people don't need the high speed search functionality. 3. Turn off Windows Features that you don't need. This won't help disk utilization much, but can greatly improve overall system performance. 4. Disable the paging file. Windows does a lot of unnecessary background paging, which can cause lots of unneeded disk IO. Be careful doing this however, as it may cause problems for memory hungry applications. 5. See if you can disable boot time services you don't need. Bluetooth, SmartCard, and Adaptive Screen Brightness are all things you probably don't need in a VM environment. Of these, 1, 2, and 4 will probably help the most. The other thing is that NTFS is a journaling file system, and putting a journaled file system image on a COW backing store will always cause some degree of thrashing, because the same few hundred MB of the disk get rewritten over and over again, and the only way to work around that on BTRFS is to make the file NOCOW, AND preallocate the entire file in one operation (use the fallocate command from util-linux to do this). smime.p7s Description: S/MIME Cryptographic Signature
Re: Questions on using BtrFS for fileserver
On 2014-08-19 12:21, M G Berberich wrote: Hello, we are thinking about using BtrFS on standard hardware for a fileserver with about 50T (100T raw) of storage (25×4TByte). This is what I understood so far. Is this right? · incremental send/receive works. · There is no support for hotspares (spare disks that automatically replaces faulty disk). · BtrFS with RAID1 is fairly stable. · RAID 5/6 spreads all data over all devices, leading to performance problems on large diskarrays, and there is no option to limit the numbers of disk per stripe so far. Some questions: · There where reports, that bcache with btrfs leads to corruption. Is this still so? Based on some testing I did last month, bcache with anything has the potential to cause data corruption. · If a disk failes, does BtrFS rebalance automatically? (This would give a a kind o hotspare behavior) No, but it wouldn't be hard to write a simple monitoring program to do this from userspace. IIRC, the big issue is that you need to add a device in-place of the failed one for the re-balance to work. · Besides using bcache, are there any possibilities to boost performance by adding (dedicated) cache-SSDs to a BtrFS? Like mentioned in one of the other responses, I would suggest looking into dm-cache. BTRFS itself does not have any functionality for this, although there has been talk of implementing device priorities for reads, which could provide a similar performance boost. · Are there any reports/papers/web-pages about BtrFS-systems this size in use? Praises, complains, performance-reviews, whatever… While it doesn't quite fit the description, I have had very good success with a very active 2TB BTRFS RAID10 filesystem consisting of BTRFS on four unpartitioned 1TB SATA III hard drives. The filesystem gets in excess of 100GB of data written to it each day (almost all rewrites however), and is what I use for /home, /var/log, and /var/lib, and I've had no issues with it that were caused by BTRFS, and in-fact, the very fact that it uses BTRFS helped me recover data when the storage controller they are connected to went bad. On average, I get about 125% of raw disk performance on writes, and about 110% on reads. If you are using a very large number of disks, then I would not suggest that you use BTRFS RAID10, but instead BTRFS RAID1, as RAID10 will try to stripe things across ALL of the devices in the filesystem, and unless you have no more than about four times as many disks as storage controllers (that is, each controller has no more than four disks attached to it), the overhead outweighs the benefit of striping the data. Also, just to make sure it's clear, in BTRFS RAID1, each block gets written EXACTLY twice. On the plus side though, this means that if you do set-up a caching mechanism, you may be able to keep most of the array spun down a majority of the time. smime.p7s Description: S/MIME Cryptographic Signature
Re: Questions on using BtrFS for fileserver
On 08/19/2014 05:38 PM, Andrej Manduch wrote: Hi, On 08/19/2014 06:21 PM, M G Berberich wrote: · Are there any reports/papers/web-pages about BtrFS-systems this size in use? Praises, complains, performance-reviews, whatever… I don't know about papers or benchmarks but few weeks ago there was a guy who has problem with really long mounting with btrfs with similiar size. https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg36226.html And I would not recommend 3TB disks. *I'm not btrfs dev* but as far as I know there is a quite different between rebuilding disk on real RAID and btrfs RAID. The problem is btrfs has RAID on filesystem level not on hw level so there is bigger mechanical overheat on drives and thus it take significantli longer than regular RAID. It really suprises me that so many people come to this conclusion, but maybe they don't provide as much slack space as I do on my systems. In general you will only have a longer rebuild on BTRFS than on hardware RAID if the filesystem is more than about 50% full. On my desktop array (4x 1TB disks using BTRFS RAID10), I've replaced disks before and it took less than an hour for the operation. Of course that array is usually not more than 10% full. Interestingly, it took less time to rebuild this array the last time I lost a disk than it did back when it was 3x 1TB disks in a BTRFS RAID1, so things might improve overall with a larger number of disks in the array. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Significance of high number of mails on this list?
On 2014-08-20 23:22, Shriramana Sharma wrote: Hello. People on this list have been kind enough to reply to my technical questions. However, seeing the high number of mails on this list, esp with the title PATCH, I have a question about the development itself: Is this just an indication of a vibrant user/devel community [*] and healthy development of many new nice features to eventually come out in stable form later, or are we still at the fixing rough edges stage? IOW what is the proportion of commits adding new features to those stabilising/fixing features? [* Since there is no separate btrfs-users vs brtfs-dev I'm not able to gauge this difference either. i.e. if there were a dedicated -dev list I might not be alarmed by a high number of mails indicating fast development.] Mostly I have read like BTRFS is mostly stable but there might be a few corner cases as yet unknown since this is a totally new generation of FSs. But still given the volume of mails here I wanted to ask... I'm sorry I realize I'm being a bit vague but I'm not sure how to exactly express what I'm feeling about BTRFS right now... Personally I'd say that BTRFS is 'stable' enough for light usage without using stuff like quotas or RAID5/6. So far, having used it since 3.10, I've only once had a filesystem get corrupted when there wasn't some serious underlying hardware issue (crashed disk, SATA controller dropping random single sectors from writes, etc.), and it gives me much better performance than what I previously used (ext4 on top of LVM). As far as what to make of the volume of patches on the mailing list, I'd say that that shouldn't be used as a measure of quality. The ext4 mailing list is almost as busy on a regular basis, and people have been using that in production for years, and the XFS mailing list gets much higher volume of patches from time to time, and it's generally considered the gold standard of a stable filesystem. smime.p7s Description: S/MIME Cryptographic Signature
Re: Distro vs latest kernel for BTRFS?
On 2014-08-22 07:59, Shriramana Sharma wrote: Hello. I've seen repeated advices to use the latest kernel. While hearing of the recent compression bug affecting recent kernels does somewhat warn one off the previous advice, I would like to know what people who are running regular distros do to get the latest kernel. Personally I'm on Kubuntu, which provides mainline kernels till a particular point but not beyond that. Do people here always compile the latest kernel themselves just to get the latest BTRFS stability fixes (and improvements, though as a second priority)? I personally use Gentoo Unstable on all my systems, so I build all my kernels locally anyway, and stay pretty much in-line with the current stable Mainline kernel. Interestingly, I haven't had any issues related to either of the recently discovered bugs, despite meeting all of the criteria for being affected by them. smime.p7s Description: S/MIME Cryptographic Signature
Re: Distro vs latest kernel for BTRFS?
On 2014-08-22 14:22, Rich Freeman wrote: On Fri, Aug 22, 2014 at 8:04 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote: I personally use Gentoo Unstable on all my systems, so I build all my kernels locally anyway, and stay pretty much in-line with the current stable Mainline kernel. Gentoo Unstable probably means gentoo-sources, testing version, which follows the stable kernel branch, but the most recent stable, and not the long-term stable. gentoo-sources stable version generally follows the most recent longterm stable kernel (so 3.14 right now). I'm not sure what the exact policy is, but that is my sense of it. So, you're still running a stable kernel most likely. If you really want mainline then you want git-sources. That follows the most recent mainline I believe. Of course, if you're following it that closely then you probably should think about just doing a git clone and managing it yourself, since then you can handle patches/etc more easily. I think the best option for somebody running btrfs is to stick with a stable kernel branch, either the current stable or a very recent longterm. I wouldn't go back into 3.2 land or anything like that. But, yes, if you had stuck with 3.14 and not gone to the current stable then you would have missed the compress=lzo deadlock. So, pick your poison. :) Rich By saying 'unstable' I'm referring to the stuff delimited in portage with the ~ARCH keywords. Personally, I wouldn't use that term myself (all of my systems running on such packages have been rock-solid stable from a software perspective), but that is how the official documentation refers to things with the ~ARCH keywords. There are a lot of Gentoo users who don't know about the keyword thing other than as an occasional inconvenience when emerging certain packages, so I just use the same term as the documentation. For the record, I am using the gentoo-sources package, but instead of using what they mark as stable (which is 3.14), I'm using the most recent version (which is 3.16.1). smime.p7s Description: S/MIME Cryptographic Signature
Re: superblock checksum mismatch after crash, cannot mount
On 2014-08-24 15:48, Chris Murphy wrote: On Aug 24, 2014, at 10:59 AM, Flash ROM flashromg...@yandex.com wrote: While it sounds dumb, this strange thing being done to put partition table in separate erase block, so it never read-modify-written when FAT entries are updated. Should something go wrong, FAR can recover from backup copy. But erased partition table just suxx. Then, FAT tables are aligned in way to fit well around erase block bounds. I think you seriously overestimate the knowledge of camera manufacturer's about the details of flash storage; and any ability to discover it; and any willingness on the part of the flash manufacturer to reveal such underlying details. The whole point of these cards is to completely abstract the reality of the underlying hardware from the application layer - in this case the camera or mobile device using it. If you really know what you are doing, it is possible to determine erase block size by looking at device performance timings, with surprisingly high accuracy (assuming you aren't trying to have software do it for you). I've actually done this before on several occasions, with nearly 100% success. Also, with SDXC exFAT is now specified. And it has only one FAT there isn't a backup FAT. So they're even more difficult to recover data from should things go awry filesystem wise. It's too bad that TFAT didn't catch on, as it would have been great for SD cards if it could be configured to put each FAT on a different erase block. This said, you can *try* to reformat, BUT no standard OS of firmware formatter will help you with default settings. They can't know geometry of underlying NAND and controller properties. There is no standard, widely accepted way to get such information from card. No matter if you use OS formatter, camera formatter or whatever. YOU WILL RUIN factory format (which is crafted in best possible way) and replace it with another, very likely suboptimal one. It's recommended by the card manufacturers to reformat it in each camera its inserted into. It's the only recommended way to erase the sd card for re-use, they don't recommend selectively deleting images. And it's known that one camera's partition table and formatting can irritate another camera make/model if the card isn't reformatted by that camera. It's not just cameras that have this issue, a lot of other hardware makes stupid assumptions about the format of media. The first firmware release for the Nintendo Wii for example, chocked if you tried to use an SD card with more than one partition on it, and old desktop versions of Windows won't ever show you anything other than the first partition on an SD card (or most USB storage devices for that matter). smime.p7s Description: S/MIME Cryptographic Signature
Re: ext4 vs btrfs performance on SSD array
I wholeheartedly agree. Of course, getting something other than CFQ as the default I/O scheduler is going to be a difficult task. Enough people upstream are convinced that we all NEED I/O priorities, when most of what I see people doing with them is bandwidth provisioning, which can be done much more accurately (and flexibly) using cgroups. Ironically, there have been a lot of in-kernel defaults that I have run into issues with recently, most of which originated in the DOS era, where a few MB of RAM was high-end. On 2014-09-02 08:55, Zack Coffey wrote: While I'm sure some of those settings were selected with good reason, maybe there can be a few options (2 or 3) that have some basic intelligence at creation to pick a more sane option. Some checks to see if an option or two might be better suited for the fs. Like the RAID5 stripe size. Leave the default as is, but maybe a quick speed test to automatically choose from a handful of the most common values. If they fail or nothing better is found, then apply the default value just like it would now. On Mon, Sep 1, 2014 at 9:22 PM, Christoph Hellwig h...@infradead.org wrote: On Tue, Sep 02, 2014 at 10:08:22AM +1000, Dave Chinner wrote: Pretty obvious difference: avgrq-sz. btrfs is doing 512k IOs, ext4 and XFS are doing is doing 128k IOs because that's the default block device readahead size. 'blockdev --setra 1024 /dev/sdd' before mounting the filesystem will probably fix it. Btw, it's really getting time to make Linux storage fs work out the box. There's way to many things that are stupid by default and we require everyone to fix up manually: - the ridiculously low max_sectors default - the very small max readahead size - replacing cfq with deadline (or noop) - the too small RAID5 stripe cache size and probably a few I forgot about. It's time to make things perform well out of the box.. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html smime.p7s Description: S/MIME Cryptographic Signature
Re: Large files, nodatacow and fragmentation
On 2014-09-02 14:31, G. Richard Bellamy wrote: I thought I'd follow-up and give everyone an update, in case anyone had further interest. I've rebuilt the RAID10 volume in question with a Samsung 840 Pro for bcache front device. It's 5x600GB SAS 15k RPM drives RAID10, with the 512MB SSD bcache. 2014-09-02 11:23:16 root@eanna i /var/lib/libvirt/images # lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:00 558.9G 0 disk └─bcache3 254:30 558.9G 0 disk /var/lib/btrfs/data sdb 8:16 0 558.9G 0 disk └─bcache2 254:20 558.9G 0 disk sdc 8:32 0 558.9G 0 disk └─bcache1 254:10 558.9G 0 disk sdd 8:48 0 558.9G 0 disk └─bcache0 254:00 558.9G 0 disk sde 8:64 0 558.9G 0 disk └─bcache4 254:40 558.9G 0 disk sdf 8:80 0 1.8T 0 disk └─sdf1 8:81 0 1.8T 0 part sdg 8:96 0 477G 0 disk /var/lib/btrfs/system sdh 8:112 0 477G 0 disk sdi 8:128 0 477G 0 disk ├─bcache0 254:00 558.9G 0 disk ├─bcache1 254:10 558.9G 0 disk ├─bcache2 254:20 558.9G 0 disk ├─bcache3 254:30 558.9G 0 disk /var/lib/btrfs/data └─bcache4 254:40 558.9G 0 disk sr011:01 1024M 0 rom I further split the system and data drives of the VM Win7 guest. It's very interesting to see the huge level of fragmentation I'm seeing, even with the help of ordered writes offered by bcache - in other words while bcache seems to be offering me stability and better behavior to the guest, the underlying the filesystem is still seeing a level of fragmentation that has me scratching my head. That being said, I don't know what would be normal fragmentation of a VM Win7 guest system drive, so could be I'm just operating in my zone of ignorance again. 2014-09-01 14:41:19 root@eanna i /var/lib/libvirt/images # filefrag atlas-* atlas-data.qcow2: 7 extents found atlas-system.qcow2: 154 extents found 2014-09-01 18:12:27 root@eanna i /var/lib/libvirt/images # filefrag atlas-* atlas-data.qcow2: 564 extents found atlas-system.qcow2: 28171 extents found 2014-09-02 08:22:00 root@eanna i /var/lib/libvirt/images # filefrag atlas-* atlas-data.qcow2: 564 extents found atlas-system.qcow2: 35281 extents found 2014-09-02 08:44:43 root@eanna i /var/lib/libvirt/images # filefrag atlas-* atlas-data.qcow2: 564 extents found atlas-system.qcow2: 37203 extents found 2014-09-02 10:14:32 root@eanna i /var/lib/libvirt/images # filefrag atlas-* atlas-data.qcow2: 564 extents found atlas-system.qcow2: 40903 extents found This may sound odd, but are you exposing the disk to the Win7 guest as a non-rotational device? Win7 and higher tend to have different write behavior when they think they are on an SSD (or something else where seek latency is effectively 0). Most VMM's (at least, most that I've seen) will use fallocate to punch holes for ranges that get TRIM'ed in the guest, so if windows is sending TRIM commands, that may also be part of the issue. Also, you might try reducing the amount of logging in the guest. smime.p7s Description: S/MIME Cryptographic Signature
Re: No space on empty, degraded raid10
On 2014-09-07 16:38, Or Tal wrote: Hi, I've created a new raid10 array from 4, 4TB drives in order to migrate old data to it. As I didn't have enough sata ports, I: - disconnected one of the raid10 disks to free a sata port, - connected an old disk I wanted to migrate, - mounted the array with -o degraded - copied the data it it. After about 2MB I got a no space left on device message. btrfs fi df showed strange things - much less space in every category (about 8GB?) and none of then was full. Ubuntu 14.10 beta - linux 3.16.0-14 Yeah, RAID10 doesn't really work in degraded mode (even if you have two disks that have stripes from the same copy). The approach that would be needed for what you want to do is: 1. Make a BTRFS RAID1 filesystem with _3_ new drives 2. Connect one of the old disks 3. Transfer data from old disk to new filesystem 4. After repeating steps 2 and 3 for each old disk, connect the final new disk, add it to the filesystem, and rebalance with '-dconvert=raid10 -mconvert=raid10' Also, I've found out the hard way that system chunks really should be RAID1, _NOT_ RAID10, otherwise it's very likely that the filesystem won't mount at all if you lose 2 disks. smime.p7s Description: S/MIME Cryptographic Signature
Re: Is it necessary to balance a btrfs raid1 array?
On 2014-09-10 08:27, Bob Williams wrote: I have two 2TB disks formatted as a btrfs raid1 array, mirroring both data and metadata. Last night I started # btrfs filesystem balance path In general, unless things are really bad, you don't ever want to use balance on such a big filesystem without some filters to control what gets balanced (especially if the filesystem is more than about 50% full most of the time). My suggestion in this case would be to use: # btrfs balance start -dusage=25 -musage=25 path on a roughly weekly basis. This will only balance chunks that are less than 25% full, and therefore run much faster. If you are particular about high storage efficiency, then try 50 instead of 25. and it is still running 18 hours later. This suggests that most stuff only gets written to one physical device, which in turn suggests that there is a risk of lost data if one physical device fails. Or is there something clever about btrfs raid that I've missed? I've used linux software raid (mdraid) before, and it appeared to write to both devices simultaneously. The reason that a full balance takes so long on a big (and I'm assuming based on the 18 hours it's taken, very full) filesystem is that it reads all of the data, and writes it out to both disks, but it doesn't do very good load-balancing like mdraid or LVM do. I've got a 4x 500Gib BTRFS RAID10 filesystem that I use for my home directory on my desktop system, and a full balance on that takes about 6 hours. Is it safe to interrupt [^Z] the btrfs balancing process? ^Z sends a SIGSTOP, which is a really bad idea with something that is doing low-level stuff to a filesystem. If you need to stop the balance process (and are using a recent enough kernel and btrfs-progs), the preferred way to do so is to use the following from another terminal: # btrfs balance stop path Depending on what the balance operation is working when you do this, it may take a few minutes before it actually stops (the longest that I've seen it take is ~200 seconds). As a rough guide, how often should one perform a) balance b) defragment c) scrub on a btrfs raid setup? In general, you should be running scrub regularly, and balance and defragment as needed. On the BTRFS RAID filesystems that I have, I use the following policy: 1) Run a 25% balance (the command I mentioned above) on a weekly basis. 2) If the filesystem has less than 50% of either the data or metadata chunks full at the end of the month, run a full balance on it. 3) Run a scrub on a daily basis. 4) Defragment files only as needed (which isn't often for me because I use the autodefrag mount option). 5) Make sure than only one of balance, scrub or defrag is running at a given time. Normally, you shouldn't need to run balance at all on most BTRFS filesystems, unless your usage patterns vary widely over time (I'm actually a good example of this, most of the files in my home directory are relatively small, except for when I am building a system with buildroot or compiling a kernel, and on occasion I have VM images that I'm working with). smime.p7s Description: S/MIME Cryptographic Signature
Re: Is it necessary to balance a btrfs raid1 array?
On 2014-09-10 09:48, Rich Freeman wrote: On Wed, Sep 10, 2014 at 9:06 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote: Normally, you shouldn't need to run balance at all on most BTRFS filesystems, unless your usage patterns vary widely over time (I'm actually a good example of this, most of the files in my home directory are relatively small, except for when I am building a system with buildroot or compiling a kernel, and on occasion I have VM images that I'm working with). Tend to agree, but I do keep a close eye on free space. If I get to the point where I'm over 90% allocated to chunks with lots of unused space otherwise I run a balance. I tend to have the most problems with my root/OS filesystem running on a 64GB SSD, likely because it is so small. Is there a big performance penalty running mixed chunks on an SSD? I believe this would get rid of the risk of ENOSPC issues if everything gets allocated to chunks. There are obviously no issues with random access on an SSD, but there could be other problems (cache utilization, etc). There shouldn't be any more performance penalty than for normally running mixed chunks. Also, a 64GB SSD is not small, I use a pair of 64GB SSD's in a BTRFS RAID1 configuration for root on my desktop, and consistently use less than a quarter (12G on average) of the available space, and that's with stuff like LibreOffice and the entire OpenClipart distribution (although I'm not running an 'enterprise' distribution, and keep /tmp and /var/tmp on tmpfs). I tend to watch btrfs fi sho and if the total space used starts getting high then I run a balance. Usually I run with -dusage=30 or -dusage=50, but sometimes I get to the point where I just need to do a full balance. Often it is helpful to run a series of balance commands starting at -dusage=10 and moving up in increments. This at least prevents killing IO continuously for hours. If we can get to a point where balancing can operate at low IO priority that would be helpful. IO priority is a problem in btrfs in general. Even tasks run at idle scheduling priority can really block up a disk. I've seen a lot of hurry-and-wait behavior in btrfs. It seems like the initial commit to the log/etc is willing to accept a very large volume of data, and then when all the trees get updated the system grinds to a crawl trying to deal with all the data that was committed. The problem is that you have two queues, with the second queue being rate-limiting but the first queue being the one that applies priority control. What we really need is for the log to have controls on how much it accepts so that the updating of the trees/etc never is rate-limiting. That will limit the ability to have short IO write bursts, but it would prevent low-priority writes from blocking high-priority read/writes. You know, you can pretty easily control bandwidth utilization just using cgroups. This is what I do, and I get much better results with cgroups and the deadline IO scheduler than I ever did with CFQ. Abstract priorities are a not bad for controlling relative CPU utilization, but they really suck for IO scheduling. smime.p7s Description: S/MIME Cryptographic Signature
Re: No space on empty, degraded raid10
On 2014-09-11 02:40, Russell Coker wrote: On Mon, 8 Sep 2014, Austin S Hemmelgarn ahferro...@gmail.com wrote: Also, I've found out the hard way that system chunks really should be RAID1, NOT RAID10, otherwise it's very likely that the filesystem won't mount at all if you lose 2 disks. Why would that be different? In a RAID-1 you expect system problems if 2 disks fail, why would RAID-10 be different? That's still the case, but in a RAID1 with four disks, of the six different pairs of two disks you could lose, only one will make the filesystem un-mountable, whereas for a four disk RAID10, there are two different pairs of two disks you could lose to make the filesystem un-mountable. In haven't run the numbers for higher numbers of disks, but things are likely not better, because if you lose both copies of the same stripe, things will fail. Also it would be nice if there was a N-way mirror option for system data. As such data is tiny (32MB on the 120G filesystem in my workstation) the space used by having a copy on every disk in the array shouldn't matter. N-way mirroring is in the queue for after RAID5/6 work; ideally, once it is ready, mkfs should default to one copy per disk in the filesystem. smime.p7s Description: S/MIME Cryptographic Signature
Re: No space on empty, degraded raid10
On 2014-09-11 07:38, Hugo Mills wrote: On Thu, Sep 11, 2014 at 07:19:00AM -0400, Austin S Hemmelgarn wrote: On 2014-09-11 02:40, Russell Coker wrote: Also it would be nice if there was a N-way mirror option for system data. As such data is tiny (32MB on the 120G filesystem in my workstation) the space used by having a copy on every disk in the array shouldn't matter. N-way mirroring is in the queue for after RAID5/6 work; ideally, once it is ready, mkfs should default to one copy per disk in the filesystem. Why change the default from 2-copies, which it's been for years? Sorry about the ambiguity in my statement, I meant that the default for system chunks should be one copy per disk in the filesystem. If you don't have a copy of the system chunks, then you essentially don't have a filesystem, and that means that BTRFS RAID6 can't provide true resilience against 2 disks failing catastrophically unless there are at least 3 copies of the system chunks. smime.p7s Description: S/MIME Cryptographic Signature
Problem with unmountable filesystem.
So, I just recently had to hard reset a system running root on BTRFS, and when it tried to come back up, it chocked on the root filesystem. Based on the kernel messages, the primary issue is log corruption, and in theory btrfs-zero-log should fix it. The actual issue however, is that the primary superblock appears to be pointing at a corrupted root tree, which causes pretty much everything that does anything other than just read the sb to fail. The first backup sb does point to a good tree, but only btrfs check and btrfs restore have any option to ignore the first sb and use one of the backups instead. To make matters more complicated, the first sb still has a valid checksum and passes the tests done by btrfs rescue super-recover, and therefore that can't be used to recover either. I was wondering if anyone here might have any advice. I'm fine using dd to replace the primary sb with one of the backups, but don't know the exact parameters that would be needed. Also, we should consider adding a mount option to select a specific sb mirror to use; I know that ext* have such an option, and that has actually saved me a couple of times. I'm using btrfs-progs 3.16 and kernel 3.16.1. smime.p7s Description: S/MIME Cryptographic Signature
Re: Problem with unmountable filesystem.
On 2014-09-16 16:57, Chris Murphy wrote: On Sep 16, 2014, at 8:40 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote: Based on the kernel messages, the primary issue is log corruption, and in theory btrfs-zero-log should fix it. Can you provide a complete dmesg somewhere for this initial failure, just for reference? I'm curious what this indication looks like compared to other problems. Okay, I can't really get a 'complete' dmesg, because the system panics on the mount failure (the filesystem in question is the system's root filesystem), the system has no serial ports, and I didn't think to build in support for console on ttyUSB0. I can however get what the recovery environment (locally compiled based on buildroot) shows when I try to mount the filesystem: [ 30.871036] BTRFS: device label gentoo devid 1 transid 160615 /dev/sda3 [ 30.875225] BTRFS info (device sda3): disk space caching is enabled [ 30.917091] BTRFS: detected SSD devices, enabling SSD mode [ 30.920536] BTRFS: bad tree block start 0 130402254848 [ 30.924018] BTRFS: bad tree block start 0 130402254848 [ 30.926234] BTRFS: failed to read log tree [ 30.953055] BTRFS: open_ctree failed The actual issue however, is that the primary superblock appears to be pointing at a corrupted root tree, which causes pretty much everything that does anything other than just read the sb to fail. The first backup sb does point to a good tree, but only btrfs check and btrfs restore have any option to ignore the first sb and use one of the backups instead. Maybe use wipefs -a on this volume, which removes the magic from only the first superblock by default (you can specify another location). And then try btrfs-show-super -F which dumps supers with bad magic. Thanks for the suggestion, I hadn't thought of that... I just tried this: # wipefs -a /dev/sdb /dev/sdb: 8 bytes were erased at offset 0x00010040 (btrfs): 5f 42 48 52 66 53 5f 4d # btrfs-show-super -F /dev/sdb superblock: bytenr=65536, device=/dev/sdb - csum 0x5c1196d7 [DON'T MATCH] bytenr65536 flags 0x1 magic [DON'T MATCH] […] # btrfs-show-super -i1 /dev/sdb superblock: bytenr=67108864, device=/dev/sdb - csum 0xfc70be19 [match] bytenr67108864 flags 0x1 magic _BHRfS_M [match] So the mirror is definitely there and valid. # btrfs rescue super-recover -yv /dev/sdb No valid Btrfs found on /dev/sdb Usage or syntax errors Not expected at all, man page says Recover bad superblocks from good copies. There's a good copy, it's not being found by btrfs rescue super-recover. Seems like a bug. # btrfs check /dev/sdb No valid Btrfs found on /dev/sdb Couldn't open file system # btrfs check -s1 /dev/sdb using SB copy 1, bytenr 67108864 Checking filesystem on /dev/sdb UUID: 9acf13de-5b98-4f28-9992-533e4a99d348 [snip] OK it finds it, maybe a --repair will fix the bad first one? # btrfs check -s1 /dev/sdb using SB copy 1, bytenr 67108864 enabling repair mode Checking filesystem on /dev/sdb UUID: 9acf13de-5b98-4f28-9992-533e4a99d348 [snip] No indication of repair # btrfs check /dev/sdb No valid Btrfs found on /dev/sdb Couldn't open file system # btrfs check /dev/sdb No valid Btrfs found on /dev/sdb Couldn't open file system [root@f21v ~]# btrfs-show-super -F /dev/sdb superblock: bytenr=65536, device=/dev/sdb - csum 0x5c1196d7 [DON'T MATCH] bytenr65536 flags 0x1 magic [DON'T MATCH] Still not fixed. Maybe I needed to corrupt something else in the superblock other than the magic and this behavior is intentional, otherwise wipefs -a, followed by btrfsck would resurrect an intentionally wiped btrfs fs, potentially wiping out some newer file system in the process. ...though maybe it's a good thing I didn't. I'm fine using dd to replace the primary sb with one of the backups, but don't know the exact parameters that would be needed. Here's an idea: # btrfs-show-super /dev/sdb superblock: bytenr=65536, device=/dev/sdb - csum 0x92aa51ab [match] [snip] So I know what I'm looking for starts at LBA 65536/512 # dd if=/dev/sdb skip=128 count=4 2/dev/null | hexdump -C 92 aa 51 ab 00 00 00 00 00 00 00 00 00 00 00 00 |..Q…..| [snip] And as it turns out the csum is right at the beginning, 4 bytes. So use bs of 4 bytes, seek 65536/4, count of 1. This should zero just 4 bytes starting at 65536 bytes in. # dd if=/dev/zero of=/dev/sdb bs=4 seek=16384 count=1 Checked it with the earlier skip=128
Re: Problem with unmountable filesystem.
On 09/17/2014 02:57 PM, Chris Murphy wrote: On Sep 17, 2014, at 5:23 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote: Thanks for all the help. Well, it's not much help. It seems possible to corrupt a primary superblock that points to a corrupt tree root, and use btrfs rescure super-recover to replace it, and then mount should work. One thing I didn't try was corrupting the primary superblock and just mounting normally or with recovery, to see if it'll automatically ignore the primary superblock and use the backup. But I think you're onto something, that a good superblock can point to a corrupt tree root, and then not have a straight forward way to mount the good tree root. If I understand this correctly. Corrupting the primary superblock did in fact work, and I decided to try mounting immediately, which failed. I didn't try with -o recovery, but I think that would probably fail as well. Things worked perfectly however after using btrfs rescue super-recover. As far as avoiding future problems, I think the best solution would be to have the mount operation try the tree root pointed to by the backup superblock if the one pointed to by the primary seems corrupted. Secondarily, this almost makes me want to set the ssd option on all BTRFS filesystems, just to get the rotating superblock updates, because if it weren't for that behavior, I probably wouldn't have been able to recovery anything in this particular case. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Problem with unmountable filesystem.
On 09/17/2014 04:22 PM, Duncan wrote: Austin S Hemmelgarn posted on Wed, 17 Sep 2014 07:23:46 -0400 as excerpted: I've also discovered, when trying to use btrfs restore to copy out the data to a different system, that 3.14.1 restore apparently chokes on filesystem that have lzo compression turned on. It's reporting errors trying to inflate compressed files, and I know for a fact that none of those files were even open, let alone being written to, when the system crashed. I don't know if this is a known bug or even if it is still the case with btrfs-progs 3.16, but I figured I'd comment about it because I haven't seen anything about it anywhere. FWIW that's a known and recently patched issue. If you're still seeing issues with it with btrfs-progs 3.16, report it, but 3.14.1 almost certainly wouldn't have had the fix. (This is one related patch turned up by a quick search; there may be others.) * commit 93ebec96f2ae1d3276ebe89e2d6188f9b46692fb | Author: Vincent Stehlé vincent.ste...@laposte.net | Date: Wed Jun 18 18:51:19 2014 +0200 | | btrfs-progs: restore: check lzo compress length | | When things go wrong for lzo-compressed btrfs, feeding | lzo1x_decompress_safe() with corrupt data during restore | can lead to crashes. Reduce the risk by adding | a check on the input length. | | Signed-off-by: Vincent Stehlé vincent.ste...@laposte.net | Signed-off-by: David Sterba dste...@suse.cz | | cmds-restore.c | 6 ++ | 1 file changed, 6 insertions(+) Yeah, 3.16 seems fine, I just hadn't updated my recovery environment yet. Ironically, I did some performance testing afterwards, and realized that using any compression was actually slowing down my system (my disk appears to be faster than my RAM, which is really sad, even for a laptop). -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Performance Issues
On 2014-09-19 08:18, Rob Spanton wrote: Hi, I have a particularly uncomplicated setup (a desktop PC with a hard disk) and I'm seeing particularly slow performance from btrfs. A `git status` in the linux source tree takes about 46 seconds after dropping caches, whereas on other machines using ext4 this takes about 13s. My mail client (evolution) also seems to perform particularly poorly on this setup, and my hunch is that it's spending a lot of time waiting on the filesystem. I've tried mounting with noatime, and this has had no effect. Anyone got any ideas? Here are the things that the wiki page asked for [1]: uname -a: Linux zarniwoop.blob 3.16.2-200.fc20.x86_64 #1 SMP Mon Sep 8 11:54:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux btrfs --version: Btrfs v3.16 btrfs fi show: Label: 'fedora' uuid: 717c0a1b-815c-4e6a-86c0-60b921e84d75 Total devices 1 FS bytes used 1.49TiB devid1 size 2.72TiB used 1.50TiB path /dev/sda4 Btrfs v3.16 btrfs fi df /: Data, single: total=1.48TiB, used=1.48TiB System, DUP: total=32.00MiB, used=208.00KiB Metadata, DUP: total=11.50GiB, used=10.43GiB unknown, single: total=512.00MiB, used=0.00 dmesg dump is attached. Please CC any responses to me, as I'm not subscribed to the list. Cheers, Rob [1] https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list WRT the performance of Evolution, the issue is probably fragmentation of the data files. If you run the command: # btrfs fi defrag -rv /home you should see some improvement in evolution performance (until you get any new mail that is). Evolution (like most graphical e-mail clients these days) uses sqlite for data storage, and sqlite database files are one of the known pathological cases for COW filesystems in general; the solution is to mark the files as NOCOW (see the info about VM images in [1] and [2], the same suggestions apply to database files). As for git, I haven't seen any performance issues specific to BTRFS; are you using any compress= mount option? zlib based compression is known to cause serious slowdowns. I don't think that git uses any kind of database for data storage. Also, if the performance comparison is from other systems, unless those systems have the EXACT same hardware configuration, they aren't really a good comparison. Unless the pc this is on is a relatively recent system (less than a year or two old), it may just be hardware that is the performance bottleneck. smime.p7s Description: S/MIME Cryptographic Signature
Re: Performance Issues
On 2014-09-19 08:25, Swâmi Petaramesh wrote: Le vendredi 19 septembre 2014, 13:18:34 Rob Spanton a écrit : I have a particularly uncomplicated setup (a desktop PC with a hard disk) and I'm seeing particularly slow performance from btrfs. Weeelll I have the same over-complicated kind of setup, and an Arch Linux BTRFS system which used to boot in some decent amout of time in the past now takes about 5 full minutes to just make it to the KDM login prompt, and another 5 minutes before KDE is fully started. Makes me think of the good ole' times of Windows 95 OSR2 on a 486SX with a dying 1 GB Hard disk... Well, part of your problem might be KDE itself, it's extremely CPU intensive these days. I'd suggest disabling the 'semantic desktop' stuff, because that tends to be the worst offender as far as soaking up system resources. Also, if you recently switched to systemd, that may be causing some slowdown as well (journald's default settings are terrible for performance) Now, let me add that I had removed all snaphots, ran a full defrag, and even rebalanced the damn thing without any positive effect... (And yes, my HD is physically in good shape, SMART feels fully happy, and it's less than 75% full...) I've been using BTRFS for 2-3 years on a dozen of different systems, and if something doesn't surprise me at all, it's « slow performance », indeed, although I'm myself more accustomed to « incredibly fscking damn slow performance »... It's kind of funny, but I haven't had any performance issues with BTRFS since about 3.10, even on the systems my employer is using Fedora 20 on, and those use only a Core 2 Duo Processor, DDR2-800 RAM, and SATA2 hard drives. HTH smime.p7s Description: S/MIME Cryptographic Signature
Re: Performance Issues
On 2014-09-19 08:49, Austin S Hemmelgarn wrote: On 2014-09-19 08:18, Rob Spanton wrote: Hi, I have a particularly uncomplicated setup (a desktop PC with a hard disk) and I'm seeing particularly slow performance from btrfs. A `git status` in the linux source tree takes about 46 seconds after dropping caches, whereas on other machines using ext4 this takes about 13s. My mail client (evolution) also seems to perform particularly poorly on this setup, and my hunch is that it's spending a lot of time waiting on the filesystem. I've tried mounting with noatime, and this has had no effect. Anyone got any ideas? Here are the things that the wiki page asked for [1]: uname -a: Linux zarniwoop.blob 3.16.2-200.fc20.x86_64 #1 SMP Mon Sep 8 11:54:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux btrfs --version: Btrfs v3.16 btrfs fi show: Label: 'fedora' uuid: 717c0a1b-815c-4e6a-86c0-60b921e84d75 Total devices 1 FS bytes used 1.49TiB devid1 size 2.72TiB used 1.50TiB path /dev/sda4 Btrfs v3.16 btrfs fi df /: Data, single: total=1.48TiB, used=1.48TiB System, DUP: total=32.00MiB, used=208.00KiB Metadata, DUP: total=11.50GiB, used=10.43GiB unknown, single: total=512.00MiB, used=0.00 dmesg dump is attached. Please CC any responses to me, as I'm not subscribed to the list. Cheers, Rob [1] https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list WRT the performance of Evolution, the issue is probably fragmentation of the data files. If you run the command: # btrfs fi defrag -rv /home you should see some improvement in evolution performance (until you get any new mail that is). Evolution (like most graphical e-mail clients these days) uses sqlite for data storage, and sqlite database files are one of the known pathological cases for COW filesystems in general; the solution is to mark the files as NOCOW (see the info about VM images in [1] and [2], the same suggestions apply to database files). As for git, I haven't seen any performance issues specific to BTRFS; are you using any compress= mount option? zlib based compression is known to cause serious slowdowns. I don't think that git uses any kind of database for data storage. Also, if the performance comparison is from other systems, unless those systems have the EXACT same hardware configuration, they aren't really a good comparison. Unless the pc this is on is a relatively recent system (less than a year or two old), it may just be hardware that is the performance bottleneck. Realized after I sent this that I forgot the links for [1] and [2] [1] https://btrfs.wiki.kernel.org/index.php/UseCases [2] https://btrfs.wiki.kernel.org/index.php/FAQ smime.p7s Description: S/MIME Cryptographic Signature
Re: Performance Issues
On 2014-09-19 09:51, Holger Hoffstätte wrote: On Fri, 19 Sep 2014 13:18:34 +0100, Rob Spanton wrote: I have a particularly uncomplicated setup (a desktop PC with a hard disk) and I'm seeing particularly slow performance from btrfs. A `git status` in the linux source tree takes about 46 seconds after dropping caches, whereas on other machines using ext4 this takes about 13s. My mail client (evolution) also seems to perform particularly poorly on this setup, and my hunch is that it's spending a lot of time waiting on the filesystem. This is - unfortunately - a particular btrfs oddity/characteristic/flaw, whatever you want to call it. git relies a lot on fast stat() calls, and those seem to be particularly slow with btrfs esp. on rotational media. I have the same problem with rsync on a freshly mounted volume; it gets fast (quite so!) after the first run. I find that kind of funny, because regardless of filesystem, stat() is one of the *slowest* syscalls on almost every *nix system in existence. The simplest thing to fix this is a du -s /dev/null to pre-cache all file inodes. I'd also love a technical explanation why this happens and how it could be fixed. Maybe it's just a consequence of how the metadata tree(s) are laid out on disk. While I don't know for certain, I think it's largely just a side effect of the lack of performance tuning in the BTRFS code. I've tried mounting with noatime, and this has had no effect. Anyone got any ideas? Don't drop the caches :-) -h smime.p7s Description: S/MIME Cryptographic Signature
Re: Problem with unmountable filesystem.
On 2014-09-19 13:07, Chris Murphy wrote: Possibly btrfs-select-super can do some of the things I was doing the hard way. It's possible to select a super to overwrite other supers, even if they're good ones. Whereas btrfs rescue super-recover won't do that, and neither will btrfsck, hence why I corrupted the one I didn't want first. This command isn't built by default (at least not on Fedora). I don't think it's built by default on any of the major distributions. On Gentoo you need to set package specific configure options. smime.p7s Description: S/MIME Cryptographic Signature
Re: Problem with unmountable filesystem.
On 2014-09-19 13:54, Chris Murphy wrote: On Sep 17, 2014, at 5:23 AM, Austin S Hemmelgarn ahferro...@gmail.com wrote: [ 30.920536] BTRFS: bad tree block start 0 130402254848 [ 30.924018] BTRFS: bad tree block start 0 130402254848 [ 30.926234] BTRFS: failed to read log tree [ 30.953055] BTRFS: open_ctree failed I'm still confused. Btrfs knows this tree root is bad, but it has backup roots. So why wasn't one of those used by -o recovery? I thought that's the whole point of that mount option. Backup tree roots are per superblock, so conceivably you'd have up to 8 of these with two superblocks, they're shown with btrfs-show-super -af ## and -F even if a super is bad But skipping that, to fix this you need to know which super is pointing to the wrong tree root, since you're using ssd mount option with rotating supers. I assume mount uses the super with the highest generation number. So you'd need to: btrfs-show-super -a to find out the super with the most recent generation. You'd assume that one was wrong. And then use btrfs-select-super to pick the right one, and replace the wrong one. Then you could mount. I also wonder if btrfs check -sX would show different results in your case. I'd think it would because it ought to know one of those tree roots is bad, seeing as mount knows it. And then it seems (I'm speculating a ton) that --repair might try to fix the bad tree root, and then if it fails I'd like to think it can just find the most recent good tree root, ideally one listed as a backup_tree_root by any good superblock, and then have the next mount use that. I'm not sure why this persistently fails, and I wonder if there are cases of users giving up and blowing away file systems that could actually be mountable. But it's just really a manual process figuring out what things to do in what order to get them to mount. From what I can tell, btrfs check doesn't do anything about backup superblocks unless you specifically tell it to. In this case, running btrfs check without specifying a superblock mirror, and with explicitly specifying the primary superblock produced identical results (namely it choked, hard, with an error message similar to that from the kernel. However, running it with -s1 to select the first backup superblock returned no errors at all other than the space_cache being invalid and the count of used blocks being wrong. Based on my (limited) understanding of the mount code, it does try to use the superblock with the highest generation (regardless of whether we are on an ssd or not), but doesn't properly fall back to a secondary superblock after trying to mount using the primary. As far as btrfs check repair trying to fix this, I don't think that it does so currently, probably for the same reason that mount fails. smime.p7s Description: S/MIME Cryptographic Signature
Re: Single disk parrallelization
On 2014-09-19 14:10, Jeb Thomson wrote: With the advanced features of btrfs, it would be an additional simple task to make different platters run in parallel. In this case, say a disk has three platters, and so three seek heads as well. If we can identify that much, and what offsets they are at, it then becomes a trivial matter to place the reads and writes to different platters at the same time. In affect, this means each platter should be operating as a single virtualized unit, instead of one single unit... In theory this is a great idea except for two things: 1) Most consumer drives have only one platter. 2) The kernel doesn't have such low-level hardware access, so it would have to be implemented in device firmware (and I'd be willing to bet that most drive manufacturers already stripe data across multiple platters when possible). smime.p7s Description: S/MIME Cryptographic Signature
Re: general thoughts and questions + general and RAID5/6 stability?
On 2014-09-22 16:51, Stefan G. Weichinger wrote: Am 20.09.2014 um 11:32 schrieb Duncan: What I do as part of my regular backup regime, is every few kernel cycles I wipe the (first level) backup and do a fresh mkfs.btrfs, activating new optional features as I believe appropriate. Then I boot to the new backup and run a bit to test it, then wipe the normal working copy and do a fresh mkfs.btrfs on it, again with the new optional features enabled that I want. Is re-creating btrfs-filesystems *recommended* in any way? Does that actually make a difference in the fs-structure? I would recommend it, there are some newer features that you can only set at mkfs time. Quite often, when a new feature is implemented, it is some time before things are such that it can be enabled online, and even then that doesn't convert anything until it is rewritten. So far I assumed it was enough to keep the kernel up2date, use current (stable) btrfs-progs and run some scrub every week or so (not to mention backups .. if it ain't backed up, it was/isn't important). Stefan smime.p7s Description: S/MIME Cryptographic Signature
Re: general thoughts and questions + general and RAID5/6 stability?
On 2014-09-23 09:06, Stefan G. Weichinger wrote: Am 23.09.2014 um 14:08 schrieb Austin S Hemmelgarn: On 2014-09-22 16:51, Stefan G. Weichinger wrote: Is re-creating btrfs-filesystems *recommended* in any way? Does that actually make a difference in the fs-structure? I would recommend it, there are some newer features that you can only set at mkfs time. Quite often, when a new feature is implemented, it is some time before things are such that it can be enabled online, and even then that doesn't convert anything until it is rewritten. What features for example? Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives the following list of features: mixed-bg- mixed data and metadata block groups extref - increased hard-link limit per file to 65536 raid56 - raid56 extended format skinny-metadata - reduced size metadata extent refs no-holes- no explicit hole extents for files mixed-bg is something that you generally wouldn't want to change after mkfs. extref can be enabled online, and the filesystem metadata gets updated as-needed, and dosen't provide any real performance improvement (but is needed for some mail servers that have HUGE mail-queues) I don't know anything about the raid56 option, but there isn't any way to change it after mkfs. skinyy-metadata can be changed online, and the format gets updated on rewrite of each metadata block. This one does provide a performance improvement (stat() in particular runs noticeably faster). You should probably enable this if it isn't already enabled, even if you don't recreate your filesystem. no-holes cannot currently be changed online, and is a very recent addition (post v3.14 btrfs-progs I believe) that provides improved performance for sparse files (which is particularly useful if you are doing things with fixed size virtual machine disk images). It's this last one that prompted me personally to recreate my filesystems most recently, as I use sparse files to save space as much as possible. I created my main btrfs a few months ago and would like to avoid recreating it as this would mean restoring my root-fs on my main workstation. Although I would do it if it is worth it ;-) I assume I could read some kind of version number out of the superblock or so? btrfs-show-super ? AFAIK there isn't really any 'version number' that has any meaning in the superblock (except for telling the kernel that it uses the stable disk layout), however, there are flag bits that you can look for (compat_flags, compat_ro_flags, and incompat_flags). I'm not 100% certain what each bit means, but on my system with a only 1 month old BTRFS filesystem, with extref, skinny-metadata, and no-holes turned on, i have compat_flags: 0x0, compat_ro_flags: 0x0, and incompat_flags: 0x16b. The other potentially significant thing is that the default nodesize/leafsize has changed recently from 4096 to 16384, as that gives somewhat better performance for most use cases. smime.p7s Description: S/MIME Cryptographic Signature
Re: general thoughts and questions + general and RAID5/6 stability?
On 2014-09-23 10:23, Tobias Holst wrote: If it is unknown, which of these options have been used at btrfs creation time - is it possible to check the state of these options afterwards on a mounted or unmounted filesystem? 2014-09-23 15:38 GMT+02:00 Austin S Hemmelgarn ahferro...@gmail.com mailto:ahferro...@gmail.com: Well, running 'mkfs.btrfs -O list-all' with 3.16 btrfs-progs gives the following list of features: mixed-bg- mixed data and metadata block groups extref - increased hard-link limit per file to 65536 raid56 - raid56 extended format skinny-metadata - reduced size metadata extent refs no-holes- no explicit hole extents for files I don't think there is a specific tool for doing this, but some of them do show up in dmesg, for example skinny-metadata shows up as a mention of the FS having skinny extents. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On 2014-10-08 15:11, Eric Sandeen wrote: I was looking at Marc's post: http://marc.merlins.org/perso/btrfs/post_2014-03-19_Btrfs-Tips_-Btrfs-Scrub-and-Btrfs-Filesystem-Repair.html and it feels like there isn't exactly a cohesive, overarching vision for repair of a corrupted btrfs filesystem. In other words - I'm an admin cruising along, when the kernel throws some fs corruption error, or for whatever reason btrfs fails to mount. What should I do? Marc lays out several steps, but to me this highlights that there seem to be a lot of disjoint mechanisms out there to deal with these problems; mostly from Marc's blog, with some bits of my own: * btrfs scrub Errors are corrected along if possible (what *is* possible?) * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. * mount -o degraded Allow mounts to continue with missing devices. (This isn't really a way to recover from corruption, right?) * btrfs-zero-log remove the log tree if log tree is corrupt * btrfs rescue Recover a damaged btrfs filesystem chunk-recover super-recover How does this relate to btrfs check? * btrfs check repair a btrfs filesystem --repair --init-csum-tree --init-extent-tree How does this relate to btrfs rescue? * btrfs restore try to salvage files from a damaged filesystem (not really repair, it's disk-scraping) What's the vision for, say, scrub vs. check vs. rescue? Should they repair the same errors, only online vs. offline? If not, what class of errors does one fix vs. the other? How would an admin know? Can btrfs check recover a bad tree root in the same way that mount -o recovery does? How would I know if I should use --init-*-tree, or chunk-recover, and what are the ramifications of using these options? It feels like recovery tools have been badly splintered, and if there's an overarching design or vision for btrfs fs repair, I can't tell what it is. Can anyone help me? Well, based on my understanding: * btrfs scrub is intended to be almost exactly equivalent to scrubbing a RAID volume; that is, it fixes disparity between multiple copies of the same block. IOW, it isn't really repair per se, but more preventative maintnence. Currently, it only works for cases where you have multiple copies of a block (dup, raid1, and raid10 profiles), but support is planned for error correction of raid5 and raid6 profiles. * mount -o recovery I don't know much about, but AFAICT, it s more for dealing with metadata related FS corruption. * mount -o degraded is used to mount a fs configured for a raid storage profile with fewer devices than the profile minimum. It's primarily so that you can get the fs into a state where you can run 'btrfs device replace' * btrfs-zero-log only deals with log tree corruption. This would be roughly equivalent to zeroing out the journal on an XFS or ext4 filesystem, and should almost never be needed. * btrfs rescue is intended for low level recovery corruption on an offline fs. * chunk-recover I'm not entirely sure about, but I believe it's like scrub for a single chunk on an offline fs * super-recover is for dealing with corrupted superblocks, and tries to replace it with one of the other copies (which hopefully isn't corrupted) * btrfs check is intended to (eventually) be equivalent to the fsck utility for most other filesystems. Currently, it's relatively good at identifying corruption, but less so at actually fixing it. There are however, some things that it won't catch, like a superblock pointing to a corrupted root tree. * btrfs restore is essentially disk scraping, but with built-in knowledge of the filesystem's on-disk structure, which makes it more reliable than more generic tools like scalpel for files that are too big to fit in the metadata blocks, and it is pretty much essential for dealing with transparently compressed files. In general, my personal procedure for handling a misbehaving BTRFS filesystem is: * Run btrfs check on it WITHOUT ANY OTHER OPTIONS to try to identify what's wrong * Try mounting it using -o recovery * Try mounting it using -o ro,recovery * Use -o degraded only if it's a BTRFS raid set that lost a disk * If btrfs check AND dmesg both seem to indicate that the log tree is corrupt, try btrfs-zero-log * If btrfs check indicated a corrupt superblock, try btrfs rescue super-recover * If all of the above fails, ask for advice on the mailing list or IRC Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On 2014-10-09 07:53, Duncan wrote: Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as excerpted: Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. AFAIK that isn't quite correct. Currently, the number of copies is limited to two, meaning if one of the two is bad, there's a 50% chance of btrfs reading the good one on first try. If btrfs reads the good copy, it simply uses it. If btrfs reads the bad one, it checks the other one and assuming it's good, replaces the bad one with the good one both for the read (which otherwise errors out), and by overwriting the bad one. But here's the rub. The chances of detecting that bad block are relatively low in most cases. First, the system must try reading it for some reason, but even then, chances are 50% it'll pick the good one and won't even notice the bad one. Thus, while btrfs may randomly bump into a bad block and rewrite it with the good copy, scrub is the only way to systematically detect and (if there's a good copy) fix these checksum errors. It's not that btrfs doesn't do it if it finds them, it's that the chances of finding them are relatively low, unless you do a scrub, which systematically checks the entire filesystem (well, other than files marked nocsum, or nocow, which implies nocsum, or files written when mounted with nodatacow or nodatasum). At least that's the way it /should/ work. I guess it's possible that btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but if so, that's the first /I/ remember reading of it. I'm not 100% certain, but I believe it doesn't actually fix things on disk when it detects an error during a read, I know it doesn't it the fs is mounted ro (even if the media is writable), because I did some testing to see how 'read-only' mounting a btrfs filesystem really is. Also, that's a much better description of how multiple copies work than I could probably have ever given. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On 2014-10-09 08:12, Hugo Mills wrote: On Thu, Oct 09, 2014 at 08:07:51AM -0400, Austin S Hemmelgarn wrote: On 2014-10-09 07:53, Duncan wrote: Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as excerpted: Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. AFAIK that isn't quite correct. Currently, the number of copies is limited to two, meaning if one of the two is bad, there's a 50% chance of btrfs reading the good one on first try. If btrfs reads the good copy, it simply uses it. If btrfs reads the bad one, it checks the other one and assuming it's good, replaces the bad one with the good one both for the read (which otherwise errors out), and by overwriting the bad one. But here's the rub. The chances of detecting that bad block are relatively low in most cases. First, the system must try reading it for some reason, but even then, chances are 50% it'll pick the good one and won't even notice the bad one. Thus, while btrfs may randomly bump into a bad block and rewrite it with the good copy, scrub is the only way to systematically detect and (if there's a good copy) fix these checksum errors. It's not that btrfs doesn't do it if it finds them, it's that the chances of finding them are relatively low, unless you do a scrub, which systematically checks the entire filesystem (well, other than files marked nocsum, or nocow, which implies nocsum, or files written when mounted with nodatacow or nodatasum). At least that's the way it /should/ work. I guess it's possible that btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but if so, that's the first /I/ remember reading of it. I'm not 100% certain, but I believe it doesn't actually fix things on disk when it detects an error during a read, I'm fairly sure it does, as I've had it happen to me. :) I probably just misinterpreted the source code, while I know enough C to generally understand things, I'm by far no expert. I know it doesn't it the fs is mounted ro (even if the media is writable), because I did some testing to see how 'read-only' mounting a btrfs filesystem really is. If the FS is RO, then yes, it won't fix things. Hugo. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On 2014-10-09 08:34, Duncan wrote: On Thu, 09 Oct 2014 08:07:51 -0400 Austin S Hemmelgarn ahferro...@gmail.com wrote: On 2014-10-09 07:53, Duncan wrote: Austin S Hemmelgarn posted on Thu, 09 Oct 2014 07:29:23 -0400 as excerpted: Also, you should be running btrfs scrub regularly to correct bit-rot and force remapping of blocks with read errors. While BTRFS technically handles both transparently on reads, it only corrects thing on disk when you do a scrub. AFAIK that isn't quite correct. Currently, the number of copies is limited to two, meaning if one of the two is bad, there's a 50% chance of btrfs reading the good one on first try. If btrfs reads the good copy, it simply uses it. If btrfs reads the bad one, it checks the other one and assuming it's good, replaces the bad one with the good one both for the read (which otherwise errors out), and by overwriting the bad one. But here's the rub. The chances of detecting that bad block are relatively low in most cases. First, the system must try reading it for some reason, but even then, chances are 50% it'll pick the good one and won't even notice the bad one. Thus, while btrfs may randomly bump into a bad block and rewrite it with the good copy, scrub is the only way to systematically detect and (if there's a good copy) fix these checksum errors. It's not that btrfs doesn't do it if it finds them, it's that the chances of finding them are relatively low, unless you do a scrub, which systematically checks the entire filesystem (well, other than files marked nocsum, or nocow, which implies nocsum, or files written when mounted with nodatacow or nodatasum). At least that's the way it /should/ work. I guess it's possible that btrfs isn't doing those routine bump-into-it-and-fix-it fixes yet, but if so, that's the first /I/ remember reading of it. I'm not 100% certain, but I believe it doesn't actually fix things on disk when it detects an error during a read, I know it doesn't it the fs is mounted ro (even if the media is writable), because I did some testing to see how 'read-only' mounting a btrfs filesystem really is. Definitely it won't with a read-only mount. But then scrub shouldn't be able to write to a read-only mount either. The only way a read-only mount should be writable is if it's mounted (bind-mounted or btrfs-subvolume-mounted) read-write elsewhere, and the write occurs to that mount, not the read-only mounted location. In theory yes, but there are caveats to this, namely: * atime updates still happen unless you have mounted the fs with noatime * The superblock gets updated if there are 'any' writes * The free space cache 'might' be updated if there are any writes All in all, a BTRFS filesystem mounted ro is much more read-only than say ext4 (which at least updates the sb, and old versions replayed the journal, in addition to the atime updates). There's even debate about replaying the journal or doing orphan-delete on read-only mounts (at least on-media, the change could, and arguably should, occur in RAM and be cached, marking the cache dirty at the same time so it's appropriately flushed if/when the filesystem goes writable), with some arguing read-only means just that, don't write /anything/ to it until it's read-write mounted. But writable-mounted, detected checksum errors (with a good copy available) should be rewritten as far as I know. If not, I'd call it a bug. The problem is in the detection, not in the rewriting. Scrub's the only way to reliably detect these errors since it's the only thing that systematically checks /everything/. Also, that's a much better description of how multiple copies work than I could probably have ever given. Thanks. =:^) smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On 2014-10-10 13:43, Bob Marley wrote: On 10/10/2014 16:37, Chris Murphy wrote: The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. A filesystem which is suited for general purpose use is a filesystem which honors fsync, and doesn't *ever* auto-roll-back without user intervention. Anything different is not suited for database transactions at all. Any paid service which has the users database on btrfs is going to be at risk of losing payments, and probably without the company even knowing. If btrfs goes this way I hope a big warning is written on the wiki and on the manpages telling that this filesystem is totally unsuitable for hosting databases performing transactions. If they need reliability, they should have some form of redundancy in-place and/or run the database directly on the block device; because even ext4, XFS, and pretty much every other filesystem can lose data sometimes, the difference being that those tend to give worse results when hardware is misbehaving than BTRFS does, because BTRFS usually has a old copy of whatever data structure gets corrupted to fall back on. Also, you really shouldn't be running databases on a BTRFS filesystem at the moment anyway, because of the significant performance implications. At most I can suggest that a flag in the metadata be added to allow/disallow auto-roll-back-on-error on such filesystem, so people can decide the tolerant vs. transaction-safe mode at filesystem creation. The problem with this is that if the auto-recovery code did run (and IMHO the kernel should spit out a warning to the system log whenever it does), then chances are that you wouldn't have had a consistent view if you had prevented it from running either; and, if the database is properly distributed/replicated, then it should recover by itself. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On 2014-10-10 18:05, Eric Sandeen wrote: On 10/10/14 2:35 PM, Austin S Hemmelgarn wrote: On 2014-10-10 13:43, Bob Marley wrote: On 10/10/2014 16:37, Chris Murphy wrote: The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. A filesystem which is suited for general purpose use is a filesystem which honors fsync, and doesn't *ever* auto-roll-back without user intervention. Anything different is not suited for database transactions at all. Any paid service which has the users database on btrfs is going to be at risk of losing payments, and probably without the company even knowing. If btrfs goes this way I hope a big warning is written on the wiki and on the manpages telling that this filesystem is totally unsuitable for hosting databases performing transactions. If they need reliability, they should have some form of redundancy in-place and/or run the database directly on the block device; because even ext4, XFS, and pretty much every other filesystem can lose data sometimes, Not if i.e. fsync returns. If the data is gone later, it's a hardware problem, or occasionally a bug - bugs that are usually found fixed pretty quickly. Yes, barring bugs and hardware problems they won't lose data. the difference being that those tend to give worse results when hardware is misbehaving than BTRFS does, because BTRFS usually has a old copy of whatever data structure gets corrupted to fall back on. I'm curious, is that based on conjecture or real-world testing? I wouldn't really call it testing, but based on personal experience I know that ext4 can lose whole directory sub-trees if it gets a single corrupt sector in the wrong place. I've also had that happen on FAT32 and (somewhat interestingly) HFS+ with failing/misbehaving hardware; and I've actually had individual files disappear on HFS+ without any discernible hardware issues. I don't have as much experience with XFS, but would assume based on what I do know of it that it could have similar issues. As for BTRFS, I've only ever had any issues with it 3 times, one was due to the kernel panicking during resume from S1, and the other two were due to hardware problems that would have caused issues on most other filesystems as well. In both cases of hardware issues, while the filesystem was initially unmountable, it was relatively simple to fix once I knew how. I tried to fix an ext4 fs that had become unmountable due to dropped writes once, and that was anything but simple, even with the much greater amount of documentation. smime.p7s Description: S/MIME Cryptographic Signature
Re: What is the vision for btrfs fs repair?
On 2014-10-12 06:14, Martin Steigerwald wrote: Am Freitag, 10. Oktober 2014, 10:37:44 schrieb Chris Murphy: On Oct 10, 2014, at 6:53 AM, Bob Marley bobmar...@shiftmail.org wrote: On 10/10/2014 03:58, Chris Murphy wrote: * mount -o recovery Enable autorecovery attempts if a bad tree root is found at mount time. I'm confused why it's not the default yet. Maybe it's continuing to evolve at a pace that suggests something could sneak in that makes things worse? It is almost an oxymoron in that I'm manually enabling an autorecovery If true, maybe the closest indication we'd get of btrfs stablity is the default enabling of autorecovery. No way! I wouldn't want a default like that. If you think at distributed transactions: suppose a sync was issued on both sides of a distributed transaction, then power was lost on one side, than btrfs had corruption. When I remount it, definitely the worst thing that can happen is that it auto-rolls-back to a previous known-good state. For a general purpose file system, losing 30 seconds (or less) of questionably committed data, likely corrupt, is a file system that won't mount without user intervention, which requires a secret decoder ring to get it to mount at all. And may require the use of specialized tools to retrieve that data in any case. The fail safe behavior is to treat the known good tree root as the default tree root, and bypass the bad tree root if it cannot be repaired, so that the volume can be mounted with default mount options (i.e. the ones in fstab). Otherwise it's a filesystem that isn't well suited for general purpose use as rootfs let alone for boot. To understand this a bit better: What can be the reasons a recent tree gets corrupted? Well, so far I have had the following cause corrupted trees: 1. Kernel panic during resume from ACPI S1 (suspend to RAM), which just happened to be in the middle of a tree commit. 2. Generic power loss during a tree commit. 3. A device not properly honoring write-barriers (the operations immediately adjacent to the write barrier weren't being ordered correctly all the time). Based on what I know about BTRFS, the following could also cause problems: 1. A single-event-upset somewhere in the write path. 2. The kernel issuing a write to the wrong device (I haven't had this happen to me, but know people who have). In general, any of these will cause problems for pretty much any filesystem, not just BTRFS. I always thought with a controller and device and driver combination that honors fsync with BTRFS it would either be the new state of the last known good state *anyway*. So where does the need to rollback arise from? I think that in this case the term rollback is a bit ambiguous, here it means from the point of view of userspace, which sees the FS as having 'rolled-back' from the most recent state to the last known good state. That said all journalling filesystems have some sort of rollback as far as I understand: If the last journal entry is incomplete they discard it on journal replay. So even there you use the last seconds of write activity. But in case fsync() returns the data needs to be safe on disk. I always thought BTRFS honors this under *any* circumstance. If some proposed autorollback breaks this guarentee, I think something is broke elsewhere. And fsync is an fsync is an fsync. Its semantics are clear as crystal. There is nothing, absolutely nothing to discuss about it. An fsync completes if the device itself reported Yeah, I have the data on disk, all safe and cool to go. Anything else is a bug IMO. Or a hardware issue, most filesystems need disks to properly honor write barriers to provide guaranteed semantics on an fsync, and many consumer disk drives still don't honor them consistently. smime.p7s Description: S/MIME Cryptographic Signature
Re: Wishlist Item :: One Subvol in Multiple Places
On 2014-10-14 18:25, Robert White wrote: I've got no idea if this is possible given the current storage layout, but it would be Really Nice™ if there were a way to have a single subvolume exist in more than one place in hirearchy. I know this can be faked via mount tricks (bind or use of subvol=), but having it be a real thing would be preferable. For example, if I have two or more distributions on a computer or want to switch between 32bit and 64bit environments frequently, but I want to use the same /home (which is its own subvolume anyway) it would be nice if the native layout could be permuted such that /__System_32/home and /__System_64/home were the actual same subvolume. The mechanism, were it possible, would be something like btrfs subvolume link /existing/path /new/path (or bind instead of link) I've got no idea if the directory structure would allow for this, but if it would it would simplify several things (for me anyway) if the file system layout represented the runtime layout. This probably won't be implemented, for the same reason that most modern unix systems disallow hardlinks to directories; namely, it results in ambiguity regarding resolution of the .. directory entry. The better solution would be to put /home in a separate top-level sub-volume, and then mount that in each location. smime.p7s Description: S/MIME Cryptographic Signature
Re: strange 3.16.3 problem
On 2014-10-20 09:02, Zygo Blaxell wrote: On Mon, Oct 20, 2014 at 04:38:28AM +, Duncan wrote: Russell Coker posted on Sat, 18 Oct 2014 14:54:19 +1100 as excerpted: # find . -name *546 ./1412233213.M638209P10546 # ls -l ./1412233213.M638209P10546 ls: cannot access ./1412233213.M638209P10546: No such file or directory Does your mail server do a lot of renames? Is one perhaps stuck? If so, that sounds like the same thing Zygo Blaxell is reporting in the 3.16.3..3.17.1 hang in renameat2() thread, OP on Sun, 19 Oct 2014 15:25:26 -400, Msg-ID: 20141019192525.ga29...@hungrycats.org, as linked here: http://permalink.gmane.org/gmane.comp.file-systems.btrfs/39539 I pointed him at this thread too. I hadn't seen you mention a hung rename, but the other symptoms sound similar. Not really. It looks like Russell having a NFS client-side problem, I'm having a server-side one (maybe). Also, all Russell's system calls seem to be returning promptly, while some of mine are not. Even if there were timeouts, an NFS server timeout gives a different error than 'No such file or directory'. Finally, the one and only thing I _can_ do with my bug is 'ls' on the renamed files (for me, the find would get stuck before returning any output). For Russell's issue...most of the stuff I can think of has been tried already. I didn't see if there was any attempt try to ls the file from the NFS server as well as the client side. If ls is OK on the server but not the client, it's an NFS issue (possibly interacting with some btrfs-specific quirk); otherwise, it's likely a corrupted filesystem (mail servers seem to be unusually good at making these). Most of the I/O time on mail servers tends to land in the fsync() system call, and some nasty fsync() btrfs bugs were fixed in 3.17 (i.e. after 3.16, and not in the 3.16.x stable update for x = 5 (the last one I've checked)). That said, I'm not familiar with how fsync() translates over NFS, so it might not be relevant after all. If the NFS server's view of the filesystem is OK, check the NFS protocol version from /proc/mounts on the client. Sometimes NFS clients will get some transient network error during connection and fall back to some earlier (and potentially buggier) NFS version. I've seen very different behavior in some important corner cases from v4 and v3 clients, for example, and if the client is falling all the way back to v2 the bugs and their workarounds start to get just plain _weird_ (e.g. filenames which produce specific values from some hash function or that contain specific character sequences are unusable). v2 is so old it may even have issues with 64-bit inode numbers. Just now saw this thread, but IIRC 'No such file or directory' also gets returned sometimes when trying to automount a share that can't be enumerated by the client, and also sometimes when there is a stale NFS file handle. smime.p7s Description: S/MIME Cryptographic Signature
Re: Poll: time to switch skinny-metadata on by default?
On 2014-10-21 05:29, Duncan wrote: David Sterba posted on Mon, 20 Oct 2014 18:34:03 +0200 as excerpted: On Thu, Oct 16, 2014 at 01:33:37PM +0200, David Sterba wrote: I'd like to make it default with the 3.17 release of btrfs-progs. Please let me know if you have objections. For the record, 3.17 will not change the defaults. The timing of the poll was very bad to get enough feedback before the release. Let's keep it open for now. FWIW my own results agree with yours, I've had no problem with skinny- metadata here, and it has been my default now for a couple backup-and-new- mkfs.btrfs generations, now. As you know there were some problems with it in the first kernel cycle or two after it was introduced as an option, and I waited awhile until they died down before trying it here, but as I said, no problems since I switched it on, and I've been running it awhile now. So defaulting to skinny-metadata looks good from here. =:^) Same here, I've been using it on all my systems since I switched from 3.15 to 3.16, and have had no issues whatsoever. smime.p7s Description: S/MIME Cryptographic Signature
Re: downgrade from kernel 3.17 to 3.10
On 2014-10-21 11:34, Cristian Falcas wrote: I will start investigating how can we build our own rpms from the 3.16 sources. Until then we are stuck with the ones from the official repos or elrepo. Which means 3.10 is the latest for el6. We used this until now and seems we where lucky enough to not hit anything bad. IIRC there is a make target in the kernel sources that generates the appropriate RPM's for you, although doing so from mainline won't get you any of the patches from Oracle that they use in el. We upgraded to 3.17 because we use ceph on the machine with openstack and on the ceph site they recommended 3.14. And because we need writable snapshots, we are forced to use btrfs under ceph. Thank you all for your advice. On Tue, Oct 21, 2014 at 6:20 PM, Robert White rwh...@pobox.com wrote: On 10/21/2014 06:18 AM, Cristian Falcas wrote: Thank you for your answer. I will reformat the disk with a 3.10 kernel in the meantime, because I don't have any rpms for 3.16 now. More concisely: Don't use 3.10 BTRFS for data you value. There is a non-trivial chance that the problems you observed are/were due to bad things on the disk written there by 3.10. There is no value to recreating your file systems under 3.10 as the same thing is likely to go bad again when you get out of the dungeon. What are your RPM options? What about just getting the sources from kernel.org and compiling your won 3.16.5? Seriously, 3.10 just... no... 8-) smime.p7s Description: S/MIME Cryptographic Signature
Re: device balance times
On 2014-10-21 16:44, Arnaud Kapp wrote: Hello, I would like to ask if the balance time is related to the number of snapshot or if this is related only to data (or both). I currently have about 4TB of data and around 5k snapshots. I'm thinking of going raid1 instead of single. From the numbers I see this seems totally impossible as it would take *way* too long. Would destroying snapshots (those are hourly snapshots to prevent stupid error to happens, like `rm my_important_file`) help? Should I reconsider moving to raid1 because of the time it would take? Sorry if I'm somehow hijacking this thread, but it seemed related :) Thanks, The issue is the snapshots, because I regularly fully re-balance my home directory on my desktop which is ~150GB on a BTRFS raid10 setup with only 3 or 4 snapshots (I only do daily snapshots, cause anything I need finer granularity on I have under git), and that takes only about 2 or 3 hours depending on how many empty chunks I have. I would remove the snapshots, and also start keeping fewer of them (5k hourly snapshots is more than six months worth of file versions), and then run the balance. I would also suggest converting data by itself first, and then converting metadata, as converting data chunks will require re-writing large parts of the metadata. On 10/21/2014 10:14 PM, Piotr Pawłow wrote: On 21.10.2014 20:59, Tomasz Chmielewski wrote: FYI - after a failed disk and replacing it I've run a balance; it took almost 3 weeks to complete, for 120 GBs of data: Looks normal to me. Last time I started a balance after adding 6th device to my FS, it took 4 days to move 25GBs of data. Some chunks took 20 hours to move. I currently have 156 snapshots on this FS (nightly rsync backups). I think it is so slow, because it's disassembling chunks piece by piece and stuffing these pieces elsewhere, instead of moving chunks as a whole. If you have a lot of little pieces (as I do), it will take a while... -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html smime.p7s Description: S/MIME Cryptographic Signature
Re: 5 _thousand_ snapshots? even 160?
On 2014-10-21 21:10, Robert White wrote: I don't think balance will _ever_ move the contents of a read only snapshot. I could be wrong. I think you just end up with an endlessly fragmented storage space and balance has to take each chunk and search for someplace else it might better fit. Which explains why it took so long. And just _forget_ single-extent large files at that point. (Of course I could be wrong about the never move rule, but that would just make the checksums on the potentially hundreds or thousands of references need to be recalculated after a move, which would make incremental send/receive unfathomable.) Balance doesn't do anything different for snapshots from what it does with regular data. I think you are confusing balance with defragmentation, as that does (in theory) handle snapshots differently. Balance just takes all of the blocks selected by the filters, and sends the through the block allocator again, and then updates the metadata to point to the new blocks. It can result in some fragmentation, but usually only for files bigger than about 256M, and even then doesn't always cause fragmentation On 10/21/2014 01:44 PM, Arnaud Kapp wrote: Hello, I would like to ask if the balance time is related to the number of snapshot or if this is related only to data (or both). I currently have about 4TB of data and around 5k snapshots. I'm thinking of going raid1 instead of single. From the numbers I see this seems totally impossible as it would take *way* too long. Would destroying snapshots (those are hourly snapshots to prevent stupid error to happens, like `rm my_important_file`) help? Should I reconsider moving to raid1 because of the time it would take? Sorry if I'm somehow hijacking this thread, but it seemed related :) Thanks, On 10/21/2014 10:14 PM, Piotr Pawłow wrote: On 21.10.2014 20:59, Tomasz Chmielewski wrote: FYI - after a failed disk and replacing it I've run a balance; it took almost 3 weeks to complete, for 120 GBs of data: Looks normal to me. Last time I started a balance after adding 6th device to my FS, it took 4 days to move 25GBs of data. Some chunks took 20 hours to move. I currently have 156 snapshots on this FS (nightly rsync backups). I think it is so slow, because it's disassembling chunks piece by piece and stuffing these pieces elsewhere, instead of moving chunks as a whole. If you have a lot of little pieces (as I do), it will take a while... smime.p7s Description: S/MIME Cryptographic Signature
Re: NOCOW and Swap Files?
On 2014-10-22 16:08, Robert White wrote: So the documentation is clear that you can't mount a swap file through BTRFS (unless you use a loop device). Why isn't a NOCOW file that has been fully pre-allocated -- as with fallocate(1) -- not suitable for swapping? I found one reference to an unimplemented feature necessary for swap, but wouldn't it be reasonable for that feature to exist for NOCOW files? (or does this relate to my previous questions about the COW operation that happens after a snapshot?) I actually use a swapfile on BTRFS on a regular basis on my laptop (trying to keep the number of partitions to a minimum, cause I dual boot Windows), and here's what the init script I use for it does: 1. Remove any old swap file (the fs is on an SSD, so I do this mostly to get the discard operation). 2. Use touch to create a new file. 3. Use chattr to mark the file NOCOW. 4. Use fallocate to pre-allocate the space for the file. 5. Bind the file to a loop device. 6. Format as swap and add as swapspace. This works very reliably for me, and the overhead of the loop device is relatively insignificant (because my disk is actually faster than my RAM) for my use case, and I can safely balance/defrag/fstrim the filesystem without causing issues with the swap file. If you can avoid using a swapfile though, I would suggest doing so, regardless of which FS you are using. I actually use a 4-disk RAID-0 LVM volume on my desktop, and it gets noticeably better performance than using a swap file. smime.p7s Description: S/MIME Cryptographic Signature
Re: device balance times
On 2014-10-23 05:19, Miao Xie wrote: On Wed, 22 Oct 2014 14:40:47 +0200, Piotr Pawłow wrote: On 22.10.2014 03:43, Chris Murphy wrote: On Oct 21, 2014, at 4:14 PM, Piotr Pawłowp...@siedziba.pl wrote: Looks normal to me. Last time I started a balance after adding 6th device to my FS, it took 4 days to move 25GBs of data. It's long term untenable. At some point it must be fixed. It's way, way slower than md raid. At a certain point it needs to fallback to block level copying, with a ~ 32KB block. It can't be treating things as if they're 1K files, doing file level copying that takes forever. It's just too risky that another device fails in the meantime. There's device replace for restoring redundancy, which is fast, but not implemented yet for RAID5/6. Now my colleague and I is implementing the scrub/replace for RAID5/6 and I have a plan to reimplement the balance and split it off from the metadata/file data process. the main idea is - allocate a new chunk which has the same size as the relocated one, but don't insert it into the block group list, so we don't allocate the free space from it. - set the source chunk to be Read-only - copy the data from the source chunk to the new chunk - replace the extent map of the source chunk with the one of the new chunk(The new chunk has the same logical address and the length as the old one) - release the source chunk By this way, we needn't deal the data one extent by one extent, and needn't do any space reservation, so the speed will be very fast even we have lots of snapshots. Even if balance gets re-implemented this way, we should still provide some way to consolidate the data from multiple partially full chunks. Maybe keep the old balance path and have some option (maybe call it aggressive?) that turns it on instead of the new code. smime.p7s Description: S/MIME Cryptographic Signature
Re: Heavy nocow'd VM image fragmentation
On 2014-10-26 13:20, Larkin Lowrey wrote: On 10/24/2014 10:28 PM, Duncan wrote: Robert White posted on Fri, 24 Oct 2014 19:41:32 -0700 as excerpted: On 10/24/2014 04:49 AM, Marc MERLIN wrote: On Thu, Oct 23, 2014 at 06:04:43PM -0500, Larkin Lowrey wrote: I have a 240GB VirtualBox vdi image that is showing heavy fragmentation (filefrag). The file was created in a dir that was chattr +C'd, the file was created via fallocate and the contents of the orignal image were copied into the file via dd. I verified that the image was +C. To be honest, I have the same problem, and it's vexing: If I understand correctly, when you take a snapshot the file goes into what I call 1COW mode. Yes, but the OP said he hadn't snapshotted since creating the file, and MM's a regular that actually wrote much of the wiki documentation on raid56 modes, so he better know about the snapshotting problem too. So that can't be it. There's apparently a bug in some recent code, and it's not honoring the NOCOW even in normal operation, when it should be. (FWIW I'm not running any VMs or large DBs here, so don't have nocow set on anything and can and do use autodefrag on all my btrfs. So I can't say one way or the other, personally.) Correct, there were no snapshots during VM usage when the fragmentation occurred. One unusual property of my setup is I have my fs on top of bcache. More specifically, the stack is md raid6 - bcache - lvm - btrfs. When the fs mounts it has mount option 'ssd' due to the fact that bcache sets /sys/block/bcache0/queue/rotational to 0. Is there any reason why either the 'ssd' mount option or being backed by bcache could be responsible? Two things: First, regarding your question, the ssd mount option shouldn't be responsible for this, because it is supposed to spread out allocation only at the chunk level, not the block level, but some recent commit may have changed that. Are you using any kind of compression in btrfs? If so, then filefrag won't report the number of fragments correctly (it currently reports the number of compressed blocks in the file instead), and in fact, if you are using compression in btrfs, I would expect the number of compressed blocks to go up as you use more space in the VM image, long runs of zero bytes compress well, other stuff (especially on-disk structures from encapsulated filesystems) doesn't. You might consider putting the vm images directly on the LVM layer instead, that tends to get much better performance in my experience than storing them on a filesystem. Secondly, I'd recommend switching from using bcache under LVM to using dm-cache on top of LVM, as it makes it much easier to recover from the various failure modes, and also to deal with a corrupted cache, due to the fact that dm-cache doesn't put any metadata on the backing device. It takes longer to shutdown when in write-back mode, and isn't SSD optimized, but has also been much more reliable in my experience. smime.p7s Description: S/MIME Cryptographic Signature
Re: btrfs deduplication and linux cache management
On 2014-10-30 05:26, lu...@plaintext.sk wrote: Hi, I want to ask, if deduplicated file content will be cached in linux kernel just once for two deduplicated files. To explain in deep: - I use btrfs for whole system with few subvolumes with some compression on some subvolumes. - I have two directories with eclipse SDK with slightly differences (same version, different config) - I assume that given directories is deduplicated and so two eclipse installations take place on hdd like one would (in rough estimation) - I will start one of given eclipse - linux kernel will cache all opened files during start of eclipse (I have enough free ram) - I am just happy stupid linux user: 1. will kernel cache file content after decompression? (I think yes) 2. cached data will be in VFS layer or in block device layer? - When I will lunch second eclipse (different from first, but deduplicated from first) after first one: 1. will second start require less data to be read from HDD? 2. will be metadata for second instance read from hdd? (I asume yes) 3. will be actual data read second time? (I hope not) Thanks for answers, have a nice day, I don't know for certain, but here is how I understand things work in this case: 1. Individual blocks are cached in the block device layer, which means that the de-duplicated data would only be cached at most as many times as there are disks it is on (ie at most 1 time for a single device filesystem, up to twice for a multi-device btrfs raid1 setup). 2. In the vfs layer, the cache handles decoded inodes (the actual file metadata), dentries (the file's entry in the parent directory), and individual pages of file content (after decompression). AFAIK, the vfs layer's cache is pathname based, so that would probably cache two copies of the data, but after the metadata look-up, wouldn't need to read from the disk cause of the block layer cache. Overall, this means that while de-duplicated data may be cached more than once, it shouldn't need to be reread from disk if there is still a copy in cache. Metadata may or may not need to be read from the disk, depending on what is in the VFS cache. smime.p7s Description: S/MIME Cryptographic Signature
Re: scrub implies failing drive - smartctl blissfully unaware
On 2014-11-18 02:29, Brendan Hide wrote: Hey, guys See further below extracted output from a daily scrub showing csum errors on sdb, part of a raid1 btrfs. Looking back, it has been getting errors like this for a few days now. The disk is patently unreliable but smartctl's output implies there are no issues. Is this somehow standard faire for S.M.A.R.T. output? Here are (I think) the important bits of the smartctl output for $(smartctl -a /dev/sdb) (the full results are attached): ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 253 006Pre-fail Always - 0 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 1 7 Seek_Error_Rate 0x000f 086 060 030Pre-fail Always - 440801014 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 0 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 0 199 UDMA_CRC_Error_Count0x003e 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x 100 253 000Old_age Offline - 0 202 Data_Address_Mark_Errs 0x0032 100 253 000Old_age Always - 0 Original Message Subject: Cron root@watricky /usr/local/sbin/btrfs-scrub-all Date: Tue, 18 Nov 2014 04:19:12 +0200 From: (Cron Daemon) root@watricky To: brendan@watricky WARNING: errors detected during scrubbing, corrected. [snip] scrub device /dev/sdb2 (id 2) done scrub started at Tue Nov 18 03:22:58 2014 and finished after 2682 seconds total bytes scrubbed: 189.49GiB with 5420 errors error details: read=5 csum=5415 corrected errors: 5420, uncorrectable errors: 0, unverified errors: 164 [snip] In addition to the storage controller being a possibility as mentioned in another reply, there are some parts of the drive that aren't covered by SMART attributes on most disks, most notably the on-drive cache. There really isn't a way to disable the read cache on the drive, but you can disable write-caching, which may improve things (and if it's a cheap disk, may provide better reliability for BTRFS as well). The other thing I would suggest trying is a different data cable to the drive itself, I've had issues with some SATA cables (the cheap red ones you get in the retail packaging for some hard disks in particular) having either bad connectors, or bad strain-reliefs, and failing after only a few hundred hours of use. smime.p7s Description: S/MIME Cryptographic Signature
Re: btrfs send and an existing backup
On 2014-11-20 09:10, Duncan wrote: Bardur Arantsson posted on Thu, 20 Nov 2014 14:17:52 +0100 as excerpted: If you have no other backups, I would really recommend that you *don't* use btrfs for your backup, or at least have a *third* backup which isn't on btrfs -- there are *still* problems with btrfs that can potentially wreck your backup filesystem. (Although it's obviously less likely if the external HDD will only be connected occasionally.) Don't get me wrong, btrfs is becoming more and more stable, but I wouldn't trust it with my *only* backup, especially if also running btrfs on the backed-up filesystem. This. My working versions and first backups are btrfs. My secondary backups are reiserfs (my old filesystem of choice, which has been very reliable for me), just in case both the btrfs versions bite the dust due to a bug in btrfs itself. Likewise, except I use compressed, encrypted tarballs stored on both Amazon S3 and Dropbox. smime.p7s Description: S/MIME Cryptographic Signature