Re: 3.14.0rc3: did not find backref in send_root
On Mon, Feb 24, 2014 at 10:36:52PM -0800, Marc MERLIN wrote: I got this during a btrfs send: BTRFS error (device dm-2): did not find backref in send_root. inode=22672, offset=524288, disk_byte=1490517954560 found extent=1490517954560 I'll try a scrub when I've finished my backup, but is there anything I can run on the file I've found from the inode? gargamel:/mnt/dshelf1/Sound# btrfs inspect-internal inode-resolve -v 22672 file.mp3 ioctl ret=0, bytes_left=3998, bytes_missing=0, cnt=1, missed=0 file.mp3 I've just seen this error: BTRFS error (device sda4): did not find backref in send_root. inode=411890, offset=307200, disk_byte=48100618240 found extent=48100618240 during a send between two snapshots I have. after moving to 3.14.2. I've seen it on two filesystems now since moving to 3.14. I have the two readonly snapshots if there is anything helpful I can figure out from them. Scrub reports no errors, but I don't seem to be able to back up anything now. David -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 0/2] Kernel space btrfs missing device detection.
Original btrfs will not detection any missing device since there is no notification mechanism for fs layer to detect missing device in block layer. However we don't really need to notify fs layer upon dev remove, probing in dev_info/rm_dev ioctl is good enough since they are the only two ioctls caring about missing device. This patchset will do ioctl time missing dev detection and return device missing status in dev_info ioctl using a new member in btrfs_ioctl_dev_info_args with a backward compatible method. Cc: Anand Jain anand.j...@oracle.com Qu Wenruo (2): btrfs: Add missing device check in dev_info/rm_dev ioctl btrfs: Add new member of btrfs_ioctl_dev_info_args. fs/btrfs/ioctl.c | 4 fs/btrfs/volumes.c | 25 - fs/btrfs/volumes.h | 2 ++ include/uapi/linux/btrfs.h | 5 - 4 files changed, 34 insertions(+), 2 deletions(-) -- 1.9.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/2] btrfs-progs: Add userspace support for kernel missing dev detection.
Add userspace support for kernel missing dev detection from dev_info ioctl. Now 'btrfs fi show' will auto detect the output format of dev_info ioctl and use kernel missing dev detection if supported. Also userspace missing dev detection is used as a fallback method and when used, a info message will be printed showing 'btrfs dev del missing' will not work. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- cmds-filesystem.c | 29 ++--- utils.c | 2 ++ 2 files changed, 24 insertions(+), 7 deletions(-) diff --git a/cmds-filesystem.c b/cmds-filesystem.c index 306f715..0ff1ca6 100644 --- a/cmds-filesystem.c +++ b/cmds-filesystem.c @@ -369,6 +369,7 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info, char uuidbuf[BTRFS_UUID_UNPARSED_SIZE]; struct btrfs_ioctl_dev_info_args *tmp_dev_info; int ret; + int new_flag = 0; ret = add_seen_fsid(fs_info-fsid); if (ret == -EEXIST) @@ -389,13 +390,22 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info, for (i = 0; i fs_info-num_devices; i++) { tmp_dev_info = (struct btrfs_ioctl_dev_info_args *)dev_info[i]; - /* Add check for missing devices even mounted */ - fd = open((char *)tmp_dev_info-path, O_RDONLY); - if (fd 0) { - missing = 1; - continue; + new_flag = tmp_dev_info-flags BTRFS_IOCTL_DEV_INFO_FLAG_SET; + if (!new_flag) { + /* Add check for missing devices even mounted */ + fd = open((char *)tmp_dev_info-path, O_RDONLY); + if (fd 0) { + missing = 1; + continue; + } + close(fd); + } else { + if (tmp_dev_info-flags + BTRFS_IOCTL_DEV_INFO_MISSING) { + missing = 1; + continue; + } } - close(fd); printf(\tdevid %4llu size %s used %s path %s\n, tmp_dev_info-devid, pretty_size(tmp_dev_info-total_bytes), @@ -403,8 +413,13 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args *fs_info, tmp_dev_info-path); } - if (missing) + if (missing) { printf(\t*** Some devices missing\n); + if (!new_flag) { + printf(\tOlder kernel detected\n); + printf(\t'btrfs dev delete missing' may not work\n); + } + } printf(\n); return 0; } diff --git a/utils.c b/utils.c index 3e9c527..230471f 100644 --- a/utils.c +++ b/utils.c @@ -1670,6 +1670,8 @@ int get_device_info(int fd, u64 devid, di_args-devid = devid; memset(di_args-uuid, '\0', sizeof(di_args-uuid)); + /* Clear flags to ensure old kernel returns untouched flags */ + memset(di_args-flags, 0, sizeof(di_args-flags)); ret = ioctl(fd, BTRFS_IOC_DEV_INFO, di_args); return ret ? -errno : 0; -- 1.9.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 2/2] btrfs: Add new member of btrfs_ioctl_dev_info_args.
Add flags member for btrfs_ioctl_dev_info_args to preset missing btrfs devices. The new member is added in the original padding area so the ioctl API is not affected but user headers needs to be updated. Cc: Anand Jain anand.j...@oracle.com Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/ioctl.c | 3 +++ include/uapi/linux/btrfs.h | 5 - 2 files changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 7680a40..1920f24 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2610,6 +2610,9 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) di_args-devid = dev-devid; di_args-bytes_used = dev-bytes_used; di_args-total_bytes = dev-total_bytes; + di_args-flags = BTRFS_IOCTL_DEV_INFO_FLAG_SET; + if (dev-missing) + di_args-flags |= BTRFS_IOCTL_DEV_INFO_MISSING; memcpy(di_args-uuid, dev-uuid, sizeof(di_args-uuid)); if (dev-name) { struct rcu_string *name; diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index b4d6909..5eb1f03 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -168,12 +168,15 @@ struct btrfs_ioctl_dev_replace_args { __u64 spare[64]; }; +#define BTRFS_IOCTL_DEV_INFO_MISSING (1ULL0) +#define BTRFS_IOCTL_DEV_INFO_FLAG_SET (1ULL63) struct btrfs_ioctl_dev_info_args { __u64 devid;/* in/out */ __u8 uuid[BTRFS_UUID_SIZE]; /* in/out */ __u64 bytes_used; /* out */ __u64 total_bytes; /* out */ - __u64 unused[379]; /* pad to 4k */ + __u64 flags;/* out */ + __u64 unused[378]; /* pad to 4k */ __u8 path[BTRFS_DEVICE_PATH_NAME_MAX]; /* out */ }; -- 1.9.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/2] btrfs: Add missing device check in dev_info/rm_dev ioctl
Old btrfs can't find a missing btrfs device since there is no mechanism for block layer to inform fs layer. But we can use a workaround that only check status(by using request_queue-queue_flags) of every device in a btrfs filesystem when calling dev_info/rm_dev ioctl, since other ioctls do not really cares about missing device. Cc: Anand Jain anand.j...@oracle.com Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- fs/btrfs/ioctl.c | 1 + fs/btrfs/volumes.c | 25 - fs/btrfs/volumes.h | 2 ++ 3 files changed, 27 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 0401397..7680a40 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -2606,6 +2606,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, void __user *arg) goto out; } + btrfs_check_dev_missing(root, dev, 1); di_args-devid = dev-devid; di_args-bytes_used = dev-bytes_used; di_args-total_bytes = dev-total_bytes; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index d241130a..c7d7908 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1548,9 +1548,10 @@ int btrfs_rm_device(struct btrfs_root *root, char *device_path) * is held. */ list_for_each_entry(tmp, devices, dev_list) { + btrfs_check_dev_missing(root, tmp, 0); if (tmp-in_fs_metadata !tmp-is_tgtdev_for_dev_replace - !tmp-bdev) { + (!tmp-bdev || tmp-missing)) { device = tmp; break; } @@ -6300,3 +6301,25 @@ int btrfs_scratch_superblock(struct btrfs_device *device) return 0; } + +/* If need_lock is set, uuid_mutex will be used */ +int btrfs_check_dev_missing(struct btrfs_root *root, struct btrfs_device *dev, + int need_lock) +{ + struct request_queue *q; + + if (unlikely(!dev || !dev-bdev || !dev-bdev-bd_queue)) + return -ENOENT; + q = dev-bdev-bd_queue; + + if (need_lock) + mutex_lock(uuid_mutex); + if (test_bit(QUEUE_FLAG_DEAD, q-queue_flags) || + test_bit(QUEUE_FLAG_DYING, q-queue_flags)) { + dev-missing = 1; + root-fs_info-fs_devices-missing_devices++; + } + if (need_lock) + mutex_unlock(uuid_mutex); + return 0; +} diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index 80754f9..47a44af 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -356,6 +356,8 @@ unsigned long btrfs_full_stripe_len(struct btrfs_root *root, int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans, struct btrfs_root *extent_root, u64 chunk_offset, u64 chunk_size); +int btrfs_check_dev_missing(struct btrfs_root *root, struct btrfs_device *dev, + int need_lock); static inline void btrfs_dev_stat_inc(struct btrfs_device *dev, int index) { -- 1.9.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC PATCH 1/2] btrfs-progs: Follow kernel changes to add new member of btrfs_ioctl_dev_info_args.
Follow the kernel header changes to add new member of btrfs_ioctl_dev_info_args. This change will use special bit to keep backward compatibility, so even on old kernels this will not screw anything up. Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com --- ioctl.h | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/ioctl.h b/ioctl.h index 9627e8d..672a3a3 100644 --- a/ioctl.h +++ b/ioctl.h @@ -156,12 +156,15 @@ struct btrfs_ioctl_dev_replace_args { __u64 spare[64]; }; +#define BTRFS_IOCTL_DEV_INFO_MISSING (1ULL0) +#define BTRFS_IOCTL_DEV_INFO_FLAG_SET (1ULL63) struct btrfs_ioctl_dev_info_args { __u64 devid;/* in/out */ __u8 uuid[BTRFS_UUID_SIZE]; /* in/out */ __u64 bytes_used; /* out */ __u64 total_bytes; /* out */ - __u64 unused[379]; /* pad to 4k */ + __u64 flags;/* out */ + __u64 unused[378]; /* pad to 4k */ __u8 path[BTRFS_DEVICE_PATH_NAME_MAX]; /* out */ }; -- 1.9.2 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: 3.14.0rc3: did not find backref in send_root
On 05/06/2014 08:10 AM, David Brown wrote: On Mon, Feb 24, 2014 at 10:36:52PM -0800, Marc MERLIN wrote: I got this during a btrfs send: BTRFS error (device dm-2): did not find backref in send_root. inode=22672, offset=524288, disk_byte=1490517954560 found extent=1490517954560 I'll try a scrub when I've finished my backup, but is there anything I can run on the file I've found from the inode? gargamel:/mnt/dshelf1/Sound# btrfs inspect-internal inode-resolve -v 22672 file.mp3 ioctl ret=0, bytes_left=3998, bytes_missing=0, cnt=1, missed=0 file.mp3 I've just seen this error: BTRFS error (device sda4): did not find backref in send_root. inode=411890, offset=307200, disk_byte=48100618240 found extent=48100618240 during a send between two snapshots I have. after moving to 3.14.2. I've seen it on two filesystems now since moving to 3.14. I have the two readonly snapshots if there is anything helpful I can figure out from them. Scrub reports no errors, but I don't seem to be able to back up anything now. David -- I am also seeing this on 3.14.1 (on ArchLinux). Scrub also reports no errors. I could also not do a ful send. Balancing made it better for a while (I was able to send a full snapshot of one subvolume, but not another), but it did not help. Offline repairing the fs with btrfsck --repair also did not affect it. Blaz -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs on software RAID0
just one last doubt: why do you use --align-payload=1024? (or 8912) Cryptsetup man says that the default for the payload alignment is 2048 (512-byte sectors). So, it's already aligned by default to 4K-byte physical sectors (if that was your concern). Am I missing something? John On Mon, May 5, 2014 at 11:25 PM, Marc MERLIN m...@merlins.org wrote: On Mon, May 05, 2014 at 10:51:46PM +0200, john terragon wrote: Hi. I'm about to try btrfs on an RAID0 md device (to be precise there will be dm-crypt in between the md device and btrfs). If I used ext4 I would set the stride and stripe_width extended options. Is there anything similar I should be doing with mkfs.btrfs? Or maybe some mount options beneficial to this kind of setting. This is not directly an answer to your question, so far I haven't used a special option like this with btrfs on my arrays although my undertstanding is that it's not as important as with ext4. That said, please read http://marc.merlins.org/perso/btrfs/post_2014-04-27_Btrfs-Multi-Device-Dmcrypt.html 1) use align-payload=1024 on cryptsetup instead of something bigger like 8192. This will reduce write amplification (if you're not on an SSD). 2) you don't need md0 in the middle, crypt each device and then use btrfs built in raid0 which will be faster (and is stable, at least as far as we know :) ). Then use /etc/crypttab or a script like this http://marc.merlins.org/linux/scripts/start-btrfs-dmcrypt to decrypt all your devices in one swoop and mount btrfs. Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] xfstests: add regression test for inode cache vs tree log
This patch adds a regression test to verify btrfs can not reuse inode id until we have committed transaction. Which was addressed by the following kernel patch: Btrfs: fix inode cache vs tree log Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com --- tests/btrfs/049 | 109 tests/btrfs/049.out | 1 + tests/btrfs/group | 1 + 3 files changed, 111 insertions(+) create mode 100644 tests/btrfs/049 create mode 100644 tests/btrfs/049.out diff --git a/tests/btrfs/049 b/tests/btrfs/049 new file mode 100644 index 000..3101d09 --- /dev/null +++ b/tests/btrfs/049 @@ -0,0 +1,109 @@ +#! /bin/bash +# FS QA Test No. btrfs/049 +# +# Regression test for btrfs inode caching vs tree log which was +# addressed by the following kernel patch. +# +# Btrfs: fix inode caching vs tree log +# +#--- +# Copyright (c) 2014 Fujitsu. All Rights Reserved. +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo QA output created by $seq + +here=`pwd` +tmp=/tmp/$$ + +status=1 # failure is the default! +trap _cleanup; exit \$status 0 1 2 3 15 + +_cleanup() +{ + _cleanup_flakey + rm -rf $tmp +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/dmflakey + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_need_to_be_root +_require_scratch +_require_dm_flakey + +rm -f $seqres.full + +_scratch_mkfs $seqres.full 21 + +SAVE_MOUNT_OPTIONS=$MOUNT_OPTIONS +MOUNT_OPTIONS=$MOUNT_OPTIONS -o inode_cache,commit=100 + +# create a basic flakey device that will never error out +_init_flakey +_mount_flakey + +_get_inode_id() +{ + local inode_id + inode_id=`stat $1 | grep Inode: | $AWK_PROG '{print $4}'` + echo $inode_id +} + +$XFS_IO_PROG -f -c pwrite 0 10M -c fsync \ + $SCRATCH_MNT/data /dev/null + +inode_id=`_get_inode_id $SCRATCH_MNT/data` +rm -f $SCRATCH_MNT/data + +for i in `seq 1 5`; +do + mkdir $SCRATCH_MNT/dir_$i + new_inode_id=`_get_inode_id $SCRATCH_MNT/dir_$i` + if [ $new_inode_id -eq $inode_id ] + then + $XFS_IO_PROG -f -c pwrite 0 1M -c fsync \ + $SCRATCH_MNT/dir_$i/data1 /dev/null + _load_flakey_table 1 + _unmount_flakey + need_umount=1 + break + fi + sleep 1 +done + +# restore previous mount options +export MOUNT_OPTIONS=$SAVE_MOUNT_OPTIONS + +# ok mount so that any recovery that needs to happen is done +if [ $new_inode_id -eq $inode_id ];then + _load_flakey_table $FLAKEY_ALLOW_WRITES + _mount_flakey + _unmount_flakey +fi + +# make sure we got a valid fs after replay +_check_scratch_fs $FLAKEY_DEV + +status=0 +exit diff --git a/tests/btrfs/049.out b/tests/btrfs/049.out new file mode 100644 index 000..cb0061b --- /dev/null +++ b/tests/btrfs/049.out @@ -0,0 +1 @@ +QA output created by 049 diff --git a/tests/btrfs/group b/tests/btrfs/group index af60c79..59b0c98 100644 --- a/tests/btrfs/group +++ b/tests/btrfs/group @@ -51,3 +51,4 @@ 046 auto quick 047 auto quick 048 auto quick +049 auto quick -- 1.8.2.1 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Scrub status: no stats available
Dear list, I am running btrfs on Arch Linux ARM (Linux 3.14.2, Btrfs v3.14.1). I can run scrub w/o errors, but I never get stats from scrub status What I get is btrfs scrub status /pools/dataPool scrub status for b5f082e2-2ce0-4f91-b54b-c2d26185a635 no stats available total bytes scrubbed: 694.13GiB with 0 errors Please mind the line no stats available. Where can I start digging? Thank you, Wolfgang -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Btrfs raid allocator
Hello all! I would like to use btrfs (or anyting else actually) to maximize raid0 performance. Basically I have a relatively constant stream of data that simply has to be written out to disk. So my question is, how is the block allocator deciding on which device to write, can this decision be dynamic and could it incorporate timing/troughput decisions? I'm willing to write code, I just have no clue as to how this works right now. I read somewhere that the decision is based on free space, is this still true? Cheers Hendrik -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs raid allocator
On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote: Hello all! I would like to use btrfs (or anyting else actually) to maximize raid0 performance. Basically I have a relatively constant stream of data that simply has to be written out to disk. So my question is, how is the block allocator deciding on which device to write, can this decision be dynamic and could it incorporate timing/troughput decisions? I'm willing to write code, I just have no clue as to how this works right now. I read somewhere that the decision is based on free space, is this still true? For (current) RAID-0 allocation, the block group allocator will use as many chunks as there are devices with free space (down to a minimum of 2). Data is then striped across those chunks in 64 KiB stripes. Thus, the first block group will be N GiB of usable space, striped across N devices. There's a second level of allocation (which I haven't looked at at all), which is how the FS decides where to put data within the allocated block groups. I think it will almost certainly be beneficial in your case to use prealloc extents, which will turn your continuous write into large contiguous sections of striping. I would recommend thoroughly benchmarking your application with the FS first though, just to see how it's going to behave for you. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Ceci n'est pas une pipe: | --- signature.asc Description: Digital signature
Re: Scrub status: no stats available
On Tue, May 06, 2014 at 11:52:58AM +0200, Wolfgang Mader wrote: Dear list, I am running btrfs on Arch Linux ARM (Linux 3.14.2, Btrfs v3.14.1). I can run scrub w/o errors, but I never get stats from scrub status What I get is btrfs scrub status /pools/dataPool scrub status for b5f082e2-2ce0-4f91-b54b-c2d26185a635 no stats available total bytes scrubbed: 694.13GiB with 0 errors Please mind the line no stats available. Where can I start digging? Here: legolas:~# l /var/lib/btrfs/ total 16 drwxr-xr-x 1 root root 494 May 6 04:08 ./ drwxr-xr-x 1 root root 1360 Apr 27 22:15 ../ srwxr-xr-x 1 root root0 May 6 03:48 scrub.progress.4850ee22-bf32-4131-a841-02abdb4a5ba6= -rw--- 1 root root 428 May 6 04:08 scrub.status.4850ee22-bf32-4131-a841-02abdb4a5ba6 -rw--- 1 root root 427 May 5 05:04 scrub.status.6afd4707-876c-46d6-9de2-21c4085b7bed -rw--- 1 root root 418 Jan 11 2013 scrub.status.92584fa9-85cd-4df6-b182-d32198b76a0b -rw--- 1 root root 420 May 17 2013 scrub.status.9f52c100-8c89-45b6-a005-3f5de1c12b38 Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs raid allocator
On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote: On 06.05.2014 12:59, Hugo Mills wrote: On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote: Hello all! I would like to use btrfs (or anyting else actually) to maximize raid0 performance. Basically I have a relatively constant stream of data that simply has to be written out to disk. So my question is, how is the block allocator deciding on which device to write, can this decision be dynamic and could it incorporate timing/troughput decisions? I'm willing to write code, I just have no clue as to how this works right now. I read somewhere that the decision is based on free space, is this still true? For (current) RAID-0 allocation, the block group allocator will use as many chunks as there are devices with free space (down to a minimum of 2). Data is then striped across those chunks in 64 KiB stripes. Thus, the first block group will be N GiB of usable space, striped across N devices. So do I understand this correctly that (assuming we have enough space) data will be spread equally between the disks independend of write speeds? So one slow device would slow down the whole raid? Yes. Exactly the same as it would be with DM RAID-0 on the same configuration. There's not a lot we can do about that at this point. There's a second level of allocation (which I haven't looked at at all), which is how the FS decides where to put data within the allocated block groups. I think it will almost certainly be beneficial in your case to use prealloc extents, which will turn your continuous write into large contiguous sections of striping. Why does prealloc change anything? For me latency does not matter, only continuous troughput! It makes the extent allocation algorithm much simpler, because it can then allocate in larger chunks and do more linear writes I would recommend thoroughly benchmarking your application with the FS first though, just to see how it's going to behave for you. Hugo. Of course - it's just that I do not yet have the hardware, but I plan to test with a small model - I just try to find out how it actually works first, so I know what look out for. Good luck. :) Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I am the author. You are the audience. I outrank you! --- signature.asc Description: Digital signature
Re: Btrfs raid allocator
On 06.05.2014 13:19, Hugo Mills wrote: On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote: On 06.05.2014 12:59, Hugo Mills wrote: On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote: Hello all! I would like to use btrfs (or anyting else actually) to maximize raid0 performance. Basically I have a relatively constant stream of data that simply has to be written out to disk. So my question is, how is the block allocator deciding on which device to write, can this decision be dynamic and could it incorporate timing/troughput decisions? I'm willing to write code, I just have no clue as to how this works right now. I read somewhere that the decision is based on free space, is this still true? For (current) RAID-0 allocation, the block group allocator will use as many chunks as there are devices with free space (down to a minimum of 2). Data is then striped across those chunks in 64 KiB stripes. Thus, the first block group will be N GiB of usable space, striped across N devices. So do I understand this correctly that (assuming we have enough space) data will be spread equally between the disks independend of write speeds? So one slow device would slow down the whole raid? Yes. Exactly the same as it would be with DM RAID-0 on the same configuration. There's not a lot we can do about that at this point. So striping is fixed but which disk takes part with a chunk is dynamic? But for large workloads slower disks could 'skip a chunk' as chunk allocation is dynamic, correct? There's a second level of allocation (which I haven't looked at at all), which is how the FS decides where to put data within the allocated block groups. I think it will almost certainly be beneficial in your case to use prealloc extents, which will turn your continuous write into large contiguous sections of striping. Why does prealloc change anything? For me latency does not matter, only continuous troughput! It makes the extent allocation algorithm much simpler, because it can then allocate in larger chunks and do more linear writes Is this still true if I do very large writes? Or do those get broken down by the kernel somewhere? I would recommend thoroughly benchmarking your application with the FS first though, just to see how it's going to behave for you. Hugo. Of course - it's just that I do not yet have the hardware, but I plan to test with a small model - I just try to find out how it actually works first, so I know what look out for. Good luck. :) Hugo. Thanks! Hendrik -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs raid allocator
On Tue, May 06, 2014 at 01:26:44PM +0200, Hendrik Siedelmann wrote: On 06.05.2014 13:19, Hugo Mills wrote: On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote: On 06.05.2014 12:59, Hugo Mills wrote: On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote: Hello all! I would like to use btrfs (or anyting else actually) to maximize raid0 performance. Basically I have a relatively constant stream of data that simply has to be written out to disk. So my question is, how is the block allocator deciding on which device to write, can this decision be dynamic and could it incorporate timing/troughput decisions? I'm willing to write code, I just have no clue as to how this works right now. I read somewhere that the decision is based on free space, is this still true? For (current) RAID-0 allocation, the block group allocator will use as many chunks as there are devices with free space (down to a minimum of 2). Data is then striped across those chunks in 64 KiB stripes. Thus, the first block group will be N GiB of usable space, striped across N devices. So do I understand this correctly that (assuming we have enough space) data will be spread equally between the disks independend of write speeds? So one slow device would slow down the whole raid? Yes. Exactly the same as it would be with DM RAID-0 on the same configuration. There's not a lot we can do about that at this point. So striping is fixed but which disk takes part with a chunk is dynamic? But for large workloads slower disks could 'skip a chunk' as chunk allocation is dynamic, correct? You'd have to rewrite the chunk allocator to do this, _and_ provide different RAID levels for different subvolumes. The chunk/block group allocator right now uses only one rule for allocating data, and one for allocating metadata. Now, both of these are planned, and _might_ between them possibly cover the use-case you're talking about, but I'm not certain it's necessarily a sensible thing to do in this case. My question is, if you actually care about the performance of this system, why are you buying some slow devices to drag the performance of your fast devices down? It seems like a recipe for disaster... There's a second level of allocation (which I haven't looked at at all), which is how the FS decides where to put data within the allocated block groups. I think it will almost certainly be beneficial in your case to use prealloc extents, which will turn your continuous write into large contiguous sections of striping. Why does prealloc change anything? For me latency does not matter, only continuous troughput! It makes the extent allocation algorithm much simpler, because it can then allocate in larger chunks and do more linear writes Is this still true if I do very large writes? Or do those get broken down by the kernel somewhere? I guess it'll depend on the approach you use to do these very large writes, and on the exact definition of very large. This is not an area I know a huge amount about. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- I am the author. You are the audience. I outrank you! --- signature.asc Description: Digital signature
Re: Btrfs raid allocator
On 06.05.2014 13:46, Hugo Mills wrote: On Tue, May 06, 2014 at 01:26:44PM +0200, Hendrik Siedelmann wrote: On 06.05.2014 13:19, Hugo Mills wrote: On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote: On 06.05.2014 12:59, Hugo Mills wrote: On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote: Hello all! I would like to use btrfs (or anyting else actually) to maximize raid0 performance. Basically I have a relatively constant stream of data that simply has to be written out to disk. So my question is, how is the block allocator deciding on which device to write, can this decision be dynamic and could it incorporate timing/troughput decisions? I'm willing to write code, I just have no clue as to how this works right now. I read somewhere that the decision is based on free space, is this still true? For (current) RAID-0 allocation, the block group allocator will use as many chunks as there are devices with free space (down to a minimum of 2). Data is then striped across those chunks in 64 KiB stripes. Thus, the first block group will be N GiB of usable space, striped across N devices. So do I understand this correctly that (assuming we have enough space) data will be spread equally between the disks independend of write speeds? So one slow device would slow down the whole raid? Yes. Exactly the same as it would be with DM RAID-0 on the same configuration. There's not a lot we can do about that at this point. So striping is fixed but which disk takes part with a chunk is dynamic? But for large workloads slower disks could 'skip a chunk' as chunk allocation is dynamic, correct? You'd have to rewrite the chunk allocator to do this, _and_ provide different RAID levels for different subvolumes. The chunk/block group allocator right now uses only one rule for allocating data, and one for allocating metadata. Now, both of these are planned, and _might_ between them possibly cover the use-case you're talking about, but I'm not certain it's necessarily a sensible thing to do in this case. But what does the allocator currently do when one disk runs out of space? I thought those disks do not get used but we can still write data. So the mechanism is already there, it just needs to be invoked when a drive is too busy instead of too full. My question is, if you actually care about the performance of this system, why are you buying some slow devices to drag the performance of your fast devices down? It seems like a recipe for disaster... Even the speed of a single hdd varies depending on where I write the data. So actually there is not much choice :-D. I'm aware that this could be a case of overengineering. Actually my first thought was to write a simple fuse module which only handles data and puts metadata on a regular filesystem. But then I thought that it would be nice to have this in btrfs - and not just for raid0. There's a second level of allocation (which I haven't looked at at all), which is how the FS decides where to put data within the allocated block groups. I think it will almost certainly be beneficial in your case to use prealloc extents, which will turn your continuous write into large contiguous sections of striping. Why does prealloc change anything? For me latency does not matter, only continuous troughput! It makes the extent allocation algorithm much simpler, because it can then allocate in larger chunks and do more linear writes Is this still true if I do very large writes? Or do those get broken down by the kernel somewhere? I guess it'll depend on the approach you use to do these very large writes, and on the exact definition of very large. This is not an area I know a huge amount about. Hugo. Never mind I'll just try it out! Hendrik -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please review and comment, dealing with btrfs full issues
On Mon, May 05, 2014 at 07:07:29PM +0200, Brendan Hide wrote: In the case above, because the filesystem is only 55% full, I can ask balance to rewrite all chunks that are more than 55% full: legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1 -dusage=50 will balance all chunks that are 50% *or less* used, Sorry, I actually meant to write 55 there. not more. The idea is that full chunks are better left alone while emptyish chunks are bundled together to make new full chunks, leaving big open areas for new chunks. Your process is good however - just the explanation that needs the tweak. :) Mmmh, so if I'm 55% full, should I actually use -dusage=45 or 55? In your last example, a full rebalance is not necessary. If you want to clear all unnecessary chunks you can run the balance with -dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of the data chunks that are 80% and less used, which would by necessity get about ~160GB worth chunks back out of data and available for re-use. So in my case when I hit that case, I had to use dusage=0 to recover. Anything above that just didn't work. On Mon, May 05, 2014 at 07:09:22PM +0200, Brendan Hide wrote: Forgot this part: Also in your last example, you used -dusage=0 and it balanced 91 chunks. That means you had 91 empty or very-close-to-empty chunks. ;) Correct. That FS was very mis-balanced. On Mon, May 05, 2014 at 02:36:09PM -0400, Calvin Walton wrote: The standard response on the mailing list for this issue is to temporarily add an additional device to the filesystem (even e.g. a 4GB USB flash drive is often enough) - this will add space to allocate a few new chunks, allowing the balance to proceed. You can remove the extra device after the balance completes. I just added that tip, thank you. On Tue, May 06, 2014 at 02:41:16PM +1000, Russell Coker wrote: Recently kernel 3.14 allowed fixing a metadata space error that seemed to be impossible to solve with 3.13. So it's possible that some of my other problems with a lack of metadata space could have been solved with kernel 3.14 too. Good point. I added that tip too. Thanks, Marc -- A mouse is a device used to point at the xterm you want to type in - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 1024R/763BE901 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please review and comment, dealing with btrfs full issues
Hi, Marc. Inline below. :) On 2014/05/06 02:19 PM, Marc MERLIN wrote: On Mon, May 05, 2014 at 07:07:29PM +0200, Brendan Hide wrote: In the case above, because the filesystem is only 55% full, I can ask balance to rewrite all chunks that are more than 55% full: legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1 -dusage=50 will balance all chunks that are 50% *or less* used, Sorry, I actually meant to write 55 there. not more. The idea is that full chunks are better left alone while emptyish chunks are bundled together to make new full chunks, leaving big open areas for new chunks. Your process is good however - just the explanation that needs the tweak. :) Mmmh, so if I'm 55% full, should I actually use -dusage=45 or 55? As usual, it depends on what end-result you want. Paranoid rebalancing - always ensuring there are as many free chunks as possible - is totally unnecessary. There may be more good reasons to rebalance - but I'm only aware of two: a) to avoid ENOSPC due to running out of free chunks; and b) to change allocation type. If you want all chunks either full or empty (except for that last chunk which will be somewhere inbetween), -dusage=55 will get you 99% there. In your last example, a full rebalance is not necessary. If you want to clear all unnecessary chunks you can run the balance with -dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of the data chunks that are 80% and less used, which would by necessity get about ~160GB worth chunks back out of data and available for re-use. So in my case when I hit that case, I had to use dusage=0 to recover. Anything above that just didn't work. I suspect when using more than zero the first chunk it wanted to balance wasn't empty - and it had nowhere to put it. Then when you did dusage=0, it didn't need a destination for the data. That is actually an interesting workaround for that case. On Mon, May 05, 2014 at 07:09:22PM +0200, Brendan Hide wrote: Forgot this part: Also in your last example, you used -dusage=0 and it balanced 91 chunks. That means you had 91 empty or very-close-to-empty chunks. ;) Correct. That FS was very mis-balanced. On Mon, May 05, 2014 at 02:36:09PM -0400, Calvin Walton wrote: The standard response on the mailing list for this issue is to temporarily add an additional device to the filesystem (even e.g. a 4GB USB flash drive is often enough) - this will add space to allocate a few new chunks, allowing the balance to proceed. You can remove the extra device after the balance completes. I just added that tip, thank you. On Tue, May 06, 2014 at 02:41:16PM +1000, Russell Coker wrote: Recently kernel 3.14 allowed fixing a metadata space error that seemed to be impossible to solve with 3.13. So it's possible that some of my other problems with a lack of metadata space could have been solved with kernel 3.14 too. Good point. I added that tip too. Thanks, Marc -- __ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please review and comment, dealing with btrfs full issues
On Tue, May 06, 2014 at 06:30:31PM +0200, Brendan Hide wrote: Hi, Marc. Inline below. :) On 2014/05/06 02:19 PM, Marc MERLIN wrote: On Mon, May 05, 2014 at 07:07:29PM +0200, Brendan Hide wrote: In the case above, because the filesystem is only 55% full, I can ask balance to rewrite all chunks that are more than 55% full: legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1 -dusage=50 will balance all chunks that are 50% *or less* used, Sorry, I actually meant to write 55 there. not more. The idea is that full chunks are better left alone while emptyish chunks are bundled together to make new full chunks, leaving big open areas for new chunks. Your process is good however - just the explanation that needs the tweak. :) Mmmh, so if I'm 55% full, should I actually use -dusage=45 or 55? As usual, it depends on what end-result you want. Paranoid rebalancing - always ensuring there are as many free chunks as possible - is totally unnecessary. There may be more good reasons to rebalance - but I'm only aware of two: a) to avoid ENOSPC due to running out of free chunks; and b) to change allocation type. c) its original reason: to redistribute the data on the FS, for example in the case of a new device being added or removed. If you want all chunks either full or empty (except for that last chunk which will be somewhere inbetween), -dusage=55 will get you 99% there. In your last example, a full rebalance is not necessary. If you want to clear all unnecessary chunks you can run the balance with -dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of the data chunks that are 80% and less used, which would by necessity get about ~160GB worth chunks back out of data and available for re-use. So in my case when I hit that case, I had to use dusage=0 to recover. Anything above that just didn't work. I suspect when using more than zero the first chunk it wanted to balance wasn't empty - and it had nowhere to put it. Then when you did dusage=0, it didn't need a destination for the data. That is actually an interesting workaround for that case. I've actually looked into implementing a smallest=n filter that would taken only the n least-full chunks (by fraction) and balance those. However, it's not entirely trivial to do efficiently with the current filtering code. Hugo. -- === Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk === PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk --- Hail and greetings. We are a flat-pack invasion force from --- Planet Ikea. We come in pieces. signature.asc Description: Digital signature
Re: [RFC PATCH 0/2] Kernel space btrfs missing device detection.
Hi, instead of extending the BTRFS_IOCTL_DEV_INFO ioctl, why do not add a field under /sys/fs/btrfs/UUID/ ? Something like /sys/fs/btrfs/UUID/missing_device BR G.Baroncelli On 05/06/2014 08:33 AM, Qu Wenruo wrote: Original btrfs will not detection any missing device since there is no notification mechanism for fs layer to detect missing device in block layer. However we don't really need to notify fs layer upon dev remove, probing in dev_info/rm_dev ioctl is good enough since they are the only two ioctls caring about missing device. This patchset will do ioctl time missing dev detection and return device missing status in dev_info ioctl using a new member in btrfs_ioctl_dev_info_args with a backward compatible method. Cc: Anand Jain anand.j...@oracle.com Qu Wenruo (2): btrfs: Add missing device check in dev_info/rm_dev ioctl btrfs: Add new member of btrfs_ioctl_dev_info_args. fs/btrfs/ioctl.c | 4 fs/btrfs/volumes.c | 25 - fs/btrfs/volumes.h | 2 ++ include/uapi/linux/btrfs.h | 5 - 4 files changed, 34 insertions(+), 2 deletions(-) -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: error 2001, no inode item
Hi tried with a newer version of btrfs, but still getting the same error. checking extents checking free space cache checking fs roots root 5 inode 5769204 errors 2001, no inode item, link count wrong unresolved ref dir 5783881 index 3 namelen 38 name 61bd2ed1fba8bc8d2f12766c7e4b3dafff6350 filetype 1 error 4, no inode ref root 5 inode 5899187 errors 2001, no inode item, link count wrong unresolved ref dir 5906761 index 3 namelen 38 name 61bd2ed1fba8bc8d2f12766c7e4b3dafff6350 filetype 1 error 0 Checking filesystem on /dev/sda4 UUID: 98190f1e-426f-433d-8335-1216b9a63d16 found 28521431809 bytes used err is 1 total csum bytes: 124070732 total tree bytes: 722415616 total fs tree bytes: 552411136 total extent tree bytes: 32673792 btree space waste bytes: 171189111 file data blocks allocated: 188149448704 referenced 126695161856 Btrfs v3.14.1+20140502 # uname -a Linux apersaud 3.14.2-25.g1474ea5-desktop #1 SMP PREEMPT Sun Apr 27 14:35:22 UTC 2014 (1474ea5) x86_64 x86_64 x86_64 GNU/Linux Any idea how I can fix the missing inode problem? Arun -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Copying related snapshots to another server with btrfs send/receive?
Brendan Hide posted on Sun, 04 May 2014 09:54:38 +0200 as excerpted: From the man page section on -c: You must not specify clone sources unless you guarantee that these snapshots are exactly in the same state on both sides, the sender and the receiver. It is allowed to omit the '-p parent' option when '-c clone-src' options are given, in which case 'btrfs send' will determine a suitable parent among the clone sources itself. -p does require that the sources be read-only. I suspect -c does as well. This means that it won't be so simple as you want your sources to be read-write. Probably the only way then would be to make read-only snapshots whenever you want to sync these over while also ensuring that you keep at least one read-only snapshot intact - again, much like incremental backups. I don't claim in any way to be a send/receive expert as I don't use it for my use-case at all. However... It's worth noting in the context of that manpage quote, that really the only practical way to guarantee that the snapshots are exactly the same on both sides is to have them read-only the entire time. Because the moment you make them writable on either side all bets are off as to whether something has been written, thereby killing the exact-same-state guarantee. =:^( *However*: snapshotting a read-only snapshot and making the new one writable is easy enough[1]. Just keep the originals read-only so they can be used as parents/clones, and make a second, writable snapshot of the first, to do your writable stuff in. --- [1] Snapshotting a snapshot: I'm getting a metaphorical flashing light saying I need to go check the wiki FAQ that deals with this again before I post, but unfortunately I can't check out why ATM as I just upgraded firefox and cairo and am currently getting a blank window where the firefox content should be, that will hopefully be gone and the content displayed after I reboot and get rid of the still loaded old libs, so unfortunately I can't check that flashing light ATM and am writing blind. Hopefully that flashing light warning isn't for something /too/ major that I'm overlooking! =:^( -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: copies= option
Hugo Mills posted on Sun, 04 May 2014 19:31:55 +0100 as excerpted: My proposal was simply a description mechanism, not an implementation. The description is N-copies, M-device-stripe, P-parity-devices (NcMsPp), and (more or less comfortably) covers at minimum all of the current and currently-proposed replication levels. There's a couple of tweaks covering description of allocation rules (DUP vs RAID-1). Thanks. That was it. =:^) But I had interpreted the discussion as a bit more concrete in terms of ultimate implementation than it apparently was. Anyway, it would indeed be nice to see an eventual implementation such that the above notation could be used with, for instance, mkfs.btrfs, and btrfs balance start -Xconvert, but regardless, that does look to be a way off. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using mount -o bind vs mount -o subvol=vol
Brendan Hide posted on Mon, 05 May 2014 08:55:55 +0200 as excerpted: You are 100% right, though. The scale is very small. By negligible, the penalty is at most a few CPU cycles. When compared to the wait time on a spindle, it really doesn't matter much. The analogy I've used before is that of taking a trip (which the data effectively is, between the device and the CPU). We've booked a 10-day cruise and are now debating what we plan on taking to and from the boarding dock. Will taking the local bus with a couple of transfers, or a taxi that will take us there in one trip but there's road construction and thus a detour, or a helicopter to fly directly, get us back from the cruise faster? Obviously, taking the helicopter (at least for the return leg) will get us back a bit faster, but we're talking perhaps a couple hours difference at the end of a 10 day cruise! =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using mount -o bind vs mount -o subvol=vol
Marc MERLIN posted on Sat, 03 May 2014 17:47:32 -0700 as excerpted: Is there any functional difference between mount -o subvol=usr /dev/sda1 /usr and mount /dev/sda1 /mnt/btrfs_pool mount -o bind /mnt/btrfs_pool/usr /usr ? Brendan answered the primary aspect of this well so I won't deal with that. However, I've some additional (somewhat controversial) opinion/comments on the topic of subvolumes in general. TL;DR: Put simply, with certain sometimes major exceptions, IMO subvolumes are /mostly/ a solution looking for a problem. In the /general/ case, I don't see the point and personally STRONGLY prefer multiple independent partitions for their much stronger data safety and mounting/backup flexibility. That's why I use independent partitions, here. Relevant points to consider: Subvolume negatives, independent partition positives: 1) Multiple subvolumes on a common filesystem share the filesystem tree- and super-structure. If something happens to that filesystem, you had all your data eggs in that one basket and the bottom just dropped out of it! If you can't recover, kiss **ALL** those data eggs goodbye! That's the important one; the one that would prevent me sleeping well if that's the solution I had chosen to use. But there's a number of others, more practical in the binary it's not an unrecoverable failure case. 2) Presently, btrfs is rather limited in the opposing mount options it can apply to subvolumes on the same overall filesystem. Mounting just one subvolume nodatacow, for instance, without mounting all mounted subvolumes of the filesystem nodatacow isn't yet possible, tho the filesystem design allows for it and the feature is roadmapped to appear sometime in the future. This means that at present, the subvolumes solution severely limits your mount options flexibility, altho that problem should go away to a large degree at some rather handwavily defined point in the future. 3) Filesystem size and time to complete whole-filesystem operations such as balance, scrub and check are directly related; the larger the filesystem, the longer such operations take. There are reports here of balances taking days on multi-terabyte filesystems, and double-digit hours isn't unusual at all. Of course SSDs are generally smaller and (much) faster, but still, a filesystem the size of a quarter or a half-gig SSD could easily take an hour or more to balance or scrub, and that can still be a big deal. Contrast that with the /trivial/ balance/scrub times I see on my partitioned btrfs-on-ssd setup here, some of them under a minute, even the big btrfs of 24 GiB (gentoo packages/sources/ccache filesystem) taking under three minutes (just under 7 second per GiB). At those times the return is fast enough I normally run the thing in foreground and wait for it to return in real-time; times trivial enough I can actually do a full filesystem rebalance in ordered to time it to make this point on a post! =:^) Of course the other aspect of that is that I can for instance fsck my dedicated multimedia filesystem without it interfering with running X and my ordinary work on /home. If it's all the same filesystem and I have to fsck from the initramfs or a rescue disk... Now ask yourself, how likely are you to routinely run a scrub or balance as preventive maintenance if you know it's going to take the entire day to finish? Here, the times are literally so trivial can and do run a full filesystem rebalance to time it and make this point and maintenance such as scrub or balance simply ceases to be an issue. I actually learned this point back on mdraid, before I switched to btrfs. When I first setup mdraid, I had only three raids, primary/ working, secondary/first-backup, and the raid0 for stuff like package cache that I could simply redownload if necessary. But if a device dropped (as it occasionally did after a resume from hibernate, due to hardware taking too long to wake up and the kernel giving up on it), the rebuild would take HOURS! Later on, after a few layout changes, I had many more raids and kept some of them (like the one containing my distro package cache) deactivated unless I actually needed to use them (if I was actually doing an update). Since a good portion of the many more but smaller raids were offline most of the time, if a device dropped, I had far fewer and smaller raids to rebuild, and was typically back up and running in under a half hour. Filesystem maintenance time DOES make a difference! Subvolume positives, independent partition negatives: 4) Many distros are using btrfs subvolumes on a single btrfs storage pool the way they formerly used LVM volume groups, as a common storage pool allowing them the flexibility to (re)allocate space to whatever lvm volume or btrfs subvolume needs it. This is a killer feature from the viewpoint of many distros and users as the flexibility means no more hassle with guessing incorrectly
Re: How does Suse do live filesystem revert with btrfs?
Marc MERLIN posted on Sat, 03 May 2014 17:52:57 -0700 as excerpted: (more questions I'm asking myself while writing my talk slides) I know Suse uses btrfs to roll back filesystem changes. So I understand how you can take a snapshot before making a change, but not how you revert to that snapshot without rebooting or using rsync, How do you do a pivot-root like mountpoint swap to an older snapshot, especially if you have filehandles opened on the current snapshot? Is that what Suse manages, or are they doing something simpler? While I don't have any OpenSuSE specific knowledge on this, I strongly suspect their solution is more along the select-the-root-snapshot-to-roll- back-to-from-the-initramfs/initrd line. Consider, they do the snapshot, then the upgrade. In-use files won't be entirely removed and the upgrade actually activated for them until a reboot or at least an application restart[1] for all those running apps in ordered to free their in-use files, anyway. At that point, if the user finds something broke, they've just rebooted[1], so rebooting[1] to select the pre-upgrade rootfs snapshot won't be too big a deal, since they've already disrupted the normal high level session and have just attempted a reload in ordered to discover the breakage, in the first place. IOW, for the rootfs and main system, anyway, the rollback technology is a great step up from not having that snapshot to rollback to in the first place, but it's /not/ /magic/; if a rollback is needed, they almost certainly will need to reboot[1] and from there select the rootfs snapshot to rollback to, in ordered to mount it and accomplish that rollback. --- [1] Reboot: Or possibly dipped to single user mode, and/or to the initramfs, which they'd need to reload and switch-root into for the purpose, but systemd is doing just that sort of thing these days in ordered to properly unmount rootfs after upgrades before shutdown as it's a step safer than the old style remount read-only, and implementing a snapshot selector and remount of the rootfs in that initr* instead of dropping all the way to a full reboot is only a small step from there. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is metadata redundant over more than one drive with raid0 too?
Marc MERLIN posted on Sun, 04 May 2014 22:06:17 -0700 as excerpted: That's true, but in this case I barely see the point of -m single vs -m raid0. It sounds like they both stripe data anyway, maybe not at the same level, but if both are striped, than they're almost the same in my book :) Single only stripes in such extremely large (1 GiB data, quarter-GiB metadata, per strip) chunks that it doesn't matter for speed, and then only as a result of its chunk allocation policy. If one can define such large strips as striping, which it is in a way, but not really in the practical sense. The effect of a lost device, then, is more or less random, tho for single metadata the effect is likely to be quite large up to total loss, due to the damage to the tree. It's not out of thin air that the multi-device metadata default is raid1 (which unlike the single-device case, should be the same on SSD or spinning rust, since by definition the copies will be on different devices and thus cannot be affected by SSDs' FTL-level de- dup). So the below assumes copies=2 raid1 metadata and is thus only considering single vs. raid0 data. For single data, only files that happened to be partially allocated on the lost device will be damaged. For file sizes above the 1 GiB data chunk size, the chance of damage is therefore rather high, as by definition the file will require multiple chunks and the chances of one of them being on the lost device go up accordingly. But for file sizes significantly under 1 GiB, where data fragmentation is relatively low at least (think a recent rebalance or (auto)defrag), relatively small files are very likely to be located on a single chunk and thus either all there or all missing, depending on whether that chunk was on the missing device or not. That contrasts with raid0, where the striping is at sizes well under a chunk (memory page size or 4 MiB on x86/amd64 data I believe, tho the fact that files under the 16 MiB node size may actually be entirely folded into metadata and not have a data extent allocation at all skews things for up to the 16 MiB metadata node size), so the definition of small file likely to be recovered is **MUCH** smaller on raid0, than on single. Effectively, raid0 data you're only (relatively) likely to recover files smaller than 16 MiB, while single data, it's files smaller than 1 GiB. Big difference! -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does Suse do live filesystem revert with btrfs?
Marc MERLIN posted on Sun, 04 May 2014 22:04:59 -0700 as excerpted: On Mon, May 05, 2014 at 01:36:39AM +0100, Hugo Mills wrote: I'm guessing it involves reflink copies of files from the snapshot back to the original, and then restarting affected services. That's about the only other thing that I can think of, but it's got load of race conditions in it (albeit difficult to hit in most cases, I suspect). Aaah, right, you can use a script to see the file differences between two snapshots, and then restore that with reflink if you can truly get a list of all changed files. However, that is indeed not atomic at all, even if faster than rsync. Would send/receive help in such a script? -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is metadata redundant over more than one drive with raid0 too?
Marc MERLIN posted on Sun, 04 May 2014 18:27:19 -0700 as excerpted: On Sun, May 04, 2014 at 09:44:41AM +0200, Brendan Hide wrote: Ah, I see the man page now This is because SSDs can remap blocks internally so duplicate blocks could end up in the same erase block which negates the benefits of doing metadata duplication. You can force dup but, per the man page, whether or not that is beneficial is questionable. So the reason I was confused originally was this: legolas:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=734.01GiB, used=435.39GiB System, DUP: total=8.00MiB, used=96.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=8.50GiB, used=6.74GiB Metadata, single: total=8.00MiB, used=0.00 This is on my laptop with an SSD. Clearly btrfs is using duplicate metadata on an SSD, and I did not ask it to do so. Note that I'm still generally happy with the idea of duplicate metadata on an SSD even if it's not bulletproof. In regard to metadata defaulting to single rather than the (otherwise) dup on single-device ssd: 1) In ordered to do that, btrfs (I guess mkfs.btrfs in this case) must be able to detect that the device *IS* ssd. Depending on the SSD, the kernel version, and whether the btrfs is being created direct on bare- metal device or on some device layered (lvm or dmcrypt or whatever) on top of the bare metal, btrfs may or may not successfully detect that. Obviously in your case[1] the ssd wasn't detected. Question: Does btrfs detect ssd and automatically add it to the mount options for that btrfs? I suspect not, thus consistent behavior in not detecting the SSD. FWIW, it is detected here. I've never specifically added ssd to any of my btrfs mount options, but it's always there in /proc/self/mounts when I check.[2] I believe I've seen you mention using dmcrypt or the like, however, which probably doesn't pass whatever is used for ssd protection on thru, thus explaining btrfs not seeing it and having to specify it yourself, if you wish. While I'm not sure, I /think/ btrfs may use the sysfs rotational file (or rather, the same information that the kernel exports to that file) for this detection. For my bare-metal devices that's: /sys/block/sdX/queue/rotational For my ssds that file contains 0 while for spinning rust, it contains 1. The contents of that file are derived in turn from the information exported by the device. I believe the same information can be seen with hdparm -I, in the Configuration section, as Nominal Media Rotation Rate. For my spinning rust that returns an RPM value such as 7200. For my sdds it returns Solid State Device. The same information can be seen with smartctl -i, which has much shorter output so it's easier to find. Look for Rotation Rate. Again, my ssds report Solid State Device, while my spinning rust reports a value such as 7200 rpm. 2) The only reason I happen to know about the SSD metadata single-device single mode default exception (where metadata otherwise defaults to dup mode on single-device, and to raid1 mode on multi-device regardless of the media), is as a result of I believe Chris Mason commenting on it in an on-list reply. The reasoning given in that reply was not the erase-block reason I've seen someone else mention here (and which doesn't quite make sense to me, since I don't know why that would make a difference), but rather: Some SSD firmware does automatic deduplication and compression. On these devices, DUP-mode would almost certainly be stored as a single internal data block with two external address references anyway, so it would actually be single in any case, and defaulting to single (a) doesn't hide that fact, and (b) reduces overhead that's justified for safety otherwise, but if the firmware is doing an end run around that safety anyway, might as well just shortcut the overhead as well. However, while the btrfs default will apply to all (detected) ssds, not all ssds have firmware that does this internal deduplication! In fact, the documentation for my ssds sells its LACK of such compression and deduplication as a feature, pointing out that such features tend to make the behavior of a device far less predictable[3], tho they do increase maximum speed and capacity. Which is why I've chosen to specify dup mode on my single-device btrfs here, even on ssds.[4] While it'd be the wrong choice on ssds that do compression and deduplication, on mine, it's still the right choice. =:^) If your SSDs don't do firmware-based dedup/compression, then dup metadata is still arguably the best choice on ssd. But if they do, the single metadata default does indeed make more sense, even if that's not the default you're getting due to lack of ssd detection. --- [1] Obviously ssd not detected: Assuming you didn't specify metadata level, probably a safe assumption or we'd not be having the discussion. Personally, I always make a point of specifying both data and
Re: copies= option
N-copies, M-device-stripe, P-parity-devices (NcMsPp) At expense of being the terminology nut, who doesn't even like SNIA's chosen terminology because it's confusing, I suggest a concerted effort to either use SNIA's terms anyway, or push back and ask them to make changes before propagating deviant terminology. Strip is a consecutive blocks on a single extent (on a single device) Strip size is the number of blocks in a single extent (on a single device) Stripe is a set of strips on each member extent (on multiple devices) Stripe size is strip size times non-parity extents. e.g. Btrfs default strip size is 64KiB, therefore a 5 disk raid5 volume stripe size is 256KiB. I use and specify size units in bytes rather than SNIAs blocks (sectors) because it's less ambiguous. In other words, for M- what we care about is the strip size, which is what md/mdadm calls a chunk. We can't know the stripe size without knowing how many non-parity member devices there are. Chris Murphy -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Thoughts on RAID nomenclature
On 05/05/2014 11:17 PM, Hugo Mills wrote: [...] Does this all make sense? Are there any other options or features that we might consider for chunk allocation at this point? The kind of chunk (DATA, METADATA, MIXED) and the subvolume (when /if this possibility will come) As how write this information I suggest the following options: -[DATA|METADATA|MIXED|SYSTEM:]NcMsPp[:driveslist[:/subvolume/path]] Where drivelist is an expression of the disks policy allocation: a) {sdX1:W1,sdX2:W2...} where sdX is the partition involved and W is the weight: #1 {sda:1,sdb:1,sdc:1} means spread all the disks #2 {sda:1,sdb:2,sdc:3} means linear from sda to sdc #3 {sda:1,sdb:1,sdc:2} means spread on sda and sdb (grouped) then (when full) sdc or b) #1 (sda,sdb,sdc) means spread all the disks #2 [sda,sdb,sdc] means linear from sda to sdc #3 [(sda,sdb),sdc] means spread on sda and sdb (grouped) then (when full) sdc or c) #1 (sda,sdb,sdc) means spread all the disks #2 sda,sdb,sdc means linear from sda to sdc #3 (sda,sdb),sdc means spread on sda and sdb (grouped) then (when full) sdc Some examples: - 1c2s3b Default allocation policy - DATA:2c3s4b Default allocation policy for the DATA - METADATA:1c4s:(sda,sdb,sdc,sdd) Spread over all the 4 disks for metadata - MIXED:1c4s:sda,sdc,sdb,sdd Linear over the 4 disks, ordered as the list for Data+Metadata - DATA:1c4s:(sda,sdc),(sdb,sdd) spread over sda,sdc and then when these are filled, spread over sdb and sdc - METADATA:1c4s:(sda,sdb,sdc,sdd):/subvolume/path Spread over all the 4 disks for metadata belonging the subvolume /subvolume/path I think it would be interesting to explore some configuration like - DATA:1c:(sda) - METADATA:2c:(sdb) if sda is bigger and sdb is faster Some further thoughts: - more I think about the allocation policy per subvolume and/or file basis and more I think that it would be a messy to manage -- gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Is metadata redundant over more than one drive with raid0 too?
Marc MERLIN posted on Sun, 04 May 2014 18:27:19 -0700 as excerpted: The original reason why I was asking myself this question and trying to figure out how much better -m raid1 -d raid0 was over -m raid0 -d raid0 I think the summary is that in the first case, you're going to to be abel to recover all/most small files (think maildir) if you lose one device, whereas in the 2nd case, with half the metadata missing, your FS is pretty much fully gone. Fair to say that? Yes. =:^) Now, if I don't care about speed, but wouldn't mind recovering a few bits should something happen (actually in my case mostly knowing the state of the filesystem when a drive was lost so that I can see how many new files showed up since my last backup), it sounds like it wouldn't be bad to use: -m raid1 -d linear Well, assuming that by -d linear you meant -d single. Btrfs doesn't call it linear, tho at the data safety level, btrfs single is actually quite comparable to mdadm linear. =:^) (I had to check. I knew I didn't remember btrfs having linear as an option, and hadn't seen any patches float by on the list that would add it, but since I'm not a dev I don't follow patches /that/ closely, and thought I might have missed it. So I thought I better go check to see what this possible new linear option actually was, if indeed I had missed it. Turns out I didn't miss it after all; there's still no linear option that I can see, unless it's there and simply not documented. =:^) This will not give me the speed boost from raid0 which I don't care about, it will give me metadata redundancy, and due to linear, there is a decent chance that half my files are intact on the remaining drive (depending on their size apparently). Yes. =:^) So one place I use it is not for speed but for one FS that gives me more space without redundancy (rotating buffer streaming video from security cams). At the time I used -m raid1 -d raid0, but it sounds for slightly extra recoverability, I should have ued -m raid1 -d linear (and yes, I undertand that one should not consider a -d linear recoverable when a drive went missing). That appears to be a very good use of either -d raid0 or -d single, yes. And since you're apparently not streaming such high resolution video that you NEED the raid0, single does indeed give you a somewhat better chance at recovery. Tho with streaming video I wonder what your filesizes are as video files tend to be pretty big. If they're over the 1 GiB btrfs data chunk size, particularly if you're only running a two-device btrfs, you'd probably lose near all files anyway. Assuming single data mode and file sizes between a GiB and 2 GiB, statistically you should lose near 100% on a two device btrfs with one dropping out, 67% on a three device btrfs with a single device dropout, 50% on four devices, 40% on five devices... If file sizes are 2-3 GiB, you should lose near 100% on 2-3 devices, 75% on four devices, 60% on five, 50% on six... With raid0 data stats would be similar but I believe starting at 16 MiB with 4 MiB intervals. Due to many files under 16 MiB being stored in the metadata, you'd lose few of them, but that'd jump to 100% loss at 16 MiB until you had 5+ devices in the raid0, with 16-20 MiB file loss chance on a 5-device raid0 80%, since chances would be 80% of one strip of the stripe being on the lost device. (That's assuming my 4 MiB strip size assumption is correct, it could be smaller than that, possibly 64 KiB.) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How does btrfs fi show show full?
Marc MERLIN posted on Sun, 04 May 2014 22:50:29 -0700 as excerpted: In the second FS: Label: btrfs_pool1 uuid: [...] Total devices 1 FS bytes used 442.17GiB devid1 size 865.01GiB used 751.04GiB path [...] The difference is huge between 'Total used' and 'devid used'. Is btrfs going to fix this on its own, or likely not and I'm stuck doing a full balance (without filters since I'm balancing data and not metadata)? If that helps. legolas:~# btrfs fi df /mnt/btrfs_pool1 Data, single: total=734.01GiB, used=435.29GiB System, DUP: total=8.00MiB, used=96.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=8.50GiB, used=6.74GiB Metadata, single: total=8.00MiB, used=0.00 Definitely helps. The spread is in data. Try btrfs balance start -dusage=20 /mnt/btrfs_pool1 You still have plenty of unused (if allocated) space available, so you can play around with the usage= a bit. -dusage=20 will be faster than something like -dusage=50 or -dusage=80, likely MUCH faster, but will return less chunks to unallocated, as well. Still, your spread between data-total and data-used is high enough, I expect -dusage=20 will give you pretty good results. Since show says you still have ~100 GiB unallocated in df, there's no real urgency, and again I'd try -dusage=20 the first time. If that doesn't cut it you can of course try bumping the usage= as needed, but because you still have 100 GiB unallocated and because the data used vs. total spread is so big, I really do think -dusage=20 will do it for you. As your actual device usage goes up the spread between used and size will go down, meaning more frequent balances to keep some reasonable unallocated space available, and you'll either need to actually delete some stuff or to bump up those usage= numbers as well, but usage=20 is very likely to be sufficient at this point. I hadn't seen anyone try an actual formula as Brendan suggests in his post, and I'm not actually sure that formula will apply well in all use- cases as I think fragmentation and fill-pattern will have a lot to do with it, but based on his post it does apply for his use-case, and the same general principle if not the specific formula should apply everywhere and is what I'm doing above, only simply eyeballing it, not using a specific formula. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Thoughts on RAID nomenclature
Brendan Hide posted on Mon, 05 May 2014 23:47:17 +0200 as excerpted: At the moment, we have two chunk allocation strategies: dup and spread (for want of a better word; not to be confused with the ssd_spread mount option, which is a whole different kettle of borscht). The dup allocation strategy is currently only available for 2c replication, and only on single-device filesystems. When a filesystem with dup allocation has a second device added to it, it's automatically upgraded to spread. I thought this step was manual - but okay! :) AFAIK, the /allocator/ automatically updates to spread when a second device is added. That is, assuming previous dup metadata on a single device, adding a device will cause new allocations to be in raid1/spread mode, instead of dup. What's manual, however, is that /existing/ chunk allocations don't get automatically updated. For that, a balance must be done. But existing allocations are by definition already allocated, so the chunk allocator doesn't do anything with them. (A rebalance allocates new chunks, rewriting the contents of the old chunks into the new ones before unmapping the now unused old chunks, so again, existing chunks stay where they are until unmapped, it's the NEW chunks that get mapped by the updated allocation policy.) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs raid allocator
Hendrik Siedelmann posted on Tue, 06 May 2014 12:41:38 +0200 as excerpted: I would like to use btrfs (or anyting else actually) to maximize raid0 performance. Basically I have a relatively constant stream of data that simply has to be written out to disk. If flexible parallelization is all you're worried about, not data integrity or the other things btrfs does, I'd suggest looking at a more mature solution such as md- or dm-raid. They're more mature and less complex than btrfs, and if you're not using the other features of btrfs anyway, they should simply work better for your use-case. -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Please review and comment, dealing with btrfs full issues
Brendan Hide posted on Tue, 06 May 2014 18:30:31 +0200 as excerpted: So in my case when I hit that case, I had to use dusage=0 to recover. Anything above that just didn't work. I suspect when using more than zero the first chunk it wanted to balance wasn't empty - and it had nowhere to put it. Then when you did dusage=0, it didn't need a destination for the data. That is actually an interesting workaround for that case. I've actually used -Xusage=0 (where X=m or d, obviously) for exactly that. If every last bit of filesystem is allocated so another chunk simply cannot be written in ordered to rewrite partially used chunks into, BUT the spread between allocated and actually used is quite high, there's a reasonably good chance that at least one of those allocated chunks is entirely empty, and -Xusage=0 allows returning it to the unallocated pool without actually requiring a new chunk allocation to do so. With luck, that will free at least one zero-usage chunk (two for metadata dup, but it would both allocate and return to unallocated in pairs as so it balances out), allowing the user to rerun balance, this time with a higher -Xusage=. The other known valid use-case for -Xusage=0 is when freeing the extraneous zero-usage single-mode chunks first created by mkfs.btrfs as part of the mkfs process, so they don't clutter up the btrfs filesystem df output. =:^) -- Duncan - List replies preferred. No HTML msgs. Every nonfree program has a lord, a master -- and if you use the program, he is your master. Richard Stallman -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs raid allocator
On May 6, 2014, at 4:41 AM, Hendrik Siedelmann hendrik.siedelm...@googlemail.com wrote: Hello all! I would like to use btrfs (or anyting else actually) to maximize raid0 performance. Basically I have a relatively constant stream of data that simply has to be written out to disk. I think the only way to know what works best for your workload is to test configurations with the actual workload. For optimization of multiple device file systems, it's hard to beat XFS on raid0 or even linear/concat due to its parallelization, if you have more than one stream (or a stream that produces a lot of files that XFS can allocate into separate allocation groups). Also mdadm supports use specified strip/chunk sizes, whereas currently on Btrfs this is fixed to 64KiB. Depending on the file size for your workload, it's possible a much larger strip will yield better performance. Another optimization is hardware RAID with a battery backed write cache (the drives' write cache are disabled) and using nobarrier mount option. If your workload supports linear/concat then it's fine to use md linear for this. What I'm not sure of is if it's an OK practice to disable barriers if the system is on a UPS (rather than a battery backed hardware RAID cache). You should post the workload and hardware details on the XFS list to get suggestions about such things. They'll also likely recommend the deadline scheduler over cfq. Unless you have a workload really familiar to the responder, they'll tell you any benchmarking you do needs to approximate the actual workflow. A mismatched benchmark to the workload will lead you to the wrong conclusions. Typically when you optimize for a particular workload, other workloads suffer. Chris Murphy-- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs raid allocator
On 06.05.2014 23:49, Chris Murphy wrote: On May 6, 2014, at 4:41 AM, Hendrik Siedelmann hendrik.siedelm...@googlemail.com wrote: Hello all! I would like to use btrfs (or anyting else actually) to maximize raid0 performance. Basically I have a relatively constant stream of data that simply has to be written out to disk. I think the only way to know what works best for your workload is to test configurations with the actual workload. For optimization of multiple device file systems, it's hard to beat XFS on raid0 or even linear/concat due to its parallelization, if you have more than one stream (or a stream that produces a lot of files that XFS can allocate into separate allocation groups). Also mdadm supports use specified strip/chunk sizes, whereas currently on Btrfs this is fixed to 64KiB. Depending on the file size for your workload, it's possible a much larger strip will yield better performance. Thanks, that's quite a few knobs I can try out - I just have a lot of data - with a rate up to 450MB/s that I want to write out in time, preferably without having to rely on too expensive hardware. Another optimization is hardware RAID with a battery backed write cache (the drives' write cache are disabled) and using nobarrier mount option. If your workload supports linear/concat then it's fine to use md linear for this. What I'm not sure of is if it's an OK practice to disable barriers if the system is on a UPS (rather than a battery backed hardware RAID cache). You should post the workload and hardware details on the XFS list to get suggestions about such things. They'll also likely recommend the deadline scheduler over cfq. Actually data integrity does not matter for the workload. If everything is succesfull the result will be backed up - before that full filesystem corruption is acceptable as a failure mode. Unless you have a workload really familiar to the responder, they'll tell you any benchmarking you do needs to approximate the actual workflow. A mismatched benchmark to the workload will lead you to the wrong conclusions. Typically when you optimize for a particular workload, other workloads suffer. Chris Murphy Thanks again for all the infos! I'll get back if everything works fine - or if it doesn't ;-) Cheers Hendrik -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs issues in 3.14
Hello, I've been having a number of issues with processes hanging due to btrfs using 3.14 kernels. This seems pretty new as it has been working fine before. I also rebuilt the filesystem and am still receiving hangs. The filesystem is running on dmcrypt which is running on lvm2 which is running on an SSD (SAMSUNG MZMTD256HAGM-000L1). When the issue occurs the process is unable to be killed and the system will not fully shutdown. $ uname -a Linux orange 3.14.2-1-ARCH #1 SMP PREEMPT Sun Apr 27 11:28:44 CEST 2014 x86_64 GNU/Linux $ btrfs --version Btrfs v3.14.1 $ btrfs fi show Btrfs v3.14.1 $ btrfs fi df /home Data, single: total=71.01GiB, used=68.72GiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=1.50GiB, used=863.33MiB Metadata, single: total=8.00MiB, used=0.00 I opened bugs 75181 and 75191 and I'll include the relevant journalctl entries. The kernel was upgraded from 3.14.1-1 to 3.14.2-1 during this time, and the filesystem was rebuilt after the orphan issue. I'm not on this list so please CC me on replies. Thanks, Kenny journal.txt.gz Description: GNU Zip compressed data
Re: [RFC PATCH 0/2] Kernel space btrfs missing device detection.
Original Message Subject: Re: [RFC PATCH 0/2] Kernel space btrfs missing device detection. From: Goffredo Baroncelli kreij...@libero.it To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs@vger.kernel.org Date: 2014年05月07日 02:10 Hi, instead of extending the BTRFS_IOCTL_DEV_INFO ioctl, why do not add a field under /sys/fs/btrfs/UUID/ ? Something like /sys/fs/btrfs/UUID/missing_device BR G.Baroncelli I think that is also a good idea. I'll try to add it later. Thanks Qu On 05/06/2014 08:33 AM, Qu Wenruo wrote: Original btrfs will not detection any missing device since there is no notification mechanism for fs layer to detect missing device in block layer. However we don't really need to notify fs layer upon dev remove, probing in dev_info/rm_dev ioctl is good enough since they are the only two ioctls caring about missing device. This patchset will do ioctl time missing dev detection and return device missing status in dev_info ioctl using a new member in btrfs_ioctl_dev_info_args with a backward compatible method. Cc: Anand Jain anand.j...@oracle.com Qu Wenruo (2): btrfs: Add missing device check in dev_info/rm_dev ioctl btrfs: Add new member of btrfs_ioctl_dev_info_args. fs/btrfs/ioctl.c | 4 fs/btrfs/volumes.c | 25 - fs/btrfs/volumes.h | 2 ++ include/uapi/linux/btrfs.h | 5 - 4 files changed, 34 insertions(+), 2 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs issues in 3.14
On Tue, May 06, 2014 at 08:49:04PM -0300, Kenny MacDermid wrote: Hello, I've been having a number of issues with processes hanging due to btrfs using 3.14 kernels. This seems pretty new as it has been working fine before. I also rebuilt the filesystem and am still receiving hangs. The filesystem is running on dmcrypt which is running on lvm2 which is running on an SSD (SAMSUNG MZMTD256HAGM-000L1). When the issue occurs the process is unable to be killed and the system will not fully shutdown. $ uname -a Linux orange 3.14.2-1-ARCH #1 SMP PREEMPT Sun Apr 27 11:28:44 CEST 2014 x86_64 GNU/Linux $ btrfs --version Btrfs v3.14.1 $ btrfs fi show Btrfs v3.14.1 $ btrfs fi df /home Data, single: total=71.01GiB, used=68.72GiB System, DUP: total=8.00MiB, used=16.00KiB System, single: total=4.00MiB, used=0.00 Metadata, DUP: total=1.50GiB, used=863.33MiB Metadata, single: total=8.00MiB, used=0.00 I opened bugs 75181 and 75191 and I'll include the relevant journalctl entries. The kernel was upgraded from 3.14.1-1 to 3.14.2-1 during this time, and the filesystem was rebuilt after the orphan issue. I'm not on this list so please CC me on replies. What does sysrq+w say when the hang happens? -liubo -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Using noCow with snapshots ?
How could BTRFS and a database fight about data recovery? BTRFS offers similar guarantees about data durability etc to other journalled filesystems and only differs by having checksums so that while a snapshot might have half the data that was written by an app you at least know that the half will be consistent. If you had database files on a separate subvol to the database log then you would be at risk of having problems making a any sort of consistent snapshot (the Debian approach of /var/log/mysql and /var/lib/mysql is a bad idea). But there would be no difference with LVM snapshots in that regard. -- Sent from my Samsung Galaxy Note 2 with K-9 Mail. -- To unsubscribe from this list: send the line unsubscribe linux-btrfs in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html