date:20140506

Re: 3.14.0rc3: did not find backref in send_root

2014-05-06 Thread David Brown


On Mon, Feb 24, 2014 at 10:36:52PM -0800, Marc MERLIN wrote:

I got this during a btrfs send:
BTRFS error (device dm-2): did not find backref in send_root. inode=22672, 
offset=524288, disk_byte=1490517954560 found extent=1490517954560

I'll try a scrub when I've finished my backup, but is there anything I
can run on the file I've found from the inode?

gargamel:/mnt/dshelf1/Sound# btrfs inspect-internal inode-resolve  -v 22672 
file.mp3
ioctl ret=0, bytes_left=3998, bytes_missing=0, cnt=1, missed=0
file.mp3


I've just seen this error:

  BTRFS error (device sda4): did not find backref in send_root. inode=411890, 
offset=307200, disk_byte=48100618240 found extent=48100618240

during a send between two snapshots I have.

after moving to 3.14.2.  I've seen it on two filesystems now since
moving to 3.14.  I have the two readonly snapshots if there is
anything helpful I can figure out from them.

Scrub reports no errors, but I don't seem to be able to back up
anything now.

David
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 0/2] Kernel space btrfs missing device detection.

2014-05-06 Thread Qu Wenruo

Original btrfs will not detection any missing device since there is no
notification mechanism for fs layer to detect missing device in block layer.

However we don't really need to notify fs layer upon dev remove, probing in
dev_info/rm_dev ioctl is good enough since they are the only two ioctls caring
about missing device.

This patchset will do ioctl time missing dev detection and return device missing
status in dev_info ioctl using a new member in btrfs_ioctl_dev_info_args with a
backward compatible method.

Cc: Anand Jain anand.j...@oracle.com
Qu Wenruo (2):
  btrfs: Add missing device check in dev_info/rm_dev ioctl
  btrfs: Add new member of btrfs_ioctl_dev_info_args.

 fs/btrfs/ioctl.c   |  4 
 fs/btrfs/volumes.c | 25 -
 fs/btrfs/volumes.h |  2 ++
 include/uapi/linux/btrfs.h |  5 -
 4 files changed, 34 insertions(+), 2 deletions(-)

-- 
1.9.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 2/2] btrfs-progs: Add userspace support for kernel missing dev detection.

2014-05-06 Thread Qu Wenruo

Add userspace support for kernel missing dev detection from dev_info
ioctl.

Now 'btrfs fi show' will auto detect the output format of dev_info ioctl
and use kernel missing dev detection if supported.
Also userspace missing dev detection is used as a fallback method and
when used, a info message will be printed showing 'btrfs dev del missing'
will not work.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 cmds-filesystem.c | 29 ++---
 utils.c   |  2 ++
 2 files changed, 24 insertions(+), 7 deletions(-)

diff --git a/cmds-filesystem.c b/cmds-filesystem.c
index 306f715..0ff1ca6 100644
--- a/cmds-filesystem.c
+++ b/cmds-filesystem.c
@@ -369,6 +369,7 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
char uuidbuf[BTRFS_UUID_UNPARSED_SIZE];
struct btrfs_ioctl_dev_info_args *tmp_dev_info;
int ret;
+   int new_flag = 0;
 
ret = add_seen_fsid(fs_info-fsid);
if (ret == -EEXIST)
@@ -389,13 +390,22 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
for (i = 0; i  fs_info-num_devices; i++) {
tmp_dev_info = (struct btrfs_ioctl_dev_info_args *)dev_info[i];
 
-   /* Add check for missing devices even mounted */
-   fd = open((char *)tmp_dev_info-path, O_RDONLY);
-   if (fd  0) {
-   missing = 1;
-   continue;
+   new_flag = tmp_dev_info-flags  BTRFS_IOCTL_DEV_INFO_FLAG_SET;
+   if (!new_flag) {
+   /* Add check for missing devices even mounted */
+   fd = open((char *)tmp_dev_info-path, O_RDONLY);
+   if (fd  0) {
+   missing = 1;
+   continue;
+   }
+   close(fd);
+   } else {
+   if (tmp_dev_info-flags 
+   BTRFS_IOCTL_DEV_INFO_MISSING) {
+   missing = 1;
+   continue;
+   }
}
-   close(fd);
printf(\tdevid %4llu size %s used %s path %s\n,
tmp_dev_info-devid,
pretty_size(tmp_dev_info-total_bytes),
@@ -403,8 +413,13 @@ static int print_one_fs(struct btrfs_ioctl_fs_info_args 
*fs_info,
tmp_dev_info-path);
}
 
-   if (missing)
+   if (missing) {
printf(\t*** Some devices missing\n);
+   if (!new_flag) {
+   printf(\tOlder kernel detected\n);
+   printf(\t'btrfs dev delete missing' may not work\n);
+   }
+   }
printf(\n);
return 0;
 }
diff --git a/utils.c b/utils.c
index 3e9c527..230471f 100644
--- a/utils.c
+++ b/utils.c
@@ -1670,6 +1670,8 @@ int get_device_info(int fd, u64 devid,
 
di_args-devid = devid;
memset(di_args-uuid, '\0', sizeof(di_args-uuid));
+   /* Clear flags to ensure old kernel returns untouched flags */
+   memset(di_args-flags, 0, sizeof(di_args-flags));
 
ret = ioctl(fd, BTRFS_IOC_DEV_INFO, di_args);
return ret ? -errno : 0;
-- 
1.9.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 2/2] btrfs: Add new member of btrfs_ioctl_dev_info_args.

2014-05-06 Thread Qu Wenruo

Add flags member for btrfs_ioctl_dev_info_args to preset missing btrfs
devices.

The new member is added in the original padding area so the ioctl API is
not affected but user headers needs to be updated.

Cc: Anand Jain anand.j...@oracle.com
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 fs/btrfs/ioctl.c   | 3 +++
 include/uapi/linux/btrfs.h | 5 -
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 7680a40..1920f24 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2610,6 +2610,9 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, 
void __user *arg)
di_args-devid = dev-devid;
di_args-bytes_used = dev-bytes_used;
di_args-total_bytes = dev-total_bytes;
+   di_args-flags = BTRFS_IOCTL_DEV_INFO_FLAG_SET;
+   if (dev-missing)
+   di_args-flags |= BTRFS_IOCTL_DEV_INFO_MISSING;
memcpy(di_args-uuid, dev-uuid, sizeof(di_args-uuid));
if (dev-name) {
struct rcu_string *name;
diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h
index b4d6909..5eb1f03 100644
--- a/include/uapi/linux/btrfs.h
+++ b/include/uapi/linux/btrfs.h
@@ -168,12 +168,15 @@ struct btrfs_ioctl_dev_replace_args {
__u64 spare[64];
 };
 
+#define BTRFS_IOCTL_DEV_INFO_MISSING   (1ULL0)
+#define BTRFS_IOCTL_DEV_INFO_FLAG_SET  (1ULL63)
 struct btrfs_ioctl_dev_info_args {
__u64 devid;/* in/out */
__u8 uuid[BTRFS_UUID_SIZE]; /* in/out */
__u64 bytes_used;   /* out */
__u64 total_bytes;  /* out */
-   __u64 unused[379];  /* pad to 4k */
+   __u64 flags;/* out */
+   __u64 unused[378];  /* pad to 4k */
__u8 path[BTRFS_DEVICE_PATH_NAME_MAX];  /* out */
 };
 
-- 
1.9.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/2] btrfs: Add missing device check in dev_info/rm_dev ioctl

2014-05-06 Thread Qu Wenruo

Old btrfs can't find a missing btrfs device since there is no
mechanism for block layer to inform fs layer.

But we can use a workaround that only check status(by using
request_queue-queue_flags) of every device in a btrfs
filesystem when calling dev_info/rm_dev ioctl, since other ioctls
do not really cares about missing device.

Cc: Anand Jain anand.j...@oracle.com
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 fs/btrfs/ioctl.c   |  1 +
 fs/btrfs/volumes.c | 25 -
 fs/btrfs/volumes.h |  2 ++
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0401397..7680a40 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2606,6 +2606,7 @@ static long btrfs_ioctl_dev_info(struct btrfs_root *root, 
void __user *arg)
goto out;
}
 
+   btrfs_check_dev_missing(root, dev, 1);
di_args-devid = dev-devid;
di_args-bytes_used = dev-bytes_used;
di_args-total_bytes = dev-total_bytes;
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index d241130a..c7d7908 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -1548,9 +1548,10 @@ int btrfs_rm_device(struct btrfs_root *root, char 
*device_path)
 * is held.
 */
list_for_each_entry(tmp, devices, dev_list) {
+   btrfs_check_dev_missing(root, tmp, 0);
if (tmp-in_fs_metadata 
!tmp-is_tgtdev_for_dev_replace 
-   !tmp-bdev) {
+   (!tmp-bdev || tmp-missing)) {
device = tmp;
break;
}
@@ -6300,3 +6301,25 @@ int btrfs_scratch_superblock(struct btrfs_device *device)
 
return 0;
 }
+
+/* If need_lock is set, uuid_mutex will be used */
+int btrfs_check_dev_missing(struct btrfs_root *root, struct btrfs_device *dev,
+   int need_lock)
+{
+   struct request_queue *q;
+
+   if (unlikely(!dev || !dev-bdev || !dev-bdev-bd_queue))
+   return -ENOENT;
+   q = dev-bdev-bd_queue;
+
+   if (need_lock)
+   mutex_lock(uuid_mutex);
+   if (test_bit(QUEUE_FLAG_DEAD, q-queue_flags) ||
+   test_bit(QUEUE_FLAG_DYING, q-queue_flags)) {
+   dev-missing = 1;
+   root-fs_info-fs_devices-missing_devices++;
+   }
+   if (need_lock)
+   mutex_unlock(uuid_mutex);
+   return 0;
+}
diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 80754f9..47a44af 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -356,6 +356,8 @@ unsigned long btrfs_full_stripe_len(struct btrfs_root *root,
 int btrfs_finish_chunk_alloc(struct btrfs_trans_handle *trans,
struct btrfs_root *extent_root,
u64 chunk_offset, u64 chunk_size);
+int btrfs_check_dev_missing(struct btrfs_root *root, struct btrfs_device *dev,
+   int need_lock);
 static inline void btrfs_dev_stat_inc(struct btrfs_device *dev,
  int index)
 {
-- 
1.9.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[RFC PATCH 1/2] btrfs-progs: Follow kernel changes to add new member of btrfs_ioctl_dev_info_args.

2014-05-06 Thread Qu Wenruo

Follow the kernel header changes to add new member of
btrfs_ioctl_dev_info_args.

This change will use special bit to keep backward compatibility, so even
on old kernels this will not screw anything up.

Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 ioctl.h | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/ioctl.h b/ioctl.h
index 9627e8d..672a3a3 100644
--- a/ioctl.h
+++ b/ioctl.h
@@ -156,12 +156,15 @@ struct btrfs_ioctl_dev_replace_args {
__u64 spare[64];
 };
 
+#define BTRFS_IOCTL_DEV_INFO_MISSING   (1ULL0)
+#define BTRFS_IOCTL_DEV_INFO_FLAG_SET  (1ULL63)
 struct btrfs_ioctl_dev_info_args {
__u64 devid;/* in/out */
__u8 uuid[BTRFS_UUID_SIZE]; /* in/out */
__u64 bytes_used;   /* out */
__u64 total_bytes;  /* out */
-   __u64 unused[379];  /* pad to 4k */
+   __u64 flags;/* out */
+   __u64 unused[378];  /* pad to 4k */
__u8 path[BTRFS_DEVICE_PATH_NAME_MAX];  /* out */
 };
 
-- 
1.9.2

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.14.0rc3: did not find backref in send_root

2014-05-06 Thread Blaz Repas


On 05/06/2014 08:10 AM, David Brown wrote:

On Mon, Feb 24, 2014 at 10:36:52PM -0800, Marc MERLIN wrote:

I got this during a btrfs send:
BTRFS error (device dm-2): did not find backref in send_root. 
inode=22672, offset=524288, disk_byte=1490517954560 found 
extent=1490517954560


I'll try a scrub when I've finished my backup, but is there anything I
can run on the file I've found from the inode?

gargamel:/mnt/dshelf1/Sound# btrfs inspect-internal inode-resolve  -v 
22672 file.mp3

ioctl ret=0, bytes_left=3998, bytes_missing=0, cnt=1, missed=0
file.mp3


I've just seen this error:

  BTRFS error (device sda4): did not find backref in send_root. 
inode=411890, offset=307200, disk_byte=48100618240 found 
extent=48100618240


during a send between two snapshots I have.

after moving to 3.14.2.  I've seen it on two filesystems now since
moving to 3.14.  I have the two readonly snapshots if there is
anything helpful I can figure out from them.

Scrub reports no errors, but I don't seem to be able to back up
anything now.

David
--


I am also seeing this on 3.14.1 (on ArchLinux). Scrub also reports no 
errors.
I could also not do a ful send. Balancing made it better for a while (I 
was able to send a full snapshot of one subvolume, but not another), but 
it did not help. Offline repairing the fs with btrfsck --repair also did 
not affect it.


Blaz

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs on software RAID0

2014-05-06 Thread john terragon

just one last doubt:

why do you use --align-payload=1024? (or 8912)
Cryptsetup man says that the default for the payload alignment is 2048
(512-byte sectors). So, it's already aligned by default to 4K-byte
physical sectors (if that was your concern). Am I missing something?

John

On Mon, May 5, 2014 at 11:25 PM, Marc MERLIN m...@merlins.org wrote:
 On Mon, May 05, 2014 at 10:51:46PM +0200, john terragon wrote:
 Hi.
 I'm about to try btrfs on an RAID0 md device (to be precise there will
 be dm-crypt in between the md device and btrfs). If I used ext4 I
 would set the stride and stripe_width extended options. Is there
 anything similar I should be doing with mkfs.btrfs? Or maybe some
 mount options beneficial to this kind of setting.

 This is not directly an answer to your question, so far I haven't used a
 special option like this with btrfs on my arrays although my
 undertstanding is that it's not as important as with ext4.

 That said, please read
 http://marc.merlins.org/perso/btrfs/post_2014-04-27_Btrfs-Multi-Device-Dmcrypt.html

 1) use align-payload=1024 on cryptsetup instead of something bigger like
 8192. This will reduce write amplification (if you're not on an SSD).

 2) you don't need md0 in the middle, crypt each device and then use
 btrfs built in raid0 which will be faster (and is stable, at least as
 far as we know :) ).

 Then use /etc/crypttab or a script like this
 http://marc.merlins.org/linux/scripts/start-btrfs-dmcrypt
 to decrypt all your devices in one swoop and mount btrfs.

 Marc
 --
 A mouse is a device used to point at the xterm you want to type in - A.S.R.
 Microsoft is to operating systems 
    what McDonalds is to gourmet 
 cooking
 Home page: http://marc.merlins.org/ | PGP 
 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] xfstests: add regression test for inode cache vs tree log

2014-05-06 Thread Wang Shilong

This patch adds a regression test to verify btrfs can not
reuse inode id until we have committed transaction. Which was
addressed by the following kernel patch:

 Btrfs: fix inode cache vs tree log

Signed-off-by: Wang Shilong wangsl.f...@cn.fujitsu.com
---
 tests/btrfs/049 | 109 
 tests/btrfs/049.out |   1 +
 tests/btrfs/group   |   1 +
 3 files changed, 111 insertions(+)
 create mode 100644 tests/btrfs/049
 create mode 100644 tests/btrfs/049.out

diff --git a/tests/btrfs/049 b/tests/btrfs/049
new file mode 100644
index 000..3101d09
--- /dev/null
+++ b/tests/btrfs/049
@@ -0,0 +1,109 @@
+#! /bin/bash
+# FS QA Test No. btrfs/049
+#
+# Regression test for btrfs inode caching vs tree log which was
+# addressed by the following kernel patch.
+#
+# Btrfs: fix inode caching vs tree log
+#
+#---
+# Copyright (c) 2014 Fujitsu.  All Rights Reserved.
+#
+# This program is free software; you can redistribute it and/or
+# modify it under the terms of the GNU General Public License as
+# published by the Free Software Foundation.
+#
+# This program is distributed in the hope that it would be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+# GNU General Public License for more details.
+#
+# You should have received a copy of the GNU General Public License
+# along with this program; if not, write the Free Software Foundation,
+# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+#---
+#
+
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo QA output created by $seq
+
+here=`pwd`
+tmp=/tmp/$$
+
+status=1   # failure is the default!
+trap _cleanup; exit \$status 0 1 2 3 15
+
+_cleanup()
+{
+   _cleanup_flakey
+   rm -rf $tmp
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+. ./common/dmflakey
+
+# real QA test starts here
+_supported_fs generic
+_supported_os Linux
+_need_to_be_root
+_require_scratch
+_require_dm_flakey
+
+rm -f $seqres.full
+
+_scratch_mkfs  $seqres.full 21
+
+SAVE_MOUNT_OPTIONS=$MOUNT_OPTIONS
+MOUNT_OPTIONS=$MOUNT_OPTIONS -o inode_cache,commit=100
+
+# create a basic flakey device that will never error out
+_init_flakey
+_mount_flakey
+
+_get_inode_id()
+{
+   local inode_id
+   inode_id=`stat $1 | grep Inode: | $AWK_PROG '{print $4}'`
+   echo $inode_id
+}
+
+$XFS_IO_PROG -f -c pwrite 0 10M -c fsync \
+   $SCRATCH_MNT/data  /dev/null
+
+inode_id=`_get_inode_id $SCRATCH_MNT/data`
+rm -f $SCRATCH_MNT/data
+
+for i in `seq 1 5`;
+do
+   mkdir $SCRATCH_MNT/dir_$i
+   new_inode_id=`_get_inode_id $SCRATCH_MNT/dir_$i`
+   if [ $new_inode_id -eq $inode_id ]
+   then
+   $XFS_IO_PROG -f -c pwrite 0 1M -c fsync \
+   $SCRATCH_MNT/dir_$i/data1  /dev/null
+   _load_flakey_table 1
+   _unmount_flakey
+   need_umount=1
+   break
+   fi
+   sleep 1
+done
+
+# restore previous mount options
+export MOUNT_OPTIONS=$SAVE_MOUNT_OPTIONS
+
+# ok mount so that any recovery that needs to happen is done
+if [ $new_inode_id -eq $inode_id ];then
+   _load_flakey_table $FLAKEY_ALLOW_WRITES
+   _mount_flakey
+   _unmount_flakey
+fi
+
+# make sure we got a valid fs after replay
+_check_scratch_fs $FLAKEY_DEV
+
+status=0
+exit
diff --git a/tests/btrfs/049.out b/tests/btrfs/049.out
new file mode 100644
index 000..cb0061b
--- /dev/null
+++ b/tests/btrfs/049.out
@@ -0,0 +1 @@
+QA output created by 049
diff --git a/tests/btrfs/group b/tests/btrfs/group
index af60c79..59b0c98 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -51,3 +51,4 @@
 046 auto quick
 047 auto quick
 048 auto quick
+049 auto quick
-- 
1.8.2.1

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Scrub status: no stats available

2014-05-06 Thread Wolfgang Mader

Dear list,

I am running btrfs on Arch Linux ARM (Linux 3.14.2, Btrfs v3.14.1). I can run 
scrub w/o errors, but I never get stats from scrub status

What I get is
   btrfs scrub status /pools/dataPool
   
   scrub status for b5f082e2-2ce0-4f91-b54b-c2d26185a635
   no stats available
   total bytes scrubbed: 694.13GiB with 0 errors

Please mind the line no stats available. Where can I start digging?

Thank you,
Wolfgang


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Btrfs raid allocator

2014-05-06 Thread Hendrik Siedelmann


Hello all!

I would like to use btrfs (or anyting else actually) to maximize raid0 
performance. Basically I have a relatively constant stream of data that 
simply has to be written out to disk. So my question is, how is the 
block allocator deciding on which device to write, can this decision be 
dynamic and could it incorporate timing/troughput decisions? I'm willing 
to write code, I just have no clue as to how this works right now. I 
read somewhere that the decision is based on free space, is this still true?


Cheers
Hendrik
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs raid allocator

2014-05-06 Thread Hugo Mills

On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:
 Hello all!
 
 I would like to use btrfs (or anyting else actually) to maximize raid0
 performance. Basically I have a relatively constant stream of data that
 simply has to be written out to disk. So my question is, how is the block
 allocator deciding on which device to write, can this decision be dynamic
 and could it incorporate timing/troughput decisions? I'm willing to write
 code, I just have no clue as to how this works right now. I read somewhere
 that the decision is based on free space, is this still true?

   For (current) RAID-0 allocation, the block group allocator will use
as many chunks as there are devices with free space (down to a minimum
of 2). Data is then striped across those chunks in 64 KiB stripes.
Thus, the first block group will be N GiB of usable space, striped
across N devices.

   There's a second level of allocation (which I haven't looked at at
all), which is how the FS decides where to put data within the
allocated block groups. I think it will almost certainly be beneficial
in your case to use prealloc extents, which will turn your continuous
write into large contiguous sections of striping.

   I would recommend thoroughly benchmarking your application with the
FS first though, just to see how it's going to behave for you.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Ceci n'est pas une pipe:  | ---   


signature.asc
Description: Digital signature

Re: Scrub status: no stats available

2014-05-06 Thread Marc MERLIN

On Tue, May 06, 2014 at 11:52:58AM +0200, Wolfgang Mader wrote:
 Dear list,
 
 I am running btrfs on Arch Linux ARM (Linux 3.14.2, Btrfs v3.14.1). I can run 
 scrub w/o errors, but I never get stats from scrub status
 
 What I get is
btrfs scrub status /pools/dataPool

scrub status for b5f082e2-2ce0-4f91-b54b-c2d26185a635
no stats available
total bytes scrubbed: 694.13GiB with 0 errors
 
 Please mind the line no stats available. Where can I start digging?

Here:
legolas:~# l /var/lib/btrfs/
total 16
drwxr-xr-x 1 root root  494 May  6 04:08 ./
drwxr-xr-x 1 root root 1360 Apr 27 22:15 ../
srwxr-xr-x 1 root root0 May  6 03:48 
scrub.progress.4850ee22-bf32-4131-a841-02abdb4a5ba6=
-rw--- 1 root root  428 May  6 04:08 
scrub.status.4850ee22-bf32-4131-a841-02abdb4a5ba6
-rw--- 1 root root  427 May  5 05:04 
scrub.status.6afd4707-876c-46d6-9de2-21c4085b7bed
-rw--- 1 root root  418 Jan 11  2013 
scrub.status.92584fa9-85cd-4df6-b182-d32198b76a0b
-rw--- 1 root root  420 May 17  2013 
scrub.status.9f52c100-8c89-45b6-a005-3f5de1c12b38


Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs raid allocator

2014-05-06 Thread Hugo Mills

On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote:
 On 06.05.2014 12:59, Hugo Mills wrote:
 On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:
 Hello all!
 
 I would like to use btrfs (or anyting else actually) to maximize raid0
 performance. Basically I have a relatively constant stream of data that
 simply has to be written out to disk. So my question is, how is the block
 allocator deciding on which device to write, can this decision be dynamic
 and could it incorporate timing/troughput decisions? I'm willing to write
 code, I just have no clue as to how this works right now. I read somewhere
 that the decision is based on free space, is this still true?
 
 For (current) RAID-0 allocation, the block group allocator will use
 as many chunks as there are devices with free space (down to a minimum
 of 2). Data is then striped across those chunks in 64 KiB stripes.
 Thus, the first block group will be N GiB of usable space, striped
 across N devices.
 
 So do I understand this correctly that (assuming we have enough space) data
 will be spread equally between the disks independend of write speeds? So one
 slow device would slow down the whole raid?

   Yes. Exactly the same as it would be with DM RAID-0 on the same
configuration. There's not a lot we can do about that at this point.

 There's a second level of allocation (which I haven't looked at at
 all), which is how the FS decides where to put data within the
 allocated block groups. I think it will almost certainly be beneficial
 in your case to use prealloc extents, which will turn your continuous
 write into large contiguous sections of striping.
 
 Why does prealloc change anything? For me latency does not matter, only
 continuous troughput!

   It makes the extent allocation algorithm much simpler, because it
can then allocate in larger chunks and do more linear writes

 I would recommend thoroughly benchmarking your application with the
 FS first though, just to see how it's going to behave for you.
 
 Hugo.
 
 
 Of course - it's just that I do not yet have the hardware, but I plan to
 test with a small model - I just try to find out how it actually works
 first, so I know what look out for.

   Good luck. :)

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- I am the author. You are the audience. I outrank you! --- 


signature.asc
Description: Digital signature

Re: Btrfs raid allocator

2014-05-06 Thread Hendrik Siedelmann


On 06.05.2014 13:19, Hugo Mills wrote:

On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote:

On 06.05.2014 12:59, Hugo Mills wrote:

On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:

Hello all!

I would like to use btrfs (or anyting else actually) to maximize raid0
performance. Basically I have a relatively constant stream of data that
simply has to be written out to disk. So my question is, how is the block
allocator deciding on which device to write, can this decision be dynamic
and could it incorporate timing/troughput decisions? I'm willing to write
code, I just have no clue as to how this works right now. I read somewhere
that the decision is based on free space, is this still true?


For (current) RAID-0 allocation, the block group allocator will use
as many chunks as there are devices with free space (down to a minimum
of 2). Data is then striped across those chunks in 64 KiB stripes.
Thus, the first block group will be N GiB of usable space, striped
across N devices.


So do I understand this correctly that (assuming we have enough space) data
will be spread equally between the disks independend of write speeds? So one
slow device would slow down the whole raid?


Yes. Exactly the same as it would be with DM RAID-0 on the same
configuration. There's not a lot we can do about that at this point.


So striping is fixed but which disk takes part with a chunk is dynamic? 
But for large workloads slower disks could 'skip a chunk' as chunk 
allocation is dynamic, correct?



There's a second level of allocation (which I haven't looked at at
all), which is how the FS decides where to put data within the
allocated block groups. I think it will almost certainly be beneficial
in your case to use prealloc extents, which will turn your continuous
write into large contiguous sections of striping.


Why does prealloc change anything? For me latency does not matter, only
continuous troughput!


It makes the extent allocation algorithm much simpler, because it
can then allocate in larger chunks and do more linear writes


Is this still true if I do very large writes? Or do those get broken 
down by the kernel somewhere?



I would recommend thoroughly benchmarking your application with the
FS first though, just to see how it's going to behave for you.

Hugo.



Of course - it's just that I do not yet have the hardware, but I plan to
test with a small model - I just try to find out how it actually works
first, so I know what look out for.


Good luck. :)

Hugo.



Thanks!
Hendrik

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs raid allocator

2014-05-06 Thread Hugo Mills

On Tue, May 06, 2014 at 01:26:44PM +0200, Hendrik Siedelmann wrote:
 On 06.05.2014 13:19, Hugo Mills wrote:
 On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote:
 On 06.05.2014 12:59, Hugo Mills wrote:
 On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:
 Hello all!
 
 I would like to use btrfs (or anyting else actually) to maximize raid0
 performance. Basically I have a relatively constant stream of data that
 simply has to be written out to disk. So my question is, how is the block
 allocator deciding on which device to write, can this decision be dynamic
 and could it incorporate timing/troughput decisions? I'm willing to write
 code, I just have no clue as to how this works right now. I read somewhere
 that the decision is based on free space, is this still true?
 
 For (current) RAID-0 allocation, the block group allocator will use
 as many chunks as there are devices with free space (down to a minimum
 of 2). Data is then striped across those chunks in 64 KiB stripes.
 Thus, the first block group will be N GiB of usable space, striped
 across N devices.
 
 So do I understand this correctly that (assuming we have enough space) data
 will be spread equally between the disks independend of write speeds? So one
 slow device would slow down the whole raid?
 
 Yes. Exactly the same as it would be with DM RAID-0 on the same
 configuration. There's not a lot we can do about that at this point.
 
 So striping is fixed but which disk takes part with a chunk is dynamic? But
 for large workloads slower disks could 'skip a chunk' as chunk allocation is
 dynamic, correct?

   You'd have to rewrite the chunk allocator to do this, _and_ provide
different RAID levels for different subvolumes. The chunk/block group
allocator right now uses only one rule for allocating data, and one
for allocating metadata. Now, both of these are planned, and _might_
between them possibly cover the use-case you're talking about, but I'm
not certain it's necessarily a sensible thing to do in this case.

   My question is, if you actually care about the performance of this
system, why are you buying some slow devices to drag the performance
of your fast devices down? It seems like a recipe for disaster...

 There's a second level of allocation (which I haven't looked at at
 all), which is how the FS decides where to put data within the
 allocated block groups. I think it will almost certainly be beneficial
 in your case to use prealloc extents, which will turn your continuous
 write into large contiguous sections of striping.
 
 Why does prealloc change anything? For me latency does not matter, only
 continuous troughput!
 
 It makes the extent allocation algorithm much simpler, because it
 can then allocate in larger chunks and do more linear writes
 
 Is this still true if I do very large writes? Or do those get broken down by
 the kernel somewhere?

   I guess it'll depend on the approach you use to do these very
large writes, and on the exact definition of very large. This is
not an area I know a huge amount about.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
 --- I am the author. You are the audience. I outrank you! --- 


signature.asc
Description: Digital signature

Re: Btrfs raid allocator

2014-05-06 Thread Hendrik Siedelmann


On 06.05.2014 13:46, Hugo Mills wrote:

On Tue, May 06, 2014 at 01:26:44PM +0200, Hendrik Siedelmann wrote:

On 06.05.2014 13:19, Hugo Mills wrote:

On Tue, May 06, 2014 at 01:14:26PM +0200, Hendrik Siedelmann wrote:

On 06.05.2014 12:59, Hugo Mills wrote:

On Tue, May 06, 2014 at 12:41:38PM +0200, Hendrik Siedelmann wrote:

Hello all!

I would like to use btrfs (or anyting else actually) to maximize raid0
performance. Basically I have a relatively constant stream of data that
simply has to be written out to disk. So my question is, how is the block
allocator deciding on which device to write, can this decision be dynamic
and could it incorporate timing/troughput decisions? I'm willing to write
code, I just have no clue as to how this works right now. I read somewhere
that the decision is based on free space, is this still true?


For (current) RAID-0 allocation, the block group allocator will use
as many chunks as there are devices with free space (down to a minimum
of 2). Data is then striped across those chunks in 64 KiB stripes.
Thus, the first block group will be N GiB of usable space, striped
across N devices.


So do I understand this correctly that (assuming we have enough space) data
will be spread equally between the disks independend of write speeds? So one
slow device would slow down the whole raid?


Yes. Exactly the same as it would be with DM RAID-0 on the same
configuration. There's not a lot we can do about that at this point.


So striping is fixed but which disk takes part with a chunk is dynamic? But
for large workloads slower disks could 'skip a chunk' as chunk allocation is
dynamic, correct?


You'd have to rewrite the chunk allocator to do this, _and_ provide
different RAID levels for different subvolumes. The chunk/block group
allocator right now uses only one rule for allocating data, and one
for allocating metadata. Now, both of these are planned, and _might_
between them possibly cover the use-case you're talking about, but I'm
not certain it's necessarily a sensible thing to do in this case.


But what does the allocator currently do when one disk runs out of 
space? I thought those disks do not get used but we can still write 
data. So the mechanism is already there, it just needs to be invoked 
when a drive is too busy instead of too full.



My question is, if you actually care about the performance of this
system, why are you buying some slow devices to drag the performance
of your fast devices down? It seems like a recipe for disaster...


Even the speed of a single hdd varies depending on where I write the 
data. So actually there is not much choice :-D.
I'm aware that this could be a case of overengineering. Actually my 
first thought was to write a simple fuse module which only handles data 
and puts metadata on a regular filesystem. But then I thought that it 
would be nice to have this in btrfs - and not just for raid0.



There's a second level of allocation (which I haven't looked at at
all), which is how the FS decides where to put data within the
allocated block groups. I think it will almost certainly be beneficial
in your case to use prealloc extents, which will turn your continuous
write into large contiguous sections of striping.


Why does prealloc change anything? For me latency does not matter, only
continuous troughput!


It makes the extent allocation algorithm much simpler, because it
can then allocate in larger chunks and do more linear writes


Is this still true if I do very large writes? Or do those get broken down by
the kernel somewhere?


I guess it'll depend on the approach you use to do these very
large writes, and on the exact definition of very large. This is
not an area I know a huge amount about.

Hugo.


Never mind I'll just try it out!

Hendrik

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please review and comment, dealing with btrfs full issues

2014-05-06 Thread Marc MERLIN

On Mon, May 05, 2014 at 07:07:29PM +0200, Brendan Hide wrote:
 In the case above, because the filesystem is only 55% full, I can
 ask balance to rewrite all chunks that are more than 55% full:
 
 legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1
 
 -dusage=50 will balance all chunks that are 50% *or less* used,

Sorry, I actually meant to write 55 there.

 not more. The idea is that full chunks are better left alone while
 emptyish chunks are bundled together to make new full chunks,
 leaving big open areas for new chunks. Your process is good however
 - just the explanation that needs the tweak. :)

Mmmh, so if I'm 55% full, should I actually use -dusage=45 or 55?

 In your last example, a full rebalance is not necessary. If you want
 to clear all unnecessary chunks you can run the balance with
 -dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of
 the data chunks that are 80% and less used, which would by necessity
 get about ~160GB worth chunks back out of data and available for
 re-use.

So in my case when I hit that case, I had to use dusage=0 to recover.
Anything above that just didn't work.

On Mon, May 05, 2014 at 07:09:22PM +0200, Brendan Hide wrote:
 Forgot this part: Also in your last example, you used -dusage=0
 and it balanced 91 chunks. That means you had 91 empty or
 very-close-to-empty chunks. ;)

Correct. That FS was very mis-balanced.

On Mon, May 05, 2014 at 02:36:09PM -0400, Calvin Walton wrote:
 The standard response on the mailing list for this issue is to
 temporarily add an additional device to the filesystem (even e.g. a 4GB
 USB flash drive is often enough) - this will add space to allocate a few
 new chunks, allowing the balance to proceed. You can remove the extra
 device after the balance completes.

I just added that tip, thank you.
 
On Tue, May 06, 2014 at 02:41:16PM +1000, Russell Coker wrote:
 Recently kernel 3.14 allowed fixing a metadata space error that seemed to be 
 impossible to solve with 3.13.  So it's possible that some of my other 
 problems with a lack of metadata space could have been solved with kernel 
 3.14 
 too.

Good point. I added that tip too.

Thanks,
Marc
-- 
A mouse is a device used to point at the xterm you want to type in - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/ | PGP 1024R/763BE901
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please review and comment, dealing with btrfs full issues

2014-05-06 Thread Brendan Hide


Hi, Marc. Inline below. :)

On 2014/05/06 02:19 PM, Marc MERLIN wrote:

On Mon, May 05, 2014 at 07:07:29PM +0200, Brendan Hide wrote:

In the case above, because the filesystem is only 55% full, I can
ask balance to rewrite all chunks that are more than 55% full:

legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1

-dusage=50 will balance all chunks that are 50% *or less* used,

Sorry, I actually meant to write 55 there.


not more. The idea is that full chunks are better left alone while
emptyish chunks are bundled together to make new full chunks,
leaving big open areas for new chunks. Your process is good however
- just the explanation that needs the tweak. :)

Mmmh, so if I'm 55% full, should I actually use -dusage=45 or 55?


As usual, it depends on what end-result you want. Paranoid rebalancing - 
always ensuring there are as many free chunks as possible - is totally 
unnecessary. There may be more good reasons to rebalance - but I'm only 
aware of two: a) to avoid ENOSPC due to running out of free chunks; and 
b) to change allocation type.


If you want all chunks either full or empty (except for that last chunk 
which will be somewhere inbetween), -dusage=55 will get you 99% there.

In your last example, a full rebalance is not necessary. If you want
to clear all unnecessary chunks you can run the balance with
-dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of
the data chunks that are 80% and less used, which would by necessity
get about ~160GB worth chunks back out of data and available for
re-use.

So in my case when I hit that case, I had to use dusage=0 to recover.
Anything above that just didn't work.


I suspect when using more than zero the first chunk it wanted to balance 
wasn't empty - and it had nowhere to put it. Then when you did dusage=0, 
it didn't need a destination for the data. That is actually an 
interesting workaround for that case.

On Mon, May 05, 2014 at 07:09:22PM +0200, Brendan Hide wrote:

Forgot this part: Also in your last example, you used -dusage=0
and it balanced 91 chunks. That means you had 91 empty or
very-close-to-empty chunks. ;)

Correct. That FS was very mis-balanced.

On Mon, May 05, 2014 at 02:36:09PM -0400, Calvin Walton wrote:

The standard response on the mailing list for this issue is to
temporarily add an additional device to the filesystem (even e.g. a 4GB
USB flash drive is often enough) - this will add space to allocate a few
new chunks, allowing the balance to proceed. You can remove the extra
device after the balance completes.

I just added that tip, thank you.
  
On Tue, May 06, 2014 at 02:41:16PM +1000, Russell Coker wrote:

Recently kernel 3.14 allowed fixing a metadata space error that seemed to be
impossible to solve with 3.13.  So it's possible that some of my other
problems with a lack of metadata space could have been solved with kernel 3.14
too.

Good point. I added that tip too.

Thanks,
Marc



--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please review and comment, dealing with btrfs full issues

2014-05-06 Thread Hugo Mills

On Tue, May 06, 2014 at 06:30:31PM +0200, Brendan Hide wrote:
 Hi, Marc. Inline below. :)
 
 On 2014/05/06 02:19 PM, Marc MERLIN wrote:
 On Mon, May 05, 2014 at 07:07:29PM +0200, Brendan Hide wrote:
 In the case above, because the filesystem is only 55% full, I can
 ask balance to rewrite all chunks that are more than 55% full:
 
 legolas:~# btrfs balance start -dusage=50 /mnt/btrfs_pool1
 
 -dusage=50 will balance all chunks that are 50% *or less* used,
 Sorry, I actually meant to write 55 there.
 
 not more. The idea is that full chunks are better left alone while
 emptyish chunks are bundled together to make new full chunks,
 leaving big open areas for new chunks. Your process is good however
 - just the explanation that needs the tweak. :)
 Mmmh, so if I'm 55% full, should I actually use -dusage=45 or 55?
 
 As usual, it depends on what end-result you want. Paranoid rebalancing -
 always ensuring there are as many free chunks as possible - is totally
 unnecessary. There may be more good reasons to rebalance - but I'm only
 aware of two: a) to avoid ENOSPC due to running out of free chunks; and b)
 to change allocation type.

   c) its original reason: to redistribute the data on the FS, for
   example in the case of a new device being added or removed.

 If you want all chunks either full or empty (except for that last chunk
 which will be somewhere inbetween), -dusage=55 will get you 99% there.
 In your last example, a full rebalance is not necessary. If you want
 to clear all unnecessary chunks you can run the balance with
 -dusage=80 (636GB/800GB~=79%). That will cause a rebalance only of
 the data chunks that are 80% and less used, which would by necessity
 get about ~160GB worth chunks back out of data and available for
 re-use.
 So in my case when I hit that case, I had to use dusage=0 to recover.
 Anything above that just didn't work.
 
 I suspect when using more than zero the first chunk it wanted to balance
 wasn't empty - and it had nowhere to put it. Then when you did dusage=0, it
 didn't need a destination for the data. That is actually an interesting
 workaround for that case.

   I've actually looked into implementing a smallest=n filter that
would taken only the n least-full chunks (by fraction) and balance
those. However, it's not entirely trivial to do efficiently with the
current filtering code.

   Hugo.

-- 
=== Hugo Mills: hugo@... carfax.org.uk | darksatanic.net | lug.org.uk ===
  PGP key: 65E74AC0 from wwwkeys.eu.pgp.net or http://www.carfax.org.uk
   --- Hail and greetings.  We are a flat-pack invasion force from ---   
 Planet Ikea. We come in pieces. 


signature.asc
Description: Digital signature

Re: [RFC PATCH 0/2] Kernel space btrfs missing device detection.

2014-05-06 Thread Goffredo Baroncelli

Hi,

instead of extending the BTRFS_IOCTL_DEV_INFO ioctl, why do not add a field 
under /sys/fs/btrfs/UUID/ ? Something like /sys/fs/btrfs/UUID/missing_device

BR
G.Baroncelli

On 05/06/2014 08:33 AM, Qu Wenruo wrote:
 Original btrfs will not detection any missing device since there is
 no notification mechanism for fs layer to detect missing device in
 block layer.
 
 However we don't really need to notify fs layer upon dev remove,
 probing in dev_info/rm_dev ioctl is good enough since they are the
 only two ioctls caring about missing device.
 
 This patchset will do ioctl time missing dev detection and return
 device missing status in dev_info ioctl using a new member in
 btrfs_ioctl_dev_info_args with a backward compatible method.
 
 Cc: Anand Jain anand.j...@oracle.com Qu Wenruo (2): btrfs: Add
 missing device check in dev_info/rm_dev ioctl btrfs: Add new member
 of btrfs_ioctl_dev_info_args.
 
 fs/btrfs/ioctl.c   |  4  fs/btrfs/volumes.c | 25
 - fs/btrfs/volumes.h |  2 ++ 
 include/uapi/linux/btrfs.h |  5 - 4 files changed, 34
 insertions(+), 2 deletions(-)
 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: error 2001, no inode item

2014-05-06 Thread Arun Persaud

Hi

tried with a newer version of btrfs, but still getting the same error.

checking extents
checking free space cache
checking fs roots
root 5 inode 5769204 errors 2001, no inode item, link count wrong
unresolved ref dir 5783881 index 3 namelen 38 name
61bd2ed1fba8bc8d2f12766c7e4b3dafff6350 filetype 1 error 4, no inode ref
root 5 inode 5899187 errors 2001, no inode item, link count wrong
unresolved ref dir 5906761 index 3 namelen 38 name
61bd2ed1fba8bc8d2f12766c7e4b3dafff6350 filetype 1 error 0
Checking filesystem on /dev/sda4
UUID: 98190f1e-426f-433d-8335-1216b9a63d16
found 28521431809 bytes used err is 1
total csum bytes: 124070732
total tree bytes: 722415616
total fs tree bytes: 552411136
total extent tree bytes: 32673792
btree space waste bytes: 171189111
file data blocks allocated: 188149448704
 referenced 126695161856
Btrfs v3.14.1+20140502

# uname -a
Linux apersaud 3.14.2-25.g1474ea5-desktop #1 SMP PREEMPT Sun Apr 27
14:35:22 UTC 2014 (1474ea5) x86_64 x86_64 x86_64 GNU/Linux

Any idea how I can fix the missing inode problem?

Arun

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Copying related snapshots to another server with btrfs send/receive?

2014-05-06 Thread Duncan

Brendan Hide posted on Sun, 04 May 2014 09:54:38 +0200 as excerpted:

  From the man page section on -c:
  You must not specify clone sources unless you guarantee that these
 snapshots are exactly in the same state on both sides, the sender and
 the receiver. It is allowed to omit the '-p parent' option when '-c
 clone-src' options are given, in which case 'btrfs send' will
 determine a suitable parent among the clone sources itself.
 
 -p does require that the sources be read-only. I suspect -c does as
 well. This means that it won't be so simple as you want your sources to
 be read-write. Probably the only way then would be to make read-only
 snapshots whenever you want to sync these over while also ensuring that
 you keep at least one read-only snapshot intact - again, much like
 incremental backups.

I don't claim in any way to be a send/receive expert as I don't use it 
for my use-case at all.  However...

It's worth noting in the context of that manpage quote, that really the 
only practical way to guarantee that the snapshots are exactly the same 
on both sides is to have them read-only the entire time.  Because the 
moment you make them writable on either side all bets are off as to 
whether something has been written, thereby killing the exact-same-state 
guarantee. =:^(

*However*: snapshotting a read-only snapshot and making the new one 
writable is easy enough[1].  Just keep the originals read-only so they 
can be used as parents/clones, and make a second, writable snapshot of 
the first, to do your writable stuff in.

---
[1]  Snapshotting a snapshot: I'm getting a metaphorical flashing light 
saying I need to go check the wiki FAQ that deals with this again before 
I post, but unfortunately I can't check out why ATM as I just upgraded 
firefox and cairo and am currently getting a blank window where the 
firefox content should be, that will hopefully be gone and the content 
displayed after I reboot and get rid of the still loaded old libs, so 
unfortunately I can't check that flashing light ATM and am writing 
blind.  Hopefully that flashing light warning isn't for something /too/ 
major that I'm overlooking!  =:^(

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: copies= option

2014-05-06 Thread Duncan

Hugo Mills posted on Sun, 04 May 2014 19:31:55 +0100 as excerpted:

  My proposal was simply a description mechanism, not an
 implementation. The description is N-copies, M-device-stripe,
 P-parity-devices (NcMsPp), and (more or less comfortably) covers at
 minimum all of the current and currently-proposed replication levels.
 There's a couple of tweaks covering description of allocation rules
 (DUP vs RAID-1).

Thanks.  That was it. =:^)

But I had interpreted the discussion as a bit more concrete in terms of 
ultimate implementation than it apparently was.  Anyway, it would indeed 
be nice to see an eventual implementation such that the above notation 
could be used with, for instance, mkfs.btrfs, and
btrfs balance start -Xconvert, but regardless, that does look to be a way 
off.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Using mount -o bind vs mount -o subvol=vol

2014-05-06 Thread Duncan

Brendan Hide posted on Mon, 05 May 2014 08:55:55 +0200 as excerpted:

 You are 100% right, though. The scale is very small. By negligible, the
 penalty is at most a few CPU cycles. When compared to the wait time on
 a spindle, it really doesn't matter much.

The analogy I've used before is that of taking a trip (which the data 
effectively is, between the device and the CPU).

We've booked a 10-day cruise and are now debating what we plan on taking 
to and from the boarding dock.  Will taking the local bus with a couple 
of transfers, or a taxi that will take us there in one trip but there's 
road construction and thus a detour, or a helicopter to fly directly,  
get us back from the cruise faster?

Obviously, taking the helicopter (at least for the return leg) will get 
us back a bit faster, but we're talking perhaps a couple hours difference 
at the end of a 10 day cruise!

=:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Using mount -o bind vs mount -o subvol=vol

2014-05-06 Thread Duncan

Marc MERLIN posted on Sat, 03 May 2014 17:47:32 -0700 as excerpted:

 Is there any functional difference between
 
 mount -o subvol=usr /dev/sda1 /usr

 and

 mount /dev/sda1 /mnt/btrfs_pool
 mount -o bind /mnt/btrfs_pool/usr /usr
 
 ?

Brendan answered the primary aspect of this well so I won't deal
with that.  However, I've some additional (somewhat controversial)
opinion/comments on the topic of subvolumes in general.

TL;DR: Put simply, with certain sometimes major exceptions, IMO subvolumes 
are /mostly/ a solution looking for a problem.  In the /general/ case, I 
don't see the point and personally STRONGLY prefer multiple independent 
partitions for their much stronger data safety and mounting/backup 
flexibility.  That's why I use independent partitions, here.

Relevant points to consider:

Subvolume negatives, independent partition positives:

1) Multiple subvolumes on a common filesystem share the filesystem tree- 
and super-structure.  If something happens to that filesystem, you had 
all your data eggs in that one basket and the bottom just dropped out of 
it!  If you can't recover, kiss **ALL** those data eggs goodbye!

That's the important one; the one that would prevent me sleeping well if 
that's the solution I had chosen to use.  But there's a number of others, 
more practical in the binary it's not an unrecoverable failure case.

2) Presently, btrfs is rather limited in the opposing mount options it 
can apply to subvolumes on the same overall filesystem.  Mounting just 
one subvolume nodatacow, for instance, without mounting all mounted 
subvolumes of the filesystem nodatacow isn't yet possible, tho the 
filesystem design allows for it and the feature is roadmapped to appear 
sometime in the future.

This means that at present, the subvolumes solution severely limits your 
mount options flexibility, altho that problem should go away to a large 
degree at some rather handwavily defined point in the future.

3) Filesystem size and time to complete whole-filesystem operations such 
as balance, scrub and check are directly related; the larger the 
filesystem, the longer such operations take.  There are reports here of 
balances taking days on multi-terabyte filesystems, and double-digit 
hours isn't unusual at all.

Of course SSDs are generally smaller and (much) faster, but still, a 
filesystem the size of a quarter or a half-gig SSD could easily take an 
hour or more to balance or scrub, and that can still be a big deal.

Contrast that with the /trivial/ balance/scrub times I see on my 
partitioned btrfs-on-ssd setup here, some of them under a minute, even 
the big btrfs of 24 GiB (gentoo packages/sources/ccache filesystem) 
taking under three minutes (just under 7 second per GiB).  At those times 
the return is fast enough I normally run the thing in foreground and wait 
for it to return in real-time; times trivial enough I can actually do a 
full filesystem rebalance in ordered to time it to make this point on a 
post! =:^)

Of course the other aspect of that is that I can for instance fsck my 
dedicated multimedia filesystem without it interfering with running X and 
my ordinary work on /home.  If it's all the same filesystem and I have to 
fsck from the initramfs or a rescue disk...

Now ask yourself, how likely are you to routinely run a scrub or balance 
as preventive maintenance if you know it's going to take the entire day 
to finish?  Here, the times are literally so trivial can and do run a 
full filesystem rebalance to time it and make this point and maintenance 
such as scrub or balance simply ceases to be an issue.

I actually learned this point back on mdraid, before I switched to 
btrfs.  When I first setup mdraid, I had only three raids, primary/
working, secondary/first-backup, and the raid0 for stuff like package 
cache that I could simply redownload if necessary.  But if a device 
dropped (as it occasionally did after a resume from hibernate, due to 
hardware taking too long to wake up and the kernel giving up on it), the 
rebuild would take HOURS!

Later on, after a few layout changes, I had many more raids and kept some 
of them (like the one containing my distro package cache) deactivated 
unless I actually needed to use them (if I was actually doing an 
update).  Since a good portion of the many more but smaller raids were 
offline most of the time, if a device dropped, I had far fewer and 
smaller raids to rebuild, and was typically back up and running in under 
a half hour.

Filesystem maintenance time DOES make a difference!

Subvolume positives, independent partition negatives:

4) Many distros are using btrfs subvolumes on a single btrfs storage 
pool the way they formerly used LVM volume groups, as a common storage 
pool allowing them the flexibility to (re)allocate space to whatever lvm 
volume or btrfs subvolume needs it.

This is a killer feature from the viewpoint of many distros and users 
as the flexibility means no more hassle with guessing incorrectly

Re: How does Suse do live filesystem revert with btrfs?

2014-05-06 Thread Duncan

Marc MERLIN posted on Sat, 03 May 2014 17:52:57 -0700 as excerpted:

 (more questions I'm asking myself while writing my talk slides)
 
 I know Suse uses btrfs to roll back filesystem changes.
 
 So I understand how you can take a snapshot before making a change, but
 not how you revert to that snapshot without rebooting or using rsync,
 
 How do you do a pivot-root like mountpoint swap to an older snapshot,
 especially if you have filehandles opened on the current snapshot?
 
 Is that what Suse manages, or are they doing something simpler?

While I don't have any OpenSuSE specific knowledge on this, I strongly 
suspect their solution is more along the select-the-root-snapshot-to-roll-
back-to-from-the-initramfs/initrd line.

Consider, they do the snapshot, then the upgrade.  In-use files won't be 
entirely removed and the upgrade actually activated for them until a 
reboot or at least an application restart[1] for all those running apps 
in ordered to free their in-use files, anyway.  At that point, if the 
user finds something broke, they've just rebooted[1], so rebooting[1] to 
select the pre-upgrade rootfs snapshot won't be too big a deal, since 
they've already disrupted the normal high level session and have just 
attempted a reload in ordered to discover the breakage, in the first 
place.

IOW, for the rootfs and main system, anyway, the rollback technology is a 
great step up from not having that snapshot to rollback to in the first 
place, but it's /not/ /magic/; if a rollback is needed, they almost 
certainly will need to reboot[1] and from there select the rootfs 
snapshot to rollback to, in ordered to mount it and accomplish that 
rollback.

---
[1] Reboot:  Or possibly dipped to single user mode, and/or to the 
initramfs, which they'd need to reload and switch-root into for the 
purpose, but systemd is doing just that sort of thing these days in 
ordered to properly unmount rootfs after upgrades before shutdown as it's 
a step safer than the old style remount read-only, and implementing a 
snapshot selector and remount of the rootfs in that initr* instead of 
dropping all the way to a full reboot is only a small step from there.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is metadata redundant over more than one drive with raid0 too?

2014-05-06 Thread Duncan

Marc MERLIN posted on Sun, 04 May 2014 22:06:17 -0700 as excerpted:

 That's true, but in this case I barely see the point of -m single vs -m
 raid0. It sounds like they both stripe data anyway, maybe not at the
 same level, but if both are striped, than they're almost the same in my
 book :)

Single only stripes in such extremely large (1 GiB data, quarter-GiB 
metadata, per strip) chunks that it doesn't matter for speed, and then 
only as a result of its chunk allocation policy.  If one can define such 
large strips as striping, which it is in a way, but not really in the 
practical sense.

The effect of a lost device, then, is more or less random, tho for single 
metadata the effect is likely to be quite large up to total loss, due to 
the damage to the tree.  It's not out of thin air that the multi-device 
metadata default is raid1 (which unlike the single-device case, should be 
the same on SSD or spinning rust, since by definition the copies will be 
on different devices and thus cannot be affected by SSDs' FTL-level de-
dup).

So the below assumes copies=2 raid1 metadata and is thus only considering 
single vs. raid0 data.

For single data, only files that happened to be partially allocated on 
the lost device will be damaged.  For file sizes above the 1 GiB data 
chunk size, the chance of damage is therefore rather high, as by 
definition the file will require multiple chunks and the chances of one 
of them being on the lost device go up accordingly.  But for file sizes 
significantly under 1 GiB, where data fragmentation is relatively low at 
least (think a recent rebalance or (auto)defrag), relatively small files 
are very likely to be located on a single chunk and thus either all there 
or all missing, depending on whether that chunk was on the missing device 
or not.

That contrasts with raid0, where the striping is at sizes well under a 
chunk (memory page size or 4 MiB on x86/amd64 data I believe, tho the 
fact that files under the 16 MiB node size may actually be entirely 
folded into metadata and not have a data extent allocation at all skews 
things for up to the 16 MiB metadata node size), so the definition of 
small file likely to be recovered is **MUCH** smaller on raid0, than on 
single.

Effectively, raid0 data you're only (relatively) likely to recover files 
smaller than 16 MiB, while single data, it's files smaller than 1 GiB.

Big difference!

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How does Suse do live filesystem revert with btrfs?

2014-05-06 Thread Duncan

Marc MERLIN posted on Sun, 04 May 2014 22:04:59 -0700 as excerpted:

 On Mon, May 05, 2014 at 01:36:39AM +0100, Hugo Mills wrote:
I'm guessing it involves reflink copies of files from the snapshot
 back to the original, and then restarting affected services. That's
 about the only other thing that I can think of, but it's got load of
 race conditions in it (albeit difficult to hit in most cases, I
 suspect).
 
 Aaah, right, you can use a script to see the file differences between
 two snapshots, and then restore that with reflink if you can truly get a
 list of all changed files.
 However, that is indeed not atomic at all, even if faster than rsync.

Would send/receive help in such a script?

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is metadata redundant over more than one drive with raid0 too?

2014-05-06 Thread Duncan

Marc MERLIN posted on Sun, 04 May 2014 18:27:19 -0700 as excerpted:

 On Sun, May 04, 2014 at 09:44:41AM +0200, Brendan Hide wrote:
 Ah, I see the man page now This is because SSDs can remap blocks
 internally so duplicate blocks could end up in the same erase block
 which negates the benefits of doing metadata duplication.
 
 You can force dup but, per the man page, whether or not that is
 beneficial is questionable.
 
 So the reason I was confused originally was this:
 legolas:~# btrfs fi df /mnt/btrfs_pool1
 Data, single: total=734.01GiB, used=435.39GiB
 System, DUP: total=8.00MiB, used=96.00KiB
 System, single: total=4.00MiB, used=0.00
 Metadata, DUP: total=8.50GiB, used=6.74GiB
 Metadata, single: total=8.00MiB, used=0.00
 
 This is on my laptop with an SSD. Clearly btrfs is using duplicate
 metadata on an SSD, and I did not ask it to do so.
 Note that I'm still generally happy with the idea of duplicate metadata
 on an SSD even if it's not bulletproof.

In regard to metadata defaulting to single rather than the (otherwise) dup 
on single-device ssd:

1) In ordered to do that, btrfs (I guess mkfs.btrfs in this case) must be 
able to detect that the device *IS* ssd.  Depending on the SSD, the 
kernel version, and whether the btrfs is being created direct on bare-
metal device or on some device layered (lvm or dmcrypt or whatever) on 
top of the bare metal, btrfs may or may not successfully detect that.

Obviously in your case[1] the ssd wasn't detected.

Question:  Does btrfs detect ssd and automatically add it to the mount 
options for that btrfs?  I suspect not, thus consistent behavior in not 
detecting the SSD.  FWIW, it is detected here.  I've never specifically 
added ssd to any of my btrfs mount options, but it's always there in 
/proc/self/mounts when I check.[2]

I believe I've seen you mention using dmcrypt or the like, however, which 
probably doesn't pass whatever is used for ssd protection on thru, thus 
explaining btrfs not seeing it and having to specify it yourself, if you 
wish.

While I'm not sure, I /think/ btrfs may use the sysfs rotational file (or 
rather, the same information that the kernel exports to that file) for 
this detection.  For my bare-metal devices that's:

/sys/block/sdX/queue/rotational

For my ssds that file contains 0 while for spinning rust, it contains 
1.

The contents of that file are derived in turn from the information 
exported by the device.  I believe the same information can be seen with 
hdparm -I, in the Configuration section, as Nominal Media Rotation Rate.

For my spinning rust that returns an RPM value such as 7200.  For my sdds 
it returns Solid State Device.

The same information can be seen with smartctl -i, which has much shorter 
output so it's easier to find.  Look for Rotation Rate.

Again, my ssds report Solid State Device, while my spinning rust 
reports a value such as 7200 rpm.

2) The only reason I happen to know about the SSD metadata single-device 
single mode default exception (where metadata otherwise defaults to dup 
mode on single-device, and to raid1 mode on multi-device regardless of 
the media), is as a result of I believe Chris Mason commenting on it in 
an on-list reply.

The reasoning given in that reply was not the erase-block reason I've 
seen someone else mention here (and which doesn't quite make sense to me, 
since I don't know why that would make a difference), but rather:

Some SSD firmware does automatic deduplication and compression.  On these 
devices, DUP-mode would almost certainly be stored as a single internal 
data block with two external address references anyway, so it would 
actually be single in any case, and defaulting to single (a) doesn't hide 
that fact, and (b) reduces overhead that's justified for safety 
otherwise, but if the firmware is doing an end run around that safety 
anyway, might as well just shortcut the overhead as well.

However, while the btrfs default will apply to all (detected) ssds, not 
all ssds have firmware that does this internal deduplication!

In fact, the documentation for my ssds sells its LACK of such compression 
and deduplication as a feature, pointing out that such features tend to 
make the behavior of a device far less predictable[3], tho they do 
increase maximum speed and capacity.

Which is why I've chosen to specify dup mode on my single-device btrfs 
here, even on ssds.[4]  While it'd be the wrong choice on ssds that do 
compression and deduplication, on mine, it's still the right choice. =:^)


If your SSDs don't do firmware-based dedup/compression, then dup metadata 
is still arguably the best choice on ssd.  But if they do, the single 
metadata default does indeed make more sense, even if that's not the 
default you're getting due to lack of ssd detection.

---
[1] Obviously ssd not detected: Assuming you didn't specify metadata 
level, probably a safe assumption or we'd not be having the discussion.  
Personally, I always make a point of specifying both data and

Re: copies= option

2014-05-06 Thread Chris Murphy


 N-copies, M-device-stripe, P-parity-devices (NcMsPp)

At expense of being the terminology nut, who doesn't even like SNIA's chosen 
terminology because it's confusing, I suggest a concerted effort to either use 
SNIA's terms anyway, or push back and ask them to make changes before 
propagating deviant terminology.

Strip is a consecutive blocks on a single extent (on a single device)
Strip size is the number of blocks in a single extent (on a single device)

Stripe is a set of strips on each member extent (on multiple devices)
Stripe size is strip size times non-parity extents.

e.g. Btrfs default strip size is 64KiB, therefore a 5 disk raid5 volume stripe 
size is 256KiB. I use and specify size units in bytes rather than SNIAs blocks 
(sectors) because it's less ambiguous.

In other words, for M- what we care about is the strip size, which is what 
md/mdadm calls a chunk. We can't know the stripe size without knowing how many 
non-parity member devices there are.


Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Thoughts on RAID nomenclature

2014-05-06 Thread Goffredo Baroncelli

On 05/05/2014 11:17 PM, Hugo Mills wrote:
[...]
Does this all make sense? Are there any other options or features
 that we might consider for chunk allocation at this point? 

The kind of chunk (DATA, METADATA, MIXED) and the subvolume (when /if this 
possibility will come)


As how write this information I suggest the following options:


-[DATA|METADATA|MIXED|SYSTEM:]NcMsPp[:driveslist[:/subvolume/path]]


Where drivelist is an expression of the disks policy allocation:


a)
{sdX1:W1,sdX2:W2...}

where sdX is the partition involved and W is the weight:

#1  {sda:1,sdb:1,sdc:1}   means spread all the disks
#2  {sda:1,sdb:2,sdc:3}   means linear from sda to sdc
#3  {sda:1,sdb:1,sdc:2}   means spread on sda and sdb (grouped)
 then (when full)  sdc

or
b)
#1  (sda,sdb,sdc)   means spread all the disks
#2  [sda,sdb,sdc]   means linear from sda to sdc
#3  [(sda,sdb),sdc] means spread on sda and sdb (grouped)
 then (when full)  sdc

or
c)
#1  (sda,sdb,sdc)   means spread all the disks
#2  sda,sdb,sdc means linear from sda to sdc
#3  (sda,sdb),sdc   means spread on sda and sdb (grouped)
 then (when full)  sdc


Some examples:

- 1c2s3b
Default allocation policy

- DATA:2c3s4b   
Default allocation policy for the DATA

- METADATA:1c4s:(sda,sdb,sdc,sdd)
Spread over all the 4 disks for metadata

- MIXED:1c4s:sda,sdc,sdb,sdd
Linear over the 4 disks, ordered as the list for
Data+Metadata

- DATA:1c4s:(sda,sdc),(sdb,sdd)
spread over sda,sdc and then when these are
filled, spread over sdb and sdc

- METADATA:1c4s:(sda,sdb,sdc,sdd):/subvolume/path
Spread over all the 4 disks for metadata belonging the
subvolume /subvolume/path


I think it would be interesting to explore some configuration like

- DATA:1c:(sda)
- METADATA:2c:(sdb)

if sda is bigger and sdb is faster

Some further thoughts:
- more I think about the allocation policy per subvolume and/or file basis and 
more I think that it would be a messy to manage





-- 
gpg @keyserver.linux.it: Goffredo Baroncelli (kreijackATinwind.it
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Is metadata redundant over more than one drive with raid0 too?

2014-05-06 Thread Duncan

Marc MERLIN posted on Sun, 04 May 2014 18:27:19 -0700 as excerpted:

 The original reason why I was asking myself this question and trying to
 figure out how much better -m raid1 -d raid0 was over -m raid0 -d raid0
 
 I think the summary is that in the first case, you're going to to be
 abel to recover all/most small files (think maildir) if you lose one
 device, whereas in the 2nd case, with half the metadata missing, your FS
 is pretty much fully gone.
 Fair to say that?

Yes. =:^)

 Now, if I don't care about speed, but wouldn't mind recovering a few
 bits should something happen (actually in my case mostly knowing the
 state of the filesystem when a drive was lost so that I can see how many
 new files showed up since my last backup), it sounds like it wouldn't be
 bad to use:
 -m raid1 -d linear

Well, assuming that by -d linear you meant -d single. Btrfs doesn't call 
it linear, tho at the data safety level, btrfs single is actually quite 
comparable to mdadm linear.  =:^)  

(I had to check.  I knew I didn't remember btrfs having linear as an 
option, and hadn't seen any patches float by on the list that would add 
it, but since I'm not a dev I don't follow patches /that/ closely, and 
thought I might have missed it.  So I thought I better go check to see 
what this possible new linear option actually was, if indeed I had missed 
it.  Turns out I didn't miss it after all; there's still no linear option 
that I can see, unless it's there and simply not documented.  =:^)

 This will not give me the speed boost from raid0 which I don't care
 about, it will give me metadata redundancy, and due to linear, there is
 a decent chance that half my files are intact on the remaining drive
 (depending on their size apparently).

Yes. =:^)

 So one place I use it is not for speed but for one FS that gives me more
 space without redundancy (rotating buffer streaming video from security
 cams).
 At the time I used -m raid1 -d raid0, but it sounds for slightly extra
 recoverability, I should have ued -m raid1 -d linear (and yes, I
 undertand that one should not consider a -d linear recoverable when a
 drive went missing).

That appears to be a very good use of either -d raid0 or -d single, yes.  
And since you're apparently not streaming such high resolution video that 
you NEED the raid0, single does indeed give you a somewhat better chance 
at recovery.

Tho with streaming video I wonder what your filesizes are as video files 
tend to be pretty big.  If they're over the 1 GiB btrfs data chunk size, 
particularly if you're only running a two-device btrfs, you'd probably 
lose near all files anyway.

Assuming single data mode and file sizes between a GiB and 2 GiB, 
statistically you should lose near 100% on a two device btrfs with one 
dropping out, 67% on a three device btrfs with a single device dropout, 
50% on four devices, 40% on five devices...

If file sizes are 2-3 GiB, you should lose near 100% on 2-3 devices, 75% 
on four devices, 60% on five, 50% on six...

With raid0 data stats would be similar but I believe starting at 16 MiB 
with 4 MiB intervals.  Due to many files under 16 MiB being stored in the 
metadata, you'd lose few of them, but that'd jump to 100% loss at 16 MiB 
until you had 5+ devices in the raid0, with 16-20 MiB file loss chance on 
a 5-device raid0 80%, since chances would be 80% of one strip of the 
stripe being on the lost device.  (That's assuming my 4 MiB strip size 
assumption is correct, it could be smaller than that, possibly 64 KiB.)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: How does btrfs fi show show full?

2014-05-06 Thread Duncan

Marc MERLIN posted on Sun, 04 May 2014 22:50:29 -0700 as excerpted:

 In the second FS:
 Label: btrfs_pool1  uuid: [...]
 Total devices 1 FS bytes used 442.17GiB
 devid1 size 865.01GiB used 751.04GiB path [...]
 
 The difference is huge between 'Total used' and 'devid used'.
 
 Is btrfs going to fix this on its own, or likely not and I'm stuck doing
 a full balance (without filters since I'm balancing data and not
 metadata)?
 
 If that helps.
 legolas:~# btrfs fi df /mnt/btrfs_pool1
 Data, single: total=734.01GiB, used=435.29GiB
 System, DUP: total=8.00MiB, used=96.00KiB
 System, single: total=4.00MiB, used=0.00
 Metadata, DUP: total=8.50GiB, used=6.74GiB
 Metadata, single: total=8.00MiB, used=0.00

Definitely helps.  The spread is in data.

Try

btrfs balance start -dusage=20 /mnt/btrfs_pool1

You still have plenty of unused (if allocated) space available, so you 
can play around with the usage= a bit.  -dusage=20 will be faster than 
something like -dusage=50 or -dusage=80, likely MUCH faster, but will 
return less chunks to unallocated, as well.  Still, your spread between 
data-total and data-used is high enough, I expect -dusage=20 will give 
you pretty good results.

Since show says you still have ~100 GiB unallocated in df, there's no 
real urgency, and again I'd try -dusage=20 the first time.  If that 
doesn't cut it you can of course try bumping the usage= as needed, but 
because you still have 100 GiB unallocated and because the data used vs. 
total spread is so big, I really do think -dusage=20 will do it for you.
As your actual device usage goes up the spread between used and size will 
go down, meaning more frequent balances to keep some reasonable 
unallocated space available, and you'll either need to actually delete 
some stuff or to bump up those usage= numbers as well, but usage=20 is 
very likely to be sufficient at this point.

I hadn't seen anyone try an actual formula as Brendan suggests in his 
post, and I'm not actually sure that formula will apply well in all use-
cases as I think fragmentation and fill-pattern will have a lot to do 
with it, but based on his post it does apply for his use-case, and the 
same general principle if not the specific formula should apply 
everywhere and is what I'm doing above, only simply eyeballing it, not 
using a specific formula.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Thoughts on RAID nomenclature

2014-05-06 Thread Duncan

Brendan Hide posted on Mon, 05 May 2014 23:47:17 +0200 as excerpted:

 At the moment, we have two chunk allocation strategies: dup and
 spread (for want of a better word; not to be confused with the
 ssd_spread mount option, which is a whole different kettle of borscht).
 The dup allocation strategy is currently only available for 2c
 replication, and only on single-device filesystems. When a filesystem
 with dup allocation has a second device added to it, it's automatically
 upgraded to spread.
 
 I thought this step was manual - but okay! :)

AFAIK, the /allocator/ automatically updates to spread when a second 
device is added.  That is, assuming previous dup metadata on a single 
device, adding a device will cause new allocations to be in raid1/spread 
mode, instead of dup.

What's manual, however, is that /existing/ chunk allocations don't get 
automatically updated.  For that, a balance must be done.  But existing 
allocations are by definition already allocated, so the chunk allocator 
doesn't do anything with them.  (A rebalance allocates new chunks, 
rewriting the contents of the old chunks into the new ones before 
unmapping the now unused old chunks, so again, existing chunks stay where 
they are until unmapped, it's the NEW chunks that get mapped by the 
updated allocation policy.)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs raid allocator

2014-05-06 Thread Duncan

Hendrik Siedelmann posted on Tue, 06 May 2014 12:41:38 +0200 as excerpted:

 I would like to use btrfs (or anyting else actually) to maximize raid0
 performance. Basically I have a relatively constant stream of data that
 simply has to be written out to disk.

If flexible parallelization is all you're worried about, not data 
integrity or the other things btrfs does, I'd suggest looking at a more 
mature solution such as md- or dm-raid.  They're more mature and less 
complex than btrfs, and if you're not using the other features of btrfs 
anyway, they should simply work better for your use-case.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Please review and comment, dealing with btrfs full issues

2014-05-06 Thread Duncan

Brendan Hide posted on Tue, 06 May 2014 18:30:31 +0200 as excerpted:

 So in my case when I hit that case, I had to use dusage=0 to recover.
 Anything above that just didn't work.
 
 I suspect when using more than zero the first chunk it wanted to balance
 wasn't empty - and it had nowhere to put it. Then when you did dusage=0,
 it didn't need a destination for the data. That is actually an
 interesting workaround for that case.

I've actually used -Xusage=0 (where X=m or d, obviously) for exactly 
that.  If every last bit of filesystem is allocated so another chunk 
simply cannot be written in ordered to rewrite partially used chunks 
into, BUT the spread between allocated and actually used is quite high, 
there's a reasonably good chance that at least one of those allocated 
chunks is entirely empty, and -Xusage=0 allows returning it to the 
unallocated pool without actually requiring a new chunk allocation to do 
so.

With luck, that will free at least one zero-usage chunk (two for metadata 
dup, but it would both allocate and return to unallocated in pairs as so 
it balances out), allowing the user to rerun balance, this time with a 
higher -Xusage=.  


The other known valid use-case for -Xusage=0 is when freeing the 
extraneous zero-usage single-mode chunks first created by mkfs.btrfs as 
part of the mkfs process, so they don't clutter up the btrfs filesystem df 
output. =:^)

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs raid allocator

2014-05-06 Thread Chris Murphy


On May 6, 2014, at 4:41 AM, Hendrik Siedelmann 
hendrik.siedelm...@googlemail.com wrote:

 Hello all!
 
 I would like to use btrfs (or anyting else actually) to maximize raid0 
 performance. Basically I have a relatively constant stream of data that 
 simply has to be written out to disk. 

I think the only way to know what works best for your workload is to test 
configurations with the actual workload. For optimization of multiple device 
file systems, it's hard to beat XFS on raid0 or even linear/concat due to its 
parallelization, if you have more than one stream (or a stream that produces a 
lot of files that XFS can allocate into separate allocation groups). Also mdadm 
supports use specified strip/chunk sizes, whereas currently on Btrfs this is 
fixed to 64KiB. Depending on the file size for your workload, it's possible a 
much larger strip will yield better performance.

Another optimization is hardware RAID with a battery backed write cache (the 
drives' write cache are disabled) and using nobarrier mount option. If your 
workload supports linear/concat then it's fine to use md linear for this. What 
I'm not sure of is if it's an OK practice to disable barriers if the system is 
on a UPS (rather than a battery backed hardware RAID cache). You should post 
the workload and hardware details on the XFS list to get suggestions about such 
things. They'll also likely recommend the deadline scheduler over cfq.

Unless you have a workload really familiar to the responder, they'll tell you 
any benchmarking you do needs to approximate the actual workflow. A mismatched 
benchmark to the workload will lead you to the wrong conclusions. Typically 
when you optimize for a particular workload, other workloads suffer.

Chris Murphy--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs raid allocator

2014-05-06 Thread Hendrik Siedelmann


On 06.05.2014 23:49, Chris Murphy wrote:


On May 6, 2014, at 4:41 AM, Hendrik Siedelmann
hendrik.siedelm...@googlemail.com wrote:


Hello all!

I would like to use btrfs (or anyting else actually) to maximize
raid0 performance. Basically I have a relatively constant stream of
data that simply has to be written out to disk.


I think the only way to know what works best for your workload is to
test configurations with the actual workload. For optimization of
multiple device file systems, it's hard to beat XFS on raid0 or even
linear/concat due to its parallelization, if you have more than one
stream (or a stream that produces a lot of files that XFS can
allocate into separate allocation groups). Also mdadm supports use
specified strip/chunk sizes, whereas currently on Btrfs this is fixed
to 64KiB. Depending on the file size for your workload, it's possible
a much larger strip will yield better performance.


Thanks, that's quite a few knobs I can try out - I just have a lot of 
data - with a rate up to 450MB/s that I want to write out in time, 
preferably without having to rely on too expensive hardware.



Another optimization is hardware RAID with a battery backed write
cache (the drives' write cache are disabled) and using nobarrier
mount option. If your workload supports linear/concat then it's fine
to use md linear for this. What I'm not sure of is if it's an OK
practice to disable barriers if the system is on a UPS (rather than a
battery backed hardware RAID cache). You should post the workload and
hardware details on the XFS list to get suggestions about such
things. They'll also likely recommend the deadline scheduler over
cfq.


Actually data integrity does not matter for the workload. If everything 
is succesfull the result will be backed up - before that full filesystem 
corruption is acceptable as a failure mode.



Unless you have a workload really familiar to the responder, they'll
tell you any benchmarking you do needs to approximate the actual
workflow. A mismatched benchmark to the workload will lead you to the
wrong conclusions. Typically when you optimize for a particular
workload, other workloads suffer.

Chris Murphy



Thanks again for all the infos! I'll get back if everything works fine - 
or if it doesn't ;-)


Cheers
Hendrik
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

btrfs issues in 3.14

2014-05-06 Thread Kenny MacDermid

Hello,

I've been having a number of issues with processes hanging due to
btrfs using 3.14 kernels. This seems pretty new as it has been working
fine before. I also rebuilt the filesystem and am still receiving
hangs.

The filesystem is running on dmcrypt which is running on lvm2 which is
running on an SSD (SAMSUNG MZMTD256HAGM-000L1).

When the issue occurs the process is unable to be killed and the
system will not fully shutdown.

$ uname -a
Linux orange 3.14.2-1-ARCH #1 SMP PREEMPT Sun Apr 27 11:28:44 CEST
2014 x86_64 GNU/Linux

$ btrfs --version
Btrfs v3.14.1

$ btrfs fi show
Btrfs v3.14.1

$ btrfs fi df /home
Data, single: total=71.01GiB, used=68.72GiB
System, DUP: total=8.00MiB, used=16.00KiB
System, single: total=4.00MiB, used=0.00
Metadata, DUP: total=1.50GiB, used=863.33MiB
Metadata, single: total=8.00MiB, used=0.00

I opened bugs 75181 and 75191 and I'll include the relevant journalctl
entries. The kernel was upgraded from 3.14.1-1 to 3.14.2-1 during this
time, and the filesystem was rebuilt after the orphan issue.

I'm not on this list so please CC me on replies.

Thanks,

Kenny


journal.txt.gz
Description: GNU Zip compressed data

Re: [RFC PATCH 0/2] Kernel space btrfs missing device detection.

2014-05-06 Thread Qu Wenruo

 Original Message 
Subject: Re: [RFC PATCH 0/2] Kernel space btrfs missing device detection.
From: Goffredo Baroncelli kreij...@libero.it
To: Qu Wenruo quwen...@cn.fujitsu.com, linux-btrfs@vger.kernel.org
Date: 2014年05月07日 02:10

Hi,

instead of extending the BTRFS_IOCTL_DEV_INFO ioctl, why do not add a field under 
/sys/fs/btrfs/UUID/ ? Something like /sys/fs/btrfs/UUID/missing_device

BR
G.Baroncelli

I think that is also a good idea. I'll try to add it later.

Thanks
Qu

On 05/06/2014 08:33 AM, Qu Wenruo wrote:

Original btrfs will not detection any missing device since there is
no notification mechanism for fs layer to detect missing device in
block layer.

However we don't really need to notify fs layer upon dev remove,
probing in dev_info/rm_dev ioctl is good enough since they are the
only two ioctls caring about missing device.

This patchset will do ioctl time missing dev detection and return
device missing status in dev_info ioctl using a new member in
btrfs_ioctl_dev_info_args with a backward compatible method.

Cc: Anand Jain anand.j...@oracle.com Qu Wenruo (2): btrfs: Add
missing device check in dev_info/rm_dev ioctl btrfs: Add new member
of btrfs_ioctl_dev_info_args.

fs/btrfs/ioctl.c   |  4  fs/btrfs/volumes.c | 25
- fs/btrfs/volumes.h |  2 ++
include/uapi/linux/btrfs.h |  5 - 4 files changed, 34
insertions(+), 2 deletions(-)

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs issues in 3.14

2014-05-06 Thread Liu Bo

On Tue, May 06, 2014 at 08:49:04PM -0300, Kenny MacDermid wrote:
 Hello,
 
 I've been having a number of issues with processes hanging due to
 btrfs using 3.14 kernels. This seems pretty new as it has been working
 fine before. I also rebuilt the filesystem and am still receiving
 hangs.
 
 The filesystem is running on dmcrypt which is running on lvm2 which is
 running on an SSD (SAMSUNG MZMTD256HAGM-000L1).
 
 When the issue occurs the process is unable to be killed and the
 system will not fully shutdown.
 
 $ uname -a
 Linux orange 3.14.2-1-ARCH #1 SMP PREEMPT Sun Apr 27 11:28:44 CEST
 2014 x86_64 GNU/Linux
 
 $ btrfs --version
 Btrfs v3.14.1
 
 $ btrfs fi show
 Btrfs v3.14.1
 
 $ btrfs fi df /home
 Data, single: total=71.01GiB, used=68.72GiB
 System, DUP: total=8.00MiB, used=16.00KiB
 System, single: total=4.00MiB, used=0.00
 Metadata, DUP: total=1.50GiB, used=863.33MiB
 Metadata, single: total=8.00MiB, used=0.00
 
 I opened bugs 75181 and 75191 and I'll include the relevant journalctl
 entries. The kernel was upgraded from 3.14.1-1 to 3.14.2-1 during this
 time, and the filesystem was rebuilt after the orphan issue.
 
 I'm not on this list so please CC me on replies.

What does sysrq+w say when the hang happens?

-liubo
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Using noCow with snapshots ?

2014-05-06 Thread Russell Coker

How could BTRFS and a database fight about data recovery?

BTRFS offers similar guarantees about data durability etc to other journalled 
filesystems and only differs by having checksums so that while a snapshot might 
have half the data that was written by an app you at least know that the half 
will be consistent.

If you had database files on a separate subvol to the database log then you 
would be at risk of having problems making a any sort of consistent snapshot 
(the Debian approach of /var/log/mysql and /var/lib/mysql is a bad idea). But 
there would be no difference with LVM snapshots in that regard.
-- 
Sent from my Samsung Galaxy Note 2 with K-9 Mail.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

43 matches

Mail list logo