[PATCH] xfstest/btrfs: check for matching kernel send stream ver 2

2014-07-21 Thread Anand Jain
The test case btrfs/049 is relevant to send stream version 2, and
needs kernel patches as well. So call _notrun if there isn't
matching kernel support as shown below

btrfs/047[not run] Missing btrfs kernel patch for send stream version 
2, skipped this test
Not run: btrfs/047

Signed-off-by: Anand Jain anand.j...@oracle.com
---
 common/rc | 5 +
 1 file changed, 5 insertions(+)

diff --git a/common/rc b/common/rc
index 4a6511f..1c914bb 100644
--- a/common/rc
+++ b/common/rc
@@ -2223,6 +2223,11 @@ _require_btrfs_send_stream_version()
if [ $? -ne 0 ]; then
_notrun Missing btrfs-progs send --stream-version command line 
option, skipped this test
fi
+
+   # test if btrfs kernel supports send stream version 2
+   if [ ! -f /sys/fs/btrfs/send/stream_version ]; then
+   _notrun Missing btrfs kernel patch for send stream version 2, 
skipped this test
+   fi
 }
 
 _require_btrfs_mkfs_feature()
-- 
2.0.0.153.g79d

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs-progs: check if there is required kernel send stream version

2014-07-21 Thread Anand Jain
When kernel does not have the send stream version 2 patches,
the btrfs send with --stream-version 2 would fail with out
giving the details what is wrong. This patch will help to
identify correctly that required kernel patches are missing.

Signed-off-by: Anand Jain anand.j...@oracle.com
---
 cmds-send.c | 13 +
 send.h  |  2 ++
 utils.c | 17 +
 utils.h |  1 +
 4 files changed, 33 insertions(+)

diff --git a/cmds-send.c b/cmds-send.c
index 9a73b32..0c20a6f 100644
--- a/cmds-send.c
+++ b/cmds-send.c
@@ -435,6 +435,7 @@ int cmd_send(int argc, char **argv)
u64 parent_root_id = 0;
int full_send = 1;
int new_end_cmd_semantic = 0;
+   int k_sstream;
 
memset(send, 0, sizeof(send));
send.dump_fd = fileno(stdout);
@@ -544,6 +545,18 @@ int cmd_send(int argc, char **argv)
ret = 1;
goto out;
}
+
+   /* check if btrfs kernel supports send stream ver 2 */
+   if (g_stream_version  BTRFS_SEND_STREAM_VERSION_1) {
+   k_sstream = 
btrfs_read_sysfs(BTRFS_SEND_STREAM_VER_PATH);
+   if (k_sstream  g_stream_version) {
+   fprintf(stderr,
+   ERROR: Need btrfs kernel send stream 
version %d or above, %d\n,
+   BTRFS_SEND_STREAM_VERSION_2, k_sstream);
+   ret = 1;
+   goto out;
+   }
+   }
break;
case 's':
g_total_data_size = 1;
diff --git a/send.h b/send.h
index ea56965..d7a171b 100644
--- a/send.h
+++ b/send.h
@@ -24,6 +24,8 @@ extern C {
 #endif
 
 #define BTRFS_SEND_STREAM_MAGIC btrfs-stream
+#define BTRFS_SEND_STREAM_VER_PATH /sys/fs/btrfs/send/stream_version
+
 #define BTRFS_SEND_STREAM_VERSION_1 1
 #define BTRFS_SEND_STREAM_VERSION_2 2
 /* Max supported stream version. */
diff --git a/utils.c b/utils.c
index e144dfd..e3d4fa2 100644
--- a/utils.c
+++ b/utils.c
@@ -2681,3 +2681,20 @@ int fsid_to_mntpt(__u8 *fsid, char *mntpt, int *mnt_cnt)
 
return ret;
 }
+
+int btrfs_read_sysfs(char path[PATH_MAX])
+{
+   int fd;
+   char val;
+
+   fd = open(path, O_RDONLY);
+   if (fd  0)
+   return -errno;
+
+   if (read(fd, val, sizeof(char))  sizeof(char)) {
+   close(fd);
+   return -EINVAL;
+   }
+   close(fd);
+   return atoi((const char *)val);
+}
diff --git a/utils.h b/utils.h
index ddf31cf..0c9b65f 100644
--- a/utils.h
+++ b/utils.h
@@ -153,5 +153,6 @@ static inline u64 btrfs_min_dev_size(u32 leafsize)
return 2 * (BTRFS_MKFS_SYSTEM_GROUP_SIZE +
btrfs_min_global_blk_rsv_size(leafsize));
 }
+int btrfs_read_sysfs(char path[PATH_MAX]);
 
 #endif
-- 
2.0.0.153.g79d

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] btrfs: Add show_path function for btrfs_super_ops.

2014-07-21 Thread Qu Wenruo
show_path() function in struct super_operations is used to output
subtree mount info for mountinfo.
Without the implement of show_path() function, user can not found where
each subvolume is mounted if using 'subvolid=' mount option.
(When mounted with 'subvol=' mount option, vfs is aware of subtree mount
and can to the path resolve by vfs itself)

With this patch, end users will be able to use findmnt(8) or other
programs reading mountinfo to find which btrfs subvolume is mounted.

Though we use fs_info-subvol_sem to protect show_path() from subvolume
destroying/creating, if user renames/moves the parent non-subvolume
dir of a subvolume, it is still possible that concurrency may happen and
cause btrfs_search_slot() fails to find the desired key.
In that case, we just return -EBUSY and info user to try again since
extra locking like locking the whole subvolume tree is too expensive for
such usage.

Reported-by: Stefan G.Weichinger li...@xunil.at
Signed-off-by: Qu Wenruo quwen...@cn.fujitsu.com
---
 fs/btrfs/ctree.h |   2 +
 fs/btrfs/ioctl.c |   4 +-
 fs/btrfs/super.c | 112 +++
 3 files changed, 116 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index be91397..63fba05 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3881,6 +3881,8 @@ void btrfs_get_block_group_info(struct list_head 
*groups_list,
struct btrfs_ioctl_space_info *space);
 void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
   struct btrfs_ioctl_balance_args *bargs);
+int btrfs_search_path_in_tree(struct btrfs_fs_info *info,
+ u64 tree_id, u64 dirid, char *name);
 
 
 /* file.c */
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 47aceb4..c2bd6b5 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -2218,8 +2218,8 @@ static noinline int btrfs_ioctl_tree_search_v2(struct 
file *file,
  * Search INODE_REFs to identify path name of 'dirid' directory
  * in a 'tree_id' tree. and sets path name to 'name'.
  */
-static noinline int btrfs_search_path_in_tree(struct btrfs_fs_info *info,
-   u64 tree_id, u64 dirid, char *name)
+int btrfs_search_path_in_tree(struct btrfs_fs_info *info,
+ u64 tree_id, u64 dirid, char *name)
 {
struct btrfs_root *root;
struct btrfs_key key;
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 8e16bca..b5ece81 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1831,6 +1831,117 @@ static int btrfs_show_devname(struct seq_file *m, 
struct dentry *root)
return 0;
 }
 
+static char *str_prepend(char *dest, char *src)
+{
+   memmove(dest + strlen(src), dest, strlen(dest) + 1);
+   memcpy(dest, src, strlen(src));
+   return dest;
+}
+
+static int alloc_mem_if_needed(char **dest, char *src, int *len)
+{
+   char *tmp;
+
+   if (unlikely(strlen(*dest) + strlen(src)  *len)) {
+   *len *= 2;
+   tmp = krealloc(*dest, *len, GFP_NOFS);
+   if (!tmp) {
+   return -ENOMEM;
+   }
+   *dest = tmp;
+   }
+   return 0;
+}
+
+static int btrfs_show_path(struct seq_file *m, struct dentry *mount_root)
+{
+   struct inode *inode = mount_root-d_inode;
+   struct btrfs_root *subv_root = BTRFS_I(inode)-root;
+   struct btrfs_fs_info *fs_info = subv_root-fs_info;
+   struct btrfs_root *tree_root = fs_info-tree_root;
+   struct btrfs_root_ref *ref;
+   struct btrfs_key key;
+   struct btrfs_key found_key;
+   struct btrfs_path *path = NULL;
+   char *name = NULL;
+   char *buf = NULL;
+   int ret = 0;
+   int len;
+   u64 dirid = 0;
+   u16 namelen;
+
+   name = kmalloc(PAGE_SIZE, GFP_NOFS);
+   len = PAGE_SIZE;
+   buf = kmalloc(BTRFS_INO_LOOKUP_PATH_MAX, GFP_NOFS);
+   path = btrfs_alloc_path();
+   if (!name || !buf || !path) {
+   ret = -ENOMEM;
+   goto out_free;
+   }
+   *name = '/';
+   *(name + 1) = '\0';
+
+   key.objectid = subv_root-root_key.objectid;
+   key.type = BTRFS_ROOT_BACKREF_KEY;
+   key.offset = 0;
+   down_read(fs_info-subvol_sem);
+   while (key.objectid != BTRFS_FS_TREE_OBJECTID) {
+   ret = btrfs_search_slot_for_read(tree_root, key, path, 1, 1);
+   if (ret  0)
+   goto out;
+   if (ret) {
+   ret = -ENOENT;
+   goto out;
+   }
+   btrfs_item_key_to_cpu(path-nodes[0], found_key,
+ path-slots[0]);
+   if (found_key.objectid != key.objectid ||
+   found_key.type != BTRFS_ROOT_BACKREF_KEY) {
+   ret = -ENOENT;
+   goto out;
+   }
+   /* append the subvol 

Re: ENOSPC errors during balance

2014-07-21 Thread Brendan Hide

On 20/07/14 14:59, Duncan wrote:

Marc Joliet posted on Sun, 20 Jul 2014 12:22:33 +0200 as excerpted:


On the other hand, the wiki [0] says that defragmentation (and
balancing) is optional, and the only reason stated for doing either is
because they will have impact on performance.

Yes.  That's what threw off the other guy as well.  He decided to skip it
for the same reason.

If I had a wiki account I'd change it, but for whatever reason I tend to
be far more comfortable writing list replies, sometimes repeatedly, than
writing anything on the web, which I tend to treat as read-only.  So I've
never gotten a wiki account and thus haven't changed it, and apparently
the other guy with the problem and anyone else that knows hasn't changed
it either, so the conversion page still continues to underemphasize the
importance of completing the conversion steps, including the defrag, in
proper order.

I've inserted information specific to this in the wiki. Others with wiki 
accounts, feel free to review:

https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3#Before_first_use

--
__
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ENOSPC errors during balance

2014-07-21 Thread Marc Joliet
Am Sun, 20 Jul 2014 21:44:40 +0200
schrieb Marc Joliet mar...@gmx.de:

[...]
 What I did:
 
 - delete the single largest file on the file system, a 12 GB VM image, along
   with all subvolumes that contained it
 - rsync it over again
[...]

I want to point out at this point, though, that doing those two steps freed a
disproportionate amount of space.  The image file is only 12 GB, and it hadn't
changed in any of the snapshots (I haven't used this VM since June), so that
subvolume delete -c snapshots returned after a few seconds. Yet deleting it
seems to have freed up twice as much. You can see this from the filesystem df
output: before, used was at 229.04 GiB, and after deleting it and copying it
back (and after a day's worth of backups) went down to 218 GiB.

Does anyone have any idea how this happened?

Actually, now I remember something that is probably related: when I first
moved to my current backup scheme last week, I first copied the data from the
last rsnapshot based backup with cp --reflink to the new backup location, but
forgot to use -a.  I interrupted it and ran cp -a -u --reflink, but it had
already copied a lot, and I was too impatient to start over; after all, the
data hadn't changed.  Then, when rsync (with --inplace) ran for the first time,
all of these files with wrong permissions and different time stamps were copied
over, but for some reason, the space used increased *greatly*; *much* more than
I would expect from changed metadata.

The total size of the file system data should be around 142 GB (+ snapshots),
but, well, it's more than 1.5 times as much.

Perhaps cp --reflink treats hard links differently than expected?  I would have
expected the data pointed to by the hard link to have been referenced, but
maybe something else happened?

-- 
Marc Joliet
--
People who think they know everything really annoy those of us who know we
don't - Bjarne Stroustrup


signature.asc
Description: PGP signature


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread TM
Wang Shilong wangsl.fnst at cn.fujitsu.com writes:

 Just my two cents:
 
 Since 'btrfs replace' support RADI10, I suppose using replace
 operation is better than 'device removal and add'.
 
 Another Question is related to btrfs snapshot-aware balance.
 How many snapshots did you have in your system?
 
 Of course, During balance/resize/device removal operations,
 you could still snapshot, but fewer snapshots should speed things up!
 
 Anyway 'btrfs replace' is implemented more effective than
 'device remova and add'.
 


Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX


TM


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Q: BTRFS_IOC_DEFRAG_RANGE and START_IO

2014-07-21 Thread Timofey Titovets
I working on readahead in systemd and try to complete todo for it.
One of todos it is:
 readahead: use BTRFS_IOC_DEFRAG_RANGE instead of BTRFS_IOC_DEFRAG
ioctl, with START_IO

Can someone explain what start_io flag in BTRFS_IOC_DEFRAG_RANGE do?
Just force write data after defragment or do something else?
This flag mean what btrfs can guarantee data consistency after defragment?

Thanks for any explanation!

-- 
Best regards,
Timofey.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread ronnie sahlberg
On Sun, Jul 20, 2014 at 7:48 PM, Duncan 1i5t5.dun...@cox.net wrote:
 ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:

 If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
 an 8.3ms rotational latency (half a rotation), an average 64kb write and
 a 100MB/S streaming write speed, each write comes in at ~21ms, which
 gives us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S,
 DISK LIMITED.

 The 5MB/S that TM is seeing is fine, considering the small files he says
 he has.

 Thanks for the additional numbers supporting my point. =:^)

 I had run some of the numbers but not to the extent you just did, so I
 didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the
 range of expectation for spinning rust, given the current state of
 optimization... or more accurately the lack thereof, due to the focus
 still being on features.


That is actually nonsense.
Raid rebuild operates on the block/stripe layer and not on the filesystem layer.
It does not matter at all what the average file size is.

Raid rebuild is really only limited by disk i/o speed when performing
a linear read of the whole spindle using huge i/o sizes,
or, if you have multiple spindles on the same bus, the bus saturation speed.

Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle,
when doing a raid rebuild.
That is for the naive rebuild that rebuilds every single stripe. A
smarter rebuild that knows which stripes are unused can skip the
unused stripes and thus become even faster than that.


Now, that the rebuild is off by an order of magnitude is by design but
should be fixed at some stage, but with the current state of btrfs it
is probably better to focus on other more urgent areas first.
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: `btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb-flags 1)' failed.` in `btrfsck`

2014-07-21 Thread Karl-Philipp Richter
Hi,
I could `btrfsck --repair` the sparse file with Linux 3.15.6-utopic from
http://kernel.ubuntu.com/~kernel-ppa/mainline/ and btrfsck 3.12-1 (from
btrfs-tools package in Ubuntu 14.04).

Thanks for your hints, Wang!

All the best,
Karl

Am 18.07.2014 14:13, schrieb Wang Shilong:
 
 Hi,
 
 There are some patches for fsck flighting, they  are integrated  in David's 
 branches.
 You can pull from David's latest branch, and see if it helps:
 
 https://github.com/kdave/btrfs-progs  integration-20140704
 
 Have a try and see if it helps anyway.
 
 Thanks,
 Wang
 
 Hi together,
 I'm experiencing the following issues when I invoke `btrfsck` on a
 sparse file image with a GPT and one (the only) btrfs partition attached
 to a loop device

$ sudo btrfsck --repair --init-csum-tree --init-extent-tree -b
 /dev/loop0p1
Incorrect local backref count on 128510738432 root 5 owner 3849475
 offset 0 found 1 wanted 0 back 0xbab41270
backpointer mismatch on [128510738432 4096]
ref mismatch on [128510742528 12288] extent item 0, found 1
btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb-flags
  1)' failed.

$ sudo btrfsck --repair --init-csum-tree --init-extent-tree /dev/loop0p1
Incorrect local backref count on 128510726144 root 5 owner 3849470
 offset 0 found 1 wanted 0 back 0xbbcb9500
backpointer mismatch on [128510726144 12288]
ref mismatch on [128510738432 4096] extent item 0, found 1
adding new data backref on 128510738432 root 5 owner 3849475 offset
 0 found 1
Backref 128510738432 root 5 owner 3849475 offset 0 num_refs 0 not
 found in extent tree
Incorrect local backref count on 128510738432 root 5 owner 3849475
 offset 0 found 1 wanted 0 back 0xbbcb9630
backpointer mismatch on [128510738432 4096]
ref mismatch on [128510742528 12288] extent item 0, found 1
btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb-flags
  1)' failed.

$ sudo btrfsck --repair /dev/loop0p1
Incorrect local backref count on 130861096960 root 5 owner 22733727
 offset 0 found 1 wanted 0 back 0xc7c7d170
backpointer mismatch on [130861096960 8192]
ref mismatch on [130861105152 8192] extent item 0, found 1
btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb-flags
  1)' failed.

$ sudo btrfsck --repair /dev/loop0p1
Backref 130861096960 root 5 owner 22733727 offset 0 num_refs 0 not
 found in extent tree
Incorrect local backref count on 130861096960 root 5 owner 22733727
 offset 0 found 1 wanted 0 back 0xc7f31170
backpointer mismatch on [130861096960 8192]
ref mismatch on [130861105152 8192] extent item 0, found 1
btrfsck: extent_io.c:612: free_extent_buffer: Assertion `!(eb-flags
  1)' failed.

 I'm using `btrfs-progs` 24cf4d8c3ee924b474f68514e0167cc2e602a48d on
 Linux 3.16-rc5 (anything else, i.e. older versions, give me immediate
 error after start because errornous file system)

 I'd like to know whether this (assertion) error is related to a bug or
 missing feature in btrfs-progs and might be fixed at some point or
 whether this might indicate a completely messed up btrfs.

 Best regards,
 Karl-P. Richter

 



signature.asc
Description: OpenPGP digital signature


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Chris Murphy

On Jul 21, 2014, at 10:46 AM, ronnie sahlberg ronniesahlb...@gmail.com wrote:

 On Sun, Jul 20, 2014 at 7:48 PM, Duncan 1i5t5.dun...@cox.net wrote:
 ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:
 
 If you assume a 12ms average seek time (normal for 7200RPM SATA drives),
 an 8.3ms rotational latency (half a rotation), an average 64kb write and
 a 100MB/S streaming write speed, each write comes in at ~21ms, which
 gives us ~47 IOPS.  With the 64KB write size, this comes out to ~3MB/S,
 DISK LIMITED.
 
 The 5MB/S that TM is seeing is fine, considering the small files he says
 he has.
 
 Thanks for the additional numbers supporting my point. =:^)
 
 I had run some of the numbers but not to the extent you just did, so I
 didn't know where 5 MiB/s fit in, only that it wasn't entirely out of the
 range of expectation for spinning rust, given the current state of
 optimization... or more accurately the lack thereof, due to the focus
 still being on features.
 
 
 That is actually nonsense.
 Raid rebuild operates on the block/stripe layer and not on the filesystem 
 layer.

Not on Btrfs. It is on the filesystem layer. However, a rebuild is about 
replicating metadata (up to 256MB) and data (up to 1GB) chunks. For raid10, 
those are further broken down into 64KB strips. So the smallest size unit for 
replication during a rebuild on Btrfs would be 64KB.

Anyway 5MB/s seems really low to me, so I'm suspicious something else is going 
on. I haven't done a rebuild in a couple months, but my recollection is it's 
always been as fast as the write performance of a single device in the btrfs 
volume.

I'd be looking in dmesg for any of the physical drives being reset, having read 
or write errors, and I'd do some individual drive testing to see if the problem 
can be isolated. And if that's not helpful, well, this is really tedious and 
verbose amounts of information but it might reveal some issue is to capture 
actual commands going to physical devices:

http://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg34886.html

My expectation (i.e. I'm guessing) based on previous testing is that whether 
raid1 or raid10, the actual read/write commands will each be 256KB in size. 
Btrfs rebuild is basically designed to be a sequential operation. This could 
maybe fall apart if there were somehow many minimally full chunks, which is 
probably unlikely.

Chris Murphy

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Testing with flaky disk

2014-07-21 Thread ronnie sahlberg
List, btrfs developers.

I started working on a test tool for SCSI initiators and filesystem folks.
It is a iSCSI target that implements a bad flaky disks where you
can set precise controls of how/what is broken which you can use to test
error and recovery paths in the initiator/filesystem.

The tool is available at :
https://github.com/rsahlberg/flaky-stgt.git
and is a modified version of the TGTD iscsi target.


Right now it is just an initial prototype and it needs more work to
add more types of errors as well as making it more userfriendly.
But it is still useful enough to illustrate certain failure cases
which could be helpful to btrfs and others.


Let me illustrate. Lets start by creating a BTRFS filesystem spanning
three 1G disks:

#
# Create three disks and export them through flaky iSCSI
#
truncate -s 1G /data/tmp/disk1.img
truncate -s 1G /data/tmp/disk2.img
truncate -s 1G /data/tmp/disk3.img

killall -9 tgtd
./usr/tgtd -f -d 1 

sleep 3

./usr/tgtadm --op new --mode target --tid 1 -T iqn.ronnie.test

./usr/tgtadm --op new --mode logicalunit --tid 1 --lun 1 -b
/data/tmp/disk1.img --blocksize=4096
./usr/tgtadm --op new --mode logicalunit --tid 1 --lun 2 -b
/data/tmp/disk2.img --blocksize=4096
./usr/tgtadm --op new --mode logicalunit --tid 1 --lun 3 -b
/data/tmp/disk3.img --blocksize=4096

./usr/tgtadm --op bind --mode target --tid 1 -I ALL


#
# connect to the three disks
#
iscsiadm --mode discoverydb --type sendtargets --portal 127.0.0.1 --discover
iscsiadm --mode node --targetname iqn.ronnie.test --portal
127.0.0.1:3260 --login
#
# check dmesg, you should now have three new 1G disks
#
# Use: iscsiadm --mode node --targetname iqn.ronnie.test \
#  --portal 127.0.0.1:3260 --logout
# to disconnect the disks when you are finished.


# create a btrfs filesystem
mkfs.btrfs -f -d raid1
/dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-1
/dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-2
/dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-3

# mount the filesystem
mount /dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-1 /mnt


Then we can proceed to copy a bunch of data to the filesystem so that
there will be some blocks used.


Now we can see how/what happens in the case of a single bad disk.
Lets say the disk is gone bad,   it is still possible to read from the
disk but all writes fail with medium error.
Perhaps this is similar to the case of a cheap disk that has
completely run out of blocks to reallocate to?


===
# make all writes to the third disk fail with write error.
# 3 - MEDIUM ERROR
# 0x0c02 - WRITE ERROR AUTOREALLOCATION FAILED
#
./usr/tgtadm --mode error --op new --tid 1 --lun 3 --error
op=WRITE10,lba=0,len=,pct=100,pause=0,repeat=0,action=CHECK_CONDITION,key=3,asc=0x0c02

# To show all current error injects:
# ./usr/tgtadm --mode error --op show
#
# To delete/clear all current error injects:
# ./usr/tgtadm --mode error --op delete
===



If you now know that this disk has gone bad,  you could try to delete
the device :

btrfs device delete
/dev/disk/by-path/ip-127.0.0.1:3260-iscsi-iqn.ronnie.test-lun-3 /mnt

but this will probably not work, since at least to semi-recent
versions of btrfs you can not remove a device from the filesystem
UNLESS you can also write to the device.

Thus making it impossible to remove the bad device in other ways that
physically removing the device.
This is suboptimal from a data integrity point of view since if the
disk is readable, it
can potentially still contain valid copies of the data which might be
silently errored
on the other mirror.

At some stage, from a data integrity and data robustness standpoint,
it would be nice to be able to device delete a device that is
readable, and contain a valid copy of the data, but still unwriteable.


There is a bunch of other things you can test and emulate with this too.
I have only tested this with semi-recent versions of btrfs and not the
latest version.
I will wait until the current versions of btrfs becomes more
stable/robust before I
will start experimenting with it.


Since I think this could be invaluably useful for a filesystem
developer, please have a look. I am more than happy to add additional
features that would make it even more useful for
filesystem-error-path-and-recovery-testing



regards
ronnie sahlberg
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ENOSPC errors during balance

2014-07-21 Thread Marc Joliet
Am Mon, 21 Jul 2014 15:22:16 +0200
schrieb Marc Joliet mar...@gmx.de:

 Am Sun, 20 Jul 2014 21:44:40 +0200
 schrieb Marc Joliet mar...@gmx.de:
 
 [...]
  What I did:
  
  - delete the single largest file on the file system, a 12 GB VM image, along
with all subvolumes that contained it
  - rsync it over again
 [...]
 
 I want to point out at this point, though, that doing those two steps freed a
 disproportionate amount of space.  The image file is only 12 GB, and it hadn't
 changed in any of the snapshots (I haven't used this VM since June), so that
 subvolume delete -c snapshots returned after a few seconds. Yet deleting 
 it
 seems to have freed up twice as much. You can see this from the filesystem 
 df
 output: before, used was at 229.04 GiB, and after deleting it and copying it
 back (and after a day's worth of backups) went down to 218 GiB.
 
 Does anyone have any idea how this happened?
 
 Actually, now I remember something that is probably related: when I first
 moved to my current backup scheme last week, I first copied the data from the
 last rsnapshot based backup with cp --reflink to the new backup location, 
 but
 forgot to use -a.  I interrupted it and ran cp -a -u --reflink, but it had
 already copied a lot, and I was too impatient to start over; after all, the
 data hadn't changed.  Then, when rsync (with --inplace) ran for the first 
 time,
 all of these files with wrong permissions and different time stamps were 
 copied
 over, but for some reason, the space used increased *greatly*; *much* more 
 than
 I would expect from changed metadata.
 
 The total size of the file system data should be around 142 GB (+ snapshots),
 but, well, it's more than 1.5 times as much.
 
 Perhaps cp --reflink treats hard links differently than expected?  I would 
 have
 expected the data pointed to by the hard link to have been referenced, but
 maybe something else happened?

Hah, OK, apparently when my daily backup removed the oldest daily snapshot, it
freed up whatever was taking up so much space, so as of now the file system
uses only 169.14 GiB (from 218).  Weird.

-- 
Marc Joliet
--
People who think they know everything really annoy those of us who know we
don't - Bjarne Stroustrup


signature.asc
Description: PGP signature


Re: ENOSPC errors during balance

2014-07-21 Thread Marc Joliet
Am Tue, 22 Jul 2014 00:30:57 +0200
schrieb Marc Joliet mar...@gmx.de:

 Am Mon, 21 Jul 2014 15:22:16 +0200
 schrieb Marc Joliet mar...@gmx.de:
 
  Am Sun, 20 Jul 2014 21:44:40 +0200
  schrieb Marc Joliet mar...@gmx.de:
  
  [...]
   What I did:
   
   - delete the single largest file on the file system, a 12 GB VM image, 
   along
 with all subvolumes that contained it
   - rsync it over again
  [...]
  
  I want to point out at this point, though, that doing those two steps freed 
  a
  disproportionate amount of space.  The image file is only 12 GB, and it 
  hadn't
  changed in any of the snapshots (I haven't used this VM since June), so that
  subvolume delete -c snapshots returned after a few seconds. Yet 
  deleting it
  seems to have freed up twice as much. You can see this from the filesystem 
  df
  output: before, used was at 229.04 GiB, and after deleting it and copying 
  it
  back (and after a day's worth of backups) went down to 218 GiB.
  
  Does anyone have any idea how this happened?
  
  Actually, now I remember something that is probably related: when I first
  moved to my current backup scheme last week, I first copied the data from 
  the
  last rsnapshot based backup with cp --reflink to the new backup location, 
  but
  forgot to use -a.  I interrupted it and ran cp -a -u --reflink, but it 
  had
  already copied a lot, and I was too impatient to start over; after all, the
  data hadn't changed.  Then, when rsync (with --inplace) ran for the first 
  time,
  all of these files with wrong permissions and different time stamps were 
  copied
  over, but for some reason, the space used increased *greatly*; *much* more 
  than
  I would expect from changed metadata.
  
  The total size of the file system data should be around 142 GB (+ 
  snapshots),
  but, well, it's more than 1.5 times as much.
  
  Perhaps cp --reflink treats hard links differently than expected?  I would 
  have
  expected the data pointed to by the hard link to have been referenced, but
  maybe something else happened?
 
 Hah, OK, apparently when my daily backup removed the oldest daily snapshot, it
 freed up whatever was taking up so much space, so as of now the file system
 uses only 169.14 GiB (from 218).  Weird.

And now that the background deletion of the old snapshots is done, the file
system ended up at:

# btrfs filesystem df /run/media/marcec/MARCEC_BACKUP
Data, single: total=219.00GiB, used=140.13GiB
System, DUP: total=32.00MiB, used=36.00KiB
Metadata, DUP: total=4.50GiB, used=2.40GiB
unknown, single: total=512.00MiB, used=0.00

I don't know how reliable du is for this, but I used it to estimate how much
used data I should expect, and I get 138 GiB.  That means that the snapshots
yield about 2 GiB overhead, which is very reasonable, I think.  Obviously
I'll be starting a full balance now.

I still think this whole... thing is very odd, hopefully somebody can shed
some light on it for me (maybe it's obvious, but I don't see it).

-- 
Marc Joliet
--
People who think they know everything really annoy those of us who know we
don't - Bjarne Stroustrup


signature.asc
Description: PGP signature


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Wang Shilong

On 07/21/2014 10:00 PM, TM wrote:

Wang Shilong wangsl.fnst at cn.fujitsu.com writes:


Just my two cents:

Since 'btrfs replace' support RADI10, I suppose using replace
operation is better than 'device removal and add'.

Another Question is related to btrfs snapshot-aware balance.
How many snapshots did you have in your system?

Of course, During balance/resize/device removal operations,
you could still snapshot, but fewer snapshots should speed things up!

Anyway 'btrfs replace' is implemented more effective than
'device remova and add'.



Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

I don't have specific statistics about this. The conclusion come from
implementation differences between replace and 'device removal'.




Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX


TM


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Wang Shilong

On 07/21/2014 10:00 PM, TM wrote:

Wang Shilong wangsl.fnst at cn.fujitsu.com writes:


Just my two cents:

Since 'btrfs replace' support RADI10, I suppose using replace
operation is better than 'device removal and add'.

Another Question is related to btrfs snapshot-aware balance.
How many snapshots did you have in your system?

Of course, During balance/resize/device removal operations,
you could still snapshot, but fewer snapshots should speed things up!

Anyway 'btrfs replace' is implemented more effective than
'device remova and add'.



Hi Wang,
just one subvolume, no snaphots or anything else.

device replace: to tell you the truth I have not used it in the past. Most
of my testing was done 2 years ago. So in this 'kind of production' system I
did not try it. But if I knew that it was faster, perhaps I could have used
it. Anyone has statistics for such a replace and the time it takes?

I don't have specific statistics about this. The conclusion come from
implementation differences between replace and 'device removal'.



Also, can replace be used when one device is missing? Cant find
documentation. eg.
btrfs replace start missing /dev/sdXX
The latest btrfs-progs include man page of btrfs-replace. Actually, you 
could use it

something like:

btrfs replace start  srcdev|devid targetdev mnt

You could use 'btrfs file show' to see missing device id. and then run 
btrfs replace.


Thanks,
Wang



TM


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-21 Thread Duncan
ronnie sahlberg posted on Mon, 21 Jul 2014 09:46:07 -0700 as excerpted:

 On Sun, Jul 20, 2014 at 7:48 PM, Duncan 1i5t5.dun...@cox.net wrote:
 ashford posted on Sun, 20 Jul 2014 12:59:21 -0700 as excerpted:

 If you assume a 12ms average seek time (normal for 7200RPM SATA
 drives), an 8.3ms rotational latency (half a rotation), an average
 64kb write and a 100MB/S streaming write speed, each write comes in
 at ~21ms, which gives us ~47 IOPS.  With the 64KB write size, this
 comes out to ~3MB/S, DISK LIMITED.

 The 5MB/S that TM is seeing is fine, considering the small files he
 says he has.

 That is actually nonsense.
 Raid rebuild operates on the block/stripe layer and not on the
 filesystem layer.

If we were talking about a normal raid, yes.  But we're talking about 
btrFS, note the FS for filesystem, so indeed it *IS* the filesystem 
layer.  Now this particular filesystem /does/ happen to have raid 
properties as well, but it's definitely filesystem level...

 It does not matter at all what the average file size is.

... and the filesize /does/ matter.

 Raid rebuild is really only limited by disk i/o speed when performing a
 linear read of the whole spindle using huge i/o sizes,
 or, if you have multiple spindles on the same bus, the bus saturation
 speed.

Makes sense... if you're dealing at the raid level.  If we were talking 
about dmraid or mdraid... and they're both much more mature and 
optimized, as well, so 50 MiB/sec, per spindle in parallel, would indeed 
be a reasonable expectation for them.

But (barring bugs, which will and do happen at this stage of development) 
btrfs both makes far better data validity guarantees, and does a lot more 
complex processing what with COW and snapshotting, etc, of course in 
addition to the normal filesystem level stuff AND the raid-level stuff it 
does.

 Thus is is perfectly reasonabe to expect ~50MByte/second, per spindle,
 when doing a raid rebuild.

... And perfectly reasonable, at least at this point, to expect ~5 MiB/
sec total thruput, one spindle at a time, for btrfs.

 That is for the naive rebuild that rebuilds every single stripe. A
 smarter rebuild that knows which stripes are unused can skip the unused
 stripes and thus become even faster than that.
 
 
 Now, that the rebuild is off by an order of magnitude is by design but
 should be fixed at some stage, but with the current state of btrfs it is
 probably better to focus on other more urgent areas first.

Because of all the extra work it does, btrfs may never get to full 
streaming speed across all spindles at once.  But it can and will 
certainly get much better than it is, once the focus moves to 
optimization.  *AND*, because it /does/ know which areas of the device 
are actually in use, once btrfs is optimized, it's quite likely that 
despite the slower raw speed, because it won't have to deal with the 
unused area, at least with the typically 20-60% unused filesystems most 
people run, rebuild times will match or be faster than raid-layer-only 
technologies that must rebuild the entire device, because they do /not/ 
know which areas are unused.

-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ENOSPC errors during balance

2014-07-21 Thread Duncan
Marc Joliet posted on Tue, 22 Jul 2014 01:30:22 +0200 as excerpted:

 And now that the background deletion of the old snapshots is done, the file
 system ended up at:
 
 # btrfs filesystem df /run/media/marcec/MARCEC_BACKUP
 Data, single: total=219.00GiB, used=140.13GiB
 System, DUP: total=32.00MiB, used=36.00KiB
 Metadata, DUP: total=4.50GiB, used=2.40GiB
 unknown, single: total=512.00MiB, used=0.00
 
 I don't know how reliable du is for this, but I used it to estimate how much
 used data I should expect, and I get 138 GiB.  That means that the snapshots
 yield about 2 GiB overhead, which is very reasonable, I think.  Obviously
 I'll be starting a full balance now.

FWIW, the balance should reduce the data total quite a bit, to 141-ish GiB
(might be 142 or 145, but it should definitely come down from 219 GiB),
because the spread between total and used is relatively high, now, and balance
is what's used to bring that back down.

Metadata total will probably come down a bit as well, to 3.00 GiB or so.

What's going on there is this:  Btrfs allocates and deallocates data and
metadata in two stages.  First it allocates chunks, 1 GiB in size for
data, 256 MiB in size for metadata, but because metadata is dup by default
it allocates two chunks so half a GiB at a time, there.  Then the actual
file data and metadata can be written into the pre-allocated chunks, filling
them up.  As they near full, more chunks will be allocated from the
unallocated pool as necessary.

But on file deletion, btrfs only automatically handles the file
data/metadata level; it doesn't (yet) automatically deallocate the chunks,
nor can it change the allocation from say a data chunk to a metadata chunk.
So when a chunk is allocated, it stays allocated.

That's the spread you see in btrfs filesystem df, between total and used,
for each chunk type.

The way to recover those allocated but unused chunks to the unallocated
pool, so they can be reallocated between data and metadata as necessary,
is with a balance.  That balance, therefore, should reduce the spread
seen in the above between total and used.

Meanwhile, btrfs filesystem df shows the spread between allocated and
used for each type, but what about unallocated?  Simple.  Btrfs
filesystem show lists total filesystem size as well as allocated
usage for each device.  (The total line is something else, I recommend
ignoring it as it's simply confusing.  Only pay attention to the
individual device lines.)

Thus, to get a proper picture of the space usage status on a btrfs
filesystem, you must have both the btrfs filesystem show and
btrfs filesystem df output for that filesystem, show to tell
you how much of the total space is chunk-allocated for each device,
df to tell you what those allocations are, and how much of the
chunk-allocated space is actually used, for each allocation type.

It's wise to keep track of the show output in particular, and
when the spread between used (allocated) and total for each
device gets low, under a few GiB, check btrfs fi df and see
what's using that space unnecessarily and then do a balance
to recover it, if possible.


-- 
Duncan - List replies preferred.   No HTML msgs.
Every nonfree program has a lord, a master --
and if you use the program, he is your master.  Richard Stallman

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html