date:20170822

[PATCH 1/6] btrfs-progs: check: enable repair in lowmem mode

2017-08-22 Thread Lu Fengqi

From: Su Yue 

Turn on the option --repair with --mode==lowmem in btrfsck.

Signed-off-by: Su Yue 
---
 cmds-check.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index c5faa2b..829f7c5 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -12844,12 +12844,10 @@ int cmd_check(int argc, char **argv)
}
 
/*
-* Not supported yet
+* experimental and dangerous
 */
-   if (repair && check_mode == CHECK_MODE_LOWMEM) {
-   error("low memory mode doesn't support repair yet");
-   exit(1);
-   }
+   if (repair && check_mode == CHECK_MODE_LOWMEM)
+   printf("Low memory mode supports repair partially\n");
 
radix_tree_init();
cache_tree_init(_cache);
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 4/6] btrfs-progs: check: Introduce repair_chunk_item()

2017-08-22 Thread Lu Fengqi

From: Su Yue 

Because this patchset concentrates on repair of extent tree,
repair_chunk_item() now only inserts missed chunk group item into
extent tree.

There are some things left TODO, for example dev_item.

Signed-off-by: Su Yue 
---
 cmds-check.c | 46 ++
 1 file changed, 46 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 0f26394..726e330 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -11543,6 +11543,50 @@ out:
 }
 
 /*
+ * Add block group item to the extent tree if @err contains
+ * REFERENCER_MISSING.
+ * TO DO: repair error about dev_item.
+ *
+ * Returns error after repair.
+ */
+static int repair_chunk_item(struct btrfs_trans_handle *trans,
+struct btrfs_root *chunk_root,
+struct btrfs_path *path, int err)
+{
+   struct btrfs_chunk *chunk;
+   struct btrfs_key chunk_key;
+   struct extent_buffer *eb = path->nodes[0];
+   u64 length;
+   int slot = path->slots[0];
+   u64 type;
+   int ret = 0;
+
+   btrfs_item_key_to_cpu(eb, _key, slot);
+   if (chunk_key.type != BTRFS_CHUNK_ITEM_KEY)
+   return err;
+   chunk = btrfs_item_ptr(eb, slot, struct btrfs_chunk);
+   type = btrfs_chunk_type(path->nodes[0], chunk);
+   length = btrfs_chunk_length(eb, chunk);
+
+   if (err & REFERENCER_MISSING) {
+   ret = btrfs_make_block_group(trans, chunk_root->fs_info, 0,
+type, chunk_key.objectid, chunk_key.offset, length);
+   if (ret) {
+   error("fail to add block group item[%llu %llu]",
+ chunk_key.offset, length);
+   goto out;
+   } else {
+   err &= ~REFERENCER_MISSING;
+   printf("Added block group item[%llu %llu]\n",
+  chunk_key.offset, length);
+   }
+   }
+
+out:
+   return err;
+}
+
+/*
  * Check a chunk item.
  * Including checking all referred dev_extents and block group
  */
@@ -11729,6 +11773,8 @@ again:
break;
case BTRFS_CHUNK_ITEM_KEY:
ret = check_chunk_item(fs_info, eb, slot);
+   if (repair && ret)
+   ret = repair_chunk_item(trans, root, path, ret);
err |= ret;
break;
case BTRFS_DEV_EXTENT_KEY:
-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/6] btrfs-progs: check: delete wrong items in lowmem repair

2017-08-22 Thread Lu Fengqi

From: Su Yue 

Introduce delete_extent_tree_item() and repair_extent_item() to do
delete only.

while checking a extent tree, just delete wrong item.
For extent item, free wrong backref. Otherwise, do delete.
So the rest items in extent tree should be correct.

Signed-off-by: Su Yue 
---
 cmds-check.c | 151 ++-
 1 file changed, 138 insertions(+), 13 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 7c9036c..0f26394 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -11084,24 +11084,77 @@ out:
 }
 
 /*
+ * Only delete backref if REFERENCER_MISSING now
+ *
+ * Returns <0   the extent was deleted
+ * Returns >0   the backref was deleted but extent is still existed,
+ *  returned value means err after repair
+ * Returns  0   nothing happened
+ */
+static int repair_extent_item(struct btrfs_trans_handle *trans,
+ struct btrfs_root *root, struct btrfs_path *path,
+ u64 bytenr, u64 num_bytes, u64 parent, u64 root_objectid,
+ u64 owner, u64 offset, int err)
+{
+   struct btrfs_key old_key;
+   int freed = 0;
+   int ret;
+
+   btrfs_item_key_to_cpu(path->nodes[0], _key, path->slots[0]);
+
+   if (err & (REFERENCER_MISSING | REFERENCER_MISMATCH)) {
+   /* delete the backref */
+   ret = btrfs_free_extent(trans, root->fs_info->fs_root, bytenr,
+ num_bytes, parent, root_objectid, owner, offset);
+   if (!ret) {
+   freed = 1;
+   err &= ~REFERENCER_MISSING;
+   printf("Delete backref in extent [%llu %llu]\n",
+  bytenr, num_bytes);
+   } else {
+   error("fail to delete backref in extent [%llu %llu]\n",
+  bytenr, num_bytes);
+   }
+   }
+
+   /* btrfs_free_extent may delete the extent */
+   btrfs_release_path(path);
+   ret = btrfs_search_slot(NULL, root, _key, path, 0, 0);
+
+   if (ret)
+   ret = -ENOENT;
+   else if (freed)
+   ret = err;
+   return ret;
+}
+
+/*
  * This function will check a given extent item, including its backref and
  * itself (like crossing stripe boundary and type)
  *
  * Since we don't use extent_record anymore, introduce new error bit
  */
-static int check_extent_item(struct btrfs_fs_info *fs_info,
-struct extent_buffer *eb, int slot)
+static int check_extent_item(struct btrfs_trans_handle *trans,
+struct btrfs_fs_info *fs_info,
+struct btrfs_path *path)
 {
struct btrfs_extent_item *ei;
struct btrfs_extent_inline_ref *iref;
struct btrfs_extent_data_ref *dref;
+   struct extent_buffer *eb = path->nodes[0];
unsigned long end;
unsigned long ptr;
+   int slot = path->slots[0];
int type;
u32 nodesize = btrfs_super_nodesize(fs_info->super_copy);
u32 item_size = btrfs_item_size_nr(eb, slot);
u64 flags;
u64 offset;
+   u64 parent;
+   u64 num_bytes;
+   u64 root_objectid;
+   u64 owner;
+   u64 owner_offset;
int metadata = 0;
int level;
struct btrfs_key key;
@@ -11109,10 +11162,13 @@ static int check_extent_item(struct btrfs_fs_info 
*fs_info,
int err = 0;
 
btrfs_item_key_to_cpu(eb, , slot);
-   if (key.type == BTRFS_EXTENT_ITEM_KEY)
+   if (key.type == BTRFS_EXTENT_ITEM_KEY) {
bytes_used += key.offset;
-   else
+   num_bytes = key.offset;
+   } else {
bytes_used += nodesize;
+   num_bytes = nodesize;
+   }
 
if (item_size < sizeof(*ei)) {
/*
@@ -11150,7 +11206,6 @@ static int check_extent_item(struct btrfs_fs_info 
*fs_info,
level = key.offset;
}
end = (unsigned long)ei + item_size;
-
 next:
/* Reached extent item end normally */
if (ptr == end)
@@ -11164,42 +11219,63 @@ next:
goto out;
}
 
+   parent = 0;
+   root_objectid = 0;
+   owner = 0;
+   owner_offset = 0;
/* Now check every backref in this extent item */
iref = (struct btrfs_extent_inline_ref *)ptr;
type = btrfs_extent_inline_ref_type(eb, iref);
offset = btrfs_extent_inline_ref_offset(eb, iref);
switch (type) {
case BTRFS_TREE_BLOCK_REF_KEY:
+   root_objectid = offset;
+   owner = level;
ret = check_tree_block_backref(fs_info, offset, key.objectid,
   level);
err |= ret;
break;
case BTRFS_SHARED_BLOCK_REF_KEY:
+   parent = offset;
ret =

[PATCH 5/6] [btrfs-progs: check: Introduce repair_tree_block_ref()

2017-08-22 Thread Lu Fengqi

From: Su Yue 

The only thing repair_tree_block_ref() does is that adding backref of the
tree_block. Just like what origin repair do:

It first searches the correspond extent item then
1. If the extent item exists but backref is missing, add one backref to the
   extent.
2. Found nothing, just add an extent item and add one backref.

Signed-off-by: Su Yue 
---
 cmds-check.c | 147 +++
 1 file changed, 147 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index 726e330..deebc70 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -2323,6 +2323,150 @@ static void account_bytes(struct btrfs_root *root, 
struct btrfs_path *path,
}
 }
 
+/*
+ * This function only handles BACKREF_MISSING,
+ * If correspond extent item exists, increase the ref;
+ * else insert an extent item and backref.
+ *
+ * Returns error bits after repair.
+ */
+static int repair_tree_block_ref(struct btrfs_trans_handle *trans,
+struct btrfs_root *root,
+struct extent_buffer *node,
+struct node_refs *nrefs, int level, int err)
+{
+   struct btrfs_fs_info *fs_info = root->fs_info;
+   struct btrfs_root *extent_root = fs_info->extent_root;
+   struct btrfs_path path;
+   struct btrfs_extent_item *ei;
+   struct btrfs_tree_block_info *bi;
+   struct btrfs_key key;
+   struct extent_buffer *eb;
+   u32 size = sizeof(*ei);
+   u32 node_size = root->fs_info->nodesize;
+   int insert_extent = 0;
+   int skinny_metadata = btrfs_fs_incompat(fs_info, SKINNY_METADATA);
+   int root_level = btrfs_header_level(root->node);
+   int generation;
+   int ret;
+   u64 owner;
+   u64 bytenr;
+   u64 flags = BTRFS_EXTENT_FLAG_TREE_BLOCK;
+   u64 parent = 0;
+
+   if ((err & BACKREF_MISSING) == 0)
+   return err;
+
+   WARN_ON(level > BTRFS_MAX_LEVEL);
+   WARN_ON(level < 0);
+
+   btrfs_init_path();
+   bytenr = btrfs_header_bytenr(node);
+   owner = btrfs_header_owner(node);
+   generation = btrfs_header_generation(node);
+
+   key.objectid = bytenr;
+   key.type = (u8)-1;
+   key.offset = (u64)-1;
+
+   /* Search for the extent item */
+   ret = btrfs_search_slot(NULL, extent_root, , , 0, 0);
+   if (ret <= 0) {
+   ret = -EIO;
+   goto out;
+   }
+
+   ret = btrfs_previous_extent_item(extent_root, , bytenr);
+   if (ret)
+   insert_extent = 1;
+
+   /* calculate the extent item flag is full backref or not */
+   if (nrefs->full_backref[level] != 0)
+   flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
+
+   /* insert an extent item */
+   if (insert_extent) {
+   struct btrfs_disk_key copy_key;
+
+   generation = btrfs_header_generation(node);
+
+   if (level < root_level && nrefs->full_backref[level + 1] &&
+   owner != root->objectid) {
+   flags |= BTRFS_BLOCK_FLAG_FULL_BACKREF;
+   }
+
+   key.objectid = bytenr;
+   if (!skinny_metadata) {
+   key.type = BTRFS_EXTENT_ITEM_KEY;
+   key.offset = node_size;
+   size += sizeof(*bi);
+   } else {
+   key.type = BTRFS_METADATA_ITEM_KEY;
+   key.offset = level;
+   }
+
+   btrfs_release_path();
+   ret = btrfs_insert_empty_item(trans, extent_root, , ,
+ size);
+   if (ret)
+   goto out;
+
+   eb = path.nodes[0];
+   ei = btrfs_item_ptr(eb, path.slots[0],
+   struct btrfs_extent_item);
+
+   btrfs_set_extent_refs(eb, ei, 0);
+   btrfs_set_extent_generation(eb, ei, generation);
+   btrfs_set_extent_flags(eb, ei, flags);
+
+   if (!skinny_metadata) {
+   bi = (struct btrfs_tree_block_info *)(ei + 1);
+   memset_extent_buffer(eb, 0, (unsigned long)bi,
+sizeof(*bi));
+   btrfs_set_disk_key_objectid(_key,
+   root->objectid);
+   btrfs_set_disk_key_type(_key, 0);
+   btrfs_set_disk_key_offset(_key, 0);
+
+   btrfs_set_tree_block_level(eb, bi, level);
+   btrfs_set_tree_block_key(eb, bi, _key);
+   }
+   btrfs_mark_buffer_dirty(eb);
+   printf("Added a extent item [%llu %u]\n", bytenr,
+  node_size);
+   btrfs_update_block_group(trans, extent_root, bytenr, node_size,
+

[PATCH 0/6] btrfs-progs: check: extent tree lowmem repair

2017-08-22 Thread Lu Fengqi

From: Su Yue 

This is part 2 of lowmem repair patchsets:
1. Change the way of traversal under lowmem check to use walk_up_tree_v2() and
   walk_down_tree_v2() and it scans all trees now.
2. Repair cases: block group missing, tree block backref missing,
   extent item mismatch, extent data mismatch and data extent backref missing.

Methods to repair extent tree is similar as original mode.
1. Delete all wrong extents(REFERENCER_MISSING or REFERENCER_MISMATCH).

2. Traverse all trees and extent data to rebuild  extent tree.

some issues:
1. Because scan of all trees, the speed may be very very slow.
2. Unlike origin mode who gathers all things together and checks, extent tree
   repair in lowmem mode may print some incorrect information after check.
   But, data on disk is fine, next check should be OK.
3. After repair, some extent of extent tree nodes may be reported like
   "Extent buffer leak ", but next check is fine.
   (I don't know what's wrong)

The reason why lowmem check has to check all trees is list as [patch 2/6] commit
message.

Although I have tested those code by images which tree level is 2 or 3 and has
snapshots with option(--init-extent-tree).
I am still worried some corner cases.

Su Yue (6):
  btrfs-progs: check: enable repair in lowmem mode
  btrfs-progs: check: change traversal way of lowmem mode
  btrfs-progs: check: delete wrong items in lowmem repair
  btrfs-progs: check: Introduce repair_chunk_item()
  [btrfs-progs: check: Introduce repair_tree_block_ref()
  btrfs-progs: check: Introduce repair_extent_data_item()

 cmds-check.c | 1262 +++---
 1 file changed, 943 insertions(+), 319 deletions(-)

-- 
2.7.4



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 6/6] btrfs-progs: check: Introduce repair_extent_data_item()

2017-08-22 Thread Lu Fengqi

From: Su Yue 

The only thing repair_extent_data_item() does is that adding backref of the
tree_block. Just like what origin repair do:

It first searches the correspond extent item then
1. If the extent item exists but backref is missing, add one backref to the
   extent.
2. Found nothing, just add an extent item and add one backref.

Signed-off-by: Su Yue 
---
 cmds-check.c | 117 +++
 1 file changed, 117 insertions(+)

diff --git a/cmds-check.c b/cmds-check.c
index deebc70..09c8d4d 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -10665,6 +10665,120 @@ out:
 }
 
 /*
+ * If @err contains BACKREF_MISSING then add extent of the
+ * file_extent_data_item.
+ *
+ * Returns error bits after reapir.
+ */
+static int repair_extent_data_item(struct btrfs_trans_handle *trans,
+  struct btrfs_root *root,
+  struct btrfs_path *pathp,
+  struct node_refs *nrefs,
+  int err)
+{
+   struct btrfs_file_extent_item *fi;
+   struct btrfs_key fi_key;
+   struct btrfs_key key;
+   struct btrfs_extent_item *ei;
+   struct btrfs_path path;
+   struct btrfs_root *extent_root = root->fs_info->extent_root;
+   struct extent_buffer *eb;
+   u64 size;
+   u64 disk_bytenr;
+   u64 num_bytes;
+   u64 parent;
+   u64 offset;
+   u64 extent_offset;
+   u64 file_offset;
+   int generation;
+   int slot;
+   int ret = 0;
+
+   eb = pathp->nodes[0];
+   slot = pathp->slots[0];
+   btrfs_item_key_to_cpu(eb, _key, slot);
+   fi = btrfs_item_ptr(eb, slot, struct btrfs_file_extent_item);
+
+   if (btrfs_file_extent_type(eb, fi) == BTRFS_FILE_EXTENT_INLINE ||
+   btrfs_file_extent_disk_bytenr(eb, fi) == 0)
+   return err;
+
+   file_offset = fi_key.offset;
+   generation = btrfs_file_extent_generation(eb, fi);
+   disk_bytenr = btrfs_file_extent_disk_bytenr(eb, fi);
+   num_bytes = btrfs_file_extent_disk_num_bytes(eb, fi);
+   extent_offset = btrfs_file_extent_offset(eb, fi);
+   offset = file_offset - extent_offset;
+
+   /* now repair only adds backref */
+   if ((err & BACKREF_MISSING) == 0)
+   return err;
+
+   /* search extent item */
+   key.objectid = disk_bytenr;
+   key.type = BTRFS_EXTENT_ITEM_KEY;
+   key.offset = num_bytes;
+
+   btrfs_init_path();
+   ret = btrfs_search_slot(NULL, extent_root, , , 0, 0);
+   if (ret < 0) {
+   ret = -EIO;
+   goto out;
+   }
+
+   /* insert an extent item */
+   if (ret > 0) {
+   key.objectid = disk_bytenr;
+   key.type = BTRFS_EXTENT_ITEM_KEY;
+   key.offset = num_bytes;
+   size = sizeof(*ei);
+
+   btrfs_release_path();
+   ret = btrfs_insert_empty_item(trans, extent_root, , ,
+ size);
+   if (ret)
+   goto out;
+   eb = path.nodes[0];
+   ei = btrfs_item_ptr(eb, path.slots[0],
+   struct btrfs_extent_item);
+
+   btrfs_set_extent_refs(eb, ei, 0);
+   btrfs_set_extent_generation(eb, ei, generation);
+   btrfs_set_extent_flags(eb, ei, BTRFS_EXTENT_FLAG_DATA);
+
+   btrfs_mark_buffer_dirty(eb);
+   ret = btrfs_update_block_group(trans, extent_root, disk_bytenr,
+  num_bytes, 1, 0);
+   btrfs_release_path();
+   }
+
+   if (nrefs->full_backref[0])
+   parent = btrfs_header_bytenr(eb);
+   else
+   parent = 0;
+
+   ret = btrfs_inc_extent_ref(trans, root, disk_bytenr, num_bytes, parent,
+  root->objectid,
+  parent ? BTRFS_FIRST_FREE_OBJECTID : fi_key.objectid,
+  offset);
+   if (ret) {
+   error("failed to increase extent data backref[%llu %llu] root 
%llu",
+ disk_bytenr, num_bytes, root->objectid);
+   goto out;
+   } else {
+   printf("Add one extent data backref [%llu %llu]\n",
+  disk_bytenr, num_bytes);
+   }
+
+   err &= ~BACKREF_MISSING;
+out:
+   if (ret)
+   error("can't repair root %llu extent data item[%llu %llu]",
+ root->objectid, disk_bytenr, num_bytes);
+   return err;
+}
+
+/*
  * Check EXTENT_DATA item, mainly for its dbackref in extent tree
  *
  * Return >0 any error found and output error message
@@ -11905,6 +12019,9 @@ again:
switch (type) {
case BTRFS_EXTENT_DATA_KEY:
ret = check_extent_data_item(root, path, nrefs, account_bytes);
+

[PATCH 2/6] btrfs-progs: check: change traversal way of lowmem mode

2017-08-22 Thread Lu Fengqi

From: Su Yue 

This patch is a preparation for extent-tree repair in lowmem mode.
In the lowmem mode, checking tree blocks of various tree is in
recursive way.
But if during repair, add or delete of item(s) may modify upper nodes
which will cause the repair to be complicated and dangerous.

Before this patch:
One problem of lowmem check is that it only checks the lowest node's
backref in check_tree_block_ref.
This way ensures checked tree blocks are legal and avoids to traverse
all trees for consideration about speed.
However, there is one shortcoming that it can not detect backref mistake
if one extent whose owner == offset but lacks of other backref(s).

In check, correctness is more important than speed.
If errors can not be detected, repair is impossible.

Change of the patch:
check_chunks_and_extents now has to check *ALL* trees so lowmem check
will behave like original mode.
Changing the way of traversal to be same as fs tree which calls
walk_down_tree_v2() and walk_up_tree_v2() is easy for further
repair.

Signed-off-by: Su Yue 
---
 cmds-check.c | 695 +--
 1 file changed, 443 insertions(+), 252 deletions(-)

diff --git a/cmds-check.c b/cmds-check.c
index 829f7c5..7c9036c 100644
--- a/cmds-check.c
+++ b/cmds-check.c
@@ -1878,10 +1878,15 @@ struct node_refs {
u64 bytenr[BTRFS_MAX_LEVEL];
u64 refs[BTRFS_MAX_LEVEL];
int need_check[BTRFS_MAX_LEVEL];
+   /* field for check all trees*/
+   int checked[BTRFS_MAX_LEVEL];
+   /* the correspond extent should mark as full backref or not */
+   int full_backref[BTRFS_MAX_LEVEL];
 };
 
 static int update_nodes_refs(struct btrfs_root *root, u64 bytenr,
-struct node_refs *nrefs, u64 level);
+struct extent_buffer *eb, struct node_refs *nrefs,
+u64 level, int check_all);
 static int check_inode_item(struct btrfs_root *root, struct btrfs_path *path,
unsigned int ext_ref);
 
@@ -1943,7 +1948,7 @@ again:
 
ret = update_nodes_refs(root,
path->nodes[i]->start,
-   nrefs, i);
+   path->nodes[i], nrefs, i, 0);
if (ret)
goto out;
 
@@ -2062,25 +2067,42 @@ static int need_check(struct btrfs_root *root, struct 
ulist *roots)
return 1;
 }
 
+static int calc_extent_flag_v2(struct btrfs_root *root,
+  struct extent_buffer *eb,
+  u64 *flags_ret);
 /*
  * for a tree node or leaf, we record its reference count, so later if we still
  * process this node or leaf, don't need to compute its reference count again.
+ *
+ * @bytenr  if @bytenr == (u64)-1, only update nrefs->full_backref[level]
  */
 static int update_nodes_refs(struct btrfs_root *root, u64 bytenr,
-struct node_refs *nrefs, u64 level)
+struct extent_buffer *eb, struct node_refs *nrefs,
+u64 level, int check_all)
 {
-   int check, ret;
-   u64 refs;
struct ulist *roots;
+   u64 refs = 0;
+   u64 flags = 0;
+   int root_level = btrfs_header_level(root->node);
+   int check;
+   int ret;
 
-   if (nrefs->bytenr[level] != bytenr) {
+   if (nrefs->bytenr[level] == bytenr)
+   return 0;
+
+   if (bytenr != (u64)-1) {
+   /* the return value of this function seems a mistake */
ret = btrfs_lookup_extent_info(NULL, root, bytenr,
-  level, 1, , NULL);
-   if (ret < 0)
+  level, 1, , );
+   /* temporary fix */
+   if (ret < 0 && !check_all)
return ret;
 
nrefs->bytenr[level] = bytenr;
nrefs->refs[level] = refs;
+   nrefs->full_backref[level] = 0;
+   nrefs->checked[level] = 0;
+
if (refs > 1) {
ret = btrfs_find_all_roots(NULL, root->fs_info, bytenr,
   0, );
@@ -2091,13 +2113,56 @@ static int update_nodes_refs(struct btrfs_root *root, 
u64 bytenr,
ulist_free(roots);
nrefs->need_check[level] = check;
} else {
-   nrefs->need_check[level] = 1;
+   if (!check_all) {
+   nrefs->need_check[level] = 1;
+   } else {
+   if (level == root_level)
+   nrefs->need_check[level] = 1;
+   else
+   /*
+* the node refs may have not been updated
+

RE: [PATCH] btrfs-progs: mkfs: Replace number with enum

2017-08-22 Thread Gu, Jinxiang



> -Original Message-
> From: David Sterba [mailto:dste...@suse.cz]
> Sent: Tuesday, August 22, 2017 10:04 PM
> To: Gu, Jinxiang/顾 金香 ; linux-btrfs@vger.kernel.org
> Subject: Re: [PATCH] btrfs-progs: mkfs: Replace number with enum
> 
> On Mon, Aug 21, 2017 at 07:39:49PM +0200, David Sterba wrote:
> > > +/* roots: root tree, extent tree, chunk tree, dev tree, fs tree,
> > > +csum tree */ enum btrfs_mkfs_block {
> > > + SUPER_BLOCK = 0,
> > > + ROOT_TREE,
> > > + EXTENT_TREE,
> > > + CHUNK_TREE,
> > > + DEV_TREE,
> > > + FS_TREE,
> > > + CSUM_TREE,
> > > + BLOCK_COUNT
> 
> BLOCK_COUNT is 7

> 
> > > +};
> > > +
> > >  struct btrfs_mkfs_config {
> > >   /* Label of the new filesystem */
> > >   const char *label;
> > > @@ -43,7 +55,7 @@ struct btrfs_mkfs_config {
> > >   /* Output fields, set during creation */
> > >
> > >   /* Logical addresses of superblock [0] and other tree roots */
> > > - u64 blocks[8];
> > > + u64 blocks[BLOCK_COUNT];
> 
> This replaces 8 with 7 then, so the fs_uuid gets overwritten, can be also 
> caught by simply running 'make test-mkfs'.

I made this change because block[7] is never used. I have run 'make test-mkfs', 
and get no error.
Why need to make a u64 left before fs_uuid?

> 
> > >   char fs_uuid[BTRFS_UUID_UNPARSED_SIZE];
> > >   char chunk_uuid[BTRFS_UUID_UNPARSED_SIZE];
> > >
> > > --
> > > 2.9.4
> > >
> > >
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe
> > > linux-btrfs" in the body of a message to majord...@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
> > in the body of a message to majord...@vger.kernel.org More majordomo
> > info at  http://vger.kernel.org/majordomo-info.html
>

[PATCH v5 1/6] Btrfs: heuristic make use compression workspaces

2017-08-22 Thread Timofey Titovets

Move heuristic to external file
Implement compression workspaces support for
heuristic resources

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/Makefile  |  2 +-
 fs/btrfs/compression.c | 18 +
 fs/btrfs/compression.h |  7 -
 fs/btrfs/heuristic.c   | 70 ++
 4 files changed, 84 insertions(+), 13 deletions(-)
 create mode 100644 fs/btrfs/heuristic.c

diff --git a/fs/btrfs/Makefile b/fs/btrfs/Makefile
index 128ce17a80b0..6fa8479dff43 100644
--- a/fs/btrfs/Makefile
+++ b/fs/btrfs/Makefile
@@ -9,7 +9,7 @@ btrfs-y += super.o ctree.o extent-tree.o print-tree.o 
root-tree.o dir-item.o \
   export.o tree-log.o free-space-cache.o zlib.o lzo.o \
   compression.o delayed-ref.o relocation.o delayed-inode.o scrub.o \
   reada.o backref.o ulist.o qgroup.o send.o dev-replace.o raid56.o \
-  uuid-tree.o props.o hash.o free-space-tree.o
+  uuid-tree.o props.o hash.o free-space-tree.o heuristic.o

 btrfs-$(CONFIG_BTRFS_FS_POSIX_ACL) += acl.o
 btrfs-$(CONFIG_BTRFS_FS_CHECK_INTEGRITY) += check-integrity.o
diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 883ecc58fd0d..f0aaf27bcc95 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -704,6 +704,7 @@ static struct {
 static const struct btrfs_compress_op * const btrfs_compress_op[] = {
_zlib_compress,
_lzo_compress,
+   _heuristic,
 };

 void __init btrfs_init_compress(void)
@@ -1065,18 +1066,13 @@ int btrfs_decompress_buf2page(const char *buf, unsigned 
long buf_start,
  */
 int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end)
 {
-   u64 index = start >> PAGE_SHIFT;
-   u64 end_index = end >> PAGE_SHIFT;
-   struct page *page;
-   int ret = 1;
+   int ret;
+   enum btrfs_compression_type type = BTRFS_HEURISTIC;
+   struct list_head *workspace = find_workspace(type);

-   while (index <= end_index) {
-   page = find_get_page(inode->i_mapping, index);
-   kmap(page);
-   kunmap(page);
-   put_page(page);
-   index++;
-   }
+   ret = btrfs_compress_op[type-1]->heuristic(workspace, inode,
+  start, end);

+   free_workspace(type, workspace);
return ret;
 }
diff --git a/fs/btrfs/compression.h b/fs/btrfs/compression.h
index 3b1b0ac15fdc..10e9ffa6dfa4 100644
--- a/fs/btrfs/compression.h
+++ b/fs/btrfs/compression.h
@@ -99,7 +99,8 @@ enum btrfs_compression_type {
BTRFS_COMPRESS_NONE  = 0,
BTRFS_COMPRESS_ZLIB  = 1,
BTRFS_COMPRESS_LZO   = 2,
-   BTRFS_COMPRESS_TYPES = 2,
+   BTRFS_HEURISTIC = 3,
+   BTRFS_COMPRESS_TYPES = 3,
 };

 struct btrfs_compress_op {
@@ -123,10 +124,14 @@ struct btrfs_compress_op {
  struct page *dest_page,
  unsigned long start_byte,
  size_t srclen, size_t destlen);
+
+   int (*heuristic)(struct list_head *workspace,
+struct inode *inode, u64 start, u64 end);
 };

 extern const struct btrfs_compress_op btrfs_zlib_compress;
 extern const struct btrfs_compress_op btrfs_lzo_compress;
+extern const struct btrfs_compress_op btrfs_heuristic;

 int btrfs_compress_heuristic(struct inode *inode, u64 start, u64 end);

diff --git a/fs/btrfs/heuristic.c b/fs/btrfs/heuristic.c
new file mode 100644
index ..96ae3e9334bc
--- /dev/null
+++ b/fs/btrfs/heuristic.c
@@ -0,0 +1,70 @@
+/*
+ * Copyright (c) 2017
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public
+ * License v2 as published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include "compression.h"
+
+struct workspace {
+   struct list_head list;
+};
+
+static void heuristic_free_workspace(struct list_head *ws)
+{
+   struct workspace *workspace = list_entry(ws, struct workspace, list);
+   kfree(workspace);
+}
+
+static struct list_head *heuristic_alloc_workspace(void)
+{
+   struct workspace *workspace;
+
+   workspace = kzalloc(sizeof(*workspace), GFP_KERNEL);
+   if (!workspace)
+   return ERR_PTR(-ENOMEM);
+
+   INIT_LIST_HEAD(>list);
+
+   return >list;
+}
+
+static int heuristic(struct list_head *ws, struct inode *inode,
+u64 start, u64 end)
+{
+   struct page *page;
+   u64 index, index_end;
+   u8 *input_data;
+
+   index = start >> PAGE_SHIFT;
+   index_end = end >> PAGE_SHIFT;
+
+   for (; index <= index_end; index++) {
+

[PATCH v5 2/6] Btrfs: heuristic workspace add bucket and sample items

2017-08-22 Thread Timofey Titovets

Heuristic workspace:
 - Add bucket for storing byte type counters
 - Add sample array for storing partial copy of
   input data range
 - Add counter for store current sample size to workspace

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/heuristic.c | 30 ++
 1 file changed, 30 insertions(+)

diff --git a/fs/btrfs/heuristic.c b/fs/btrfs/heuristic.c
index 96ae3e9334bc..2c2cadc9dfad 100644
--- a/fs/btrfs/heuristic.c
+++ b/fs/btrfs/heuristic.c
@@ -20,13 +20,36 @@
 #include 
 #include "compression.h"

+#define READ_SIZE 16
+#define ITER_SHIFT 256
+#define BUCKET_SIZE 256
+
+/*
+ * While mapping 128KiB range into pages (with 4k PAGE_SIZE as ex),
+ * and iterate that with index <= index_end
+ * code get 0-32 items, that a 33 pages
+ */
+#define MAX_INPUT_PAGES ((BTRFS_MAX_UNCOMPRESSED >> PAGE_SHIFT)+1)
+#define MAX_SAMPLE_SIZE (MAX_INPUT_PAGES*PAGE_SIZE*READ_SIZE/ITER_SHIFT)
+
+struct bucket_item {
+   u32 count;
+};
+
 struct workspace {
+   u8  *sample;
+   /* Partial copy of input data */
+   u32 sample_size;
+   /* Bucket store counter for each byte type */
+   struct bucket_item bucket[BUCKET_SIZE];
struct list_head list;
 };

 static void heuristic_free_workspace(struct list_head *ws)
 {
struct workspace *workspace = list_entry(ws, struct workspace, list);
+
+   kvfree(workspace->sample);
kfree(workspace);
 }

@@ -38,9 +61,16 @@ static struct list_head *heuristic_alloc_workspace(void)
if (!workspace)
return ERR_PTR(-ENOMEM);

+   workspace->sample = kvmalloc(MAX_SAMPLE_SIZE, GFP_KERNEL);
+   if (!workspace->sample)
+   goto fail;
+
INIT_LIST_HEAD(>list);

return >list;
+fail:
+   heuristic_free_workspace(>list);
+   return ERR_PTR(-ENOMEM);
 }

 static int heuristic(struct list_head *ws, struct inode *inode,
--
2.14.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 3/6] Btrfs: implement heuristic sampling logic

2017-08-22 Thread Timofey Titovets

Copy sample data from input data range to sample buffer
then calculate byte type count for that sample into bucket.

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/heuristic.c | 31 +--
 1 file changed, 29 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/heuristic.c b/fs/btrfs/heuristic.c
index 2c2cadc9dfad..5336638a3b7c 100644
--- a/fs/btrfs/heuristic.c
+++ b/fs/btrfs/heuristic.c
@@ -76,20 +76,47 @@ static struct list_head *heuristic_alloc_workspace(void)
 static int heuristic(struct list_head *ws, struct inode *inode,
 u64 start, u64 end)
 {
+   struct workspace *workspace = list_entry(ws, struct workspace, list);
struct page *page;
u64 index, index_end;
-   u8 *input_data;
+   u8 *in_data;
+   u32 a, b;
+   u8 byte;
+
+   /*
+* Compression only handle first 128kb of input range
+* And just shift over range in loop for compressing it.
+* Let's do the same.
+*/
+   if (end - start > BTRFS_MAX_UNCOMPRESSED)
+   end = start + BTRFS_MAX_UNCOMPRESSED;

index = start >> PAGE_SHIFT;
index_end = end >> PAGE_SHIFT;

+   b = 0;
for (; index <= index_end; index++) {
page = find_get_page(inode->i_mapping, index);
-   input_data = kmap(page);
+   in_data = kmap(page);
+   a = 0;
+   while (a < PAGE_SIZE-READ_SIZE && b < MAX_SAMPLE_SIZE) {
+   memcpy(>sample[b], _data[a], READ_SIZE);
+   a += ITER_SHIFT;
+   b += READ_SIZE;
+   }
kunmap(page);
put_page(page);
}

+   workspace->sample_size = b;
+
+   memset(workspace->bucket, 0, sizeof(*workspace->bucket)*BUCKET_SIZE);
+
+   for (a = 0; a < workspace->sample_size; a++) {
+   byte = workspace->sample[a];
+   workspace->bucket[byte].count++;
+   }
+
return 1;
 }

--
2.14.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 4/6] Btrfs: heuristic add detection of zeroed sample

2017-08-22 Thread Timofey Titovets

Use memcmp for check sample data to zeroes.

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/heuristic.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/fs/btrfs/heuristic.c b/fs/btrfs/heuristic.c
index 5336638a3b7c..4557ea1db373 100644
--- a/fs/btrfs/heuristic.c
+++ b/fs/btrfs/heuristic.c
@@ -73,6 +73,21 @@ static struct list_head *heuristic_alloc_workspace(void)
return ERR_PTR(-ENOMEM);
 }

+static bool sample_zeroed(struct workspace *workspace)
+{
+   u32 i;
+   u8 zero[READ_SIZE];
+
+   memset(, 0, sizeof(zero));
+
+   for (i = 0; i < workspace->sample_size; i += sizeof(zero)) {
+   if (memcmp(>sample[i], , sizeof(zero)))
+   return false;
+   }
+
+   return true;
+}
+
 static int heuristic(struct list_head *ws, struct inode *inode,
 u64 start, u64 end)
 {
@@ -110,6 +125,9 @@ static int heuristic(struct list_head *ws, struct inode 
*inode,

workspace->sample_size = b;

+   if (sample_zeroed(workspace))
+   return 1;
+
memset(workspace->bucket, 0, sizeof(*workspace->bucket)*BUCKET_SIZE);

for (a = 0; a < workspace->sample_size; a++) {
--
2.14.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 5/6] Btrfs: heuristic add byte set calculation

2017-08-22 Thread Timofey Titovets

Calculate byte set size for data sample:
Calculate how many unique bytes has been in sample
By count all bytes in bucket with count > 0
If byte set low (~25%), data are easily compressible

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/heuristic.c | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/fs/btrfs/heuristic.c b/fs/btrfs/heuristic.c
index 4557ea1db373..953428fde305 100644
--- a/fs/btrfs/heuristic.c
+++ b/fs/btrfs/heuristic.c
@@ -31,6 +31,7 @@
  */
 #define MAX_INPUT_PAGES ((BTRFS_MAX_UNCOMPRESSED >> PAGE_SHIFT)+1)
 #define MAX_SAMPLE_SIZE (MAX_INPUT_PAGES*PAGE_SIZE*READ_SIZE/ITER_SHIFT)
+#define BYTE_SET_THRESHOLD 64

 struct bucket_item {
u32 count;
@@ -73,6 +74,27 @@ static struct list_head *heuristic_alloc_workspace(void)
return ERR_PTR(-ENOMEM);
 }

+static int byte_set_size(const struct workspace *workspace)
+{
+   int a = 0;
+   int byte_set_size = 0;
+
+   for (; a < BYTE_SET_THRESHOLD; a++) {
+   if (workspace->bucket[a].count > 0)
+   byte_set_size++;
+   }
+
+   for (; a < BUCKET_SIZE; a++) {
+   if (workspace->bucket[a].count > 0) {
+   byte_set_size++;
+   if (byte_set_size > BYTE_SET_THRESHOLD)
+   return byte_set_size;
+   }
+   }
+
+   return byte_set_size;
+}
+
 static bool sample_zeroed(struct workspace *workspace)
 {
u32 i;
@@ -135,6 +157,10 @@ static int heuristic(struct list_head *ws, struct inode 
*inode,
workspace->bucket[byte].count++;
}

+   a = byte_set_size(workspace);
+   if (a > BYTE_SET_THRESHOLD)
+   return 2;
+
return 1;
 }

--
2.14.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 6/6] Btrfs: heuristic add byte core set calculation

2017-08-22 Thread Timofey Titovets

Calculate byte core set for data sample:
Sort bucket's numbers in decreasing order
Count how many numbers use 90% of sample
If core set are low (<=25%), data are easily compressible
If core set high (>=80%), data are not compressible

Signed-off-by: Timofey Titovets 
---
 fs/btrfs/heuristic.c | 51 ++-
 1 file changed, 50 insertions(+), 1 deletion(-)

diff --git a/fs/btrfs/heuristic.c b/fs/btrfs/heuristic.c
index 953428fde305..14128f77d5ae 100644
--- a/fs/btrfs/heuristic.c
+++ b/fs/btrfs/heuristic.c
@@ -18,6 +18,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "compression.h"

 #define READ_SIZE 16
@@ -32,6 +33,8 @@
 #define MAX_INPUT_PAGES ((BTRFS_MAX_UNCOMPRESSED >> PAGE_SHIFT)+1)
 #define MAX_SAMPLE_SIZE (MAX_INPUT_PAGES*PAGE_SIZE*READ_SIZE/ITER_SHIFT)
 #define BYTE_SET_THRESHOLD 64
+#define BYTE_CORE_SET_LOW  BYTE_SET_THRESHOLD
+#define BYTE_CORE_SET_HIGH 200 // ~80%

 struct bucket_item {
u32 count;
@@ -74,6 +77,45 @@ static struct list_head *heuristic_alloc_workspace(void)
return ERR_PTR(-ENOMEM);
 }

+/* For bucket sorting */
+static inline int bucket_compare(const void *lv, const void *rv)
+{
+   struct bucket_item *l = (struct bucket_item *)(lv);
+   struct bucket_item *r = (struct bucket_item *)(rv);
+
+   return r->count - l->count;
+}
+
+/*
+ * Byte Core set size
+ * How many bytes use 90% of sample
+ */
+static int byte_core_set_size(struct workspace *workspace)
+{
+   int a = 0;
+   u32 coreset_sum = 0;
+   struct bucket_item *bucket = workspace->bucket;
+   u32 core_set_threshold = workspace->sample_size*90/100;
+
+   /* Sort in reverse order */
+   sort(bucket, BUCKET_SIZE, sizeof(*bucket),
+_compare, NULL);
+
+   for (; a < BYTE_CORE_SET_LOW; a++)
+   coreset_sum += bucket[a].count;
+
+   if (coreset_sum > core_set_threshold)
+   return a;
+
+   for (; a < BYTE_CORE_SET_HIGH && bucket[a].count > 0; a++) {
+   coreset_sum += bucket[a].count;
+   if (coreset_sum > core_set_threshold)
+   break;
+   }
+
+   return a;
+}
+
 static int byte_set_size(const struct workspace *workspace)
 {
int a = 0;
@@ -161,7 +203,14 @@ static int heuristic(struct list_head *ws, struct inode 
*inode,
if (a > BYTE_SET_THRESHOLD)
return 2;

-   return 1;
+   a = byte_core_set_size(workspace);
+   if (a <= BYTE_CORE_SET_LOW)
+   return 3;
+
+   if (a >= BYTE_CORE_SET_HIGH)
+   return 0;
+
+   return 4;
 }

 const struct btrfs_compress_op btrfs_heuristic = {
--
2.14.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v5 0/6] Btrfs: populate heuristic with code

2017-08-22 Thread Timofey Titovets

Based on kdave for-next

Patches short:
1. Move heuristic to use compression workspaces
   Bit tricky, but works.

2. Add heuristic counters and buffer to workspaces

3. Implement simple input data sampling
   It's get 16 byte samples with 256 bytes shifts
   over input data. Collect info about how many
   different bytes (symbols) has been found in sample data

4. Implement check sample to zeroes
   Just check all bytes in sample to 0

5. Add code for calculate
   how many unique bytes has been found in sample data
   That can fast detect easy compressible data

6. Add code for calculate byte core set size
   i.e. how many unique bytes use 90% of sample data
   That code require that numbers in bucket must be sorted
   That can detect easy compressible data with many repeated bytes
   That can detect not compressible data with evenly distributed bytes

Changes v1 -> v2:
  - Change input data iterator shift 512 -> 256
  - Replace magic macro numbers with direct values
  - Drop useless symbol population in bucket
as no one care about where and what symbol stored
in bucket at now

Changes v2 -> v3 (only update #3 patch):
  - Fix u64 division problem by use u32 for input_size
  - Fix input size calculation start - end -> end - start
  - Add missing sort.h header

Changes v3 -> v4 (only update #1 patch):
  - Change counter type in bucket item u16 -> u32
  - Drop other fields from bucket item for now,
no one use it

Change v4 -> v5
  - Move heuristic code to external file
  - Make heuristic use compression workspaces
  - Add check sample to zeroes

Timofey Titovets (6):
  Btrfs: heuristic make use compression workspaces
  Btrfs: heuristic workspace add bucket and sample items
  Btrfs: Implement heuristic sampling logic
  Btrfs: heuristic add detection of zeroed sample
  Btrfs: heuristic add byte set calculation
  Btrfs: heuristic add byte core set calculation

 fs/btrfs/Makefile  |   2 +-
 fs/btrfs/compression.c |  18 ++--
 fs/btrfs/compression.h |   7 +-
 fs/btrfs/heuristic.c   | 220 +
 4 files changed, 234 insertions(+), 13 deletions(-)
 create mode 100644 fs/btrfs/heuristic.c

--
2.14.1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs Raid5 issue.

2017-08-22 Thread Qu Wenruo




On 2017年08月23日 00:37, Robert LeBlanc wrote:

Thanks for the explanations. Chris, I don't think 'degraded' did
anything to help the mounting, I just passed it in to see if it would
help (I'm not sure if btrfs is "smart" enough to ignore a drive if it
would increase the chance of mounting the volume even if it is
degraded, but one could hope). I believe the key was 'nologreplay'.
Here is some info about the corrupted fs:

# btrfs fi show /tmp/root/
Label: 'kvm-btrfs'  uuid: fef29f0a-dc4c-4cc4-b524-914e6630803c
 Total devices 3 FS bytes used 3.30TiB
 devid1 size 2.73TiB used 2.09TiB path /dev/bcache32
 devid2 size 2.73TiB used 2.09TiB path /dev/bcache0
 devid3 size 2.73TiB used 2.09TiB path /dev/bcache16

# btrfs fi usage /tmp/root/
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
 Device size:   8.18TiB
 Device allocated:0.00B
 Device unallocated:8.18TiB
 Device missing:  0.00B
 Used:0.00B
 Free (estimated):0.00B  (min: 8.00EiB)
 Data ratio:   0.00
 Metadata ratio:   0.00
 Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID5: Size:4.15TiB, Used:3.28TiB
/dev/bcache02.08TiB
/dev/bcache16   2.08TiB
/dev/bcache32   2.08TiB

Metadata,RAID5: Size:22.00GiB, Used:20.69GiB
/dev/bcache0   11.00GiB
/dev/bcache16  11.00GiB
/dev/bcache32  11.00GiB

System,RAID5: Size:64.00MiB, Used:400.00KiB
/dev/bcache0   32.00MiB
/dev/bcache16  32.00MiB
/dev/bcache32  32.00MiB

Unallocated:
/dev/bcache0  655.00GiB
/dev/bcache16 655.00GiB
/dev/bcache32 656.49GiB

So it looks like I set the metadata and system data to RAID5 and not
RAID1. I guess that it could have been affected by the write hole
causing the problem I was seeing.

Since I get the same space usage with RAID1 and RAID5,


Well, RAID1 has larger space usage than 3-disk RAID5.
Space efficiency will be 50% for RAID1 while 66% for 3-disk RAID5.

So you may lost some available space.


I think I'm
just going to use RAID1. I don't need stripe performance or anything
like that.


And RAID5/6 won't always improve performance.
Especially when IO blocksize is smaller than full stripe size (in your 
case it's 128K).


When doing sequential IO with blocksize smaller than 128K, there will be 
an obvious performance drop due to RMW cycle.

This is not limited to Btrfs RAID56 but all RAID56.


It would be nice if btrfs supported hotplug and re-plug a
little better so that it is more "production" quality, but I just have
to be patient. I'm familiar with Gluster and contributed code to Ceph,
so I'm familiar with those types of distributed systems. I really like
them, but the complexity is quite overkill for my needs at home.

As far as bcache performance:
I have two Crucial MX200 250GB drives that were md raid1 containing
/boot (ext2), swap and then bcache. I have 2 WD Reds and a Seagate
Barracuda Desktop drive all 3TB. With bcache in writeback, apt-get
would be painfully slow. Running iostat, the SSDs would be doing a few
hundred IOPs and the backing disks would be very busy and would be the
limiting factor overall. Even though apt-get just downloaded the file
(should be on the SSDs because of writeback), it still involved the
backend disks way too much. The amount of dirty data was always less
than 10% so there should have been plenty of space to free up cache
without having to flush. I experimented with changing the size of
contiguous IO to force more to cache, increasing the dirty ratio, etc,
nothing seemed to provide the performance I was hoping. To be fair
having a pair of SSDs (md raid1) caching three spindles (btrfs raid5)
may not be an ideal configuration. If I had three SSDs, one for each
drive, then it may have performed better?? I have also ~980 snapshots
spread over a years time, so I don't know how much that impacts
things. I did use a btrfs utility to help find duplicate files/chunks
and dedupe them so that updated system binaries between upgraded LXC
containers would use the same space on disk and be more efficient in
bcache cache usage.


Well, RAID1 ssd, offline dedupe, bcache, many snapshots, way more 
complex than I though.

So I'm uncertain where the bottleneck is.



After restoring the root and LXC roots snapshots on the SSD (broke the
md raid1 so I could restore to one of them), I ran apt-get and got
upwards to 2,400 IOPs with it being sustained around 1,200 IOPs (btrfs
single on md raid1 degraded). I know that btrfs has some performance
challenges, but I don't think I was hitting those. I was most likely a
very unusual set-up of bcache and btrfs raid that caused the problem.
I have bcache on 10 year old desktop box with a

user snapshots

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (19:36), Peter Grandi wrote:

> For somewhat good reasons subvolumes including snapshots cannot be
> deleted by users though unless mount option 'user_subvol_rm_allowed' is
> used.

Also in https://btrfs.wiki.kernel.org/index.php/Mount_options
"user_subvol_rm_allowed (...) Use with caution."

Why? What is the problem?

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<22940.31139.194399.982...@tree.ty.sabi.co.uk>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (19:36), Peter Grandi wrote:

> Indeed and there is a fair description of some options for
> subvolume nesting policies here which may be interesting to the
> original poster:
> 
>   https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Layout
> 
> It is unsurprising to me that there are tradeoffs involved in
> every choice. I find the "Flat" layout particularly desirable.

My layout is already nearly "flat".
It seems my decision was right :-)



> Btrfs snapshots can only be done for a whole subvolume.

I know this.

> Subvolumes and snapshots can be created by users, but too many snapshots
> (see below) can cause trouble. For somewhat good reasons subvolumes
> including snapshots cannot be deleted by users though unless mount option
> 'user_subvol_rm_allowed' is used.

Ooops, this is new to me!

framstag@fex:~: btrfs subvolume create xx
Create subvolume './xx'

framstag@fex:~: btrfs subvolume delete xx
Delete subvolume '/local/home/framstag/xx'
ERROR: cannot delete '/local/home/framstag/xx' - Operation not permitted

This means, root has to remove the subvolme.
Is it possible to disallow creation of subvolumes for normal users?



> >>> Because Netapp do it this way - for at least 20 years and we
> >>> have a multi-PB Netapp storage environment. No chance to change
> >>> this.
> 
> Send patches :-).

For waffle or btrfs? :-)


> Assumptions that all Btrfs features such as snapshots are
> infinitely scalable at no cost may be optimistic:
> 
>   
> https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow

"when you do device removes on file systems with a lot of snapshots, it
 is unbelievably slow ... took nearly a week to move 20GB of FS data from
 one device to the other using that method"
  
"a balance on 2TB of data that was heavily snapshotted - it took 3 months" 

ARGH!!
Thanks for this warning!
I will overthink my multi-snapshots plan!

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<22940.31139.194399.982...@tree.ty.sabi.co.uk>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[josef-btrfs:slab-priority 4/6] fs//ntfs/attrib.c:2549:35: error: implicit declaration of function 'inode_to_bdi'

2017-08-22 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
slab-priority
head:   a1be3b41415243d20c90e9e92e82808fe1ff91a0
commit: fe049b0156a10dd0bb3fbf3d4dad3ca943874f10 [4/6] remove mapping from 
balance_dirty_pages*()
config: i386-randconfig-a1-201734 (attached as .config)
compiler: gcc-5 (Debian 5.4.1-2) 5.4.1 20160904
reproduce:
git checkout fe049b0156a10dd0bb3fbf3d4dad3ca943874f10
# save the attached .config to linux build tree
make ARCH=i386 

All error/warnings (new ones prefixed by >>):

   fs//ntfs/attrib.c: In function 'ntfs_attr_set':
>> fs//ntfs/attrib.c:2549:35: error: implicit declaration of function 
>> 'inode_to_bdi' [-Werror=implicit-function-declaration]
  balance_dirty_pages_ratelimited(inode_to_bdi(inode),
  ^
>> fs//ntfs/attrib.c:2549:35: warning: passing argument 1 of 
>> 'balance_dirty_pages_ratelimited' makes pointer from integer without a cast 
>> [-Wint-conversion]
   In file included from include/linux/memcontrol.h:31:0,
from include/linux/swap.h:8,
from fs//ntfs/attrib.c:26:
   include/linux/writeback.h:380:6: note: expected 'struct backing_dev_info *' 
but argument is of type 'int'
void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
 ^
   fs//ntfs/attrib.c:2591:35: warning: passing argument 1 of 
'balance_dirty_pages_ratelimited' makes pointer from integer without a cast 
[-Wint-conversion]
  balance_dirty_pages_ratelimited(inode_to_bdi(inode),
  ^
   In file included from include/linux/memcontrol.h:31:0,
from include/linux/swap.h:8,
from fs//ntfs/attrib.c:26:
   include/linux/writeback.h:380:6: note: expected 'struct backing_dev_info *' 
but argument is of type 'int'
void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
 ^
   fs//ntfs/attrib.c:2609:35: warning: passing argument 1 of 
'balance_dirty_pages_ratelimited' makes pointer from integer without a cast 
[-Wint-conversion]
  balance_dirty_pages_ratelimited(inode_to_bdi(inode),
  ^
   In file included from include/linux/memcontrol.h:31:0,
from include/linux/swap.h:8,
from fs//ntfs/attrib.c:26:
   include/linux/writeback.h:380:6: note: expected 'struct backing_dev_info *' 
but argument is of type 'int'
void balance_dirty_pages_ratelimited(struct backing_dev_info *bdi,
 ^
   cc1: some warnings being treated as errors

vim +/inode_to_bdi +2549 fs//ntfs/attrib.c

  2472  
  2473  /**
  2474   * ntfs_attr_set - fill (a part of) an attribute with a byte
  2475   * @ni: ntfs inode describing the attribute to fill
  2476   * @ofs:offset inside the attribute at which to start to fill
  2477   * @cnt:number of bytes to fill
  2478   * @val:the unsigned 8-bit value with which to fill the 
attribute
  2479   *
  2480   * Fill @cnt bytes of the attribute described by the ntfs inode @ni 
starting at
  2481   * byte offset @ofs inside the attribute with the constant byte @val.
  2482   *
  2483   * This function is effectively like memset() applied to an ntfs 
attribute.
  2484   * Note thie function actually only operates on the page cache pages 
belonging
  2485   * to the ntfs attribute and it marks them dirty after doing the 
memset().
  2486   * Thus it relies on the vm dirty page write code paths to cause the 
modified
  2487   * pages to be written to the mft record/disk.
  2488   *
  2489   * Return 0 on success and -errno on error.  An error code of -ESPIPE 
means
  2490   * that @ofs + @cnt were outside the end of the attribute and no write 
was
  2491   * performed.
  2492   */
  2493  int ntfs_attr_set(ntfs_inode *ni, const s64 ofs, const s64 cnt, const 
u8 val)
  2494  {
  2495  ntfs_volume *vol = ni->vol;
  2496  struct inode *inode = VFS_I(ni);
  2497  struct address_space *mapping;
  2498  struct page *page;
  2499  u8 *kaddr;
  2500  pgoff_t idx, end;
  2501  unsigned start_ofs, end_ofs, size;
  2502  
  2503  ntfs_debug("Entering for ofs 0x%llx, cnt 0x%llx, val 0x%hx.",
  2504  (long long)ofs, (long long)cnt, val);
  2505  BUG_ON(ofs < 0);
  2506  BUG_ON(cnt < 0);
  2507  if (!cnt)
  2508  goto done;
  2509  /*
  2510   * FIXME: Compressed and encrypted attributes are not supported 
when
  2511   * writing and we should never have gotten here for them.
  2512   */
  2513  BUG_ON(NInoCompressed(ni));
  2514  BUG_ON(NInoEncrypted(ni));
  2515  mapping = VFS_I(ni)->i_mapping;
  2516  /* Work out the starting index and page offset. */
  2517  idx = ofs >> PAGE_SHIFT;
  2518  start_ofs = ofs & ~PAGE_MASK;
  2519  /* Work out the ending

[PATCH][v2] btrfs: change how we decide to commit transactions during flushing

2017-08-22 Thread josef

From: Josef Bacik 

Nikolay reported that generic/273 was failing currently with ENOSPC.
Turns out this is because we get to the point where the outstanding
reservations are greater than the pinned space on the fs.  This is a
mistake, previously we used the current reservation amount in
may_commit_transaction, not the entire outstanding reservation amount.
Fix this to find the minimum byte size needed to make progress in
flushing, and pass that into may_commit_transaction.  From there we can
make a smarter decision on whether to commit the transaction or not.
This fixes the failure in generic/273.

Reported-by: Nikolay Borisov 
Signed-off-by: Josef Bacik 
---
v1->v2:
- check the ticket bytes in may_commit_transaction instead of copying bytes
  around.
- clean up may_commit_transaction to remove unused arguments

 fs/btrfs/extent-tree.c | 42 --
 1 file changed, 28 insertions(+), 14 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index a5d59dd..1464678 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -4836,6 +4836,13 @@ static void shrink_delalloc(struct btrfs_fs_info 
*fs_info, u64 to_reclaim,
}
 }
 
+struct reserve_ticket {
+   u64 bytes;
+   int error;
+   struct list_head list;
+   wait_queue_head_t wait;
+};
+
 /**
  * maybe_commit_transaction - possibly commit the transaction if its ok to
  * @root - the root we're allocating for
@@ -4847,18 +4854,29 @@ static void shrink_delalloc(struct btrfs_fs_info 
*fs_info, u64 to_reclaim,
  * will return -ENOSPC.
  */
 static int may_commit_transaction(struct btrfs_fs_info *fs_info,
- struct btrfs_space_info *space_info,
- u64 bytes, int force)
+ struct btrfs_space_info *space_info)
 {
+   struct reserve_ticket *ticket = NULL;
struct btrfs_block_rsv *delayed_rsv = _info->delayed_block_rsv;
struct btrfs_trans_handle *trans;
+   u64 bytes;
 
trans = (struct btrfs_trans_handle *)current->journal_info;
if (trans)
return -EAGAIN;
 
-   if (force)
-   goto commit;
+   spin_lock(_info->lock);
+   if (!list_empty(_info->priority_tickets))
+   ticket = list_first_entry(_info->priority_tickets,
+ struct reserve_ticket, list);
+   else if (!list_empty(_info->tickets))
+   ticket = list_first_entry(_info->tickets,
+ struct reserve_ticket, list);
+   bytes = (ticket) ? ticket->bytes : 0;
+   spin_unlock(_info->lock);
+
+   if (!bytes)
+   return 0;
 
/* See if there is enough pinned space to make this reservation */
if (percpu_counter_compare(_info->total_bytes_pinned,
@@ -4873,8 +4891,12 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
return -ENOSPC;
 
spin_lock(_rsv->lock);
+   if (delayed_rsv->size > bytes)
+   bytes = 0;
+   else
+   bytes -= delayed_rsv->size;
if (percpu_counter_compare(_info->total_bytes_pinned,
-  bytes - delayed_rsv->size) < 0) {
+  bytes) < 0) {
spin_unlock(_rsv->lock);
return -ENOSPC;
}
@@ -4888,13 +4910,6 @@ static int may_commit_transaction(struct btrfs_fs_info 
*fs_info,
return btrfs_commit_transaction(trans);
 }
 
-struct reserve_ticket {
-   u64 bytes;
-   int error;
-   struct list_head list;
-   wait_queue_head_t wait;
-};
-
 /*
  * Try to flush some data based on policy set by @state. This is only advisory
  * and may fail for various reasons. The caller is supposed to examine the
@@ -4944,8 +4959,7 @@ static void flush_space(struct btrfs_fs_info *fs_info,
ret = 0;
break;
case COMMIT_TRANS:
-   ret = may_commit_transaction(fs_info, space_info,
-num_bytes, 0);
+   ret = may_commit_transaction(fs_info, space_info);
break;
default:
ret = -ENOSPC;
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-08-22 Thread Peter Grandi

[ ... ]

 It is beneficial to not have snapshots in-place. With a local
 directory of snapshots, [ ... ]

Indeed and there is a fair description of some options for
subvolume nesting policies here which may be interesting to the
original poster:

  https://btrfs.wiki.kernel.org/index.php/SysadminGuide#Layout

It is unsurprising to me that there are tradeoffs involved in
every choice. I find the "Flat" layout particularly desirable.

>>> Netapp snapshots are invisible for tools doing opendir()/
>>> readdir() One could simulate this with symlinks for the
>>> snapshot directory: store the snapshot elsewhere (not inplace)
>>> and create a symlink to it, in every directory.

More precisely in every subvolume root directory.

>>> My users want the snapshots locally in a .snapshot
>>> subdirectory.

Btrfs snapshots can only be done for a whole subvolume. Subvolumes
and snapshots can be created by users, but too many snapshots (see
below) can cause trouble. For somewhat good reasons subvolumes
including snapshots cannot be deleted by users though unless mount
option 'user_subvol_rm_allowed' is used.

>>> Because Netapp do it this way - for at least 20 years and we
>>> have a multi-PB Netapp storage environment. No chance to change
>>> this.

Send patches :-).

> Not only du works recursivly, but also find and with option
> also ls, grep, etc.

Note also that subvolume root directory inodes are indeed root
directory inodes so they can be 'mount'ed and therefore the
transition from a subvolume into a contained subvolume can be
detected at the mountpoint.

So 'find' has the '-xdev' option and 'du' has the '-x' options and
so similarly nearly all other tools, so perhaps someone expects
that to happen :-).

> And it would require a bind mount for EVERY directory. There can
> be hundreds... thousends!

Assumptions that all Btrfs features such as snapshots are
infinitely scalable at no cost may be optimistic:

  
https://btrfs.wiki.kernel.org/index.php/Gotchas#Having_many_subvolumes_can_be_very_slow
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 0/5 v2] btrfs-progs: convert fixes + reiserfs support

2017-08-22 Thread David Sterba

On Thu, Jul 27, 2017 at 11:47:18AM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> Changes since v1:
> - reiserfs conversion:
>   - use bool instead of int
>   - catch 'impossible' condition of multiple discontiguous tails
>   - properly handle hole followed by tail
>   - add testing for combinations of real blocks, tails, and holes
>   - print error indicating filename (and key) that caused a failure
>   - constify buffer arg to btrfs_insert_inline_extent
>   - add tails=on to reiserfs mount options
>   - fixed absence of libreiserfscore to be non-fatal unless specificially
> enabled
> 
> - btrfs-progs: convert: use search_cache_extent in migrate_one_reserved_range
>   - In testing, we would not be able to roll back to part of the 0-1MB range
> not being migrated.
> 
> - btrfs-progs: tests: fix typo in convert-tests/008-readonly-image
>   - The test used ext2_save instead of ext2_saved as the filename

Oh crap, I noticed v2 after I had finished merging v1. The changes have
been now transferred, so the version in devel is v2 + my fixups. The 2
new patches ("use search_cache_extent in migrate_one_reserved_range" and
"tests: fix typo in convert-tests/008-readonly-image") have been also
added.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (22:36), Roman Mamedov wrote:

> > My users want the snapshots locally in a .snapshot subdirectory.
> > Because Netapp do it this way - for at least 20 years and we have a
> > multi-PB Netapp storage environment.
> 
> Just a side note, you do know that only subvolumes can be snapshotted on 
> Btrfs,
> not any regular directory? And that snapshots are not recursive, i.e. if a
> subvolume "contains" other subvolumes (hint: it really doesn't), snapshots of
> the parent one will not include content of subvolumes below that in the tree.

Yes, I know this. But thanks for your hints! (Other readers here may be
not aware of this)


> I don't know how Netapp does this

I am only a Netapp/waffle user, so I know no internals.
Netapp is not Linux based and definitly a lot older than btrfs.


> from the way you describe that setup it feels like with Btrfs you're
> still in for some bad surprises and a part of your expectations will not
> be met.

I will take care :-)


> Do you plan to make each and every directory and subdirectory a subvolume

No. My idea is to place a symlink in every subdirectory pointing to the
snapshot directory. Not yet programmed...
I was hoping someone already has implemented such a feature.


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<2017083647.350ca27d@natsu>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (19:19), A L wrote:
> Perhaps using a bind mount? It would look and work the same as a ordinary fs. 
> Just need to make sure du uses one filesystem.
> 
>  From: Ulli Horlacher  -- Sent: 2017-08-22 
> - 18:57 
> 
> > On Tue 2017-08-22 (21:45), Roman Mamedov wrote:
> > 
> >> It is beneficial to not have snapshots in-place. With a local directory of
> >> snapshots, issuing things like "find", "grep -r" or even "du" will take an
> >> inordinate amount of time and will produce a result you do not expect.
> > 
> > Netapp snapshots are invisible for tools doing opendir()/readdir()
> > One could simulate this with symlinks for the snapshot directory:
> > store the snapshot elsewhere (not inplace) and create a symlink to it, in
> > every directory.

Not only du works recursivly, but also find and with option also ls, grep,
etc.

And it would require a bind mount for EVERY directory. There can be
hundreds... thousends!


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Peter Grandi

[ ... ]

>> There is no fixed relationship between the root directory
>> inode of a subvolume and the root directory inode of any
>> other subvolume or the main volume.

> Actually, there is, because it's inherently rooted in the
> hierarchy of the volume itself. That root inode for the
> subvolume is anchored somewhere under the next higher
> subvolume.

This stupid point relies on ignoring that it is not mandatory to
mount the main volume, and that therefore "There is no fixed
relationship between the root directory inode of a subvolume and
the root directory inode of any other subvolume or the main
volume", because the "root directory inode" of the "main volume"
may not be mounted at all.

This stupid point also relies on ignoring that subvolumes can be
mounted *also* under another directory, even if the main volume
is mounted somewhere else. Suppose that the following applies:

  subvol=5  /local
  subvol=383/local/.backup/home
  subvol=383/mnt/home-backup

and you are given the mountpoint '/mnt/home-backup', how can you
find the main volume mountpoint '/local' from that?

Please explain how '/mnt/home-backup' is indeed "inherently
rooted in the hierarchy of the volume itself", because there is
always a "fixed relationship between the root directory inode of
a subvolume and the root directory inode of any other subvolume
or the main volume".

[ ... ]

> Again, it does, it's just not inherently exposed to userspace
> unless you mount the top-level subvolume (subvolid=5 and/or
> subvol=/ in mount options).

This extra stupid point is based on ignoring that to "mount the
top-level subvolume" relies on knowing already which one is the
"top-level subvolume", which is begging the question.

[ ... ]
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-08-22 Thread Roman Mamedov

On Tue, 22 Aug 2017 18:57:25 +0200
Ulli Horlacher  wrote:

> On Tue 2017-08-22 (21:45), Roman Mamedov wrote:
> 
> > It is beneficial to not have snapshots in-place. With a local directory of
> > snapshots, issuing things like "find", "grep -r" or even "du" will take an
> > inordinate amount of time and will produce a result you do not expect.
> 
> Netapp snapshots are invisible for tools doing opendir()/readdir()
> One could simulate this with symlinks for the snapshot directory:
> store the snapshot elsewhere (not inplace) and create a symlink to it, in
> every directory.
> 
> 
> > Personally I prefer to have a /snapshots directory on every FS
> 
> My users want the snapshots locally in a .snapshot subdirectory.
> Because Netapp do it this way - for at least 20 years and we have a
> multi-PB Netapp storage environment.
> No chance to change this.

Just a side note, you do know that only subvolumes can be snapshotted on Btrfs,
not any regular directory? And that snapshots are not recursive, i.e. if a
subvolume "contains" other subvolumes (hint: it really doesn't), snapshots of
the parent one will not include content of subvolumes below that in the tree.

I don't know how Netapp does this, from the way you describe that setup it
feels like with Btrfs you're still in for some bad surprises and a part of
your expectations will not be met.

Do you plan to make each and every directory and subdirectory a subvolume (so
that it could have a trail of its own snapshots)? There will be performance
implications to that. Also deleting subvolumes can only be done via the
"btrfs" tool, they won't delete like normal dirs, e.g. when trying to do that
remotely via NFS or Samba share.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-08-22 Thread A L

Perhaps using a bind mount? It would look and work the same as a ordinary fs. 
Just need to make sure du uses one filesystem.

 From: Ulli Horlacher  -- Sent: 2017-08-22 - 
18:57 

> On Tue 2017-08-22 (21:45), Roman Mamedov wrote:
> 
>> It is beneficial to not have snapshots in-place. With a local directory of
>> snapshots, issuing things like "find", "grep -r" or even "du" will take an
>> inordinate amount of time and will produce a result you do not expect.
> 
> Netapp snapshots are invisible for tools doing opendir()/readdir()
> One could simulate this with symlinks for the snapshot directory:
> store the snapshot elsewhere (not inplace) and create a symlink to it, in
> every directory.
> 
> 
>> Personally I prefer to have a /snapshots directory on every FS
> 
> My users want the snapshots locally in a .snapshot subdirectory.
> Because Netapp do it this way - for at least 20 years and we have a
> multi-PB Netapp storage environment.
> No chance to change this.
> 
> -- 
> Ullrich Horlacher  Server und Virtualisierung
> Rechenzentrum TIK 
> Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
> Allmandring 30aTel:++49-711-68565868
> 70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
> REF:<20170822214531.44538589@natsu>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] btrfs-progs: convert: add support for converting reiserfs

2017-08-22 Thread David Sterba

On Tue, Jul 25, 2017 at 04:54:43PM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> This patch adds support to convert reiserfs file systems in-place to btrfs.
> 
> It will convert extended attribute files to btrfs extended attributes,
> translate ACLs, coalesce tails that consist of multiple items into one item,
> and convert tails that are too big into indirect files.
> 
> This requires that libreiserfscore 3.6.27 be available.
> 
> Many of the test cases for convert apply regardless of what the source
> file system is and using ext4 is sufficient.  I've included several
> test cases that are reiserfs-specific.
> 
> Signed-off-by: Jeff Mahoney 

Patches merged, with quite a few small fixups here and there. It took me
less time to fix them on the way than to take it through the
mailinglist. The tests were split to separate patch. I'm fine with
keeping the reiserfs bits in one patch as it's an isolated feature.
The tests are now running, I'll let it finish. There's some code or
test code duplication that can be cleaned up eventually, but now I
consider reiserfs conversion support done. Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (21:45), Roman Mamedov wrote:

> It is beneficial to not have snapshots in-place. With a local directory of
> snapshots, issuing things like "find", "grep -r" or even "du" will take an
> inordinate amount of time and will produce a result you do not expect.

Netapp snapshots are invisible for tools doing opendir()/readdir()
One could simulate this with symlinks for the snapshot directory:
store the snapshot elsewhere (not inplace) and create a symlink to it, in
every directory.

> Personally I prefer to have a /snapshots directory on every FS

My users want the snapshots locally in a .snapshot subdirectory.
Because Netapp do it this way - for at least 20 years and we have a
multi-PB Netapp storage environment.
No chance to change this.

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20170822214531.44538589@natsu>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Roman Mamedov

On Tue, 22 Aug 2017 17:45:37 +0200
Ulli Horlacher  wrote:

> In perl I have now:
> 
> $root = $volume;
> while (`btrfs subvolume show "$root" 2>/dev/null` !~ /toplevel subvolume/) {
>   $root = dirname($root);
>   last if $root eq '/';
> }
> 
> 

If you are okay with rolling your own solutions like this, take a look at
"btrfs filesystem usage ". It will print the blockdevice used for
mounting the base FS. From that you can find the mountpoint via /proc/mounts.

Performance-wise it seems to work instantly on an almost full 2TB FS.

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (18:08), Peter Becker wrote:
> This is possible. Use the -b or -B option.
> 
> -b basedir places the snapshot in basedir with a directory structure
> that mimics the mountpoint
> -B basedir places the snapshots in basedir with NO additional
> subdirectory structure
> 
> 2017-08-22 16:24 GMT+02:00 Ulli Horlacher :
> > On Tue 2017-08-22 (15:44), Peter Becker wrote:
> >> Is use: https://github.com/jf647/btrfs-snap
> >>
> >> 2017-08-22 15:22 GMT+02:00 Ulli Horlacher :
> >> > With Netapp/waffle you have automatic hourly/daily/weekly snapshots.
> >> > You can find these snapshots in every local directory (readonly).
> >> > Example:
> >> >
> >> > framstag@fex:/sw/share: ll .snapshot/
> >> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> >> > .snapshot/daily.2017-08-15_0010
> >> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> >> > .snapshot/daily.2017-08-16_0010
> >> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> >> > .snapshot/daily.2017-08-17_0010
> >> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> >> > .snapshot/daily.2017-08-18_0010
> >> > drwxr-xr-x  framstag root - 2017-08-18 23:59:29 
> >> > .snapshot/daily.2017-08-19_0010
> >> > drwxr-xr-x  framstag root - 2017-08-19 21:01:25 
> >> > .snapshot/daily.2017-08-20_0010
> >> > drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
> >> > .snapshot/daily.2017-08-21_0010
> >> > drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
> >> > .snapshot/hourly.2017-08-20_1210
> >> > drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
> >> > .snapshot/hourly.2017-08-20_1610
> >> > drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
> >> > .snapshot/hourly.2017-08-20_2010
> >> > drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
> >> > .snapshot/hourly.2017-08-21_0810
> >> > drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
> >> > .snapshot/hourly.2017-08-21_1210
> >> > drwxr-xr-x  framstag root - 2017-08-21 13:05:28 
> >> > .snapshot/hourly.2017-08-21_1610
> >
> > btrfs-snap does not create local .snapshot/ sub-directories, but saves the
> > snapshots in the toplevel root volume directory.

No, I want in EVERY directory of the sourcetree a subdirectory named
snapshot, example:

framstag@fex:/sw/share: ll .snapshot a*/.snapshot a*/*/.snapshot
drwxrwxrwx  root root - 2017-08-22 16:10:01 .snapshot
drwxrwxrwx  root root - 2017-08-22 16:10:01 aggis-1.0/.snapshot
drwxrwxrwx  root root - 2017-08-22 16:10:01 aggis-1.0/bin/.snapshot
drwxrwxrwx  root root - 2017-08-22 16:10:01 aggis-1.0/man/.snapshot

(this is on a Netapp NFS volume)

btrfs-snap creates a snapshot directory tree on a different path.


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-08-22 Thread Roman Mamedov

On Tue, 22 Aug 2017 16:24:51 +0200
Ulli Horlacher  wrote:

> On Tue 2017-08-22 (15:44), Peter Becker wrote:
> > Is use: https://github.com/jf647/btrfs-snap
> > 
> > 2017-08-22 15:22 GMT+02:00 Ulli Horlacher :
> > > With Netapp/waffle you have automatic hourly/daily/weekly snapshots.
> > > You can find these snapshots in every local directory (readonly).
> > > Example:
> > >
> > > framstag@fex:/sw/share: ll .snapshot/
> > > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > > .snapshot/daily.2017-08-15_0010
> > > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > > .snapshot/daily.2017-08-16_0010
> > > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > > .snapshot/daily.2017-08-17_0010
> > > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > > .snapshot/daily.2017-08-18_0010
> > > drwxr-xr-x  framstag root - 2017-08-18 23:59:29 
> > > .snapshot/daily.2017-08-19_0010
> > > drwxr-xr-x  framstag root - 2017-08-19 21:01:25 
> > > .snapshot/daily.2017-08-20_0010
> > > drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
> > > .snapshot/daily.2017-08-21_0010
> > > drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
> > > .snapshot/hourly.2017-08-20_1210
> > > drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
> > > .snapshot/hourly.2017-08-20_1610
> > > drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
> > > .snapshot/hourly.2017-08-20_2010
> > > drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
> > > .snapshot/hourly.2017-08-21_0810
> > > drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
> > > .snapshot/hourly.2017-08-21_1210
> > > drwxr-xr-x  framstag root - 2017-08-21 13:05:28 
> > > .snapshot/hourly.2017-08-21_1610
> 
> btrfs-snap does not create local .snapshot/ sub-directories, but saves the
> snapshots in the toplevel root volume directory.

It is beneficial to not have snapshots in-place. With a local directory of
snapshots, issuing things like "find", "grep -r" or even "du" will take an
inordinate amount of time and will produce a result you do not expect.

For some of those tools the problem can be avoided (by always keeping in mind
to use "-x" with du, or "--one-file-system" with tar), but not for all of them.

Personally I prefer to have a /snapshots directory on every FS, and e.g. timed
snapshots of /home/username/src will live in /snapshots/home-username-src/. No
point to hide it there with a dot either, as it's convenient to be able to
browse older snapshots with GUI filemanagers (which hide dot-files by default).

-- 
With respect,
Roman
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs Raid5 issue.

2017-08-22 Thread Robert LeBlanc

Thanks for the explanations. Chris, I don't think 'degraded' did
anything to help the mounting, I just passed it in to see if it would
help (I'm not sure if btrfs is "smart" enough to ignore a drive if it
would increase the chance of mounting the volume even if it is
degraded, but one could hope). I believe the key was 'nologreplay'.
Here is some info about the corrupted fs:

# btrfs fi show /tmp/root/
Label: 'kvm-btrfs'  uuid: fef29f0a-dc4c-4cc4-b524-914e6630803c
Total devices 3 FS bytes used 3.30TiB
devid1 size 2.73TiB used 2.09TiB path /dev/bcache32
devid2 size 2.73TiB used 2.09TiB path /dev/bcache0
devid3 size 2.73TiB used 2.09TiB path /dev/bcache16

# btrfs fi usage /tmp/root/
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
WARNING: RAID56 detected, not implemented
Overall:
Device size:   8.18TiB
Device allocated:0.00B
Device unallocated:8.18TiB
Device missing:  0.00B
Used:0.00B
Free (estimated):0.00B  (min: 8.00EiB)
Data ratio:   0.00
Metadata ratio:   0.00
Global reserve:  512.00MiB  (used: 0.00B)

Data,RAID5: Size:4.15TiB, Used:3.28TiB
   /dev/bcache02.08TiB
   /dev/bcache16   2.08TiB
   /dev/bcache32   2.08TiB

Metadata,RAID5: Size:22.00GiB, Used:20.69GiB
   /dev/bcache0   11.00GiB
   /dev/bcache16  11.00GiB
   /dev/bcache32  11.00GiB

System,RAID5: Size:64.00MiB, Used:400.00KiB
   /dev/bcache0   32.00MiB
   /dev/bcache16  32.00MiB
   /dev/bcache32  32.00MiB

Unallocated:
   /dev/bcache0  655.00GiB
   /dev/bcache16 655.00GiB
   /dev/bcache32 656.49GiB

So it looks like I set the metadata and system data to RAID5 and not
RAID1. I guess that it could have been affected by the write hole
causing the problem I was seeing.

Since I get the same space usage with RAID1 and RAID5, I think I'm
just going to use RAID1. I don't need stripe performance or anything
like that. It would be nice if btrfs supported hotplug and re-plug a
little better so that it is more "production" quality, but I just have
to be patient. I'm familiar with Gluster and contributed code to Ceph,
so I'm familiar with those types of distributed systems. I really like
them, but the complexity is quite overkill for my needs at home.

As far as bcache performance:
I have two Crucial MX200 250GB drives that were md raid1 containing
/boot (ext2), swap and then bcache. I have 2 WD Reds and a Seagate
Barracuda Desktop drive all 3TB. With bcache in writeback, apt-get
would be painfully slow. Running iostat, the SSDs would be doing a few
hundred IOPs and the backing disks would be very busy and would be the
limiting factor overall. Even though apt-get just downloaded the file
(should be on the SSDs because of writeback), it still involved the
backend disks way too much. The amount of dirty data was always less
than 10% so there should have been plenty of space to free up cache
without having to flush. I experimented with changing the size of
contiguous IO to force more to cache, increasing the dirty ratio, etc,
nothing seemed to provide the performance I was hoping. To be fair
having a pair of SSDs (md raid1) caching three spindles (btrfs raid5)
may not be an ideal configuration. If I had three SSDs, one for each
drive, then it may have performed better?? I have also ~980 snapshots
spread over a years time, so I don't know how much that impacts
things. I did use a btrfs utility to help find duplicate files/chunks
and dedupe them so that updated system binaries between upgraded LXC
containers would use the same space on disk and be more efficient in
bcache cache usage.

After restoring the root and LXC roots snapshots on the SSD (broke the
md raid1 so I could restore to one of them), I ran apt-get and got
upwards to 2,400 IOPs with it being sustained around 1,200 IOPs (btrfs
single on md raid1 degraded). I know that btrfs has some performance
challenges, but I don't think I was hitting those. I was most likely a
very unusual set-up of bcache and btrfs raid that caused the problem.
I have bcache on 10 year old desktop box with a single nvme drive that
performs a little better, but it is hard to be certain because of its
age. It has bcache in write-around (since there is only a single nvme)
and btrfs in raid1. I haven't watched that box as closely because it
is responsive enough. It also only has four Gb of RAM so it constantly
has to swap (web pages are hogs these days) and one of the reasons to
retrofit that box with nvme rather than MX200.

If you have any other questions, feel free to ask.

Thanks


Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to

Re: netapp-alike snapshots?

2017-08-22 Thread Peter Becker

This is possible. Use the -b or -B option.

-b basedir places the snapshot in basedir with a directory structure
that mimics the mountpoint
-B basedir places the snapshots in basedir with NO additional
subdirectory structure

2017-08-22 16:24 GMT+02:00 Ulli Horlacher :
> On Tue 2017-08-22 (15:44), Peter Becker wrote:
>> Is use: https://github.com/jf647/btrfs-snap
>>
>> 2017-08-22 15:22 GMT+02:00 Ulli Horlacher :
>> > With Netapp/waffle you have automatic hourly/daily/weekly snapshots.
>> > You can find these snapshots in every local directory (readonly).
>> > Example:
>> >
>> > framstag@fex:/sw/share: ll .snapshot/
>> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
>> > .snapshot/daily.2017-08-15_0010
>> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
>> > .snapshot/daily.2017-08-16_0010
>> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
>> > .snapshot/daily.2017-08-17_0010
>> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
>> > .snapshot/daily.2017-08-18_0010
>> > drwxr-xr-x  framstag root - 2017-08-18 23:59:29 
>> > .snapshot/daily.2017-08-19_0010
>> > drwxr-xr-x  framstag root - 2017-08-19 21:01:25 
>> > .snapshot/daily.2017-08-20_0010
>> > drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
>> > .snapshot/daily.2017-08-21_0010
>> > drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
>> > .snapshot/hourly.2017-08-20_1210
>> > drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
>> > .snapshot/hourly.2017-08-20_1610
>> > drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
>> > .snapshot/hourly.2017-08-20_2010
>> > drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
>> > .snapshot/hourly.2017-08-21_0810
>> > drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
>> > .snapshot/hourly.2017-08-21_1210
>> > drwxr-xr-x  framstag root - 2017-08-21 13:05:28 
>> > .snapshot/hourly.2017-08-21_1610
>
> btrfs-snap does not create local .snapshot/ sub-directories, but saves the
> snapshots in the toplevel root volume directory.
>
>
>
> --
> Ullrich Horlacher  Server und Virtualisierung
> Rechenzentrum TIK
> Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
> Allmandring 30aTel:++49-711-68565868
> 70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
>

Re: [PATCH 3/7] btrfs-progs: extent-cache: actually cache extent buffers

2017-08-22 Thread David Sterba

On Tue, Jul 25, 2017 at 04:51:34PM -0400, je...@suse.com wrote:
> From: Jeff Mahoney 
> 
> We have the infrastructure to cache extent buffers but we don't actually
> do the caching.  As soon as the last reference is dropped, the buffer
> is dropped.  This patch keeps the extent buffers around until the max
> cache size is reached (defaults to 25% of memory) and then it drops
> the last 10% of the LRU to free up cache space for reallocation.  The
> cache size is configurable (for use by e.g. lowmem) when the cache is
> initialized.
> 
> Signed-off-by: Jeff Mahoney 

I've started to merge the series, changed code according to the review.
In this patch, test-mkfs and test-check fail (segfaults and assertions).
A debugging build or asan (make D=all,asan) does not reproduce the
errors so this will need to be found by other means.

I also fixed some trivial coding style issues, so the changes are now in
the branch
https://github.com/kdave/btrfs-progs/tree/ext/jeffm/extent-cache

Please use this as a starting point, I'm fine with resending just this
patch or an incremental.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (11:03), Austin S. Hemmelgarn wrote:


> Or alternatively, repeatedly call `btrfs filesystem show` on the path, 
> removing one component from the end each time until you get a zero 
> return code.  The path you called it on that got a zero return code is 
> where the mount is (and thus what filesystem that subvolume is part of), 
> and the output just gave you a list of devices it's on.

"btrfs filesystem show" is relative slow (2.6 s), 
"btrfs subvolume show" is MUCH faster (0.02 s).


In perl I have now:

$root = $volume;
while (`btrfs subvolume show "$root" 2>/dev/null` !~ /toplevel subvolume/) {
  $root = dirname($root);
  last if $root eq '/';
}


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<62494c0c-0c27-5b36-3727-b8755eb2c...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Austin S. Hemmelgarn


On 2017-08-22 10:43, Peter Grandi wrote:

How do I find the root filesystem of a subvolume?
Example:
root@fex:~# df -T
Filesystem Type  1K-blocks  Used Available Use% Mounted on
-  -1073740800 104244552 967773976  10% /local/.backup/home

[ ... ]

I know, the root filesystem is /local,


That question is somewhat misunderstood and uses the wrong
concepts and terms. In UNIX filesystems a filesystem "root" is a
directory inode with a number that is local to itself, and can
be "mounted" anywhere, or left unmounted, and that is a property
of the running system, not of the filesystem "root". Usually
UNIX filesystems have a single "root" directory inode.

In the case of Btrfs the main volume and its subvolumes all have
filesystem "root" directory inodes, which may or may not be
"mounted", anywhere the administrators of the running system
pleases, as a property of the running system. There is no fixed
relationship between the root directory inode of a subvolume and
the root directory inode of any other subvolume or the main
volume.
Actually, there is, because it's inherently rooted in the hierarchy of 
the volume itself.  That root inode for the subvolume is anchored 
somewhere under the next higher subvolume.  It's the same concept as 
nested data-sets in ZFS, BTRFS just inherently exposes them at the 
appropriate location in the hierarchy and allows intermediary directories.


Note: in Btrfs terminology "volume" seems to mean both the main
volume and the collection of devices where it and subvolumes are
hosted.
Standard terminology from what I've seen uses 'volume' in the same way 
it's used for ext4, XFS, LVM, MD, and similar things, namely a BTRFS 
'volume' is the collection of devices as well as the sum total of all 
subvolumes on those devices.  This ends up meaning that it implicitly 
refers to the top-level subvolume when there are no other subvolumes, 
and as a result it's come to sometimes mean the top-level subvolume 
(though I rarely see that usage, and would actively discourage it).



but who can I show it by command?


The system does not keep an explicit record of which Btrfs
"root" directory inode is related to which other Btrfs "root"
directory inode in the same volume, whether mounted or
unmounted.
Again, it does, it's just not inherently exposed to userspace unless you 
mount the top-level subvolume (subvolid=5 and/or subvol=/ in mount 
options).  Mount the top level subvolume (once you know what volume the 
subvolume is on), and call `btrfs subvolume list` on it.  The `top level 
N` part of the output from that tells you what the next subvolume up the 
hierarchy is for each subvolume, and the `path` part at the end tells 
you the location within that next higher subvolume where this one is 
rooted.  The output may not make sense though if you don't have the root 
subvolume mounted (because it may be non trivial to trace things up the 
tree).


That relationship has to be discovered by using volume UUIDs,
which are the same for the main subvolume and the other
subvolumes, whether mounted or not, so one has to do:

   * For the indicated mounted subvolume "root" read its UUID.
   * For every mounted filesystem "root", check whether its type
 is 'btrfs' and if it is obtain its UUID.
   * If the UUID is the same, and the subvolume id is '5', that's
 the main subvolume, and terminate.
   * For every block device which is not mounted, check whether it
 has a Btrfs superblock.
   * If the type is 'btrfs' and the volume UUIS is the same as
 that of the subvolume, list the block device.

In the latter case since the main volume is not mounted the only
way to identify it is to list the block devices that host it.
Or alternatively, repeatedly call `btrfs filesystem show` on the path, 
removing one component from the end each time until you get a zero 
return code.  The path you called it on that got a zero return code is 
where the mount is (and thus what filesystem that subvolume is part of), 
and the output just gave you a list of devices it's on.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Peter Grandi

> How do I find the root filesystem of a subvolume?
> Example:
> root@fex:~# df -T 
> Filesystem Type  1K-blocks  Used Available Use% Mounted on
> -  -1073740800 104244552 967773976  10% /local/.backup/home
[ ... ]
> I know, the root filesystem is /local,

That question is somewhat misunderstood and uses the wrong
concepts and terms. In UNIX filesystems a filesystem "root" is a
directory inode with a number that is local to itself, and can
be "mounted" anywhere, or left unmounted, and that is a property
of the running system, not of the filesystem "root". Usually
UNIX filesystems have a single "root" directory inode.

In the case of Btrfs the main volume and its subvolumes all have
filesystem "root" directory inodes, which may or may not be
"mounted", anywhere the administrators of the running system
pleases, as a property of the running system. There is no fixed
relationship between the root directory inode of a subvolume and
the root directory inode of any other subvolume or the main
volume.

Note: in Btrfs terminology "volume" seems to mean both the main
volume and the collection of devices where it and subvolumes are
hosted.

> but who can I show it by command?

The system does not keep an explicit record of which Btrfs
"root" directory inode is related to which other Btrfs "root"
directory inode in the same volume, whether mounted or
unmounted.

That relationship has to be discovered by using volume UUIDs,
which are the same for the main subvolume and the other
subvolumes, whether mounted or not, so one has to do:

  * For the indicated mounted subvolume "root" read its UUID.
  * For every mounted filesystem "root", check whether its type
is 'btrfs' and if it is obtain its UUID.
  * If the UUID is the same, and the subvolume id is '5', that's
the main subvolume, and terminate.
  * For every block device which is not mounted, check whether it
has a Btrfs superblock.
  * If the type is 'btrfs' and the volume UUIS is the same as
that of the subvolume, list the block device.

In the latter case since the main volume is not mounted the only
way to identify it is to list the block devices that host it.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Austin S. Hemmelgarn


On 2017-08-22 10:23, Hugo Mills wrote:

On Tue, Aug 22, 2017 at 10:12:25AM -0400, Austin S. Hemmelgarn wrote:

On 2017-08-22 09:53, Ulli Horlacher wrote:

On Tue 2017-08-22 (09:37), Austin S. Hemmelgarn wrote:


root@fex:~# df -T /local/.backup/home
Filesystem Type  1K-blocks  Used Available Use% Mounted on
-  -1073740800 104252160 967766336  10% /local/.backup/home


Hmm, now I'm really confused, I just checked on the Ubuntu 17.04 and
16.04.3 VM's I have (I only run current and the most recent LTS
version), and neither of them behave like this.


I have this kind of output on all of my Ubuntu hosts:

root@moep:~# grep PRETTY_NAME /etc/os-release
PRETTY_NAME="Ubuntu 16.04.3 LTS"

root@moep:~# df -T /usb/UF/tmp/blubb
Filesystem Type 1K-blocksUsed Available Use% Mounted on
-  - 12581888 3690524   7253700  34% /usb/UF/tmp/blubb

root@moep:~# btrfs subvolume show /usb/UF/tmp/blubb
/usb/UF/tmp/blubb
 Name:   blubb
 UUID:   ecf8c804-d4a3-9948-89fe-b0c1971c25cb
 Parent UUID:-
 Received UUID:  -
 Creation time:  2017-08-22 12:54:16 +0200
 Subvolume ID:   262
 Generation: 23
 Gen at creation:22
 Parent ID:  5
 Top level ID:   5
 Flags:  -
 Snapshot(s):

root@moep:~# dpkg -l | grep btrfs
ii  btrfs-tools 4.4-1ubuntu1
 amd64Checksumming Copy on Write Filesystem utilities


Hmm, interesting.  Are you using qgroups by chance?


I get this behaviour (the "- -") only if it's a non-mounted
subvolume:

hrm@amelia:~ $ df -T .
Filesystem Type  1K-blocks Used Available Use% Mounted on
/dev/sdb1  btrfs 117220284 95271852  18611060  84% /home

hrm@amelia:~ $ sudo btrfs sub crea foo
Create subvolume './foo'

hrm@amelia:~ $ df -T ./foo
Filesystem Type 1K-blocks Used Available Use% Mounted on
-  -117220284 95271880  18611032  84% /home/hrm/foo

hrm@amelia:~ $ sudo mkdir foo/bar
hrm@amelia:~ $ df -T foo/bar
Filesystem Type 1K-blocks Used Available Use% Mounted on
-  -117220284 95271852  18611060  84% /home/hrm/foo

hrm@amelia:~ $ mkdir foo2

hrm@amelia:~ $ sudo mount /dev/sdb1 ./foo2 -o subvol=home/hrm/foo

hrm@amelia:~ $ df -T foo2
Filesystem Type  1K-blocks Used Available Use% Mounted on
/dev/sdb1  btrfs 117220284 95272384  18610528  84% /home/hrm/foo2

Wait, I think I see what's up here.  I was just calling `df -T` without 
pointing at the subvolume (which correctly ignores it because it's not 
actually mounted).  It looks like this is a side effect of the (rather 
irritating) fake mount-point behavior of subvolumes.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs-progs: Use named constants for common sizes

2017-08-22 Thread David Sterba

On Thu, Jul 27, 2017 at 11:17:00AM +0300, Nikolay Borisov wrote:
> There multiple places where we use well-known sizes - 1,8,16,32 megabytes. We
> also have them defined as constants in the sizes.h header. So let's use them.
> No functional changes.

Both applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 2/2] btrfs-progs: Use named constants for common sizes

2017-08-22 Thread David Sterba

On Thu, Jul 27, 2017 at 09:02:12PM +, Duncan wrote:
> Nikolay Borisov posted on Thu, 27 Jul 2017 11:17:00 +0300 as excerpted:
> 
> > diff --git a/convert/main.c b/convert/main.c
> > index c56382e915fd..49ab829b5641 100644
> > --- a/convert/main.c
> > +++ b/convert/main.c
> 
> > @@ -1586,7 +1586,7 @@ next:
> >   * |   RSV 1   |  | Old  |   |RSV 2  | | Old  | |   RSV 3   |
> >   * |   0~1M|  | Fs   |   | SB2 + 64K | | Fs   | | SB3 + 64K |
> >   *
> > - * On the other hande, the converted fs image in btrfs is a completely 
> > + * On the other hande, the converted fs image in btrfs is a completely
> >   * valid old fs.
> >   *
> >   * |<-Converted fs image in btrfs>|
> 
> If you're going to kill the line-terminating space, you might as well
> do the spell-correct in the same line:
> 
> s/hande/hand/

Fixed.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (15:44), Peter Becker wrote:
> Is use: https://github.com/jf647/btrfs-snap
> 
> 2017-08-22 15:22 GMT+02:00 Ulli Horlacher :
> > With Netapp/waffle you have automatic hourly/daily/weekly snapshots.
> > You can find these snapshots in every local directory (readonly).
> > Example:
> >
> > framstag@fex:/sw/share: ll .snapshot/
> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > .snapshot/daily.2017-08-15_0010
> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > .snapshot/daily.2017-08-16_0010
> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > .snapshot/daily.2017-08-17_0010
> > drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> > .snapshot/daily.2017-08-18_0010
> > drwxr-xr-x  framstag root - 2017-08-18 23:59:29 
> > .snapshot/daily.2017-08-19_0010
> > drwxr-xr-x  framstag root - 2017-08-19 21:01:25 
> > .snapshot/daily.2017-08-20_0010
> > drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
> > .snapshot/daily.2017-08-21_0010
> > drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
> > .snapshot/hourly.2017-08-20_1210
> > drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
> > .snapshot/hourly.2017-08-20_1610
> > drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
> > .snapshot/hourly.2017-08-20_2010
> > drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
> > .snapshot/hourly.2017-08-21_0810
> > drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
> > .snapshot/hourly.2017-08-21_1210
> > drwxr-xr-x  framstag root - 2017-08-21 13:05:28 
> > .snapshot/hourly.2017-08-21_1610

btrfs-snap does not create local .snapshot/ sub-directories, but saves the
snapshots in the toplevel root volume directory.



-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Hugo Mills

On Tue, Aug 22, 2017 at 10:12:25AM -0400, Austin S. Hemmelgarn wrote:
> On 2017-08-22 09:53, Ulli Horlacher wrote:
> >On Tue 2017-08-22 (09:37), Austin S. Hemmelgarn wrote:
> >
> >>>root@fex:~# df -T /local/.backup/home
> >>>Filesystem Type  1K-blocks  Used Available Use% Mounted on
> >>>-  -1073740800 104252160 967766336  10% /local/.backup/home
> >>
> >>Hmm, now I'm really confused, I just checked on the Ubuntu 17.04 and
> >>16.04.3 VM's I have (I only run current and the most recent LTS
> >>version), and neither of them behave like this.
> >
> >I have this kind of output on all of my Ubuntu hosts:
> >
> >root@moep:~# grep PRETTY_NAME /etc/os-release
> >PRETTY_NAME="Ubuntu 16.04.3 LTS"
> >
> >root@moep:~# df -T /usb/UF/tmp/blubb
> >Filesystem Type 1K-blocksUsed Available Use% Mounted on
> >-  - 12581888 3690524   7253700  34% /usb/UF/tmp/blubb
> >
> >root@moep:~# btrfs subvolume show /usb/UF/tmp/blubb
> >/usb/UF/tmp/blubb
> > Name:   blubb
> > UUID:   ecf8c804-d4a3-9948-89fe-b0c1971c25cb
> > Parent UUID:-
> > Received UUID:  -
> > Creation time:  2017-08-22 12:54:16 +0200
> > Subvolume ID:   262
> > Generation: 23
> > Gen at creation:22
> > Parent ID:  5
> > Top level ID:   5
> > Flags:  -
> > Snapshot(s):
> >
> >root@moep:~# dpkg -l | grep btrfs
> >ii  btrfs-tools 4.4-1ubuntu1 
> >amd64Checksumming Copy on Write Filesystem utilities
> >
> Hmm, interesting.  Are you using qgroups by chance?

   I get this behaviour (the "- -") only if it's a non-mounted
subvolume:

hrm@amelia:~ $ df -T .
Filesystem Type  1K-blocks Used Available Use% Mounted on
/dev/sdb1  btrfs 117220284 95271852  18611060  84% /home

hrm@amelia:~ $ sudo btrfs sub crea foo
Create subvolume './foo'

hrm@amelia:~ $ df -T ./foo
Filesystem Type 1K-blocks Used Available Use% Mounted on
-  -117220284 95271880  18611032  84% /home/hrm/foo

hrm@amelia:~ $ sudo mkdir foo/bar
hrm@amelia:~ $ df -T foo/bar
Filesystem Type 1K-blocks Used Available Use% Mounted on
-  -117220284 95271852  18611060  84% /home/hrm/foo

hrm@amelia:~ $ mkdir foo2

hrm@amelia:~ $ sudo mount /dev/sdb1 ./foo2 -o subvol=home/hrm/foo

hrm@amelia:~ $ df -T foo2
Filesystem Type  1K-blocks Used Available Use% Mounted on
/dev/sdb1  btrfs 117220284 95272384  18610528  84% /home/hrm/foo2

   Hugo.

-- 
Hugo Mills | "Your problem is that you have a negative
hugo@... carfax.org.uk | personality."
http://carfax.org.uk/  | "No, I don't!"
PGP: E2AB1DE4  |  Londo and Vir, Babylon 5


signature.asc
Description: Digital signature

Re: [PATCH] Btrfs-progs: Check root before printing item

2017-08-22 Thread David Sterba

On Mon, Aug 21, 2017 at 03:57:13PM +0800, zhangyu-f...@cn.fujitsu.com wrote:
> From: Zhang Yu 
> 
> [TEST/fuzz] case: 004-simple-dump-tree
> 
> Since the wrong key(DATA_RELOC_TREE CHUNK_ITEM 0) in root tree,
> error calling print_chunk(), resulting in num_stripes == 0.
> 
> ERROR:
>  [TEST/fuzz]   004-simple-dump-tree
> ctree.h:317: btrfs_chunk_item_size: BUG_ON `num_stripes == 0`
> triggered, value 1
> 
> failed (ignored, ret=134): /myproject/btrfs-progs/btrfs
> inspect-internal dump-tree
> /myproject/btrfs-progs/tests/fuzz-tests/images/
> bko-155201-wrong-chunk-item-in-root-tree.raw.restored
> 
> test failed for case 004-simple-dump-tree
> Makefile:288: recipe for target 'test-fuzz' failed
> make: *** [test-fuzz] Error 1
> 
> So, before printing item, determine the root is valid or not.

I don't think this is the right way to fix it. The print-tree function
should print everything that's found, possibly doing sanity checks and
then only skip the bad data. For debugging or other purposes, we want to
get exact state of the trees.

The original problem you found is wrong number of stripes, so it should
be dealt with in print_chunk.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Austin S. Hemmelgarn


On 2017-08-22 09:53, Ulli Horlacher wrote:

On Tue 2017-08-22 (09:37), Austin S. Hemmelgarn wrote:


root@fex:~# df -T /local/.backup/home
Filesystem Type  1K-blocks  Used Available Use% Mounted on
-  -1073740800 104252160 967766336  10% /local/.backup/home


Hmm, now I'm really confused, I just checked on the Ubuntu 17.04 and
16.04.3 VM's I have (I only run current and the most recent LTS
version), and neither of them behave like this.


I have this kind of output on all of my Ubuntu hosts:

root@moep:~# grep PRETTY_NAME /etc/os-release
PRETTY_NAME="Ubuntu 16.04.3 LTS"

root@moep:~# df -T /usb/UF/tmp/blubb
Filesystem Type 1K-blocksUsed Available Use% Mounted on
-  - 12581888 3690524   7253700  34% /usb/UF/tmp/blubb

root@moep:~# btrfs subvolume show /usb/UF/tmp/blubb
/usb/UF/tmp/blubb
 Name:   blubb
 UUID:   ecf8c804-d4a3-9948-89fe-b0c1971c25cb
 Parent UUID:-
 Received UUID:  -
 Creation time:  2017-08-22 12:54:16 +0200
 Subvolume ID:   262
 Generation: 23
 Gen at creation:22
 Parent ID:  5
 Top level ID:   5
 Flags:  -
 Snapshot(s):

root@moep:~# dpkg -l | grep btrfs
ii  btrfs-tools 4.4-1ubuntu1
 amd64Checksumming Copy on Write Filesystem utilities


Hmm, interesting.  Are you using qgroups by chance?

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: Make in-place exit to a common exit block

2017-08-22 Thread David Sterba

On Tue, Aug 22, 2017 at 01:35:06PM +0800, Gu Jinxiang wrote:
> As comment pointed out by David, make in-place exit
> to a common exit block of mkfs.
> 
> v1:
> Add some close(fd) when error occures in mkfs.
> And add close(fd) when end use it.
> 
> Signed-off-by: Gu Jinxiang 

Applied, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs-progs: mkfs: Replace number with enum

2017-08-22 Thread David Sterba

On Mon, Aug 21, 2017 at 07:39:49PM +0200, David Sterba wrote:
> > +/* roots: root tree, extent tree, chunk tree, dev tree, fs tree, csum tree 
> > */
> > +enum btrfs_mkfs_block {
> > +   SUPER_BLOCK = 0,
> > +   ROOT_TREE,
> > +   EXTENT_TREE,
> > +   CHUNK_TREE,
> > +   DEV_TREE,
> > +   FS_TREE,
> > +   CSUM_TREE,
> > +   BLOCK_COUNT

BLOCK_COUNT is 7

> > +};
> > +
> >  struct btrfs_mkfs_config {
> > /* Label of the new filesystem */
> > const char *label;
> > @@ -43,7 +55,7 @@ struct btrfs_mkfs_config {
> > /* Output fields, set during creation */
> >  
> > /* Logical addresses of superblock [0] and other tree roots */
> > -   u64 blocks[8];
> > +   u64 blocks[BLOCK_COUNT];

This replaces 8 with 7 then, so the fs_uuid gets overwritten, can be
also caught by simply running 'make test-mkfs'.

> > char fs_uuid[BTRFS_UUID_UNPARSED_SIZE];
> > char chunk_uuid[BTRFS_UUID_UNPARSED_SIZE];
> >  
> > -- 
> > 2.9.4
> > 
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> > the body of a message to majord...@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (09:37), Austin S. Hemmelgarn wrote:

> > root@fex:~# df -T /local/.backup/home
> > Filesystem Type  1K-blocks  Used Available Use% Mounted on
> > -  -1073740800 104252160 967766336  10% /local/.backup/home
> 
> Hmm, now I'm really confused, I just checked on the Ubuntu 17.04 and 
> 16.04.3 VM's I have (I only run current and the most recent LTS 
> version), and neither of them behave like this.

I have this kind of output on all of my Ubuntu hosts:

root@moep:~# grep PRETTY_NAME /etc/os-release
PRETTY_NAME="Ubuntu 16.04.3 LTS"

root@moep:~# df -T /usb/UF/tmp/blubb
Filesystem Type 1K-blocksUsed Available Use% Mounted on
-  - 12581888 3690524   7253700  34% /usb/UF/tmp/blubb

root@moep:~# btrfs subvolume show /usb/UF/tmp/blubb
/usb/UF/tmp/blubb
Name:   blubb
UUID:   ecf8c804-d4a3-9948-89fe-b0c1971c25cb
Parent UUID:-
Received UUID:  -
Creation time:  2017-08-22 12:54:16 +0200
Subvolume ID:   262
Generation: 23
Gen at creation:22
Parent ID:  5
Top level ID:   5
Flags:  -
Snapshot(s):

root@moep:~# dpkg -l | grep btrfs
ii  btrfs-tools 4.4-1ubuntu1
 amd64Checksumming Copy on Write Filesystem utilities


-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<0c35fba9-a514-31dd-a703-17f4727ed...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Marat Khalili

Hmm, now I'm really confused, I just checked on the Ubuntu 17.04 and 
16.04.3 VM's I have (I only run current and the most recent LTS 
version), and neither of them behave like this. 


Was also shocked, but:


$ lsb_release -a
No LSB modules are available.
Distributor ID:Ubuntu
Description:Ubuntu 16.04.3 LTS
Release:16.04
Codename:xenial

$ df -T | grep /mnt/data/lxc

$ df -T /mnt/data/lxc
Filesystem Type  1K-blocks Used  Available Use% Mounted on
-  -2907008836 90829848 2815107576   4% /mnt/data/lxc


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Marat Khalili


I have no subvol=/ option at all:
Probably depends on kernel, but I presume missing subvol means the same 
as subvol=/ .



I am only interested in mounted volumes.
If your initial path (/local/.backup/home) is a subvolume but it's not 
itself present in /proc/mounts then it's probably mounted as a part some 
higher-level subvolume, but this higher-level subvolume does not have to 
be root. Do you need volume root or just some higher-level subvolume 
that's mounted?


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: netapp-alike snapshots?

2017-08-22 Thread Peter Becker

Is use: https://github.com/jf647/btrfs-snap

2017-08-22 15:22 GMT+02:00 Ulli Horlacher :
> With Netapp/waffle you have automatic hourly/daily/weekly snapshots.
> You can find these snapshots in every local directory (readonly).
> Example:
>
> framstag@fex:/sw/share: ll .snapshot/
> drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> .snapshot/daily.2017-08-15_0010
> drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> .snapshot/daily.2017-08-16_0010
> drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> .snapshot/daily.2017-08-17_0010
> drwxr-xr-x  framstag root - 2017-08-14 10:21:47 
> .snapshot/daily.2017-08-18_0010
> drwxr-xr-x  framstag root - 2017-08-18 23:59:29 
> .snapshot/daily.2017-08-19_0010
> drwxr-xr-x  framstag root - 2017-08-19 21:01:25 
> .snapshot/daily.2017-08-20_0010
> drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
> .snapshot/daily.2017-08-21_0010
> drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
> .snapshot/hourly.2017-08-20_1210
> drwxr-xr-x  framstag root - 2017-08-20 02:50:18 
> .snapshot/hourly.2017-08-20_1610
> drwxr-xr-x  framstag root - 2017-08-20 19:48:40 
> .snapshot/hourly.2017-08-20_2010
> drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
> .snapshot/hourly.2017-08-21_0810
> drwxr-xr-x  framstag root - 2017-08-21 00:42:28 
> .snapshot/hourly.2017-08-21_1210
> drwxr-xr-x  framstag root - 2017-08-21 13:05:28 
> .snapshot/hourly.2017-08-21_1610
>
> I would like to have something similar with btrfs.
> Programming such a feature is not a general problem for me, but I think I
> am not the first one who wants this kind of auto-snapshooting.
> Is there (where?) such a tool?
>
> I know snapper, but it has a totally different approach.
>
> --
> Ullrich Horlacher  Server und Virtualisierung
> Rechenzentrum TIK
> Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
> Allmandring 30aTel:++49-711-68565868
> 70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
> REF:<20170822132208.gd14...@rus.uni-stuttgart.de>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Austin S. Hemmelgarn


On 2017-08-22 09:30, Ulli Horlacher wrote:

On Tue 2017-08-22 (09:27), Austin S. Hemmelgarn wrote:


root@fex:~# df -T
Filesystem Type  1K-blocks  Used Available Use% Mounted on
-  -1073740800 104244552 967773976  10% /local/.backup/home


 I've never seen the "- -" output from df before. Is this a bind
mount or something?


No, /local/.backup/home is just a btrfs subvolume


It arguably shouldn't be showing up here then if it's not been
explicitly mounted.  I'm betting you're running OpenSUSE or SLES


No:

root@fex:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/;
SUPPORT_URL="http://help.ubuntu.com/;
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/;

root@fex:~# df -T /local/.backup/home
Filesystem Type  1K-blocks  Used Available Use% Mounted on
-  -1073740800 104252160 967766336  10% /local/.backup/home

root@fex:~# type df
df is hashed (/bin/df)

root@fex:~# dpkg -S /bin/df
coreutils: /bin/df

Hmm, now I'm really confused, I just checked on the Ubuntu 17.04 and 
16.04.3 VM's I have (I only run current and the most recent LTS 
version), and neither of them behave like this.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (09:27), Austin S. Hemmelgarn wrote:

> >>> root@fex:~# df -T
> >>> Filesystem Type  1K-blocks  Used Available Use% Mounted on
> >>> -  -1073740800 104244552 967773976  10% 
> >>> /local/.backup/home
> >>
> >> I've never seen the "- -" output from df before. Is this a bind
> >> mount or something?
> > 
> > No, /local/.backup/home is just a btrfs subvolume
> 
> It arguably shouldn't be showing up here then if it's not been 
> explicitly mounted.  I'm betting you're running OpenSUSE or SLES

No:

root@fex:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="14.04.5 LTS, Trusty Tahr"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 14.04.5 LTS"
VERSION_ID="14.04"
HOME_URL="http://www.ubuntu.com/;
SUPPORT_URL="http://help.ubuntu.com/;
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/;

root@fex:~# df -T /local/.backup/home
Filesystem Type  1K-blocks  Used Available Use% Mounted on
-  -1073740800 104252160 967766336  10% /local/.backup/home

root@fex:~# type df
df is hashed (/bin/df)

root@fex:~# dpkg -S /bin/df
coreutils: /bin/df

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<16778020-9167-b7cc-4768-ee33dca2b...@gmail.com>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Austin S. Hemmelgarn


On 2017-08-22 08:50, Ulli Horlacher wrote:

On Tue 2017-08-22 (12:40), Hugo Mills wrote:

On Tue, Aug 22, 2017 at 02:23:50PM +0200, Ulli Horlacher wrote:


How do I find the root filesystem of a subvolume?
Example:

root@fex:~# df -T
Filesystem Type  1K-blocks  Used Available Use% Mounted on
-  -1073740800 104244552 967773976  10% /local/.backup/home


I've never seen the "- -" output from df before. Is this a bind
mount or something?


No, /local/.backup/home is just a btrfs subvolume
It arguably shouldn't be showing up here then if it's not been 
explicitly mounted.  I'm betting you're running OpenSUSE or SLES and 
they finally got their df integration done, as that df output absolutely 
matches the type of brain-dead handling of BTRFS I'm coming to expect 
out of them.



Note to SUSE people reading this:  You should be including actual 
information for at least the Type field, and ideally the Filesystem 
field too.  People expect this to behave reasonably, and not listing any 
info about where the 'mount' originated or what it is is not reasonable.

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

netapp-alike snapshots?

2017-08-22 Thread Ulli Horlacher

With Netapp/waffle you have automatic hourly/daily/weekly snapshots.
You can find these snapshots in every local directory (readonly).
Example:

framstag@fex:/sw/share: ll .snapshot/
drwxr-xr-x  framstag root - 2017-08-14 10:21:47 .snapshot/daily.2017-08-15_0010
drwxr-xr-x  framstag root - 2017-08-14 10:21:47 .snapshot/daily.2017-08-16_0010
drwxr-xr-x  framstag root - 2017-08-14 10:21:47 .snapshot/daily.2017-08-17_0010
drwxr-xr-x  framstag root - 2017-08-14 10:21:47 .snapshot/daily.2017-08-18_0010
drwxr-xr-x  framstag root - 2017-08-18 23:59:29 .snapshot/daily.2017-08-19_0010
drwxr-xr-x  framstag root - 2017-08-19 21:01:25 .snapshot/daily.2017-08-20_0010
drwxr-xr-x  framstag root - 2017-08-20 19:48:40 .snapshot/daily.2017-08-21_0010
drwxr-xr-x  framstag root - 2017-08-20 02:50:18 .snapshot/hourly.2017-08-20_1210
drwxr-xr-x  framstag root - 2017-08-20 02:50:18 .snapshot/hourly.2017-08-20_1610
drwxr-xr-x  framstag root - 2017-08-20 19:48:40 .snapshot/hourly.2017-08-20_2010
drwxr-xr-x  framstag root - 2017-08-21 00:42:28 .snapshot/hourly.2017-08-21_0810
drwxr-xr-x  framstag root - 2017-08-21 00:42:28 .snapshot/hourly.2017-08-21_1210
drwxr-xr-x  framstag root - 2017-08-21 13:05:28 .snapshot/hourly.2017-08-21_1610

I would like to have something similar with btrfs.
Programming such a feature is not a general problem for me, but I think I
am not the first one who wants this kind of auto-snapshooting.
Is there (where?) such a tool?

I know snapper, but it has a totally different approach.

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20170822132208.gd14...@rus.uni-stuttgart.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (15:58), Marat Khalili wrote:
> On 22/08/17 15:50, Ulli Horlacher wrote:
> 
> > It seems, I have to scan the subvolume path upwards until I found a real
> > mount point,
> 
> I think searching /proc/mounts for the same device and subvol=/ in 
> options is most straightforward.

I have no subvol=/ option at all:

root@fex:~# grep btrfs /proc/mounts
/dev/sdf1 /backup btrfs rw,relatime,compress=zlib,space_cache 0 0
/dev/sde1 /local btrfs rw,relatime,compress=zlib,space_cache 0 0


> But what makes you think it's mounted at all?

I am only interested in mounted volumes.

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Marat Khalili


On 22/08/17 15:50, Ulli Horlacher wrote:

It seems, I have to scan the subvolume path upwards until I found a real
mount point,
I think searching /proc/mounts for the same device and subvol=/ in 
options is most straightforward. But what makes you think it's mounted 
at all?


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Ulli Horlacher

On Tue 2017-08-22 (12:40), Hugo Mills wrote:
> On Tue, Aug 22, 2017 at 02:23:50PM +0200, Ulli Horlacher wrote:
> 
> > How do I find the root filesystem of a subvolume?
> > Example:
> > 
> > root@fex:~# df -T 
> > Filesystem Type  1K-blocks  Used Available Use% Mounted on
> > -  -1073740800 104244552 967773976  10% /local/.backup/home
> 
>I've never seen the "- -" output from df before. Is this a bind
> mount or something?

No, /local/.backup/home is just a btrfs subvolume


> > I know, the root filesystem is /local, but who can I show it by command?
> 
>Probably in /proc/self/mountinfo 

root@fex:~# grep home /proc/self/mountinfo
root@fex:~# grep btrfs /proc/self/mountinfo
31 22 0:23 / /backup rw,relatime - btrfs /dev/sdf1 rw,compress=zlib,space_cache
32 22 0:26 / /local rw,relatime - btrfs /dev/sde1 rw,compress=zlib,space_cache

No information about the subvolume /local/.backup/home

It seems, I have to scan the subvolume path upwards until I found a real
mount point,

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20170822124036.ga32...@carfax.org.uk>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: finding root filesystem of a subvolume?

2017-08-22 Thread Hugo Mills

On Tue, Aug 22, 2017 at 02:23:50PM +0200, Ulli Horlacher wrote:
> How do I find the root filesystem of a subvolume?
> Example:
> 
> root@fex:~# df -T 
> Filesystem Type  1K-blocks  Used Available Use% Mounted on
> -  -1073740800 104244552 967773976  10% /local/.backup/home

   I've never seen the "- -" output from df before. Is this a bind
mount or something?

> root@fex:~# btrfs subvolume show /local/.backup/home
> /local/.backup/home
> Name:   home
> uuid:   f86a2db0-6a82-124f-9a71-1cd4c20fd6fb
> Parent uuid:ba4d388f-44bf-7b46-b2b8-00e2a9a87181
> Creation time:  2017-08-10 22:19:15
> Object ID:  383
> Generation (Gen):   148
> Gen at creation:148
> Parent: 5
> Top Level:  5
> Flags:  readonly
> Snapshot(s):
> 
> 
> I know, the root filesystem is /local, but who can I show it by command?

   Probably in /proc/self/mountinfo -- that should give you the full
set of applied mount options, plus the original source for the mount
(which will be a block device for most filesystem mounts, a path for
bind mounts, or something FS-specific for network filesystems).

   Hugo.

-- 
Hugo Mills | And what rough beast, its hour come round at last /
hugo@... carfax.org.uk | slouches towards Bethlehem, to be born?
http://carfax.org.uk/  |
PGP: E2AB1DE4  | W.B. Yeats, The Second Coming


signature.asc
Description: Digital signature

finding root filesystem of a subvolume?

2017-08-22 Thread Ulli Horlacher

How do I find the root filesystem of a subvolume?
Example:

root@fex:~# df -T 
Filesystem Type  1K-blocks  Used Available Use% Mounted on
-  -1073740800 104244552 967773976  10% /local/.backup/home

root@fex:~# btrfs subvolume show /local/.backup/home
/local/.backup/home
Name:   home
uuid:   f86a2db0-6a82-124f-9a71-1cd4c20fd6fb
Parent uuid:ba4d388f-44bf-7b46-b2b8-00e2a9a87181
Creation time:  2017-08-10 22:19:15
Object ID:  383
Generation (Gen):   148
Gen at creation:148
Parent: 5
Top Level:  5
Flags:  readonly
Snapshot(s):


I know, the root filesystem is /local, but who can I show it by command?

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20170822122350.ga14...@rus.uni-stuttgart.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] btrfs: Add sanity check for EXTENT_DATA when reading out leaf

2017-08-22 Thread Nikolay Borisov



On 22.08.2017 14:23, Qu Wenruo wrote:
> 
> 
> On 2017年08月22日 19:00, Nikolay Borisov wrote:
>>
>>
>> On 22.08.2017 13:57, Nikolay Borisov wrote:
>>>
>>>
>>> On 22.08.2017 10:37, Qu Wenruo wrote:
 Add extra checker for item with EXTENT_DATA type.
 This checks the following thing:
 1) Item size
 Plain text inline file extent size must match item size.
 (compressed inline file extent has no info about its on-disk size)
 Regular/preallocated file extent size must be a fixed value.

 2) Every member of regular file extent item
 Including alignment for bytenr and offset, possible value for
 compression/encryption/type.

 3) Type/compression/encode must be one of the valid values.

 This should be the most comprehensive and restrict check in the context
 of btrfs_item for EXTENT_DATA.

 Signed-off-by: Qu Wenruo 
 ---
   fs/btrfs/disk-io.c  | 88
 +
   include/uapi/linux/btrfs_tree.h |  1 +
   2 files changed, 89 insertions(+)

 diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
 index 59ee7b959bf0..557f9a520e2a 100644
 --- a/fs/btrfs/disk-io.c
 +++ b/fs/btrfs/disk-io.c
 @@ -549,6 +549,83 @@ static int check_tree_block_fsid(struct
 btrfs_fs_info *fs_info,
  btrfs_header_level(eb) == 0 ? "leaf" : "node",\
  reason, btrfs_header_bytenr(eb), root->objectid, slot)
   +static int check_extent_data_item(struct btrfs_root *root,
 +  struct extent_buffer *leaf, int slot)
 +{
 +struct btrfs_file_extent_item *fi;
 +u32 sectorsize = root->fs_info->sectorsize;
 +u32 item_size = btrfs_item_size_nr(leaf, slot);
 +
 +fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
 +
 +if (btrfs_file_extent_type(leaf, fi) >=
 BTRFS_FILE_EXTENT_LAST_TYPE) {
 +CORRUPT("invalid file extent type", leaf, root, slot);
 +return -EIO;
 +}
 +if (btrfs_file_extent_compression(leaf, fi) >=
 BTRFS_COMPRESS_LAST) {
 +CORRUPT("invalid file extent compression", leaf, root, slot);
 +return -EIO;
 +}
 +if (btrfs_file_extent_encryption(leaf, fi)) {
 +CORRUPT("invalid file extent encryption", leaf, root, slot);
 +return -EIO;
 +}
 +if (btrfs_file_extent_type(leaf, fi) ==
 BTRFS_FILE_EXTENT_INLINE) {
 +if (btrfs_file_extent_compression(leaf, fi) !=
 +BTRFS_COMPRESS_NONE)
 +return 0;
 +/* Plaintext inline extent size must match item size */
 +if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START +
 +btrfs_file_extent_ram_bytes(leaf, fi)) {
 +CORRUPT("plaintext inline extent has invalid size",
 +leaf, root, slot);
 +return -EIO;
 +}
 +return 0;
 +}
>>
>> One more thing - don't we really want to use -EUCLEAN rather than -EIO?
> 
> Nice suggestion.
> Since it's not really something wrong with IO routine, EUCLEAN is better.

Yeah, I'm not saying it's wrong. But my mental model for -EIO vs
-EUCLEAN should be the following:

- When we write data in case something goes wrong e should return -EIO (
we basically cover this, since we always used -EIO).

- When we read data but while performing validity checks on it (as is
the case with your patch) we should return -EUCLEAN.

Basically the FS needs to ensure that it's always feeding valid data to
disk and the only error could be -EIO. But if this same data is read
some time later and our internal checks show that the data is
inconsistent we should say so and not just -EIO.

I've mentioned this before and as a result David created the following
wiki entry:

https://btrfs.wiki.kernel.org/index.php/Project_ideas#Distinguish_EIO_and_EUCLEAN_types_of_errors

I guess we should start from somewhere :)
> 
>>
>>
 +
 +
 +/* regular or preallocated extent has fixed item size */
 +if (item_size != sizeof(*fi)) {
 +CORRUPT(
 +"regluar or preallocated extent data item size is invalid",
 +leaf, root, slot);
 +return -EIO;
 +}
 +if (!IS_ALIGNED(btrfs_file_extent_ram_bytes(leaf, fi),
 sectorsize) ||
 +!IS_ALIGNED(btrfs_file_extent_disk_bytenr(leaf, fi),
 sectorsize) ||
 +!IS_ALIGNED(btrfs_file_extent_disk_num_bytes(leaf, fi),
 +sectorsize) ||
 +!IS_ALIGNED(btrfs_file_extent_offset(leaf, fi), sectorsize) ||
 +!IS_ALIGNED(btrfs_file_extent_num_bytes(leaf, fi),
 sectorsize)) {
 +CORRUPT(
 +"regular or preallocated extent data item has unaligned
 value",
 +leaf, root, slot);
 +return -EIO;
 +

Re: [PATCH 3/3] btrfs: Add sanity check for EXTENT_DATA when reading out leaf

2017-08-22 Thread Qu Wenruo




On 2017年08月22日 19:00, Nikolay Borisov wrote:



On 22.08.2017 13:57, Nikolay Borisov wrote:



On 22.08.2017 10:37, Qu Wenruo wrote:

Add extra checker for item with EXTENT_DATA type.
This checks the following thing:
1) Item size
Plain text inline file extent size must match item size.
(compressed inline file extent has no info about its on-disk size)
Regular/preallocated file extent size must be a fixed value.

2) Every member of regular file extent item
Including alignment for bytenr and offset, possible value for
compression/encryption/type.

3) Type/compression/encode must be one of the valid values.

This should be the most comprehensive and restrict check in the context
of btrfs_item for EXTENT_DATA.

Signed-off-by: Qu Wenruo 
---
  fs/btrfs/disk-io.c  | 88 +
  include/uapi/linux/btrfs_tree.h |  1 +
  2 files changed, 89 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 59ee7b959bf0..557f9a520e2a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -549,6 +549,83 @@ static int check_tree_block_fsid(struct btrfs_fs_info 
*fs_info,
   btrfs_header_level(eb) == 0 ? "leaf" : "node",   \
   reason, btrfs_header_bytenr(eb), root->objectid, slot)
  
+static int check_extent_data_item(struct btrfs_root *root,

+ struct extent_buffer *leaf, int slot)
+{
+   struct btrfs_file_extent_item *fi;
+   u32 sectorsize = root->fs_info->sectorsize;
+   u32 item_size = btrfs_item_size_nr(leaf, slot);
+
+   fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
+
+   if (btrfs_file_extent_type(leaf, fi) >= BTRFS_FILE_EXTENT_LAST_TYPE) {
+   CORRUPT("invalid file extent type", leaf, root, slot);
+   return -EIO;
+   }
+   if (btrfs_file_extent_compression(leaf, fi) >= BTRFS_COMPRESS_LAST) {
+   CORRUPT("invalid file extent compression", leaf, root, slot);
+   return -EIO;
+   }
+   if (btrfs_file_extent_encryption(leaf, fi)) {
+   CORRUPT("invalid file extent encryption", leaf, root, slot);
+   return -EIO;
+   }
+   if (btrfs_file_extent_type(leaf, fi) == BTRFS_FILE_EXTENT_INLINE) {
+   if (btrfs_file_extent_compression(leaf, fi) !=
+   BTRFS_COMPRESS_NONE)
+   return 0;
+   /* Plaintext inline extent size must match item size */
+   if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START +
+   btrfs_file_extent_ram_bytes(leaf, fi)) {
+   CORRUPT("plaintext inline extent has invalid size",
+   leaf, root, slot);
+   return -EIO;
+   }
+   return 0;
+   }


One more thing - don't we really want to use -EUCLEAN rather than -EIO?


Nice suggestion.
Since it's not really something wrong with IO routine, EUCLEAN is better.





+
+
+   /* regular or preallocated extent has fixed item size */
+   if (item_size != sizeof(*fi)) {
+   CORRUPT(
+   "regluar or preallocated extent data item size is invalid",
+   leaf, root, slot);
+   return -EIO;
+   }
+   if (!IS_ALIGNED(btrfs_file_extent_ram_bytes(leaf, fi), sectorsize) ||
+   !IS_ALIGNED(btrfs_file_extent_disk_bytenr(leaf, fi), sectorsize) ||
+   !IS_ALIGNED(btrfs_file_extent_disk_num_bytes(leaf, fi),
+   sectorsize) ||
+   !IS_ALIGNED(btrfs_file_extent_offset(leaf, fi), sectorsize) ||
+   !IS_ALIGNED(btrfs_file_extent_num_bytes(leaf, fi), sectorsize)) {
+   CORRUPT(
+   "regular or preallocated extent data item has unaligned value",
+   leaf, root, slot);
+   return -EIO;
+   }
+
+   return 0;
+}
+
+static int check_leaf_item(struct btrfs_root *root,
+  struct extent_buffer *leaf, int slot)
+{
+   struct btrfs_key key;
+   int ret = 0;
+
+   btrfs_item_key_to_cpu(leaf, , slot);


nit: We already have the key in the proper format in the caller of this
function. Why not just pass in the type as an argument and save a
redundant call for every item in a leaf? Perhaps it's a
microoptimisation but for very densely populated trees the miniature
cost might build up.


Sounds valid. Considering how many times this item_key_to_cpu() get 
called in a large leaf,

micro-optimization counts.

I'll update this in next revision.

Thanks for your review,
Qu




+   /*
+* Considering how overcrowded the code will be inside the switch,
+* complex verification is better to moved its own function.
+*/
+   switch (key.type) {
+   case BTRFS_EXTENT_DATA_KEY:
+   ret = check_extent_data_item(root, leaf, slot);
+   break;
+

Re: [PATCH 3/3] btrfs: Add sanity check for EXTENT_DATA when reading out leaf

2017-08-22 Thread Nikolay Borisov



On 22.08.2017 13:57, Nikolay Borisov wrote:
> 
> 
> On 22.08.2017 10:37, Qu Wenruo wrote:
>> Add extra checker for item with EXTENT_DATA type.
>> This checks the following thing:
>> 1) Item size
>>Plain text inline file extent size must match item size.
>>(compressed inline file extent has no info about its on-disk size)
>>Regular/preallocated file extent size must be a fixed value.
>>
>> 2) Every member of regular file extent item
>>Including alignment for bytenr and offset, possible value for
>>compression/encryption/type.
>>
>> 3) Type/compression/encode must be one of the valid values.
>>
>> This should be the most comprehensive and restrict check in the context
>> of btrfs_item for EXTENT_DATA.
>>
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/disk-io.c  | 88 
>> +
>>  include/uapi/linux/btrfs_tree.h |  1 +
>>  2 files changed, 89 insertions(+)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 59ee7b959bf0..557f9a520e2a 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -549,6 +549,83 @@ static int check_tree_block_fsid(struct btrfs_fs_info 
>> *fs_info,
>> btrfs_header_level(eb) == 0 ? "leaf" : "node",   \
>> reason, btrfs_header_bytenr(eb), root->objectid, slot)
>>  
>> +static int check_extent_data_item(struct btrfs_root *root,
>> +  struct extent_buffer *leaf, int slot)
>> +{
>> +struct btrfs_file_extent_item *fi;
>> +u32 sectorsize = root->fs_info->sectorsize;
>> +u32 item_size = btrfs_item_size_nr(leaf, slot);
>> +
>> +fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
>> +
>> +if (btrfs_file_extent_type(leaf, fi) >= BTRFS_FILE_EXTENT_LAST_TYPE) {
>> +CORRUPT("invalid file extent type", leaf, root, slot);
>> +return -EIO;
>> +}
>> +if (btrfs_file_extent_compression(leaf, fi) >= BTRFS_COMPRESS_LAST) {
>> +CORRUPT("invalid file extent compression", leaf, root, slot);
>> +return -EIO;
>> +}
>> +if (btrfs_file_extent_encryption(leaf, fi)) {
>> +CORRUPT("invalid file extent encryption", leaf, root, slot);
>> +return -EIO;
>> +}
>> +if (btrfs_file_extent_type(leaf, fi) == BTRFS_FILE_EXTENT_INLINE) {
>> +if (btrfs_file_extent_compression(leaf, fi) !=
>> +BTRFS_COMPRESS_NONE)
>> +return 0;
>> +/* Plaintext inline extent size must match item size */
>> +if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START +
>> +btrfs_file_extent_ram_bytes(leaf, fi)) {
>> +CORRUPT("plaintext inline extent has invalid size",
>> +leaf, root, slot);
>> +return -EIO;
>> +}
>> +return 0;
>> +}

One more thing - don't we really want to use -EUCLEAN rather than -EIO?


>> +
>> +
>> +/* regular or preallocated extent has fixed item size */
>> +if (item_size != sizeof(*fi)) {
>> +CORRUPT(
>> +"regluar or preallocated extent data item size is invalid",
>> +leaf, root, slot);
>> +return -EIO;
>> +}
>> +if (!IS_ALIGNED(btrfs_file_extent_ram_bytes(leaf, fi), sectorsize) ||
>> +!IS_ALIGNED(btrfs_file_extent_disk_bytenr(leaf, fi), sectorsize) ||
>> +!IS_ALIGNED(btrfs_file_extent_disk_num_bytes(leaf, fi),
>> +sectorsize) ||
>> +!IS_ALIGNED(btrfs_file_extent_offset(leaf, fi), sectorsize) ||
>> +!IS_ALIGNED(btrfs_file_extent_num_bytes(leaf, fi), sectorsize)) {
>> +CORRUPT(
>> +"regular or preallocated extent data item has unaligned value",
>> +leaf, root, slot);
>> +return -EIO;
>> +}
>> +
>> +return 0;
>> +}
>> +
>> +static int check_leaf_item(struct btrfs_root *root,
>> +   struct extent_buffer *leaf, int slot)
>> +{
>> +struct btrfs_key key;
>> +int ret = 0;
>> +
>> +btrfs_item_key_to_cpu(leaf, , slot);
> 
> nit: We already have the key in the proper format in the caller of this
> function. Why not just pass in the type as an argument and save a
> redundant call for every item in a leaf? Perhaps it's a
> microoptimisation but for very densely populated trees the miniature
> cost might build up.
> 
>> +/*
>> + * Considering how overcrowded the code will be inside the switch,
>> + * complex verification is better to moved its own function.
>> + */
>> +switch (key.type) {
>> +case BTRFS_EXTENT_DATA_KEY:
>> +ret = check_extent_data_item(root, leaf, slot);
>> +break;
>> +}
>> +return ret;
>> +}
>> +
>>  static noinline int check_leaf(struct btrfs_root *root,
>> struct extent_buffer *leaf)
>>  {
>> @@ -605,9 +682,13 @@ static noinline int

Re: [PATCH 0/3] Introduce comprehensive sanity check framework and

2017-08-22 Thread Nikolay Borisov



On 22.08.2017 10:37, Qu Wenruo wrote:
> The patchset introduce a new framework to do more comprehensive (if not
> the most) sanity check when reading out a leaf.
> 
> The new sanity checker will include:
> 
> 1) Key order
>Existing code
> 
> 2) Item boundary
>Existing code with enhanced checker to ensure item pointer doesn't
>overlap with item itself.
> 
> 3) Key type based sanity checker
>Only EXTENT_DATA checker is implemented yet.
>As each checker should go through review and tests, or it can easily
>make a valid btrfs failed to be mounted.
>So only one checker is implemented as an example.
> 
>Existing checker like INODE_REF checker can be moved to this
>framework easily, and we can centralize all existing checkers, make
>the rest of codes more clean.
> 
> Performance wise, it's just iterating a leaf.
> And it will only get triggered when read out a leaf, cached leaf will
> not go through such checker.
> So it won't be a performance breaker.
> 
> I tested with the patchset applied on v4.13-rc6 with fstests, no
> regression is detected.
> 
> Qu Wenruo (3):
>   btrfs: Refactor check_leaf function for later expansion.
>   btrfs: Check if item pointer overlap with item itself
>   btrfs: Add sanity check for EXTENT_DATA when reading out leaf

I have one minor comment on 3/3 which I've sent separately but otherwise
this series looks good and I like the direction it's steering future
code into.

For the whole series:

Reviewed-by: Nikolay Borisov 

> 
>  fs/btrfs/disk-io.c  | 137 
> ++--
>  include/uapi/linux/btrfs_tree.h |   1 +
>  2 files changed, 119 insertions(+), 19 deletions(-)
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] btrfs: Add sanity check for EXTENT_DATA when reading out leaf

2017-08-22 Thread Nikolay Borisov



On 22.08.2017 10:37, Qu Wenruo wrote:
> Add extra checker for item with EXTENT_DATA type.
> This checks the following thing:
> 1) Item size
>Plain text inline file extent size must match item size.
>(compressed inline file extent has no info about its on-disk size)
>Regular/preallocated file extent size must be a fixed value.
> 
> 2) Every member of regular file extent item
>Including alignment for bytenr and offset, possible value for
>compression/encryption/type.
> 
> 3) Type/compression/encode must be one of the valid values.
> 
> This should be the most comprehensive and restrict check in the context
> of btrfs_item for EXTENT_DATA.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/disk-io.c  | 88 
> +
>  include/uapi/linux/btrfs_tree.h |  1 +
>  2 files changed, 89 insertions(+)
> 
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 59ee7b959bf0..557f9a520e2a 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -549,6 +549,83 @@ static int check_tree_block_fsid(struct btrfs_fs_info 
> *fs_info,
>  btrfs_header_level(eb) == 0 ? "leaf" : "node",   \
>  reason, btrfs_header_bytenr(eb), root->objectid, slot)
>  
> +static int check_extent_data_item(struct btrfs_root *root,
> +   struct extent_buffer *leaf, int slot)
> +{
> + struct btrfs_file_extent_item *fi;
> + u32 sectorsize = root->fs_info->sectorsize;
> + u32 item_size = btrfs_item_size_nr(leaf, slot);
> +
> + fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
> +
> + if (btrfs_file_extent_type(leaf, fi) >= BTRFS_FILE_EXTENT_LAST_TYPE) {
> + CORRUPT("invalid file extent type", leaf, root, slot);
> + return -EIO;
> + }
> + if (btrfs_file_extent_compression(leaf, fi) >= BTRFS_COMPRESS_LAST) {
> + CORRUPT("invalid file extent compression", leaf, root, slot);
> + return -EIO;
> + }
> + if (btrfs_file_extent_encryption(leaf, fi)) {
> + CORRUPT("invalid file extent encryption", leaf, root, slot);
> + return -EIO;
> + }
> + if (btrfs_file_extent_type(leaf, fi) == BTRFS_FILE_EXTENT_INLINE) {
> + if (btrfs_file_extent_compression(leaf, fi) !=
> + BTRFS_COMPRESS_NONE)
> + return 0;
> + /* Plaintext inline extent size must match item size */
> + if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START +
> + btrfs_file_extent_ram_bytes(leaf, fi)) {
> + CORRUPT("plaintext inline extent has invalid size",
> + leaf, root, slot);
> + return -EIO;
> + }
> + return 0;
> + }
> +
> +
> + /* regular or preallocated extent has fixed item size */
> + if (item_size != sizeof(*fi)) {
> + CORRUPT(
> + "regluar or preallocated extent data item size is invalid",
> + leaf, root, slot);
> + return -EIO;
> + }
> + if (!IS_ALIGNED(btrfs_file_extent_ram_bytes(leaf, fi), sectorsize) ||
> + !IS_ALIGNED(btrfs_file_extent_disk_bytenr(leaf, fi), sectorsize) ||
> + !IS_ALIGNED(btrfs_file_extent_disk_num_bytes(leaf, fi),
> + sectorsize) ||
> + !IS_ALIGNED(btrfs_file_extent_offset(leaf, fi), sectorsize) ||
> + !IS_ALIGNED(btrfs_file_extent_num_bytes(leaf, fi), sectorsize)) {
> + CORRUPT(
> + "regular or preallocated extent data item has unaligned value",
> + leaf, root, slot);
> + return -EIO;
> + }
> +
> + return 0;
> +}
> +
> +static int check_leaf_item(struct btrfs_root *root,
> +struct extent_buffer *leaf, int slot)
> +{
> + struct btrfs_key key;
> + int ret = 0;
> +
> + btrfs_item_key_to_cpu(leaf, , slot);

nit: We already have the key in the proper format in the caller of this
function. Why not just pass in the type as an argument and save a
redundant call for every item in a leaf? Perhaps it's a
microoptimisation but for very densely populated trees the miniature
cost might build up.

> + /*
> +  * Considering how overcrowded the code will be inside the switch,
> +  * complex verification is better to moved its own function.
> +  */
> + switch (key.type) {
> + case BTRFS_EXTENT_DATA_KEY:
> + ret = check_extent_data_item(root, leaf, slot);
> + break;
> + }
> + return ret;
> +}
> +
>  static noinline int check_leaf(struct btrfs_root *root,
>  struct extent_buffer *leaf)
>  {
> @@ -605,9 +682,13 @@ static noinline int check_leaf(struct btrfs_root *root,
>* 1) key order
>* 2) item offset and size
>*No overlap, no hole, all inside the leaf.
> +  * 3) item content
> +

Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2017-08-22 Thread Dmitrii Tcvetkov

On Tue, 22 Aug 2017 11:31:23 +0200
g6094...@freenet.de wrote:
> So 1st should be investigating why did the disk not get removed
> correctly? Btrfs dev del should remove the device corretly, right? Is
> there a bug?

It should and probably did. To check that we need to see output of 
btrfs filesystem show 
and output of 
btrfs filesystem usage 

If there are non-raid1 chunks then you need to do soft balance:
btrfs balance start -mconvert=raid1,soft -dconvert=raid1,soft 

The balance should finish very quickly as you probably have only one of
data and metadata single chunks. They appeared during writes when the
filesystem was mounted read-write in degraded mode.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: [PATCH 00/15] Btrfs-progs offline scrub

2017-08-22 Thread Gu, Jinxiang

Ping

-Original Message-
From: linux-btrfs-ow...@vger.kernel.org 
[mailto:linux-btrfs-ow...@vger.kernel.org] On Behalf Of Gu Jinxiang
Sent: Tuesday, July 18, 2017 2:34 PM
To: linux-btrfs@vger.kernel.org
Cc: quwenruo.bt...@gmx.com
Subject: [PATCH 00/15] Btrfs-progs offline scrub

For any one who wants to try it, it can be get from my repo:
https://github.com/gujx2017/btrfs-progs/tree/offline_scrub

In this v5, just make some small fixups of comments on the left 15 patches, 
according to problems pointed out by David when mergering the first 5 patches 
of this patchset.
And rebase it to 93a9004dde410d920f08f85c6365e138713992d8.

Several reports on kernel scrub screwing up good data stripes are in ML for 
sometime.

And since kernel scrub won't account P/Q corruption, it makes us quite hard to 
detect error like kernel screwing up P/Q when scrubbing.

To get a comparable tool for kernel scrub, we need a user-space tool to act as 
benchmark to compare their different behaviors.

So here is the patchset for user-space scrub.

Which can do:
1) All mirror/backup check for non-parity based stripe
   Which means for RAID1/DUP/RAID10, we can really check all mirrors
   other than the 1st good mirror.

   Current "--check-data-csum" option should be finally replaced by
   offline scrub.
   As "--check-data-csum" doesn't really check all mirrors, if it hits
   a good copy, then resting copies will just be ignored.

   In v4 update, data check is further improved, inspired by kernel
   behavior, now data extent is checked sector by sector, so it can
   handle the following corruption case:

   Data extent A contains data from 0~28K.
   And |///| = corrupted  |   | = good
 0   4k  8k  12k 16k 20k 24k 28k
   Mirror 0  |///|   |///|   |///|   |   |
   Mirror 1  |   |///|   |///|   |///|   |

   Extent A should be RECOVERABLE, while in v3 we treat data extent A as
   a whole unit, above case is reported as CORRUPTED.

2) RAID5/6 full stripe check
   It will take full use of btrfs csum(both tree and data).
   It will only recover the full stripe if all recovered data matches
   with its csum.

   NOTE: Due to the lack of good bitmap facilities, RAID56 sector by
   sector repair will be quite complex, especially when NODATASUM is
   involved.

   So current RAID56 doesn't support vertical sector recovery yet.

   Data extent A contains data from 0~64K
   And |///| = corrupted while |   | = good
  0   8K  16K 24K 32K 40K 48K 56K 64K
   Data stripe 0  |///|   |///|   |///|   |///|   |
   Data stripe 1  |   |///|   |///|   |///|   |///|
   Parity |   |   |   |   |   |   |   |   |

   Kernel will recover it, while current scrub will report it as
   CORRUPTED.

3) Repair
   In v4 update, repair is finally added.

And this patchset also introduces new btrfs_map_block() function, which is more 
flex than current btrfs_map_block(), and has a unified interface for all 
profiles, not just an extra array for RAID56.

Check the 6th and 7th patch for details.

They are already used in RAID5/6 scrub, but can also be used for other profiles 
too.

The to-do list has been shortened, since repair is added in v4 update.
1) Test cases
   Need to make the infrastructure able to handle multi-device first.

2) Make btrfsck able to handle RAID5 with missing device
   Now it doesn't even open RAID5 btrfs with missing device, even though
   scrub should be able to handle it.

3) RAID56 vertical sector repair
   Although I consider such case is minor compared to RAID1 vertical
   sector repair.
   As for RAID1, an extent can be as large as 128M, while for RAID56 one
   stripe will always be 64K, much smaller than RAID1 case, making the
   possibility lower.

   I prefer to add this function after the patchset get merged, as no
   one really likes get 20 mails every time I update the patchset.

For guys who want to review the patchset, there is a basic function 
relationships slide.
I hope this will reduce the time needed to get what the patchset is doing.
https://docs.google.com/presentation/d/1tAU3lUVaRUXooSjhFaDUeyW3wauHDSg9H-AiLBOSuIM/edit?usp=sharing

Changelog:
V0.8 RFC:
   Initial RFC patchset

v1:
   First formal patchset.
   RAID6 recovery support added, mainly copied from kernel radi6 lib.
   Cleaner recovery logical.

v2:
   More comments in both code and commit message, suggested by David.
   File re-arrangement, no check/ dir, raid56.ch moved to kernel-lib,
   Suggested by David

v3:
  Put "--offline" option to scrub, other than put it in fsck.
  Use bitmap to read multiple csums in one run, to improve performance.
  Add --progress/--no-progress option, to tell user we're not just
  wasting CPU and IO.

v4:
  Improve data check. Make data extent to be checked sector by sector.
  And make repair to be supported.

Gu Jinxiang (1):
  btrfs-progs: Introduce new btrfs_map_block function which returns more
unified result.

Qu Wenruo (14):
  btrfs-progs: Allow __btrfs_map_block_v2 to remove

Re: degraded BTRFS RAID 1 not mountable: open_ctree failed, unable to find block group for 0

2017-08-22 Thread g6094199

He guys,


picking up this old topic cause i'm running into a similar problem.


Running a Ubuntu 16.04 (HWE K4.8) server with 2 nvme SSD as Raid1 as /.
Since one nvme died i had to replace it, where the trouble began. I
replaced the nvme, bootet degraded, added the new disk to the raid
(btrfs dev add) and removed the missing/dead device (btrfs dev del).
Everything worked well. BUT as i rebooted i ran into the "BTRFS RAID 1
not mountable: open_ctree failed, unable to find block group for 0"
because of a MISSING disk?! I checked the btrfs list and found that
there was a patch that enabled a more strict behavior in handing missing
devices (atm cant find the related patch anymore), which was merged some
kernels before k4.8 but was NOT in k4.4. So i managed to install the
k4.4 ubuntu kernel and the system startet booting and working again. So
my pitty is that i cant update to anything after k4.4 with this
production machine. :-(

So 1st should be investigating why did the disk not get removed
correctly? Btrfs dev del should remove the device corretly, right? Is
there a bug?

2nd Was the restriction on handling missing devices to strikt? Is there
a bug?

3rd i saw https://patchwork.kernel.org/patch/9419189/ from Roman. Did he
receive any comments on his patch? This one could help on this problem,
too. 


Regards

Sash

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 1/3] btrfs: Refactor check_leaf function for later expansion.

2017-08-22 Thread Qu Wenruo

Current check_leaf() function does a good job checking key orders and
item offset/size.

However it only checks from slot 0 to the last but one slot, this is
good but makes later expansion hard.

So this refactoring iterates from slot 0 to the last slot.
For key comparison, it uses a key with all 0 as initial key, so all
valid key should be larger than it.

And for item size/offset check, it compares current item end with
previous item offset.
For slot 0, use leaf end as special case.

This makes later item/key offset check and item size check easier to be
implemented.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c | 42 +++---
 1 file changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 080e2ebb8aa0..919ddd4b774c 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -553,8 +553,9 @@ static noinline int check_leaf(struct btrfs_root *root,
   struct extent_buffer *leaf)
 {
struct btrfs_fs_info *fs_info = root->fs_info;
+   /* No valid key type is 0, so all key should be larger than this key */
+   struct btrfs_key prev_key = {0, 0, 0};
struct btrfs_key key;
-   struct btrfs_key leaf_key;
u32 nritems = btrfs_header_nritems(leaf);
int slot;
 
@@ -597,26 +598,21 @@ static noinline int check_leaf(struct btrfs_root *root,
if (nritems == 0)
return 0;
 
-   /* Check the 0 item */
-   if (btrfs_item_offset_nr(leaf, 0) + btrfs_item_size_nr(leaf, 0) !=
-   BTRFS_LEAF_DATA_SIZE(fs_info)) {
-   CORRUPT("invalid item offset size pair", leaf, root, 0);
-   return -EIO;
-   }
-
/*
-* Check to make sure each items keys are in the correct order and their
-* offsets make sense.  We only have to loop through nritems-1 because
-* we check the current slot against the next slot, which verifies the
-* next slot's offset+size makes sense and that the current's slot
-* offset is correct.
+* Check the following things to make sure this is a good leaf, and
+* leaf users won't need to bother similar sanity check:
+*
+* 1) key order
+* 2) item offset and size
+*No overlap, no hole, all inside the leaf.
 */
-   for (slot = 0; slot < nritems - 1; slot++) {
-   btrfs_item_key_to_cpu(leaf, _key, slot);
-   btrfs_item_key_to_cpu(leaf, , slot + 1);
+   for (slot = 0; slot < nritems; slot++) {
+   u32 item_end_expected;
+
+   btrfs_item_key_to_cpu(leaf, , slot);
 
/* Make sure the keys are in the right order */
-   if (btrfs_comp_cpu_keys(_key, ) >= 0) {
+   if (btrfs_comp_cpu_keys(_key, ) >= 0) {
CORRUPT("bad key order", leaf, root, slot);
return -EIO;
}
@@ -626,8 +622,12 @@ static noinline int check_leaf(struct btrfs_root *root,
 * item data starts at the end of the leaf and grows towards the
 * front.
 */
-   if (btrfs_item_offset_nr(leaf, slot) !=
-   btrfs_item_end_nr(leaf, slot + 1)) {
+   if (slot == 0)
+   item_end_expected = BTRFS_LEAF_DATA_SIZE(fs_info);
+   else
+   item_end_expected = btrfs_item_offset_nr(leaf,
+slot - 1);
+   if (btrfs_item_end_nr(leaf, slot) != item_end_expected) {
CORRUPT("slot offset bad", leaf, root, slot);
return -EIO;
}
@@ -642,6 +642,10 @@ static noinline int check_leaf(struct btrfs_root *root,
CORRUPT("slot end outside of leaf", leaf, root, slot);
return -EIO;
}
+
+   prev_key.objectid = key.objectid;
+   prev_key.type = key.type;
+   prev_key.offset = key.offset;
}
 
return 0;
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/3] btrfs: Add sanity check for EXTENT_DATA when reading out leaf

2017-08-22 Thread Qu Wenruo

Add extra checker for item with EXTENT_DATA type.
This checks the following thing:
1) Item size
   Plain text inline file extent size must match item size.
   (compressed inline file extent has no info about its on-disk size)
   Regular/preallocated file extent size must be a fixed value.

2) Every member of regular file extent item
   Including alignment for bytenr and offset, possible value for
   compression/encryption/type.

3) Type/compression/encode must be one of the valid values.

This should be the most comprehensive and restrict check in the context
of btrfs_item for EXTENT_DATA.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c  | 88 +
 include/uapi/linux/btrfs_tree.h |  1 +
 2 files changed, 89 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 59ee7b959bf0..557f9a520e2a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -549,6 +549,83 @@ static int check_tree_block_fsid(struct btrfs_fs_info 
*fs_info,
   btrfs_header_level(eb) == 0 ? "leaf" : "node",   \
   reason, btrfs_header_bytenr(eb), root->objectid, slot)
 
+static int check_extent_data_item(struct btrfs_root *root,
+ struct extent_buffer *leaf, int slot)
+{
+   struct btrfs_file_extent_item *fi;
+   u32 sectorsize = root->fs_info->sectorsize;
+   u32 item_size = btrfs_item_size_nr(leaf, slot);
+
+   fi = btrfs_item_ptr(leaf, slot, struct btrfs_file_extent_item);
+
+   if (btrfs_file_extent_type(leaf, fi) >= BTRFS_FILE_EXTENT_LAST_TYPE) {
+   CORRUPT("invalid file extent type", leaf, root, slot);
+   return -EIO;
+   }
+   if (btrfs_file_extent_compression(leaf, fi) >= BTRFS_COMPRESS_LAST) {
+   CORRUPT("invalid file extent compression", leaf, root, slot);
+   return -EIO;
+   }
+   if (btrfs_file_extent_encryption(leaf, fi)) {
+   CORRUPT("invalid file extent encryption", leaf, root, slot);
+   return -EIO;
+   }
+   if (btrfs_file_extent_type(leaf, fi) == BTRFS_FILE_EXTENT_INLINE) {
+   if (btrfs_file_extent_compression(leaf, fi) !=
+   BTRFS_COMPRESS_NONE)
+   return 0;
+   /* Plaintext inline extent size must match item size */
+   if (item_size != BTRFS_FILE_EXTENT_INLINE_DATA_START +
+   btrfs_file_extent_ram_bytes(leaf, fi)) {
+   CORRUPT("plaintext inline extent has invalid size",
+   leaf, root, slot);
+   return -EIO;
+   }
+   return 0;
+   }
+
+
+   /* regular or preallocated extent has fixed item size */
+   if (item_size != sizeof(*fi)) {
+   CORRUPT(
+   "regluar or preallocated extent data item size is invalid",
+   leaf, root, slot);
+   return -EIO;
+   }
+   if (!IS_ALIGNED(btrfs_file_extent_ram_bytes(leaf, fi), sectorsize) ||
+   !IS_ALIGNED(btrfs_file_extent_disk_bytenr(leaf, fi), sectorsize) ||
+   !IS_ALIGNED(btrfs_file_extent_disk_num_bytes(leaf, fi),
+   sectorsize) ||
+   !IS_ALIGNED(btrfs_file_extent_offset(leaf, fi), sectorsize) ||
+   !IS_ALIGNED(btrfs_file_extent_num_bytes(leaf, fi), sectorsize)) {
+   CORRUPT(
+   "regular or preallocated extent data item has unaligned value",
+   leaf, root, slot);
+   return -EIO;
+   }
+
+   return 0;
+}
+
+static int check_leaf_item(struct btrfs_root *root,
+  struct extent_buffer *leaf, int slot)
+{
+   struct btrfs_key key;
+   int ret = 0;
+
+   btrfs_item_key_to_cpu(leaf, , slot);
+   /*
+* Considering how overcrowded the code will be inside the switch,
+* complex verification is better to moved its own function.
+*/
+   switch (key.type) {
+   case BTRFS_EXTENT_DATA_KEY:
+   ret = check_extent_data_item(root, leaf, slot);
+   break;
+   }
+   return ret;
+}
+
 static noinline int check_leaf(struct btrfs_root *root,
   struct extent_buffer *leaf)
 {
@@ -605,9 +682,13 @@ static noinline int check_leaf(struct btrfs_root *root,
 * 1) key order
 * 2) item offset and size
 *No overlap, no hole, all inside the leaf.
+* 3) item content
+*If possible, do comprehensive sanity check.
+*NOTE: All check must only rely on the item data itself.
 */
for (slot = 0; slot < nritems; slot++) {
u32 item_end_expected;
+   int ret;
 
btrfs_item_key_to_cpu(leaf, , slot);
 
@@ -650,6 +731,13 @@ static noinline int check_leaf(struct btrfs_root *root,
return -EIO;

[PATCH 0/3] Introduce comprehensive sanity check framework and

2017-08-22 Thread Qu Wenruo

The patchset introduce a new framework to do more comprehensive (if not
the most) sanity check when reading out a leaf.

The new sanity checker will include:

1) Key order
   Existing code

2) Item boundary
   Existing code with enhanced checker to ensure item pointer doesn't
   overlap with item itself.

3) Key type based sanity checker
   Only EXTENT_DATA checker is implemented yet.
   As each checker should go through review and tests, or it can easily
   make a valid btrfs failed to be mounted.
   So only one checker is implemented as an example.

   Existing checker like INODE_REF checker can be moved to this
   framework easily, and we can centralize all existing checkers, make
   the rest of codes more clean.

Performance wise, it's just iterating a leaf.
And it will only get triggered when read out a leaf, cached leaf will
not go through such checker.
So it won't be a performance breaker.

I tested with the patchset applied on v4.13-rc6 with fstests, no
regression is detected.

Qu Wenruo (3):
  btrfs: Refactor check_leaf function for later expansion.
  btrfs: Check if item pointer overlap with item itself
  btrfs: Add sanity check for EXTENT_DATA when reading out leaf

 fs/btrfs/disk-io.c  | 137 ++--
 include/uapi/linux/btrfs_tree.h |   1 +
 2 files changed, 119 insertions(+), 19 deletions(-)

-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 2/3] btrfs: Check if item pointer overlap with item itself

2017-08-22 Thread Qu Wenruo

Function check_leaf() checks if any item pointer points outside of the
leaf, but it doesn't check if the pointer overlap with the item itself.

Normally only the last item may be the victim, but add such check is
never a bad idea anyway.

Signed-off-by: Qu Wenruo 
---
 fs/btrfs/disk-io.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 919ddd4b774c..59ee7b959bf0 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -643,6 +643,13 @@ static noinline int check_leaf(struct btrfs_root *root,
return -EIO;
}
 
+   /* Also check if the item pointer overlaps with btrfs item. */
+   if (btrfs_item_nr_offset(slot) + sizeof(struct btrfs_item) >
+   btrfs_item_ptr_offset(leaf, slot)) {
+   CORRUPT("slot overlap with its data", leaf, root, slot);
+   return -EIO;
+   }
+
prev_key.objectid = key.objectid;
prev_key.type = key.type;
prev_key.offset = key.offset;
-- 
2.14.1

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Btrfs Raid5 issue.

2017-08-22 Thread Qu Wenruo




On 2017年08月22日 13:19, Robert LeBlanc wrote:

Chris and Qu thanks for your help. I was able to restore the data off
the volume. I only could not read one file that I tried to rsync (a
MySQl bin log), but it wasn't critical as I had an off-site snapshot
from that morning and ownclould could resync the files that were
changed anyway. This turned out much better than the md RAID failure
that I had a year ago. Much faster recovery thanks to snapshots.

Is there anything you would like from this damaged filesystem to help
determine what went wrong and to help make btrfs better? If I don't
hear back from you in a day, I'll destroy it so that I can add the
disks into the new btrfs volumes to restore redundancy.

Feel free to destroy the old images.

If nologreplay works, that's good enough.
The problem seems to be extent tree, but it's too hard to locate the 
real problem.




Bcache wasn't providing the performance I was hoping for, so I'm
putting the root and roots for my LXC containers on the SSDs (btrfs
RAID1) and the bulk stuff on the three spindle drives (btrfs RAID1).


Well, I'm more interested in the bcache performance.

I was considering to using my Intel 600P NVMe to cache one 2.5' HGST 1T 
HDD (7200rpm) in my btrfs KVM host (also my daily machine).


Would you please share more details about the performance problem?
(Maybe it's about some btrfs performance problems, not bcache. Btrfs is 
not good at workload like DB or metadata heavy operations)



For some reason, it seemed that the btrfs RAID5 setup required one of
the drives, but I thought I had data with RAID5 and metadata with 2
copies. Was I missing something else that prevented mounting with that
specific drive? I don't want to get into a situation where one drive
dies and I can't get to any data.


The direct cause is btrfs fails to replay its log, and it's corrupted 
extent tree causing log replay failed.
And normally such failure will definitely cause problem, so btrfs just 
stop the mount procedure.


In your case, if "nologreplay" is specified, btrfs skips the problem, 
and since you must specify RO for nologrelay, btrfs has nothing to do 
with extent tree at all.

So btrfs can be mounted.

Why extent tree get corrupted is still unknown. If your metadata is also 
RAID5, then write-hole may be the cause.

If your metadata profile is RAID1, then I don't know why this could happen.

So from this point of view, even we fixed btrfs scrub/race problems, 
it's still not good enough to survive a disk removal in real world.


With RAID1 setup, at least we don't need to care about write hole and 
csum will help us to determine which copy is correct, so I think it will 
be much better than RAID56.


If you have spare time, you could try to hot-plug RAID1 devices to 
verify how it works.
But please note that, re-attach plugged device may need to umount the fs 
and re-scan btrfs.


And even you're using 3 devices with RAID1, it's still 2 copies.
So you can lose at most 1 device.

Thanks,
Qu



Thank you again.

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

73 matches

Mail list logo