Re: [PATCH 3/3] Btrfs-progs: list snapshots by generation

2012-08-02 Thread Anand Jain


Liu Bo,

 I am trying to resolve a conflict with your patch vs Hugo'
 integration branch  
  http://git.darksatanic.net/repo/btrfs-progs-unstable.git

  (integration-20120605)

---

@@ -185,11 +199,15 @@ static int add_root(struct root_lookup *root_lookup,
ri->dir_id = dir_id;
ri->root_id = root_id;
ri->ref_tree = ref_tree;
-   strncpy(ri->name, name, name_len);
+   if (name)
+   strncpy(ri->name, name, name_len);
if (name_len>  0)
ri->name[name_len] = 0;gen = *gen;

---

 Per the original patch
  http://permalink.gmane.org/gmane.comp.file-systems.btrfs/16914
 it is ri->name[name_len-1] instead.
 and I don't see any further patches which made it to be
 ri->name[name_len] as in your patch. any idea?

Thanks, Anand


On 31/07/12 13:49, Liu Bo wrote:

The idea is that we usually use snapshot to backup/restore our data, and the
common way can be a cron script which makes lots of snapshots, so we can end
up with spending some time to find the latest snapshot to restore.

This adds a feature for 'btrfs subvolume list' to let it list snapshots by their
_created_ generation.

What we need to do is just to list them in descending order and get the latest
snapshot.  What's more, we can find the oldest snapshot as well by listing
snapshots in ascending order.

Signed-off-by: Liu Bo
---
  btrfs-list.c |  176 --
  cmds-subvolume.c |   19 +-
  2 files changed, 185 insertions(+), 10 deletions(-)

diff --git a/btrfs-list.c b/btrfs-list.c
index 05360dc..0374b41 100644
--- a/btrfs-list.c
+++ b/btrfs-list.c
@@ -87,13 +87,23 @@ static int comp_entry(struct root_info *entry, u64 root_id, 
u64 ref_tree)
return 0;
  }

+static int comp_entry_with_gen(struct root_info *entry, u64 root_id,
+  u64 ref_tree, u64 gen)
+{
+   if (entry->gen<  gen)
+   return 1;
+   if (entry->gen>  gen)
+   return -1;
+   return comp_entry(entry, root_id, ref_tree);
+}
+
  /*
   * insert a new root into the tree.  returns the existing root entry
   * if one is already there.  Both root_id and ref_tree are used
   * as the key
   */
  static struct rb_node *tree_insert(struct rb_root *root, u64 root_id,
-  u64 ref_tree, struct rb_node *node)
+  u64 ref_tree, u64 *gen, struct rb_node *node)
  {
struct rb_node ** p =&root->rb_node;
struct rb_node * parent = NULL;
@@ -104,7 +114,11 @@ static struct rb_node *tree_insert(struct rb_root *root, 
u64 root_id,
parent = *p;
entry = rb_entry(parent, struct root_info, rb_node);

-   comp = comp_entry(entry, root_id, ref_tree);
+   if (!gen)
+   comp = comp_entry(entry, root_id, ref_tree);
+   else
+   comp = comp_entry_with_gen(entry, root_id, ref_tree,
+  *gen);

if (comp<  0)
p =&(*p)->rb_left;
@@ -171,7 +185,7 @@ static struct root_info *tree_search(struct rb_root *root, 
u64 root_id)
   */
  static int add_root(struct root_lookup *root_lookup,
u64 root_id, u64 ref_tree, u64 dir_id, char *name,
-   int name_len)
+   int name_len, u64 *gen)
  {
struct root_info *ri;
struct rb_node *ret;
@@ -185,11 +199,15 @@ static int add_root(struct root_lookup *root_lookup,
ri->dir_id = dir_id;
ri->root_id = root_id;
ri->ref_tree = ref_tree;
-   strncpy(ri->name, name, name_len);
+   if (name)
+   strncpy(ri->name, name, name_len);
if (name_len>  0)
ri->name[name_len] = 0;
+   if (gen)
+   ri->gen = *gen;

-   ret = tree_insert(&root_lookup->root, root_id, ref_tree,&ri->rb_node);
+   ret = tree_insert(&root_lookup->root, root_id, ref_tree, gen,
+   &ri->rb_node);
if (ret) {
printf("failed to insert tree %llu\n", (unsigned long 
long)root_id);
exit(1);
@@ -693,7 +711,7 @@ again:
dir_id = btrfs_stack_root_ref_dirid(ref);

add_root(root_lookup, sh->objectid, sh->offset,
-dir_id, name, name_len);
+dir_id, name, name_len, NULL);
} else if (get_gen&&  sh->type == BTRFS_ROOT_ITEM_KEY) {
ri = (struct btrfs_root_item *)(args.buf + off);
gen = btrfs_root_generation(ri);
@@ -750,6 +768,79 @@ again:
return 0;
  }

+static int __list_snapshot_search(int fd, struct root_lookup *root_lookup)
+{
+   int ret;
+   struct btrfs_ioctl_search_args args;
+  

Btrfs send/receive fixes and PULL request

2012-08-02 Thread Alexander Block
Hello Chris,

You can find and pull a lot of fixes for btrfs send/receive in my git repo:

git://github.com/ablock84/linux-btrfs.git for-chris

These fixes are mostly the results of the reviews from Arne and Alex
that I got here on the list. I'm not posting the patches to the list
at the moment as they got quite a lot (23 patches). If anyone thinks I
should still post them, please tell me. The branch is based on current
master of Linus repo. As I already announced, I won't have time to
work on btrfs starting from now.

Alex.


Alexander Block (23):
  Btrfs: add rdev to get_inode_info in send/receive
  Btrfs: fix cur_ino < parent_ino case for send/receive
  Btrfs: add missing check for dir != tmp_dir to is_first_ref
  Btrfs: remove unused code with #if 0
  Btrfs: add correct parent to check_dirs when dir got moved
  Btrfs: remove unused use_list from send/receive code
  Btrfs: rename backref_ctx::found_in_send_root to found_itself
  Btrfs: use kmalloc instead of stack for backref_ctx
  Btrfs: use normal return path for root == send_root case
  Btrfs: don't break in the final loop of find_extent_clone
  Btrfs: fix memory leak for name_cache in send/receive
  Btrfs: fix use of radix_tree for name_cache in send/receive
  Btrfs: make aux field of ulist 64 bit
  Btrfs: update send_progress at correct places
  Btrfs: add/fix comments/documentation for send/receive
  Btrfs: code cleanups for send/receive
  Btrfs: remove unused tmp_path from iterate_dir_item
  Btrfs: free nce and nce_head on error in name_cache_insert
  Btrfs: fix check for changed extent in is_extent_unchanged
  Btrfs: use <= instead of < in is_extent_unchanged
  Btrfs: pass root instead of parent_root to iterate_inode_ref
  Btrfs: ignore non-FS inodes for send/receive
  Btrfs: don't treat top/root directory inode as deleted/reused

Dan Carpenter (1):
  Btrfs: fix some endian bugs handling the root times

 fs/btrfs/backref.c |8 +-
 fs/btrfs/ioctl.c   |2 +-
 fs/btrfs/qgroup.c  |   20 +-
 fs/btrfs/root-tree.c   |4 +-
 fs/btrfs/send.c|  851 ++--
 fs/btrfs/transaction.c |2 +-
 fs/btrfs/ulist.c   |7 +-
 fs/btrfs/ulist.h   |9 +-
 8 files changed, 483 insertions(+), 420 deletions(-)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Btrfs send/receive fixes and PULL request

2012-08-02 Thread Alexander Block
On Thu, Aug 2, 2012 at 11:55 AM, Alexander Block
 wrote:
> Hello Chris,
>
> You can find and pull a lot of fixes for btrfs send/receive in my git repo:
>
> git://github.com/ablock84/linux-btrfs.git for-chris
>
> These fixes are mostly the results of the reviews from Arne and Alex
> that I got here on the list. I'm not posting the patches to the list
> at the moment as they got quite a lot (23 patches). If anyone thinks I
> should still post them, please tell me. The branch is based on current
> master of Linus repo. As I already announced, I won't have time to
> work on btrfs starting from now.
>
> Alex.
>
> 
> Alexander Block (23):
>   Btrfs: add rdev to get_inode_info in send/receive
>   Btrfs: fix cur_ino < parent_ino case for send/receive
>   Btrfs: add missing check for dir != tmp_dir to is_first_ref
>   Btrfs: remove unused code with #if 0
>   Btrfs: add correct parent to check_dirs when dir got moved
>   Btrfs: remove unused use_list from send/receive code
>   Btrfs: rename backref_ctx::found_in_send_root to found_itself
>   Btrfs: use kmalloc instead of stack for backref_ctx
>   Btrfs: use normal return path for root == send_root case
>   Btrfs: don't break in the final loop of find_extent_clone
>   Btrfs: fix memory leak for name_cache in send/receive
>   Btrfs: fix use of radix_tree for name_cache in send/receive
>   Btrfs: make aux field of ulist 64 bit
>   Btrfs: update send_progress at correct places
>   Btrfs: add/fix comments/documentation for send/receive
>   Btrfs: code cleanups for send/receive
>   Btrfs: remove unused tmp_path from iterate_dir_item
>   Btrfs: free nce and nce_head on error in name_cache_insert
>   Btrfs: fix check for changed extent in is_extent_unchanged
>   Btrfs: use <= instead of < in is_extent_unchanged
>   Btrfs: pass root instead of parent_root to iterate_inode_ref
>   Btrfs: ignore non-FS inodes for send/receive
>   Btrfs: don't treat top/root directory inode as deleted/reused
>
> Dan Carpenter (1):
>   Btrfs: fix some endian bugs handling the root times
>
>  fs/btrfs/backref.c |8 +-
>  fs/btrfs/ioctl.c   |2 +-
>  fs/btrfs/qgroup.c  |   20 +-
>  fs/btrfs/root-tree.c   |4 +-
>  fs/btrfs/send.c|  851 
> ++--
>  fs/btrfs/transaction.c |2 +-
>  fs/btrfs/ulist.c   |7 +-
>  fs/btrfs/ulist.h   |9 +-
>  8 files changed, 483 insertions(+), 420 deletions(-)

I forgot to write where to find the btrfs-progs fixes:
git://github.com/ablock84/btrfs-progs.git for-chris
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs-progs: enhance btrfs subvol list to show read-only snapshots

2012-08-02 Thread zhoubo
From: Zhou Bo 

This patch enhances btrfs subvol list to show read-only snapshots.
You can use the -r option showing read-only snapshots, for example:
btrfs subvolume list -r 

Signed-off-by: Zhou Bo 
---
 btrfs-list.c |   68 +++--
 cmds-subvolume.c |   15 
 ctree.h  |2 +
 3 files changed, 72 insertions(+), 13 deletions(-)

diff --git a/btrfs-list.c b/btrfs-list.c
index 693d241..9c10b9e 100644
--- a/btrfs-list.c
+++ b/btrfs-list.c
@@ -608,7 +608,46 @@ out:
return 0;
 }
 
-static int __list_subvol_search(int fd, struct root_lookup *root_lookup)
+static int is_readonly_subvol(int fd, u64 objectid)
+{
+   int ret;
+   struct btrfs_ioctl_search_args args;
+   struct btrfs_ioctl_search_key *sk = &args.key;
+   struct btrfs_ioctl_search_header *sh;
+   unsigned long off = 0;
+   int found = 0;
+   struct btrfs_root_item *item;
+
+   memset(&args, 0, sizeof(args));
+
+   /* search in the tree of tree roots */
+   sk->tree_id = BTRFS_ROOT_TREE_OBJECTID;
+   sk->min_objectid = objectid;
+   sk->max_objectid = objectid;
+   sk->max_type = BTRFS_ROOT_ITEM_KEY;
+   sk->min_type = BTRFS_ROOT_ITEM_KEY;
+   sk->max_offset = (u64)-1;
+   sk->max_transid = (u64)-1;
+   sk->nr_items = 1;
+   ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args);
+   if (ret < 0)
+   return ret;
+   if (sk->nr_items == 0) {
+   errno = -ENOENT;
+   found = -1;
+   goto out;
+   }
+   sh = (struct btrfs_ioctl_search_header *)args.buf;
+   off += sizeof(*sh);
+   item = (struct btrfs_root_item *)(args.buf + off);
+   if (item->flags & cpu_to_le64(BTRFS_ROOT_SUBVOL_RDONLY))
+   found = 1;
+out:
+   return found;
+}
+
+static int __list_subvol_search(int fd, struct root_lookup *root_lookup,
+   int list_ro)
 {
int ret;
struct btrfs_ioctl_search_args args;
@@ -668,9 +707,23 @@ static int __list_subvol_search(int fd, struct root_lookup 
*root_lookup)
name_len = btrfs_stack_root_ref_name_len(ref);
name = (char *)(ref + 1);
dir_id = btrfs_stack_root_ref_dirid(ref);
-
-   add_root(root_lookup, sh->objectid, sh->offset,
-dir_id, name, name_len);
+   if (list_ro) {
+   int subvol_readonly =
+   is_readonly_subvol(fd,
+   sh->objectid);
+   if (subvol_readonly < 0) {
+   return subvol_readonly;
+   } else if (subvol_readonly) {
+   add_root(root_lookup,
+   sh->objectid,
+   sh->offset, dir_id,
+   name, name_len);
+   }
+   } else {
+   add_root(root_lookup, sh->objectid,
+   sh->offset, dir_id,
+   name, name_len);
+   }
}
 
off += sh->len;
@@ -719,7 +772,7 @@ static int __list_subvol_fill_paths(int fd, struct 
root_lookup *root_lookup)
return 0;
 }
 
-int list_subvols(int fd, int print_parent, int get_default)
+int list_subvols(int fd, int print_parent, int get_default, int list_ro)
 {
struct root_lookup root_lookup;
struct rb_node *n;
@@ -745,7 +798,7 @@ int list_subvols(int fd, int print_parent, int get_default)
}
}
 
-   ret = __list_subvol_search(fd, &root_lookup);
+   ret = __list_subvol_search(fd, &root_lookup, list_ro);
if (ret) {
fprintf(stderr, "ERROR: can't perform the search - %s\n",
strerror(errno));
@@ -788,7 +841,6 @@ int list_subvols(int fd, int print_parent, int get_default)
(unsigned long long)entry->root_id,
(unsigned long long)level, path);
}
-
free(path);
n = rb_prev(n);
}
@@ -983,7 +1035,7 @@ char *path_for_root(int fd, u64 root)
char *ret_path = NULL;
int ret;
 
-   ret = __list_subvol_search(fd, &root_lookup);
+   ret = __list_subvol_search(fd, &root_lookup, 0);
if (ret < 0)
return ERR_PTR(ret);
 
diff --git a/cmds-subvolume.c b/cmds-subvolume.c
index 3508c

[RFC][PATCH] Btrfs-progs, btrfsck: add block group check function

2012-08-02 Thread Miao Xie
From: Chen Yang 

This patch adds the function to check correspondence between block group,
chunk and device extent.

Signed-off-by: Cheng Yang 
Signed-off-by: Miao Xie 
---
 Makefile   |2 +-
 btrfsck.c  |  569 +++-
 dev-extent-cache.c |  188 +
 dev-extent-cache.h |   60 ++
 4 files changed, 812 insertions(+), 7 deletions(-)
 create mode 100644 dev-extent-cache.c
 create mode 100644 dev-extent-cache.h

diff --git a/Makefile b/Makefile
index 969..75eced8 100644
--- a/Makefile
+++ b/Makefile
@@ -4,7 +4,7 @@ CFLAGS = -g -O0
 objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \
  root-tree.o dir-item.o file-item.o inode-item.o \
  inode-map.o crc32c.o rbtree.o extent-cache.o extent_io.o \
- volumes.o utils.o btrfs-list.o btrfslabel.o repair.o
+ volumes.o utils.o btrfs-list.o btrfslabel.o repair.o 
dev-extent-cache.o
 cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \
   cmds-inspect.o cmds-balance.o
 
diff --git a/btrfsck.c b/btrfsck.c
index 088b9f4..a25c948 100644
--- a/btrfsck.c
+++ b/btrfsck.c
@@ -34,6 +34,56 @@
 #include "list.h"
 #include "version.h"
 #include "utils.h"
+#include "dev-extent-cache.h"
+
+
+struct block_group_rec {
+   struct cache_extent cache;
+   u64 objectid;
+   u8  type;
+   u64 offset;
+
+   u64 flags;
+};
+
+struct dev_rec {
+   struct cache_extent cache;
+   u64 objectid;
+   u8  type;
+   u64 offset;
+
+   u64 devid;
+   u64 total_byte;
+   u64 byte_used;
+};
+
+struct stripe {
+   u64 devid;
+   u64 offset;
+};
+
+struct chunk_rec {
+   struct cache_extent cache;
+   u64 objectid;
+   u8  type;
+   u64 offset;
+
+   u64 length;
+   u64 type_flags;
+   u16 num_stripes;
+   struct stripe stripes[0];
+};
+
+struct dev_extent_rec {
+   struct cache_dev_extent cache;
+   u64 objectid;
+   u8  type;
+   u64 offset;
+
+   u64 chunk_objecteid;
+   u64 chunk_offset;
+   u64 length;
+};
 
 static u64 bytes_used = 0;
 static u64 total_csum_bytes = 0;
@@ -1852,7 +1902,7 @@ static int all_backpointers_checked(struct extent_record 
*rec, int print_errs)
(unsigned long long)rec->start,
back->full_backref ?
"parent" : "root",
-   back->full_backref ? 
+   back->full_backref ?
(unsigned long long)dback->parent:
(unsigned long long)dback->root,
(unsigned long long)dback->owner,
@@ -2440,6 +2490,149 @@ static int process_extent_ref_v0(struct cache_tree 
*extent_cache,
 }
 #endif
 
+static int process_chunk_item(struct cache_tree *chunk_cache,
+   struct btrfs_key *key, struct extent_buffer *eb, int slot)
+{
+   struct btrfs_chunk *ptr;
+   struct chunk_rec *rec;
+   int num_stripes, i;
+   int ret = 0;
+
+   ptr = btrfs_item_ptr(eb,
+   slot, struct btrfs_chunk);
+
+   num_stripes = btrfs_chunk_num_stripes(eb, ptr);
+
+   rec = malloc(sizeof(*rec) +
+   num_stripes * sizeof(*rec->stripes));
+   if (!rec) {
+   fprintf(stderr, "memory allocation failed\n");
+   return -ENOMEM;
+   }
+
+   rec->cache.start = key->offset;
+   rec->cache.size = 1;
+
+   rec->objectid = key->objectid;
+   rec->type = key->type;
+   rec->offset = key->offset;
+
+   rec->length = btrfs_chunk_length(eb, ptr);
+   rec->type = btrfs_chunk_type(eb, ptr);
+   rec->num_stripes = num_stripes;
+
+   for (i = 0; i < rec->num_stripes; ++i) {
+   rec->stripes[i].devid =
+   btrfs_stripe_devid_nr(eb, ptr, i);
+   rec->stripes[i].offset =
+   btrfs_stripe_offset_nr(eb, ptr, i);
+   }
+
+   ret = insert_existing_cache_extent(
+   chunk_cache, &rec->cache);
+
+   return ret;
+}
+
+static int process_dev_item(struct cache_tree *dev_cache,
+   struct btrfs_key *key, struct extent_buffer *eb, int slot)
+{
+   struct btrfs_dev_item *ptr;
+   struct dev_rec *rec;
+   int ret = 0;
+
+   ptr = btrfs_item_ptr(eb,
+   slot, struct btrfs_dev_item);
+
+   rec = malloc(sizeof(*rec));
+   if (!rec) {
+   fprintf(stderr, "memory allocation failed\n");
+   return -ENOMEM;
+   }
+
+   rec->cache.start = key->offset;
+   rec->cache.size = 1;
+
+   rec->objectid = key->objectid;
+   rec->type = key->type;
+   rec->offset = key->offset;
+
+   rec->devid = btrfs_device_id(eb, ptr);
+   rec->total_byte = btrfs_device_total_bytes(eb, ptr);
+  

Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread Stefan Behrens
On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote:
> On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote:
>> On 08/01/2012 09:07 PM, Jan Schmidt wrote:
>>> On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote:
 On 08/01/2012 07:45 PM, Stefan Behrens wrote:
> With commit acce952b0, btrfs was changed to flag the filesystem with
> BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
> error happened like a write I/O errors of all mirrors.
> In such situations, on unmount, the superblock is written in
> btrfs_error_commit_super(). This is done with the intention to be able
> to evaluate the error flag on the next mount. A warning is printed
> in this case during the next mount and the log tree is ignored.
>
> The issue is that it is possible that the superblock points to a root
> that was not written (due to write I/O errors).
> The result is that the filesystem cannot be mounted. btrfsck also does
> not start and all the other btrfs-progs tools fail to start as well.
> However, mount -o recovery is working well and does the right things
> to recover the filesystem (i.e., don't use the log root, clear the
> free space cache and use the next mountable root that is stored in the
> root backup array).
>
> This patch removes the writing of the superblock when
> BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
> flag in the mount function.
>

 Yes, I have to admit that this can be a serious problem.

 But we'll need to send the error flag stored in the super block into
 disk in the future so that the next mount can find it unstable and do
 fsck by itself maybe.
>>>
>>> Hum, that's possible. However, I neither see
>>>
>>> a) a safe way to get that flag to disk
>>>
>>> nor
>>>
>>> b) a situation where this flag would help. When we abort a transaction, we 
>>> just
>>> roll everything back to the last commit, i.e. a consistent state. So if we 
>>> stop
>>> writing a potentially corrupt super block, we should be fine anyway. Or am I
>>> missing something?
>>>
>>
>> I'm just wondering if we can roll everything back well, why do we need fsck?
> 
> If the disks support barriers, we roll everything back very well. The
> most recent superblock on the disks always defines a consistent
> filesystem state. There are only two remaining filesystem consistency
> issues left that can cause inconsistent states, one is the one that the
> patch in this email addresses, and the second one is that the error
> result from barrier_all_devices() is ignored (which I want to change next).

Hi Liu Bo,

Do you have any remaining objections to that patch?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread Liu Bo
On 08/02/2012 06:30 PM, Stefan Behrens wrote:
> On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote:
>> On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote:
>>> On 08/01/2012 09:07 PM, Jan Schmidt wrote:
 On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote:
> On 08/01/2012 07:45 PM, Stefan Behrens wrote:
>> With commit acce952b0, btrfs was changed to flag the filesystem with
>> BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
>> error happened like a write I/O errors of all mirrors.
>> In such situations, on unmount, the superblock is written in
>> btrfs_error_commit_super(). This is done with the intention to be able
>> to evaluate the error flag on the next mount. A warning is printed
>> in this case during the next mount and the log tree is ignored.
>>
>> The issue is that it is possible that the superblock points to a root
>> that was not written (due to write I/O errors).
>> The result is that the filesystem cannot be mounted. btrfsck also does
>> not start and all the other btrfs-progs tools fail to start as well.
>> However, mount -o recovery is working well and does the right things
>> to recover the filesystem (i.e., don't use the log root, clear the
>> free space cache and use the next mountable root that is stored in the
>> root backup array).
>>
>> This patch removes the writing of the superblock when
>> BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
>> flag in the mount function.
>>
>
> Yes, I have to admit that this can be a serious problem.
>
> But we'll need to send the error flag stored in the super block into
> disk in the future so that the next mount can find it unstable and do
> fsck by itself maybe.

 Hum, that's possible. However, I neither see

 a) a safe way to get that flag to disk

 nor

 b) a situation where this flag would help. When we abort a transaction, we 
 just
 roll everything back to the last commit, i.e. a consistent state. So if we 
 stop
 writing a potentially corrupt super block, we should be fine anyway. Or am 
 I
 missing something?

>>>
>>> I'm just wondering if we can roll everything back well, why do we need fsck?
>>
>> If the disks support barriers, we roll everything back very well. The
>> most recent superblock on the disks always defines a consistent
>> filesystem state. There are only two remaining filesystem consistency
>> issues left that can cause inconsistent states, one is the one that the
>> patch in this email addresses, and the second one is that the error
>> result from barrier_all_devices() is ignored (which I want to change next).
> 
> Hi Liu Bo,
> 
> Do you have any remaining objections to that patch?
> 

Hi Stefan,

Still I have another question:

Our metadata can be flushed into disk if we reach the limit, 32k, so we
can end up with updated metadata and the latest superblock if we do not
write the current super block.

Any ideas?

thanks,
liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: barrier before waitqueue_active

2012-08-02 Thread Liu Bo
On 08/02/2012 04:25 AM, Josef Bacik wrote:
> We need an smb_mb() before waitqueue_active to avoid missing wakeups.
> Before Mitch was hitting a deadlock between the ordered flushers and the
> transaction commit because the ordered flushers were waiting for more refs
> and were never woken up, so those smp_mb()'s are the most important.
> Everything else I added for correctness sake and to avoid getting bitten by
> this again somewhere else.  Thanks,
> 

Hi Josef,

I'll appreciate a lot if you can add some comments for each memory
barrier, because not everyone knows why it is used here and there. :)

thanks,
liubo

> Signed-off-by: Josef Bacik 
> ---
>  fs/btrfs/compression.c   |1 +
>  fs/btrfs/delayed-inode.c |   16 ++--
>  fs/btrfs/delayed-ref.c   |   18 --
>  fs/btrfs/disk-io.c   |   11 ---
>  fs/btrfs/inode.c |8 +---
>  fs/btrfs/volumes.c   |8 +---
>  6 files changed, 41 insertions(+), 21 deletions(-)
> 
> diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
> index 86eff48..43d1c5a 100644
> --- a/fs/btrfs/compression.c
> +++ b/fs/btrfs/compression.c
> @@ -818,6 +818,7 @@ static void free_workspace(int type, struct list_head 
> *workspace)
>   btrfs_compress_op[idx]->free_workspace(workspace);
>   atomic_dec(alloc_workspace);
>  wake:
> + smp_mb();
>   if (waitqueue_active(workspace_wait))
>   wake_up(workspace_wait);
>  }
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 335605c..8cc9b19 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -513,9 +513,11 @@ static void __btrfs_remove_delayed_item(struct 
> btrfs_delayed_item *delayed_item)
>   rb_erase(&delayed_item->rb_node, root);
>   delayed_item->delayed_node->count--;
>   atomic_dec(&delayed_root->items);
> - if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND &&
> - waitqueue_active(&delayed_root->wait))
> - wake_up(&delayed_root->wait);
> + if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND) {
> + smp_mb();
> + if (waitqueue_active(&delayed_root->wait))
> + wake_up(&delayed_root->wait);
> + }
>  }
>  
>  static void btrfs_release_delayed_item(struct btrfs_delayed_item *item)
> @@ -1057,9 +1059,11 @@ static void btrfs_release_delayed_inode(struct 
> btrfs_delayed_node *delayed_node)
>   delayed_root = delayed_node->root->fs_info->delayed_root;
>   atomic_dec(&delayed_root->items);
>   if (atomic_read(&delayed_root->items) <
> - BTRFS_DELAYED_BACKGROUND &&
> - waitqueue_active(&delayed_root->wait))
> - wake_up(&delayed_root->wait);
> + BTRFS_DELAYED_BACKGROUND) {
> + smp_mb();
> + if (waitqueue_active(&delayed_root->wait))
> + wake_up(&delayed_root->wait);
> + }
>   }
>  }
>  
> diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
> index da7419e..858ef02 100644
> --- a/fs/btrfs/delayed-ref.c
> +++ b/fs/btrfs/delayed-ref.c
> @@ -662,9 +662,12 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
> *fs_info,
>   add_delayed_tree_ref(fs_info, trans, &ref->node, bytenr,
>  num_bytes, parent, ref_root, level, action,
>  for_cow);
> - if (!need_ref_seq(for_cow, ref_root) &&
> - waitqueue_active(&fs_info->tree_mod_seq_wait))
> - wake_up(&fs_info->tree_mod_seq_wait);
> + if (!need_ref_seq(for_cow, ref_root)) {
> + smp_mb();
> + if (waitqueue_active(&fs_info->tree_mod_seq_wait))
> + wake_up(&fs_info->tree_mod_seq_wait);
> + }
> +
>   spin_unlock(&delayed_refs->lock);
>   if (need_ref_seq(for_cow, ref_root))
>   btrfs_qgroup_record_ref(trans, &ref->node, extent_op);
> @@ -713,9 +716,11 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
> *fs_info,
>   add_delayed_data_ref(fs_info, trans, &ref->node, bytenr,
>  num_bytes, parent, ref_root, owner, offset,
>  action, for_cow);
> - if (!need_ref_seq(for_cow, ref_root) &&
> - waitqueue_active(&fs_info->tree_mod_seq_wait))
> - wake_up(&fs_info->tree_mod_seq_wait);
> + if (!need_ref_seq(for_cow, ref_root)) {
> + smp_mb();
> + if (waitqueue_active(&fs_info->tree_mod_seq_wait))
> + wake_up(&fs_info->tree_mod_seq_wait);
> + }
>   spin_unlock(&delayed_refs->lock);
>   if (need_ref_seq(for_cow, ref_root))
>   btrfs_qgroup_record_ref(trans, &ref->node, extent_op);
> @@ -744,6 +749,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info 
> *fs_info,
>  num_bytes, BTRFS_UPDATE_DELAYED_HEAD,
>   

Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread Arne Jansen
On 02.08.2012 12:36, Liu Bo wrote:
> On 08/02/2012 06:30 PM, Stefan Behrens wrote:
>> On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote:
>>> On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote:
 On 08/01/2012 09:07 PM, Jan Schmidt wrote:
> On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote:
>> On 08/01/2012 07:45 PM, Stefan Behrens wrote:
>>> With commit acce952b0, btrfs was changed to flag the filesystem with
>>> BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
>>> error happened like a write I/O errors of all mirrors.
>>> In such situations, on unmount, the superblock is written in
>>> btrfs_error_commit_super(). This is done with the intention to be able
>>> to evaluate the error flag on the next mount. A warning is printed
>>> in this case during the next mount and the log tree is ignored.
>>>
>>> The issue is that it is possible that the superblock points to a root
>>> that was not written (due to write I/O errors).
>>> The result is that the filesystem cannot be mounted. btrfsck also does
>>> not start and all the other btrfs-progs tools fail to start as well.
>>> However, mount -o recovery is working well and does the right things
>>> to recover the filesystem (i.e., don't use the log root, clear the
>>> free space cache and use the next mountable root that is stored in the
>>> root backup array).
>>>
>>> This patch removes the writing of the superblock when
>>> BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
>>> flag in the mount function.
>>>
>>
>> Yes, I have to admit that this can be a serious problem.
>>
>> But we'll need to send the error flag stored in the super block into
>> disk in the future so that the next mount can find it unstable and do
>> fsck by itself maybe.
>
> Hum, that's possible. However, I neither see
>
> a) a safe way to get that flag to disk
>
> nor
>
> b) a situation where this flag would help. When we abort a transaction, 
> we just
> roll everything back to the last commit, i.e. a consistent state. So if 
> we stop
> writing a potentially corrupt super block, we should be fine anyway. Or 
> am I
> missing something?
>

 I'm just wondering if we can roll everything back well, why do we need 
 fsck?
>>>
>>> If the disks support barriers, we roll everything back very well. The
>>> most recent superblock on the disks always defines a consistent
>>> filesystem state. There are only two remaining filesystem consistency
>>> issues left that can cause inconsistent states, one is the one that the
>>> patch in this email addresses, and the second one is that the error
>>> result from barrier_all_devices() is ignored (which I want to change next).
>>
>> Hi Liu Bo,
>>
>> Do you have any remaining objections to that patch?
>>
> 
> Hi Stefan,
> 
> Still I have another question:
> 
> Our metadata can be flushed into disk if we reach the limit, 32k, so we
> can end up with updated metadata and the latest superblock if we do not
> write the current super block.

The old metadata stays valid until the new superblock is written,
so no problem here, or maybe I don't understand your question :)

> 
> Any ideas?
> 
> thanks,
> liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-02 Thread Martin Steigerwald
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> On Wed, Aug 01, 2012 at 11:57:39PM +0200, Martin Steigerwald wrote:
> > Its getting quite strange.
>  
> I would agree :)
> 
> Before I paste a bunch of thing, I wanted to thank you for not giving up on 
> me 
> and offering your time to help me figure this out :)

You are welcome.

Well I am holding Linux performance analysis & tuning trainings and I am
really interested into issues like this ;)

I will take care of myself and I take my time to respond or even do not
respond at all anymore if I run out of ideas ;).

> > I lost track of whether you did that already or not, but if you didn´t
> > please post some
> > 
> > vmstat 1
> > iostat -xd 1
> > on the device while it is being slow.
>  
> Sure thing, here's the 24 second du -s run:
> procs ---memory-- ---swap-- -io -system-- cpu
>  r  b   swpd   free   buff  cache   si   sobibo   in   cs us sy id wa
>  2  1  0 2747264 44 348388002850  242  184 19  6 74  1
>  1  0  0 2744128 44 35170000   144 0 2758 32115 30  5 61  
> 4
>  2  1  0 2743100 44 35199200   792 0 2616 30613 28  4 50 
> 18
>  1  1  0 2741592 44 35266800   776 0 2574 31551 29  4 45 
> 21
>  1  1  0 2740720 44 35343200   692 0 2734 32891 30  4 45 
> 22
>  1  1  0 2740104 44 35428400   460 0 2639 31585 30  4 45 
> 21
>  3  1  0 2738520 44 35469200   544   264 2834 30302 32  5 42 
> 21
>  1  1  0 2736936 44 35547600  1064  2012 2867 31172 28  4 45 
> 23

A bit more wait I/O with not even 10% of the throughput as compared to
the Intel SSD 320 figures. Seems that Intel SSD is running circles around
your Samsung SSD while not – as expected for that use case – being fully
utilized.

> Linux 3.5.0-amd64-preempt-noide-20120410 (gandalfthegreat)08/01/2012  
> _x86_64_(4 CPU)
> rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz   await 
> r_await w_await  svctm  %util
>   2.18 0.686.45   17.7778.12   153.3919.12 0.187.51   
>  8.527.15   4.46  10.81
>   0.00 0.00  118.000.00   540.00 0.00 9.15 1.189.93   
>  9.930.00   4.98  58.80
>   0.00 0.00  217.000.00   868.00 0.00 8.00 1.908.77   
>  8.770.00   4.44  96.40
>   0.00 0.00  192.000.00   768.00 0.00 8.00 1.638.44   
>  8.440.00   5.10  98.00
>   0.00 0.00  119.000.00   476.00 0.00 8.00 1.069.01   
>  9.010.00   8.20  97.60
>   0.00 0.00  125.000.00   500.00 0.00 8.00 1.088.67   
>  8.670.00   7.55  94.40
>   0.00 0.00  165.000.00   660.00 0.00 8.00 1.509.12   
>  9.120.00   5.87  96.80
>   0.00 0.00  195.00   13.00   780.00   272.0010.12 1.688.10   
>  7.94   10.46   4.65  96.80
>   0.00 0.00  173.000.00   692.00 0.00 8.00 1.729.87   
>  9.870.00   5.71  98.80
>   0.00 0.00  171.000.00   684.00 0.00 8.00 1.629.33   
>  9.330.00   5.75  98.40
>   0.00 0.00  161.000.00   644.00 0.00 8.00 1.529.57   
>  9.570.00   6.14  98.80
>   0.00 0.00  136.000.00   544.00 0.00 8.00 1.269.29   
>  9.290.00   7.24  98.40
>   0.00 0.00  199.000.00   796.00 0.00 8.00 1.949.73   
>  9.730.00   4.94  98.40
>   0.00 0.00  201.000.00   804.00 0.00 8.00 1.708.54   
>  8.540.00   4.80  96.40
>   0.00 0.00  272.00   15.00  1088.00   272.00 9.48 2.358.21   
>  8.463.73   3.39  97.20
[…]
> > I am interested in wait I/O and latencies and disk utilization.
>  
> Cool tool, I didn't know about iostat.
> My r_await numbers don't look good obviously and yet %util is pretty much
> 100% the entire time.
> 
> Does that show that it's indeed the device that is unable to deliver the 
> requests any quicker, despite
> being an ssd, or are you reading this differently?

That, or…

> > Also I am interested in
> > merkaba:~> hdparm -I /dev/sda | grep -i queue
> > Queue depth: 32
> >*Native Command Queueing (NCQ)
> > output for your SSD.
> 
> gandalfthegreat:/var/local# hdparm -I /dev/sda | grep -i queue   
>   Queue depth: 32
>  *Native Command Queueing (NCQ)
> gandalfthegreat:/var/local# 
> 
> I've the the fio tests in:
> /dev/mapper/cryptroot /var btrfs 
> rw,noatime,compress=lzo,nossd,discard,space_cache 0 0

… you are still using dm_crypt?

Please test without dm_crypt. My figures are from within LVM, but no
dm_crypt. Its good to have a comparable base for the measurements.

> (discard is there, so fstrim shouldn't be needed)

I can´t imagine why it should matter, but maybe its worth having some
tests without „discard“. 

> > I also suggest to use fio with with the ssd-test example

Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-02 Thread Martin Steigerwald
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> So, doctor, is it bad? :)
> 
> randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, 
> iodepth=64
> sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, 
> iodepth=64
> randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
> 2.0.8
> Starting 4 processes
> randomwrite: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0  iops] [eta 
> 00m:00s]  
> randomwrite: (groupid=0, jobs=1): err= 0: pid=7193
>   write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec
> slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55
> clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63
>  lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57
> clat percentiles (msec):

Heck, I didn´t look at the IOPS figure!

189 IOPS for a SATA-600 SSD. Thats pathetic.

So again, please test this without dm_crypt. I can´t believe that this
is the maximum the hardware is able to achieve.

A really fast 15000 rpm SAS harddisk might top that.

-- 
Martin 'Helios' Steigerwald - http://www.Lichtvoll.de
GPG: 03B0 0D6C 0040 0710 4AFA  B82F 991B EAAC A599 84C7
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread Liu Bo
On 08/02/2012 07:18 PM, Arne Jansen wrote:
> On 02.08.2012 12:36, Liu Bo wrote:
>> On 08/02/2012 06:30 PM, Stefan Behrens wrote:
>>> On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote:
 On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote:
> On 08/01/2012 09:07 PM, Jan Schmidt wrote:
>> On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote:
>>> On 08/01/2012 07:45 PM, Stefan Behrens wrote:
 With commit acce952b0, btrfs was changed to flag the filesystem with
 BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
 error happened like a write I/O errors of all mirrors.
 In such situations, on unmount, the superblock is written in
 btrfs_error_commit_super(). This is done with the intention to be able
 to evaluate the error flag on the next mount. A warning is printed
 in this case during the next mount and the log tree is ignored.

 The issue is that it is possible that the superblock points to a root
 that was not written (due to write I/O errors).
 The result is that the filesystem cannot be mounted. btrfsck also does
 not start and all the other btrfs-progs tools fail to start as well.
 However, mount -o recovery is working well and does the right things
 to recover the filesystem (i.e., don't use the log root, clear the
 free space cache and use the next mountable root that is stored in the
 root backup array).

 This patch removes the writing of the superblock when
 BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
 flag in the mount function.

>>>
>>> Yes, I have to admit that this can be a serious problem.
>>>
>>> But we'll need to send the error flag stored in the super block into
>>> disk in the future so that the next mount can find it unstable and do
>>> fsck by itself maybe.
>>
>> Hum, that's possible. However, I neither see
>>
>> a) a safe way to get that flag to disk
>>
>> nor
>>
>> b) a situation where this flag would help. When we abort a transaction, 
>> we just
>> roll everything back to the last commit, i.e. a consistent state. So if 
>> we stop
>> writing a potentially corrupt super block, we should be fine anyway. Or 
>> am I
>> missing something?
>>
>
> I'm just wondering if we can roll everything back well, why do we need 
> fsck?

 If the disks support barriers, we roll everything back very well. The
 most recent superblock on the disks always defines a consistent
 filesystem state. There are only two remaining filesystem consistency
 issues left that can cause inconsistent states, one is the one that the
 patch in this email addresses, and the second one is that the error
 result from barrier_all_devices() is ignored (which I want to change next).
>>>
>>> Hi Liu Bo,
>>>
>>> Do you have any remaining objections to that patch?
>>>
>>
>> Hi Stefan,
>>
>> Still I have another question:
>>
>> Our metadata can be flushed into disk if we reach the limit, 32k, so we
>> can end up with updated metadata and the latest superblock if we do not
>> write the current super block.
> 
> The old metadata stays valid until the new superblock is written,
> so no problem here, or maybe I don't understand your question :)
> 

Yeah, Arne, you're right :)

But for undetected and unexpected errors as Arne had mentioned,  I want
to keep the error flag which is able to inform users that this FS is
recommended (but not must) to do fsck at least.

thanks,
liubo

>>
>> Any ideas?
>>
>> thanks,
>> liubo
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread Arne Jansen
On 02.08.2012 13:34, Liu Bo wrote:
> On 08/02/2012 07:18 PM, Arne Jansen wrote:
>> On 02.08.2012 12:36, Liu Bo wrote:
>>> On 08/02/2012 06:30 PM, Stefan Behrens wrote:
 On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote:
> On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote:
>> On 08/01/2012 09:07 PM, Jan Schmidt wrote:
>>> On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote:
 On 08/01/2012 07:45 PM, Stefan Behrens wrote:
> With commit acce952b0, btrfs was changed to flag the filesystem with
> BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
> error happened like a write I/O errors of all mirrors.
> In such situations, on unmount, the superblock is written in
> btrfs_error_commit_super(). This is done with the intention to be able
> to evaluate the error flag on the next mount. A warning is printed
> in this case during the next mount and the log tree is ignored.
>
> The issue is that it is possible that the superblock points to a root
> that was not written (due to write I/O errors).
> The result is that the filesystem cannot be mounted. btrfsck also does
> not start and all the other btrfs-progs tools fail to start as well.
> However, mount -o recovery is working well and does the right things
> to recover the filesystem (i.e., don't use the log root, clear the
> free space cache and use the next mountable root that is stored in the
> root backup array).
>
> This patch removes the writing of the superblock when
> BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
> flag in the mount function.
>

 Yes, I have to admit that this can be a serious problem.

 But we'll need to send the error flag stored in the super block into
 disk in the future so that the next mount can find it unstable and do
 fsck by itself maybe.
>>>
>>> Hum, that's possible. However, I neither see
>>>
>>> a) a safe way to get that flag to disk
>>>
>>> nor
>>>
>>> b) a situation where this flag would help. When we abort a transaction, 
>>> we just
>>> roll everything back to the last commit, i.e. a consistent state. So if 
>>> we stop
>>> writing a potentially corrupt super block, we should be fine anyway. Or 
>>> am I
>>> missing something?
>>>
>>
>> I'm just wondering if we can roll everything back well, why do we need 
>> fsck?
>
> If the disks support barriers, we roll everything back very well. The
> most recent superblock on the disks always defines a consistent
> filesystem state. There are only two remaining filesystem consistency
> issues left that can cause inconsistent states, one is the one that the
> patch in this email addresses, and the second one is that the error
> result from barrier_all_devices() is ignored (which I want to change 
> next).

 Hi Liu Bo,

 Do you have any remaining objections to that patch?

>>>
>>> Hi Stefan,
>>>
>>> Still I have another question:
>>>
>>> Our metadata can be flushed into disk if we reach the limit, 32k, so we
>>> can end up with updated metadata and the latest superblock if we do not
>>> write the current super block.
>>
>> The old metadata stays valid until the new superblock is written,
>> so no problem here, or maybe I don't understand your question :)
>>
> 
> Yeah, Arne, you're right :)
> 
> But for undetected and unexpected errors as Arne had mentioned,  I want
> to keep the error flag which is able to inform users that this FS is
> recommended (but not must) to do fsck at least.

How about storing the flag in a different location than the superblock?
If the fs is in an unknown state, every write potentially makes it only
worse.

> 
> thanks,
> liubo
> 
>>>
>>> Any ideas?
>>>
>>> thanks,
>>> liubo
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>>> the body of a message to majord...@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/2] Btrfs: fix the snapshot that should not exist

2012-08-02 Thread David Sterba
Hi,

appologies for late reply,

On Thu, Aug 02, 2012 at 12:40:46PM +0800, Miao Xie wrote:
> Changelog v1 -> v2:
> - add comment to explain why we need deal with the delayed items after
>   snapshot creation and why this operation do not corrupt the metadata.

I'm sorry, the comment did not fix the bug :)

The subvol stress is able to hit this:

[ 2360.444321] [ cut here ]
[ 2360.448019] kernel BUG at fs/btrfs/extent-tree.c:6047!
[ 2360.448019] invalid opcode:  [#1] SMP
[ 2360.448019] CPU 0
[ 2360.448019] Modules linked in: btrfs aoe [last unloaded: btrfs]
[ 2360.448019]
[ 2360.448019] Pid: 8212, comm: btrfs Not tainted 3.5.0-default+ #170 Intel 
Corporation Santa Rosa platform/Matanzas
[ 2360.448019] RIP: 0010:[]  [] 
run_clustered_refs+0xa11/0xa20 [btrfs]
[ 2360.448019] RSP: 0018:88003eca1a68  EFLAGS: 00010246
[ 2360.448019] RAX: 07ff RBX: 880017a694c8 RCX: 88003eca1a08
[ 2360.448019] RDX: 880028aa9000 RSI: 07fe RDI: 880064223cf0
[ 2360.448019] RBP: 88003eca1b48 R08: 07ff R09: 88003eca19f8
[ 2360.448019] R10: 88002435d1e8 R11:  R12: 880025d66d28
[ 2360.448019] R13: 88003864 R14: 8800778dfa88 R15: 880060f010d0
[ 2360.448019] FS:  7f3289f35740() GS:88007dc0() 
knlGS:
[ 2360.448019] CS:  0010 DS:  ES:  CR0: 8005003b
[ 2360.448019] CR2: ff600400 CR3: 2e112000 CR4: 07f0
[ 2360.448019] DR0:  DR1:  DR2: 
[ 2360.448019] DR3:  DR6: 0ff0 DR7: 0400
[ 2360.448019] Process btrfs (pid: 8212, threadinfo 88003eca, task 
88001d834200)
[ 2360.448019] Stack:
[ 2360.448019]    0001 

[ 2360.448019]  07ed 88002435d1e8 3eca1b18 

[ 2360.448019]  0770  5cb1e000 
88003eca1c08
[ 2360.448019] Call Trace:
[ 2360.448019]  [] btrfs_run_delayed_refs+0x1c9/0x550 [btrfs]
[ 2360.448019]  [] ? trace_hardirqs_on_caller+0x155/0x1d0
[ 2360.448019]  [] ? btrfs_free_path+0x2a/0x40 [btrfs]
[ 2360.448019]  [] ? btrfs_run_delayed_items+0xf1/0x160 
[btrfs]
[ 2360.448019]  [] btrfs_commit_transaction+0x605/0xb00 
[btrfs]
[ 2360.448019]  [] ? lock_release_holdtime+0x3d/0x1c0
[ 2360.448019]  [] ? btrfs_mksubvol+0x298/0x360 [btrfs]
[ 2360.448019]  [] ? wake_up_bit+0x40/0x40
[ 2360.448019]  [] ? do_raw_spin_unlock+0x5e/0xb0
[ 2360.448019]  [] btrfs_mksubvol+0x358/0x360 [btrfs]
[ 2360.448019]  [] 
btrfs_ioctl_snap_create_transid+0x10a/0x190 [btrfs]
[ 2360.448019]  [] 
btrfs_ioctl_snap_create_v2.clone.0+0xfd/0x110 [btrfs]
[ 2360.448019]  [] btrfs_ioctl+0x48e/0x1340 [btrfs]
[ 2360.448019]  [] ? do_page_fault+0x2d0/0x580
[ 2360.448019]  [] ? _raw_spin_unlock_irq+0x30/0x50
[ 2360.448019]  [] ? finish_task_switch+0x83/0xf0
[ 2360.448019]  [] do_vfs_ioctl+0x98/0x560
[ 2360.448019]  [] ? retint_swapgs+0x13/0x1b
[ 2360.448019]  [] sys_ioctl+0x4f/0x80
[ 2360.448019]  [] system_call_fastpath+0x16/0x1b
[ 2360.448019] Code: 8b 76 40 48 89 d7 48 89 55 a0 e8 2b 74 ff ff 83 f8 17 0f 
87 1e ff ff ff 0f 0b 80 fa b2 0f 84 b4 f8 ff ff 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 
0f 0b 0f 0b 0f 0b 0f 1f 80 00 00 00 00 55 48 89 e5 41 57
[ 2360.448019] RIP  [] run_clustered_refs+0xa11/0xa20 [btrfs]
[ 2360.448019]  RSP 
[ 2360.814508] ---[ end trace 555a16cac3620ccb ]---
[ 2360.820398] note: btrfs[8212] exited with preempt_count 1
[ 2360.827072] BUG: sleeping function called from invalid context at 
kernel/rwsem.c:20
[ 2360.836047] in_atomic(): 1, irqs_disabled(): 0, pid: 8212, name: btrfs
[ 2360.843859] INFO: lockdep is turned off.
[ 2360.849021] Pid: 8212, comm: btrfs Tainted: G  D  3.5.0-default+ #170
[ 2360.849022] Call Trace:
[ 2360.849027]  [] __might_sleep+0xfc/0x130
[ 2360.849030]  [] down_read+0x26/0xa0
[ 2360.849034]  [] acct_collect+0x4b/0x1b0
[ 2360.849038]  [] do_exit+0x718/0x9a0
[ 2360.849041]  [] ? kmsg_dump+0x26/0x140
[ 2360.849043]  [] oops_end+0xb0/0xf0
[ 2360.849046]  [] die+0x5b/0x90
[ 2360.849048]  [] do_trap+0xc4/0x170
[ 2360.849052]  [] do_invalid_op+0x95/0xb0
[ 2360.849067]  [] ? run_clustered_refs+0xa11/0xa20 [btrfs]
[ 2360.849071]  [] ? trace_hardirqs_off_thunk+0x3a/0x3c
[ 2360.849073]  [] ? restore_args+0x30/0x30
[ 2360.849076]  [] invalid_op+0x1b/0x20
[ 2360.849087]  [] ? run_clustered_refs+0xa11/0xa20 [btrfs]
[ 2360.849097]  [] ? run_clustered_refs+0x69b/0xa20 [btrfs]
[ 2360.849108]  [] btrfs_run_delayed_refs+0x1c9/0x550 [btrfs]
[ 2360.849110]  [] ? trace_hardirqs_on_caller+0x155/0x1d0
[ 2360.849119]  [] ? btrfs_free_path+0x2a/0x40 [btrfs]
[ 2360.849133]  [] ? btrfs_run_delayed_items+0xf1/0x160 
[btrfs]
[ 2360.849145]  [] btrfs_commit_transaction+0x605/0xb00 
[btrfs]
[ 2360.849148]  [] ? lock_release_holdtime+0x3d/0x1c0
[ 2360.849161]  [] ? btrfs_mksubvol+0x298/0x360 [btrfs]
[ 2360.849164]  [] ? wake_up_bit

Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread Liu Bo
On 08/02/2012 07:40 PM, Arne Jansen wrote:
> On 02.08.2012 13:34, Liu Bo wrote:
>> On 08/02/2012 07:18 PM, Arne Jansen wrote:
>>> On 02.08.2012 12:36, Liu Bo wrote:
 On 08/02/2012 06:30 PM, Stefan Behrens wrote:
> On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote:
>> On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote:
>>> On 08/01/2012 09:07 PM, Jan Schmidt wrote:
 On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote:
> On 08/01/2012 07:45 PM, Stefan Behrens wrote:
>> With commit acce952b0, btrfs was changed to flag the filesystem with
>> BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
>> error happened like a write I/O errors of all mirrors.
>> In such situations, on unmount, the superblock is written in
>> btrfs_error_commit_super(). This is done with the intention to be 
>> able
>> to evaluate the error flag on the next mount. A warning is printed
>> in this case during the next mount and the log tree is ignored.
>>
>> The issue is that it is possible that the superblock points to a root
>> that was not written (due to write I/O errors).
>> The result is that the filesystem cannot be mounted. btrfsck also 
>> does
>> not start and all the other btrfs-progs tools fail to start as well.
>> However, mount -o recovery is working well and does the right things
>> to recover the filesystem (i.e., don't use the log root, clear the
>> free space cache and use the next mountable root that is stored in 
>> the
>> root backup array).
>>
>> This patch removes the writing of the superblock when
>> BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
>> flag in the mount function.
>>
>
> Yes, I have to admit that this can be a serious problem.
>
> But we'll need to send the error flag stored in the super block into
> disk in the future so that the next mount can find it unstable and do
> fsck by itself maybe.

 Hum, that's possible. However, I neither see

 a) a safe way to get that flag to disk

 nor

 b) a situation where this flag would help. When we abort a 
 transaction, we just
 roll everything back to the last commit, i.e. a consistent state. So 
 if we stop
 writing a potentially corrupt super block, we should be fine anyway. 
 Or am I
 missing something?

>>>
>>> I'm just wondering if we can roll everything back well, why do we need 
>>> fsck?
>>
>> If the disks support barriers, we roll everything back very well. The
>> most recent superblock on the disks always defines a consistent
>> filesystem state. There are only two remaining filesystem consistency
>> issues left that can cause inconsistent states, one is the one that the
>> patch in this email addresses, and the second one is that the error
>> result from barrier_all_devices() is ignored (which I want to change 
>> next).
>
> Hi Liu Bo,
>
> Do you have any remaining objections to that patch?
>

 Hi Stefan,

 Still I have another question:

 Our metadata can be flushed into disk if we reach the limit, 32k, so we
 can end up with updated metadata and the latest superblock if we do not
 write the current super block.
>>>
>>> The old metadata stays valid until the new superblock is written,
>>> so no problem here, or maybe I don't understand your question :)
>>>
>>
>> Yeah, Arne, you're right :)
>>
>> But for undetected and unexpected errors as Arne had mentioned,  I want
>> to keep the error flag which is able to inform users that this FS is
>> recommended (but not must) to do fsck at least.
> 
> How about storing the flag in a different location than the superblock?
> If the fs is in an unknown state, every write potentially makes it only
> worse.
> 

IMO it does not make sense if we don't write the flag into disk, and on
ext4's side, it just tries to write the super block.

Anyway, for now, our error flag has only been stored in memory, so what
about just keep it until we find a graceful way?


thanks,
liubo


>>
>> thanks,
>> liubo
>>

 Any ideas?

 thanks,
 liubo
 --
 To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.o

Re: [josef-btrfs:master 9/11] fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized in this function

2012-08-02 Thread Josef Bacik
On Wed, Aug 01, 2012 at 10:29:21PM -0600, Miao Xie wrote:
> On Thu, 2 Aug 2012 10:31:23 +0800, Fengguang Wu wrote:
> > There are new compile warnings show up in
> > 
> > tree:   git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git 
> > master
> > head:   8e1163044779e90662e96887cdd692c1594146ad
> > commit: dd817e81e81bbb83b63317b169d0e57a5d7ae0e1 [9/11] Btrfs: fix error 
> > path in create_pending_snapshot()
> > config: x86_64-randconfig-h031 (attached as .config)
> > 
> > All error/warnings:
> > 
> > fs/btrfs/transaction.c: In function 'create_pending_snapshot':
> > fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized 
> > in this function [-Wmaybe-uninitialized]
> > fs/btrfs/transaction.c:1119:19: warning: 'rsv' may be used uninitialized in 
> > this function [-Wmaybe-uninitialized]
> > 
> > vim +1118 fs/btrfs/transaction.c
> >   1115  if (ret)
> >   1116  goto abort_trans;
> >   1117  fail:
> >> 1118   dput(parent);
> >   1119  trans->block_rsv = rsv;
> >   1120  no_free_objectid:
> >   1121  kfree(new_root_item);
> 
> The patch I sent before is based on the v3.5.0, not Chris's latest for-linus 
> branch, so...
> 
> I will send new patches to fix it. Thanks for your report.
> 

I've pulled these patches, Dave still has concerns about corruption.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: barrier before waitqueue_active

2012-08-02 Thread Josef Bacik
On Thu, Aug 02, 2012 at 04:46:44AM -0600, Liu Bo wrote:
> On 08/02/2012 04:25 AM, Josef Bacik wrote:
> > We need an smb_mb() before waitqueue_active to avoid missing wakeups.
> > Before Mitch was hitting a deadlock between the ordered flushers and the
> > transaction commit because the ordered flushers were waiting for more refs
> > and were never woken up, so those smp_mb()'s are the most important.
> > Everything else I added for correctness sake and to avoid getting bitten by
> > this again somewhere else.  Thanks,
> > 
> 
> Hi Josef,
> 
> I'll appreciate a lot if you can add some comments for each memory
> barrier, because not everyone knows why it is used here and there. :)
> 

I'm not going to add comments to all those places, you need a memory barrier in
places you don't have an implicit barrier before you do waitqueue_active because
you could miss somebody being added to the waitqueue, it's just basic
correctness.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: barrier before waitqueue_active

2012-08-02 Thread cwillu
On Thu, Aug 2, 2012 at 4:46 AM, Liu Bo  wrote:
> On 08/02/2012 04:25 AM, Josef Bacik wrote:
>> We need an smb_mb() before waitqueue_active to avoid missing wakeups.
>> Before Mitch was hitting a deadlock between the ordered flushers and the
>> transaction commit because the ordered flushers were waiting for more refs
>> and were never woken up, so those smp_mb()'s are the most important.
>> Everything else I added for correctness sake and to avoid getting bitten by
>> this again somewhere else.  Thanks,
>>
>
> Hi Josef,
>
> I'll appreciate a lot if you can add some comments for each memory
> barrier, because not everyone knows why it is used here and there. :)

Everyone who wants to know should read the memory-barriers.txt file
that's hiding in the oddly named "Documentation" folder of their
kernel tree.  :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: barrier before waitqueue_active

2012-08-02 Thread David Sterba
On Thu, Aug 02, 2012 at 08:11:58AM -0400, Josef Bacik wrote:
> On Thu, Aug 02, 2012 at 04:46:44AM -0600, Liu Bo wrote:
> > On 08/02/2012 04:25 AM, Josef Bacik wrote:
> > > We need an smb_mb() before waitqueue_active to avoid missing wakeups.
> > > Before Mitch was hitting a deadlock between the ordered flushers and the
> > > transaction commit because the ordered flushers were waiting for more refs
> > > and were never woken up, so those smp_mb()'s are the most important.
> > > Everything else I added for correctness sake and to avoid getting bitten 
> > > by
> > > this again somewhere else.  Thanks,
> > 
> > I'll appreciate a lot if you can add some comments for each memory
> > barrier, because not everyone knows why it is used here and there. :)
> 
> I'm not going to add comments to all those places, you need a memory barrier 
> in
> places you don't have an implicit barrier before you do waitqueue_active 
> because
> you could miss somebody being added to the waitqueue, it's just basic
> correctness.  Thanks,

This asks for a helper:

+   smp_mb();
+   if (waitqueue_active(&fs_info->async_submit_wait))
+   wake_up(&fs_info->async_submit_wait);

->

void wake_up_if_active(wait) {
/*
 * the comment
 */
smp_mb();
if(waitqueue_active(wait)
wake_up(wait);
}
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at fs/btrfs/extent-tree.c:5038 (linux 3.4.7)

2012-08-02 Thread David Sterba
On Wed, Aug 01, 2012 at 09:48:15PM +0200, Olivier Bonvalet wrote:
> I have some trouble with a btrfs filesystem.
> As you can see in logs, there is lines which are from btrfs (I
> supposed), then some warnings at fs/btrfs/extent-tree.c, and finally a
> "kernel BUG" at fs/btrfs/extent-tree.c:5038.

Did you really see this BUG ? It comes from some unused and disabled code:

5036 #ifdef BTRFS_COMPAT_EXTENT_TREE_V0
5037 if (item_size < sizeof(*ei)) {
5038 BUG_ON(found_extent || extent_slot != path->slots[0]);


5039 ret = convert_extent_item_v0(trans, extent_root, path,
5040  owner_objectid, 0);
5041 if (ret < 0)
5042 goto abort;
5043
5044 btrfs_release_path(path);
5045 path->leave_spinning = 1;
5046
5047 key.objectid = bytenr;
5048 key.type = BTRFS_EXTENT_ITEM_KEY;
5049 key.offset = num_bytes;
5050
5051 ret = btrfs_search_slot(trans, extent_root, &key, path,
5052 -1, 1);
5053 if (ret) {
5054 printk(KERN_ERR "umm, got %d back from search"
5055", was looking for %llu\n", ret,
5056(unsigned long long)bytenr);
5057 btrfs_print_leaf(extent_root, path->nodes[0]);
5058 }
5059 if (ret < 0)
5060 goto abort;
5061 extent_slot = path->slots[0];
5062 leaf = path->nodes[0];
5063 item_size = btrfs_item_size_nr(leaf, extent_slot);
5064 }
5065 #endif


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: barrier before waitqueue_active V2

2012-08-02 Thread Josef Bacik
We need a barrir before calling waitqueue_active otherwise we will miss
wakeups.  So in places that do atomic_dec(); then atomic_read() use
atomic_dec_return() which imply a memory barrier (see memory-barriers.txt)
and then add an explicit memory barrier everywhere else that need them.
Thanks,

Signed-off-by: Josef Bacik 
---
V1->V2: changed atomic_dec/atomic_read combos to atomic_dec_return() for the
implied barrier.
 fs/btrfs/compression.c   |1 +
 fs/btrfs/delayed-inode.c |7 +++
 fs/btrfs/delayed-ref.c   |   18 --
 fs/btrfs/disk-io.c   |7 ---
 fs/btrfs/inode.c |4 +---
 fs/btrfs/volumes.c   |3 +--
 6 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c
index 86eff48..43d1c5a 100644
--- a/fs/btrfs/compression.c
+++ b/fs/btrfs/compression.c
@@ -818,6 +818,7 @@ static void free_workspace(int type, struct list_head 
*workspace)
btrfs_compress_op[idx]->free_workspace(workspace);
atomic_dec(alloc_workspace);
 wake:
+   smp_mb();
if (waitqueue_active(workspace_wait))
wake_up(workspace_wait);
 }
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 335605c..ee2050c 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -512,8 +512,8 @@ static void __btrfs_remove_delayed_item(struct 
btrfs_delayed_item *delayed_item)
 
rb_erase(&delayed_item->rb_node, root);
delayed_item->delayed_node->count--;
-   atomic_dec(&delayed_root->items);
-   if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND &&
+   if (atomic_dec_return(&delayed_root->items) <
+   BTRFS_DELAYED_BACKGROUND &&
waitqueue_active(&delayed_root->wait))
wake_up(&delayed_root->wait);
 }
@@ -1055,8 +1055,7 @@ static void btrfs_release_delayed_inode(struct 
btrfs_delayed_node *delayed_node)
delayed_node->count--;
 
delayed_root = delayed_node->root->fs_info->delayed_root;
-   atomic_dec(&delayed_root->items);
-   if (atomic_read(&delayed_root->items) <
+   if (atomic_dec_return(&delayed_root->items) <
BTRFS_DELAYED_BACKGROUND &&
waitqueue_active(&delayed_root->wait))
wake_up(&delayed_root->wait);
diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index da7419e..858ef02 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -662,9 +662,12 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info 
*fs_info,
add_delayed_tree_ref(fs_info, trans, &ref->node, bytenr,
   num_bytes, parent, ref_root, level, action,
   for_cow);
-   if (!need_ref_seq(for_cow, ref_root) &&
-   waitqueue_active(&fs_info->tree_mod_seq_wait))
-   wake_up(&fs_info->tree_mod_seq_wait);
+   if (!need_ref_seq(for_cow, ref_root)) {
+   smp_mb();
+   if (waitqueue_active(&fs_info->tree_mod_seq_wait))
+   wake_up(&fs_info->tree_mod_seq_wait);
+   }
+
spin_unlock(&delayed_refs->lock);
if (need_ref_seq(for_cow, ref_root))
btrfs_qgroup_record_ref(trans, &ref->node, extent_op);
@@ -713,9 +716,11 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info 
*fs_info,
add_delayed_data_ref(fs_info, trans, &ref->node, bytenr,
   num_bytes, parent, ref_root, owner, offset,
   action, for_cow);
-   if (!need_ref_seq(for_cow, ref_root) &&
-   waitqueue_active(&fs_info->tree_mod_seq_wait))
-   wake_up(&fs_info->tree_mod_seq_wait);
+   if (!need_ref_seq(for_cow, ref_root)) {
+   smp_mb();
+   if (waitqueue_active(&fs_info->tree_mod_seq_wait))
+   wake_up(&fs_info->tree_mod_seq_wait);
+   }
spin_unlock(&delayed_refs->lock);
if (need_ref_seq(for_cow, ref_root))
btrfs_qgroup_record_ref(trans, &ref->node, extent_op);
@@ -744,6 +749,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info 
*fs_info,
   num_bytes, BTRFS_UPDATE_DELAYED_HEAD,
   extent_op->is_data);
 
+   smp_mb();
if (waitqueue_active(&fs_info->tree_mod_seq_wait))
wake_up(&fs_info->tree_mod_seq_wait);
spin_unlock(&delayed_refs->lock);
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 502b20c..2ed81c8 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -754,9 +754,7 @@ static void run_one_async_done(struct btrfs_work *work)
limit = btrfs_async_submit_limit(fs_info);
limit = limit * 2 / 3;
 
-   atomic_dec(&fs_info->nr_async_submits);
-
-   if (atomic_read(&fs_info->nr_async_submits) < limit &&
+   if (atomic_dec_return(&fs_info->nr_async_submits

Re: BTRFS crash on mount with 3.4.4

2012-08-02 Thread David Sterba
On Tue, Jul 31, 2012 at 06:01:04PM -0700, Marc MERLIN wrote:
> On Tue, Jul 31, 2012 at 11:04:12AM -0700, Marc MERLIN wrote:
> > My kernel crashed for some other reason, and now I can't mount my btrfs
> > filesystem.
> > I don't care about the data, it's backed up.
> > 
> > I'll compile a 3.5 kernel, but is there any info you'd like off that
> > filesystem to see why btrfs is crashing on mount?
> 
> On the plus side, 3.5 mounted the filesystem:
> [  183.069179] device label btrfs_pool1 devid 1 transid 20769 /dev/mapper/test
> [  183.087805] btrfs: disk space caching is enabled
> [  183.224784] btrfs: no dev_stats entry found for device /dev/mapper/test 
> (devid 1) (OK on first mount after mkfs)
> [  186.277795] Btrfs detected SSD devices, enabling SSD mode
> [  188.477443] btrfs: unlinked 6 orphans
> [  188.477451] btrfs: truncated 1 orphans
> [  411.837015] btrfs: unlinked 13 orphans
> [  414.373500] btrfs: unlinked 35 orphans

there's this commit in 3.5 that looks like it fixed the problem:

Author: Josef Bacik 
AuthorDate: Wed Jun 27 15:10:56 2012 -0400

Btrfs: fix tree log remove space corner case

The tree log stuff can have allocated space that we end up having split
across a bitmap and a real extent.  The free space code does not deal with
this, it assumes that if it finds an extent or bitmap entry that the entire
range must fall within the entry it finds.  This isn't necessarily the case,
so rework the remove function so it can handle this case properly.  This
fixed two panics the user hit, first in the case where the space was
initially in a bitmap and then in an extent entry, and then the reverse
case.  Thanks,
---

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kernel BUG at fs/btrfs/extent-tree.c:5038 (linux 3.4.7)

2012-08-02 Thread Olivier Bonvalet

On 02/08/2012 15:22, David Sterba wrote:

On Wed, Aug 01, 2012 at 09:48:15PM +0200, Olivier Bonvalet wrote:

I have some trouble with a btrfs filesystem.
As you can see in logs, there is lines which are from btrfs (I
supposed), then some warnings at fs/btrfs/extent-tree.c, and finally a
"kernel BUG" at fs/btrfs/extent-tree.c:5038.


Did you really see this BUG ? It comes from some unused and disabled code:

5036 #ifdef BTRFS_COMPAT_EXTENT_TREE_V0
5037 if (item_size<  sizeof(*ei)) {
5038 BUG_ON(found_extent || extent_slot != path->slots[0]);


5039 ret = convert_extent_item_v0(trans, extent_root, path,
5040  owner_objectid, 0);
5041 if (ret<  0)
5042 goto abort;
5043
5044 btrfs_release_path(path);
5045 path->leave_spinning = 1;
5046
5047 key.objectid = bytenr;
5048 key.type = BTRFS_EXTENT_ITEM_KEY;
5049 key.offset = num_bytes;
5050
5051 ret = btrfs_search_slot(trans, extent_root,&key, path,
5052 -1, 1);
5053 if (ret) {
5054 printk(KERN_ERR "umm, got %d back from search"
5055", was looking for %llu\n", ret,
5056(unsigned long long)bytenr);
5057 btrfs_print_leaf(extent_root, path->nodes[0]);
5058 }
5059 if (ret<  0)
5060 goto abort;
5061 extent_slot = path->slots[0];
5062 leaf = path->nodes[0];
5063 item_size = btrfs_item_size_nr(leaf, extent_slot);
5064 }
5065 #endif


david



Yes... it's a copy from my /var/log/kern.log. Is it really "disabled" ?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread Arne Jansen
On 02.08.2012 13:57, Liu Bo wrote:
> On 08/02/2012 07:40 PM, Arne Jansen wrote:
>> On 02.08.2012 13:34, Liu Bo wrote:
>>> On 08/02/2012 07:18 PM, Arne Jansen wrote:
 On 02.08.2012 12:36, Liu Bo wrote:
> On 08/02/2012 06:30 PM, Stefan Behrens wrote:
>> On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote:
>>> On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote:
 On 08/01/2012 09:07 PM, Jan Schmidt wrote:
> On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote:
>> On 08/01/2012 07:45 PM, Stefan Behrens wrote:
>>> With commit acce952b0, btrfs was changed to flag the filesystem with
>>> BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
>>> error happened like a write I/O errors of all mirrors.
>>> In such situations, on unmount, the superblock is written in
>>> btrfs_error_commit_super(). This is done with the intention to be 
>>> able
>>> to evaluate the error flag on the next mount. A warning is printed
>>> in this case during the next mount and the log tree is ignored.
>>>
>>> The issue is that it is possible that the superblock points to a 
>>> root
>>> that was not written (due to write I/O errors).
>>> The result is that the filesystem cannot be mounted. btrfsck also 
>>> does
>>> not start and all the other btrfs-progs tools fail to start as well.
>>> However, mount -o recovery is working well and does the right things
>>> to recover the filesystem (i.e., don't use the log root, clear the
>>> free space cache and use the next mountable root that is stored in 
>>> the
>>> root backup array).
>>>
>>> This patch removes the writing of the superblock when
>>> BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error
>>> flag in the mount function.
>>>
>>
>> Yes, I have to admit that this can be a serious problem.
>>
>> But we'll need to send the error flag stored in the super block into
>> disk in the future so that the next mount can find it unstable and do
>> fsck by itself maybe.
>
> Hum, that's possible. However, I neither see
>
> a) a safe way to get that flag to disk
>
> nor
>
> b) a situation where this flag would help. When we abort a 
> transaction, we just
> roll everything back to the last commit, i.e. a consistent state. So 
> if we stop
> writing a potentially corrupt super block, we should be fine anyway. 
> Or am I
> missing something?
>

 I'm just wondering if we can roll everything back well, why do we need 
 fsck?
>>>
>>> If the disks support barriers, we roll everything back very well. The
>>> most recent superblock on the disks always defines a consistent
>>> filesystem state. There are only two remaining filesystem consistency
>>> issues left that can cause inconsistent states, one is the one that the
>>> patch in this email addresses, and the second one is that the error
>>> result from barrier_all_devices() is ignored (which I want to change 
>>> next).
>>
>> Hi Liu Bo,
>>
>> Do you have any remaining objections to that patch?
>>
>
> Hi Stefan,
>
> Still I have another question:
>
> Our metadata can be flushed into disk if we reach the limit, 32k, so we
> can end up with updated metadata and the latest superblock if we do not
> write the current super block.

 The old metadata stays valid until the new superblock is written,
 so no problem here, or maybe I don't understand your question :)

>>>
>>> Yeah, Arne, you're right :)
>>>
>>> But for undetected and unexpected errors as Arne had mentioned,  I want
>>> to keep the error flag which is able to inform users that this FS is
>>> recommended (but not must) to do fsck at least.
>>
>> How about storing the flag in a different location than the superblock?
>> If the fs is in an unknown state, every write potentially makes it only
>> worse.
>>
> 
> IMO it does not make sense if we don't write the flag into disk, and on
> ext4's side, it just tries to write the super block.
> 
> Anyway, for now, our error flag has only been stored in memory, so what
> about just keep it until we find a graceful way?

Yeah, we need this patch to restore consistency. We can define a fixed
area on disk (e.g. behind the superblock) where we can write the flag
to without risking the superblock.

> 
> 
> thanks,
> liubo
> 
> 
>>>
>>> thanks,
>>> liubo
>>>
>
> Any ideas?
>
> thanks,
> liubo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordo

Re: kernel BUG at fs/btrfs/extent-tree.c:5038 (linux 3.4.7)

2012-08-02 Thread David Sterba
On Thu, Aug 02, 2012 at 03:41:03PM +0200, Olivier Bonvalet wrote:
> Yes... it's a copy from my /var/log/kern.log. Is it really "disabled" ?

I was mistaken, it really is enabled unconditionally in ctree.h:55.
Josef says that the V0 extent refs are not used for a long time, so the
question is how did they appear in your filesystem. Did you switch to a
new kernel from a very old one?


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread David Sterba
On Thu, Aug 02, 2012 at 03:46:50PM +0200, Arne Jansen wrote:
> > Anyway, for now, our error flag has only been stored in memory, so what
> > about just keep it until we find a graceful way?
> 
> Yeah, we need this patch to restore consistency. We can define a fixed
> area on disk (e.g. behind the superblock) where we can write the flag
> to without risking the superblock.

Is it possible that sectors around the superblock are somehow damaged
and unwritable that writing to them will fail as well? Ie. some safe
distance from the superblock would be needed.

david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread Arne Jansen
On 02.08.2012 15:57, David Sterba wrote:
> On Thu, Aug 02, 2012 at 03:46:50PM +0200, Arne Jansen wrote:
>>> Anyway, for now, our error flag has only been stored in memory, so what
>>> about just keep it until we find a graceful way?
>>
>> Yeah, we need this patch to restore consistency. We can define a fixed
>> area on disk (e.g. behind the superblock) where we can write the flag
>> to without risking the superblock.
> 
> Is it possible that sectors around the superblock are somehow damaged
> and unwritable that writing to them will fail as well? Ie. some safe
> distance from the superblock would be needed.

In about 1MB distance you're side-by-side with the superblock, so closer
to it might even be safer ;)

> 
> david

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: do not use missing devices when showing devname

2012-08-02 Thread Josef Bacik
If you do the following

mkfs.btrfs /dev/sdb /dev/sdc
rmmod btrfs
dd if=/dev/zero of=/dev/sdb bs=1M count=1
mount -o degraded /dev/sdc /mnt/btrfs-test

the box will panic trying to deref the name for the missing dev since it is
the lower numbered devid.  So fix show_devname to not use missing devices.
Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/super.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index fa61ef5..ffcfe49 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -1505,6 +1505,8 @@ static int btrfs_show_devname(struct seq_file *m, struct 
dentry *root)
while (cur_devices) {
head = &cur_devices->devices;
list_for_each_entry(dev, head, dev_list) {
+   if (dev->missing)
+   continue;
if (!first_dev || dev->devid < first_dev->devid)
first_dev = dev;
}
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: do not strdup non existent strings

2012-08-02 Thread Josef Bacik
When we close devices we add back empty devices for some reason that escapes
me.  In the case of a missing dev we don't allocate an rcu_string for it's
name, so check to see if the device has a name and if it doesn't don't
bother strdup()'ing it.  Thanks,

Signed-off-by: Josef Bacik 
---
 fs/btrfs/volumes.c |8 +---
 1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index 3821302..0b1e69d 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -568,9 +568,11 @@ static int __btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices)
memcpy(new_device, device, sizeof(*new_device));
 
/* Safe because we are under uuid_mutex */
-   name = rcu_string_strdup(device->name->str, GFP_NOFS);
-   BUG_ON(device->name && !name); /* -ENOMEM */
-   rcu_assign_pointer(new_device->name, name);
+   if (device->name) {
+   name = rcu_string_strdup(device->name->str, GFP_NOFS);
+   BUG_ON(device->name && !name); /* -ENOMEM */
+   rcu_assign_pointer(new_device->name, name);
+   }
new_device->bdev = NULL;
new_device->writeable = 0;
new_device->in_fs_metadata = 0;
-- 
1.7.7.6

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS crash on mount with 3.4.4

2012-08-02 Thread Marc MERLIN
On Thu, Aug 02, 2012 at 03:40:03PM +0200, David Sterba wrote:
> On Tue, Jul 31, 2012 at 06:01:04PM -0700, Marc MERLIN wrote:
> > On Tue, Jul 31, 2012 at 11:04:12AM -0700, Marc MERLIN wrote:
> > > My kernel crashed for some other reason, and now I can't mount my btrfs
> > > filesystem.
> > > I don't care about the data, it's backed up.
> > > 
> > > I'll compile a 3.5 kernel, but is there any info you'd like off that
> > > filesystem to see why btrfs is crashing on mount?
> > 
> > On the plus side, 3.5 mounted the filesystem:
> > [  183.069179] device label btrfs_pool1 devid 1 transid 20769 
> > /dev/mapper/test
> > [  183.087805] btrfs: disk space caching is enabled
> > [  183.224784] btrfs: no dev_stats entry found for device /dev/mapper/test 
> > (devid 1) (OK on first mount after mkfs)
> > [  186.277795] Btrfs detected SSD devices, enabling SSD mode
> > [  188.477443] btrfs: unlinked 6 orphans
> > [  188.477451] btrfs: truncated 1 orphans
> > [  411.837015] btrfs: unlinked 13 orphans
> > [  414.373500] btrfs: unlinked 35 orphans
> 
> there's this commit in 3.5 that looks like it fixed the problem:
> 
> Author: Josef Bacik 
> AuthorDate: Wed Jun 27 15:10:56 2012 -0400
> 
> Btrfs: fix tree log remove space corner case
> 
> The tree log stuff can have allocated space that we end up having split
> across a bitmap and a real extent.  The free space code does not deal with
> this, it assumes that if it finds an extent or bitmap entry that the 
> entire
> range must fall within the entry it finds.  This isn't necessarily the 
> case,
> so rework the remove function so it can handle this case properly.  This
> fixed two panics the user hit, first in the case where the space was
> initially in a bitmap and then in an extent entry, and then the reverse
> case.  Thanks,

Indeed. Thanks for finding this David and thanks for writing this Josef :)

Marc
-- 
"A mouse is a device used to point at the xterm you want to type in" - A.S.R.
Microsoft is to operating systems 
   what McDonalds is to gourmet cooking
Home page: http://marc.merlins.org/  
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread Jan Schmidt
On Thu, August 02, 2012 at 15:46 (+0200), Arne Jansen wrote:
> On 02.08.2012 13:57, Liu Bo wrote:
>> Anyway, for now, our error flag has only been stored in memory, so what
>> about just keep it until we find a graceful way?
> 
> Yeah, we need this patch to restore consistency. We can define a fixed
> area on disk (e.g. behind the superblock) where we can write the flag
> to without risking the superblock.

At least we all agree that we need this patch, fine.

We don't yet agree that we need a place to store a "please consider fsck" flag.
Can I please get one concrete example in which situation we

a) do detect the user should really do a file system check AND
b) do not abort the transaction to clean the mess up?

(An example on how we could fail transaction cleanup is also accepted).

If such a situation doesn't exist, there's no need for this flag. The fact that
ext has such a flag doesn't convince me, probably because I know nothing about
ext. I can imagine that they can detect file system errors without the ability
to return to a potentially older consistent state.

Thanks,
-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread cwillu
On Thu, Aug 2, 2012 at 7:46 AM, Arne Jansen  wrote:
> On 02.08.2012 13:57, Liu Bo wrote:
>> On 08/02/2012 07:40 PM, Arne Jansen wrote:
>>> On 02.08.2012 13:34, Liu Bo wrote:
 On 08/02/2012 07:18 PM, Arne Jansen wrote:
> On 02.08.2012 12:36, Liu Bo wrote:
>> On 08/02/2012 06:30 PM, Stefan Behrens wrote:
>>> On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote:
 On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote:
> On 08/01/2012 09:07 PM, Jan Schmidt wrote:
>> On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote:
>>> On 08/01/2012 07:45 PM, Stefan Behrens wrote:
 With commit acce952b0, btrfs was changed to flag the filesystem 
 with
 BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal
 error happened like a write I/O errors of all mirrors.
 In such situations, on unmount, the superblock is written in
 btrfs_error_commit_super(). This is done with the intention to be 
 able
 to evaluate the error flag on the next mount. A warning is printed
 in this case during the next mount and the log tree is ignored.

 The issue is that it is possible that the superblock points to a 
 root
 that was not written (due to write I/O errors).
 The result is that the filesystem cannot be mounted. btrfsck also 
 does
 not start and all the other btrfs-progs tools fail to start as 
 well.
 However, mount -o recovery is working well and does the right 
 things
 to recover the filesystem (i.e., don't use the log root, clear the
 free space cache and use the next mountable root that is stored in 
 the
 root backup array).

 This patch removes the writing of the superblock when
 BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the 
 error
 flag in the mount function.

>>>
>>> Yes, I have to admit that this can be a serious problem.
>>>
>>> But we'll need to send the error flag stored in the super block into
>>> disk in the future so that the next mount can find it unstable and 
>>> do
>>> fsck by itself maybe.
>>
>> Hum, that's possible. However, I neither see
>>
>> a) a safe way to get that flag to disk
>>
>> nor
>>
>> b) a situation where this flag would help. When we abort a 
>> transaction, we just
>> roll everything back to the last commit, i.e. a consistent state. So 
>> if we stop
>> writing a potentially corrupt super block, we should be fine anyway. 
>> Or am I
>> missing something?
>>
>
> I'm just wondering if we can roll everything back well, why do we 
> need fsck?

 If the disks support barriers, we roll everything back very well. The
 most recent superblock on the disks always defines a consistent
 filesystem state. There are only two remaining filesystem consistency
 issues left that can cause inconsistent states, one is the one that the
 patch in this email addresses, and the second one is that the error
 result from barrier_all_devices() is ignored (which I want to change 
 next).
>>>
>>> Hi Liu Bo,
>>>
>>> Do you have any remaining objections to that patch?
>>>
>>
>> Hi Stefan,
>>
>> Still I have another question:
>>
>> Our metadata can be flushed into disk if we reach the limit, 32k, so we
>> can end up with updated metadata and the latest superblock if we do not
>> write the current super block.
>
> The old metadata stays valid until the new superblock is written,
> so no problem here, or maybe I don't understand your question :)
>

 Yeah, Arne, you're right :)

 But for undetected and unexpected errors as Arne had mentioned,  I want
 to keep the error flag which is able to inform users that this FS is
 recommended (but not must) to do fsck at least.
>>>
>>> How about storing the flag in a different location than the superblock?
>>> If the fs is in an unknown state, every write potentially makes it only
>>> worse.
>>>
>>
>> IMO it does not make sense if we don't write the flag into disk, and on
>> ext4's side, it just tries to write the super block.
>>
>> Anyway, for now, our error flag has only been stored in memory, so what
>> about just keep it until we find a graceful way?
>
> Yeah, we need this patch to restore consistency. We can define a fixed
> area on disk (e.g. behind the superblock) where we can write the flag
> to without risking the superblock.

Is there a reason btrfs_error_commit_super couldn't do the as treelog:
update only the first

Re: kernel BUG at fs/btrfs/extent-tree.c:5038 (linux 3.4.7)

2012-08-02 Thread Olivier Bonvalet

On 02/08/2012 15:53, David Sterba wrote:

On Thu, Aug 02, 2012 at 03:41:03PM +0200, Olivier Bonvalet wrote:

Yes... it's a copy from my /var/log/kern.log. Is it really "disabled" ?


I was mistaken, it really is enabled unconditionally in ctree.h:55.
Josef says that the V0 extent refs are not used for a long time, so the
question is how did they appear in your filesystem. Did you switch to a
new kernel from a very old one?


david
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



mmm I have upgraded that system every two months, but the first setup is 
from april 2011 I think.


Should I start a "scrub" on all that systems installed a long time ago ?
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-02 Thread Marc MERLIN
On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote:
> > I've the the fio tests in:
> > /dev/mapper/cryptroot /var btrfs 
> > rw,noatime,compress=lzo,nossd,discard,space_cache 0 0
> 
> … you are still using dm_crypt?
 
That was my biggest partition and so far I've found no performance impact
on file access between unencrypted and dm_crypt.
I just took out my swap partition and made a smaller btrfs there:
/dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0

I mounted without discard.

> > lat (usec) : 50=0.01%
> > lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89%
> > lat (msec) : 500=72.44%, 750=14.43%
> 
> Gosh, look at these latencies!
> 
> 72,44% of all requests above 500 (in words: five hundred) milliseconds!
> And 14,43% above 750 msecs. The percentage of requests served at 100 msecs
> or less is was below one percent! Hey, is this an SSD or what?
 
Yeah, that's kind of what I've been complaining about since the beginning :)
Once I'm reading sequentially, it goes fast, but random access/latency is
indeed abysmal.

> Still even with iodepth 64 totally different picture. And look at the IOPS
> and throughput.
 
Yep. I know mine are bad :(

> For reference, this refers to
> 
> [global]
> ioengine=libaio
> direct=1
> iodepth=64

Since it's slightly different than the first job file you gave me, I re-ran
with this one this time.

gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2
gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2
gandalfthegreat:~# grep sda2 /proc/mounts
/dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0

here's the btrfs test (ext4 is lower down):
gandalfthegreat:/mnt/mnt2# fio ~/fio.job2
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64
zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, 
iodepth=64
sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, 
iodepth=64
2.0.8
Starting 4 processes
zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB)
sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB)
zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193  iops] [eta 02m:10s]   
  
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318
  read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec
slat (usec): min=3 , max=37432 , avg=7252.52, stdev=5717.70
clat (usec): min=13 , max=981927 , avg=454046.13, stdev=110527.92
 lat (msec): min=5 , max=999 , avg=461.30, stdev=112.00
clat percentiles (msec):
 |  1.00th=[  145],  5.00th=[  269], 10.00th=[  371], 20.00th=[  408],
 | 30.00th=[  424], 40.00th=[  437], 50.00th=[  449], 60.00th=[  457],
 | 70.00th=[  474], 80.00th=[  490], 90.00th=[  570], 95.00th=[  644],
 | 99.00th=[  865], 99.50th=[  922], 99.90th=[  963], 99.95th=[  979],
 | 99.99th=[  979]
bw (KB/s)  : min=8, max= 2807, per=100.00%, avg=1227.75, stdev=317.57
lat (usec) : 20=0.01%
lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82%
lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58%
  cpu  : usr=0.12%, sys=1.13%, ctx=12186, majf=0, minf=276
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued: total=r=8262/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30340
  read : io=2048.0MB, bw=211257KB/s, iops=23473 , runt=  9927msec
slat (usec): min=1 , max=56321 , avg=20.51, stdev=424.44
clat (usec): min=0 , max=57987 , avg=2695.98, stdev=6624.00
 lat (usec): min=1 , max=58015 , avg=2716.75, stdev=6642.09
clat percentiles (usec):
 |  1.00th=[1],  5.00th=[   10], 10.00th=[   30], 20.00th=[  100],
 | 30.00th=[  217], 40.00th=[  362], 50.00th=[  494], 60.00th=[  636],
 | 70.00th=[  892], 80.00th=[ 1656], 90.00th=[ 7392], 95.00th=[21632],
 | 99.00th=[29056], 99.50th=[29568], 99.90th=[43776], 99.95th=[46848],
 | 99.99th=[57600]
bw (KB/s)  : min=166675, max=260984, per=99.83%, avg=210892.26, 
stdev=22433.65
lat (usec) : 2=2.16%, 4=0.43%, 10=2.33%, 20=2.80%, 50=5.72%
lat (usec) : 100=6.47%, 250=12.35%, 500=18.29%, 750=15.52%, 1000=5.44%
lat (msec) : 2=13.04%, 4=3.59%, 10=3.83%, 20=1.88%, 50=6.11%
lat (msec) : 100=0.04%
  cpu  : usr=4.51%, sys=35.70%, ctx=11480, majf=0, minf=278
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued: total=r=233025/w=0/d=0, short=r=0/w=0/d=0
zufälligschreiben: (

Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-02 Thread Martin Steigerwald
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote:
> > > I've the the fio tests in:
> > > /dev/mapper/cryptroot /var btrfs 
> > > rw,noatime,compress=lzo,nossd,discard,space_cache 0 0
> > 
> > … you are still using dm_crypt?
[…]
> I just took out my swap partition and made a smaller btrfs there:
> /dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0
> 
> I mounted without discard.
[…]
> > For reference, this refers to
> > 
> > [global]
> > ioengine=libaio
> > direct=1
> > iodepth=64
> 
> Since it's slightly different than the first job file you gave me, I re-ran
> with this one this time.
> 
> gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2
> gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2
> gandalfthegreat:~# grep sda2 /proc/mounts
> /dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0
> 
> here's the btrfs test (ext4 is lower down):
> gandalfthegreat:/mnt/mnt2# fio ~/fio.job2
> zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, 
> iodepth=64
> sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, 
> iodepth=64
> zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, 
> iodepth=64
> sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, 
> iodepth=64
> 2.0.8

Still abysmal except for sequential reads.

> Starting 4 processes
> zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB)
> zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193  iops] [eta 02m:10s] 
> 
> zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318
>   read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec
[…]
> lat (usec) : 20=0.01%
> lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82%
> lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58%

1 second latency? Oh well…


> Run status group 3 (all jobs):
>   WRITE: io=84902KB, aggrb=1414KB/s, minb=1414KB/s, maxb=1414KB/s, 
> mint=60005msec, maxt=60005msec
> gandalfthegreat:/mnt/mnt2#
> (no lines beyond this, fio 2.0.8)
[…]
> > Can you also post the last lines:
> > 
> > Disk stats (read/write):
> >   dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, 
> > in_queue=3867448, util=98.81%, aggrios=504829/504643, 
> > aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, 
> > aggrutil=98.78%
> > sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, 
> > in_queue=3223048, util=98.78%
> > martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba>
>  
> I didn't get these lines.

Hmmm, back then I guess this was fio 1.5.9 or so.

> I tried deadline and noop, and indeed I'm not see much of a difference for my 
> basic tests.
> Fro now I have deadline.
>  
> > So my recommendation of now:
> > 
> > Remove as much factors as possible and in order to compare results with
> > what I posted try with plain logical volume with Ext4.
> 
> gandalfthegreat:~# mkfs.ext4 -O extent -b 4096 -E stride=128,stripe-width=128 
> /dev/sda2
> /dev/sda2 /mnt/mnt2 ext4 rw,noatime,stripe=128,data=ordered 0 0
> 
> gandalfthegreat:/mnt/mnt2# fio ~/fio.job2
> zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, 
> iodepth=64
> sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, 
> iodepth=64
> zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, 
> iodepth=64
> sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, 
> iodepth=64
> 2.0.8
> Starting 4 processes
> zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB)
> zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB)
> Jobs: 1 (f=1): [___W] [63.8% done] [0K/2526K /s] [0 /280  iops] [eta 01m:21s] 
>  
> zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30077
>   read : io=2048.0MB, bw=276232KB/s, iops=50472 , runt=  7592msec
> slat (usec): min=2 , max=2276 , avg= 6.87, stdev=12.01
> clat (usec): min=249 , max=52128 , avg=1258.87, stdev=1714.63
>  lat (usec): min=260 , max=52134 , avg=1266.00, stdev=1715.36
> clat percentiles (usec):
>  |  1.00th=[  450],  5.00th=[  548], 10.00th=[  620], 20.00th=[  724],
>  | 30.00th=[  820], 40.00th=[  908], 50.00th=[ 1004], 60.00th=[ 1096],
>  | 70.00th=[ 1208], 80.00th=[ 1368], 90.00th=[ 1640], 95.00th=[ 2040],
>  | 99.00th=[ 8256], 99.50th=[14912], 99.90th=[21120], 99.95th=[23168],
>  | 99.99th=[33024]
> bw (KB/s)  : min=76463, max=385328, per=100.00%, avg=277313.20, 
> stdev=94661.29
> lat (usec) : 250=0.01%, 500=2.46%, 750=19.82%, 1000=27.70%
> lat (msec) : 2=44.79%, 4=3.55%, 10=0.72%, 20=0.78%, 50=0.17%
> lat (msec) : 100=0.01%
> 

Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-02 Thread Marc MERLIN
On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote:
> Hey, whats this? With Ext4 you have really good random read performance
> now! Way better than the Intel SSD 320 and…

Yep, my du -sh tests do show that ext4 is 2x faster than btrfs.
Obviously it's sending IO in a way that either the IO subsystem, linux
driver, or drive prefer.

> The performance of your Samsung SSD seems to be quite erratic. It seems
> that the device is capable of being fast, but only sometimes shows this
> capability.
 
That's indeed exactly what I'm seeing in real life :)

> > > Have the IOPS run on the device it self. That will remove any filesystem
> > > layer. But only the read only tests, to make sure I suggest to use fio
> > > with the --readonly option as safety guard. Unless you have a spare SSD
> > > that you can afford to use for write testing which will likely destroy
> > > every filesystem on it. Or let it run on just one logical volume.
> >  
> > Can you send me a recommended job config you'd like me to run if the runs
> > above haven't already answered your questions?
 
> [global]
(...)

I used this and just changed filename to /dev/sda. Since I'm reading
from the beginning of the drive, reads have to be aligned.

> I won´t expect much of a difference, but then the random read performance
> is quite different between Ext4 and BTRFS on this disk. That would make
> it interesting to test without any filesystem in between and over the
> whole device.

Here is the output:
gandalfthegreat:~# fio --readonly ./fio.job3
zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0  iops] [eta 01m:00s] 
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172
  read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msec
slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64
clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66
 lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88
clat percentiles (usec):
 |  1.00th=[  126],  5.00th=[  235], 10.00th=[ 5216], 20.00th=[ 5920],
 | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480],
 | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768],
 | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656],
 | 99.99th=[27264]
bw (KB/s)  : min=  423, max= 5776, per=100.00%, avg=986.48, stdev=480.68
lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02%
lat (usec) : 1000=0.02%
lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18%
  cpu  : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=6532/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199
  read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msec
slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91
clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32
 lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06
clat percentiles (usec):
 |  1.00th=[   61],  5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920],
 | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608],
 | 70.00th=[12608], 80.00th=[12736], 90.00th=[16512], 95.00th=[17536],
 | 99.00th=[18816], 99.50th=[19584], 99.90th=[24960], 99.95th=[29568],
 | 99.99th=[34048]
bw (KB/s)  : min=  405, max= 2680, per=100.00%, avg=912.92, stdev=261.62
lat (usec) : 50=0.41%, 100=1.77%, 250=1.20%, 500=0.23%, 750=0.02%
lat (usec) : 1000=0.03%
lat (msec) : 2=0.02%, 10=43.06%, 20=52.91%, 50=0.36%
  cpu  : usr=0.15%, sys=0.45%, ctx=6103, majf=0, minf=28
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=6101/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=59036KB, aggrb=983KB/s, minb=983KB/s, maxb=983KB/s, mint=60002msec, 
maxt=60002msec

Run status group 1 (all jobs):
   READ: io=54658KB, aggrb=910KB/s, minb=910KB/s, maxb=910KB/s, mint=60002msec, 
maxt=60002msec

Disk stats (read/write):
  sda: ios=12660/2072, merge=5/34, ticks=119452/22496, in_queue=141936, 
util=99.30%

> … or get yourself another SSD. Its your decision.
> 
> I admire your endurance. ;)

Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the
company is greating great reviews for them, I'm now pretty sure that
it's either a problem with a linux driver, which is interesting for us
all to debug :)

Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-02 Thread Martin Steigerwald
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN:
> On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote:
> > Hey, whats this? With Ext4 you have really good random read performance
> > now! Way better than the Intel SSD 320 and…
> 
> Yep, my du -sh tests do show that ext4 is 2x faster than btrfs.
> Obviously it's sending IO in a way that either the IO subsystem, linux
> driver, or drive prefer.

But only on reads.

> > > > Have the IOPS run on the device it self. That will remove any filesystem
> > > > layer. But only the read only tests, to make sure I suggest to use fio
> > > > with the --readonly option as safety guard. Unless you have a spare SSD
> > > > that you can afford to use for write testing which will likely destroy
> > > > every filesystem on it. Or let it run on just one logical volume.
> > >  
> > > Can you send me a recommended job config you'd like me to run if the runs
> > > above haven't already answered your questions?
>  
> > [global]
> (...)
> 
> I used this and just changed filename to /dev/sda. Since I'm reading
> from the beginning of the drive, reads have to be aligned.
> 
> > I won´t expect much of a difference, but then the random read performance
> > is quite different between Ext4 and BTRFS on this disk. That would make
> > it interesting to test without any filesystem in between and over the
> > whole device.
> 
> Here is the output:
> gandalfthegreat:~# fio --readonly ./fio.job3
> zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, 
> iodepth=1
> sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1
> 2.0.8
> Starting 2 processes
> Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0  iops] [eta 01m:00s] 
> zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172
>   read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msec

WTF?

Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely
without it.

Also I noticed that I had iodepth 1 in there to circumvent any in drive
cache / optimization.

> slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64
> clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66
>  lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88
> clat percentiles (usec):
>  |  1.00th=[  126],  5.00th=[  235], 10.00th=[ 5216], 20.00th=[ 5920],
>  | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480],
>  | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768],
>  | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656],
>  | 99.99th=[27264]
> bw (KB/s)  : min=  423, max= 5776, per=100.00%, avg=986.48, stdev=480.68
> lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02%
> lat (usec) : 1000=0.02%
> lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18%
>   cpu  : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26
>   IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
>  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
> >=64=0.0%
>  issued: total=r=6532/w=0/d=0, short=r=0/w=0/d=0

Latency is still way to high even with iodepth 1, 10 milliseconds for
43% of requests. And the throughput and IOPS is still abysmal even
for iodepth 1 (see below for Intel SSD 320 values).

Okay, one further idea: Remove the bsrange to just test with 4k blocks.

Additionally test these are aligned with

   blockalign=int[,int], ba=int[,int]
  At what boundary to align random IO offsets. Defaults to
  the  same  as  'blocksize'  the minimum blocksize given.
  Minimum alignment is typically 512b for using direct IO,
  though  it  usually  depends on the hardware block size.
  This option is mutually exclusive with  using  a  random
  map for files, so it will turn off that option.

I would first test with 4k blocks as is. And then do something like:

blocksize=4k
blockalign=4k

And then raise

blockalign to some values that may matter like 8k, 128k, 512k, 1m or so.

But thats just guess work. I do not even exactly now if it works this
way in fio.

There is something pretty wierd going on. But I am not sure what it is.
Maybe an alignment issue, since Ext4 with stripe alignment was able to
do so much faster on reads.

> sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199
>   read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msec

Ey, whats this?

> slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91
> clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32
>  lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06
> clat percentiles (usec):
>  |  1.00th=[   61],  5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920],
>  | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608],
>  | 70.00th=[12608], 80.00th=[12736], 90.00th=

Re: How can btrfs take 23sec to stat 23K files from an SSD?

2012-08-02 Thread Marc MERLIN
On Thu, Aug 02, 2012 at 11:21:51PM +0200, Martin Steigerwald wrote:
> > Yep, my du -sh tests do show that ext4 is 2x faster than btrfs.
> > Obviously it's sending IO in a way that either the IO subsystem, linux
> > driver, or drive prefer.
> 
> But only on reads.

Yes. I was focussing on the simple case. SSDs can be slower than hard
drives on writes, and now you're getting into deep firmware magic for
corner cases, so I didn't want to touch that here :)
 
> WTF?
> 
> Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely
> without it.

I left it as is:
[global]
ioengine=libaio
direct=1
filename=/dev/sda
size=476940m
bsrange=2k-16k

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

I just removed it now.

> Okay, one further idea: Remove the bsrange to just test with 4k blocks.

Ok, there you go:
[global]
ioengine=libaio
direct=1
filename=/dev/sda
blocksize=4k
blockalign=4k

[zufälliglesen]
rw=randread
runtime=60

[sequentielllesen]
stonewall
rw=read
runtime=60

gandalfthegreat:~# fio --readonly ./fio.job4
zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [1476K/0K /s] [369 /0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3614
  read : io=60832KB, bw=1013.7KB/s, iops=253 , runt= 60012msec
slat (usec): min=4 , max=892 , avg=24.36, stdev=16.05
clat (usec): min=2 , max=27575 , avg=3915.39, stdev=4990.69
 lat (usec): min=44 , max=27591 , avg=3940.35, stdev=4991.99
clat percentiles (usec):
 |  1.00th=[   41],  5.00th=[   46], 10.00th=[   51], 20.00th=[   66],
 | 30.00th=[  113], 40.00th=[  163], 50.00th=[  205], 60.00th=[ 4960],
 | 70.00th=[ 5856], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608],
 | 99.00th=[14272], 99.50th=[17280], 99.90th=[18816], 99.95th=[21120],
 | 99.99th=[22912]
bw (KB/s)  : min=  253, max= 2520, per=100.00%, avg=1014.86, stdev=364.74
lat (usec) : 4=0.05%, 10=0.01%, 20=0.01%, 50=8.07%, 100=19.94%
lat (usec) : 250=26.26%, 500=2.78%, 750=0.30%, 1000=0.09%
lat (msec) : 2=0.12%, 4=0.07%, 10=20.75%, 20=21.51%, 50=0.05%
  cpu  : usr=0.31%, sys=0.88%, ctx=15250, majf=0, minf=23
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=15208/w=0/d=0, short=r=0/w=0/d=0
sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3636
  read : io=88468KB, bw=1474.3KB/s, iops=368 , runt= 60009msec
slat (usec): min=3 , max=340 , avg=16.46, stdev=10.23
clat (usec): min=1 , max=27356 , avg=2692.99, stdev=4564.76
 lat (usec): min=30 , max=27370 , avg=2709.87, stdev=4566.31
clat percentiles (usec):
 |  1.00th=[   27],  5.00th=[   29], 10.00th=[   31], 20.00th=[   35],
 | 30.00th=[   47], 40.00th=[   61], 50.00th=[   78], 60.00th=[  107],
 | 70.00th=[  270], 80.00th=[ 5856], 90.00th=[11712], 95.00th=[12608],
 | 99.00th=[16512], 99.50th=[17536], 99.90th=[18816], 99.95th=[19584],
 | 99.99th=[21120]
bw (KB/s)  : min=  282, max= 5096, per=100.00%, avg=1474.40, stdev=899.36
lat (usec) : 2=0.05%, 4=0.01%, 20=0.02%, 50=32.45%, 100=24.61%
lat (usec) : 250=12.52%, 500=1.40%, 750=0.03%, 1000=0.03%
lat (msec) : 2=0.02%, 4=0.03%, 10=13.95%, 20=14.85%, 50=0.03%
  cpu  : usr=0.33%, sys=0.90%, ctx=22176, majf=0, minf=25
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=22117/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=60832KB, aggrb=1013KB/s, minb=1013KB/s, maxb=1013KB/s, 
mint=60012msec, maxt=60012msec

Run status group 1 (all jobs):
   READ: io=88468KB, aggrb=1474KB/s, minb=1474KB/s, maxb=1474KB/s, 
mint=60009msec, maxt=60009msec

Disk stats (read/write):
  sda: ios=54013/1836, merge=291/17, ticks=251660/11052, in_queue=262668, 
util=99.51%


and the same test with blockalign=512k:
gandalfthegreat:~# fio --readonly ./fio.job4 # blockalign=512k
fio: Any use of blockalign= turns off randommap
zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1
2.0.8
Starting 2 processes
Jobs: 1 (f=1): [_R] [66.9% done] [4180K/0K /s] [1045 /0  iops] [eta 01m:00s]
zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3679
  read : io=56096KB, bw=957355 B/s, iops=233 , runt= 60001msec
slat (usec): min=4 , max=344 , avg=27.73, stdev=15.51
clat (usec): min=2 , max=38954 , avg=4244.27, stdev=5153.32
 lat (usec): min=44 , max=38980 , avg=4

[current linus/master] deadlock during snapshot, sb_writers

2012-08-02 Thread David Sterba
Hi,

top commit (v3.5-9237) fc6bdb59a501740b28ed3b616641a22c8dc5dd31, observed with
btrfs-next/master as well.

mount:
/dev/sdb on /mnt/sdb type btrfs (rw,relatime,space_cache)

fresh mkfs, snapshot stresstest, stuck on first snapshot call afaics

[  113.751610] =
[  113.752178] [ INFO: possible recursive locking detected ]
[  113.752178] 3.5.0linus-default+ #174 Not tainted
[  113.752178] -
[  113.752178] btrfs/3125 is trying to acquire lock:
[  113.752178]  (sb_writers#10){.+.+.+}, at: [] 
mnt_want_write+0x24/0x50
[  113.752178]
[  113.752178] but task is already holding lock:
[  113.752178]  (sb_writers#10){.+.+.+}, at: [] 
mnt_want_write_file+0x28/0x60
[  113.752178]
[  113.752178] other info that might help us debug this:
[  113.752178]  Possible unsafe locking scenario:
[  113.752178]
[  113.752178]CPU0
[  113.752178]
[  113.752178]   lock(sb_writers#10);
[  113.752178]   lock(sb_writers#10);
[  113.752178]
[  113.752178]  *** DEADLOCK ***
[  113.752178]
[  113.856366]  May be due to missing lock nesting notation
[  113.856366]
[  113.856366] 1 lock held by btrfs/3125:
[  113.856366]  #0:  (sb_writers#10){.+.+.+}, at: [] 
mnt_want_write_file+0x28/0x60
[  113.856366]
[  113.856366] stack backtrace:
[  113.856366] Pid: 3125, comm: btrfs Not tainted 3.5.0linus-default+ #174
[  113.856366] Call Trace:
[  113.856366]  [] __lock_acquire+0x1663/0x1e90
[  113.856366]  [] ? trace_hardirqs_off_caller+0x29/0xc0
[  113.856366]  [] ? native_sched_clock+0x15/0x80
[  113.856366]  [] ? local_clock+0x6f/0x80
[  113.856366]  [] ? trace_hardirqs_off_caller+0x29/0xc0
[  113.856366]  [] ? mnt_want_write+0x24/0x50
[  113.856366]  [] lock_acquire+0x94/0x130
[  113.856366]  [] ? mnt_want_write+0x24/0x50
[  113.856366]  [] __sb_start_write+0x9c/0x1b0
[  113.856366]  [] ? mnt_want_write+0x24/0x50
[  113.856366]  [] ? native_sched_clock+0x15/0x80
[  113.856366]  [] ? mnt_want_write+0x24/0x50
[  113.856366]  [] ? trace_hardirqs_off_caller+0x29/0xc0
[  113.856366]  [] mnt_want_write+0x24/0x50
[  113.856366]  [] ? fget+0x84/0x100
[  113.856366]  [] btrfs_mksubvol+0x47/0x370 [btrfs]
[  113.856366]  [] btrfs_ioctl_snap_create_transid+0xfa/0x190 
[btrfs]
[  113.856366]  [] ? might_fault+0x9c/0xb0
[  113.856366]  [] ? might_fault+0x53/0xb0
[  113.856366]  [] btrfs_ioctl_snap_create_v2+0x106/0x140 
[btrfs]
[  113.856366]  [] ? lock_release_holdtime+0x3d/0x1c0
[  113.856366]  [] ? do_page_fault+0x2d0/0x580
[  113.856366]  [] btrfs_ioctl+0x558/0x19c0 [btrfs]
[  113.856366]  [] ? up_read+0x23/0x40
[  113.856366]  [] ? do_page_fault+0x2d0/0x580
[  113.856366]  [] ? _raw_spin_unlock_irq+0x30/0x50
[  113.856366]  [] ? finish_task_switch+0x83/0xf0
[  113.856366]  [] do_vfs_ioctl+0x98/0x560
[  113.856366]  [] ? retint_swapgs+0x13/0x1b
[  113.856366]  [] sys_ioctl+0x4f/0x80
[  113.856366]  [] system_call_fastpath+0x16/0x1b


tasks in D-state:

 3030 ?D  0:00 [btrfs-endio-wri]
[] btrfs_run_delayed_refs+0x402/0x560 [btrfs]
[] __btrfs_end_transaction+0xcf/0x320 [btrfs]
[] btrfs_end_transaction+0x15/0x20 [btrfs]
[] btrfs_finish_ordered_io+0x19b/0x410 [btrfs]
[] finish_ordered_fn+0x15/0x20 [btrfs]
[] worker_loop+0xc4/0x5a0 [btrfs]
[] kthread+0xae/0xc0
[] kernel_thread_helper+0x4/0x10
[] 0x
  PID TTY  STAT   TIME COMMAND
 3036 ?D  0:00 [btrfs-transacti]
[] btrfs_run_delayed_refs+0x402/0x560 [btrfs]
[] btrfs_commit_transaction+0x9d/0xb50 [btrfs]
[] transaction_kthread+0x1ae/0x240 [btrfs]
[] kthread+0xae/0xc0
[] kernel_thread_helper+0x4/0x10
[] 0x
  PID TTY  STAT   TIME COMMAND
 3104 pts/3D+ 0:06 tar x
[] wait_current_trans+0xc5/0x120 [btrfs]
[] start_transaction+0x14e/0x400 [btrfs]
[] btrfs_start_transaction+0x13/0x20 [btrfs]
[] btrfs_create+0x46/0x220 [btrfs]
[] vfs_create+0x89/0xc0
[] do_last+0x853/0xd10
[] path_openat+0xb6/0x4a0
[] do_filp_open+0x49/0xa0
[] do_sys_open+0x102/0x1e0
[] sys_openat+0x14/0x20
[] system_call_fastpath+0x16/0x1b
[] 0x
  PID TTY  STAT   TIME COMMAND
 3115 ?D  0:00 [btrfs-endio-wri]
[] btrfs_run_delayed_refs+0x402/0x560 [btrfs]
[] __btrfs_end_transaction+0xcf/0x320 [btrfs]
[] btrfs_end_transaction+0x15/0x20 [btrfs]
[] btrfs_finish_ordered_io+0x19b/0x410 [btrfs]
[] finish_ordered_fn+0x15/0x20 [btrfs]
[] worker_loop+0xc4/0x5a0 [btrfs]
[] kthread+0xae/0xc0
[] kernel_thread_helper+0x4/0x10
[] 0x
  PID TTY  STAT   TIME COMMAND
 3116 ?D  0:00 [btrfs-endio-wri]
[] btrfs_run_delayed_refs+0x402/0x560 [btrfs]
[] __btrfs_end_transaction+0xcf/0x320 [btrfs]
[] btrfs_end_transaction+0x15/0x20 [btrfs]
[] btrfs_finish_ordered_io+0x19b/0x410 [btrfs]
[] finish_ordered_fn+0x15/0x20 [btrfs]
[] worker_loop+0xc4/0x5a0 [btrfs]
[] kthread+0xae/0xc0
[] kernel_thread_helper+0x4/0x10
[] 0x
  PID TTY  STAT   TIME COMMAND
 3117 ?D  0:00 [btrfs-endio-wri]
[] btrfs_run_delayed_refs+0

Re: [current linus/master] deadlock during snapshot, sb_writers

2012-08-02 Thread Alexander Block
On Fri, Aug 3, 2012 at 12:16 AM, David Sterba  wrote:
> Hi,
>
> top commit (v3.5-9237) fc6bdb59a501740b28ed3b616641a22c8dc5dd31, observed with
> btrfs-next/master as well.
>
> mount:
> /dev/sdb on /mnt/sdb type btrfs (rw,relatime,space_cache)
>
> fresh mkfs, snapshot stresstest, stuck on first snapshot call afaics
>
The problem seems to be a merge error. Commit a874a63 removed the
mnt_want_write call from btrfs_mksubvol and added a replacement call
to mnt_want_write_file in btrfs_ioctl_snap_create_transid. Commit
e7848683 however tried to move all calls to mnt_want_write above
i_mutex. So somewhere while merging this, it got mixed up. The
solution is to remove the mnt_want_write call completely from
mksubvol. Will send a patch in a few minutes.
[...]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Btrfs: remove mnt_want_write call in btrfs_mksubvol

2012-08-02 Thread Alexander Block
We got a recursive lock in mksubvol because the caller already held
a lock. I think we got into this due to a merge error. Commit a874a63
removed the mnt_want_write call from btrfs_mksubvol and added a
replacement call to mnt_want_write_file in btrfs_ioctl_snap_create_transid.
Commit e7848683 however tried to move all calls to mnt_want_write above
i_mutex. So somewhere while merging this, it got mixed up. The
solution is to remove the mnt_want_write call completely from
mksubvol.

Reported-by: David Sterba 
Signed-off-by: Alexander Block 
---
 fs/btrfs/ioctl.c |5 -
 1 file changed, 5 deletions(-)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 00ddf22..9df50fa 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -664,10 +664,6 @@ static noinline int btrfs_mksubvol(struct path *parent,
struct dentry *dentry;
int error;
 
-   error = mnt_want_write(parent->mnt);
-   if (error)
-   return error;
-
mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT);
 
dentry = lookup_one_len(name, parent->dentry, namelen);
@@ -703,7 +699,6 @@ out_dput:
dput(dentry);
 out_unlock:
mutex_unlock(&dir->i_mutex);
-   mnt_drop_write(parent->mnt);
return error;
 }
 
-- 
1.7.10.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V3 2/2] Btrfs: fix the snapshot that should not exist

2012-08-02 Thread Miao Xie
On Thu, 2 Aug 2012 13:46:31 +0200, David Sterba wrote:
> Hi,
> 
> appologies for late reply,
> 
> On Thu, Aug 02, 2012 at 12:40:46PM +0800, Miao Xie wrote:
>> Changelog v1 -> v2:
>> - add comment to explain why we need deal with the delayed items after
>>   snapshot creation and why this operation do not corrupt the metadata.
> 
> I'm sorry, the comment did not fix the bug :)
> 
> The subvol stress is able to hit this:
> 
> [ 2360.444321] [ cut here ]
> [ 2360.448019] kernel BUG at fs/btrfs/extent-tree.c:6047!
> [ 2360.448019] invalid opcode:  [#1] SMP
> [ 2360.448019] CPU 0
> [ 2360.448019] Modules linked in: btrfs aoe [last unloaded: btrfs]
> [ 2360.448019]
> [ 2360.448019] Pid: 8212, comm: btrfs Not tainted 3.5.0-default+ #170 Intel 
> Corporation Santa Rosa platform/Matanzas
> [ 2360.448019] RIP: 0010:[]  [] 
> run_clustered_refs+0xa11/0xa20 [btrfs]
> [ 2360.448019] RSP: 0018:88003eca1a68  EFLAGS: 00010246
> [ 2360.448019] RAX: 07ff RBX: 880017a694c8 RCX: 
> 88003eca1a08
> [ 2360.448019] RDX: 880028aa9000 RSI: 07fe RDI: 
> 880064223cf0
> [ 2360.448019] RBP: 88003eca1b48 R08: 07ff R09: 
> 88003eca19f8
> [ 2360.448019] R10: 88002435d1e8 R11:  R12: 
> 880025d66d28
> [ 2360.448019] R13: 88003864 R14: 8800778dfa88 R15: 
> 880060f010d0
> [ 2360.448019] FS:  7f3289f35740() GS:88007dc0() 
> knlGS:
> [ 2360.448019] CS:  0010 DS:  ES:  CR0: 8005003b
> [ 2360.448019] CR2: ff600400 CR3: 2e112000 CR4: 
> 07f0
> [ 2360.448019] DR0:  DR1:  DR2: 
> 
> [ 2360.448019] DR3:  DR6: 0ff0 DR7: 
> 0400
> [ 2360.448019] Process btrfs (pid: 8212, threadinfo 88003eca, task 
> 88001d834200)
> [ 2360.448019] Stack:
> [ 2360.448019]    0001 
> 
> [ 2360.448019]  07ed 88002435d1e8 3eca1b18 
> 
> [ 2360.448019]  0770  5cb1e000 
> 88003eca1c08
> [ 2360.448019] Call Trace:
> [ 2360.448019]  [] btrfs_run_delayed_refs+0x1c9/0x550 
> [btrfs]
> [ 2360.448019]  [] ? trace_hardirqs_on_caller+0x155/0x1d0
> [ 2360.448019]  [] ? btrfs_free_path+0x2a/0x40 [btrfs]
> [ 2360.448019]  [] ? btrfs_run_delayed_items+0xf1/0x160 
> [btrfs]
> [ 2360.448019]  [] btrfs_commit_transaction+0x605/0xb00 
> [btrfs]
> [ 2360.448019]  [] ? lock_release_holdtime+0x3d/0x1c0
> [ 2360.448019]  [] ? btrfs_mksubvol+0x298/0x360 [btrfs]
> [ 2360.448019]  [] ? wake_up_bit+0x40/0x40
> [ 2360.448019]  [] ? do_raw_spin_unlock+0x5e/0xb0
> [ 2360.448019]  [] btrfs_mksubvol+0x358/0x360 [btrfs]
> [ 2360.448019]  [] 
> btrfs_ioctl_snap_create_transid+0x10a/0x190 [btrfs]
> [ 2360.448019]  [] 
> btrfs_ioctl_snap_create_v2.clone.0+0xfd/0x110 [btrfs]
> [ 2360.448019]  [] btrfs_ioctl+0x48e/0x1340 [btrfs]
> [ 2360.448019]  [] ? do_page_fault+0x2d0/0x580
> [ 2360.448019]  [] ? _raw_spin_unlock_irq+0x30/0x50
> [ 2360.448019]  [] ? finish_task_switch+0x83/0xf0
> [ 2360.448019]  [] do_vfs_ioctl+0x98/0x560
> [ 2360.448019]  [] ? retint_swapgs+0x13/0x1b
> [ 2360.448019]  [] sys_ioctl+0x4f/0x80
> [ 2360.448019]  [] system_call_fastpath+0x16/0x1b
> [ 2360.448019] Code: 8b 76 40 48 89 d7 48 89 55 a0 e8 2b 74 ff ff 83 f8 17 0f 
> 87 1e ff ff ff 0f 0b 80 fa b2 0f 84 b4 f8 ff ff 0f 0b 0f 0b 0f 0b 0f 0b <0f> 
> 0b 0f 0b 0f 0b 0f 0b 0f 1f 80 00 00 00 00 55 48 89 e5 41 57
> [ 2360.448019] RIP  [] run_clustered_refs+0xa11/0xa20 
> [btrfs]
> [ 2360.448019]  RSP 
> [ 2360.814508] ---[ end trace 555a16cac3620ccb ]---
> [ 2360.820398] note: btrfs[8212] exited with preempt_count 1
> [ 2360.827072] BUG: sleeping function called from invalid context at 
> kernel/rwsem.c:20
> [ 2360.836047] in_atomic(): 1, irqs_disabled(): 0, pid: 8212, name: btrfs
> [ 2360.843859] INFO: lockdep is turned off.
> [ 2360.849021] Pid: 8212, comm: btrfs Tainted: G  D  3.5.0-default+ 
> #170
> [ 2360.849022] Call Trace:
> [ 2360.849027]  [] __might_sleep+0xfc/0x130
> [ 2360.849030]  [] down_read+0x26/0xa0
> [ 2360.849034]  [] acct_collect+0x4b/0x1b0
> [ 2360.849038]  [] do_exit+0x718/0x9a0
> [ 2360.849041]  [] ? kmsg_dump+0x26/0x140
> [ 2360.849043]  [] oops_end+0xb0/0xf0
> [ 2360.849046]  [] die+0x5b/0x90
> [ 2360.849048]  [] do_trap+0xc4/0x170
> [ 2360.849052]  [] do_invalid_op+0x95/0xb0
> [ 2360.849067]  [] ? run_clustered_refs+0xa11/0xa20 [btrfs]
> [ 2360.849071]  [] ? trace_hardirqs_off_thunk+0x3a/0x3c
> [ 2360.849073]  [] ? restore_args+0x30/0x30
> [ 2360.849076]  [] invalid_op+0x1b/0x20
> [ 2360.849087]  [] ? run_clustered_refs+0xa11/0xa20 [btrfs]
> [ 2360.849097]  [] ? run_clustered_refs+0x69b/0xa20 [btrfs]
> [ 2360.849108]  [] btrfs_run_delayed_refs+0x1c9/0x550 
> [btrfs]
> [ 2360.849110]  [] ? trace_hardirqs_on_caller+0x155/0x1d0
> [ 2360.849119]  [] ? btrfs_free_path+0x

Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread Liu Bo
On 08/02/2012 10:28 PM, Jan Schmidt wrote:
> On Thu, August 02, 2012 at 15:46 (+0200), Arne Jansen wrote:
>> On 02.08.2012 13:57, Liu Bo wrote:
>>> Anyway, for now, our error flag has only been stored in memory, so what
>>> about just keep it until we find a graceful way?
>>
>> Yeah, we need this patch to restore consistency. We can define a fixed
>> area on disk (e.g. behind the superblock) where we can write the flag
>> to without risking the superblock.
> 
> At least we all agree that we need this patch, fine.
> 
> We don't yet agree that we need a place to store a "please consider fsck" 
> flag.
> Can I please get one concrete example in which situation we
> 
> a) do detect the user should really do a file system check AND
> b) do not abort the transaction to clean the mess up?
> 
> (An example on how we could fail transaction cleanup is also accepted).
> 

Unfortunately I don't have such an example either.

Since we always get COW on metadata, I believe that it's ok to just roll
back on failure.

> If such a situation doesn't exist, there's no need for this flag. The fact 
> that
> ext has such a flag doesn't convince me, probably because I know nothing about
> ext. I can imagine that they can detect file system errors without the ability
> to return to a potentially older consistent state.
> 

This error flag is also used to indicate filesystem's error state for
transaction cleanup, so keeping it in memory is reasonable.

thanks,
liubo

> Thanks,
> -Jan
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] Btrfs: remove superblock writing after fatal error

2012-08-02 Thread Jan Schmidt
[added Josef]

Josef: Please queue this patch for rc2, we finally agreed.

On Fri, August 03, 2012 at 05:20 (+0200), Liu Bo wrote:
> On 08/02/2012 10:28 PM, Jan Schmidt wrote:
>> If such a situation doesn't exist, there's no need for this flag. The fact 
>> that
>> ext has such a flag doesn't convince me, probably because I know nothing 
>> about
>> ext. I can imagine that they can detect file system errors without the 
>> ability
>> to return to a potentially older consistent state.
>>
> 
> This error flag is also used to indicate filesystem's error state for
> transaction cleanup, so keeping it in memory is reasonable.

Granted :-)

-Jan
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html