Re: [PATCH 3/3] Btrfs-progs: list snapshots by generation
Liu Bo, I am trying to resolve a conflict with your patch vs Hugo' integration branch http://git.darksatanic.net/repo/btrfs-progs-unstable.git (integration-20120605) --- @@ -185,11 +199,15 @@ static int add_root(struct root_lookup *root_lookup, ri->dir_id = dir_id; ri->root_id = root_id; ri->ref_tree = ref_tree; - strncpy(ri->name, name, name_len); + if (name) + strncpy(ri->name, name, name_len); if (name_len> 0) ri->name[name_len] = 0;gen = *gen; --- Per the original patch http://permalink.gmane.org/gmane.comp.file-systems.btrfs/16914 it is ri->name[name_len-1] instead. and I don't see any further patches which made it to be ri->name[name_len] as in your patch. any idea? Thanks, Anand On 31/07/12 13:49, Liu Bo wrote: The idea is that we usually use snapshot to backup/restore our data, and the common way can be a cron script which makes lots of snapshots, so we can end up with spending some time to find the latest snapshot to restore. This adds a feature for 'btrfs subvolume list' to let it list snapshots by their _created_ generation. What we need to do is just to list them in descending order and get the latest snapshot. What's more, we can find the oldest snapshot as well by listing snapshots in ascending order. Signed-off-by: Liu Bo --- btrfs-list.c | 176 -- cmds-subvolume.c | 19 +- 2 files changed, 185 insertions(+), 10 deletions(-) diff --git a/btrfs-list.c b/btrfs-list.c index 05360dc..0374b41 100644 --- a/btrfs-list.c +++ b/btrfs-list.c @@ -87,13 +87,23 @@ static int comp_entry(struct root_info *entry, u64 root_id, u64 ref_tree) return 0; } +static int comp_entry_with_gen(struct root_info *entry, u64 root_id, + u64 ref_tree, u64 gen) +{ + if (entry->gen< gen) + return 1; + if (entry->gen> gen) + return -1; + return comp_entry(entry, root_id, ref_tree); +} + /* * insert a new root into the tree. returns the existing root entry * if one is already there. Both root_id and ref_tree are used * as the key */ static struct rb_node *tree_insert(struct rb_root *root, u64 root_id, - u64 ref_tree, struct rb_node *node) + u64 ref_tree, u64 *gen, struct rb_node *node) { struct rb_node ** p =&root->rb_node; struct rb_node * parent = NULL; @@ -104,7 +114,11 @@ static struct rb_node *tree_insert(struct rb_root *root, u64 root_id, parent = *p; entry = rb_entry(parent, struct root_info, rb_node); - comp = comp_entry(entry, root_id, ref_tree); + if (!gen) + comp = comp_entry(entry, root_id, ref_tree); + else + comp = comp_entry_with_gen(entry, root_id, ref_tree, + *gen); if (comp< 0) p =&(*p)->rb_left; @@ -171,7 +185,7 @@ static struct root_info *tree_search(struct rb_root *root, u64 root_id) */ static int add_root(struct root_lookup *root_lookup, u64 root_id, u64 ref_tree, u64 dir_id, char *name, - int name_len) + int name_len, u64 *gen) { struct root_info *ri; struct rb_node *ret; @@ -185,11 +199,15 @@ static int add_root(struct root_lookup *root_lookup, ri->dir_id = dir_id; ri->root_id = root_id; ri->ref_tree = ref_tree; - strncpy(ri->name, name, name_len); + if (name) + strncpy(ri->name, name, name_len); if (name_len> 0) ri->name[name_len] = 0; + if (gen) + ri->gen = *gen; - ret = tree_insert(&root_lookup->root, root_id, ref_tree,&ri->rb_node); + ret = tree_insert(&root_lookup->root, root_id, ref_tree, gen, + &ri->rb_node); if (ret) { printf("failed to insert tree %llu\n", (unsigned long long)root_id); exit(1); @@ -693,7 +711,7 @@ again: dir_id = btrfs_stack_root_ref_dirid(ref); add_root(root_lookup, sh->objectid, sh->offset, -dir_id, name, name_len); +dir_id, name, name_len, NULL); } else if (get_gen&& sh->type == BTRFS_ROOT_ITEM_KEY) { ri = (struct btrfs_root_item *)(args.buf + off); gen = btrfs_root_generation(ri); @@ -750,6 +768,79 @@ again: return 0; } +static int __list_snapshot_search(int fd, struct root_lookup *root_lookup) +{ + int ret; + struct btrfs_ioctl_search_args args; +
Btrfs send/receive fixes and PULL request
Hello Chris, You can find and pull a lot of fixes for btrfs send/receive in my git repo: git://github.com/ablock84/linux-btrfs.git for-chris These fixes are mostly the results of the reviews from Arne and Alex that I got here on the list. I'm not posting the patches to the list at the moment as they got quite a lot (23 patches). If anyone thinks I should still post them, please tell me. The branch is based on current master of Linus repo. As I already announced, I won't have time to work on btrfs starting from now. Alex. Alexander Block (23): Btrfs: add rdev to get_inode_info in send/receive Btrfs: fix cur_ino < parent_ino case for send/receive Btrfs: add missing check for dir != tmp_dir to is_first_ref Btrfs: remove unused code with #if 0 Btrfs: add correct parent to check_dirs when dir got moved Btrfs: remove unused use_list from send/receive code Btrfs: rename backref_ctx::found_in_send_root to found_itself Btrfs: use kmalloc instead of stack for backref_ctx Btrfs: use normal return path for root == send_root case Btrfs: don't break in the final loop of find_extent_clone Btrfs: fix memory leak for name_cache in send/receive Btrfs: fix use of radix_tree for name_cache in send/receive Btrfs: make aux field of ulist 64 bit Btrfs: update send_progress at correct places Btrfs: add/fix comments/documentation for send/receive Btrfs: code cleanups for send/receive Btrfs: remove unused tmp_path from iterate_dir_item Btrfs: free nce and nce_head on error in name_cache_insert Btrfs: fix check for changed extent in is_extent_unchanged Btrfs: use <= instead of < in is_extent_unchanged Btrfs: pass root instead of parent_root to iterate_inode_ref Btrfs: ignore non-FS inodes for send/receive Btrfs: don't treat top/root directory inode as deleted/reused Dan Carpenter (1): Btrfs: fix some endian bugs handling the root times fs/btrfs/backref.c |8 +- fs/btrfs/ioctl.c |2 +- fs/btrfs/qgroup.c | 20 +- fs/btrfs/root-tree.c |4 +- fs/btrfs/send.c| 851 ++-- fs/btrfs/transaction.c |2 +- fs/btrfs/ulist.c |7 +- fs/btrfs/ulist.h |9 +- 8 files changed, 483 insertions(+), 420 deletions(-) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Btrfs send/receive fixes and PULL request
On Thu, Aug 2, 2012 at 11:55 AM, Alexander Block wrote: > Hello Chris, > > You can find and pull a lot of fixes for btrfs send/receive in my git repo: > > git://github.com/ablock84/linux-btrfs.git for-chris > > These fixes are mostly the results of the reviews from Arne and Alex > that I got here on the list. I'm not posting the patches to the list > at the moment as they got quite a lot (23 patches). If anyone thinks I > should still post them, please tell me. The branch is based on current > master of Linus repo. As I already announced, I won't have time to > work on btrfs starting from now. > > Alex. > > > Alexander Block (23): > Btrfs: add rdev to get_inode_info in send/receive > Btrfs: fix cur_ino < parent_ino case for send/receive > Btrfs: add missing check for dir != tmp_dir to is_first_ref > Btrfs: remove unused code with #if 0 > Btrfs: add correct parent to check_dirs when dir got moved > Btrfs: remove unused use_list from send/receive code > Btrfs: rename backref_ctx::found_in_send_root to found_itself > Btrfs: use kmalloc instead of stack for backref_ctx > Btrfs: use normal return path for root == send_root case > Btrfs: don't break in the final loop of find_extent_clone > Btrfs: fix memory leak for name_cache in send/receive > Btrfs: fix use of radix_tree for name_cache in send/receive > Btrfs: make aux field of ulist 64 bit > Btrfs: update send_progress at correct places > Btrfs: add/fix comments/documentation for send/receive > Btrfs: code cleanups for send/receive > Btrfs: remove unused tmp_path from iterate_dir_item > Btrfs: free nce and nce_head on error in name_cache_insert > Btrfs: fix check for changed extent in is_extent_unchanged > Btrfs: use <= instead of < in is_extent_unchanged > Btrfs: pass root instead of parent_root to iterate_inode_ref > Btrfs: ignore non-FS inodes for send/receive > Btrfs: don't treat top/root directory inode as deleted/reused > > Dan Carpenter (1): > Btrfs: fix some endian bugs handling the root times > > fs/btrfs/backref.c |8 +- > fs/btrfs/ioctl.c |2 +- > fs/btrfs/qgroup.c | 20 +- > fs/btrfs/root-tree.c |4 +- > fs/btrfs/send.c| 851 > ++-- > fs/btrfs/transaction.c |2 +- > fs/btrfs/ulist.c |7 +- > fs/btrfs/ulist.h |9 +- > 8 files changed, 483 insertions(+), 420 deletions(-) I forgot to write where to find the btrfs-progs fixes: git://github.com/ablock84/btrfs-progs.git for-chris -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs-progs: enhance btrfs subvol list to show read-only snapshots
From: Zhou Bo This patch enhances btrfs subvol list to show read-only snapshots. You can use the -r option showing read-only snapshots, for example: btrfs subvolume list -r Signed-off-by: Zhou Bo --- btrfs-list.c | 68 +++-- cmds-subvolume.c | 15 ctree.h |2 + 3 files changed, 72 insertions(+), 13 deletions(-) diff --git a/btrfs-list.c b/btrfs-list.c index 693d241..9c10b9e 100644 --- a/btrfs-list.c +++ b/btrfs-list.c @@ -608,7 +608,46 @@ out: return 0; } -static int __list_subvol_search(int fd, struct root_lookup *root_lookup) +static int is_readonly_subvol(int fd, u64 objectid) +{ + int ret; + struct btrfs_ioctl_search_args args; + struct btrfs_ioctl_search_key *sk = &args.key; + struct btrfs_ioctl_search_header *sh; + unsigned long off = 0; + int found = 0; + struct btrfs_root_item *item; + + memset(&args, 0, sizeof(args)); + + /* search in the tree of tree roots */ + sk->tree_id = BTRFS_ROOT_TREE_OBJECTID; + sk->min_objectid = objectid; + sk->max_objectid = objectid; + sk->max_type = BTRFS_ROOT_ITEM_KEY; + sk->min_type = BTRFS_ROOT_ITEM_KEY; + sk->max_offset = (u64)-1; + sk->max_transid = (u64)-1; + sk->nr_items = 1; + ret = ioctl(fd, BTRFS_IOC_TREE_SEARCH, &args); + if (ret < 0) + return ret; + if (sk->nr_items == 0) { + errno = -ENOENT; + found = -1; + goto out; + } + sh = (struct btrfs_ioctl_search_header *)args.buf; + off += sizeof(*sh); + item = (struct btrfs_root_item *)(args.buf + off); + if (item->flags & cpu_to_le64(BTRFS_ROOT_SUBVOL_RDONLY)) + found = 1; +out: + return found; +} + +static int __list_subvol_search(int fd, struct root_lookup *root_lookup, + int list_ro) { int ret; struct btrfs_ioctl_search_args args; @@ -668,9 +707,23 @@ static int __list_subvol_search(int fd, struct root_lookup *root_lookup) name_len = btrfs_stack_root_ref_name_len(ref); name = (char *)(ref + 1); dir_id = btrfs_stack_root_ref_dirid(ref); - - add_root(root_lookup, sh->objectid, sh->offset, -dir_id, name, name_len); + if (list_ro) { + int subvol_readonly = + is_readonly_subvol(fd, + sh->objectid); + if (subvol_readonly < 0) { + return subvol_readonly; + } else if (subvol_readonly) { + add_root(root_lookup, + sh->objectid, + sh->offset, dir_id, + name, name_len); + } + } else { + add_root(root_lookup, sh->objectid, + sh->offset, dir_id, + name, name_len); + } } off += sh->len; @@ -719,7 +772,7 @@ static int __list_subvol_fill_paths(int fd, struct root_lookup *root_lookup) return 0; } -int list_subvols(int fd, int print_parent, int get_default) +int list_subvols(int fd, int print_parent, int get_default, int list_ro) { struct root_lookup root_lookup; struct rb_node *n; @@ -745,7 +798,7 @@ int list_subvols(int fd, int print_parent, int get_default) } } - ret = __list_subvol_search(fd, &root_lookup); + ret = __list_subvol_search(fd, &root_lookup, list_ro); if (ret) { fprintf(stderr, "ERROR: can't perform the search - %s\n", strerror(errno)); @@ -788,7 +841,6 @@ int list_subvols(int fd, int print_parent, int get_default) (unsigned long long)entry->root_id, (unsigned long long)level, path); } - free(path); n = rb_prev(n); } @@ -983,7 +1035,7 @@ char *path_for_root(int fd, u64 root) char *ret_path = NULL; int ret; - ret = __list_subvol_search(fd, &root_lookup); + ret = __list_subvol_search(fd, &root_lookup, 0); if (ret < 0) return ERR_PTR(ret); diff --git a/cmds-subvolume.c b/cmds-subvolume.c index 3508c
[RFC][PATCH] Btrfs-progs, btrfsck: add block group check function
From: Chen Yang This patch adds the function to check correspondence between block group, chunk and device extent. Signed-off-by: Cheng Yang Signed-off-by: Miao Xie --- Makefile |2 +- btrfsck.c | 569 +++- dev-extent-cache.c | 188 + dev-extent-cache.h | 60 ++ 4 files changed, 812 insertions(+), 7 deletions(-) create mode 100644 dev-extent-cache.c create mode 100644 dev-extent-cache.h diff --git a/Makefile b/Makefile index 969..75eced8 100644 --- a/Makefile +++ b/Makefile @@ -4,7 +4,7 @@ CFLAGS = -g -O0 objects = ctree.o disk-io.o radix-tree.o extent-tree.o print-tree.o \ root-tree.o dir-item.o file-item.o inode-item.o \ inode-map.o crc32c.o rbtree.o extent-cache.o extent_io.o \ - volumes.o utils.o btrfs-list.o btrfslabel.o repair.o + volumes.o utils.o btrfs-list.o btrfslabel.o repair.o dev-extent-cache.o cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-inspect.o cmds-balance.o diff --git a/btrfsck.c b/btrfsck.c index 088b9f4..a25c948 100644 --- a/btrfsck.c +++ b/btrfsck.c @@ -34,6 +34,56 @@ #include "list.h" #include "version.h" #include "utils.h" +#include "dev-extent-cache.h" + + +struct block_group_rec { + struct cache_extent cache; + u64 objectid; + u8 type; + u64 offset; + + u64 flags; +}; + +struct dev_rec { + struct cache_extent cache; + u64 objectid; + u8 type; + u64 offset; + + u64 devid; + u64 total_byte; + u64 byte_used; +}; + +struct stripe { + u64 devid; + u64 offset; +}; + +struct chunk_rec { + struct cache_extent cache; + u64 objectid; + u8 type; + u64 offset; + + u64 length; + u64 type_flags; + u16 num_stripes; + struct stripe stripes[0]; +}; + +struct dev_extent_rec { + struct cache_dev_extent cache; + u64 objectid; + u8 type; + u64 offset; + + u64 chunk_objecteid; + u64 chunk_offset; + u64 length; +}; static u64 bytes_used = 0; static u64 total_csum_bytes = 0; @@ -1852,7 +1902,7 @@ static int all_backpointers_checked(struct extent_record *rec, int print_errs) (unsigned long long)rec->start, back->full_backref ? "parent" : "root", - back->full_backref ? + back->full_backref ? (unsigned long long)dback->parent: (unsigned long long)dback->root, (unsigned long long)dback->owner, @@ -2440,6 +2490,149 @@ static int process_extent_ref_v0(struct cache_tree *extent_cache, } #endif +static int process_chunk_item(struct cache_tree *chunk_cache, + struct btrfs_key *key, struct extent_buffer *eb, int slot) +{ + struct btrfs_chunk *ptr; + struct chunk_rec *rec; + int num_stripes, i; + int ret = 0; + + ptr = btrfs_item_ptr(eb, + slot, struct btrfs_chunk); + + num_stripes = btrfs_chunk_num_stripes(eb, ptr); + + rec = malloc(sizeof(*rec) + + num_stripes * sizeof(*rec->stripes)); + if (!rec) { + fprintf(stderr, "memory allocation failed\n"); + return -ENOMEM; + } + + rec->cache.start = key->offset; + rec->cache.size = 1; + + rec->objectid = key->objectid; + rec->type = key->type; + rec->offset = key->offset; + + rec->length = btrfs_chunk_length(eb, ptr); + rec->type = btrfs_chunk_type(eb, ptr); + rec->num_stripes = num_stripes; + + for (i = 0; i < rec->num_stripes; ++i) { + rec->stripes[i].devid = + btrfs_stripe_devid_nr(eb, ptr, i); + rec->stripes[i].offset = + btrfs_stripe_offset_nr(eb, ptr, i); + } + + ret = insert_existing_cache_extent( + chunk_cache, &rec->cache); + + return ret; +} + +static int process_dev_item(struct cache_tree *dev_cache, + struct btrfs_key *key, struct extent_buffer *eb, int slot) +{ + struct btrfs_dev_item *ptr; + struct dev_rec *rec; + int ret = 0; + + ptr = btrfs_item_ptr(eb, + slot, struct btrfs_dev_item); + + rec = malloc(sizeof(*rec)); + if (!rec) { + fprintf(stderr, "memory allocation failed\n"); + return -ENOMEM; + } + + rec->cache.start = key->offset; + rec->cache.size = 1; + + rec->objectid = key->objectid; + rec->type = key->type; + rec->offset = key->offset; + + rec->devid = btrfs_device_id(eb, ptr); + rec->total_byte = btrfs_device_total_bytes(eb, ptr); +
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote: > On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote: >> On 08/01/2012 09:07 PM, Jan Schmidt wrote: >>> On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote: On 08/01/2012 07:45 PM, Stefan Behrens wrote: > With commit acce952b0, btrfs was changed to flag the filesystem with > BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal > error happened like a write I/O errors of all mirrors. > In such situations, on unmount, the superblock is written in > btrfs_error_commit_super(). This is done with the intention to be able > to evaluate the error flag on the next mount. A warning is printed > in this case during the next mount and the log tree is ignored. > > The issue is that it is possible that the superblock points to a root > that was not written (due to write I/O errors). > The result is that the filesystem cannot be mounted. btrfsck also does > not start and all the other btrfs-progs tools fail to start as well. > However, mount -o recovery is working well and does the right things > to recover the filesystem (i.e., don't use the log root, clear the > free space cache and use the next mountable root that is stored in the > root backup array). > > This patch removes the writing of the superblock when > BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error > flag in the mount function. > Yes, I have to admit that this can be a serious problem. But we'll need to send the error flag stored in the super block into disk in the future so that the next mount can find it unstable and do fsck by itself maybe. >>> >>> Hum, that's possible. However, I neither see >>> >>> a) a safe way to get that flag to disk >>> >>> nor >>> >>> b) a situation where this flag would help. When we abort a transaction, we >>> just >>> roll everything back to the last commit, i.e. a consistent state. So if we >>> stop >>> writing a potentially corrupt super block, we should be fine anyway. Or am I >>> missing something? >>> >> >> I'm just wondering if we can roll everything back well, why do we need fsck? > > If the disks support barriers, we roll everything back very well. The > most recent superblock on the disks always defines a consistent > filesystem state. There are only two remaining filesystem consistency > issues left that can cause inconsistent states, one is the one that the > patch in this email addresses, and the second one is that the error > result from barrier_all_devices() is ignored (which I want to change next). Hi Liu Bo, Do you have any remaining objections to that patch? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On 08/02/2012 06:30 PM, Stefan Behrens wrote: > On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote: >> On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote: >>> On 08/01/2012 09:07 PM, Jan Schmidt wrote: On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote: > On 08/01/2012 07:45 PM, Stefan Behrens wrote: >> With commit acce952b0, btrfs was changed to flag the filesystem with >> BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal >> error happened like a write I/O errors of all mirrors. >> In such situations, on unmount, the superblock is written in >> btrfs_error_commit_super(). This is done with the intention to be able >> to evaluate the error flag on the next mount. A warning is printed >> in this case during the next mount and the log tree is ignored. >> >> The issue is that it is possible that the superblock points to a root >> that was not written (due to write I/O errors). >> The result is that the filesystem cannot be mounted. btrfsck also does >> not start and all the other btrfs-progs tools fail to start as well. >> However, mount -o recovery is working well and does the right things >> to recover the filesystem (i.e., don't use the log root, clear the >> free space cache and use the next mountable root that is stored in the >> root backup array). >> >> This patch removes the writing of the superblock when >> BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error >> flag in the mount function. >> > > Yes, I have to admit that this can be a serious problem. > > But we'll need to send the error flag stored in the super block into > disk in the future so that the next mount can find it unstable and do > fsck by itself maybe. Hum, that's possible. However, I neither see a) a safe way to get that flag to disk nor b) a situation where this flag would help. When we abort a transaction, we just roll everything back to the last commit, i.e. a consistent state. So if we stop writing a potentially corrupt super block, we should be fine anyway. Or am I missing something? >>> >>> I'm just wondering if we can roll everything back well, why do we need fsck? >> >> If the disks support barriers, we roll everything back very well. The >> most recent superblock on the disks always defines a consistent >> filesystem state. There are only two remaining filesystem consistency >> issues left that can cause inconsistent states, one is the one that the >> patch in this email addresses, and the second one is that the error >> result from barrier_all_devices() is ignored (which I want to change next). > > Hi Liu Bo, > > Do you have any remaining objections to that patch? > Hi Stefan, Still I have another question: Our metadata can be flushed into disk if we reach the limit, 32k, so we can end up with updated metadata and the latest superblock if we do not write the current super block. Any ideas? thanks, liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: barrier before waitqueue_active
On 08/02/2012 04:25 AM, Josef Bacik wrote: > We need an smb_mb() before waitqueue_active to avoid missing wakeups. > Before Mitch was hitting a deadlock between the ordered flushers and the > transaction commit because the ordered flushers were waiting for more refs > and were never woken up, so those smp_mb()'s are the most important. > Everything else I added for correctness sake and to avoid getting bitten by > this again somewhere else. Thanks, > Hi Josef, I'll appreciate a lot if you can add some comments for each memory barrier, because not everyone knows why it is used here and there. :) thanks, liubo > Signed-off-by: Josef Bacik > --- > fs/btrfs/compression.c |1 + > fs/btrfs/delayed-inode.c | 16 ++-- > fs/btrfs/delayed-ref.c | 18 -- > fs/btrfs/disk-io.c | 11 --- > fs/btrfs/inode.c |8 +--- > fs/btrfs/volumes.c |8 +--- > 6 files changed, 41 insertions(+), 21 deletions(-) > > diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c > index 86eff48..43d1c5a 100644 > --- a/fs/btrfs/compression.c > +++ b/fs/btrfs/compression.c > @@ -818,6 +818,7 @@ static void free_workspace(int type, struct list_head > *workspace) > btrfs_compress_op[idx]->free_workspace(workspace); > atomic_dec(alloc_workspace); > wake: > + smp_mb(); > if (waitqueue_active(workspace_wait)) > wake_up(workspace_wait); > } > diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c > index 335605c..8cc9b19 100644 > --- a/fs/btrfs/delayed-inode.c > +++ b/fs/btrfs/delayed-inode.c > @@ -513,9 +513,11 @@ static void __btrfs_remove_delayed_item(struct > btrfs_delayed_item *delayed_item) > rb_erase(&delayed_item->rb_node, root); > delayed_item->delayed_node->count--; > atomic_dec(&delayed_root->items); > - if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND && > - waitqueue_active(&delayed_root->wait)) > - wake_up(&delayed_root->wait); > + if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND) { > + smp_mb(); > + if (waitqueue_active(&delayed_root->wait)) > + wake_up(&delayed_root->wait); > + } > } > > static void btrfs_release_delayed_item(struct btrfs_delayed_item *item) > @@ -1057,9 +1059,11 @@ static void btrfs_release_delayed_inode(struct > btrfs_delayed_node *delayed_node) > delayed_root = delayed_node->root->fs_info->delayed_root; > atomic_dec(&delayed_root->items); > if (atomic_read(&delayed_root->items) < > - BTRFS_DELAYED_BACKGROUND && > - waitqueue_active(&delayed_root->wait)) > - wake_up(&delayed_root->wait); > + BTRFS_DELAYED_BACKGROUND) { > + smp_mb(); > + if (waitqueue_active(&delayed_root->wait)) > + wake_up(&delayed_root->wait); > + } > } > } > > diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c > index da7419e..858ef02 100644 > --- a/fs/btrfs/delayed-ref.c > +++ b/fs/btrfs/delayed-ref.c > @@ -662,9 +662,12 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info > *fs_info, > add_delayed_tree_ref(fs_info, trans, &ref->node, bytenr, > num_bytes, parent, ref_root, level, action, > for_cow); > - if (!need_ref_seq(for_cow, ref_root) && > - waitqueue_active(&fs_info->tree_mod_seq_wait)) > - wake_up(&fs_info->tree_mod_seq_wait); > + if (!need_ref_seq(for_cow, ref_root)) { > + smp_mb(); > + if (waitqueue_active(&fs_info->tree_mod_seq_wait)) > + wake_up(&fs_info->tree_mod_seq_wait); > + } > + > spin_unlock(&delayed_refs->lock); > if (need_ref_seq(for_cow, ref_root)) > btrfs_qgroup_record_ref(trans, &ref->node, extent_op); > @@ -713,9 +716,11 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info > *fs_info, > add_delayed_data_ref(fs_info, trans, &ref->node, bytenr, > num_bytes, parent, ref_root, owner, offset, > action, for_cow); > - if (!need_ref_seq(for_cow, ref_root) && > - waitqueue_active(&fs_info->tree_mod_seq_wait)) > - wake_up(&fs_info->tree_mod_seq_wait); > + if (!need_ref_seq(for_cow, ref_root)) { > + smp_mb(); > + if (waitqueue_active(&fs_info->tree_mod_seq_wait)) > + wake_up(&fs_info->tree_mod_seq_wait); > + } > spin_unlock(&delayed_refs->lock); > if (need_ref_seq(for_cow, ref_root)) > btrfs_qgroup_record_ref(trans, &ref->node, extent_op); > @@ -744,6 +749,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info > *fs_info, > num_bytes, BTRFS_UPDATE_DELAYED_HEAD, >
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On 02.08.2012 12:36, Liu Bo wrote: > On 08/02/2012 06:30 PM, Stefan Behrens wrote: >> On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote: >>> On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote: On 08/01/2012 09:07 PM, Jan Schmidt wrote: > On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote: >> On 08/01/2012 07:45 PM, Stefan Behrens wrote: >>> With commit acce952b0, btrfs was changed to flag the filesystem with >>> BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal >>> error happened like a write I/O errors of all mirrors. >>> In such situations, on unmount, the superblock is written in >>> btrfs_error_commit_super(). This is done with the intention to be able >>> to evaluate the error flag on the next mount. A warning is printed >>> in this case during the next mount and the log tree is ignored. >>> >>> The issue is that it is possible that the superblock points to a root >>> that was not written (due to write I/O errors). >>> The result is that the filesystem cannot be mounted. btrfsck also does >>> not start and all the other btrfs-progs tools fail to start as well. >>> However, mount -o recovery is working well and does the right things >>> to recover the filesystem (i.e., don't use the log root, clear the >>> free space cache and use the next mountable root that is stored in the >>> root backup array). >>> >>> This patch removes the writing of the superblock when >>> BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error >>> flag in the mount function. >>> >> >> Yes, I have to admit that this can be a serious problem. >> >> But we'll need to send the error flag stored in the super block into >> disk in the future so that the next mount can find it unstable and do >> fsck by itself maybe. > > Hum, that's possible. However, I neither see > > a) a safe way to get that flag to disk > > nor > > b) a situation where this flag would help. When we abort a transaction, > we just > roll everything back to the last commit, i.e. a consistent state. So if > we stop > writing a potentially corrupt super block, we should be fine anyway. Or > am I > missing something? > I'm just wondering if we can roll everything back well, why do we need fsck? >>> >>> If the disks support barriers, we roll everything back very well. The >>> most recent superblock on the disks always defines a consistent >>> filesystem state. There are only two remaining filesystem consistency >>> issues left that can cause inconsistent states, one is the one that the >>> patch in this email addresses, and the second one is that the error >>> result from barrier_all_devices() is ignored (which I want to change next). >> >> Hi Liu Bo, >> >> Do you have any remaining objections to that patch? >> > > Hi Stefan, > > Still I have another question: > > Our metadata can be flushed into disk if we reach the limit, 32k, so we > can end up with updated metadata and the latest superblock if we do not > write the current super block. The old metadata stays valid until the new superblock is written, so no problem here, or maybe I don't understand your question :) > > Any ideas? > > thanks, > liubo > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN: > On Wed, Aug 01, 2012 at 11:57:39PM +0200, Martin Steigerwald wrote: > > Its getting quite strange. > > I would agree :) > > Before I paste a bunch of thing, I wanted to thank you for not giving up on > me > and offering your time to help me figure this out :) You are welcome. Well I am holding Linux performance analysis & tuning trainings and I am really interested into issues like this ;) I will take care of myself and I take my time to respond or even do not respond at all anymore if I run out of ideas ;). > > I lost track of whether you did that already or not, but if you didn´t > > please post some > > > > vmstat 1 > > iostat -xd 1 > > on the device while it is being slow. > > Sure thing, here's the 24 second du -s run: > procs ---memory-- ---swap-- -io -system-- cpu > r b swpd free buff cache si sobibo in cs us sy id wa > 2 1 0 2747264 44 348388002850 242 184 19 6 74 1 > 1 0 0 2744128 44 35170000 144 0 2758 32115 30 5 61 > 4 > 2 1 0 2743100 44 35199200 792 0 2616 30613 28 4 50 > 18 > 1 1 0 2741592 44 35266800 776 0 2574 31551 29 4 45 > 21 > 1 1 0 2740720 44 35343200 692 0 2734 32891 30 4 45 > 22 > 1 1 0 2740104 44 35428400 460 0 2639 31585 30 4 45 > 21 > 3 1 0 2738520 44 35469200 544 264 2834 30302 32 5 42 > 21 > 1 1 0 2736936 44 35547600 1064 2012 2867 31172 28 4 45 > 23 A bit more wait I/O with not even 10% of the throughput as compared to the Intel SSD 320 figures. Seems that Intel SSD is running circles around your Samsung SSD while not – as expected for that use case – being fully utilized. > Linux 3.5.0-amd64-preempt-noide-20120410 (gandalfthegreat)08/01/2012 > _x86_64_(4 CPU) > rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await > r_await w_await svctm %util > 2.18 0.686.45 17.7778.12 153.3919.12 0.187.51 > 8.527.15 4.46 10.81 > 0.00 0.00 118.000.00 540.00 0.00 9.15 1.189.93 > 9.930.00 4.98 58.80 > 0.00 0.00 217.000.00 868.00 0.00 8.00 1.908.77 > 8.770.00 4.44 96.40 > 0.00 0.00 192.000.00 768.00 0.00 8.00 1.638.44 > 8.440.00 5.10 98.00 > 0.00 0.00 119.000.00 476.00 0.00 8.00 1.069.01 > 9.010.00 8.20 97.60 > 0.00 0.00 125.000.00 500.00 0.00 8.00 1.088.67 > 8.670.00 7.55 94.40 > 0.00 0.00 165.000.00 660.00 0.00 8.00 1.509.12 > 9.120.00 5.87 96.80 > 0.00 0.00 195.00 13.00 780.00 272.0010.12 1.688.10 > 7.94 10.46 4.65 96.80 > 0.00 0.00 173.000.00 692.00 0.00 8.00 1.729.87 > 9.870.00 5.71 98.80 > 0.00 0.00 171.000.00 684.00 0.00 8.00 1.629.33 > 9.330.00 5.75 98.40 > 0.00 0.00 161.000.00 644.00 0.00 8.00 1.529.57 > 9.570.00 6.14 98.80 > 0.00 0.00 136.000.00 544.00 0.00 8.00 1.269.29 > 9.290.00 7.24 98.40 > 0.00 0.00 199.000.00 796.00 0.00 8.00 1.949.73 > 9.730.00 4.94 98.40 > 0.00 0.00 201.000.00 804.00 0.00 8.00 1.708.54 > 8.540.00 4.80 96.40 > 0.00 0.00 272.00 15.00 1088.00 272.00 9.48 2.358.21 > 8.463.73 3.39 97.20 […] > > I am interested in wait I/O and latencies and disk utilization. > > Cool tool, I didn't know about iostat. > My r_await numbers don't look good obviously and yet %util is pretty much > 100% the entire time. > > Does that show that it's indeed the device that is unable to deliver the > requests any quicker, despite > being an ssd, or are you reading this differently? That, or… > > Also I am interested in > > merkaba:~> hdparm -I /dev/sda | grep -i queue > > Queue depth: 32 > >*Native Command Queueing (NCQ) > > output for your SSD. > > gandalfthegreat:/var/local# hdparm -I /dev/sda | grep -i queue > Queue depth: 32 > *Native Command Queueing (NCQ) > gandalfthegreat:/var/local# > > I've the the fio tests in: > /dev/mapper/cryptroot /var btrfs > rw,noatime,compress=lzo,nossd,discard,space_cache 0 0 … you are still using dm_crypt? Please test without dm_crypt. My figures are from within LVM, but no dm_crypt. Its good to have a comparable base for the measurements. > (discard is there, so fstrim shouldn't be needed) I can´t imagine why it should matter, but maybe its worth having some tests without „discard“. > > I also suggest to use fio with with the ssd-test example
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN: > So, doctor, is it bad? :) > > randomwrite: (g=0): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, > iodepth=64 > sequentialwrite: (g=1): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, > iodepth=64 > randomread: (g=2): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > sequentialread: (g=3): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 > 2.0.8 > Starting 4 processes > randomwrite: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [___R] [100.0% done] [558.8M/0K /s] [63.8K/0 iops] [eta > 00m:00s] > randomwrite: (groupid=0, jobs=1): err= 0: pid=7193 > write: io=102048KB, bw=1700.8KB/s, iops=189 , runt= 60003msec > slat (usec): min=21 , max=219834 , avg=5250.91, stdev=5936.55 > clat (usec): min=25 , max=738932 , avg=329339.45, stdev=106004.63 > lat (msec): min=4 , max=751 , avg=334.59, stdev=107.57 > clat percentiles (msec): Heck, I didn´t look at the IOPS figure! 189 IOPS for a SATA-600 SSD. Thats pathetic. So again, please test this without dm_crypt. I can´t believe that this is the maximum the hardware is able to achieve. A really fast 15000 rpm SAS harddisk might top that. -- Martin 'Helios' Steigerwald - http://www.Lichtvoll.de GPG: 03B0 0D6C 0040 0710 4AFA B82F 991B EAAC A599 84C7 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On 08/02/2012 07:18 PM, Arne Jansen wrote: > On 02.08.2012 12:36, Liu Bo wrote: >> On 08/02/2012 06:30 PM, Stefan Behrens wrote: >>> On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote: On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote: > On 08/01/2012 09:07 PM, Jan Schmidt wrote: >> On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote: >>> On 08/01/2012 07:45 PM, Stefan Behrens wrote: With commit acce952b0, btrfs was changed to flag the filesystem with BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal error happened like a write I/O errors of all mirrors. In such situations, on unmount, the superblock is written in btrfs_error_commit_super(). This is done with the intention to be able to evaluate the error flag on the next mount. A warning is printed in this case during the next mount and the log tree is ignored. The issue is that it is possible that the superblock points to a root that was not written (due to write I/O errors). The result is that the filesystem cannot be mounted. btrfsck also does not start and all the other btrfs-progs tools fail to start as well. However, mount -o recovery is working well and does the right things to recover the filesystem (i.e., don't use the log root, clear the free space cache and use the next mountable root that is stored in the root backup array). This patch removes the writing of the superblock when BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error flag in the mount function. >>> >>> Yes, I have to admit that this can be a serious problem. >>> >>> But we'll need to send the error flag stored in the super block into >>> disk in the future so that the next mount can find it unstable and do >>> fsck by itself maybe. >> >> Hum, that's possible. However, I neither see >> >> a) a safe way to get that flag to disk >> >> nor >> >> b) a situation where this flag would help. When we abort a transaction, >> we just >> roll everything back to the last commit, i.e. a consistent state. So if >> we stop >> writing a potentially corrupt super block, we should be fine anyway. Or >> am I >> missing something? >> > > I'm just wondering if we can roll everything back well, why do we need > fsck? If the disks support barriers, we roll everything back very well. The most recent superblock on the disks always defines a consistent filesystem state. There are only two remaining filesystem consistency issues left that can cause inconsistent states, one is the one that the patch in this email addresses, and the second one is that the error result from barrier_all_devices() is ignored (which I want to change next). >>> >>> Hi Liu Bo, >>> >>> Do you have any remaining objections to that patch? >>> >> >> Hi Stefan, >> >> Still I have another question: >> >> Our metadata can be flushed into disk if we reach the limit, 32k, so we >> can end up with updated metadata and the latest superblock if we do not >> write the current super block. > > The old metadata stays valid until the new superblock is written, > so no problem here, or maybe I don't understand your question :) > Yeah, Arne, you're right :) But for undetected and unexpected errors as Arne had mentioned, I want to keep the error flag which is able to inform users that this FS is recommended (but not must) to do fsck at least. thanks, liubo >> >> Any ideas? >> >> thanks, >> liubo >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On 02.08.2012 13:34, Liu Bo wrote: > On 08/02/2012 07:18 PM, Arne Jansen wrote: >> On 02.08.2012 12:36, Liu Bo wrote: >>> On 08/02/2012 06:30 PM, Stefan Behrens wrote: On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote: > On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote: >> On 08/01/2012 09:07 PM, Jan Schmidt wrote: >>> On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote: On 08/01/2012 07:45 PM, Stefan Behrens wrote: > With commit acce952b0, btrfs was changed to flag the filesystem with > BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal > error happened like a write I/O errors of all mirrors. > In such situations, on unmount, the superblock is written in > btrfs_error_commit_super(). This is done with the intention to be able > to evaluate the error flag on the next mount. A warning is printed > in this case during the next mount and the log tree is ignored. > > The issue is that it is possible that the superblock points to a root > that was not written (due to write I/O errors). > The result is that the filesystem cannot be mounted. btrfsck also does > not start and all the other btrfs-progs tools fail to start as well. > However, mount -o recovery is working well and does the right things > to recover the filesystem (i.e., don't use the log root, clear the > free space cache and use the next mountable root that is stored in the > root backup array). > > This patch removes the writing of the superblock when > BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error > flag in the mount function. > Yes, I have to admit that this can be a serious problem. But we'll need to send the error flag stored in the super block into disk in the future so that the next mount can find it unstable and do fsck by itself maybe. >>> >>> Hum, that's possible. However, I neither see >>> >>> a) a safe way to get that flag to disk >>> >>> nor >>> >>> b) a situation where this flag would help. When we abort a transaction, >>> we just >>> roll everything back to the last commit, i.e. a consistent state. So if >>> we stop >>> writing a potentially corrupt super block, we should be fine anyway. Or >>> am I >>> missing something? >>> >> >> I'm just wondering if we can roll everything back well, why do we need >> fsck? > > If the disks support barriers, we roll everything back very well. The > most recent superblock on the disks always defines a consistent > filesystem state. There are only two remaining filesystem consistency > issues left that can cause inconsistent states, one is the one that the > patch in this email addresses, and the second one is that the error > result from barrier_all_devices() is ignored (which I want to change > next). Hi Liu Bo, Do you have any remaining objections to that patch? >>> >>> Hi Stefan, >>> >>> Still I have another question: >>> >>> Our metadata can be flushed into disk if we reach the limit, 32k, so we >>> can end up with updated metadata and the latest superblock if we do not >>> write the current super block. >> >> The old metadata stays valid until the new superblock is written, >> so no problem here, or maybe I don't understand your question :) >> > > Yeah, Arne, you're right :) > > But for undetected and unexpected errors as Arne had mentioned, I want > to keep the error flag which is able to inform users that this FS is > recommended (but not must) to do fsck at least. How about storing the flag in a different location than the superblock? If the fs is in an unknown state, every write potentially makes it only worse. > > thanks, > liubo > >>> >>> Any ideas? >>> >>> thanks, >>> liubo >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 2/2] Btrfs: fix the snapshot that should not exist
Hi, appologies for late reply, On Thu, Aug 02, 2012 at 12:40:46PM +0800, Miao Xie wrote: > Changelog v1 -> v2: > - add comment to explain why we need deal with the delayed items after > snapshot creation and why this operation do not corrupt the metadata. I'm sorry, the comment did not fix the bug :) The subvol stress is able to hit this: [ 2360.444321] [ cut here ] [ 2360.448019] kernel BUG at fs/btrfs/extent-tree.c:6047! [ 2360.448019] invalid opcode: [#1] SMP [ 2360.448019] CPU 0 [ 2360.448019] Modules linked in: btrfs aoe [last unloaded: btrfs] [ 2360.448019] [ 2360.448019] Pid: 8212, comm: btrfs Not tainted 3.5.0-default+ #170 Intel Corporation Santa Rosa platform/Matanzas [ 2360.448019] RIP: 0010:[] [] run_clustered_refs+0xa11/0xa20 [btrfs] [ 2360.448019] RSP: 0018:88003eca1a68 EFLAGS: 00010246 [ 2360.448019] RAX: 07ff RBX: 880017a694c8 RCX: 88003eca1a08 [ 2360.448019] RDX: 880028aa9000 RSI: 07fe RDI: 880064223cf0 [ 2360.448019] RBP: 88003eca1b48 R08: 07ff R09: 88003eca19f8 [ 2360.448019] R10: 88002435d1e8 R11: R12: 880025d66d28 [ 2360.448019] R13: 88003864 R14: 8800778dfa88 R15: 880060f010d0 [ 2360.448019] FS: 7f3289f35740() GS:88007dc0() knlGS: [ 2360.448019] CS: 0010 DS: ES: CR0: 8005003b [ 2360.448019] CR2: ff600400 CR3: 2e112000 CR4: 07f0 [ 2360.448019] DR0: DR1: DR2: [ 2360.448019] DR3: DR6: 0ff0 DR7: 0400 [ 2360.448019] Process btrfs (pid: 8212, threadinfo 88003eca, task 88001d834200) [ 2360.448019] Stack: [ 2360.448019] 0001 [ 2360.448019] 07ed 88002435d1e8 3eca1b18 [ 2360.448019] 0770 5cb1e000 88003eca1c08 [ 2360.448019] Call Trace: [ 2360.448019] [] btrfs_run_delayed_refs+0x1c9/0x550 [btrfs] [ 2360.448019] [] ? trace_hardirqs_on_caller+0x155/0x1d0 [ 2360.448019] [] ? btrfs_free_path+0x2a/0x40 [btrfs] [ 2360.448019] [] ? btrfs_run_delayed_items+0xf1/0x160 [btrfs] [ 2360.448019] [] btrfs_commit_transaction+0x605/0xb00 [btrfs] [ 2360.448019] [] ? lock_release_holdtime+0x3d/0x1c0 [ 2360.448019] [] ? btrfs_mksubvol+0x298/0x360 [btrfs] [ 2360.448019] [] ? wake_up_bit+0x40/0x40 [ 2360.448019] [] ? do_raw_spin_unlock+0x5e/0xb0 [ 2360.448019] [] btrfs_mksubvol+0x358/0x360 [btrfs] [ 2360.448019] [] btrfs_ioctl_snap_create_transid+0x10a/0x190 [btrfs] [ 2360.448019] [] btrfs_ioctl_snap_create_v2.clone.0+0xfd/0x110 [btrfs] [ 2360.448019] [] btrfs_ioctl+0x48e/0x1340 [btrfs] [ 2360.448019] [] ? do_page_fault+0x2d0/0x580 [ 2360.448019] [] ? _raw_spin_unlock_irq+0x30/0x50 [ 2360.448019] [] ? finish_task_switch+0x83/0xf0 [ 2360.448019] [] do_vfs_ioctl+0x98/0x560 [ 2360.448019] [] ? retint_swapgs+0x13/0x1b [ 2360.448019] [] sys_ioctl+0x4f/0x80 [ 2360.448019] [] system_call_fastpath+0x16/0x1b [ 2360.448019] Code: 8b 76 40 48 89 d7 48 89 55 a0 e8 2b 74 ff ff 83 f8 17 0f 87 1e ff ff ff 0f 0b 80 fa b2 0f 84 b4 f8 ff ff 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 0f 0b 0f 0b 0f 1f 80 00 00 00 00 55 48 89 e5 41 57 [ 2360.448019] RIP [] run_clustered_refs+0xa11/0xa20 [btrfs] [ 2360.448019] RSP [ 2360.814508] ---[ end trace 555a16cac3620ccb ]--- [ 2360.820398] note: btrfs[8212] exited with preempt_count 1 [ 2360.827072] BUG: sleeping function called from invalid context at kernel/rwsem.c:20 [ 2360.836047] in_atomic(): 1, irqs_disabled(): 0, pid: 8212, name: btrfs [ 2360.843859] INFO: lockdep is turned off. [ 2360.849021] Pid: 8212, comm: btrfs Tainted: G D 3.5.0-default+ #170 [ 2360.849022] Call Trace: [ 2360.849027] [] __might_sleep+0xfc/0x130 [ 2360.849030] [] down_read+0x26/0xa0 [ 2360.849034] [] acct_collect+0x4b/0x1b0 [ 2360.849038] [] do_exit+0x718/0x9a0 [ 2360.849041] [] ? kmsg_dump+0x26/0x140 [ 2360.849043] [] oops_end+0xb0/0xf0 [ 2360.849046] [] die+0x5b/0x90 [ 2360.849048] [] do_trap+0xc4/0x170 [ 2360.849052] [] do_invalid_op+0x95/0xb0 [ 2360.849067] [] ? run_clustered_refs+0xa11/0xa20 [btrfs] [ 2360.849071] [] ? trace_hardirqs_off_thunk+0x3a/0x3c [ 2360.849073] [] ? restore_args+0x30/0x30 [ 2360.849076] [] invalid_op+0x1b/0x20 [ 2360.849087] [] ? run_clustered_refs+0xa11/0xa20 [btrfs] [ 2360.849097] [] ? run_clustered_refs+0x69b/0xa20 [btrfs] [ 2360.849108] [] btrfs_run_delayed_refs+0x1c9/0x550 [btrfs] [ 2360.849110] [] ? trace_hardirqs_on_caller+0x155/0x1d0 [ 2360.849119] [] ? btrfs_free_path+0x2a/0x40 [btrfs] [ 2360.849133] [] ? btrfs_run_delayed_items+0xf1/0x160 [btrfs] [ 2360.849145] [] btrfs_commit_transaction+0x605/0xb00 [btrfs] [ 2360.849148] [] ? lock_release_holdtime+0x3d/0x1c0 [ 2360.849161] [] ? btrfs_mksubvol+0x298/0x360 [btrfs] [ 2360.849164] [] ? wake_up_bit
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On 08/02/2012 07:40 PM, Arne Jansen wrote: > On 02.08.2012 13:34, Liu Bo wrote: >> On 08/02/2012 07:18 PM, Arne Jansen wrote: >>> On 02.08.2012 12:36, Liu Bo wrote: On 08/02/2012 06:30 PM, Stefan Behrens wrote: > On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote: >> On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote: >>> On 08/01/2012 09:07 PM, Jan Schmidt wrote: On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote: > On 08/01/2012 07:45 PM, Stefan Behrens wrote: >> With commit acce952b0, btrfs was changed to flag the filesystem with >> BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal >> error happened like a write I/O errors of all mirrors. >> In such situations, on unmount, the superblock is written in >> btrfs_error_commit_super(). This is done with the intention to be >> able >> to evaluate the error flag on the next mount. A warning is printed >> in this case during the next mount and the log tree is ignored. >> >> The issue is that it is possible that the superblock points to a root >> that was not written (due to write I/O errors). >> The result is that the filesystem cannot be mounted. btrfsck also >> does >> not start and all the other btrfs-progs tools fail to start as well. >> However, mount -o recovery is working well and does the right things >> to recover the filesystem (i.e., don't use the log root, clear the >> free space cache and use the next mountable root that is stored in >> the >> root backup array). >> >> This patch removes the writing of the superblock when >> BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error >> flag in the mount function. >> > > Yes, I have to admit that this can be a serious problem. > > But we'll need to send the error flag stored in the super block into > disk in the future so that the next mount can find it unstable and do > fsck by itself maybe. Hum, that's possible. However, I neither see a) a safe way to get that flag to disk nor b) a situation where this flag would help. When we abort a transaction, we just roll everything back to the last commit, i.e. a consistent state. So if we stop writing a potentially corrupt super block, we should be fine anyway. Or am I missing something? >>> >>> I'm just wondering if we can roll everything back well, why do we need >>> fsck? >> >> If the disks support barriers, we roll everything back very well. The >> most recent superblock on the disks always defines a consistent >> filesystem state. There are only two remaining filesystem consistency >> issues left that can cause inconsistent states, one is the one that the >> patch in this email addresses, and the second one is that the error >> result from barrier_all_devices() is ignored (which I want to change >> next). > > Hi Liu Bo, > > Do you have any remaining objections to that patch? > Hi Stefan, Still I have another question: Our metadata can be flushed into disk if we reach the limit, 32k, so we can end up with updated metadata and the latest superblock if we do not write the current super block. >>> >>> The old metadata stays valid until the new superblock is written, >>> so no problem here, or maybe I don't understand your question :) >>> >> >> Yeah, Arne, you're right :) >> >> But for undetected and unexpected errors as Arne had mentioned, I want >> to keep the error flag which is able to inform users that this FS is >> recommended (but not must) to do fsck at least. > > How about storing the flag in a different location than the superblock? > If the fs is in an unknown state, every write potentially makes it only > worse. > IMO it does not make sense if we don't write the flag into disk, and on ext4's side, it just tries to write the super block. Anyway, for now, our error flag has only been stored in memory, so what about just keep it until we find a graceful way? thanks, liubo >> >> thanks, >> liubo >> Any ideas? thanks, liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.o
Re: [josef-btrfs:master 9/11] fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized in this function
On Wed, Aug 01, 2012 at 10:29:21PM -0600, Miao Xie wrote: > On Thu, 2 Aug 2012 10:31:23 +0800, Fengguang Wu wrote: > > There are new compile warnings show up in > > > > tree: git://git.kernel.org/pub/scm/linux/kernel/git/josef/btrfs-next.git > > master > > head: 8e1163044779e90662e96887cdd692c1594146ad > > commit: dd817e81e81bbb83b63317b169d0e57a5d7ae0e1 [9/11] Btrfs: fix error > > path in create_pending_snapshot() > > config: x86_64-randconfig-h031 (attached as .config) > > > > All error/warnings: > > > > fs/btrfs/transaction.c: In function 'create_pending_snapshot': > > fs/btrfs/transaction.c:1118:6: warning: 'parent' may be used uninitialized > > in this function [-Wmaybe-uninitialized] > > fs/btrfs/transaction.c:1119:19: warning: 'rsv' may be used uninitialized in > > this function [-Wmaybe-uninitialized] > > > > vim +1118 fs/btrfs/transaction.c > > 1115 if (ret) > > 1116 goto abort_trans; > > 1117 fail: > >> 1118 dput(parent); > > 1119 trans->block_rsv = rsv; > > 1120 no_free_objectid: > > 1121 kfree(new_root_item); > > The patch I sent before is based on the v3.5.0, not Chris's latest for-linus > branch, so... > > I will send new patches to fix it. Thanks for your report. > I've pulled these patches, Dave still has concerns about corruption. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: barrier before waitqueue_active
On Thu, Aug 02, 2012 at 04:46:44AM -0600, Liu Bo wrote: > On 08/02/2012 04:25 AM, Josef Bacik wrote: > > We need an smb_mb() before waitqueue_active to avoid missing wakeups. > > Before Mitch was hitting a deadlock between the ordered flushers and the > > transaction commit because the ordered flushers were waiting for more refs > > and were never woken up, so those smp_mb()'s are the most important. > > Everything else I added for correctness sake and to avoid getting bitten by > > this again somewhere else. Thanks, > > > > Hi Josef, > > I'll appreciate a lot if you can add some comments for each memory > barrier, because not everyone knows why it is used here and there. :) > I'm not going to add comments to all those places, you need a memory barrier in places you don't have an implicit barrier before you do waitqueue_active because you could miss somebody being added to the waitqueue, it's just basic correctness. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: barrier before waitqueue_active
On Thu, Aug 2, 2012 at 4:46 AM, Liu Bo wrote: > On 08/02/2012 04:25 AM, Josef Bacik wrote: >> We need an smb_mb() before waitqueue_active to avoid missing wakeups. >> Before Mitch was hitting a deadlock between the ordered flushers and the >> transaction commit because the ordered flushers were waiting for more refs >> and were never woken up, so those smp_mb()'s are the most important. >> Everything else I added for correctness sake and to avoid getting bitten by >> this again somewhere else. Thanks, >> > > Hi Josef, > > I'll appreciate a lot if you can add some comments for each memory > barrier, because not everyone knows why it is used here and there. :) Everyone who wants to know should read the memory-barriers.txt file that's hiding in the oddly named "Documentation" folder of their kernel tree. :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: barrier before waitqueue_active
On Thu, Aug 02, 2012 at 08:11:58AM -0400, Josef Bacik wrote: > On Thu, Aug 02, 2012 at 04:46:44AM -0600, Liu Bo wrote: > > On 08/02/2012 04:25 AM, Josef Bacik wrote: > > > We need an smb_mb() before waitqueue_active to avoid missing wakeups. > > > Before Mitch was hitting a deadlock between the ordered flushers and the > > > transaction commit because the ordered flushers were waiting for more refs > > > and were never woken up, so those smp_mb()'s are the most important. > > > Everything else I added for correctness sake and to avoid getting bitten > > > by > > > this again somewhere else. Thanks, > > > > I'll appreciate a lot if you can add some comments for each memory > > barrier, because not everyone knows why it is used here and there. :) > > I'm not going to add comments to all those places, you need a memory barrier > in > places you don't have an implicit barrier before you do waitqueue_active > because > you could miss somebody being added to the waitqueue, it's just basic > correctness. Thanks, This asks for a helper: + smp_mb(); + if (waitqueue_active(&fs_info->async_submit_wait)) + wake_up(&fs_info->async_submit_wait); -> void wake_up_if_active(wait) { /* * the comment */ smp_mb(); if(waitqueue_active(wait) wake_up(wait); } -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at fs/btrfs/extent-tree.c:5038 (linux 3.4.7)
On Wed, Aug 01, 2012 at 09:48:15PM +0200, Olivier Bonvalet wrote: > I have some trouble with a btrfs filesystem. > As you can see in logs, there is lines which are from btrfs (I > supposed), then some warnings at fs/btrfs/extent-tree.c, and finally a > "kernel BUG" at fs/btrfs/extent-tree.c:5038. Did you really see this BUG ? It comes from some unused and disabled code: 5036 #ifdef BTRFS_COMPAT_EXTENT_TREE_V0 5037 if (item_size < sizeof(*ei)) { 5038 BUG_ON(found_extent || extent_slot != path->slots[0]); 5039 ret = convert_extent_item_v0(trans, extent_root, path, 5040 owner_objectid, 0); 5041 if (ret < 0) 5042 goto abort; 5043 5044 btrfs_release_path(path); 5045 path->leave_spinning = 1; 5046 5047 key.objectid = bytenr; 5048 key.type = BTRFS_EXTENT_ITEM_KEY; 5049 key.offset = num_bytes; 5050 5051 ret = btrfs_search_slot(trans, extent_root, &key, path, 5052 -1, 1); 5053 if (ret) { 5054 printk(KERN_ERR "umm, got %d back from search" 5055", was looking for %llu\n", ret, 5056(unsigned long long)bytenr); 5057 btrfs_print_leaf(extent_root, path->nodes[0]); 5058 } 5059 if (ret < 0) 5060 goto abort; 5061 extent_slot = path->slots[0]; 5062 leaf = path->nodes[0]; 5063 item_size = btrfs_item_size_nr(leaf, extent_slot); 5064 } 5065 #endif david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: barrier before waitqueue_active V2
We need a barrir before calling waitqueue_active otherwise we will miss wakeups. So in places that do atomic_dec(); then atomic_read() use atomic_dec_return() which imply a memory barrier (see memory-barriers.txt) and then add an explicit memory barrier everywhere else that need them. Thanks, Signed-off-by: Josef Bacik --- V1->V2: changed atomic_dec/atomic_read combos to atomic_dec_return() for the implied barrier. fs/btrfs/compression.c |1 + fs/btrfs/delayed-inode.c |7 +++ fs/btrfs/delayed-ref.c | 18 -- fs/btrfs/disk-io.c |7 --- fs/btrfs/inode.c |4 +--- fs/btrfs/volumes.c |3 +-- 6 files changed, 22 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/compression.c b/fs/btrfs/compression.c index 86eff48..43d1c5a 100644 --- a/fs/btrfs/compression.c +++ b/fs/btrfs/compression.c @@ -818,6 +818,7 @@ static void free_workspace(int type, struct list_head *workspace) btrfs_compress_op[idx]->free_workspace(workspace); atomic_dec(alloc_workspace); wake: + smp_mb(); if (waitqueue_active(workspace_wait)) wake_up(workspace_wait); } diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index 335605c..ee2050c 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -512,8 +512,8 @@ static void __btrfs_remove_delayed_item(struct btrfs_delayed_item *delayed_item) rb_erase(&delayed_item->rb_node, root); delayed_item->delayed_node->count--; - atomic_dec(&delayed_root->items); - if (atomic_read(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND && + if (atomic_dec_return(&delayed_root->items) < + BTRFS_DELAYED_BACKGROUND && waitqueue_active(&delayed_root->wait)) wake_up(&delayed_root->wait); } @@ -1055,8 +1055,7 @@ static void btrfs_release_delayed_inode(struct btrfs_delayed_node *delayed_node) delayed_node->count--; delayed_root = delayed_node->root->fs_info->delayed_root; - atomic_dec(&delayed_root->items); - if (atomic_read(&delayed_root->items) < + if (atomic_dec_return(&delayed_root->items) < BTRFS_DELAYED_BACKGROUND && waitqueue_active(&delayed_root->wait)) wake_up(&delayed_root->wait); diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c index da7419e..858ef02 100644 --- a/fs/btrfs/delayed-ref.c +++ b/fs/btrfs/delayed-ref.c @@ -662,9 +662,12 @@ int btrfs_add_delayed_tree_ref(struct btrfs_fs_info *fs_info, add_delayed_tree_ref(fs_info, trans, &ref->node, bytenr, num_bytes, parent, ref_root, level, action, for_cow); - if (!need_ref_seq(for_cow, ref_root) && - waitqueue_active(&fs_info->tree_mod_seq_wait)) - wake_up(&fs_info->tree_mod_seq_wait); + if (!need_ref_seq(for_cow, ref_root)) { + smp_mb(); + if (waitqueue_active(&fs_info->tree_mod_seq_wait)) + wake_up(&fs_info->tree_mod_seq_wait); + } + spin_unlock(&delayed_refs->lock); if (need_ref_seq(for_cow, ref_root)) btrfs_qgroup_record_ref(trans, &ref->node, extent_op); @@ -713,9 +716,11 @@ int btrfs_add_delayed_data_ref(struct btrfs_fs_info *fs_info, add_delayed_data_ref(fs_info, trans, &ref->node, bytenr, num_bytes, parent, ref_root, owner, offset, action, for_cow); - if (!need_ref_seq(for_cow, ref_root) && - waitqueue_active(&fs_info->tree_mod_seq_wait)) - wake_up(&fs_info->tree_mod_seq_wait); + if (!need_ref_seq(for_cow, ref_root)) { + smp_mb(); + if (waitqueue_active(&fs_info->tree_mod_seq_wait)) + wake_up(&fs_info->tree_mod_seq_wait); + } spin_unlock(&delayed_refs->lock); if (need_ref_seq(for_cow, ref_root)) btrfs_qgroup_record_ref(trans, &ref->node, extent_op); @@ -744,6 +749,7 @@ int btrfs_add_delayed_extent_op(struct btrfs_fs_info *fs_info, num_bytes, BTRFS_UPDATE_DELAYED_HEAD, extent_op->is_data); + smp_mb(); if (waitqueue_active(&fs_info->tree_mod_seq_wait)) wake_up(&fs_info->tree_mod_seq_wait); spin_unlock(&delayed_refs->lock); diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 502b20c..2ed81c8 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -754,9 +754,7 @@ static void run_one_async_done(struct btrfs_work *work) limit = btrfs_async_submit_limit(fs_info); limit = limit * 2 / 3; - atomic_dec(&fs_info->nr_async_submits); - - if (atomic_read(&fs_info->nr_async_submits) < limit && + if (atomic_dec_return(&fs_info->nr_async_submits
Re: BTRFS crash on mount with 3.4.4
On Tue, Jul 31, 2012 at 06:01:04PM -0700, Marc MERLIN wrote: > On Tue, Jul 31, 2012 at 11:04:12AM -0700, Marc MERLIN wrote: > > My kernel crashed for some other reason, and now I can't mount my btrfs > > filesystem. > > I don't care about the data, it's backed up. > > > > I'll compile a 3.5 kernel, but is there any info you'd like off that > > filesystem to see why btrfs is crashing on mount? > > On the plus side, 3.5 mounted the filesystem: > [ 183.069179] device label btrfs_pool1 devid 1 transid 20769 /dev/mapper/test > [ 183.087805] btrfs: disk space caching is enabled > [ 183.224784] btrfs: no dev_stats entry found for device /dev/mapper/test > (devid 1) (OK on first mount after mkfs) > [ 186.277795] Btrfs detected SSD devices, enabling SSD mode > [ 188.477443] btrfs: unlinked 6 orphans > [ 188.477451] btrfs: truncated 1 orphans > [ 411.837015] btrfs: unlinked 13 orphans > [ 414.373500] btrfs: unlinked 35 orphans there's this commit in 3.5 that looks like it fixed the problem: Author: Josef Bacik AuthorDate: Wed Jun 27 15:10:56 2012 -0400 Btrfs: fix tree log remove space corner case The tree log stuff can have allocated space that we end up having split across a bitmap and a real extent. The free space code does not deal with this, it assumes that if it finds an extent or bitmap entry that the entire range must fall within the entry it finds. This isn't necessarily the case, so rework the remove function so it can handle this case properly. This fixed two panics the user hit, first in the case where the space was initially in a bitmap and then in an extent entry, and then the reverse case. Thanks, --- david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kernel BUG at fs/btrfs/extent-tree.c:5038 (linux 3.4.7)
On 02/08/2012 15:22, David Sterba wrote: On Wed, Aug 01, 2012 at 09:48:15PM +0200, Olivier Bonvalet wrote: I have some trouble with a btrfs filesystem. As you can see in logs, there is lines which are from btrfs (I supposed), then some warnings at fs/btrfs/extent-tree.c, and finally a "kernel BUG" at fs/btrfs/extent-tree.c:5038. Did you really see this BUG ? It comes from some unused and disabled code: 5036 #ifdef BTRFS_COMPAT_EXTENT_TREE_V0 5037 if (item_size< sizeof(*ei)) { 5038 BUG_ON(found_extent || extent_slot != path->slots[0]); 5039 ret = convert_extent_item_v0(trans, extent_root, path, 5040 owner_objectid, 0); 5041 if (ret< 0) 5042 goto abort; 5043 5044 btrfs_release_path(path); 5045 path->leave_spinning = 1; 5046 5047 key.objectid = bytenr; 5048 key.type = BTRFS_EXTENT_ITEM_KEY; 5049 key.offset = num_bytes; 5050 5051 ret = btrfs_search_slot(trans, extent_root,&key, path, 5052 -1, 1); 5053 if (ret) { 5054 printk(KERN_ERR "umm, got %d back from search" 5055", was looking for %llu\n", ret, 5056(unsigned long long)bytenr); 5057 btrfs_print_leaf(extent_root, path->nodes[0]); 5058 } 5059 if (ret< 0) 5060 goto abort; 5061 extent_slot = path->slots[0]; 5062 leaf = path->nodes[0]; 5063 item_size = btrfs_item_size_nr(leaf, extent_slot); 5064 } 5065 #endif david Yes... it's a copy from my /var/log/kern.log. Is it really "disabled" ? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On 02.08.2012 13:57, Liu Bo wrote: > On 08/02/2012 07:40 PM, Arne Jansen wrote: >> On 02.08.2012 13:34, Liu Bo wrote: >>> On 08/02/2012 07:18 PM, Arne Jansen wrote: On 02.08.2012 12:36, Liu Bo wrote: > On 08/02/2012 06:30 PM, Stefan Behrens wrote: >> On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote: >>> On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote: On 08/01/2012 09:07 PM, Jan Schmidt wrote: > On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote: >> On 08/01/2012 07:45 PM, Stefan Behrens wrote: >>> With commit acce952b0, btrfs was changed to flag the filesystem with >>> BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal >>> error happened like a write I/O errors of all mirrors. >>> In such situations, on unmount, the superblock is written in >>> btrfs_error_commit_super(). This is done with the intention to be >>> able >>> to evaluate the error flag on the next mount. A warning is printed >>> in this case during the next mount and the log tree is ignored. >>> >>> The issue is that it is possible that the superblock points to a >>> root >>> that was not written (due to write I/O errors). >>> The result is that the filesystem cannot be mounted. btrfsck also >>> does >>> not start and all the other btrfs-progs tools fail to start as well. >>> However, mount -o recovery is working well and does the right things >>> to recover the filesystem (i.e., don't use the log root, clear the >>> free space cache and use the next mountable root that is stored in >>> the >>> root backup array). >>> >>> This patch removes the writing of the superblock when >>> BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error >>> flag in the mount function. >>> >> >> Yes, I have to admit that this can be a serious problem. >> >> But we'll need to send the error flag stored in the super block into >> disk in the future so that the next mount can find it unstable and do >> fsck by itself maybe. > > Hum, that's possible. However, I neither see > > a) a safe way to get that flag to disk > > nor > > b) a situation where this flag would help. When we abort a > transaction, we just > roll everything back to the last commit, i.e. a consistent state. So > if we stop > writing a potentially corrupt super block, we should be fine anyway. > Or am I > missing something? > I'm just wondering if we can roll everything back well, why do we need fsck? >>> >>> If the disks support barriers, we roll everything back very well. The >>> most recent superblock on the disks always defines a consistent >>> filesystem state. There are only two remaining filesystem consistency >>> issues left that can cause inconsistent states, one is the one that the >>> patch in this email addresses, and the second one is that the error >>> result from barrier_all_devices() is ignored (which I want to change >>> next). >> >> Hi Liu Bo, >> >> Do you have any remaining objections to that patch? >> > > Hi Stefan, > > Still I have another question: > > Our metadata can be flushed into disk if we reach the limit, 32k, so we > can end up with updated metadata and the latest superblock if we do not > write the current super block. The old metadata stays valid until the new superblock is written, so no problem here, or maybe I don't understand your question :) >>> >>> Yeah, Arne, you're right :) >>> >>> But for undetected and unexpected errors as Arne had mentioned, I want >>> to keep the error flag which is able to inform users that this FS is >>> recommended (but not must) to do fsck at least. >> >> How about storing the flag in a different location than the superblock? >> If the fs is in an unknown state, every write potentially makes it only >> worse. >> > > IMO it does not make sense if we don't write the flag into disk, and on > ext4's side, it just tries to write the super block. > > Anyway, for now, our error flag has only been stored in memory, so what > about just keep it until we find a graceful way? Yeah, we need this patch to restore consistency. We can define a fixed area on disk (e.g. behind the superblock) where we can write the flag to without risking the superblock. > > > thanks, > liubo > > >>> >>> thanks, >>> liubo >>> > > Any ideas? > > thanks, > liubo > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordo
Re: kernel BUG at fs/btrfs/extent-tree.c:5038 (linux 3.4.7)
On Thu, Aug 02, 2012 at 03:41:03PM +0200, Olivier Bonvalet wrote: > Yes... it's a copy from my /var/log/kern.log. Is it really "disabled" ? I was mistaken, it really is enabled unconditionally in ctree.h:55. Josef says that the V0 extent refs are not used for a long time, so the question is how did they appear in your filesystem. Did you switch to a new kernel from a very old one? david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On Thu, Aug 02, 2012 at 03:46:50PM +0200, Arne Jansen wrote: > > Anyway, for now, our error flag has only been stored in memory, so what > > about just keep it until we find a graceful way? > > Yeah, we need this patch to restore consistency. We can define a fixed > area on disk (e.g. behind the superblock) where we can write the flag > to without risking the superblock. Is it possible that sectors around the superblock are somehow damaged and unwritable that writing to them will fail as well? Ie. some safe distance from the superblock would be needed. david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On 02.08.2012 15:57, David Sterba wrote: > On Thu, Aug 02, 2012 at 03:46:50PM +0200, Arne Jansen wrote: >>> Anyway, for now, our error flag has only been stored in memory, so what >>> about just keep it until we find a graceful way? >> >> Yeah, we need this patch to restore consistency. We can define a fixed >> area on disk (e.g. behind the superblock) where we can write the flag >> to without risking the superblock. > > Is it possible that sectors around the superblock are somehow damaged > and unwritable that writing to them will fail as well? Ie. some safe > distance from the superblock would be needed. In about 1MB distance you're side-by-side with the superblock, so closer to it might even be safer ;) > > david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: do not use missing devices when showing devname
If you do the following mkfs.btrfs /dev/sdb /dev/sdc rmmod btrfs dd if=/dev/zero of=/dev/sdb bs=1M count=1 mount -o degraded /dev/sdc /mnt/btrfs-test the box will panic trying to deref the name for the missing dev since it is the lower numbered devid. So fix show_devname to not use missing devices. Thanks, Signed-off-by: Josef Bacik --- fs/btrfs/super.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index fa61ef5..ffcfe49 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1505,6 +1505,8 @@ static int btrfs_show_devname(struct seq_file *m, struct dentry *root) while (cur_devices) { head = &cur_devices->devices; list_for_each_entry(dev, head, dev_list) { + if (dev->missing) + continue; if (!first_dev || dev->devid < first_dev->devid) first_dev = dev; } -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: do not strdup non existent strings
When we close devices we add back empty devices for some reason that escapes me. In the case of a missing dev we don't allocate an rcu_string for it's name, so check to see if the device has a name and if it doesn't don't bother strdup()'ing it. Thanks, Signed-off-by: Josef Bacik --- fs/btrfs/volumes.c |8 +--- 1 files changed, 5 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 3821302..0b1e69d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -568,9 +568,11 @@ static int __btrfs_close_devices(struct btrfs_fs_devices *fs_devices) memcpy(new_device, device, sizeof(*new_device)); /* Safe because we are under uuid_mutex */ - name = rcu_string_strdup(device->name->str, GFP_NOFS); - BUG_ON(device->name && !name); /* -ENOMEM */ - rcu_assign_pointer(new_device->name, name); + if (device->name) { + name = rcu_string_strdup(device->name->str, GFP_NOFS); + BUG_ON(device->name && !name); /* -ENOMEM */ + rcu_assign_pointer(new_device->name, name); + } new_device->bdev = NULL; new_device->writeable = 0; new_device->in_fs_metadata = 0; -- 1.7.7.6 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BTRFS crash on mount with 3.4.4
On Thu, Aug 02, 2012 at 03:40:03PM +0200, David Sterba wrote: > On Tue, Jul 31, 2012 at 06:01:04PM -0700, Marc MERLIN wrote: > > On Tue, Jul 31, 2012 at 11:04:12AM -0700, Marc MERLIN wrote: > > > My kernel crashed for some other reason, and now I can't mount my btrfs > > > filesystem. > > > I don't care about the data, it's backed up. > > > > > > I'll compile a 3.5 kernel, but is there any info you'd like off that > > > filesystem to see why btrfs is crashing on mount? > > > > On the plus side, 3.5 mounted the filesystem: > > [ 183.069179] device label btrfs_pool1 devid 1 transid 20769 > > /dev/mapper/test > > [ 183.087805] btrfs: disk space caching is enabled > > [ 183.224784] btrfs: no dev_stats entry found for device /dev/mapper/test > > (devid 1) (OK on first mount after mkfs) > > [ 186.277795] Btrfs detected SSD devices, enabling SSD mode > > [ 188.477443] btrfs: unlinked 6 orphans > > [ 188.477451] btrfs: truncated 1 orphans > > [ 411.837015] btrfs: unlinked 13 orphans > > [ 414.373500] btrfs: unlinked 35 orphans > > there's this commit in 3.5 that looks like it fixed the problem: > > Author: Josef Bacik > AuthorDate: Wed Jun 27 15:10:56 2012 -0400 > > Btrfs: fix tree log remove space corner case > > The tree log stuff can have allocated space that we end up having split > across a bitmap and a real extent. The free space code does not deal with > this, it assumes that if it finds an extent or bitmap entry that the > entire > range must fall within the entry it finds. This isn't necessarily the > case, > so rework the remove function so it can handle this case properly. This > fixed two panics the user hit, first in the case where the space was > initially in a bitmap and then in an extent entry, and then the reverse > case. Thanks, Indeed. Thanks for finding this David and thanks for writing this Josef :) Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On Thu, August 02, 2012 at 15:46 (+0200), Arne Jansen wrote: > On 02.08.2012 13:57, Liu Bo wrote: >> Anyway, for now, our error flag has only been stored in memory, so what >> about just keep it until we find a graceful way? > > Yeah, we need this patch to restore consistency. We can define a fixed > area on disk (e.g. behind the superblock) where we can write the flag > to without risking the superblock. At least we all agree that we need this patch, fine. We don't yet agree that we need a place to store a "please consider fsck" flag. Can I please get one concrete example in which situation we a) do detect the user should really do a file system check AND b) do not abort the transaction to clean the mess up? (An example on how we could fail transaction cleanup is also accepted). If such a situation doesn't exist, there's no need for this flag. The fact that ext has such a flag doesn't convince me, probably because I know nothing about ext. I can imagine that they can detect file system errors without the ability to return to a potentially older consistent state. Thanks, -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On Thu, Aug 2, 2012 at 7:46 AM, Arne Jansen wrote: > On 02.08.2012 13:57, Liu Bo wrote: >> On 08/02/2012 07:40 PM, Arne Jansen wrote: >>> On 02.08.2012 13:34, Liu Bo wrote: On 08/02/2012 07:18 PM, Arne Jansen wrote: > On 02.08.2012 12:36, Liu Bo wrote: >> On 08/02/2012 06:30 PM, Stefan Behrens wrote: >>> On Wed, 01 Aug 2012 16:31:54 +0200, Stefan Behrens wrote: On Wed, 01 Aug 2012 21:31:58 +0800, Liu Bo wrote: > On 08/01/2012 09:07 PM, Jan Schmidt wrote: >> On Wed, August 01, 2012 at 14:02 (+0200), Liu Bo wrote: >>> On 08/01/2012 07:45 PM, Stefan Behrens wrote: With commit acce952b0, btrfs was changed to flag the filesystem with BTRFS_SUPER_FLAG_ERROR and switch to read-only mode after a fatal error happened like a write I/O errors of all mirrors. In such situations, on unmount, the superblock is written in btrfs_error_commit_super(). This is done with the intention to be able to evaluate the error flag on the next mount. A warning is printed in this case during the next mount and the log tree is ignored. The issue is that it is possible that the superblock points to a root that was not written (due to write I/O errors). The result is that the filesystem cannot be mounted. btrfsck also does not start and all the other btrfs-progs tools fail to start as well. However, mount -o recovery is working well and does the right things to recover the filesystem (i.e., don't use the log root, clear the free space cache and use the next mountable root that is stored in the root backup array). This patch removes the writing of the superblock when BTRFS_SUPER_FLAG_ERROR is set, and removes the handling of the error flag in the mount function. >>> >>> Yes, I have to admit that this can be a serious problem. >>> >>> But we'll need to send the error flag stored in the super block into >>> disk in the future so that the next mount can find it unstable and >>> do >>> fsck by itself maybe. >> >> Hum, that's possible. However, I neither see >> >> a) a safe way to get that flag to disk >> >> nor >> >> b) a situation where this flag would help. When we abort a >> transaction, we just >> roll everything back to the last commit, i.e. a consistent state. So >> if we stop >> writing a potentially corrupt super block, we should be fine anyway. >> Or am I >> missing something? >> > > I'm just wondering if we can roll everything back well, why do we > need fsck? If the disks support barriers, we roll everything back very well. The most recent superblock on the disks always defines a consistent filesystem state. There are only two remaining filesystem consistency issues left that can cause inconsistent states, one is the one that the patch in this email addresses, and the second one is that the error result from barrier_all_devices() is ignored (which I want to change next). >>> >>> Hi Liu Bo, >>> >>> Do you have any remaining objections to that patch? >>> >> >> Hi Stefan, >> >> Still I have another question: >> >> Our metadata can be flushed into disk if we reach the limit, 32k, so we >> can end up with updated metadata and the latest superblock if we do not >> write the current super block. > > The old metadata stays valid until the new superblock is written, > so no problem here, or maybe I don't understand your question :) > Yeah, Arne, you're right :) But for undetected and unexpected errors as Arne had mentioned, I want to keep the error flag which is able to inform users that this FS is recommended (but not must) to do fsck at least. >>> >>> How about storing the flag in a different location than the superblock? >>> If the fs is in an unknown state, every write potentially makes it only >>> worse. >>> >> >> IMO it does not make sense if we don't write the flag into disk, and on >> ext4's side, it just tries to write the super block. >> >> Anyway, for now, our error flag has only been stored in memory, so what >> about just keep it until we find a graceful way? > > Yeah, we need this patch to restore consistency. We can define a fixed > area on disk (e.g. behind the superblock) where we can write the flag > to without risking the superblock. Is there a reason btrfs_error_commit_super couldn't do the as treelog: update only the first
Re: kernel BUG at fs/btrfs/extent-tree.c:5038 (linux 3.4.7)
On 02/08/2012 15:53, David Sterba wrote: On Thu, Aug 02, 2012 at 03:41:03PM +0200, Olivier Bonvalet wrote: Yes... it's a copy from my /var/log/kern.log. Is it really "disabled" ? I was mistaken, it really is enabled unconditionally in ctree.h:55. Josef says that the V0 extent refs are not used for a long time, so the question is how did they appear in your filesystem. Did you switch to a new kernel from a very old one? david -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html mmm I have upgraded that system every two months, but the first setup is from april 2011 I think. Should I start a "scrub" on all that systems installed a long time ago ? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote: > > I've the the fio tests in: > > /dev/mapper/cryptroot /var btrfs > > rw,noatime,compress=lzo,nossd,discard,space_cache 0 0 > > … you are still using dm_crypt? That was my biggest partition and so far I've found no performance impact on file access between unencrypted and dm_crypt. I just took out my swap partition and made a smaller btrfs there: /dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0 I mounted without discard. > > lat (usec) : 50=0.01% > > lat (msec) : 10=0.02%, 20=0.02%, 50=0.05%, 100=0.14%, 250=12.89% > > lat (msec) : 500=72.44%, 750=14.43% > > Gosh, look at these latencies! > > 72,44% of all requests above 500 (in words: five hundred) milliseconds! > And 14,43% above 750 msecs. The percentage of requests served at 100 msecs > or less is was below one percent! Hey, is this an SSD or what? Yeah, that's kind of what I've been complaining about since the beginning :) Once I'm reading sequentially, it goes fast, but random access/latency is indeed abysmal. > Still even with iodepth 64 totally different picture. And look at the IOPS > and throughput. Yep. I know mine are bad :( > For reference, this refers to > > [global] > ioengine=libaio > direct=1 > iodepth=64 Since it's slightly different than the first job file you gave me, I re-ran with this one this time. gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2 gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2 gandalfthegreat:~# grep sda2 /proc/mounts /dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0 here's the btrfs test (ext4 is lower down): gandalfthegreat:/mnt/mnt2# fio ~/fio.job2 zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=64 2.0.8 Starting 4 processes zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB) sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB) zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB) sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB) Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193 iops] [eta 02m:10s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318 read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec slat (usec): min=3 , max=37432 , avg=7252.52, stdev=5717.70 clat (usec): min=13 , max=981927 , avg=454046.13, stdev=110527.92 lat (msec): min=5 , max=999 , avg=461.30, stdev=112.00 clat percentiles (msec): | 1.00th=[ 145], 5.00th=[ 269], 10.00th=[ 371], 20.00th=[ 408], | 30.00th=[ 424], 40.00th=[ 437], 50.00th=[ 449], 60.00th=[ 457], | 70.00th=[ 474], 80.00th=[ 490], 90.00th=[ 570], 95.00th=[ 644], | 99.00th=[ 865], 99.50th=[ 922], 99.90th=[ 963], 99.95th=[ 979], | 99.99th=[ 979] bw (KB/s) : min=8, max= 2807, per=100.00%, avg=1227.75, stdev=317.57 lat (usec) : 20=0.01% lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82% lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58% cpu : usr=0.12%, sys=1.13%, ctx=12186, majf=0, minf=276 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=0.4%, >=64=99.2% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued: total=r=8262/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=30340 read : io=2048.0MB, bw=211257KB/s, iops=23473 , runt= 9927msec slat (usec): min=1 , max=56321 , avg=20.51, stdev=424.44 clat (usec): min=0 , max=57987 , avg=2695.98, stdev=6624.00 lat (usec): min=1 , max=58015 , avg=2716.75, stdev=6642.09 clat percentiles (usec): | 1.00th=[1], 5.00th=[ 10], 10.00th=[ 30], 20.00th=[ 100], | 30.00th=[ 217], 40.00th=[ 362], 50.00th=[ 494], 60.00th=[ 636], | 70.00th=[ 892], 80.00th=[ 1656], 90.00th=[ 7392], 95.00th=[21632], | 99.00th=[29056], 99.50th=[29568], 99.90th=[43776], 99.95th=[46848], | 99.99th=[57600] bw (KB/s) : min=166675, max=260984, per=99.83%, avg=210892.26, stdev=22433.65 lat (usec) : 2=2.16%, 4=0.43%, 10=2.33%, 20=2.80%, 50=5.72% lat (usec) : 100=6.47%, 250=12.35%, 500=18.29%, 750=15.52%, 1000=5.44% lat (msec) : 2=13.04%, 4=3.59%, 10=3.83%, 20=1.88%, 50=6.11% lat (msec) : 100=0.04% cpu : usr=4.51%, sys=35.70%, ctx=11480, majf=0, minf=278 IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0% issued: total=r=233025/w=0/d=0, short=r=0/w=0/d=0 zufälligschreiben: (
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN: > On Thu, Aug 02, 2012 at 01:18:07PM +0200, Martin Steigerwald wrote: > > > I've the the fio tests in: > > > /dev/mapper/cryptroot /var btrfs > > > rw,noatime,compress=lzo,nossd,discard,space_cache 0 0 > > > > … you are still using dm_crypt? […] > I just took out my swap partition and made a smaller btrfs there: > /dev/sda3 /mnt/mnt3 btrfs rw,noatime,ssd,space_cache 0 0 > > I mounted without discard. […] > > For reference, this refers to > > > > [global] > > ioengine=libaio > > direct=1 > > iodepth=64 > > Since it's slightly different than the first job file you gave me, I re-ran > with this one this time. > > gandalfthegreat:~# /sbin/mkfs.btrfs -L test /dev/sda2 > gandalfthegreat:~# mount -o noatime /dev/sda2 /mnt/mnt2 > gandalfthegreat:~# grep sda2 /proc/mounts > /dev/sda2 /mnt/mnt2 btrfs rw,noatime,ssd,space_cache 0 0 > > here's the btrfs test (ext4 is lower down): > gandalfthegreat:/mnt/mnt2# fio ~/fio.job2 > zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, > iodepth=64 > sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, > iodepth=64 > zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, > iodepth=64 > sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, > iodepth=64 > 2.0.8 Still abysmal except for sequential reads. > Starting 4 processes > zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB) > sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB) > zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [___W] [59.5% done] [0K/1800K /s] [0 /193 iops] [eta 02m:10s] > > zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30318 > read : io=73682KB, bw=1227.1KB/s, iops=137 , runt= 60004msec […] > lat (usec) : 20=0.01% > lat (msec) : 10=0.01%, 20=0.02%, 50=0.04%, 100=0.46%, 250=3.82% > lat (msec) : 500=79.48%, 750=13.57%, 1000=2.58% 1 second latency? Oh well… > Run status group 3 (all jobs): > WRITE: io=84902KB, aggrb=1414KB/s, minb=1414KB/s, maxb=1414KB/s, > mint=60005msec, maxt=60005msec > gandalfthegreat:/mnt/mnt2# > (no lines beyond this, fio 2.0.8) […] > > Can you also post the last lines: > > > > Disk stats (read/write): > > dm-2: ios=616191/613142, merge=0/0, ticks=1300820/2565384, > > in_queue=3867448, util=98.81%, aggrios=504829/504643, > > aggrmerge=111362/111451, aggrticks=1058320/2164664, aggrin_queue=3223048, > > aggrutil=98.78% > > sda: ios=504829/504643, merge=111362/111451, ticks=1058320/2164664, > > in_queue=3223048, util=98.78% > > martin@merkaba:~/Artikel/LinuxNewMedia/fio/Recherche/Messungen/merkaba> > > I didn't get these lines. Hmmm, back then I guess this was fio 1.5.9 or so. > I tried deadline and noop, and indeed I'm not see much of a difference for my > basic tests. > Fro now I have deadline. > > > So my recommendation of now: > > > > Remove as much factors as possible and in order to compare results with > > what I posted try with plain logical volume with Ext4. > > gandalfthegreat:~# mkfs.ext4 -O extent -b 4096 -E stride=128,stripe-width=128 > /dev/sda2 > /dev/sda2 /mnt/mnt2 ext4 rw,noatime,stripe=128,data=ordered 0 0 > > gandalfthegreat:/mnt/mnt2# fio ~/fio.job2 > zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, > iodepth=64 > sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, > iodepth=64 > zufälligschreiben: (g=2): rw=randwrite, bs=2K-16K/2K-16K, ioengine=libaio, > iodepth=64 > sequentiellschreiben: (g=3): rw=write, bs=2K-16K/2K-16K, ioengine=libaio, > iodepth=64 > 2.0.8 > Starting 4 processes > zufälliglesen: Laying out IO file(s) (1 file(s) / 2048MB) > sequentielllesen: Laying out IO file(s) (1 file(s) / 2048MB) > zufälligschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > sequentiellschreiben: Laying out IO file(s) (1 file(s) / 2048MB) > Jobs: 1 (f=1): [___W] [63.8% done] [0K/2526K /s] [0 /280 iops] [eta 01m:21s] > > zufälliglesen: (groupid=0, jobs=1): err= 0: pid=30077 > read : io=2048.0MB, bw=276232KB/s, iops=50472 , runt= 7592msec > slat (usec): min=2 , max=2276 , avg= 6.87, stdev=12.01 > clat (usec): min=249 , max=52128 , avg=1258.87, stdev=1714.63 > lat (usec): min=260 , max=52134 , avg=1266.00, stdev=1715.36 > clat percentiles (usec): > | 1.00th=[ 450], 5.00th=[ 548], 10.00th=[ 620], 20.00th=[ 724], > | 30.00th=[ 820], 40.00th=[ 908], 50.00th=[ 1004], 60.00th=[ 1096], > | 70.00th=[ 1208], 80.00th=[ 1368], 90.00th=[ 1640], 95.00th=[ 2040], > | 99.00th=[ 8256], 99.50th=[14912], 99.90th=[21120], 99.95th=[23168], > | 99.99th=[33024] > bw (KB/s) : min=76463, max=385328, per=100.00%, avg=277313.20, > stdev=94661.29 > lat (usec) : 250=0.01%, 500=2.46%, 750=19.82%, 1000=27.70% > lat (msec) : 2=44.79%, 4=3.55%, 10=0.72%, 20=0.78%, 50=0.17% > lat (msec) : 100=0.01% >
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote: > Hey, whats this? With Ext4 you have really good random read performance > now! Way better than the Intel SSD 320 and… Yep, my du -sh tests do show that ext4 is 2x faster than btrfs. Obviously it's sending IO in a way that either the IO subsystem, linux driver, or drive prefer. > The performance of your Samsung SSD seems to be quite erratic. It seems > that the device is capable of being fast, but only sometimes shows this > capability. That's indeed exactly what I'm seeing in real life :) > > > Have the IOPS run on the device it self. That will remove any filesystem > > > layer. But only the read only tests, to make sure I suggest to use fio > > > with the --readonly option as safety guard. Unless you have a spare SSD > > > that you can afford to use for write testing which will likely destroy > > > every filesystem on it. Or let it run on just one logical volume. > > > > Can you send me a recommended job config you'd like me to run if the runs > > above haven't already answered your questions? > [global] (...) I used this and just changed filename to /dev/sda. Since I'm reading from the beginning of the drive, reads have to be aligned. > I won´t expect much of a difference, but then the random read performance > is quite different between Ext4 and BTRFS on this disk. That would make > it interesting to test without any filesystem in between and over the > whole device. Here is the output: gandalfthegreat:~# fio --readonly ./fio.job3 zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172 read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msec slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64 clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66 lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88 clat percentiles (usec): | 1.00th=[ 126], 5.00th=[ 235], 10.00th=[ 5216], 20.00th=[ 5920], | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480], | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768], | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656], | 99.99th=[27264] bw (KB/s) : min= 423, max= 5776, per=100.00%, avg=986.48, stdev=480.68 lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02% lat (usec) : 1000=0.02% lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18% cpu : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued: total=r=6532/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199 read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msec slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91 clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32 lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06 clat percentiles (usec): | 1.00th=[ 61], 5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920], | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608], | 70.00th=[12608], 80.00th=[12736], 90.00th=[16512], 95.00th=[17536], | 99.00th=[18816], 99.50th=[19584], 99.90th=[24960], 99.95th=[29568], | 99.99th=[34048] bw (KB/s) : min= 405, max= 2680, per=100.00%, avg=912.92, stdev=261.62 lat (usec) : 50=0.41%, 100=1.77%, 250=1.20%, 500=0.23%, 750=0.02% lat (usec) : 1000=0.03% lat (msec) : 2=0.02%, 10=43.06%, 20=52.91%, 50=0.36% cpu : usr=0.15%, sys=0.45%, ctx=6103, majf=0, minf=28 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued: total=r=6101/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=59036KB, aggrb=983KB/s, minb=983KB/s, maxb=983KB/s, mint=60002msec, maxt=60002msec Run status group 1 (all jobs): READ: io=54658KB, aggrb=910KB/s, minb=910KB/s, maxb=910KB/s, mint=60002msec, maxt=60002msec Disk stats (read/write): sda: ios=12660/2072, merge=5/34, ticks=119452/22496, in_queue=141936, util=99.30% > … or get yourself another SSD. Its your decision. > > I admire your endurance. ;) Since I've gotten 2 SSDs to make sure I didn't get one bad one, and that the company is greating great reviews for them, I'm now pretty sure that it's either a problem with a linux driver, which is interesting for us all to debug :)
Re: How can btrfs take 23sec to stat 23K files from an SSD?
Am Donnerstag, 2. August 2012 schrieb Marc MERLIN: > On Thu, Aug 02, 2012 at 10:20:07PM +0200, Martin Steigerwald wrote: > > Hey, whats this? With Ext4 you have really good random read performance > > now! Way better than the Intel SSD 320 and… > > Yep, my du -sh tests do show that ext4 is 2x faster than btrfs. > Obviously it's sending IO in a way that either the IO subsystem, linux > driver, or drive prefer. But only on reads. > > > > Have the IOPS run on the device it self. That will remove any filesystem > > > > layer. But only the read only tests, to make sure I suggest to use fio > > > > with the --readonly option as safety guard. Unless you have a spare SSD > > > > that you can afford to use for write testing which will likely destroy > > > > every filesystem on it. Or let it run on just one logical volume. > > > > > > Can you send me a recommended job config you'd like me to run if the runs > > > above haven't already answered your questions? > > > [global] > (...) > > I used this and just changed filename to /dev/sda. Since I'm reading > from the beginning of the drive, reads have to be aligned. > > > I won´t expect much of a difference, but then the random read performance > > is quite different between Ext4 and BTRFS on this disk. That would make > > it interesting to test without any filesystem in between and over the > > whole device. > > Here is the output: > gandalfthegreat:~# fio --readonly ./fio.job3 > zufälliglesen: (g=0): rw=randread, bs=2K-16K/2K-16K, ioengine=libaio, > iodepth=1 > sequentielllesen: (g=1): rw=read, bs=2K-16K/2K-16K, ioengine=libaio, iodepth=1 > 2.0.8 > Starting 2 processes > Jobs: 1 (f=1): [_R] [66.9% done] [966K/0K /s] [108 /0 iops] [eta 01m:00s] > zufälliglesen: (groupid=0, jobs=1): err= 0: pid=2172 > read : io=59036KB, bw=983.93KB/s, iops=108 , runt= 60002msec WTF? Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely without it. Also I noticed that I had iodepth 1 in there to circumvent any in drive cache / optimization. > slat (usec): min=5 , max=158 , avg=27.62, stdev=10.64 > clat (usec): min=45 , max=27348 , avg=9150.78, stdev=4452.66 > lat (usec): min=53 , max=27370 , avg=9179.05, stdev=4454.88 > clat percentiles (usec): > | 1.00th=[ 126], 5.00th=[ 235], 10.00th=[ 5216], 20.00th=[ 5920], > | 30.00th=[ 5920], 40.00th=[ 5984], 50.00th=[ 7712], 60.00th=[12480], > | 70.00th=[12608], 80.00th=[12736], 90.00th=[12864], 95.00th=[16768], > | 99.00th=[18560], 99.50th=[18816], 99.90th=[20352], 99.95th=[22656], > | 99.99th=[27264] > bw (KB/s) : min= 423, max= 5776, per=100.00%, avg=986.48, stdev=480.68 > lat (usec) : 50=0.11%, 100=0.64%, 250=4.47%, 500=1.65%, 750=0.02% > lat (usec) : 1000=0.02% > lat (msec) : 2=0.06%, 4=0.03%, 10=43.31%, 20=49.51%, 50=0.18% > cpu : usr=0.17%, sys=0.45%, ctx=6534, majf=0, minf=26 > IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% > submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, > >=64=0.0% > issued: total=r=6532/w=0/d=0, short=r=0/w=0/d=0 Latency is still way to high even with iodepth 1, 10 milliseconds for 43% of requests. And the throughput and IOPS is still abysmal even for iodepth 1 (see below for Intel SSD 320 values). Okay, one further idea: Remove the bsrange to just test with 4k blocks. Additionally test these are aligned with blockalign=int[,int], ba=int[,int] At what boundary to align random IO offsets. Defaults to the same as 'blocksize' the minimum blocksize given. Minimum alignment is typically 512b for using direct IO, though it usually depends on the hardware block size. This option is mutually exclusive with using a random map for files, so it will turn off that option. I would first test with 4k blocks as is. And then do something like: blocksize=4k blockalign=4k And then raise blockalign to some values that may matter like 8k, 128k, 512k, 1m or so. But thats just guess work. I do not even exactly now if it works this way in fio. There is something pretty wierd going on. But I am not sure what it is. Maybe an alignment issue, since Ext4 with stripe alignment was able to do so much faster on reads. > sequentielllesen: (groupid=1, jobs=1): err= 0: pid=2199 > read : io=54658KB, bw=932798 B/s, iops=101 , runt= 60002msec Ey, whats this? > slat (usec): min=5 , max=140 , avg=28.63, stdev= 9.91 > clat (usec): min=39 , max=34210 , avg=9799.18, stdev=4471.32 > lat (usec): min=45 , max=34228 , avg=9828.50, stdev=4472.06 > clat percentiles (usec): > | 1.00th=[ 61], 5.00th=[ 5088], 10.00th=[ 5856], 20.00th=[ 5920], > | 30.00th=[ 5984], 40.00th=[ 6048], 50.00th=[11840], 60.00th=[12608], > | 70.00th=[12608], 80.00th=[12736], 90.00th=
Re: How can btrfs take 23sec to stat 23K files from an SSD?
On Thu, Aug 02, 2012 at 11:21:51PM +0200, Martin Steigerwald wrote: > > Yep, my du -sh tests do show that ext4 is 2x faster than btrfs. > > Obviously it's sending IO in a way that either the IO subsystem, linux > > driver, or drive prefer. > > But only on reads. Yes. I was focussing on the simple case. SSDs can be slower than hard drives on writes, and now you're getting into deep firmware magic for corner cases, so I didn't want to touch that here :) > WTF? > > Hey, did you adapt the size= keyword? It seems fio 2.0.8 can do completely > without it. I left it as is: [global] ioengine=libaio direct=1 filename=/dev/sda size=476940m bsrange=2k-16k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 I just removed it now. > Okay, one further idea: Remove the bsrange to just test with 4k blocks. Ok, there you go: [global] ioengine=libaio direct=1 filename=/dev/sda blocksize=4k blockalign=4k [zufälliglesen] rw=randread runtime=60 [sequentielllesen] stonewall rw=read runtime=60 gandalfthegreat:~# fio --readonly ./fio.job4 zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [1476K/0K /s] [369 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3614 read : io=60832KB, bw=1013.7KB/s, iops=253 , runt= 60012msec slat (usec): min=4 , max=892 , avg=24.36, stdev=16.05 clat (usec): min=2 , max=27575 , avg=3915.39, stdev=4990.69 lat (usec): min=44 , max=27591 , avg=3940.35, stdev=4991.99 clat percentiles (usec): | 1.00th=[ 41], 5.00th=[ 46], 10.00th=[ 51], 20.00th=[ 66], | 30.00th=[ 113], 40.00th=[ 163], 50.00th=[ 205], 60.00th=[ 4960], | 70.00th=[ 5856], 80.00th=[11584], 90.00th=[12480], 95.00th=[12608], | 99.00th=[14272], 99.50th=[17280], 99.90th=[18816], 99.95th=[21120], | 99.99th=[22912] bw (KB/s) : min= 253, max= 2520, per=100.00%, avg=1014.86, stdev=364.74 lat (usec) : 4=0.05%, 10=0.01%, 20=0.01%, 50=8.07%, 100=19.94% lat (usec) : 250=26.26%, 500=2.78%, 750=0.30%, 1000=0.09% lat (msec) : 2=0.12%, 4=0.07%, 10=20.75%, 20=21.51%, 50=0.05% cpu : usr=0.31%, sys=0.88%, ctx=15250, majf=0, minf=23 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued: total=r=15208/w=0/d=0, short=r=0/w=0/d=0 sequentielllesen: (groupid=1, jobs=1): err= 0: pid=3636 read : io=88468KB, bw=1474.3KB/s, iops=368 , runt= 60009msec slat (usec): min=3 , max=340 , avg=16.46, stdev=10.23 clat (usec): min=1 , max=27356 , avg=2692.99, stdev=4564.76 lat (usec): min=30 , max=27370 , avg=2709.87, stdev=4566.31 clat percentiles (usec): | 1.00th=[ 27], 5.00th=[ 29], 10.00th=[ 31], 20.00th=[ 35], | 30.00th=[ 47], 40.00th=[ 61], 50.00th=[ 78], 60.00th=[ 107], | 70.00th=[ 270], 80.00th=[ 5856], 90.00th=[11712], 95.00th=[12608], | 99.00th=[16512], 99.50th=[17536], 99.90th=[18816], 99.95th=[19584], | 99.99th=[21120] bw (KB/s) : min= 282, max= 5096, per=100.00%, avg=1474.40, stdev=899.36 lat (usec) : 2=0.05%, 4=0.01%, 20=0.02%, 50=32.45%, 100=24.61% lat (usec) : 250=12.52%, 500=1.40%, 750=0.03%, 1000=0.03% lat (msec) : 2=0.02%, 4=0.03%, 10=13.95%, 20=14.85%, 50=0.03% cpu : usr=0.33%, sys=0.90%, ctx=22176, majf=0, minf=25 IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0% submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0% issued: total=r=22117/w=0/d=0, short=r=0/w=0/d=0 Run status group 0 (all jobs): READ: io=60832KB, aggrb=1013KB/s, minb=1013KB/s, maxb=1013KB/s, mint=60012msec, maxt=60012msec Run status group 1 (all jobs): READ: io=88468KB, aggrb=1474KB/s, minb=1474KB/s, maxb=1474KB/s, mint=60009msec, maxt=60009msec Disk stats (read/write): sda: ios=54013/1836, merge=291/17, ticks=251660/11052, in_queue=262668, util=99.51% and the same test with blockalign=512k: gandalfthegreat:~# fio --readonly ./fio.job4 # blockalign=512k fio: Any use of blockalign= turns off randommap zufälliglesen: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 sequentielllesen: (g=1): rw=read, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=1 2.0.8 Starting 2 processes Jobs: 1 (f=1): [_R] [66.9% done] [4180K/0K /s] [1045 /0 iops] [eta 01m:00s] zufälliglesen: (groupid=0, jobs=1): err= 0: pid=3679 read : io=56096KB, bw=957355 B/s, iops=233 , runt= 60001msec slat (usec): min=4 , max=344 , avg=27.73, stdev=15.51 clat (usec): min=2 , max=38954 , avg=4244.27, stdev=5153.32 lat (usec): min=44 , max=38980 , avg=4
[current linus/master] deadlock during snapshot, sb_writers
Hi, top commit (v3.5-9237) fc6bdb59a501740b28ed3b616641a22c8dc5dd31, observed with btrfs-next/master as well. mount: /dev/sdb on /mnt/sdb type btrfs (rw,relatime,space_cache) fresh mkfs, snapshot stresstest, stuck on first snapshot call afaics [ 113.751610] = [ 113.752178] [ INFO: possible recursive locking detected ] [ 113.752178] 3.5.0linus-default+ #174 Not tainted [ 113.752178] - [ 113.752178] btrfs/3125 is trying to acquire lock: [ 113.752178] (sb_writers#10){.+.+.+}, at: [] mnt_want_write+0x24/0x50 [ 113.752178] [ 113.752178] but task is already holding lock: [ 113.752178] (sb_writers#10){.+.+.+}, at: [] mnt_want_write_file+0x28/0x60 [ 113.752178] [ 113.752178] other info that might help us debug this: [ 113.752178] Possible unsafe locking scenario: [ 113.752178] [ 113.752178]CPU0 [ 113.752178] [ 113.752178] lock(sb_writers#10); [ 113.752178] lock(sb_writers#10); [ 113.752178] [ 113.752178] *** DEADLOCK *** [ 113.752178] [ 113.856366] May be due to missing lock nesting notation [ 113.856366] [ 113.856366] 1 lock held by btrfs/3125: [ 113.856366] #0: (sb_writers#10){.+.+.+}, at: [] mnt_want_write_file+0x28/0x60 [ 113.856366] [ 113.856366] stack backtrace: [ 113.856366] Pid: 3125, comm: btrfs Not tainted 3.5.0linus-default+ #174 [ 113.856366] Call Trace: [ 113.856366] [] __lock_acquire+0x1663/0x1e90 [ 113.856366] [] ? trace_hardirqs_off_caller+0x29/0xc0 [ 113.856366] [] ? native_sched_clock+0x15/0x80 [ 113.856366] [] ? local_clock+0x6f/0x80 [ 113.856366] [] ? trace_hardirqs_off_caller+0x29/0xc0 [ 113.856366] [] ? mnt_want_write+0x24/0x50 [ 113.856366] [] lock_acquire+0x94/0x130 [ 113.856366] [] ? mnt_want_write+0x24/0x50 [ 113.856366] [] __sb_start_write+0x9c/0x1b0 [ 113.856366] [] ? mnt_want_write+0x24/0x50 [ 113.856366] [] ? native_sched_clock+0x15/0x80 [ 113.856366] [] ? mnt_want_write+0x24/0x50 [ 113.856366] [] ? trace_hardirqs_off_caller+0x29/0xc0 [ 113.856366] [] mnt_want_write+0x24/0x50 [ 113.856366] [] ? fget+0x84/0x100 [ 113.856366] [] btrfs_mksubvol+0x47/0x370 [btrfs] [ 113.856366] [] btrfs_ioctl_snap_create_transid+0xfa/0x190 [btrfs] [ 113.856366] [] ? might_fault+0x9c/0xb0 [ 113.856366] [] ? might_fault+0x53/0xb0 [ 113.856366] [] btrfs_ioctl_snap_create_v2+0x106/0x140 [btrfs] [ 113.856366] [] ? lock_release_holdtime+0x3d/0x1c0 [ 113.856366] [] ? do_page_fault+0x2d0/0x580 [ 113.856366] [] btrfs_ioctl+0x558/0x19c0 [btrfs] [ 113.856366] [] ? up_read+0x23/0x40 [ 113.856366] [] ? do_page_fault+0x2d0/0x580 [ 113.856366] [] ? _raw_spin_unlock_irq+0x30/0x50 [ 113.856366] [] ? finish_task_switch+0x83/0xf0 [ 113.856366] [] do_vfs_ioctl+0x98/0x560 [ 113.856366] [] ? retint_swapgs+0x13/0x1b [ 113.856366] [] sys_ioctl+0x4f/0x80 [ 113.856366] [] system_call_fastpath+0x16/0x1b tasks in D-state: 3030 ?D 0:00 [btrfs-endio-wri] [] btrfs_run_delayed_refs+0x402/0x560 [btrfs] [] __btrfs_end_transaction+0xcf/0x320 [btrfs] [] btrfs_end_transaction+0x15/0x20 [btrfs] [] btrfs_finish_ordered_io+0x19b/0x410 [btrfs] [] finish_ordered_fn+0x15/0x20 [btrfs] [] worker_loop+0xc4/0x5a0 [btrfs] [] kthread+0xae/0xc0 [] kernel_thread_helper+0x4/0x10 [] 0x PID TTY STAT TIME COMMAND 3036 ?D 0:00 [btrfs-transacti] [] btrfs_run_delayed_refs+0x402/0x560 [btrfs] [] btrfs_commit_transaction+0x9d/0xb50 [btrfs] [] transaction_kthread+0x1ae/0x240 [btrfs] [] kthread+0xae/0xc0 [] kernel_thread_helper+0x4/0x10 [] 0x PID TTY STAT TIME COMMAND 3104 pts/3D+ 0:06 tar x [] wait_current_trans+0xc5/0x120 [btrfs] [] start_transaction+0x14e/0x400 [btrfs] [] btrfs_start_transaction+0x13/0x20 [btrfs] [] btrfs_create+0x46/0x220 [btrfs] [] vfs_create+0x89/0xc0 [] do_last+0x853/0xd10 [] path_openat+0xb6/0x4a0 [] do_filp_open+0x49/0xa0 [] do_sys_open+0x102/0x1e0 [] sys_openat+0x14/0x20 [] system_call_fastpath+0x16/0x1b [] 0x PID TTY STAT TIME COMMAND 3115 ?D 0:00 [btrfs-endio-wri] [] btrfs_run_delayed_refs+0x402/0x560 [btrfs] [] __btrfs_end_transaction+0xcf/0x320 [btrfs] [] btrfs_end_transaction+0x15/0x20 [btrfs] [] btrfs_finish_ordered_io+0x19b/0x410 [btrfs] [] finish_ordered_fn+0x15/0x20 [btrfs] [] worker_loop+0xc4/0x5a0 [btrfs] [] kthread+0xae/0xc0 [] kernel_thread_helper+0x4/0x10 [] 0x PID TTY STAT TIME COMMAND 3116 ?D 0:00 [btrfs-endio-wri] [] btrfs_run_delayed_refs+0x402/0x560 [btrfs] [] __btrfs_end_transaction+0xcf/0x320 [btrfs] [] btrfs_end_transaction+0x15/0x20 [btrfs] [] btrfs_finish_ordered_io+0x19b/0x410 [btrfs] [] finish_ordered_fn+0x15/0x20 [btrfs] [] worker_loop+0xc4/0x5a0 [btrfs] [] kthread+0xae/0xc0 [] kernel_thread_helper+0x4/0x10 [] 0x PID TTY STAT TIME COMMAND 3117 ?D 0:00 [btrfs-endio-wri] [] btrfs_run_delayed_refs+0
Re: [current linus/master] deadlock during snapshot, sb_writers
On Fri, Aug 3, 2012 at 12:16 AM, David Sterba wrote: > Hi, > > top commit (v3.5-9237) fc6bdb59a501740b28ed3b616641a22c8dc5dd31, observed with > btrfs-next/master as well. > > mount: > /dev/sdb on /mnt/sdb type btrfs (rw,relatime,space_cache) > > fresh mkfs, snapshot stresstest, stuck on first snapshot call afaics > The problem seems to be a merge error. Commit a874a63 removed the mnt_want_write call from btrfs_mksubvol and added a replacement call to mnt_want_write_file in btrfs_ioctl_snap_create_transid. Commit e7848683 however tried to move all calls to mnt_want_write above i_mutex. So somewhere while merging this, it got mixed up. The solution is to remove the mnt_want_write call completely from mksubvol. Will send a patch in a few minutes. [...] > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: remove mnt_want_write call in btrfs_mksubvol
We got a recursive lock in mksubvol because the caller already held a lock. I think we got into this due to a merge error. Commit a874a63 removed the mnt_want_write call from btrfs_mksubvol and added a replacement call to mnt_want_write_file in btrfs_ioctl_snap_create_transid. Commit e7848683 however tried to move all calls to mnt_want_write above i_mutex. So somewhere while merging this, it got mixed up. The solution is to remove the mnt_want_write call completely from mksubvol. Reported-by: David Sterba Signed-off-by: Alexander Block --- fs/btrfs/ioctl.c |5 - 1 file changed, 5 deletions(-) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index 00ddf22..9df50fa 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -664,10 +664,6 @@ static noinline int btrfs_mksubvol(struct path *parent, struct dentry *dentry; int error; - error = mnt_want_write(parent->mnt); - if (error) - return error; - mutex_lock_nested(&dir->i_mutex, I_MUTEX_PARENT); dentry = lookup_one_len(name, parent->dentry, namelen); @@ -703,7 +699,6 @@ out_dput: dput(dentry); out_unlock: mutex_unlock(&dir->i_mutex); - mnt_drop_write(parent->mnt); return error; } -- 1.7.10.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V3 2/2] Btrfs: fix the snapshot that should not exist
On Thu, 2 Aug 2012 13:46:31 +0200, David Sterba wrote: > Hi, > > appologies for late reply, > > On Thu, Aug 02, 2012 at 12:40:46PM +0800, Miao Xie wrote: >> Changelog v1 -> v2: >> - add comment to explain why we need deal with the delayed items after >> snapshot creation and why this operation do not corrupt the metadata. > > I'm sorry, the comment did not fix the bug :) > > The subvol stress is able to hit this: > > [ 2360.444321] [ cut here ] > [ 2360.448019] kernel BUG at fs/btrfs/extent-tree.c:6047! > [ 2360.448019] invalid opcode: [#1] SMP > [ 2360.448019] CPU 0 > [ 2360.448019] Modules linked in: btrfs aoe [last unloaded: btrfs] > [ 2360.448019] > [ 2360.448019] Pid: 8212, comm: btrfs Not tainted 3.5.0-default+ #170 Intel > Corporation Santa Rosa platform/Matanzas > [ 2360.448019] RIP: 0010:[] [] > run_clustered_refs+0xa11/0xa20 [btrfs] > [ 2360.448019] RSP: 0018:88003eca1a68 EFLAGS: 00010246 > [ 2360.448019] RAX: 07ff RBX: 880017a694c8 RCX: > 88003eca1a08 > [ 2360.448019] RDX: 880028aa9000 RSI: 07fe RDI: > 880064223cf0 > [ 2360.448019] RBP: 88003eca1b48 R08: 07ff R09: > 88003eca19f8 > [ 2360.448019] R10: 88002435d1e8 R11: R12: > 880025d66d28 > [ 2360.448019] R13: 88003864 R14: 8800778dfa88 R15: > 880060f010d0 > [ 2360.448019] FS: 7f3289f35740() GS:88007dc0() > knlGS: > [ 2360.448019] CS: 0010 DS: ES: CR0: 8005003b > [ 2360.448019] CR2: ff600400 CR3: 2e112000 CR4: > 07f0 > [ 2360.448019] DR0: DR1: DR2: > > [ 2360.448019] DR3: DR6: 0ff0 DR7: > 0400 > [ 2360.448019] Process btrfs (pid: 8212, threadinfo 88003eca, task > 88001d834200) > [ 2360.448019] Stack: > [ 2360.448019] 0001 > > [ 2360.448019] 07ed 88002435d1e8 3eca1b18 > > [ 2360.448019] 0770 5cb1e000 > 88003eca1c08 > [ 2360.448019] Call Trace: > [ 2360.448019] [] btrfs_run_delayed_refs+0x1c9/0x550 > [btrfs] > [ 2360.448019] [] ? trace_hardirqs_on_caller+0x155/0x1d0 > [ 2360.448019] [] ? btrfs_free_path+0x2a/0x40 [btrfs] > [ 2360.448019] [] ? btrfs_run_delayed_items+0xf1/0x160 > [btrfs] > [ 2360.448019] [] btrfs_commit_transaction+0x605/0xb00 > [btrfs] > [ 2360.448019] [] ? lock_release_holdtime+0x3d/0x1c0 > [ 2360.448019] [] ? btrfs_mksubvol+0x298/0x360 [btrfs] > [ 2360.448019] [] ? wake_up_bit+0x40/0x40 > [ 2360.448019] [] ? do_raw_spin_unlock+0x5e/0xb0 > [ 2360.448019] [] btrfs_mksubvol+0x358/0x360 [btrfs] > [ 2360.448019] [] > btrfs_ioctl_snap_create_transid+0x10a/0x190 [btrfs] > [ 2360.448019] [] > btrfs_ioctl_snap_create_v2.clone.0+0xfd/0x110 [btrfs] > [ 2360.448019] [] btrfs_ioctl+0x48e/0x1340 [btrfs] > [ 2360.448019] [] ? do_page_fault+0x2d0/0x580 > [ 2360.448019] [] ? _raw_spin_unlock_irq+0x30/0x50 > [ 2360.448019] [] ? finish_task_switch+0x83/0xf0 > [ 2360.448019] [] do_vfs_ioctl+0x98/0x560 > [ 2360.448019] [] ? retint_swapgs+0x13/0x1b > [ 2360.448019] [] sys_ioctl+0x4f/0x80 > [ 2360.448019] [] system_call_fastpath+0x16/0x1b > [ 2360.448019] Code: 8b 76 40 48 89 d7 48 89 55 a0 e8 2b 74 ff ff 83 f8 17 0f > 87 1e ff ff ff 0f 0b 80 fa b2 0f 84 b4 f8 ff ff 0f 0b 0f 0b 0f 0b 0f 0b <0f> > 0b 0f 0b 0f 0b 0f 0b 0f 1f 80 00 00 00 00 55 48 89 e5 41 57 > [ 2360.448019] RIP [] run_clustered_refs+0xa11/0xa20 > [btrfs] > [ 2360.448019] RSP > [ 2360.814508] ---[ end trace 555a16cac3620ccb ]--- > [ 2360.820398] note: btrfs[8212] exited with preempt_count 1 > [ 2360.827072] BUG: sleeping function called from invalid context at > kernel/rwsem.c:20 > [ 2360.836047] in_atomic(): 1, irqs_disabled(): 0, pid: 8212, name: btrfs > [ 2360.843859] INFO: lockdep is turned off. > [ 2360.849021] Pid: 8212, comm: btrfs Tainted: G D 3.5.0-default+ > #170 > [ 2360.849022] Call Trace: > [ 2360.849027] [] __might_sleep+0xfc/0x130 > [ 2360.849030] [] down_read+0x26/0xa0 > [ 2360.849034] [] acct_collect+0x4b/0x1b0 > [ 2360.849038] [] do_exit+0x718/0x9a0 > [ 2360.849041] [] ? kmsg_dump+0x26/0x140 > [ 2360.849043] [] oops_end+0xb0/0xf0 > [ 2360.849046] [] die+0x5b/0x90 > [ 2360.849048] [] do_trap+0xc4/0x170 > [ 2360.849052] [] do_invalid_op+0x95/0xb0 > [ 2360.849067] [] ? run_clustered_refs+0xa11/0xa20 [btrfs] > [ 2360.849071] [] ? trace_hardirqs_off_thunk+0x3a/0x3c > [ 2360.849073] [] ? restore_args+0x30/0x30 > [ 2360.849076] [] invalid_op+0x1b/0x20 > [ 2360.849087] [] ? run_clustered_refs+0xa11/0xa20 [btrfs] > [ 2360.849097] [] ? run_clustered_refs+0x69b/0xa20 [btrfs] > [ 2360.849108] [] btrfs_run_delayed_refs+0x1c9/0x550 > [btrfs] > [ 2360.849110] [] ? trace_hardirqs_on_caller+0x155/0x1d0 > [ 2360.849119] [] ? btrfs_free_path+0x
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
On 08/02/2012 10:28 PM, Jan Schmidt wrote: > On Thu, August 02, 2012 at 15:46 (+0200), Arne Jansen wrote: >> On 02.08.2012 13:57, Liu Bo wrote: >>> Anyway, for now, our error flag has only been stored in memory, so what >>> about just keep it until we find a graceful way? >> >> Yeah, we need this patch to restore consistency. We can define a fixed >> area on disk (e.g. behind the superblock) where we can write the flag >> to without risking the superblock. > > At least we all agree that we need this patch, fine. > > We don't yet agree that we need a place to store a "please consider fsck" > flag. > Can I please get one concrete example in which situation we > > a) do detect the user should really do a file system check AND > b) do not abort the transaction to clean the mess up? > > (An example on how we could fail transaction cleanup is also accepted). > Unfortunately I don't have such an example either. Since we always get COW on metadata, I believe that it's ok to just roll back on failure. > If such a situation doesn't exist, there's no need for this flag. The fact > that > ext has such a flag doesn't convince me, probably because I know nothing about > ext. I can imagine that they can detect file system errors without the ability > to return to a potentially older consistent state. > This error flag is also used to indicate filesystem's error state for transaction cleanup, so keeping it in memory is reasonable. thanks, liubo > Thanks, > -Jan > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] Btrfs: remove superblock writing after fatal error
[added Josef] Josef: Please queue this patch for rc2, we finally agreed. On Fri, August 03, 2012 at 05:20 (+0200), Liu Bo wrote: > On 08/02/2012 10:28 PM, Jan Schmidt wrote: >> If such a situation doesn't exist, there's no need for this flag. The fact >> that >> ext has such a flag doesn't convince me, probably because I know nothing >> about >> ext. I can imagine that they can detect file system errors without the >> ability >> to return to a potentially older consistent state. >> > > This error flag is also used to indicate filesystem's error state for > transaction cleanup, so keeping it in memory is reasonable. Granted :-) -Jan -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html