from:"Josef Bacik"

Re: Ceph on btrfs 3.4rc

2012-05-10 Thread Josef Bacik

On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> Am 24. April 2012 18:26 schrieb Sage Weil :
> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> > After running ceph on XFS for some time, I decided to try btrfs again.
> >> > Performance with the current "for-linux-min" branch and big metadata
> >> > is much better. The only problem (?) I'm still seeing is a warning
> >> > that seems to occur from time to time:
> >
> > Actually, before you do that... we have a new tool,
> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> > local file system.  It's a subset of what a full OSD might do, but if
> > we're lucky it will be sufficient to reproduce this issue.  Something like
> >
> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >
> > will hopefully do the trick.
> >
> > Christian, maybe you can see if that is able to trigger this warning?
> > You'll need to pull it from the current master branch; it wasn't in the
> > last release.
> 
> Trying to reproduce with test_filestore_workloadgen didn't work for
> me. So here are some instructions on how to reproduce with a minimal
> ceph setup.
> 
> You will need a single system with two disks and a bit of memory.
> 
> - Compile and install ceph (detailed instructions:
> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> 
> - For the test setup I've used two tmpfs files as journal devices. To
> create these, do the following:
> 
> # mkdir -p /ceph/temp
> # mount -t tmpfs tmpfs /ceph/temp
> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> 
> - Now you should create and mount btrfs. Here is what I did:
> 
> # mkfs.btrfs -l 64k -n 64k /dev/sda
> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> # mkdir /ceph/osd.000
> # mkdir /ceph/osd.001
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
> 
> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> will probably have to change the btrfs devices and the hostname
> (os39).
> 
> - Create the ceph filesystems:
> 
> # mkdir /ceph/mon
> # mkcephfs -a -c /etc/ceph/ceph.conf
> 
> - Start ceph (e.g. "service ceph start")
> 
> - Now you should be able to use ceph - "ceph -s" will tell you about
> the state of the ceph cluster.
> 
> - "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.
> 
> - Compile my test with "gcc -o rbdtest rbdtest.c -lrbd" and run it
> with "./rbdtest testimg".
> 
> I can see the first btrfs_orphan_commit_root warning after an hour or
> so... I hope that I've described all necessary steps. If there is a
> problem just send me a note.
> 

Well I feel like an idiot, I finally get it to reproduce, go look at where I
want to put my printks and theres the problem staring me right in the face.
I've looked seriously at this problem 2 or 3 times and have missed this every
single freaking time.  Here is the patch I'm trying, please try it on yours to
make sure it fixes the problem.  It takes like 2 hours for it to reproduce for
me so I won't be able to fully test it until tomorrow, but so far it hasn't
broken anything so it should be good.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index eefe573..4ad628d 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -164,6 +161,7 @@ struct btrfs_inode {
unsigned dummy_inode:1;
unsigned in_defrag:1;
unsigned delalloc_meta_reserved:1;
+   unsigned has_orphan_item:1;
 
/*
 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8a89888..6dd20f3 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
struct list_head root_list;
 
spinlock_t orphan_lock;
-   struct list_head orphan_list;
+   atomic_t orphan_inodes;
struct btrfs_block_rsv *orphan_block_rsv;
int orphan_item_inserted;
int orphan_cleanup_state;
diff --gi

Re: Ceph on btrfs 3.4rc

2012-05-11 Thread Josef Bacik

On Thu, May 10, 2012 at 04:35:23PM -0400, Josef Bacik wrote:
> On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> > Am 24. April 2012 18:26 schrieb Sage Weil :
> > > On Tue, 24 Apr 2012, Josef Bacik wrote:
> > >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> > >> > After running ceph on XFS for some time, I decided to try btrfs again.
> > >> > Performance with the current "for-linux-min" branch and big metadata
> > >> > is much better. The only problem (?) I'm still seeing is a warning
> > >> > that seems to occur from time to time:
> > >
> > > Actually, before you do that... we have a new tool,
> > > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> > > local file system.  It's a subset of what a full OSD might do, but if
> > > we're lucky it will be sufficient to reproduce this issue.  Something like
> > >
> > >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> > >
> > > will hopefully do the trick.
> > >
> > > Christian, maybe you can see if that is able to trigger this warning?
> > > You'll need to pull it from the current master branch; it wasn't in the
> > > last release.
> > 
> > Trying to reproduce with test_filestore_workloadgen didn't work for
> > me. So here are some instructions on how to reproduce with a minimal
> > ceph setup.
> > 
> > You will need a single system with two disks and a bit of memory.
> > 
> > - Compile and install ceph (detailed instructions:
> > http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> > 
> > - For the test setup I've used two tmpfs files as journal devices. To
> > create these, do the following:
> > 
> > # mkdir -p /ceph/temp
> > # mount -t tmpfs tmpfs /ceph/temp
> > # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> > # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> > 
> > - Now you should create and mount btrfs. Here is what I did:
> > 
> > # mkfs.btrfs -l 64k -n 64k /dev/sda
> > # mkfs.btrfs -l 64k -n 64k /dev/sdb
> > # mkdir /ceph/osd.000
> > # mkdir /ceph/osd.001
> > # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
> > # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
> > 
> > - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> > will probably have to change the btrfs devices and the hostname
> > (os39).
> > 
> > - Create the ceph filesystems:
> > 
> > # mkdir /ceph/mon
> > # mkcephfs -a -c /etc/ceph/ceph.conf
> > 
> > - Start ceph (e.g. "service ceph start")
> > 
> > - Now you should be able to use ceph - "ceph -s" will tell you about
> > the state of the ceph cluster.
> > 
> > - "rbd create -size 100 testimg" will create an rbd image on the ceph 
> > cluster.
> > 
> > - Compile my test with "gcc -o rbdtest rbdtest.c -lrbd" and run it
> > with "./rbdtest testimg".
> > 
> > I can see the first btrfs_orphan_commit_root warning after an hour or
> > so... I hope that I've described all necessary steps. If there is a
> > problem just send me a note.
> > 
> 
> Well I feel like an idiot, I finally get it to reproduce, go look at where I
> want to put my printks and theres the problem staring me right in the face.
> I've looked seriously at this problem 2 or 3 times and have missed this every
> single freaking time.  Here is the patch I'm trying, please try it on yours to
> make sure it fixes the problem.  It takes like 2 hours for it to reproduce for
> me so I won't be able to fully test it until tomorrow, but so far it hasn't
> broken anything so it should be good.  Thanks,
> 

That previous patch was against btrfs-next, this patch is against 3.4-rc6 if you
are on mainline.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..54af1fa 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -156,6 +153,7 @@ struct btrfs_inode {
unsigned dummy_inode:1;
unsigned

Re: Ceph on btrfs 3.4rc

2012-05-11 Thread Josef Bacik

On Fri, May 11, 2012 at 08:33:34PM +0200, Martin Mailand wrote:
> Hi Josef,
> 
> Am 11.05.2012 15:31, schrieb Josef Bacik:
> >That previous patch was against btrfs-next, this patch is against 3.4-rc6 if 
> >you
> >are on mainline.  Thanks,
> 
> I tried your patch against mainline, after a few minutes I hit this bug.
> 

Heh duh, sorry, try this one instead.  Thanks,

Josef

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..54af1fa 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -156,6 +153,7 @@ struct btrfs_inode {
unsigned dummy_inode:1;
unsigned in_defrag:1;
unsigned delalloc_meta_reserved:1;
+   unsigned has_orphan_item:1;
 
/*
 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fd7233..aad2600 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
struct list_head root_list;
 
spinlock_t orphan_lock;
-   struct list_head orphan_list;
+   atomic_t orphan_inodes;
struct btrfs_block_rsv *orphan_block_rsv;
int orphan_item_inserted;
int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..ff3bf4b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
root->orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(&root->dirty_list);
-   INIT_LIST_HEAD(&root->orphan_list);
INIT_LIST_HEAD(&root->root_list);
spin_lock_init(&root->orphan_lock);
spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
atomic_set(&root->log_commit[0], 0);
atomic_set(&root->log_commit[1], 0);
atomic_set(&root->log_writers, 0);
+   atomic_set(&root->orphan_inodes, 0);
root->log_batch = 0;
root->log_transid = 0;
root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61b16c6..5ba68d0 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2072,12 +2072,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
struct btrfs_block_rsv *block_rsv;
int ret;
 
-   if (!list_empty(&root->orphan_list) ||
+   if (atomic_read(&root->orphan_inodes) ||
root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
return;
 
spin_lock(&root->orphan_lock);
-   if (!list_empty(&root->orphan_list)) {
+   if (atomic_read(&root->orphan_inodes)) {
spin_unlock(&root->orphan_lock);
return;
}
@@ -2134,8 +2134,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
block_rsv = NULL;
}
 
-   if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-   list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+   if (!BTRFS_I(inode)->has_orphan_item) {
+   BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
/*
 * For proper ENOSPC handling, we should do orphan
@@ -2148,6 +2148,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
insert = 1;
 #endif
insert = 1;
+   atomic_inc(&root->orphan_inodes);
}
 
if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2195,9 +2196,13 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, 
struct inode *inode)
int release_rsv = 0;
int ret = 0;
 
+   /*
+* evict_inode gets called without holding the i_mutex so we need to
+* take the orphan lock to make sure we are safe in messing with these.
+*/
spin_lock(&root->orphan_lock);
-   if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-   list_del_init(&BTRFS_I(inode)->i_orphan);
+   if (BTRFS_I(inode)->has_orphan_item) {
+   BTRFS_I(inode)->has_orphan_item = 0;
delete_item = 1;
}
 
@@ -2215,6 +2220,9 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, 
struct inode *inode)
if (release_rsv)
btrfs_orphan_release_metadata(inode);
 
+   if (trans && delete_item)
+   atomic_dec(&root->orphan

Re: Ceph on btrfs 3.4rc

2012-05-14 Thread Josef Bacik

On Mon, May 14, 2012 at 04:19:37PM +0200, Martin Mailand wrote:
> Hi Josef,
> 
> Am 11.05.2012 21:16, schrieb Josef Bacik:
> >Heh duh, sorry, try this one instead.  Thanks,
> 
> With this patch I got this Bug:

Yeah Christian reported the same thing on Friday.  I'm going to work on a patch
and actually run it here to make sure it doesn't blow up and then send it to the
list when I think I've got something that works.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-16 Thread Josef Bacik

On Mon, May 14, 2012 at 10:20:48AM -0400, Josef Bacik wrote:
> On Mon, May 14, 2012 at 04:19:37PM +0200, Martin Mailand wrote:
> > Hi Josef,
> > 
> > Am 11.05.2012 21:16, schrieb Josef Bacik:
> > >Heh duh, sorry, try this one instead.  Thanks,
> > 
> > With this patch I got this Bug:
> 
> Yeah Christian reported the same thing on Friday.  I'm going to work on a 
> patch
> and actually run it here to make sure it doesn't blow up and then send it to 
> the
> list when I think I've got something that works.  Thanks,
> 

Hrm ok so I finally got some time to try and debug it and let the test run a
good long while (5 hours almost) and I couldn't hit either the original bug or
the one you guys were hitting.  So either my extra little bit of locking did the
trick or I get to keep my "Worst reproducer ever" award.  Can you guys give this
one a whirl and if it panics send the entire dmesg since it should spit out a
WARN_ON() to let me know what I thought was the problem was it.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3771b85..559e716 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -153,6 +150,7 @@ struct btrfs_inode {
unsigned dummy_inode:1;
unsigned in_defrag:1;
unsigned delalloc_meta_reserved:1;
+   unsigned has_orphan_item:1;
 
/*
 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ba8743b..72cdf98 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
struct list_head root_list;
 
spinlock_t orphan_lock;
-   struct list_head orphan_list;
+   atomic_t orphan_inodes;
struct btrfs_block_rsv *orphan_block_rsv;
int orphan_item_inserted;
int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 19f5b45..25dba7a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
root->orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(&root->dirty_list);
-   INIT_LIST_HEAD(&root->orphan_list);
INIT_LIST_HEAD(&root->root_list);
spin_lock_init(&root->orphan_lock);
spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
atomic_set(&root->log_commit[0], 0);
atomic_set(&root->log_commit[1], 0);
atomic_set(&root->log_writers, 0);
+   atomic_set(&root->orphan_inodes, 0);
root->log_batch = 0;
root->log_transid = 0;
root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 54ae3df..c0cff20 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2104,12 +2104,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
struct btrfs_block_rsv *block_rsv;
int ret;
 
-   if (!list_empty(&root->orphan_list) ||
+   if (atomic_read(&root->orphan_inodes) ||
root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
return;
 
spin_lock(&root->orphan_lock);
-   if (!list_empty(&root->orphan_list)) {
+   if (atomic_read(&root->orphan_inodes)) {
spin_unlock(&root->orphan_lock);
return;
}
@@ -2166,8 +2166,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
block_rsv = NULL;
}
 
-   if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-   list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+   if (!BTRFS_I(inode)->has_orphan_item) {
+   BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
/*
 * For proper ENOSPC handling, we should do orphan
@@ -2180,6 +2180,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
insert = 1;
 #endif
insert = 1;
+   atomic_inc(&root->orphan_inodes);
}
 
if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2198,6 +2199,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
if (insert >= 1) {
ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
if (ret &

Re: Ceph on btrfs 3.4rc

2012-05-17 Thread Josef Bacik

On Thu, May 17, 2012 at 12:29:32PM +0200, Martin Mailand wrote:
> Hi Josef,
> 
> somehow I still get the kernel Bug messages, I used your patch from
> the 16th against rc7.
> 

Was there anything above those messages?  There should have been a WARN_ON() or
something.  If not thats fine, I just need to know one way or the other so I can
figure out what to do next.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-17 Thread Josef Bacik

On Thu, May 17, 2012 at 05:12:55PM +0200, Martin Mailand wrote:
> Hi Josef,
> no there was nothing above. Here the is another dmesg output.
> 

Hrm ok give this a try and hopefully this is it, still couldn't reproduce.
Thanks,

Josef

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3771b85..559e716 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -153,6 +150,7 @@ struct btrfs_inode {
unsigned dummy_inode:1;
unsigned in_defrag:1;
unsigned delalloc_meta_reserved:1;
+   unsigned has_orphan_item:1;
 
/*
 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ba8743b..72cdf98 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
struct list_head root_list;
 
spinlock_t orphan_lock;
-   struct list_head orphan_list;
+   atomic_t orphan_inodes;
struct btrfs_block_rsv *orphan_block_rsv;
int orphan_item_inserted;
int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 19f5b45..25dba7a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
root->orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(&root->dirty_list);
-   INIT_LIST_HEAD(&root->orphan_list);
INIT_LIST_HEAD(&root->root_list);
spin_lock_init(&root->orphan_lock);
spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
atomic_set(&root->log_commit[0], 0);
atomic_set(&root->log_commit[1], 0);
atomic_set(&root->log_writers, 0);
+   atomic_set(&root->orphan_inodes, 0);
root->log_batch = 0;
root->log_transid = 0;
root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 54ae3df..7cc1c96 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2104,12 +2104,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
struct btrfs_block_rsv *block_rsv;
int ret;
 
-   if (!list_empty(&root->orphan_list) ||
+   if (atomic_read(&root->orphan_inodes) ||
root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
return;
 
spin_lock(&root->orphan_lock);
-   if (!list_empty(&root->orphan_list)) {
+   if (atomic_read(&root->orphan_inodes)) {
spin_unlock(&root->orphan_lock);
return;
}
@@ -2166,8 +2166,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
block_rsv = NULL;
}
 
-   if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-   list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+   if (!BTRFS_I(inode)->has_orphan_item) {
+   BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
/*
 * For proper ENOSPC handling, we should do orphan
@@ -2180,6 +2180,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
insert = 1;
 #endif
insert = 1;
+   atomic_inc(&root->orphan_inodes);
}
 
if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2198,6 +2199,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
if (insert >= 1) {
ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
if (ret && ret != -EEXIST) {
+   spin_lock(&root->orphan_lock);
+   BTRFS_I(inode)->has_orphan_item = 0;
+   spin_unlock(&root->orphan_lock);
btrfs_abort_transaction(trans, root, ret);
return ret;
}
@@ -2227,13 +2231,21 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, 
struct inode *inode)
int release_rsv = 0;
int ret = 0;
 
+   /*
+* evict_inode gets called without holding the i_mutex so we need to
+* take the orphan lock to make sure we are safe in messing with these.
+*/
spin_lock(&root->orphan_lock);
-   if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-   list_del_init(&BTRFS_I(inode)->i_orphan);
-   delete_item = 1;
+   if (BTRFS_I(inode)->has_orphan_item) {
+   if (trans) {
+   BTRFS_I(inode)->has_orphan_item = 0;
+   delete_item = 1;
+   } else {
+

Re: Ceph on btrfs 3.4rc

2012-05-18 Thread Josef Bacik

On Thu, May 17, 2012 at 11:18:25PM +0200, Martin Mailand wrote:
> Hi Josef,
> 
> I hit exact the same bug as Christian with your last patch.
> 

Ok hopefully this will print something out that makes sense.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..492c74f 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -156,6 +153,8 @@ struct btrfs_inode {
unsigned dummy_inode:1;
unsigned in_defrag:1;
unsigned delalloc_meta_reserved:1;
+   unsigned has_orphan_item:1;
+   unsigned doing_truncate:1;
 
/*
 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fd7233..aad2600 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
struct list_head root_list;
 
spinlock_t orphan_lock;
-   struct list_head orphan_list;
+   atomic_t orphan_inodes;
struct btrfs_block_rsv *orphan_block_rsv;
int orphan_item_inserted;
int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..ff3bf4b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
root->orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(&root->dirty_list);
-   INIT_LIST_HEAD(&root->orphan_list);
INIT_LIST_HEAD(&root->root_list);
spin_lock_init(&root->orphan_lock);
spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
atomic_set(&root->log_commit[0], 0);
atomic_set(&root->log_commit[1], 0);
atomic_set(&root->log_writers, 0);
+   atomic_set(&root->orphan_inodes, 0);
root->log_batch = 0;
root->log_transid = 0;
root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61b16c6..7de7f6f 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2072,12 +2072,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
struct btrfs_block_rsv *block_rsv;
int ret;
 
-   if (!list_empty(&root->orphan_list) ||
+   if (atomic_read(&root->orphan_inodes) ||
root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
return;
 
spin_lock(&root->orphan_lock);
-   if (!list_empty(&root->orphan_list)) {
+   if (atomic_read(&root->orphan_inodes)) {
spin_unlock(&root->orphan_lock);
return;
}
@@ -2134,8 +2134,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
block_rsv = NULL;
}
 
-   if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-   list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+   if (!BTRFS_I(inode)->has_orphan_item) {
+   BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
/*
 * For proper ENOSPC handling, we should do orphan
@@ -2148,6 +2148,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
insert = 1;
 #endif
insert = 1;
+   atomic_inc(&root->orphan_inodes);
}
 
if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2166,6 +2167,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
if (insert >= 1) {
ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
if (ret && ret != -EEXIST) {
+   spin_lock(&root->orphan_lock);
+   BTRFS_I(inode)->has_orphan_item = 0;
+   spin_unlock(&root->orphan_lock);
btrfs_abort_transaction(trans, root, ret);
return ret;
}
@@ -2195,13 +2199,21 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, 
struct inode *inode)
int release_rsv = 0;
int ret = 0;
 
+   /*
+* evict_inode gets called without holding the i_mutex so we need to
+* take the orphan lock to make sure we are safe in messing with these.
+*/
spin_lock(&root->orphan_lock);
-   if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-   list_del_init(&BTRFS_I(inode)->i_orphan);
-   delete_item = 1;
+   if (BTRFS_I(inode)->has_orphan_item) {
+   if (trans) {
+   BTRFS_I(inode)->has_orphan_item = 0;
+   delete_item = 1;
+

Re: Ceph on btrfs 3.4rc

2012-05-18 Thread Josef Bacik

On Fri, May 18, 2012 at 07:24:25PM +0200, Martin Mailand wrote:
> Hi Josef,
> there was one line before the bug.
> 
> [  995.725105] couldn't find orphan item for 524
> 
> 

*sigh* ok try this, hopefully it will point me in the right direction.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..492c74f 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -156,6 +153,8 @@ struct btrfs_inode {
unsigned dummy_inode:1;
unsigned in_defrag:1;
unsigned delalloc_meta_reserved:1;
+   unsigned has_orphan_item:1;
+   unsigned doing_truncate:1;
 
/*
 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fd7233..aad2600 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
struct list_head root_list;
 
spinlock_t orphan_lock;
-   struct list_head orphan_list;
+   atomic_t orphan_inodes;
struct btrfs_block_rsv *orphan_block_rsv;
int orphan_item_inserted;
int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..ff3bf4b 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
root->orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(&root->dirty_list);
-   INIT_LIST_HEAD(&root->orphan_list);
INIT_LIST_HEAD(&root->root_list);
spin_lock_init(&root->orphan_lock);
spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
atomic_set(&root->log_commit[0], 0);
atomic_set(&root->log_commit[1], 0);
atomic_set(&root->log_writers, 0);
+   atomic_set(&root->orphan_inodes, 0);
root->log_batch = 0;
root->log_transid = 0;
root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 61b16c6..572da13 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2072,12 +2072,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
struct btrfs_block_rsv *block_rsv;
int ret;
 
-   if (!list_empty(&root->orphan_list) ||
+   if (atomic_read(&root->orphan_inodes) ||
root->orphan_cleanup_state != ORPHAN_CLEANUP_DONE)
return;
 
spin_lock(&root->orphan_lock);
-   if (!list_empty(&root->orphan_list)) {
+   if (atomic_read(&root->orphan_inodes)) {
spin_unlock(&root->orphan_lock);
return;
}
@@ -2134,8 +2134,8 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
block_rsv = NULL;
}
 
-   if (list_empty(&BTRFS_I(inode)->i_orphan)) {
-   list_add(&BTRFS_I(inode)->i_orphan, &root->orphan_list);
+   if (!BTRFS_I(inode)->has_orphan_item) {
+   BTRFS_I(inode)->has_orphan_item = 1;
 #if 0
/*
 * For proper ENOSPC handling, we should do orphan
@@ -2148,6 +2148,7 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
insert = 1;
 #endif
insert = 1;
+   atomic_inc(&root->orphan_inodes);
}
 
if (!BTRFS_I(inode)->orphan_meta_reserved) {
@@ -2166,6 +2167,9 @@ int btrfs_orphan_add(struct btrfs_trans_handle *trans, 
struct inode *inode)
if (insert >= 1) {
ret = btrfs_insert_orphan_item(trans, root, btrfs_ino(inode));
if (ret && ret != -EEXIST) {
+   spin_lock(&root->orphan_lock);
+   BTRFS_I(inode)->has_orphan_item = 0;
+   spin_unlock(&root->orphan_lock);
btrfs_abort_transaction(trans, root, ret);
return ret;
}
@@ -2195,13 +2199,21 @@ int btrfs_orphan_del(struct btrfs_trans_handle *trans, 
struct inode *inode)
int release_rsv = 0;
int ret = 0;
 
+   /*
+* evict_inode gets called without holding the i_mutex so we need to
+* take the orphan lock to make sure we are safe in messing with these.
+*/
spin_lock(&root->orphan_lock);
-   if (!list_empty(&BTRFS_I(inode)->i_orphan)) {
-   list_del_init(&BTRFS_I(inode)->i_orphan);
-   delete_item = 1;
+   if (BTRFS_I(inode)->has_orphan_item) {
+   if (trans) {
+   BTRFS_I(inode)->has_orphan_item = 0;
+

Re: Ceph on btrfs 3.4rc

2012-05-22 Thread Josef Bacik

On Mon, May 21, 2012 at 11:59:54AM +0800, Miao Xie wrote:
> Hi Josef,
> 
> On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
> > diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> > index 9b9b15f..492c74f 100644
> > --- a/fs/btrfs/btrfs_inode.h
> > +++ b/fs/btrfs/btrfs_inode.h
> > @@ -57,9 +57,6 @@ struct btrfs_inode {
> > /* used to order data wrt metadata */
> > struct btrfs_ordered_inode_tree ordered_tree;
> >  
> > -   /* for keeping track of orphaned inodes */
> > -   struct list_head i_orphan;
> > -
> > /* list of all the delalloc inodes in the FS.  There are times we need
> >  * to write all the delalloc pages to disk, and this list is used
> >  * to walk them all.
> > @@ -156,6 +153,8 @@ struct btrfs_inode {
> > unsigned dummy_inode:1;
> > unsigned in_defrag:1;
> > unsigned delalloc_meta_reserved:1;
> > +   unsigned has_orphan_item:1;
> > +   unsigned doing_truncate:1;
> 
> I think the problem is we should not use the different lock to protect the 
> bit fields which
> are stored in the same machine word. Or some bit fields may be covered by the 
> others when
> someone change those fields. Could you try to declare 
> ->delalloc_meta_reserved and ->has_orphan_item
> as a integer?
> 

Oh freaking duh, thank you Miao, I'm an idiot.

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-22 Thread Josef Bacik

On Tue, May 22, 2012 at 12:29:59PM +0200, Christian Brunner wrote:
> 2012/5/21 Miao Xie :
> > Hi Josef,
> >
> > On fri, 18 May 2012 15:01:05 -0400, Josef Bacik wrote:
> >> diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
> >> index 9b9b15f..492c74f 100644
> >> --- a/fs/btrfs/btrfs_inode.h
> >> +++ b/fs/btrfs/btrfs_inode.h
> >> @@ -57,9 +57,6 @@ struct btrfs_inode {
> >>       /* used to order data wrt metadata */
> >>       struct btrfs_ordered_inode_tree ordered_tree;
> >>
> >> -     /* for keeping track of orphaned inodes */
> >> -     struct list_head i_orphan;
> >> -
> >>       /* list of all the delalloc inodes in the FS.  There are times we 
> >> need
> >>        * to write all the delalloc pages to disk, and this list is used
> >>        * to walk them all.
> >> @@ -156,6 +153,8 @@ struct btrfs_inode {
> >>       unsigned dummy_inode:1;
> >>       unsigned in_defrag:1;
> >>       unsigned delalloc_meta_reserved:1;
> >> +     unsigned has_orphan_item:1;
> >> +     unsigned doing_truncate:1;
> >
> > I think the problem is we should not use the different lock to protect the 
> > bit fields which
> > are stored in the same machine word. Or some bit fields may be covered by 
> > the others when
> > someone change those fields. Could you try to declare 
> > ->delalloc_meta_reserved and ->has_orphan_item
> > as a integer?
> 
> I have tried changing it to:
> 
> struct btrfs_inode {
> unsigned orphan_meta_reserved:1;
> unsigned dummy_inode:1;
> unsigned in_defrag:1;
> -   unsigned delalloc_meta_reserved:1;
> +   int delalloc_meta_reserved;
> +   int has_orphan_item;
> +   int doing_truncate;
> 
> The strange thing is, that I'm no longer hitting the BUG_ON, but the
> old WARNING (no additional messages):
> 

Yeah you would also need to change orphan_meta_reserved.  I fixed this by just
taking the BTRFS_I(inode)->lock when messing with these since we don't want to
take up all that space in the inode just for a marker.  I ran this patch for 3
hours with no issues, let me know if it works for you.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 3771b85..559e716 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -57,9 +57,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -153,6 +150,7 @@ struct btrfs_inode {
unsigned dummy_inode:1;
unsigned in_defrag:1;
unsigned delalloc_meta_reserved:1;
+   unsigned has_orphan_item:1;
 
/*
 * always compress this one file
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index ba8743b..72cdf98 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
struct list_head root_list;
 
spinlock_t orphan_lock;
-   struct list_head orphan_list;
+   atomic_t orphan_inodes;
struct btrfs_block_rsv *orphan_block_rsv;
int orphan_item_inserted;
int orphan_cleanup_state;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 19f5b45..25dba7a 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
root->orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(&root->dirty_list);
-   INIT_LIST_HEAD(&root->orphan_list);
INIT_LIST_HEAD(&root->root_list);
spin_lock_init(&root->orphan_lock);
spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
atomic_set(&root->log_commit[0], 0);
atomic_set(&root->log_commit[1], 0);
atomic_set(&root->log_writers, 0);
+   atomic_set(&root->orphan_inodes, 0);
root->log_batch = 0;
root->log_transid = 0;
root->last_log_commit = 0;
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 54ae3df..54f1b30 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -2104,12 +2104,12 @@ void btrfs_orphan_commit_root(struct btrfs_trans_handle 
*trans,
struct btrfs_block_rsv *block_rsv;
int ret;
 
-   if (!list_empty(&root->orphan_list) ||
+   if (atomic_read(&root->orphan_inodes) ||
root->orphan

Re: Ceph on btrfs 3.4rc

2012-05-23 Thread Josef Bacik

On Wed, May 23, 2012 at 02:34:43PM +0200, Christian Brunner wrote:
> 2012/5/22 Josef Bacik :
> >>
> >
> > Yeah you would also need to change orphan_meta_reserved.  I fixed this by 
> > just
> > taking the BTRFS_I(inode)->lock when messing with these since we don't want 
> > to
> > take up all that space in the inode just for a marker.  I ran this patch 
> > for 3
> > hours with no issues, let me know if it works for you.  Thanks,
> 
> Compared to the last runs, I had to run it much longer, but somehow I
> managed to hit a BUG_ON again:
> 

Yeah it's because we access other parts of that bitfield with no lock at all
which is what is likely screwing us.  I'm going to have to redo that part and
then do the orphan fix, I'll have a patch shortly.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-23 Thread Josef Bacik

On Wed, May 23, 2012 at 02:34:43PM +0200, Christian Brunner wrote:
> 2012/5/22 Josef Bacik :
> >>
> >
> > Yeah you would also need to change orphan_meta_reserved.  I fixed this by 
> > just
> > taking the BTRFS_I(inode)->lock when messing with these since we don't want 
> > to
> > take up all that space in the inode just for a marker.  I ran this patch 
> > for 3
> > hours with no issues, let me know if it works for you.  Thanks,
> 
> Compared to the last runs, I had to run it much longer, but somehow I
> managed to hit a BUG_ON again:
> 

Ok give this a shot, it should do it.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 9b9b15f..41ddec8 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -24,6 +24,22 @@
 #include "ordered-data.h"
 #include "delayed-inode.h"
 
+/*
+ * ordered_data_close is set by truncate when a file that used
+ * to have good data has been truncated to zero.  When it is set
+ * the btrfs file release call will add this inode to the
+ * ordered operations list so that we make sure to flush out any
+ * new data the application may have written before commit.
+ */
+#define BTRFS_INODE_ORDERED_DATA_CLOSE 0
+#define BTRFS_INODE_ORPHAN_META_RESERVED   1
+#define BTRFS_INODE_DUMMY  2
+#define BTRFS_INODE_IN_DEFRAG  3
+#define BTRFS_INODE_DELALLOC_META_RESERVED 4
+#define BTRFS_INODE_HAS_ORPHAN_ITEM5
+#define BTRFS_INODE_FORCE_ZLIB 6
+#define BTRFS_INODE_FORCE_LZO  7
+
 /* in memory btrfs inode */
 struct btrfs_inode {
/* which subvolume this inode belongs to */
@@ -57,9 +73,6 @@ struct btrfs_inode {
/* used to order data wrt metadata */
struct btrfs_ordered_inode_tree ordered_tree;
 
-   /* for keeping track of orphaned inodes */
-   struct list_head i_orphan;
-
/* list of all the delalloc inodes in the FS.  There are times we need
 * to write all the delalloc pages to disk, and this list is used
 * to walk them all.
@@ -143,24 +156,7 @@ struct btrfs_inode {
 */
unsigned outstanding_extents;
unsigned reserved_extents;
-
-   /*
-* ordered_data_close is set by truncate when a file that used
-* to have good data has been truncated to zero.  When it is set
-* the btrfs file release call will add this inode to the
-* ordered operations list so that we make sure to flush out any
-* new data the application may have written before commit.
-*/
-   unsigned ordered_data_close:1;
-   unsigned orphan_meta_reserved:1;
-   unsigned dummy_inode:1;
-   unsigned in_defrag:1;
-   unsigned delalloc_meta_reserved:1;
-
-   /*
-* always compress this one file
-*/
-   unsigned force_compress:4;
+   unsigned long runtime_flags;
 
struct btrfs_delayed_node *delayed_node;
 
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 8fd7233..aad2600 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1375,7 +1375,7 @@ struct btrfs_root {
struct list_head root_list;
 
spinlock_t orphan_lock;
-   struct list_head orphan_list;
+   atomic_t orphan_inodes;
struct btrfs_block_rsv *orphan_block_rsv;
int orphan_item_inserted;
int orphan_cleanup_state;
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 03e3748..5190861 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -669,8 +669,8 @@ static int btrfs_delayed_inode_reserve_metadata(
return ret;
} else if (src_rsv == &root->fs_info->delalloc_block_rsv) {
spin_lock(&BTRFS_I(inode)->lock);
-   if (BTRFS_I(inode)->delalloc_meta_reserved) {
-   BTRFS_I(inode)->delalloc_meta_reserved = 0;
+   if (test_and_clear_bit(BTRFS_INODE_DELALLOC_META_RESERVED,
+  &BTRFS_I(inode)->runtime_flags)) {
spin_unlock(&BTRFS_I(inode)->lock);
release = true;
goto migrate;
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index a7ffc88..0ddeb0d 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1153,7 +1153,6 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
root->orphan_block_rsv = NULL;
 
INIT_LIST_HEAD(&root->dirty_list);
-   INIT_LIST_HEAD(&root->orphan_list);
INIT_LIST_HEAD(&root->root_list);
spin_lock_init(&root->orphan_lock);
spin_lock_init(&root->inode_lock);
@@ -1166,6 +1165,7 @@ static void __setup_root(u32 nodesize, u32 leafsize, u32 
sectorsize,
atomic_set(&root->log_commit[0], 0);
atomic_set(&

Re: 2.6.39-rc1: btrfs "WARNING: at fs/btrfs/inode.c:2177"

2011-04-07 Thread Josef Bacik


On 04/07/2011 05:41 AM, Jeff Wu wrote:

Hi ,
I run iozone stress test on a ceph client for x86_64, ceph 0.26 +
linux-2.6.39-rc1 server,
printk "WARNING: at fs/btrfs/inode.c:2177"

Crap I was hoping I had fixed this, could you run with this debug patch 
and get me the output so I can figure out what's going on?  Thanks,


Josef


diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f619c3c..79ec933 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3696,6 +3696,7 @@ int btrfs_block_rsv_add(struct btrfs_trans_handle *trans,
 {
int ret;
 
+   WARN_ON(block_rsv == root->orphan_block_rsv);
if (num_bytes == 0)
return 0;

Re: 2.6.39-rc1: btrfs "WARNING: at fs/btrfs/inode.c:2177"

2011-04-08 Thread Josef Bacik


On 04/08/2011 01:53 AM, Jeff Wu wrote:


Hi ,
I applied the patch to 2.6.39-rc1,took the following steps to compile
it:make&&  make modules_install&&  make install&&  mkinitramfs
but , it seam that it don't run to "WARN_ON(block_rsv ==
root->orphan_block_rsv);"

i attached the codes and logs at the below:



Bummer ok so here's a much bigger debug patch, remove the previous one I 
sent you and apply this one instead and run with it.  As soon as you get 
a warning stop iozone because this debug patch will create _a lot_ of 
debug output, and I don't want to have to sift through all of it.  Just 
send me your logs after running this patch so I can try and piece 
together what's going on.  Thanks,


Josef
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 0d00a07..c0d8c1d 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -766,6 +766,7 @@ struct btrfs_block_rsv {
unsigned int durable:1;
unsigned int refill_used:1;
unsigned int full:1;
+   unsigned int orphan:1;
 };
 
 /*
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f619c3c..5ebcda8 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -3429,6 +3429,9 @@ static int reserve_metadata_bytes(struct 
btrfs_trans_handle *trans,
bool reserved = false;
bool committed = false;
 
+   if (block_rsv->orphan)
+   printk(KERN_ERR "resrving metadata bytes for orphan rsv %llu\n",
+  orig_bytes);
 again:
ret = -ENOSPC;
if (reserved)
@@ -3556,6 +3559,9 @@ static struct btrfs_block_rsv *get_block_rsv(struct 
btrfs_trans_handle *trans,
if (!block_rsv)
block_rsv = &root->fs_info->empty_block_rsv;
 
+   if (block_rsv->orphan)
+   printk(KERN_ERR "got orphan block rsv\n");
+
return block_rsv;
 }
 
@@ -3563,6 +3569,9 @@ static int block_rsv_use_bytes(struct btrfs_block_rsv 
*block_rsv,
   u64 num_bytes)
 {
int ret = -ENOSPC;
+   if (block_rsv->orphan)
+   printk(KERN_ERR "using %llu bytes from orphan\n",
+  num_bytes);
spin_lock(&block_rsv->lock);
if (block_rsv->reserved >= num_bytes) {
block_rsv->reserved -= num_bytes;
@@ -3577,6 +3586,9 @@ static int block_rsv_use_bytes(struct btrfs_block_rsv 
*block_rsv,
 static void block_rsv_add_bytes(struct btrfs_block_rsv *block_rsv,
u64 num_bytes, int update_size)
 {
+   if (block_rsv->orphan)
+   printk(KERN_ERR "adding %llu bytes, update_size=%d\n",
+  num_bytes, update_size);
spin_lock(&block_rsv->lock);
block_rsv->reserved += num_bytes;
if (update_size)
@@ -3592,6 +3604,10 @@ void block_rsv_release_bytes(struct btrfs_block_rsv 
*block_rsv,
struct btrfs_space_info *space_info = block_rsv->space_info;
 
spin_lock(&block_rsv->lock);
+   if (block_rsv->orphan)
+   printk(KERN_ERR "releasing %llu bytes from orhan, size=%llu, "
+  "reserved=%llu\n", num_bytes, block_rsv->size,
+  block_rsv->reserved);
if (num_bytes == (u64)-1)
num_bytes = block_rsv->size;
block_rsv->size -= num_bytes;
@@ -3668,6 +3684,9 @@ struct btrfs_block_rsv *btrfs_alloc_block_rsv(struct 
btrfs_root *root)
 void btrfs_free_block_rsv(struct btrfs_root *root,
  struct btrfs_block_rsv *rsv)
 {
+   if (rsv->orphan)
+   printk(KERN_ERR "freeing orphan rsv\n");
+
if (rsv && atomic_dec_and_test(&rsv->usage)) {
btrfs_block_rsv_release(root, rsv, (u64)-1);
if (!rsv->durable)
@@ -3696,6 +3715,10 @@ int btrfs_block_rsv_add(struct btrfs_trans_handle *trans,
 {
int ret;
 
+   if (block_rsv->orphan)
+   printk(KERN_ERR "adding %llu bytes to orphan\n",
+  num_bytes);
+
if (num_bytes == 0)
return 0;
 
@@ -3720,6 +3743,10 @@ int btrfs_block_rsv_check(struct btrfs_trans_handle 
*trans,
if (!block_rsv)
return 0;
 
+   if (block_rsv->orphan)
+   printk(KERN_ERR "checking orphan reserve for %llu bytes, "
+  "%d min factor\n", min_reserved, min_factor);
+
spin_lock(&block_rsv->lock);
if (min_factor > 0)
num_bytes = div_factor(block_rsv->size, min_factor);
@@ -3964,6 +3991,7 @@ int btrfs_orphan_reserve_metadata(struct 
btrfs_trans_handle *trans,
 * transaction and use space it freed.
 */
u64 num_bytes = calc_trans_metadata_size(root, 4);
+   printk(KERN_ERR "reserving %llu bytes for orphan\n", num_bytes);
return block_rsv_migrate_bytes(src_rsv, dst_rsv, num_bytes);
 }
 
@@ -3971,6 +3999,7 @@ void btrfs_orphan_release_metadata(struct inode *inode)
 {
struct btrfs_root *root = BTRFS_I(inode)->root;
u64 num_bytes = calc_trans_metada

Re: btrfs warnings from 2.6.39-rc5

2011-05-03 Thread Josef Bacik


On 04/27/2011 02:43 PM, Jim Schutt wrote:

Hi,

I'm not sure if they matter, but I got these warnings on
one of the machines I'm using as a Ceph OSD server:

[ 1806.549469] [ cut here ]
[ 1806.554593] WARNING: at fs/btrfs/extent-tree.c:5790
use_block_rsv+0xa7/0x101 [btrfs]()
[ 1806.562903] Hardware name: PowerEdge 1950
[ 1806.567126] Modules linked in: loop btrfs zlib_deflate lzo_compress
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables
x_tables bridge stp i2c_dev i2c_core ext3 jbd ib_iser libi]
[ 1806.689084] Pid: 12025, comm: cosd Not tainted
2.6.39-rc5-6-g7b0bd4b #2
[ 1806.697425] Call Trace:
[ 1806.700332] [] ? use_block_rsv+0xa7/0x101 [btrfs]
[ 1806.707032] [] ? warn_slowpath_common+0x85/0x9e
[ 1806.713502] [] ? warn_slowpath_null+0x1a/0x1c
[ 1806.720755] [] ? use_block_rsv+0xa7/0x101 [btrfs]
[ 1806.731488] [] ? btrfs_alloc_free_block+0x30/0x198
[btrfs]
[ 1806.743858] [] ?
map_private_extent_buffer+0xb2/0xd9 [btrfs]
[ 1806.752600] [] ? __kmap_atomic+0x12/0x47 [btrfs]
[ 1806.760057] [] ? read_extent_buffer+0xc2/0xd4 [btrfs]
[ 1806.770897] [] ? __btrfs_cow_block+0x10e/0x2f5 [btrfs]
[ 1806.779552] [] ? btrfs_header_generation+0x1f/0x25
[btrfs]
[ 1806.788994] [] ? btrfs_cow_block+0xfc/0x121 [btrfs]
[ 1806.798434] [] ? btrfs_search_slot+0x144/0x3ae [btrfs]
[ 1806.806444] [] ? btrfs_lookup_inode+0x31/0x86 [btrfs]
[ 1806.813611] [] ? btrfs_update_inode+0x52/0xc1 [btrfs]
[ 1806.820904] [] ? btrfs_truncate+0x239/0x297 [btrfs]
[ 1806.829290] [] ? btrfs_setsize+0x8c/0x9b [btrfs]
[ 1806.835962] [] ? btrfs_setattr+0x61/0x9d [btrfs]
[ 1806.842892] [] ? notify_change+0x174/0x1bc
[ 1806.850878] [] ? do_truncate+0x6e/0x8a
[ 1806.857374] [] ? generic_permission+0x1c/0x8e
[ 1806.868498] [] ? do_sys_truncate+0xf8/0x10a
[ 1806.877746] [] ? sys_truncate+0xe/0x10
[ 1806.902751] [] ? system_call_fastpath+0x16/0x1b
[ 1806.909736] ---[ end trace cd0ae33f1a4433d9 ]---
[ 1812.142333] [ cut here ]
[ 1812.146996] WARNING: at fs/btrfs/inode.c:2180
btrfs_orphan_commit_root+0x8c/0xab [btrfs]()
[ 1812.155275] Hardware name: PowerEdge 1950
[ 1812.159280] Modules linked in: loop btrfs zlib_deflate lzo_compress
ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4
xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables
x_tables bridge stp i2c_dev i2c_core ext3 jbd ib_iser libi]
[ 1812.240872] Pid: 7923, comm: kworker/3:2 Tainted: G W
2.6.39-rc5-6-g7b0bd4b #2
[ 1812.249216] Call Trace:
[ 1812.251712] [] ? btrfs_orphan_commit_root+0x8c/0xab
[btrfs]
[ 1812.258976] [] ? warn_slowpath_common+0x85/0x9e
[ 1812.265199] [] ? warn_slowpath_null+0x1a/0x1c
[ 1812.271420] [] ? btrfs_orphan_commit_root+0x8c/0xab
[btrfs]
[ 1812.278704] [] ? commit_fs_roots+0x95/0xfd [btrfs]
[ 1812.285220] [] ? btrfs_run_delayed_refs+0x112/0x15e
[btrfs]
[ 1812.292459] [] ? need_resched+0x23/0x2d
[ 1812.297961] [] ? should_resched+0xe/0x2f
[ 1812.303604] [] ?
btrfs_commit_transaction+0x349/0x5a3 [btrfs]
[ 1812.310999] [] ? list_del_init+0x21/0x21
[ 1812.316615] [] ? do_async_commit+0x1f/0x2c [btrfs]
[ 1812.323055] [] ? process_one_work+0x124/0x1e0
[ 1812.329083] [] ?
btrfs_commit_transaction+0x5a3/0x5a3 [btrfs]
[ 1812.336526] [] ? workqueue_congested+0x1e/0x1e
[ 1812.342636] [] ? worker_thread+0x8f/0x124
[ 1812.348353] [] ? kthread+0x72/0x7a
[ 1812.353439] [] ? kernel_thread_helper+0x4/0x10
[ 1812.359555] [] ? retint_restore_args+0xe/0xe
[ 1812.365484] [] ? kthread_bind+0x64/0x64
[ 1812.370986] [] ? gs_change+0xb/0xb
[ 1812.376047] ---[ end trace cd0ae33f1a4433da ]---

Please let me know if I can do anything to help sort these out.



I just posted a patch for this problem, it's titled

Btrfs: fix how we do space reservation for truncate

Please apply it and test it and see if it makes this problem go away. 
Thanks,


Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: btrfs warnings from 2.6.39-rc5

2011-05-16 Thread Josef Bacik


On 05/16/2011 10:28 AM, Jim Schutt wrote:

Josef Bacik wrote:

On 04/27/2011 02:43 PM, Jim Schutt wrote:

Hi,

I'm not sure if they matter, but I got these warnings on
one of the machines I'm using as a Ceph OSD server:

[ 1806.549469] [ cut here ]
[ 1806.554593] WARNING: at fs/btrfs/extent-tree.c:5790
use_block_rsv+0xa7/0x101 [btrfs]()



Please let me know if I can do anything to help sort these out.



I just posted a patch for this problem, it's titled

Btrfs: fix how we do space reservation for truncate

Please apply it and test it and see if it makes this problem go away.


This patch has been working very well for me, thanks.
I've had no sign of truncate trouble since I started using it.

-- Jim



Fantastic, thank you for testing.

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Josef Bacik

On 06/10/2011 02:14 PM, Sage Weil wrote:
> On Fri, 10 Jun 2011, Sage Weil wrote:
>> On Fri, 10 Jun 2011, Chris Mason wrote:
>>> Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:
>>>
>>> [ two different btrfs crashes ]
>>>
>>> I think your two crashes in btrfs were from the uninit variables and
>>> those should be fixed in rc2.
>>>
 When I did my bisection, my criteria for success/failure was
 "did mkcephfs succeed?".  When I apply this criteria to a recent
 linus kernel (e.g. 06e86849cf4019), which includes the fix you
 mentioned (aa0467d8d2a00e), I get still a different failure mode,
 which doesn't actually reference btrfs:

 [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
 000a
 [  276.365127] IP: [] journal_start+0x3e/0x9c [jbd]
>>>
>>> Looking at the resulting code in the oops, we're here in journal_start:
>>>
>>> if (handle) {
>>> J_ASSERT(handle->h_transaction->t_journal == journal);
>>>
>>> handle comes from current->journal_info, and we're doing a deref on
>>> handle->h_transaction, which is probably 0xa.
>>>
>>> So, we're leaving crud in current->journal_info and ext3 is finding it.
>>>
>>> Perhaps its from ceph starting a transaction but leaving it running?
>>> The bug came with Josef's transaction performance fixes, but it is
>>> probably a mixture of his code with the ioctls ceph is using.
>>
>> Ah, yeah, that's the problem.  We saw a similar problem a while back with 
>> the start/stop transaction ioctls.  In this case, create_snapshot is doing
>>
>>  trans = btrfs_start_transaction(root->fs_info->extent_root, 5);
>>  if (IS_ERR(trans)) {
>>  ret = PTR_ERR(trans);
>>  goto fail;
>>  }
>>
>> which sets current->journal_info.  Then
>>
>>  ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
>>  BUG_ON(ret);
>>
>>  list_add(&pending_snapshot->list,
>>   &trans->transaction->pending_snapshots);
>>  if (async_transid) {
>>  *async_transid = trans->transid;
>>  ret = btrfs_commit_transaction_async(trans,
>>   root->fs_info->extent_root, 1);
>>  } else {
>>  ret = btrfs_commit_transaction(trans,
>> root->fs_info->extent_root);
>>  }
>>
>> but the async snap creation ioctl takes the async path, which runs 
>> btrfs_commit_transaction in a worker thread.
>>
>> I'm not sure what the right thing to do is here is... can whatever is in 
>> journal_info be attached to trans instead in 
>> btrfs_commit_transaction_async()?
> 
> It looks like it's not used for anything in btrfs, actually; it's just set 
> and cleared.  What's the point of that?
> 

It is used now, check the beginning of start_transaction().  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 3.0-rcX BUG at fs/btrfs/ioctl.c:432 - bisected

2011-06-10 Thread Josef Bacik

On 06/10/2011 02:35 PM, Sage Weil wrote:
> On Fri, 10 Jun 2011, Josef Bacik wrote:
>> On 06/10/2011 02:14 PM, Sage Weil wrote:
>>> On Fri, 10 Jun 2011, Sage Weil wrote:
>>>> On Fri, 10 Jun 2011, Chris Mason wrote:
>>>>> Excerpts from Jim Schutt's message of 2011-06-10 13:06:22 -0400:
>>>>>
>>>>> [ two different btrfs crashes ]
>>>>>
>>>>> I think your two crashes in btrfs were from the uninit variables and
>>>>> those should be fixed in rc2.
>>>>>
>>>>>> When I did my bisection, my criteria for success/failure was
>>>>>> "did mkcephfs succeed?".  When I apply this criteria to a recent
>>>>>> linus kernel (e.g. 06e86849cf4019), which includes the fix you
>>>>>> mentioned (aa0467d8d2a00e), I get still a different failure mode,
>>>>>> which doesn't actually reference btrfs:
>>>>>>
>>>>>> [  276.364178] BUG: unable to handle kernel NULL pointer dereference at 
>>>>>> 000a
>>>>>> [  276.365127] IP: [] journal_start+0x3e/0x9c [jbd]
>>>>>
>>>>> Looking at the resulting code in the oops, we're here in journal_start:
>>>>>
>>>>> if (handle) {
>>>>> J_ASSERT(handle->h_transaction->t_journal == journal);
>>>>>
>>>>> handle comes from current->journal_info, and we're doing a deref on
>>>>> handle->h_transaction, which is probably 0xa.
>>>>>
>>>>> So, we're leaving crud in current->journal_info and ext3 is finding it.
>>>>>
>>>>> Perhaps its from ceph starting a transaction but leaving it running?
>>>>> The bug came with Josef's transaction performance fixes, but it is
>>>>> probably a mixture of his code with the ioctls ceph is using.
>>>>
>>>> Ah, yeah, that's the problem.  We saw a similar problem a while back with 
>>>> the start/stop transaction ioctls.  In this case, create_snapshot is doing
>>>>
>>>>trans = btrfs_start_transaction(root->fs_info->extent_root, 5);
>>>>if (IS_ERR(trans)) {
>>>>ret = PTR_ERR(trans);
>>>>goto fail;
>>>>}
>>>>
>>>> which sets current->journal_info.  Then
>>>>
>>>>ret = btrfs_snap_reserve_metadata(trans, pending_snapshot);
>>>>BUG_ON(ret);
>>>>
>>>>list_add(&pending_snapshot->list,
>>>> &trans->transaction->pending_snapshots);
>>>>if (async_transid) {
>>>>*async_transid = trans->transid;
>>>>ret = btrfs_commit_transaction_async(trans,
>>>> root->fs_info->extent_root, 1);
>>>>} else {
>>>>ret = btrfs_commit_transaction(trans,
>>>>   root->fs_info->extent_root);
>>>>}
>>>>
>>>> but the async snap creation ioctl takes the async path, which runs 
>>>> btrfs_commit_transaction in a worker thread.
>>>>
>>>> I'm not sure what the right thing to do is here is... can whatever is in 
>>>> journal_info be attached to trans instead in 
>>>> btrfs_commit_transaction_async()?
>>>
>>> It looks like it's not used for anything in btrfs, actually; it's just set 
>>> and cleared.  What's the point of that?
>>>
>>
>> It is used now, check the beginning of start_transaction().  Thanks,
> 
> Oh I see, okay.
> 
> So clearing it in btrfs_commit_transaction_async should be fine then, 
> right?  When btrfs_commit_transaction runs in the other thread it won't 
> care that current->journal_info is NULL.
> 

Oh yeah your patch is good :),

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: stalls with latest btrfs merge into 3.0-rc2

2011-06-13 Thread Josef Bacik


On 06/13/2011 05:07 PM, Jim Schutt wrote:

Hi,

On a system under a heavy write load from multiple ceph OSDs,
I'm running into the following hung tasks where btrfs is implicated.
I'm running commit 3c25fa740e2 from Linus' tree merged with
commit cb9b41c92fa from git://ceph.newdream.net/git/ceph-client.git.

Let me know what else I can do to help sort this out.

[ 961.318047] INFO: task kworker/1:2:2346 blocked for more than 120
seconds.
[ 961.324993] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 961.332891] 8802253dbcd0 0046 8802253dbcb0
88021c138000
[ 961.340398] 880222c3dac0 880222c3dac0 000108c0
88021c138000
[ 961.347893] 880222c3de80 0001 8802253dbd00
813b0f69
[ 961.355384] Call Trace:
[ 961.357838] [] schedule+0x164/0x19e
[ 961.363041] [] btrfs_start_ordered_extent+0xa8/0xc4
[btrfs]
[ 961.370268] [] ? list_del_init+0x21/0x21
[ 961.376075] [] btrfs_wait_ordered_extents+0xd8/0x143
[btrfs]
[ 961.383387] [] btrfs_commit_transaction+0x20b/0x5a4
[btrfs]
[ 961.390642] [] ? list_del_init+0x21/0x21
[ 961.396284] [] do_async_commit+0x1f/0x2c [btrfs]
[ 961.402638] [] process_one_work+0x124/0x1e0
[ 961.408478] [] ?
btrfs_commit_transaction+0x5a4/0x5a4 [btrfs]
[ 961.415917] [] ? destroy_workqueue+0x161/0x161
[ 961.422155] [] worker_thread+0x8f/0x124
[ 961.427642] [] kthread+0x72/0x7a
[ 961.432636] [] kernel_thread_helper+0x4/0x10
[ 961.438561] [] ? retint_restore_args+0xe/0xe
[ 961.444530] [] ? kthread_bind+0x53/0x53
[ 961.450100] [] ? gs_change+0xb/0xb
[ 961.455188] INFO: task btrfs-transacti:7653 blocked for more than 120
seconds.
[ 961.462506] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 961.477512] 88021c24fd90 0046 88021c24fd70
880190a65ac0
[ 961.491735] 8802246cad60 8802246cad60 000108c0
880190a65ac0
[ 961.505853] 8802246cb120 0001 88021c24fdc0
813b0f69
[ 961.521106] Call Trace:
[ 961.526052] [] schedule+0x164/0x19e
[ 961.531910] [] wait_current_trans+0xb8/0xec [btrfs]
[ 961.544101] [] ? list_del_init+0x21/0x21
[ 961.550394] [] ? spin_lock+0xe/0x10 [btrfs]
[ 961.556471] [] start_transaction+0xd1/0x206 [btrfs]
[ 961.563023] [] ?
btree_readpage_end_io_hook+0x192/0x192 [btrfs]
[ 961.570597] [] btrfs_join_transaction+0x15/0x17 [btrfs]
[ 961.577609] [] transaction_kthread+0x154/0x22d [btrfs]
[ 961.584465] [] ? need_resched+0x23/0x2d
[ 961.589967] [] kthread+0x72/0x7a
[ 961.594843] [] kernel_thread_helper+0x4/0x10
[ 961.600777] [] ? retint_restore_args+0xe/0xe
[ 961.606702] [] ? kthread_bind+0x53/0x53
[ 961.612206] [] ? gs_change+0xb/0xb
[ 961.617322] INFO: task cosd:16719 blocked for more than 120 seconds.
[ 961.623702] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 961.631542] 8801bf163cb8 0086 8801bf163c98
880226db16b0
[ 961.639072] 8801df4f5ac0 8801df4f5ac0 000108c0
880226db16b0
[ 961.646516] 8801df4f5e80 0001 8801bf163ce8
813b0f69
[ 961.653955] Call Trace:
[ 961.656483] [] schedule+0x164/0x19e
[ 961.661725] []
wait_current_trans_commit_start_and_unblock+0xa9/0xce [btrfs]
[ 961.670461] [] ? list_del_init+0x21/0x21
[ 961.676028] [] ? queue_work+0x1f/0x21
[ 961.681436] []
btrfs_commit_transaction_async+0xd3/0x115 [btrfs]
[ 961.689118] [] create_snapshot+0xe5/0x177 [btrfs]
[ 961.695546] [] btrfs_mksubvol+0xfa/0x167 [btrfs]
[ 961.702416] []
btrfs_ioctl_snap_create_transid+0xff/0x121 [btrfs]
[ 961.710454] [] btrfs_ioctl_snap_create_v2+0x88/0xea
[btrfs]
[ 961.718000] [] btrfs_ioctl+0x208/0x358 [btrfs]
[ 961.724088] [] vfs_ioctl+0x1d/0x34
[ 961.729159] [] do_vfs_ioctl+0x171/0x17a
[ 961.734732] [] ? fget_light+0x69/0x81
[ 961.740057] [] sys_ioctl+0x5c/0x7c
[ 961.745333] [] ? jbd_free_handle+0x1b/0x1d [jbd]
[ 961.752028] [] system_call_fastpath+0x16/0x1b
[ 961.758429] INFO: task cosd:16720 blocked for more than 120 seconds.
[ 961.765416] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[ 961.776419] 8801bf123bb8 0086 8801bf123b98
8801664bad60
[ 961.792604] 8801df4f2d60 8801df4f2d60 000108c0
8801664bad60
[ 961.814250] 8801df4f3120  8801bf123be8
813b0f69
[ 961.825356] Call Trace:
[ 961.830932] [] schedule+0x164/0x19e
[ 961.837403] [] wait_current_trans+0xb8/0xec [btrfs]
[ 961.844413] [] ? list_del_init+0x21/0x21
[ 961.850203] [] ? kmem_cache_alloc+0xad/0xb9
[ 961.856255] [] start_transaction+0xd1/0x206 [btrfs]
[ 961.863351] [] btrfs_start_transaction+0x13/0x15
[btrfs]
[ 961.870549] [] btrfs_create+0x3b/0x197 [btrfs]
[ 961.876889] [] vfs_create+0x72/0x92
[ 961.882115] [] do_last+0x22c/0x40b
[ 961.887195] [] path_openat+0xc0/0x2ef
[ 961.892581] [] do_filp_open+0x3d/0x87
[ 961.897888] [] ? strncpy_from_user+0x43/0x4d
[ 961.903880] [] ? getname_flags+0x2e/0x80
[ 961.909463] [] ? do_getname+0x14b/0x173
[ 961.915472] [] ? audi

Re: stalls with latest btrfs merge into 3.0-rc2

2011-06-13 Thread Josef Bacik


On 06/13/2011 05:07 PM, Jim Schutt wrote:

Hi,

On a system under a heavy write load from multiple ceph OSDs,
I'm running into the following hung tasks where btrfs is implicated.
I'm running commit 3c25fa740e2 from Linus' tree merged with
commit cb9b41c92fa from git://ceph.newdream.net/git/ceph-client.git.



Please try this patch and verify it fixes the problem.  If it does I'll 
make it less crappy and send it along.  Thanks,


Josef
diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c
index 7a9f517..532139e 100644
--- a/fs/btrfs/transaction.c
+++ b/fs/btrfs/transaction.c
@@ -1236,12 +1236,16 @@ int btrfs_commit_transaction(struct btrfs_trans_handle 
*trans,
schedule_timeout(1);
 
finish_wait(&cur_trans->writer_wait, &wait);
-   spin_lock(&root->fs_info->trans_lock);
-   root->fs_info->trans_no_join = 1;
-   spin_unlock(&root->fs_info->trans_lock);
} while (atomic_read(&cur_trans->num_writers) > 1 ||
 (should_grow && cur_trans->num_joined != joined));
 
+   spin_lock(&root->fs_info->trans_lock);
+   root->fs_info->trans_no_join = 1;
+   spin_unlock(&root->fs_info->trans_lock);
+
+   while (atomic_read(&cur_trans->num_writers) > 1)
+   schedule_timeout(1);
+
ret = create_pending_snapshots(trans, root->fs_info);
BUG_ON(ret);

Re: Delayed inode operations not doing the right thing with enospc

2011-07-12 Thread Josef Bacik

On 07/12/2011 11:20 AM, Christian Brunner wrote:
> 2011/6/7 Josef Bacik :
>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>> I got a lot of these when running stress.sh on my test box
>>>>
>>>>
>>>>
>>>> This is because use_block_rsv() is having to do a
>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>> reserved enough space for those operations to complete.  This is
>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>> trans->block_rsv, which will only have what the current transaction
>>>> starter had reserved.
>>>>
>>>> What needs to be done instead is we need to have a block reserve that
>>>> any reservation that is done at create time for these inodes is migrated
>>>> to this special reserve, and then when you run the delayed inode items
>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>> accounting is all done properly.
>>>>
>>>> This is just off the top of my head, there may be a better way to do it,
>>>> I've not actually looked that the delayed inode code at all.
>>>>
>>>> I would do this myself but I have a ever increasing list of shit to do
>>>> so will somebody pick this up and fix it please?  Thanks,
>>>
>>> Sorry, it's my miss.
>>> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
>>> the space from trans_block_rsv to global_block_rsv.
>>>
>>> I'll fix it soon.
>>>
>>
>> There is another problem, we're failing xfstest 204.  I tried making
>> reserve_metadata_bytes commit the transaction regardless of whether or
>> not there were pinned bytes but the test just hung there.  Usually it
>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>> 204 just creates a crap ton of files, which is what is killing us.
>> There needs to be a way to start flushing delayed inode items so we can
>> reclaim the space they are holding onto so we don't get enospc, and it
>> needs to be better than just committing the transaction because that is
>> dog slow.  Thanks,
>>
>> Josef
> 
> Is there a solution for this?
> 
> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
> (except the pluging). When starting a ceph rebuild on the btrfs
> volumes I get a lot of warnings from block_rsv_use_bytes in
> use_block_rsv:
> 

Yeah there is something wonky going on here, I meant to take a look this
week but I will go ahead and look into it now.  I have a way to
reproduce it thankfully, but I may have you run my patches when I get
somewhere.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Delayed inode operations not doing the right thing with enospc

2011-07-13 Thread Josef Bacik

On 07/12/2011 11:20 AM, Christian Brunner wrote:
> 2011/6/7 Josef Bacik :
>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>> I got a lot of these when running stress.sh on my test box
>>>>
>>>>
>>>>
>>>> This is because use_block_rsv() is having to do a
>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>> reserved enough space for those operations to complete.  This is
>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>> trans->block_rsv, which will only have what the current transaction
>>>> starter had reserved.
>>>>
>>>> What needs to be done instead is we need to have a block reserve that
>>>> any reservation that is done at create time for these inodes is migrated
>>>> to this special reserve, and then when you run the delayed inode items
>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>> accounting is all done properly.
>>>>
>>>> This is just off the top of my head, there may be a better way to do it,
>>>> I've not actually looked that the delayed inode code at all.
>>>>
>>>> I would do this myself but I have a ever increasing list of shit to do
>>>> so will somebody pick this up and fix it please?  Thanks,
>>>
>>> Sorry, it's my miss.
>>> I forgot to set trans->block_rsv to global_block_rsv, since we have migrated
>>> the space from trans_block_rsv to global_block_rsv.
>>>
>>> I'll fix it soon.
>>>
>>
>> There is another problem, we're failing xfstest 204.  I tried making
>> reserve_metadata_bytes commit the transaction regardless of whether or
>> not there were pinned bytes but the test just hung there.  Usually it
>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>> 204 just creates a crap ton of files, which is what is killing us.
>> There needs to be a way to start flushing delayed inode items so we can
>> reclaim the space they are holding onto so we don't get enospc, and it
>> needs to be better than just committing the transaction because that is
>> dog slow.  Thanks,
>>
>> Josef
> 
> Is there a solution for this?
> 
> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
> (except the pluging). When starting a ceph rebuild on the btrfs
> volumes I get a lot of warnings from block_rsv_use_bytes in
> use_block_rsv:
> 

Ok I think I've got this nailed down.  Will you run with this patch and make 
sure the warnings go away?  Thanks,

Josef

diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 52d7eca..2263d29 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -112,9 +112,6 @@ struct btrfs_inode {
 */
u64 disk_i_size;
 
-   /* flags field from the on disk inode */
-   u32 flags;
-
/*
 * if this is a directory then index_cnt is the counter for the index
 * number for new files that are created
@@ -128,14 +125,8 @@ struct btrfs_inode {
 */
u64 last_unlink_trans;
 
-   /*
-* Counters to keep track of the number of extent item's we may use due
-* to delalloc and such.  outstanding_extents is the number of extent
-* items we think we'll end up using, and reserved_extents is the number
-* of extent items we've reserved metadata for.
-*/
-   atomic_t outstanding_extents;
-   atomic_t reserved_extents;
+   /* flags field from the on disk inode */
+   u32 flags;
 
/*
 * ordered_data_close is set by truncate when a file that used
@@ -151,12 +142,21 @@ struct btrfs_inode {
unsigned orphan_meta_reserved:1;
unsigned dummy_inode:1;
unsigned in_defrag:1;
-
/*
 * always compress this one file
 */
unsigned force_compress:4;
 
+   /*
+* Counters to keep track of the number of extent item's we may use due
+* to delalloc and such.  outstanding_extents is the number of extent
+* items we think we'll end up using, and reserved_extents is the number
+* of extent items we've reserved metadata for.
+*/
+   spinlock_t extents_count_lock;
+   unsigned outstanding_extents;
+   unsigned reserved_extents;
+
struct btrfs_delayed_node *delayed_node;
 
struct inode vfs_inode;
diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index be02cae..3ba4d5f 100644
--- a/fs

Re: Delayed inode operations not doing the right thing with enospc

2011-07-14 Thread Josef Bacik

On 07/14/2011 03:27 AM, Christian Brunner wrote:
> 2011/7/13 Josef Bacik :
>> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>>> 2011/6/7 Josef Bacik :
>>>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>>>> I got a lot of these when running stress.sh on my test box
>>>>>>
>>>>>>
>>>>>>
>>>>>> This is because use_block_rsv() is having to do a
>>>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>>>> reserved enough space for those operations to complete.  This is
>>>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>>>> trans->block_rsv, which will only have what the current transaction
>>>>>> starter had reserved.
>>>>>>
>>>>>> What needs to be done instead is we need to have a block reserve that
>>>>>> any reservation that is done at create time for these inodes is migrated
>>>>>> to this special reserve, and then when you run the delayed inode items
>>>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>>>> accounting is all done properly.
>>>>>>
>>>>>> This is just off the top of my head, there may be a better way to do it,
>>>>>> I've not actually looked that the delayed inode code at all.
>>>>>>
>>>>>> I would do this myself but I have a ever increasing list of shit to do
>>>>>> so will somebody pick this up and fix it please?  Thanks,
>>>>>
>>>>> Sorry, it's my miss.
>>>>> I forgot to set trans->block_rsv to global_block_rsv, since we have 
>>>>> migrated
>>>>> the space from trans_block_rsv to global_block_rsv.
>>>>>
>>>>> I'll fix it soon.
>>>>>
>>>>
>>>> There is another problem, we're failing xfstest 204.  I tried making
>>>> reserve_metadata_bytes commit the transaction regardless of whether or
>>>> not there were pinned bytes but the test just hung there.  Usually it
>>>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>>>> 204 just creates a crap ton of files, which is what is killing us.
>>>> There needs to be a way to start flushing delayed inode items so we can
>>>> reclaim the space they are holding onto so we don't get enospc, and it
>>>> needs to be better than just committing the transaction because that is
>>>> dog slow.  Thanks,
>>>>
>>>> Josef
>>>
>>> Is there a solution for this?
>>>
>>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>>> (except the pluging). When starting a ceph rebuild on the btrfs
>>> volumes I get a lot of warnings from block_rsv_use_bytes in
>>> use_block_rsv:
>>>
>>
>> Ok I think I've got this nailed down.  Will you run with this patch and make 
>> sure the warnings go away?  Thanks,
> 
> I'm sorry, I'm still getting a lot of warnings like the one below.
> 
> I've also noticed, that I'm not getting these messages when the
> free_space_cache is disabled.
> 
>

Ok I see what's wrong, our checksum calculation is completely bogus.
I'm in the middle of something big so I can't give you a nice clean
patch, so if you can just go into extent-tree.c and replace
calc_csum_metadata_size with this you should be good to go

static u64 calc_csum_metadata_size(struct inode *inode, u64 num_bytes)
{
struct btrfs_root *root = BTRFS_I(inode)->root;
int num_leaves;
int num_csums;
u16 csum_size =
btrfs_super_csum_size(&root->fs_info->super_copy);

num_csums = (int)div64_u64(num_bytes, root->sectorsize);
num_leaves = (int)((num_csums * csum_size) / root->leafsize);

return btrfs_calc_trans_metadata_size(root, num_leaves);
}


Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Delayed inode operations not doing the right thing with enospc

2011-07-14 Thread Josef Bacik

On 07/14/2011 03:27 AM, Christian Brunner wrote:
> 2011/7/13 Josef Bacik :
>> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>>> 2011/6/7 Josef Bacik :
>>>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>>>> I got a lot of these when running stress.sh on my test box
>>>>>>
>>>>>>
>>>>>>
>>>>>> This is because use_block_rsv() is having to do a
>>>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>>>> reserved enough space for those operations to complete.  This is
>>>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>>>> trans->block_rsv, which will only have what the current transaction
>>>>>> starter had reserved.
>>>>>>
>>>>>> What needs to be done instead is we need to have a block reserve that
>>>>>> any reservation that is done at create time for these inodes is migrated
>>>>>> to this special reserve, and then when you run the delayed inode items
>>>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>>>> accounting is all done properly.
>>>>>>
>>>>>> This is just off the top of my head, there may be a better way to do it,
>>>>>> I've not actually looked that the delayed inode code at all.
>>>>>>
>>>>>> I would do this myself but I have a ever increasing list of shit to do
>>>>>> so will somebody pick this up and fix it please?  Thanks,
>>>>>
>>>>> Sorry, it's my miss.
>>>>> I forgot to set trans->block_rsv to global_block_rsv, since we have 
>>>>> migrated
>>>>> the space from trans_block_rsv to global_block_rsv.
>>>>>
>>>>> I'll fix it soon.
>>>>>
>>>>
>>>> There is another problem, we're failing xfstest 204.  I tried making
>>>> reserve_metadata_bytes commit the transaction regardless of whether or
>>>> not there were pinned bytes but the test just hung there.  Usually it
>>>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>>>> 204 just creates a crap ton of files, which is what is killing us.
>>>> There needs to be a way to start flushing delayed inode items so we can
>>>> reclaim the space they are holding onto so we don't get enospc, and it
>>>> needs to be better than just committing the transaction because that is
>>>> dog slow.  Thanks,
>>>>
>>>> Josef
>>>
>>> Is there a solution for this?
>>>
>>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>>> (except the pluging). When starting a ceph rebuild on the btrfs
>>> volumes I get a lot of warnings from block_rsv_use_bytes in
>>> use_block_rsv:
>>>
>>
>> Ok I think I've got this nailed down.  Will you run with this patch and make 
>> sure the warnings go away?  Thanks,
> 
> I'm sorry, I'm still getting a lot of warnings like the one below.
> 
> I've also noticed, that I'm not getting these messages when the
> free_space_cache is disabled.
> 
>

Actually scratch that last note, it's wrong.  I'll send you an updated
patch when I've got this mess all sorted out.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Delayed inode operations not doing the right thing with enospc

2011-07-14 Thread Josef Bacik

On 07/14/2011 03:27 AM, Christian Brunner wrote:
> 2011/7/13 Josef Bacik :
>> On 07/12/2011 11:20 AM, Christian Brunner wrote:
>>> 2011/6/7 Josef Bacik :
>>>> On 06/06/2011 09:39 PM, Miao Xie wrote:
>>>>> On fri, 03 Jun 2011 14:46:10 -0400, Josef Bacik wrote:
>>>>>> I got a lot of these when running stress.sh on my test box
>>>>>>
>>>>>>
>>>>>>
>>>>>> This is because use_block_rsv() is having to do a
>>>>>> reserve_metadata_bytes(), which shouldn't happen as we should have
>>>>>> reserved enough space for those operations to complete.  This is
>>>>>> happening because use_block_rsv() will call get_block_rsv(), which if
>>>>>> root->ref_cows is set (which is the case on all fs roots) we will use
>>>>>> trans->block_rsv, which will only have what the current transaction
>>>>>> starter had reserved.
>>>>>>
>>>>>> What needs to be done instead is we need to have a block reserve that
>>>>>> any reservation that is done at create time for these inodes is migrated
>>>>>> to this special reserve, and then when you run the delayed inode items
>>>>>> stuff you set trans->block_rsv to the special block reserve so the
>>>>>> accounting is all done properly.
>>>>>>
>>>>>> This is just off the top of my head, there may be a better way to do it,
>>>>>> I've not actually looked that the delayed inode code at all.
>>>>>>
>>>>>> I would do this myself but I have a ever increasing list of shit to do
>>>>>> so will somebody pick this up and fix it please?  Thanks,
>>>>>
>>>>> Sorry, it's my miss.
>>>>> I forgot to set trans->block_rsv to global_block_rsv, since we have 
>>>>> migrated
>>>>> the space from trans_block_rsv to global_block_rsv.
>>>>>
>>>>> I'll fix it soon.
>>>>>
>>>>
>>>> There is another problem, we're failing xfstest 204.  I tried making
>>>> reserve_metadata_bytes commit the transaction regardless of whether or
>>>> not there were pinned bytes but the test just hung there.  Usually it
>>>> takes 7 seconds to run and I ctrl+c'ed it after a couple of minutes.
>>>> 204 just creates a crap ton of files, which is what is killing us.
>>>> There needs to be a way to start flushing delayed inode items so we can
>>>> reclaim the space they are holding onto so we don't get enospc, and it
>>>> needs to be better than just committing the transaction because that is
>>>> dog slow.  Thanks,
>>>>
>>>> Josef
>>>
>>> Is there a solution for this?
>>>
>>> I'm running a 2.6.38.8 kernel with all the btrfs patches from 3.0rc7
>>> (except the pluging). When starting a ceph rebuild on the btrfs
>>> volumes I get a lot of warnings from block_rsv_use_bytes in
>>> use_block_rsv:
>>>
>>
>> Ok I think I've got this nailed down.  Will you run with this patch and make 
>> sure the warnings go away?  Thanks,
> 
> I'm sorry, I'm still getting a lot of warnings like the one below.
> 
> I've also noticed, that I'm not getting these messages when the
> free_space_cache is disabled.
> 

Ok ditch that previous patch and try this one, it should work.  Thanks,

Josef


diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h
index 52d7eca..2263d29 100644
--- a/fs/btrfs/btrfs_inode.h
+++ b/fs/btrfs/btrfs_inode.h
@@ -112,9 +112,6 @@ struct btrfs_inode {
 */
u64 disk_i_size;

-   /* flags field from the on disk inode */
-   u32 flags;
-
/*
 * if this is a directory then index_cnt is the counter for the index
 * number for new files that are created
@@ -128,14 +125,8 @@ struct btrfs_inode {
 */
u64 last_unlink_trans;

-   /*
-* Counters to keep track of the number of extent item's we may use due
-* to delalloc and such.  outstanding_extents is the number of extent
-* items we think we'll end up using, and reserved_extents is the number
-* of extent items we've reserved metadata for.
-*/
-   atomic_t outstanding_extents;
-   atomic_t reserved_extents;
+   /* flags field from the on disk inode */
+   u32 flags;

/*
 * ordered_data_close is set by truncate when a file that used
@@ -151,12 +142,21 @@ struc

Re: WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()

2011-09-15 Thread Josef Bacik

On Thu, Sep 15, 2011 at 11:44:09AM -0700, Sage Weil wrote:
> On Tue, 13 Sep 2011, Liu Bo wrote:
> > On 09/11/2011 05:47 AM, Martin Mailand wrote:
> > > Hi
> > > I am hitting this Warning reproducible, the workload is a ceph osd,
> > > kernel ist 3.1.0-rc5.
> > > 
> > 
> > Have posted a patch for this:
> > 
> > http://marc.info/?l=linux-btrfs&m=131547325515336&w=2
> 
> We're still seeing this with -rc6, which includes 98c9942 and 65450aa.
> 
> I haven't looked at the reservation code in much detail.  Is there 
> anything I can do to help track this down?
> 

This should be taken care of with all my enospc changes.  You can pull them down
from my btrfs-work tree as soon as kernel.org comes back from the dead :).
Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()

2011-09-16 Thread Josef Bacik

On 09/16/2011 10:09 AM, Martin Mailand wrote:
> Hi Josef,
> after a quick test it seems that I do not hit this Warning any longer.
> But I got a new one.
> 

Hmm looks like that may not be my newest stuff, is commit

57f499e1bb76ba3ebeb09cd12e9dac84baa5812b

in there?  Specifically look at __btrfs_end_transaction in transaction.c
and see if the line

trans->block_rsv = NULL;

is before the first while() loop.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: WARNING: at fs/btrfs/inode.c:2193 btrfs_orphan_commit_root+0xb0/0xc0 [btrfs]()

2011-09-19 Thread Josef Bacik

On 09/16/2011 12:25 PM, Jim Schutt wrote:
> David Sterba wrote:
>> On Thu, Sep 15, 2011 at 11:44:09AM -0700, Sage Weil wrote:
>>> On Tue, 13 Sep 2011, Liu Bo wrote:
 On 09/11/2011 05:47 AM, Martin Mailand wrote:
> Hi
> I am hitting this Warning reproducible, the workload is a ceph osd,
> kernel ist 3.1.0-rc5.
>
 Have posted a patch for this:

 http://marc.info/?l=linux-btrfs&m=131547325515336&w=2
>>> We're still seeing this with -rc6, which includes 98c9942 and 65450aa.
>>
>> Me too, for the
>> WARNING: at fs/btrfs/extent-tree.c:5711
>> btrfs_alloc_free_block+0x180/0x350 [btrfs]()
>>
> 
> FWIW, I'm seeing a slightly different case, while testing branch
> integration/btrfs-next (commit 2828cbd9620e03) from
> git://repo.or.cz/linux-2.6/btrfs-unstable.git merged into branch
> master (commit c455ea4f122d21) from git://github.com/torvalds/linux.git
> 

Ah yeah sorry I see what's going on here, I just missed a few places we
call run_delayed_refs() where we can still have trans->block_rsv set.  I
will fix this and send a patch soon.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-24 Thread Josef Bacik

On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> [adding linux-btrfs to cc]
> 
> Josef, Chris, any ideas on the below issues?
> 
> On Mon, 24 Oct 2011, Christian Brunner wrote:
> > Thanks for explaining this. I don't have any objections against btrfs
> > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
> > scare me, since I can use the ceph replication to recover a lost
> > btrfs-filesystem. The only problem I have is, that btrfs is not stable
> > on our side and I wonder what you are doing to make it work. (Maybe
> > it's related to the load pattern of using ceph as a backend store for
> > qemu).
> > 
> > Here is a list of the btrfs problems I'm having:
> > 
> > - When I run ceph with the default configuration (btrfs snaps enabled)
> > I can see a rapid increase in Disk-I/O after a few hours of uptime.
> > Btrfs-cleaner is using more and more time in
> > btrfs_clean_old_snapshots().
> 
> In theory, there shouldn't be any significant difference between taking a 
> snapshot and removing it a few commits later, and the prior root refs that 
> btrfs holds on to internally until the new commit is complete.  That's 
> clearly not quite the case, though.
> 
> In any case, we're going to try to reproduce this issue in our 
> environment.
> 

I've noticed this problem too, clean_old_snapshots is taking quite a while in
cases where it really shouldn't.  I will see if I can come up with a reproducer
that doesn't require setting up ceph ;).

> > - When I run ceph with btrfs snaps disabled, the situation is getting
> > slightly better. I can run an OSD for about 3 days without problems,
> > but then again the load increases. This time, I can see that the
> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> > than usual.
> 
> FYI in this scenario you're exposed to the same journal replay issues that 
> ext4 and XFS are.  The btrfs workload that ceph is generating will also 
> not be all that special, though, so this problem shouldn't be unique to 
> ceph.
> 

Can you get sysrq+w when this happens?  I'd like to see what btrfs-endio-write
is up to.

> > Another thing is that I'm seeing a WARNING: at fs/btrfs/inode.c:2114
> > from time to time. Maybe it's related to the performance issues, but
> > seems to be able to verify this.
> 
> I haven't seen this yet with the latest stuff from Josef, but others have.  
> Josef, is there any information we can provide to help track it down?
>

Actually this would show up in 2 cases, I fixed the one most people hit with my
earlier stuff and then fixed the other one more recently, hopefully it will be
fixed in 3.2.  A full backtrace would be nice so I can figure out which one it
is you are hitting.
 
> > It's really sad to see, that ceph performance and stability is
> > suffering that much from the underlying filesystems and that this
> > hasn't changed over the last months.
> 
> We don't have anyone internally working on btrfs at the moment, and are 
> still struggling to hire experienced kernel/fs people.  Josef has been 
> very helpful with tracking these issues down, but he hass responsibilities 
> beyond just the Ceph related issues.  Progress is slow, but we are 
> working on it!

I'm open to offers ;).  These things are being hit by people all over the place,
but it's hard for me to reproduce, especially since most of the reports are "run
X server for Y days and wait for it to start sucking."  I will try and get a box
setup that I can let stress.sh run on for a few days to see if I can make some
of this stuff come out to play with me, but unfortunately I end up having to
debug these kind of things over email, which means they get a whole lot of
nowhere.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Josef Bacik

On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> 2011/10/24 Josef Bacik :
> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> [adding linux-btrfs to cc]
> >>
> >> Josef, Chris, any ideas on the below issues?
> >>
> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >
> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> > but then again the load increases. This time, I can see that the
> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> > than usual.
> >>
> >> FYI in this scenario you're exposed to the same journal replay issues that
> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> not be all that special, though, so this problem shouldn't be unique to
> >> ceph.
> >>
> >
> > Can you get sysrq+w when this happens?  I'd like to see what 
> > btrfs-endio-write
> > is up to.
> 
> Capturing this seems to be not easy. I have a few traces (see
> attachment), but with sysrq+w I do not get a stacktrace of
> btrfs-endio-write. What I have is a "latencytop -c" output which is
> interesting:
> 
> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> tries to balance the load over all OSDs, so all filesystems should get
> an nearly equal load. At the moment one filesystem seems to have a
> problem. When running with iostat I see the following
> 
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdd   0.00 0.000.004.33 0.0053.33
> 12.31 0.08   19.38  12.23   5.30
> sdc   0.00 1.000.00  228.33 0.00  1957.33
> 8.5774.33  380.76   2.74  62.57
> sdb   0.00 0.000.001.33 0.0016.00
> 12.00 0.03   25.00  19.75   2.63
> sda   0.00 0.000.000.67 0.00 8.00
> 12.00 0.01   19.50  12.50   0.83
> 
> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> with top I see this process and a btrfs-endio-writer (PID 5447):
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  2053 root  20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
>  5447 root  20   0 000 S 22.6  0.0  19:32.18 btrfs-endio-wri
> 
> In the latencytop output you can see that those processes have a much
> higher latency, than the other ceph-osd and btrfs-endio-writers.
> 

I'm seeing a lot of this

[schedule]  1654.6 msec 96.4 %
schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
generic_write_sync blkdev_aio_write do_sync_readv_writev
do_readv_writev vfs_writev sys_writev system_call_fastpath

where ceph-osd's latency is mostly coming from this fsync of a block device
directly, and not so much being tied up by btrfs directly.  With 22% CPU being
taken up by btrfs-endio-wri we must be doing something wrong.  Can you run perf
record -ag when this is going on and then perf report so we can see what
btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to get
only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
horribly wrong or introducing a lot of latency.  Most of it seems to be when
running the dleayed refs and having to read in blocks.  I've been suspecting for
a while that the delayed ref stuff ends up doing way more work than it needs to
be per task, and it's possible that btrfs-endio-wri is simply getting screwed by
other people doing work.

At this point it seems like the biggest problem with latency in ceph-osd is not
related to btrfs, the latency seems to all be from the fact that ceph-osd is
fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems like
its blowing a lot of CPU time, so perf record -ag is probably going to be your
best bet when it's using lots of cpu so we can figure out what it's spinning on.
Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Josef Bacik

On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> 2011/10/25 Josef Bacik :
> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> >> 2011/10/24 Josef Bacik :
> >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> >> [adding linux-btrfs to cc]
> >> >>
> >> >> Josef, Chris, any ideas on the below issues?
> >> >>
> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >> >
> >> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> >> > but then again the load increases. This time, I can see that the
> >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> >> > than usual.
> >> >>
> >> >> FYI in this scenario you're exposed to the same journal replay issues 
> >> >> that
> >> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> >> not be all that special, though, so this problem shouldn't be unique to
> >> >> ceph.
> >> >>
> >> >
> >> > Can you get sysrq+w when this happens?  I'd like to see what 
> >> > btrfs-endio-write
> >> > is up to.
> >>
> >> Capturing this seems to be not easy. I have a few traces (see
> >> attachment), but with sysrq+w I do not get a stacktrace of
> >> btrfs-endio-write. What I have is a "latencytop -c" output which is
> >> interesting:
> >>
> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> >> tries to balance the load over all OSDs, so all filesystems should get
> >> an nearly equal load. At the moment one filesystem seems to have a
> >> problem. When running with iostat I see the following
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> >> avgrq-sz avgqu-sz   await  svctm  %util
> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> >> 12.31     0.08   19.38  12.23   5.30
> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> >> 8.57    74.33  380.76   2.74  62.57
> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> >> 12.00     0.03   25.00 19.75 2.63
> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
> >> 12.00     0.01   19.50  12.50   0.83
> >>
> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> >> with top I see this process and a btrfs-endio-writer (PID 5447):
> >>
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
> >>
> >> In the latencytop output you can see that those processes have a much
> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
> >>
> >
> > I'm seeing a lot of this
> >
> >        [schedule]      1654.6 msec         96.4 %
> >                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
> >                generic_write_sync blkdev_aio_write do_sync_readv_writev
> >                do_readv_writev vfs_writev sys_writev system_call_fastpath
> >
> > where ceph-osd's latency is mostly coming from this fsync of a block device
> > directly, and not so much being tied up by btrfs directly.  With 22% CPU 
> > being
> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you run 
> > perf
> > record -ag when this is going on and then perf report so we can see what
> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf report 
> > to get
> > only what btrfs-endio-wri is doing, so that would be best.  As far as the 
> > rest
> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing 
> > anything
> > horribly wrong or introducing a lot of latency.  Most of it seems to be when
> > running the dleayed refs and having to read in blocks.  I've been 
> > suspecting for
> > a while that the delayed ref stuff ends up doing way more work than it 
> > needs to
> > be per task, and it's possible that btrfs-endio-wri is simply getting 
> > screwed by
> > other people doing work.
> >
> > At this poi

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Josef Bacik

On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> 2011/10/25 Josef Bacik :
> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> >> 2011/10/24 Josef Bacik :
> >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> >> [adding linux-btrfs to cc]
> >> >>
> >> >> Josef, Chris, any ideas on the below issues?
> >> >>
> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >> >
> >> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> >> > but then again the load increases. This time, I can see that the
> >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> >> > than usual.
> >> >>
> >> >> FYI in this scenario you're exposed to the same journal replay issues 
> >> >> that
> >> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> >> not be all that special, though, so this problem shouldn't be unique to
> >> >> ceph.
> >> >>
> >> >
> >> > Can you get sysrq+w when this happens?  I'd like to see what 
> >> > btrfs-endio-write
> >> > is up to.
> >>
> >> Capturing this seems to be not easy. I have a few traces (see
> >> attachment), but with sysrq+w I do not get a stacktrace of
> >> btrfs-endio-write. What I have is a "latencytop -c" output which is
> >> interesting:
> >>
> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> >> tries to balance the load over all OSDs, so all filesystems should get
> >> an nearly equal load. At the moment one filesystem seems to have a
> >> problem. When running with iostat I see the following
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> >> avgrq-sz avgqu-sz   await  svctm  %util
> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> >> 12.31     0.08   19.38  12.23   5.30
> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> >> 8.57    74.33  380.76   2.74  62.57
> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> >> 12.00     0.03   25.00 19.75 2.63
> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
> >> 12.00     0.01   19.50  12.50   0.83
> >>
> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> >> with top I see this process and a btrfs-endio-writer (PID 5447):
> >>
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
> >>
> >> In the latencytop output you can see that those processes have a much
> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
> >>
> >
> > I'm seeing a lot of this
> >
> >        [schedule]      1654.6 msec         96.4 %
> >                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
> >                generic_write_sync blkdev_aio_write do_sync_readv_writev
> >                do_readv_writev vfs_writev sys_writev system_call_fastpath
> >
> > where ceph-osd's latency is mostly coming from this fsync of a block device
> > directly, and not so much being tied up by btrfs directly.  With 22% CPU 
> > being
> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you run 
> > perf
> > record -ag when this is going on and then perf report so we can see what
> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf report 
> > to get
> > only what btrfs-endio-wri is doing, so that would be best.  As far as the 
> > rest
> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing 
> > anything
> > horribly wrong or introducing a lot of latency.  Most of it seems to be when
> > running the dleayed refs and having to read in blocks.  I've been 
> > suspecting for
> > a while that the delayed ref stuff ends up doing way more work than it 
> > needs to
> > be per task, and it's possible that btrfs-endio-wri is simply getting 
> > screwed by
> > other people doing work.
> >
> > At this poin

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Josef Bacik

On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > 
> > > Attached is a perf-report. I have included the whole report, so that
> > > you can see the difference between the good and the bad
> > > btrfs-endio-wri.
> > >
> > 
> > We also shouldn't be running run_ordered_operations, man this is screwed up,
> > thanks so much for this, I should be able to nail this down pretty easily.
> > Thanks,
> 
> Looks like we're getting there from reserve_metadata_bytes when we join
> the transaction?
>

We don't do reservations in the endio stuff, we assume you've reserved all the
space you need in delalloc, plus we would have seen reserve_metadata_bytes in
the trace.  Though it does look like perf is lying to us in at least one case
sicne btrfs_alloc_logged_file_extent is only called from log replay and not
during normal runtime, so it definitely shouldn't be showing up.  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Josef Bacik

On Wed, Oct 26, 2011 at 09:23:54AM -0400, Chris Mason wrote:
> On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:
> > On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> > > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > > > 
> > > > > Attached is a perf-report. I have included the whole report, so that
> > > > > you can see the difference between the good and the bad
> > > > > btrfs-endio-wri.
> > > > >
> > > > 
> > > > We also shouldn't be running run_ordered_operations, man this is 
> > > > screwed up,
> > > > thanks so much for this, I should be able to nail this down pretty 
> > > > easily.
> > > > Thanks,
> > > 
> > > Looks like we're getting there from reserve_metadata_bytes when we join
> > > the transaction?
> > >
> > 
> > We don't do reservations in the endio stuff, we assume you've reserved all 
> > the
> > space you need in delalloc, plus we would have seen reserve_metadata_bytes 
> > in
> > the trace.  Though it does look like perf is lying to us in at least one 
> > case
> > sicne btrfs_alloc_logged_file_extent is only called from log replay and not
> > during normal runtime, so it definitely shouldn't be showing up.  Thanks,
> 
> Whoops, I should have read that num_items > 0 check harder.
> 
> btrfs_end_transaction is doing it by setting ->blocked = 1
> 
> if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
> should_end_transaction(trans, root)) {
> trans->transaction->blocked = 1;
>   ^
> smp_wmb();
> }
> 
>if (lock && cur_trans->blocked && !cur_trans->in_commit) {
>^^^
> if (throttle) {
> /*
>  * We may race with somebody else here so end up 
> having
>  * to call end_transaction on ourselves again, so inc
>  * our use_count.
>  */
> trans->use_count++;
> return btrfs_commit_transaction(trans, root);
> } else {
> wake_up_process(info->transaction_kthread);
> }
> }
> 

Not sure what you are getting at here?  Even if we set blocked, we're not
throttling so it will just wake up the transaction kthread, so we won't do the
commit in the endio case.  Thanks

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Josef Bacik

On Thu, Oct 27, 2011 at 11:07:38AM -0400, Josef Bacik wrote:
> On Wed, Oct 26, 2011 at 09:23:54AM -0400, Chris Mason wrote:
> > On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:
> > > On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> > > > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > > > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > > > > 
> > > > > > Attached is a perf-report. I have included the whole report, so that
> > > > > > you can see the difference between the good and the bad
> > > > > > btrfs-endio-wri.
> > > > > >
> > > > > 
> > > > > We also shouldn't be running run_ordered_operations, man this is 
> > > > > screwed up,
> > > > > thanks so much for this, I should be able to nail this down pretty 
> > > > > easily.
> > > > > Thanks,
> > > > 
> > > > Looks like we're getting there from reserve_metadata_bytes when we join
> > > > the transaction?
> > > >
> > > 
> > > We don't do reservations in the endio stuff, we assume you've reserved 
> > > all the
> > > space you need in delalloc, plus we would have seen 
> > > reserve_metadata_bytes in
> > > the trace.  Though it does look like perf is lying to us in at least one 
> > > case
> > > sicne btrfs_alloc_logged_file_extent is only called from log replay and 
> > > not
> > > during normal runtime, so it definitely shouldn't be showing up.  Thanks,
> > 
> > Whoops, I should have read that num_items > 0 check harder.
> > 
> > btrfs_end_transaction is doing it by setting ->blocked = 1
> > 
> > if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
> > should_end_transaction(trans, root)) {
> > trans->transaction->blocked = 1;
> > ^
> > smp_wmb();
> > }
> > 
> >if (lock && cur_trans->blocked && !cur_trans->in_commit) {
> >^^^
> > if (throttle) {
> > /*
> >  * We may race with somebody else here so end up 
> > having
> >  * to call end_transaction on ourselves again, so 
> > inc
> >  * our use_count.
> >  */
> > trans->use_count++;
> > return btrfs_commit_transaction(trans, root);
> > } else {
> > wake_up_process(info->transaction_kthread);
> > }
> > }
> > 
> 
> Not sure what you are getting at here?  Even if we set blocked, we're not
> throttling so it will just wake up the transaction kthread, so we won't do the
> commit in the endio case.  Thanks
> 

Oh I see what you were trying to say, that we'd set blocking and then commit the
transaction from the endio process which would run ordered operations, but since
throttle isn't set that won't happen.  I think that the perf symbols are just
lying to us.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Josef Bacik

On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> 2011/10/24 Josef Bacik :
> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> [adding linux-btrfs to cc]
> >>
> >> Josef, Chris, any ideas on the below issues?
> >>
> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >
> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> > but then again the load increases. This time, I can see that the
> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> > than usual.
> >>
> >> FYI in this scenario you're exposed to the same journal replay issues that
> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> not be all that special, though, so this problem shouldn't be unique to
> >> ceph.
> >>
> >
> > Can you get sysrq+w when this happens?  I'd like to see what 
> > btrfs-endio-write
> > is up to.
> 
> Capturing this seems to be not easy. I have a few traces (see
> attachment), but with sysrq+w I do not get a stacktrace of
> btrfs-endio-write. What I have is a "latencytop -c" output which is
> interesting:
> 
> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> tries to balance the load over all OSDs, so all filesystems should get
> an nearly equal load. At the moment one filesystem seems to have a
> problem. When running with iostat I see the following
> 
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdd   0.00 0.000.004.33 0.0053.33
> 12.31 0.08   19.38  12.23   5.30
> sdc   0.00 1.000.00  228.33 0.00  1957.33
> 8.5774.33  380.76   2.74  62.57
> sdb   0.00 0.000.001.33 0.0016.00
> 12.00 0.03   25.00  19.75   2.63
> sda   0.00 0.000.000.67 0.00 8.00
> 12.00 0.01   19.50  12.50   0.83
> 
> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> with top I see this process and a btrfs-endio-writer (PID 5447):
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  2053 root  20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
>  5447 root  20   0 000 S 22.6  0.0  19:32.18 btrfs-endio-wri
> 
> In the latencytop output you can see that those processes have a much
> higher latency, than the other ceph-osd and btrfs-endio-writers.
> 
> Regards,
> Christian

Ok just a shot in the dark, but could you give this a whirl and see if it helps
you?  Thanks

Josef


diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 125cf76..fbc196e 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -210,9 +210,9 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
 }
 
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
-  struct list_head *cluster, u64 start)
+  struct list_head *cluster, u64 start, unsigned long 
max_count)
 {
-   int count = 0;
+   unsigned long count = 0;
struct btrfs_delayed_ref_root *delayed_refs;
struct rb_node *node;
struct btrfs_delayed_ref_node *ref;
@@ -242,7 +242,7 @@ int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
node = rb_first(&delayed_refs->root);
}
 again:
-   while (node && count < 32) {
+   while (node && count < max_count) {
ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node);
if (btrfs_delayed_ref_is_head(ref)) {
head = btrfs_delayed_node_to_head(ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index e287e3b..b15a6ad 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -169,7 +169,8 @@ btrfs_find_delayed_ref_head(struct btrfs_trans_handle 
*trans, u64 bytenr);
 int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
   struct btrfs_delayed_ref_head *head);
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
-  struct list_head *cluster, u64 search_start);
+  struct list_head *cluster, u64 search_start,
+  unsigned long max_count);
 /*
  * a node might live in a head or a regular ref, this lets you
  * test for the proper type to use.
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 31d84e7..c190282 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -81,6 +81,7 @@ int btrfs_inse

Re: Btrfs slowdown with ceph (how to reproduce)

2012-01-23 Thread Josef Bacik

On Fri, Jan 20, 2012 at 01:13:37PM +0100, Christian Brunner wrote:
> As you might know, I have been seeing btrfs slowdowns in our ceph
> cluster for quite some time. Even with the latest btrfs code for 3.3
> I'm still seeing these problems. To make things reproducible, I've now
> written a small test, that imitates ceph's behavior:
> 
> On a freshly created btrfs filesystem (2 TB size, mounted with
> "noatime,nodiratime,compress=lzo,space_cache,inode_cache") I'm opening
> 100 files. After that I'm doing random writes on these files with a
> sync_file_range after each write (each write has a size of 100 bytes)
> and ioctl(BTRFS_IOC_SYNC) after every 100 writes.
> 
> After approximately 20 minutes, write activity suddenly increases
> fourfold and the average request size decreases (see chart in the
> attachment).
> 
> You can find IOstat output here: http://pastebin.com/Smbfg1aG
> 
> I hope that you are able to trace down the problem with the test
> program in the attachment.
 
Ran it, saw the problem, tried the dangerdonteveruse branch in Chris's tree and
formatted the fs with 64k node and leaf sizes and the problem appeared to go
away.  So surprise surprise fragmentation is biting us in the ass.  If you can
try running that branch with 64k node and leaf sizes with your ceph cluster and
see how that works out.  Course you should only do that if you dont mind if you
lose everything :).  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-04-24 Thread Josef Bacik

On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> After running ceph on XFS for some time, I decided to try btrfs again.
> Performance with the current "for-linux-min" branch and big metadata
> is much better. The only problem (?) I'm still seeing is a warning
> that seems to occur from time to time:
> 
> [87703.784552] [ cut here ]
> [87703.789759] WARNING: at fs/btrfs/inode.c:2103
> btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
> [87703.799070] Hardware name: ProLiant DL180 G6
> [87703.804024] Modules linked in: btrfs zlib_deflate libcrc32c xfs
> exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
> iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
> iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
> [87703.828166] Pid: 929, comm: kworker/1:2 Tainted: P   O
> 3.3.2-1.fits.1.el6.x86_64 #1
> [87703.837513] Call Trace:
> [87703.840280]  [] warn_slowpath_common+0x7f/0xc0
> [87703.847016]  [] warn_slowpath_null+0x1a/0x20
> [87703.853533]  [] btrfs_orphan_commit_root+0xf6/0x100 
> [btrfs]
> [87703.861541]  [] commit_fs_roots+0xc6/0x1c0 [btrfs]
> [87703.868674]  []
> btrfs_commit_transaction+0x5db/0xa50 [btrfs]
> [87703.876745]  [] ? __switch_to+0x153/0x440
> [87703.882966]  [] ? wake_up_bit+0x40/0x40
> [87703.888997]  [] ?
> btrfs_commit_transaction+0xa50/0xa50 [btrfs]
> [87703.897271]  [] do_async_commit+0x1f/0x30 [btrfs]
> [87703.904262]  [] process_one_work+0x129/0x450
> [87703.910777]  [] worker_thread+0x17b/0x3c0
> [87703.916991]  [] ? manage_workers+0x220/0x220
> [87703.923504]  [] kthread+0x9e/0xb0
> [87703.928952]  [] kernel_thread_helper+0x4/0x10
> [87703.93]  [] ? kthread_freezable_should_stop+0x70/0x70
> [87703.943323]  [] ? gs_change+0x13/0x13
> [87703.949149] ---[ end trace b8c31966cca731fa ]---
> [91128.812399] [ cut here ]
> [91128.817576] WARNING: at fs/btrfs/inode.c:2103
> btrfs_orphan_commit_root+0xf6/0x100 [btrfs]()
> [91128.826930] Hardware name: ProLiant DL180 G6
> [91128.831897] Modules linked in: btrfs zlib_deflate libcrc32c xfs
> exportfs sunrpc bonding ipv6 sg serio_raw pcspkr iTCO_wdt
> iTCO_vendor_support i7core_edac edac_core ixgbe dca mdio
> iomemory_vsl(PO) hpsa squashfs [last unloaded: scsi_wait_scan]
> [91128.856086] Pid: 6806, comm: btrfs-transacti Tainted: PW  O
> 3.3.2-1.fits.1.el6.x86_64 #1
> [91128.865912] Call Trace:
> [91128.868670]  [] warn_slowpath_common+0x7f/0xc0
> [91128.875379]  [] warn_slowpath_null+0x1a/0x20
> [91128.881900]  [] btrfs_orphan_commit_root+0xf6/0x100 
> [btrfs]
> [91128.889894]  [] commit_fs_roots+0xc6/0x1c0 [btrfs]
> [91128.897019]  [] ?
> btrfs_run_delayed_items+0xf1/0x160 [btrfs]
> [91128.905075]  []
> btrfs_commit_transaction+0x5db/0xa50 [btrfs]
> [91128.913156]  [] ? start_transaction+0x92/0x310 [btrfs]
> [91128.920643]  [] ? wake_up_bit+0x40/0x40
> [91128.926667]  [] transaction_kthread+0x26b/0x2e0 [btrfs]
> [91128.934254]  [] ?
> btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
> [91128.943671]  [] ?
> btrfs_destroy_marked_extents.clone.0+0x1f0/0x1f0 [btrfs]
> [91128.953079]  [] kthread+0x9e/0xb0
> [91128.958532]  [] kernel_thread_helper+0x4/0x10
> [91128.965133]  [] ? kthread_freezable_should_stop+0x70/0x70
> [91128.972913]  [] ? gs_change+0x13/0x13
> [91128.978826] ---[ end trace b8c31966cca731fb ]---
> 
> I'm able to reproduce this with ceph on a single server with 4 disks
> (4 filesystems/osds) and a small test program based on librbd. It is
> simply writing random bytes on a rbd volume (see attachment).
> 
> Is this something I should care about? Any hint's on solving this
> would be appreciated.
> 

Can you send me a config or some basic steps for me to setup ceph on my box so I
can run this program and finally track down this problem?  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-04-24 Thread Josef Bacik

On Tue, Apr 24, 2012 at 09:26:15AM -0700, Sage Weil wrote:
> On Tue, 24 Apr 2012, Josef Bacik wrote:
> > On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> > > After running ceph on XFS for some time, I decided to try btrfs again.
> > > Performance with the current "for-linux-min" branch and big metadata
> > > is much better. The only problem (?) I'm still seeing is a warning
> > > that seems to occur from time to time:
> 
> Actually, before you do that... we have a new tool, 
> test_filestore_workloadgen, that generates a ceph-osd-like workload on the 
> local file system.  It's a subset of what a full OSD might do, but if 
> we're lucky it will be sufficient to reproduce this issue.  Something like
> 
>  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> 
> will hopefully do the trick.
> 
> Christian, maybe you can see if that is able to trigger this warning?  
> You'll need to pull it from the current master branch; it wasn't in the 
> last release.
> 

Keep up the good work Sage, at this rate I'll never have to setup ceph for
myself :),

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-03 Thread Josef Bacik

On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> Am 24. April 2012 18:26 schrieb Sage Weil :
> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> > After running ceph on XFS for some time, I decided to try btrfs again.
> >> > Performance with the current "for-linux-min" branch and big metadata
> >> > is much better. The only problem (?) I'm still seeing is a warning
> >> > that seems to occur from time to time:
> >
> > Actually, before you do that... we have a new tool,
> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> > local file system.  It's a subset of what a full OSD might do, but if
> > we're lucky it will be sufficient to reproduce this issue.  Something like
> >
> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >
> > will hopefully do the trick.
> >
> > Christian, maybe you can see if that is able to trigger this warning?
> > You'll need to pull it from the current master branch; it wasn't in the
> > last release.
> 
> Trying to reproduce with test_filestore_workloadgen didn't work for
> me. So here are some instructions on how to reproduce with a minimal
> ceph setup.
> 
> You will need a single system with two disks and a bit of memory.
> 
> - Compile and install ceph (detailed instructions:
> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> 
> - For the test setup I've used two tmpfs files as journal devices. To
> create these, do the following:
> 
> # mkdir -p /ceph/temp
> # mount -t tmpfs tmpfs /ceph/temp
> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> 
> - Now you should create and mount btrfs. Here is what I did:
> 
> # mkfs.btrfs -l 64k -n 64k /dev/sda
> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> # mkdir /ceph/osd.000
> # mkdir /ceph/osd.001
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
> 
> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> will probably have to change the btrfs devices and the hostname
> (os39).
> 
> - Create the ceph filesystems:
> 
> # mkdir /ceph/mon
> # mkcephfs -a -c /etc/ceph/ceph.conf
> 
> - Start ceph (e.g. "service ceph start")
> 
> - Now you should be able to use ceph - "ceph -s" will tell you about
> the state of the ceph cluster.
> 
> - "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.
> 

It's failing here

http://fpaste.org/e3BG/

Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-03 Thread Josef Bacik

On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
> On Thu, 3 May 2012 10:13:55 -0400, Josef Bacik 
> wrote:
> > On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> >> Am 24. April 2012 18:26 schrieb Sage Weil :
> >> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> >> > After running ceph on XFS for some time, I decided to try btrfs again.
> >> >> > Performance with the current "for-linux-min" branch and big metadata
> >> >> > is much better. The only problem (?) I'm still seeing is a warning
> >> >> > that seems to occur from time to time:
> >> >
> >> > Actually, before you do that... we have a new tool,
> >> > test_filestore_workloadgen, that generates a ceph-osd-like workload on 
> >> > the
> >> > local file system.  It's a subset of what a full OSD might do, but if
> >> > we're lucky it will be sufficient to reproduce this issue.  Something 
> >> > like
> >> >
> >> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >> >
> >> > will hopefully do the trick.
> >> >
> >> > Christian, maybe you can see if that is able to trigger this warning?
> >> > You'll need to pull it from the current master branch; it wasn't in the
> >> > last release.
> >>
> >> Trying to reproduce with test_filestore_workloadgen didn't work for
> >> me. So here are some instructions on how to reproduce with a minimal
> >> ceph setup.
> >>
> >> You will need a single system with two disks and a bit of memory.
> >>
> >> - Compile and install ceph (detailed instructions:
> >> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> >>
> >> - For the test setup I've used two tmpfs files as journal devices. To
> >> create these, do the following:
> >>
> >> # mkdir -p /ceph/temp
> >> # mount -t tmpfs tmpfs /ceph/temp
> >> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> >> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> >>
> >> - Now you should create and mount btrfs. Here is what I did:
> >>
> >> # mkfs.btrfs -l 64k -n 64k /dev/sda
> >> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> >> # mkdir /ceph/osd.000
> >> # mkdir /ceph/osd.001
> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda 
> >> /ceph/osd.000
> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb 
> >> /ceph/osd.001
> >>
> >> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> >> will probably have to change the btrfs devices and the hostname
> >> (os39).
> >>
> >> - Create the ceph filesystems:
> >>
> >> # mkdir /ceph/mon
> >> # mkcephfs -a -c /etc/ceph/ceph.conf
> >>
> >> - Start ceph (e.g. "service ceph start")
> >>
> >> - Now you should be able to use ceph - "ceph -s" will tell you about
> >> the state of the ceph cluster.
> >>
> >> - "rbd create -size 100 testimg" will create an rbd image on the ceph 
> >> cluster.
> >>
> > 
> > It's failing here
> > 
> > http://fpaste.org/e3BG/
> 
> 2012-05-03 10:11:28.818308 7fcb5a0ee700 -- 127.0.0.1:0/1003269 <==
> osd.1 127.0.0.1:6803/2379 3  osd_op_reply(3 rbd_info [call] = -5
> (Input/output error)) v4  107+0+0 (3948821281 0 0) 0x7fcb380009a0
> con 0x1cad3e0
> 
> This is probably because the osd isn't finding the rbd class.
> Do you have 'rbd_cls.so' in /usr/lib64/rados-classes? Wherever
> rbd_cls.so is,
> try adding 'osd class dir = /path/to/rados-classes' to the [osd]
> section
> in your ceph.conf, and restarting the osds.
> 
> If you set 'debug osd = 10' you should see '_load_class rbd' in the osd
> log
> when you try to create an rbd image.
> 
> Autotools should be setting the default location correctly, but if
> you're
> running the osds in a chroot or something the path would be wrong.
> 

Yeah all that was in the right place, I rebooted and I magically stopped getting
that error, but now I'm getting this

http://fpaste.org/OE92/

with that ping thing repeating over and over.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-03 Thread Josef Bacik

On Thu, May 03, 2012 at 09:38:27AM -0700, Josh Durgin wrote:
> On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik 
> wrote:
> > On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
> >> On Thu, 3 May 2012 10:13:55 -0400, Josef Bacik 
> >> wrote:
> >> > On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> >> >> Am 24. April 2012 18:26 schrieb Sage Weil :
> >> >> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> >> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> >> >> > After running ceph on XFS for some time, I decided to try btrfs 
> >> >> >> > again.
> >> >> >> > Performance with the current "for-linux-min" branch and big 
> >> >> >> > metadata
> >> >> >> > is much better. The only problem (?) I'm still seeing is a warning
> >> >> >> > that seems to occur from time to time:
> >> >> >
> >> >> > Actually, before you do that... we have a new tool,
> >> >> > test_filestore_workloadgen, that generates a ceph-osd-like workload 
> >> >> > on the
> >> >> > local file system.  It's a subset of what a full OSD might do, but if
> >> >> > we're lucky it will be sufficient to reproduce this issue.  Something 
> >> >> > like
> >> >> >
> >> >> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >> >> >
> >> >> > will hopefully do the trick.
> >> >> >
> >> >> > Christian, maybe you can see if that is able to trigger this warning?
> >> >> > You'll need to pull it from the current master branch; it wasn't in 
> >> >> > the
> >> >> > last release.
> >> >>
> >> >> Trying to reproduce with test_filestore_workloadgen didn't work for
> >> >> me. So here are some instructions on how to reproduce with a minimal
> >> >> ceph setup.
> >> >>
> >> >> You will need a single system with two disks and a bit of memory.
> >> >>
> >> >> - Compile and install ceph (detailed instructions:
> >> >> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> >> >>
> >> >> - For the test setup I've used two tmpfs files as journal devices. To
> >> >> create these, do the following:
> >> >>
> >> >> # mkdir -p /ceph/temp
> >> >> # mount -t tmpfs tmpfs /ceph/temp
> >> >> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> >> >> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> >> >>
> >> >> - Now you should create and mount btrfs. Here is what I did:
> >> >>
> >> >> # mkfs.btrfs -l 64k -n 64k /dev/sda
> >> >> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> >> >> # mkdir /ceph/osd.000
> >> >> # mkdir /ceph/osd.001
> >> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda 
> >> >> /ceph/osd.000
> >> >> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb 
> >> >> /ceph/osd.001
> >> >>
> >> >> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> >> >> will probably have to change the btrfs devices and the hostname
> >> >> (os39).
> >> >>
> >> >> - Create the ceph filesystems:
> >> >>
> >> >> # mkdir /ceph/mon
> >> >> # mkcephfs -a -c /etc/ceph/ceph.conf
> >> >>
> >> >> - Start ceph (e.g. "service ceph start")
> >> >>
> >> >> - Now you should be able to use ceph - "ceph -s" will tell you about
> >> >> the state of the ceph cluster.
> >> >>
> >> >> - "rbd create -size 100 testimg" will create an rbd image on the ceph 
> >> >> cluster.
> >> >>
> >> >
> >> > It's failing here
> >> >
> >> > http://fpaste.org/e3BG/
> >>
> >> 2012-05-03 10:11:28.818308 7fcb5a0ee700 -- 127.0.0.1:0/1003269 <==
> >> osd.1 127.0.0.1:6803/2379 3  osd_op_reply(3 rbd_info [call] = -5
> >> (Input/output error)) v4  107+0+0 (3948821281 0 0) 0x7fcb380009a0
> >> con 0x1cad3e0
> >>
>

Re: Ceph on btrfs 3.4rc

2012-05-09 Thread Josef Bacik

On Fri, May 04, 2012 at 10:24:16PM +0200, Christian Brunner wrote:
> 2012/5/3 Josef Bacik :
> > On Thu, May 03, 2012 at 09:38:27AM -0700, Josh Durgin wrote:
> >> On Thu, 3 May 2012 11:20:53 -0400, Josef Bacik 
> >> wrote:
> >> > On Thu, May 03, 2012 at 08:17:43AM -0700, Josh Durgin wrote:
> >> >
> >> > Yeah all that was in the right place, I rebooted and I magically
> >> > stopped getting
> >> > that error, but now I'm getting this
> >> >
> >> > http://fpaste.org/OE92/
> >> >
> >> > with that ping thing repeating over and over.  Thanks,
> >>
> >> That just looks like the osd isn't running. If you restart the
> >> osd with 'debug osd = 20' the osd log should tell us what's going on.
> >
> > Ok that part was my fault, Duh I need to redo the tmpfs and mkcephfs stuff 
> > after
> > reboot.  But now I'm back to my original problem
> >
> > http://fpaste.org/PfwO/
> >
> > I have the osd class dir = /usr/lib64/rados-classes thing set and 
> > libcls_rbd is
> > in there, so I'm not sure what is wrong.  Thanks,
> 
> Thats really strange. Do you have the osd logs in /var/log/ceph? If
> so, can you look if you find anything about "rbd" or "class" loading
> in there?
> 
> Another thing you should try is, whether you can access ceph with rados:
> 
> # rados -p rbd ls
> # rados -p rbd -i /proc/cpuinfo put testobj
> # rados -p rbd -o - get testobj
>

Ok weirdly ceph is trying to dlopen /usr/lib64/rados-classes/libcls_rbd.so but
all I had was libcls_rbd.so.1 and libcls_rbd.so.1.0.0.  Symlink fixed that part,
I'll see if I can reproduce now.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph on btrfs 3.4rc

2012-05-10 Thread Josef Bacik

On Fri, Apr 27, 2012 at 01:02:08PM +0200, Christian Brunner wrote:
> Am 24. April 2012 18:26 schrieb Sage Weil :
> > On Tue, 24 Apr 2012, Josef Bacik wrote:
> >> On Fri, Apr 20, 2012 at 05:09:34PM +0200, Christian Brunner wrote:
> >> > After running ceph on XFS for some time, I decided to try btrfs again.
> >> > Performance with the current "for-linux-min" branch and big metadata
> >> > is much better. The only problem (?) I'm still seeing is a warning
> >> > that seems to occur from time to time:
> >
> > Actually, before you do that... we have a new tool,
> > test_filestore_workloadgen, that generates a ceph-osd-like workload on the
> > local file system.  It's a subset of what a full OSD might do, but if
> > we're lucky it will be sufficient to reproduce this issue.  Something like
> >
> >  test_filestore_workloadgen --osd-data /foo --osd-journal /bar
> >
> > will hopefully do the trick.
> >
> > Christian, maybe you can see if that is able to trigger this warning?
> > You'll need to pull it from the current master branch; it wasn't in the
> > last release.
> 
> Trying to reproduce with test_filestore_workloadgen didn't work for
> me. So here are some instructions on how to reproduce with a minimal
> ceph setup.
> 
> You will need a single system with two disks and a bit of memory.
> 
> - Compile and install ceph (detailed instructions:
> http://ceph.newdream.net/docs/master/ops/install/mkcephfs/)
> 
> - For the test setup I've used two tmpfs files as journal devices. To
> create these, do the following:
> 
> # mkdir -p /ceph/temp
> # mount -t tmpfs tmpfs /ceph/temp
> # dd if=/dev/zero of=/ceph/temp/journal0 count=500 bs=1024k
> # dd if=/dev/zero of=/ceph/temp/journal1 count=500 bs=1024k
> 
> - Now you should create and mount btrfs. Here is what I did:
> 
> # mkfs.btrfs -l 64k -n 64k /dev/sda
> # mkfs.btrfs -l 64k -n 64k /dev/sdb
> # mkdir /ceph/osd.000
> # mkdir /ceph/osd.001
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sda /ceph/osd.000
> # mount -o noatime,space_cache,inode_cache,autodefrag /dev/sdb /ceph/osd.001
> 
> - Create /etc/ceph/ceph.conf similar to the attached ceph.conf. You
> will probably have to change the btrfs devices and the hostname
> (os39).
> 
> - Create the ceph filesystems:
> 
> # mkdir /ceph/mon
> # mkcephfs -a -c /etc/ceph/ceph.conf
> 
> - Start ceph (e.g. "service ceph start")
> 
> - Now you should be able to use ceph - "ceph -s" will tell you about
> the state of the ceph cluster.
> 
> - "rbd create -size 100 testimg" will create an rbd image on the ceph cluster.
> 
> - Compile my test with "gcc -o rbdtest rbdtest.c -lrbd" and run it
> with "./rbdtest testimg".
> 
> I can see the first btrfs_orphan_commit_root warning after an hour or
> so... I hope that I've described all necessary steps. If there is a
> problem just send me a note.
> 

Well it's only taken me 2 weeks but I've finally git it all up and running,
hopefully I'll reproduce.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

45 matches

Mail list logo