ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-24 Thread Sage Weil
[adding linux-btrfs to cc]

Josef, Chris, any ideas on the below issues?

On Mon, 24 Oct 2011, Christian Brunner wrote:
> Thanks for explaining this. I don't have any objections against btrfs
> as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
> scare me, since I can use the ceph replication to recover a lost
> btrfs-filesystem. The only problem I have is, that btrfs is not stable
> on our side and I wonder what you are doing to make it work. (Maybe
> it's related to the load pattern of using ceph as a backend store for
> qemu).
> 
> Here is a list of the btrfs problems I'm having:
> 
> - When I run ceph with the default configuration (btrfs snaps enabled)
> I can see a rapid increase in Disk-I/O after a few hours of uptime.
> Btrfs-cleaner is using more and more time in
> btrfs_clean_old_snapshots().

In theory, there shouldn't be any significant difference between taking a 
snapshot and removing it a few commits later, and the prior root refs that 
btrfs holds on to internally until the new commit is complete.  That's 
clearly not quite the case, though.

In any case, we're going to try to reproduce this issue in our 
environment.

> - When I run ceph with btrfs snaps disabled, the situation is getting
> slightly better. I can run an OSD for about 3 days without problems,
> but then again the load increases. This time, I can see that the
> ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> than usual.

FYI in this scenario you're exposed to the same journal replay issues that 
ext4 and XFS are.  The btrfs workload that ceph is generating will also 
not be all that special, though, so this problem shouldn't be unique to 
ceph.

> Another thing is that I'm seeing a WARNING: at fs/btrfs/inode.c:2114
> from time to time. Maybe it's related to the performance issues, but
> seems to be able to verify this.

I haven't seen this yet with the latest stuff from Josef, but others have.  
Josef, is there any information we can provide to help track it down?

> It's really sad to see, that ceph performance and stability is
> suffering that much from the underlying filesystems and that this
> hasn't changed over the last months.

We don't have anyone internally working on btrfs at the moment, and are 
still struggling to hire experienced kernel/fs people.  Josef has been 
very helpful with tracking these issues down, but he hass responsibilities 
beyond just the Ceph related issues.  Progress is slow, but we are 
working on it!

sage


> 
> Kind regards,
> Christian
> 
> 2011/10/24 Sage Weil :
> > Although running on ext4, xfs, or whatever other non-btrfs you want mostly
> > works, there are a few important remaining issues:
> >
> > 1- ext4 limits total xattrs for 4KB.  This can cause problems in some
> > cases, as Ceph uses xattrs extensively.  Most of the time we don't hit
> > this.  We do hit the limit with radosgw pretty easily, though, and may
> > also hit it in exceptional cases where the OSD cluster is very unhealthy.
> >
> > There is a large xattr patch for ext4 from the Lustre folks that has been
> > floating around for (I think) years.  Maybe as interest grows in running
> > Ceph on ext4 this can move upstream.
> >
> > Previously we were being forgiving about large setxattr failures on ext3,
> > but we found that was leading to corruption in certain cases (because we
> > couldn't set our internal metadata), so the next release will assert/crash
> > in that case (fail-stop instead of fail-maybe-eventually-corrupt).
> >
> > XFS does not have an xattr size limit and thus does have this problem.
> >
> > 2- The other problem is with OSD journal replay of non-idempotent
> > transactions.  On non-btrfs backends, the Ceph OSDs use a write-ahead
> > journal.  After restart, the OSD does not know exactly which transactions
> > in the journal may have already been committed to disk, and may reapply a
> > transaction again during replay.  For most operations (write, delete,
> > truncate) this is fine.
> >
> > Some operations, though, are non-idempotent.  The simplest example is
> > CLONE, which copies (efficiently, on btrfs) data from one object to
> > another.  If the source object is modified, the osd restarts, and then
> > the clone is replayed, the target will get incorrect (newer) data.  For
> > example,
> >
> > 1- clone A -> B
> > 2- modify A
> >   
> >
> > B will get new instead of old contents.
> >
> > (This doesn't happen on btrfs because the snapshots allow us to replay
> > from a known consistent point in time.)
> >
> > For things like clone, skipping the operation of the target exists almost
> > works, except for cases like
> >
> > 1- clone A -> B
> > 2- modify A
> > ...
> > 3- delete B
> >   
> >
> > (Although in that example who cares if B had bad data; it was removed
> > anyway.)  The larger problem, though, is that that doesn't always work;
> > CLONERANGE copies a range of a file from A to B, where B may already
> > exist.
> >
> > In practice, the higher level in

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-24 Thread Josef Bacik
On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> [adding linux-btrfs to cc]
> 
> Josef, Chris, any ideas on the below issues?
> 
> On Mon, 24 Oct 2011, Christian Brunner wrote:
> > Thanks for explaining this. I don't have any objections against btrfs
> > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
> > scare me, since I can use the ceph replication to recover a lost
> > btrfs-filesystem. The only problem I have is, that btrfs is not stable
> > on our side and I wonder what you are doing to make it work. (Maybe
> > it's related to the load pattern of using ceph as a backend store for
> > qemu).
> > 
> > Here is a list of the btrfs problems I'm having:
> > 
> > - When I run ceph with the default configuration (btrfs snaps enabled)
> > I can see a rapid increase in Disk-I/O after a few hours of uptime.
> > Btrfs-cleaner is using more and more time in
> > btrfs_clean_old_snapshots().
> 
> In theory, there shouldn't be any significant difference between taking a 
> snapshot and removing it a few commits later, and the prior root refs that 
> btrfs holds on to internally until the new commit is complete.  That's 
> clearly not quite the case, though.
> 
> In any case, we're going to try to reproduce this issue in our 
> environment.
> 

I've noticed this problem too, clean_old_snapshots is taking quite a while in
cases where it really shouldn't.  I will see if I can come up with a reproducer
that doesn't require setting up ceph ;).

> > - When I run ceph with btrfs snaps disabled, the situation is getting
> > slightly better. I can run an OSD for about 3 days without problems,
> > but then again the load increases. This time, I can see that the
> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> > than usual.
> 
> FYI in this scenario you're exposed to the same journal replay issues that 
> ext4 and XFS are.  The btrfs workload that ceph is generating will also 
> not be all that special, though, so this problem shouldn't be unique to 
> ceph.
> 

Can you get sysrq+w when this happens?  I'd like to see what btrfs-endio-write
is up to.

> > Another thing is that I'm seeing a WARNING: at fs/btrfs/inode.c:2114
> > from time to time. Maybe it's related to the performance issues, but
> > seems to be able to verify this.
> 
> I haven't seen this yet with the latest stuff from Josef, but others have.  
> Josef, is there any information we can provide to help track it down?
>

Actually this would show up in 2 cases, I fixed the one most people hit with my
earlier stuff and then fixed the other one more recently, hopefully it will be
fixed in 3.2.  A full backtrace would be nice so I can figure out which one it
is you are hitting.
 
> > It's really sad to see, that ceph performance and stability is
> > suffering that much from the underlying filesystems and that this
> > hasn't changed over the last months.
> 
> We don't have anyone internally working on btrfs at the moment, and are 
> still struggling to hire experienced kernel/fs people.  Josef has been 
> very helpful with tracking these issues down, but he hass responsibilities 
> beyond just the Ceph related issues.  Progress is slow, but we are 
> working on it!

I'm open to offers ;).  These things are being hit by people all over the place,
but it's hard for me to reproduce, especially since most of the reports are "run
X server for Y days and wait for it to start sucking."  I will try and get a box
setup that I can let stress.sh run on for a few days to see if I can make some
of this stuff come out to play with me, but unfortunately I end up having to
debug these kind of things over email, which means they get a whole lot of
nowhere.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-24 Thread Chris Mason
On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:
> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> > [adding linux-btrfs to cc]
> > 
> > Josef, Chris, any ideas on the below issues?
> > 
> > On Mon, 24 Oct 2011, Christian Brunner wrote:
> > > Thanks for explaining this. I don't have any objections against btrfs
> > > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
> > > scare me, since I can use the ceph replication to recover a lost
> > > btrfs-filesystem. The only problem I have is, that btrfs is not stable
> > > on our side and I wonder what you are doing to make it work. (Maybe
> > > it's related to the load pattern of using ceph as a backend store for
> > > qemu).
> > > 
> > > Here is a list of the btrfs problems I'm having:
> > > 
> > > - When I run ceph with the default configuration (btrfs snaps enabled)
> > > I can see a rapid increase in Disk-I/O after a few hours of uptime.
> > > Btrfs-cleaner is using more and more time in
> > > btrfs_clean_old_snapshots().
> > 
> > In theory, there shouldn't be any significant difference between taking a 
> > snapshot and removing it a few commits later, and the prior root refs that 
> > btrfs holds on to internally until the new commit is complete.  That's 
> > clearly not quite the case, though.
> > 
> > In any case, we're going to try to reproduce this issue in our 
> > environment.
> > 
> 
> I've noticed this problem too, clean_old_snapshots is taking quite a while in
> cases where it really shouldn't.  I will see if I can come up with a 
> reproducer
> that doesn't require setting up ceph ;).

This sounds familiar though, I thought we had fixed a similar
regression.  Either way, Arne's readahead code should really help.

Which kernel version were you running?

[ ack on the rest of Josef's comments ]

-chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-24 Thread Christian Brunner
2011/10/24 Chris Mason :
> On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:
>> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> > [adding linux-btrfs to cc]
>> >
>> > Josef, Chris, any ideas on the below issues?
>> >
>> > On Mon, 24 Oct 2011, Christian Brunner wrote:
>> > > Thanks for explaining this. I don't have any objections against btrfs
>> > > as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
>> > > scare me, since I can use the ceph replication to recover a lost
>> > > btrfs-filesystem. The only problem I have is, that btrfs is not stable
>> > > on our side and I wonder what you are doing to make it work. (Maybe
>> > > it's related to the load pattern of using ceph as a backend store for
>> > > qemu).
>> > >
>> > > Here is a list of the btrfs problems I'm having:
>> > >
>> > > - When I run ceph with the default configuration (btrfs snaps enabled)
>> > > I can see a rapid increase in Disk-I/O after a few hours of uptime.
>> > > Btrfs-cleaner is using more and more time in
>> > > btrfs_clean_old_snapshots().
>> >
>> > In theory, there shouldn't be any significant difference between taking a
>> > snapshot and removing it a few commits later, and the prior root refs that
>> > btrfs holds on to internally until the new commit is complete.  That's
>> > clearly not quite the case, though.
>> >
>> > In any case, we're going to try to reproduce this issue in our
>> > environment.
>> >
>>
>> I've noticed this problem too, clean_old_snapshots is taking quite a while in
>> cases where it really shouldn't.  I will see if I can come up with a 
>> reproducer
>> that doesn't require setting up ceph ;).
>
> This sounds familiar though, I thought we had fixed a similar
> regression.  Either way, Arne's readahead code should really help.
>
> Which kernel version were you running?
>
> [ ack on the rest of Josef's comments ]

This was with a 3.0 kernel, including all btrfs-patches from josefs
git repo plus the "use the global reserve when truncating the free
space cache inode" patch.

I'll try the readahead code.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-24 Thread Arne Jansen

On 24.10.2011 23:34, Christian Brunner wrote:

2011/10/24 Chris Mason:

On Mon, Oct 24, 2011 at 03:51:47PM -0400, Josef Bacik wrote:

On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:

[adding linux-btrfs to cc]

Josef, Chris, any ideas on the below issues?

On Mon, 24 Oct 2011, Christian Brunner wrote:

Thanks for explaining this. I don't have any objections against btrfs
as a osd filesystem. Even the fact that there is no btrfs-fsck doesn't
scare me, since I can use the ceph replication to recover a lost
btrfs-filesystem. The only problem I have is, that btrfs is not stable
on our side and I wonder what you are doing to make it work. (Maybe
it's related to the load pattern of using ceph as a backend store for
qemu).

Here is a list of the btrfs problems I'm having:

- When I run ceph with the default configuration (btrfs snaps enabled)
I can see a rapid increase in Disk-I/O after a few hours of uptime.
Btrfs-cleaner is using more and more time in
btrfs_clean_old_snapshots().


In theory, there shouldn't be any significant difference between taking a
snapshot and removing it a few commits later, and the prior root refs that
btrfs holds on to internally until the new commit is complete.  That's
clearly not quite the case, though.

In any case, we're going to try to reproduce this issue in our
environment.



I've noticed this problem too, clean_old_snapshots is taking quite a while in
cases where it really shouldn't.  I will see if I can come up with a reproducer
that doesn't require setting up ceph ;).


This sounds familiar though, I thought we had fixed a similar
regression.  Either way, Arne's readahead code should really help.

Which kernel version were you running?

[ ack on the rest of Josef's comments ]


This was with a 3.0 kernel, including all btrfs-patches from josefs
git repo plus the "use the global reserve when truncating the free
space cache inode" patch.

I'll try the readahead code.


The current readahead code is only used for scrub. I plan to extend it
to snapshot deletion in a next step, but currently I'm afraid it can't
help.

-Arne



Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christoph Hellwig
On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> > - When I run ceph with btrfs snaps disabled, the situation is getting
> > slightly better. I can run an OSD for about 3 days without problems,
> > but then again the load increases. This time, I can see that the
> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> > than usual.
> 
> FYI in this scenario you're exposed to the same journal replay issues that 
> ext4 and XFS are.  The btrfs workload that ceph is generating will also 
> not be all that special, though, so this problem shouldn't be unique to 
> ceph.

What journal replay issues would ext4 and XFS be exposed to?

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Josef Bacik
On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> 2011/10/24 Josef Bacik :
> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> [adding linux-btrfs to cc]
> >>
> >> Josef, Chris, any ideas on the below issues?
> >>
> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >
> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> > but then again the load increases. This time, I can see that the
> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> > than usual.
> >>
> >> FYI in this scenario you're exposed to the same journal replay issues that
> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> not be all that special, though, so this problem shouldn't be unique to
> >> ceph.
> >>
> >
> > Can you get sysrq+w when this happens?  I'd like to see what 
> > btrfs-endio-write
> > is up to.
> 
> Capturing this seems to be not easy. I have a few traces (see
> attachment), but with sysrq+w I do not get a stacktrace of
> btrfs-endio-write. What I have is a "latencytop -c" output which is
> interesting:
> 
> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> tries to balance the load over all OSDs, so all filesystems should get
> an nearly equal load. At the moment one filesystem seems to have a
> problem. When running with iostat I see the following
> 
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdd   0.00 0.000.004.33 0.0053.33
> 12.31 0.08   19.38  12.23   5.30
> sdc   0.00 1.000.00  228.33 0.00  1957.33
> 8.5774.33  380.76   2.74  62.57
> sdb   0.00 0.000.001.33 0.0016.00
> 12.00 0.03   25.00  19.75   2.63
> sda   0.00 0.000.000.67 0.00 8.00
> 12.00 0.01   19.50  12.50   0.83
> 
> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> with top I see this process and a btrfs-endio-writer (PID 5447):
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  2053 root  20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
>  5447 root  20   0 000 S 22.6  0.0  19:32.18 btrfs-endio-wri
> 
> In the latencytop output you can see that those processes have a much
> higher latency, than the other ceph-osd and btrfs-endio-writers.
> 

I'm seeing a lot of this

[schedule]  1654.6 msec 96.4 %
schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
generic_write_sync blkdev_aio_write do_sync_readv_writev
do_readv_writev vfs_writev sys_writev system_call_fastpath

where ceph-osd's latency is mostly coming from this fsync of a block device
directly, and not so much being tied up by btrfs directly.  With 22% CPU being
taken up by btrfs-endio-wri we must be doing something wrong.  Can you run perf
record -ag when this is going on and then perf report so we can see what
btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to get
only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
horribly wrong or introducing a lot of latency.  Most of it seems to be when
running the dleayed refs and having to read in blocks.  I've been suspecting for
a while that the delayed ref stuff ends up doing way more work than it needs to
be per task, and it's possible that btrfs-endio-wri is simply getting screwed by
other people doing work.

At this point it seems like the biggest problem with latency in ceph-osd is not
related to btrfs, the latency seems to all be from the fact that ceph-osd is
fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems like
its blowing a lot of CPU time, so perf record -ag is probably going to be your
best bet when it's using lots of cpu so we can figure out what it's spinning on.
Thanks,

Josef

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner
2011/10/25 Josef Bacik :
> On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
>> 2011/10/24 Josef Bacik :
>> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> >> [adding linux-btrfs to cc]
>> >>
>> >> Josef, Chris, any ideas on the below issues?
>> >>
>> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
>> >> >
>> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
>> >> > slightly better. I can run an OSD for about 3 days without problems,
>> >> > but then again the load increases. This time, I can see that the
>> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
>> >> > than usual.
>> >>
>> >> FYI in this scenario you're exposed to the same journal replay issues that
>> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
>> >> not be all that special, though, so this problem shouldn't be unique to
>> >> ceph.
>> >>
>> >
>> > Can you get sysrq+w when this happens?  I'd like to see what 
>> > btrfs-endio-write
>> > is up to.
>>
>> Capturing this seems to be not easy. I have a few traces (see
>> attachment), but with sysrq+w I do not get a stacktrace of
>> btrfs-endio-write. What I have is a "latencytop -c" output which is
>> interesting:
>>
>> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> tries to balance the load over all OSDs, so all filesystems should get
>> an nearly equal load. At the moment one filesystem seems to have a
>> problem. When running with iostat I see the following
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> 12.31     0.08   19.38  12.23   5.30
>> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> 8.57    74.33  380.76   2.74  62.57
>> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> 12.00     0.03   25.00 19.75 2.63
>> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> 12.00     0.01   19.50  12.50   0.83
>>
>> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> with top I see this process and a btrfs-endio-writer (PID 5447):
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
>>
>> In the latencytop output you can see that those processes have a much
>> higher latency, than the other ceph-osd and btrfs-endio-writers.
>>
>
> I'm seeing a lot of this
>
>        [schedule]      1654.6 msec         96.4 %
>                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
>                generic_write_sync blkdev_aio_write do_sync_readv_writev
>                do_readv_writev vfs_writev sys_writev system_call_fastpath
>
> where ceph-osd's latency is mostly coming from this fsync of a block device
> directly, and not so much being tied up by btrfs directly.  With 22% CPU being
> taken up by btrfs-endio-wri we must be doing something wrong.  Can you run 
> perf
> record -ag when this is going on and then perf report so we can see what
> btrfs-endio-wri is doing with the cpu.  You can drill down in perf report to 
> get
> only what btrfs-endio-wri is doing, so that would be best.  As far as the rest
> of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing anything
> horribly wrong or introducing a lot of latency.  Most of it seems to be when
> running the dleayed refs and having to read in blocks.  I've been suspecting 
> for
> a while that the delayed ref stuff ends up doing way more work than it needs 
> to
> be per task, and it's possible that btrfs-endio-wri is simply getting screwed 
> by
> other people doing work.
>
> At this point it seems like the biggest problem with latency in ceph-osd is 
> not
> related to btrfs, the latency seems to all be from the fact that ceph-osd is
> fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems 
> like
> its blowing a lot of CPU time, so perf record -ag is probably going to be your
> best bet when it's using lots of cpu so we can figure out what it's spinning 
> on.

Attached is a perf-report. I have included the whole report, so that
you can see the difference between the good and the bad
btrfs-endio-wri.

Thanks,
Christian


perf.report.bz2
Description: BZip2 compressed data


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Josef Bacik
On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> 2011/10/25 Josef Bacik :
> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> >> 2011/10/24 Josef Bacik :
> >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> >> [adding linux-btrfs to cc]
> >> >>
> >> >> Josef, Chris, any ideas on the below issues?
> >> >>
> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >> >
> >> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> >> > but then again the load increases. This time, I can see that the
> >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> >> > than usual.
> >> >>
> >> >> FYI in this scenario you're exposed to the same journal replay issues 
> >> >> that
> >> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> >> not be all that special, though, so this problem shouldn't be unique to
> >> >> ceph.
> >> >>
> >> >
> >> > Can you get sysrq+w when this happens?  I'd like to see what 
> >> > btrfs-endio-write
> >> > is up to.
> >>
> >> Capturing this seems to be not easy. I have a few traces (see
> >> attachment), but with sysrq+w I do not get a stacktrace of
> >> btrfs-endio-write. What I have is a "latencytop -c" output which is
> >> interesting:
> >>
> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> >> tries to balance the load over all OSDs, so all filesystems should get
> >> an nearly equal load. At the moment one filesystem seems to have a
> >> problem. When running with iostat I see the following
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> >> avgrq-sz avgqu-sz   await  svctm  %util
> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> >> 12.31     0.08   19.38  12.23   5.30
> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> >> 8.57    74.33  380.76   2.74  62.57
> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> >> 12.00     0.03   25.00 19.75 2.63
> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
> >> 12.00     0.01   19.50  12.50   0.83
> >>
> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> >> with top I see this process and a btrfs-endio-writer (PID 5447):
> >>
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
> >>
> >> In the latencytop output you can see that those processes have a much
> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
> >>
> >
> > I'm seeing a lot of this
> >
> >        [schedule]      1654.6 msec         96.4 %
> >                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
> >                generic_write_sync blkdev_aio_write do_sync_readv_writev
> >                do_readv_writev vfs_writev sys_writev system_call_fastpath
> >
> > where ceph-osd's latency is mostly coming from this fsync of a block device
> > directly, and not so much being tied up by btrfs directly.  With 22% CPU 
> > being
> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you run 
> > perf
> > record -ag when this is going on and then perf report so we can see what
> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf report 
> > to get
> > only what btrfs-endio-wri is doing, so that would be best.  As far as the 
> > rest
> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing 
> > anything
> > horribly wrong or introducing a lot of latency.  Most of it seems to be when
> > running the dleayed refs and having to read in blocks.  I've been 
> > suspecting for
> > a while that the delayed ref stuff ends up doing way more work than it 
> > needs to
> > be per task, and it's possible that btrfs-endio-wri is simply getting 
> > screwed by
> > other people doing work.
> >
> > At this point it seems like the biggest problem with latency in ceph-osd is 
> > not
> > related to btrfs, the latency seems to all be from the fact that ceph-osd is
> > fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems 
> > like
> > its blowing a lot of CPU time, so perf record -ag is probably going to be 
> > your
> > best bet when it's using lots of cpu so we can figure out what it's 
> > spinning on.
> 
> Attached is a perf-report. I have included the whole report, so that
> you can see the difference between the good and the bad
> btrfs-endio-wri.
>

Oh shit we're inserting xattrs in endio, thats not good.  I'll look more into
this when I get back home but this is definitely a problem, we're doing a lot
more work in endio than we should.  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to major

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Josef Bacik
On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> 2011/10/25 Josef Bacik :
> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> >> 2011/10/24 Josef Bacik :
> >> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> >> [adding linux-btrfs to cc]
> >> >>
> >> >> Josef, Chris, any ideas on the below issues?
> >> >>
> >> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >> >
> >> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> >> > but then again the load increases. This time, I can see that the
> >> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> >> > than usual.
> >> >>
> >> >> FYI in this scenario you're exposed to the same journal replay issues 
> >> >> that
> >> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> >> not be all that special, though, so this problem shouldn't be unique to
> >> >> ceph.
> >> >>
> >> >
> >> > Can you get sysrq+w when this happens?  I'd like to see what 
> >> > btrfs-endio-write
> >> > is up to.
> >>
> >> Capturing this seems to be not easy. I have a few traces (see
> >> attachment), but with sysrq+w I do not get a stacktrace of
> >> btrfs-endio-write. What I have is a "latencytop -c" output which is
> >> interesting:
> >>
> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> >> tries to balance the load over all OSDs, so all filesystems should get
> >> an nearly equal load. At the moment one filesystem seems to have a
> >> problem. When running with iostat I see the following
> >>
> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
> >> avgrq-sz avgqu-sz   await  svctm  %util
> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
> >> 12.31     0.08   19.38  12.23   5.30
> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
> >> 8.57    74.33  380.76   2.74  62.57
> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
> >> 12.00     0.03   25.00 19.75 2.63
> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
> >> 12.00     0.01   19.50  12.50   0.83
> >>
> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> >> with top I see this process and a btrfs-endio-writer (PID 5447):
> >>
> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
> >>
> >> In the latencytop output you can see that those processes have a much
> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
> >>
> >
> > I'm seeing a lot of this
> >
> >        [schedule]      1654.6 msec         96.4 %
> >                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
> >                generic_write_sync blkdev_aio_write do_sync_readv_writev
> >                do_readv_writev vfs_writev sys_writev system_call_fastpath
> >
> > where ceph-osd's latency is mostly coming from this fsync of a block device
> > directly, and not so much being tied up by btrfs directly.  With 22% CPU 
> > being
> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you run 
> > perf
> > record -ag when this is going on and then perf report so we can see what
> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf report 
> > to get
> > only what btrfs-endio-wri is doing, so that would be best.  As far as the 
> > rest
> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing 
> > anything
> > horribly wrong or introducing a lot of latency.  Most of it seems to be when
> > running the dleayed refs and having to read in blocks.  I've been 
> > suspecting for
> > a while that the delayed ref stuff ends up doing way more work than it 
> > needs to
> > be per task, and it's possible that btrfs-endio-wri is simply getting 
> > screwed by
> > other people doing work.
> >
> > At this point it seems like the biggest problem with latency in ceph-osd is 
> > not
> > related to btrfs, the latency seems to all be from the fact that ceph-osd is
> > fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems 
> > like
> > its blowing a lot of CPU time, so perf record -ag is probably going to be 
> > your
> > best bet when it's using lots of cpu so we can figure out what it's 
> > spinning on.
> 
> Attached is a perf-report. I have included the whole report, so that
> you can see the difference between the good and the bad
> btrfs-endio-wri.
>

We also shouldn't be running run_ordered_operations, man this is screwed up,
thanks so much for this, I should be able to nail this down pretty easily.
Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner
2011/10/25 Josef Bacik :
> On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>> 2011/10/25 Josef Bacik :
>> > On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
[...]
>> >>
>> >> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> >> tries to balance the load over all OSDs, so all filesystems should get
>> >> an nearly equal load. At the moment one filesystem seems to have a
>> >> problem. When running with iostat I see the following
>> >>
>> >> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> >> avgrq-sz avgqu-sz   await  svctm  %util
>> >> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> >> 12.31     0.08   19.38  12.23   5.30
>> >> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> >> 8.57    74.33  380.76   2.74  62.57
>> >> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> >> 12.00     0.03   25.00 19.75 2.63
>> >> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> >> 12.00     0.01   19.50  12.50   0.83
>> >>
>> >> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> >> with top I see this process and a btrfs-endio-writer (PID 5447):
>> >>
>> >>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>> >>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>> >>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
>> >>
>> >> In the latencytop output you can see that those processes have a much
>> >> higher latency, than the other ceph-osd and btrfs-endio-writers.
>> >>
>> >
>> > I'm seeing a lot of this
>> >
>> >        [schedule]      1654.6 msec         96.4 %
>> >                schedule blkdev_issue_flush blkdev_fsync vfs_fsync_range
>> >                generic_write_sync blkdev_aio_write do_sync_readv_writev
>> >                do_readv_writev vfs_writev sys_writev system_call_fastpath
>> >
>> > where ceph-osd's latency is mostly coming from this fsync of a block device
>> > directly, and not so much being tied up by btrfs directly.  With 22% CPU 
>> > being
>> > taken up by btrfs-endio-wri we must be doing something wrong.  Can you run 
>> > perf
>> > record -ag when this is going on and then perf report so we can see what
>> > btrfs-endio-wri is doing with the cpu.  You can drill down in perf report 
>> > to get
>> > only what btrfs-endio-wri is doing, so that would be best.  As far as the 
>> > rest
>> > of the latencytop goes, it doesn't seem like btrfs-endio-wri is doing 
>> > anything
>> > horribly wrong or introducing a lot of latency.  Most of it seems to be 
>> > when
>> > running the dleayed refs and having to read in blocks.  I've been 
>> > suspecting for
>> > a while that the delayed ref stuff ends up doing way more work than it 
>> > needs to
>> > be per task, and it's possible that btrfs-endio-wri is simply getting 
>> > screwed by
>> > other people doing work.
>> >
>> > At this point it seems like the biggest problem with latency in ceph-osd 
>> > is not
>> > related to btrfs, the latency seems to all be from the fact that ceph-osd 
>> > is
>> > fsyncing a block dev for whatever reason.  As for btrfs-endio-wri it seems 
>> > like
>> > its blowing a lot of CPU time, so perf record -ag is probably going to be 
>> > your
>> > best bet when it's using lots of cpu so we can figure out what it's 
>> > spinning on.
>>
>> Attached is a perf-report. I have included the whole report, so that
>> you can see the difference between the good and the bad
>> btrfs-endio-wri.
>>
>
> We also shouldn't be running run_ordered_operations, man this is screwed up,
> thanks so much for this, I should be able to nail this down pretty easily.

Please note that this is with "btrfs snaps disabled" in the ceph conf.
When I enable snaps our problems get worse (the btrfs-cleaner thing),
but I would be glad if this one thing gets solved. I can run debugging
with snaps enabled, if you want, but I would suggest, that we do this
afterwards.

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Sage Weil
On Tue, 25 Oct 2011, Christoph Hellwig wrote:
> On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> > > - When I run ceph with btrfs snaps disabled, the situation is getting
> > > slightly better. I can run an OSD for about 3 days without problems,
> > > but then again the load increases. This time, I can see that the
> > > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> > > than usual.
> > 
> > FYI in this scenario you're exposed to the same journal replay issues that 
> > ext4 and XFS are.  The btrfs workload that ceph is generating will also 
> > not be all that special, though, so this problem shouldn't be unique to 
> > ceph.
> 
> What journal replay issues would ext4 and XFS be exposed to?

It's the ceph-osd journal replay, not the ext4/XFS journal... the #2 
item in

http://marc.info/?l=ceph-devel&m=131942130322957&w=2

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Sage Weil
On Tue, 25 Oct 2011, Josef Bacik wrote:
> At this point it seems like the biggest problem with latency in ceph-osd 
> is not related to btrfs, the latency seems to all be from the fact that 
> ceph-osd is fsyncing a block dev for whatever reason. 

There is one place where we sync_file_range() on the journal block device, 
but that should only happen if directio is disabled (it's on by default).  

Christian, have you tweaked those settings in your ceph.conf?  It would be 
something like 'journal dio = false'.  If not, can you verify that 
directio shows true when the journal is initialized from your osd log?  
E.g.,

 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 
104857600 bytes, block size 4096 bytes, directio = 1

If directio = 1 for you, something else funky is causing those 
blkdev_fsync's...

sage
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner
2011/10/25 Sage Weil :
> On Tue, 25 Oct 2011, Josef Bacik wrote:
>> At this point it seems like the biggest problem with latency in ceph-osd
>> is not related to btrfs, the latency seems to all be from the fact that
>> ceph-osd is fsyncing a block dev for whatever reason.
>
> There is one place where we sync_file_range() on the journal block device,
> but that should only happen if directio is disabled (it's on by default).
>
> Christian, have you tweaked those settings in your ceph.conf?  It would be
> something like 'journal dio = false'.  If not, can you verify that
> directio shows true when the journal is initialized from your osd log?
> E.g.,
>
>  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 
> 14: 104857600 bytes, block size 4096 bytes, directio = 1
>
> If directio = 1 for you, something else funky is causing those
> blkdev_fsync's...

I've looked it up in the logs - directio is 1:

Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
/dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
bytes, directio = 1

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Chris Mason
On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > 
> > Attached is a perf-report. I have included the whole report, so that
> > you can see the difference between the good and the bad
> > btrfs-endio-wri.
> >
> 
> We also shouldn't be running run_ordered_operations, man this is screwed up,
> thanks so much for this, I should be able to nail this down pretty easily.
> Thanks,

Looks like we're getting there from reserve_metadata_bytes when we join
the transaction?

-chris
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Josef Bacik
On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > 
> > > Attached is a perf-report. I have included the whole report, so that
> > > you can see the difference between the good and the bad
> > > btrfs-endio-wri.
> > >
> > 
> > We also shouldn't be running run_ordered_operations, man this is screwed up,
> > thanks so much for this, I should be able to nail this down pretty easily.
> > Thanks,
> 
> Looks like we're getting there from reserve_metadata_bytes when we join
> the transaction?
>

We don't do reservations in the endio stuff, we assume you've reserved all the
space you need in delalloc, plus we would have seen reserve_metadata_bytes in
the trace.  Though it does look like perf is lying to us in at least one case
sicne btrfs_alloc_logged_file_extent is only called from log replay and not
during normal runtime, so it definitely shouldn't be showing up.  Thanks,

Josef 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Sage Weil
On Tue, 25 Oct 2011, Christian Brunner wrote:
> 2011/10/25 Sage Weil :
> > On Tue, 25 Oct 2011, Josef Bacik wrote:
> >> At this point it seems like the biggest problem with latency in ceph-osd
> >> is not related to btrfs, the latency seems to all be from the fact that
> >> ceph-osd is fsyncing a block dev for whatever reason.
> >
> > There is one place where we sync_file_range() on the journal block device,
> > but that should only happen if directio is disabled (it's on by default).
> >
> > Christian, have you tweaked those settings in your ceph.conf?  It would be
> > something like 'journal dio = false'.  If not, can you verify that
> > directio shows true when the journal is initialized from your osd log?
> > E.g.,
> >
> >  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 
> > 14: 104857600 bytes, block size 4096 bytes, directio = 1
> >
> > If directio = 1 for you, something else funky is causing those
> > blkdev_fsync's...
> 
> I've looked it up in the logs - directio is 1:
> 
> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
> bytes, directio = 1

Do you mind capturing an strace?  I'd like to see where that blkdev_fsync 
is coming from.

thanks!
sage

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-25 Thread Christian Brunner
2011/10/25 Josef Bacik :
> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
>> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
>> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>> > >
>> > > Attached is a perf-report. I have included the whole report, so that
>> > > you can see the difference between the good and the bad
>> > > btrfs-endio-wri.
>> > >
>> >
>> > We also shouldn't be running run_ordered_operations, man this is screwed 
>> > up,
>> > thanks so much for this, I should be able to nail this down pretty easily.
>> > Thanks,
>>
>> Looks like we're getting there from reserve_metadata_bytes when we join
>> the transaction?
>>
>
> We don't do reservations in the endio stuff, we assume you've reserved all the
> space you need in delalloc, plus we would have seen reserve_metadata_bytes in
> the trace.  Though it does look like perf is lying to us in at least one case
> sicne btrfs_alloc_logged_file_extent is only called from log replay and not
> during normal runtime, so it definitely shouldn't be showing up.  Thanks,

Strange! - I'll check if symbols got messed up in the report tomorrow.

Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-26 Thread Christian Brunner
2011/10/26 Sage Weil :
> On Wed, 26 Oct 2011, Christian Brunner wrote:
>> >> > Christian, have you tweaked those settings in your ceph.conf?  It would 
>> >> > be
>> >> > something like 'journal dio = false'.  If not, can you verify that
>> >> > directio shows true when the journal is initialized from your osd log?
>> >> > E.g.,
>> >> >
>> >> >  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal 
>> >> > fd 14: 104857600 bytes, block size 4096 bytes, directio = 1
>> >> >
>> >> > If directio = 1 for you, something else funky is causing those
>> >> > blkdev_fsync's...
>> >>
>> >> I've looked it up in the logs - directio is 1:
>> >>
>> >> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
>> >> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
>> >> bytes, directio = 1
>> >
>> > Do you mind capturing an strace?  I'd like to see where that blkdev_fsync
>> > is coming from.
>>
>> Here is an strace. I can see a lot of sync_file_range operations.
>
> Yeah, these all look like the flusher thread, and shouldn't be hitting
> blkdev_fsync.  Can you confirm that with
>
>        filestore flusher = false
>        filestore sync flush = false
>
> you get no sync_file_range at all?  I wonder if this is also perf lying
> about the call chain.

Yes, setting this makes the sync_file_range calls go away.

Is it safe to use these settings with "filestore btrfs snap = 0"?

Thanks,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-26 Thread Christian Brunner
2011/10/26 Christian Brunner :
> 2011/10/25 Josef Bacik :
>> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
>>> On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
>>> > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
>>> > >
>>> > > Attached is a perf-report. I have included the whole report, so that
>>> > > you can see the difference between the good and the bad
>>> > > btrfs-endio-wri.
>>> > >
>>> >
>>> > We also shouldn't be running run_ordered_operations, man this is screwed 
>>> > up,
>>> > thanks so much for this, I should be able to nail this down pretty easily.
>>> > Thanks,
>>>
>>> Looks like we're getting there from reserve_metadata_bytes when we join
>>> the transaction?
>>>
>>
>> We don't do reservations in the endio stuff, we assume you've reserved all 
>> the
>> space you need in delalloc, plus we would have seen reserve_metadata_bytes in
>> the trace.  Though it does look like perf is lying to us in at least one case
>> sicne btrfs_alloc_logged_file_extent is only called from log replay and not
>> during normal runtime, so it definitely shouldn't be showing up.  Thanks,
>
> Strange! - I'll check if symbols got messed up in the report tomorrow.

I've checked this now: Except for the missing symbols for iomemory_vsl
module, everything is looking normal.

I've also run the report on another OSD again, but the results look
quite similar.

Regards,
Christian

PS: This is what perf report -v is saying...

build id event received for [kernel.kallsyms]:
805ca93f4057cc0c8f53b061a849b3f847f2de40
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/fs/btrfs/btrfs.ko:
64a723e05af3908fb9593f4a3401d6563cb1a01b
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/lib/libcrc32c.ko:
b1391be8d33b54b6de20e07b7f2ee8d777fc09d2
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/bonding/bonding.ko:
663392df0f407211ab8f9527c482d54fce890c5e
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/scsi/hpsa.ko:
676eecffd476aef1b0f2f8c1bf8c8e6120d369c9
build id event received for
/lib/modules/3.0.6-1.fits.8.el6.x86_64/kernel/drivers/net/ixgbe/ixgbe.ko:
db7c200894b27e71ae6fe5cf7adaebf787c90da9
build id event received for [iomemory_vsl]:
4ed417c9a815e6bbe77a1656bceda95d9f06cb13
build id event received for /lib64/libc-2.12.so:
2ab28d41242ede641418966ef08f9aacffd9e8c7
build id event received for /lib64/libpthread-2.12.so:
c177389a6f119b3883ea0b3c33cb04df3f8e5cc7
build id event received for /sbin/rsyslogd:
1372ef1e2ec550967fe20d0bdddbc0aab0bb36dc
build id event received for /lib64/libglib-2.0.so.0.2200.5:
d880be15bf992b5fbcc629e6bbf1c747a928ddd5
build id event received for /usr/sbin/irqbalance:
842de64f46ca9fde55efa29a793c08b197d58354
build id event received for /lib64/libm-2.12.so:
46ac89195918407d2937bd1450c0ec99c8d41a2a
build id event received for /usr/bin/ceph-osd:
9fcb36e020c49fc49171b4c88bd784b38eb0675b
build id event received for /usr/lib64/libstdc++.so.6.0.13:
d1b2ca4e1ec8f81ba820e5f1375d960107ac7e50
build id event received for /usr/lib64/libtcmalloc.so.0.2.0:
02766551b2eb5a453f003daee0c5fc9cd176e831
Looking at the vmlinux_path (6 entries long)
dso__load_sym: cannot get elf header.
Using /proc/kallsyms for symbols
Looking at the vmlinux_path (6 entries long)
No kallsyms or vmlinux with build-id
4ed417c9a815e6bbe77a1656bceda95d9f06cb13 was found
[iomemory_vsl] with build id 4ed417c9a815e6bbe77a1656bceda95d9f06cb13
not found, continuing without symbols
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-26 Thread Chris Mason
On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:
> On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > > 
> > > > Attached is a perf-report. I have included the whole report, so that
> > > > you can see the difference between the good and the bad
> > > > btrfs-endio-wri.
> > > >
> > > 
> > > We also shouldn't be running run_ordered_operations, man this is screwed 
> > > up,
> > > thanks so much for this, I should be able to nail this down pretty easily.
> > > Thanks,
> > 
> > Looks like we're getting there from reserve_metadata_bytes when we join
> > the transaction?
> >
> 
> We don't do reservations in the endio stuff, we assume you've reserved all the
> space you need in delalloc, plus we would have seen reserve_metadata_bytes in
> the trace.  Though it does look like perf is lying to us in at least one case
> sicne btrfs_alloc_logged_file_extent is only called from log replay and not
> during normal runtime, so it definitely shouldn't be showing up.  Thanks,

Whoops, I should have read that num_items > 0 check harder.

btrfs_end_transaction is doing it by setting ->blocked = 1

if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
should_end_transaction(trans, root)) {
trans->transaction->blocked = 1;
^
smp_wmb();
}

   if (lock && cur_trans->blocked && !cur_trans->in_commit) {
   ^^^
if (throttle) {
/*
 * We may race with somebody else here so end up having
 * to call end_transaction on ourselves again, so inc
 * our use_count.
 */
trans->use_count++;
return btrfs_commit_transaction(trans, root);
} else {
wake_up_process(info->transaction_kthread);
}
}

perf is definitely lying a little bit about the trace ;)

-chris

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-26 Thread Sage Weil
On Wed, 26 Oct 2011, Christian Brunner wrote:
> 2011/10/26 Sage Weil :
> > On Wed, 26 Oct 2011, Christian Brunner wrote:
> >> >> > Christian, have you tweaked those settings in your ceph.conf?  It 
> >> >> > would be
> >> >> > something like 'journal dio = false'.  If not, can you verify that
> >> >> > directio shows true when the journal is initialized from your osd log?
> >> >> > E.g.,
> >> >> >
> >> >> >  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open 
> >> >> > dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, 
> >> >> > directio = 1
> >> >> >
> >> >> > If directio = 1 for you, something else funky is causing those
> >> >> > blkdev_fsync's...
> >> >>
> >> >> I've looked it up in the logs - directio is 1:
> >> >>
> >> >> Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
> >> >> /dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
> >> >> bytes, directio = 1
> >> >
> >> > Do you mind capturing an strace?  I'd like to see where that blkdev_fsync
> >> > is coming from.
> >>
> >> Here is an strace. I can see a lot of sync_file_range operations.
> >
> > Yeah, these all look like the flusher thread, and shouldn't be hitting
> > blkdev_fsync.  Can you confirm that with
> >
> >        filestore flusher = false
> >        filestore sync flush = false
> >
> > you get no sync_file_range at all?  I wonder if this is also perf lying
> > about the call chain.
> 
> Yes, setting this makes the sync_file_range calls go away.

Okay.  That means either sync_file_range on a regular btrfs file is 
triggering blkdev_fsync somewhere in btrfs, there is an extremely sneaky 
bug that is mixing up file descriptors, or latencytop is lying.  I'm 
guessing the latter, given the other weirdness Josef and Chris were 
seeing.  :)

> Is it safe to use these settings with "filestore btrfs snap = 0"?

Yeah.  They're purely a performance thing to push as much dirty data to 
disk as quickly as possible to minimize the snapshot create latency.  
You'll notice the write throughput tends to tank when them off.

sage

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Martin Mailand

Hi
resend without the perf attachment, which could be found here:
http://tuxadero.com/multistorage/perf.report.txt.bz2

Best Regards,
 martin

 Original-Nachricht 
Betreff: Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
Datum: Wed, 26 Oct 2011 22:38:47 +0200
Von: Martin Mailand 
Antwort an: mar...@tuxadero.com
An: Sage Weil 
Kopie (CC): Christian Brunner , ceph-devel@vger.kernel.org, 
 linux-bt...@vger.kernel.org


Hi,
I have more or less the same setup as Christian and I suffer the same
problems.
But as far as I can see the output of latencytop and perf differs form
Christian one, both are attached.
I was wondering about the high latency from btrfs-submit.

Process btrfs-submit-0 (970) Total: 2123.5 msec

I have as well the high IO rate and high IO wait.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0.600.002.20   82.400.00   14.80

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.008.40 0.0074.40
17.71 0.033.810.003.81   3.81   3.20
sdb   0.00 7.000.00  269.80 0.00  1224.80
9.08   107.19  398.690.00  398.69   3.15  85.00

top - 21:57:41 up  8:41,  1 user,  load average: 0.65, 0.79, 0.76
Tasks: 179 total,   1 running, 178 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.6%us,  2.4%sy,  0.0%ni, 70.8%id, 25.8%wa,  0.0%hi,  0.3%si,
0.0%st
Mem:   4018276k total,  1577728k used,  2440548k free,10496k buffers
Swap:  1998844k total,0k used,  1998844k free,  1316696k cached

   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

  1399 root  20   0  548m 103m 3428 S  0.0  2.6   2:01.85 ceph-osd

  1401 root  20   0  548m 103m 3428 S  0.0  2.6   1:51.71 ceph-osd

  1400 root  20   0  548m 103m 3428 S  0.0  2.6   1:50.30 ceph-osd

  1391 root  20   0 000 S  0.0  0.0   1:18.39
btrfs-endio-wri

   976 root  20   0 000 S  0.0  0.0   1:18.11
btrfs-endio-wri

  1367 root  20   0 000 S  0.0  0.0   1:05.60
btrfs-worker-1

   968 root  20   0 000 S  0.0  0.0   1:05.45
btrfs-worker-0

  1163 root  20   0  141m 1636 1100 S  0.0  0.0   1:00.56 collectd

   970 root  20   0 000 S  0.0  0.0   0:47.73
btrfs-submit-0

  1402 root  20   0  548m 103m 3428 S  0.0  2.6   0:34.86 ceph-osd

  1392 root  20   0 000 S  0.0  0.0   0:33.70
btrfs-endio-met

   975 root  20   0 000 S  0.0  0.0   0:32.70
btrfs-endio-met

  1415 root  20   0  548m 103m 3428 S  0.0  2.6   0:28.29 ceph-osd

  1414 root  20   0  548m 103m 3428 S  0.0  2.6   0:28.24 ceph-osd

  1397 root  20   0  548m 103m 3428 S  0.0  2.6   0:24.60 ceph-osd

  1436 root  20   0  548m 103m 3428 S  0.0  2.6   0:13.31 ceph-osd


Here ist my setup.
Kernel v3.1 + Josef

The config for this osd (ceph version 0.37
(commit:a6f3bbb744a6faea95ae48317f0b838edb16a896)) is:
[osd.1]
 host = s-brick-003
 osd journal = /dev/sda7
 btrfs devs = /dev/sdb
btrfs options = noatime
filestore_btrfs_snap = false

I hope this helps to pin point the problem.

Best Regards,
martin


Sage Weil schrieb:

On Wed, 26 Oct 2011, Christian Brunner wrote:

2011/10/26 Sage Weil :

On Wed, 26 Oct 2011, Christian Brunner wrote:

Christian, have you tweaked those settings in your ceph.conf?  It would be
something like 'journal dio = false'.  If not, can you verify that
directio shows true when the journal is initialized from your osd log?
E.g.,

 2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open dev/osd0.journal fd 14: 
104857600 bytes, block size 4096 bytes, directio = 1

If directio = 1 for you, something else funky is causing those
blkdev_fsync's...

I've looked it up in the logs - directio is 1:

Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
/dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
bytes, directio = 1

Do you mind capturing an strace?  I'd like to see where that blkdev_fsync
is coming from.

Here is an strace. I can see a lot of sync_file_range operations.

Yeah, these all look like the flusher thread, and shouldn't be hitting
blkdev_fsync.  Can you confirm that with

   filestore flusher = false
   filestore sync flush = false

you get no sync_file_range at all?  I wonder if this is also perf lying
about the call chain.

Yes, setting this makes the sync_file_range calls go away.


Okay.  That means either sync_file_range on a regular btrfs file is
triggering blkdev_fsync somewhere in btrfs, there is an extremely sneaky
bug that is mixing up file descriptors, or latencytop is lying.  I'm
guessing the latter, given the other weirdness Josef and Chris were
seeing.  :)


Is it safe to use these settings with "filestore btrfs snap = 0"?


Yeah.  They're purely a performance thin

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Stefan Majer
Hi Martin,

a quick dig into your perf report show a large amount of swapper work.
If this is the case, i would suspect latency. So do you have not
enough physical ram in your machine ?

Greetings

Stefan Majer

On Thu, Oct 27, 2011 at 12:53 PM, Martin Mailand  wrote:
> Hi
> resend without the perf attachment, which could be found here:
> http://tuxadero.com/multistorage/perf.report.txt.bz2
>
> Best Regards,
>  martin
>
>  Original-Nachricht ----
> Betreff: Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
> Datum: Wed, 26 Oct 2011 22:38:47 +0200
> Von: Martin Mailand 
> Antwort an: mar...@tuxadero.com
> An: Sage Weil 
> Kopie (CC): Christian Brunner , ceph-devel@vger.kernel.org,
>  linux-bt...@vger.kernel.org
>
> Hi,
> I have more or less the same setup as Christian and I suffer the same
> problems.
> But as far as I can see the output of latencytop and perf differs form
> Christian one, both are attached.
> I was wondering about the high latency from btrfs-submit.
>
> Process btrfs-submit-0 (970) Total: 2123.5 msec
>
> I have as well the high IO rate and high IO wait.
>
> avg-cpu:  %user   %nice %system %iowait  %steal   %idle
>            0.60    0.00    2.20   82.40    0.00   14.80
>
> Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s
> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
> sda               0.00     0.00    0.00    8.40     0.00    74.40
> 17.71     0.03    3.81    0.00    3.81   3.81   3.20
> sdb               0.00     7.00    0.00  269.80     0.00  1224.80
> 9.08   107.19  398.69    0.00  398.69   3.15  85.00
>
> top - 21:57:41 up  8:41,  1 user,  load average: 0.65, 0.79, 0.76
> Tasks: 179 total,   1 running, 178 sleeping,   0 stopped,   0 zombie
> Cpu(s):  0.6%us,  2.4%sy,  0.0%ni, 70.8%id, 25.8%wa,  0.0%hi,  0.3%si,
> 0.0%st
> Mem:   4018276k total,  1577728k used,  2440548k free,    10496k buffers
> Swap:  1998844k total,        0k used,  1998844k free,  1316696k cached
>
>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>
>  1399 root      20   0  548m 103m 3428 S  0.0  2.6   2:01.85 ceph-osd
>
>  1401 root      20   0  548m 103m 3428 S  0.0  2.6   1:51.71 ceph-osd
>
>  1400 root      20   0  548m 103m 3428 S  0.0  2.6   1:50.30 ceph-osd
>
>  1391 root      20   0     0    0    0 S  0.0  0.0   1:18.39
> btrfs-endio-wri
>
>   976 root      20   0     0    0    0 S  0.0  0.0   1:18.11
> btrfs-endio-wri
>
>  1367 root      20   0     0    0    0 S  0.0  0.0   1:05.60
> btrfs-worker-1
>
>   968 root      20   0     0    0    0 S  0.0  0.0   1:05.45
> btrfs-worker-0
>
>  1163 root      20   0  141m 1636 1100 S  0.0  0.0   1:00.56 collectd
>
>   970 root      20   0     0    0    0 S  0.0  0.0   0:47.73
> btrfs-submit-0
>
>  1402 root      20   0  548m 103m 3428 S  0.0  2.6   0:34.86 ceph-osd
>
>  1392 root      20   0     0    0    0 S  0.0  0.0   0:33.70
> btrfs-endio-met
>
>   975 root      20   0     0    0    0 S  0.0  0.0   0:32.70
> btrfs-endio-met
>
>  1415 root      20   0  548m 103m 3428 S  0.0  2.6   0:28.29 ceph-osd
>
>  1414 root      20   0  548m 103m 3428 S  0.0  2.6   0:28.24 ceph-osd
>
>  1397 root      20   0  548m 103m 3428 S  0.0  2.6   0:24.60 ceph-osd
>
>  1436 root      20   0  548m 103m 3428 S  0.0  2.6   0:13.31 ceph-osd
>
>
> Here ist my setup.
> Kernel v3.1 + Josef
>
> The config for this osd (ceph version 0.37
> (commit:a6f3bbb744a6faea95ae48317f0b838edb16a896)) is:
> [osd.1]
>         host = s-brick-003
>         osd journal = /dev/sda7
>         btrfs devs = /dev/sdb
>        btrfs options = noatime
>        filestore_btrfs_snap = false
>
> I hope this helps to pin point the problem.
>
> Best Regards,
> martin
>
>
> Sage Weil schrieb:
>>
>> On Wed, 26 Oct 2011, Christian Brunner wrote:
>>>
>>> 2011/10/26 Sage Weil :
>>>>
>>>> On Wed, 26 Oct 2011, Christian Brunner wrote:
>>>>>>>>
>>>>>>>> Christian, have you tweaked those settings in your ceph.conf?  It
>>>>>>>> would be
>>>>>>>> something like 'journal dio = false'.  If not, can you verify that
>>>>>>>> directio shows true when the journal is initialized from your osd
>>>>>>>> log?
>>>>>>>> E.g.,
>>>>>>>>
>>>>>>>>  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open
>>>>>>>> dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, 
>>>>>>>> directio = 1
>>>>>>>>
&

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Martin Mailand

Hi Stefan,
I think the machine has enough ram.

root@s-brick-003:~# free -m
 total   used   free sharedbuffers cached
Mem:  3924   2401   1522  0 42   2115
-/+ buffers/cache:243   3680
Swap: 1951  0   1951

There is no swap usage at all.

-martin


Am 27.10.2011 12:59, schrieb Stefan Majer:

Hi Martin,

a quick dig into your perf report show a large amount of swapper work.
If this is the case, i would suspect latency. So do you have not
enough physical ram in your machine ?

Greetings

Stefan Majer

On Thu, Oct 27, 2011 at 12:53 PM, Martin Mailand  wrote:

Hi
resend without the perf attachment, which could be found here:
http://tuxadero.com/multistorage/perf.report.txt.bz2

Best Regards,
  martin

 Original-Nachricht 
Betreff: Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]
Datum: Wed, 26 Oct 2011 22:38:47 +0200
Von: Martin Mailand
Antwort an: mar...@tuxadero.com
An: Sage Weil
Kopie (CC): Christian Brunner, ceph-devel@vger.kernel.org,
  linux-bt...@vger.kernel.org

Hi,
I have more or less the same setup as Christian and I suffer the same
problems.
But as far as I can see the output of latencytop and perf differs form
Christian one, both are attached.
I was wondering about the high latency from btrfs-submit.

Process btrfs-submit-0 (970) Total: 2123.5 msec

I have as well the high IO rate and high IO wait.

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
0.600.002.20   82.400.00   14.80

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda   0.00 0.000.008.40 0.0074.40
17.71 0.033.810.003.81   3.81   3.20
sdb   0.00 7.000.00  269.80 0.00  1224.80
9.08   107.19  398.690.00  398.69   3.15  85.00

top - 21:57:41 up  8:41,  1 user,  load average: 0.65, 0.79, 0.76
Tasks: 179 total,   1 running, 178 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.6%us,  2.4%sy,  0.0%ni, 70.8%id, 25.8%wa,  0.0%hi,  0.3%si,
0.0%st
Mem:   4018276k total,  1577728k used,  2440548k free,10496k buffers
Swap:  1998844k total,0k used,  1998844k free,  1316696k cached

   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

  1399 root  20   0  548m 103m 3428 S  0.0  2.6   2:01.85 ceph-osd

  1401 root  20   0  548m 103m 3428 S  0.0  2.6   1:51.71 ceph-osd

  1400 root  20   0  548m 103m 3428 S  0.0  2.6   1:50.30 ceph-osd

  1391 root  20   0 000 S  0.0  0.0   1:18.39
btrfs-endio-wri

   976 root  20   0 000 S  0.0  0.0   1:18.11
btrfs-endio-wri

  1367 root  20   0 000 S  0.0  0.0   1:05.60
btrfs-worker-1

   968 root  20   0 000 S  0.0  0.0   1:05.45
btrfs-worker-0

  1163 root  20   0  141m 1636 1100 S  0.0  0.0   1:00.56 collectd

   970 root  20   0 000 S  0.0  0.0   0:47.73
btrfs-submit-0

  1402 root  20   0  548m 103m 3428 S  0.0  2.6   0:34.86 ceph-osd

  1392 root  20   0 000 S  0.0  0.0   0:33.70
btrfs-endio-met

   975 root  20   0 000 S  0.0  0.0   0:32.70
btrfs-endio-met

  1415 root  20   0  548m 103m 3428 S  0.0  2.6   0:28.29 ceph-osd

  1414 root  20   0  548m 103m 3428 S  0.0  2.6   0:28.24 ceph-osd

  1397 root  20   0  548m 103m 3428 S  0.0  2.6   0:24.60 ceph-osd

  1436 root  20   0  548m 103m 3428 S  0.0  2.6   0:13.31 ceph-osd


Here ist my setup.
Kernel v3.1 + Josef

The config for this osd (ceph version 0.37
(commit:a6f3bbb744a6faea95ae48317f0b838edb16a896)) is:
[osd.1]
 host = s-brick-003
 osd journal = /dev/sda7
 btrfs devs = /dev/sdb
btrfs options = noatime
filestore_btrfs_snap = false

I hope this helps to pin point the problem.

Best Regards,
martin


Sage Weil schrieb:


On Wed, 26 Oct 2011, Christian Brunner wrote:


2011/10/26 Sage Weil:


On Wed, 26 Oct 2011, Christian Brunner wrote:


Christian, have you tweaked those settings in your ceph.conf?  It
would be
something like 'journal dio = false'.  If not, can you verify that
directio shows true when the journal is initialized from your osd
log?
E.g.,

  2011-10-21 15:21:02.026789 7ff7e5c54720 journal _open
dev/osd0.journal fd 14: 104857600 bytes, block size 4096 bytes, directio = 1

If directio = 1 for you, something else funky is causing those
blkdev_fsync's...


I've looked it up in the logs - directio is 1:

Oct 25 17:20:16 os00 osd.000[1696]: 7f0016841740 journal _open
/dev/vg01/lv_osd_journal_0 fd 15: 17179869184 bytes, block size 4096
bytes, directio = 1


Do you mind capturing an strace?  I'd like to see where that
blkdev_fsync
is coming from.


Here is an strace. I can see a lot of sync_file_range operations.


Yeah, these all look like the flusher thread, and shouldn't be hitting
blkdev_fsy

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Josef Bacik
On Wed, Oct 26, 2011 at 09:23:54AM -0400, Chris Mason wrote:
> On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:
> > On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> > > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > > > 
> > > > > Attached is a perf-report. I have included the whole report, so that
> > > > > you can see the difference between the good and the bad
> > > > > btrfs-endio-wri.
> > > > >
> > > > 
> > > > We also shouldn't be running run_ordered_operations, man this is 
> > > > screwed up,
> > > > thanks so much for this, I should be able to nail this down pretty 
> > > > easily.
> > > > Thanks,
> > > 
> > > Looks like we're getting there from reserve_metadata_bytes when we join
> > > the transaction?
> > >
> > 
> > We don't do reservations in the endio stuff, we assume you've reserved all 
> > the
> > space you need in delalloc, plus we would have seen reserve_metadata_bytes 
> > in
> > the trace.  Though it does look like perf is lying to us in at least one 
> > case
> > sicne btrfs_alloc_logged_file_extent is only called from log replay and not
> > during normal runtime, so it definitely shouldn't be showing up.  Thanks,
> 
> Whoops, I should have read that num_items > 0 check harder.
> 
> btrfs_end_transaction is doing it by setting ->blocked = 1
> 
> if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
> should_end_transaction(trans, root)) {
> trans->transaction->blocked = 1;
>   ^
> smp_wmb();
> }
> 
>if (lock && cur_trans->blocked && !cur_trans->in_commit) {
>^^^
> if (throttle) {
> /*
>  * We may race with somebody else here so end up 
> having
>  * to call end_transaction on ourselves again, so inc
>  * our use_count.
>  */
> trans->use_count++;
> return btrfs_commit_transaction(trans, root);
> } else {
> wake_up_process(info->transaction_kthread);
> }
> }
> 

Not sure what you are getting at here?  Even if we set blocked, we're not
throttling so it will just wake up the transaction kthread, so we won't do the
commit in the endio case.  Thanks

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Josef Bacik
On Thu, Oct 27, 2011 at 11:07:38AM -0400, Josef Bacik wrote:
> On Wed, Oct 26, 2011 at 09:23:54AM -0400, Chris Mason wrote:
> > On Tue, Oct 25, 2011 at 04:22:48PM -0400, Josef Bacik wrote:
> > > On Tue, Oct 25, 2011 at 04:15:45PM -0400, Chris Mason wrote:
> > > > On Tue, Oct 25, 2011 at 11:05:12AM -0400, Josef Bacik wrote:
> > > > > On Tue, Oct 25, 2011 at 04:25:02PM +0200, Christian Brunner wrote:
> > > > > > 
> > > > > > Attached is a perf-report. I have included the whole report, so that
> > > > > > you can see the difference between the good and the bad
> > > > > > btrfs-endio-wri.
> > > > > >
> > > > > 
> > > > > We also shouldn't be running run_ordered_operations, man this is 
> > > > > screwed up,
> > > > > thanks so much for this, I should be able to nail this down pretty 
> > > > > easily.
> > > > > Thanks,
> > > > 
> > > > Looks like we're getting there from reserve_metadata_bytes when we join
> > > > the transaction?
> > > >
> > > 
> > > We don't do reservations in the endio stuff, we assume you've reserved 
> > > all the
> > > space you need in delalloc, plus we would have seen 
> > > reserve_metadata_bytes in
> > > the trace.  Though it does look like perf is lying to us in at least one 
> > > case
> > > sicne btrfs_alloc_logged_file_extent is only called from log replay and 
> > > not
> > > during normal runtime, so it definitely shouldn't be showing up.  Thanks,
> > 
> > Whoops, I should have read that num_items > 0 check harder.
> > 
> > btrfs_end_transaction is doing it by setting ->blocked = 1
> > 
> > if (lock && !atomic_read(&root->fs_info->open_ioctl_trans) &&
> > should_end_transaction(trans, root)) {
> > trans->transaction->blocked = 1;
> > ^
> > smp_wmb();
> > }
> > 
> >if (lock && cur_trans->blocked && !cur_trans->in_commit) {
> >^^^
> > if (throttle) {
> > /*
> >  * We may race with somebody else here so end up 
> > having
> >  * to call end_transaction on ourselves again, so 
> > inc
> >  * our use_count.
> >  */
> > trans->use_count++;
> > return btrfs_commit_transaction(trans, root);
> > } else {
> > wake_up_process(info->transaction_kthread);
> > }
> > }
> > 
> 
> Not sure what you are getting at here?  Even if we set blocked, we're not
> throttling so it will just wake up the transaction kthread, so we won't do the
> commit in the endio case.  Thanks
> 

Oh I see what you were trying to say, that we'd set blocking and then commit the
transaction from the endio process which would run ordered operations, but since
throttle isn't set that won't happen.  I think that the perf symbols are just
lying to us.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Josef Bacik
On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
> 2011/10/24 Josef Bacik :
> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
> >> [adding linux-btrfs to cc]
> >>
> >> Josef, Chris, any ideas on the below issues?
> >>
> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
> >> >
> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
> >> > slightly better. I can run an OSD for about 3 days without problems,
> >> > but then again the load increases. This time, I can see that the
> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
> >> > than usual.
> >>
> >> FYI in this scenario you're exposed to the same journal replay issues that
> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
> >> not be all that special, though, so this problem shouldn't be unique to
> >> ceph.
> >>
> >
> > Can you get sysrq+w when this happens?  I'd like to see what 
> > btrfs-endio-write
> > is up to.
> 
> Capturing this seems to be not easy. I have a few traces (see
> attachment), but with sysrq+w I do not get a stacktrace of
> btrfs-endio-write. What I have is a "latencytop -c" output which is
> interesting:
> 
> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
> tries to balance the load over all OSDs, so all filesystems should get
> an nearly equal load. At the moment one filesystem seems to have a
> problem. When running with iostat I see the following
> 
> Device: rrqm/s   wrqm/s r/s w/s   rsec/s   wsec/s
> avgrq-sz avgqu-sz   await  svctm  %util
> sdd   0.00 0.000.004.33 0.0053.33
> 12.31 0.08   19.38  12.23   5.30
> sdc   0.00 1.000.00  228.33 0.00  1957.33
> 8.5774.33  380.76   2.74  62.57
> sdb   0.00 0.000.001.33 0.0016.00
> 12.00 0.03   25.00  19.75   2.63
> sda   0.00 0.000.000.67 0.00 8.00
> 12.00 0.01   19.50  12.50   0.83
> 
> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
> with top I see this process and a btrfs-endio-writer (PID 5447):
> 
>   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
>  2053 root  20   0  537m 146m 2364 S 33.2  0.6  43:31.24 ceph-osd
>  5447 root  20   0 000 S 22.6  0.0  19:32.18 btrfs-endio-wri
> 
> In the latencytop output you can see that those processes have a much
> higher latency, than the other ceph-osd and btrfs-endio-writers.
> 
> Regards,
> Christian

Ok just a shot in the dark, but could you give this a whirl and see if it helps
you?  Thanks

Josef


diff --git a/fs/btrfs/delayed-ref.c b/fs/btrfs/delayed-ref.c
index 125cf76..fbc196e 100644
--- a/fs/btrfs/delayed-ref.c
+++ b/fs/btrfs/delayed-ref.c
@@ -210,9 +210,9 @@ int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
 }
 
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
-  struct list_head *cluster, u64 start)
+  struct list_head *cluster, u64 start, unsigned long 
max_count)
 {
-   int count = 0;
+   unsigned long count = 0;
struct btrfs_delayed_ref_root *delayed_refs;
struct rb_node *node;
struct btrfs_delayed_ref_node *ref;
@@ -242,7 +242,7 @@ int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
node = rb_first(&delayed_refs->root);
}
 again:
-   while (node && count < 32) {
+   while (node && count < max_count) {
ref = rb_entry(node, struct btrfs_delayed_ref_node, rb_node);
if (btrfs_delayed_ref_is_head(ref)) {
head = btrfs_delayed_node_to_head(ref);
diff --git a/fs/btrfs/delayed-ref.h b/fs/btrfs/delayed-ref.h
index e287e3b..b15a6ad 100644
--- a/fs/btrfs/delayed-ref.h
+++ b/fs/btrfs/delayed-ref.h
@@ -169,7 +169,8 @@ btrfs_find_delayed_ref_head(struct btrfs_trans_handle 
*trans, u64 bytenr);
 int btrfs_delayed_ref_lock(struct btrfs_trans_handle *trans,
   struct btrfs_delayed_ref_head *head);
 int btrfs_find_ref_cluster(struct btrfs_trans_handle *trans,
-  struct list_head *cluster, u64 search_start);
+  struct list_head *cluster, u64 search_start,
+  unsigned long max_count);
 /*
  * a node might live in a head or a regular ref, this lets you
  * test for the proper type to use.
diff --git a/fs/btrfs/dir-item.c b/fs/btrfs/dir-item.c
index 31d84e7..c190282 100644
--- a/fs/btrfs/dir-item.c
+++ b/fs/btrfs/dir-item.c
@@ -81,6 +81,7 @@ int btrfs_insert_xattr_item(struct btrfs_trans_handle *trans,
u32 data_size;
 
BUG_ON(name_len + data_len > BTRFS_MAX_XATTR_SIZE(root));
+   WARN_ON(trans->endio);
 
key.objectid = objectid;
btrfs_set_key_type(&key, BTRFS_XATTR_ITEM_KEY);
diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index 4eb7d2b..0977a10 100644
--- a/fs/b

Re: ceph on btrfs [was Re: ceph on non-btrfs file systems]

2011-10-27 Thread Christian Brunner
2011/10/27 Josef Bacik :
> On Tue, Oct 25, 2011 at 01:56:48PM +0200, Christian Brunner wrote:
>> 2011/10/24 Josef Bacik :
>> > On Mon, Oct 24, 2011 at 10:06:49AM -0700, Sage Weil wrote:
>> >> [adding linux-btrfs to cc]
>> >>
>> >> Josef, Chris, any ideas on the below issues?
>> >>
>> >> On Mon, 24 Oct 2011, Christian Brunner wrote:
>> >> >
>> >> > - When I run ceph with btrfs snaps disabled, the situation is getting
>> >> > slightly better. I can run an OSD for about 3 days without problems,
>> >> > but then again the load increases. This time, I can see that the
>> >> > ceph-osd (blkdev_issue_flush) and btrfs-endio-wri are doing more work
>> >> > than usual.
>> >>
>> >> FYI in this scenario you're exposed to the same journal replay issues that
>> >> ext4 and XFS are.  The btrfs workload that ceph is generating will also
>> >> not be all that special, though, so this problem shouldn't be unique to
>> >> ceph.
>> >>
>> >
>> > Can you get sysrq+w when this happens?  I'd like to see what 
>> > btrfs-endio-write
>> > is up to.
>>
>> Capturing this seems to be not easy. I have a few traces (see
>> attachment), but with sysrq+w I do not get a stacktrace of
>> btrfs-endio-write. What I have is a "latencytop -c" output which is
>> interesting:
>>
>> In our Ceph-OSD server we have 4 disks with 4 btrfs filesystems. Ceph
>> tries to balance the load over all OSDs, so all filesystems should get
>> an nearly equal load. At the moment one filesystem seems to have a
>> problem. When running with iostat I see the following
>>
>> Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s
>> avgrq-sz avgqu-sz   await  svctm  %util
>> sdd               0.00     0.00    0.00    4.33     0.00    53.33
>> 12.31     0.08   19.38  12.23   5.30
>> sdc               0.00     1.00    0.00  228.33     0.00  1957.33
>> 8.57    74.33  380.76   2.74  62.57
>> sdb               0.00     0.00    0.00    1.33     0.00    16.00
>> 12.00     0.03   25.00 19.75 2.63
>> sda               0.00     0.00    0.00    0.67     0.00     8.00
>> 12.00     0.01   19.50  12.50   0.83
>>
>> The PID of the ceph-osd taht is running on sdc is 2053 and when I look
>> with top I see this process and a btrfs-endio-writer (PID 5447):
>>
>>   PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
>>  2053 root      20   0  537m 146m 2364 S 33.2 0.6 43:31.24 ceph-osd
>>  5447 root      20   0     0    0    0 S 22.6 0.0 19:32.18 btrfs-endio-wri
>>
>> In the latencytop output you can see that those processes have a much
>> higher latency, than the other ceph-osd and btrfs-endio-writers.
>>
>> Regards,
>> Christian
>
> Ok just a shot in the dark, but could you give this a whirl and see if it 
> helps
> you?  Thanks

Thanks for the patch! I'll install it tomorrow and I think that I can
report back on Monday. It always takes a few days until the load goes
up.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html