Re: [ceph-users] Performance issues related to scrubbing

2016-02-17 Thread Cullen King
On Wed, Feb 17, 2016 at 12:13 AM, Christian Balzer  wrote:

>
> Hello,
>
> On Tue, 16 Feb 2016 10:46:32 -0800 Cullen King wrote:
>
> > Thanks for the helpful commentary Christian. Cluster is performing much
> > better with 50% more spindles (12 to 18 drives), along with setting scrub
> > sleep to 0.1. Didn't see really any gain from moving from the Samsung 850
> > Pro journal drives to Intel 3710's, even though dd and other direct tests
> > of the drives yielded much better results. rados bench with 4k requests
> > are still awfully low. I'll figure that problem out next.
> >
> Got examples, numbers, watched things with atop?
> 4KB rados benches are what can make my CPUs melt on the cluster here
> that's most similar to yours. ^o^
>
> > I ended up bumping up the number of placement groups from 512 to 1024
> > which should help a little bit. Basically it'll change the worst case
> > scrub performance such that it is distributed a little more across
> > drives rather than clustered on a single drive for longer.
> >
> Of course with osd_max_scrubs at its default of 1 there should never be
> more than one scrub per OSD.
> However I seem to vaguely remember that this is per "primary" scrub, so in
> case of deep-scrubs there could still be plenty of contention going on.
> Again, I've always had a good success with that manually kicked off scrub
> of all OSDs.
> It seems to sequence things nicely and finishes within 4 hours on my
> "good" production cluster.
>
> > I think the real solution here is to create a secondary SSD pool, pin
> > some radosgw buckets to it and put my thumbnail data on the smaller,
> > faster pool. I'll reserve the spindle based pool for original high res
> > photos, which are only read to create thumbnails when necessary. This
> > should put the majority of my random read IO on SSDs, and thumbnails
> > average 50kb each so it shouldn't be too spendy. I am considering trying
> > the newer samsung sm863 drives as we are read heavy, any potential data
> > loss on this thumbnail pool will not be catastrophic.
> >
> I seriously detest it when makers don't have they endurance data on the
> web page with all the other specifications and make you look up things in
> a slightly hidden PDF.
> Then giving the total endurance and making you calculate drive writes per
> day. ^o^
> Only to find that these have 3 DWPD, which is nothing to be ashamed off
> and should be fine for this particular use case.
>
> However take a look at this old posting of mine:
>
> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
>
> With that in mind, I'd recommend you do some testing with real world data
> before you invest too much into something that will wear out long before
> it has payed for itself.
>

We are not write heavy at all, if my current drives are any indication I'd
only do one drive write per year on the things.


>
> Christian
>
> > Third, it seems that I am also running into the known "Lots Of Small
> > Files" performance issue. Looks like performance in my use case will be
> > drastically improved with the upcoming bluestore, though migrating to it
> > sounds painful!
> >
> > On Thu, Feb 4, 2016 at 7:56 PM, Christian Balzer  wrote:
> >
> > >
> > > Hello,
> > >
> > > On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote:
> > >
> > > > Replies in-line:
> > > >
> > > > On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer
> > > >  wrote:
> > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I've been trying to nail down a nasty performance issue related
> > > > > > to scrubbing. I am mostly using radosgw with a handful of buckets
> > > > > > containing millions of various sized objects. When ceph scrubs,
> > > > > > both regular and deep, radosgw blocks on external requests, and
> > > > > > my cluster has a bunch of requests that have blocked for > 32
> > > > > > seconds. Frequently OSDs are marked down.
> > > > > >
> > > > > From my own (painful) experiences let me state this:
> > > > >
> > > > > 1. When your cluster runs out of steam during deep-scrubs, drop
> > > > > what you're doing and order more HW (OSDs).
> > > > > Because this is a sign that it would also be in trouble when doing
> > > > > recoveries.
> > > > >
> > > >
> > > > When I've initiated recoveries from working on the hardware the
> > > > cluster hasn't had a problem keeping up. It seems that it only has a
> > > > problem with scrubbing, meaning it feels like the IO pattern is
> > > > drastically different. I would think that with scrubbing I'd see
> > > > something closer to bursty sequential reads, rather than just
> > > > thrashing the drives with a more random IO pattern, especially given
> > > > our low cluster utilization.
> > > >
> > > It's probably more pronounced when phasing in/out entire OSDs, where it
> > > also has to read the entire (primary) data off it.
> > >
> > > >
> > > > >
> > > > > 2. If y

Re: [ceph-users] Performance issues related to scrubbing

2016-02-17 Thread Christian Balzer

Hello,

On Tue, 16 Feb 2016 10:46:32 -0800 Cullen King wrote:

> Thanks for the helpful commentary Christian. Cluster is performing much
> better with 50% more spindles (12 to 18 drives), along with setting scrub
> sleep to 0.1. Didn't see really any gain from moving from the Samsung 850
> Pro journal drives to Intel 3710's, even though dd and other direct tests
> of the drives yielded much better results. rados bench with 4k requests
> are still awfully low. I'll figure that problem out next.
> 
Got examples, numbers, watched things with atop?
4KB rados benches are what can make my CPUs melt on the cluster here
that's most similar to yours. ^o^

> I ended up bumping up the number of placement groups from 512 to 1024
> which should help a little bit. Basically it'll change the worst case
> scrub performance such that it is distributed a little more across
> drives rather than clustered on a single drive for longer.
> 
Of course with osd_max_scrubs at its default of 1 there should never be
more than one scrub per OSD. 
However I seem to vaguely remember that this is per "primary" scrub, so in
case of deep-scrubs there could still be plenty of contention going on.
Again, I've always had a good success with that manually kicked off scrub
of all OSDs. 
It seems to sequence things nicely and finishes within 4 hours on my
"good" production cluster.

> I think the real solution here is to create a secondary SSD pool, pin
> some radosgw buckets to it and put my thumbnail data on the smaller,
> faster pool. I'll reserve the spindle based pool for original high res
> photos, which are only read to create thumbnails when necessary. This
> should put the majority of my random read IO on SSDs, and thumbnails
> average 50kb each so it shouldn't be too spendy. I am considering trying
> the newer samsung sm863 drives as we are read heavy, any potential data
> loss on this thumbnail pool will not be catastrophic.
> 
I seriously detest it when makers don't have they endurance data on the
web page with all the other specifications and make you look up things in
a slightly hidden PDF. 
Then giving the total endurance and making you calculate drive writes per
day. ^o^
Only to find that these have 3 DWPD, which is nothing to be ashamed off
and should be fine for this particular use case.

However take a look at this old posting of mine:
http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html

With that in mind, I'd recommend you do some testing with real world data
before you invest too much into something that will wear out long before
it has payed for itself.

Christian

> Third, it seems that I am also running into the known "Lots Of Small
> Files" performance issue. Looks like performance in my use case will be
> drastically improved with the upcoming bluestore, though migrating to it
> sounds painful!
> 
> On Thu, Feb 4, 2016 at 7:56 PM, Christian Balzer  wrote:
> 
> >
> > Hello,
> >
> > On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote:
> >
> > > Replies in-line:
> > >
> > > On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer
> > >  wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I've been trying to nail down a nasty performance issue related
> > > > > to scrubbing. I am mostly using radosgw with a handful of buckets
> > > > > containing millions of various sized objects. When ceph scrubs,
> > > > > both regular and deep, radosgw blocks on external requests, and
> > > > > my cluster has a bunch of requests that have blocked for > 32
> > > > > seconds. Frequently OSDs are marked down.
> > > > >
> > > > From my own (painful) experiences let me state this:
> > > >
> > > > 1. When your cluster runs out of steam during deep-scrubs, drop
> > > > what you're doing and order more HW (OSDs).
> > > > Because this is a sign that it would also be in trouble when doing
> > > > recoveries.
> > > >
> > >
> > > When I've initiated recoveries from working on the hardware the
> > > cluster hasn't had a problem keeping up. It seems that it only has a
> > > problem with scrubbing, meaning it feels like the IO pattern is
> > > drastically different. I would think that with scrubbing I'd see
> > > something closer to bursty sequential reads, rather than just
> > > thrashing the drives with a more random IO pattern, especially given
> > > our low cluster utilization.
> > >
> > It's probably more pronounced when phasing in/out entire OSDs, where it
> > also has to read the entire (primary) data off it.
> >
> > >
> > > >
> > > > 2. If you cluster is inconvenienced by even mere scrubs, you're
> > > > really in trouble.
> > > > Threaten the penny pincher with bodily violence and have that new
> > > > HW phased in yesterday.
> > > >
> > >
> > > I am the penny pincher, biz owner, dev and ops guy for
> > > http://ridewithgps.com :) More hardware isn't an issue, it just feels
> > > pretty crazy to have this low of performance

Re: [ceph-users] Performance issues related to scrubbing

2016-02-16 Thread Cullen King
Thanks for the helpful commentary Christian. Cluster is performing much
better with 50% more spindles (12 to 18 drives), along with setting scrub
sleep to 0.1. Didn't see really any gain from moving from the Samsung 850
Pro journal drives to Intel 3710's, even though dd and other direct tests
of the drives yielded much better results. rados bench with 4k requests are
still awfully low. I'll figure that problem out next.

I ended up bumping up the number of placement groups from 512 to 1024 which
should help a little bit. Basically it'll change the worst case scrub
performance such that it is distributed a little more across drives rather
than clustered on a single drive for longer.

I think the real solution here is to create a secondary SSD pool, pin some
radosgw buckets to it and put my thumbnail data on the smaller, faster
pool. I'll reserve the spindle based pool for original high res photos,
which are only read to create thumbnails when necessary. This should put
the majority of my random read IO on SSDs, and thumbnails average 50kb each
so it shouldn't be too spendy. I am considering trying the newer samsung
sm863 drives as we are read heavy, any potential data loss on this
thumbnail pool will not be catastrophic.

Third, it seems that I am also running into the known "Lots Of Small Files"
performance issue. Looks like performance in my use case will be
drastically improved with the upcoming bluestore, though migrating to it
sounds painful!

On Thu, Feb 4, 2016 at 7:56 PM, Christian Balzer  wrote:

>
> Hello,
>
> On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote:
>
> > Replies in-line:
> >
> > On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer
> >  wrote:
> >
> > >
> > > Hello,
> > >
> > > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
> > >
> > > > Hello,
> > > >
> > > > I've been trying to nail down a nasty performance issue related to
> > > > scrubbing. I am mostly using radosgw with a handful of buckets
> > > > containing millions of various sized objects. When ceph scrubs, both
> > > > regular and deep, radosgw blocks on external requests, and my
> > > > cluster has a bunch of requests that have blocked for > 32 seconds.
> > > > Frequently OSDs are marked down.
> > > >
> > > From my own (painful) experiences let me state this:
> > >
> > > 1. When your cluster runs out of steam during deep-scrubs, drop what
> > > you're doing and order more HW (OSDs).
> > > Because this is a sign that it would also be in trouble when doing
> > > recoveries.
> > >
> >
> > When I've initiated recoveries from working on the hardware the cluster
> > hasn't had a problem keeping up. It seems that it only has a problem with
> > scrubbing, meaning it feels like the IO pattern is drastically
> > different. I would think that with scrubbing I'd see something closer to
> > bursty sequential reads, rather than just thrashing the drives with a
> > more random IO pattern, especially given our low cluster utilization.
> >
> It's probably more pronounced when phasing in/out entire OSDs, where it
> also has to read the entire (primary) data off it.
>
> >
> > >
> > > 2. If you cluster is inconvenienced by even mere scrubs, you're really
> > > in trouble.
> > > Threaten the penny pincher with bodily violence and have that new HW
> > > phased in yesterday.
> > >
> >
> > I am the penny pincher, biz owner, dev and ops guy for
> > http://ridewithgps.com :) More hardware isn't an issue, it just feels
> > pretty crazy to have this low of performance on a 12 OSD system. Granted,
> > that feeling isn't backed by anything concrete! In general, I like to
> > understand the problem before I solve it with hardware, though I am
> > definitely not averse to it. I already ordered 6 more 4tb drives along
> > with the new journal SSDs, anticipating the need.
> >
> > As you can see from the output of ceph status, we are not space hungry by
> > any means.
> >
>
> Well, in Ceph having just one OSD pegged to max will impact (eventually)
> everything when they need to read/write primary PGs on it.
>
> More below.
>
> >
> > >
> > > > According to atop, the OSDs being deep scrubbed are reading at only
> > > > 5mb/s to 8mb/s, and a scrub of a 6.4gb placement group takes 10-20
> > > > minutes.
> > > >
> > > > Here's a screenshot of atop from a node:
> > > > https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png
> > > >
> > > This looks familiar.
> > > Basically at this point in time the competing read request for all the
> > > objects clash with write requests and completely saturate your HD
> > > (about 120 IOPS and 85% busy according to your atop screenshot).
> > >
> >
> > In your experience would the scrub operation benefit from a bigger
> > readahead? Meaning is it more sequential than random reads? I already
> > bumped /sys/block/sd{x}/queue/read_ahead_kb to 512kb.
> >
> I played with that long time ago (in benchmark scenarios) and didn't see
> any noticeable improvement.
> Deep-scrub might (fragmentation could hurt it though), regular scrub no

Re: [ceph-users] Performance issues related to scrubbing

2016-02-16 Thread Cullen King
Thanks for the tuning tips Bob, I'll play with them after solidifying some
of my other fixes (another 24-48 hours before my migration to 1024
placement groups is finished).

Glad you enjoy ridewithgps, shoot me an email if you have any
questions/ideas/needs :)

On Fri, Feb 5, 2016 at 4:42 PM, Bob R  wrote:

> Cullen,
>
> We operate a cluster with 4 nodes, each has 2xE5-2630, 64gb ram, 10x4tb
> spinners. We've recently replaced 2xm550 journals with a single p3700 nvme
> drive per server and didn't see the performance gains we were hoping for.
> After making the changes below we're now seeing significantly better 4k
> performance. Unfortunately we pushed all of these at once so I wasn't able
> to break down the performance improvement per option but you might want to
> take a look at some of these.
>
> before:
> [cephuser@ceph03 ~]$ rados -p one bench 120 rand -t 64
> Total time run:   120.001910
> Total reads made: 1530642
> Read size:4096
> Bandwidth (MB/sec):   49.8
> Average IOPS: 12755
> Stddev IOPS:  1272
> Max IOPS: 14087
> Min IOPS: 8165
> Average Latency:  0.005
> Max latency:  0.307
> Min latency:  0.000411
>
> after:
> [cephuser@ceph03 ~]$ rados -p one bench 120 rand -t 64
> Total time run:   120.004069
> Total reads made: 4285054
> Read size:4096
> Bandwidth (MB/sec):   139
> Average IOPS: 35707
> Stddev IOPS:  6282
> Max IOPS: 40917
> Min IOPS: 3815
> Average Latency:  0.00178
> Max latency:  1.73
> Min latency:  0.000239
>
> [bobr@bobr ~]$ diff ceph03-before ceph03-after
> 6,8c6,8
> < "debug_lockdep": "0\/1",
> < "debug_context": "0\/1",
> < "debug_crush": "1\/1",
> ---
> > "debug_lockdep": "0\/0",
> > "debug_context": "0\/0",
> > "debug_crush": "0\/0",
> 15,17c15,17
> < "debug_buffer": "0\/1",
> < "debug_timer": "0\/1",
> < "debug_filer": "0\/1",
> ---
> > "debug_buffer": "0\/0",
> > "debug_timer": "0\/0",
> > "debug_filer": "0\/0",
> 19,21c19,21
> < "debug_objecter": "0\/1",
> < "debug_rados": "0\/5",
> < "debug_rbd": "0\/5",
> ---
> > "debug_objecter": "0\/0",
> > "debug_rados": "0\/0",
> > "debug_rbd": "0\/0",
> 26c26
> < "debug_osd": "0\/5",
> ---
> > "debug_osd": "0\/0",
> 29c29
> < "debug_filestore": "1\/3",
> ---
> > "debug_filestore": "0\/0",
> 31,32c31,32
> < "debug_journal": "1\/3",
> < "debug_ms": "0\/5",
> ---
> > "debug_journal": "0\/0",
> > "debug_ms": "0\/0",
> 34c34
> < "debug_monc": "0\/10",
> ---
> > "debug_monc": "0\/0",
> 36,37c36,37
> < "debug_tp": "0\/5",
> < "debug_auth": "1\/5",
> ---
> > "debug_tp": "0\/0",
> > "debug_auth": "0\/0",
> 39,41c39,41
> < "debug_finisher": "1\/1",
> < "debug_heartbeatmap": "1\/5",
> < "debug_perfcounter": "1\/5",
> ---
> > "debug_finisher": "0\/0",
> > "debug_heartbeatmap": "0\/0",
> > "debug_perfcounter": "0\/0",
> 132c132
> < "ms_dispatch_throttle_bytes": "104857600",
> ---
> > "ms_dispatch_throttle_bytes": "1048576000",
> 329c329
> < "objecter_inflight_ops": "1024",
> ---
> > "objecter_inflight_ops": "10240",
> 506c506
> < "osd_op_threads": "4",
> ---
> > "osd_op_threads": "20",
> 510c510
> < "osd_disk_threads": "4",
> ---
> > "osd_disk_threads": "1",
> 697c697
> < "filestore_max_inline_xattr_size": "0",
> ---
> > "filestore_max_inline_xattr_size": "254",
> 701c701
> < "filestore_max_inline_xattrs": "0",
> ---
> > "filestore_max_inline_xattrs": "6",
> 708c708
> < "filestore_max_sync_interval": "5",
> ---
> > "filestore_max_sync_interval": "10",
> 721,724c721,724
> < "filestore_queue_max_ops": "1000",
> < "filestore_queue_max_bytes": "209715200",
> < "filestore_queue_committing_max_ops": "1000",
> < "filestore_queue_committing_max_bytes": "209715200",
> ---
> > "filestore_queue_max_ops": "500",
> > "filestore_queue_max_bytes": "1048576000",
> > "filestore_queue_committing_max_ops": "5000",
> > "filestore_queue_committing_max_bytes": "1048576000",
> 758,761c758,761
> < "journal_max_write_bytes": "10485760",
> < "journal_max_write_entries": "100",
> < "journal_queue_max_ops": "300",
> < "journal_queue_max_bytes": "33554432",
> ---
> > "journal_max_write_bytes": "1048576000",
> > "journal_max_write_entries": "1000",
> > "journal_queue_max_ops": "3000",
> > "journal_queue_max_bytes": "1048576000",
>
> Good luck,
> Bob
>
> PS. thanks for ridewithgps :)
>
>
> On Thu, Feb 4, 2016 at 7:56 PM, Christian Balzer  wrote:
>
>>
>> Hello,
>>
>> On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote:
>>
>> > Replies in-line:
>> >
>> > On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer
>> >  wrote:
>> >
>> > >
>> > > Hello,
>> > >
>> > > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
>> > >
>> > > > Hello,
>> > > >
>> > > > I've b

Re: [ceph-users] Performance issues related to scrubbing

2016-02-05 Thread Bob R
Cullen,

We operate a cluster with 4 nodes, each has 2xE5-2630, 64gb ram, 10x4tb
spinners. We've recently replaced 2xm550 journals with a single p3700 nvme
drive per server and didn't see the performance gains we were hoping for.
After making the changes below we're now seeing significantly better 4k
performance. Unfortunately we pushed all of these at once so I wasn't able
to break down the performance improvement per option but you might want to
take a look at some of these.

before:
[cephuser@ceph03 ~]$ rados -p one bench 120 rand -t 64
Total time run:   120.001910
Total reads made: 1530642
Read size:4096
Bandwidth (MB/sec):   49.8
Average IOPS: 12755
Stddev IOPS:  1272
Max IOPS: 14087
Min IOPS: 8165
Average Latency:  0.005
Max latency:  0.307
Min latency:  0.000411

after:
[cephuser@ceph03 ~]$ rados -p one bench 120 rand -t 64
Total time run:   120.004069
Total reads made: 4285054
Read size:4096
Bandwidth (MB/sec):   139
Average IOPS: 35707
Stddev IOPS:  6282
Max IOPS: 40917
Min IOPS: 3815
Average Latency:  0.00178
Max latency:  1.73
Min latency:  0.000239

[bobr@bobr ~]$ diff ceph03-before ceph03-after
6,8c6,8
< "debug_lockdep": "0\/1",
< "debug_context": "0\/1",
< "debug_crush": "1\/1",
---
> "debug_lockdep": "0\/0",
> "debug_context": "0\/0",
> "debug_crush": "0\/0",
15,17c15,17
< "debug_buffer": "0\/1",
< "debug_timer": "0\/1",
< "debug_filer": "0\/1",
---
> "debug_buffer": "0\/0",
> "debug_timer": "0\/0",
> "debug_filer": "0\/0",
19,21c19,21
< "debug_objecter": "0\/1",
< "debug_rados": "0\/5",
< "debug_rbd": "0\/5",
---
> "debug_objecter": "0\/0",
> "debug_rados": "0\/0",
> "debug_rbd": "0\/0",
26c26
< "debug_osd": "0\/5",
---
> "debug_osd": "0\/0",
29c29
< "debug_filestore": "1\/3",
---
> "debug_filestore": "0\/0",
31,32c31,32
< "debug_journal": "1\/3",
< "debug_ms": "0\/5",
---
> "debug_journal": "0\/0",
> "debug_ms": "0\/0",
34c34
< "debug_monc": "0\/10",
---
> "debug_monc": "0\/0",
36,37c36,37
< "debug_tp": "0\/5",
< "debug_auth": "1\/5",
---
> "debug_tp": "0\/0",
> "debug_auth": "0\/0",
39,41c39,41
< "debug_finisher": "1\/1",
< "debug_heartbeatmap": "1\/5",
< "debug_perfcounter": "1\/5",
---
> "debug_finisher": "0\/0",
> "debug_heartbeatmap": "0\/0",
> "debug_perfcounter": "0\/0",
132c132
< "ms_dispatch_throttle_bytes": "104857600",
---
> "ms_dispatch_throttle_bytes": "1048576000",
329c329
< "objecter_inflight_ops": "1024",
---
> "objecter_inflight_ops": "10240",
506c506
< "osd_op_threads": "4",
---
> "osd_op_threads": "20",
510c510
< "osd_disk_threads": "4",
---
> "osd_disk_threads": "1",
697c697
< "filestore_max_inline_xattr_size": "0",
---
> "filestore_max_inline_xattr_size": "254",
701c701
< "filestore_max_inline_xattrs": "0",
---
> "filestore_max_inline_xattrs": "6",
708c708
< "filestore_max_sync_interval": "5",
---
> "filestore_max_sync_interval": "10",
721,724c721,724
< "filestore_queue_max_ops": "1000",
< "filestore_queue_max_bytes": "209715200",
< "filestore_queue_committing_max_ops": "1000",
< "filestore_queue_committing_max_bytes": "209715200",
---
> "filestore_queue_max_ops": "500",
> "filestore_queue_max_bytes": "1048576000",
> "filestore_queue_committing_max_ops": "5000",
> "filestore_queue_committing_max_bytes": "1048576000",
758,761c758,761
< "journal_max_write_bytes": "10485760",
< "journal_max_write_entries": "100",
< "journal_queue_max_ops": "300",
< "journal_queue_max_bytes": "33554432",
---
> "journal_max_write_bytes": "1048576000",
> "journal_max_write_entries": "1000",
> "journal_queue_max_ops": "3000",
> "journal_queue_max_bytes": "1048576000",

Good luck,
Bob

PS. thanks for ridewithgps :)


On Thu, Feb 4, 2016 at 7:56 PM, Christian Balzer  wrote:

>
> Hello,
>
> On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote:
>
> > Replies in-line:
> >
> > On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer
> >  wrote:
> >
> > >
> > > Hello,
> > >
> > > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
> > >
> > > > Hello,
> > > >
> > > > I've been trying to nail down a nasty performance issue related to
> > > > scrubbing. I am mostly using radosgw with a handful of buckets
> > > > containing millions of various sized objects. When ceph scrubs, both
> > > > regular and deep, radosgw blocks on external requests, and my
> > > > cluster has a bunch of requests that have blocked for > 32 seconds.
> > > > Frequently OSDs are marked down.
> > > >
> > > From my own (painful) experiences let me state this:
> > >
> > > 1. When your cluster runs out of steam during deep-scrubs, drop what
> > > you're doing and order more HW (OSDs).
> > > Because this is a sign that

Re: [ceph-users] Performance issues related to scrubbing

2016-02-04 Thread Christian Balzer

Hello,

On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote:

> Replies in-line:
> 
> On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer
>  wrote:
> 
> >
> > Hello,
> >
> > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
> >
> > > Hello,
> > >
> > > I've been trying to nail down a nasty performance issue related to
> > > scrubbing. I am mostly using radosgw with a handful of buckets
> > > containing millions of various sized objects. When ceph scrubs, both
> > > regular and deep, radosgw blocks on external requests, and my
> > > cluster has a bunch of requests that have blocked for > 32 seconds.
> > > Frequently OSDs are marked down.
> > >
> > From my own (painful) experiences let me state this:
> >
> > 1. When your cluster runs out of steam during deep-scrubs, drop what
> > you're doing and order more HW (OSDs).
> > Because this is a sign that it would also be in trouble when doing
> > recoveries.
> >
> 
> When I've initiated recoveries from working on the hardware the cluster
> hasn't had a problem keeping up. It seems that it only has a problem with
> scrubbing, meaning it feels like the IO pattern is drastically
> different. I would think that with scrubbing I'd see something closer to
> bursty sequential reads, rather than just thrashing the drives with a
> more random IO pattern, especially given our low cluster utilization.
>
It's probably more pronounced when phasing in/out entire OSDs, where it
also has to read the entire (primary) data off it.
 
> 
> >
> > 2. If you cluster is inconvenienced by even mere scrubs, you're really
> > in trouble.
> > Threaten the penny pincher with bodily violence and have that new HW
> > phased in yesterday.
> >
> 
> I am the penny pincher, biz owner, dev and ops guy for
> http://ridewithgps.com :) More hardware isn't an issue, it just feels
> pretty crazy to have this low of performance on a 12 OSD system. Granted,
> that feeling isn't backed by anything concrete! In general, I like to
> understand the problem before I solve it with hardware, though I am
> definitely not averse to it. I already ordered 6 more 4tb drives along
> with the new journal SSDs, anticipating the need.
> 
> As you can see from the output of ceph status, we are not space hungry by
> any means.
> 

Well, in Ceph having just one OSD pegged to max will impact (eventually)
everything when they need to read/write primary PGs on it. 

More below.

> 
> >
> > > According to atop, the OSDs being deep scrubbed are reading at only
> > > 5mb/s to 8mb/s, and a scrub of a 6.4gb placement group takes 10-20
> > > minutes.
> > >
> > > Here's a screenshot of atop from a node:
> > > https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png
> > >
> > This looks familiar.
> > Basically at this point in time the competing read request for all the
> > objects clash with write requests and completely saturate your HD
> > (about 120 IOPS and 85% busy according to your atop screenshot).
> >
> 
> In your experience would the scrub operation benefit from a bigger
> readahead? Meaning is it more sequential than random reads? I already
> bumped /sys/block/sd{x}/queue/read_ahead_kb to 512kb.
> 
I played with that long time ago (in benchmark scenarios) and didn't see
any noticeable improvement. 
Deep-scrub might (fragmentation could hurt it though), regular scrub not so
much.

> About half of our reads are on objects with an average size of 40kb (map
> thumbnails), and the other half are on photo thumbs with a size between
> 10kb and 150kb.
> 

Noted, see below.

> After doing a little more researching, I came across this:
> 
> http://tracker.ceph.com/projects/ceph/wiki/Optimize_Newstore_for_massive_small_object_storage
> 
> Sounds like I am probably running into issues with lots of random read
> IO, combined with known issues around small files. To give an idea, I
> have about 15 million small map thumbnails stored in my two largest
> buckets, and I am pushing out about 30 requests per second right now
> from those two buckets.
> 
This is certainly a factor, but that knowledge of a future improvement
won't help you with your current problem of course. ^_-

> 
> 
> > There are ceph configuration options that can mitigate this to some
> > extend and which I don't see in your config, like
> > "osd_scrub_load_threshold" and "osd_scrub_sleep" along with the
> > various IO priority settings.
> > However the points above still stand.
> >
> 
> Yes, I have a running series of notes of config options to try out, just
> wanted to touch base with other community members before shooting in the
> dark.
> 
osd_scrub_sleep is probably the most effective immediately available
option for you to prevent slow, stalled IO. 
At the obvious cost of scrubs taking even longer.
There is of course also the option to disable scrubs entirely until your HW
has been upgraded.

> 
> >
> > XFS defragmentation might help, significantly if your FS is badly
> > fragmented. But again, this is only a temporary band-aid.
> >
> > > First question: is this a r

Re: [ceph-users] Performance issues related to scrubbing

2016-02-04 Thread Cullen King
Replies in-line:

On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer 
wrote:

>
> Hello,
>
> On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
>
> > Hello,
> >
> > I've been trying to nail down a nasty performance issue related to
> > scrubbing. I am mostly using radosgw with a handful of buckets containing
> > millions of various sized objects. When ceph scrubs, both regular and
> > deep, radosgw blocks on external requests, and my cluster has a bunch of
> > requests that have blocked for > 32 seconds. Frequently OSDs are marked
> > down.
> >
> From my own (painful) experiences let me state this:
>
> 1. When your cluster runs out of steam during deep-scrubs, drop what
> you're doing and order more HW (OSDs).
> Because this is a sign that it would also be in trouble when doing
> recoveries.
>

When I've initiated recoveries from working on the hardware the cluster
hasn't had a problem keeping up. It seems that it only has a problem with
scrubbing, meaning it feels like the IO pattern is drastically different. I
would think that with scrubbing I'd see something closer to bursty
sequential reads, rather than just thrashing the drives with a more random
IO pattern, especially given our low cluster utilization.


>
> 2. If you cluster is inconvenienced by even mere scrubs, you're really in
> trouble.
> Threaten the penny pincher with bodily violence and have that new HW
> phased in yesterday.
>

I am the penny pincher, biz owner, dev and ops guy for
http://ridewithgps.com :) More hardware isn't an issue, it just feels
pretty crazy to have this low of performance on a 12 OSD system. Granted,
that feeling isn't backed by anything concrete! In general, I like to
understand the problem before I solve it with hardware, though I am
definitely not averse to it. I already ordered 6 more 4tb drives along with
the new journal SSDs, anticipating the need.

As you can see from the output of ceph status, we are not space hungry by
any means.


>
> > According to atop, the OSDs being deep scrubbed are reading at only 5mb/s
> > to 8mb/s, and a scrub of a 6.4gb placement group takes 10-20 minutes.
> >
> > Here's a screenshot of atop from a node:
> > https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png
> >
> This looks familiar.
> Basically at this point in time the competing read request for all the
> objects clash with write requests and completely saturate your HD (about
> 120 IOPS and 85% busy according to your atop screenshot).
>

In your experience would the scrub operation benefit from a bigger
readahead? Meaning is it more sequential than random reads? I already
bumped /sys/block/sd{x}/queue/read_ahead_kb to 512kb.

About half of our reads are on objects with an average size of 40kb (map
thumbnails), and the other half are on photo thumbs with a size between
10kb and 150kb.

After doing a little more researching, I came across this:

http://tracker.ceph.com/projects/ceph/wiki/Optimize_Newstore_for_massive_small_object_storage

Sounds like I am probably running into issues with lots of random read IO,
combined with known issues around small files. To give an idea, I have
about 15 million small map thumbnails stored in my two largest buckets, and
I am pushing out about 30 requests per second right now from those two
buckets.



> There are ceph configuration options that can mitigate this to some
> extend and which I don't see in your config, like
> "osd_scrub_load_threshold" and "osd_scrub_sleep" along with the various IO
> priority settings.
> However the points above still stand.
>

Yes, I have a running series of notes of config options to try out, just
wanted to touch base with other community members before shooting in the
dark.


>
> XFS defragmentation might help, significantly if your FS is badly
> fragmented. But again, this is only a temporary band-aid.
>
> > First question: is this a reasonable speed for scrubbing, given a very
> > lightly used cluster? Here's some cluster details:
> >
> > deploy@drexler:~$ ceph --version
> > ceph version 0.94.1-5-g85a68f9 (85a68f9a8237f7e74f44a1d1fbbd6cb4ac50f8e8)
> >
> >
> > 2x Xeon E5-2630 per node, 64gb of ram per node.
> >
> More memory can help by keeping hot objects in the page cache (so the
> actual disks need not be read and can write at their full IOPS capacity).
> A lot of memory (and the correct sysctl settings) will also allow for a
> large SLAB space, keeping all those directory entries and other bits in
> memory without having to go to disk to get them.
>
> You seem to be just fine CPU wise.
>

I thought about bumping each node up to 128gb of ram as another cheap
insurance policy. I'll try that after the other changes. I'd like to know
why so I'll try and change one thing at a time, though I am also just eager
to have this thing stable.


>
> >
> > deploy@drexler:~$ ceph status
> > cluster 234c6825-0e2b-4256-a710-71d29f4f023e
> >  health HEALTH_WARN
> > 118 requests are blocked > 32 sec
> >  monmap e1: 3 mons at {drexler=
> > 10.0.0.

Re: [ceph-users] Performance issues related to scrubbing

2016-02-03 Thread Christian Balzer

Hello,

On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:

> Hello,
> 
> I've been trying to nail down a nasty performance issue related to
> scrubbing. I am mostly using radosgw with a handful of buckets containing
> millions of various sized objects. When ceph scrubs, both regular and
> deep, radosgw blocks on external requests, and my cluster has a bunch of
> requests that have blocked for > 32 seconds. Frequently OSDs are marked
> down.
>   
>From my own (painful) experiences let me state this:

1. When your cluster runs out of steam during deep-scrubs, drop what
you're doing and order more HW (OSDs).
Because this is a sign that it would also be in trouble when doing
recoveries. 

2. If you cluster is inconvenienced by even mere scrubs, you're really in
trouble. 
Threaten the penny pincher with bodily violence and have that new HW
phased in yesterday.

> According to atop, the OSDs being deep scrubbed are reading at only 5mb/s
> to 8mb/s, and a scrub of a 6.4gb placement group takes 10-20 minutes.
> 
> Here's a screenshot of atop from a node:
> https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png
>   
This looks familiar. 
Basically at this point in time the competing read request for all the
objects clash with write requests and completely saturate your HD (about
120 IOPS and 85% busy according to your atop screenshot). 

There are ceph configuration options that can mitigate this to some
extend and which I don't see in your config, like
"osd_scrub_load_threshold" and "osd_scrub_sleep" along with the various IO
priority settings.
However the points above still stand.

XFS defragmentation might help, significantly if your FS is badly
fragmented. But again, this is only a temporary band-aid.

> First question: is this a reasonable speed for scrubbing, given a very
> lightly used cluster? Here's some cluster details:
> 
> deploy@drexler:~$ ceph --version
> ceph version 0.94.1-5-g85a68f9 (85a68f9a8237f7e74f44a1d1fbbd6cb4ac50f8e8)
> 
> 
> 2x Xeon E5-2630 per node, 64gb of ram per node.
>  
More memory can help by keeping hot objects in the page cache (so the
actual disks need not be read and can write at their full IOPS capacity).
A lot of memory (and the correct sysctl settings) will also allow for a
large SLAB space, keeping all those directory entries and other bits in
memory without having to go to disk to get them.

You seem to be just fine CPU wise. 

> 
> deploy@drexler:~$ ceph status
> cluster 234c6825-0e2b-4256-a710-71d29f4f023e
>  health HEALTH_WARN
> 118 requests are blocked > 32 sec
>  monmap e1: 3 mons at {drexler=
> 10.0.0.36:6789/0,lucy=10.0.0.38:6789/0,paley=10.0.0.34:6789/0}
> election epoch 296, quorum 0,1,2 paley,drexler,lucy
>  mdsmap e19989: 1/1/1 up {0=lucy=up:active}, 1 up:standby
>  osdmap e1115: 12 osds: 12 up, 12 in
>   pgmap v21748062: 1424 pgs, 17 pools, 3185 GB data, 20493 kobjects
> 10060 GB used, 34629 GB / 44690 GB avail
> 1422 active+clean
>1 active+clean+scrubbing+deep
>1 active+clean+scrubbing
>   client io 721 kB/s rd, 33398 B/s wr, 53 op/s
>   
You want to avoid having scrubs going on willy-nilly in parallel and at
high peek times, even IF your cluster is capable of handling them.

Depending on how busy your cluster is and its usage pattern, you may do
what I did. 
Kick off a deep scrub of all OSDs "ceph osd deep-scrub \*" like 01:00 on a
Saturday morning. 
If your cluster is fast enough, it will finish before 07:00 (without
killing your client performance) and all regular scrubs will now happen in
that time frame as well (given default settings).
If your cluster isn't fast enough, see my initial 2 points. ^o^

> deploy@drexler:~$ ceph osd tree
> ID WEIGHT   TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 43.67999 root default
> -2 14.56000 host paley
>  0  3.64000 osd.0 up  1.0  1.0
>  3  3.64000 osd.3 up  1.0  1.0
>  6  3.64000 osd.6 up  1.0  1.0
>  9  3.64000 osd.9 up  1.0  1.0
> -3 14.56000 host lucy
>  1  3.64000 osd.1 up  1.0  1.0
>  4  3.64000 osd.4 up  1.0  1.0
>  7  3.64000 osd.7 up  1.0  1.0
> 11  3.64000 osd.11up  1.0  1.0
> -4 14.56000 host drexler
>  2  3.64000 osd.2 up  1.0  1.0
>  5  3.64000 osd.5 up  1.0  1.0
>  8  3.64000 osd.8 up  1.0  1.0
> 10  3.64000 osd.10up  1.0  1.0
> 
> 
> My OSDs are 4tb 7200rpm Hitachi DeskStars, using XFS, with Samsung 850
> Pro journals (very slow, ordered s3700 replacements, but shouldn't pose
> problems for reading as far as I understand things).   

Just to make sure, these are genuine DeskStars?
I'm aski

[ceph-users] Performance issues related to scrubbing

2016-02-03 Thread Cullen King
Hello,

I've been trying to nail down a nasty performance issue related to
scrubbing. I am mostly using radosgw with a handful of buckets containing
millions of various sized objects. When ceph scrubs, both regular and deep,
radosgw blocks on external requests, and my cluster has a bunch of requests
that have blocked for > 32 seconds. Frequently OSDs are marked down.

According to atop, the OSDs being deep scrubbed are reading at only 5mb/s
to 8mb/s, and a scrub of a 6.4gb placement group takes 10-20 minutes.

Here's a screenshot of atop from a node:
https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png

First question: is this a reasonable speed for scrubbing, given a very
lightly used cluster? Here's some cluster details:

deploy@drexler:~$ ceph --version
ceph version 0.94.1-5-g85a68f9 (85a68f9a8237f7e74f44a1d1fbbd6cb4ac50f8e8)


2x Xeon E5-2630 per node, 64gb of ram per node.


deploy@drexler:~$ ceph status
cluster 234c6825-0e2b-4256-a710-71d29f4f023e
 health HEALTH_WARN
118 requests are blocked > 32 sec
 monmap e1: 3 mons at {drexler=
10.0.0.36:6789/0,lucy=10.0.0.38:6789/0,paley=10.0.0.34:6789/0}
election epoch 296, quorum 0,1,2 paley,drexler,lucy
 mdsmap e19989: 1/1/1 up {0=lucy=up:active}, 1 up:standby
 osdmap e1115: 12 osds: 12 up, 12 in
  pgmap v21748062: 1424 pgs, 17 pools, 3185 GB data, 20493 kobjects
10060 GB used, 34629 GB / 44690 GB avail
1422 active+clean
   1 active+clean+scrubbing+deep
   1 active+clean+scrubbing
  client io 721 kB/s rd, 33398 B/s wr, 53 op/s

deploy@drexler:~$ ceph osd tree
ID WEIGHT   TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 43.67999 root default
-2 14.56000 host paley
 0  3.64000 osd.0 up  1.0  1.0
 3  3.64000 osd.3 up  1.0  1.0
 6  3.64000 osd.6 up  1.0  1.0
 9  3.64000 osd.9 up  1.0  1.0
-3 14.56000 host lucy
 1  3.64000 osd.1 up  1.0  1.0
 4  3.64000 osd.4 up  1.0  1.0
 7  3.64000 osd.7 up  1.0  1.0
11  3.64000 osd.11up  1.0  1.0
-4 14.56000 host drexler
 2  3.64000 osd.2 up  1.0  1.0
 5  3.64000 osd.5 up  1.0  1.0
 8  3.64000 osd.8 up  1.0  1.0
10  3.64000 osd.10up  1.0  1.0


My OSDs are 4tb 7200rpm Hitachi DeskStars, using XFS, with Samsung 850 Pro
journals (very slow, ordered s3700 replacements, but shouldn't pose
problems for reading as far as I understand things). MONs are co-located
with OSD nodes, but the nodes are fairly beefy and has very low load.
Drives are on a expanding backplane, with an LSI SAS3008 controller.

I have a fairly standard config as well:

https://gist.github.com/kingcu/aae7373eb62ceb7579da

I know that I don't have a ton of OSDs, but I'd expect a little better
performance than this. Checkout munin of my three nodes:

http://munin.ridewithgps.com/ridewithgps.com/drexler.ridewithgps.com/index.html#disk
http://munin.ridewithgps.com/ridewithgps.com/paley.ridewithgps.com/index.html#disk
http://munin.ridewithgps.com/ridewithgps.com/lucy.ridewithgps.com/index.html#disk


Any input would be appreciated, before I start trying to micro-optimize
config params, as well as upgrading to Infernalis.


Cheers,

Cullen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com