Replies in-line:

On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer <c-bal...@fusioncom.co.jp>
wrote:

>
> Hello,
>
> On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
>
> > Hello,
> >
> > I've been trying to nail down a nasty performance issue related to
> > scrubbing. I am mostly using radosgw with a handful of buckets containing
> > millions of various sized objects. When ceph scrubs, both regular and
> > deep, radosgw blocks on external requests, and my cluster has a bunch of
> > requests that have blocked for > 32 seconds. Frequently OSDs are marked
> > down.
> >
> From my own (painful) experiences let me state this:
>
> 1. When your cluster runs out of steam during deep-scrubs, drop what
> you're doing and order more HW (OSDs).
> Because this is a sign that it would also be in trouble when doing
> recoveries.
>

When I've initiated recoveries from working on the hardware the cluster
hasn't had a problem keeping up. It seems that it only has a problem with
scrubbing, meaning it feels like the IO pattern is drastically different. I
would think that with scrubbing I'd see something closer to bursty
sequential reads, rather than just thrashing the drives with a more random
IO pattern, especially given our low cluster utilization.


>
> 2. If you cluster is inconvenienced by even mere scrubs, you're really in
> trouble.
> Threaten the penny pincher with bodily violence and have that new HW
> phased in yesterday.
>

I am the penny pincher, biz owner, dev and ops guy for
http://ridewithgps.com :) More hardware isn't an issue, it just feels
pretty crazy to have this low of performance on a 12 OSD system. Granted,
that feeling isn't backed by anything concrete! In general, I like to
understand the problem before I solve it with hardware, though I am
definitely not averse to it. I already ordered 6 more 4tb drives along with
the new journal SSDs, anticipating the need.

As you can see from the output of ceph status, we are not space hungry by
any means.


>
> > According to atop, the OSDs being deep scrubbed are reading at only 5mb/s
> > to 8mb/s, and a scrub of a 6.4gb placement group takes 10-20 minutes.
> >
> > Here's a screenshot of atop from a node:
> > https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png
> >
> This looks familiar.
> Basically at this point in time the competing read request for all the
> objects clash with write requests and completely saturate your HD (about
> 120 IOPS and 85% busy according to your atop screenshot).
>

In your experience would the scrub operation benefit from a bigger
readahead? Meaning is it more sequential than random reads? I already
bumped /sys/block/sd{x}/queue/read_ahead_kb to 512kb.

About half of our reads are on objects with an average size of 40kb (map
thumbnails), and the other half are on photo thumbs with a size between
10kb and 150kb.

After doing a little more researching, I came across this:

http://tracker.ceph.com/projects/ceph/wiki/Optimize_Newstore_for_massive_small_object_storage

Sounds like I am probably running into issues with lots of random read IO,
combined with known issues around small files. To give an idea, I have
about 15 million small map thumbnails stored in my two largest buckets, and
I am pushing out about 30 requests per second right now from those two
buckets.



> There are ceph configuration options that can mitigate this to some
> extend and which I don't see in your config, like
> "osd_scrub_load_threshold" and "osd_scrub_sleep" along with the various IO
> priority settings.
> However the points above still stand.
>

Yes, I have a running series of notes of config options to try out, just
wanted to touch base with other community members before shooting in the
dark.


>
> XFS defragmentation might help, significantly if your FS is badly
> fragmented. But again, this is only a temporary band-aid.
>
> > First question: is this a reasonable speed for scrubbing, given a very
> > lightly used cluster? Here's some cluster details:
> >
> > deploy@drexler:~$ ceph --version
> > ceph version 0.94.1-5-g85a68f9 (85a68f9a8237f7e74f44a1d1fbbd6cb4ac50f8e8)
> >
> >
> > 2x Xeon E5-2630 per node, 64gb of ram per node.
> >
> More memory can help by keeping hot objects in the page cache (so the
> actual disks need not be read and can write at their full IOPS capacity).
> A lot of memory (and the correct sysctl settings) will also allow for a
> large SLAB space, keeping all those directory entries and other bits in
> memory without having to go to disk to get them.
>
> You seem to be just fine CPU wise.
>

I thought about bumping each node up to 128gb of ram as another cheap
insurance policy. I'll try that after the other changes. I'd like to know
why so I'll try and change one thing at a time, though I am also just eager
to have this thing stable.


>
> >
> > deploy@drexler:~$ ceph status
> >     cluster 234c6825-0e2b-4256-a710-71d29f4f023e
> >      health HEALTH_WARN
> >             118 requests are blocked > 32 sec
> >      monmap e1: 3 mons at {drexler=
> > 10.0.0.36:6789/0,lucy=10.0.0.38:6789/0,paley=10.0.0.34:6789/0}
> >             election epoch 296, quorum 0,1,2 paley,drexler,lucy
> >      mdsmap e19989: 1/1/1 up {0=lucy=up:active}, 1 up:standby
> >      osdmap e1115: 12 osds: 12 up, 12 in
> >       pgmap v21748062: 1424 pgs, 17 pools, 3185 GB data, 20493 kobjects
> >             10060 GB used, 34629 GB / 44690 GB avail
> >                 1422 active+clean
> >                    1 active+clean+scrubbing+deep
> >                    1 active+clean+scrubbing
> >   client io 721 kB/s rd, 33398 B/s wr, 53 op/s
> >
> You want to avoid having scrubs going on willy-nilly in parallel and at
> high peek times, even IF your cluster is capable of handling them.
>
> Depending on how busy your cluster is and its usage pattern, you may do
> what I did.
> Kick off a deep scrub of all OSDs "ceph osd deep-scrub \*" like 01:00 on a
> Saturday morning.
> If your cluster is fast enough, it will finish before 07:00 (without
> killing your client performance) and all regular scrubs will now happen in
> that time frame as well (given default settings).
> If your cluster isn't fast enough, see my initial 2 points. ^o^
>

The problem is our cluster is the image and upload store for our site which
is a reasonably busy site international site. We have about 60% of our
customers in North America, and 30% or so in Europe and Asia. We definitely
would be better off with more scrubs between 11pm and 7am -8 to 0 GMT,
though we can't afford to slam the cluster.

I suppose that our cluster is a much more random mix of reads than many
others using ceph as a RBD store. Operating systems probably have a
stronger mix of sequential reads, whereas our users are concurrently
viewing different pages with different images, a more random workload.

It sounds like we have to maintain a cluster storage capacity of less than
25% in order to have reasonable performance. I guess this makes sense, we
have much more random IO needs than storage needs.


>
> > deploy@drexler:~$ ceph osd tree
> > ID WEIGHT   TYPE NAME        UP/DOWN REWEIGHT PRIMARY-AFFINITY
> > -1 43.67999 root default
> > -2 14.56000     host paley
> >  0  3.64000         osd.0         up  1.00000          1.00000
> >  3  3.64000         osd.3         up  1.00000          1.00000
> >  6  3.64000         osd.6         up  1.00000          1.00000
> >  9  3.64000         osd.9         up  1.00000          1.00000
> > -3 14.56000     host lucy
> >  1  3.64000         osd.1         up  1.00000          1.00000
> >  4  3.64000         osd.4         up  1.00000          1.00000
> >  7  3.64000         osd.7         up  1.00000          1.00000
> > 11  3.64000         osd.11        up  1.00000          1.00000
> > -4 14.56000     host drexler
> >  2  3.64000         osd.2         up  1.00000          1.00000
> >  5  3.64000         osd.5         up  1.00000          1.00000
> >  8  3.64000         osd.8         up  1.00000          1.00000
> > 10  3.64000         osd.10        up  1.00000          1.00000
> >
> >
> > My OSDs are 4tb 7200rpm Hitachi DeskStars, using XFS, with Samsung 850
> > Pro journals (very slow, ordered s3700 replacements, but shouldn't pose
> > problems for reading as far as I understand things).
>
> Just to make sure, these are genuine DeskStars?
> I'm asking both because AFAIK they are out of production and their direct
> successors, the Toshiba DT drives (can) have a nasty firmware bug that
> totally ruins their performance (from ~8 hours per week to permanently
> until power-cycled).
>

These are original deskstars. Didn't realize they weren't in production, I
just grabbed 6 more of the Hitachi DeskStar NAS edition 4tb drives, which
are readily available. I probably should have ordered 6tb drives, as I'd
end up with better seek times due to them not being fully utilized - the
data would reside closer to the center of the platters.


>
> Regards,
>
> Christian
> > MONs are co-located
> > with OSD nodes, but the nodes are fairly beefy and has very low load.
> > Drives are on a expanding backplane, with an LSI SAS3008 controller.
> >
> > I have a fairly standard config as well:
> >
> > https://gist.github.com/kingcu/aae7373eb62ceb7579da
> >
> > I know that I don't have a ton of OSDs, but I'd expect a little better
> > performance than this. Checkout munin of my three nodes:
> >
> >
> http://munin.ridewithgps.com/ridewithgps.com/drexler.ridewithgps.com/index.html#disk
> >
> http://munin.ridewithgps.com/ridewithgps.com/paley.ridewithgps.com/index.html#disk
> >
> http://munin.ridewithgps.com/ridewithgps.com/lucy.ridewithgps.com/index.html#disk
> >
> >
> > Any input would be appreciated, before I start trying to micro-optimize
> > config params, as well as upgrading to Infernalis.
> >
> >
> > Cheers,
> >
> > Cullen
>
>
> --
> Christian Balzer        Network/Systems Engineer
> ch...@gol.com           Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to