Thanks for the tuning tips Bob, I'll play with them after solidifying some of my other fixes (another 24-48 hours before my migration to 1024 placement groups is finished).
Glad you enjoy ridewithgps, shoot me an email if you have any questions/ideas/needs :) On Fri, Feb 5, 2016 at 4:42 PM, Bob R <b...@drinksbeer.org> wrote: > Cullen, > > We operate a cluster with 4 nodes, each has 2xE5-2630, 64gb ram, 10x4tb > spinners. We've recently replaced 2xm550 journals with a single p3700 nvme > drive per server and didn't see the performance gains we were hoping for. > After making the changes below we're now seeing significantly better 4k > performance. Unfortunately we pushed all of these at once so I wasn't able > to break down the performance improvement per option but you might want to > take a look at some of these. > > before: > [cephuser@ceph03 ~]$ rados -p one bench 120 rand -t 64 > Total time run: 120.001910 > Total reads made: 1530642 > Read size: 4096 > Bandwidth (MB/sec): 49.8 > Average IOPS: 12755 > Stddev IOPS: 1272 > Max IOPS: 14087 > Min IOPS: 8165 > Average Latency: 0.005 > Max latency: 0.307 > Min latency: 0.000411 > > after: > [cephuser@ceph03 ~]$ rados -p one bench 120 rand -t 64 > Total time run: 120.004069 > Total reads made: 4285054 > Read size: 4096 > Bandwidth (MB/sec): 139 > Average IOPS: 35707 > Stddev IOPS: 6282 > Max IOPS: 40917 > Min IOPS: 3815 > Average Latency: 0.00178 > Max latency: 1.73 > Min latency: 0.000239 > > [bobr@bobr ~]$ diff ceph03-before ceph03-after > 6,8c6,8 > < "debug_lockdep": "0\/1", > < "debug_context": "0\/1", > < "debug_crush": "1\/1", > --- > > "debug_lockdep": "0\/0", > > "debug_context": "0\/0", > > "debug_crush": "0\/0", > 15,17c15,17 > < "debug_buffer": "0\/1", > < "debug_timer": "0\/1", > < "debug_filer": "0\/1", > --- > > "debug_buffer": "0\/0", > > "debug_timer": "0\/0", > > "debug_filer": "0\/0", > 19,21c19,21 > < "debug_objecter": "0\/1", > < "debug_rados": "0\/5", > < "debug_rbd": "0\/5", > --- > > "debug_objecter": "0\/0", > > "debug_rados": "0\/0", > > "debug_rbd": "0\/0", > 26c26 > < "debug_osd": "0\/5", > --- > > "debug_osd": "0\/0", > 29c29 > < "debug_filestore": "1\/3", > --- > > "debug_filestore": "0\/0", > 31,32c31,32 > < "debug_journal": "1\/3", > < "debug_ms": "0\/5", > --- > > "debug_journal": "0\/0", > > "debug_ms": "0\/0", > 34c34 > < "debug_monc": "0\/10", > --- > > "debug_monc": "0\/0", > 36,37c36,37 > < "debug_tp": "0\/5", > < "debug_auth": "1\/5", > --- > > "debug_tp": "0\/0", > > "debug_auth": "0\/0", > 39,41c39,41 > < "debug_finisher": "1\/1", > < "debug_heartbeatmap": "1\/5", > < "debug_perfcounter": "1\/5", > --- > > "debug_finisher": "0\/0", > > "debug_heartbeatmap": "0\/0", > > "debug_perfcounter": "0\/0", > 132c132 > < "ms_dispatch_throttle_bytes": "104857600", > --- > > "ms_dispatch_throttle_bytes": "1048576000", > 329c329 > < "objecter_inflight_ops": "1024", > --- > > "objecter_inflight_ops": "10240", > 506c506 > < "osd_op_threads": "4", > --- > > "osd_op_threads": "20", > 510c510 > < "osd_disk_threads": "4", > --- > > "osd_disk_threads": "1", > 697c697 > < "filestore_max_inline_xattr_size": "0", > --- > > "filestore_max_inline_xattr_size": "254", > 701c701 > < "filestore_max_inline_xattrs": "0", > --- > > "filestore_max_inline_xattrs": "6", > 708c708 > < "filestore_max_sync_interval": "5", > --- > > "filestore_max_sync_interval": "10", > 721,724c721,724 > < "filestore_queue_max_ops": "1000", > < "filestore_queue_max_bytes": "209715200", > < "filestore_queue_committing_max_ops": "1000", > < "filestore_queue_committing_max_bytes": "209715200", > --- > > "filestore_queue_max_ops": "500", > > "filestore_queue_max_bytes": "1048576000", > > "filestore_queue_committing_max_ops": "5000", > > "filestore_queue_committing_max_bytes": "1048576000", > 758,761c758,761 > < "journal_max_write_bytes": "10485760", > < "journal_max_write_entries": "100", > < "journal_queue_max_ops": "300", > < "journal_queue_max_bytes": "33554432", > --- > > "journal_max_write_bytes": "1048576000", > > "journal_max_write_entries": "1000", > > "journal_queue_max_ops": "3000", > > "journal_queue_max_bytes": "1048576000", > > Good luck, > Bob > > PS. thanks for ridewithgps :) > > > On Thu, Feb 4, 2016 at 7:56 PM, Christian Balzer <ch...@gol.com> wrote: > >> >> Hello, >> >> On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote: >> >> > Replies in-line: >> > >> > On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer >> > <c-bal...@fusioncom.co.jp> wrote: >> > >> > > >> > > Hello, >> > > >> > > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote: >> > > >> > > > Hello, >> > > > >> > > > I've been trying to nail down a nasty performance issue related to >> > > > scrubbing. I am mostly using radosgw with a handful of buckets >> > > > containing millions of various sized objects. When ceph scrubs, both >> > > > regular and deep, radosgw blocks on external requests, and my >> > > > cluster has a bunch of requests that have blocked for > 32 seconds. >> > > > Frequently OSDs are marked down. >> > > > >> > > From my own (painful) experiences let me state this: >> > > >> > > 1. When your cluster runs out of steam during deep-scrubs, drop what >> > > you're doing and order more HW (OSDs). >> > > Because this is a sign that it would also be in trouble when doing >> > > recoveries. >> > > >> > >> > When I've initiated recoveries from working on the hardware the cluster >> > hasn't had a problem keeping up. It seems that it only has a problem >> with >> > scrubbing, meaning it feels like the IO pattern is drastically >> > different. I would think that with scrubbing I'd see something closer to >> > bursty sequential reads, rather than just thrashing the drives with a >> > more random IO pattern, especially given our low cluster utilization. >> > >> It's probably more pronounced when phasing in/out entire OSDs, where it >> also has to read the entire (primary) data off it. >> >> > >> > > >> > > 2. If you cluster is inconvenienced by even mere scrubs, you're really >> > > in trouble. >> > > Threaten the penny pincher with bodily violence and have that new HW >> > > phased in yesterday. >> > > >> > >> > I am the penny pincher, biz owner, dev and ops guy for >> > http://ridewithgps.com :) More hardware isn't an issue, it just feels >> > pretty crazy to have this low of performance on a 12 OSD system. >> Granted, >> > that feeling isn't backed by anything concrete! In general, I like to >> > understand the problem before I solve it with hardware, though I am >> > definitely not averse to it. I already ordered 6 more 4tb drives along >> > with the new journal SSDs, anticipating the need. >> > >> > As you can see from the output of ceph status, we are not space hungry >> by >> > any means. >> > >> >> Well, in Ceph having just one OSD pegged to max will impact (eventually) >> everything when they need to read/write primary PGs on it. >> >> More below. >> >> > >> > > >> > > > According to atop, the OSDs being deep scrubbed are reading at only >> > > > 5mb/s to 8mb/s, and a scrub of a 6.4gb placement group takes 10-20 >> > > > minutes. >> > > > >> > > > Here's a screenshot of atop from a node: >> > > > https://s3.amazonaws.com/rwgps/screenshots/DgSSRyeF.png >> > > > >> > > This looks familiar. >> > > Basically at this point in time the competing read request for all the >> > > objects clash with write requests and completely saturate your HD >> > > (about 120 IOPS and 85% busy according to your atop screenshot). >> > > >> > >> > In your experience would the scrub operation benefit from a bigger >> > readahead? Meaning is it more sequential than random reads? I already >> > bumped /sys/block/sd{x}/queue/read_ahead_kb to 512kb. >> > >> I played with that long time ago (in benchmark scenarios) and didn't see >> any noticeable improvement. >> Deep-scrub might (fragmentation could hurt it though), regular scrub not >> so >> much. >> >> > About half of our reads are on objects with an average size of 40kb (map >> > thumbnails), and the other half are on photo thumbs with a size between >> > 10kb and 150kb. >> > >> >> Noted, see below. >> >> > After doing a little more researching, I came across this: >> > >> > >> http://tracker.ceph.com/projects/ceph/wiki/Optimize_Newstore_for_massive_small_object_storage >> > >> > Sounds like I am probably running into issues with lots of random read >> > IO, combined with known issues around small files. To give an idea, I >> > have about 15 million small map thumbnails stored in my two largest >> > buckets, and I am pushing out about 30 requests per second right now >> > from those two buckets. >> > >> This is certainly a factor, but that knowledge of a future improvement >> won't help you with your current problem of course. ^_- >> >> > >> > >> > > There are ceph configuration options that can mitigate this to some >> > > extend and which I don't see in your config, like >> > > "osd_scrub_load_threshold" and "osd_scrub_sleep" along with the >> > > various IO priority settings. >> > > However the points above still stand. >> > > >> > >> > Yes, I have a running series of notes of config options to try out, just >> > wanted to touch base with other community members before shooting in the >> > dark. >> > >> osd_scrub_sleep is probably the most effective immediately available >> option for you to prevent slow, stalled IO. >> At the obvious cost of scrubs taking even longer. >> There is of course also the option to disable scrubs entirely until your >> HW >> has been upgraded. >> >> > >> > > >> > > XFS defragmentation might help, significantly if your FS is badly >> > > fragmented. But again, this is only a temporary band-aid. >> > > >> > > > First question: is this a reasonable speed for scrubbing, given a >> > > > very lightly used cluster? Here's some cluster details: >> > > > >> > > > deploy@drexler:~$ ceph --version >> > > > ceph version 0.94.1-5-g85a68f9 >> > > > (85a68f9a8237f7e74f44a1d1fbbd6cb4ac50f8e8) >> > > > >> > > > >> > > > 2x Xeon E5-2630 per node, 64gb of ram per node. >> > > > >> > > More memory can help by keeping hot objects in the page cache (so the >> > > actual disks need not be read and can write at their full IOPS >> > > capacity). A lot of memory (and the correct sysctl settings) will also >> > > allow for a large SLAB space, keeping all those directory entries and >> > > other bits in memory without having to go to disk to get them. >> > > >> > > You seem to be just fine CPU wise. >> > > >> > >> > I thought about bumping each node up to 128gb of ram as another cheap >> > insurance policy. I'll try that after the other changes. I'd like to >> know >> > why so I'll try and change one thing at a time, though I am also just >> > eager to have this thing stable. >> > >> >> For me everything was sweet and dandy as long all the really hot objects >> did fit in the page cache and the FS bits where all in SLAB (no need to >> go to disk for a "ls -R"). >> >> Past the point it all went to molasses land "quickly". >> >> > >> > > >> > > > >> > > > deploy@drexler:~$ ceph status >> > > > cluster 234c6825-0e2b-4256-a710-71d29f4f023e >> > > > health HEALTH_WARN >> > > > 118 requests are blocked > 32 sec >> > > > monmap e1: 3 mons at {drexler= >> > > > 10.0.0.36:6789/0,lucy=10.0.0.38:6789/0,paley=10.0.0.34:6789/0} >> > > > election epoch 296, quorum 0,1,2 paley,drexler,lucy >> > > > mdsmap e19989: 1/1/1 up {0=lucy=up:active}, 1 up:standby >> > > > osdmap e1115: 12 osds: 12 up, 12 in >> > > > pgmap v21748062: 1424 pgs, 17 pools, 3185 GB data, 20493 >> > > > kobjects 10060 GB used, 34629 GB / 44690 GB avail >> > > > 1422 active+clean >> > > > 1 active+clean+scrubbing+deep >> > > > 1 active+clean+scrubbing >> > > > client io 721 kB/s rd, 33398 B/s wr, 53 op/s >> > > > >> > > You want to avoid having scrubs going on willy-nilly in parallel and >> at >> > > high peek times, even IF your cluster is capable of handling them. >> > > >> > > Depending on how busy your cluster is and its usage pattern, you may >> do >> > > what I did. >> > > Kick off a deep scrub of all OSDs "ceph osd deep-scrub \*" like 01:00 >> > > on a Saturday morning. >> > > If your cluster is fast enough, it will finish before 07:00 (without >> > > killing your client performance) and all regular scrubs will now >> > > happen in that time frame as well (given default settings). >> > > If your cluster isn't fast enough, see my initial 2 points. ^o^ >> > > >> > >> > The problem is our cluster is the image and upload store for our site >> > which is a reasonably busy site international site. We have about 60% of >> > our customers in North America, and 30% or so in Europe and Asia. We >> > definitely would be better off with more scrubs between 11pm and 7am -8 >> > to 0 GMT, though we can't afford to slam the cluster. >> > >> > I suppose that our cluster is a much more random mix of reads than many >> > others using ceph as a RBD store. Operating systems probably have a >> > stronger mix of sequential reads, whereas our users are concurrently >> > viewing different pages with different images, a more random workload. >> > >> > It sounds like we have to maintain a cluster storage capacity of less >> > than 25% in order to have reasonable performance. I guess this makes >> > sense, we have much more random IO needs than storage needs. >> > >> In your use case (and most others) random IOPS tends to be the bottleneck >> long long before either space or sequential bandwidth becomes and issues. >> >> More spindles, more IOPS. See below. ^o^ >> >> > >> > > >> > > > deploy@drexler:~$ ceph osd tree >> > > > ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY >> > > > -1 43.67999 root default >> > > > -2 14.56000 host paley >> > > > 0 3.64000 osd.0 up 1.00000 1.00000 >> > > > 3 3.64000 osd.3 up 1.00000 1.00000 >> > > > 6 3.64000 osd.6 up 1.00000 1.00000 >> > > > 9 3.64000 osd.9 up 1.00000 1.00000 >> > > > -3 14.56000 host lucy >> > > > 1 3.64000 osd.1 up 1.00000 1.00000 >> > > > 4 3.64000 osd.4 up 1.00000 1.00000 >> > > > 7 3.64000 osd.7 up 1.00000 1.00000 >> > > > 11 3.64000 osd.11 up 1.00000 1.00000 >> > > > -4 14.56000 host drexler >> > > > 2 3.64000 osd.2 up 1.00000 1.00000 >> > > > 5 3.64000 osd.5 up 1.00000 1.00000 >> > > > 8 3.64000 osd.8 up 1.00000 1.00000 >> > > > 10 3.64000 osd.10 up 1.00000 1.00000 >> > > > >> > > > >> > > > My OSDs are 4tb 7200rpm Hitachi DeskStars, using XFS, with Samsung >> > > > 850 Pro journals (very slow, ordered s3700 replacements, but >> > > > shouldn't pose problems for reading as far as I understand things). >> > > >> > > Just to make sure, these are genuine DeskStars? >> > > I'm asking both because AFAIK they are out of production and their >> > > direct successors, the Toshiba DT drives (can) have a nasty firmware >> > > bug that totally ruins their performance (from ~8 hours per week to >> > > permanently until power-cycled). >> > > >> > >> > These are original deskstars. Didn't realize they weren't in production, >> > I just grabbed 6 more of the Hitachi DeskStar NAS edition 4tb drives, >> > which are readily available. I probably should have ordered 6tb drives, >> > as I'd end up with better seek times due to them not being fully >> > utilized - the data would reside closer to the center of the platters. >> > >> Ah, Deskstar NAS, yes, these still are in production. >> >> I'd get more, smaller, faster HDDs instead. >> HW cache on your controller can also help (depends on the model/FW if it >> is used efficiently in JBOD mode). >> >> And since your space utilization is small (though of course that can and >> will change over time of course), you may very well benefit from going >> SSD. >> >> SSD pools if you think you can fit (economically) a set of your high >> access data like the thumbnails on it. >> >> SSD cache tiers are a bit more dubious when comes to rewards, but that >> depends a lot on the hot data set. >> Plenty of discussion in here about that. >> >> Regards, >> >> Christian >> > >> > > >> > > Regards, >> > > >> > > Christian >> > > > MONs are co-located >> > > > with OSD nodes, but the nodes are fairly beefy and has very low >> load. >> > > > Drives are on a expanding backplane, with an LSI SAS3008 controller. >> > > > >> > > > I have a fairly standard config as well: >> > > > >> > > > https://gist.github.com/kingcu/aae7373eb62ceb7579da >> > > > >> > > > I know that I don't have a ton of OSDs, but I'd expect a little >> > > > better performance than this. Checkout munin of my three nodes: >> > > > >> > > > >> > > >> http://munin.ridewithgps.com/ridewithgps.com/drexler.ridewithgps.com/index.html#disk >> > > > >> > > >> http://munin.ridewithgps.com/ridewithgps.com/paley.ridewithgps.com/index.html#disk >> > > > >> > > >> http://munin.ridewithgps.com/ridewithgps.com/lucy.ridewithgps.com/index.html#disk >> > > > >> > > > >> > > > Any input would be appreciated, before I start trying to >> > > > micro-optimize config params, as well as upgrading to Infernalis. >> > > > >> > > > >> > > > Cheers, >> > > > >> > > > Cullen >> > > >> > > >> > > -- >> > > Christian Balzer Network/Systems Engineer >> > > ch...@gol.com Global OnLine Japan/Rakuten Communications >> > > http://www.gol.com/ >> > > >> >> >> -- >> Christian Balzer Network/Systems Engineer >> ch...@gol.com Global OnLine Japan/Rakuten Communications >> http://www.gol.com/ >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com