Hi, > On 24 May 2016, at 09:16, Christian Balzer <ch...@gol.com> wrote: > > > Hello, > > On Tue, 24 May 2016 07:03:25 +0000 Josef Johansson wrote: > >> Hi, >> >> You need to monitor latency instead of peak points. As Ceph is writing to >> two other nodes if you have 3 replicas that is 4x extra the latency >> compared to one roundtrip to the first OSD from client. So smaller and >> more IO equals more pain in latency. >> > While very true, I don't think latency (as in network Ceph code related) > is causing his problems. > > 30+ second slow requests tend to be nearly exclusively the domain of the > I/O system being unable to keep up. > And I'm pretty sure it's the HDD based EC pool. > > Not just monitoring Ceph counters but the actual disk stats can be > helpful, as keeping in mind that it takes only one slow OSD, node, wonky > link to bring everything to a standstill. >
I agree, more counters to identify the problem. Any way nowadays to automatically find out what OSD that is causing the slow requests? Like, a graphical way of doing it :) >> And the worst thing is that there is nothing that actually shows this >> AFAIK, ceph osd perf shows some latency. A slow CPU could hamper >> performance even if it's showing no sign of it. >> >> I believe you can see which operations is running right now and see where >> they are waiting for, I think there's a thread on this ML regarding dead >> lock and skow requests that could be interesting. >> >> I did not see Christians responses either so maybe not a problem with >> your client. >> > The list (archive) and me sure did: > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009464.html > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009464.html> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009744.html > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009744.html> > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009747.html > <http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009747.html> > So another Subject line for that issue. That explains it. Regards, Josef > Christian > >> Regards, >> Josef >> >> On Tue, 24 May 2016, 08:49 Peter Kerdisle, <peter.kerdi...@gmail.com> >> wrote: >> >>> Hey Christian, >>> >>> I honestly haven't seen any replies to my earlier message. I will >>> traverse my email and make sure I find it, my apologies. >>> >>> I am graphing everything with collectd and graphite this is what makes >>> it so frustrating since I am not seeing any obvious pain points. >>> >>> I am basically using the pool in read-forward now so there should be >>> almost no promotion from EC to the SSD pool. I will see what options I >>> have for adding some SSD journals to the OSD nodes to help speed >>> things along. >>> >>> Thanks, and apologies again for missing your earlier replies. >>> >>> Peter >>> >>> On Tue, May 24, 2016 at 4:25 AM, Christian Balzer <ch...@gol.com> >>> wrote: >>> >>>> >>>> Hello, >>>> >>>> On Mon, 23 May 2016 10:45:41 +0200 Peter Kerdisle wrote: >>>> >>>>> Hey, >>>>> >>>>> Sadly I'm still battling this issue. I did notice one interesting >>>>> thing. >>>>> >>>>> I changed the cache settings for my cache tier to add redundancy to >>>>> the pool which means a lot of recover activity on the cache. During >>>>> all this there were absolutely no slow requests reported. Is there >>>>> anything I can conclude from that information? Is it possible that >>>>> not having any SSDs for journals could be the bottleneck on my >>>>> erasure pool and that's generating slow requests? >>>>> >>>> That's what I suggested in the first of my 3 replies to your original >>>> thread which seemingly got ignored judging by the lack of a reply. >>>> >>>>> I simply can't imagine why a request can be blocked for 30 or even >>>>> 60 seconds. It's getting really frustrating not being able to fix >>>>> this and >>>> I >>>>> simply don't know what else I can do at this point. >>>>> >>>> Are you monitoring your cluster, especially the HDD nodes? >>>> Permanently with collectd and graphite (or similar) and topically >>>> (especially during tests) with atop (or iostat, etc)? >>>> >>>> And one more time, your cache tier can only help you if it is fast and >>>> large enough to sufficiently dis-engage you from your slow backing >>>> storage. >>>> And lets face it EC pool AND no journal SSDs will be quite slow. >>>> >>>> So if your cache is dirty all the time and has to flush (causing IO >>>> on the backing storage) while there is also promotion from the >>>> backing storage to the cache going on you're basically down to the >>>> base speed of your EC pool, at least for some of your ops. >>>> >>>> Christian >>>> >>>>> If anybody has anything I haven't tried before please let me know. >>>>> >>>>> Peter >>>>> >>>>> On Thu, May 5, 2016 at 10:30 AM, Peter Kerdisle >>>>> <peter.kerdi...@gmail.com> wrote: >>>>> >>>>>> Hey guys, >>>>>> >>>>>> I'm running into an issue with my cluster during high activity. >>>>>> >>>>>> I have two SSD cache servers (2 SSDs for journals, 7 SSDs for >>>>>> data) with 2x10Gbit bonded each and a six OSD nodes with a 10Gbit >>>>>> public and 10Gbit cluster network for the erasure pool (10x3TB >>>>>> without separate journal). This is all on Jewel. >>>>>> >>>>>> It's working fine with normal load. However when I force increased >>>>>> activity by lowering the cache_target_dirty_ratio to make sure my >>>>>> files are promoted things start to go amiss. >>>>>> >>>>>> To give an example: >>>>>> http://pastie.org/private/5k5ml6a8gqkivjshgjcedg >>>>>> >>>>>> This is especially concering: pgs: 9 >>>>>> activating+undersized+degraded, 48 active+undersized+degraded, 1 >>>>>> stale+active+clean, 27 peering. >>>>>> >>>>>> Here is an other minute or so where I grepped for warnings: >>>>>> http://pastie.org/private/bfv3kxl63cfcduafoaurog >>>>>> >>>>>> These warnings are generated all over the OSD nodes, not >>>>>> specifically one OSD or even one node. >>>>>> >>>>>> During this time the different OSD logs show various warnings: >>>>>> >>>>>> 2016-05-05 10:04:10.873603 7f1afaf3b700 1 heartbeat_map >>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out >>>>>> after 15 2016-05-05 10:04:10.873605 7f1afaf3b700 1 heartbeat_map >>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out >>>>>> after 15 2016-05-05 10:04:10.905997 7f1afc73e700 1 heartbeat_map >>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out >>>>>> after 15 2016-05-05 10:04:10.906000 7f1afc73e700 1 heartbeat_map >>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out >>>>>> after 15 2016-05-05 10:04:10.906022 7f1afaf3b700 1 heartbeat_map >>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed out >>>>>> after 15 2016-05-05 10:04:10.906027 7f1afaf3b700 1 heartbeat_map >>>>>> is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' had timed out >>>>>> after 15 2016-05-05 10:04:10.949894 7f1af3f2d700 1 heartbeat_map >>>>>> reset_timeout 'OSD::osd_op_tp thread 0x7f1af3f2d700' had timed >>>>>> out after 15 2016-05-05 10:04:10.956801 7f1afc73e700 1 >>>>>> heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' >>>>>> had timed out after 15 2016-05-05 10:04:10.956833 7f1afaf3b700 1 >>>>>> heartbeat_map is_healthy 'OSD::osd_op_tp thread 0x7f1af5f31700' >>>>>> had timed out after 15 2016-05-05 10:04:10.957816 7f1af5f31700 1 >>>>>> heartbeat_map reset_timeout 'OSD::osd_op_tp thread >>>>>> 0x7f1af5f31700' had timed out after 15 >>>>>> >>>>>> or >>>>>> >>>>>> 2016-05-05 10:03:27.269658 7f98cca35700 -1 osd.6 7235 >>>>>> heartbeat_check: no reply from osd.25 since back 2016-05-05 >>>>>> 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff >>>>>> 2016-05-05 10:03:07.269651) 2016-05-05 10:03:28.269838 >>>>>> 7f98cca35700 -1 osd.6 7235 heartbeat_check: no reply from osd.25 >>>>>> since back 2016-05-05 10:03:06.566276 front 2016-05-05 >>>>>> 10:03:06.566276 (cutoff 2016-05-05 10:03:08.269831) 2016-05-05 >>>>>> 10:03:29.269998 7f98cca35700 -1 osd.6 7235 heartbeat_check: no >>>>>> reply from osd.25 since back 2016-05-05 10:03:06.566276 front >>>>>> 2016-05-05 10:03:06.566276 (cutoff 2016-05-05 10:03:09.269992) >>>>>> 2016-05-05 10:03:29.801145 7f98b339a700 -1 osd.6 7235 >>>>>> heartbeat_check: no reply from osd.25 since back 2016-05-05 >>>>>> 10:03:06.566276 front 2016-05-05 10:03:06.566276 (cutoff >>>>>> 2016-05-05 10:03:09.801142) 2016-05-05 10:04:06.275237 >>>>>> 7f98cca35700 0 log_channel(cluster) log [WRN] : 1 slow requests, >>>>>> 1 included below; oldest blocked for > 30.910385 secs 2016-05-05 >>>>>> 10:04:06.275252 7f98cca35700 0 log_channel(cluster) log [WRN] : >>>>>> slow request 30.910385 seconds old, received at 2016-05-05 >>>>>> 10:03:35.364796: osd_op(osd.70.6640:6555588 4.1041c254 >>>>>> rbd_data.afd7564682858.0000000000014daa [copy-from ver 105993] >>>>>> snapc 0=[] >>>>>> ondisk+write+ignore_overlay+enforce_snapc+known_if_redirected >>>>>> e7235) currently commit_sent >>>>>> >>>>>> >>>>>> I've followed the instructions on >>>>>> >>>> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/ >>>>>> to hopefully find out what's happening but as far as the hardware >>>>>> is concerned everything looks fine. No smart errors logged, >>>>>> iostats shows some activity but nothing pegged to 100%, no >>>>>> messages in dmesg and the cpus are only used for around 25% max. >>>>>> >>>>>> This makes me think it might be network related. I have a 10Gbit >>>> public >>>>>> and 10Gbit cluster network neither of which seems to hit any >>>>>> limits. There is one thing that might be a problem and that is >>>>>> that one of the cache node has a bonded interface and no access >>>>>> to the cluster network, and the other cache node has a public and >>>>>> cluster interface. >>>>>> >>>>>> Could anybody give me some more steps I can take to further >>>>>> discover where this bottleneck lies? >>>>>> >>>>>> Thanks in advance, >>>>>> >>>>>> Peter >>>>>> >>>> >>>> >>>> -- >>>> Christian Balzer Network/Systems Engineer >>>> ch...@gol.com Global OnLine Japan/Rakuten Communications >>>> http://www.gol.com/ >>>> >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> > > > -- > Christian Balzer Network/Systems Engineer > ch...@gol.com <mailto:ch...@gol.com> Global OnLine Japan/Rakuten > Communications > http://www.gol.com/ <http://www.gol.com/>
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com