>
>
>> I'm sure there are many more useful things to graph.  One of things I'm
>> interested in (but haven't found time to research yet) is the journal
>> usage, with maybe some alerts if the journal is more than 90% full.
>>
>
> This is not likely to be an issue with the default journal config since
> the wbthrottle code is pretty aggressive about flushing the journal to
> avoid spiky client IO.  Having said that, I tend to agree that we need to
> do a better job of documenting everything from the perf counters to the
> states described in dump_historic_ops.  Even internally it can get
> confusing trying to keep track of what's going on where.
>
> Mark
>

I've always had issues during deep-scrubbing, particularly when there is a
lot of deep-scrubbing going on for a long time.  For example, I left
nodeep-scrub set for a month.  Things were pretty painful when I unset it.
Everything was fine, but after ~8 hours, I start getting slow requests,
then osds marked down for being unresponsive.

So "full journals" is just my most recent theory.  I haven't figured out
how to test my theory.  I've tested (and fixed) a lot of other issues,
which have made things better.


It less of a problem now with journals on SSD, but it's something I ran
into a several times when my journals were on the HDD.  With with the SSD
journals, if I do something that affects ~20% of my OSDs, I start having
issues.  I only have 5 nodes, and I can trigger this by re-formatting all
of the OSDs on one node.  I haven't (yet) had problems with smaller
operations that affect less than 5% of my OSDs.  My disk are 4TB, ~70%
full, and a fresh format takes 24-48 hours to backfill.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to