> > >> I'm sure there are many more useful things to graph. One of things I'm >> interested in (but haven't found time to research yet) is the journal >> usage, with maybe some alerts if the journal is more than 90% full. >> > > This is not likely to be an issue with the default journal config since > the wbthrottle code is pretty aggressive about flushing the journal to > avoid spiky client IO. Having said that, I tend to agree that we need to > do a better job of documenting everything from the perf counters to the > states described in dump_historic_ops. Even internally it can get > confusing trying to keep track of what's going on where. > > Mark >
I've always had issues during deep-scrubbing, particularly when there is a lot of deep-scrubbing going on for a long time. For example, I left nodeep-scrub set for a month. Things were pretty painful when I unset it. Everything was fine, but after ~8 hours, I start getting slow requests, then osds marked down for being unresponsive. So "full journals" is just my most recent theory. I haven't figured out how to test my theory. I've tested (and fixed) a lot of other issues, which have made things better. It less of a problem now with journals on SSD, but it's something I ran into a several times when my journals were on the HDD. With with the SSD journals, if I do something that affects ~20% of my OSDs, I start having issues. I only have 5 nodes, and I can trigger this by re-formatting all of the OSDs on one node. I haven't (yet) had problems with smaller operations that affect less than 5% of my OSDs. My disk are 4TB, ~70% full, and a fresh format takes 24-48 hours to backfill.
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com