Question #244564 on Graphite changed: https://answers.launchpad.net/graphite/+question/244564
Travis Groth posted a new comment: That’s not a very helpful answer or comment, Jason. I was hoping to get information on how to debug the issue, not from-the-hip ranting about configuration details that should not matter. - Network storage shouldn’t break carbon assuming it can keep up (but we’re not using it) - Setting a cache size max shouldn’t be a “bad” idea. Memory bounding is a good thing. We have configured our system both with and without a limit. In this case, using non-inf appears broken due to a very long standing yet unfixed bug https://github.com/graphite- project/carbon/issues/167. If I turn on cache size limits now, carbon basically becomes unresponsive immediately. So, granted, non-inf is a bad idea but only because cache sizing is broken currently in 0.9.12. - Running a VM will have no impact on carbon’s stability as long as the IO keeps up. It is, in our case. Our IO wait was less than 20% at any given time. While I have taken straces during the issue, and I will do so again if needed, they revealed nothing meaningful to me or the team here (mostly futex calls). That said, we've done further research and the "bug" isn't so much a bug but unintuitive behavior when carbon-cache is CPU bound. It will accept metrics and cache them immediately (and seemingly as the highest priority activity/task/event/callback/whatever), but since a carbon instance is effectively limited to approximately 1 CPU, it may not have enough processing time to actually flush the cache on the back end. This leads to a very dramatic tipping point in CPU usage that is hard to observe from the OS - if that "100%" CPU isn't balanced between accepting metrics and actually doing something with them you continuously fall behind and expand your memory footprint. This turns into a failure in two ways - 1) as the cache grows, carbon slows down until it is unusable and/or gets OOM killed 2) it never actually flushes to disk at some cache size, even when you stop the daemon with no MAX_UPDATES_PER_SECOND_ON_SHUTDOWN. We might have 5G of metrics cached (say, 2 hours of metrics) and when we stop the process we'll lose most of that two hours of data. So, we wound up configuring a carbon-cache instance for every vCPU on the system and then putting carbon-relay in front of all of them with consistent hashing. We now appear to be keeping up, though I'm now seeing carbon-relay is approaching being CPU bound. Once that occurs, I imagine our configuration will get even more complicated. It would be very nice if carbon could, itself, scale to multiple CPUs without the administrator orchestrating haproxy, multiple layers of processes, port differentiation, hashing, etc. -- You received this question notification because you are a member of graphite-dev, which is an answer contact for Graphite. _______________________________________________ Mailing list: https://launchpad.net/~graphite-dev Post to : [email protected] Unsubscribe : https://launchpad.net/~graphite-dev More help : https://help.launchpad.net/ListHelp

