Gehel added a comment.

Comments inline. Keep in mind that my understanding of GC is limited, I am most probably wrong in a lot of what I write below. And my understanding of G1 is even more limited...

So I took a look at our logs from Sep 29 with http://gceasy.io/ and I got the following conclusions (could be totally misguided, please tell me if so):

  • The log file covers about 5 to 6 hours. I think this should be enough but maybe if there are daily patterns it could be useful to have a larger data set to look integratively on. We could do this probably by bumping rotated file size to ~100M from 20M.

gceasy allows you to upload multiple files in a .tar.gz archive, so no need to change the config

  • GC consumes 7-8% of CPU time. Have no idea if it's normal for Java service or not.

No, it is not. Anything above 5% is suspicious. For comparison, elastic1023 is at a little bit over 1%.

  • There's four long pause GC events I found in the logs for the last day. One took 40s. From what I see, three of them were in quick succession (around 3:08 am) - which indicates the same failure scenario I alluded to before - instead of failing the allocation, JVM starts obsessively GCing, trying to eke out that last byte. This is the failure mode I am concerned the most about. Not sure what's the alternative here - as I understand, Java has no safe out of memory failure mode? It's either slow death by GC or fast death by OOM?

Actually, there are many failure modes (what fun would it be without them).

When no heap is available for allocation (either not enough space, or too much fragmentation), GC is run. If after a certain amount of effort not enough memory can be freed to do the allocation, then an OOME is raised. That's the easy scenario.

As long as the GC can free up memory for allocation, no OOME occurs. But if the application continues to ask for allocation, GC will start again. Trouble begins when the GC rate is close to the allocation rate, basically, the GC can just keep up, throughput decrease as the GC pressure raises, keeping things mostly in balance.

Consecutive full GC is still another problem. The main reason for full GC (as far as I understand) is when old gen needs to be collected. Full GC implies a longer stop-the-world phase. The cause is usually too early promotion, or a too large (and thus more expensive to collect) old gen. It does not necessarily means that no heap is available, but more that it isn't available in the right place. To minimize this, we should ensure that only long lived objects make it to the old gen and that no premature promotion occurs. Note that too much allocation can result in the same symptoms, with new allocation taking all the new gen space and pushing too much to old gen.

  • Cross-referencing these events with WDQS dashboards shows that these events blocked updates, causing several minute lag in updates. Which means the server was essentially in coma for several minutes, but then recovered. Good news is it did recover, bad news is it should never happen in the first place.

This makes me think that we have an increased allocation rate for some time (a few queries allocating much more, or allocating objects more expensive to collect). Once the workload is back to normal, the JVM recovers. That's only an assumption, and I'm not sure how to invalidate it...

  • Heap usage after GC is reported just under 6G. I think that means making heap size less than 6G is asking for trouble. There are several peaks reaching 12G.

What is important is the minima of heap after GC, which on the logs I sampled seems to be fairly consistently between 2 and 3 GB. The minima are important, because in most cases, GC will only free what it needs, leaving a lot of garbage in place. The minima is when we are the closest of not having any garbage left... By slowly minimizing the overall size of heap, we will see better what the reasonable limit is.

  • GC times mostly stay under 5sec, but there are many over 1sec, it probably would be a good idea to keep them down.

I'm used to GC taking less than 100ms in most cases. Again, elastic1023 as comparison: 99.849% in less than 1 sec, avg GC time: 50ms, longest GC ~3 seconds.

  • Old Gen size stays mostly between 5-6G, which seems to indicate that is what long-term Blazegraph data structures need (unless I misunderstand what Old Gen means). There are several spikes, which I suspect are due to long-running queries that take so much time that normally transient structures get promoted to old gen, but I may be wrong here. This conclusion also is supported by the fact that the peaks are generally short-lived (which I interpret as query finally finishing or timing out and its memory allocation being released).

Again, what is interesting are the minima of old gen after GC, which I see between 2 and 3 GB.

  • Young gen size goes from 6-7G before GC to under 1G after GC. That means there's probably some space for reducing the heap size if we want to have more frequent but less costly cleanups.

Agreed!

  • Not sure why our main GC cause is "GCLocker Initiated GC". Maybe because Blazegraph uses Java native process heap for its internal memory manager, instead of Java object heap, so each time it does something there it's a JNI call?

I don't understand that one either and it looks suspicious to me. elastic1023 has "GCLocker Initiated GC" as a cause for 0.05% of GCs.

Some ideas:

  • Can we play with -XX:MaxTenuringThreshold and see if it can get any better results? Old gen size shows some volatility, which IMHO may mean some objects which are not supposed to be "old" get into the "old" space. From my understanding of how "old" is supposed to work, Old gen space should be more stable size-wise?

Yes, but only after tuning the overall heap size to something reasonable.

  • Should we adjust -Xmn and/or -XX:NewRatio to maybe give more space to old gen without growing young gen space too much?

My intuition tells me that we should probably reduce old gen and increase young to avoid premature promotion. But yes, we can play with it. Again, only after having a reasonable overall heap size.

  • Should we set -XX:MaxGCPauseMillis? I know it's not a hard limit but maybe hinting the VM that 40s GC is not OK would help?

I think it is very unlikely that this will help. Basically, giving a max pause time will decrease throughput to improve latency, we are already not good on throughput.

  • Should we adjust XX:InitiatingHeapOccupancyPercent so that GC would try to collect young gen sooner, thus potentially resulting in shorter GCs?

Maybe worth playing with...

In the end, we also need to understand why GC is working so much harder and need more tuning. There might be good reasons for this, but without having some understanding of the kind of workload the GC is seeing, it is hard to know what to do. Or to understand if the GC issues we are seeing are indeed a cause, or just a consequence.


TASK DETAIL
https://phabricator.wikimedia.org/T175919

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: Stashbot, gerritbot, Smalyshev, Gehel, Aklapper, Gq86, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune, merbst, Avner, Lewizho99, Maathavan, debt, Jonas, FloNight, Xmlizer, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to