[Wikidata-bugs] [Maniphest] [Commented On] T175919: investigate GC times on wikidata query service

Gehel Wed, 04 Oct 2017 08:23:43 -0700

Gehel added a comment.

Using a demo version of Jclarity Censum, I see the following:

Problems

Premature promotion:

There are a number of possible causes for this problem:

Survivor spaces are too small.

Young gen is too small.

The -XX:MaxTenuringThreshold flag may have been set too low.

There a number of possible solutions for this problem:

Alter the size of the young space via the -XX:NewRatio property.

Alter the size of Survivor Spaces (relative to Eden) via the -XX:SurvivorRatio=<N> property using the information provided by the Tenuring graphs. This flag works to divide Young into N + 2 chunks. N chunks will be assigned to Eden and 1 chunk each will be assigned to the To and From spaces respectively.

Alter the -XX:MaxTenuringThreshold property.

Application Throughput:

The recommendation is a bit generic:

How you tune will depend upon your performance requirements and the environment into which the application is deployed in. In general terms, smaller live set sizes combined with smaller memory pools will reduce GC pause times. However, small memory pools will increase the frequency of garbage collections. Additionally, smaller memory pool sizes will result in more frequent collections and place you at risk of running out of memory. High rates of allocation will also result in more frequent collections and place you at risk of running out of memory. Consequently, the general recommendation is to first work to minimize allocation rates and the volume of data that needs to be retained in Java heap. The second step is to tune the garbage collector to reduce GC overhead and minimize pause times.

Minimizing allocation rate looks like a good idea, but not something easy to do :)

High kernel times:

This looks suspicious... and indicates that other applications are competing for resources. Surprisingly it seems that IO contention can be an issue during GC, but it could be something else.

Informative

Heap Too Small Indicators

4 To space exhausted event were seen. 1 is just after application startup and is probably not significant. The other 3 are at roughly the same time, preceding a full GC. It looks like there is a change of pattern at that time, which probably indicate that we have to problems:

high GC overhead all the time
sudden rise in allocation at specific times, probably related to more expensive queries (I might be entirely wrong with this one)

F10000336: G1 Failure Events.png

There is an allocation rate graph in the report, but I have no idea how to read it...
F10000332: allocation rate.png

TASK DETAIL

https://phabricator.wikimedia.org/T175919

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Gehel
Cc: Stashbot, gerritbot, Smalyshev, Gehel, Aklapper, Gq86, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune, merbst, Avner, Lewizho99, Maathavan, debt, Jonas, FloNight, Xmlizer, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs