ES v1.1 continuous young gc pauses old gc, stops the world when old gc happens and splits cluster

Ankush Jhalani Wed, 16 Apr 2014 11:58:44 -0700

We have a 4 node (2 client only, 2 data/master nodes with 25G memory 
allocated to ES and 12 cores each) ES cluster, storing an index with 16 
shards, ~200GB and 1 replica.


Recently running scan/scroll requests to dump data and other faceting 
requests, the nodes disconnected from each and we had a split condition. 
All requests were being run sequentially, not in parellel. From the logs we 
notice that for 5-10 minutes, just the young gc was running even though old 
memory and total memory used was over 75% (which I think is the default 
CMSInitiatingOccupancyFraction=75). When finally old gc ran, it took 1.7 
minutes which caused nodes to disconnect (3 pings of 30 seconds failed). 
 Below are some of the last traces from log - 

[2014-04-15 22:07:33,526][INFO ][monitor.jvm              ] [ny1.node2] 
[gc][young][149283][9270] duration [782ms], collections [1]/[1s], total 
[782ms]/[8.2m], memory [21.5gb]->[20.8gb]/[24.8gb], all_pools {[young] 
[1gb]->[147mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old] [20
.3gb]->[20.5gb]/[23.5gb]}
[2014-04-15 22:07:35,264][INFO ][monitor.jvm              ] [ny1.node2] 
[gc][young][149284][9271] duration [743ms], collections [1]/[1.7s], total 
[743ms]/[8.2m], memory [20.8gb]->[20.8gb]/[24.8gb], all_pools {[young] 
[147mb]->[10.1mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old
] [20.5gb]->[20.6gb]/[23.5gb]}
[2014-04-15 22:07:36,814][INFO ][monitor.jvm              ] [ny1.node2] 
[gc][young][149285][9272] duration [786ms], collections [1]/[1.5s], total 
[786ms]/[8.2m], memory [20.8gb]->[20.9gb]/[24.8gb], all_pools {[young] 
[10.1mb]->[2.8mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old
] [20.6gb]->[20.8gb]/[23.5gb]}
[2014-04-15 22:07:38,880][INFO ][monitor.jvm              ] [ny1.node2] 
[gc][young][149287][9273] duration [835ms], collections [1]/[1s], total 
[835ms]/[8.2m], memory [21.5gb]->[21.1gb]/[24.8gb], all_pools {[young] 
[655.9mb]->[1.2mb]/[1.1gb]}{[survivor] [149.7mb]->[149.7mb]/[149.7mb]}{[old]
 [20.7gb]->[20.9gb]/[23.5gb]}
[2014-04-15 22:09:24,215][INFO ][monitor.jvm              ] [ny1.node2] 
[gc][young][149290][9274] duration [786ms], collections [1]/[1.7m], total 
[786ms]/[8.3m], memory [21.7gb]->[2.4gb]/[24.8gb], all_pools {[young] 
[727.2mb]->[13.8mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] [2
0.8gb]->[2.4gb]/[23.5gb]}
[2014-04-15 22:09:24,215][WARN ][monitor.jvm              ] [ny1.node2] 
[gc][old][149290][25] duration [1.7m], collections [2]/[1.7m], total 
[1.7m]/[1.7m], memory [21.7gb]->[2.4gb]/[24.8gb], all_pools {[young] 
[727.2mb]->[13.8mb]/[1.1gb]}{[survivor] [149.7mb]->[0b]/[149.7mb]}{[old] 
[20.8gb]
->[2.4gb]/[23.5gb]}

CPU usage was pretty low on the machine, but it's confusing why the old gc 
was put on pause for so long and why did it take so much when it finally 
ran? We currently use Java 7 update 10, would the latest 51 help?  Would 
switching to GC1 help ? 

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/2bb88e62-4240-442e-b3aa-16cfd7817550%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

ES v1.1 continuous young gc pauses old gc, stops the world when old gc happens and splits cluster

Reply via email to