Re: ES v1.1 continuous young gc pauses old gc, stops the world when old gc happens and splits cluster

Michael Hart Fri, 20 Jun 2014 06:41:13 -0700

We're seeing the same thing. ES 1.1.0, JDK 7u55 on Ubuntu 12.04, 5 data 
nodes, 3 separate masters, all are 15GB hosts with 7.5GB Heaps, storage is 
SSD. Data set is ~1.6TB according to Marvel.


Our daily indices are roughly 33GB in size, with 5 shards and 2 replicas. 
I'm still investigating what happened yesterday, but I do see in Marvel a 
large spike in the "Indices Current Merges" graph just before the node 
dies, and a corresponding increase in JVM Heap. When Heap hits 99% 
everything grinds to a halt. Restarting the node "fixes" the issue, but 
this is third or fourth time it's happened.

I'm still researching how to deal with this, but a couple of things I am 
looking at are:

   - increase the number of shards so that the segment merges stay smaller 
   (is that even a legitimate sentence?) I'm still reading through this page 
the 
   Index Module Merge page 
   
<http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-merge.html>
 for 
   more details.
   - look at store level throttling 
   
<http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-store.html#store-throttling>
   .

I would love to get some feedback on my ramblings. If I find anything more 
I'll update this thread.

cheers
mike




On Thursday, June 19, 2014 4:05:54 PM UTC-4, Bruce Ritchie wrote:
>
> Java 8 with G1GC perhaps? It'll have more overhead but perhaps it'll be 
> more consistent wrt pauses.
>
>
>
> On Wednesday, June 18, 2014 2:02:24 PM UTC-4, Eric Brandes wrote:
>>
>> I'd just like to chime in with a "me too".  Is the answer just more 
>> nodes?  In my case this is happening every week or so.
>>
>> On Monday, April 21, 2014 9:04:33 PM UTC-5, Brian Flad wrote:
>>
>> My dataset currently is 100GB across a few "daily" indices (~5-6GB and 15 
>> shards each). Data nodes are 12 CPU, 12GB RAM (6GB heap).
>>
>>
>> On Mon, Apr 21, 2014 at 6:33 PM, Mark Walkom <ma...@campaignmonitor.com> 
>> wrote:
>>
>> How big are your data sets? How big are your nodes?
>>
>> Regards,
>> Mark Walkom
>>
>> Infrastructure Engineer
>> Campaign Monitor
>> email: ma...@campaignmonitor.com
>> web: www.campaignmonitor.com
>>
>>
>> On 22 April 2014 00:32, Brian Flad <bfla...@gmail.com> wrote:
>>
>> We're seeing the same behavior with 1.1.1, JDK 7u55, 3 master nodes (2 
>> min master), and 5 data nodes. Interestingly, we see the repeated young GCs 
>> only on a node or two at a time. Cluster operations (such as recovering 
>> unassigned shards) grinds to a halt. After restarting a GCing node, 
>> everything returns to normal operation in the cluster.
>>
>> Brian F
>>
>>
>> On Wed, Apr 16, 2014 at 8:00 PM, Mark Walkom <ma...@campaignmonitor.com> 
>> wrote:
>>
>> In both your instances, if you can, have 3 master eligible nodes as it 
>> will reduce the likelihood of a split cluster as you will always have a 
>> majority quorum. Also look at discovery.zen.minimum_master_nodes to go with 
>> that.
>> However you may just be reaching the limit of your nodes, which means the 
>> best option is to add another node (which also neatly solves your split 
>> brain!).
>>
>> Ankush it would help if you can update java, most people recommend u25 
>> but we run u51 with no problems.
>>
>>
>>
>> Regards,
>> Mark Walkom
>>
>> Infrastructure Engineer
>> Campaign Monitor
>> email: ma...@campaignmonitor.com
>> web: www.campaignmonitor.com
>>
>>
>> On 17 April 2014 07:31, Dominiek ter Heide <domin...@gmail.com> wrote:
>>
>> We are seeing the same issue here. 
>>
>> Our environment:
>>
>> - 2 nodes
>> - 30GB Heap allocated to ES
>> - ~140GB of data
>> - 639 indices, 10 shards per index
>> - ~48M documents
>>
>> After starting ES everything is good, but after a couple of hours we see 
>> the Heap build up towards 96% on one node and 80% on the other. We then see 
>> the GC take very long on the 96% node:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> TOuKgmlzaVaFVA][elasticsearch1.trend1.bottlenose.com][inet[/192.99.45.125
>> :9300]]])
>>
>> [2014-04-16 12:04:27,845][INFO ][discovery                ] 
>> [elasticsearch2.trend1] trend1/I3EHG_XjSayz2OsHyZpeZA
>>
>> [2014-04-16 12:04:27,850][INFO ][http                     ] [
>> elasticsearch2.trend1] bound_address {inet[/0.0.0.0:9200]}, 
>> publish_address {inet[/192.99.45.126:9200]}
>>
>> [2014-04-16 12:04:27,851][INFO ][node                     ] 
>> [elasticsearch2.trend1] started
>>
>> [2014-04-16 12:04:32,669][INFO ][indices.store            ] 
>> [elasticsearch2.trend1] updating indices.store.throttle.max_bytes_per_sec 
>> from [20mb] to [1gb], note, type is [MERGE]
>>
>> [2014-04-16 12:04:32,669][INFO ][cluster.routing.allocation.decider] 
>> [elasticsearch2.trend1] updating 
>> [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] 
>> to [50]
>>
>> [2014-04-16 12:04:32,670][INFO ][indices.recovery         ] 
>> [elasticsearch2.trend1] updating [indices.recovery.max_bytes_per_sec] 
>> from [200mb] to [2gb]
>>
>> [2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider] 
>> [elasticsearch2.trend1] updating 
>> [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] 
>> to [50]
>>
>> [2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider] 
>> [elasticsearch2.trend1] updating 
>> [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] 
>> to [50]
>>
>> [2014-04-16 15:25:21,409][WARN ][monitor.jvm              ] 
>> [elasticsearch2.trend1] [gc][old][11876][106] duration [1.1m], 
>> collections [1]/[1.1m], total [1.1m]/[1.4m], memory [28.7gb]->[22gb]/[
>> 29.9gb], all_pools {[young] [67.9mb]->[268.9mb]/[665.6mb]}{[survivor] [
>> 60.5mb]->[0b]/[83.1mb]}{[old] [28.6gb]->[21.8gb]/[29.1gb]}
>>
>> [2014-04-16 16:02:32,523][WARN ][monitor.jvm              ] [
>> elasticsearch2.trend1] [gc][old][13996][144] duration [1.4m], 
>> collections [1]/[1.4m], total [1.4m]/[3m], memory [28.8gb]->[23.5gb]/[
>> 29.9gb], all_pools {[young] [21.8mb]->[238.2mb]/[665.6mb]}{[survivor] [
>> 82.4mb]->[0b]/[83.1mb]}{[old] [28.7gb]->[23.3gb]/[29.1gb]}
>>
>> [2014-04-16 16:14:12,386][WARN ][monitor.jvm              ] [
>> elasticsearch2.trend1] [gc][old][14603][155] duration [1.3m], 
>> collections [2]/[1.3m], total [1.3m]/[4.4m], memory [29.2gb]->[23.9gb]/[
>> 29.9gb], all_pools {[young] [289mb]->[161.3mb]/[665.6mb]}{[survivor] [
>> 58.3mb]->[0b]/[83.1mb]}{[old] [28.8gb]->[23.8gb]/[29.1gb]}
>>
>> [2014-04-16 16:17:55,480][WARN ][monitor.jvm              ] [
>> elasticsearch2.trend1] [gc][old][14745][158] duration [1.3m], 
>> collections [1]/[1.3m], total [1.3m]/[5.7m], memory [29.7gb]->[24.1gb]/[
>> 29.9gb], all_pools {[young] [633.8mb]->[149.7mb]/[665.6mb]}{[survivor] [
>> 68.6mb]->[0b]/[83.1mb]}{[old] [29gb]->[24gb]/[29.1gb]}
>>
>> [2014-04-16 16:21:17,950][WARN ][monitor.jvm              ] [
>> elasticsearch2.trend1] [gc][old][14857][161] duration [1.4m], 
>> collections [1]/[1.4m], total [1.4m]/[7.2m], memory [28.6gb]->[24.5gb]/[
>> 29.9gb], all_pools {[young] [27.7mb]->[154.8mb]/[665.6mb]}{[survivor] [
>> 83.1mb]->[0b]/[83.1mb]}{[old] [28.5gb]->[24.3gb]/[29.1gb]}
>>
>> [2014-04-16 16:24:48,776][WARN ][monitor.jvm              ] [
>> elasticsearch2.trend1] [gc][old][14978][164] duration [1.4m], 
>> collections [1]/[1.4m], total [1.4m]/[8.6m], memory [29.4gb]->[24.7gb]/[
>> 29.9gb], all_pools {[young] [475.5mb]->[125.1mb]/[665.6mb]}{[survivor] [
>> 68.9mb]->[0b]/[83.1mb]}{[old] [28.9gb]->[24.6gb]/[29.1gb]}
>>
>> [2014-04-16 16:26:54,801][WARN ][monitor.jvm              ] [
>> elasticsearch2.trend1] [gc][old][15021][165] duration [1.3m], 
>> collections [1]/[1.3m], total [1.3m]/[9.9m], memory [29.3gb]->[24.8gb]/[
>> 29.9gb], all_pools {[young] [391.8mb]->[151.1mb]/[665.6mb]}{[survivor] [
>> 62.4mb]->[0b]/[83.1mb]}{[old] [28.9gb]->[24.6gb]/[29.1gb]}
>>
>> [2014-04-16 16:30:45,393][WARN ][monitor.jvm              ] [
>> elasticsearch2.trend1] [gc][old][15170][168] duration [1.3m], 
>> collections [1]/[1.3m], total [1.3m]/[11.3m], memory [29.4gb]->[24.6gb]/[
>> 29.9gb], all_pools {[young] [320.3mb]->[186.7mb<span style="col
>>
>> ...
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5bf568eb-103e-4fee-8bd2-ba2b5bc76178%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: ES v1.1 continuous young gc pauses old gc, stops the world when old gc happens and splits cluster

Reply via email to