Re: ES v1.1 continuous young gc pauses old gc, stops the world when old gc happens and splits cluster
I'd just like to chime in with a "me too". Is the answer just more nodes? In my case this is happening every week or so. On Monday, April 21, 2014 9:04:33 PM UTC-5, Brian Flad wrote: > > My dataset currently is 100GB across a few "daily" indices (~5-6GB and 15 > shards each). Data nodes are 12 CPU, 12GB RAM (6GB heap). > > > On Mon, Apr 21, 2014 at 6:33 PM, Mark Walkom > wrote: > > How big are your data sets? How big are your nodes? > > Regards, > Mark Walkom > > Infrastructure Engineer > Campaign Monitor > email: ma...@campaignmonitor.com > web: www.campaignmonitor.com > > > On 22 April 2014 00:32, Brian Flad > > wrote: > > We're seeing the same behavior with 1.1.1, JDK 7u55, 3 master nodes (2 min > master), and 5 data nodes. Interestingly, we see the repeated young GCs > only on a node or two at a time. Cluster operations (such as recovering > unassigned shards) grinds to a halt. After restarting a GCing node, > everything returns to normal operation in the cluster. > > Brian F > > > On Wed, Apr 16, 2014 at 8:00 PM, Mark Walkom > wrote: > > In both your instances, if you can, have 3 master eligible nodes as it > will reduce the likelihood of a split cluster as you will always have a > majority quorum. Also look at discovery.zen.minimum_master_nodes to go with > that. > However you may just be reaching the limit of your nodes, which means the > best option is to add another node (which also neatly solves your split > brain!). > > Ankush it would help if you can update java, most people recommend u25 but > we run u51 with no problems. > > > > Regards, > Mark Walkom > > Infrastructure Engineer > Campaign Monitor > email: ma...@campaignmonitor.com > web: www.campaignmonitor.com > > > On 17 April 2014 07:31, Dominiek ter Heide > wrote: > > We are seeing the same issue here. > > Our environment: > > - 2 nodes > - 30GB Heap allocated to ES > - ~140GB of data > - 639 indices, 10 shards per index > - ~48M documents > > After starting ES everything is good, but after a couple of hours we see > the Heap build up towards 96% on one node and 80% on the other. We then see > the GC take very long on the 96% node: > > > > > > > > > > TOuKgmlzaVaFVA][elasticsearch1.trend1.bottlenose.com][inet[/192.99.45.125: > 9300]]]) > > [2014-04-16 12:04:27,845][INFO ][discovery] > [elasticsearch2.trend1] trend1/I3EHG_XjSayz2OsHyZpeZA > > [2014-04-16 12:04:27,850][INFO ][http ] [ > elasticsearch2.trend1] bound_address {inet[/0.0.0.0:9200]}, > publish_address {inet[/192.99.45.126:9200]} > > [2014-04-16 12:04:27,851][INFO ][node ] > [elasticsearch2.trend1] started > > [2014-04-16 12:04:32,669][INFO ][indices.store] > [elasticsearch2.trend1] updating indices.store.throttle.max_bytes_per_sec > from [20mb] to [1gb], note, type is [MERGE] > > [2014-04-16 12:04:32,669][INFO ][cluster.routing.allocation.decider] > [elasticsearch2.trend1] updating > [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] > to [50] > > [2014-04-16 12:04:32,670][INFO ][indices.recovery ] > [elasticsearch2.trend1] updating [indices.recovery.max_bytes_per_sec] from > [200mb] to [2gb] > > [2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider] > [elasticsearch2.trend1] updating > [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] > to [50] > > [2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider] > [elasticsearch2.trend1] updating > [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] > to [50] > > [2014-04-16 15:25:21,409][WARN ][monitor.jvm ] > [elasticsearch2.trend1] [gc][old][11876][106] duration [1.1m], > collections [1]/[1.1m], total [1.1m]/[1.4m], memory [28.7gb]->[22gb]/[ > 29.9gb], all_pools {[young] [67.9mb]->[268.9mb]/[665.6mb]}{[survivor] [ > 60.5mb]->[0b]/[83.1mb]}{[old] [28.6gb]->[21.8gb]/[29.1gb]} > > [2014-04-16 16:02:32,523][WARN ][monitor.jvm ] [ > elasticsearch2.trend1] [gc][old][13996][144] duration [1.4m], collections > [1]/[1.4m], total [1.4m]/[3m], memory [28.8gb]->[23.5gb]/[29.9gb], > all_pools {[young] [21.8mb]->[238.2mb]/[665.6mb]}{[survivor] [82.4mb]->[0b > ]/[83.1mb]}{[old] [28.7gb]->[23.3gb]/[29.1gb]} > > [2014-04-16 16:14:12,386][WARN ][monitor.jvm ] [ > elasticsearch2.trend1] [gc][old][14603][155] duration [1.3m], collections > [2]/[1.3m], total [1.3m]/[4.4m], memory [29.2gb]->[23.9gb]/[29.9gb], > all_pools {[young] [289mb]->[161.3mb]/[665.6mb]}{[survivor] [58.3mb]->[0b > ]/[83.1mb]}{[old] [28.8gb]->[23.8gb]/[29.1gb]} > > [2014-04-16 16:17:55,480][WARN ][monitor.jvm ] [ > elasticsearch2.trend1] [gc][old][14745][158] duration [1.3m], collections > [1]/[1.3m], total [1.3m]/[5.7m], memory [29.7gb]->[24.1gb]/[29.9gb], > all_pools {[young] [633.8mb]->[149.7mb]/[665.6mb]}{[survivor] [68.6mb]->[ > 0b]/[83.1mb]}{[old] [29gb]->[24gb]/[29.1gb]} > > [2014-04-16 16:21:17,950][WARN ][monitor.
Re: Random node disconnects in Azure, no resource issues as near as I can tell
The three nodes are connected by an Azure virtual network. They are all part of a single cloud service, operating in a load balanced set. I am not currently using any kind of FQDN, so the unicast host names are "es-machine-1", "es-machine-2" etc. No domain suffix whatsoever. As far as I know that is end-arounding the public load balancer (since none of those hostnames are publicly accessible to machines outside the virtual network). But I've been wrong before :) I actually can't find any kind of fully qualified domain name for those machines, other than the public facing cloudapp.net one, so I assume this is OK? I've also tried using the internal virtual network IP addresses on a similarly specced development cluster, and I see the same timeouts there. On Friday, May 30, 2014 1:40:47 AM UTC-5, Michael Delaney wrote: > > Are u using internal fully qualified domain names, e.g > es01.myelasticsearcservice.f3.internal.net > If you use public load balancer end points you'll get timeouts. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/dd26798c-66ef-4881-88ea-72d9df2e16a0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Random node disconnects in Azure, no resource issues as near as I can tell
I'm using the unicast list of nodes at the moment. I have multicast turned off as well. I have not changed the default ping timeout or anything. On Thursday, May 29, 2014 7:37:38 PM UTC-5, David Pilato wrote: > > Just checking: are you using azure cloud plugin or unicast list of nodes? > > -- > David ;-) > Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs > > > Le 30 mai 2014 à 02:12, Eric Brandes > > a écrit : > > I have a 3 node cluster running ES 1.0.1 in Azure. They're windows VMs > with 7GB of RAM. The JVM heap size is allocated at 4GB per node. There is > a single index in the cluster with 50 shards and 1 replica. The total > number of documents on primary shards is 29 million with a store size of > 60gb (including replicas). > > Almost every day now I get a random node disconnecting from the cluster. > The usual suspect is a ping timeout. The longest GC in the logs is about 1 > sec, and the boxes don't look resource constrained really at all. CPU never > goes above 20%. The used JVM heap size never goes above 6gb (the total on > the cluster is 12gb) and the field data cache never gets over 1gb. The > node that drops out is different every day. I have > minimum_number_master_nodes set so there's not any kind of split brain > scenario, but there are times where the disconnected node NEVER rejoins > until I bounce the process. > > Has anyone seen this before? Is it an Azure networking issue? How can I > tell? If it's resource problems, what's the best way for me to turn on > logging to diagnose them? What else can I tell you or what other steps can > I take to figure this out? It's really quite maddening :( > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearc...@googlegroups.com . > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com > > <https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/7671194d-3059-4220-9da5-c4e1aa169072%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Random node disconnects in Azure, no resource issues as near as I can tell
I have a 3 node cluster running ES 1.0.1 in Azure. They're windows VMs with 7GB of RAM. The JVM heap size is allocated at 4GB per node. There is a single index in the cluster with 50 shards and 1 replica. The total number of documents on primary shards is 29 million with a store size of 60gb (including replicas). Almost every day now I get a random node disconnecting from the cluster. The usual suspect is a ping timeout. The longest GC in the logs is about 1 sec, and the boxes don't look resource constrained really at all. CPU never goes above 20%. The used JVM heap size never goes above 6gb (the total on the cluster is 12gb) and the field data cache never gets over 1gb. The node that drops out is different every day. I have minimum_number_master_nodes set so there's not any kind of split brain scenario, but there are times where the disconnected node NEVER rejoins until I bounce the process. Has anyone seen this before? Is it an Azure networking issue? How can I tell? If it's resource problems, what's the best way for me to turn on logging to diagnose them? What else can I tell you or what other steps can I take to figure this out? It's really quite maddening :( -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Windows Elasticsearch cluster performance tuning
Interesting - so in general would you recommend consolidating all 400 indexes in to a single index and using aliases/filters to address them? (they're currently broken out by user, and all operations are scoped to a specific user) If I were to consolidate to a single index, how many shards would be recommended? On Sunday, March 23, 2014 2:00:18 PM UTC-5, David Pilato wrote: > > Forget to say that you should extra large instances and not large. > With larges, you could suffer from noisy neighbors. > > -- > David ;-) > Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs > > Le 23 mars 2014 à 19:54, David Pilato > a > écrit : > > IMHO 800 shards per node is far too much. And with only 4gb of memory... > > I guess you have lot of GC or you forget to disable SWAP. > > My 2 cents. > > -- > David ;-) > Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs > > Le 23 mars 2014 à 18:08, Eric Brandes > > a écrit : > > Hey all, I have a 3 node Elasticsearch 1.0.1 cluster running on Windows > Server 2012 (in Azure). There's about 20 million documents that take up a > total of 40GB (including replicas). There's about 400 indexes in total, > with some having millions of documents and some having just a few. Each > index is set to have 3 shards and 1 replica. The main cluster is running > on three 4 core machines with 7GB of ram. The min/max JVM heap size is > set to 4GB. > > The primary use case for this cluster is faceting/aggregations over the > documents. There's almost no full text searching, so everything is pretty > much based on exact values (which are stored but not analyzed at index time) > > When doing some term facets on a few of these indexes (the biggest one > contains about 8 million documents) I'm seeing really long response times > (> 5 sec). There are potentially thousands of distinct values for the term > I'm faceting on, but I would have still expected faster performance. > > So my goal is to speed up these queries to get the responses sub second if > possible. To that end I had some questions: > 1) Would switching to Linux give me better performance in general? > 2) I could collapse almost all of these 400 indexes in to a single big > index and use aliases + filters instead. Would this be advisable? > 3) Would mucking with the field data cache yield any better results? > > > If I can add any more data to this discussion please let me know! > Thanks! > Eric > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearc...@googlegroups.com . > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/eb5fb6bf-be2c-4d5f-b73a-edc1ef5813f1%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/eb5fb6bf-be2c-4d5f-b73a-edc1ef5813f1%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to elasticsearc...@googlegroups.com . > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/C6157A06-390B-45C0-8425-3723F37D3766%40pilato.fr<https://groups.google.com/d/msgid/elasticsearch/C6157A06-390B-45C0-8425-3723F37D3766%40pilato.fr?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/04ea8acb-a62f-4232-a483-bdde916c48c2%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Windows Elasticsearch cluster performance tuning
Hey all, I have a 3 node Elasticsearch 1.0.1 cluster running on Windows Server 2012 (in Azure). There's about 20 million documents that take up a total of 40GB (including replicas). There's about 400 indexes in total, with some having millions of documents and some having just a few. Each index is set to have 3 shards and 1 replica. The main cluster is running on three 4 core machines with 7GB of ram. The min/max JVM heap size is set to 4GB. The primary use case for this cluster is faceting/aggregations over the documents. There's almost no full text searching, so everything is pretty much based on exact values (which are stored but not analyzed at index time) When doing some term facets on a few of these indexes (the biggest one contains about 8 million documents) I'm seeing really long response times (> 5 sec). There are potentially thousands of distinct values for the term I'm faceting on, but I would have still expected faster performance. So my goal is to speed up these queries to get the responses sub second if possible. To that end I had some questions: 1) Would switching to Linux give me better performance in general? 2) I could collapse almost all of these 400 indexes in to a single big index and use aliases + filters instead. Would this be advisable? 3) Would mucking with the field data cache yield any better results? If I can add any more data to this discussion please let me know! Thanks! Eric -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/eb5fb6bf-be2c-4d5f-b73a-edc1ef5813f1%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.