Re: ES v1.1 continuous young gc pauses old gc, stops the world when old gc happens and splits cluster

2014-06-18 Thread Eric Brandes
I'd just like to chime in with a "me too".  Is the answer just more nodes?  
In my case this is happening every week or so.

On Monday, April 21, 2014 9:04:33 PM UTC-5, Brian Flad wrote:
>
> My dataset currently is 100GB across a few "daily" indices (~5-6GB and 15 
> shards each). Data nodes are 12 CPU, 12GB RAM (6GB heap).
>
>
> On Mon, Apr 21, 2014 at 6:33 PM, Mark Walkom  > wrote:
>
> How big are your data sets? How big are your nodes?
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com 
> web: www.campaignmonitor.com
>
>
> On 22 April 2014 00:32, Brian Flad > 
> wrote:
>
> We're seeing the same behavior with 1.1.1, JDK 7u55, 3 master nodes (2 min 
> master), and 5 data nodes. Interestingly, we see the repeated young GCs 
> only on a node or two at a time. Cluster operations (such as recovering 
> unassigned shards) grinds to a halt. After restarting a GCing node, 
> everything returns to normal operation in the cluster.
>
> Brian F
>
>
> On Wed, Apr 16, 2014 at 8:00 PM, Mark Walkom  > wrote:
>
> In both your instances, if you can, have 3 master eligible nodes as it 
> will reduce the likelihood of a split cluster as you will always have a 
> majority quorum. Also look at discovery.zen.minimum_master_nodes to go with 
> that.
> However you may just be reaching the limit of your nodes, which means the 
> best option is to add another node (which also neatly solves your split 
> brain!).
>
> Ankush it would help if you can update java, most people recommend u25 but 
> we run u51 with no problems.
>
>
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com 
> web: www.campaignmonitor.com
>
>
> On 17 April 2014 07:31, Dominiek ter Heide  > wrote:
>
> We are seeing the same issue here. 
>
> Our environment:
>
> - 2 nodes
> - 30GB Heap allocated to ES
> - ~140GB of data
> - 639 indices, 10 shards per index
> - ~48M documents
>
> After starting ES everything is good, but after a couple of hours we see 
> the Heap build up towards 96% on one node and 80% on the other. We then see 
> the GC take very long on the 96% node:
>
>
>
>
>
>
>
>
>
> TOuKgmlzaVaFVA][elasticsearch1.trend1.bottlenose.com][inet[/192.99.45.125:
> 9300]]])
>
> [2014-04-16 12:04:27,845][INFO ][discovery] 
> [elasticsearch2.trend1] trend1/I3EHG_XjSayz2OsHyZpeZA
>
> [2014-04-16 12:04:27,850][INFO ][http ] [
> elasticsearch2.trend1] bound_address {inet[/0.0.0.0:9200]}, 
> publish_address {inet[/192.99.45.126:9200]}
>
> [2014-04-16 12:04:27,851][INFO ][node ] 
> [elasticsearch2.trend1] started
>
> [2014-04-16 12:04:32,669][INFO ][indices.store] 
> [elasticsearch2.trend1] updating indices.store.throttle.max_bytes_per_sec 
> from [20mb] to [1gb], note, type is [MERGE]
>
> [2014-04-16 12:04:32,669][INFO ][cluster.routing.allocation.decider] 
> [elasticsearch2.trend1] updating 
> [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] 
> to [50]
>
> [2014-04-16 12:04:32,670][INFO ][indices.recovery ] 
> [elasticsearch2.trend1] updating [indices.recovery.max_bytes_per_sec] from 
> [200mb] to [2gb]
>
> [2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider] 
> [elasticsearch2.trend1] updating 
> [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] 
> to [50]
>
> [2014-04-16 12:04:32,670][INFO ][cluster.routing.allocation.decider] 
> [elasticsearch2.trend1] updating 
> [cluster.routing.allocation.node_initial_primaries_recoveries] from [4] 
> to [50]
>
> [2014-04-16 15:25:21,409][WARN ][monitor.jvm  ] 
> [elasticsearch2.trend1] [gc][old][11876][106] duration [1.1m], 
> collections [1]/[1.1m], total [1.1m]/[1.4m], memory [28.7gb]->[22gb]/[
> 29.9gb], all_pools {[young] [67.9mb]->[268.9mb]/[665.6mb]}{[survivor] [
> 60.5mb]->[0b]/[83.1mb]}{[old] [28.6gb]->[21.8gb]/[29.1gb]}
>
> [2014-04-16 16:02:32,523][WARN ][monitor.jvm  ] [
> elasticsearch2.trend1] [gc][old][13996][144] duration [1.4m], collections 
> [1]/[1.4m], total [1.4m]/[3m], memory [28.8gb]->[23.5gb]/[29.9gb], 
> all_pools {[young] [21.8mb]->[238.2mb]/[665.6mb]}{[survivor] [82.4mb]->[0b
> ]/[83.1mb]}{[old] [28.7gb]->[23.3gb]/[29.1gb]}
>
> [2014-04-16 16:14:12,386][WARN ][monitor.jvm  ] [
> elasticsearch2.trend1] [gc][old][14603][155] duration [1.3m], collections 
> [2]/[1.3m], total [1.3m]/[4.4m], memory [29.2gb]->[23.9gb]/[29.9gb], 
> all_pools {[young] [289mb]->[161.3mb]/[665.6mb]}{[survivor] [58.3mb]->[0b
> ]/[83.1mb]}{[old] [28.8gb]->[23.8gb]/[29.1gb]}
>
> [2014-04-16 16:17:55,480][WARN ][monitor.jvm  ] [
> elasticsearch2.trend1] [gc][old][14745][158] duration [1.3m], collections 
> [1]/[1.3m], total [1.3m]/[5.7m], memory [29.7gb]->[24.1gb]/[29.9gb], 
> all_pools {[young] [633.8mb]->[149.7mb]/[665.6mb]}{[survivor] [68.6mb]->[
> 0b]/[83.1mb]}{[old] [29gb]->[24gb]/[29.1gb]}
>
> [2014-04-16 16:21:17,950][WARN ][monitor.

Re: Random node disconnects in Azure, no resource issues as near as I can tell

2014-05-30 Thread Eric Brandes
The three nodes are connected by an Azure virtual network. They are all 
part of a single cloud service, operating in a load balanced set.  I am not 
currently using any kind of FQDN, so the unicast host names are 
"es-machine-1", "es-machine-2" etc. No domain suffix whatsoever.  As far as 
I know that is end-arounding the public load balancer (since none of those 
hostnames are publicly accessible to machines outside the virtual 
network).  But I've been wrong before :)  I actually can't find any kind of 
fully qualified domain name for those machines, other than the public 
facing cloudapp.net one, so I assume this is OK?  I've also tried using the 
internal virtual network IP addresses on a similarly specced development 
cluster, and I see the same timeouts there.

On Friday, May 30, 2014 1:40:47 AM UTC-5, Michael Delaney wrote:
>
> Are u using internal fully qualified domain names, e.g 
> es01.myelasticsearcservice.f3.internal.net 
> If you use public load balancer end points you'll get timeouts. 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/dd26798c-66ef-4881-88ea-72d9df2e16a0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Random node disconnects in Azure, no resource issues as near as I can tell

2014-05-29 Thread Eric Brandes
I'm using the unicast list of nodes at the moment. I have multicast turned 
off as well.  I have not changed the default ping timeout or anything.

On Thursday, May 29, 2014 7:37:38 PM UTC-5, David Pilato wrote:
>
> Just checking: are you using azure cloud plugin or unicast list of nodes?
>
> --
> David ;-)
> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>
>
> Le 30 mai 2014 à 02:12, Eric Brandes > 
> a écrit :
>
> I have a 3 node cluster running ES 1.0.1 in Azure.  They're windows VMs 
> with 7GB of RAM.  The JVM heap size is allocated at 4GB per node.  There is 
> a single index in the cluster with 50 shards and 1 replica.  The total 
> number of documents on primary shards is 29 million with a store size of 
> 60gb (including replicas).
>
> Almost every day now I get a random node disconnecting from the cluster.  
> The usual suspect is a ping timeout.  The longest GC in the logs is about 1 
> sec, and the boxes don't look resource constrained really at all. CPU never 
> goes above 20%. The used JVM heap size never goes above 6gb (the total on 
> the cluster is 12gb) and the field data cache never gets over 1gb.  The 
> node that drops out is different every day.  I have 
> minimum_number_master_nodes set so there's not any kind of split brain 
> scenario, but there are times where the disconnected node NEVER rejoins 
> until I bounce the process.
>
> Has anyone seen this before?  Is it an Azure networking issue?  How can I 
> tell?  If it's resource problems, what's the best way for me to turn on 
> logging to diagnose them?  What else can I tell you or what other steps can 
> I take to figure this out?  It's really quite maddening :(
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearc...@googlegroups.com .
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7671194d-3059-4220-9da5-c4e1aa169072%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Random node disconnects in Azure, no resource issues as near as I can tell

2014-05-29 Thread Eric Brandes
I have a 3 node cluster running ES 1.0.1 in Azure.  They're windows VMs 
with 7GB of RAM.  The JVM heap size is allocated at 4GB per node.  There is 
a single index in the cluster with 50 shards and 1 replica.  The total 
number of documents on primary shards is 29 million with a store size of 
60gb (including replicas).

Almost every day now I get a random node disconnecting from the cluster.  
The usual suspect is a ping timeout.  The longest GC in the logs is about 1 
sec, and the boxes don't look resource constrained really at all. CPU never 
goes above 20%. The used JVM heap size never goes above 6gb (the total on 
the cluster is 12gb) and the field data cache never gets over 1gb.  The 
node that drops out is different every day.  I have 
minimum_number_master_nodes set so there's not any kind of split brain 
scenario, but there are times where the disconnected node NEVER rejoins 
until I bounce the process.

Has anyone seen this before?  Is it an Azure networking issue?  How can I 
tell?  If it's resource problems, what's the best way for me to turn on 
logging to diagnose them?  What else can I tell you or what other steps can 
I take to figure this out?  It's really quite maddening :(

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8f85c254-9d53-4507-a340-4c8f2a4a078d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Windows Elasticsearch cluster performance tuning

2014-03-23 Thread Eric Brandes
Interesting - so in general would you recommend consolidating all 400 
indexes in to a single index and using aliases/filters to address them?  
(they're currently broken out by user, and all operations are scoped to a 
specific user)

If I were to consolidate to a single index, how many shards would be 
recommended?

On Sunday, March 23, 2014 2:00:18 PM UTC-5, David Pilato wrote:
>
> Forget to say that you should extra large instances and not large.
> With larges, you could suffer from noisy neighbors.
>
> --
> David ;-)
> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>
> Le 23 mars 2014 à 19:54, David Pilato > a 
> écrit :
>
> IMHO 800 shards per node is far too much. And with only 4gb of memory...
>
> I guess you have lot of GC or you forget to disable SWAP.
>
> My 2 cents.
>
> --
> David ;-)
> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>
> Le 23 mars 2014 à 18:08, Eric Brandes > 
> a écrit :
>
> Hey all, I have a 3 node Elasticsearch 1.0.1 cluster running on Windows 
> Server 2012 (in Azure).  There's about 20 million documents that take up a 
> total of 40GB (including replicas).  There's about 400 indexes in total, 
> with some having millions of documents and some having just a few.  Each 
> index is set to have 3 shards and 1 replica.   The main cluster is running 
> on three  4 core machines with 7GB of ram.  The min/max JVM heap size is 
> set to 4GB.  
>
> The primary use case for this cluster is faceting/aggregations over the 
> documents.  There's almost no full text searching, so everything is pretty 
> much based on exact values (which are stored but not analyzed at index time)
>
> When doing some term facets on a few of these indexes (the biggest one 
> contains about 8 million documents) I'm seeing really long response times 
> (> 5 sec).  There are potentially thousands of distinct values for the term 
> I'm faceting on, but I would have still expected faster performance.
>
> So my goal is to speed up these queries to get the responses sub second if 
> possible.  To that end I had some questions:
> 1) Would switching to Linux give me better performance in general?
> 2) I could collapse almost all of these 400 indexes in to a single big 
> index and use aliases + filters instead.  Would this be advisable?
> 3) Would mucking with the field data cache yield any better results?
>
>
> If I can add any more data to this discussion please let me know!
> Thanks!
> Eric
>
>  -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearc...@googlegroups.com .
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/eb5fb6bf-be2c-4d5f-b73a-edc1ef5813f1%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/eb5fb6bf-be2c-4d5f-b73a-edc1ef5813f1%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>  -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearc...@googlegroups.com .
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/C6157A06-390B-45C0-8425-3723F37D3766%40pilato.fr<https://groups.google.com/d/msgid/elasticsearch/C6157A06-390B-45C0-8425-3723F37D3766%40pilato.fr?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/04ea8acb-a62f-4232-a483-bdde916c48c2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Windows Elasticsearch cluster performance tuning

2014-03-23 Thread Eric Brandes
Hey all, I have a 3 node Elasticsearch 1.0.1 cluster running on Windows 
Server 2012 (in Azure).  There's about 20 million documents that take up a 
total of 40GB (including replicas).  There's about 400 indexes in total, 
with some having millions of documents and some having just a few.  Each 
index is set to have 3 shards and 1 replica.   The main cluster is running 
on three  4 core machines with 7GB of ram.  The min/max JVM heap size is 
set to 4GB.  

The primary use case for this cluster is faceting/aggregations over the 
documents.  There's almost no full text searching, so everything is pretty 
much based on exact values (which are stored but not analyzed at index time)

When doing some term facets on a few of these indexes (the biggest one 
contains about 8 million documents) I'm seeing really long response times 
(> 5 sec).  There are potentially thousands of distinct values for the term 
I'm faceting on, but I would have still expected faster performance.

So my goal is to speed up these queries to get the responses sub second if 
possible.  To that end I had some questions:
1) Would switching to Linux give me better performance in general?
2) I could collapse almost all of these 400 indexes in to a single big 
index and use aliases + filters instead.  Would this be advisable?
3) Would mucking with the field data cache yield any better results?


If I can add any more data to this discussion please let me know!
Thanks!
Eric

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/eb5fb6bf-be2c-4d5f-b73a-edc1ef5813f1%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.