Re: Marvel creating disk usage imbalance
I think it's related to this: https://github.com/elasticsearch/elasticsearch/pull/8270 which I believe was released with 1.4. We see the same thing, with hot spots on some nodes. You can poke the cluster to rebalance itself, which that #8270 fixes permanently, using curl -XPOST localhost:9200/_cluster/reroute. That doesn't always sort it out, and this issue (https://github.com/elasticsearch/elasticsearch/issues/8149) is our primary issue. AFAIK it's not just Marvel, but any indice can get into this situation. Right now I have a few nodes with 1TB of free disk and others with 400Gb, and Marvel is in another cluster entirely. cheers mike On Tuesday, November 11, 2014 4:15:33 AM UTC-5, Duncan Innes wrote: I now know that Marvel creates a lot of data per day of monitoring - in our case around 1Gb. What I'm just starting to get my head around is the imbalance of disk usage that this caused on my 5 node cluster. I've now removed Marvel and deleted the indexes for now (great tool, but I don't have the disk space to spare on this proof of concept) and my disk usage for the 12 months of rsyslog data has equalised across all the nodes in my cluster. When the Marvel data was sitting there, not only was I using far too much disk space, but I was also seeing significant differences between nodes. At least one node would be using nearly all of the 32Gb, where other nodes would sit at half that or even less. Is there something intrinsically different about Marvel's indexes that makes them prone to such wild differences? Thanks Duncan -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/477f9c6c-b359-4776-83be-e8ac5ac8401a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: hardware recommendation for dedicated client node
I have dedicated client nodes for some really intense queries and aggregations. Clients typically have 2GB of heap. Our experience is that 2GB of Heap is sufficient, the client node doesn't do a whole lot. The bulk of the work is done on the data nodes. cheers mike On Monday, November 10, 2014 11:24:41 AM UTC-5, Nikolas Everett wrote: I don't use client nodes so I can't speak from experience here. Most of the gathering steps I can think of amount to merging sorted lists which isn't particularly intense. I think aggregations (another thing I don't use) can be more intense at the client node but I'm not sure. My recommendation is to start by sending requests directly to the data nodes and only start to investigate client nodes if you have trouble with that and diagnose that trouble as being something that'd move to a client node if you had them. Its a nice thing to have in your back pocket but it just hasn't come up for me. Nik On Mon, Nov 10, 2014 at 11:17 AM, Terence Tung ter...@teambanjo.com javascript: wrote: can anyone please help me? -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com javascript:. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/a5585f15-297b-4c16-8881-0a57f8902617%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/a5585f15-297b-4c16-8881-0a57f8902617%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/12ddb420-fc78-4c3c-9085-9a332814e78d%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: elasticsearch high cpu usage
A few things I can think of to look at. During this high CPU load, what's the: - search rate - index rate - GC status (old and young) both number and duration - IOPS Are these nodes VM's? If so is there something else running on the other VM's? That CPU load doesn't look too bad to me, it doesn't appear to be running flat out. mike On Thursday, July 3, 2014 7:37:57 AM UTC-4, vincent Park wrote: Hi, I have 5 clustered nodes and each nodes have 1 replica. total document size is 216 M and 853,000 docs. I was suffering from very high CPU usage. every hours and every early morning about am 05:00 ~ am 09:00 you can see my cacti graph. http://elasticsearch-users.115913.n3.nabble.com/file/n4059189/cpuhigh.jpg there is elasticsearch only on this server I thought there are something wrong with es process. but there is a few server request at cpu peak time. and there is no cron job even. every hours and every early morning about am 05:00 ~ am 09:00 I don't know what's going on elasticsearch at this time!! somebody help me, tell me what happened in there. please.. $ ./elasticsearch -v Version: 1.1.1, Build: f1585f0/2014-04-16T14:27:12Z, JVM: 1.7.0_55 $ java -version java version 1.7.0_55 Java(TM) SE Runtime Environment (build 1.7.0_55-b13) Java HotSpot(TM) 64-Bit Server VM (build 24.55-b03, mixed mode) and I installed plugins on elasticsearch: HQ, bigdesk, head, kopf, sense es log at cpu peak time: [2014-07-03 08:01:00,045][DEBUG][action.search.type ] [node1] [search][4], node[GJjzCrLvQQ-ZRRoqL13MrQ], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@451f9e7c] lastShard [true] org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution (queue capacity 300) on org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$4@68ab486b at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:62) at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:293) at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:300) at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.start(TransportSearchTypeAction.java:190) at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:59) at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction.doExecute(TransportSearchQueryThenFetchAction.java:49) at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63) at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:108) at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:43) at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:63) at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:92) at org.elasticsearch.client.support.AbstractClient.search(AbstractClient.java:212) at org.elasticsearch.rest.action.search.RestSearchAction.handleRequest(RestSearchAction.java:98) at org.elasticsearch.rest.RestController.executeHandler(RestController.java:159) at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:142) at org.elasticsearch.http.HttpServer.internalDispatchRequest(HttpServer.java:121) at org.elasticsearch.http.HttpServer$Dispatcher.dispatchRequest(HttpServer.java:83) at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:291) at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:43) at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791) at org.elasticsearch.common.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:145) at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70) at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564) at
Re: Visibility
I strongly recommend Marvel (and I don't work for elasticsearch), it's quite detailed and you get can insight into exactly what elasticsearch is doing. The only thing it doesn't have full visibility into is the detailed GC stats, for those you'll have to enable GC logging and use a gcviewer to investigate. I also have collectd running with the python module enabled and then this plugin: https://github.com/phobos182/collectd-elasticsearch but that's only to tie it into our alerting system. mike On Wednesday, July 2, 2014 10:33:46 PM UTC-4, smonasco wrote: I currently record basically everything in bigdesk, all the numerics from cluster health, cluster state, nodes info, node stats, index status and segments. I want memory allocated on a per shard level for Lucene level actions, query level actions (outside field and filter cache) and hooks into events like nodes entering and exiting the cluster, new indexes, alias and other administrative changes and master elections. Basically when it comes to memory I'd like to have all parts of the heap accounted for. Field + filter cache is not accounting for whatever process is spiking nor does it answer most of the heap. At 29 gigs being used and garbage collection taking minutes, but not getting anything, elastic is only reporting 7 gigs in cache. We can discuss my particular memory problems and solutions, but mostly I'm after the visibility. --Shannon Monasco On Jul 2, 2014 5:50 PM, Mark Walkom ma...@campaignmonitor.com javascript: wrote: Depends what you want to do really. There are plugins like ElasticHQ, Marvel, kopf and bigdesk that will give you some info. You can also hook collectd into the stack and take metrics, or use plugins from nagios etc. What monitoring platforms do you have in place now? Regards, Mark Walkom Infrastructure Engineer Campaign Monitor email: ma...@campaignmonitor.com javascript: web: www.campaignmonitor.com On 3 July 2014 07:49, smonasco smon...@gmail.com javascript: wrote: Hi, I'm trying to get a lot more visibility and metrics into what's going on under the hood. Occasionally, we see spikes in memory. I'd like to get heap mem used on a per shard basis. If I'm not mistaken, somewhere somehow, this Lucene index that is a shard is using memory in the heap, and I'd like to collect metric. It may also be an operation somewhere higher up in the elasticsearch level where we are merging results from shards or results from indexes (maybe elasticsearch doesn't bother to merge twice but merges once), that's also a mem space I'd like to collect data on. I think a per query mem use would also be something interesting, though, perhaps obviously too much to keep up with for every query (maybe a future opt-in feature, unless it's already there and I'm missing it). Other cluster events like nodes entering and exiting the cluster or the changing of the master would be nice to collect. I'm guessing some of this isn't available and some of it is, but my Google-Fu seems to be lacking. I'm pretty sure I can poll to figure out the events happened, but was wondering if there was something in the java client node where I could get a Future or some other hook to turn it into a push instead of a pull. Any help will be appreciated. I'm aware it's a wide net though. --Shannon Monasco -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com javascript:. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/56362f94-c20b-4201-ae15-5f5f9ca77ff4%40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/56362f94-c20b-4201-ae15-5f5f9ca77ff4%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to a topic in the Google Groups elasticsearch group. To unsubscribe from this topic, visit https://groups.google.com/d/topic/elasticsearch/sF_C58d96ms/unsubscribe. To unsubscribe from this group and all its topics, send an email to elasticsearc...@googlegroups.com javascript:. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAEM624aCuHQhWVHp0MOrTZH3s0y6kN7jqkg7bXEQF%2BrtwfEqTQ%40mail.gmail.com https://groups.google.com/d/msgid/elasticsearch/CAEM624aCuHQhWVHp0MOrTZH3s0y6kN7jqkg7bXEQF%2BrtwfEqTQ%40mail.gmail.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit
Re: Alerting in ELK stack?
We use Nagios for alerting. I originally was using the nsca output plugin for logstash, but found that it took close to a second to execute the command line nsca client, and if we got flooded with alert messages, logstash would fall behind. I've since switched to use the http output and send json to the nagios-api server (https://github.com/zorkian/nagios-api). That seems to scale a lot better. We do also have metrics sent from logstash to statsd/graphite, but mostly so I can see message rates. mike On Monday, June 23, 2014 4:50:22 AM UTC-4, Siddharth Trikha wrote: We are using the `ELK stack (logstash, elasticsearch, kibana)` to analyze our logs. So far, so good. But now we want notification generation on some particular kind of logs. Eg When a login failed logs comes more than 5 times (threshold crossed) an email to be sent to the sysadmin. I looked up online and heard about `statsd`, `riemann`, `nagios`, `metric` filter (logstash) to achieve our requirement. Can anyone suggest which fits best with ELK stack?? I am new to this. Thanks -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/d609f39f-e452-44e8-a962-0e4b2a88e920%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Stress Free Guide To Expanding a Cluster
Try setting indices.recovery.max_bytes_per_sec much higher for faster recovery. The default is 20mb/s, and there's a bug in versions prior to 1.2 that rate limit to even lower than that. You didn't specify how big your indices are, but I can fairly accurately predict how long it'll take for the cluster to go green with that parameter. mike On Wednesday, June 25, 2014 8:20:02 AM UTC-4, Nikolas Everett wrote: On Wed, Jun 25, 2014 at 8:05 AM, James Carr james@gmail.com javascript: wrote: I launched two new EC2 instances to join the cluster and watched. Some shards began relocating, no big deal. Six hours later I checked in and some shards were still locating, one shard was recovering. Weird but whatever... the cluster health is still green and searches are working fine. I add new nodes every once in a while and it can take a few hours for everything to balance out, but six hours is a bit long. Its possible. Do you have graphs of the count of relocating shards? Something like this can really help you figure out if everything balanced out at some point and then unbalanced. Example: http://ganglia.wikimedia.org/latest/graph_all_periods.php?c=Elasticsearch%20cluster%20eqiadh=elastic1001.eqiad.wmnetr=hourz=defaultjr=js=st=1403698335v=0m=es_relocating_shardsvl=shardsti=es_relocating_shardsz=large Then I got an alert at 2:30am that the cluster state is now yellow and find that we have 3 shards marked as recovering and 2 shards that unassigned. The cluster still technically works but 24 hours later after the new nodes were added I feel like my only choice to get a green cluster again will be to simply launch 5 fresh nodes and replay all the data from backups into it. Ugh. This sounds like one of the nodes bounced. It can take a long time to recover from that. Its something that is being worked on. Check the logs and see if you see anything about it. One thing to make sure of is that you set the number of master nodes correctly on all nodes. If you have five master eligible nodes then set it to 3. If the two new nodes aren't master eligible (you have three master eligible nodes) then set it to 2. SERIOUSLY! What can I do to prevent this? I feel like I am missing something because I always heard the strength of elasticsearch is its ease of scaling out but it feels like every time I try it falls to the floor. :-( Its always been pretty painless for me. I did have trouble when I added nodes that were broken: one time I added nodes without SSDs to a cluster with SSDs. Another time I didn't set the heap size on the new nodes and they worked until some shards moved to them. Then they fell over. Nik -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/17a60021-e0bc-4806-8573-f37a9ef91b89%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: ES v1.1 continuous young gc pauses old gc, stops the world when old gc happens and splits cluster
Removing the -XX+UseCMSInitiatingOccupancyOnly flag extended the time it took before the JVM started full GC's from about 2 hours to 7 hours in my cluster, but now it's back to constant full GC's. I'm out of ideas. Suggestions? mike On Monday, June 23, 2014 10:25:20 AM UTC-4, Michael Hart wrote: My nodes are in Rackspace, so they are VM's, but they are configured without swap. I'm not entirely sure what the searches are up to, I'm going to investigate that further. I did correlate a rapid increase in Heap used, number of segments (up from the norm of ~400 to 15,000) and consequently Old GC counts when the cluster attempts to merge a 5GB segment. It seems that in spite of my really fast disk the merge of a 5GB segment takes up to 15 minutes. I've made two changes this morning, namely set these: index.merge.scheduler.max_thread_count: 3 index.merge.policy.max_merged_segment: 1gb The first is in the hope that while a large segment merge is underway, the two other threads can still keep the small segment merges going. The second is to keep the larger segment merges under control. I was ending up with two 5GB segments, and a long tail of smaller ones. A quick model shows that by dropping this to 1GB I'll have 12 x 1GB segments and a similar long tail of smaller segments (about 50?). I've also enabled GC logging on one node, I'll leave it running for the day and tomorrow remove the -XX+UseCMSInitiatingOccupancyOnly flag (used by default for elasticsearch) and see if there's any difference. I'll report back here incase this is of any use for anyone. thanks mike On Friday, June 20, 2014 6:31:54 PM UTC-4, Clinton Gormley wrote: * Do you have swap disabled? (any swap plays havoc with GC) * You said you're doing scan/scroll - how many documents are you retrieving at a time? Consider reducing the number * Are you running on a VM - that can cause you to swap even though your VM guest thinks that swap is disabled, or steal CPU (slowing down GC) Essentially, if all of the above are false, you shouldn't be getting slow GCs unless you're under very heavy memory pressure (and I see that your old gen is not too bad, so that doesn't look likely). On 20 June 2014 16:03, Michael Hart hart...@gmail.com wrote: Thanks I do see the GC warnings in the logs, such as [2014-06-19 20:17:06,603][WARN ][monitor.jvm ] [redacted] [gc ][old][179386][22718] duration [11.4s], collections [1]/[12.2s], total [ 11.4s]/[25.2m], memory [7.1gb]-[6.9gb]/[7.2gb], all_pools {[young] [ 158.7mb]-[7.4mb]/[266.2mb]}{[survivor] [32.4mb]-[0b]/[33.2mb]}{[old] [ 6.9gb]-[6.9gb]/[6.9gb]} CPU Idle is around 50% when the merge starts, and drops to zero by the time that first GC old warning is logged. During recovery my SSD's sustain 2400 IOPS and during yesterday's outage I only see about 800 IOPS before ES died. While I can throw more hardware at it, I'd prefer to do some tuning first if possible. The reason I was thinking of adding more shards is that largest segment is 4.9GB (just under the default maximum set by index.merge.policy.max_merged_segment). I suppose the other option is to reduce the index.merge.policy.max_merged_segment setting to something smaller, but I have no idea what the implications are. thoughts? mike On Friday, June 20, 2014 9:47:22 AM UTC-4, Ankush Jhalani wrote: Mike - The above sounds like happened due to machines sending too many indexing requests and merging unable to keep up pace. Usual suspects would be not enough cpu/disk speed bandwidth. This doesn't sound related to memory constraints posted in the original issue of this thread. Do you see memory GC traces in logs? On Friday, June 20, 2014 9:40:48 AM UTC-4, Michael Hart wrote: We're seeing the same thing. ES 1.1.0, JDK 7u55 on Ubuntu 12.04, 5 data nodes, 3 separate masters, all are 15GB hosts with 7.5GB Heaps, storage is SSD. Data set is ~1.6TB according to Marvel. Our daily indices are roughly 33GB in size, with 5 shards and 2 replicas. I'm still investigating what happened yesterday, but I do see in Marvel a large spike in the Indices Current Merges graph just before the node dies, and a corresponding increase in JVM Heap. When Heap hits 99% everything grinds to a halt. Restarting the node fixes the issue, but this is third or fourth time it's happened. I'm still researching how to deal with this, but a couple of things I am looking at are: - increase the number of shards so that the segment merges stay smaller (is that even a legitimate sentence?) I'm still reading through this page the Index Module Merge page http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-merge.html for more details. - look at store level throttling http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-store.html#store-throttling . I would love
G1 Garbage Collector with Elasticsearch = 1.1
I'm running into a lot of issues with large heaps of = 8GB and full GC's, as are a lot of others on this forum. Everything from Oracle/Sun indicates that the G1 garbage collector is supposed to deal with large heaps better, or at least give more consistency in terms of GC pauses, than the CMS garbage collector. Earlier posts in this forum indicate that there were bugs with the G1 collector and Trove, that have now been fixed. Is there updated information and/or recommendations from Elasticsearch about using the G1 collector with Java 7u55 and Elasticsearch 1.1 or 1.2? thanks mike -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/64d8d7fb-411f-44b0-9c51-fd6374965837%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.