Sum aggregation with results from other aggregations?

2015-06-01 Thread Josh Harrison
Is it possible to create an aggregation where I can do a sum on the results of a sub bucket? I'm working on twitter data. In this data I have a bunch of retweets of different users. Say that user A has 10 tweets that are retweeted a hundred times in my dataset. I want to find the maximum

Elasticsearch twitter river filtered stream question

2014-07-07 Thread Josh Harrison
Quick question about the ES twitter river at https://github.com/elasticsearch/elasticsearch-river-twitter The twitter streaming API allows you to filter, and you apparently get up to 1% of the stream total, with our search queries. So, if I were filtering for coffee, I'd get coffee tweets that

Automatic index balancing plugin or other solution?

2014-04-08 Thread Josh Harrison
I have heard that ideally, you want to have a similar number of documents per shard for optimal search times, is that correct? I have data volumes that are just all over the place, from 100k to tens of millions in a week. I'm thinking about a river plugin that could: Take a mapping object as a

Please explain the flow of data?

2014-03-21 Thread Josh Harrison
I'm trying to build a basic understanding of how indexing and searching works, hopefully someone can either point me to good resources or explain! I'm trying to figure out what having multiple coordinator nodes as defined in the elasticsearch.yml would do, and what having multiple search load

Re: Please explain the flow of data?

2014-03-21 Thread Josh Harrison
javascript: web: www.campaignmonitor.com On 22 March 2014 08:25, Josh Harrison hij...@gmail.com javascript:wrote: I'm trying to build a basic understanding of how indexing and searching works, hopefully someone can either point me to good resources or explain! I'm trying to figure out what having

Re: elasticsearch-py usage

2014-03-13 Thread Josh Harrison
It doesn't look like the elasticsearch-py API covers the river use case. When I've run into things like this I've always just run a manual CURL request, or if I need to do it from within a script I just do a basic command with requests, ala

Best way to duplicate data across clusters live?

2014-03-12 Thread Josh Harrison
Say I have clusters A and B. Cluster A is consuming data using an ActiveMQ river. I would like to stream data to cluster B as well. Do I just create a secondary outbound AMQ channel and subscribe cluster B to it, or is there a decent way to have a live copy of data going two places at once? --

Re: Best way to duplicate data across clusters live?

2014-03-12 Thread Josh Harrison
Analytics Solr Elasticsearch Support * http://sematext.com/ On Wednesday, March 12, 2014 2:55:58 PM UTC-4, Josh Harrison wrote: Say I have clusters A and B. Cluster A is consuming data using an ActiveMQ river. I would like to stream data to cluster B as well. Do I just create a secondary

Too many nodes started up on some data nodes - best approach to fix?

2014-02-26 Thread Josh Harrison
I restarted my cluster the other day, but something odd stuck, resulting in 15/16 data nodes starting up an extra ES instance in the same cluster. This ended badly as there were two nodes with identical display names, the system locked up, etc. When restarting again, to my horror, we were

Random scan results?

2014-02-19 Thread Josh Harrison
I need to be able to pull 100s of thousands to millions of random documents from my indexes. Normally, to pull data this large I'd do a scan query, but they don't support sorting, so the suggestions I've seen online for randomizing your results don't work (such as those discussed here:

Re: Random scan results?

2014-02-19 Thread Josh Harrison
. On Wed, Feb 19, 2014 at 9:04 PM, Josh Harrison hij...@gmail.comjavascript: wrote: I need to be able to pull 100s of thousands to millions of random documents from my indexes. Normally, to pull data this large I'd do a scan query, but they don't support sorting, so the suggestions I've seen

Documents per shard

2014-02-14 Thread Josh Harrison
I've got indexes storing the same kind of data split into weekly chunks - there has been some fairly substantial variation in data volume. I've got a mapping change I need to make across all the back data, and I'm thinking it might make sense to try to rebalance the documents per shard so that

Re: Data Loss

2014-02-12 Thread Josh Harrison
I'm sure it isn't the case for everyone that is having data/shard problems, but I had some real trouble doing a full cluster restart on an 18 node cluster. Kinda nightmarish, actually, shards failing all over the place, lost data because of lost shards, etc. I finally realized that the

Re: Data Loss

2014-02-12 Thread Josh Harrison
This particular cluster is 16 data nodes with SSD RAIDs connected to each other and the two master nodes with infiniband. Under 100 indexes and usually 3 shards per index with 1 replica. Overall data volume is in the 1TB range. I haven't tweaked the shard allocation settings from default. -Josh

Re: Plugin development guidance

2014-02-11 Thread Josh Harrison
Great, thanks Jörg! I'll start fiddling around with the langdetect plugin to see if I can get it going with our library. On Tue, Feb 11, 2014 at 1:18 PM, joergpra...@gmail.com joergpra...@gmail.com wrote: An analyzer plugin is the right thing. Adding the recognized/extracted terms needs

Stress testing queries

2014-01-30 Thread Josh Harrison
Are there any decent ES specific stress testing tools out there that would allow me to test what kinds of simultaneous load my cluster can handle with concurrent users making queries? Searched around a bit and didn't see anything. Figured I'd ask before I come up with a test approach of my own!

Re: Stress testing queries

2014-01-30 Thread Josh Harrison
In our case, we're just interested in query stress testing. We've got a web app that queries our indexes that are organized based on weeks of the year, with a bunch of aliases making it so specific portions of the data can be reached easily. Questions about scaling the app have come up. In our

Re: Stress testing queries

2014-01-30 Thread Josh Harrison
like I said, it'd be in python since that's my language of choice. So it wouldn't be as optimal a testing platform as a native Java app, I guess, but still useful as a proof of concept. On Thursday, January 30, 2014 4:41:06 PM UTC-8, Josh Harrison wrote: In our case, we're just interested

Re: Multiple nodes on a powerful system?

2014-01-29 Thread Josh Harrison
Thanks Jörg, Mark and Nikolas, some great information here. The 6x6 configuration was something of a worst case example, the farthest we'd probably stretch it would be 3 nodes per host on 16-18 hosts, which should be a little more reasonable. Hopefully we'll be able to do a support contract

Scan/scroll facet data?

2014-01-28 Thread Josh Harrison
I've got fields that have a few hundred thousand+ unique values that I'd like to be able to facet on. Is there some way of essentially streaming the exhaustive list of facet results, like I can search hits? -- You received this message because you are subscribed to the Google Groups

Design practices for hosting multiple clusters/on-demand cluster creation?

2014-01-07 Thread Josh Harrison
While ES is still in a pre deployment stage at my job, there is growing interest in it. For various reasons, a monster cluster holding everyone's stuff is simply not possible. Individual projects require complete control over their data and the culture and security requirements here are such

Possible to turn off/suppress version data in response to GET http://localhost:9200?

2013-12-19 Thread Josh Harrison
The subject says it all pretty much, is it possible to turn off the reporting of version data in response to GET http://localhost:9200? Thanks, Josh -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop

Re: Possible to turn off/suppress version data in response to GET http://localhost:9200?

2013-12-19 Thread Josh Harrison
that you can disable returning the version field. -- Ivan On Thu, Dec 19, 2013 at 12:27 PM, Josh Harrison hij...@gmail.comjavascript: wrote: The subject says it all pretty much, is it possible to turn off the reporting of version data in response to GET http://localhost:9200? Thanks, Josh