Sum aggregation with results from other aggregations?

2015-06-01 Thread Josh Harrison
Is it possible to create an aggregation where I can do a sum on the results of a sub bucket? I'm working on twitter data. In this data I have a bunch of retweets of different users. Say that user A has 10 tweets that are retweeted a hundred times in my dataset. I want to find the maximum retwee

Elasticsearch twitter river filtered stream question

2014-07-07 Thread Josh Harrison
Quick question about the ES twitter river at https://github.com/elasticsearch/elasticsearch-river-twitter The twitter streaming API allows you to filter, and you apparently get up to 1% of the stream total, with our search queries. So, if I were filtering for "coffee", I'd get "coffee" tweets th

Re: Automatic index balancing plugin or other solution?

2014-04-10 Thread Josh Harrison
ou mean equal distribution of docs with same average size across > shards. This means a search does not have to wait for nodes that must > search in larger shards. > > I do not think this needs a river plugin, since equal distribution of docs > over the shards is the default. > &g

Automatic index balancing plugin or other solution?

2014-04-08 Thread Josh Harrison
I have heard that ideally, you want to have a similar number of documents per shard for optimal search times, is that correct? I have data volumes that are just all over the place, from 100k to tens of millions in a week. I'm thinking about a river plugin that could: Take a mapping object as a

Re: Please explain the flow of data?

2014-03-21 Thread Josh Harrison
. > > Regards, > Mark Walkom > > Infrastructure Engineer > Campaign Monitor > email: ma...@campaignmonitor.com > web: www.campaignmonitor.com > > > On 22 March 2014 08:25, Josh Harrison >wrote: > >> I'm trying to build a basic understanding of how inde

Please explain the flow of data?

2014-03-21 Thread Josh Harrison
I'm trying to build a basic understanding of how indexing and searching works, hopefully someone can either point me to good resources or explain! I'm trying to figure out what having multiple "coordinator" nodes as defined in the elasticsearch.yml would do, and what having multiple "search load

Re: elasticsearch-py usage

2014-03-13 Thread Josh Harrison
It doesn't look like the elasticsearch-py API covers the river use case. When I've run into things like this I've always just run a manual CURL request, or if I need to do it from within a script I just do a basic command with requests, ala requests.put("http://localhost:9200/_river/mydocs/_meta

Re: Best way to duplicate data across clusters live?

2014-03-12 Thread Josh Harrison
onitoring * Log Analytics * Search Analytics > Solr & Elasticsearch Support * http://sematext.com/ > > > On Wednesday, March 12, 2014 2:55:58 PM UTC-4, Josh Harrison wrote: >> >> Say I have clusters A and B. Cluster A is consuming data using an >> ActiveM

Best way to duplicate data across clusters live?

2014-03-12 Thread Josh Harrison
Say I have clusters A and B. Cluster A is consuming data using an ActiveMQ river. I would like to stream data to cluster B as well. Do I just create a secondary outbound AMQ channel and subscribe cluster B to it, or is there a decent way to have a live copy of data going two places at once? --

Too many nodes started up on some data nodes - best approach to fix?

2014-02-26 Thread Josh Harrison
I restarted my cluster the other day, but something odd stuck, resulting in 15/16 data nodes starting up an extra ES instance in the same cluster. This ended badly as there were two nodes with identical display names, the system locked up, etc. When restarting again, to my horror, we were missin

Re: Random scan results?

2014-02-19 Thread Josh Harrison
e possible to read sequentially anymore. > > > On Wed, Feb 19, 2014 at 9:04 PM, Josh Harrison > > wrote: > >> I need to be able to pull 100s of thousands to millions of random >> documents from my indexes. Normally, to pull data this large I'd do a scan >> qu

Random scan results?

2014-02-19 Thread Josh Harrison
I need to be able to pull 100s of thousands to millions of random documents from my indexes. Normally, to pull data this large I'd do a scan query, but they don't support sorting, so the suggestions I've seen online for randomizing your results don't work (such as those discussed here: http://s

Documents per shard

2014-02-14 Thread Josh Harrison
I've got indexes storing the same kind of data split into weekly chunks - there has been some fairly substantial variation in data volume. I've got a mapping change I need to make across all the back data, and I'm thinking it might make sense to try to rebalance the documents per shard so that

Re: Data Loss

2014-02-12 Thread Josh Harrison
This particular cluster is 16 data nodes with SSD RAIDs connected to each other and the two master nodes with infiniband. Under 100 indexes and usually 3 shards per index with 1 replica. Overall data volume is in the 1TB range. I haven't tweaked the shard allocation settings from default. -Josh

Re: Data Loss

2014-02-12 Thread Josh Harrison
I'm sure it isn't the case for everyone that is having data/shard problems, but I had some real trouble doing a full cluster restart on an 18 node cluster. Kinda nightmarish, actually, shards failing all over the place, lost data because of lost shards, etc. I finally realized that the gateway.r

Re: Plugin development guidance

2014-02-11 Thread Josh Harrison
Great, thanks Jörg! I'll start fiddling around with the langdetect plugin to see if I can get it going with our library. On Tue, Feb 11, 2014 at 1:18 PM, joergpra...@gmail.com < joergpra...@gmail.com> wrote: > An analyzer plugin is the right thing. Adding the recognized/extracted > terms needs a

Plugin development guidance

2014-02-11 Thread Josh Harrison
Hi all, We've got an internal Java library that allows us to do keyword extraction that seems like a great thing to turn into an integrated elasticsearch function. Ultimately, I want to be able to access the result of this library from search results/etc, but I wanted to do a sanity check to ma

Re: Stress testing queries

2014-01-30 Thread Josh Harrison
many systems as you like. This is just a first pass at the idea, there may be some dumb mistakes in logic or oversights about test cases, but I think an app like this could be pretty useful. Heck, you could have a GUI on it, or just make it run off a yaml file or something. If it gets into my

Re: Stress testing queries

2014-01-30 Thread Josh Harrison
In our case, we're just interested in query stress testing. We've got a web app that queries our indexes that are organized based on weeks of the year, with a bunch of aliases making it so specific portions of the data can be reached easily. Questions about scaling the app have come up. In our c

Stress testing queries

2014-01-30 Thread Josh Harrison
Are there any decent ES specific stress testing tools out there that would allow me to test what kinds of simultaneous load my cluster can handle with concurrent users making queries? Searched around a bit and didn't see anything. Figured I'd ask before I come up with a test approach of my own!

[Hadoop] capability clarification questions

2014-01-30 Thread Josh Harrison
In looking around I haven't been able to find explicit answers to these questions - though the questions may entirely be because I'm a hadoop newbie. If we were to deploy ES within a hadoop environment: The primary benefit is allowing direct interaction with ES from Hadoop, running queries or i

Re: Multiple nodes on a powerful system?

2014-01-29 Thread Josh Harrison
Thanks Jörg, Mark and Nikolas, some great information here. The 6x6 configuration was something of a worst case example, the farthest we'd probably stretch it would be 3 nodes per host on 16-18 hosts, which should be a little more reasonable. Hopefully we'll be able to do a support contract wit

Multiple nodes on a powerful system?

2014-01-29 Thread Josh Harrison
Other than the resource footprint, is there any reason we should avoid running multiple node instances of a cluster on the same machine, assuming all the shard awareness stuff is in place to keep all the copies of a given shard from being stored on those nodes that are all resident on a single

Scan/scroll facet data?

2014-01-28 Thread Josh Harrison
I've got fields that have a few hundred thousand+ unique values that I'd like to be able to facet on. Is there some way of essentially streaming the exhaustive list of facet results, like I can search hits? -- You received this message because you are subscribed to the Google Groups "elastics

Get counts of new users per day

2014-01-23 Thread Josh Harrison
I'm working with the twitter river data. Trying to figure out how to construct a query that would let me generate a count of all new users within a given time period, where in this case, new means "user has not had a post captured before the start of this query window". So basically, I want to

Cross index filtering

2014-01-10 Thread Josh Harrison
I've got a backlog of usernames. I need to pull data associated with those usernames into a separate index. I want a query that will let me do a terms facet on user counts in the first index, with a facet filter excluding the users that already exist in the second. Is there a simple way to do t

Design practices for hosting multiple clusters/on-demand cluster creation?

2014-01-07 Thread Josh Harrison
While ES is still in a pre deployment stage at my job, there is growing interest in it. For various reasons, a monster cluster holding everyone's stuff is simply not possible. Individual projects require complete control over their data and the culture and security requirements here are such th

Re: Possible to turn off/suppress version data in response to GET http://localhost:9200?

2013-12-19 Thread Josh Harrison
on/admin/cluster/node/stats/RestNodesStatsAction.java > > The main action does not really have specific request/response classes. > You can try raising an issue or even submitting a pull request yourself, > but I do not see this issue as being very important. That is just my guess. &g

Re: Possible to turn off/suppress version data in response to GET http://localhost:9200?

2013-12-19 Thread Josh Harrison
rity risk. Sigh. Honestly, if it doesn't break something else, I wouldn't mind if there was just a way to turn off that default response entirely. That'd do it too. On Thursday, December 19, 2013 12:50:29 PM UTC-8, Ivan Brusic wrote: > > From what I can tell from the code,

Possible to turn off/suppress version data in response to GET http://localhost:9200?

2013-12-19 Thread Josh Harrison
The subject says it all pretty much, is it possible to turn off the reporting of version data in response to GET http://localhost:9200? Thanks, Josh -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiv

Re: ES data on Glusterfs

2013-12-17 Thread Josh Harrison
Cool, thanks. It looks like I've conflated gluster and lustre which are unfortunately totally unrelated. We're running lustre. On Tuesday, December 17, 2013 5:58:39 PM UTC-8, Jörg Prante wrote: > > Use Gluster on its native protocol, not on NFS and the like. > > If you want backup/restore, Glust

Auto aliases?

2013-12-17 Thread Josh Harrison
So I know how to set up default mappings. Is it possible to set up default aliases for indexes with a certain name format? I'm pulling in data on a weekly bases, so say I've got 2013_41, and all my filtered aliases (based on the contents of the document) set up. When 2013_42 comes along, do I ne

ES data on Glusterfs

2013-12-17 Thread Josh Harrison
Is it a dumb idea to have my ES data directory on a glusterFS/lusterfS storage node? Because we've got some BIG data that we'd like to index, but it's too big for the local storage on our test cluster. -- You received this message because you are subscribed to the Google Groups "elasticsearch"