Is it possible to create an aggregation where I can do a sum on the results
of a sub bucket?
I'm working on twitter data. In this data I have a bunch of retweets of
different users.
Say that user A has 10 tweets that are retweeted a hundred times in my
dataset. I want to find the maximum retwee
Quick question about the ES twitter river at
https://github.com/elasticsearch/elasticsearch-river-twitter
The twitter streaming API allows you to filter, and you apparently get up
to 1% of the stream total, with our search queries. So, if I were filtering
for "coffee", I'd get "coffee" tweets th
ou mean equal distribution of docs with same average size across
> shards. This means a search does not have to wait for nodes that must
> search in larger shards.
>
> I do not think this needs a river plugin, since equal distribution of docs
> over the shards is the default.
>
&g
I have heard that ideally, you want to have a similar number of documents
per shard for optimal search times, is that correct?
I have data volumes that are just all over the place, from 100k to tens of
millions in a week.
I'm thinking about a river plugin that could:
Take a mapping object as a
.
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com
> web: www.campaignmonitor.com
>
>
> On 22 March 2014 08:25, Josh Harrison >wrote:
>
>> I'm trying to build a basic understanding of how inde
I'm trying to build a basic understanding of how indexing and searching
works, hopefully someone can either point me to good resources or explain!
I'm trying to figure out what having multiple "coordinator" nodes as
defined in the elasticsearch.yml would do, and what having multiple "search
load
It doesn't look like the elasticsearch-py API covers the river use case.
When I've run into things like this I've always just run a manual CURL
request, or if I need to do it from within a script I just do a basic
command with requests, ala
requests.put("http://localhost:9200/_river/mydocs/_meta
onitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Wednesday, March 12, 2014 2:55:58 PM UTC-4, Josh Harrison wrote:
>>
>> Say I have clusters A and B. Cluster A is consuming data using an
>> ActiveM
Say I have clusters A and B. Cluster A is consuming data using an ActiveMQ
river. I would like to stream data to cluster B as well. Do I just create a
secondary outbound AMQ channel and subscribe cluster B to it, or is there a
decent way to have a live copy of data going two places at once?
--
I restarted my cluster the other day, but something odd stuck, resulting in
15/16 data nodes starting up an extra ES instance in the same cluster. This
ended badly as there were two nodes with identical display names, the
system locked up, etc.
When restarting again, to my horror, we were missin
e possible to read sequentially anymore.
>
>
> On Wed, Feb 19, 2014 at 9:04 PM, Josh Harrison
> > wrote:
>
>> I need to be able to pull 100s of thousands to millions of random
>> documents from my indexes. Normally, to pull data this large I'd do a scan
>> qu
I need to be able to pull 100s of thousands to millions of random documents
from my indexes. Normally, to pull data this large I'd do a scan query, but
they don't support sorting, so the suggestions I've seen online for
randomizing your results don't work (such as those discussed here:
http://s
I've got indexes storing the same kind of data split into weekly chunks -
there has been some fairly substantial variation in data volume.
I've got a mapping change I need to make across all the back data, and I'm
thinking it might make sense to try to rebalance the documents per shard so
that
This particular cluster is 16 data nodes with SSD RAIDs connected to each
other and the two master nodes with infiniband.
Under 100 indexes and usually 3 shards per index with 1 replica. Overall
data volume is in the 1TB range.
I haven't tweaked the shard allocation settings from default.
-Josh
I'm sure it isn't the case for everyone that is having data/shard problems,
but I had some real trouble doing a full cluster restart on an 18 node
cluster. Kinda nightmarish, actually, shards failing all over the place,
lost data because of lost shards, etc.
I finally realized that the gateway.r
Great, thanks Jörg!
I'll start fiddling around with the langdetect plugin to see if I can get
it going with our library.
On Tue, Feb 11, 2014 at 1:18 PM, joergpra...@gmail.com <
joergpra...@gmail.com> wrote:
> An analyzer plugin is the right thing. Adding the recognized/extracted
> terms needs a
Hi all,
We've got an internal Java library that allows us to do keyword extraction
that seems like a great thing to turn into an integrated elasticsearch
function.
Ultimately, I want to be able to access the result of this library from
search results/etc, but I wanted to do a sanity check to ma
many systems as you like.
This is just a first pass at the idea, there may be some dumb mistakes in
logic or oversights about test cases, but I think an app like this could be
pretty useful. Heck, you could have a GUI on it, or just make it run off a
yaml file or something.
If it gets into my
In our case, we're just interested in query stress testing. We've got a web
app that queries our indexes that are organized based on weeks of the year,
with a bunch of aliases making it so specific portions of the data can be
reached easily. Questions about scaling the app have come up. In our c
Are there any decent ES specific stress testing tools out there that would
allow me to test what kinds of simultaneous load my cluster can handle with
concurrent users making queries? Searched around a bit and didn't see
anything.
Figured I'd ask before I come up with a test approach of my own!
In looking around I haven't been able to find explicit answers to these
questions - though the questions may entirely be because I'm a hadoop
newbie.
If we were to deploy ES within a hadoop environment:
The primary benefit is allowing direct interaction with ES from Hadoop,
running queries or i
Thanks Jörg, Mark and Nikolas, some great information here. The 6x6
configuration was something of a worst case example, the farthest we'd
probably stretch it would be 3 nodes per host on 16-18 hosts, which should
be a little more reasonable. Hopefully we'll be able to do a support
contract wit
Other than the resource footprint, is there any reason we should avoid
running multiple node instances of a cluster on the same machine, assuming
all the shard awareness stuff is in place to keep all the copies of a given
shard from being stored on those nodes that are all resident on a single
I've got fields that have a few hundred thousand+ unique values that I'd
like to be able to facet on. Is there some way of essentially streaming the
exhaustive list of facet results, like I can search hits?
--
You received this message because you are subscribed to the Google Groups
"elastics
I'm working with the twitter river data. Trying to figure out how to
construct a query that would let me generate a count of all new users
within a given time period, where in this case, new means "user has not had
a post captured before the start of this query window".
So basically, I want to
I've got a backlog of usernames. I need to pull data associated with those
usernames into a separate index.
I want a query that will let me do a terms facet on user counts in the
first index, with a facet filter excluding the users that already exist in
the second. Is there a simple way to do t
While ES is still in a pre deployment stage at my job, there is growing
interest in it. For various reasons, a monster cluster holding everyone's
stuff is simply not possible. Individual projects require complete control
over their data and the culture and security requirements here are such
th
on/admin/cluster/node/stats/RestNodesStatsAction.java
>
> The main action does not really have specific request/response classes.
> You can try raising an issue or even submitting a pull request yourself,
> but I do not see this issue as being very important. That is just my guess.
&g
rity risk.
Sigh.
Honestly, if it doesn't break something else, I wouldn't mind if there was
just a way to turn off that default response entirely. That'd do it too.
On Thursday, December 19, 2013 12:50:29 PM UTC-8, Ivan Brusic wrote:
>
> From what I can tell from the code,
The subject says it all pretty much, is it possible to turn off the
reporting of version data in response to GET http://localhost:9200?
Thanks,
Josh
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiv
Cool, thanks. It looks like I've conflated gluster and lustre which are
unfortunately totally unrelated. We're running lustre.
On Tuesday, December 17, 2013 5:58:39 PM UTC-8, Jörg Prante wrote:
>
> Use Gluster on its native protocol, not on NFS and the like.
>
> If you want backup/restore, Glust
So I know how to set up default mappings. Is it possible to set up default
aliases for indexes with a certain name format?
I'm pulling in data on a weekly bases, so say I've got 2013_41, and all my
filtered aliases (based on the contents of the document) set up. When
2013_42 comes along, do I ne
Is it a dumb idea to have my ES data directory on a glusterFS/lusterfS
storage node? Because we've got some BIG data that we'd like to index, but
it's too big for the local storage on our test cluster.
--
You received this message because you are subscribed to the Google Groups
"elasticsearch"
33 matches
Mail list logo