from:"Josh Harrison"

Sum aggregation with results from other aggregations?

2015-06-01 Thread Josh Harrison

Is it possible to create an aggregation where I can do a sum on the results 
of a sub bucket?

I'm working on twitter data. In this data I have a bunch of retweets of 
different users.
Say that user A has 10 tweets that are retweeted a hundred times in my 
dataset. I want to find the maximum retweet_count for each individual 
tweet, and then I want to find the sum of all of those maximums from an 
individual user.
This is the base query structure I'm working with: 

{
  "aggs": {
"user_id": {
  "terms": {
"field": "retweet_user_id"
  },
  "aggs": {
"tweet_ids": {
  "terms": {
"field": "retweet_id",
"order": "max_tweet.value"
  },
  "aggs": {
"max_tweet": {
  "max": {
"field": "retweet_count"
  }
}
  }
}
  }
}
  }
}



Importantly here, I don't want to just take a sum of "retweet_count" for a 
given retweet_user_id - this doesn't give the max value per tweet.


Essentially, is it possible for me to take a sum of the agg results at 
user_id.tweet_ids.max_tweet.value, and use that as an "order" term in the 
user_id terms agg?



-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/100b463d-0f95-4801-aec7-e32544624518%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Elasticsearch twitter river filtered stream question

2014-07-07 Thread Josh Harrison

Quick question about the ES twitter river at 
https://github.com/elasticsearch/elasticsearch-river-twitter
The twitter streaming API allows you to filter, and you apparently get up 
to 1% of the stream total, with our search queries. So, if I were filtering 
for "coffee", I'd get "coffee" tweets that I wouldn't get if I was just 
capturing the 1% stream passively.
Does the Twitter river use this filter functionality, or does it do its 
filtering on the ingestion side, ingesting the normal 1% stream and 
discarding anything that doesn't match

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b7adb5f1-49a1-4424-8f4e-1c75e15c4cb0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Automatic index balancing plugin or other solution?

2014-04-10 Thread Josh Harrison

Interesting, ok. A colleague went to training with Elasticsearch and was 
told that given a default index with N shards similar index size was a 
critical thing for maintaining consistent search performance. I guess maybe 
that could play out by a two billion record index having a huge number of 
unique terms, while a smaller, say, 100k record index would have a 
substantially smaller set of terms, right?
Dealing with content from stuff like the Twitter public API, I would 
anticipate a fairly linear growth of unique terms and overall index size. 
This ultimately results in the scenario initially, where a larger index is 
comparatively slower to search, due to its necessarily increased dictionary 
size. It seems as though there'd still be room for the kind of 
automatically scaling via a template system described above?

On Wednesday, April 9, 2014 7:38:35 AM UTC-7, Jörg Prante wrote:
>
> The number of documents is not relevant to the search time.
>
> Important factors for search time are the type of query, shard size, the 
> number of unique terms (the dictionary size), the number of segments, 
> network latency, disk drive latency, ...
>
> Maybe you mean equal distribution of docs with same average size across 
> shards. This means a search does not have to wait for nodes that must 
> search in larger shards.
>
> I do not think this needs a river plugin, since equal distribution of docs 
> over the shards is the default.
>
> Jörg
>
>
> On Tue, Apr 8, 2014 at 9:03 PM, Josh Harrison 
> > wrote:
>
>> I have heard that ideally, you want to have a similar number of documents 
>> per shard for optimal search times, is that correct?
>>
>> I have data volumes that are just all over the place, from 100k to tens 
>> of millions in a week.
>>
>> I'm thinking about a river plugin that could:
>> Take a mapping object as a template
>> Define a template for child index names (project_\_\MM_\DD_\NNN = 
>> project_2014_04_08_000, etc)
>> Define index shard count (5)
>> Define maximum index size (1,000,000)
>> Define a listening endpoint of some sort
>>
>> Documents would stream into the listening endpoint however you wanted, 
>> rivers, bulk loads using an API, etc. They would be automatically routed to 
>> the lowest numbered not-full index. So on a given day you could end up with 
>> fifteen indexes, or eighty, or two, but they'd all be a maximum of N 
>> records.
>>
>> A plugin seems desirable in this case, as it frees you from needing to 
>> write the load balancing into every ingestion stream you've got.
>>
>> Is this a reasonable solution to this problem? Am I overcomplicating 
>> things? 
>>  
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com<https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8537dab8-8831-42a5-97b0-92367d3753ca%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Automatic index balancing plugin or other solution?

2014-04-08 Thread Josh Harrison

I have heard that ideally, you want to have a similar number of documents 
per shard for optimal search times, is that correct?

I have data volumes that are just all over the place, from 100k to tens of 
millions in a week.

I'm thinking about a river plugin that could:
Take a mapping object as a template
Define a template for child index names (project_\_\MM_\DD_\NNN = 
project_2014_04_08_000, etc)
Define index shard count (5)
Define maximum index size (1,000,000)
Define a listening endpoint of some sort

Documents would stream into the listening endpoint however you wanted, 
rivers, bulk loads using an API, etc. They would be automatically routed to 
the lowest numbered not-full index. So on a given day you could end up with 
fifteen indexes, or eighty, or two, but they'd all be a maximum of N 
records.

A plugin seems desirable in this case, as it frees you from needing to 
write the load balancing into every ingestion stream you've got.

Is this a reasonable solution to this problem? Am I overcomplicating 
things? 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/176f4fb2-d924-4ec2-bcee-67ad8de24dfb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Please explain the flow of data?

2014-03-21 Thread Josh Harrison

Awesome, ok, thank you.
Is the logic behind not allowing storage on master nodes to both:
Take advantage of a system with limited storage resources
and
Have a dedicated results aggregator/search handler?

I can imagine if I had a particularly badly written gnarly search, trying 
to deal with the results on a master and a querying the results at the same 
time could be bad.

So in a 16 node cluster you'd want to have 9 nodes allowed to be masters, 
(n/2)+1?

Thanks again!
Josh


On Friday, March 21, 2014 3:20:24 PM UTC-7, Mark Walkom wrote:
>
> A couple of things;
>
>1. You should have n/2+1 masters in your cluster, where n = number of 
>nodes. This helps prevent split brain situations and is best practise.
>2. Your master nodes can store data, this way you don't need to add 
>more nodes to fulfil the above. 
>
> Your indexing scenario is correct. 
> For searching, replica's and primaries can be queried.
> For both - Adding more masters adds redundancy as per the first two 
> points. Adding more search nodes won't do much though other than reduce the 
> load on your masters (unless someone else can add anything I don't know :p).
>
> And for your final question, yes that is correct.
>
> To give you an idea of practical application, we don't use search nodes 
> but have 3 non-data masters that handle all queries, and a bunch of data 
> only nodes for storing everything.
>
> Regards,
> Mark Walkom
>
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com 
> web: www.campaignmonitor.com
>
>
> On 22 March 2014 08:25, Josh Harrison >wrote:
>
>> I'm trying to build a basic understanding of how indexing and searching 
>> works, hopefully someone can either point me to good resources or explain!
>> I'm trying to figure out what having multiple "coordinator" nodes as 
>> defined in the elasticsearch.yml would do, and what having multiple "search 
>> load balancer" nodes would do. Both in the context of indexing and 
>> searching.
>> Is there a functional difference between a "coordinator" node and a 
>> "search load balancer" node, beyond the fact that a "search load balancer" 
>> node can't be elected master?
>>
>>
>> Say I have a 4 node cluster. There's a master only "coordinator" node, 
>> that doesn't store data, named "master". 
>> node.master: true
>> node.data: false
>>
>> There are three data only nodes, "A", "B" and "C" 
>> node.master: false
>> node.date: true
>>
>> I have an index "test" with two shards and one replica. Primary shard 0 
>> lives on A, primary shard 1 lives on C, replica shard 0 lives on B, replica 
>> shard 1 lives on A.
>>
>> I send the command
>> curl -XPOST http://master:9200/test/test -d '{"foo":"bar"}'
>>
>> A connection is made to master, and the data is sent to master to be 
>> indexed. Master randomly decides to place this document in shard 1, so it 
>> gets sent to the primary shard 1 on C and replica shard 1 on B, right? This 
>> is where routing can come in, I can say that that document really should go 
>> to shard 0 because I said so.
>>
>> So this is a fairly simple scenario, assuming I'm correct.
>>
>> What benefit do I get to indexing when I add more "coordinator" nodes?
>> node.master: true
>> node.data: false
>>
>> What about if I add "search load balancer" nodes?
>> node.master: false
>> node.data: false
>>
>>
>>
>> How about on the searching side of things?
>> I send a search to master,
>> curl -XPOST http://master:9200/test/test/_search -d 
>> '{"query":{"match_all":{}}}'
>>
>> Master sends these queries off to A, B and C, who each generate their own 
>> results and return them to master. Each data node queries all the relevant 
>> shards that are present locally and then combines those results for 
>> delivery to master. Do only primary shards get queried, or are replica 
>> shards queried too? 
>> Master takes these combined results from all the relevant nodes and 
>> combines them into the final query response.
>>
>> Same questions:
>> What benefit do I get to searching when I add more nodes that are like 
>> master?
>> node.master: true
>> node.data: false
>>
>> What about if I add "search load balancer" nodes?
>> node.master: false
>> node.data: false
>>  
>>

Please explain the flow of data?

2014-03-21 Thread Josh Harrison

I'm trying to build a basic understanding of how indexing and searching
works, hopefully someone can either point me to good resources or explain!
I'm trying to figure out what having multiple "coordinator" nodes as
defined in the elasticsearch.yml would do, and what having multiple "search
load balancer" nodes would do. Both in the context of indexing and
searching.
Is there a functional difference between a "coordinator" node and a "search
load balancer" node, beyond the fact that a "search load balancer" node
can't be elected master?

Say I have a 4 node cluster. There's a master only "coordinator" node, that
doesn't store data, named "master".
node.master: true
node.data: false

There are three data only nodes, "A", "B" and "C"
node.master: false
node.date: true

I have an index "test" with two shards and one replica. Primary shard 0
lives on A, primary shard 1 lives on C, replica shard 0 lives on B, replica
shard 1 lives on A.

I send the command
curl -XPOST http://master:9200/test/test -d '{"foo":"bar"}'

A connection is made to master, and the data is sent to master to be
indexed. Master randomly decides to place this document in shard 1, so it
gets sent to the primary shard 1 on C and replica shard 1 on B, right? This
is where routing can come in, I can say that that document really should go
to shard 0 because I said so.

So this is a fairly simple scenario, assuming I'm correct.

What benefit do I get to indexing when I add more "coordinator" nodes?
node.master: true
node.data: false

What about if I add "search load balancer" nodes?
node.master: false
node.data: false

How about on the searching side of things?
I send a search to master,
curl -XPOST http://master:9200/test/test/_search -d
'{"query":{"match_all":{}}}'

Master sends these queries off to A, B and C, who each generate their own
results and return them to master. Each data node queries all the relevant
shards that are present locally and then combines those results for
delivery to master. Do only primary shards get queried, or are replica
shards queried too?
Master takes these combined results from all the relevant nodes and
combines them into the final query response.

Same questions:
What benefit do I get to searching when I add more nodes that are like
master?
node.master: true
node.data: false

What about if I add "search load balancer" nodes?
node.master: false
node.data: false

Is the only difference between a
node.master: true
node.data: false
and a
node.master: false
node.data: false
that the node is a candidate to be a master, should it be elected?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/eaff1d85-1e85-422d-bfba-9a0825ed5da9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: elasticsearch-py usage

2014-03-13 Thread Josh Harrison

It doesn't look like the elasticsearch-py API covers the river use case. 
When I've run into things like this I've always just run a manual CURL 
request, or if I need to do it from within a script I just do a basic 
command with requests, ala
requests.put("http://localhost:9200/_river/mydocs/_meta"; data='{"type": 
"fs", "fs": {  "url": "/tmp",  "update_rate": 90,  "includes": 
"*.doc,*.pdf", "excludes": "resume" }}')
Not the most elegant approach, but it works!

On Thursday, March 13, 2014 1:57:55 PM UTC-7, Kent Tenney wrote:
>
> From the fsriver doc: 
>
> curl -XPUT 'localhost:9200/_river/mydocs/_meta' -d '{ 
>   "type": "fs", 
>   "fs": { 
>  "url": "/tmp", 
>  "update_rate": 90, 
>  "includes": "*.doc,*.pdf", 
>  "excludes": "resume" 
>} 
> }' 
>
> How does this translate to the Python API? 
>
> Thanks, 
> Kent 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c9d25008-fe3d-43dc-a57f-e8e510f8a3ce%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Best way to duplicate data across clusters live?

2014-03-12 Thread Josh Harrison

Kafka looks interesting, though at this point we're actively trying to 
reduce the number of moving parts, so I think an AMQ based approach is what 
we'll ultimately go for.
Seems like there might be room here for an 
elasticsearch-elasticsearch-river plugin or something - to do one or two 
way close to real time replication on some selected set of indexes between 
separate clusters. That way you could easily mirror prod data to a dev 
environment without depending on the ability to do the duplication earlier 
in the pipeline, or depending on scripts to move the data around.

On Wednesday, March 12, 2014 2:46:19 PM UTC-7, Otis Gospodnetic wrote:
>
> Consider Kafka 0.8.1.  It comes with a MirrorMaker tool that mirrors Kafka 
> data (to multiple DCs).  Once data is local, you can feed your ES from the 
> local Kafka broker.
>
> Otis
> --
> Performance Monitoring * Log Analytics * Search Analytics
> Solr & Elasticsearch Support * http://sematext.com/
>
>
> On Wednesday, March 12, 2014 2:55:58 PM UTC-4, Josh Harrison wrote:
>>
>> Say I have clusters A and B. Cluster A is consuming data using an 
>> ActiveMQ river. I would like to stream data to cluster B as well. Do I just 
>> create a secondary outbound AMQ channel and subscribe cluster B to it, or 
>> is there a decent way to have a live copy of data going two places at once?
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6b1bdbe4-e2fa-4b10-9298-62d3d1869842%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Best way to duplicate data across clusters live?

2014-03-12 Thread Josh Harrison

Say I have clusters A and B. Cluster A is consuming data using an ActiveMQ 
river. I would like to stream data to cluster B as well. Do I just create a 
secondary outbound AMQ channel and subscribe cluster B to it, or is there a 
decent way to have a live copy of data going two places at once?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e58efbd5-0cc0-436d-8a41-5f7987587881%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Too many nodes started up on some data nodes - best approach to fix?

2014-02-26 Thread Josh Harrison

I restarted my cluster the other day, but something odd stuck, resulting in 
15/16 data nodes starting up an extra ES instance in the same cluster. This 
ended badly as there were two nodes with identical display names, the 
system locked up, etc.
When restarting again, to my horror, we were missing shards. I quickly 
figured out that the missing shards had gotten moved into the second 
instance storage location.
What is the best way to resolve this? Should we either spawn second ES 
instances on the culprit machines (with different instance names), or can a 
simple 
mv escluster/nodes/1/indices/data1/* escluster/nodes/0/indices/data1/ 
do the job?

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5ec79698-5ea3-4ba9-a81d-0665a23a9bd5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Random scan results?

2014-02-19 Thread Josh Harrison

Darn ok. Thank you.
If I'm retrieving large numbers of random largish (twitter river records) 
documents, is there a particular pattern I should use for searching? That 
is, does it make sense to send 20 sequential queries with size 10,000 and 
random sorting, or a single query with a size of 200,000? What about up 
into the millions? Obviously we're risking duplication of results when 
sending multiple smaller queries, but this is OK for our purposes, or can 
be dealt with at another stage of the process outside ES.
Thanks,
Josh

On Wednesday, February 19, 2014 12:41:58 PM UTC-8, Adrien Grand wrote:
>
> Hi Josh,
>
> In order to run efficiently, scan queries read records sequentially on 
> disk and keep a cursor that is used to maintain state between successive 
> pages. It would not be possible to get records in a random order as it 
> would not be possible to read sequentially anymore.
>
>
> On Wed, Feb 19, 2014 at 9:04 PM, Josh Harrison 
> > wrote:
>
>> I need to be able to pull 100s of thousands to millions of random 
>> documents from my indexes. Normally, to pull data this large I'd do a scan 
>> query, but they don't support sorting, so the suggestions I've seen online 
>> for randomizing your results don't work (such as those discussed here: 
>> http://stackoverflow.com/questions/9796470/random-order-pagination-elasticsearch
>> ).
>> Is there a way to introduce randomness into a basic scan query? 
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/b3971dda-2963-48ce-b7ed-f50e85b82a97%40googlegroups.com
>> .
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>
>
> -- 
> Adrien Grand
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/fabec423-97a6-4246-bf11-5d2899ca64b9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Random scan results?

2014-02-19 Thread Josh Harrison

I need to be able to pull 100s of thousands to millions of random documents 
from my indexes. Normally, to pull data this large I'd do a scan query, but 
they don't support sorting, so the suggestions I've seen online for 
randomizing your results don't work (such as those discussed here: 
http://stackoverflow.com/questions/9796470/random-order-pagination-elasticsearch).
Is there a way to introduce randomness into a basic scan query? 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/b3971dda-2963-48ce-b7ed-f50e85b82a97%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Documents per shard

2014-02-14 Thread Josh Harrison

I've got indexes storing the same kind of data split into weekly chunks - 
there has been some fairly substantial variation in data volume. 
I've got a mapping change I need to make across all the back data, and I'm 
thinking it might make sense to try to rebalance the documents per shard so 
that I have around 1 shard per N documents. 
Is that a worthwhile time investment in terms of query performance, or 
should I just stick with the 3 shards per index I've been using so far? I'd 
keep 3 shards as a minimum, so if there's a week with 10 documents it would 
still have 3 shards.

If I have an index that would end up with more than one shard per data 
node, does it make more sense to limit the number of shards to the number 
of data nodes, or go ahead and follow the 1 shard per N documents pattern?

Thanks!
Josh

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d775acd2-9717-4fe6-bbda-9c6d42f0cb39%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Data Loss

2014-02-12 Thread Josh Harrison

This particular cluster is 16 data nodes with SSD RAIDs connected to each 
other and the two master nodes with infiniband.
Under 100 indexes and usually 3 shards per index with 1 replica. Overall 
data volume is in the 1TB range.
I haven't tweaked the shard allocation settings from default.
-Josh

On Wednesday, February 12, 2014 1:36:53 PM UTC-8, Tony Su wrote:
>
> Josh,
>
> Your experience about recovering in only about 10 minutes is very 
> interesting.
> Because my little 5-node cluster/15GB data/3500 indices is taking about an 
> hour to recover and i know the bottleneck is the disk subsystem I'm 
> currently on,
>
> Am curious
> - What is the total size data in your cluster?
> - how many indices?
> - Are the shard numbers pretty typical (5 shards per index, 1 replica for 
> every shard)
> - Are you storing your data on a SAN, SCSI array or something else, and if 
> the disks are SSD?
>
> Thx,
> Tony
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1eccb9f9-4ee7-42b3-adde-352f0c8c65d7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Data Loss

2014-02-12 Thread Josh Harrison

I'm sure it isn't the case for everyone that is having data/shard problems, 
but I had some real trouble doing a full cluster restart on an 18 node 
cluster. Kinda nightmarish, actually, shards failing all over the place, 
lost data because of lost shards, etc.
I finally realized that the gateway.recover_after_nodes, 
gateway.expected_nodes and gateway.recover_after_time config properties 
were critical to avoiding my situation. Before the gateway configuration 
stuff was in there, it would take literally hours and a lot of work to get 
everything back to green. We dreaded a full cluster restart. 
After the gateway configuration stuff, a full cluster restart, from service 
restart on all systems to full green, takes anywhere from 2-10 minutes 
total. The root cause in my situation was a few nodes coming up in the 
cluster, and seeing a severely degraded state and trying to "fix" 
everything, resulting in chaos as more nodes came up.
Hopefully this is helpful to someone! 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html
-Josh

On Wednesday, February 12, 2014 11:49:29 AM UTC-8, Tony Su wrote:
>
> IMO evaluating this issue starts with applying the CAP Theorem which in 
> summary states that networked clusters with multiple nodes can offer only 2 
> of the following 3 desirable objectives
>  
> Consistency
> Availability
> Partition tolerance (data distributed across nodes).
>  
> ES clearly does the last two so in theory cannot guarantee the first.
> Of course "guarantee" is not the same as "best effort" which as expected 
> is being done.
> And, this Theorem applies to  multi-node cluster technologies of 
> which ES is one.
>  
> Tony
>  
>  
>
> On Wednesday, February 12, 2014 8:09:58 AM UTC-8, Brad Lhotsky wrote:
>
>> Appreciated, but keep in mind large installations can’t just constantly 
>> upgrade.  And if ES is being used in critical infrastructure upgrading may 
>> mean many hours of recertification work with auditors and assessors.  The 
>> project is still relatively young, but "just upgrade" isn’t always 
>> plausible.  It takes over 2 hours for a cluster to go back to green when a 
>> single node restarts for my logging cluster.  I have 15 nodes now, which 
>> means a safe upgrade path may take literally 1 working week.  That assumes 
>> I can have nodes with different versions in the cluster.  Or I have to lose 
>> data while I restart the *whole* cluster, which a whole cluster restart is 
>> also ~ 4 hours.
>>
>> -- 
>> Brad Lhotsky
>>
>> On 12 Feb 2014 at 16:07:20, Binh Ly (bi...@hibalo.com) wrote:
>>
>> FYI, ES has very frequent releases to fix bugs discovered by the 
>> community. If you find a data loss problem in your current install (and 
>> assuming it is indeed an ES problem), please try the latest build and see 
>> if it fixes it. Chances are it has already been discovered and fixed in the 
>> latest release.
>> --
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/71fe3210-4873-417e-859a-7f83e980f2ed%40googlegroups.com
>> .
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c23ad042-1fdd-40da-976c-5df12a29d96b%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Plugin development guidance

2014-02-11 Thread Josh Harrison

Great, thanks Jörg!
I'll start fiddling around with the langdetect plugin to see if I can get
it going with our library.


On Tue, Feb 11, 2014 at 1:18 PM, joergpra...@gmail.com <
joergpra...@gmail.com> wrote:

> An analyzer plugin is the right thing. Adding the recognized/extracted
> terms needs access to ES mapping service. There are a few plugins out there
> which work in this manner, for example, the attachment mapper plugin.
>
> Or the lang-detect plugin, it adds the recognized language(s) as a keyword
> code into a neighbor field for filtering or faceting:
> https://github.com/jprante/elasticsearch-langdetect
>
> Also, I developed a similar plugin that works with recognition techniques,
> it can recognize ISBN or other standard number in a text, and injects extra
> tokens into the token stream to identify these numbers:
> https://github.com/jprante/elasticsearch-analysis-standardnumber
>
> Jörg
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/1lexzKdBbP8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEzwFzy4zGbcg1w66LQgcEq8L5O9tjj2ke_6krw9nc%2B7A%40mail.gmail.com
> .
>
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CALX3-AmgVoyoTy1nso0_SW%3DaVNkZ3aSKXN8tbPTnmrOfqjnVDQ%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Plugin development guidance

2014-02-11 Thread Josh Harrison

Hi all,
We've got an internal Java library that allows us to do keyword extraction
that seems like a great thing to turn into an integrated elasticsearch
function.
Ultimately, I want to be able to access the result of this library from
search results/etc, but I wanted to do a sanity check to make sure my
approach was right - or if I should be looking at doing a custom analyzer
or something instead.

Given a string field, the type would become a multi-field, {name} and
keywords/phrases as subfields. A plugin would be written to handle this
keywords field, run the strings through the library and return a list of
strings like:
"my_data":"Jack and Jill went up the hill, Jack fell down and bumped his
crown, and Jill came tumbling after."
"my_data.keywords":["Jack", "Jack fell"]
That's a trivial example, of course, and the algorithm is more complex than
the standard stopword filtering.

Ultimately, I want to be able to expose the my_data.keywords field as an
actual list like above, so that we can use it in other things like facets
down the line.

So is a custom type plugin the right way to go here, or should I be looking
at developing a more complex analyzer/tokenizer/stopword combo?

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/44d3ba64-f0b3-4727-9a49-745a2167d34d%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Stress testing queries

2014-01-30 Thread Josh Harrison

I think a basic tool that is relatively independent of cluster specifics 
could be pretty useful.
I'm imagining a tool that allows you to do load testing against any cluster 
you point it at to:
Test indexing by selecting the complexity of data objects you're interested 
in - ie, create X test indices with X shards and Y replicas each, and send 
either a custom object with fields that could be defined for length and 
type of random variables, or basic objects at various sizes (an example 
tweet, log record, simple data point, etc). 
As a better example, if I wanted to see how my system did with objects that 
looked like {"test1":18, "test2":[{"test3":"XnksdjfknSeorjOJosdimkn 
 skjnfds sdfsidfun ifsdfosdmfo"}, {"test3":"dslkmlkdsfnUFIDNSiufndsfn 
DFISnu"}]}, we could just pass something like a mapping, 
{"test1":{"type":"int", "maxSize":1000, 
"minSize":-1029420},"test2":{"type":"nested", "minSize":0, "maxSize":20, 
"properties":{"test3":{"type":"string", "maxSize": 160, "minSize":20
Random objects would be generated according to the specifications of the 
template. Maybe you could also pick a type "english", "french", "russian", 
etc, to generate strings that are actually language based on a dictionary 
of terms. (or be able to define custom "types" pointing at text files).
You could then indicate how many objects you want created in any given 
index, also a min/max range, the number of workers writing to any given 
index, the total number of workers you want writing data, the total number 
of indexes, etc. 
Data could all be sent over the native API, I'm a Python guy so I'll say 
Python, or over HTTP using something like the requests module. This could 
allow for interesting comparisons of the APIs.

Test queries by doing pretty much the same thing. Track the query response 
times per worker, and other relevant stats. Have definable "max requests 
per second per worker", maybe, so you can replicate your worst case user 
behavior in each process.

This would be step 1 of the process, step 2 would be developing something 
so that a central test system could allocate testing jobs and collect stats 
across a number of client test systems. So I'd set up a test service on 
host A, and test clients on hosts B and C. B and C would be sent job 
properties by A, B and C would then launch, track their own stats, and send 
them to A to aggregate. Scale out to as many systems as you like.

This is just a first pass at the idea, there may be some dumb mistakes in 
logic or oversights about test cases, but I think an app like this could be 
pretty useful. Heck, you could have a GUI on it, or just make it run off a 
yaml file or something.

If it gets into my head enough maybe I'll try to write this up, though like 
I said, it'd be in python since that's my language of choice. So it 
wouldn't be as optimal a testing platform as a native Java app, I guess, 
but still useful as a proof of concept.

On Thursday, January 30, 2014 4:41:06 PM UTC-8, Josh Harrison wrote:
>
> In our case, we're just interested in query stress testing. We've got a 
> web app that queries our indexes that are organized based on weeks of the 
> year, with a bunch of aliases making it so specific portions of the data 
> can be reached easily. Questions about scaling the app have come up. In our 
> case, that means testing through the app itself, which so far only makes 
> queries. I figure we should load test our cluster directly too, so we can 
> see if there is a bottleneck somewhere in the app, if any eventual 
> bottlenecks are on the cluster itself.
>
> So far I haven't been able to really max out the indexing rate on a system 
> that is adequately equipped with resources, that I can tell. I've had 32 
> sub-process Python workers happily sending, I think, ~5+ million records an 
> hour to our cluster with no problem in indexing speed or other response 
> time when backloading some data.
> My current strategy is to get the ugliest heavy queries the application 
> runs and simply use ABS or something similar to run queries over http with 
> variables that are in a reasonable range. If I can make my cluster crash by 
> doing that, I know that'll be my upper limit!
>
>
>
>
> On Thursday, January 30, 2014 3:59:19 PM UTC-8, Jörg Prante wrote:
>>
>> Just a few questions, because I'm also interested in load testing.
>>
>> What kind of stress do you think of? Random data? Wikipedia? Logfiles? 
>> Just query? What about indexing? And what client? Java? Other script 
>> languages? How should the c

Re: Stress testing queries

2014-01-30 Thread Josh Harrison

In our case, we're just interested in query stress testing. We've got a web 
app that queries our indexes that are organized based on weeks of the year, 
with a bunch of aliases making it so specific portions of the data can be 
reached easily. Questions about scaling the app have come up. In our case, 
that means testing through the app itself, which so far only makes queries. 
I figure we should load test our cluster directly too, so we can see if 
there is a bottleneck somewhere in the app, if any eventual bottlenecks are 
on the cluster itself.

So far I haven't been able to really max out the indexing rate on a system 
that is adequately equipped with resources, that I can tell. I've had 32 
sub-process Python workers happily sending, I think, ~5+ million records an 
hour to our cluster with no problem in indexing speed or other response 
time when backloading some data.
My current strategy is to get the ugliest heavy queries the application 
runs and simply use ABS or something similar to run queries over http with 
variables that are in a reasonable range. If I can make my cluster crash by 
doing that, I know that'll be my upper limit!

On Thursday, January 30, 2014 3:59:19 PM UTC-8, Jörg Prante wrote:
>
> Just a few questions, because I'm also interested in load testing.
>
> What kind of stress do you think of? Random data? Wikipedia? Logfiles? 
> Just query? What about indexing? And what client? Java? Other script 
> languages? How should the cluster be configured, one node? two or more 
> nodes? Index shards? Replica? etc. etc.
>
> There are so many variants and options out there, I believe this is one of 
> the reason why a compelling load testing tool is still missing. 
>
> It would be nice to have a tool to upload ES performance profiles to a 
> public web site, for checking how well an ES cluster is tuned in comparison 
> to others. A measure unit for comparing performance is needed to be 
> defined, e.g. "this cluster performs with a power factor of 1.0, this 
> cluster has power factor 1.5, 2.0, ..."
>
> That's only possible when all software and hardware characteristics are 
> properly taken into account, plus "application profiles" for a typical 
> workloads, so it can be decided which configuration is best for what 
> purpose.
>
> Jörg
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8b33e17a-0b27-4926-a8bd-f467357198c5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Stress testing queries

2014-01-30 Thread Josh Harrison

Are there any decent ES specific stress testing tools out there that would 
allow me to test what kinds of simultaneous load my cluster can handle with 
concurrent users making queries? Searched around a bit and didn't see 
anything.
Figured I'd ask before I come up with a test approach of my own!
Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/83ccba10-c387-4c00-95b1-6f68cc1d6312%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

[Hadoop] capability clarification questions

2014-01-30 Thread Josh Harrison

In looking around I haven't been able to find explicit answers to these
questions - though the questions may entirely be because I'm a hadoop
newbie.
If we were to deploy ES within a hadoop environment:
The primary benefit is allowing direct interaction with ES from Hadoop,
running queries or indexing data, is that right?
Are there explicit benefits to search speed and capability when run through
the normal REST or other client APIs? That is to say, if I have a set of N
documents and a query that takes T seconds to run on a normal cluster
through curl, would there be a marked improvement in T when running the
same query through curl against a hadoop enabled cluster?
Are the ideal architecture designs for a hadoop enabled ES cluster the
same, or similar to, a "regular" cluster?
If they're the same, does a hadoop enabled cluster need to be designed as
such from the start, or can that functionality be tacked on to an already
functioning cluster with data? Situation is, we're on a cluster of machines
running hadoop, but the ES nodes are just running on the compute nodes like
a regular service. Wondering what it would take to enable the hadoop
capabilities.

Thanks!

--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/4dea6c95-75b8-4ed7-a054-3f9eaedde9d3%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Multiple nodes on a powerful system?

2014-01-29 Thread Josh Harrison

Thanks Jörg, Mark and Nikolas, some great information here. The 6x6 
configuration was something of a worst case example, the farthest we'd 
probably stretch it would be 3 nodes per host on 16-18 hosts, which should 
be a little more reasonable. Hopefully we'll be able to do a support 
contract with the commercial side of ES and get some help building out a 
system that meets our exact needs.

On Wednesday, January 29, 2014 5:02:05 PM UTC-8, Jörg Prante wrote:
>
> You should consider the RAM to CPU/Disk ratio. On systems with huge 
> memory, CPUs have the tendency to become weak, and the I/O subsystem must 
> push data with higher pressure from RAM to drive (spindle or SSD).
> Huge RAM helps for caching strategies but also creates headaches, large 
> caches must be long lived and must not collapse, which is hard in a  large 
> JVM heap, and JVM garbage collection will take more resources and time. 
>
> Running multiple JVMs on a single host only looks like a viable solution, 
> but that is not how ES scales. ES scales horizontally over many machines, 
> not vertically over RAM size.
>
> So you should take care that your CPU performance is not suffering. There 
> is overhead also on the OS layer and it depends on the setup.
>
> A 36 node cluster on 6 machines adds another challenge. You must tell ES 
> how your nodes are organized, in order to get a reliable green/yellow/red 
> cluster health for your shard allocation.
>
> Jörg
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/156f7ee8-c66c-40e0-9774-e442dd2f1976%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Multiple nodes on a powerful system?

2014-01-29 Thread Josh Harrison

Other than the resource footprint, is there any reason we should avoid 
running multiple node instances of a cluster on the same machine, assuming 
all the shard awareness stuff is in place to keep all the copies of a given 
shard from being stored on those nodes that are all resident on a single 
physical box?
Basically, if I had a cluster of, say six machines with 512GB of ram a 
piece, is it reasonable to run six instances of ES per machine and 30GB of 
swap per instance allocated, resulting in a 36 node cluster with and a bit 
over a terabyte of memory footprint across the cluster?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1df2f7ff-26af-4ce8-a734-29aa9dbd90dd%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Scan/scroll facet data?

2014-01-28 Thread Josh Harrison

I've got fields that have a few hundred thousand+ unique values that I'd 
like to be able to facet on. Is there some way of essentially streaming the 
exhaustive list of facet results, like I can search hits? 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3b3581eb-a852-444d-ae79-739ec0b56dc8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Get counts of new users per day

2014-01-23 Thread Josh Harrison

I'm working with the twitter river data. Trying to figure out how to 
construct a query that would let me generate a count of all new users 
within a given time period, where in this case, new means "user has not had 
a post captured before the start of this query window". 
So basically, I want to get a facet result of user.screen_name where a name 
will be dropped from the facet entirely if an instance of it occurs outside 
a specified time range.
I've got no idea where to start. Anyone have any pointers?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f1166ce3-f529-4412-b721-37f3e1cfcc79%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Cross index filtering

2014-01-10 Thread Josh Harrison

I've got a backlog of usernames. I need to pull data associated with those 
usernames into a separate index. 
I want a query that will let me do a terms facet on user counts in the 
first index, with a facet filter excluding the users that already exist in 
the second. Is there a simple way to do this?
There could end up being a couple hundred million unique users in the 
secondary index, in theory. For that scale, what's the best way to approach 
the problem? 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7523648d-f12d-400d-b2d7-7cfb3c3f01c4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Design practices for hosting multiple clusters/on-demand cluster creation?

2014-01-07 Thread Josh Harrison

While ES is still in a pre deployment stage at my job, there is growing 
interest in it. For various reasons, a monster cluster holding everyone's 
stuff is simply not possible. Individual projects require complete control 
over their data and the culture and security requirements here are such 
that doing something like always naming project 1's indexes 
PROJECT_1_ will not fly.
We have a fairly beefy hadoop cluster hosting our content currently, along 
with a separate head node acting as the master.
In this situation, is it simply a matter of starting up new processes on 
each node pointed at different configuration profiles and tying specific 
ports to specific projects/clusters?

Basically, is there an established way to build on-demand clusters, given a 
set of resources? We'll layer something in front of it to deal with access 
control/etc.

Thanks!
-Josh

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ad2695f7-d1a2-4036-82b2-58bddf349681%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Possible to turn off/suppress version data in response to GET http://localhost:9200?

2013-12-19 Thread Josh Harrison

Yeah, it's been looking like a proxy is the way to go. If it was an already 
existing functionality allowing me to suppress the version info from /, I'd 
have been happy to use that, but I agree - it isn't worth anyone's time to 
add this.
Thanks Ivan!
-Josh

On Thursday, December 19, 2013 3:04:15 PM UTC-8, Ivan Brusic wrote:
>
> Just having the REST endpoint open is a security risk. :) You can always 
> put a proxy in front of elasticsearch that intercepts certain calls such as 
> PUT, POST, DELETE or simply / in your case.
>
> Normally in elasticsearch, a request is built with various parameters via 
> a builder and then the resulting response will have the correct fields. You 
> can see an example with the nodes stats:
>
>
> https://github.com/elasticsearch/elasticsearch/blob/master/src/main/java/org/elasticsearch/rest/action/admin/cluster/node/stats/RestNodesStatsAction.java
>
> The main action does not really have specific request/response classes. 
> You can try raising an issue or even submitting a pull request yourself, 
> but I do not see this issue as being very important. That is just my guess.
>
> -- 
> Ivan
>
>
> On Thu, Dec 19, 2013 at 2:52 PM, Josh Harrison 
> > wrote:
>
>> To clarify, when I go to http://localhost:9200, I want to get back
>>
>> {
>>   "ok" : true,
>>   "status" : 200,
>>   "name" : "Stem Cell",
>>   "tagline" : "You Know, for Search"
>> }
>>
>>
>> Not
>>
>> {
>>   "ok" : true,
>>   "status" : 200,
>>   "name" : "Stem Cell",
>>   "version" : {
>> "number" : "0.90.5",
>> "build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee",
>> "build_timestamp" : "2013-09-17T13:09:46Z",
>> "build_snapshot" : false,
>> "lucene_version" : "4.4"
>>   },
>>   "tagline" : "You Know, for Search"
>> }
>>
>>
>> I poked around in the code and the only code place I fine "You Know, for 
>> Search" is
>>
>> https://github.com/elasticsearch/elasticsearch/blob/c20d4bb69ed29cf11a747f0fdc40ce4237f79ce4/src/main/java/org/elasticsearch/rest/action/main/RestMainAction.java
>> There doesn't appear to be an explicit flag that would allow me to 
>> suppress that, but perhaps that's somewhere else? My IT folks are in a 
>> tizzy that version information is being displayed, saying it's a major 
>> security risk. Sigh.
>> Honestly, if it doesn't break something else, I wouldn't mind if there 
>> was just a way to turn off that default response entirely. That'd do it too.
>>
>>
>> On Thursday, December 19, 2013 12:50:29 PM UTC-8, Ivan Brusic wrote:
>>
>>> From what I can tell from the code, it appears that you can disable 
>>> returning the version field.
>>>
>>> -- 
>>> Ivan
>>>
>>>
>>> On Thu, Dec 19, 2013 at 12:27 PM, Josh Harrison wrote:
>>>
>>>> The subject says it all pretty much, is it possible to turn off the 
>>>> reporting of version data in response to GET http://localhost:9200?
>>>> Thanks,
>>>> Josh
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to elasticsearc...@googlegroups.com.
>>>>
>>>> To view this discussion on the web visit https://groups.google.com/d/
>>>> msgid/elasticsearch/7962249a-610f-4ee6-9496-a1cf14df8d95%
>>>> 40googlegroups.com.
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/dbd5cd20-6b39-46f8-bab8-b6c37de21c26%40googlegroups.com
>> .
>>
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/1cd01174-f23c-4edd-854f-31a5975e01f4%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Possible to turn off/suppress version data in response to GET http://localhost:9200?

2013-12-19 Thread Josh Harrison

To clarify, when I go to http://localhost:9200, I want to get back

{
  "ok" : true,
  "status" : 200,
  "name" : "Stem Cell",
  "tagline" : "You Know, for Search"
}


Not

{
  "ok" : true,
  "status" : 200,
  "name" : "Stem Cell",
  "version" : {
"number" : "0.90.5",
"build_hash" : "c8714e8e0620b62638f660f6144831792b9dedee",
"build_timestamp" : "2013-09-17T13:09:46Z",
"build_snapshot" : false,
"lucene_version" : "4.4"
  },
  "tagline" : "You Know, for Search"
}


I poked around in the code and the only code place I fine "You Know, for 
Search" is
https://github.com/elasticsearch/elasticsearch/blob/c20d4bb69ed29cf11a747f0fdc40ce4237f79ce4/src/main/java/org/elasticsearch/rest/action/main/RestMainAction.java
There doesn't appear to be an explicit flag that would allow me to suppress 
that, but perhaps that's somewhere else? My IT folks are in a tizzy that 
version information is being displayed, saying it's a major security risk. 
Sigh.
Honestly, if it doesn't break something else, I wouldn't mind if there was 
just a way to turn off that default response entirely. That'd do it too.

On Thursday, December 19, 2013 12:50:29 PM UTC-8, Ivan Brusic wrote:
>
> From what I can tell from the code, it appears that you can disable 
> returning the version field.
>
> -- 
> Ivan
>
>
> On Thu, Dec 19, 2013 at 12:27 PM, Josh Harrison 
> > wrote:
>
>> The subject says it all pretty much, is it possible to turn off the 
>> reporting of version data in response to GET http://localhost:9200?
>> Thanks,
>> Josh
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/7962249a-610f-4ee6-9496-a1cf14df8d95%40googlegroups.com
>> .
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/dbd5cd20-6b39-46f8-bab8-b6c37de21c26%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Possible to turn off/suppress version data in response to GET http://localhost:9200?

2013-12-19 Thread Josh Harrison

The subject says it all pretty much, is it possible to turn off the 
reporting of version data in response to GET http://localhost:9200?
Thanks,
Josh

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7962249a-610f-4ee6-9496-a1cf14df8d95%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: ES data on Glusterfs

2013-12-17 Thread Josh Harrison

Cool, thanks. It looks like I've conflated gluster and lustre which are 
unfortunately totally unrelated. We're running lustre. 

On Tuesday, December 17, 2013 5:58:39 PM UTC-8, Jörg Prante wrote:
>
> Use Gluster on its native protocol, not on NFS and the like.
>
> If you want backup/restore, Gluster support was announced as a post 1.0 ES 
> feature.
>
> Jörg
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/212a220f-33eb-4026-9454-1d4b84e2ad17%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Auto aliases?

2013-12-17 Thread Josh Harrison

So I know how to set up default mappings. Is it possible to set up default 
aliases for indexes with a certain name format?
I'm pulling in data on a weekly bases, so say I've got 2013_41, and all my 
filtered aliases (based on the contents of the document) set up. When 
2013_42 comes along, do I need to manually create the aliases? Or is there 
a better way?
Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d4e0da5d-3878-4d76-9710-fc51898a7107%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ES data on Glusterfs

2013-12-17 Thread Josh Harrison

Is it a dumb idea to have my ES data directory on a glusterFS/lusterfS 
storage node? Because we've got some BIG data that we'd like to index, but 
it's too big for the local storage on our test cluster.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/11c1be2e-a908-454e-92b5-e65f90e33156%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Sum aggregation with results from other aggregations?

Elasticsearch twitter river filtered stream question

Re: Automatic index balancing plugin or other solution?

Automatic index balancing plugin or other solution?

Re: Please explain the flow of data?

Please explain the flow of data?

Re: elasticsearch-py usage

Re: Best way to duplicate data across clusters live?

Best way to duplicate data across clusters live?

Too many nodes started up on some data nodes - best approach to fix?

Re: Random scan results?

Random scan results?

Documents per shard

Re: Data Loss

Re: Data Loss

Re: Plugin development guidance

Plugin development guidance

Re: Stress testing queries

Re: Stress testing queries

Stress testing queries

[Hadoop] capability clarification questions

Re: Multiple nodes on a powerful system?

Multiple nodes on a powerful system?

Scan/scroll facet data?

Get counts of new users per day

Cross index filtering

Design practices for hosting multiple clusters/on-demand cluster creation?

Re: Possible to turn off/suppress version data in response to GET http://localhost:9200?

Re: Possible to turn off/suppress version data in response to GET http://localhost:9200?

Possible to turn off/suppress version data in response to GET http://localhost:9200?

Re: ES data on Glusterfs

Auto aliases?

ES data on Glusterfs

33 matches

Site Navigation

Mail list logo

Footer information