Re: Connecting Kibana to Elasticsearch on Kubernetes

2015-04-29 Thread Nils Dijk
Hi,

Have you tried defining a kubernetes service for Kibana? You can add a 
public IP of one of your minions to this service so that you can reach it 
easier from outside of the cluster.

Here is an example from k8s on how to set this up: 
https://github.com/GoogleCloudPlatform/kubernetes/tree/master/examples/guestbook#step-six-create-the-guestbook-service

-- Nils

On Tuesday, April 28, 2015 at 11:31:05 PM UTC+2, Satnam Singh wrote:
>
> Hello,
>
> I've upgraded to Elasticsearch 1.5.2 and Kibana 4.0.2 which I am deploying 
> in a Kubernetes  cluster.
> Specifically, I am running Elasticsearch in one "pod" (a Kubernetes 
> container with its own IP) and I am running Kibana in another pod (again 
> with a distinct IP address).
> The Kubernetes cluster runs a DNS service which will map the name 
> "elasticsearch-logging.default:9200" to the pod running Elasticsearch.
> This works fine: I can exec into any Docker container in a pod and run 
> "curl http://elasticsearch-logging.default:9200"; and the right thing 
> happens.
> I've configured Kibana to let it know where Elasticsearch is running:
>
> elasticsearch_url: "http://elasticsearch-logging.default:9200";
> elasticsearch_preserve_host: true
>
> Since I want to access Kibana from outside the cluster I use a proxy 
> running on the master node of the cluster (after adding certificates to my 
> browser for the SSL connection) e.g.
>
>
> https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/
>
> Sadly, this does not work. I get the error:
>
> Error: Unable to check for Kibana index ".kibana"
> Error: unknown error
> at respond 
> (https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/index.js?_b=6004:81693:15)
> at checkRespForFailure 
> (https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/index.js?_b=6004:81659:7)
> at 
> https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/index.js?_b=6004:80322:7
> at deferred.promise.then.wrappedErrback 
> (https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/index.js?_b=6004:20897:78)
> at deferred.promise.then.wrappedErrback 
> (https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/index.js?_b=6004:20897:78)
> at deferred.promise.then.wrappedErrback 
> (https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/index.js?_b=6004:20897:78)
> at 
> https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/index.js?_b=6004:21030:76
> at Scope.$get.Scope.$eval 
> (https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/index.js?_b=6004:22017:28)
> at Scope.$get.Scope.$digest 
> (https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/index.js?_b=6004:21829:31)
> at Scope.$get.Scope.$apply 
> (https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/index.js?_b=6004:22121:24)
>
>
> And this is what the console shows:
>
>
> 
>
> Visiting 
> https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/kibana-logging/elasticsearch
>  
> works (i.e. returns the status of Elasticsearch).
> So does visiting Elasticsearch via the proxy i.e. 
> https://104.197.26.147/api/v1beta3/proxy/namespaces/default/services/elasticsearch-logging/
>
> I wonder if anyone has some advice about what is going wrong here?
> Thank you kindly.
>
> Cheers,
>
> Satnam
>  
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/78759c24-37b9-4528-aaeb-d0ba06b8012f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Inconsistent results when aggregate by field from array

2015-04-14 Thread Nils Dijk
Hi,

You want to look at nested objects: 
http://www.elastic.co/guide/en/elasticsearch/guide/master/nested-objects.html

-- Nils
Tip: try formatting your post, it was hard to read.

On Tuesday, March 31, 2015 at 4:37:40 AM UTC+2, Iana Bondarskaia wrote:
>
> Hi All, I have array of objects in each document in index. I want to group 
> by and retrieve stats based on fields from this array. But stats is 
> calculated based on sum of all values in this array. Could you please 
> suggest, is there mistake in my query or expected behavior for now? Example 
> of my document: { "city":"London", "arrayField":[ { 
> "groupByField":"value1", "statsField":10 }, { "groupByField":"value1", 
> "statsField":20 }, { "groupByField":"value2", "statsField":10 }, { 
> "groupByField":"value2", "statsField":5 } ] } Example of my query: { "size" 
> : 0, "aggregations" : { "filter" : { "filter" : { "bool" : { "must" : { 
> "match_all" : { } } } }, "aggregations" : { "terms" : { "terms" : { "field" 
> : "arrayField.groupByField", "size" : 10 }, "aggregations" : { 
> "districts.population" : { "stats" : { "field" : "arrayField.statsField" } 
> } } } } } } } I expect to get results: for group1: sum = 30 for group2:sum 
> = 15 I actually get: for group1:sum = 45 for group2:sum = 45 
> --
> View this message in context: Inconsistent results when aggregate by 
> field from array 
> 
> Sent from the ElasticSearch Users mailing list archive 
>  at Nabble.com.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/de0251ab-333f-4c3d-a5ec-480106ccbb39%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Incorrect Aggregations returned from ES

2015-04-14 Thread Nils Dijk
Hi,

To me this sounds a lot like an issue that was happening to me a week 
before the release of 1.0.0. This issue was related to internal memory 
reuse within Elasticsearch before the result was read out. The issue is 
documented here: https://github.com/elastic/elasticsearch/issues/5021

What I did back then was create a reproducible test that showed the issue.

I doubt it has to do with your replica's being inconsistent. Especially 
since you turned off replicating and then turned it back on, this copies 
the files you have in your primary to the secondaries. 

-- Nils
Here is the test I created in the past: 
https://gist.github.com/thanodnl/8803745

On Wednesday, March 25, 2015 at 11:57:15 PM UTC+1, MC wrote:
>
> I am seeing some erroneous behavior in my ES cluster when performing 
> aggregations.  Originally, I thought this was specific to a histogram as 
> that is where the error first appeared (in a K3 graph - see my post 
> https://groups.google.com/forum/#!topic/elasticsearch/iY-lKjtW7PM for 
> reference) but I have been able to re-create the exception with a simple 
> max aggregation.  The details are as follows:
>
> ES Version: 1.4.4
> Topology: 5 nodes, 5 shards per index, 2 replicas
> OS: Redhat Linux
>
> To create the issue I execute the following query against the cluster:
>
> {
>   "query": {
> "term": {
>   "metric": "used"
> }
>   },
>   "aggs": {
> "max_val": {
>   "max": {
> "field": "metric_value"
>   }
> }
>   }
> }
>
> Upon executing this query multiple times, I get different responses.  One 
> time I get the expected result:
> ...
> "took": 13,
> "timed_out": false,
> "_shards": {
> "total": 5,
> "successful": 5,
> "failed": 0
> },
> "hits": {
> "total": 11712,
> "max_score": 9.361205,
> ...
> "aggregations": { "max_val": { "value": 18096380}}
>
> whereas on another request with the same query I get the following bad 
> response:
>
> "took": 8,
> "timed_out": false,
> "_shards": {
> "total": 5,
> "successful": 5,
> "failed": 0
> },
> "hits": {
> "total": 11712,
> "max_score": 9.361205,
> ...
> "aggregations": { "max_val": { "value": 4697741490703565000}}
>
>
>
> Some possibly relevant observations:
> 1.  In my first set of tests, I was consistently getting the correct 
> results for the first 2 requests and the bad result on the 3rd request 
> (with no one else executing this query at that point in time)
> 2.  Flushing the cache did not correct the issue
> 3.  I reduced the number of replicas to 0 and was consistently getting the 
> same result (which happened to be the correct one)
> 4.  After increasing the replica count back to 2 and waiting until ES 
> reported that the replication was complete, I tried the same experiment.  
> This time, the 1st request retrieved the correct result and the next 2 
> requests retrieved incorrect results.  In this case the incorrect results 
> were not the same but were both huge and of the same order of magnitude.
>
>
> Other info:
> - The size of the index was about 3.3Gb with ~ 50M documents in it
> - This is one of many date based indices (i.e. similar to the logstash 
> index setup), but the only one in this installation that exhibited the 
> issue.  I believe we saw something similar in a UAT environment as well 
> where 1 or 2 of the indices acted in this weird manner
> - ES reported the entire cluster as green
>
>
> It seems that some shard(s)/replica(s) were being corrupted on the 
> replication and we were being routed to that one every 3rd hit.  (Is this 
> somehow correlated to the number of replicas?)
>
> So, my questions are:
>
> 1. Has anyone seen this type of behavior before?  
> 2. Can it somehow be data dependent?
> 3. Is there any way to figure out what happened/what is happening?
> 4. Why does ES report the cluster state as green?
> 5. How can I debug this?
> 6. How can I prevent/correct this?
>
>
> Any and all help/pointers would be greatly appreciated.
>
> Thanks in advance,
> MC
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/14c3bff7-b17b-4fa8-938f-cf8e13c80a29%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Disk Decommission

2015-03-03 Thread Nils Dijk
You might want to tune the 
allocator: 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-allocation.html#disk

"cluster.routing.allocation.disk.watermark.high controls the high 
watermark. It defaults to 90%, meaning ES will attempt to relocate shards 
to another node if the node disk usage rises above 90%."

-- Nils

On Sunday, March 1, 2015 at 7:38:41 AM UTC, Prasanth R wrote:
>
> Dear Mark,
>   Thanks for your answers... So.. The only way I have is deleting old 
> docs...
>
> Thanks
> Prasanth Rajan
> On Mar 1, 2015 1:04 PM, "Mark Walkom" > 
> wrote:
>
>> It doesn't work like that though :)
>>
>> On 1 March 2015 at 17:30, Prasanth R 
>> > wrote:
>>
>>> Dear Mark,
>>>
>>> Each node roughly have more than  One TB data.. So moving to another 
>>> node is not easy.. If elasticsearch stop indexing on disks those are 
>>> reaching certain percentage will be more useful..
>>>
>>> Thanks
>>> Prasanth Rajan
>>> On Mar 1, 2015 11:53 AM, "Mark Walkom" > 
>>> wrote:
>>>
 Are the disks not the same size?

 You will need to make sure the shards that this node has are on other 
 nodes, or move shards to other nodes, then replace the disk.

 On 1 March 2015 at 16:32, Prasanth R >>> > wrote:

> Dear Mark,
>
> Our path.data pointing to different directories and each is different 
> hard disk.. One of the hard disc reached more than 90%.. Please suggest 
> me 
> what to do..
>
> Thanks
> Prasanth Rajan
> On Mar 1, 2015 9:42 AM, "Mark Walkom"  > wrote:
>
>> If you are using multiple data paths and you replace a disk, you will 
>> lose all data on that node.
>>
>> "Data can be saved to multiple directories, and if each directory is 
>> mounted on a different hard drive, this is a simple and effective way to 
>> set up a software RAID 0. Elasticsearch will automatically stripe data 
>> between the different directories, boosting performance"
>>
>> http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/_important_configuration_changes.html#_paths
>>
>> On 27 February 2015 at 21:48, Prasanth R > > wrote:
>>
>>> Dear All,
>>>
>>>   Is it possible to decommission a disc from nodes? 
>>>   is it possible to stop indexing/storing data on particular disc?.
>>>
>>> Thanks
>>> Prasanth Rajan
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, 
>>> send an email to elasticsearc...@googlegroups.com .
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/04017b82-a798-4258-bb94-079e7b2c6267%40googlegroups.com
>>>  
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  -- 
>> You received this message because you are subscribed to a topic in 
>> the Google Groups "elasticsearch" group.
>> To unsubscribe from this topic, visit 
>> https://groups.google.com/d/topic/elasticsearch/RY8OanGRV_c/unsubscribe
>> .
>> To unsubscribe from this group and all its topics, send an email to 
>> elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-Jyn8bN4HwiEiS2PsY7E5c9U_06sF6EPSMAsAakhNHOA%40mail.gmail.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>  -- 
> You received this message because you are subscribed to the Google 
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send 
> an email to elasticsearc...@googlegroups.com .
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/CAJLGCR2GFiFMowYUtEc35g_PFcRLDAqV_M27gx7GJN3BMBeQcg%40mail.gmail.com
>  
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

  -- 
 You received this message because you are subscribed to a topic in the 
 Google Groups "elasticsearch" group.
 To unsubscribe from this topic, visit 
 https://groups.google.com/d/topic/elasticsearch/RY8OanGRV_c/unsubscribe
 .
 To unsubscribe from this group and all its topics, send an email to 
 elasticsearc...@googlegroups.com .
 To view this discussion on the web visit 

Re: Is ElasticSearch truly scalable for analytics?

2015-01-15 Thread Nils Dijk
Adding a 'node reduce phase' to aggregations is something I'm very 
interested in, and also investigating for the project I'm currently working 
on.

"If you introduce an extra reduction phase (for multiple shards on the same 
node) you introduce further potential for inaccuracies in the final 
results."

This is true if you only reduce the top-k items per shard, but I was 
thinking to reduce the complete set of buckets locally. This takes a bit 
more cpu, and memory, but my guess is that this is negligible compared to 
the work already being done by the aggregation framework. If you reduce the 
buckets on the node before sending it to the coordinator it will actually 
increase the accuracy for aggregations!

"how many of these sorts of use cases generate sufficiently large trees of 
results where a node-level merging would be beneficial"

It is primarily beneficial for bigger installations with lots of shards per 
machine. Say 40 machines with ~100 shards per machine. In the current 
strategy where every node is sending 100 results there is a lot of 
bandwidth used on the coordinating node, since it receives 4000 responses, 
while it could do with 40 responses (1 per machine).

I acknowledge it is a highly specialised use-case which not very many 
people run into, but it is a case I'm currently working on.

"How hard would it to be to implement such a feature?"

I have been looking into this, and it is not trivial. This needs to be 
implemented in/around the SearchService. This is the place I found to be 
implementing the different search strategies, eg. DFS. Unlike the rest of 
Elasticsearch it does seem to not consist of modules that implement 
different search strategies.

Regarding the accuracy of top-k lists. I think the above, both the 'node 
reduce phase' and making the search strategy pluggable will be the 
groundwork to start working on implementations of TJA or TPUT strategies as 
discussed in an old issue[1] about accuracy of factes.

The order of steps to take before reaching the ultimate goal would be:
1) Make search strategies (eg. query then fetch, dfs query then fetch) more 
modularized.
2) Make a search strategy with a 'node reduce phase' for the aggregations. 
Start with a complete reduce on the node. If that takes to much memory/time 
you can use TJA or TPUT locally on the node to get a reliable top-k list.
3a) Make a search strategy that executes TJA on the cluster coordinated by 
the coordinating node
3b) Make a separate strategy that executes TPUT on the cluster coordinated 
by the coordinating node

I would say that 3a and 3b are 'easy' if doing a complete reduce in step 2 
is not consuming to much resources.

Adding strategies for both TJA and TPUT gives ultimate control to the user, 
as TPUT is not suited for reliably sorting on sums where the field might 
contain a negative value. But TPUT has better performance in latency over 
TJA.

I would love to get an opinion from Adrien concerning the feasibility of 
such an approach.

-- Nils

[1] https://github.com/elasticsearch/elasticsearch/issues/1305

On Wednesday, January 14, 2015 at 7:47:07 PM UTC+1, Elliott Bradshaw wrote:
>
> How hard would it to be to implement such a feature?  Even if there are 
> only a handful of use cases, it could prove very helpful in these.  
> Particularly since very large trees are the ones that will struggle the 
> most with bandwidth issues.
>
>
> On Wednesday, January 14, 2015 at 1:36:53 PM UTC-5, Mark Harwood wrote:
>>
>> Understood, but what about cases where size is set to unlimited?  
>>> Inaccuracies are not a concern in that case, correct?
>>>
>>
>> Correct. But if we only consider the scenarios where the key sets are 
>> complete and accuracy is not put at risk by merging (i.e. there is no "top 
>> N" type filtering in play), how many of these sorts of use cases generate 
>> sufficiently large trees of results where a node-level merging would be 
>> beneficial? 
>>  
>>
>>>
>>> On Wednesday, January 14, 2015 at 1:09:48 PM UTC-5, Mark Harwood wrote:

 If you introduce an extra reduction phase (for multiple shards on the 
 same node) you introduce further potential for inaccuracies in the final 
 results.
 Consider the role of 'size' and 'shard_size' in the "terms" aggregation 
 [1] and the effects they have on accuracy. You'd arguably need a 
 'node_size' setting to also control the size of this new intermediate 
 collection. All stages that reduce the volumes of data processed can 
 introduce an approximation with the potential for inaccuracies upstream 
 when merging.


 [1] 
 http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-aggregations-bucket-terms-aggregation.html#_shard_size

 On Wednesday, January 14, 2015 at 5:44:47 PM UTC, Elliott Bradshaw 
 wrote:
>
> Adrien,
>
> I get the feeling that you're a pretty heavy contributor to the 
> aggregation module.  In your experience, w

Re: Timezones in date histograms, redux

2014-05-02 Thread Nils Dijk
What you are looking for is the only not-so-sensible-default I have thus 
far found in Elasticsearch. By default it offsets the times to UTC to 
easily create the day buckets. By setting 'pre_zone_adjust_large_interval' 
to true the keys of your buckets will be in the requested timezone.

This also works for the newer aggregations.

-- Nils

On Thursday, May 1, 2014 10:48:49 PM UTC+2, Mal Curtis wrote:
>
> Hi Andrew,
>
> Did you ever figure this one out?
>
> Currently we're thinking we're going to store date information both as the 
> UTC, and as the local time AS utc (i.e. 12.30 in +1200 becomes 12.30 UTC). 
> So we're denormalizing the time into two fields with different 
> representations of the same data.
>
> -Mal
>
> On Saturday, 15 June 2013 04:44:12 UTC+12, Andrew Clegg wrote:
>>
>> Hi,
>>
>> I've seen a lot of discussion of this in various old threads, and I've 
>> spent about two hours going over the docs and code and brainstorming with 
>> my colleagues, but for the life of me: I still can't make this work right.
>>
>> Basically, we are doing date ranges with date histogram facets, and need 
>> to support users in different timezones querying the same data set.
>>
>> For each user, I want to show hourly, daily, weekly or monthly facets, 
>> relative to their OWN timezone, starting at the range filter's lower bound.
>>
>> The filter's boundaries are always exactly on an hour (in the user's time 
>> zone), and for bucket sizes of daily or greater, the they'll will be at 
>> 00:00:00 (in the user's time zone).
>>
>> So, I might come along and say "I want to see all data between 3 Jan and 
>> 8 Jan, by day" -- and if I'm in GMT it's easy:
>>
>> {
>> "query": {
>> "filtered" : {
>> "query" : {
>> "match_all" : {}
>> },
>> "filter" : {
>> "range": {
>> "datetime": {
>> "gte": "2012-01-03T00:00:00Z", 
>> "lt":"2012-01-07T00:00:00Z"
>> }
>> }
>> }
>> }
>> },
>> "facets": {
>> "histo" : {
>> "date_histogram" : {
>> "field" : "datetime",
>> "interval" : "1d"
>> }
>> }
>> }
>> }
>>
>>
>> This returns evenly-spaced days starting on Jan 3rd.
>>
>> But, I am too dumb to make this work if the lower bound of the range 
>> filter isn't on a midnight UTC moment. For example if it's midnight 
>> EST/05:00UTC. Or for that matter, midnight British Summer Time. The 
>> server-side UTC calculations always give us unexpected bucket boundaries.
>>
>> We've been juggling very possible combination of pre_zone, post_zone, 
>> pre_offset, post_offset etc., but just can't find a combination that will 
>> give us regular hourly/weekly/daily/monthly intervals starting at a given 
>> point in time like that.
>>
>> Please help. I know it's possible but this stuff makes my head hurt.
>>
>> Thanks :-)
>>
>>
>>
> Watch the fun new 2 minute and 2 second video tour of 
> Vend

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/09458726-cc58-4ad5-bb19-cb53b2542b80%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Aggregations sort on doc_count of a filter

2014-02-14 Thread Nils Dijk
Hi,

Aggregations ROCK!

But 
Imagine, we have a document with a number (n) in the range of -10 to 10. 
And a field with a couple of terms.

Now I want to create a view with two columns, the first for the number of 
documents having a positive value, the second for documents having a 
negative value. This is easily done with an aggregation like this:

{
  "size": 0,
  "aggs": {
"label": {
  "terms": {
"field": "somefield"
  },
  "aggs": {
"positive": {
  "filter": {
"range": {
  "n": { "gt": 0 }
}
  }
},
"negative": {
  "filter": {
"range": {
  "n": { "lt": 0 }
}
  }
}
  }
}
  }
}


The counts of the positive and negative documents can be found in 
'positive.doc_count' and the 'negative.doc_count'. Everything is fine.

Now you want to sort you aggregation on the label with the most positive 
documents, so we add the following to the 'terms':
...
{
  "terms": {
"field": "somefield",
"order": { "positive": "desc" }
  },
  ...
}
...


We get an error back saying:
SearchPhaseExecutionException[Failed to execute phase [query], all shards 
failed; shardFailures {[RwghWCxzQ-S9SyjTAX119A][X][8]: 
AggregationExecutionException[terms aggregation [label] is configured to 
order by sub-aggregation [positive] which is is not a metrics aggregation. 
Terms aggregation order can only refer to metrics aggregations]}

This error tells me that I can't order on the doc_count of a filter because 
filter is not a metrics aggregation. It would be really help full to be 
able to sort on filtered aggregations as well. I would even go as far as 
sorting on sub aggregations of filters! That way you could sum the value of 
the positive documents and sort on the sum as well (although I would not 
know what that number is supposed to represent now).

This might not be trivial to implement, but I think it is worth looking in 
to.

At least I found a way to get around this problem in the short term by 
summing the value of a script which does the heavy lifting in checking if 
the document should be counted, but we have some configuration files now 
where we have 100+ columns representing counts of documents with according 
filters which you cannot really port automatically to a value_script of a 
sum.

If people are interesting in the workaround this is the query:
{
  "size": 0,
  "aggs": {
"label": {
  "terms": {
"field": "somefield",
"order": { "positive": "desc" }
  },
  "aggs": {
"positive": {
  "sum": {
"field": "n",
"script": "_value>0?1:0"
  }
},
"negative": {
  "sum": {
"field": "n",
"script": "_value<0?1:0"
  }
}
  }
}
  }
}


Could we get the sorting on filtered aggregations as a feature request? Or 
is this simply impossible to achieve (with decent performance) in the 
aggregations framework?

-- Nils

On a side note, the above queries are not tested on elasticsearch so could 
contain errors when copy/paste into an elasticsearch request.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/971d14e9-e9e7-45f3-bc54-5c5c2c3603fe%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-02-07 Thread Nils Dijk
Hi Adrien,

Good news! The problem is solved.
Can't wait for the release containing the fix, but for now I will use my 
own build :)

On Thursday, February 6, 2014 5:25:11 PM UTC+1, Nils Dijk wrote:
>
> Yay!
>
> I will try this somewhere tomorrow. Thanks for fixing, much appreciated!
>
> Seems like it was difficult to find. Since it only happens when a 'page' 
> gets recycled internally.
>
> On Thursday, February 6, 2014 3:53:46 PM UTC+1, Adrien Grand wrote:
>>
>> It took me some time but I finally managed to understand the cause and to 
>> write a fix:
>>   https://github.com/elasticsearch/elasticsearch/pull/5039
>>
>> Thanks very much for reporting this and for your help reproducing and 
>> debugging this issue!
>>
>>
>> On Thu, Feb 6, 2014 at 2:08 PM, Nils Dijk  wrote:
>>
>>> Good,
>>>
>>> It is always easier to fix when it's on your own machine.
>>>
>>> I tried your .patch, but it did not fix the problem. I also tried your 
>>> config, although I did not really get where to put the setting, I ended up 
>>> putting the setting on the index. This also did not fix the problem.
>>>
>>> I also tried with a bigger shard_size in the agg. Yet again no 
>>> difference.
>>>
>>> To test some more around aggs I loaded a complete production set into 
>>> both my local ES RC2 (osx) and one on a linux server with ES RC2. I have a 
>>> hunch it could be in the sorting of the terms. When I do a sub agg and sort 
>>> on it I see all kind of weird results that are even lower than the ones I 
>>> see when I do not sort on the sub agg.
>>>
>>> If you need me to test some more I am keeping a close watch on this 
>>> thread.
>>>
>>> -- Nils
>>>
>>> On Thursday, February 6, 2014 1:19:40 PM UTC+1, Adrien Grand wrote:
>>>
>>>> OK, I finally managed to reproduce it on both mac and linux by 
>>>> increasing the number of shards to 20, will keep you posted
>>>>
>>>>
>>>> On Thu, Feb 6, 2014 at 9:29 AM, Adrien Grand >>> com> wrote:
>>>>
>>>> On Wed, Feb 5, 2014 at 6:42 PM, Nils Dijk  wrote:
>>>>>
>>>>>> Ok, I was preparing to do a long bisecting session, but I started 
>>>>>> with the commit you highlighted below 
>>>>>> (4271d573d60f39564c458e2d3fb7c14afb82d4d8) 
>>>>>> and the commit before that one (6481a2fde858520988f2ce28c02a1
>>>>>> 5be3fe108e4). And as it turns out, it is the breaking commit.
>>>>>>
>>>>>> If I build the commit of yours from December 3 it fails my test suite.
>>>>>> If I build the commit of Nik from Januari 6 it still passes my test.
>>>>>>
>>>>>> I also tried reverting your commit on the v1.0.0.RC1 tag, but it gave 
>>>>>> me all kinds of conflicts so I could not test RC1 without your commit.
>>>>>>
>>>>>> If you would like I can still do a full bisect, but I suspect I end 
>>>>>> up at your commit since I tested that one, and the one before.
>>>>>>
>>>>>> Would it be possible for you to send a .patch without the unsafe 
>>>>>> stuff, so I can apply that to a commit and make a build?
>>>>>>
>>>>>
>>>>> Thanks Nils for your work, this is much appreciated.
>>>>>
>>>>> Here is a simple patch attached that short-circuits the use of Unsafe 
>>>>> to do string comparisons.
>>>>>
>>>>> Maybe you could also try to set the `cache.recycler.page.type` setting 
>>>>> to `none` to see if that changes anything.
>>>>>
>>>>> -- 
>>>>> Adrien Grand
>>>>>  
>>>>
>>>>
>>>>
>>>> -- 
>>>> Adrien Grand
>>>>  
>>>  -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to elasticsearc...@googlegroups.com.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/elasticsearch/af8e91d8-4a97-42d3-9dd5-8a980ded493e%40googlegroups.com
>>> .
>>>
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>
>>
>> -- 
>> Adrien Grand
>>  
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/c6acf6ac-3f47-49e4-8240-57c4c697c635%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-02-06 Thread Nils Dijk
Yay!

I will try this somewhere tomorrow. Thanks for fixing, much appreciated!

Seems like it was difficult to find. Since it only happens when a 'page' 
gets recycled internally.

On Thursday, February 6, 2014 3:53:46 PM UTC+1, Adrien Grand wrote:
>
> It took me some time but I finally managed to understand the cause and to 
> write a fix:
>   https://github.com/elasticsearch/elasticsearch/pull/5039
>
> Thanks very much for reporting this and for your help reproducing and 
> debugging this issue!
>
>
> On Thu, Feb 6, 2014 at 2:08 PM, Nils Dijk >wrote:
>
>> Good,
>>
>> It is always easier to fix when it's on your own machine.
>>
>> I tried your .patch, but it did not fix the problem. I also tried your 
>> config, although I did not really get where to put the setting, I ended up 
>> putting the setting on the index. This also did not fix the problem.
>>
>> I also tried with a bigger shard_size in the agg. Yet again no difference.
>>
>> To test some more around aggs I loaded a complete production set into 
>> both my local ES RC2 (osx) and one on a linux server with ES RC2. I have a 
>> hunch it could be in the sorting of the terms. When I do a sub agg and sort 
>> on it I see all kind of weird results that are even lower than the ones I 
>> see when I do not sort on the sub agg.
>>
>> If you need me to test some more I am keeping a close watch on this 
>> thread.
>>
>> -- Nils
>>
>> On Thursday, February 6, 2014 1:19:40 PM UTC+1, Adrien Grand wrote:
>>
>>> OK, I finally managed to reproduce it on both mac and linux by 
>>> increasing the number of shards to 20, will keep you posted
>>>
>>>
>>> On Thu, Feb 6, 2014 at 9:29 AM, Adrien Grand >> com> wrote:
>>>
>>> On Wed, Feb 5, 2014 at 6:42 PM, Nils Dijk  wrote:
>>>>
>>>>> Ok, I was preparing to do a long bisecting session, but I started with 
>>>>> the commit you highlighted below 
>>>>> (4271d573d60f39564c458e2d3fb7c14afb82d4d8) 
>>>>> and the commit before that one (6481a2fde858520988f2ce28c02a1
>>>>> 5be3fe108e4). And as it turns out, it is the breaking commit.
>>>>>
>>>>> If I build the commit of yours from December 3 it fails my test suite.
>>>>> If I build the commit of Nik from Januari 6 it still passes my test.
>>>>>
>>>>> I also tried reverting your commit on the v1.0.0.RC1 tag, but it gave 
>>>>> me all kinds of conflicts so I could not test RC1 without your commit.
>>>>>
>>>>> If you would like I can still do a full bisect, but I suspect I end up 
>>>>> at your commit since I tested that one, and the one before.
>>>>>
>>>>> Would it be possible for you to send a .patch without the unsafe 
>>>>> stuff, so I can apply that to a commit and make a build?
>>>>>
>>>>
>>>> Thanks Nils for your work, this is much appreciated.
>>>>
>>>> Here is a simple patch attached that short-circuits the use of Unsafe 
>>>> to do string comparisons.
>>>>
>>>> Maybe you could also try to set the `cache.recycler.page.type` setting 
>>>> to `none` to see if that changes anything.
>>>>
>>>> -- 
>>>> Adrien Grand
>>>>  
>>>
>>>
>>>
>>> -- 
>>> Adrien Grand
>>>  
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/af8e91d8-4a97-42d3-9dd5-8a980ded493e%40googlegroups.com
>> .
>>
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>
>
> -- 
> Adrien Grand
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0e604ec8-05a8-4697-b6bf-28d8bda756ee%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-02-06 Thread Nils Dijk
Good,

It is always easier to fix when it's on your own machine.

I tried your .patch, but it did not fix the problem. I also tried your 
config, although I did not really get where to put the setting, I ended up 
putting the setting on the index. This also did not fix the problem.

I also tried with a bigger shard_size in the agg. Yet again no difference.

To test some more around aggs I loaded a complete production set into both 
my local ES RC2 (osx) and one on a linux server with ES RC2. I have a hunch 
it could be in the sorting of the terms. When I do a sub agg and sort on it 
I see all kind of weird results that are even lower than the ones I see 
when I do not sort on the sub agg.

If you need me to test some more I am keeping a close watch on this thread.

-- Nils

On Thursday, February 6, 2014 1:19:40 PM UTC+1, Adrien Grand wrote:
>
> OK, I finally managed to reproduce it on both mac and linux by increasing 
> the number of shards to 20, will keep you posted
>
>
> On Thu, Feb 6, 2014 at 9:29 AM, Adrien Grand 
> 
> > wrote:
>
>> On Wed, Feb 5, 2014 at 6:42 PM, Nils Dijk > >wrote:
>>
>>> Ok, I was preparing to do a long bisecting session, but I started with 
>>> the commit you highlighted below (4271d573d60f39564c458e2d3fb7c14afb82d4d8) 
>>> and the commit before that one (6481a2fde858520988f2ce28c02a15be3fe108e4). 
>>> And as it turns out, it is the breaking commit.
>>>
>>> If I build the commit of yours from December 3 it fails my test suite.
>>> If I build the commit of Nik from Januari 6 it still passes my test.
>>>
>>> I also tried reverting your commit on the v1.0.0.RC1 tag, but it gave me 
>>> all kinds of conflicts so I could not test RC1 without your commit.
>>>
>>> If you would like I can still do a full bisect, but I suspect I end up 
>>> at your commit since I tested that one, and the one before.
>>>
>>> Would it be possible for you to send a .patch without the unsafe stuff, 
>>> so I can apply that to a commit and make a build?
>>>
>>
>> Thanks Nils for your work, this is much appreciated.
>>
>> Here is a simple patch attached that short-circuits the use of Unsafe to 
>> do string comparisons.
>>
>> Maybe you could also try to set the `cache.recycler.page.type` setting to 
>> `none` to see if that changes anything.
>>
>> -- 
>> Adrien Grand
>>  
>
>
>
> -- 
> Adrien Grand
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/af8e91d8-4a97-42d3-9dd5-8a980ded493e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-02-05 Thread Nils Dijk
Hi Jörg,

Glad you could reproduce with my updated gist.

cb.

On Wednesday, February 5, 2014 8:18:39 PM UTC+1, Jörg Prante wrote:
>
> Nils, I ran the test on my Mac, and I can reproduce the issue. And also on 
> Linux.
>
> Unfortunately the Mac locked up and I had to cold reboot, and my 
> copy/paste logs are gone with all the numbers, but anyway.
>
> As a matter of fact, your aggregates demo is daunting.
>
> On the Mac, it shows different counts even between the first and the 
> subsequent executions. The counts of the first are lower, and also, even 
> different terms show up. On Linux, I do not observe different counts 
> between runs.
>

The issue you describe for Mac is the issue I discussed here.

>
> But, what's more bothering is, I observed different results in regard to 
> the shard count, and that is both on Mac and Linux. The more the hit count 
> is on top of the buckets, the more the counts match, only the lower buckets 
> differ, so the deviating counts are somewhat hard to notice.
>

The counts differ when you change the shard size is long known problem of 
elasticsearch and was also a problem in faceting. A long thread about the 
nature of this problem can be found here: 
https://github.com/elasticsearch/elasticsearch/issues/1305.

It is an issue which you can circumvent easily by one of two options:

   1. Use the term you do the aggregation for as a routing key. This forces 
   to have the same tokens in the same shard, and thus always return the exact 
   count. Although this only works if you do these kind of analytics over 1 
   field.
   2. Increase the shard_size for the terms aggregation. This way the 
   internal shards create bigger lists which than have more chance of 
   containing the actual top terms. 
   
http://www.elasticsearch.org/guide/en/elasticsearch/reference/1.x/search-aggregations-bucket-terms-aggregation.html#_size_amp_shard_size


> I use Java 8 FCS, but since you observe this issue also on Java 7, I think 
> it is not an issue of Java 8. And it's both on Mac and Linux, but with 
> different symptoms.
>

This makes the only factor occurring multiple times the MacOSX OS. And on 
all java versions, I tested both 1.7 and 1.6. It is unfortunate that Adrien 
wasn't able to reproduce it on OSX.
 

>
> ES 1.0.0.RC2
> Mac OS X 10.8.5
> Darwin Jorg-Prantes-MacBook-Pro.local 12.5.0 Darwin Kernel Version 12.5.0: 
> Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64
> java version "1.8.0"
> Java(TM) SE Runtime Environment (build 1.8.0-b128)
> Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode)
> G1GC enabled
>
> ES 1.0.0.RC2
> RHEL 6.3
> Linux zephyros 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 
> x86_64 x86_64 x86_64 GNU/Linux
> java version "1.8.0"
> Java(TM) SE Runtime Environment (build 1.8.0-b128)
> Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode)
> G1GC enabled
>
> Here are two Linux examples. Note, the last three terms and counts are 
> different.
>
> shards=10
>
> {
>   "took" : 143,
>   "timed_out" : false,
>   "_shards" : {
> "total" : 10,
> "successful" : 10,
> "failed" : 0
>   },
>   "hits" : {
> "total" : 1060387,
> "max_score" : 0.0,
> "hits" : [ ]
>   },
>   "aggregations" : {
> "a" : {
>   "buckets" : [ {
> "key" : "totaltrafficbos",
> "doc_count" : 3599
>   }, {
> "key" : "mai93thm",
> "doc_count" : 2517
>   }, {
> "key" : "mai90thm",
> "doc_count" : 2207
>   }, {
> "key" : "mai95thm",
> "doc_count" : 2207
>   }, {
> "key" : "totaltrafficnyc",
> "doc_count" : 1660
>   }, {
> "key" : "confessions",
> "doc_count" : 1534
>   }, {
> "key" : "incidentreports",
> "doc_count" : 1468
>   }, {
> "key" : "nji80thm",
> "doc_count" : 1071
>   }, {
> "key" : "pai76thm",
> "doc_count" : 1039
>   }, {
> "key" : "txi35thm",
> "doc_count" : 357
>   } ]
> }
>   }
> }
>
> shards=5
>
> {
>   "took" : 172,
>   "timed_out" : false,
>   "_shards" : {
> "total" : 5,
> "successful" : 5,
> "failed" : 0
>   },
>   "hits" : {
> "total" : 1060387,
> "max_score" : 0.0,
> "hits" : [ ]
>   },
>   "aggregations" : {
> "a" : {
>   "buckets" : [ {
> "key" : "totaltrafficbos",
> "doc_count" : 3599
>   }, {
> "key" : "mai93thm",
> "doc_count" : 2517
>   }, {
> "key" : "mai90thm",
> "doc_count" : 2207
>   }, {
> "key" : "mai95thm",
> "doc_count" : 2207
>   }, {
> "key" : "totaltrafficnyc",
> "doc_count" : 1660
>   }, {
> "key" : "confessions",
> "doc_count" : 1534
>   }, {
> "key" : "incidentreports",
> "doc_count" : 1468
>   }, {
> "key" : "nji80thm",
> "doc_count" : 1180
>   }, {
> "key" : "pai76thm",
>  

Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-02-05 Thread Nils Dijk
Ok, I was preparing to do a long bisecting session, but I started with the 
commit you highlighted below (4271d573d60f39564c458e2d3fb7c14afb82d4d8) and 
the commit before that one (6481a2fde858520988f2ce28c02a15be3fe108e4). And 
as it turns out, it is the breaking commit.

If I build the commit of yours from December 3 it fails my test suite.
If I build the commit of Nik from Januari 6 it still passes my test.

I also tried reverting your commit on the v1.0.0.RC1 tag, but it gave me 
all kinds of conflicts so I could not test RC1 without your commit.

If you would like I can still do a full bisect, but I suspect I end up at 
your commit since I tested that one, and the one before.

Would it be possible for you to send a .patch without the unsafe stuff, so 
I can apply that to a commit and make a build?

Thanks in advance,

On Wednesday, February 5, 2014 6:10:35 PM UTC+1, Adrien Grand wrote:
>
>
> On Wed, Feb 5, 2014 at 6:01 PM, Nils Dijk >wrote:
>>
>> I was trying to find out if I could disable this unsafe 
>> string comparisons, but could not really find where that should be 
>> disabled. Is there an easy way for me to switch back that change? Do you 
>> know on what commit this was changed so I can revert that commit in my 
>> local clone of the repo, do a build to see if the problem is solved that 
>> way?
>>
>
> Sure, this was changed in 4271d573d60f39564c458e2d3fb7c14afb82d4d8 However 
> I also just read that you can't reproduce the issue with one shard although 
> this shouldn't be relevant.
>  
>
>> For reproducing I do not really see what could impact this besides from 
>> the OS and java version. And the other OSX machine was a different version 
>> of OS AND java, and still having the same results.
>>
>> I am however a bit more relaxed with the issue not showing up on our 
>> production machines, that would have killed the ES migration we are 
>> currently doing. Although it is unfortunate that we can not test our stuff 
>> on our developement machines (all showing the issue here).
>>
>> Do you have any thoughts on what could be different between our setups 
>> that we are having the issue, and you don't?
>>
>
> I wish I had ideas! :-)
>
> Since the issue seems to reproduce consistently for you, something that 
> would be super helpful would be to git bisect in order to find the commit 
> that broke aggregations in your setup (Beta2 commit is 296cfbe3 and rc1 
> commit is 2c8ee3fb).
>  
>
>> To make sure, you use my scripts to load it in? Since Jörg seemed to load 
>> the data on a different way (different shardcount and different mapping) 
>> which did not show the issues here.
>>
>
> Yes, I used your scripts, exactly as described in the README.
>  
> -- 
> Adrien Grand
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ab8f000d-d0ee-4be8-aaa5-46d0718c56e8%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-02-05 Thread Nils Dijk
Thanks for the effort.

I tried running on 1.7.0_51, and it gave me the same issue.

I was trying to find out if I could disable this unsafe string comparisons, 
but could not really find where that should be disabled. Is there an easy 
way for me to switch back that change? Do you know on what commit this was 
changed so I can revert that commit in my local clone of the repo, do a 
build to see if the problem is solved that way?

For reproducing I do not really see what could impact this besides from the 
OS and java version. And the other OSX machine was a different version of 
OS AND java, and still having the same results.

I am however a bit more relaxed with the issue not showing up on our 
production machines, that would have killed the ES migration we are 
currently doing. Although it is unfortunate that we can not test our stuff 
on our developement machines (all showing the issue here).

Do you have any thoughts on what could be different between our setups that 
we are having the issue, and you don't?

To make sure, you use my scripts to load it in? Since Jörg seemed to load 
the data on a different way (different shardcount and different mapping) 
which did not show the issues here.

On Wednesday, February 5, 2014 5:40:10 PM UTC+1, Adrien Grand wrote:
>
> I just installed 1.7u25 on a mac with maverick to try to reproduce the 
> issue, but without success (on 1.0.0-RC2).
>
>
> On Wed, Feb 5, 2014 at 4:49 PM, Nils Dijk >wrote:
>
>> Hi Adrien,
>>
>> I'm using OSX (Mavericks) and java: (having the issue)
>>
>> $ java -version
>> java version "1.7.0_25"
>> Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
>> Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)
>>
>> My colleague is running OSX (Lion) and java: (having the issue)
>>
>> $ java -version
>> java version "1.6.0_26"
>> Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11D50)
>> Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)
>>
>> A server soon to be used for production Ubuntu 12.04 LTS with java: (Not 
>> having the issue)
>>
>> $ java -version
>> java version "1.7.0_45"
>> Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
>> Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)
>>
>> Could this be an issue with java on OSX than?
>>
>> On Wednesday, February 5, 2014 4:38:36 PM UTC+1, Adrien Grand wrote:
>>
>>> I didn't manage to reproduce the issue locally either. What JVM / OS are 
>>> you using (RC1 introduced Unsafe to perform String comparisons in terms 
>>> aggs so I'm wondering if that could be related to your issue)?
>>>
>>>
>>> On Wed, Feb 5, 2014 at 4:33 PM, Nils Dijk  wrote:
>>>
>>>>  I did only test it with 1 and with 10 shards, indeed with 1 shard it 
>>>> did not have any issues, with 10 shards it has issues all the time.
>>>> I also had a colleague testing it with the two scripts in the gist 
>>>> (which uses 10 shards).
>>>>
>>>> Also I do not think the analyzer _should_ have impact, since it would 
>>>> only index more terms on that field if it tokenizes it. Can you use the 
>>>> aggsbug.load.sh to load the data? And than use aggsbug.test.sh to run 
>>>> the test? It should give you a field analyzed with the default analyzer 
>>>> and 
>>>> 10 shards.
>>>>
>>>> I'll try out some different analyzers, and loading the data in 3 shards 
>>>> now to see if that changes things.
>>>>
>>>> On Wednesday, February 5, 2014 4:02:54 PM UTC+1, Jörg Prante wrote:
>>>>>
>>>>> Also the same with shards = 3 and analyzer = standard. Stable results.
>>>>>
>>>>> {
>>>>>   "took" : 240,
>>>>>   "timed_out" : false,
>>>>>   "_shards" : {
>>>>> "total" : 3,
>>>>> "successful" : 3,
>>>>> "failed" : 0
>>>>>   },
>>>>>   "hits" : {
>>>>> "total" : 1060387,
>>>>> "max_score" : 0.0,
>>>>> "hits" : [ ]
>>>>>   },
>>>>>   "aggregations" : {
>>>>> "a" : {
>>>>>   "buckets" : [ {
>>>>> "key" : "totaltrafficbos",
>>>>> "doc_count" : 3599
>>>>>   }, {
>>>>>   

Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-02-05 Thread Nils Dijk
Hi Adrien,

I'm using OSX (Mavericks) and java: (having the issue)

$ java -version
java version "1.7.0_25"
Java(TM) SE Runtime Environment (build 1.7.0_25-b15)
Java HotSpot(TM) 64-Bit Server VM (build 23.25-b01, mixed mode)

My colleague is running OSX (Lion) and java: (having the issue)

$ java -version
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03-383-11D50)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02-383, mixed mode)

A server soon to be used for production Ubuntu 12.04 LTS with java: (Not 
having the issue)

$ java -version
java version "1.7.0_45"
Java(TM) SE Runtime Environment (build 1.7.0_45-b18)
Java HotSpot(TM) 64-Bit Server VM (build 24.45-b08, mixed mode)

Could this be an issue with java on OSX than?

On Wednesday, February 5, 2014 4:38:36 PM UTC+1, Adrien Grand wrote:
>
> I didn't manage to reproduce the issue locally either. What JVM / OS are 
> you using (RC1 introduced Unsafe to perform String comparisons in terms 
> aggs so I'm wondering if that could be related to your issue)?
>
>
> On Wed, Feb 5, 2014 at 4:33 PM, Nils Dijk >wrote:
>
>> I did only test it with 1 and with 10 shards, indeed with 1 shard it did 
>> not have any issues, with 10 shards it has issues all the time.
>> I also had a colleague testing it with the two scripts in the gist (which 
>> uses 10 shards).
>>
>> Also I do not think the analyzer _should_ have impact, since it would 
>> only index more terms on that field if it tokenizes it. Can you use the 
>> aggsbug.load.sh to load the data? And than use aggsbug.test.sh to run 
>> the test? It should give you a field analyzed with the default analyzer and 
>> 10 shards.
>>
>> I'll try out some different analyzers, and loading the data in 3 shards 
>> now to see if that changes things.
>>
>> On Wednesday, February 5, 2014 4:02:54 PM UTC+1, Jörg Prante wrote:
>>>
>>> Also the same with shards = 3 and analyzer = standard. Stable results.
>>>
>>> {
>>>   "took" : 240,
>>>   "timed_out" : false,
>>>   "_shards" : {
>>> "total" : 3,
>>> "successful" : 3,
>>> "failed" : 0
>>>   },
>>>   "hits" : {
>>> "total" : 1060387,
>>> "max_score" : 0.0,
>>> "hits" : [ ]
>>>   },
>>>   "aggregations" : {
>>> "a" : {
>>>   "buckets" : [ {
>>> "key" : "totaltrafficbos",
>>> "doc_count" : 3599
>>>   }, {
>>> "key" : "mai93thm",
>>> "doc_count" : 2517
>>>   }, {
>>> "key" : "mai90thm",
>>> "doc_count" : 2207
>>>   }, {
>>> "key" : "mai95thm",
>>> "doc_count" : 2207
>>>   }, {
>>> "key" : "totaltrafficnyc",
>>> "doc_count" : 1660
>>>   }, {
>>> "key" : "confessions",
>>> "doc_count" : 1534
>>>   }, {
>>> "key" : "incidentreports",
>>> "doc_count" : 1468
>>>   }, {
>>> "key" : "nji80thm",
>>> "doc_count" : 1180
>>>   }, {
>>> "key" : "pai76thm",
>>> "doc_count" : 1142
>>>   }, {
>>> "key" : "txi35thm",
>>> "doc_count" : 379
>>>   } ]
>>> }
>>>   }
>>> }
>>>
>>> You should examine your log files if your ES cluster was able to process 
>>> all the docs correctly while indexing or searching, maybe you encountered 
>>> OOMs or other subtle issues.
>>>
>>> Jörg
>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/7c74c649-8a4a-46c5-aaec-b6f3254cc0d9%40googlegroups.com
>> .
>>
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>
>
> -- 
> Adrien Grand
>  

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/8f0f80b7-fbf2-4747-90d4-725a06560938%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-02-05 Thread Nils Dijk
I did only test it with 1 and with 10 shards, indeed with 1 shard it did 
not have any issues, with 10 shards it has issues all the time.
I also had a colleague testing it with the two scripts in the gist (which 
uses 10 shards).

Also I do not think the analyzer _should_ have impact, since it would only 
index more terms on that field if it tokenizes it. Can you use the 
aggsbug.load.sh to load the data? And than use aggsbug.test.sh to run the 
test? It should give you a field analyzed with the default analyzer and 10 
shards.

I'll try out some different analyzers, and loading the data in 3 shards now 
to see if that changes things.

On Wednesday, February 5, 2014 4:02:54 PM UTC+1, Jörg Prante wrote:
>
> Also the same with shards = 3 and analyzer = standard. Stable results.
>
> {
>   "took" : 240,
>   "timed_out" : false,
>   "_shards" : {
> "total" : 3,
> "successful" : 3,
> "failed" : 0
>   },
>   "hits" : {
> "total" : 1060387,
> "max_score" : 0.0,
> "hits" : [ ]
>   },
>   "aggregations" : {
> "a" : {
>   "buckets" : [ {
> "key" : "totaltrafficbos",
> "doc_count" : 3599
>   }, {
> "key" : "mai93thm",
> "doc_count" : 2517
>   }, {
> "key" : "mai90thm",
> "doc_count" : 2207
>   }, {
> "key" : "mai95thm",
> "doc_count" : 2207
>   }, {
> "key" : "totaltrafficnyc",
> "doc_count" : 1660
>   }, {
> "key" : "confessions",
> "doc_count" : 1534
>   }, {
> "key" : "incidentreports",
> "doc_count" : 1468
>   }, {
> "key" : "nji80thm",
> "doc_count" : 1180
>   }, {
> "key" : "pai76thm",
> "doc_count" : 1142
>   }, {
> "key" : "txi35thm",
> "doc_count" : 379
>   } ]
> }
>   }
> }
>
> You should examine your log files if your ES cluster was able to process 
> all the docs correctly while indexing or searching, maybe you encountered 
> OOMs or other subtle issues.
>
> Jörg
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/7c74c649-8a4a-46c5-aaec-b6f3254cc0d9%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-02-05 Thread Nils Dijk
Hi,

I updated the gist now with a file in bulkindex format.
I also split up the loading from the testing phase, so you can do the test 
multiple times in a row.
I also added a README.md to instruct how to run the test.

I'm also creating a bug as stated here 
http://www.elasticsearch.org/blog/0-90-11-1-0-0-rc2-released/.

On Wednesday, February 5, 2014 9:49:40 AM UTC+1, Jörg Prante wrote:
>
> Sorry, but your file at  https://gist.github.com/8803745.git is broken, 
> it contains invalid JSON, so it can not be processed.
>
> It would be helpful to provide a script with escaped JSON in bulk format.
>
> From what I suspect, you do not use keyword analyzer for faceting/agg'ing, 
> so you will get all kinds of unwanted results. If that explains your 
> fluctuating aggs results, I can not tell. It is rather uncommon to use 
> "facets" and "aggs" side by side.
>
> Jörg
>
>
>
> On Tue, Feb 4, 2014 at 3:01 PM, Nils Dijk >wrote:
>
>> To follow up,
>>
>> I have a contained test suite at https://gist.github.com/thanodnl/8803745for 
>> this problem. It contains two files:
>>
>>1. aggsbug.sh
>>2. aggsbug.json
>>
>> The .json file contains ~1M documents newline separated to load into the 
>> database, I was not able to create a curl request to load them directly 
>> into the index.
>> The .sh file (https://gist.github.com/thanodnl/8803745/raw/aggsbug.sh) 
>> contains the instructions for recreating this behavior.
>>
>> I have ran these against the following version:
>>
>>1. 1.0.0.Beta2
>>2. 1.0.0.RC1
>>3. 1.0.0-SNAPSHOT as compiled from the git 1.0 branch on commit 
>>0f8b41ffad9b5ecdfd543d7c73edcf404e6fc763 
>>
>> When ran on 1.0.0.Beta2 it gives the same output consistently when I run 
>> the _search over and over again.
>> When ran on 1.0.0.RC1 it will give me multiple different outcomes 
>> comparable to the numbers I posted earlier in the thread,
>> When ran on 1.0.0-SNAPSHOT it behaves the same as in 1.0.0.RC1.
>>
>> That it still was working on 1.0.0.Beta2 proves to me that it is a bug 
>> that got into RC1. I could not find any related ticket on the issues page 
>> of the github repository. Hopefully this is enough information to recreate 
>> the problem.
>>
>> The json file is quite big and could bug when you open the gist it in a 
>> browser. A clone of the gist locally will work best:
>> $ git clone https://gist.github.com/8803745.git
>>
>> I do not really know how to move on from here. Do you want me to open an 
>> issue for this problem at github.com/elasticsearch/elasticsearch? It 
>> would be nice to fix this problem before a release of 1.0.0 since that is 
>> the first release containing the aggregations for analytics.
>>
>> On Tuesday, February 4, 2014 12:31:10 PM UTC+1, Nils Dijk wrote:
>>
>>> I've loaded the same dataset in ES1.0.0.Beta2 with the same index 
>>> configuration as in the topic start.
>>>
>>> However now the numbers are consistent if I call the same aggregation 
>>> multiple times in a row AND the number match the numbers of the facets. 
>>> This leads me to the conclusion something is broken from Beta2 to RC1!
>>>
>>> I would like to test this on master, but I could not find any nightly 
>>> builds of elasticsearch. Is there a location where they are stored or 
>>> should I compile it myself?
>>>
>>> On Friday, January 31, 2014 6:43:07 PM UTC+1, Nils Dijk wrote:
>>>>
>>>> Hi Binh Ly,
>>>>
>>>> Thanks for the response.
>>>>
>>>> I'm aware that the numbers are not exact (hence the link to issue #1305 
>>>> in my initial post), and have been advocating slightly incorrect numbers 
>>>> with my colleges and customers for some time already to prepare them for 
>>>> the moment we provide analytics with ES. But what bothers me is that they 
>>>> are *inconsistent*.
>>>>
>>>> If you look at my gist you see that I ran the same aggs 3 times right 
>>>> after each other. If we just look at the top item we see the following 
>>>> results:
>>>>
>>>>1. { "key": "totaltrafficbos", "doc_count": 2880 }
>>>>2. { "key": "totaltrafficbos", "doc_count": 2552 }
>>>>3. { "key": "totaltrafficbos", "doc_count": 2179 }
>>>>
>>>> These results are taken within seconds without any change to the num

Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-02-04 Thread Nils Dijk
To follow up,

I have a contained test suite at https://gist.github.com/thanodnl/8803745for 
this problem. It contains two files:

   1. aggsbug.sh
   2. aggsbug.json

The .json file contains ~1M documents newline separated to load into the 
database, I was not able to create a curl request to load them directly 
into the index.
The .sh file (https://gist.github.com/thanodnl/8803745/raw/aggsbug.sh) 
contains the instructions for recreating this behavior.

I have ran these against the following version:

   1. 1.0.0.Beta2
   2. 1.0.0.RC1
   3. 1.0.0-SNAPSHOT as compiled from the git 1.0 branch on commit 
   0f8b41ffad9b5ecdfd543d7c73edcf404e6fc763

When ran on 1.0.0.Beta2 it gives the same output consistently when I run 
the _search over and over again.
When ran on 1.0.0.RC1 it will give me multiple different outcomes 
comparable to the numbers I posted earlier in the thread,
When ran on 1.0.0-SNAPSHOT it behaves the same as in 1.0.0.RC1.

That it still was working on 1.0.0.Beta2 proves to me that it is a bug that 
got into RC1. I could not find any related ticket on the issues page of the 
github repository. Hopefully this is enough information to recreate the 
problem.

The json file is quite big and could bug when you open the gist it in a 
browser. A clone of the gist locally will work best:
$ git clone https://gist.github.com/8803745.git

I do not really know how to move on from here. Do you want me to open an 
issue for this problem at github.com/elasticsearch/elasticsearch? It would 
be nice to fix this problem before a release of 1.0.0 since that is the 
first release containing the aggregations for analytics.

On Tuesday, February 4, 2014 12:31:10 PM UTC+1, Nils Dijk wrote:

> I've loaded the same dataset in ES1.0.0.Beta2 with the same index 
> configuration as in the topic start.
>
> However now the numbers are consistent if I call the same aggregation 
> multiple times in a row AND the number match the numbers of the facets. 
> This leads me to the conclusion something is broken from Beta2 to RC1!
>
> I would like to test this on master, but I could not find any nightly 
> builds of elasticsearch. Is there a location where they are stored or 
> should I compile it myself?
>
> On Friday, January 31, 2014 6:43:07 PM UTC+1, Nils Dijk wrote:
>>
>> Hi Binh Ly,
>>
>> Thanks for the response.
>>
>> I'm aware that the numbers are not exact (hence the link to issue #1305 
>> in my initial post), and have been advocating slightly incorrect numbers 
>> with my colleges and customers for some time already to prepare them for 
>> the moment we provide analytics with ES. But what bothers me is that they 
>> are *inconsistent*.
>>
>> If you look at my gist you see that I ran the same aggs 3 times right 
>> after each other. If we just look at the top item we see the following 
>> results:
>>
>>1. { "key": "totaltrafficbos", "doc_count": 2880 }
>>2. { "key": "totaltrafficbos", "doc_count": 2552 }
>>3. { "key": "totaltrafficbos", "doc_count": 2179 }
>>
>> These results are taken within seconds without any change to the number of 
>> documents in the index. If I run them even more you see that it rotates 
>> between a hand full of numbers. Is this also behavior one would expect from 
>> the aggs? And if so, why do the facets show the same number over and over 
>> again?
>>
>> Anyway, I will try to work myself through the aggs code this weekend to get 
>> a better hang of what we could do with it, and what not.
>>
>> -- Nils
>>
>> On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote:
>>>
>>> Nils,
>>>
>>> This is just the nature of splitting data around in shards. Actually the 
>>> terms facet has the same limitations (i.e. it will also give "approximate 
>>> counts"). Neither the terms facet nor the terms aggregation is better or 
>>> worse than the other - they are both approximations (using different 
>>> implementations). It is correct that if you put all your data in 1 shard, 
>>> then all the counts are exact. If you need to shard, you can increase the 
>>> "shard_size" parameter inside the terms aggregation to "improve accuracy". 
>>> Play with that number until it suits your purposes but the important thing 
>>> is they are just approximations the more documents you have in the index - 
>>> so just don't expect absolute numbers from them if you have more than 1 
>>> shard.
>>>
>>> {
>>>   "size": 0,
>>>   "aggs": {
>>> &quo

Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-02-04 Thread Nils Dijk
I've loaded the same dataset in ES1.0.0.Beta2 with the same index 
configuration as in the topic start.

However now the numbers are consistent if I call the same aggregation 
multiple times in a row AND the number match the numbers of the facets. 
This leads me to the conclusion something is broken from Beta2 to RC1!

I would like to test this on master, but I could not find any nightly 
builds of elasticsearch. Is there a location where they are stored or 
should I compile it myself?

On Friday, January 31, 2014 6:43:07 PM UTC+1, Nils Dijk wrote:
>
> Hi Binh Ly,
>
> Thanks for the response.
>
> I'm aware that the numbers are not exact (hence the link to issue #1305 in 
> my initial post), and have been advocating slightly incorrect numbers with 
> my colleges and customers for some time already to prepare them for the 
> moment we provide analytics with ES. But what bothers me is that they are 
> *inconsistent*.
>
> If you look at my gist you see that I ran the same aggs 3 times right 
> after each other. If we just look at the top item we see the following 
> results:
>
>1. { "key": "totaltrafficbos", "doc_count": 2880 }
>2. { "key": "totaltrafficbos", "doc_count": 2552 }
>3. { "key": "totaltrafficbos", "doc_count": 2179 }
>
> These results are taken within seconds without any change to the number of 
> documents in the index. If I run them even more you see that it rotates 
> between a hand full of numbers. Is this also behavior one would expect from 
> the aggs? And if so, why do the facets show the same number over and over 
> again?
>
> Anyway, I will try to work myself through the aggs code this weekend to get a 
> better hang of what we could do with it, and what not.
>
> -- Nils
>
> On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote:
>>
>> Nils,
>>
>> This is just the nature of splitting data around in shards. Actually the 
>> terms facet has the same limitations (i.e. it will also give "approximate 
>> counts"). Neither the terms facet nor the terms aggregation is better or 
>> worse than the other - they are both approximations (using different 
>> implementations). It is correct that if you put all your data in 1 shard, 
>> then all the counts are exact. If you need to shard, you can increase the 
>> "shard_size" parameter inside the terms aggregation to "improve accuracy". 
>> Play with that number until it suits your purposes but the important thing 
>> is they are just approximations the more documents you have in the index - 
>> so just don't expect absolute numbers from them if you have more than 1 
>> shard.
>>
>> {
>>   "size": 0,
>>   "aggs": {
>> "a": {
>>   "terms": {
>> "field": "actor.displayName",
>> "shard_size": 1
>>   }
>> }
>>   }
>> }
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6bee2ff8-ae78-4837-91f5-77ee80f55d34%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-01-31 Thread Nils Dijk
Hi Binh Ly,

Thanks for the response.

I'm aware that the numbers are not exact (hence the link to issue #1305 in 
my initial post), and have been advocating slightly incorrect numbers with 
my colleges and customers for some time already to prepare them for the 
moment we provide analytics with ES. But what bothers me is that they are 
*inconsistent*.

If you look at my gist you see that I ran the same aggs 3 times right after 
each other. If we just look at the top item we see the following results:

   1. { "key": "totaltrafficbos", "doc_count": 2880 }
   2. { "key": "totaltrafficbos", "doc_count": 2552 }
   3. { "key": "totaltrafficbos", "doc_count": 2179 }
   
These results are taken within seconds without any change to the number of 
documents in the index. If I run them even more you see that it rotates between 
a hand full of numbers. Is this also behavior one would expect from the aggs? 
And if so, why do the facets show the same number over and over again?

Anyway, I will try to work myself through the aggs code this weekend to get a 
better hang of what we could do with it, and what not.

-- Nils

On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote:
>
> Nils,
>
> This is just the nature of splitting data around in shards. Actually the 
> terms facet has the same limitations (i.e. it will also give "approximate 
> counts"). Neither the terms facet nor the terms aggregation is better or 
> worse than the other - they are both approximations (using different 
> implementations). It is correct that if you put all your data in 1 shard, 
> then all the counts are exact. If you need to shard, you can increase the 
> "shard_size" parameter inside the terms aggregation to "improve accuracy". 
> Play with that number until it suits your purposes but the important thing 
> is they are just approximations the more documents you have in the index - 
> so just don't expect absolute numbers from them if you have more than 1 
> shard.
>
> {
>   "size": 0,
>   "aggs": {
> "a": {
>   "terms": {
> "field": "actor.displayName",
> "shard_size": 1
>   }
> }
>   }
> }
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/13053d4e-a213-4f42-8f16-09e539ad694c%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Re: Inconsistent responses from aggregations (ES1.0.0RC1)

2014-01-31 Thread Nils Dijk
I finished indexing the same dataset in an index with only one shard.

$ curl 'http://localhost:9200/52b1e8c1f8b9d7313004/_search?pretty=true' 
-d '{
   "size": 0,
   "facets": {
  "participants": {
 "terms": {
"field": "actor.displayName",
"size": 10
 }
  }
   },
   "aggs": {
  "participants": {
 "terms": {
"field": "actor.displayName",
"size": 10
 }
  }
   }
}'
{
  "took" : 1377,
  "timed_out" : false,
  "_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
  },
  "hits" : {
"total" : 1060387,
"max_score" : 0.0,
"hits" : [ ]
  },
  "facets" : {
"participants" : {
  "_type" : "terms",
  "missing" : 0,
  "total" : 1129848,
  "other" : 270,
  "terms" : [ {
"term" : "totaltrafficbos",
"count" : 3599
  }, {
"term" : "mai93thm",
"count" : 2517
  }, {
"term" : "mai95thm",
"count" : 2207
  }, {
"term" : "mai90thm",
"count" : 2207
  }, {
"term" : "totaltrafficnyc",
"count" : 1660
  }, {
"term" : "confessions",
"count" : 1534
  }, {
"term" : "incidentreports",
"count" : 1468
  }, {
"term" : "nji80thm",
"count" : 1180
  }, {
"term" : "pai76thm",
"count" : 1142
  }, {
"term" : "txi35thm",
"count" : 1064
  } ]
}
  },
  "aggregations" : {
"participants" : {
  "buckets" : [ {
"key" : "totaltrafficbos",
"doc_count" : 3599
  }, {
"key" : "mai93thm",
"doc_count" : 2517
  }, {
"key" : "mai90thm",
"doc_count" : 2207
  }, {
"key" : "mai95thm",
"doc_count" : 2207
  }, {
"key" : "totaltrafficnyc",
"doc_count" : 1660
  }, {
"key" : "confessions",
"doc_count" : 1534
  }, {
"key" : "incidentreports",
"doc_count" : 1468
  }, {
"key" : "nji80thm",
"doc_count" : 1180
  }, {
"key" : "pai76thm",
"doc_count" : 1142
  }, {
"key" : "txi35thm",
"doc_count" : 1064
  } ]
}
  }
}


Now the counts and are the same as with faceting, and more important, 
consistent.

Seems like the problem resides in aggs on multiple shards. How to proceed 
from here?

-- Nils

On Friday, January 31, 2014 4:30:55 PM UTC+1, Nils Dijk wrote:
>
> Hi,
>
> I am tinkering with elasticsearch 1.0.0RC1 for a bit. Especially the part 
> of aggregations. When looking closer to the responses of the aggregations I 
> noticed the numbers fluctuated all the time.
>
> I have an index:
>   shards: 10
>   replicas: 0
>   documents: ~1M
>
> Currently I'm not ingesting data anymore.
>
> When I try to recreate the terms facet in aggregations I came up with the 
> following:
>
> {
>"size": 0,
>"facets": {
>   "participants": {
>  "terms": {
> "field": "actor.displayName",
> "size": 10
>  }
>   }
>},
>"aggs": {
>   "participants": {
>  "terms": {
> "field": "actor.displayName",
> "size": 10
>  }
>   }
>}
> }
>
>
> This should give me roundabout the top 10 
> (*<https://github.com/elasticsearch/elasticsearch/issues/1305>) 
> occurring terms in the 'actor.displayName' field. The terms facet gives the 
> same counts over and over again, which is what is expected. However, the 
> counts from the aggregations return different numbers every time I invoke 
> it. Results of 3 consecutive runs: 
> https://gist.github.com/thanodnl/8733837.
>
> Currently I'm reindexing all the documents in an index with only one shard 
> to see if that makes a difference.
> This would only solve the problem short term, but our production load is 
> too big to fit in one shard.
>
> -- Nils
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e2e84dc5-cd11-476c-90b4-a0aa5e0fdd72%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


Inconsistent responses from aggregations (ES1.0.0RC1)

2014-01-31 Thread Nils Dijk
Hi,

I am tinkering with elasticsearch 1.0.0RC1 for a bit. Especially the part 
of aggregations. When looking closer to the responses of the aggregations I 
noticed the numbers fluctuated all the time.

I have an index:
  shards: 10
  replicas: 0
  documents: ~1M

Currently I'm not ingesting data anymore.

When I try to recreate the terms facet in aggregations I came up with the 
following:

{
   "size": 0,
   "facets": {
  "participants": {
 "terms": {
"field": "actor.displayName",
"size": 10
 }
  }
   },
   "aggs": {
  "participants": {
 "terms": {
"field": "actor.displayName",
"size": 10
 }
  }
   }
}


This should give me roundabout the top 10 
(*) 
occurring terms in the 'actor.displayName' field. The terms facet gives the 
same counts over and over again, which is what is expected. However, the 
counts from the aggregations return different numbers every time I invoke 
it. Results of 3 consecutive runs: https://gist.github.com/thanodnl/8733837.

Currently I'm reindexing all the documents in an index with only one shard 
to see if that makes a difference.
This would only solve the problem short term, but our production load is 
too big to fit in one shard.

-- Nils

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/49fe3127-84a1-43d6-a298-6e70ee9d038e%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.