Re: slow performance on phrase queries in should clause

2014-12-05 Thread Kireet Reddy
I spent some more time debugging this yesterday, and it started driving me 
a little crazy. I thought to test my theory, I should reduce the number of 
terms in my must filter from ~ 100, to 1. If the should was executing over 
all documents, the query should remain slow. But it ended up executing 
quickly! So I am a little lost as to what's going on. Does 
elasticsearch/lucene use any heuristics about which clause to execute first 
that might cause this? I am using 1.3.5.

I'll ask our ops guys about seeing if we can setup an installation of the 
master branch and see if there's any improvement. Would I need to change 
the query at all? In the meantime, is there anything I can do on the 1.3 
branch? Should I split off should clauses into a separate bool filter and 
wrap it in an and? I.e. 
AND of
  + bool filters with selective terms filter
  + bool filters with must filters

Also, I've run into a few of there performance issues, it would have been 
really helpful if there was something like an explain plan for database 
queries, or if I could set an explain type option on the query and it would 
collect performance info at each step while processing the query and send 
it back with the results. Right now it's really kind of a black box for me, 
especially with caching kicking in at times. Has there ever been any 
thought about implementing something like this in lucene/elasticsearch?

Thanks
Kireet

On Friday, December 5, 2014 3:12:49 AM UTC-8, Michael McCandless wrote:
>
> It's likely the should is (stupidly) being fully expanded before being 
> AND'd with the must ... but there are improvements here 
> (XBooleanFilter.java) to this in master, are you able to test and see if 
> it's still slow?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
> 2014-12-04 19:21 GMT-05:00 Kireet Reddy >:
>
>> Our system is normally very responsive, but very occasionally people 
>> submit long phrase queries which timeout and cause high system load. Not 
>> all long phrase queries cause issues, but I have been debugging one that 
>> I've found.[1]
>>
>> The query is in the filter section of a constant score query as below. 
>> This form times out. However if I move the query out of the should section 
>> and into the must section, the query runs very quickly (in the full query, 
>> there was another filter in the should section). Converting this to an AND 
>> filter is also fast. Is there a reason for this? Are should filters 
>> executed on the full set and not short circuited with the results of must 
>> filters?
>>
>> {
>>
>> "query": {
>>
>> "constant_score": {
>>
>> "filter": {
>>
>> "bool": {
>>
>> "must": { "terms": { -- selective terms filter 
>> -- }  },
>>
>> "should": { "query": { "match": { "text": { "query": 
>> "…", "type": "phrase" } } } }
>>
>> }
>>
>> }
>>
>> }
>>
>> }
>>
>> }
>>
>>
>>
>>
>>
>> [1] query 
>> -- 
>> ぶ新サービスは2015年春にリリースの予定。IoTのハードウェアそのものではなく、SDKやデータベース、解析、IDといったバックグラウンド環境をサービスとして提供するというものだ。発表後、松本氏は「例えばイケてる時計型のプロダクトを作ったとして、(機能面では)単体での価値は1〜2割だったりする。でも本当に重要なのはバックエンド。しかしユーザーから見てみれば時計というプロダクトそのものに大きな価値を感じることが多い。そうであれば、IoTのバックエンドをBaaS(Backend
>>  
>> as a 
>> Service:ユーザーの登録や管理、データ保管といったバックエンド環境をサービスとして提供すること)のように提供できればプロダクトの開発に集中できると思う。クラウドが出てネットサービスの開発が手軽になったのと同じような環境を提供したい」とサービスについて語ってくれた。
>>  
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/5b1e6260-5c19-4ac7-bf1e-939360bf509e%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/elasticsearch/5b1e6260-5c19-4ac7-bf1e-939360bf509e%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/4f9ba758-0895-433f-b7f3-d27d9ef8627c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


slow performance on phrase queries in should clause

2014-12-04 Thread Kireet Reddy
Our system is normally very responsive, but very occasionally people submit 
long phrase queries which timeout and cause high system load. Not all long 
phrase queries cause issues, but I have been debugging one that I've 
found.[1]

The query is in the filter section of a constant score query as below. This 
form times out. However if I move the query out of the should section and 
into the must section, the query runs very quickly (in the full query, 
there was another filter in the should section). Converting this to an AND 
filter is also fast. Is there a reason for this? Are should filters 
executed on the full set and not short circuited with the results of must 
filters?

{

"query": {

"constant_score": {

"filter": {

"bool": {

"must": { "terms": { -- selective terms filter -- } 
 },

"should": { "query": { "match": { "text": { "query": 
"…", "type": "phrase" } } } }

}

}

}

}

}





[1] query 
-- 
ぶ新サービスは2015年春にリリースの予定。IoTのハードウェアそのものではなく、SDKやデータベース、解析、IDといったバックグラウンド環境をサービスとして提供するというものだ。発表後、松本氏は「例えばイケてる時計型のプロダクトを作ったとして、(機能面では)単体での価値は1〜2割だったりする。でも本当に重要なのはバックエンド。しかしユーザーから見てみれば時計というプロダクトそのものに大きな価値を感じることが多い。そうであれば、IoTのバックエンドをBaaS(Backend
 
as a 
Service:ユーザーの登録や管理、データ保管といったバックエンド環境をサービスとして提供すること)のように提供できればプロダクトの開発に集中できると思う。クラウドが出てネットサービスの開発が手軽になったのと同じような環境を提供したい」とサービスについて語ってくれた。
 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5b1e6260-5c19-4ac7-bf1e-939360bf509e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


slow execution of nested boolean filter

2014-09-16 Thread Kireet Reddy
I have a query with a nested boolean (boolean within a boolean) filter with 
a should clause that performs really terribly. But if I move the nested 
query up to top level, it performs as much as 50x faster. I am struggling 
to understand why this is the case. Here are the 2 forms: 

https://gist.github.com/anonymous/cdbfbb940fa8da81019f

You can see the "flattened" version simply repeats the query filter for 
each of the nested should clauses in the "nested" version. The initial 
"groups" terms filter is very selective. So I am thinking that the nested 
boolean filter query must somehow work on all documents rather than the 
ones filtered from that initial clause? Why would that be the case?


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/30643d57-4ad3-4101-b0c8-d6a788f7cd6a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: slow filter execution

2014-07-31 Thread Kireet Reddy
Quick update, I found that if I explicitly set _cache to true, things seem 
to work more as expected, i.e. subsequent executions of the query sped up. 
I looked at DateFieldMapper.rangeFilter() and to me it looks like if a 
number is passed, caching will be disabled unless it's explicitly set to 
true. Not sure if this has been fixed in 1.3.x yet or not. This meshes with 
my observed behavior. 

On Wednesday, July 30, 2014 8:59:37 AM UTC-7, Kireet Reddy wrote:
>
> Thanks for the detailed reply. 
>
> I am a bit confused about and vs bool filter execution. I read this post 
> <http://www.elasticsearch.org/blog/all-about-elasticsearch-filter-bitsets/> 
> on 
> the elasticsearch blog. From that, I thought the bool filter would work by 
> basically creating a bitset for the entire segment(s) being examined. If 
> the filter value changes every time, will this still be cheaper than an AND 
> filter that will just examine the matching docs? My segments can be very 
> big and this query for example on matched one document.
>
> There is no match_all query filter, There is a "match" query filter on a 
> field named "all". :)
>
> Based on your feedback, I moved all filters, including the query filter, 
> into the bool filter. However it didn't change things: the query takes an 
> order of magnitude slower with the range filter, unless I set execution to 
> fielddata. I am using 1.2.2, I tried the strategy anyways and it didn't 
> make a difference.
>
> {
> "query": {
> "filtered": {
> "query": {
> "match_all": {}
> },
> "filter": {
> "bool": {
> "must": [
> {
> "terms": {
> "source_id": ["s1", "s2", "s3"]
> }
> },
> {
> "query": {
> "match": {
> "all": {
> "query": "foo"
> }
> }
> }
> },
> {
> "range": {
> "published": {
> "to": 1406064191883
> }
> }
> }
> ]
> }
> }
> }
> },
> "sort": [
> {
> "crawlDate": {
> "order": "desc"
> }
> }
> ]
> }
>
> On Wednesday, July 30, 2014 4:30:10 AM UTC-7, Clinton Gormley wrote:
>>
>> Don't use the `and` filter - use the `bool` filter instead.  They have 
>> different execution modes and the `bool` filter works best with bitset 
>> filters (but also knows how to handle non-bitset filters like geo etc).  
>>
>> Just remove the `and`, `or` and `not` filters from your DSL vocabulary.
>>
>> Also, not sure why you are ANDing with a match_all filter - that doesn't 
>> make much sense.
>>
>> Depending on which version of ES you're using, you may be encountering a 
>> bug in the filtered query which ended up always running the query first, 
>> instead of the filter. This was fixed in v1.2.0 
>> https://github.com/elasticsearch/elasticsearch/issues/6247 .  If you are 
>> on an earlier version you can force filter-first execution manually by 
>> specifying a "strategy" of "random_access_100".  See 
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html#_filter_strategy
>>
>> In summary, (and taking your less granular datetime clause into account) 
>> your query would be better written as:
>>
>> GET /_search
>> {
>>   "query": {
>> "filtered": {
>>   "strategy": "random_access_100",   pre 1.2 only
>>   "filter": {
>> "bool": {
>>   "must": [
>> {
>>   "terms": {
>> "source_id": [ "s1", "s2", "s3&qu

Re: slow filter execution

2014-07-30 Thread Kireet Reddy
Thanks for the detailed reply. 

I am a bit confused about and vs bool filter execution. I read this post 
 on 
the elasticsearch blog. From that, I thought the bool filter would work by 
basically creating a bitset for the entire segment(s) being examined. If 
the filter value changes every time, will this still be cheaper than an AND 
filter that will just examine the matching docs? My segments can be very 
big and this query for example on matched one document.

There is no match_all query filter, There is a "match" query filter on a 
field named "all". :)

Based on your feedback, I moved all filters, including the query filter, 
into the bool filter. However it didn't change things: the query takes an 
order of magnitude slower with the range filter, unless I set execution to 
fielddata. I am using 1.2.2, I tried the strategy anyways and it didn't 
make a difference.

{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"bool": {
"must": [
{
"terms": {
"source_id": ["s1", "s2", "s3"]
}
},
{
"query": {
"match": {
"all": {
"query": "foo"
}
}
}
},
{
"range": {
"published": {
"to": 1406064191883
}
}
}
]
}
}
}
},
"sort": [
{
"crawlDate": {
"order": "desc"
}
}
]
}

On Wednesday, July 30, 2014 4:30:10 AM UTC-7, Clinton Gormley wrote:
>
> Don't use the `and` filter - use the `bool` filter instead.  They have 
> different execution modes and the `bool` filter works best with bitset 
> filters (but also knows how to handle non-bitset filters like geo etc).  
>
> Just remove the `and`, `or` and `not` filters from your DSL vocabulary.
>
> Also, not sure why you are ANDing with a match_all filter - that doesn't 
> make much sense.
>
> Depending on which version of ES you're using, you may be encountering a 
> bug in the filtered query which ended up always running the query first, 
> instead of the filter. This was fixed in v1.2.0 
> https://github.com/elasticsearch/elasticsearch/issues/6247 .  If you are 
> on an earlier version you can force filter-first execution manually by 
> specifying a "strategy" of "random_access_100".  See 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filtered-query.html#_filter_strategy
>
> In summary, (and taking your less granular datetime clause into account) 
> your query would be better written as:
>
> GET /_search
> {
>   "query": {
> "filtered": {
>   "strategy": "random_access_100",   pre 1.2 only
>   "filter": {
> "bool": {
>   "must": [
> {
>   "terms": {
> "source_id": [ "s1", "s2", "s3" ]
>   }
> },
> {
>   "range": {
> "published": {
>   "gte": "now-1d/d"   coarse grained, cached
> }
>   }
> },
> {
>   "range": {
> "published": {
>   "gte": "now-30m"  fine grained, not cached, 
> could use fielddata too
> },
> "_cache": false
>   }
> }
>   ]
> }
>   }
> }
>   }
> }
>
>
>
>
>
> On 30 July 2014 10:55, David Pilato > wrote:
>
>> May be a stupid question: why did you put that filter inside a query and 
>> not within the same filter you have at the end?
>>
>>
>> For my test case it's the same every time. In the "real" query it will 
>>> change every time, but I planned to not cache this filter and have a less 
>>> granular date filter in the bool filter that would be cached. However while 
>>> debugging I noticed slowness with the date range filters even while testing 
>>> with the same value repeatedly.
>>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearc...@googlegroups.com .
>> To view this discuss

Re: slow filter execution

2014-07-30 Thread Kireet Reddy
For my test case it's the same every time. In the "real" query it will
change every time, but I planned to not cache this filter and have a less
granular date filter in the bool filter that would be cached. However while
debugging I noticed slowness with the date range filters even while testing
with the same value repeatedly.
On Jul 29, 2014 10:49 PM, "David Pilato"  wrote:

> Any chance your filter value changes for every call?
> Or are you using exactly the same value each time?
>
> --
> David ;-)
> Twitter : @dadoonet / @elasticsearchfr / @scrutmydocs
>
>
> Le 30 juil. 2014 à 05:03, Kireet Reddy  a écrit :
>
> One of my queries has been consistently taking 500ms-1s and I can't figure
> out why. Here is the query
> <https://gist.github.com/anonymous/d98fb2c46d9a7755e882> (it looks a bit
> strange as I have removed things that didn't seem to affect execution
> time). When I remove the range filter, the query consistently takes < 10ms.
> The query itself only results 1 hit with or without the range filter, so I
> am not sure why simply including this filter adds so much time. My nodes
> are not experiencing any filter cache evictions. I also tried moving it to
> the bool section with no luck. Changing execution to "fielddata" does
> improve execution time to < 10ms though. Since I am sorting on the same
> field, I suppose this should be fine. But I would like to understand why
> the slowdown occurs. The published field is a date type and has eager field
> data loading enabled.
>
> Thanks
> Kireet
>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/994f4700-7a52-4db4-a2a7-d252732517bd%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/994f4700-7a52-4db4-a2a7-d252732517bd%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
> --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/N0z5eZRPO2A/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CE4B26B8-5837-46C5-9E89-2AFBADED9BB6%40pilato.fr
> <https://groups.google.com/d/msgid/elasticsearch/CE4B26B8-5837-46C5-9E89-2AFBADED9BB6%40pilato.fr?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CACkKG4iMwtd-i_NE2mWM6Ce3WeEGM_cpsJXzFsdOUc5n_PTU-A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


slow filter execution

2014-07-29 Thread Kireet Reddy
One of my queries has been consistently taking 500ms-1s and I can't figure 
out why. Here is the query 
 (it looks a bit 
strange as I have removed things that didn't seem to affect execution 
time). When I remove the range filter, the query consistently takes < 10ms. 
The query itself only results 1 hit with or without the range filter, so I 
am not sure why simply including this filter adds so much time. My nodes 
are not experiencing any filter cache evictions. I also tried moving it to 
the bool section with no luck. Changing execution to "fielddata" does 
improve execution time to < 10ms though. Since I am sorting on the same 
field, I suppose this should be fine. But I would like to understand why 
the slowdown occurs. The published field is a date type and has eager field 
data loading enabled.

Thanks
Kireet


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/994f4700-7a52-4db4-a2a7-d252732517bd%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Constant High (~99%) CPU on 1 of 5 Nodes in Cluster

2014-07-29 Thread Kireet Reddy
We've had a very similar issue, but haven't been able to figure out what 
the problem is. How do you "fix" the problem? Will a node restart fix the 
problem immediately or do you need to restart the whole machine? 

On Tuesday, July 29, 2014 1:59:52 PM UTC-7, mic...@modernmast.com wrote:
>
> Hey guys,
>
> We've been running a 5 node cluster for our index (5 shards, 1 replica, 
> evenly distributed on 5 nodes), and are running into a problem with one of 
> the nodes in the cluster. It is not unique to any specific node, and can 
> happen sporadically on any of the nodes.
>
> One of the machines starts spiking up close to 100% CPU Load, and close to 
> 8 OS Load (which is amusing, considering there are only 4 CPU cores on the 
> machine), while all the other machines operate normally way below those 
> figures. Naturally, this behavior is accompanied by extremely high write 
> times, and read times, as well.
>
> Here's what Marvel looks like:
>
>
> 
>
> Here's all the information we could gather:
>
>- Full thread dump from while this occurred: 
>https://gist.github.com/danielschonfeld/ff6c3744197f2c748632
>- GET _nodes/stats: 
>https://gist.github.com/schonfeld/693c8dbf0dd57e4cff7c
>- GET _nodes/hot_threads: 
>https://gist.github.com/schonfeld/766d771d211e452a7100
>- GET _cluster/stats: 
>https://gist.github.com/schonfeld/d5395f97e3a87745cc1f
>
>
> Thoughts? insights? Any clues would be greatly appreciated. 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6660d09b-98f9-41c4-87d0-9ee56890c7b9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Use java Api to set a document's field as _id

2014-07-27 Thread Kireet Reddy
You should check out the IndexRequestBuilder class. It helps simplify 
creating indexing requests and has a setId() method.

On Friday, July 25, 2014 4:22:42 PM UTC-7, Chia-Eng Chang wrote:
>
> I want to ask if the unique field _id be assigned by certain field within 
> document. I see with Rest, it can achieve by "path":
>
> {
> "tweet" : {
> "_id" : {
>"path" : "post_id"
>}
> }
> }
>
> But if I want to do it with java API, is there any way to achieve it?
>
> Map MapA= new HashMap();
> MapA=MapProcessor(MapA);
>
> client.prepareIndex("index","type").setSource(MapA).execute().actionGet();
>
> How could I modify my code to assign certain field in Map to become _id of 
> this type?
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/66966b06-71dc-4ff9-a863-4371bf9ca368%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Clustering/Sharding impact on query performance

2014-07-21 Thread Kireet Reddy
My working assumption had been that elasticsearch executes queries across all 
shards in parallel and then merges the results. So maybe shards <= cpu cores 
would help in this case where there is only one concurrent query. But I have 
never tested this assumption, out of curiosity during the 20 shard test did you 
still only see 1 cpu being used? Did you try 2 shards and get the same results?

On Jul 20, 2014, at 1:01 AM, 'Fin Sekun' via elasticsearch 
 wrote:

> Hi Kireet, thanks for your answer and sorry for the late response. More 
> shards doesn't help. It will slow down the system because each shard takes 
> quite some overhead to maintain a Lucene index and, the smaller the shards, 
> the bigger the overhead. Having more shards enhances the indexing performance 
> and allows to distribute a big index across machines, but I don't have a 
> cluster with a lot of machines. I could observe this negative effects while 
> testing with 20 shards.
> 
> It would be very cool if somebody could answer/comment to the question 
> summarized at the end of my post. Thanks again.
> 
> 
> 
> 
> 
> On Friday, July 11, 2014 3:02:50 AM UTC+2, Kireet Reddy wrote:
> I would test using multiple primary shards on a single machine. Since your 
> dataset seems to fit into RAM, this could help for these longer latency 
> queries.
> 
> On Thursday, July 10, 2014 12:24:26 AM UTC-7, Fin Sekun wrote:
> Any hints?
> 
> 
> 
> On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote:
> 
> Hi,
> 
> 
> SCENARIO
> 
> Our Elasticsearch database has ~2.5 million entries. Each entry has the three 
> analyzed fields "match", "sec_match" and "thi_match" (all contains 3-20 
> words) that will be used in this query:
> https://gist.github.com/anonymous/a8d1142512e5625e4e91
> 
> 
> ES runs on two types of servers:
> (1) Real servers (system has direct access to real CPUs, no virtualization) 
> of newest generation - Very performant!
> (2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic for 
> cloud services.
> 
> See https://gist.github.com/anonymous/3098b142c2bab51feecc for (1) and (2) 
> CPU details.
> 
> 
> ES settings:
> ES version 1.2.0 (jdk1.8.0_05)
> ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
> vm.max_map_count = 262144
> ulimit -n 64000
> ulimit -l unlimited
> index.number_of_shards: 1
> index.number_of_replicas: 0
> index.store.type: mmapfs
> threadpool.search.type: fixed
> threadpool.search.size: 75
> threadpool.search.queue_size: 5000
> 
> 
> Infrastructure:
> As you can see above, we don't use the cluster feature of ES (1 shard, 0 
> replicas). The reason is that our hosting infrastructure is based on 
> different providers.
> Upside: We aren't dependent on a single hosting provider. Downside: Our 
> servers aren't in the same LAN.
> 
> This means:
> - We cannot use ES sharding, because synchronisation via WAN (internet) seems 
> not a useful solution.
> - So, every ES-server has the complete dataset and we configured only one 
> shard and no replicas for higher performance.
> - We have a distribution process that updates the ES data on every host 
> frequently. This process is fine for us, because updates aren't very often 
> and perfect just-in-time ES synchronisation isn't necessary for our business 
> case.
> - If a server goes down/crashs, the central loadbalancer removes it (the 
> resulting minimal packet lost is acceptable).
>  
> 
> 
> 
> PROBLEM
> 
> For long query terms (6 and more keywords), we have very high CPU loads, even 
> on the high performance server (1), and this leads to high response times: 
> 1-4sec on server (1), 8-20sec on server (2). The system parameters while 
> querying:
> - Very high load (usually 100%) for the thread responsible CPU (the other 
> CPUs are idle in our test scenario)
> - No I/O load (the harddisks are fine)
> - No RAM bottlenecks
> 
> So, we think the file caching is working fine, because we have no I/O 
> problems and the garbage collector seams to be happy (jstat shows very few 
> GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
> https://gist.github.com/anonymous/9cecfd512cb533114b7d 
> 
> 
> 
> 
> SUMMARY/ASSUMPTIONS
> 
> - Our database size isn't very big and the query not very complex.
> - ES is designed for huge amount of data, but the key is clustering/sharding: 
> Data distribution to many servers means smaller indices, smaller indices 
> leads to fewer CPU load and short response times.
> - So, our database isn't big, but to big for a single CPU and this means 
> especially low per

Re: excessive merging/small segment sizes

2014-07-13 Thread Kireet Reddy
We did the test with ES still running and indexing data, ES still 
running/not indexing, and ES stopped. All three showed the poor i/o rate. 
Then after a few minutes, the copy i/o rate somehow increased again. It was 
really strange. We still have some digging to do to figure out the problem 
there.

On Sunday, July 13, 2014 2:33:00 AM UTC-7, Michael McCandless wrote:
>
> On Fri, Jul 11, 2014 at 7:35 PM, Kireet Reddy  > wrote:
>
>> The problem reappeared. We did some tests today around copying a large 
>> file on the nodes to test i/o throughput. On the loaded node, the copy was 
>> really slow, maybe 30x slower. So it seems your suspicion around something 
>> external interfering with I/O was right in the end even though nothing else 
>> is running on the machines. We will investigate our setup further but this 
>> doesn't seem like a lucene/elasticsearch issue in the end.
>>
>
> Hmm but ES was still running on the node?  So it could still be something 
> about ES/Lucene that's putting heaving IO load on the box?
>  
>
>> For the index close, I didn't issue any command, elasticsearch seemed to 
>> do that on its own. The code is in IndexingMemoryController. The triggering 
>> event seems to be the ram buffer size change, this triggers a call to 
>> InternalEngine.updateIndexingBufferSize() which then calls flush with type 
>> NEW_WRITER. That seems to close the lucene IndexWriter.
>>
>
> Ahh, thanks for the clarification, yes ES sometimes closes & opens a new 
> writer to make "non-live" settings changes take effect. However, changing 
> RAM buffer size for indexing is a live setting so it should not require the 
> close/open yet indeed (InternalEngine.updateIndexingBufferSize) it does ... 
> I'll dig.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/d55f7ad6-ed9a-46de-9279-ce5d7863f80e%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Frequent OOM

2014-07-10 Thread Kireet Reddy
Does it seem related to search activity? Merge activity? What does the hot 
threads endpoint show before running out of memory? I might try to cap the 
max segment size or use more shards so the segments stay less than the heap 
size (maybe target 2GB?)

On Wednesday, July 9, 2014 5:00:29 AM UTC-7, Andrey Perminov wrote:
>
> I've tried to diable bloom filter for older indexes with no luck. I cannot 
> close indexes, because all of them are used.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/924d8dc5-908a-4396-930b-12ed26613ccb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Clustering/Sharding impact on query performance

2014-07-10 Thread Kireet Reddy
I would test using multiple primary shards on a single machine. Since your 
dataset seems to fit into RAM, this could help for these longer latency 
queries.

On Thursday, July 10, 2014 12:24:26 AM UTC-7, Fin Sekun wrote:
>
> Any hints?
>
>
>
> On Monday, July 7, 2014 3:51:19 PM UTC+2, Fin Sekun wrote:
>>
>>
>> Hi,
>>
>>
>> *SCENARIO*
>>
>> Our Elasticsearch database has ~2.5 million entries. Each entry has the 
>> three analyzed fields "match", "sec_match" and "thi_match" (all contains 
>> 3-20 words) that will be used in this query:
>> https://gist.github.com/anonymous/a8d1142512e5625e4e91
>>
>>
>> ES runs on two *types of servers*:
>> (1) Real servers (system has direct access to real CPUs, no 
>> virtualization) of newest generation - Very performant!
>> (2) Cloud servers with virtualized CPUs - Poor CPUs, but this is generic 
>> for cloud services.
>>
>> See https://gist.github.com/anonymous/3098b142c2bab51feecc for (1) and 
>> (2) CPU details.
>>
>>
>> *ES settings:*
>> ES version 1.2.0 (jdk1.8.0_05)
>> ES_HEAP_SIZE = 512m (we also tested with 1024m with same results)
>> vm.max_map_count = 262144
>> ulimit -n 64000
>> ulimit -l unlimited
>> index.number_of_shards: 1
>> index.number_of_replicas: 0
>> index.store.type: mmapfs
>> threadpool.search.type: fixed
>> threadpool.search.size: 75
>> threadpool.search.queue_size: 5000
>>
>>
>> *Infrastructure*:
>> As you can see above, we don't use the cluster feature of ES (1 shard, 0 
>> replicas). The reason is that our hosting infrastructure is based on 
>> different providers.
>> Upside: We aren't dependent on a single hosting provider. Downside: Our 
>> servers aren't in the same LAN.
>>
>> This means:
>> - We cannot use ES sharding, because synchronisation via WAN (internet) 
>> seems not a useful solution.
>> - So, every ES-server has the complete dataset and we configured only one 
>> shard and no replicas for higher performance.
>> - We have a distribution process that updates the ES data on every host 
>> frequently. This process is fine for us, because updates aren't very often 
>> and perfect just-in-time ES synchronisation isn't necessary for our 
>> business case.
>> - If a server goes down/crashs, the central loadbalancer removes it (the 
>> resulting minimal packet lost is acceptable).
>>  
>>
>>
>>
>> *PROBLEM*
>>
>> For long query terms (6 and more keywords), we have very high CPU loads, 
>> even on the high performance server (1), and this leads to high response 
>> times: 1-4sec on server (1), 8-20sec on server (2). The system parameters 
>> while querying:
>> - Very high load (usually 100%) for the thread responsible CPU (the other 
>> CPUs are idle in our test scenario)
>> - No I/O load (the harddisks are fine)
>> - No RAM bottlenecks
>>
>> So, we think the file caching is working fine, because we have no I/O 
>> problems and the garbage collector seams to be happy (jstat shows very few 
>> GCs). The CPU is the problem, and ES hot-threads point to the Scorer module:
>> https://gist.github.com/anonymous/9cecfd512cb533114b7d 
>>
>>
>>
>>
>> *SUMMARY/ASSUMPTIONS*
>>
>> - Our database size isn't very big and the query not very complex.
>> - ES is designed for huge amount of data, but the key is 
>> clustering/sharding: Data distribution to many servers means smaller 
>> indices, smaller indices leads to fewer CPU load and short response times.
>> - So, our database isn't big, but to big for a single CPU and this means 
>> especially low performance (virtual) CPUs can only be used in sharding 
>> environments.
>>
>> If we don't want to lost the provider independency, we have only the 
>> following two options:
>>
>> 1) Simpler query (I think not possible in our case)
>> 2) Smaller database
>>
>>
>>
>>
>> *QUESTIONS*
>>
>> Are our assumptions correct? Especially:
>>
>> - Is clustering/sharding (also small indices) the main key to 
>> performance, that means the only possibility to prevent overloaded 
>> (virtual) CPUs?
>> - Is it right that clustering is only useful/possible in LANs?
>> - Do you have any ES configuration or architecture hints regarding our 
>> preference for using multiple hosting providers?
>>
>>
>>
>> Thank you. Rgds
>> Fin
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/046b78ca-9173-4fa0-ae5d-309a716c9dc3%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: excessive merging/small segment sizes

2014-07-09 Thread Kireet Reddy
Sorry, forgot the link

https://www.dropbox.com/sh/3s6m0bhz4eshi6m/AAD8g3Ukq1UW0IbPV-a-CrBGa/1229.txt

On Wednesday, July 9, 2014 1:05:56 PM UTC-7, Kireet Reddy wrote:
>
> The problem is happening again, this time on node 5. I have captured a few 
> hot thread requests here. I also included one from node 6 (which is now 
> fine).There are merge related stacks, but it seems like everything is 
> taking a lot more cpu than usual. I did a few type=wait and type=block 
> dumps and the result was always 0% usage there. Also young gen gc activity 
> has again gone crazy (old gen/heap size seems fine). Would hot thread 
> measurements be distorted if gc activity is very high?
>
> It seems strange to me that this would only happen on one node while we 
> have replica set to at least 1 for all our indices. It seems like the 
> problems should happen on a couple nodes simultaneously.
>
> --Kireet
>
> On Monday, July 7, 2014 3:51:35 PM UTC-7, Michael McCandless wrote:
>>
>> Could you pull all hot threads next time the problem happens?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>  
>>
>> On Mon, Jul 7, 2014 at 3:47 PM, Kireet Reddy  wrote:
>>
>>> All that seems correct (except I think this is for node 6, not node 5). 
>>> We don't delete documents, but we do some updates. The vast majority of 
>>> documents get indexed into the large shards, but the smaller ones take some 
>>> writes as well.
>>>
>>> We aren't using virtualized hardware and elasticsearch is the only thing 
>>> running on the machines, no scheduled jobs, etc. I don't think something is 
>>> interfering, actually overall disk i/o rate and operations on the machine 
>>> go down quite a bit during the problematic period, which is consistent with 
>>> your observations about things taking longer.
>>>
>>> I went back and checked all our collected metrics again. I noticed that 
>>> even though the heap usage and gc count seems smooth during the period in 
>>> question, gc time spent goes way up. Also active indexing threads goes up, 
>>> but since our ingest rate didn't go up I assumed this was a side effect. 
>>> During a previous occurrence a few days ago on node5, I stopped all 
>>> indexing activity for 15 minutes. Active merges and indexing requests went 
>>> to zero as expected. Then I re-enabled indexing and immediately the 
>>> increased cpu/gc/active merges went back up to the problematic rates.
>>>
>>> Overall this is pretty confusing to me as to what is a symptom vs a root 
>>> cause here. A summary of what I think I know:
>>>
>>>1. Every few days, cpu usage on a node goes way above the other 
>>>nodes and doesn't recover. We've let the node run in the elevated cpu 
>>> state 
>>>for a day with no improvement. 
>>>2. It doesn't seem likely that it's data related. We use replicas=1 
>>>and no other nodes have issues.
>>>3. It doesn't seem hardware related. We run on a dedicated h/w with 
>>>elasticsearch being the only thing running. Also the problem appears on 
>>>various nodes and machine load seems tied directly to the elasticsearch 
>>>process. 
>>>4. During the problematic period: cpu usage, active merge threads, 
>>>active bulk (indexing) threads, and gc time are elevated.
>>>5. During the problematic period: i/o ops and i/o throughput 
>>>decrease. 
>>>6. overall heap usage size seems to smoothly increase, the extra gc 
>>>time seems to be spent on the new gen. Interestingly, the gc count 
>>> didn't 
>>>seem to increase.
>>>7. In the hours beforehand, gc behavior of the problematic node was 
>>>similar to the other nodes. 
>>>8. If I pause indexing, machine load quickly returns to normal, 
>>>merges and indexing requests complete.  if I then restart indexing the 
>>>problem reoccurs immediately.
>>>9. If I disable automatic refreshes, the problem disappears within 
>>>an hour or so. 
>>>10. hot threads show merging activity as the hot threads.
>>>
>>> The first few points make me think the increased active merges is 
>>> perhaps a side effect, but then the last 3 make me think merging is the 
>>> root cause. The only additional things I can think of that may be relevant 
>>> are:
>>>
>>>1. Our documents can vary greatly in size, they average a couple KB 
>>>but

Re: excessive merging/small segment sizes

2014-07-07 Thread Kireet Reddy
All that seems correct (except I think this is for node 6, not node 5). We 
don't delete documents, but we do some updates. The vast majority of 
documents get indexed into the large shards, but the smaller ones take some 
writes as well.

We aren't using virtualized hardware and elasticsearch is the only thing 
running on the machines, no scheduled jobs, etc. I don't think something is 
interfering, actually overall disk i/o rate and operations on the machine 
go down quite a bit during the problematic period, which is consistent with 
your observations about things taking longer.

I went back and checked all our collected metrics again. I noticed that 
even though the heap usage and gc count seems smooth during the period in 
question, gc time spent goes way up. Also active indexing threads goes up, 
but since our ingest rate didn't go up I assumed this was a side effect. 
During a previous occurrence a few days ago on node5, I stopped all 
indexing activity for 15 minutes. Active merges and indexing requests went 
to zero as expected. Then I re-enabled indexing and immediately the 
increased cpu/gc/active merges went back up to the problematic rates.

Overall this is pretty confusing to me as to what is a symptom vs a root 
cause here. A summary of what I think I know:

   1. Every few days, cpu usage on a node goes way above the other nodes 
   and doesn't recover. We've let the node run in the elevated cpu state for a 
   day with no improvement.
   2. It doesn't seem likely that it's data related. We use replicas=1 and 
   no other nodes have issues.
   3. It doesn't seem hardware related. We run on a dedicated h/w with 
   elasticsearch being the only thing running. Also the problem appears on 
   various nodes and machine load seems tied directly to the elasticsearch 
   process.
   4. During the problematic period: cpu usage, active merge threads, 
   active bulk (indexing) threads, and gc time are elevated.
   5. During the problematic period: i/o ops and i/o throughput decrease.
   6. overall heap usage size seems to smoothly increase, the extra gc time 
   seems to be spent on the new gen. Interestingly, the gc count didn't seem 
   to increase.
   7. In the hours beforehand, gc behavior of the problematic node was 
   similar to the other nodes.
   8. If I pause indexing, machine load quickly returns to normal, merges 
   and indexing requests complete.  if I then restart indexing the problem 
   reoccurs immediately.
   9. If I disable automatic refreshes, the problem disappears within an 
   hour or so.
   10. hot threads show merging activity as the hot threads.

The first few points make me think the increased active merges is perhaps a 
side effect, but then the last 3 make me think merging is the root cause. 
The only additional things I can think of that may be relevant are:

   1. Our documents can vary greatly in size, they average a couple KB but 
   can rarely be several MB. 
   2. we do use language analysis plugins, perhaps one of these is acting 
   up? 
   3. We eagerly load one field into the field data cache. But the cache 
   size is ok and the overall heap behavior is ok so I don't think this is the 
   problem.

That's a lot of information, but I am not sure where to go next from here...

On Monday, July 7, 2014 8:23:20 AM UTC-7, Michael McCandless wrote:
>
> Indeed there are no big merges during that time ...
>
> I can see on node5, ~14:45 suddenly merges are taking a long time, refresh 
> is taking much longer (4-5 seconds instead of < .4 sec), commit time goes 
> up from < 0.5 sec to ~1-2 sec, etc., but other metrics are fine e.g. total 
> merging GB, number of commits/refreshes is very low during this time.
>
> Each node has 2 biggish (~17 GB) shards and then ~50 tiny shards.  The 
> biggish shards are indexing at a very slow rate and only have ~1% 
> deletions.  Are you explicitly deleting docs?
>
> I suspect something is suddenly cutting into the IO perf of this box, and 
> because merging/refreshing is so IO intensive, it causes these operations 
> to run slower / backlog.
>
> Are there any scheduled jobs, e.g. backups/snapshots, that start up?  Are 
> you running on virtualized hardware?
>
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>  
>
> On Sun, Jul 6, 2014 at 8:23 PM, Kireet Reddy  > wrote:
>
>> Just to reiterate, the problematic period is from 07/05 14:45 to 07/06 
>> 02:10. I included a couple hours before and after in the logs.
>>
>>
>> On Sunday, July 6, 2014 5:17:06 PM UTC-7, Kireet Reddy wrote:
>>>
>>> They are linked below (node5 is the log of the normal node, node6 is the 
>>> log of the problematic node). 
>>>
>>> I don't think it was doing big merges, otherwise during the high load 
>>> period, the merges 

Re: excessive merging/small segment sizes

2014-07-06 Thread Kireet Reddy
Just to reiterate, the problematic period is from 07/05 14:45 to 07/06 
02:10. I included a couple hours before and after in the logs.

On Sunday, July 6, 2014 5:17:06 PM UTC-7, Kireet Reddy wrote:
>
> They are linked below (node5 is the log of the normal node, node6 is the 
> log of the problematic node). 
>
> I don't think it was doing big merges, otherwise during the high load 
> period, the merges graph line would have had a "floor" > 0, similar to the 
> time period after I disabled refresh. We don't do routing and use mostly 
> default settings. I think the only settings we changed are:
>
> indices.memory.index_buffer_size: 50%
> index.translog.flush_threshold_ops: 5
>
> We are running on a 6 cpu/12 cores machine with a 32GB heap and 96GB of 
> memory with 4 spinning disks. 
>
> node 5 log (normal) <https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip>
> node 6 log (high load) 
> <https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip>
>
> On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless wrote:
>>
>> Can you post the IndexWriter infoStream output?  I can see if anything 
>> stands out.
>>
>> Maybe it was just that this node was doing big merges?  I.e., if you 
>> waited long enough, the other shards would eventually do their big merges 
>> too?
>>
>> Have you changed any default settings, do custom routing, etc.?  Is there 
>> any reason to think that the docs that land on this node are "different" in 
>> any way?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy  wrote:
>>
>>>  From all the information I’ve collected, it seems to be the merging 
>>> activity:
>>>
>>>
>>>1. We capture the cluster stats into graphite and the current merges 
>>>stat seems to be about 10x higher on this node. 
>>>2. The actual node that the problem occurs on has happened on 
>>>different physical machines so a h/w issue seems unlikely. Once the 
>>> problem 
>>>starts it doesn't seem to stop. We have blown away the indices in the 
>>> past 
>>>and started indexing again after enabling more logging/stats. 
>>>3. I've stopped executing queries so the only thing that's happening 
>>>on the cluster is indexing.
>>>4. Last night when the problem was ongoing, I disabled refresh 
>>>(index.refresh_interval = -1) around 2:10am. Within 1 hour, the load 
>>>returned to normal. The merge activity seemed to reduce, it seems like 2 
>>>very long running merges are executing but not much else. 
>>>5. I grepped an hour of logs of the 2 machiese for "add merge=", it 
>>>was 540 on the high load node and 420 on a normal node. I pulled out the 
>>>size value from the log message and the merges seemed to be much smaller 
>>> on 
>>>the high load node. 
>>>
>>> I just created the indices a few days ago, so the shards of each index 
>>> are balanced across the nodes. We have external metrics around document 
>>> ingest rate and there was no spike during this time period. 
>>>
>>>
>>>
>>> Thanks
>>> Kireet
>>>
>>>
>>> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote:
>>>
>>>> It's perfectly normal/healthy for many small merges below the floor 
>>>> size to happen.
>>>>
>>>> I think you should first figure out why this node is different from the 
>>>> others?  Are you sure it's merging CPU cost that's different?
>>>>
>>>> Mike McCandless
>>>>
>>>> http://blog.mikemccandless.com
>>>>
>>>>
>>>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy  wrote:
>>>>
>>>>>  We have a situation where one of the four nodes in our cluster seems 
>>>>> to get caught up endlessly merging.  However it seems to be high CPU 
>>>>> activity and not I/O constrainted. I have enabled the IndexWriter info 
>>>>> stream logs, and often times it seems to do merges of quite small 
>>>>> segments 
>>>>> (100KB) that are much below the floor size (2MB). I suspect this is due 
>>>>> to 
>>>>> frequent refreshes and/or using lots of threads concurrently to do 
>>>>> indexing. Is this true?
>>>>>
>>>>> My supposition is that this i

Re: excessive merging/small segment sizes

2014-07-06 Thread Kireet Reddy
They are linked below (node5 is the log of the normal node, node6 is the 
log of the problematic node). 

I don't think it was doing big merges, otherwise during the high load 
period, the merges graph line would have had a "floor" > 0, similar to the 
time period after I disabled refresh. We don't do routing and use mostly 
default settings. I think the only settings we changed are:

indices.memory.index_buffer_size: 50%
index.translog.flush_threshold_ops: 5

We are running on a 6 cpu/12 cores machine with a 32GB heap and 96GB of 
memory with 4 spinning disks. 

node 5 log (normal) <https://www.dropbox.com/s/uf76m58nf87mdmw/node5.zip>
node 6 log (high load) <https://www.dropbox.com/s/w7qm2v9qpdttd69/node6.zip>

On Sunday, July 6, 2014 4:23:19 PM UTC-7, Michael McCandless wrote:
>
> Can you post the IndexWriter infoStream output?  I can see if anything 
> stands out.
>
> Maybe it was just that this node was doing big merges?  I.e., if you 
> waited long enough, the other shards would eventually do their big merges 
> too?
>
> Have you changed any default settings, do custom routing, etc.?  Is there 
> any reason to think that the docs that land on this node are "different" in 
> any way?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sun, Jul 6, 2014 at 6:48 PM, Kireet Reddy  > wrote:
>
>>  From all the information I’ve collected, it seems to be the merging 
>> activity:
>>
>>
>>1. We capture the cluster stats into graphite and the current merges 
>>stat seems to be about 10x higher on this node. 
>>2. The actual node that the problem occurs on has happened on 
>>different physical machines so a h/w issue seems unlikely. Once the 
>> problem 
>>starts it doesn't seem to stop. We have blown away the indices in the 
>> past 
>>and started indexing again after enabling more logging/stats. 
>>3. I've stopped executing queries so the only thing that's happening 
>>on the cluster is indexing.
>>4. Last night when the problem was ongoing, I disabled refresh 
>>(index.refresh_interval = -1) around 2:10am. Within 1 hour, the load 
>>returned to normal. The merge activity seemed to reduce, it seems like 2 
>>very long running merges are executing but not much else. 
>>5. I grepped an hour of logs of the 2 machiese for "add merge=", it 
>>was 540 on the high load node and 420 on a normal node. I pulled out the 
>>size value from the log message and the merges seemed to be much smaller 
>> on 
>>the high load node. 
>>
>> I just created the indices a few days ago, so the shards of each index 
>> are balanced across the nodes. We have external metrics around document 
>> ingest rate and there was no spike during this time period. 
>>
>>
>>
>> Thanks
>> Kireet
>>
>>
>> On Sunday, July 6, 2014 1:32:00 PM UTC-7, Michael McCandless wrote:
>>
>>> It's perfectly normal/healthy for many small merges below the floor size 
>>> to happen.
>>>
>>> I think you should first figure out why this node is different from the 
>>> others?  Are you sure it's merging CPU cost that's different?
>>>
>>> Mike McCandless
>>>
>>> http://blog.mikemccandless.com
>>>
>>>
>>> On Sat, Jul 5, 2014 at 9:51 PM, Kireet Reddy  wrote:
>>>
>>>>  We have a situation where one of the four nodes in our cluster seems 
>>>> to get caught up endlessly merging.  However it seems to be high CPU 
>>>> activity and not I/O constrainted. I have enabled the IndexWriter info 
>>>> stream logs, and often times it seems to do merges of quite small segments 
>>>> (100KB) that are much below the floor size (2MB). I suspect this is due to 
>>>> frequent refreshes and/or using lots of threads concurrently to do 
>>>> indexing. Is this true?
>>>>
>>>> My supposition is that this is leading to the merge policy doing lots 
>>>> of merges of very small segments into another small segment which will 
>>>> again require a merge to even reach the floor size. My index has 64 
>>>> segments and 25 are below the floor size. I am wondering if there should 
>>>> be 
>>>> an exception for the maxMergesAtOnce parameter for the first level so that 
>>>> many small segments could be merged at once in this case.
>>>>
>>>> I am considering changing the other parameters (wider tiers, lower 
>>>> floor size, more concurre

excessive merging/small segment sizes

2014-07-05 Thread Kireet Reddy
We have a situation where one of the four nodes in our cluster seems to get 
caught up endlessly merging.  However it seems to be high CPU activity and 
not I/O constrainted. I have enabled the IndexWriter info stream logs, and 
often times it seems to do merges of quite small segments (100KB) that are 
much below the floor size (2MB). I suspect this is due to frequent 
refreshes and/or using lots of threads concurrently to do indexing. Is this 
true?

My supposition is that this is leading to the merge policy doing lots of 
merges of very small segments into another small segment which will again 
require a merge to even reach the floor size. My index has 64 segments and 
25 are below the floor size. I am wondering if there should be an exception 
for the maxMergesAtOnce parameter for the first level so that many small 
segments could be merged at once in this case.

I am considering changing the other parameters (wider tiers, lower floor 
size, more concurrent merges allowed) but these all seem to have side 
effects I may not necessarily want. Is there a good solution here?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0a8db0dc-ae0b-49cb-b29d-e396510bf755%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: node failures

2014-06-17 Thread Kireet Reddy
As soon as we restarted indexing, we saw a lot of merge activity and the 
deleted documents percentage went to around 25%. Does indexing activity trigger 
merges? Currently, there is not much merge activity, but some indices still 
have high deleted document counts. E.g. we have one index with count around 17m 
and deleted at 15m, but no merge activity. I am wondering if merges aren't 
scheduled for that index because writes to that index are infrequent.

On Jun 16, 2014, at 3:16 PM, Mark Walkom  wrote:

> TTL does use a lot of resources as it constantly scans for expired docs. It'd 
> be more efficient to switch to daily indexes and then drop them, though that 
> might not fit your business requirements.
> 
> You can try forcing an optimise on an index, 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-optimize.html,
>  it's very resource intensive though but it if reduces your segment count 
> then it may allude to where the problem lies.
> 
> Regards,
> Mark Walkom
> 
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com
> web: www.campaignmonitor.com
> 
> 
> On 17 June 2014 07:07, Kireet Reddy  wrote:
> java version is 1.7.0_55. the servers have a 32GB heap, 96GB of memory, 12 
> logical cores, and 4 spinning disks.
> 
> Currently we have about 450GB of data on each machine, average doc size is 
> about 1.5KB. We create an index (4 shards, 1 replica) every N days. Right now 
> we have 12 indices, meaning about 24 shards/node (12*4*2 / 4). 
> 
> Looking at ElasticHQ, I noticed some warnings around documents deleted. Our 
> percentages are in the 70s and the pass level is 10% (!). Due to our business 
> requirements, we have to use TTL. My understanding is this leads to a lot of 
> document deletions and increased merge activity. However it seems that maybe 
> segments with lots of deletes aren't being merged? We stopped indexing 
> temporarily and there are no merges occurring anywhere in the system so it's 
> not a throttling issue. We are using almost all default settings, but is 
> there some setting in particular I should look at?
> 
> On Jun 10, 2014, at 3:41 PM, Mark Walkom  wrote:
> 
>> Are you using a monitoring plugin such as marvel or elastichq? If not then 
>> installing those will give you a better insight into your cluster.
>> You can also check the hot threads end point to check each node - 
>> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html
>> 
>> Providing a bit more info on your cluster setup may help as well, index size 
>> and count, server specs, java version, that sort of thing.
>> 
>> Regards,
>> Mark Walkom
>> 
>> Infrastructure Engineer
>> Campaign Monitor
>> email: ma...@campaignmonitor.com
>> web: www.campaignmonitor.com
>> 
>> 
>> On 11 June 2014 00:41, Kireet Reddy  wrote:
>> On our 4 node test cluster (1.1.2), seemingly out of the blue we had one 
>> node experience very high cpu usage and become unresponsive and then after 
>> about 8 hours another node experienced the same issue. The processes 
>> themselves stayed alive, gc activity was normal, they didn't experience an 
>> OutOfMemoryError. The nodes left the cluster though, perhaps due to the 
>> unresponsiveness. The only errors in the log files were a bunch of messages 
>> like:
>> 
>> org.elasticsearch.search.SearchContextMissingException: No search context 
>> found for id ...
>> 
>> and errors about the search queue being full. We see the 
>> SearchContextMissingException occasionally during normal operation, but 
>> during the high cpu period it happened quite a bit.
>> 
>> I don't think we had an unusually high number of queries during that time 
>> because the other 2 nodes had normal cpu usage and for the prior week things 
>> ran smoothly.
>> 
>> We are going to restart testing, but is there anything we can do to better 
>> understand what happened? Maybe change a particular log level or do 
>> something while the problem is happening, assuming we can reproduce the 
>> issue?
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>&g

Re: node failures

2014-06-16 Thread Kireet Reddy
java version is 1.7.0_55. the servers have a 32GB heap, 96GB of memory, 12 
logical cores, and 4 spinning disks.

Currently we have about 450GB of data on each machine, average doc size is 
about 1.5KB. We create an index (4 shards, 1 replica) every N days. Right now 
we have 12 indices, meaning about 24 shards/node (12*4*2 / 4). 

Looking at ElasticHQ, I noticed some warnings around documents deleted. Our 
percentages are in the 70s and the pass level is 10% (!). Due to our business 
requirements, we have to use TTL. My understanding is this leads to a lot of 
document deletions and increased merge activity. However it seems that maybe 
segments with lots of deletes aren't being merged? We stopped indexing 
temporarily and there are no merges occurring anywhere in the system so it's 
not a throttling issue. We are using almost all default settings, but is there 
some setting in particular I should look at?

On Jun 10, 2014, at 3:41 PM, Mark Walkom  wrote:

> Are you using a monitoring plugin such as marvel or elastichq? If not then 
> installing those will give you a better insight into your cluster.
> You can also check the hot threads end point to check each node - 
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/cluster-nodes-hot-threads.html
> 
> Providing a bit more info on your cluster setup may help as well, index size 
> and count, server specs, java version, that sort of thing.
> 
> Regards,
> Mark Walkom
> 
> Infrastructure Engineer
> Campaign Monitor
> email: ma...@campaignmonitor.com
> web: www.campaignmonitor.com
> 
> 
> On 11 June 2014 00:41, Kireet Reddy  wrote:
> On our 4 node test cluster (1.1.2), seemingly out of the blue we had one node 
> experience very high cpu usage and become unresponsive and then after about 8 
> hours another node experienced the same issue. The processes themselves 
> stayed alive, gc activity was normal, they didn't experience an 
> OutOfMemoryError. The nodes left the cluster though, perhaps due to the 
> unresponsiveness. The only errors in the log files were a bunch of messages 
> like:
> 
> org.elasticsearch.search.SearchContextMissingException: No search context 
> found for id ...
> 
> and errors about the search queue being full. We see the 
> SearchContextMissingException occasionally during normal operation, but 
> during the high cpu period it happened quite a bit.
> 
> I don't think we had an unusually high number of queries during that time 
> because the other 2 nodes had normal cpu usage and for the prior week things 
> ran smoothly.
> 
> We are going to restart testing, but is there anything we can do to better 
> understand what happened? Maybe change a particular log level or do something 
> while the problem is happening, assuming we can reproduce the issue?
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
> 
> 
> -- 
> You received this message because you are subscribed to a topic in the Google 
> Groups "elasticsearch" group.
> To unsubscribe from this topic, visit 
> https://groups.google.com/d/topic/elasticsearch/6ze7e1TVM8A/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to 
> elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/elasticsearch/CAEM624bNyfbBkLZbeGpz8v%2Bq8VOPOLmAeGmWf%2BNQrEar2owLoQ%40mail.gmail.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/5E198314-7782-4987-81B8-D7A37684C249%40feedly.com.
For more options, visit https://groups.google.com/d/optout.


node failures

2014-06-10 Thread Kireet Reddy
On our 4 node test cluster (1.1.2), seemingly out of the blue we had one 
node experience very high cpu usage and become unresponsive and then after 
about 8 hours another node experienced the same issue. The processes 
themselves stayed alive, gc activity was normal, they didn't experience an 
OutOfMemoryError. The nodes left the cluster though, perhaps due to the 
unresponsiveness. The only errors in the log files were a bunch of messages 
like:

org.elasticsearch.search.SearchContextMissingException: No search context 
found for id ...

and errors about the search queue being full. We see the 
SearchContextMissingException occasionally during normal operation, but 
during the high cpu period it happened quite a bit.

I don't think we had an unusually high number of queries during that time 
because the other 2 nodes had normal cpu usage and for the prior week 
things ran smoothly.

We are going to restart testing, but is there anything we can do to better 
understand what happened? Maybe change a particular log level or do 
something while the problem is happening, assuming we can reproduce the 
issue?

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/58351342-da89-43ad-a1be-194d8b608457%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.