es search optimizing question

2015-05-28 Thread Jay Danielian
I know its dangerous to get general answers when talking about performance, 
as the answer usually is it depends. But I am going to try anyway :) My 
question is as a general rule of thumb is it better to have a list of items 
in an array stored and the query only has to issue a single matching term? 
Or store a single value per document and create various terms in an array 
passing in those generated terms for the query?

My example use case is this. I am trying to find contacts by name and 
email. Emails usually fall into several common patterns (first.last@domain, 
first_last@domain, firstinitial_last@domain, etc), so I want to be able to 
search against all of those possible combinations in trying to find this 
contact in our index. The queries are all filter terms, no wildcard, etc. 
The fields are all not_analyzed, so its basically an exact term match that 
I am looking for. So, I can either store the extra possible combinations in 
the document, and have the query syntax only need to pass in one term (as 
the field stored is an array). Or I can pass in the multiple combinations 
in a term array in the query syntax, and search against the single email we 
have stored in the index.

I know its never a perfect answer, but even general rule of thumb response 
from someone with deep internal knowledge of lucene/ES would be 
appreciated. 

Thanks!

J

-- 
Please update your bookmarks! We have moved to https://discuss.elastic.co/
--- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/ef1b1d61-96b6-4dcd-a658-1385aa3f380f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ElasticSearch search performance question

2015-02-18 Thread Jay Danielian
Just to update the thread.

I added code to disable cache on all the term filters we were using, and it 
made a huge performance improvement. Now we are able to service the queries 
with average response time under two seconds, which is excellent (we are 
bundling several searches using _msearch, so  2 seconds total response is 
good) The search requests / sec metric is still peaking at around 600 / 
sec, however our CPU only spikes to about 65% now - so I think we can add 
more search threads to our config as we are no longer maxing out CPU. I 
also see a a bit of disk read activity now, which against our non RAID EBS 
drive - means we may be able to squeeze more if we switch disk setup.

It seems like having these filters add cache items was wasting CPU on cache 
eviction and cache lookups (cache misses really) for each query - which 
really only shows up when trying to push some load through.

Thanks for everyone's suggestions!!

J

On Friday, February 13, 2015 at 11:55:52 AM UTC-5, Jay Danielian wrote:

 Thanks to all for these great suggestions. I haven't had a chance to 
 change the syntax yet, as that is a risky thing for me to quickly change 
 against our production setup. My plan is to try that this weekend (so I can 
 properly test the new syntax is returning the same results). However, is 
 there a way to turn filter caching off globally via config or elsewhere?

 Thanks!

 J

 On Friday, February 13, 2015 at 11:25:20 AM UTC-5, Mark Harwood wrote:

 So I can see in the hot threads dump the initialization requests for 
 those FixedBitSets I was talking about.
 Looking at the number of docs in your index I estimate each Term to be 
 allocating 140mb of memory in total for all these bitsets across all shards 
 given the 1bn docs in your index. Remember that you are probably setting 
 only a single bit in each of these large structures. 
 Another stat (if I read it correctly) shows 5m evictions of these cached 
 filters given their low reusability. It's fair to say you have some cache 
 churn going on :)
 Did you try my earlier suggestion of queries not filters?




 On Friday, February 13, 2015 at 2:29:42 PM UTC, Jay Danielian wrote:

 As requested here is a dump of the hot threads output. 

 Thanks!

 J

 On Thursday, February 12, 2015 at 6:45:23 PM UTC-5, Nikolas Everett 
 wrote:

 You might want to try hitting hot threads while putting your load on it 
 and seeing what you see.  Or posting it.

 Nik

 On Thu, Feb 12, 2015 at 4:44 PM, Jay Danielian 
 jay.da...@circleback.com wrote:

 Mark,

 Thanks for the initial reply. Yes, your assumption about these things 
 being very specific and thus not likely to have any re-use with regards 
 to 
 caching is correct. I have attached some screenshots from the BigDesk 
 plugin which showed a decent snapshot of what the server looked like 
 while 
 my tests were running. You can see the spikes in CPU, that essentially 
 covered the duration when the JMeter tests were running. 

 At a high level, the only thing that seems to be really stressed on 
 the server is CPU. But that makes me think that there is something in my 
 setup , query syntax, or perhaps the cache eviction rate, etc that is 
 causing it to spike so high. I also have concerns about non RAID 0 the 
 EBS 
 volumes, as I know that having one large volume does not maximize 
 throughput - however, just looking at the stats  it doesn't seem like IO 
 is 
 really a bottleneck.

 Here is a sample query structure = 
 https://gist.github.com/jaydanielian/c2be885987f344031cfc

 Also this is one query - in reality we use _msearch to pipeline 
 several of these queries in one batch. The queries also include custom 
 routing / route key to make sure we only hit one shard.

 Thanks!

 J


 On Thursday, February 12, 2015 at 4:22:29 PM UTC-5, Mark Walkom wrote:

 It'd help if you could gist/pastebin/etc a query example.

 Also your current ES and java need updating, there are known issues 
 with java 1.7u55, and you will always see performance boosts running 
 the 
 latest version of ES.

 That aside, what is your current resource utilisation like?  Are you 
 seeing lots of cache evictions, high heap use, high CPU, IO delays?

 On 13 February 2015 at 07:32, Jay Danielian jay.da...@circleback.com
  wrote:

 I know this is difficult to answer, the real answer is always It 
 Depends :) But I am going to go ahead and hope I get some feedback 
 here.

 We are mainly using ES to issue terms searches against fields that 
 are non-analyzed. We are using ES like a key value store, where once 
 the 
 match is found we parse the _source JSON and return our model. We are 
 doing 
 contact lookups, searching against (last_name AND (phone_number OR 
 email)). 
 We are issuing constant_score queries with term filters for the terms 
 mentioned above. No aggregations, no sorting, no scripts, etc. Using 
 JMeter, we were maxing out at around 500 search requests / sec. Average 
 request time was taking around 7 seconds

Re: ElasticSearch search performance question

2015-02-13 Thread Jay Danielian
we have three nodes, 6 shards total with each node having 1 replica. Here 
are the settings for the index:


index : {
number_of_replicas : 1,
number_of_shards : 6,
refresh_interval : 60,
version : {
created : 1030399
},
merge : {
policy : {
merge_factor : 30
}
}
}



On Friday, February 13, 2015 at 9:40:20 AM UTC-5, 
christian...@elasticsearch.com wrote:

 How many replicas do you have configured for the index?

 Christian

 On Thursday, February 12, 2015 at 8:32:28 PM UTC, Jay Danielian wrote:

 I know this is difficult to answer, the real answer is always It 
 Depends :) But I am going to go ahead and hope I get some feedback here.

 We are mainly using ES to issue terms searches against fields that are 
 non-analyzed. We are using ES like a key value store, where once the match 
 is found we parse the _source JSON and return our model. We are doing 
 contact lookups, searching against (last_name AND (phone_number OR email)). 
 We are issuing constant_score queries with term filters for the terms 
 mentioned above. No aggregations, no sorting, no scripts, etc. Using 
 JMeter, we were maxing out at around 500 search requests / sec. Average 
 request time was taking around 7 seconds to complete. When the test would 
 fire up, the ThreadPool Search Queue would spike to 1000 on each node and 
 CPU would be maxed out, then once it finished everything would return to 
 normal. So it appears healthy, and we wouldn't get any errors - just 
 nowhere close to the performance we are looking for.

 Setup details
 - Index size 100GB with two different document mappings in the index. 
 Roughly 500M documents
 - three nodes c3.4xl instances on EC2 using pIOPS SSD EBS disks (although 
 NOT RAID 0 - just one big volume)
 - each server node on EC2 has 30GB RAM, 16GB on heap, rest for OS
 - we have set mlockall on our instances
 - 3 nodes are split into 6 shards for the main index
 - Index is read only after it is loaded - we don't update the index ever, 
 it is only for querying
 - ES version 1.3.3 Java 1.7.0_51
 - each server has 16 cores / node and 48 search threads with queue length 
 of 1000

 Assuming no stemming, free text queries - just term matching, how can we 
 increase the throughput and decrease the response time for the ES queries? 
 is 500 requests / sec at the top end?
 Do we just need many more servers if we really want 3000 requests / sec ? 
 I have read that scaling out is better for ES vs scaling up. But it feels 
 that the current server farm should deliver better performance. 

 Any help or tuning advice would be really appreciated. We have looked at 
 many slideshares, blog posts from found.no, elasticseearch.org, etc - 
 and can't really pinpoint a way to improve our setup. 

 Thanks!

 JD




-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/3a6b1358-2928-4963-8c1a-fe2eacae0d67%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: ElasticSearch search performance question

2015-02-13 Thread Jay Danielian
As requested here is a dump of the hot threads output. 

Thanks!

J

On Thursday, February 12, 2015 at 6:45:23 PM UTC-5, Nikolas Everett wrote:

 You might want to try hitting hot threads while putting your load on it 
 and seeing what you see.  Or posting it.

 Nik

 On Thu, Feb 12, 2015 at 4:44 PM, Jay Danielian jay.da...@circleback.com 
 javascript: wrote:

 Mark,

 Thanks for the initial reply. Yes, your assumption about these things 
 being very specific and thus not likely to have any re-use with regards to 
 caching is correct. I have attached some screenshots from the BigDesk 
 plugin which showed a decent snapshot of what the server looked like while 
 my tests were running. You can see the spikes in CPU, that essentially 
 covered the duration when the JMeter tests were running. 

 At a high level, the only thing that seems to be really stressed on the 
 server is CPU. But that makes me think that there is something in my setup 
 , query syntax, or perhaps the cache eviction rate, etc that is causing it 
 to spike so high. I also have concerns about non RAID 0 the EBS volumes, as 
 I know that having one large volume does not maximize throughput - however, 
 just looking at the stats  it doesn't seem like IO is really a bottleneck.

 Here is a sample query structure = 
 https://gist.github.com/jaydanielian/c2be885987f344031cfc

 Also this is one query - in reality we use _msearch to pipeline several 
 of these queries in one batch. The queries also include custom routing / 
 route key to make sure we only hit one shard.

 Thanks!

 J


 On Thursday, February 12, 2015 at 4:22:29 PM UTC-5, Mark Walkom wrote:

 It'd help if you could gist/pastebin/etc a query example.

 Also your current ES and java need updating, there are known issues with 
 java 1.7u55, and you will always see performance boosts running the latest 
 version of ES.

 That aside, what is your current resource utilisation like?  Are you 
 seeing lots of cache evictions, high heap use, high CPU, IO delays?

 On 13 February 2015 at 07:32, Jay Danielian jay.da...@circleback.com 
 wrote:

 I know this is difficult to answer, the real answer is always It 
 Depends :) But I am going to go ahead and hope I get some feedback here.

 We are mainly using ES to issue terms searches against fields that are 
 non-analyzed. We are using ES like a key value store, where once the match 
 is found we parse the _source JSON and return our model. We are doing 
 contact lookups, searching against (last_name AND (phone_number OR 
 email)). 
 We are issuing constant_score queries with term filters for the terms 
 mentioned above. No aggregations, no sorting, no scripts, etc. Using 
 JMeter, we were maxing out at around 500 search requests / sec. Average 
 request time was taking around 7 seconds to complete. When the test would 
 fire up, the ThreadPool Search Queue would spike to 1000 on each node and 
 CPU would be maxed out, then once it finished everything would return to 
 normal. So it appears healthy, and we wouldn't get any errors - just 
 nowhere close to the performance we are looking for.

 Setup details
 - Index size 100GB with two different document mappings in the index. 
 Roughly 500M documents
 - three nodes c3.4xl instances on EC2 using pIOPS SSD EBS disks 
 (although NOT RAID 0 - just one big volume)
 - each server node on EC2 has 30GB RAM, 16GB on heap, rest for OS
 - we have set mlockall on our instances
 - 3 nodes are split into 6 shards for the main index
 - Index is read only after it is loaded - we don't update the index 
 ever, it is only for querying
 - ES version 1.3.3 Java 1.7.0_51
 - each server has 16 cores / node and 48 search threads with queue 
 length of 1000

 Assuming no stemming, free text queries - just term matching, how can 
 we increase the throughput and decrease the response time for the ES 
 queries? is 500 requests / sec at the top end?
 Do we just need many more servers if we really want 3000 requests / sec 
 ? I have read that scaling out is better for ES vs scaling up. But it 
 feels 
 that the current server farm should deliver better performance. 

 Any help or tuning advice would be really appreciated. We have looked 
 at many slideshares, blog posts from found.no, elasticseearch.org, etc 
 - and can't really pinpoint a way to improve our setup. 

 Thanks!

 JD


  -- 
 You received this message because you are subscribed to the Google 
 Groups elasticsearch group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to elasticsearc...@googlegroups.com.
 To view this discussion on the web visit https://groups.google.com/d/
 msgid/elasticsearch/47b93b84-d929-4cad-becd-31581cd4c574%
 40googlegroups.com 
 https://groups.google.com/d/msgid/elasticsearch/47b93b84-d929-4cad-becd-31581cd4c574%40googlegroups.com?utm_medium=emailutm_source=footer
 .
 For more options, visit https://groups.google.com/d/optout.


  -- 
 You received this message because you are subscribed

Re: ElasticSearch search performance question

2015-02-13 Thread Jay Danielian
Thanks to all for these great suggestions. I haven't had a chance to change 
the syntax yet, as that is a risky thing for me to quickly change against 
our production setup. My plan is to try that this weekend (so I can 
properly test the new syntax is returning the same results). However, is 
there a way to turn filter caching off globally via config or elsewhere?

Thanks!

J

On Friday, February 13, 2015 at 11:25:20 AM UTC-5, Mark Harwood wrote:

 So I can see in the hot threads dump the initialization requests for those 
 FixedBitSets I was talking about.
 Looking at the number of docs in your index I estimate each Term to be 
 allocating 140mb of memory in total for all these bitsets across all shards 
 given the 1bn docs in your index. Remember that you are probably setting 
 only a single bit in each of these large structures. 
 Another stat (if I read it correctly) shows 5m evictions of these cached 
 filters given their low reusability. It's fair to say you have some cache 
 churn going on :)
 Did you try my earlier suggestion of queries not filters?




 On Friday, February 13, 2015 at 2:29:42 PM UTC, Jay Danielian wrote:

 As requested here is a dump of the hot threads output. 

 Thanks!

 J

 On Thursday, February 12, 2015 at 6:45:23 PM UTC-5, Nikolas Everett wrote:

 You might want to try hitting hot threads while putting your load on it 
 and seeing what you see.  Or posting it.

 Nik

 On Thu, Feb 12, 2015 at 4:44 PM, Jay Danielian jay.da...@circleback.com
  wrote:

 Mark,

 Thanks for the initial reply. Yes, your assumption about these things 
 being very specific and thus not likely to have any re-use with regards to 
 caching is correct. I have attached some screenshots from the BigDesk 
 plugin which showed a decent snapshot of what the server looked like while 
 my tests were running. You can see the spikes in CPU, that essentially 
 covered the duration when the JMeter tests were running. 

 At a high level, the only thing that seems to be really stressed on the 
 server is CPU. But that makes me think that there is something in my setup 
 , query syntax, or perhaps the cache eviction rate, etc that is causing it 
 to spike so high. I also have concerns about non RAID 0 the EBS volumes, 
 as 
 I know that having one large volume does not maximize throughput - 
 however, 
 just looking at the stats  it doesn't seem like IO is really a bottleneck.

 Here is a sample query structure = 
 https://gist.github.com/jaydanielian/c2be885987f344031cfc

 Also this is one query - in reality we use _msearch to pipeline several 
 of these queries in one batch. The queries also include custom routing / 
 route key to make sure we only hit one shard.

 Thanks!

 J


 On Thursday, February 12, 2015 at 4:22:29 PM UTC-5, Mark Walkom wrote:

 It'd help if you could gist/pastebin/etc a query example.

 Also your current ES and java need updating, there are known issues 
 with java 1.7u55, and you will always see performance boosts running the 
 latest version of ES.

 That aside, what is your current resource utilisation like?  Are you 
 seeing lots of cache evictions, high heap use, high CPU, IO delays?

 On 13 February 2015 at 07:32, Jay Danielian jay.da...@circleback.com 
 wrote:

 I know this is difficult to answer, the real answer is always It 
 Depends :) But I am going to go ahead and hope I get some feedback here.

 We are mainly using ES to issue terms searches against fields that 
 are non-analyzed. We are using ES like a key value store, where once the 
 match is found we parse the _source JSON and return our model. We are 
 doing 
 contact lookups, searching against (last_name AND (phone_number OR 
 email)). 
 We are issuing constant_score queries with term filters for the terms 
 mentioned above. No aggregations, no sorting, no scripts, etc. Using 
 JMeter, we were maxing out at around 500 search requests / sec. Average 
 request time was taking around 7 seconds to complete. When the test 
 would 
 fire up, the ThreadPool Search Queue would spike to 1000 on each node 
 and 
 CPU would be maxed out, then once it finished everything would return to 
 normal. So it appears healthy, and we wouldn't get any errors - just 
 nowhere close to the performance we are looking for.

 Setup details
 - Index size 100GB with two different document mappings in the index. 
 Roughly 500M documents
 - three nodes c3.4xl instances on EC2 using pIOPS SSD EBS disks 
 (although NOT RAID 0 - just one big volume)
 - each server node on EC2 has 30GB RAM, 16GB on heap, rest for OS
 - we have set mlockall on our instances
 - 3 nodes are split into 6 shards for the main index
 - Index is read only after it is loaded - we don't update the index 
 ever, it is only for querying
 - ES version 1.3.3 Java 1.7.0_51
 - each server has 16 cores / node and 48 search threads with queue 
 length of 1000

 Assuming no stemming, free text queries - just term matching, how can 
 we increase the throughput and decrease the response time

ElasticSearch search performance question

2015-02-12 Thread Jay Danielian
I know this is difficult to answer, the real answer is always It Depends 
:) But I am going to go ahead and hope I get some feedback here.

We are mainly using ES to issue terms searches against fields that are 
non-analyzed. We are using ES like a key value store, where once the match 
is found we parse the _source JSON and return our model. We are doing 
contact lookups, searching against (last_name AND (phone_number OR email)). 
We are issuing constant_score queries with term filters for the terms 
mentioned above. No aggregations, no sorting, no scripts, etc. Using 
JMeter, we were maxing out at around 500 search requests / sec. Average 
request time was taking around 7 seconds to complete. When the test would 
fire up, the ThreadPool Search Queue would spike to 1000 on each node and 
CPU would be maxed out, then once it finished everything would return to 
normal. So it appears healthy, and we wouldn't get any errors - just 
nowhere close to the performance we are looking for.

Setup details
- Index size 100GB with two different document mappings in the index. 
Roughly 500M documents
- three nodes c3.4xl instances on EC2 using pIOPS SSD EBS disks (although 
NOT RAID 0 - just one big volume)
- each server node on EC2 has 30GB RAM, 16GB on heap, rest for OS
- we have set mlockall on our instances
- 3 nodes are split into 6 shards for the main index
- Index is read only after it is loaded - we don't update the index ever, 
it is only for querying
- ES version 1.3.3 Java 1.7.0_51
- each server has 16 cores / node and 48 search threads with queue length 
of 1000

Assuming no stemming, free text queries - just term matching, how can we 
increase the throughput and decrease the response time for the ES queries? 
is 500 requests / sec at the top end?
Do we just need many more servers if we really want 3000 requests / sec ? I 
have read that scaling out is better for ES vs scaling up. But it feels 
that the current server farm should deliver better performance. 

Any help or tuning advice would be really appreciated. We have looked at 
many slideshares, blog posts from found.no, elasticseearch.org, etc - and 
can't really pinpoint a way to improve our setup. 

Thanks!

JD


-- 
You received this message because you are subscribed to the Google Groups 
elasticsearch group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/47b93b84-d929-4cad-becd-31581cd4c574%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.