es search optimizing question
I know its dangerous to get general answers when talking about performance, as the answer usually is it depends. But I am going to try anyway :) My question is as a general rule of thumb is it better to have a list of items in an array stored and the query only has to issue a single matching term? Or store a single value per document and create various terms in an array passing in those generated terms for the query? My example use case is this. I am trying to find contacts by name and email. Emails usually fall into several common patterns (first.last@domain, first_last@domain, firstinitial_last@domain, etc), so I want to be able to search against all of those possible combinations in trying to find this contact in our index. The queries are all filter terms, no wildcard, etc. The fields are all not_analyzed, so its basically an exact term match that I am looking for. So, I can either store the extra possible combinations in the document, and have the query syntax only need to pass in one term (as the field stored is an array). Or I can pass in the multiple combinations in a term array in the query syntax, and search against the single email we have stored in the index. I know its never a perfect answer, but even general rule of thumb response from someone with deep internal knowledge of lucene/ES would be appreciated. Thanks! J -- Please update your bookmarks! We have moved to https://discuss.elastic.co/ --- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/ef1b1d61-96b6-4dcd-a658-1385aa3f380f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: ElasticSearch search performance question
Just to update the thread. I added code to disable cache on all the term filters we were using, and it made a huge performance improvement. Now we are able to service the queries with average response time under two seconds, which is excellent (we are bundling several searches using _msearch, so 2 seconds total response is good) The search requests / sec metric is still peaking at around 600 / sec, however our CPU only spikes to about 65% now - so I think we can add more search threads to our config as we are no longer maxing out CPU. I also see a a bit of disk read activity now, which against our non RAID EBS drive - means we may be able to squeeze more if we switch disk setup. It seems like having these filters add cache items was wasting CPU on cache eviction and cache lookups (cache misses really) for each query - which really only shows up when trying to push some load through. Thanks for everyone's suggestions!! J On Friday, February 13, 2015 at 11:55:52 AM UTC-5, Jay Danielian wrote: Thanks to all for these great suggestions. I haven't had a chance to change the syntax yet, as that is a risky thing for me to quickly change against our production setup. My plan is to try that this weekend (so I can properly test the new syntax is returning the same results). However, is there a way to turn filter caching off globally via config or elsewhere? Thanks! J On Friday, February 13, 2015 at 11:25:20 AM UTC-5, Mark Harwood wrote: So I can see in the hot threads dump the initialization requests for those FixedBitSets I was talking about. Looking at the number of docs in your index I estimate each Term to be allocating 140mb of memory in total for all these bitsets across all shards given the 1bn docs in your index. Remember that you are probably setting only a single bit in each of these large structures. Another stat (if I read it correctly) shows 5m evictions of these cached filters given their low reusability. It's fair to say you have some cache churn going on :) Did you try my earlier suggestion of queries not filters? On Friday, February 13, 2015 at 2:29:42 PM UTC, Jay Danielian wrote: As requested here is a dump of the hot threads output. Thanks! J On Thursday, February 12, 2015 at 6:45:23 PM UTC-5, Nikolas Everett wrote: You might want to try hitting hot threads while putting your load on it and seeing what you see. Or posting it. Nik On Thu, Feb 12, 2015 at 4:44 PM, Jay Danielian jay.da...@circleback.com wrote: Mark, Thanks for the initial reply. Yes, your assumption about these things being very specific and thus not likely to have any re-use with regards to caching is correct. I have attached some screenshots from the BigDesk plugin which showed a decent snapshot of what the server looked like while my tests were running. You can see the spikes in CPU, that essentially covered the duration when the JMeter tests were running. At a high level, the only thing that seems to be really stressed on the server is CPU. But that makes me think that there is something in my setup , query syntax, or perhaps the cache eviction rate, etc that is causing it to spike so high. I also have concerns about non RAID 0 the EBS volumes, as I know that having one large volume does not maximize throughput - however, just looking at the stats it doesn't seem like IO is really a bottleneck. Here is a sample query structure = https://gist.github.com/jaydanielian/c2be885987f344031cfc Also this is one query - in reality we use _msearch to pipeline several of these queries in one batch. The queries also include custom routing / route key to make sure we only hit one shard. Thanks! J On Thursday, February 12, 2015 at 4:22:29 PM UTC-5, Mark Walkom wrote: It'd help if you could gist/pastebin/etc a query example. Also your current ES and java need updating, there are known issues with java 1.7u55, and you will always see performance boosts running the latest version of ES. That aside, what is your current resource utilisation like? Are you seeing lots of cache evictions, high heap use, high CPU, IO delays? On 13 February 2015 at 07:32, Jay Danielian jay.da...@circleback.com wrote: I know this is difficult to answer, the real answer is always It Depends :) But I am going to go ahead and hope I get some feedback here. We are mainly using ES to issue terms searches against fields that are non-analyzed. We are using ES like a key value store, where once the match is found we parse the _source JSON and return our model. We are doing contact lookups, searching against (last_name AND (phone_number OR email)). We are issuing constant_score queries with term filters for the terms mentioned above. No aggregations, no sorting, no scripts, etc. Using JMeter, we were maxing out at around 500 search requests / sec. Average request time was taking around 7 seconds
Re: ElasticSearch search performance question
we have three nodes, 6 shards total with each node having 1 replica. Here are the settings for the index: index : { number_of_replicas : 1, number_of_shards : 6, refresh_interval : 60, version : { created : 1030399 }, merge : { policy : { merge_factor : 30 } } } On Friday, February 13, 2015 at 9:40:20 AM UTC-5, christian...@elasticsearch.com wrote: How many replicas do you have configured for the index? Christian On Thursday, February 12, 2015 at 8:32:28 PM UTC, Jay Danielian wrote: I know this is difficult to answer, the real answer is always It Depends :) But I am going to go ahead and hope I get some feedback here. We are mainly using ES to issue terms searches against fields that are non-analyzed. We are using ES like a key value store, where once the match is found we parse the _source JSON and return our model. We are doing contact lookups, searching against (last_name AND (phone_number OR email)). We are issuing constant_score queries with term filters for the terms mentioned above. No aggregations, no sorting, no scripts, etc. Using JMeter, we were maxing out at around 500 search requests / sec. Average request time was taking around 7 seconds to complete. When the test would fire up, the ThreadPool Search Queue would spike to 1000 on each node and CPU would be maxed out, then once it finished everything would return to normal. So it appears healthy, and we wouldn't get any errors - just nowhere close to the performance we are looking for. Setup details - Index size 100GB with two different document mappings in the index. Roughly 500M documents - three nodes c3.4xl instances on EC2 using pIOPS SSD EBS disks (although NOT RAID 0 - just one big volume) - each server node on EC2 has 30GB RAM, 16GB on heap, rest for OS - we have set mlockall on our instances - 3 nodes are split into 6 shards for the main index - Index is read only after it is loaded - we don't update the index ever, it is only for querying - ES version 1.3.3 Java 1.7.0_51 - each server has 16 cores / node and 48 search threads with queue length of 1000 Assuming no stemming, free text queries - just term matching, how can we increase the throughput and decrease the response time for the ES queries? is 500 requests / sec at the top end? Do we just need many more servers if we really want 3000 requests / sec ? I have read that scaling out is better for ES vs scaling up. But it feels that the current server farm should deliver better performance. Any help or tuning advice would be really appreciated. We have looked at many slideshares, blog posts from found.no, elasticseearch.org, etc - and can't really pinpoint a way to improve our setup. Thanks! JD -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/3a6b1358-2928-4963-8c1a-fe2eacae0d67%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: ElasticSearch search performance question
As requested here is a dump of the hot threads output. Thanks! J On Thursday, February 12, 2015 at 6:45:23 PM UTC-5, Nikolas Everett wrote: You might want to try hitting hot threads while putting your load on it and seeing what you see. Or posting it. Nik On Thu, Feb 12, 2015 at 4:44 PM, Jay Danielian jay.da...@circleback.com javascript: wrote: Mark, Thanks for the initial reply. Yes, your assumption about these things being very specific and thus not likely to have any re-use with regards to caching is correct. I have attached some screenshots from the BigDesk plugin which showed a decent snapshot of what the server looked like while my tests were running. You can see the spikes in CPU, that essentially covered the duration when the JMeter tests were running. At a high level, the only thing that seems to be really stressed on the server is CPU. But that makes me think that there is something in my setup , query syntax, or perhaps the cache eviction rate, etc that is causing it to spike so high. I also have concerns about non RAID 0 the EBS volumes, as I know that having one large volume does not maximize throughput - however, just looking at the stats it doesn't seem like IO is really a bottleneck. Here is a sample query structure = https://gist.github.com/jaydanielian/c2be885987f344031cfc Also this is one query - in reality we use _msearch to pipeline several of these queries in one batch. The queries also include custom routing / route key to make sure we only hit one shard. Thanks! J On Thursday, February 12, 2015 at 4:22:29 PM UTC-5, Mark Walkom wrote: It'd help if you could gist/pastebin/etc a query example. Also your current ES and java need updating, there are known issues with java 1.7u55, and you will always see performance boosts running the latest version of ES. That aside, what is your current resource utilisation like? Are you seeing lots of cache evictions, high heap use, high CPU, IO delays? On 13 February 2015 at 07:32, Jay Danielian jay.da...@circleback.com wrote: I know this is difficult to answer, the real answer is always It Depends :) But I am going to go ahead and hope I get some feedback here. We are mainly using ES to issue terms searches against fields that are non-analyzed. We are using ES like a key value store, where once the match is found we parse the _source JSON and return our model. We are doing contact lookups, searching against (last_name AND (phone_number OR email)). We are issuing constant_score queries with term filters for the terms mentioned above. No aggregations, no sorting, no scripts, etc. Using JMeter, we were maxing out at around 500 search requests / sec. Average request time was taking around 7 seconds to complete. When the test would fire up, the ThreadPool Search Queue would spike to 1000 on each node and CPU would be maxed out, then once it finished everything would return to normal. So it appears healthy, and we wouldn't get any errors - just nowhere close to the performance we are looking for. Setup details - Index size 100GB with two different document mappings in the index. Roughly 500M documents - three nodes c3.4xl instances on EC2 using pIOPS SSD EBS disks (although NOT RAID 0 - just one big volume) - each server node on EC2 has 30GB RAM, 16GB on heap, rest for OS - we have set mlockall on our instances - 3 nodes are split into 6 shards for the main index - Index is read only after it is loaded - we don't update the index ever, it is only for querying - ES version 1.3.3 Java 1.7.0_51 - each server has 16 cores / node and 48 search threads with queue length of 1000 Assuming no stemming, free text queries - just term matching, how can we increase the throughput and decrease the response time for the ES queries? is 500 requests / sec at the top end? Do we just need many more servers if we really want 3000 requests / sec ? I have read that scaling out is better for ES vs scaling up. But it feels that the current server farm should deliver better performance. Any help or tuning advice would be really appreciated. We have looked at many slideshares, blog posts from found.no, elasticseearch.org, etc - and can't really pinpoint a way to improve our setup. Thanks! JD -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearc...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/ msgid/elasticsearch/47b93b84-d929-4cad-becd-31581cd4c574% 40googlegroups.com https://groups.google.com/d/msgid/elasticsearch/47b93b84-d929-4cad-becd-31581cd4c574%40googlegroups.com?utm_medium=emailutm_source=footer . For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed
Re: ElasticSearch search performance question
Thanks to all for these great suggestions. I haven't had a chance to change the syntax yet, as that is a risky thing for me to quickly change against our production setup. My plan is to try that this weekend (so I can properly test the new syntax is returning the same results). However, is there a way to turn filter caching off globally via config or elsewhere? Thanks! J On Friday, February 13, 2015 at 11:25:20 AM UTC-5, Mark Harwood wrote: So I can see in the hot threads dump the initialization requests for those FixedBitSets I was talking about. Looking at the number of docs in your index I estimate each Term to be allocating 140mb of memory in total for all these bitsets across all shards given the 1bn docs in your index. Remember that you are probably setting only a single bit in each of these large structures. Another stat (if I read it correctly) shows 5m evictions of these cached filters given their low reusability. It's fair to say you have some cache churn going on :) Did you try my earlier suggestion of queries not filters? On Friday, February 13, 2015 at 2:29:42 PM UTC, Jay Danielian wrote: As requested here is a dump of the hot threads output. Thanks! J On Thursday, February 12, 2015 at 6:45:23 PM UTC-5, Nikolas Everett wrote: You might want to try hitting hot threads while putting your load on it and seeing what you see. Or posting it. Nik On Thu, Feb 12, 2015 at 4:44 PM, Jay Danielian jay.da...@circleback.com wrote: Mark, Thanks for the initial reply. Yes, your assumption about these things being very specific and thus not likely to have any re-use with regards to caching is correct. I have attached some screenshots from the BigDesk plugin which showed a decent snapshot of what the server looked like while my tests were running. You can see the spikes in CPU, that essentially covered the duration when the JMeter tests were running. At a high level, the only thing that seems to be really stressed on the server is CPU. But that makes me think that there is something in my setup , query syntax, or perhaps the cache eviction rate, etc that is causing it to spike so high. I also have concerns about non RAID 0 the EBS volumes, as I know that having one large volume does not maximize throughput - however, just looking at the stats it doesn't seem like IO is really a bottleneck. Here is a sample query structure = https://gist.github.com/jaydanielian/c2be885987f344031cfc Also this is one query - in reality we use _msearch to pipeline several of these queries in one batch. The queries also include custom routing / route key to make sure we only hit one shard. Thanks! J On Thursday, February 12, 2015 at 4:22:29 PM UTC-5, Mark Walkom wrote: It'd help if you could gist/pastebin/etc a query example. Also your current ES and java need updating, there are known issues with java 1.7u55, and you will always see performance boosts running the latest version of ES. That aside, what is your current resource utilisation like? Are you seeing lots of cache evictions, high heap use, high CPU, IO delays? On 13 February 2015 at 07:32, Jay Danielian jay.da...@circleback.com wrote: I know this is difficult to answer, the real answer is always It Depends :) But I am going to go ahead and hope I get some feedback here. We are mainly using ES to issue terms searches against fields that are non-analyzed. We are using ES like a key value store, where once the match is found we parse the _source JSON and return our model. We are doing contact lookups, searching against (last_name AND (phone_number OR email)). We are issuing constant_score queries with term filters for the terms mentioned above. No aggregations, no sorting, no scripts, etc. Using JMeter, we were maxing out at around 500 search requests / sec. Average request time was taking around 7 seconds to complete. When the test would fire up, the ThreadPool Search Queue would spike to 1000 on each node and CPU would be maxed out, then once it finished everything would return to normal. So it appears healthy, and we wouldn't get any errors - just nowhere close to the performance we are looking for. Setup details - Index size 100GB with two different document mappings in the index. Roughly 500M documents - three nodes c3.4xl instances on EC2 using pIOPS SSD EBS disks (although NOT RAID 0 - just one big volume) - each server node on EC2 has 30GB RAM, 16GB on heap, rest for OS - we have set mlockall on our instances - 3 nodes are split into 6 shards for the main index - Index is read only after it is loaded - we don't update the index ever, it is only for querying - ES version 1.3.3 Java 1.7.0_51 - each server has 16 cores / node and 48 search threads with queue length of 1000 Assuming no stemming, free text queries - just term matching, how can we increase the throughput and decrease the response time
ElasticSearch search performance question
I know this is difficult to answer, the real answer is always It Depends :) But I am going to go ahead and hope I get some feedback here. We are mainly using ES to issue terms searches against fields that are non-analyzed. We are using ES like a key value store, where once the match is found we parse the _source JSON and return our model. We are doing contact lookups, searching against (last_name AND (phone_number OR email)). We are issuing constant_score queries with term filters for the terms mentioned above. No aggregations, no sorting, no scripts, etc. Using JMeter, we were maxing out at around 500 search requests / sec. Average request time was taking around 7 seconds to complete. When the test would fire up, the ThreadPool Search Queue would spike to 1000 on each node and CPU would be maxed out, then once it finished everything would return to normal. So it appears healthy, and we wouldn't get any errors - just nowhere close to the performance we are looking for. Setup details - Index size 100GB with two different document mappings in the index. Roughly 500M documents - three nodes c3.4xl instances on EC2 using pIOPS SSD EBS disks (although NOT RAID 0 - just one big volume) - each server node on EC2 has 30GB RAM, 16GB on heap, rest for OS - we have set mlockall on our instances - 3 nodes are split into 6 shards for the main index - Index is read only after it is loaded - we don't update the index ever, it is only for querying - ES version 1.3.3 Java 1.7.0_51 - each server has 16 cores / node and 48 search threads with queue length of 1000 Assuming no stemming, free text queries - just term matching, how can we increase the throughput and decrease the response time for the ES queries? is 500 requests / sec at the top end? Do we just need many more servers if we really want 3000 requests / sec ? I have read that scaling out is better for ES vs scaling up. But it feels that the current server farm should deliver better performance. Any help or tuning advice would be really appreciated. We have looked at many slideshares, blog posts from found.no, elasticseearch.org, etc - and can't really pinpoint a way to improve our setup. Thanks! JD -- You received this message because you are subscribed to the Google Groups elasticsearch group. To unsubscribe from this group and stop receiving emails from it, send an email to elasticsearch+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/47b93b84-d929-4cad-becd-31581cd4c574%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.