Re: Performance optimization of Proximity/Wildcard searches
Hi, Yes, assuming you didn't change the index files, say by optimizing the index, the hot portions of the index should remain in the OS cache unless something else kicked them out. Re other thread - I don't think I have those messages any more. Otis --- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Mon, February 7, 2011 2:49:44 AM Subject: Re: Performance optimization of Proximity/Wildcard searches Only couple of thousand documents are added daily so the old OS cache should still be useful since old documents remain same, right? Also can you please comment on my other thread related to Term Vectors? Thanks! On Sat, Feb 5, 2011 at 8:40 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Yes, OS cache mostly remains (obviously index files that are no longer around are going to remain the OS cache for a while, but will be useless and gradually replaced by new index files). How long warmup takes is not relevant here, but what queries you use to warm up the index and how much you auto-warm the caches. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Sat, February 5, 2011 4:06:54 AM Subject: Re: Performance optimization of Proximity/Wildcard searches Correct me if I am wrong. Commit in index flushes SOLR cache but of course OS cache would still be useful? If a an index is updated every hour then a warm up that takes less than 5 mins should be more than enough, right? On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Salman, Warming up may be useful if your caches are getting decent hit ratios. Plus, you are warming up the OS cache when you warm up. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Fri, February 4, 2011 3:33:41 PM Subject: Re: Performance optimization of Proximity/Wildcard searches I know so we are not really using it for regular warm-ups (in any case index is updated on hourly basis). Just tried few times to compare results. The issue is I am not even sure if warming up is useful for such regular updates. On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Salman, I only skimmed your email, but wanted to say that this part sounds a little suspicious: Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every It sounds like this will make warmup take a long time, assuming you have more than a handful distinct queries in your logs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk Sent: Tue, January 25, 2011 6:32:48 AM Subject: Re: Performance optimization of Proximity/Wildcard searches By warmed index you only mean warming the SOLR cache or OS cache? As I said our index is updated every hour so I am not sure how much SOLR cache would be helpful but OS cache should still be helpful, right? I haven't compared the results with a proper script but from manual testing here are some of the observations. 'Recent' queries which are in cache of course return immediately (only if they are exactly same - even if they took 3-4 mins first time). I will need to test how many recent queries stay in cache but still this would work only for very commonqueries. User can run different queries and I want at least them to be at 'acceptable' level (5-10 secs) even if not very fast. Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday
Re: Performance optimization of Proximity/Wildcard searches
Only couple of thousand documents are added daily so the old OS cache should still be useful since old documents remain same, right? Also can you please comment on my other thread related to Term Vectors? Thanks! On Sat, Feb 5, 2011 at 8:40 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Yes, OS cache mostly remains (obviously index files that are no longer around are going to remain the OS cache for a while, but will be useless and gradually replaced by new index files). How long warmup takes is not relevant here, but what queries you use to warm up the index and how much you auto-warm the caches. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Sat, February 5, 2011 4:06:54 AM Subject: Re: Performance optimization of Proximity/Wildcard searches Correct me if I am wrong. Commit in index flushes SOLR cache but of course OS cache would still be useful? If a an index is updated every hour then a warm up that takes less than 5 mins should be more than enough, right? On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Salman, Warming up may be useful if your caches are getting decent hit ratios. Plus, you are warming up the OS cache when you warm up. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Fri, February 4, 2011 3:33:41 PM Subject: Re: Performance optimization of Proximity/Wildcard searches I know so we are not really using it for regular warm-ups (in any case index is updated on hourly basis). Just tried few times to compare results. The issue is I am not even sure if warming up is useful for such regular updates. On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Salman, I only skimmed your email, but wanted to say that this part sounds a little suspicious: Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every It sounds like this will make warmup take a long time, assuming you have more than a handful distinct queries in your logs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk Sent: Tue, January 25, 2011 6:32:48 AM Subject: Re: Performance optimization of Proximity/Wildcard searches By warmed index you only mean warming the SOLR cache or OS cache? As I said our index is updated every hour so I am not sure how much SOLR cache would be helpful but OS cache should still be helpful, right? I haven't compared the results with a proper script but from manual testing here are some of the observations. 'Recent' queries which are in cache of course return immediately (only if they are exactly same - even if they took 3-4 mins first time). I will need to test how many recent queries stay in cache but still this would work only for very common queries. User can run different queries and I want at least them to be at 'acceptable' level (5-10 secs) even if not very fast. Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every hour after that) and today when I executed some of the same queries again their time seemed a little less (around 15-20%), I am not sure if this means anything. However, still their time is not acceptable. What do you think is the best way to compare results? First run all the warm up queries and then execute same randomly andcompare? We are using Windows server, would it make a big difference if we move to Linux? Our load is not high but some queries are really complex. Also I was hoping to move to SSD in last after trying out all software options. Is that an agreed fact that on large indexes (which don't fit in RAM
Re: Performance optimization of Proximity/Wildcard searches
Correct me if I am wrong. Commit in index flushes SOLR cache but of course OS cache would still be useful? If a an index is updated every hour then a warm up that takes less than 5 mins should be more than enough, right? On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Salman, Warming up may be useful if your caches are getting decent hit ratios. Plus, you are warming up the OS cache when you warm up. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Fri, February 4, 2011 3:33:41 PM Subject: Re: Performance optimization of Proximity/Wildcard searches I know so we are not really using it for regular warm-ups (in any case index is updated on hourly basis). Just tried few times to compare results. The issue is I am not even sure if warming up is useful for such regular updates. On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Salman, I only skimmed your email, but wanted to say that this part sounds a little suspicious: Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every It sounds like this will make warmup take a long time, assuming you have more than a handful distinct queries in your logs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk Sent: Tue, January 25, 2011 6:32:48 AM Subject: Re: Performance optimization of Proximity/Wildcard searches By warmed index you only mean warming the SOLR cache or OS cache? As I said our index is updated every hour so I am not sure how much SOLR cache would be helpful but OS cache should still be helpful, right? I haven't compared the results with a proper script but from manual testing here are some of the observations. 'Recent' queries which are in cache of course return immediately (only if they are exactly same - even if they took 3-4 mins first time). I will need to test how many recent queries stay in cache but still this would work only for very common queries. User can run different queries and I want at least them to be at 'acceptable' level (5-10 secs) even if not very fast. Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every hour after that) and today when I executed some of the same queries again their time seemed a little less (around 15-20%), I am not sure if this means anything. However, still their time is not acceptable. What do you think is the best way to compare results? First run all the warm up queries and then execute same randomly and compare? We are using Windows server, would it make a big difference if we move to Linux? Our load is not high but some queries are really complex. Also I was hoping to move to SSD in last after trying out all software options. Is that an agreed fact that on large indexes (which don't fit in RAM) proximity/wildcard/phrase queries (on common words) would be slow and it can be only improved by cache warm up and better hardware? Otherwise with an index of around 150GB such queries will take more than a min? If that's the case I know this question is very subjective but if a single query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD (everything else same)? Thanks! On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What is the time difference between queries with a warmed index and a cold one? If the warmed index performs satisfactory, then one answer is to upgrade your underlying storage. As always for IO-caused performance problem in Lucene/Solr-land, SSD is the answer. -- Regards, Salman Akram -- Regards, Salman Akram -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
Since all queries return total count as well so on average a query matches 10% of the total documents. The index I am talking about is around 13 million so that means around 1.3 million documents match on average. Of course all of them won't be overlapping so I am guessing that around 30-50% documents do match the daily queries. I tried to find out a lot if you can tell SOLR to stop searching after a certain count - I don't mean no. of rows but just like MySQL limit so that it doesn't have to spend time calculating the total count whereas its only returning few rows to UI and we are OK in showing count as 1000+ (if its more than 1000) but couldn't find any way. On Sat, Feb 5, 2011 at 7:45 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Heh, I'm not sure if this is valid thinking. :) By *matching* doc distribution I meant: what proportion of your millions of documents actually ever get matched and then how many of those make it to the UI. If you have 1000 queries in a day and they all end up matching only 3 of your docs, the system will need less RAM than a system where 1000 queries match 5 different docs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Fri, February 4, 2011 3:38:55 PM Subject: Re: Performance optimization of Proximity/Wildcard searches Well I assume many people out there would have indexes larger than 100GB and I don't think so normally you will have more RAM than 32GB or 64! As I mentioned the queries are mostly phrase, proximity, wildcard and combination of these. What exactly do you mean by distribution of documents? On this index our documents are not more than few hundred KB's on average (file system size) and there are around 14 million documents. 80% of the index size is taken up by position file. I am not sure if this is what you asked? On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Sharding is an option too but that too comes with limitations so want to keep that as a last resort but I think there must be other things coz 150GB is not too big for one drive/server with 32GB Ram. Hmm what makes you think 32 GB is enough for your 150 GB index? It depends on queries and distribution of matching documents, for example. What's yours like? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Tue, January 25, 2011 4:20:34 AM Subject: Performance optimization of Proximity/Wildcard searches Hi, I am facing performance issues in three types of queries (and their combination). Some of the queries take more than 2-3 mins. Index size is around 150GB. - Wildcard - Proximity - Phrases (with common words) I know CommonGrams and Stop words are a good way to resolve such issues but they don't fulfill our functional requirements (Common Grams seem to have issues with phrase proximity, stop words have issues with exact match etc). Sharding is an option too but that too comes with limitations so want to keep that as a last resort but I think there must be other things coz 150GB is not too big for one drive/server with 32GB Ram. Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What are the other main tips that can help in performance optimization of the above queries? Thanks -- Regards, Salman Akram -- Regards, Salman Akram -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
Yes, OS cache mostly remains (obviously index files that are no longer around are going to remain the OS cache for a while, but will be useless and gradually replaced by new index files). How long warmup takes is not relevant here, but what queries you use to warm up the index and how much you auto-warm the caches. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Sat, February 5, 2011 4:06:54 AM Subject: Re: Performance optimization of Proximity/Wildcard searches Correct me if I am wrong. Commit in index flushes SOLR cache but of course OS cache would still be useful? If a an index is updated every hour then a warm up that takes less than 5 mins should be more than enough, right? On Sat, Feb 5, 2011 at 7:42 AM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Salman, Warming up may be useful if your caches are getting decent hit ratios. Plus, you are warming up the OS cache when you warm up. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Fri, February 4, 2011 3:33:41 PM Subject: Re: Performance optimization of Proximity/Wildcard searches I know so we are not really using it for regular warm-ups (in any case index is updated on hourly basis). Just tried few times to compare results. The issue is I am not even sure if warming up is useful for such regular updates. On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Salman, I only skimmed your email, but wanted to say that this part sounds a little suspicious: Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every It sounds like this will make warmup take a long time, assuming you have more than a handful distinct queries in your logs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk Sent: Tue, January 25, 2011 6:32:48 AM Subject: Re: Performance optimization of Proximity/Wildcard searches By warmed index you only mean warming the SOLR cache or OS cache? As I said our index is updated every hour so I am not sure how much SOLR cache would be helpful but OS cache should still be helpful, right? I haven't compared the results with a proper script but from manual testing here are some of the observations. 'Recent' queries which are in cache of course return immediately (only if they are exactly same - even if they took 3-4 mins first time). I will need to test how many recent queries stay in cache but still this would work only for very common queries. User can run different queries and I want at least them to be at 'acceptable' level (5-10 secs) even if not very fast. Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every hour after that) and today when I executed some of the same queries again their time seemed a little less (around 15-20%), I am not sure if this means anything. However, still their time is not acceptable. What do you think is the best way to compare results? First run all the warm up queries and then execute same randomly andcompare? We are using Windows server, would it make a big difference if we move to Linux? Our load is not high but some queries are really complex. Also I was hoping to move to SSD in last after trying out all software options. Is that an agreed fact that on large indexes (which don't fit in RAM) proximity/wildcard/phrase queries (on common words) would be slow and it can be only improved by cache warm up and better hardware? Otherwise with an index of around 150GB such queries will take more than a min? If that's the case I know this question is very subjective but if a single query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD (everything else same
Re: Performance optimization of Proximity/Wildcard searches
Salman, I only skimmed your email, but wanted to say that this part sounds a little suspicious: Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every It sounds like this will make warmup take a long time, assuming you have more than a handful distinct queries in your logs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk Sent: Tue, January 25, 2011 6:32:48 AM Subject: Re: Performance optimization of Proximity/Wildcard searches By warmed index you only mean warming the SOLR cache or OS cache? As I said our index is updated every hour so I am not sure how much SOLR cache would be helpful but OS cache should still be helpful, right? I haven't compared the results with a proper script but from manual testing here are some of the observations. 'Recent' queries which are in cache of course return immediately (only if they are exactly same - even if they took 3-4 mins first time). I will need to test how many recent queries stay in cache but still this would work only for very common queries. User can run different queries and I want at least them to be at 'acceptable' level (5-10 secs) even if not very fast. Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every hour after that) and today when I executed some of the same queries again their time seemed a little less (around 15-20%), I am not sure if this means anything. However, still their time is not acceptable. What do you think is the best way to compare results? First run all the warm up queries and then execute same randomly and compare? We are using Windows server, would it make a big difference if we move to Linux? Our load is not high but some queries are really complex. Also I was hoping to move to SSD in last after trying out all software options. Is that an agreed fact that on large indexes (which don't fit in RAM) proximity/wildcard/phrase queries (on common words) would be slow and it can be only improved by cache warm up and better hardware? Otherwise with an index of around 150GB such queries will take more than a min? If that's the case I know this question is very subjective but if a single query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD (everything else same)? Thanks! On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What is the time difference between queries with a warmed index and a cold one? If the warmed index performs satisfactory, then one answer is to upgrade your underlying storage. As always for IO-caused performance problem in Lucene/Solr-land, SSD is the answer. -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
Hi, Sharding is an option too but that too comes with limitations so want to keep that as a last resort but I think there must be other things coz 150GB is not too big for one drive/server with 32GB Ram. Hmm what makes you think 32 GB is enough for your 150 GB index? It depends on queries and distribution of matching documents, for example. What's yours like? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Tue, January 25, 2011 4:20:34 AM Subject: Performance optimization of Proximity/Wildcard searches Hi, I am facing performance issues in three types of queries (and their combination). Some of the queries take more than 2-3 mins. Index size is around 150GB. - Wildcard - Proximity - Phrases (with common words) I know CommonGrams and Stop words are a good way to resolve such issues but they don't fulfill our functional requirements (Common Grams seem to have issues with phrase proximity, stop words have issues with exact match etc). Sharding is an option too but that too comes with limitations so want to keep that as a last resort but I think there must be other things coz 150GB is not too big for one drive/server with 32GB Ram. Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What are the other main tips that can help in performance optimization of the above queries? Thanks -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
I know so we are not really using it for regular warm-ups (in any case index is updated on hourly basis). Just tried few times to compare results. The issue is I am not even sure if warming up is useful for such regular updates. On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Salman, I only skimmed your email, but wanted to say that this part sounds a little suspicious: Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every It sounds like this will make warmup take a long time, assuming you have more than a handful distinct queries in your logs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk Sent: Tue, January 25, 2011 6:32:48 AM Subject: Re: Performance optimization of Proximity/Wildcard searches By warmed index you only mean warming the SOLR cache or OS cache? As I said our index is updated every hour so I am not sure how much SOLR cache would be helpful but OS cache should still be helpful, right? I haven't compared the results with a proper script but from manual testing here are some of the observations. 'Recent' queries which are in cache of course return immediately (only if they are exactly same - even if they took 3-4 mins first time). I will need to test how many recent queries stay in cache but still this would work only for very common queries. User can run different queries and I want at least them to be at 'acceptable' level (5-10 secs) even if not very fast. Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every hour after that) and today when I executed some of the same queries again their time seemed a little less (around 15-20%), I am not sure if this means anything. However, still their time is not acceptable. What do you think is the best way to compare results? First run all the warm up queries and then execute same randomly and compare? We are using Windows server, would it make a big difference if we move to Linux? Our load is not high but some queries are really complex. Also I was hoping to move to SSD in last after trying out all software options. Is that an agreed fact that on large indexes (which don't fit in RAM) proximity/wildcard/phrase queries (on common words) would be slow and it can be only improved by cache warm up and better hardware? Otherwise with an index of around 150GB such queries will take more than a min? If that's the case I know this question is very subjective but if a single query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD (everything else same)? Thanks! On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What is the time difference between queries with a warmed index and a cold one? If the warmed index performs satisfactory, then one answer is to upgrade your underlying storage. As always for IO-caused performance problem in Lucene/Solr-land, SSD is the answer. -- Regards, Salman Akram -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
Well I assume many people out there would have indexes larger than 100GB and I don't think so normally you will have more RAM than 32GB or 64! As I mentioned the queries are mostly phrase, proximity, wildcard and combination of these. What exactly do you mean by distribution of documents? On this index our documents are not more than few hundred KB's on average (file system size) and there are around 14 million documents. 80% of the index size is taken up by position file. I am not sure if this is what you asked? On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Sharding is an option too but that too comes with limitations so want to keep that as a last resort but I think there must be other things coz 150GB is not too big for one drive/server with 32GB Ram. Hmm what makes you think 32 GB is enough for your 150 GB index? It depends on queries and distribution of matching documents, for example. What's yours like? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Tue, January 25, 2011 4:20:34 AM Subject: Performance optimization of Proximity/Wildcard searches Hi, I am facing performance issues in three types of queries (and their combination). Some of the queries take more than 2-3 mins. Index size is around 150GB. - Wildcard - Proximity - Phrases (with common words) I know CommonGrams and Stop words are a good way to resolve such issues but they don't fulfill our functional requirements (Common Grams seem to have issues with phrase proximity, stop words have issues with exact match etc). Sharding is an option too but that too comes with limitations so want to keep that as a last resort but I think there must be other things coz 150GB is not too big for one drive/server with 32GB Ram. Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What are the other main tips that can help in performance optimization of the above queries? Thanks -- Regards, Salman Akram -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
Salman, Warming up may be useful if your caches are getting decent hit ratios. Plus, you are warming up the OS cache when you warm up. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Fri, February 4, 2011 3:33:41 PM Subject: Re: Performance optimization of Proximity/Wildcard searches I know so we are not really using it for regular warm-ups (in any case index is updated on hourly basis). Just tried few times to compare results. The issue is I am not even sure if warming up is useful for such regular updates. On Fri, Feb 4, 2011 at 5:16 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Salman, I only skimmed your email, but wanted to say that this part sounds a little suspicious: Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every It sounds like this will make warmup take a long time, assuming you have more than a handful distinct queries in your logs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org; t...@statsbiblioteket.dk Sent: Tue, January 25, 2011 6:32:48 AM Subject: Re: Performance optimization of Proximity/Wildcard searches By warmed index you only mean warming the SOLR cache or OS cache? As I said our index is updated every hour so I am not sure how much SOLR cache would be helpful but OS cache should still be helpful, right? I haven't compared the results with a proper script but from manual testing here are some of the observations. 'Recent' queries which are in cache of course return immediately (only if they are exactly same - even if they took 3-4 mins first time). I will need to test how many recent queries stay in cache but still this would work only for very common queries. User can run different queries and I want at least them to be at 'acceptable' level (5-10 secs) even if not very fast. Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every hour after that) and today when I executed some of the same queries again their time seemed a little less (around 15-20%), I am not sure if this means anything. However, still their time is not acceptable. What do you think is the best way to compare results? First run all the warm up queries and then execute same randomly and compare? We are using Windows server, would it make a big difference if we move to Linux? Our load is not high but some queries are really complex. Also I was hoping to move to SSD in last after trying out all software options. Is that an agreed fact that on large indexes (which don't fit in RAM) proximity/wildcard/phrase queries (on common words) would be slow and it can be only improved by cache warm up and better hardware? Otherwise with an index of around 150GB such queries will take more than a min? If that's the case I know this question is very subjective but if a single query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD (everything else same)? Thanks! On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What is the time difference between queries with a warmed index and a cold one? If the warmed index performs satisfactory, then one answer is to upgrade your underlying storage. As always for IO-caused performance problem in Lucene/Solr-land, SSD is the answer. -- Regards, Salman Akram -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
Heh, I'm not sure if this is valid thinking. :) By *matching* doc distribution I meant: what proportion of your millions of documents actually ever get matched and then how many of those make it to the UI. If you have 1000 queries in a day and they all end up matching only 3 of your docs, the system will need less RAM than a system where 1000 queries match 5 different docs. Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Fri, February 4, 2011 3:38:55 PM Subject: Re: Performance optimization of Proximity/Wildcard searches Well I assume many people out there would have indexes larger than 100GB and I don't think so normally you will have more RAM than 32GB or 64! As I mentioned the queries are mostly phrase, proximity, wildcard and combination of these. What exactly do you mean by distribution of documents? On this index our documents are not more than few hundred KB's on average (file system size) and there are around 14 million documents. 80% of the index size is taken up by position file. I am not sure if this is what you asked? On Fri, Feb 4, 2011 at 5:19 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi, Sharding is an option too but that too comes with limitations so want to keep that as a last resort but I think there must be other things coz 150GB is not too big for one drive/server with 32GB Ram. Hmm what makes you think 32 GB is enough for your 150 GB index? It depends on queries and distribution of matching documents, for example. What's yours like? Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ - Original Message From: Salman Akram salman.ak...@northbaysolutions.net To: solr-user@lucene.apache.org Sent: Tue, January 25, 2011 4:20:34 AM Subject: Performance optimization of Proximity/Wildcard searches Hi, I am facing performance issues in three types of queries (and their combination). Some of the queries take more than 2-3 mins. Index size is around 150GB. - Wildcard - Proximity - Phrases (with common words) I know CommonGrams and Stop words are a good way to resolve such issues but they don't fulfill our functional requirements (Common Grams seem to have issues with phrase proximity, stop words have issues with exact match etc). Sharding is an option too but that too comes with limitations so want to keep that as a last resort but I think there must be other things coz 150GB is not too big for one drive/server with 32GB Ram. Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What are the other main tips that can help in performance optimization of the above queries? Thanks -- Regards, Salman Akram -- Regards, Salman Akram
Performance optimization of Proximity/Wildcard searches
Hi, I am facing performance issues in three types of queries (and their combination). Some of the queries take more than 2-3 mins. Index size is around 150GB. - Wildcard - Proximity - Phrases (with common words) I know CommonGrams and Stop words are a good way to resolve such issues but they don't fulfill our functional requirements (Common Grams seem to have issues with phrase proximity, stop words have issues with exact match etc). Sharding is an option too but that too comes with limitations so want to keep that as a last resort but I think there must be other things coz 150GB is not too big for one drive/server with 32GB Ram. Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What are the other main tips that can help in performance optimization of the above queries? Thanks -- Regards, Salman Akram
Re: Performance optimization of Proximity/Wildcard searches
On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What is the time difference between queries with a warmed index and a cold one? If the warmed index performs satisfactory, then one answer is to upgrade your underlying storage. As always for IO-caused performance problem in Lucene/Solr-land, SSD is the answer.
Re: Performance optimization of Proximity/Wildcard searches
By warmed index you only mean warming the SOLR cache or OS cache? As I said our index is updated every hour so I am not sure how much SOLR cache would be helpful but OS cache should still be helpful, right? I haven't compared the results with a proper script but from manual testing here are some of the observations. 'Recent' queries which are in cache of course return immediately (only if they are exactly same - even if they took 3-4 mins first time). I will need to test how many recent queries stay in cache but still this would work only for very common queries. User can run different queries and I want at least them to be at 'acceptable' level (5-10 secs) even if not very fast. Our warm up script currently executes all distinct queries in our logs having count 5. It was run yesterday (with all the indexing update every hour after that) and today when I executed some of the same queries again their time seemed a little less (around 15-20%), I am not sure if this means anything. However, still their time is not acceptable. What do you think is the best way to compare results? First run all the warm up queries and then execute same randomly and compare? We are using Windows server, would it make a big difference if we move to Linux? Our load is not high but some queries are really complex. Also I was hoping to move to SSD in last after trying out all software options. Is that an agreed fact that on large indexes (which don't fit in RAM) proximity/wildcard/phrase queries (on common words) would be slow and it can be only improved by cache warm up and better hardware? Otherwise with an index of around 150GB such queries will take more than a min? If that's the case I know this question is very subjective but if a single query takes 2 min on SAS 10K RPM what would its approx time be on a good SSD (everything else same)? Thanks! On Tue, Jan 25, 2011 at 3:44 PM, Toke Eskildsen t...@statsbiblioteket.dkwrote: On Tue, 2011-01-25 at 10:20 +0100, Salman Akram wrote: Cache warming is a good option too but the index get updated every hour so not sure how much would that help. What is the time difference between queries with a warmed index and a cold one? If the warmed index performs satisfactory, then one answer is to upgrade your underlying storage. As always for IO-caused performance problem in Lucene/Solr-land, SSD is the answer. -- Regards, Salman Akram