Re: search performance
On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote: [200GB, 150M documents] With NRT enabled, search speed is roughly 5 minutes on average. The server resources are: 2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux. 5 minutes is extremely long. Is that really the right number? I do not see a hardware upgrade changing that with the fine machine you're using. What is your search speed if you disable continuous updates? When you restart the searcher, how long does the first search take? - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Toke Thanks for the comment. Unfortunately, in this instance, it is a live production system, so we cannot conduct experiments. The number is definitely accurate. We have many different systems with a similar load that observe the same performance issue. To my knowledge, the Lucene integration code is fairly well optimized. I've requested access to the indexes so that we can perform further testing. Regards Jamie On 2014/06/03, 8:09 AM, Toke Eskildsen wrote: On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote: [200GB, 150M documents] With NRT enabled, search speed is roughly 5 minutes on average. The server resources are: 2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux. 5 minutes is extremely long. Is that really the right number? I do not see a hardware upgrade changing that with the fine machine you're using. What is your search speed if you disable continuous updates? When you restart the searcher, how long does the first search take? - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Can you take thread stacktraces (repeatedly) during those 5 minute searches? That might give you (or someone on the mailing list) a clue where all that time is spent. You could try using jstack for that: http://docs.oracle.com/javase/7/docs/technotes/tools/share/jstack.html Regards Christoph Am 03.06.2014 08:17, schrieb Jamie: Toke Thanks for the comment. Unfortunately, in this instance, it is a live production system, so we cannot conduct experiments. The number is definitely accurate. We have many different systems with a similar load that observe the same performance issue. To my knowledge, the Lucene integration code is fairly well optimized. I've requested access to the indexes so that we can perform further testing. Regards Jamie On 2014/06/03, 8:09 AM, Toke Eskildsen wrote: On Mon, 2014-06-02 at 08:51 +0200, Jamie wrote: [200GB, 150M documents] With NRT enabled, search speed is roughly 5 minutes on average. The server resources are: 2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux. 5 minutes is extremely long. Is that really the right number? I do not see a hardware upgrade changing that with the fine machine you're using. What is your search speed if you disable continuous updates? When you restart the searcher, how long does the first search take? - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
On Tue, 2014-06-03 at 08:17 +0200, Jamie wrote: Unfortunately, in this instance, it is a live production system, so we cannot conduct experiments. The number is definitely accurate. We have many different systems with a similar load that observe the same performance issue. To my knowledge, the Lucene integration code is fairly well optimized. It is possible that the extreme slowness is a combination of factors, but with a bit of luck it will boil down to a single thing. Standard procedure it to disable features until it performs well, so - Disable running updates - Limit page size - Limit lookup of returned fields - Disable highlighting - Simpler queries - Whatever else you might think of At some point along the way I would expect a sharp increase in performance. I've requested access to the indexes so that we can perform further testing. Great. - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Toke Thanks for the contact. See below: On 2014/06/03, 9:17 AM, Toke Eskildsen wrote: On Tue, 2014-06-03 at 08:17 +0200, Jamie wrote: Unfortunately, in this instance, it is a live production system, so we cannot conduct experiments. The number is definitely accurate. We have many different systems with a similar load that observe the same performance issue. To my knowledge, the Lucene integration code is fairly well optimized. It is possible that the extreme slowness is a combination of factors, but with a bit of luck it will boil down to a single thing. Standard procedure it to disable features until it performs well, so - Disable running updates No can do. - Limit page size Done this. - Limit lookup of returned fields Done this. - Disable highlighting No highlighting. - Simpler queries They are as simple as possible. - Whatever else you might think of Our application has been using Lucene for seven years. It has been constantly optimized over that period. I'll conduct further testing... At some point along the way I would expect a sharp increase in performance. I've requested access to the indexes so that we can perform further testing. Great. - Toke Eskildsen, State and University Library, Denmark - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Something doesn't quite add up. TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true, false, false, true); We use pagination, so only returning 1000 documents or so at a time. You say you are using pagination, yet the API you are using to create your collector isn't how you would utilize Lucene's built-in pagination feature (unless misunderstand the API). If the max is the snippet above is 1000, then you're simply returning top 1000 docs every time you execute your search. Otherwise... well, could you actually post a bit more of your code that runs the search here, in particular? Assuming that the max is much larger than 1000, however, you could call fieldCollector.topDocs(int, int) after accumulating hits using this collector, but this won't work multiple times per query execution, according to the javadoc. So you either have to re-execute the full search, and then get the next chunk of ScoreDocs, or use the proper API for this, one that accepts as a parameter the end of the previous page of results, i.e. IndexSearcher.searchAfter(ScoreDoc, ...)
Re: search performance
Vitaly Thanks for the contribution. Unfortunately, we cannot use Lucene's pagination function, because in reality the user can skip pages to start the search at any point, not just from the end of the previous search. Even the first search (without any pagination), with a max of 1000 hits, takes 5 minutes to complete. Regards Jamie On 2014/06/03, 10:54 AM, Vitaly Funstein wrote: Something doesn't quite add up. TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true, false, false, true); We use pagination, so only returning 1000 documents or so at a time. You say you are using pagination, yet the API you are using to create your collector isn't how you would utilize Lucene's built-in pagination feature (unless misunderstand the API). If the max is the snippet above is 1000, then you're simply returning top 1000 docs every time you execute your search. Otherwise... well, could you actually post a bit more of your code that runs the search here, in particular? Assuming that the max is much larger than 1000, however, you could call fieldCollector.topDocs(int, int) after accumulating hits using this collector, but this won't work multiple times per query execution, according to the javadoc. So you either have to re-execute the full search, and then get the next chunk of ScoreDocs, or use the proper API for this, one that accepts as a parameter the end of the previous page of results, i.e. IndexSearcher.searchAfter(ScoreDoc, ...) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Hi Jamie, What is included in the 5 minutes? Just the call to the searcher? seacher.search(...) ? Can you show a bit more of the code you use? On Tue, Jun 3, 2014 at 11:32 AM, Jamie ja...@mailarchiva.com wrote: Vitaly Thanks for the contribution. Unfortunately, we cannot use Lucene's pagination function, because in reality the user can skip pages to start the search at any point, not just from the end of the previous search. Even the first search (without any pagination), with a max of 1000 hits, takes 5 minutes to complete. Regards Jamie On 2014/06/03, 10:54 AM, Vitaly Funstein wrote: Something doesn't quite add up. TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true, false, false, true); We use pagination, so only returning 1000 documents or so at a time. You say you are using pagination, yet the API you are using to create your collector isn't how you would utilize Lucene's built-in pagination feature (unless misunderstand the API). If the max is the snippet above is 1000, then you're simply returning top 1000 docs every time you execute your search. Otherwise... well, could you actually post a bit more of your code that runs the search here, in particular? Assuming that the max is much larger than 1000, however, you could call fieldCollector.topDocs(int, int) after accumulating hits using this collector, but this won't work multiple times per query execution, according to the javadoc. So you either have to re-execute the full search, and then get the next chunk of ScoreDocs, or use the proper API for this, one that accepts as a parameter the end of the previous page of results, i.e. IndexSearcher.searchAfter(ScoreDoc, ...) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Sure... see below: protected void search(Query query, Filter queryFilter, Sort sort) throws BlobSearchException { try { logger.debug(start search {searchquery=' + getSearchQuery() + ',query='+query.toString()+',filterQuery='+queryFilter+',sort='+sort+'}); Thread.currentThread().setPriority(Thread.MAX_PRIORITY); results.clear(); int max; if (getPagination()) { max = start + length; } else { max = getMaxResults(); } // release the old volume searchers IndexReader indexReader = initIndexReader(); searcher = new IndexSearcher(indexReader,executor); TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true, false, false, true); searcher.search(query, queryFilter, fieldCollector); TopDocs topDocs; if (getPagination()) { topDocs = fieldCollector.topDocs(start,length); } else { topDocs = fieldCollector.topDocs(); } int count = 0; for (int i = 0; i topDocs.scoreDocs.length; i++) { if ((getMaxResults()0 count getMaxResults()) || (getPagination() count++=length)) { break; } results.add(topDocs.scoreDocs[i]); } totalHits = fieldCollector.getTotalHits(); logger.debug(search executed successfully {query='+ getSearchQuery() + ',returnedresults=' + results.size()+ '}); } catch (Exception io) { throw new BlobSearchException(failed to execute search query {searchquery='+ getSearchQuery() + }, io, logger, ChainedException.Level.DEBUG); } } On 2014/06/03, 11:41 AM, Rob Audenaerde wrote: Hi Jamie, What is included in the 5 minutes? Just the call to the searcher? seacher.search(...) ? Can you show a bit more of the code you use? On Tue, Jun 3, 2014 at 11:32 AM, Jamie ja...@mailarchiva.com wrote: Vitaly Thanks for the contribution. Unfortunately, we cannot use Lucene's pagination function, because in reality the user can skip pages to start the search at any point, not just from the end of the previous search. Even the first search (without any pagination), with a max of 1000 hits, takes 5 minutes to complete. Regards Jamie On 2014/06/03, 10:54 AM, Vitaly Funstein wrote: Something doesn't quite add up. TopFieldCollector fieldCollector = TopFieldCollector.create(sort, max,true, false, false, true); We use pagination, so only returning 1000 documents or so at a time. You say you are using pagination, yet the API you are using to create your collector isn't how you would utilize Lucene's built-in pagination feature (unless misunderstand the API). If the max is the snippet above is 1000, then you're simply returning top 1000 docs every time you execute your search. Otherwise... well, could you actually post a bit more of your code that runs the search here, in particular? Assuming that the max is much larger than 1000, however, you could call fieldCollector.topDocs(int, int) after accumulating hits using this collector, but this won't work multiple times per query execution, according to the javadoc. So you either have to re-execute the full search, and then get the next chunk of ScoreDocs, or use the proper API for this, one that accepts as a parameter the end of the previous page of results, i.e. IndexSearcher.searchAfter(ScoreDoc, ...) - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
FYI: We are also using a multireader to search over multiple index readers. Search under a million documents yields good response times. When you get into the 60M territory, search slows to a crawl. On 2014/06/03, 11:47 AM, Jamie wrote: Sure... see below: - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
A couple of questions. 1. What are you trying to achieve by setting the current thread's priority to max possible value? Is it grabbing as much CPU time as possible? In my experience, mucking with thread priorities like this is at best futile, and at worst quite detrimental to responsiveness and overall performance of the system as a whole. I would remove that line. 2. This seems suspicious: if (getPagination()) { max = start + length; } else { max = getMaxResults(); } If start is at 100M, and length is 1000 - what do you think Lucene will try and do when you pass this max to the collector? On Tue, Jun 3, 2014 at 2:55 AM, Jamie ja...@mailarchiva.com wrote: FYI: We are also using a multireader to search over multiple index readers. Search under a million documents yields good response times. When you get into the 60M territory, search slows to a crawl. On 2014/06/03, 11:47 AM, Jamie wrote: Sure... see below: - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Vitaly See below: On 2014/06/03, 12:09 PM, Vitaly Funstein wrote: A couple of questions. 1. What are you trying to achieve by setting the current thread's priority to max possible value? Is it grabbing as much CPU time as possible? In my experience, mucking with thread priorities like this is at best futile, and at worst quite detrimental to responsiveness and overall performance of the system as a whole. I would remove that line. Yes, you are right to be worried about this, especially since thread priorities behave differently on different platforms. 2. This seems suspicious: if (getPagination()) { max = start + length; } else { max = getMaxResults(); } If start is at 100M, and length is 1000 - what do you think Lucene will try and do when you pass this max to the collector? I dont see the problem here. The collector will start from zero to max results. I agree that from a performance perspective, ts not ideal to return all results from the beginning of the search, but the Lucene API us with no choice. I simply do not know the ScoreDoc to start from. If I did keep a record of it, then I would need to store all scoredocs for the entire result set. When there are 60M+ results, this can be problematic in terms of memory consumption. It would be far nicer if there was a searchAfter function that took a position as an integer. Regards Jamie - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Check and make sure you are not opening an indexreader for every search. Be sure you don't do that. On Mon, Jun 2, 2014 at 2:51 AM, Jamie ja...@mailarchiva.com wrote: Greetings Despite following all the recommended optimizations (as described at http://wiki.apache.org/lucene-java/ImproveSearchingSpeed) , in some of our installations, search performance has reached the point where is it unacceptably slow. For instance, in one environment, the total index size is 200GB, with 150 million documents indexed. With NRT enabled, search speed is roughly 5 minutes on average. The server resources are: 2x6 Core Intel CPU, 128GB, 2 SSD for index and RAID 0, with Linux. The only thing we haven't yet done, is to upgrade Lucene from 4.7.x to 4.8.x. Is this likely to make any noticeable difference in performance? Clearly, longer term, we need to move to a distributed search model. We thought to take advantage of the distributed search features offered in Solr, however, our solution is very tightly integrated into Lucene directly (since Solr didn't exist when we started out). Moving to Solr now seems like a daunting prospect. We've also following the Katta project with interest, but it doesn't appear support distributed indexing, and development on it seems to have stalled. It would be nice if there were a distributed search project on the Lucene level that we could use. I realize this is a rather vague question, but are there any further suggestions on ways to improve search performance? We need cheap and dirty ideas, as well as longer term advice on a possible path forward. Much appreciate Jamie - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Jamie, What if you were to forget for a moment the whole pagination idea, and always capped your search at 1000 results for testing purposes only? This is just to try and pinpoint the bottleneck here; if, regardless of the query parameters, the search latency stays roughly the same and well below 5 min, you now have the answer - the problem is your naive implementation of pagination which results in snowballing result numbers and search times, the closer you get to the end of the results range. Otherwise, I would focus on your query and filter next. On Tue, Jun 3, 2014 at 3:21 AM, Jamie ja...@mailarchiva.com wrote: Vitaly See below: On 2014/06/03, 12:09 PM, Vitaly Funstein wrote: A couple of questions. 1. What are you trying to achieve by setting the current thread's priority to max possible value? Is it grabbing as much CPU time as possible? In my experience, mucking with thread priorities like this is at best futile, and at worst quite detrimental to responsiveness and overall performance of the system as a whole. I would remove that line. Yes, you are right to be worried about this, especially since thread priorities behave differently on different platforms. 2. This seems suspicious: if (getPagination()) { max = start + length; } else { max = getMaxResults(); } If start is at 100M, and length is 1000 - what do you think Lucene will try and do when you pass this max to the collector? I dont see the problem here. The collector will start from zero to max results. I agree that from a performance perspective, ts not ideal to return all results from the beginning of the search, but the Lucene API us with no choice. I simply do not know the ScoreDoc to start from. If I did keep a record of it, then I would need to store all scoredocs for the entire result set. When there are 60M+ results, this can be problematic in terms of memory consumption. It would be far nicer if there was a searchAfter function that took a position as an integer. Regards Jamie - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Fwd: Reader reopen
Hello, If I have an AtomicReader, and an IndexSearcher can I reopen the index to get the new documents? Like there: http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/index/IndexReader.html#reopen%28%29 There is any workaround? Thanks, Gergő P.S.: I accidentaly send this to general list too. Sorry for that.
Re: Reader reopen
Sure, just use DirectoryReader.openIfChanged. Mike McCandless http://blog.mikemccandless.com On Tue, Jun 3, 2014 at 6:36 AM, Gergő Törcsvári torcsvari.ge...@gmail.com wrote: Hello, If I have an AtomicReader, and an IndexSearcher can I reopen the index to get the new documents? Like there: http://lucene.apache.org/core/3_0_3/api/all/org/apache/lucene/index/IndexReader.html#reopen%28%29 There is any workaround? Thanks, Gergő P.S.: I accidentaly send this to general list too. Sorry for that. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Vitality / Robert I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes. Unless I am mistaken, the Lucene library's pagination mechanism, makes the assumption that you will cache the scoredocs for the entire result set. This is not practical when you have a result set that exceeds 60M. As stated earlier, in any case, it is the first query that is slow. We do open index readers.. since we are using NRT search. Since documents are being added to the indexes on a continuous basis. When the user clicks on the Search button, the user will expect to see the latest result set. With regards to NRT search, my understanding is that we do need to open the index readers on each search operation to see the latest changes. Thus, on each search, we combine the indexreaders into a multireader, and open each reader based their corresponding writer. protected IndexReader initIndexReader() { ListIndexReader readers = new LinkedList(); for (Writer writer : writers) { readers.add(DirectoryReader.open(writer, true); } return MultiReader(readers,true); } Thank you for your ideas/suggestions. Regards Jamie On 2014/06/03, 12:29 PM, Vitaly Funstein wrote: Jamie, What if you were to forget for a moment the whole pagination idea, and always capped your search at 1000 results for testing purposes only? This is just to try and pinpoint the bottleneck here; if, regardless of the query parameters, the search latency stays roughly the same and well below 5 min, you now have the answer - the problem is your naive implementation of pagination which results in snowballing result numbers and search times, the closer you get to the end of the results range. Otherwise, I would focus on your query and filter next. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
No, you are incorrect. The point of a search engine is to return top-N most relevant. If you insist you need to open an indexreader on every single search, and then return huge amounts of docs, maybe you should use a database instead. On Tue, Jun 3, 2014 at 6:42 AM, Jamie ja...@mailarchiva.com wrote: Vitality / Robert I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes. Unless I am mistaken, the Lucene library's pagination mechanism, makes the assumption that you will cache the scoredocs for the entire result set. This is not practical when you have a result set that exceeds 60M. As stated earlier, in any case, it is the first query that is slow. We do open index readers.. since we are using NRT search. Since documents are being added to the indexes on a continuous basis. When the user clicks on the Search button, the user will expect to see the latest result set. With regards to NRT search, my understanding is that we do need to open the index readers on each search operation to see the latest changes. Thus, on each search, we combine the indexreaders into a multireader, and open each reader based their corresponding writer. protected IndexReader initIndexReader() { ListIndexReader readers = new LinkedList(); for (Writer writer : writers) { readers.add(DirectoryReader.open(writer, true); } return MultiReader(readers,true); } Thank you for your ideas/suggestions. Regards Jamie On 2014/06/03, 12:29 PM, Vitaly Funstein wrote: Jamie, What if you were to forget for a moment the whole pagination idea, and always capped your search at 1000 results for testing purposes only? This is just to try and pinpoint the bottleneck here; if, regardless of the query parameters, the search latency stays roughly the same and well below 5 min, you now have the answer - the problem is your naive implementation of pagination which results in snowballing result numbers and search times, the closer you get to the end of the results range. Otherwise, I would focus on your query and filter next. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Robert Hmmm. why did Mike go to all the trouble of implementing NRT search, if we are not supposed to be using it? The user simply wants the latest result set. To me, this doesn't appear out of scope for the Lucene project. Jamie On 2014/06/03, 1:17 PM, Robert Muir wrote: No, you are incorrect. The point of a search engine is to return top-N most relevant. If you insist you need to open an indexreader on every single search, and then return huge amounts of docs, maybe you should use a database instead. On Tue, Jun 3, 2014 at 6:42 AM, Jamie ja...@mailarchiva.com wrote: Vitality / Robert I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes. Unless I am mistaken, the Lucene library's pagination mechanism, makes the assumption that you will cache the scoredocs for the entire result set. This is not practical when you have a result set that exceeds 60M. As stated earlier, in any case, it is the first query that is slow. We do open index readers.. since we are using NRT search. Since documents are being added to the indexes on a continuous basis. When the user clicks on the Search button, the user will expect to see the latest result set. With regards to NRT search, my understanding is that we do need to open the index readers on each search operation to see the latest changes. Thus, on each search, we combine the indexreaders into a multireader, and open each reader based their corresponding writer. protected IndexReader initIndexReader() { ListIndexReader readers = new LinkedList(); for (Writer writer : writers) { readers.add(DirectoryReader.open(writer, true); } return MultiReader(readers,true); } Thank you for your ideas/suggestions. Regards Jamie - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Robert FYI: I've modified the code to utilize the experimental function.. DirectoryReader dirReader = DirectoryReader.openIfChanged(cachedDirectoryReader,writer, true); In this case, the IndexReader won't be opened on each search, unless absolutely necessary. Regards Jamie On 2014/06/03, 1:25 PM, Jamie wrote: - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Reopening for every search is not a good idea. this will have an extremely high cost (not as high as what you are doing with paging but still not good). Instead consider making it near-realtime, by doing this every second or so instead. Look at SearcherManager for code that helps you do this. On Tue, Jun 3, 2014 at 7:25 AM, Jamie ja...@mailarchiva.com wrote: Robert Hmmm. why did Mike go to all the trouble of implementing NRT search, if we are not supposed to be using it? The user simply wants the latest result set. To me, this doesn't appear out of scope for the Lucene project. Jamie On 2014/06/03, 1:17 PM, Robert Muir wrote: No, you are incorrect. The point of a search engine is to return top-N most relevant. If you insist you need to open an indexreader on every single search, and then return huge amounts of docs, maybe you should use a database instead. On Tue, Jun 3, 2014 at 6:42 AM, Jamie ja...@mailarchiva.com wrote: Vitality / Robert I wouldn't go so far as to call our pagination naive!? Sub-optimal, yes. Unless I am mistaken, the Lucene library's pagination mechanism, makes the assumption that you will cache the scoredocs for the entire result set. This is not practical when you have a result set that exceeds 60M. As stated earlier, in any case, it is the first query that is slow. We do open index readers.. since we are using NRT search. Since documents are being added to the indexes on a continuous basis. When the user clicks on the Search button, the user will expect to see the latest result set. With regards to NRT search, my understanding is that we do need to open the index readers on each search operation to see the latest changes. Thus, on each search, we combine the indexreaders into a multireader, and open each reader based their corresponding writer. protected IndexReader initIndexReader() { ListIndexReader readers = new LinkedList(); for (Writer writer : writers) { readers.add(DirectoryReader.open(writer, true); } return MultiReader(readers,true); } Thank you for your ideas/suggestions. Regards Jamie - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Robert. Thanks, I've already done a similar thing. Results on my test platform are encouraging.. On 2014/06/03, 2:41 PM, Robert Muir wrote: Reopening for every search is not a good idea. this will have an extremely high cost (not as high as what you are doing with paging but still not good). Instead consider making it near-realtime, by doing this every second or so instead. Look at SearcherManager for code that helps you do this. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
With regards to pagination, is there a way for you to cache the IndexSearcher, Query, and TopDocs between user pagination requests (a lot of webapp frameworks have object caching mechanisms)? If so, you may have luck with code like this: void ensureTopDocs(final int rank) throws IOException { if (StartDocIndex rank) { Docs = Searcher.search(SearchQuery, TOP_DOCS_WINDOW); StartDocIndex = 0; } int len = Docs.scoreDocs.length; while (StartDocIndex + len = rank) { StartDocIndex += len; Docs = Searcher.searchAfter(Docs.scoreDocs[len - 1], SearchQuery, TOP_DOCS_WINDOW); len = Docs.scoreDocs.length; } } StartDocIndex is a member variable denoting the current rank of the first item in TopDocs (Docs) window. I call this function before each Document retrieval. The common case--of the user looking at the first page of results or the user advancing to the next page--is quite fast. But it still supports random access, albeit not in constant time. OTOH, if your app is concurrent, most search queries will probably be returned very quickly so the odd query that wants to jump deep into the result set will have more of the server's resources available to it. Also, given the size of your result sets, you have to allocate a lot of memory upfront which will then get gc'd after some time. From query to query, you will have a decent amount of memory churn. This isn't free. My guess is using Lucene's linear (search() searchAfter()) pagination will perform faster than your current approach just based upon not having to create such large arrays. I'm not the Lucene expert that Robert is, but this has worked alright for me. cheers, Jon On Tue, Jun 3, 2014 at 8:47 AM, Jamie ja...@mailarchiva.com wrote: Robert. Thanks, I've already done a similar thing. Results on my test platform are encouraging.. On 2014/06/03, 2:41 PM, Robert Muir wrote: Reopening for every search is not a good idea. this will have an extremely high cost (not as high as what you are doing with paging but still not good). Instead consider making it near-realtime, by doing this every second or so instead. Look at SearcherManager for code that helps you do this. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org -- Jon Stewart, Principal (646) 719-0317 | j...@lightboxtechnologies.com | Arlington, VA - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
Re: search performance
Thanks Jon I'll investigate your idea further. It would be nice if, in future, the Lucene API could provide a searchAfter that takes a position (int). Regards Jamie On 2014/06/03, 3:24 PM, Jon Stewart wrote: With regards to pagination, is there a way for you to cache the IndexSearcher, Query, and TopDocs between user pagination requests (a lot of webapp frameworks have object caching mechanisms)? If so, you may have luck with code like this: void ensureTopDocs(final int rank) throws IOException { if (StartDocIndex rank) { Docs = Searcher.search(SearchQuery, TOP_DOCS_WINDOW); StartDocIndex = 0; } int len = Docs.scoreDocs.length; while (StartDocIndex + len = rank) { StartDocIndex += len; Docs = Searcher.searchAfter(Docs.scoreDocs[len - 1], SearchQuery, TOP_DOCS_WINDOW); len = Docs.scoreDocs.length; } } StartDocIndex is a member variable denoting the current rank of the first item in TopDocs (Docs) window. I call this function before each Document retrieval. The common case--of the user looking at the first page of results or the user advancing to the next page--is quite fast. But it still supports random access, albeit not in constant time. OTOH, if your app is concurrent, most search queries will probably be returned very quickly so the odd query that wants to jump deep into the result set will have more of the server's resources available to it. Also, given the size of your result sets, you have to allocate a lot of memory upfront which will then get gc'd after some time. From query to query, you will have a decent amount of memory churn. This isn't free. My guess is using Lucene's linear (search() searchAfter()) pagination will perform faster than your current approach just based upon not having to create such large arrays. I'm not the Lucene expert that Robert is, but this has worked alright for me. - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org
RE: search performance
Jamie [ja...@mailarchiva.com] wrote: It would be nice if, in future, the Lucene API could provide a searchAfter that takes a position (int). It would not really help with large result sets. At least not with the current underlying implementations. This is tied into your current performance problem, if I understand it correctly. We seem to have isolated your performance problems to large (10M+) result sets, right? Requesting the top X results in Lucene works internally by adding to a Priority Queue. The problem with PQs is that they work really well for small result sets and really bad for large result sets (note that result set refers to the collected documents, not to the amount of matching documents). PQs rearranges the internal structure each time a hit is entered that has a score = the lowest known score. With millions of documents in the result set, this happens all the time. Abstractly there is little difference between small result sets and large: O(n * log n) is fine scaling. In reality the rearrangements of the internal heap structure only works well when it is in CPU cache. To test this, I created the tiny project https://github.com/tokee/luso It simulates the workflow (for an extremely loose value of 'simulates') you described with extraction of a large result set by filling a PQ of a given size with docIDs (ints) and scores (floats) and then extracting the ordered docIDs. Running it with different sizes shows how the PQ deteriorates on a 4 core i7 with 8MB level 2 cache: MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args=pq 1 1000 1 100 100 500 1000 2000 3000 4000 Starting 1 threads with extraction method pq 1,000 docs in mean 15 ms,66 docs/ms. 10,000 docs in mean 47 ms, 212 docs/ms. 100,000 docs in mean 65 ms, 1,538 docs/ms. 500,000 docs in mean 385 ms, 1,298 docs/ms. 1,000,000 docs in mean 832 ms, 1,201 docs/ms. 5,000,000 docs in mean 7,566 ms, 660 docs/ms. 10,000,000 docs in mean 16,482 ms, 606 docs/ms. 20,000,000 docs in mean 39,481 ms, 506 docs/ms. 30,000,000 docs in mean 80,293 ms, 373 docs/ms. 40,000,000 docs in mean 109,537 ms, 365 docs/ms. As can be seen, relative performance (docs/ms) drops significantly when the document count increases. To add insult to injury, this deterioration patters is optimistic as the test was the only heavy job on my computer. Running 4 of these tests in parallel (1 per core) we would ideally expect about the same speed, but instead we get MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args=pq 4 1000 1 10 50 100 500 1000 2000 3000 4000 Starting 4 threads with extraction method pq 1,000 docs in mean 34 ms,29 docs/ms. 10,000 docs in mean 70 ms, 142 docs/ms. 100,000 docs in mean 102 ms, 980 docs/ms. 500,000 docs in mean 1,340 ms, 373 docs/ms. 1,000,000 docs in mean 2,564 ms, 390 docs/ms. 5,000,000 docs in mean 19,464 ms, 256 docs/ms. 10,000,000 docs in mean 49,985 ms, 200 docs/ms. 20,000,000 docs in mean 112,321 ms, 178 docs/ms. (I got tired of waiting and stopped after 20M docs) The conclusion seems clear enough: Using PQ for millions of results will take a long time. So what can be done? I added an alternative implementation where all the docIDs and scores are collected in two parallel arrays, then merge sorted after collection. That gave the results MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args=ip 1 1000 1 10 50 100 500 1000 2000 3000 4000 Starting 1 threads with extraction method ip 1,000 docs in mean 15 ms,66 docs/ms. 10,000 docs in mean 52 ms, 192 docs/ms. 100,000 docs in mean 73 ms, 1,369 docs/ms. 500,000 docs in mean 363 ms, 1,377 docs/ms. 1,000,000 docs in mean 780 ms, 1,282 docs/ms. 5,000,000 docs in mean 4,634 ms, 1,078 docs/ms. 10,000,000 docs in mean 9,708 ms, 1,030 docs/ms. 20,000,000 docs in mean 20,818 ms, 960 docs/ms. 30,000,000 docs in mean 32,413 ms, 925 docs/ms. 40,000,000 docs in mean 44,235 ms, 904 docs/ms. Notice how the deterioration of relative speed is a lot less than for PQ. Running this with 4 threads gets us MAVEN_OPTS=-Xmx4g mvn -q exec:java -Dexec.args=ip 4 1000 1 10 50 100 500 1000 2000 3000 4000 Starting 4 threads with extraction method ip 1,000 docs in mean 35 ms,28 docs/ms. 10,000 docs in mean 221 ms,45 docs/ms. 100,000 docs in mean 162 ms, 617 docs/ms. 500,000 docs in mean 639 ms, 782 docs/ms. 1,000,000 docs in mean 1,388 ms, 720 docs/ms. 5,000,000 docs in mean 8,372 ms, 597 docs/ms. 10,000,000 docs in mean 17,933 ms, 557 docs/ms. 20,000,000 docs in mean 36,031 ms, 555 docs/ms. 30,000,000 docs in mean 58,257 ms, 514 docs/ms. 40,000,000
How to approach indexing source code?
Hi, I'd like to index (Haskell) source code. I've run the source code through a compiler (GHC) to get rich information about each token (its type, fully qualified name, etc) that I want to index (and later use when ranking). I'm wondering how to approach indexing source code. I can see two possible approaches: * Create a file containing all the metadata and write a custom tokenizer/analyzer that processes the file. The file could use a simple line-based format: myFunction,1:12-1:22,my-package,defined-here,more-metadata myFunction,5:11-5:21,my-package,used-here,more-metadata ... The tokenizer would use CharTermAttribute to write the function name, OffsetAttribute to write the source span, etc. * Use and IndexWriter to create a Document directly, as done here: http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 I'm new to Lucene so I can't quite tell which approach is more likely to work well. Which way would you recommend? Other things I'd like to do that might influence the answer: - Index several tokens at the same position, so I can index both the fully qualified name (e.g. module.myFunction) and unqualified name (e.g. myFunction) for a term. -- Johan
Re: How to approach indexing source code?
The first question for any search app should always be: How do you intend to query the data? That will in large part determine how you should index the data. IOW, how do you intend to use the data? Be specific. Provide some sample queries and then work backwards to how the data needs to be indexed. -- Jack Krupansky -Original Message- From: Johan Tibell Sent: Tuesday, June 3, 2014 9:32 PM To: java-user@lucene.apache.org Subject: How to approach indexing source code? Hi, I'd like to index (Haskell) source code. I've run the source code through a compiler (GHC) to get rich information about each token (its type, fully qualified name, etc) that I want to index (and later use when ranking). I'm wondering how to approach indexing source code. I can see two possible approaches: * Create a file containing all the metadata and write a custom tokenizer/analyzer that processes the file. The file could use a simple line-based format: myFunction,1:12-1:22,my-package,defined-here,more-metadata myFunction,5:11-5:21,my-package,used-here,more-metadata ... The tokenizer would use CharTermAttribute to write the function name, OffsetAttribute to write the source span, etc. * Use and IndexWriter to create a Document directly, as done here: http://www.onjava.com/pub/a/onjava/2006/01/18/using-lucene-to-search-java-source.html?page=3 I'm new to Lucene so I can't quite tell which approach is more likely to work well. Which way would you recommend? Other things I'd like to do that might influence the answer: - Index several tokens at the same position, so I can index both the fully qualified name (e.g. module.myFunction) and unqualified name (e.g. myFunction) for a term. -- Johan - To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org