[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Marr updated LUCENE-1690: - Attachment: LUCENE-1690.patch This is the latest version. I wasn't working on it at quite such a rediculous hour this time so it should be better. It includes - fixed cache logic, a few comments, LRU object applied in the right place, and some test cases demonstrating things behave as expected. I'll do some more testing when I have a free evening. I have some questions: a) org.apache.lucene.search.similar doesn't seem like the right place for a generic LRU LinkedHashMap wrapper. Is there an existing class I can use instead? b) Having the cache dependent on both the MLT object and the IndexReader object seems a bit... odd. I suspect the right place for this cache is in the IndexReader, but suspect that would be a can of worms. Comments? Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch Original Estimate: 2h Remaining Estimate: 2h The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736525#action_12736525 ] Richard Marr commented on LUCENE-1690: -- There's also another problem I've just noticed. Please ignore the latest patch. Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor Attachments: LruCache.patch, LUCENE-1690.patch Original Estimate: 2h Remaining Estimate: 2h The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Marr updated LUCENE-1690: - Attachment: LruCache.patch Attached is a draft of an implementation that uses a WeakHashMap to bind the cache to the IndexReader instance, and a LinkedHashMap to provide LRU functionality. Disclaimer: I'm not fluent in Java or OSS contribution so there may be holes or bad style in this implementation. I also need to check it meets the project coding standards. Anybody up for giving me some feedback in the meantime? Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor Attachments: LruCache.patch, LUCENE-1690.patch Original Estimate: 2h Remaining Estimate: 2h The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733237#action_12733237 ] Richard Marr commented on LUCENE-1690: -- Okay, so the ideal solution is an LRU cache binding to a specific IndexReader instance. I think I can handle that. Carl, do you have any data on how this has changed performance in your system? My use case is a limited vocabulary so the performance gain was large. Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor Attachments: LUCENE-1690.patch Original Estimate: 2h Remaining Estimate: 2h The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719653#action_12719653 ] Richard Marr commented on LUCENE-1690: -- Sounds reasonable although that'll take a little longer for me to do. I'll have a think about it. Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor Attachments: LUCENE-1690.patch Original Estimate: 2h Remaining Estimate: 2h The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Marr updated LUCENE-1690: - Attachment: LUCENE-1690.patch This patch implements a basic hashmap term frequency cache. It shouldn't affect any applications that don't opt-in to using it, and applications that do should see an order of magnitude performance improvement for MLT queries. This cache implementation is tied to the MLT object but can be cleared on demand. Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor Attachments: LUCENE-1690.patch Original Estimate: 2h Remaining Estimate: 2h The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org