Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
2009/7/30 Michael McCandless : > Good question... Good answer. Thanks. I guess the next step then is to understand why the TermInfo cache isn't getting the performance to where it could be. It'll take me a while to get to the point where I can answer that question. If anyone's in a hurry it'd probably be worth someone looking at it. Rich - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
Yeah, having this stuff stored centrally behind the IndexReader seems like a better idea than having it in client classes. My shallow knowledge of the code isn't helping me explain why it's not performing though. Out of interest, how come it's a per-thread cache? I don't understand all the issues involved but that surprised me. 2009/7/30 Michael McCandless (JIRA) : > > [ > https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737059#action_12737059 > ] > > Michael McCandless commented on LUCENE-1690: > > > OK now I feel silly -- this cache is in fact very similar to the caching that > Lucene already does, internally! Sorry I didn't catch this overlap sooner. > > In oal.index.TermInfosReader.java there's an LRU cache, default size 1024, > that holds recently retrieved terms and their TermInfo. It uses > oal.util.cache.SimpleLRUCache. > > There are some important differences from this new cache in MLT. EG, it > holds the entire TermInfo, not just the docFreq. Plus, it's a central cache > for any & all term lookups that go through the SegmentReader. Also, it's > stored in thread-private storage, so each thread has its own cache. > > But, now I'm confused: how come you are not already seeing the benefits of > this cache? You ought to see MLT queries going faster. This core cache was > first added in 2.4.x; it looks like you were testing against 2.4.1 (from the > "Affects Version" on this issue). > >> Morelikethis queries are very slow compared to other search types >> - >> >> Key: LUCENE-1690 >> URL: https://issues.apache.org/jira/browse/LUCENE-1690 >> Project: Lucene - Java >> Issue Type: Improvement >> Components: contrib/* >> Affects Versions: 2.4.1 >> Reporter: Richard Marr >> Priority: Minor >> Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch >> >> Original Estimate: 2h >> Remaining Estimate: 2h >> >> The MoreLikeThis object performs term frequency lookups for every query. >> From my testing that's what seems to take up the majority of time for >> MoreLikeThis searches. >> For some (I'd venture many) applications it's not necessary for term >> statistics to be looked up every time. A fairly naive opt-in caching >> mechanism tied to the life of the MoreLikeThis object would allow >> applications to cache term statistics for the duration that suits them. >> I've got this working in my test code. I'll put together a patch file when I >> get a minute. From my testing this can improve performance by a factor of >> around 10. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Richard Marr richard.m...@gmail.com 07976 910 515 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Marr updated LUCENE-1690: - Attachment: LUCENE-1690.patch This is the latest version. I wasn't working on it at quite such a rediculous hour this time so it should be better. It includes - fixed cache logic, a few comments, LRU object applied in the right place, and some test cases demonstrating things behave as expected. I'll do some more testing when I have a free evening. I have some questions: a) org.apache.lucene.search.similar doesn't seem like the right place for a generic LRU LinkedHashMap wrapper. Is there an existing class I can use instead? b) Having the cache dependent on both the MLT object and the IndexReader object seems a bit... odd. I suspect the right place for this cache is in the IndexReader, but suspect that would be a can of worms. Comments? > Morelikethis queries are very slow compared to other search types > - > > Key: LUCENE-1690 > URL: https://issues.apache.org/jira/browse/LUCENE-1690 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Richard Marr >Priority: Minor > Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > The MoreLikeThis object performs term frequency lookups for every query. > From my testing that's what seems to take up the majority of time for > MoreLikeThis searches. > For some (I'd venture many) applications it's not necessary for term > statistics to be looked up every time. A fairly naive opt-in caching > mechanism tied to the life of the MoreLikeThis object would allow > applications to cache term statistics for the duration that suits them. > I've got this working in my test code. I'll put together a patch file when I > get a minute. From my testing this can improve performance by a factor of > around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736525#action_12736525 ] Richard Marr commented on LUCENE-1690: -- There's also another problem I've just noticed. Please ignore the latest patch. > Morelikethis queries are very slow compared to other search types > - > > Key: LUCENE-1690 > URL: https://issues.apache.org/jira/browse/LUCENE-1690 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Richard Marr >Priority: Minor > Attachments: LruCache.patch, LUCENE-1690.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > The MoreLikeThis object performs term frequency lookups for every query. > From my testing that's what seems to take up the majority of time for > MoreLikeThis searches. > For some (I'd venture many) applications it's not necessary for term > statistics to be looked up every time. A fairly naive opt-in caching > mechanism tied to the life of the MoreLikeThis object would allow > applications to cache term statistics for the duration that suits them. > I've got this working in my test code. I'll put together a patch file when I > get a minute. From my testing this can improve performance by a factor of > around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Marr updated LUCENE-1690: - Attachment: LruCache.patch Attached is a draft of an implementation that uses a WeakHashMap to bind the cache to the IndexReader instance, and a LinkedHashMap to provide LRU functionality. Disclaimer: I'm not fluent in Java or OSS contribution so there may be holes or bad style in this implementation. I also need to check it meets the project coding standards. Anybody up for giving me some feedback in the meantime? > Morelikethis queries are very slow compared to other search types > - > > Key: LUCENE-1690 > URL: https://issues.apache.org/jira/browse/LUCENE-1690 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Richard Marr >Priority: Minor > Attachments: LruCache.patch, LUCENE-1690.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > The MoreLikeThis object performs term frequency lookups for every query. > From my testing that's what seems to take up the majority of time for > MoreLikeThis searches. > For some (I'd venture many) applications it's not necessary for term > statistics to be looked up every time. A fairly naive opt-in caching > mechanism tied to the life of the MoreLikeThis object would allow > applications to cache term statistics for the duration that suits them. > I've got this working in my test code. I'll put together a patch file when I > get a minute. From my testing this can improve performance by a factor of > around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733237#action_12733237 ] Richard Marr commented on LUCENE-1690: -- Okay, so the ideal solution is an LRU cache binding to a specific IndexReader instance. I think I can handle that. Carl, do you have any data on how this has changed performance in your system? My use case is a limited vocabulary so the performance gain was large. > Morelikethis queries are very slow compared to other search types > - > > Key: LUCENE-1690 > URL: https://issues.apache.org/jira/browse/LUCENE-1690 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Richard Marr >Priority: Minor > Attachments: LUCENE-1690.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > The MoreLikeThis object performs term frequency lookups for every query. > From my testing that's what seems to take up the majority of time for > MoreLikeThis searches. > For some (I'd venture many) applications it's not necessary for term > statistics to be looked up every time. A fairly naive opt-in caching > mechanism tied to the life of the MoreLikeThis object would allow > applications to cache term statistics for the duration that suits them. > I've got this working in my test code. I'll put together a patch file when I > get a minute. From my testing this can improve performance by a factor of > around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719653#action_12719653 ] Richard Marr commented on LUCENE-1690: -- Sounds reasonable although that'll take a little longer for me to do. I'll have a think about it. > Morelikethis queries are very slow compared to other search types > - > > Key: LUCENE-1690 > URL: https://issues.apache.org/jira/browse/LUCENE-1690 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Richard Marr >Priority: Minor > Attachments: LUCENE-1690.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > The MoreLikeThis object performs term frequency lookups for every query. > From my testing that's what seems to take up the majority of time for > MoreLikeThis searches. > For some (I'd venture many) applications it's not necessary for term > statistics to be looked up every time. A fairly naive opt-in caching > mechanism tied to the life of the MoreLikeThis object would allow > applications to cache term statistics for the duration that suits them. > I've got this working in my test code. I'll put together a patch file when I > get a minute. From my testing this can improve performance by a factor of > around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
[ https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Richard Marr updated LUCENE-1690: - Attachment: LUCENE-1690.patch This patch implements a basic hashmap term frequency cache. It shouldn't affect any applications that don't opt-in to using it, and applications that do should see an order of magnitude performance improvement for MLT queries. This cache implementation is tied to the MLT object but can be cleared on demand. > Morelikethis queries are very slow compared to other search types > - > > Key: LUCENE-1690 > URL: https://issues.apache.org/jira/browse/LUCENE-1690 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* >Affects Versions: 2.4.1 >Reporter: Richard Marr >Priority: Minor > Attachments: LUCENE-1690.patch > > Original Estimate: 2h > Remaining Estimate: 2h > > The MoreLikeThis object performs term frequency lookups for every query. > From my testing that's what seems to take up the majority of time for > MoreLikeThis searches. > For some (I'd venture many) applications it's not necessary for term > statistics to be looked up every time. A fairly naive opt-in caching > mechanism tied to the life of the MoreLikeThis object would allow > applications to cache term statistics for the duration that suits them. > I've got this working in my test code. I'll put together a patch file when I > get a minute. From my testing this can improve performance by a factor of > around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Created: (LUCENE-1690) Morelikethis queries are very slow compared to other search types
Morelikethis queries are very slow compared to other search types - Key: LUCENE-1690 URL: https://issues.apache.org/jira/browse/LUCENE-1690 Project: Lucene - Java Issue Type: Improvement Components: contrib/* Affects Versions: 2.4.1 Reporter: Richard Marr Priority: Minor The MoreLikeThis object performs term frequency lookups for every query. From my testing that's what seems to take up the majority of time for MoreLikeThis searches. For some (I'd venture many) applications it's not necessary for term statistics to be looked up every time. A fairly naive opt-in caching mechanism tied to the life of the MoreLikeThis object would allow applications to cache term statistics for the duration that suits them. I've got this working in my test code. I'll put together a patch file when I get a minute. From my testing this can improve performance by a factor of around 10. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: MoreLikeThisQuery term frequency caching
The cache is currently being stored as a static HashMap on the MLT object and expired at the discretion of the application code using a static MLT.flushCache() method. Use of the cache at all is opt-in, using a non-static MLT.setCache(true) and a new constructor signature on MLTQuery that includes a useCache parameter. It's not pretty but it's enough for our use case. Feel free to suggest nicer solutions if you've got them. 2009/4/10 Grant Ingersoll : > What was your approach to handling stale cache entries? Did you flush it > when you opened a new reader? > > On Apr 7, 2009, at 2:28 AM, Richard Marr wrote: > >> Hi all, >> >> I've been exploring MoreLikeThisQuery as part of a recent project and >> something that came out of that might be useful to others here. >> >> I found that using MoreLikeThisQuery could be quite slow for my use >> case, but that most of the time involved was spent looking up term >> frequencies to calculate weightings. Since those term frequencies >> usually don't need to be anywhere near real-time I found that caching >> them in a hashmap had a very good cost/benefit ratio for my >> application, speeding up MLT queries by an order of magnitude. >> >> My use case was possibly unusual in that I was looking at a limited >> vocabulary rather than full English, but in theory other applications >> that make use of the MLT class could benefit. >> >> So at this point I have some questions: (1) Have others experienced >> similar performance characteristics for MLT code? (2) Am I missing >> some fatal flaw in this approach? (3) Are the modifications worth >> sharing? >> >> Cheers, >> >> Rich >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> > > > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Richard Marr richard.m...@gmail.com 07976 910 515 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: MoreLikeThisQuery term frequency caching
Thanks Mike, I'll leave it a few days to give people time to respond then start looking into creating a Jira ticket and a patch. 2009/4/7 Michael McCandless : > I don't have direct experience with MLT, but this sounds like a great > improvement, so in answer to (3) I would say "definitely!". > > Mike > > On Tue, Apr 7, 2009 at 2:28 AM, Richard Marr wrote: >> Hi all, >> >> I've been exploring MoreLikeThisQuery as part of a recent project and >> something that came out of that might be useful to others here. >> >> I found that using MoreLikeThisQuery could be quite slow for my use >> case, but that most of the time involved was spent looking up term >> frequencies to calculate weightings. Since those term frequencies >> usually don't need to be anywhere near real-time I found that caching >> them in a hashmap had a very good cost/benefit ratio for my >> application, speeding up MLT queries by an order of magnitude. >> >> My use case was possibly unusual in that I was looking at a limited >> vocabulary rather than full English, but in theory other applications >> that make use of the MLT class could benefit. >> >> So at this point I have some questions: (1) Have others experienced >> similar performance characteristics for MLT code? (2) Am I missing >> some fatal flaw in this approach? (3) Are the modifications worth >> sharing? >> >> Cheers, >> >> Rich >> >> - >> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-dev-h...@lucene.apache.org >> >> > > - > To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-dev-h...@lucene.apache.org > > -- Richard Marr richard.m...@gmail.com 07976 910 515 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
MoreLikeThisQuery term frequency caching
Hi all, I've been exploring MoreLikeThisQuery as part of a recent project and something that came out of that might be useful to others here. I found that using MoreLikeThisQuery could be quite slow for my use case, but that most of the time involved was spent looking up term frequencies to calculate weightings. Since those term frequencies usually don't need to be anywhere near real-time I found that caching them in a hashmap had a very good cost/benefit ratio for my application, speeding up MLT queries by an order of magnitude. My use case was possibly unusual in that I was looking at a limited vocabulary rather than full English, but in theory other applications that make use of the MLT class could benefit. So at this point I have some questions: (1) Have others experienced similar performance characteristics for MLT code? (2) Am I missing some fatal flaw in this approach? (3) Are the modifications worth sharing? Cheers, Rich - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org