Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Richard Marr
2009/7/30 Michael McCandless :
> Good question...

Good answer. Thanks.

I guess the next step then is to understand why the TermInfo cache
isn't getting the performance to where it could be. It'll take me a
while to get to the point where I can answer that question. If
anyone's in a hurry it'd probably be worth someone looking at it.

Rich

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Richard Marr
Yeah, having this stuff stored centrally behind the IndexReader seems
like a better idea than having it in client classes. My shallow
knowledge of the code isn't helping me explain why it's not performing
though.

Out of interest, how come it's a per-thread cache? I don't understand
all the issues involved but that surprised me.




2009/7/30 Michael McCandless (JIRA) :
>
>    [ 
> https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12737059#action_12737059
>  ]
>
> Michael McCandless commented on LUCENE-1690:
> 
>
> OK now I feel silly -- this cache is in fact very similar to the caching that 
> Lucene already does, internally!  Sorry I didn't catch this overlap sooner.
>
> In oal.index.TermInfosReader.java there's an LRU cache, default size 1024, 
> that holds recently retrieved terms and their TermInfo.  It uses 
> oal.util.cache.SimpleLRUCache.
>
> There are some important differences from this new cache in MLT.  EG, it 
> holds the entire TermInfo, not just the docFreq.  Plus, it's a central cache 
> for any & all term lookups that go through the SegmentReader.  Also, it's 
> stored in thread-private storage, so each thread has its own cache.
>
> But, now I'm confused: how come you are not already seeing the benefits of 
> this cache?  You ought to see MLT queries going faster.  This core cache was 
> first added in 2.4.x; it looks like you were testing against 2.4.1 (from the 
> "Affects Version" on this issue).
>
>> Morelikethis queries are very slow compared to other search types
>> -
>>
>>                 Key: LUCENE-1690
>>                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
>>             Project: Lucene - Java
>>          Issue Type: Improvement
>>          Components: contrib/*
>>    Affects Versions: 2.4.1
>>            Reporter: Richard Marr
>>            Priority: Minor
>>         Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch
>>
>>   Original Estimate: 2h
>>  Remaining Estimate: 2h
>>
>> The MoreLikeThis object performs term frequency lookups for every query.  
>> From my testing that's what seems to take up the majority of time for 
>> MoreLikeThis searches.
>> For some (I'd venture many) applications it's not necessary for term 
>> statistics to be looked up every time. A fairly naive opt-in caching 
>> mechanism tied to the life of the MoreLikeThis object would allow 
>> applications to cache term statistics for the duration that suits them.
>> I've got this working in my test code. I'll put together a patch file when I 
>> get a minute. From my testing this can improve performance by a factor of 
>> around 10.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Richard Marr
richard.m...@gmail.com
07976 910 515

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Richard Marr (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Marr updated LUCENE-1690:
-

Attachment: LUCENE-1690.patch

This is the latest version. I wasn't working on it at quite such a rediculous 
hour this time so it should be better.

It includes - fixed cache logic, a few comments, LRU object applied in the 
right place, and some test cases demonstrating things behave as expected. I'll 
do some more testing when I have a free evening.

I have some questions:

 a) org.apache.lucene.search.similar doesn't seem like the right place for a 
generic LRU LinkedHashMap wrapper. Is there an existing class I can use instead?

 b) Having the cache dependent on both the MLT object and the IndexReader 
object seems a bit... odd. I suspect the right place for this cache is in the 
IndexReader, but suspect that would be a can of worms. Comments?



> Morelikethis queries are very slow compared to other search types
> -
>
> Key: LUCENE-1690
> URL: https://issues.apache.org/jira/browse/LUCENE-1690
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Richard Marr
>Priority: Minor
> Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  
> From my testing that's what seems to take up the majority of time for 
> MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term 
> statistics to be looked up every time. A fairly naive opt-in caching 
> mechanism tied to the life of the MoreLikeThis object would allow 
> applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I 
> get a minute. From my testing this can improve performance by a factor of 
> around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-29 Thread Richard Marr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12736525#action_12736525
 ] 

Richard Marr commented on LUCENE-1690:
--

There's also another problem I've just noticed. Please ignore the latest patch.

> Morelikethis queries are very slow compared to other search types
> -
>
> Key: LUCENE-1690
> URL: https://issues.apache.org/jira/browse/LUCENE-1690
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Richard Marr
>Priority: Minor
> Attachments: LruCache.patch, LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  
> From my testing that's what seems to take up the majority of time for 
> MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term 
> statistics to be looked up every time. A fairly naive opt-in caching 
> mechanism tied to the life of the MoreLikeThis object would allow 
> applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I 
> get a minute. From my testing this can improve performance by a factor of 
> around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-28 Thread Richard Marr (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Marr updated LUCENE-1690:
-

Attachment: LruCache.patch

Attached is a draft of an implementation that uses a WeakHashMap to bind the 
cache to the IndexReader instance, and a LinkedHashMap to provide LRU 
functionality.

Disclaimer: I'm not fluent in Java or OSS contribution so there may be holes or 
bad style in this implementation. I also need to check it meets the project 
coding standards.

Anybody up for giving me some feedback in the meantime?

> Morelikethis queries are very slow compared to other search types
> -
>
> Key: LUCENE-1690
> URL: https://issues.apache.org/jira/browse/LUCENE-1690
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Richard Marr
>Priority: Minor
> Attachments: LruCache.patch, LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  
> From my testing that's what seems to take up the majority of time for 
> MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term 
> statistics to be looked up every time. A fairly naive opt-in caching 
> mechanism tied to the life of the MoreLikeThis object would allow 
> applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I 
> get a minute. From my testing this can improve performance by a factor of 
> around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-20 Thread Richard Marr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733237#action_12733237
 ] 

Richard Marr commented on LUCENE-1690:
--

Okay, so the ideal solution is an LRU cache binding to a specific IndexReader 
instance. I think I can handle that.

Carl, do you have any data on how this has changed performance in your system?  
My use case is a limited vocabulary so the performance gain was large.

> Morelikethis queries are very slow compared to other search types
> -
>
> Key: LUCENE-1690
> URL: https://issues.apache.org/jira/browse/LUCENE-1690
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Richard Marr
>Priority: Minor
> Attachments: LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  
> From my testing that's what seems to take up the majority of time for 
> MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term 
> statistics to be looked up every time. A fairly naive opt-in caching 
> mechanism tied to the life of the MoreLikeThis object would allow 
> applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I 
> get a minute. From my testing this can improve performance by a factor of 
> around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-06-15 Thread Richard Marr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719653#action_12719653
 ] 

Richard Marr commented on LUCENE-1690:
--

Sounds reasonable although that'll take a little longer for me to do. I'll have 
a think about it.

> Morelikethis queries are very slow compared to other search types
> -
>
> Key: LUCENE-1690
> URL: https://issues.apache.org/jira/browse/LUCENE-1690
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Richard Marr
>Priority: Minor
> Attachments: LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  
> From my testing that's what seems to take up the majority of time for 
> MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term 
> statistics to be looked up every time. A fairly naive opt-in caching 
> mechanism tied to the life of the MoreLikeThis object would allow 
> applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I 
> get a minute. From my testing this can improve performance by a factor of 
> around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-06-12 Thread Richard Marr (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Marr updated LUCENE-1690:
-

Attachment: LUCENE-1690.patch

This patch implements a basic hashmap term frequency cache. It shouldn't affect 
any applications that don't opt-in to using it, and applications that do should 
see an order of magnitude performance improvement for MLT queries.

This cache implementation is tied to the MLT object but can be cleared on 
demand.

> Morelikethis queries are very slow compared to other search types
> -
>
> Key: LUCENE-1690
> URL: https://issues.apache.org/jira/browse/LUCENE-1690
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: contrib/*
>Affects Versions: 2.4.1
>Reporter: Richard Marr
>Priority: Minor
> Attachments: LUCENE-1690.patch
>
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> The MoreLikeThis object performs term frequency lookups for every query.  
> From my testing that's what seems to take up the majority of time for 
> MoreLikeThis searches.  
> For some (I'd venture many) applications it's not necessary for term 
> statistics to be looked up every time. A fairly naive opt-in caching 
> mechanism tied to the life of the MoreLikeThis object would allow 
> applications to cache term statistics for the duration that suits them.
> I've got this working in my test code. I'll put together a patch file when I 
> get a minute. From my testing this can improve performance by a factor of 
> around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-06-12 Thread Richard Marr (JIRA)
Morelikethis queries are very slow compared to other search types
-

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor


The MoreLikeThis object performs term frequency lookups for every query.  From 
my testing that's what seems to take up the majority of time for MoreLikeThis 
searches.  

For some (I'd venture many) applications it's not necessary for term statistics 
to be looked up every time. A fairly naive opt-in caching mechanism tied to the 
life of the MoreLikeThis object would allow applications to cache term 
statistics for the duration that suits them.

I've got this working in my test code. I'll put together a patch file when I 
get a minute. From my testing this can improve performance by a factor of 
around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: MoreLikeThisQuery term frequency caching

2009-04-15 Thread Richard Marr
The cache is currently being stored as a static HashMap on the MLT
object and expired at the discretion of the application code using a
static MLT.flushCache() method. Use of the cache at all is opt-in,
using a non-static MLT.setCache(true) and a new constructor signature
on MLTQuery that includes a useCache parameter.

It's not pretty but it's enough for our use case.

Feel free to suggest nicer solutions if you've got them.



2009/4/10 Grant Ingersoll :
> What was your approach to handling stale cache entries?  Did you flush it
> when you opened a new reader?
>
> On Apr 7, 2009, at 2:28 AM, Richard Marr wrote:
>
>> Hi all,
>>
>> I've been exploring MoreLikeThisQuery as part of a recent project and
>> something that came out of that might be useful to others here.
>>
>> I found that using MoreLikeThisQuery could be quite slow for my use
>> case, but that most of the time involved was spent looking up term
>> frequencies to calculate weightings. Since those term frequencies
>> usually don't need to be anywhere near real-time I found that caching
>> them in a hashmap had a very good cost/benefit ratio for my
>> application, speeding up MLT queries by an order of magnitude.
>>
>> My use case was possibly unusual in that I was looking at a limited
>> vocabulary rather than full English, but in theory other applications
>> that make use of the MLT class could benefit.
>>
>> So at this point I have some questions: (1) Have others experienced
>> similar performance characteristics for MLT code? (2) Am I missing
>> some fatal flaw in this approach? (3) Are the modifications worth
>> sharing?
>>
>> Cheers,
>>
>> Rich
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>
>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Richard Marr
richard.m...@gmail.com
07976 910 515

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: MoreLikeThisQuery term frequency caching

2009-04-07 Thread Richard Marr
Thanks Mike,

I'll leave it a few days to give people time to respond then start
looking into creating a Jira ticket and a patch.


2009/4/7 Michael McCandless :
> I don't have direct experience with MLT, but this sounds like a great
> improvement, so in answer to (3) I would say "definitely!".
>
> Mike
>
> On Tue, Apr 7, 2009 at 2:28 AM, Richard Marr  wrote:
>> Hi all,
>>
>> I've been exploring MoreLikeThisQuery as part of a recent project and
>> something that came out of that might be useful to others here.
>>
>> I found that using MoreLikeThisQuery could be quite slow for my use
>> case, but that most of the time involved was spent looking up term
>> frequencies to calculate weightings. Since those term frequencies
>> usually don't need to be anywhere near real-time I found that caching
>> them in a hashmap had a very good cost/benefit ratio for my
>> application, speeding up MLT queries by an order of magnitude.
>>
>> My use case was possibly unusual in that I was looking at a limited
>> vocabulary rather than full English, but in theory other applications
>> that make use of the MLT class could benefit.
>>
>> So at this point I have some questions: (1) Have others experienced
>> similar performance characteristics for MLT code? (2) Am I missing
>> some fatal flaw in this approach? (3) Are the modifications worth
>> sharing?
>>
>> Cheers,
>>
>> Rich
>>
>> -
>> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>>
>>
>
> -
> To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-dev-h...@lucene.apache.org
>
>



-- 
Richard Marr
richard.m...@gmail.com
07976 910 515

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



MoreLikeThisQuery term frequency caching

2009-04-06 Thread Richard Marr
Hi all,

I've been exploring MoreLikeThisQuery as part of a recent project and
something that came out of that might be useful to others here.

I found that using MoreLikeThisQuery could be quite slow for my use
case, but that most of the time involved was spent looking up term
frequencies to calculate weightings. Since those term frequencies
usually don't need to be anywhere near real-time I found that caching
them in a hashmap had a very good cost/benefit ratio for my
application, speeding up MLT queries by an order of magnitude.

My use case was possibly unusual in that I was looking at a limited
vocabulary rather than full English, but in theory other applications
that make use of the MLT class could benefit.

So at this point I have some questions: (1) Have others experienced
similar performance characteristics for MLT code? (2) Am I missing
some fatal flaw in this approach? (3) Are the modifications worth
sharing?

Cheers,

Rich

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org