[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Richard Marr (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Marr updated LUCENE-1690:
-

Attachment: LUCENE-1690.patch

This is the latest version. I wasn't working on it at quite such a rediculous 
hour this time so it should be better.

It includes - fixed cache logic, a few comments, LRU object applied in the 
right place, and some test cases demonstrating things behave as expected. I'll 
do some more testing when I have a free evening.

I have some questions:

 a) org.apache.lucene.search.similar doesn't seem like the right place for a 
generic LRU LinkedHashMap wrapper. Is there an existing class I can use instead?

 b) Having the cache dependent on both the MLT object and the IndexReader 
object seems a bit... odd. I suspect the right place for this cache is in the 
IndexReader, but suspect that would be a can of worms. Comments?



 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-29 Thread Richard Marr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736525#action_12736525
 ] 

Richard Marr commented on LUCENE-1690:
--

There's also another problem I've just noticed. Please ignore the latest patch.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LruCache.patch, LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-28 Thread Richard Marr (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Marr updated LUCENE-1690:
-

Attachment: LruCache.patch

Attached is a draft of an implementation that uses a WeakHashMap to bind the 
cache to the IndexReader instance, and a LinkedHashMap to provide LRU 
functionality.

Disclaimer: I'm not fluent in Java or OSS contribution so there may be holes or 
bad style in this implementation. I also need to check it meets the project 
coding standards.

Anybody up for giving me some feedback in the meantime?

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LruCache.patch, LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-20 Thread Richard Marr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733237#action_12733237
 ] 

Richard Marr commented on LUCENE-1690:
--

Okay, so the ideal solution is an LRU cache binding to a specific IndexReader 
instance. I think I can handle that.

Carl, do you have any data on how this has changed performance in your system?  
My use case is a limited vocabulary so the performance gain was large.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-06-15 Thread Richard Marr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719653#action_12719653
 ] 

Richard Marr commented on LUCENE-1690:
--

Sounds reasonable although that'll take a little longer for me to do. I'll have 
a think about it.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-06-13 Thread Richard Marr (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Marr updated LUCENE-1690:
-

Attachment: LUCENE-1690.patch

This patch implements a basic hashmap term frequency cache. It shouldn't affect 
any applications that don't opt-in to using it, and applications that do should 
see an order of magnitude performance improvement for MLT queries.

This cache implementation is tied to the MLT object but can be cleared on 
demand.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-06-12 Thread Richard Marr (JIRA)
Morelikethis queries are very slow compared to other search types
-

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor


The MoreLikeThis object performs term frequency lookups for every query.  From 
my testing that's what seems to take up the majority of time for MoreLikeThis 
searches.  

For some (I'd venture many) applications it's not necessary for term statistics 
to be looked up every time. A fairly naive opt-in caching mechanism tied to the 
life of the MoreLikeThis object would allow applications to cache term 
statistics for the duration that suits them.

I've got this working in my test code. I'll put together a patch file when I 
get a minute. From my testing this can improve performance by a factor of 
around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org