MoreLikeThis: fieldNames and Query

2010-02-13 Thread Shay Banon
Hi,

  I have a few questions regarding more like this:

1. In MoreLikeThis, it seems like the check for fieldNames being null and
fetching them from the reader is not done for all the like methods. For
example, it does not look like it is done at all for like(Reader r), and on
the other hand, it is done for like(File f).

2. In MoreLikeThisQuery rewrite method, there is an unnecessary conversion
to bytes and back to string. I think this:

BooleanQuery bq= (BooleanQuery) mlt.like(new
ByteArrayInputStream(likeText.getBytes()));

should be replaced with:

BooleanQuery bq= (BooleanQuery) mlt.like(new
StringReader(likeText));

What do you think?
-shay.banon


[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-12-15 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790889#action_12790889
 ] 

Otis Gospodnetic commented on LUCENE-1910:
--

* I'll second Mark's suggestion to extract the Information Gain piece of the 
patch into separate class(es), so we can reuse it in other places.  It looks 
like it's currently an integral part of MoreLikeThisUsingTags class.  Would 
that be possible?

* I noticed the code needs ASL (the Apache Software License) added.

* Also, could you please use the Lucene code format? (Eclipse/IntelliJ 
templates are at the bottom of 
http://wiki.apache.org/lucene-java/HowToContribute )


 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-11-26 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: (was: LUCENE-1910.patch)

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor

 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-11-26 Thread Thomas D'Silva (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783010#action_12783010
 ] 

Thomas D'Silva commented on LUCENE-1910:


Mark,

I refactored the code so that the tag and document probabilities are computed 
and used to find the most important document terms corresponding to a given tag 
term during the index creation phase. These most important document terms 
(ranked by information gain) for a given tag term is stored as meta information 
in the index when the index is created. I added a class TagIndexWriter which 
extends IndexWriter which is used to create an index which can be used to run 
MoreLikeThisUsingTags queries. 

I recreated a test index with one million documents, and assigned tags 
(tag_0,...tag_4) to 10%,20%.. and so on of the documents. 

The time taken to generate a query on an index created using TagIndexWriter:
tag name, number of documents, time in ms
tag_0, 10134, 22
tag_1, 19996, 29
tag_2, 30010, 6
tag_3, 39907, 6
tag_4, 50148, 9

Since the document terms corresponding to a tag term is computed during the 
indexing phase, the time taken to generate a MoreLikeThisUsingTags query is 
constant. 

Thanks,
Thomas

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor

 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-11-26 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: LUCENE-1910.patch

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Assigned: (LUCENE-1993) MoreLikeThis - allow to exclude terms that appear in too many documents (patch included)

2009-10-20 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless reassigned LUCENE-1993:
--

Assignee: Michael McCandless

 MoreLikeThis - allow to exclude terms that appear in too many documents 
 (patch included)
 

 Key: LUCENE-1993
 URL: https://issues.apache.org/jira/browse/LUCENE-1993
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Christian Steinert
Assignee: Michael McCandless
 Attachments: MoreLikeThis.java.patch

   Original Estimate: 0.17h
  Remaining Estimate: 0.17h

 The MoreLikeThis class allows to generate a likeness query based on a given 
 document. So far, it is impossible to suppress words from the likeness query, 
 that appear in almost all documents, making it necessary to use extensive 
 lists of stop words.
 Therefore I suggest to allow excluding words for which a certain absolute 
 document count or a certain percentage of documents is exceeded. Depending on 
 the corpus of text, words that appear in more than 50 or even 70% of 
 documents can usually be considered insignificant for classifying a document. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1993) MoreLikeThis - allow to exclude terms that appear in too many documents (patch included)

2009-10-20 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12767781#action_12767781
 ] 

Michael McCandless commented on LUCENE-1993:


Patch looks good... I'll commit shortly.

 MoreLikeThis - allow to exclude terms that appear in too many documents 
 (patch included)
 

 Key: LUCENE-1993
 URL: https://issues.apache.org/jira/browse/LUCENE-1993
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Christian Steinert
Assignee: Michael McCandless
 Attachments: MoreLikeThis.java.patch

   Original Estimate: 0.17h
  Remaining Estimate: 0.17h

 The MoreLikeThis class allows to generate a likeness query based on a given 
 document. So far, it is impossible to suppress words from the likeness query, 
 that appear in almost all documents, making it necessary to use extensive 
 lists of stop words.
 Therefore I suggest to allow excluding words for which a certain absolute 
 document count or a certain percentage of documents is exceeded. Depending on 
 the corpus of text, words that appear in more than 50 or even 70% of 
 documents can usually be considered insignificant for classifying a document. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1993) MoreLikeThis - allow to exclude terms that appear in too many documents (patch included)

2009-10-20 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1993.


   Resolution: Fixed
Fix Version/s: 3.0

Thanks Christian!

 MoreLikeThis - allow to exclude terms that appear in too many documents 
 (patch included)
 

 Key: LUCENE-1993
 URL: https://issues.apache.org/jira/browse/LUCENE-1993
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Christian Steinert
Assignee: Michael McCandless
 Fix For: 3.0

 Attachments: MoreLikeThis.java.patch

   Original Estimate: 0.17h
  Remaining Estimate: 0.17h

 The MoreLikeThis class allows to generate a likeness query based on a given 
 document. So far, it is impossible to suppress words from the likeness query, 
 that appear in almost all documents, making it necessary to use extensive 
 lists of stop words.
 Therefore I suggest to allow excluding words for which a certain absolute 
 document count or a certain percentage of documents is exceeded. Depending on 
 the corpus of text, words that appear in more than 50 or even 70% of 
 documents can usually be considered insignificant for classifying a document. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1993) MoreLikeThis - allow to exclude terms that appear in too many documents (patch included)

2009-10-19 Thread Christian Steinert (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Christian Steinert updated LUCENE-1993:
---

Attachment: MoreLikeThis.java.patch

suggested patch against current SVN head

 MoreLikeThis - allow to exclude terms that appear in too many documents 
 (patch included)
 

 Key: LUCENE-1993
 URL: https://issues.apache.org/jira/browse/LUCENE-1993
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Christian Steinert
 Attachments: MoreLikeThis.java.patch

   Original Estimate: 0.17h
  Remaining Estimate: 0.17h

 The MoreLikeThis class allows to generate a likeness query based on a given 
 document. So far, it is impossible to suppress words from the likeness query, 
 that appear in almost all documents, making it necessary to use extensive 
 lists of stop words.
 Therefore I suggest to allow excluding words for which a certain absolute 
 document count or a certain percentage of documents is exceeded. Depending on 
 the corpus of text, words that appear in more than 50 or even 70% of 
 documents can usually be considered insignificant for classifying a document. 
  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1993) MoreLikeThis - allow to exclude terms that appear in too many documents (patch included)

2009-10-19 Thread Christian Steinert (JIRA)
MoreLikeThis - allow to exclude terms that appear in too many documents (patch 
included)


 Key: LUCENE-1993
 URL: https://issues.apache.org/jira/browse/LUCENE-1993
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.9
Reporter: Christian Steinert
 Attachments: MoreLikeThis.java.patch

The MoreLikeThis class allows to generate a likeness query based on a given 
document. So far, it is impossible to suppress words from the likeness query, 
that appear in almost all documents, making it necessary to use extensive lists 
of stop words.

Therefore I suggest to allow excluding words for which a certain absolute 
document count or a certain percentage of documents is exceeded. Depending on 
the corpus of text, words that appear in more than 50 or even 70% of documents 
can usually be considered insignificant for classifying a document.  

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-10-05 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762290#action_12762290
 ] 

Mark Harwood commented on LUCENE-1910:
--

 2 minutes to create a query based on 10,000 documents?

Unfortunately, I can't see this being generally useful until the performance is 
improved dramatically.


 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-10-04 Thread Thomas D'Silva (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761997#action_12761997
 ] 

Thomas D'Silva edited comment on LUCENE-1910 at 10/4/09 10:21 AM:
--

Mark,

I refactored the class to include more descriptive variable names. I also 
modified the code so that while calculating information gain only terms 
belonging to documents that have been tagged with the given tag and used (and 
not all the terms in the index). 
I tested this class on a test index containing one million documents. The 
documents were tagged with five tags (tag_0...tag_4). tag_0 was assigned to 
approximately 10% of the documents, tag_1 to 20% and so on. 

tag name, number of documents, time in ms
tag_0, 10134, 137314
tag_1, 19996, 219527
tag_2, 30010, 315336
tag_3, 39907, 413615
tag_4, 50148, 507350

The time taken to generate the query for a tag depends on the number of 
documents in the index containing the tag and scales linearly with the number 
of documents. 
The top document terms for a given are cached in a hashmap once they have been 
generated in order to speed up subsequent lookups.

Thanks,
Thomas

  was (Author: twdsi...@gmail.com):
I refactored the class to include more descriptive variable names. I also 
modified the code so that while calculating information gain only terms 
belonging to documents that have been tagged with the given tag and used (and 
not all the terms in the index). 
I tested this class on a test index containing one million documents. The 
documents were tagged with five tags (tag_0...tag_4). tag_0 was assigned to 
approximately 10% of the documents, tag_1 to 20% and so on. 

tag name, number of documents, time in ms
tag_0, 10134, 137314
tag_1, 19996, 219527
tag_2, 30010, 315336
tag_3, 39907, 413615
tag_4, 50148, 507350

The time taken to generate the query for a tag depends on the number of 
documents in the index containing the tag and scales linearly with the number 
of documents. 
The top document terms for a given are cached in a hashmap once they have been 
generated in order to speed up subsequent lookups.
  
 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch, LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-10-04 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: (was: LUCENE-1910.patch)

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-21 Thread Mark Harwood (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12757924#action_12757924
 ] 

Mark Harwood commented on LUCENE-1910:
--

Hi Thomas,
Following your request for feedback, some initial thoughts from a very quick 
look.

* The Information Gain algo could use a little more explanation e.g. using 
variable names other than num1 and num2 and could perhaps be extracted into 
a utility class

* Is this scalable? It looks like in initialize it is loading this:
{code:title=MoreLikeThisUsingTags.java|borderStyle=solid}
/**
  * All terms in the index
  */
protected HashSet docTerms=new HashSet();
{code} 
..that seems a little scary!
It's also doing a seperate BooleanQuery for all items in this list ( and 
repeated for 1 tag?). Thats look like a lot of searches.

I need to spend a little more time looking at it before I understand it in more 
detail.
Before then - have you tested this on a big (millions of docs/terms) index? 
Some performance figures would be useful to accompany this.

Cheers,
Mark


 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-14 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Priority: Minor  (was: Major)

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-13 Thread Thomas D'Silva (JIRA)
Extension to MoreLikeThis to use tag information


 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva


I would like to contribute a class based on the MoreLikeThis class in
contrib/queries that generates a query based on the tags associated
with a document. The class assumes that documents are tagged with a
set of tags (which are stored in the index in a seperate Field). The
class determines the top document terms associated with a given tag
using the information gain metric.

While generating a MoreLikeThis query for a document the tags
associated with document are used to determine the terms in the query.
This class is useful for finding similar documents to a document that
does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-13 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: LUCENE-1910.patch

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-13 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: (was: LUCENE-1910.patch)

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-13 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: LUCENE-1910.patch

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



MoreLikeThis Extension for documents that have tags

2009-09-03 Thread Thomas D'Silva
Hi,

I would like to contribute a class based on the MoreLikeThis class in
contrib/queries that generates a query based on the tags associated
with a document. The class assumes that documents are tagged with a
set of tags (which are stored in the index in a seperate Field). The
class determines the top document terms associated with a given tag
using the information gain metric.

While generating a MoreLikeThis query for a document the tags
associated with document are used to determine the terms in the query.
This class is useful for finding similar documents to a document that
does not have many relevant terms but was tagged.

I have attached the class and a test class and would appreciate any feedback.

Thank you,
Thomas

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Richard Marr (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Marr updated LUCENE-1690:
-

Attachment: LUCENE-1690.patch

This is the latest version. I wasn't working on it at quite such a rediculous 
hour this time so it should be better.

It includes - fixed cache logic, a few comments, LRU object applied in the 
right place, and some test cases demonstrating things behave as expected. I'll 
do some more testing when I have a free evening.

I have some questions:

 a) org.apache.lucene.search.similar doesn't seem like the right place for a 
generic LRU LinkedHashMap wrapper. Is there an existing class I can use instead?

 b) Having the cache dependent on both the MLT object and the IndexReader 
object seems a bit... odd. I suspect the right place for this cache is in the 
IndexReader, but suspect that would be a can of worms. Comments?



 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737059#action_12737059
 ] 

Michael McCandless commented on LUCENE-1690:


OK now I feel silly -- this cache is in fact very similar to the caching that 
Lucene already does, internally!  Sorry I didn't catch this overlap sooner.

In oal.index.TermInfosReader.java there's an LRU cache, default size 1024, that 
holds recently retrieved terms and their TermInfo.  It uses 
oal.util.cache.SimpleLRUCache.

There are some important differences from this new cache in MLT.  EG, it holds 
the entire TermInfo, not just the docFreq.  Plus, it's a central cache for any 
 all term lookups that go through the SegmentReader.  Also, it's stored in 
thread-private storage, so each thread has its own cache.

But, now I'm confused: how come you are not already seeing the benefits of this 
cache?  You ought to see MLT queries going faster.  This core cache was first 
added in 2.4.x; it looks like you were testing against 2.4.1 (from the Affects 
Version on this issue).

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Richard Marr
Yeah, having this stuff stored centrally behind the IndexReader seems
like a better idea than having it in client classes. My shallow
knowledge of the code isn't helping me explain why it's not performing
though.

Out of interest, how come it's a per-thread cache? I don't understand
all the issues involved but that surprised me.




2009/7/30 Michael McCandless (JIRA) j...@apache.org:

    [ 
 https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737059#action_12737059
  ]

 Michael McCandless commented on LUCENE-1690:
 

 OK now I feel silly -- this cache is in fact very similar to the caching that 
 Lucene already does, internally!  Sorry I didn't catch this overlap sooner.

 In oal.index.TermInfosReader.java there's an LRU cache, default size 1024, 
 that holds recently retrieved terms and their TermInfo.  It uses 
 oal.util.cache.SimpleLRUCache.

 There are some important differences from this new cache in MLT.  EG, it 
 holds the entire TermInfo, not just the docFreq.  Plus, it's a central cache 
 for any  all term lookups that go through the SegmentReader.  Also, it's 
 stored in thread-private storage, so each thread has its own cache.

 But, now I'm confused: how come you are not already seeing the benefits of 
 this cache?  You ought to see MLT queries going faster.  This core cache was 
 first added in 2.4.x; it looks like you were testing against 2.4.1 (from the 
 Affects Version on this issue).

 Morelikethis queries are very slow compared to other search types
 -

                 Key: LUCENE-1690
                 URL: https://issues.apache.org/jira/browse/LUCENE-1690
             Project: Lucene - Java
          Issue Type: Improvement
          Components: contrib/*
    Affects Versions: 2.4.1
            Reporter: Richard Marr
            Priority: Minor
         Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.


 -
 To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
 For additional commands, e-mail: java-dev-h...@lucene.apache.org





-- 
Richard Marr
richard.m...@gmail.com
07976 910 515

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Michael McCandless
On Thu, Jul 30, 2009 at 6:28 AM, Richard Marrrichard.m...@gmail.com wrote:
 Yeah, having this stuff stored centrally behind the IndexReader seems
 like a better idea than having it in client classes. My shallow
 knowledge of the code isn't helping me explain why it's not performing
 though.

 Out of interest, how come it's a per-thread cache? I don't understand
 all the issues involved but that surprised me.

Good question... making it thread private seems rather wasteful since
at heart this information (Term - TermInfo) is constant across
threads and so we're wasting RAM.

Also, it's a non-trivial amount of RAM that we're tying up once the
cache is full: 1024 times maybe ~120 bytes per TermInfo on a 64bit jre
= ~120 KB, and it's somewhat devilish/unexpected (principle of least
surprise) for Lucene to do this to any threads that come through
it.

I think one reason was to avoid having to synchronize on the lookups,
though with magic similar to LUCENE-1607 we could presumably make it
lockless.

Plus, the original motivation for this (LUCENE-1195) was because
queries in general look up the same term at least 2 times during their
execution (weight (idf computation), get postings), and so I think we
wanted to ensure that a single thread doing its query would not see
its terms evicted (due to many other threads coming through) by the
2nd time it needed to use them.  But if we made the central cache
large enough, perhaps growing if it detects many threads, then this
(other threads evicted my entries before I finished my query)
shouldn't be a problem in practice.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Richard Marr
2009/7/30 Michael McCandless luc...@mikemccandless.com:
 Good question...

Good answer. Thanks.

I guess the next step then is to understand why the TermInfo cache
isn't getting the performance to where it could be. It'll take me a
while to get to the point where I can answer that question. If
anyone's in a hurry it'd probably be worth someone looking at it.

Rich

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Carl Austin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12737107#action_12737107
 ] 

Carl Austin commented on LUCENE-1690:
-

The cache in terminfosreader is for everything as you say. I do a lot of stuff 
with terms, and those terms will get pushed out of this LRU cache very quickly. 
I have a separate cache on my version of the MLT. This has the advantage of 
those terms only being pushed out by other MLT queries, and not by everything 
else I am doing that is not MLT related. 
A lot of MLTs use the same terms, and I have a good size cache for it, meaning 
most terms I use in MLT can be retrieved from there. Seeing as MLT in my 
circumstance is one of the slower bits, this can give me a good advantage.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LruCache.patch, LUCENE-1690.patch, LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-30 Thread Michael Busch

On 7/30/09 4:10 AM, Michael McCandless wrote:

Plus, the original motivation for this (LUCENE-1195) was because
queries in general look up the same term at least 2 times during their
execution (weight (idf computation), get postings), and so I think we
wanted to ensure that a single thread doing its query would not see
its terms evicted (due to many other threads coming through) by the
2nd time it needed to use them.  But if we made the central cache
large enough, perhaps growing if it detects many threads, then this
(other threads evicted my entries before I finished my query)
shouldn't be a problem in practice.

Mike

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org


   


Yes this was part of the motivation. Especially wildcard or range 
queries could wipe out the entire cache before another thread does its 
second term lookup.


If we had a lock-less cache then I agree simply making it larger would 
probably be better than having separate caches per thread.
Also we should probably optimize the most common cases... if in rare 
situations certain queries wipe out the cache it might not be such a big 
deal.


 Michael

-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-29 Thread Richard Marr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736525#action_12736525
 ] 

Richard Marr commented on LUCENE-1690:
--

There's also another problem I've just noticed. Please ignore the latest patch.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LruCache.patch, LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-28 Thread Richard Marr (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Marr updated LUCENE-1690:
-

Attachment: LruCache.patch

Attached is a draft of an implementation that uses a WeakHashMap to bind the 
cache to the IndexReader instance, and a LinkedHashMap to provide LRU 
functionality.

Disclaimer: I'm not fluent in Java or OSS contribution so there may be holes or 
bad style in this implementation. I also need to check it meets the project 
coding standards.

Anybody up for giving me some feedback in the meantime?

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LruCache.patch, LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-28 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12736149#action_12736149
 ] 

Michael McCandless commented on LUCENE-1690:


The getTermFrequency method looks like it'll incorrectly put 0 into the cache, 
when the field was in the top-level cache but the term text wasn't in the 2nd 
level cache?

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LruCache.patch, LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Closed: (LUCENE-1697) MoreLikeThis should use the new Token API

2009-07-24 Thread Michael Busch (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Busch closed LUCENE-1697.
-

Resolution: Duplicate

This will be fixed as part of LUCENE-1460.

 MoreLikeThis should use the new Token API
 -

 Key: LUCENE-1697
 URL: https://issues.apache.org/jira/browse/LUCENE-1697
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Assignee: Michael Busch
Priority: Minor
 Fix For: 2.9


 The MoreLikeThis functionality needs to be converted to use the new 
 TokenStream API.
 See also LUCENE-1695.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-20 Thread Carl Austin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733234#action_12733234
 ] 

Carl Austin commented on LUCENE-1690:
-

The cache used for this is a HashMap and this is unbounded.  Perhaps this 
should be an LRU cache with a settable maximum number of entries to stop it 
growing forever if you do a lot of like this queries on large indexes with many 
unique terms.
Otherwise nice addition, has sped up my more like this queries a bit.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-20 Thread Richard Marr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733237#action_12733237
 ] 

Richard Marr commented on LUCENE-1690:
--

Okay, so the ideal solution is an LRU cache binding to a specific IndexReader 
instance. I think I can handle that.

Carl, do you have any data on how this has changed performance in your system?  
My use case is a limited vocabulary so the performance gain was large.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-07-20 Thread Carl Austin (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12733238#action_12733238
 ] 

Carl Austin commented on LUCENE-1690:
-

I wasn't all that scientific I am afraid, just noting that it improved 
performace enough once warmed up to keep on using it. Sorry.
However, after just 3 or 4 more like this queries I am seeing a definate 
improvement, as the majority of freetext is standard vocab, and the unique 
terms only make up a small amount of the rest of the text.


 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-1272) Support for boost factor in MoreLikeThis

2009-07-14 Thread Michael McCandless (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael McCandless resolved LUCENE-1272.


Resolution: Fixed

Thanks Jonathan!

 Support for boost factor in MoreLikeThis
 

 Key: LUCENE-1272
 URL: https://issues.apache.org/jira/browse/LUCENE-1272
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Jonathan Leibiusky
Assignee: Michael McCandless
Priority: Minor
 Fix For: 2.9

 Attachments: morelikethis_boostfactor.patch


 This is a patch I made to be able to boost the terms with a specific factor 
 beside the relevancy returned by MoreLikeThis. This is helpful when having 
 more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) 
 can be boosted more than words in the field B (i.e. Description).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1697) MoreLikeThis should use the new Token API

2009-06-16 Thread Grant Ingersoll (JIRA)
MoreLikeThis should use the new Token API
-

 Key: LUCENE-1697
 URL: https://issues.apache.org/jira/browse/LUCENE-1697
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Priority: Minor
 Fix For: 2.9


The MoreLikeThis functionality needs to be converted to use the new TokenStream 
API.

See also LUCENE-1695.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1697) MoreLikeThis should use the new Token API

2009-06-16 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12720265#action_12720265
 ] 

Mark Miller commented on LUCENE-1697:
-

I'm trying to get all of 2.9 assigned. If you don't want this one Grant, we 
should assign to Michael as this is a part of LUCENE-1460.

 MoreLikeThis should use the new Token API
 -

 Key: LUCENE-1697
 URL: https://issues.apache.org/jira/browse/LUCENE-1697
 Project: Lucene - Java
  Issue Type: Improvement
Reporter: Grant Ingersoll
Priority: Minor
 Fix For: 2.9


 The MoreLikeThis functionality needs to be converted to use the new 
 TokenStream API.
 See also LUCENE-1695.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-06-15 Thread Richard Marr (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719653#action_12719653
 ] 

Richard Marr commented on LUCENE-1690:
--

Sounds reasonable although that'll take a little longer for me to do. I'll have 
a think about it.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-06-13 Thread Richard Marr (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Marr updated LUCENE-1690:
-

Attachment: LUCENE-1690.patch

This patch implements a basic hashmap term frequency cache. It shouldn't affect 
any applications that don't opt-in to using it, and applications that do should 
see an order of magnitude performance improvement for MLT queries.

This cache implementation is tied to the MLT object but can be cleared on 
demand.

 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-06-13 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12719103#action_12719103
 ] 

Michael McCandless commented on LUCENE-1690:


This sounds good!

Could we include the IndexReader in the cache key?  Then it'd be functionally 
equivalent we could enable it by default?



 Morelikethis queries are very slow compared to other search types
 -

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor
 Attachments: LUCENE-1690.patch

   Original Estimate: 2h
  Remaining Estimate: 2h

 The MoreLikeThis object performs term frequency lookups for every query.  
 From my testing that's what seems to take up the majority of time for 
 MoreLikeThis searches.  
 For some (I'd venture many) applications it's not necessary for term 
 statistics to be looked up every time. A fairly naive opt-in caching 
 mechanism tied to the life of the MoreLikeThis object would allow 
 applications to cache term statistics for the duration that suits them.
 I've got this working in my test code. I'll put together a patch file when I 
 get a minute. From my testing this can improve performance by a factor of 
 around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1690) Morelikethis queries are very slow compared to other search types

2009-06-12 Thread Richard Marr (JIRA)
Morelikethis queries are very slow compared to other search types
-

 Key: LUCENE-1690
 URL: https://issues.apache.org/jira/browse/LUCENE-1690
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Affects Versions: 2.4.1
Reporter: Richard Marr
Priority: Minor


The MoreLikeThis object performs term frequency lookups for every query.  From 
my testing that's what seems to take up the majority of time for MoreLikeThis 
searches.  

For some (I'd venture many) applications it's not necessary for term statistics 
to be looked up every time. A fairly naive opt-in caching mechanism tied to the 
life of the MoreLikeThis object would allow applications to cache term 
statistics for the duration that suits them.

I've got this working in my test code. I'll put together a patch file when I 
get a minute. From my testing this can improve performance by a factor of 
around 10.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1272) Support for boost factor in MoreLikeThis

2009-06-03 Thread Jonathan Leibiusky (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Leibiusky updated LUCENE-1272:
---

Attachment: (was: morelikethis_boostfactor.patch)

 Support for boost factor in MoreLikeThis
 

 Key: LUCENE-1272
 URL: https://issues.apache.org/jira/browse/LUCENE-1272
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Jonathan Leibiusky
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 2.9

 Attachments: morelikethis_boostfactor.patch


 This is a patch I made to be able to boost the terms with a specific factor 
 beside the relevancy returned by MoreLikeThis. This is helpful when having 
 more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) 
 can be boosted more than words in the field B (i.e. Description).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1272) Support for boost factor in MoreLikeThis

2009-06-03 Thread Jonathan Leibiusky (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Leibiusky updated LUCENE-1272:
---

Attachment: morelikethis_boostfactor.patch

Updated to work with trunk

 Support for boost factor in MoreLikeThis
 

 Key: LUCENE-1272
 URL: https://issues.apache.org/jira/browse/LUCENE-1272
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Jonathan Leibiusky
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 2.9

 Attachments: morelikethis_boostfactor.patch


 This is a patch I made to be able to boost the terms with a specific factor 
 beside the relevancy returned by MoreLikeThis. This is helpful when having 
 more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) 
 can be boosted more than words in the field B (i.e. Description).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1272) Support for boost factor in MoreLikeThis

2009-06-02 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=1271#action_1271
 ] 

Otis Gospodnetic commented on LUCENE-1272:
--

Jonathan, would it be possible for you to update this patch to work with the 
trunk, so I can apply it?  Thanks!

 Support for boost factor in MoreLikeThis
 

 Key: LUCENE-1272
 URL: https://issues.apache.org/jira/browse/LUCENE-1272
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Jonathan Leibiusky
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 2.9

 Attachments: morelikethis_boostfactor.patch


 This is a patch I made to be able to boost the terms with a specific factor 
 beside the relevancy returned by MoreLikeThis. This is helpful when having 
 more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) 
 can be boosted more than words in the field B (i.e. Description).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Resolved: (LUCENE-896) Let users set Similarity for MoreLikeThis

2008-11-12 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic resolved LUCENE-896.
-

   Resolution: Fixed
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])

Actually, my copy of MLT already takes Similarity in ctor and has 
set/getSimilarity, so no patch is needed.  You want/need that isNoise method 
protected?


 Let users set Similarity for MoreLikeThis
 -

 Key: LUCENE-896
 URL: https://issues.apache.org/jira/browse/LUCENE-896
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Ryan McKinley
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-896-MoreLikeThisSimilarity.patch


 Let users set Similarity used for MoreLikeThis
 For discussion, see:
 http://www.nabble.com/MoreLikeThis-API-changes--tf3838535.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1272) Support for boost factor in MoreLikeThis

2008-11-12 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-1272:
-

 Priority: Minor  (was: Major)
Lucene Fields: [New, Patch Available]  (was: [Patch Available, New])
Fix Version/s: 2.9
 Assignee: Otis Gospodnetic

I don't see any harm in this, I'll make the change later this week.

 Support for boost factor in MoreLikeThis
 

 Key: LUCENE-1272
 URL: https://issues.apache.org/jira/browse/LUCENE-1272
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Jonathan Leibiusky
Assignee: Otis Gospodnetic
Priority: Minor
 Fix For: 2.9

 Attachments: morelikethis_boostfactor.patch


 This is a patch I made to be able to boost the terms with a specific factor 
 beside the relevancy returned by MoreLikeThis. This is helpful when having 
 more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) 
 can be boosted more than words in the field B (i.e. Description).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1298) MoreLikeThis ignores custom similarity

2008-06-04 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-1298.
-

   Resolution: Fixed
Lucene Fields:   (was: [New])

Committed revision 663054.

 MoreLikeThis ignores custom similarity
 --

 Key: LUCENE-1298
 URL: https://issues.apache.org/jira/browse/LUCENE-1298
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: LUCENE-1298.patch


 MoreLikeThis only allows the use of the DefaultSimilarity.  Patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1298) MoreLikeThis ignores custom similarity

2008-06-03 Thread Grant Ingersoll (JIRA)
MoreLikeThis ignores custom similarity
--

 Key: LUCENE-1298
 URL: https://issues.apache.org/jira/browse/LUCENE-1298
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor


MoreLikeThis only allows the use of the DefaultSimilarity.  Patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1298) MoreLikeThis ignores custom similarity

2008-06-03 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1298?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1298:


Attachment: LUCENE-1298.patch

Patch

 MoreLikeThis ignores custom similarity
 --

 Key: LUCENE-1298
 URL: https://issues.apache.org/jira/browse/LUCENE-1298
 Project: Lucene - Java
  Issue Type: Bug
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Minor
 Attachments: LUCENE-1298.patch


 MoreLikeThis only allows the use of the DefaultSimilarity.  Patch shortly

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Resolved: (LUCENE-1295) Make retrieveTerms(int docNum) public in MoreLikeThis

2008-06-02 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll resolved LUCENE-1295.
-

   Resolution: Fixed
Lucene Fields: [New]  (was: [Patch Available, New])

Committed revision 662413.

 Make retrieveTerms(int docNum) public in MoreLikeThis
 -

 Key: LUCENE-1295
 URL: https://issues.apache.org/jira/browse/LUCENE-1295
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Trivial
 Attachments: LUCENE-1295.patch


 It would be useful if 
 {code}
 private PriorityQueue retrieveTerms(int docNum) throws IOException {
 {code}
 were public, since it is similar in use to 
 {code}
 public PriorityQueue retrieveTerms(Reader r) throws IOException {
 {code}
 It also seems useful to add 
 {code}
 public String [] retrieveInterestingTerms(int docNum) throws IOException{
 {code}
 to mirror the one that works on Reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1295) Make retrieveTerms(int docNum) public in MoreLikeThis

2008-05-30 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12601340#action_12601340
 ] 

Otis Gospodnetic commented on LUCENE-1295:
--

I think cosmetic changes are OK if:
* they are not mixed with functional changes
* there are no patches for the cleaned-up class(es) in JIRA

In this case I see only a couple of MLT issues, all of which look like we can 
take care of them quickly, and then somebody can clean up a little if we feel 
like it.  Anyhow...


 Make retrieveTerms(int docNum) public in MoreLikeThis
 -

 Key: LUCENE-1295
 URL: https://issues.apache.org/jira/browse/LUCENE-1295
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Trivial
 Attachments: LUCENE-1295.patch


 It would be useful if 
 {code}
 private PriorityQueue retrieveTerms(int docNum) throws IOException {
 {code}
 were public, since it is similar in use to 
 {code}
 public PriorityQueue retrieveTerms(Reader r) throws IOException {
 {code}
 It also seems useful to add 
 {code}
 public String [] retrieveInterestingTerms(int docNum) throws IOException{
 {code}
 to mirror the one that works on Reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1295) Make retrieveTerms(int docNum) public in MoreLikeThis

2008-05-29 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12600780#action_12600780
 ] 

Grant Ingersoll commented on LUCENE-1295:
-

{quote}
Perque no.
{quote}

Huh?

{quote}
I see MLT is full of tabs, should you feel like fixing the formating.
{quote}

Yeah, I noticed that too, and it is quite egregious, but I thought we avoided 
formatting changes, but I am happy to make an exception here.  

 Make retrieveTerms(int docNum) public in MoreLikeThis
 -

 Key: LUCENE-1295
 URL: https://issues.apache.org/jira/browse/LUCENE-1295
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Trivial
 Attachments: LUCENE-1295.patch


 It would be useful if 
 {code}
 private PriorityQueue retrieveTerms(int docNum) throws IOException {
 {code}
 were public, since it is similar in use to 
 {code}
 public PriorityQueue retrieveTerms(Reader r) throws IOException {
 {code}
 It also seems useful to add 
 {code}
 public String [] retrieveInterestingTerms(int docNum) throws IOException{
 {code}
 to mirror the one that works on Reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1295) Make retrieveTerms(int docNum) public in MoreLikeThis

2008-05-28 Thread Grant Ingersoll (JIRA)
Make retrieveTerms(int docNum) public in MoreLikeThis
-

 Key: LUCENE-1295
 URL: https://issues.apache.org/jira/browse/LUCENE-1295
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Trivial


It would be useful if 
{code}
private PriorityQueue retrieveTerms(int docNum) throws IOException {
{code}

were public, since it is similar in use to 
{code}
public PriorityQueue retrieveTerms(Reader r) throws IOException {
{code}

It also seems useful to add 
{code}
public String [] retrieveInterestingTerms(int docNum) throws IOException{
{code}
to mirror the one that works on Reader.



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1295) Make retrieveTerms(int docNum) public in MoreLikeThis

2008-05-28 Thread Grant Ingersoll (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Grant Ingersoll updated LUCENE-1295:


Attachment: LUCENE-1295.patch

I'll commit in a day or two.

 Make retrieveTerms(int docNum) public in MoreLikeThis
 -

 Key: LUCENE-1295
 URL: https://issues.apache.org/jira/browse/LUCENE-1295
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Trivial
 Attachments: LUCENE-1295.patch


 It would be useful if 
 {code}
 private PriorityQueue retrieveTerms(int docNum) throws IOException {
 {code}
 were public, since it is similar in use to 
 {code}
 public PriorityQueue retrieveTerms(Reader r) throws IOException {
 {code}
 It also seems useful to add 
 {code}
 public String [] retrieveInterestingTerms(int docNum) throws IOException{
 {code}
 to mirror the one that works on Reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-1295) Make retrieveTerms(int docNum) public in MoreLikeThis

2008-05-28 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12600679#action_12600679
 ] 

Otis Gospodnetic commented on LUCENE-1295:
--

Perque no.  I see MLT is full of tabs, should you feel like fixing the 
formating.


 Make retrieveTerms(int docNum) public in MoreLikeThis
 -

 Key: LUCENE-1295
 URL: https://issues.apache.org/jira/browse/LUCENE-1295
 Project: Lucene - Java
  Issue Type: Improvement
  Components: contrib/*
Reporter: Grant Ingersoll
Assignee: Grant Ingersoll
Priority: Trivial
 Attachments: LUCENE-1295.patch


 It would be useful if 
 {code}
 private PriorityQueue retrieveTerms(int docNum) throws IOException {
 {code}
 were public, since it is similar in use to 
 {code}
 public PriorityQueue retrieveTerms(Reader r) throws IOException {
 {code}
 It also seems useful to add 
 {code}
 public String [] retrieveInterestingTerms(int docNum) throws IOException{
 {code}
 to mirror the one that works on Reader.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-896) Let users set Similarity for MoreLikeThis

2008-05-16 Thread Otis Gospodnetic (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Otis Gospodnetic updated LUCENE-896:


Lucene Fields: [New, Patch Available]  (was: [New])
 Assignee: Otis Gospodnetic

Seems very reasonable.  I'll commit on Monday.


 Let users set Similarity for MoreLikeThis
 -

 Key: LUCENE-896
 URL: https://issues.apache.org/jira/browse/LUCENE-896
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Ryan McKinley
Assignee: Otis Gospodnetic
Priority: Minor
 Attachments: LUCENE-896-MoreLikeThisSimilarity.patch


 Let users set Similarity used for MoreLikeThis
 For discussion, see:
 http://www.nabble.com/MoreLikeThis-API-changes--tf3838535.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-1272) Support for boost factor in MoreLikeThis

2008-04-24 Thread Jonathan Leibiusky (JIRA)
Support for boost factor in MoreLikeThis


 Key: LUCENE-1272
 URL: https://issues.apache.org/jira/browse/LUCENE-1272
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Jonathan Leibiusky


This is a patch I made to be able to boost the terms with a specific factor 
beside the relevancy returned by MoreLikeThis. This is helpful when having more 
then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) can be 
boosted more than words in the field B (i.e. Description).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-1272) Support for boost factor in MoreLikeThis

2008-04-24 Thread Jonathan Leibiusky (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jonathan Leibiusky updated LUCENE-1272:
---

Attachment: morelikethis_boostfactor.patch

 Support for boost factor in MoreLikeThis
 

 Key: LUCENE-1272
 URL: https://issues.apache.org/jira/browse/LUCENE-1272
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Jonathan Leibiusky
 Attachments: morelikethis_boostfactor.patch


 This is a patch I made to be able to boost the terms with a specific factor 
 beside the relevancy returned by MoreLikeThis. This is helpful when having 
 more then 1 MoreLikeThis in the query, so words in the field A (i.e. Title) 
 can be boosted more than words in the field B (i.e. Description).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Created: (LUCENE-896) Let users set Similarity for MoreLikeThis

2007-05-30 Thread Ryan McKinley (JIRA)
Let users set Similarity for MoreLikeThis
-

 Key: LUCENE-896
 URL: https://issues.apache.org/jira/browse/LUCENE-896
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Ryan McKinley
Priority: Minor


Let users set Similarity used for MoreLikeThis

For discussion, see:
http://www.nabble.com/MoreLikeThis-API-changes--tf3838535.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Updated: (LUCENE-896) Let users set Similarity for MoreLikeThis

2007-05-30 Thread Ryan McKinley (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan McKinley updated LUCENE-896:
-

Attachment: LUCENE-896-MoreLikeThisSimilarity.patch

This adds a constructor and accessors for Similarity.

This also fixes a couple javadoc typos and makes isNoiseWord() protected

 Let users set Similarity for MoreLikeThis
 -

 Key: LUCENE-896
 URL: https://issues.apache.org/jira/browse/LUCENE-896
 Project: Lucene - Java
  Issue Type: Improvement
  Components: Other
Reporter: Ryan McKinley
Priority: Minor
 Attachments: LUCENE-896-MoreLikeThisSimilarity.patch


 Let users set Similarity used for MoreLikeThis
 For discussion, see:
 http://www.nabble.com/MoreLikeThis-API-changes--tf3838535.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



MoreLikeThis

2006-04-16 Thread Dean Hoover
Hi,
   
  Lucene is completely new to me. I just downloaded 1.9.1 and started 
experimenting with it. I am a bit confused though. I want to use the 
MoreLikeThis class, which appears in the javadoc, but does not exist in code. 
Where can I find it?
   
  Dean

__
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: MoreLikeThis

2006-04-16 Thread Chris Hostetter

:   Lucene is completely new to me. I just downloaded 1.9.1 and started
: experimenting with it. I am a bit confused though. I want to use the
: MoreLikeThis class, which appears in the javadoc, but does not exist in
: code. Where can I find it?

if you look at the way the main javadoc index is aranged, you'll notice
that some packages are listed as core and others are devided up in
sections with contrib: in their header...

http://lucene.apache.org/java/docs/api/overview-summary.html

contrib code can be found in the contrib/ directory, broken up by
module -- each module is built into a seperate jar.



-Hoss


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]