[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-11-26 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: (was: LUCENE-1910.patch)

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor

 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-11-26 Thread Thomas D'Silva (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783010#action_12783010
 ] 

Thomas D'Silva commented on LUCENE-1910:


Mark,

I refactored the code so that the tag and document probabilities are computed 
and used to find the most important document terms corresponding to a given tag 
term during the index creation phase. These most important document terms 
(ranked by information gain) for a given tag term is stored as meta information 
in the index when the index is created. I added a class TagIndexWriter which 
extends IndexWriter which is used to create an index which can be used to run 
MoreLikeThisUsingTags queries. 

I recreated a test index with one million documents, and assigned tags 
(tag_0,...tag_4) to 10%,20%.. and so on of the documents. 

The time taken to generate a query on an index created using TagIndexWriter:
tag name, number of documents, time in ms
tag_0, 10134, 22
tag_1, 19996, 29
tag_2, 30010, 6
tag_3, 39907, 6
tag_4, 50148, 9

Since the document terms corresponding to a tag term is computed during the 
indexing phase, the time taken to generate a MoreLikeThisUsingTags query is 
constant. 

Thanks,
Thomas

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor

 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-11-26 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: LUCENE-1910.patch

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Issue Comment Edited: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-10-04 Thread Thomas D'Silva (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12761997#action_12761997
 ] 

Thomas D'Silva edited comment on LUCENE-1910 at 10/4/09 10:21 AM:
--

Mark,

I refactored the class to include more descriptive variable names. I also 
modified the code so that while calculating information gain only terms 
belonging to documents that have been tagged with the given tag and used (and 
not all the terms in the index). 
I tested this class on a test index containing one million documents. The 
documents were tagged with five tags (tag_0...tag_4). tag_0 was assigned to 
approximately 10% of the documents, tag_1 to 20% and so on. 

tag name, number of documents, time in ms
tag_0, 10134, 137314
tag_1, 19996, 219527
tag_2, 30010, 315336
tag_3, 39907, 413615
tag_4, 50148, 507350

The time taken to generate the query for a tag depends on the number of 
documents in the index containing the tag and scales linearly with the number 
of documents. 
The top document terms for a given are cached in a hashmap once they have been 
generated in order to speed up subsequent lookups.

Thanks,
Thomas

  was (Author: twdsi...@gmail.com):
I refactored the class to include more descriptive variable names. I also 
modified the code so that while calculating information gain only terms 
belonging to documents that have been tagged with the given tag and used (and 
not all the terms in the index). 
I tested this class on a test index containing one million documents. The 
documents were tagged with five tags (tag_0...tag_4). tag_0 was assigned to 
approximately 10% of the documents, tag_1 to 20% and so on. 

tag name, number of documents, time in ms
tag_0, 10134, 137314
tag_1, 19996, 219527
tag_2, 30010, 315336
tag_3, 39907, 413615
tag_4, 50148, 507350

The time taken to generate the query for a tag depends on the number of 
documents in the index containing the tag and scales linearly with the number 
of documents. 
The top document terms for a given are cached in a hashmap once they have been 
generated in order to speed up subsequent lookups.
  
 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch, LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-10-04 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: (was: LUCENE-1910.patch)

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-14 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Priority: Minor  (was: Major)

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
Priority: Minor
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Created: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-13 Thread Thomas D'Silva (JIRA)
Extension to MoreLikeThis to use tag information


 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva


I would like to contribute a class based on the MoreLikeThis class in
contrib/queries that generates a query based on the tags associated
with a document. The class assumes that documents are tagged with a
set of tags (which are stored in the index in a seperate Field). The
class determines the top document terms associated with a given tag
using the information gain metric.

While generating a MoreLikeThis query for a document the tags
associated with document are used to determine the terms in the query.
This class is useful for finding similar documents to a document that
does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-13 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: LUCENE-1910.patch

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-13 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: (was: LUCENE-1910.patch)

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Updated: (LUCENE-1910) Extension to MoreLikeThis to use tag information

2009-09-13 Thread Thomas D'Silva (JIRA)

 [ 
https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thomas D'Silva updated LUCENE-1910:
---

Attachment: LUCENE-1910.patch

 Extension to MoreLikeThis to use tag information
 

 Key: LUCENE-1910
 URL: https://issues.apache.org/jira/browse/LUCENE-1910
 Project: Lucene - Java
  Issue Type: New Feature
  Components: contrib/*
Reporter: Thomas D'Silva
 Attachments: LUCENE-1910.patch


 I would like to contribute a class based on the MoreLikeThis class in
 contrib/queries that generates a query based on the tags associated
 with a document. The class assumes that documents are tagged with a
 set of tags (which are stored in the index in a seperate Field). The
 class determines the top document terms associated with a given tag
 using the information gain metric.
 While generating a MoreLikeThis query for a document the tags
 associated with document are used to determine the terms in the query.
 This class is useful for finding similar documents to a document that
 does not have many relevant terms but was tagged.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org