[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information
[ https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790889#action_12790889 ] Otis Gospodnetic commented on LUCENE-1910: -- * I'll second Mark's suggestion to extract the Information Gain piece of the patch into separate class(es), so we can reuse it in other places. It looks like it's currently an integral part of MoreLikeThisUsingTags class. Would that be possible? * I noticed the code needs ASL (the Apache Software License) added. * Also, could you please use the Lucene code format? (Eclipse/IntelliJ templates are at the bottom of http://wiki.apache.org/lucene-java/HowToContribute ) Extension to MoreLikeThis to use tag information Key: LUCENE-1910 URL: https://issues.apache.org/jira/browse/LUCENE-1910 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Thomas D'Silva Priority: Minor Attachments: LUCENE-1910.patch I would like to contribute a class based on the MoreLikeThis class in contrib/queries that generates a query based on the tags associated with a document. The class assumes that documents are tagged with a set of tags (which are stored in the index in a seperate Field). The class determines the top document terms associated with a given tag using the information gain metric. While generating a MoreLikeThis query for a document the tags associated with document are used to determine the terms in the query. This class is useful for finding similar documents to a document that does not have many relevant terms but was tagged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information
[ https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783010#action_12783010 ] Thomas D'Silva commented on LUCENE-1910: Mark, I refactored the code so that the tag and document probabilities are computed and used to find the most important document terms corresponding to a given tag term during the index creation phase. These most important document terms (ranked by information gain) for a given tag term is stored as meta information in the index when the index is created. I added a class TagIndexWriter which extends IndexWriter which is used to create an index which can be used to run MoreLikeThisUsingTags queries. I recreated a test index with one million documents, and assigned tags (tag_0,...tag_4) to 10%,20%.. and so on of the documents. The time taken to generate a query on an index created using TagIndexWriter: tag name, number of documents, time in ms tag_0, 10134, 22 tag_1, 19996, 29 tag_2, 30010, 6 tag_3, 39907, 6 tag_4, 50148, 9 Since the document terms corresponding to a tag term is computed during the indexing phase, the time taken to generate a MoreLikeThisUsingTags query is constant. Thanks, Thomas Extension to MoreLikeThis to use tag information Key: LUCENE-1910 URL: https://issues.apache.org/jira/browse/LUCENE-1910 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Thomas D'Silva Priority: Minor I would like to contribute a class based on the MoreLikeThis class in contrib/queries that generates a query based on the tags associated with a document. The class assumes that documents are tagged with a set of tags (which are stored in the index in a seperate Field). The class determines the top document terms associated with a given tag using the information gain metric. While generating a MoreLikeThis query for a document the tags associated with document are used to determine the terms in the query. This class is useful for finding similar documents to a document that does not have many relevant terms but was tagged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information
[ https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12762290#action_12762290 ] Mark Harwood commented on LUCENE-1910: -- 2 minutes to create a query based on 10,000 documents? Unfortunately, I can't see this being generally useful until the performance is improved dramatically. Extension to MoreLikeThis to use tag information Key: LUCENE-1910 URL: https://issues.apache.org/jira/browse/LUCENE-1910 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Thomas D'Silva Priority: Minor Attachments: LUCENE-1910.patch I would like to contribute a class based on the MoreLikeThis class in contrib/queries that generates a query based on the tags associated with a document. The class assumes that documents are tagged with a set of tags (which are stored in the index in a seperate Field). The class determines the top document terms associated with a given tag using the information gain metric. While generating a MoreLikeThis query for a document the tags associated with document are used to determine the terms in the query. This class is useful for finding similar documents to a document that does not have many relevant terms but was tagged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1910) Extension to MoreLikeThis to use tag information
[ https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12757924#action_12757924 ] Mark Harwood commented on LUCENE-1910: -- Hi Thomas, Following your request for feedback, some initial thoughts from a very quick look. * The Information Gain algo could use a little more explanation e.g. using variable names other than num1 and num2 and could perhaps be extracted into a utility class * Is this scalable? It looks like in initialize it is loading this: {code:title=MoreLikeThisUsingTags.java|borderStyle=solid} /** * All terms in the index */ protected HashSet docTerms=new HashSet(); {code} ..that seems a little scary! It's also doing a seperate BooleanQuery for all items in this list ( and repeated for 1 tag?). Thats look like a lot of searches. I need to spend a little more time looking at it before I understand it in more detail. Before then - have you tested this on a big (millions of docs/terms) index? Some performance figures would be useful to accompany this. Cheers, Mark Extension to MoreLikeThis to use tag information Key: LUCENE-1910 URL: https://issues.apache.org/jira/browse/LUCENE-1910 Project: Lucene - Java Issue Type: New Feature Components: contrib/* Reporter: Thomas D'Silva Priority: Minor Attachments: LUCENE-1910.patch I would like to contribute a class based on the MoreLikeThis class in contrib/queries that generates a query based on the tags associated with a document. The class assumes that documents are tagged with a set of tags (which are stored in the index in a seperate Field). The class determines the top document terms associated with a given tag using the information gain metric. While generating a MoreLikeThis query for a document the tags associated with document are used to determine the terms in the query. This class is useful for finding similar documents to a document that does not have many relevant terms but was tagged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org