[ https://issues.apache.org/jira/browse/LUCENE-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12757924#action_12757924 ]
Mark Harwood commented on LUCENE-1910: -------------------------------------- Hi Thomas, Following your request for feedback, some initial thoughts from a very quick look. * The "Information Gain" algo could use a little more explanation e.g. using variable names other than "num1" and "num2" and could perhaps be extracted into a utility class * Is this scalable? It looks like in initialize it is loading this: {code:title=MoreLikeThisUsingTags.java|borderStyle=solid} /** * All terms in the index */ protected HashSet docTerms=new HashSet(); {code} ..that seems a little scary! It's also doing a seperate BooleanQuery for all items in this list ( and repeated for >1 tag?). Thats look like a lot of searches. I need to spend a little more time looking at it before I understand it in more detail. Before then - have you tested this on a big (millions of docs/terms) index? Some performance figures would be useful to accompany this. Cheers, Mark > Extension to MoreLikeThis to use tag information > ------------------------------------------------ > > Key: LUCENE-1910 > URL: https://issues.apache.org/jira/browse/LUCENE-1910 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Thomas D'Silva > Priority: Minor > Attachments: LUCENE-1910.patch > > > I would like to contribute a class based on the MoreLikeThis class in > contrib/queries that generates a query based on the tags associated > with a document. The class assumes that documents are tagged with a > set of tags (which are stored in the index in a seperate Field). The > class determines the top document terms associated with a given tag > using the information gain metric. > While generating a MoreLikeThis query for a document the tags > associated with document are used to determine the terms in the query. > This class is useful for finding similar documents to a document that > does not have many relevant terms but was tagged. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org