[ https://issues.apache.org/jira/browse/LUCENE-1993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12767781#action_12767781 ]
Michael McCandless commented on LUCENE-1993: -------------------------------------------- Patch looks good... I'll commit shortly. > MoreLikeThis - allow to exclude terms that appear in too many documents > (patch included) > ---------------------------------------------------------------------------------------- > > Key: LUCENE-1993 > URL: https://issues.apache.org/jira/browse/LUCENE-1993 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/* > Affects Versions: 2.9 > Reporter: Christian Steinert > Assignee: Michael McCandless > Attachments: MoreLikeThis.java.patch > > Original Estimate: 0.17h > Remaining Estimate: 0.17h > > The MoreLikeThis class allows to generate a likeness query based on a given > document. So far, it is impossible to suppress words from the likeness query, > that appear in almost all documents, making it necessary to use extensive > lists of stop words. > Therefore I suggest to allow excluding words for which a certain absolute > document count or a certain percentage of documents is exceeded. Depending on > the corpus of text, words that appear in more than 50 or even 70% of > documents can usually be considered insignificant for classifying a document. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org