[ https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mark Harwood updated LUCENE-725: -------------------------------- Attachment: NovelAnalyzer.java Updated to work with Lucene 4 APIs. > NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all > "boilerplate" text > ------------------------------------------------------------------------------------------- > > Key: LUCENE-725 > URL: https://issues.apache.org/jira/browse/LUCENE-725 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Mark Harwood > Assignee: Otis Gospodnetic > Priority: Minor > Attachments: NovelAnalyzer.java, NovelAnalyzer.java, > NovelAnalyzer.java, NovelAnalyzer.java > > > This is a class I have found to be useful for analyzing small (in the > hundreds) collections of documents and removing any duplicate content such > as standard disclaimers or repeated text in an exchange of emails. > This has applications in sampling query results to identify key phrases, > improving speed-reading of results with similar content (eg email > threads/forum messages) or just removing duplicated noise from a search index. > To be more generally useful it needs to scale to millions of documents - in > which case an alternative implementation is required. See the notes in the > Javadocs for this class for more discussion on this -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org