[
https://issues.apache.org/jira/browse/LUCENE-725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mark Harwood updated LUCENE-725:
--------------------------------
Attachment: NovelAnalyzer.java
Updated to work with Lucene 4 APIs.
> NovelAnalyzer - wraps your choice of Lucene Analyzer and filters out all
> "boilerplate" text
> -------------------------------------------------------------------------------------------
>
> Key: LUCENE-725
> URL: https://issues.apache.org/jira/browse/LUCENE-725
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/analysis
> Reporter: Mark Harwood
> Assignee: Otis Gospodnetic
> Priority: Minor
> Attachments: NovelAnalyzer.java, NovelAnalyzer.java,
> NovelAnalyzer.java, NovelAnalyzer.java
>
>
> This is a class I have found to be useful for analyzing small (in the
> hundreds) collections of documents and removing any duplicate content such
> as standard disclaimers or repeated text in an exchange of emails.
> This has applications in sampling query results to identify key phrases,
> improving speed-reading of results with similar content (eg email
> threads/forum messages) or just removing duplicated noise from a search index.
> To be more generally useful it needs to scale to millions of documents - in
> which case an alternative implementation is required. See the notes in the
> Javadocs for this class for more discussion on this
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]