Feed Parser ----------- Key: TIKA-466 URL: https://issues.apache.org/jira/browse/TIKA-466 Project: Tika Issue Type: New Feature Components: parser Reporter: Julien Nioche Priority: Minor Attachments: TIKA-466.patch
We currently have no parsers for feeds in Tika and since we are progressively getting rid of our legacy parsers in Nutch I thought it could make sense to have one. The patch attached is based on the ROME feed parser (https://rome.dev.java.net/) which is under Apache License. Rome provides a unified API for different feed formats and seems well maintained. The implementation of the FeedParser is by no means complete but should serve as a basis for further improvements. It currently stores the title and description from the feed and stores them in the metadata and uses the following XHTML representation for the entries : <A href="ENTRY_URL">ENTRY_TITLE</A> <P> ENTRY_DESCRIPTION </P> This is pretty basic but should at least allow us to retrieve the outlinks in Nutch as well as some text. J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.