[ https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated TIKA-466: ------------------------------- Attachment: TIKA-466.patch > Feed Parser > ----------- > > Key: TIKA-466 > URL: https://issues.apache.org/jira/browse/TIKA-466 > Project: Tika > Issue Type: New Feature > Components: parser > Reporter: Julien Nioche > Priority: Minor > Attachments: TIKA-466.patch > > > We currently have no parsers for feeds in Tika and since we are progressively > getting rid of our legacy parsers in Nutch I thought it could make sense to > have one. > The patch attached is based on the ROME feed parser > (https://rome.dev.java.net/) which is under Apache License. Rome provides a > unified API for different feed formats and seems well maintained. > The implementation of the FeedParser is by no means complete but should serve > as a basis for further improvements. It currently stores the title and > description from the feed and stores them in the metadata and uses the > following XHTML representation for the entries : > <A href="ENTRY_URL">ENTRY_TITLE</A> > <P> > ENTRY_DESCRIPTION > </P> > This is pretty basic but should at least allow us to retrieve the outlinks in > Nutch as well as some text. > J. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.