[jira] Updated: (TIKA-466) Feed Parser

Julien Nioche (JIRA) Fri, 16 Jul 2010 04:23:22 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Julien Nioche updated TIKA-466:
-------------------------------

    Attachment: TIKA-466.patch

> Feed Parser
> -----------
>
>                 Key: TIKA-466
>                 URL: https://issues.apache.org/jira/browse/TIKA-466
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Julien Nioche
>            Priority: Minor
>         Attachments: TIKA-466.patch
>
>
> We currently have no parsers for feeds in Tika and since we are progressively 
> getting rid of our legacy parsers in Nutch I thought it could make sense to 
> have one.
> The patch attached is based on the ROME feed parser 
> (https://rome.dev.java.net/) which is under Apache License. Rome provides a 
> unified API for different feed formats and seems well maintained.
> The implementation of the FeedParser is by no means complete but should serve 
> as a basis for further improvements. It currently stores the title and 
> description from the feed and stores them in the metadata and uses the 
> following XHTML representation for the entries : 
> <A href="ENTRY_URL">ENTRY_TITLE</A>
> <P>
> ENTRY_DESCRIPTION
> </P> 
> This is pretty basic but should at least allow us to retrieve the outlinks in 
> Nutch as well as some text. 
> J. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-466) Feed Parser

Reply via email to