[ 
https://issues.apache.org/jira/browse/NUTCH-874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897190#action_12897190
 ] 

Julien Nioche commented on NUTCH-874:
-------------------------------------

{quote}
I think Jukka already worked on something really similar to the ExtParser in 
Tika. See: 
http://tika.apache.org/0.7/api/org/apache/tika/parser/ExternalParser.html
{quote}
yes, that's the one I had in mind

One of the plugins which hasn't been ported yet is the feed parser. We could 
rely on the one we recently added to Tika, knowing that there is a substantial 
difference in the sense that the Tika feed parser generates a simple XHTML 
representation of the document where the feeds are simply represented as 
anchors whereas the Nutch version created new documents for each feed.

There is also the parse-rss plugin in Nutch which is quite similar - what's the 
difference with the feed one again? Since the Tika parser would handle all 
sorts of feed formats why not simply rely on it? 

> Make sure all plugins in src/plugin are compatible with Nutch 2.0 and Gora
> --------------------------------------------------------------------------
>
>                 Key: NUTCH-874
>                 URL: https://issues.apache.org/jira/browse/NUTCH-874
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>         Environment: Nutch 2.0
>            Reporter: Chris A. Mattmann
>            Assignee: Chris A. Mattmann
>            Priority: Critical
>             Fix For: 2.0
>
>
> I just noticed while fixing NUTCH-564 that the ExtParser hasn't been brought 
> up to date with Nutch 2.0 trunk. We should review the plugins in src/plugin 
> to make sure they all work with Gora/Nutchbase now.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to