[
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris A. Mattmann updated NUTCH-444:
------------------------------------
Attachment: NUTCH-444.Mattmann.061707.patch.txt
Hi Folks,
Here is a patch that brings this issue up-to-date. The patch takes Doğacan's
initial patch, and cleans it up in many places, e.g.:
* changed ParseStatus.STATUS_FAILURE on failed parse (was
ParseStatus.STATUS_SUCCESS) - line 271
* reformatted code to conform to project style
* removed magic strings
* added in Apache license
* added in unit test
* fixed build.xml file to include refs to nutch-extensionpoints dep during unit
test
While I think there are a few minor open questions moving forward, I don't see
any of them hindering the committal of this patch. In answer to my above
referenced question regarding this issue as well, I noticed that all-in-all,
the feed plugin provided here does provide a superset of functionality provided
by that of parse-rss. So, I am +1 for removing parse-rss. Some things to
consider going forward:
1. I did find one difference in semantics between the parse-rss plugin and the
feed plugin: the feed plugin adds the URL pointer to the channel file as the
Text entry in the <Text, Parse> map provided in the ParseResult class. While
this is probably the correct thing to do, it was causing me some grief
initially b/c it caused my unit test to fail. My unit test was expecting to
receive the url: http://test.channel.com, the identified URL in the rsstest.rss
file, provided as sample input for the unit test. However, since the feed
plugin parser takes the *actual* URL pointer to the channel file (e.g.,
file:/some/path/on/your/system/rsstest.rss), rather than the specified channel
URL, this test was failing. The old parse-rss plugin actually took the channel
URL instead. I thought about this, and it's not a major hurdle. I think the
semantics of simply taking the URL pointer to the channel file that was used
(even if it was a file: pointer), is fine.
2. It might be a good idea to factor out the desired index/parse properties
taken from the feed and allow them to be specified by a configuration file to
this plugin. In other words, wouldn't it be nice to tell the plugin which
fields we want to extract (e.g., author, published date, etc.)? This would be
an improvement to this plugin later on.
Okey dok, so here it is. If there are no objections, I'd like to commit this in
the next 48 hrs. I'd also like feedback from folks like Andrzej and Doğacan
regarding removing parse-rss from the sources.
Thanks!
Cheers,
Chris
> Possibly use a different library to parse RSS feed for improved performance
> and compatibility
> ---------------------------------------------------------------------------------------------
>
> Key: NUTCH-444
> URL: https://issues.apache.org/jira/browse/NUTCH-444
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 0.9.0
> Reporter: Renaud Richardet
> Assignee: Chris A. Mattmann
> Priority: Minor
> Fix For: 1.0.0
>
> Attachments: feed.tar.bz2, NUTCH-444.Mattmann.061707.patch.txt,
> NUTCH-444.patch, parse-feed-v2.tar.bz2, parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers