[
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472099
]
nutch.newbie commented on NUTCH-444:
------------------------------------
Otis:
Thanks for the info. But as for me I am going with parse-feed. I will also like
to give stax based solution a try.
Dogacan:
It's working rather well with parse-feed. However I would be glad if you could
do a quick check on my parse-plugins.xml modifications. Cos this also throws
error during dedup... (when magic is false in nutch-site.xml). My intention is
to know if its something I am doing wrong or is it some other bug..
I am thinking of doing a test run later tonight with 10 000 feeds. So I would
be glad if you could clarify the following cases. (The following case only
happens when there is just 1 url)
- urls.txt file contains 1 url, which is http://blog.foofactory.fi/atom.xml
- bin/nutch crawl with depth 1 gives me the following error during dedup
2007-02-11 13:32:26,846 WARN mapred.LocalJobRunner - job_k9e9c2
java.lang.ArrayIndexOutOfBoundsException: -1
at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:109)
at
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
at org.apache.hadoop.mapred.MapTask$2.next(MapTask.java:166)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)
and during the parse phase of the above blog gives me the following:
2007-02-11 13:32:09,673 DEBUG http.Http - fetched 208 bytes from
http://blog.foofactory.fi/robots.txt
2007-02-11 13:32:09,674 DEBUG http.Http - fetching
http://blog.foofactory.fi/atom.xml
2007-02-11 13:32:10,560 INFO mapred.JobClient - map 100% reduce 0%
2007-02-11 13:32:10,769 DEBUG http.Http - fetched 38151 bytes from
http://blog.foofactory.fi/atom.xml
2007-02-11 13:32:10,965 DEBUG parse.ParseUtil - Parsing
[http://blog.foofactory.fi/atom.xml] with [EMAIL PROTECTED]
2007-02-11 13:32:11,292 INFO mapred.LocalJobRunner - 0 pages, 0 errors, 0.0
pages/s, 0 kb/s,
2007-02-11 13:32:11,627 INFO crawl.SignatureFactory - Using Signature impl:
org.apache.nutch.crawl.MD5Signature
2007-02-11 13:32:11,654 WARN fetcher.Fetcher - Error parsing:
http://blog.foofactory.fi/atom.xml: failed(2,200):
java.lang.NullPointerException
2007-02-11 13:32:12,293 INFO mapred.LocalJobRunner - 1 pages, 0 errors, 0.3
pages/s, 99 kb/s,
2007-02-11 13:32:12,306 DEBUG mapred.MapTask - opened spill0.out
2007-02-11 13:32:12,381 INFO mapred.LocalJobRunner - 1 pages, 0 errors, 0.3
pages/s, 99 kb/s,
Below is my Parse-plugins.xml changes...
<mimeType name="application/rss+xml">
<plugin id="parse-feed" />
</mimeType>
<mimeType name="text/xml">
<plugin id="parse-feed" />
</mimeType>
<alias name="parse-feed"
extension-id="org.apache.nutch.parse.feed.FeedParser" />
I have also mapped text/xml in parse-feed/plugin.xml cos most of the time I get
xml rather then rss+xml as content type.. Also as you mentioned you are using
this to test .. how is your test configuration? can you re-create my problem..
Thanks again for the plugin and many thanks for your help. I look forward to
contribute in terms of index-feed and query-feed.
> Possibly use a different library to parse RSS feed for improved performance
> and compatibility
> ---------------------------------------------------------------------------------------------
>
> Key: NUTCH-444
> URL: https://issues.apache.org/jira/browse/NUTCH-444
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 0.9.0
> Reporter: Renaud Richardet
> Priority: Minor
> Fix For: 0.9.0
>
> Attachments: parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome
> - Informa
> - custom implementation based on Stax
> - ??
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers