[Nutch-dev] [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

nutch.newbie (JIRA) Sun, 11 Feb 2007 04:50:34 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472099
 ]


nutch.newbie commented on NUTCH-444:
------------------------------------

Otis:

Thanks for the info. But as for me I am going with parse-feed. I will also like 
to give stax based solution a try. 

Dogacan: 

It's working rather well with parse-feed. However I would be glad if you could 
do a quick check on my parse-plugins.xml modifications. Cos this also throws 
error during dedup... (when magic is false in nutch-site.xml). My intention is 
to know if its something I am doing wrong or is it some other bug.. 

I am thinking of doing a test run later tonight with 10 000 feeds. So I would 
be glad if you could clarify the following cases. (The following case only 
happens when there is just 1 url)

- urls.txt file contains 1 url, which is http://blog.foofactory.fi/atom.xml
- bin/nutch crawl with depth 1 gives me the following error during dedup

2007-02-11 13:32:26,846 WARN  mapred.LocalJobRunner - job_k9e9c2
java.lang.ArrayIndexOutOfBoundsException: -1
        at org.apache.lucene.index.MultiReader.isDeleted(MultiReader.java:109)
        at 
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDRecordReader.next(DeleteDuplicates.java:176)
        at org.apache.hadoop.mapred.MapTask$2.next(MapTask.java:166)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:183)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:109)

and during the parse phase of the above blog gives me the following:

2007-02-11 13:32:09,673 DEBUG http.Http - fetched 208 bytes from 
http://blog.foofactory.fi/robots.txt
2007-02-11 13:32:09,674 DEBUG http.Http - fetching 
http://blog.foofactory.fi/atom.xml
2007-02-11 13:32:10,560 INFO  mapred.JobClient -  map 100% reduce 0%
2007-02-11 13:32:10,769 DEBUG http.Http - fetched 38151 bytes from 
http://blog.foofactory.fi/atom.xml
2007-02-11 13:32:10,965 DEBUG parse.ParseUtil - Parsing 
[http://blog.foofactory.fi/atom.xml] with [EMAIL PROTECTED]
2007-02-11 13:32:11,292 INFO  mapred.LocalJobRunner - 0 pages, 0 errors, 0.0 
pages/s, 0 kb/s, 
2007-02-11 13:32:11,627 INFO  crawl.SignatureFactory - Using Signature impl: 
org.apache.nutch.crawl.MD5Signature
2007-02-11 13:32:11,654 WARN  fetcher.Fetcher - Error parsing: 
http://blog.foofactory.fi/atom.xml: failed(2,200): 
java.lang.NullPointerException
2007-02-11 13:32:12,293 INFO  mapred.LocalJobRunner - 1 pages, 0 errors, 0.3 
pages/s, 99 kb/s, 
2007-02-11 13:32:12,306 DEBUG mapred.MapTask - opened spill0.out
2007-02-11 13:32:12,381 INFO  mapred.LocalJobRunner - 1 pages, 0 errors, 0.3 
pages/s, 99 kb/s,

Below is my Parse-plugins.xml changes...

       <mimeType name="application/rss+xml">
                <plugin id="parse-feed" />
        </mimeType>

        <mimeType name="text/xml">
                <plugin id="parse-feed" />
         </mimeType>

                <alias name="parse-feed"
                        extension-id="org.apache.nutch.parse.feed.FeedParser" />

I have also mapped text/xml in parse-feed/plugin.xml cos most of the time I get 
xml rather then rss+xml as content type.. Also as you mentioned you are using 
this to test .. how is your test configuration? can you re-create my problem.. 

Thanks again for the plugin and many thanks for your help. I look forward to 
contribute in terms of index-feed and query-feed.











 

> Possibly use a different library to parse RSS feed for improved performance 
> and compatibility
> ---------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-444
>                 URL: https://issues.apache.org/jira/browse/NUTCH-444
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>            Priority: Minor
>             Fix For: 0.9.0
>
>         Attachments: parse-feed.tar.bz2
>
>
> As discussed by Nutch Newbie, Gal, and Chris on NUTCH-443, the current 
> library (feedparser) has the following issues:
> - OutOfMemory when parsing > 100k feeds, since it has to convert the feed to 
> jdom first
> - no support for Atom 1.0
> - there has been no development in the last year
> Alternatives are:
> - Rome 
> - Informa
> - custom implementation based on Stax
> - ??

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

Reply via email to