parse-rss null pointer exception
--------------------------------

         Key: NUTCH-89
         URL: http://issues.apache.org/jira/browse/NUTCH-89
     Project: Nutch
        Type: Bug
  Components: fetcher  
    Versions: 0.7, 0.8-dev    
    Reporter: Michael Nebel


The rss-parser causes an exception. The reason is a syntax error in the page. 
Hitting this pages, the parser trys to add an outlink with "null" as anchor.  
The anchor  of a outlink must no be null. 

java.lang.NullPointerException
        at org.apache.nutch.io.UTF8.writeString(UTF8.java:236)
        at org.apache.nutch.parse.Outlink.write(Outlink.java:51)
        at org.apache.nutch.parse.ParseData.write(ParseData.java:111)
        at org.apache.nutch.io.SequenceFile$Writer.append(SequenceFile.java:137)
        at org.apache.nutch.io.MapFile$Writer.append(MapFile.java:127)
        at org.apache.nutch.io.ArrayFile$Writer.append(ArrayFile.java:39)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.outputPage(Fetcher.java:281)
        at 
org.apache.nutch.fetcher.Fetcher$FetcherThread.handleFetch(Fetcher.java:261)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:148)
Exception in thread "main" java.lang.RuntimeException: SEVERE error logged.  
Exiting fetcher.
        at org.apache.nutch.fetcher.Fetcher.run(Fetcher.java:354)
        at org.apache.nutch.fetcher.Fetcher.main(Fetcher.java:488)
        at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:140)

I suggest the following patch:

Index: src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java
===================================================================
--- src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java     
(revision 279397)
+++ src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss/RSSParser.java     
(working copy)
@@ -157,11 +157,13 @@
                 if (r.getLink() != null) {
                     try {
                         // get the outlink
-                        theOutlinks.add(new Outlink(r.getLink(), r
-                                .getDescription()));
+                       if (r.getDescription()!= null ) {
+                           theOutlinks.add(new Outlink(r.getLink(), 
r.getDescription()));
+                       } else {
+                           theOutlinks.add(new Outlink(r.getLink(), ""));
+                       }
                     } catch (MalformedURLException e) {
-                        LOG
-                                .info("nutch:parse-rss:RSSParser Exception: 
MalformedURL: "
+                        LOG.info("nutch:parse-rss:RSSParser Exception: 
MalformedURL: "
                                         + r.getLink()
                                         + ": Attempting to continue processing 
outlinks");
                         e.printStackTrace();
@@ -185,12 +187,13 @@
 
                     if (whichLink != null) {
                         try {
-                            theOutlinks.add(new Outlink(whichLink, theRSSItem
-                                    .getDescription()));
-
+                           if (theRSSItem.getDescription()!=null) {
+                               theOutlinks.add(new Outlink(whichLink, 
theRSSItem.getDescription()));
+                           } else {
+                               theOutlinks.add(new Outlink(whichLink, ""));
+                           }
                         } catch (MalformedURLException e) {
-                            LOG
-                                    .info("nutch:parse-rss:RSSParser 
Exception: MalformedURL: "
+                            LOG.info("nutch:parse-rss:RSSParser Exception: 
MalformedURL: "
                                             + whichLink
                                             + ": Attempting to continue 
processing outlinks");
                             e.printStackTrace();


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to