Erik Hatcher wrote:
Nevermind.... I had removed parse-html plugin from my nutch-site.xml inadvertently, which caused this issue.

However, it'd be nice to get a clearer error message for this situation. Thoughts?

Perhaps in ParserFactory a synchronized HashMap should be used in place of a Hashtable, so that a null parser for a content type can still be cached. Then Fetcher.java should be changed to gracefully turn a null parser into a ParseStatus.STATUS_NOTPARSED. Does that sound right?

Doug


On Aug 10, 2005, at 2:25 PM, Erik Hatcher wrote:

On Aug 10, 2005, at 11:51 AM, Andrzej Bialecki wrote:

There is a plugin hook in HTML parser, where it calls HTML filters (HtmlParser.java:207). These filters can add/modify anything collected so far. You could implement an HTMLFilter plugin similar to the creativecommons plugin, which would be automatically called here to add outlinks.


I had it working with a custom HTML parser, but after your mail I changed to the filter approach as that is much cleaner. However, I'm now getting this when running a fetch:

050810 142101 parsing: /Users/erik/dev/arp/nines/build/plugins/ nines-rdfmap/plugin.xml 050810 142101 impl: point=org.apache.nutch.parse.HtmlParseFilter class=org.nines.nutch.RdfLinkMapFilter
050810 142102 Configured Client
java.lang.NullPointerException
        at java.util.Hashtable.put(Hashtable.java:396)
at org.apache.nutch.parse.ParserFactory.getExtension (ParserFactory.java:91) at org.apache.nutch.parse.ParserFactory.getParser (ParserFactory.java:59) at org.apache.nutch.fetcher.Fetcher $FetcherThread.handleFetch(Fetcher.java:253) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run (Fetcher.java:148) 050810 142104 fetch okay, but can't parse http:// www.rossettiarchive.org/docs/1-1847.s244.raw.html, reason: failed (2,200): java.lang.NullPointerException

My plugin.xml looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="nines-rdfmap"
   name="NINES HTML Parse Filter"
   version="1.0.0"
   provider-name="nines.org">

   <extension-point
      id="org.apache.nutch.parse.HtmlParseFilter"
      name="HTML Parse Filter"/>


   <runtime>
      <library name="nines-rdfmap.jar">
         <export name="*"/>
      </library>
   </runtime>

   <extension id="org.nines.nutch.RdfLinkMapFilter"
              name="RDF Link Map Filter"
              point="org.apache.nutch.parse.HtmlParseFilter">
      <implementation id="RdfLinkMapFilter"
                      class="org.nines.nutch.RdfLinkMapFilter"/>
   </extension>

</plugin>

I'm sure something simple is awry, but I'm not seeing it currently. What am I missing?

Perhaps some improvements to the error message could be made to help out in this type of scenario?

Thanks,
    Erik





-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to