Erik Hatcher wrote:
Nevermind.... I had removed parse-html plugin from my nutch-site.xml
inadvertently, which caused this issue.
However, it'd be nice to get a clearer error message for this
situation. Thoughts?
Perhaps in ParserFactory a synchronized HashMap should be used in place
of a Hashtable, so that a null parser for a content type can still be
cached. Then Fetcher.java should be changed to gracefully turn a null
parser into a ParseStatus.STATUS_NOTPARSED. Does that sound right?
Doug
On Aug 10, 2005, at 2:25 PM, Erik Hatcher wrote:
On Aug 10, 2005, at 11:51 AM, Andrzej Bialecki wrote:
There is a plugin hook in HTML parser, where it calls HTML filters
(HtmlParser.java:207). These filters can add/modify anything
collected so far. You could implement an HTMLFilter plugin similar
to the creativecommons plugin, which would be automatically called
here to add outlinks.
I had it working with a custom HTML parser, but after your mail I
changed to the filter approach as that is much cleaner. However, I'm
now getting this when running a fetch:
050810 142101 parsing: /Users/erik/dev/arp/nines/build/plugins/
nines-rdfmap/plugin.xml
050810 142101 impl: point=org.apache.nutch.parse.HtmlParseFilter
class=org.nines.nutch.RdfLinkMapFilter
050810 142102 Configured Client
java.lang.NullPointerException
at java.util.Hashtable.put(Hashtable.java:396)
at org.apache.nutch.parse.ParserFactory.getExtension
(ParserFactory.java:91)
at org.apache.nutch.parse.ParserFactory.getParser
(ParserFactory.java:59)
at org.apache.nutch.fetcher.Fetcher
$FetcherThread.handleFetch(Fetcher.java:253)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run
(Fetcher.java:148)
050810 142104 fetch okay, but can't parse http://
www.rossettiarchive.org/docs/1-1847.s244.raw.html, reason: failed
(2,200): java.lang.NullPointerException
My plugin.xml looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="nines-rdfmap"
name="NINES HTML Parse Filter"
version="1.0.0"
provider-name="nines.org">
<extension-point
id="org.apache.nutch.parse.HtmlParseFilter"
name="HTML Parse Filter"/>
<runtime>
<library name="nines-rdfmap.jar">
<export name="*"/>
</library>
</runtime>
<extension id="org.nines.nutch.RdfLinkMapFilter"
name="RDF Link Map Filter"
point="org.apache.nutch.parse.HtmlParseFilter">
<implementation id="RdfLinkMapFilter"
class="org.nines.nutch.RdfLinkMapFilter"/>
</extension>
</plugin>
I'm sure something simple is awry, but I'm not seeing it currently.
What am I missing?
Perhaps some improvements to the error message could be made to help
out in this type of scenario?
Thanks,
Erik
-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general