Hi,
The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9:
HtmlParser.java:248-259). The problem is that the first feature being set
thrown an exception. So, the whole setup block is skipped. The catch statement
does nothing, so probably nobody noticed this.
I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk
contains the same code.
The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.
Please note that now there goes a lot for messages to console (not log4j log),
because "report-errors" feature is being set. Shouldn't it be removed?
Cheers,
Marcin--- nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java 2007-04-03 05:44:21.0 +0200
+++ ../nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java 2007-05-21 12:33:46.0 +0200
@@ -246,9 +246,9 @@
DOMFragmentParser parser = new DOMFragmentParser();
// some plugins, e.g., creativecommons, need to examine html comments
try {
- parser.setFeature("http://apache.org/xml/features/include-comments";,
- true);
- parser.setFeature("http://apache.org/xml/features/augmentations";,
+// parser.setFeature("http://apache.org/xml/features/include-comments";,
+// true);
+ parser.setFeature("http://cyberneko.org/html/features/augmentations";,
true);
parser.setFeature("http://cyberneko.org/html/features/balance-tags/ignore-outside-content";,
false);
@@ -256,7 +256,9 @@
true);
parser.setFeature("http://cyberneko.org/html/features/report-errors";,
true);
-} catch (SAXException e) {}
+} catch (SAXException e) {
+ LOG.warn("Exception while setting Neko parser settings.", e);
+}
// convert Document to DocumentFragment
HTMLDocumentImpl doc = new HTMLDocumentImpl();
doc.setErrorChecking(false);