Hi,
The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9:
HtmlParser.java:248-259). The problem is that the first feature being set
thrown an exception. So, the whole setup block is skipped. The catch statement
does nothing, so probably nobody noticed this.
I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk
contains the same code.
The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.
Please note that now there goes a lot for messages to console (not log4j log),
because "report-errors" feature is being set. Shouldn't it be removed?
Cheers,
Marcin
--- nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java 2007-04-03 05:44:21.000000000 +0200
+++ ../nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java 2007-05-21 12:33:46.000000000 +0200
@@ -246,9 +246,9 @@
DOMFragmentParser parser = new DOMFragmentParser();
// some plugins, e.g., creativecommons, need to examine html comments
try {
- parser.setFeature("http://apache.org/xml/features/include-comments",
- true);
- parser.setFeature("http://apache.org/xml/features/augmentations",
+// parser.setFeature("http://apache.org/xml/features/include-comments",
+// true);
+ parser.setFeature("http://cyberneko.org/html/features/augmentations",
true);
parser.setFeature("http://cyberneko.org/html/features/balance-tags/ignore-outside-content",
false);
@@ -256,7 +256,9 @@
true);
parser.setFeature("http://cyberneko.org/html/features/report-errors",
true);
- } catch (SAXException e) {}
+ } catch (SAXException e) {
+ LOG.warn("Exception while setting Neko parser settings.", e);
+ }
// convert Document to DocumentFragment
HTMLDocumentImpl doc = new HTMLDocumentImpl();
doc.setErrorChecking(false);