Hi,
The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code. 

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), 
because "report-errors" feature is being set. Shouldn't it be removed?

Cheers,
Marcin
--- nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java	2007-04-03 05:44:21.000000000 +0200
+++ ../nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java	2007-05-21 12:33:46.000000000 +0200
@@ -246,9 +246,9 @@
     DOMFragmentParser parser = new DOMFragmentParser();
     // some plugins, e.g., creativecommons, need to examine html comments
     try {
-      parser.setFeature("http://apache.org/xml/features/include-comments";, 
-              true);
-      parser.setFeature("http://apache.org/xml/features/augmentations";, 
+//      parser.setFeature("http://apache.org/xml/features/include-comments";, 
+//              true);
+      parser.setFeature("http://cyberneko.org/html/features/augmentations";, 
               true);
       parser.setFeature("http://cyberneko.org/html/features/balance-tags/ignore-outside-content";,
               false);
@@ -256,7 +256,9 @@
               true);
       parser.setFeature("http://cyberneko.org/html/features/report-errors";,
               true);
-    } catch (SAXException e) {}
+    } catch (SAXException e) {
+    	LOG.warn("Exception while setting Neko parser settings.", e);
+    }
     // convert Document to DocumentFragment
     HTMLDocumentImpl doc = new HTMLDocumentImpl();
     doc.setErrorChecking(false);

Reply via email to