Re: Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Marcin Okraszewski
> I would suggest that you open a JIRA issue and attach the patch there.
> For this case, there is a similar issue(with patch) at NUTCH-369.

Done - NUTCH-487

Marcin


Re: Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Doğacan Güney

Hi,

On 5/21/07, Marcin Okraszewski <[EMAIL PROTECTED]> wrote:

Hi,
The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code.

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), because 
"report-errors" feature is being set. Shouldn't it be removed?


I would suggest that you open a JIRA issue and attach the patch there.
For this case, there is a similar issue(with patch) at NUTCH-369.



Cheers,
Marcin




--
Doğacan Güney


Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Marcin Okraszewski
Hi,
The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code. 

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), 
because "report-errors" feature is being set. Shouldn't it be removed?

Cheers,
Marcin--- nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java	2007-04-03 05:44:21.0 +0200
+++ ../nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java	2007-05-21 12:33:46.0 +0200
@@ -246,9 +246,9 @@
 DOMFragmentParser parser = new DOMFragmentParser();
 // some plugins, e.g., creativecommons, need to examine html comments
 try {
-  parser.setFeature("http://apache.org/xml/features/include-comments";, 
-  true);
-  parser.setFeature("http://apache.org/xml/features/augmentations";, 
+//  parser.setFeature("http://apache.org/xml/features/include-comments";, 
+//  true);
+  parser.setFeature("http://cyberneko.org/html/features/augmentations";, 
   true);
   parser.setFeature("http://cyberneko.org/html/features/balance-tags/ignore-outside-content";,
   false);
@@ -256,7 +256,9 @@
   true);
   parser.setFeature("http://cyberneko.org/html/features/report-errors";,
   true);
-} catch (SAXException e) {}
+} catch (SAXException e) {
+	LOG.warn("Exception while setting Neko parser settings.", e);
+}
 // convert Document to DocumentFragment
 HTMLDocumentImpl doc = new HTMLDocumentImpl();
 doc.setErrorChecking(false);