[jira] Updated: (NUTCH-25) needs 'character encoding' detector
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-25: --- Attachment: NUTCH-25_draft.patch Well, something like this should work... + Adds a new configurable parser.charset.autodetect.min.confidence, Nutch will set encoding to detected encoding if detection confidence is greater than this value. Auto-detection is disabled if value is negative. + Adds charset auto-detection logic to Content.java. Uses icu4j(so you need to put icu4j's jar under lib to try this). + If auto-detection is confident enough, it puts detected encoding to Content's Metadata. Plugin parse-html is updated to see this and set encoding accordingly. + Uses some code from NUTCH-487 and NUTCH-369 (Thanks, Renaud Richardet and Marcin Okraszewski). There is a bug in current parse-html code that if an html page specifies an encoding, Neko ignores auto-detected encoding and assumes that the encoding specified in page is true. I didn't want to do auto-detection in parse-html because other plugins (like xml feed parsing plugins) may also need this. Also, IMHO, doing it in ParseSegment or ParseUtil wouldn't work, because I may not use those. > needs 'character encoding' detector > --- > > Key: NUTCH-25 > URL: https://issues.apache.org/jira/browse/NUTCH-25 > Project: Nutch > Issue Type: Wish >Reporter: Stefan Groschupf >Priority: Trivial > Attachments: NUTCH-25_draft.patch > > > transferred from: > http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356 > submitted by: > Jungshik Shin > this is a follow-up to bug 993380 (figure out 'charset' > from the meta tag). > Although we can cover a lot of ground using the 'C-T' > header field in in the HTTP header and the > corresponding meta tag in html documents (and in case > of XML, we have to use a similar but a different > 'parsing'), in the wild, there are a lot of documents > without any information about the character encoding > used. Browsers like Mozilla and search engines like > Google use character encoding detectors to deal with > these 'unlabelled' documents. > Mozilla's character encoding detector is GPL/MPL'd and > we might be able to port it to Java. Unfortunately, > it's not fool-proof. However, along with some other > heuristic used by Mozilla and elsewhere, it'll be > possible to achieve a high rate of the detection. > The following page has links to some other related pages. > http://trainedmonkey.com/week/2004/26 > In addition to the character encoding detection, we > also need to detect the language of a document, which > is even harder and should be a separate bug (although > it's related). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-25) needs 'character encoding' detector
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525 ] Ken Krugler commented on NUTCH-25: -- I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like this. They have a charset detector - see http://krugle.com/kse/files/cvs/source.icu-project.org/icu/icu4j/src/com/ibm/icu/text/CharsetDetector.java. I don't know how well it compares to jchardet, though. > needs 'character encoding' detector > --- > > Key: NUTCH-25 > URL: https://issues.apache.org/jira/browse/NUTCH-25 > Project: Nutch > Issue Type: Wish >Reporter: Stefan Groschupf >Priority: Trivial > > transferred from: > http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356 > submitted by: > Jungshik Shin > this is a follow-up to bug 993380 (figure out 'charset' > from the meta tag). > Although we can cover a lot of ground using the 'C-T' > header field in in the HTTP header and the > corresponding meta tag in html documents (and in case > of XML, we have to use a similar but a different > 'parsing'), in the wild, there are a lot of documents > without any information about the character encoding > used. Browsers like Mozilla and search engines like > Google use character encoding detectors to deal with > these 'unlabelled' documents. > Mozilla's character encoding detector is GPL/MPL'd and > we might be able to port it to Java. Unfortunately, > it's not fool-proof. However, along with some other > heuristic used by Mozilla and elsewhere, it'll be > possible to achieve a high rate of the detection. > The following page has links to some other related pages. > http://trainedmonkey.com/week/2004/26 > In addition to the character encoding detection, we > also need to detect the language of a document, which > is even harder and should be a separate bug (although > it's related). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-25) needs 'character encoding' detector
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507 ] Doug Cook commented on NUTCH-25: We might want to think about raising the priority of this. I've seen encoding problems affect quite a few documents. Sometimes this is obvious, because it shows up the abstract, but often it is subtle, and simply affects recall. Here's an example. I have indexed the document: http://www.winereviewonline.com/wine_reviews.cfm?nCountryID=2&archives=1 This document is in UTF-8, but the header says it is in iso-8859-1 (this seems fairly common!). Because of this, a few characters get screwed up, and if I search for "Les Vignes du Soir", I won't find it, because it is being indexed as “Les Vignes du Soirâ€, since it uses curly quotes. I've seen enough instances of problems like this to make me worry that it is causing significant recall problems. If anyone has a ready solution for this, please let me know. If not, I'll try to get to it (and contribute back the changes once I get the chance...). Is jchardet still the best Java option out there? > needs 'character encoding' detector > --- > > Key: NUTCH-25 > URL: https://issues.apache.org/jira/browse/NUTCH-25 > Project: Nutch > Issue Type: Wish >Reporter: Stefan Groschupf >Priority: Trivial > > transferred from: > http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356 > submitted by: > Jungshik Shin > this is a follow-up to bug 993380 (figure out 'charset' > from the meta tag). > Although we can cover a lot of ground using the 'C-T' > header field in in the HTTP header and the > corresponding meta tag in html documents (and in case > of XML, we have to use a similar but a different > 'parsing'), in the wild, there are a lot of documents > without any information about the character encoding > used. Browsers like Mozilla and search engines like > Google use character encoding detectors to deal with > these 'unlabelled' documents. > Mozilla's character encoding detector is GPL/MPL'd and > we might be able to port it to Java. Unfortunately, > it's not fool-proof. However, along with some other > heuristic used by Mozilla and elsewhere, it'll be > possible to achieve a high rate of the detection. > The following page has links to some other related pages. > http://trainedmonkey.com/week/2004/26 > In addition to the character encoding detection, we > also need to detect the language of a document, which > is even harder and should be a separate bug (although > it's related). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Bug (with fix): Neko HTML parser goes on defaults.
> I would suggest that you open a JIRA issue and attach the patch there. > For this case, there is a similar issue(with patch) at NUTCH-369. Done - NUTCH-487 Marcin
[jira] Created: (NUTCH-487) Neko HTML parser goes on default settings.
Neko HTML parser goes on default settings. -- Key: NUTCH-487 URL: https://issues.apache.org/jira/browse/NUTCH-487 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.9.0 Environment: Linux, Java 1.5.0. Reporter: Marcin Okraszewski Attachments: neko_setup.patch The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: HtmlParser.java:248-259). The problem is that the first feature being set thrown an exception. So, the whole setup block is skipped. The catch statement does nothing, so probably nobody noticed this. I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk contains the same code. The patch does: 1. Fixes augmentations feature. 2. Removes include-comments feature, because I couldn't find anything similar at http://people.apache.org/~andyc/neko/doc/html/settings.html 3. Prints warn message when exception is caught. Please note that now there goes a lot for messages to console (not log4j log), because "report-errors" feature is being set. Shouldn't it be removed? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-487) Neko HTML parser goes on default settings.
[ https://issues.apache.org/jira/browse/NUTCH-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcin Okraszewski updated NUTCH-487: - Attachment: neko_setup.patch Patch for Nutch 0.9, which fixes the problem. > Neko HTML parser goes on default settings. > -- > > Key: NUTCH-487 > URL: https://issues.apache.org/jira/browse/NUTCH-487 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 0.9.0 > Environment: Linux, Java 1.5.0. >Reporter: Marcin Okraszewski > Attachments: neko_setup.patch > > > The Neko HTML parser set up is done in silent try / catch statement (Nutch > 0.9: HtmlParser.java:248-259). The problem is that the first feature being > set thrown an exception. So, the whole setup block is skipped. The catch > statement does nothing, so probably nobody noticed this. > I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk > contains the same code. > The patch does: > 1. Fixes augmentations feature. > 2. Removes include-comments feature, because I couldn't find anything similar > at http://people.apache.org/~andyc/neko/doc/html/settings.html > 3. Prints warn message when exception is caught. > Please note that now there goes a lot for messages to console (not log4j > log), because "report-errors" feature is being set. Shouldn't it be removed? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: Bug (with fix): Neko HTML parser goes on defaults.
Hi, On 5/21/07, Marcin Okraszewski <[EMAIL PROTECTED]> wrote: Hi, The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: HtmlParser.java:248-259). The problem is that the first feature being set thrown an exception. So, the whole setup block is skipped. The catch statement does nothing, so probably nobody noticed this. I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk contains the same code. The patch does: 1. Fixes augmentations feature. 2. Removes include-comments feature, because I couldn't find anything similar at http://people.apache.org/~andyc/neko/doc/html/settings.html 3. Prints warn message when exception is caught. Please note that now there goes a lot for messages to console (not log4j log), because "report-errors" feature is being set. Shouldn't it be removed? I would suggest that you open a JIRA issue and attach the patch there. For this case, there is a similar issue(with patch) at NUTCH-369. Cheers, Marcin -- Doğacan Güney
Bug (with fix): Neko HTML parser goes on defaults.
Hi, The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: HtmlParser.java:248-259). The problem is that the first feature being set thrown an exception. So, the whole setup block is skipped. The catch statement does nothing, so probably nobody noticed this. I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk contains the same code. The patch does: 1. Fixes augmentations feature. 2. Removes include-comments feature, because I couldn't find anything similar at http://people.apache.org/~andyc/neko/doc/html/settings.html 3. Prints warn message when exception is caught. Please note that now there goes a lot for messages to console (not log4j log), because "report-errors" feature is being set. Shouldn't it be removed? Cheers, Marcin--- nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java 2007-04-03 05:44:21.0 +0200 +++ ../nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java 2007-05-21 12:33:46.0 +0200 @@ -246,9 +246,9 @@ DOMFragmentParser parser = new DOMFragmentParser(); // some plugins, e.g., creativecommons, need to examine html comments try { - parser.setFeature("http://apache.org/xml/features/include-comments";, - true); - parser.setFeature("http://apache.org/xml/features/augmentations";, +// parser.setFeature("http://apache.org/xml/features/include-comments";, +// true); + parser.setFeature("http://cyberneko.org/html/features/augmentations";, true); parser.setFeature("http://cyberneko.org/html/features/balance-tags/ignore-outside-content";, false); @@ -256,7 +256,9 @@ true); parser.setFeature("http://cyberneko.org/html/features/report-errors";, true); -} catch (SAXException e) {} +} catch (SAXException e) { + LOG.warn("Exception while setting Neko parser settings.", e); +} // convert Document to DocumentFragment HTMLDocumentImpl doc = new HTMLDocumentImpl(); doc.setErrorChecking(false);
bug in SegmentReader
It seems that there is bug in SegmentReader class. It's in get() method. Here is part of code: int cnt = 0; do { try { Thread.sleep(5000); } catch (Exception e) {}; it = threads.iterator(); while (it.hasNext()) { if (((Thread)it.next()).isAlive()) cnt++; } if ((cnt > 0) && (LOG.isDebugEnabled())) { LOG.debug("(" + cnt + " to retrieve)"); } } while (cnt > 0); Variable cnt can't decrease in the body of do-while loop, so as soon as it once increases the loop becomes infinite. I think cnt must be assigned to 0 at the beginning of the loop: do { int cnt = 0; try { Thread.sleep(5000); etc...