date:20070521

[jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread JIRA


 [ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-25:
---

Attachment: NUTCH-25_draft.patch

Well, something like this should work...

+ Adds a new configurable parser.charset.autodetect.min.confidence, Nutch will 
set encoding to detected encoding if detection confidence is greater than this 
value. Auto-detection is disabled if value is negative.

+ Adds charset auto-detection logic to Content.java. Uses icu4j(so you need to 
put icu4j's jar under lib to try this).

+ If auto-detection is confident enough, it puts detected encoding to Content's 
Metadata. Plugin parse-html is updated to see this and set encoding accordingly.

+ Uses some code from NUTCH-487 and NUTCH-369 (Thanks, Renaud Richardet and 
Marcin Okraszewski). There is a bug in current parse-html code that if an html 
page specifies an encoding, Neko ignores auto-detected encoding and assumes 
that the encoding specified in page is true. 

I didn't want to do auto-detection in parse-html because other plugins (like 
xml feed parsing plugins) may also need this. Also, IMHO, doing it in 
ParseSegment or ParseUtil wouldn't work, because I may not use those.

> needs 'character encoding' detector
> ---
>
> Key: NUTCH-25
> URL: https://issues.apache.org/jira/browse/NUTCH-25
> Project: Nutch
>  Issue Type: Wish
>Reporter: Stefan Groschupf
>Priority: Trivial
> Attachments: NUTCH-25_draft.patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Ken Krugler (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525
 ] 

Ken Krugler commented on NUTCH-25:
--

I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like this. 
They have a charset detector - see 
http://krugle.com/kse/files/cvs/source.icu-project.org/icu/icu4j/src/com/ibm/icu/text/CharsetDetector.java.
 I don't know how well it compares to jchardet, though.

> needs 'character encoding' detector
> ---
>
> Key: NUTCH-25
> URL: https://issues.apache.org/jira/browse/NUTCH-25
> Project: Nutch
>  Issue Type: Wish
>Reporter: Stefan Groschupf
>Priority: Trivial
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Doug Cook (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507
 ] 

Doug Cook commented on NUTCH-25:


We might want to think about raising the priority of this. I've seen encoding 
problems affect quite a few documents. Sometimes this is obvious, because it 
shows up the abstract, but often it is subtle, and simply affects recall.

Here's an example.

I have indexed the document: 
http://www.winereviewonline.com/wine_reviews.cfm?nCountryID=2&archives=1

This document is in UTF-8, but the header says it is in iso-8859-1 (this seems 
fairly common!). Because of this, a few characters get screwed up, and if I 
search for "Les Vignes du Soir", I won't find it, because it is being indexed 
as â€œLes Vignes du Soirâ€, since it uses curly quotes.

I've seen enough instances of problems like this to make me worry that it is 
causing significant recall problems.

If anyone has a ready solution for this, please let me know. If not, I'll try 
to get to it (and contribute back the changes once I get the chance...). Is 
jchardet still the best Java option out there?

> needs 'character encoding' detector
> ---
>
> Key: NUTCH-25
> URL: https://issues.apache.org/jira/browse/NUTCH-25
> Project: Nutch
>  Issue Type: Wish
>Reporter: Stefan Groschupf
>Priority: Trivial
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Marcin Okraszewski

> I would suggest that you open a JIRA issue and attach the patch there.
> For this case, there is a similar issue(with patch) at NUTCH-369.

Done - NUTCH-487

Marcin

[jira] Created: (NUTCH-487) Neko HTML parser goes on default settings.

2007-05-21 Thread Marcin Okraszewski (JIRA)

Neko HTML parser goes on default settings.
--

 Key: NUTCH-487
 URL: https://issues.apache.org/jira/browse/NUTCH-487
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: Linux, Java 1.5.0.
Reporter: Marcin Okraszewski
 Attachments: neko_setup.patch

The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code.

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), 
because "report-errors" feature is being set. Shouldn't it be removed?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (NUTCH-487) Neko HTML parser goes on default settings.

2007-05-21 Thread Marcin Okraszewski (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-487:
-

Attachment: neko_setup.patch

Patch for Nutch 0.9, which fixes the problem.

> Neko HTML parser goes on default settings.
> --
>
> Key: NUTCH-487
> URL: https://issues.apache.org/jira/browse/NUTCH-487
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.9.0
> Environment: Linux, Java 1.5.0.
>Reporter: Marcin Okraszewski
> Attachments: neko_setup.patch
>
>
> The Neko HTML parser set up is done in silent try / catch statement (Nutch 
> 0.9: HtmlParser.java:248-259). The problem is that the first feature being 
> set thrown an exception. So, the whole setup block is skipped. The catch 
> statement does nothing, so probably nobody noticed this.
> I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
> contains the same code.
> The patch does:
> 1. Fixes augmentations feature.
> 2. Removes include-comments feature, because I couldn't find anything similar 
> at http://people.apache.org/~andyc/neko/doc/html/settings.html
> 3. Prints warn message when exception is caught.
> Please note that now there goes a lot for messages to console (not log4j 
> log), because "report-errors" feature is being set. Shouldn't it be removed?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Doğacan Güney


Hi,

On 5/21/07, Marcin Okraszewski <[EMAIL PROTECTED]> wrote:

Hi,
The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code.

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), because 
"report-errors" feature is being set. Shouldn't it be removed?


I would suggest that you open a JIRA issue and attach the patch there.
For this case, there is a similar issue(with patch) at NUTCH-369.



Cheers,
Marcin




--
Doğacan Güney

Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Marcin Okraszewski

Hi,
The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code. 

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), 
because "report-errors" feature is being set. Shouldn't it be removed?

Cheers,
Marcin--- nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java	2007-04-03 05:44:21.0 +0200
+++ ../nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java	2007-05-21 12:33:46.0 +0200
@@ -246,9 +246,9 @@
 DOMFragmentParser parser = new DOMFragmentParser();
 // some plugins, e.g., creativecommons, need to examine html comments
 try {
-  parser.setFeature("http://apache.org/xml/features/include-comments";, 
-  true);
-  parser.setFeature("http://apache.org/xml/features/augmentations";, 
+//  parser.setFeature("http://apache.org/xml/features/include-comments";, 
+//  true);
+  parser.setFeature("http://cyberneko.org/html/features/augmentations";, 
   true);
   parser.setFeature("http://cyberneko.org/html/features/balance-tags/ignore-outside-content";,
   false);
@@ -256,7 +256,9 @@
   true);
   parser.setFeature("http://cyberneko.org/html/features/report-errors";,
   true);
-} catch (SAXException e) {}
+} catch (SAXException e) {
+	LOG.warn("Exception while setting Neko parser settings.", e);
+}
 // convert Document to DocumentFragment
 HTMLDocumentImpl doc = new HTMLDocumentImpl();
 doc.setErrorChecking(false);

bug in SegmentReader

2007-05-21 Thread Ilya Vishnevsky

It seems that there is bug in SegmentReader class.
It's in get() method.

Here is part of code:

int cnt = 0;
do {
  try {
Thread.sleep(5000);
  } catch (Exception e) {};
  it = threads.iterator();
  while (it.hasNext()) {
if (((Thread)it.next()).isAlive()) cnt++;
  }
  if ((cnt > 0) && (LOG.isDebugEnabled())) {
LOG.debug("(" + cnt + " to retrieve)");
  }
} while (cnt > 0);

Variable cnt can't decrease in the body of do-while loop, so as soon as
it once increases the loop becomes infinite. I think cnt must be
assigned to 0 at the beginning of the loop:

do {
int cnt = 0;
  try {
Thread.sleep(5000);
etc...

[jira] Updated: (NUTCH-25) needs 'character encoding' detector

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

[jira] Commented: (NUTCH-25) needs 'character encoding' detector

Re: Bug (with fix): Neko HTML parser goes on defaults.

[jira] Created: (NUTCH-487) Neko HTML parser goes on default settings.

[jira] Updated: (NUTCH-487) Neko HTML parser goes on default settings.

Re: Bug (with fix): Neko HTML parser goes on defaults.

Bug (with fix): Neko HTML parser goes on defaults.

bug in SegmentReader

9 matches

Site Navigation

Mail list logo

Footer information