[jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-25:
---

Attachment: NUTCH-25_draft.patch

Well, something like this should work...

+ Adds a new configurable parser.charset.autodetect.min.confidence, Nutch will 
set encoding to detected encoding if detection confidence is greater than this 
value. Auto-detection is disabled if value is negative.

+ Adds charset auto-detection logic to Content.java. Uses icu4j(so you need to 
put icu4j's jar under lib to try this).

+ If auto-detection is confident enough, it puts detected encoding to Content's 
Metadata. Plugin parse-html is updated to see this and set encoding accordingly.

+ Uses some code from NUTCH-487 and NUTCH-369 (Thanks, Renaud Richardet and 
Marcin Okraszewski). There is a bug in current parse-html code that if an html 
page specifies an encoding, Neko ignores auto-detected encoding and assumes 
that the encoding specified in page is true. 

I didn't want to do auto-detection in parse-html because other plugins (like 
xml feed parsing plugins) may also need this. Also, IMHO, doing it in 
ParseSegment or ParseUtil wouldn't work, because I may not use those.

> needs 'character encoding' detector
> ---
>
> Key: NUTCH-25
> URL: https://issues.apache.org/jira/browse/NUTCH-25
> Project: Nutch
>  Issue Type: Wish
>Reporter: Stefan Groschupf
>Priority: Trivial
> Attachments: NUTCH-25_draft.patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Ken Krugler (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497525
 ] 

Ken Krugler commented on NUTCH-25:
--

I use [ICU|http://krugle.com/kse/projects/BYfaaku] for most issues like this. 
They have a charset detector - see 
http://krugle.com/kse/files/cvs/source.icu-project.org/icu/icu4j/src/com/ibm/icu/text/CharsetDetector.java.
 I don't know how well it compares to jchardet, though.

> needs 'character encoding' detector
> ---
>
> Key: NUTCH-25
> URL: https://issues.apache.org/jira/browse/NUTCH-25
> Project: Nutch
>  Issue Type: Wish
>Reporter: Stefan Groschupf
>Priority: Trivial
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread Doug Cook (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497507
 ] 

Doug Cook commented on NUTCH-25:


We might want to think about raising the priority of this. I've seen encoding 
problems affect quite a few documents. Sometimes this is obvious, because it 
shows up the abstract, but often it is subtle, and simply affects recall.

Here's an example.

I have indexed the document: 
http://www.winereviewonline.com/wine_reviews.cfm?nCountryID=2&archives=1

This document is in UTF-8, but the header says it is in iso-8859-1 (this seems 
fairly common!). Because of this, a few characters get screwed up, and if I 
search for "Les Vignes du Soir", I won't find it, because it is being indexed 
as “Les Vignes du Soir”, since it uses curly quotes.

I've seen enough instances of problems like this to make me worry that it is 
causing significant recall problems.

If anyone has a ready solution for this, please let me know. If not, I'll try 
to get to it (and contribute back the changes once I get the chance...). Is 
jchardet still the best Java option out there?

> needs 'character encoding' detector
> ---
>
> Key: NUTCH-25
> URL: https://issues.apache.org/jira/browse/NUTCH-25
> Project: Nutch
>  Issue Type: Wish
>Reporter: Stefan Groschupf
>Priority: Trivial
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Marcin Okraszewski
> I would suggest that you open a JIRA issue and attach the patch there.
> For this case, there is a similar issue(with patch) at NUTCH-369.

Done - NUTCH-487

Marcin


[jira] Created: (NUTCH-487) Neko HTML parser goes on default settings.

2007-05-21 Thread Marcin Okraszewski (JIRA)
Neko HTML parser goes on default settings.
--

 Key: NUTCH-487
 URL: https://issues.apache.org/jira/browse/NUTCH-487
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.9.0
 Environment: Linux, Java 1.5.0.
Reporter: Marcin Okraszewski
 Attachments: neko_setup.patch

The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code.

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), 
because "report-errors" feature is being set. Shouldn't it be removed?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-487) Neko HTML parser goes on default settings.

2007-05-21 Thread Marcin Okraszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcin Okraszewski updated NUTCH-487:
-

Attachment: neko_setup.patch

Patch for Nutch 0.9, which fixes the problem.

> Neko HTML parser goes on default settings.
> --
>
> Key: NUTCH-487
> URL: https://issues.apache.org/jira/browse/NUTCH-487
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.9.0
> Environment: Linux, Java 1.5.0.
>Reporter: Marcin Okraszewski
> Attachments: neko_setup.patch
>
>
> The Neko HTML parser set up is done in silent try / catch statement (Nutch 
> 0.9: HtmlParser.java:248-259). The problem is that the first feature being 
> set thrown an exception. So, the whole setup block is skipped. The catch 
> statement does nothing, so probably nobody noticed this.
> I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
> contains the same code.
> The patch does:
> 1. Fixes augmentations feature.
> 2. Removes include-comments feature, because I couldn't find anything similar 
> at http://people.apache.org/~andyc/neko/doc/html/settings.html
> 3. Prints warn message when exception is caught.
> Please note that now there goes a lot for messages to console (not log4j 
> log), because "report-errors" feature is being set. Shouldn't it be removed?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Doğacan Güney

Hi,

On 5/21/07, Marcin Okraszewski <[EMAIL PROTECTED]> wrote:

Hi,
The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code.

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), because 
"report-errors" feature is being set. Shouldn't it be removed?


I would suggest that you open a JIRA issue and attach the patch there.
For this case, there is a similar issue(with patch) at NUTCH-369.



Cheers,
Marcin




--
Doğacan Güney


Bug (with fix): Neko HTML parser goes on defaults.

2007-05-21 Thread Marcin Okraszewski
Hi,
The Neko HTML parser set up is done in silent try / catch statement (Nutch 0.9: 
HtmlParser.java:248-259). The problem is that the first feature being set 
thrown an exception. So, the whole setup block is skipped. The catch statement 
does nothing, so probably nobody noticed this.

I attach a patch which fixes this. It was done on Nutch 0.9, but SVN trunk 
contains the same code. 

The patch does:
1. Fixes augmentations feature.
2. Removes include-comments feature, because I couldn't find anything similar 
at http://people.apache.org/~andyc/neko/doc/html/settings.html
3. Prints warn message when exception is caught.

Please note that now there goes a lot for messages to console (not log4j log), 
because "report-errors" feature is being set. Shouldn't it be removed?

Cheers,
Marcin--- nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java	2007-04-03 05:44:21.0 +0200
+++ ../nutch-0.9/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java	2007-05-21 12:33:46.0 +0200
@@ -246,9 +246,9 @@
 DOMFragmentParser parser = new DOMFragmentParser();
 // some plugins, e.g., creativecommons, need to examine html comments
 try {
-  parser.setFeature("http://apache.org/xml/features/include-comments";, 
-  true);
-  parser.setFeature("http://apache.org/xml/features/augmentations";, 
+//  parser.setFeature("http://apache.org/xml/features/include-comments";, 
+//  true);
+  parser.setFeature("http://cyberneko.org/html/features/augmentations";, 
   true);
   parser.setFeature("http://cyberneko.org/html/features/balance-tags/ignore-outside-content";,
   false);
@@ -256,7 +256,9 @@
   true);
   parser.setFeature("http://cyberneko.org/html/features/report-errors";,
   true);
-} catch (SAXException e) {}
+} catch (SAXException e) {
+	LOG.warn("Exception while setting Neko parser settings.", e);
+}
 // convert Document to DocumentFragment
 HTMLDocumentImpl doc = new HTMLDocumentImpl();
 doc.setErrorChecking(false);


bug in SegmentReader

2007-05-21 Thread Ilya Vishnevsky
It seems that there is bug in SegmentReader class.
It's in get() method.

Here is part of code:

int cnt = 0;
do {
  try {
Thread.sleep(5000);
  } catch (Exception e) {};
  it = threads.iterator();
  while (it.hasNext()) {
if (((Thread)it.next()).isAlive()) cnt++;
  }
  if ((cnt > 0) && (LOG.isDebugEnabled())) {
LOG.debug("(" + cnt + " to retrieve)");
  }
} while (cnt > 0);

Variable cnt can't decrease in the body of do-while loop, so as soon as
it once increases the loop becomes infinite. I think cnt must be
assigned to 0 at the beginning of the loop:

do {
int cnt = 0;
  try {
Thread.sleep(5000);
etc...