[ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12517170
 ] 

Doğacan Güney edited comment on NUTCH-25 at 8/2/07 2:16 AM:
------------------------------------------------------------

> At a very quick look, one potential drawback of the private EncodingClue + 
> addClue/clearClues interface is that because 
> EncodingDetector now keeps internal state, it is no longer safe to call the 
> same EncodingDetector from different threads 
> (though I'm not sure if ICU4J's CharsetDetector is thread-safe anyway, so 
> this may already have been a potential problem). Not 
> sure if this is an issue with the parsers or not, but will take a look. 

Good point. It may be an issue if parsing during fetching is enabled (I think 
multiple threads parse content if fetcher is run in parsing mode). It should be 
enough to change 'clues' (and CharsetDetector if need be) to be a ThreadLocal, 
right?


 was:
> At a very quick look, one potential drawback of the private EncodingClue + 
> addClue/clearClues interface is that because 
> EncodingDetector now keeps internal state, it is no longer safe to call the 
> same EncodingDetector from different threads 
> (though I'm not sure if ICU4J's CharsetDetector is thread-safe anyway, so 
> this may already have been a potential problem). Not 
> sure if this is an issue with the parsers or not, but will take a look. 

Good point. It may be an issue if parsing during fetching is enabled (I think 
multiple threads parse content if fetcher is run in parsing mode). It should be 
enough to change 'clues' to be a ThreadLocal, right?

> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Stefan Groschupf
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: EncodingDetector.java, NUTCH-25.patch, 
> NUTCH-25_draft.patch, NUTCH-25_v2.patch, patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to