[Nutch-dev] [jira] Issue Comment Edited: (NUTCH-25) needs 'character encoding' detector

JIRA Wed, 01 Aug 2007 12:10:30 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516912
 ]


Doğacan Güney edited comment on NUTCH-25 at 8/1/07 12:09 PM:
-------------------------------------------------------------

ı cleaned up your latest patch and updated it for latest trunk (also added some 
changes):

* Uses Java 5 generics.

* Respects 80 char boundary (for EncodingDetector).

* Moves parseCharacterEncoding and resolveEncodingAlias from StringUtil to 
EncodingDetector. I think they make more sense in EncodingDetector. 

* EncodingClue class is no longer public. 

* Adds EncodingDetector.addClue methods instead. EncodingDetector.addClue 
eliminates null values also calls resolveEncodingAlias and stores 'resolved' 
alias.

* Clients now must call EncodingDetector.clearClues before asking it to detect 
encoding for a new content to EncodingDetector. Otherwise older clues may 
affect EncodingDetector's judgement.

* I also moved 'header' detection to EncodingDetector.autoDetectClues. 
extracting charset from header is needed in a couple of plugins so this 
eliminates some code duplication.

* I removed stripGarbage method for now. As I said before, I am not sure how it 
will behave when given UTF-16 (or other non-byte oriented encodings) documents. 
So I changed EncodingDetector to use icu4j's own filtering function. However, 
Doug, if your tests are showing that stripGarbage performs better, feel free to 
add it back.

* Update parse-html, feed and parse-text plugins to use EncodingDetector.


 was:
ı cleaned up your latest patch and updated it for latest trunk (also added some 
changes):

* Uses Java 5 generics.

* Respects 80 char boundary (for EncodingDetector).

* Moves parseCharacterEncoding and resolveEncodingAlias from StringUtil to 
EncodingDetector. I think they make more sense in EncodingDetector. 

* EncodingClue class is no longer public. 

* Adds EncodingDetector.addClue methods instead. EncodingDetector.addClue 
eliminates null values also calls resolveEncodingAlias and stores 'resolved' 
alias.

* Clients now must call EncodingDetector.clearClues before asking it to detect 
encoding for a new content to EncodingDetector. Otherwise older clues may 
affect EncodingDetector's judgement.

* I also moved 'header' detection to EncodingDetector.autoDetectClues. 
extracting charset from header is needed in a couple of plugins so this 
eliminates some code duplication.

* Update parse-html, feed and parse-text plugins to use EncodingDetector.

> needs 'character encoding' detector
> -----------------------------------
>
>                 Key: NUTCH-25
>                 URL: https://issues.apache.org/jira/browse/NUTCH-25
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Stefan Groschupf
>            Assignee: Doğacan Güney
>             Fix For: 1.0.0
>
>         Attachments: EncodingDetector.java, NUTCH-25.patch, 
> NUTCH-25_draft.patch, NUTCH-25_v2.patch, patch
>
>
> transferred from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=995730&group_id=59548&atid=491356
> submitted by:
> Jungshik Shin
> this is a follow-up to bug 993380 (figure out 'charset'
> from the meta tag).
> Although we can cover a lot of ground using the 'C-T'
> header field in in the HTTP header and the
> corresponding meta tag in html documents (and in case
> of XML, we have to use a similar but a different
> 'parsing'), in the wild, there are a lot of documents
> without any information about the character encoding
> used. Browsers like Mozilla and search engines like
> Google use character encoding detectors to deal with
> these 'unlabelled' documents. 
> Mozilla's character encoding detector is GPL/MPL'd and
> we might be able to port it to Java. Unfortunately,
> it's not fool-proof. However, along with some other
> heuristic used by Mozilla and elsewhere, it'll be
> possible to achieve a high rate of the detection. 
> The following page has links to some other related pages.
> http://trainedmonkey.com/week/2004/26
> In addition to the character encoding detection, we
> also need to detect the language of a document, which
> is even harder and should be a separate bug (although
> it's related).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now >>  http://get.splunk.com/
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] [jira] Issue Comment Edited: (NUTCH-25) needs 'character encoding' detector

Reply via email to