[jira] Updated: (TIKA-335) TXTParser should use incoming charset

Ken Krugler (JIRA) Wed, 25 Nov 2009 10:51:04 -0800

     [ 
https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ken Krugler updated TIKA-335:
-----------------------------

       Priority: Minor  (was: Major)
    Description: 
The incoming charset (if any) from metadata should be passed to 
CharsetDetector.setDeclaredEncoding().


  was:
In looking at how TXTParser uses CharsetDetector, I see the following issues:

1. The incoming charset (if any) from metadata should be passed to 
CharsetDetector.setDeclaredEncoding().
2. The first supported charset should be used, not the last. These are returned 
in confidence order, from best to worst.
3. The current code might also wind up setting a language from one result, and 
the charset from another.

So the biggest change is to bail out of the loop once a supported charset has 
been found. 

     Issue Type: Improvement  (was: Bug)
        Summary: TXTParser should use incoming charset  (was: TXTParser use of 
CharsetDetector has several bugs)

> TXTParser should use incoming charset
> -------------------------------------
>
>                 Key: TIKA-335
>                 URL: https://issues.apache.org/jira/browse/TIKA-335
>             Project: Tika
>          Issue Type: Improvement
>    Affects Versions: 0.5
>            Reporter: Ken Krugler
>            Priority: Minor
>
> The incoming charset (if any) from metadata should be passed to 
> CharsetDetector.setDeclaredEncoding().

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (TIKA-335) TXTParser should use incoming charset

Reply via email to