[
https://issues.apache.org/jira/browse/TIKA-335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler updated TIKA-335:
-----------------------------
Priority: Minor (was: Major)
Description:
The incoming charset (if any) from metadata should be passed to
CharsetDetector.setDeclaredEncoding().
was:
In looking at how TXTParser uses CharsetDetector, I see the following issues:
1. The incoming charset (if any) from metadata should be passed to
CharsetDetector.setDeclaredEncoding().
2. The first supported charset should be used, not the last. These are returned
in confidence order, from best to worst.
3. The current code might also wind up setting a language from one result, and
the charset from another.
So the biggest change is to bail out of the loop once a supported charset has
been found.
Issue Type: Improvement (was: Bug)
Summary: TXTParser should use incoming charset (was: TXTParser use of
CharsetDetector has several bugs)
> TXTParser should use incoming charset
> -------------------------------------
>
> Key: TIKA-335
> URL: https://issues.apache.org/jira/browse/TIKA-335
> Project: Tika
> Issue Type: Improvement
> Affects Versions: 0.5
> Reporter: Ken Krugler
> Priority: Minor
>
> The incoming charset (if any) from metadata should be passed to
> CharsetDetector.setDeclaredEncoding().
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.