Tika needs to support diverse character encodings.
--------------------------------------------------
Key: TIKA-40
URL: https://issues.apache.org/jira/browse/TIKA-40
Project: Tika
Issue Type: New Feature
Components: general
Affects Versions: 0.1-incubator
Reporter: Keith R. Bennett
Fix For: 0.1-incubator
Currently, the text parser implementation uses the default encoding of the Java
runtime when instantiating a Reader for the passed input stream. We need to
support other encodings as well.
It would be helpful to support the specification of an encoding in the parse
method.
Ideally, Tika would also provide the ability to determine the encoding
automatically based on the data stream. (Unicode files may have byte order
marks (http://unicode.org/faq/utf_bom.html#BOM), but I don't know if other
encodings can be inferred from content.)
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.