Tika needs to support diverse character encodings.
--------------------------------------------------

                 Key: TIKA-40
                 URL: https://issues.apache.org/jira/browse/TIKA-40
             Project: Tika
          Issue Type: New Feature
          Components: general
    Affects Versions: 0.1-incubator
            Reporter: Keith R. Bennett
             Fix For: 0.1-incubator


Currently, the text parser implementation uses the default encoding of the Java 
runtime when instantiating a Reader for the passed input stream.  We need to 
support other encodings as well.  

It would be helpful to support the specification of an encoding in the parse 
method.  

Ideally, Tika would also provide the ability to determine the encoding 
automatically based on the data stream.  (Unicode files may have byte order 
marks (http://unicode.org/faq/utf_bom.html#BOM), but I don't know if other 
encodings can be inferred from content.)


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to