Shuai Liu created TIKA-1437:
-------------------------------

             Summary: encoding issue in AutoDetectReader
                 Key: TIKA-1437
                 URL: https://issues.apache.org/jira/browse/TIKA-1437
             Project: Tika
          Issue Type: Bug
          Components: detector, parser
    Affects Versions: 1.6
         Environment: Windows 8
            Reporter: Shuai Liu
            Priority: Critical


We are having an encoding problem with Tika AutoDetectReader;
we are using AutoDetectReader to read an stream to extract the string values by 
calling readLine()::AutoDetectReader. We find that the Encoding problem is 
happening in UniversalEncodingDetector being called by AutoDetectReader when 
reading the input stream being passed as one of the arguments in our 
TSVParser’s parse method. 
We are using AutoDetectReader in our parser and we believed it was able auto 
detect an correct encoding from the input stream being passed to it, but we are 
seeing several garbled chars bubbling up in our outputted and converted files 
from our parser; we find out that the encoding problem is happening in the 
UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader is 
reading the stream with UTF-8 which is incorrect encoding; and the correct 
encoding is ISO-8859-1.

I am attaching the screenshot of what I am talking about, the following is a 
raw tsv file; you can see the hex code E9 is presented as a char between M and 
xico, I believe it is a ‘e’ but in different encoding/language.

The problem is that the AutoDetectReader is decoding and reading the chars with 
incorrect encoding. 
BTW, We were able to work around this problem with CharsetDetector, which seems 
to generate a valid encoding for the moment with which we can use to read the 
tsv file properly.

However, the problem is we cannot use AutoDetectReader, we have to create our 
own TSVAutoDetectReader incorporated with CharsetDetector in the detect method; 
AutoDetectReader class seems to be less flexible for us to extend its 
functions, many of its methods are restricted with private constraints, we 
cannot manually set encoding or override the existing implementation for 
detecting encoding.

In addition, I am also not confident about CharsetDetector either; as I am 
seeing different encodings produced by CharsetDetector and AutoDetectReader for 
different tsv files; But for now, we might live with CharsetDetector, as 
CharsetDetector is solving the current encoding problem.

Finally, I would like to please give you my test program (PFA: 
EncodingProblem.java) that reads an inputted tsv directory and displays a list 
of encodings for each of the tsv files in the directory produced by 
AutoDetectReader, UniversalEncodingDetector(which is being called by 
AutoDetectReader) and CharsetDetector; so you could probably see the 
difference, they are producing different encodings for some tsv files.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to