[jira] [Updated] (TIKA-1437) encoding issue in AutoDetectReader

Shuai Liu (JIRA) Sun, 05 Oct 2014 13:25:31 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Shuai Liu updated TIKA-1437:
----------------------------
    Attachment: EncodingProblem.java

Encoding that reads a bunch of tsv files from a directory, and print out the 
encoding being produced by following tika encoding auto detection impl.
1) AutoDetectReader
2) UniversalEncodingDetector (producing same same as 1 as AutoDetectReader is 
calling UniversalEncodingDetector to get the encoding to read streams).
3) CharsetDetector (we are usng this detector to work around the current 
encoding issue)

> encoding issue in AutoDetectReader
> ----------------------------------
>
>                 Key: TIKA-1437
>                 URL: https://issues.apache.org/jira/browse/TIKA-1437
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 1.6
>         Environment: Windows 8
>            Reporter: Shuai Liu
>            Priority: Critical
>         Attachments: EncodingProblem.java
>
>
> We are having an encoding problem with Tika AutoDetectReader;
> we are using AutoDetectReader to read an stream to extract the string values 
> by calling readLine()::AutoDetectReader. We find that the Encoding problem is 
> happening in UniversalEncodingDetector being called by AutoDetectReader when 
> reading the input stream being passed as one of the arguments in our 
> TSVParser’s parse method. 
> We are using AutoDetectReader in our parser and we believed it was able auto 
> detect an correct encoding from the input stream being passed to it, but we 
> are seeing several garbled chars bubbling up in our outputted and converted 
> files from our parser; we find out that the encoding problem is happening in 
> the UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader is 
> reading the stream with UTF-8 which is incorrect encoding; and the correct 
> encoding is ISO-8859-1.
> I am attaching the screenshot of what I am talking about, the following is a 
> raw tsv file; you can see the hex code E9 is presented as a char between M 
> and xico, I believe it is a ‘e’ but in different encoding/language.
> The problem is that the AutoDetectReader is decoding and reading the chars 
> with incorrect encoding. 
> BTW, We were able to work around this problem with CharsetDetector, which 
> seems to generate a valid encoding for the moment with which we can use to 
> read the tsv file properly.
> However, the problem is we cannot use AutoDetectReader, we have to create our 
> own TSVAutoDetectReader incorporated with CharsetDetector in the detect 
> method; AutoDetectReader class seems to be less flexible for us to extend its 
> functions, many of its methods are restricted with private constraints, we 
> cannot manually set encoding or override the existing implementation for 
> detecting encoding.
> In addition, I am also not confident about CharsetDetector either; as I am 
> seeing different encodings produced by CharsetDetector and AutoDetectReader 
> for different tsv files; But for now, we might live with CharsetDetector, as 
> CharsetDetector is solving the current encoding problem.
> Finally, I would like to please give you my test program (PFA: 
> EncodingProblem.java) that reads an inputted tsv directory and displays a 
> list of encodings for each of the tsv files in the directory produced by 
> AutoDetectReader, UniversalEncodingDetector(which is being called by 
> AutoDetectReader) and CharsetDetector; so you could probably see the 
> difference, they are producing different encodings for some tsv files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (TIKA-1437) encoding issue in AutoDetectReader

Reply via email to