[ https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Shuai Liu updated TIKA-1437: ---------------------------- Attachment: EncodingProblem.java Encoding that reads a bunch of tsv files from a directory, and print out the encoding being produced by following tika encoding auto detection impl. 1) AutoDetectReader 2) UniversalEncodingDetector (producing same same as 1 as AutoDetectReader is calling UniversalEncodingDetector to get the encoding to read streams). 3) CharsetDetector (we are usng this detector to work around the current encoding issue) > encoding issue in AutoDetectReader > ---------------------------------- > > Key: TIKA-1437 > URL: https://issues.apache.org/jira/browse/TIKA-1437 > Project: Tika > Issue Type: Bug > Components: detector, parser > Affects Versions: 1.6 > Environment: Windows 8 > Reporter: Shuai Liu > Priority: Critical > Attachments: EncodingProblem.java > > > We are having an encoding problem with Tika AutoDetectReader; > we are using AutoDetectReader to read an stream to extract the string values > by calling readLine()::AutoDetectReader. We find that the Encoding problem is > happening in UniversalEncodingDetector being called by AutoDetectReader when > reading the input stream being passed as one of the arguments in our > TSVParser’s parse method. > We are using AutoDetectReader in our parser and we believed it was able auto > detect an correct encoding from the input stream being passed to it, but we > are seeing several garbled chars bubbling up in our outputted and converted > files from our parser; we find out that the encoding problem is happening in > the UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader is > reading the stream with UTF-8 which is incorrect encoding; and the correct > encoding is ISO-8859-1. > I am attaching the screenshot of what I am talking about, the following is a > raw tsv file; you can see the hex code E9 is presented as a char between M > and xico, I believe it is a ‘e’ but in different encoding/language. > The problem is that the AutoDetectReader is decoding and reading the chars > with incorrect encoding. > BTW, We were able to work around this problem with CharsetDetector, which > seems to generate a valid encoding for the moment with which we can use to > read the tsv file properly. > However, the problem is we cannot use AutoDetectReader, we have to create our > own TSVAutoDetectReader incorporated with CharsetDetector in the detect > method; AutoDetectReader class seems to be less flexible for us to extend its > functions, many of its methods are restricted with private constraints, we > cannot manually set encoding or override the existing implementation for > detecting encoding. > In addition, I am also not confident about CharsetDetector either; as I am > seeing different encodings produced by CharsetDetector and AutoDetectReader > for different tsv files; But for now, we might live with CharsetDetector, as > CharsetDetector is solving the current encoding problem. > Finally, I would like to please give you my test program (PFA: > EncodingProblem.java) that reads an inputted tsv directory and displays a > list of encodings for each of the tsv files in the directory produced by > AutoDetectReader, UniversalEncodingDetector(which is being called by > AutoDetectReader) and CharsetDetector; so you could probably see the > difference, they are producing different encodings for some tsv files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)