[ https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison resolved TIKA-1437. ------------------------------- Resolution: Cannot Reproduce Accents seem to work as expected with trunk. This may have been fixed since the original issue was opened. Please reopen if trunk isn't working for you. > encoding issue in AutoDetectReader > ---------------------------------- > > Key: TIKA-1437 > URL: https://issues.apache.org/jira/browse/TIKA-1437 > Project: Tika > Issue Type: Bug > Components: detector, parser > Affects Versions: 1.6 > Environment: Windows 8 > Reporter: Luke sh > Priority: Critical > Attachments: EncodingProblem.java, computrabajo-ar-20121108.tsv, > e9.jpg, ef.jpg > > > We are having an encoding problem with Tika AutoDetectReader; > we are using AutoDetectReader to read an stream to extract the string values > by calling readLine()::AutoDetectReader. We find that the Encoding problem is > happening in UniversalEncodingDetector being called by AutoDetectReader when > reading the input stream being passed as one of the arguments in our > TSVParser’s parse method. > We are using AutoDetectReader in our parser and we believed it was able auto > detect an correct encoding from the input stream being passed to it, but we > are seeing several garbled chars bubbling up in our outputted and converted > files from our parser; we find out that the encoding problem is happening in > the UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader is > reading the stream with UTF-8 which is incorrect encoding; and the correct > encoding is ISO-8859-1. > I am attaching the screenshot of what char difference we are seeing in the > input tsv file and converted/outputed file. they are e9.jpg and ef.jpg, > please read the description for details. > The problem is that the AutoDetectReader is decoding and reading the chars > with incorrect encoding. > BTW, We were able to work around this problem with CharsetDetector, which > seems to generate a valid encoding for the moment with which we can use to > read the tsv file properly. > However, the problem is we cannot use AutoDetectReader, we have to create our > own TSVAutoDetectReader incorporated with CharsetDetector in the detect > method; AutoDetectReader class seems to be less flexible for us to extend its > functions, many of its methods are restricted with private constraints, we > cannot manually set encoding or override the existing implementation for > detecting encoding. > In addition, I am also not confident about CharsetDetector either; as I am > seeing different encodings produced by CharsetDetector and AutoDetectReader > for different tsv files; But for now, we might live with CharsetDetector, as > CharsetDetector is solving the current encoding problem. > Finally, I would like to also please give you my test program (PFA: > EncodingProblem.java) that reads an inputted tsv directory and displays a > list of encodings for each of the tsv files in the directory produced by > AutoDetectReader, UniversalEncodingDetector(which is being called by > AutoDetectReader) and CharsetDetector; so you could probably see the > difference, they are producing different encodings for some tsv files. -- This message was sent by Atlassian JIRA (v6.3.4#6332)