[jira] [Comment Edited] (TIKA-1437) encoding issue in AutoDetectReader

Shuai Liu (JIRA) Mon, 06 Oct 2014 17:45:07 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161285#comment-14161285
 ]


Shuai Liu edited comment on TIKA-1437 at 10/7/14 12:43 AM:
-----------------------------------------------------------

Thanks Tim, i imbedded my response below, i hope your "no sure" can be 
clarified.


[Tim]: No encoding detector will be perfect. 
[Luke]True, that is why I am opening this ticket, as mentioned already....Tika 
provides 2 ways to auto detect encoding(as far as i know for the moment), and 
they are UniversalEncodingDetector and CharsetDetector. I need to understand 
which one is more accurate, for the moment CharsetDetector is giving the 
correct encoding, but UniversalEncodingDetector being called to 
AutoDetectReader is not.

[Tim]:Are you sure that the encoding of the attached is not UTF-8? Internet 
explorer "guesses" ISO-8859-1, which is clearly not right. When I tell IE to 
use UTF-8, the accented characters are correctly displayed.
[Luke]: I am sure about my test results, but i am not sure if you run the 
attached program and understand it....
One simple question i need to please clarify and confirm if you get the same 
result as mine after you run the attached program with the problem tsv file. 
I am seeing 2 different encoding produced by two encoding auto-detection 
implementation in tika. Why are they giving 2 different encoding, in my case, 
one works and the other doesn't. 

Thanks a lot in advance for your help to look into this issue, My question is 
simple that it will be great if we can use just AutoDetectReader without 
actually worrying about the encoding, i guess that is the intent for the 
AutoDetectReader class as many methods inside it cannot be overridden and e.g. 
i cannot add my own auto detect algorithm in it.

Anyway, if you can and possible please kindly run the attached program with the 
problem tsv, and let me know how it goes. 

Thanks a lot for your kind help and it will be appreciated.


was (Author: lukeliush):
Thanks Tim, but i imbedded my response below, i hope your "no sure" can be 
clarified.


[Tim]: No encoding detector will be perfect. 
[Luke]True, that is why I am opening this ticket, as mentioned already....Tika 
provides 2 ways to auto detect encoding(as far as i know for the moment), and 
they are UniversalEncodingDetector and CharsetDetector. I need to understand 
which one is more accurate, for the moment CharsetDetector is giving the 
correct encoding, but UniversalEncodingDetector being called to 
AutoDetectReader is not.

[Tim]:Are you sure that the encoding of the attached is not UTF-8? Internet 
explorer "guesses" ISO-8859-1, which is clearly not right. When I tell IE to 
use UTF-8, the accented characters are correctly displayed.
[Luke]: I am sure about my test results, but i am not sure if you run the 
attached program and understand it....
One simple question i need to please clarify and confirm if you get the same 
result as mine after you run the attached program with the problem tsv file. 
I am seeing 2 different encoding produced by two encoding auto-detection 
implementation in tika. Why are they giving 2 different encoding, in my case, 
one works and the other doesn't. 

Thanks a lot in advance for your help to look into this issue, My question is 
simple that it will be great if we can use just AutoDetectReader without 
actually worrying about the encoding, i guess that is the intent for the 
AutoDetectReader class as many methods inside it cannot be overridden and e.g. 
i cannot add my own auto detect algorithm in it.

Anyway, if you can and possible please kindly run the attached program with the 
problem tsv, and let me know how it goes. 

Thanks a lot for your kind help and it will be appreciated.

> encoding issue in AutoDetectReader
> ----------------------------------
>
>                 Key: TIKA-1437
>                 URL: https://issues.apache.org/jira/browse/TIKA-1437
>             Project: Tika
>          Issue Type: Bug
>          Components: detector, parser
>    Affects Versions: 1.6
>         Environment: Windows 8
>            Reporter: Shuai Liu
>            Priority: Critical
>         Attachments: EncodingProblem.java, computrabajo-ar-20121108.tsv, 
> e9.jpg, ef.jpg
>
>
> We are having an encoding problem with Tika AutoDetectReader;
> we are using AutoDetectReader to read an stream to extract the string values 
> by calling readLine()::AutoDetectReader. We find that the Encoding problem is 
> happening in UniversalEncodingDetector being called by AutoDetectReader when 
> reading the input stream being passed as one of the arguments in our 
> TSVParser’s parse method. 
> We are using AutoDetectReader in our parser and we believed it was able auto 
> detect an correct encoding from the input stream being passed to it, but we 
> are seeing several garbled chars bubbling up in our outputted and converted 
> files from our parser; we find out that the encoding problem is happening in 
> the UniversalEncodingDetector, which returns an UTF-8 and AutoDetectReader is 
> reading the stream with UTF-8 which is incorrect encoding; and the correct 
> encoding is ISO-8859-1.
> I am attaching the screenshot of what char difference we are seeing in the 
> input tsv file and converted/outputed file. they are e9.jpg and ef.jpg, 
> please read the description for details.
> The problem is that the AutoDetectReader is decoding and reading the chars 
> with incorrect encoding. 
> BTW, We were able to work around this problem with CharsetDetector, which 
> seems to generate a valid encoding for the moment with which we can use to 
> read the tsv file properly.
> However, the problem is we cannot use AutoDetectReader, we have to create our 
> own TSVAutoDetectReader incorporated with CharsetDetector in the detect 
> method; AutoDetectReader class seems to be less flexible for us to extend its 
> functions, many of its methods are restricted with private constraints, we 
> cannot manually set encoding or override the existing implementation for 
> detecting encoding.
> In addition, I am also not confident about CharsetDetector either; as I am 
> seeing different encodings produced by CharsetDetector and AutoDetectReader 
> for different tsv files; But for now, we might live with CharsetDetector, as 
> CharsetDetector is solving the current encoding problem.
> Finally, I would like to also please give you my test program (PFA: 
> EncodingProblem.java) that reads an inputted tsv directory and displays a 
> list of encodings for each of the tsv files in the directory produced by 
> AutoDetectReader, UniversalEncodingDetector(which is being called by 
> AutoDetectReader) and CharsetDetector; so you could probably see the 
> difference, they are producing different encodings for some tsv files.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-1437) encoding issue in AutoDetectReader

Reply via email to