[jira] [Commented] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

Tim Allison (Jira) Thu, 20 Feb 2020 09:55:16 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041188#comment-17041188
 ]


Tim Allison commented on TIKA-3048:
-----------------------------------

Tika currently by default trusts the character encoding of the html meta-header 
over the binary charset detection.  When you remove the incorrect charset 
metaheader, charset detection works, and you get decent text.

You can turn off the reliance on the html metaheader by specifying which 
charset detectors to run (will give link/pointer in next comment).  Your 
mileage will vary, and you may want to run each charset detector and then use 
tika-eval stats to tell you which one is the best.

> Tika unable to parse html files with non UTF-8 charset
> ------------------------------------------------------
>
>                 Key: TIKA-3048
>                 URL: https://issues.apache.org/jira/browse/TIKA-3048
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.9
>            Reporter: Akash
>            Priority: Major
>         Attachments: ChineseFile.html
>
>
> Tika is returning junk characters when parsing chinese characters present 
> inside html file. Html file have charset mentioned as GB2312 explicitly.
> <head><meta http-equiv=Content-Type content="text/html; charset=gb2312"><meta 
> name=Generator content="Microsoft Word 15 (filtered medium)">
>  
> If we remove this charset from the html meta tag, then parsing works fine.
>  
> Similar issue is observed for Arabic, Russain, Korean, Japanese, Hungarian 
> and Spanish languages.
> Charset mentioned for each languages - 
> Hungarian - iso-8859-1
> Chinese - gb2312
> Spanish - iso-8859-1
> Russian - koi8-r
> Korean - ks_c_5601-1987
> Japanese - iso-2022-jp
> Arabic - windows-1256



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (TIKA-3048) Tika unable to parse html files with non UTF-8 charset

Reply via email to