RE: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Allison, Timothy B. Wed, 14 Sep 2016 08:16:39 -0700

Ha, thank you for running google translate for me. :)

If the question is: "If I don't know the encoding before I send it to Tika, how 
does Tika determine the encoding?"


Tika applies the following detectors in this order:

org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

These are specified in META-INF/services/org.apache.tika.detect.EncodingDetector

Tika selects the first detector that returns a non-null value.

You can modify the service loading file to run the encoders in a different 
order or to specify your own encoding detector.

If the question is, "Why can't Tika get it right?"  Well, there are limits to 
statistical inference on only a few observations (small amount of bytes). :)

-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Wednesday, September 14, 2016 11:06 AM
To: [email protected]
Cc: Allison, Timothy B. <[email protected]>
Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Thank you for your answer.

I, character code of the file can not be determined EUC or Shift-JIS, UTF-8, 
etc. in advance.
I, or JAVA library, I want you to determine to Tika.
I want to know the determination method.

私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
私は、JAVAのライブラリか、Tikaに判断してほしい。
私は、その判断方法を知りたい。


技術初心者



> Again, relying on Google translate.
> 
> The problem with these files is that they don't self identify their encoding 
> via http metaheaders, and they contain very little content so Mozilla's 
> UniversalChardet and ICU4J don't have enough to work with.  IE, Chrome and 
> Firefox all fail on these files, too.
> 
> If you know that a file is EUC_JP, you can send a hint via the metadata 
> before the call to parse:
> 
> 
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> parser.parse(new FileInputStream(document), handler, metadata
>                                                                          
> , new ParseContext()); String plainText = handler.toString();
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]]
> Sent: Wednesday, September 14, 2016 7:37 AM
> To: [email protected]
> Subject: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> 
> ※先程のメールに添付したのは、秀丸エディタで保存したときに、
>    文字コードが変わったようで、文字化けしません。
> 
> ーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーー
> こんにちは。
> 
> 困っております。
> 
> Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> 
> 原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？
> 
> ■読み込むと文字化けするhtmlを添付します。
>       ※EUCコードのファイルです。（秀丸エディタの判定では）
> 
> ソース：
> -----------------------------------------------------
> File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> AutoDetectParser(); ContentHandler handler = new 
> BodyContentHandler(Integer.MAX_VALUE);
> Metadata metadata = new Metadata();
> parser.parse(new FileInputStream(document), handler, metadata
>                                                                          
> , new ParseContext()); String plainText = handler.toString(); 
> System.out.println(plainText);
> -----------------------------------------------------
> 
> 
> --
> 技術初心者

RE: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Reply via email to

RE: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け