Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

question.answer...@gmail.com Wed, 14 Sep 2016 08:07:44 -0700

Thank you for your answer.

I, character code of the file can not be determined EUC or Shift-JIS,
UTF-8, etc. in advance.
I, or JAVA library, I want you to determine to Tika.
I want to know the determination method.


私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
私は、JAVAのライブラリか、Tikaに判断してほしい。
私は、その判断方法を知りたい。


技術初心者



> Again, relying on Google translate.
> 
> The problem with these files is that they don't self identify their encoding 
> via http metaheaders, and they contain very little content so Mozilla's 
> UniversalChardet and ICU4J don't have enough to work with.  IE, Chrome and 
> Firefox all fail on these files, too.
> 
> If you know that a file is EUC_JP, you can send a hint via the metadata 
> before the call to parse:
> 
> 
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP");
> parser.parse(new FileInputStream(document), handler, metadata
>                                                                          , 
> new ParseContext()); String plainText = handler.toString();
> 
> 
> -----Original Message-----
> From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> Sent: Wednesday, September 14, 2016 7:37 AM
> To: user@tika.apache.org
> Subject: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> 
> ※先程のメールに添付したのは、秀丸エディタで保存したときに、
>    文字コードが変わったようで、文字化けしません。
> 
> ーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーー
> こんにちは。
> 
> 困っております。
> 
> Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> 
> 原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？
> 
> ■読み込むと文字化けするhtmlを添付します。
>       ※EUCコードのファイルです。（秀丸エディタの判定では）
> 
> ソース：
> -----------------------------------------------------
> File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> AutoDetectParser(); ContentHandler handler = new 
> BodyContentHandler(Integer.MAX_VALUE);
> Metadata metadata = new Metadata();
> parser.parse(new FileInputStream(document), handler, metadata
>                                                                          , 
> new ParseContext()); String plainText = handler.toString(); 
> System.out.println(plainText);
> -----------------------------------------------------
> 
> 
> -- 
> 技術初心者

Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Reply via email to

Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け