RE: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Allison, Timothy B. Wed, 14 Sep 2016 08:53:29 -0700

Sorry, can't tell what the question is?

-----Original Message-----
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
Sent: Wednesday, September 14, 2016 11:50 AM
To: Allison, Timothy B. <talli...@mitre.org>
Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け


Hi :)

I, in any way to, or should I use the following in the program of Tika?
私は、どのようにして、下記をTikaのプログラムで使えばいいですか？

---------------------------------------------------------
Tika applies the following detectors in this order:

org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

These are specified in META-INF/services/org.apache.tika.detect.EncodingDetector

Tika selects the first detector that returns a non-null value.
---------------------------------------------------------


-- 
question.answer...@gmail.com <question.answer...@gmail.com>



> Ha, thank you for running google translate for me. :)
> 
> If the question is: "If I don't know the encoding before I send it to Tika, 
> how does Tika determine the encoding?"
> 
> Tika applies the following detectors in this order:
> 
> org.apache.tika.parser.html.HtmlEncodingDetector
> org.apache.tika.parser.txt.UniversalEncodingDetector
> org.apache.tika.parser.txt.Icu4jEncodingDetector
> 
> These are specified in 
> META-INF/services/org.apache.tika.detect.EncodingDetector
> 
> Tika selects the first detector that returns a non-null value.
> 
> You can modify the service loading file to run the encoders in a different 
> order or to specify your own encoding detector.
> 
> If the question is, "Why can't Tika get it right?"  Well, there are limits to 
> statistical inference on only a few observations (small amount of bytes). :)
> 
> -----Original Message-----
> From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> Sent: Wednesday, September 14, 2016 11:06 AM
> To: user@tika.apache.org
> Cc: Allison, Timothy B. <talli...@mitre.org>
> Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Thank you for your answer.
> 
> I, character code of the file can not be determined EUC or Shift-JIS, UTF-8, 
> etc. in advance.
> I, or JAVA library, I want you to determine to Tika.
> I want to know the determination method.
> 
> 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> 私は、JAVAのライブラリか、Tikaに判断してほしい。
> 私は、その判断方法を知りたい。
> 
> 
> 技術初心者
> 
> 
> 
> > Again, relying on Google translate.
> > 
> > The problem with these files is that they don't self identify their 
> > encoding via http metaheaders, and they contain very little content so 
> > Mozilla's UniversalChardet and ICU4J don't have enough to work with.  IE, 
> > Chrome and Firefox all fail on these files, too.
> > 
> > If you know that a file is EUC_JP, you can send a hint via the metadata 
> > before the call to parse:
> > 
> > 
> > Metadata metadata = new Metadata();
> > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> > parser.parse(new FileInputStream(document), handler, metadata
> >                                                                          
> > , new ParseContext()); String plainText = handler.toString();
> > 
> > 
> > -----Original Message-----
> > From: question.answer...@gmail.com 
> > [mailto:question.answer...@gmail.com]
> > Sent: Wednesday, September 14, 2016 7:37 AM
> > To: user@tika.apache.org
> > Subject: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> > 
> > ※先程のメールに添付したのは、秀丸エディタで保存したときに、
> >    文字コードが変わったようで、文字化けしません。
> > 
> > ーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーー
> > こんにちは。
> > 
> > 困っております。
> > 
> > Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> > 
> > 原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？
> > 
> > ■読み込むと文字化けするhtmlを添付します。
> >       ※EUCコードのファイルです。（秀丸エディタの判定では）
> > 
> > ソース：
> > -----------------------------------------------------
> > File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> > AutoDetectParser(); ContentHandler handler = new 
> > BodyContentHandler(Integer.MAX_VALUE);
> > Metadata metadata = new Metadata();
> > parser.parse(new FileInputStream(document), handler, metadata
> >                                                                          
> > , new ParseContext()); String plainText = handler.toString(); 
> > System.out.println(plainText);
> > -----------------------------------------------------
> > 
> > 
> > --
> > 技術初心者

RE: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Reply via email to

RE: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け