Sorry, can't tell what the question is? -----Original Message----- From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] Sent: Wednesday, September 14, 2016 11:50 AM To: Allison, Timothy B. <talli...@mitre.org> Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
Hi :) I, in any way to, or should I use the following in the program of Tika? 私は、どのようにして、下記をTikaのプログラムで使えばいいですか? --------------------------------------------------------- Tika applies the following detectors in this order: org.apache.tika.parser.html.HtmlEncodingDetector org.apache.tika.parser.txt.UniversalEncodingDetector org.apache.tika.parser.txt.Icu4jEncodingDetector These are specified in META-INF/services/org.apache.tika.detect.EncodingDetector Tika selects the first detector that returns a non-null value. --------------------------------------------------------- -- question.answer...@gmail.com <question.answer...@gmail.com> > Ha, thank you for running google translate for me. :) > > If the question is: "If I don't know the encoding before I send it to Tika, > how does Tika determine the encoding?" > > Tika applies the following detectors in this order: > > org.apache.tika.parser.html.HtmlEncodingDetector > org.apache.tika.parser.txt.UniversalEncodingDetector > org.apache.tika.parser.txt.Icu4jEncodingDetector > > These are specified in > META-INF/services/org.apache.tika.detect.EncodingDetector > > Tika selects the first detector that returns a non-null value. > > You can modify the service loading file to run the encoders in a different > order or to specify your own encoding detector. > > If the question is, "Why can't Tika get it right?" Well, there are limits to > statistical inference on only a few observations (small amount of bytes). :) > > -----Original Message----- > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] > Sent: Wednesday, September 14, 2016 11:06 AM > To: user@tika.apache.org > Cc: Allison, Timothy B. <talli...@mitre.org> > Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け > > Thank you for your answer. > > I, character code of the file can not be determined EUC or Shift-JIS, UTF-8, > etc. in advance. > I, or JAVA library, I want you to determine to Tika. > I want to know the determination method. > > 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。 > 私は、JAVAのライブラリか、Tikaに判断してほしい。 > 私は、その判断方法を知りたい。 > > > 技術初心者 > > > > > Again, relying on Google translate. > > > > The problem with these files is that they don't self identify their > > encoding via http metaheaders, and they contain very little content so > > Mozilla's UniversalChardet and ICU4J don't have enough to work with. IE, > > Chrome and Firefox all fail on these files, too. > > > > If you know that a file is EUC_JP, you can send a hint via the metadata > > before the call to parse: > > > > > > Metadata metadata = new Metadata(); > > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); > > parser.parse(new FileInputStream(document), handler, metadata > > > > , new ParseContext()); String plainText = handler.toString(); > > > > > > -----Original Message----- > > From: question.answer...@gmail.com > > [mailto:question.answer...@gmail.com] > > Sent: Wednesday, September 14, 2016 7:37 AM > > To: user@tika.apache.org > > Subject: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け > > > > Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。 > > > > ※先程のメールに添付したのは、秀丸エディタで保存したときに、 > > 文字コードが変わったようで、文字化けしません。 > > > > ーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーー > > こんにちは。 > > > > 困っております。 > > > > Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。 > > > > 原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか? > > > > ■読み込むと文字化けするhtmlを添付します。 > > ※EUCコードのファイルです。(秀丸エディタの判定では) > > > > ソース: > > ----------------------------------------------------- > > File document = new File("/usr/local/sample.pdf"); Parser parser = new > > AutoDetectParser(); ContentHandler handler = new > > BodyContentHandler(Integer.MAX_VALUE); > > Metadata metadata = new Metadata(); > > parser.parse(new FileInputStream(document), handler, metadata > > > > , new ParseContext()); String plainText = handler.toString(); > > System.out.println(plainText); > > ----------------------------------------------------- > > > > > > -- > > 技術初心者