I am the image of what you want to do. Do you I have to do is fix anywhere in the program?
files do result PDF --> HTML --> Tika do analysis. --> String (Java UTF-8) TXT --> ^ ^ ^ ^ EUC UTF-8 Shift-JIS GB2312 : > > > > ----------------------------------------------------- > > > > File document = new File("/usr/local/sample.pdf"); Parser parser = new > > > > AutoDetectParser(); ContentHandler handler = new > > > > BodyContentHandler(Integer.MAX_VALUE); > > > > Metadata metadata = new Metadata(); > > > > parser.parse(new FileInputStream(document), handler, metadata > > > > > > > > > > > > , new ParseContext()); String plainText = handler.toString(); > > > > System.out.println(plainText); > > > > ----------------------------------------------------- Japanease: 私はTikaで各種ファイルを取り込みたい。 しかし、ファイルの文字コードにより、文字化けした状態で取り込まれる。 私は、どうすればいいですか? -- question.answer...@gmail.com <question.answer...@gmail.com> > I, in any way, Tika, you can EUC and shift-jis and UTF-8 of html and PDF > reading? > Moreover, I, without garbled in String, want to put converted to UTF-8. > I, I want you to tell me how the program. > > I want to parse Then garbled in Tika. > Garbled. > > -- > question.answer...@gmail.com <question.answer...@gmail.com> > > > > > Sorry, can't tell what the question is? > > > > -----Original Message----- > > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] > > Sent: Wednesday, September 14, 2016 11:50 AM > > To: Allison, Timothy B. <talli...@mitre.org> > > Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け > > > > Hi :) > > > > I, in any way to, or should I use the following in the program of Tika? > > 私は、どのようにして、下記をTikaのプログラムで使えばいいですか? > > > > --------------------------------------------------------- > > Tika applies the following detectors in this order: > > > > org.apache.tika.parser.html.HtmlEncodingDetector > > org.apache.tika.parser.txt.UniversalEncodingDetector > > org.apache.tika.parser.txt.Icu4jEncodingDetector > > > > These are specified in > > META-INF/services/org.apache.tika.detect.EncodingDetector > > > > Tika selects the first detector that returns a non-null value. > > --------------------------------------------------------- > > > > > > -- > > question.answer...@gmail.com <question.answer...@gmail.com> > > > > > > > > > Ha, thank you for running google translate for me. :) > > > > > > If the question is: "If I don't know the encoding before I send it to > > > Tika, how does Tika determine the encoding?" > > > > > > Tika applies the following detectors in this order: > > > > > > org.apache.tika.parser.html.HtmlEncodingDetector > > > org.apache.tika.parser.txt.UniversalEncodingDetector > > > org.apache.tika.parser.txt.Icu4jEncodingDetector > > > > > > These are specified in > > > META-INF/services/org.apache.tika.detect.EncodingDetector > > > > > > Tika selects the first detector that returns a non-null value. > > > > > > You can modify the service loading file to run the encoders in a > > > different order or to specify your own encoding detector. > > > > > > If the question is, "Why can't Tika get it right?" Well, there are > > > limits to statistical inference on only a few observations (small amount > > > of bytes). :) > > > > > > -----Original Message----- > > > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] > > > Sent: Wednesday, September 14, 2016 11:06 AM > > > To: user@tika.apache.org > > > Cc: Allison, Timothy B. <talli...@mitre.org> > > > Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け > > > > > > Thank you for your answer. > > > > > > I, character code of the file can not be determined EUC or Shift-JIS, > > > UTF-8, etc. in advance. > > > I, or JAVA library, I want you to determine to Tika. > > > I want to know the determination method. > > > > > > 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。 > > > 私は、JAVAのライブラリか、Tikaに判断してほしい。 > > > 私は、その判断方法を知りたい。 > > > > > > > > > 技術初心者 > > > > > > > > > > > > > Again, relying on Google translate. > > > > > > > > The problem with these files is that they don't self identify their > > > > encoding via http metaheaders, and they contain very little content so > > > > Mozilla's UniversalChardet and ICU4J don't have enough to work with. > > > > IE, Chrome and Firefox all fail on these files, too. > > > > > > > > If you know that a file is EUC_JP, you can send a hint via the metadata > > > > before the call to parse: > > > > > > > > > > > > Metadata metadata = new Metadata(); > > > > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); > > > > parser.parse(new FileInputStream(document), handler, metadata > > > > > > > > > > > > , new ParseContext()); String plainText = handler.toString(); > > > > > > > > > > > > -----Original Message----- > > > > From: question.answer...@gmail.com > > > > [mailto:question.answer...@gmail.com] > > > > Sent: Wednesday, September 14, 2016 7:37 AM > > > > To: user@tika.apache.org > > > > Subject: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け > > > > > > > > Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。 > > > > > > > > ※先程のメールに添付したのは、秀丸エディタで保存したときに、 > > > > 文字コードが変わったようで、文字化けしません。 > > > > > > > > ーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーー > > > > こんにちは。 > > > > > > > > 困っております。 > > > > > > > > Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。 > > > > > > > > 原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか? > > > > > > > > ■読み込むと文字化けするhtmlを添付します。 > > > > ※EUCコードのファイルです。(秀丸エディタの判定では) > > > > > > > > ソース: > > > > ----------------------------------------------------- > > > > File document = new File("/usr/local/sample.pdf"); Parser parser = new > > > > AutoDetectParser(); ContentHandler handler = new > > > > BodyContentHandler(Integer.MAX_VALUE); > > > > Metadata metadata = new Metadata(); > > > > parser.parse(new FileInputStream(document), handler, metadata > > > > > > > > > > > > , new ParseContext()); String plainText = handler.toString(); > > > > System.out.println(plainText); > > > > ----------------------------------------------------- > > > > > > > > > > > > -- > > > > 技術初心者 > > >