I am the image of what you want to do.
Do you I have to do is fix anywhere in the program?
files do result
PDF -->
HTML --> Tika do analysis. --> String (Java UTF-8)
TXT -->
^ ^
^ ^
EUC UTF-8
Shift-JIS
GB2312
:
> > > > -----------------------------------------------------
> > > > File document = new File("/usr/local/sample.pdf"); Parser parser = new
> > > > AutoDetectParser(); ContentHandler handler = new
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > > Metadata metadata = new Metadata();
> > > > parser.parse(new FileInputStream(document), handler, metadata
> > > >
> > > >
> > > > , new ParseContext()); String plainText = handler.toString();
> > > > System.out.println(plainText);
> > > > -----------------------------------------------------
Japanease:
私はTikaで各種ファイルを取り込みたい。
しかし、ファイルの文字コードにより、文字化けした状態で取り込まれる。
私は、どうすればいいですか?
--
[email protected] <[email protected]>
> I, in any way, Tika, you can EUC and shift-jis and UTF-8 of html and PDF
> reading?
> Moreover, I, without garbled in String, want to put converted to UTF-8.
> I, I want you to tell me how the program.
>
> I want to parse Then garbled in Tika.
> Garbled.
>
> --
> [email protected] <[email protected]>
>
>
>
> > Sorry, can't tell what the question is?
> >
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]]
> > Sent: Wednesday, September 14, 2016 11:50 AM
> > To: Allison, Timothy B. <[email protected]>
> > Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> >
> > Hi :)
> >
> > I, in any way to, or should I use the following in the program of Tika?
> > 私は、どのようにして、下記をTikaのプログラムで使えばいいですか?
> >
> > ---------------------------------------------------------
> > Tika applies the following detectors in this order:
> >
> > org.apache.tika.parser.html.HtmlEncodingDetector
> > org.apache.tika.parser.txt.UniversalEncodingDetector
> > org.apache.tika.parser.txt.Icu4jEncodingDetector
> >
> > These are specified in
> > META-INF/services/org.apache.tika.detect.EncodingDetector
> >
> > Tika selects the first detector that returns a non-null value.
> > ---------------------------------------------------------
> >
> >
> > --
> > [email protected] <[email protected]>
> >
> >
> >
> > > Ha, thank you for running google translate for me. :)
> > >
> > > If the question is: "If I don't know the encoding before I send it to
> > > Tika, how does Tika determine the encoding?"
> > >
> > > Tika applies the following detectors in this order:
> > >
> > > org.apache.tika.parser.html.HtmlEncodingDetector
> > > org.apache.tika.parser.txt.UniversalEncodingDetector
> > > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > >
> > > These are specified in
> > > META-INF/services/org.apache.tika.detect.EncodingDetector
> > >
> > > Tika selects the first detector that returns a non-null value.
> > >
> > > You can modify the service loading file to run the encoders in a
> > > different order or to specify your own encoding detector.
> > >
> > > If the question is, "Why can't Tika get it right?" Well, there are
> > > limits to statistical inference on only a few observations (small amount
> > > of bytes). :)
> > >
> > > -----Original Message-----
> > > From: [email protected] [mailto:[email protected]]
> > > Sent: Wednesday, September 14, 2016 11:06 AM
> > > To: [email protected]
> > > Cc: Allison, Timothy B. <[email protected]>
> > > Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > >
> > > Thank you for your answer.
> > >
> > > I, character code of the file can not be determined EUC or Shift-JIS,
> > > UTF-8, etc. in advance.
> > > I, or JAVA library, I want you to determine to Tika.
> > > I want to know the determination method.
> > >
> > > 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> > > 私は、JAVAのライブラリか、Tikaに判断してほしい。
> > > 私は、その判断方法を知りたい。
> > >
> > >
> > > 技術初心者
> > >
> > >
> > >
> > > > Again, relying on Google translate.
> > > >
> > > > The problem with these files is that they don't self identify their
> > > > encoding via http metaheaders, and they contain very little content so
> > > > Mozilla's UniversalChardet and ICU4J don't have enough to work with.
> > > > IE, Chrome and Firefox all fail on these files, too.
> > > >
> > > > If you know that a file is EUC_JP, you can send a hint via the metadata
> > > > before the call to parse:
> > > >
> > > >
> > > > Metadata metadata = new Metadata();
> > > > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP");
> > > > parser.parse(new FileInputStream(document), handler, metadata
> > > >
> > > >
> > > > , new ParseContext()); String plainText = handler.toString();
> > > >
> > > >
> > > > -----Original Message-----
> > > > From: [email protected]
> > > > [mailto:[email protected]]
> > > > Sent: Wednesday, September 14, 2016 7:37 AM
> > > > To: [email protected]
> > > > Subject: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > > >
> > > > Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> > > >
> > > > ※先程のメールに添付したのは、秀丸エディタで保存したときに、
> > > > 文字コードが変わったようで、文字化けしません。
> > > >
> > > > ーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーー
> > > > こんにちは。
> > > >
> > > > 困っております。
> > > >
> > > > Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> > > >
> > > > 原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか?
> > > >
> > > > ■読み込むと文字化けするhtmlを添付します。
> > > > ※EUCコードのファイルです。(秀丸エディタの判定では)
> > > >
> > > > ソース:
> > > > -----------------------------------------------------
> > > > File document = new File("/usr/local/sample.pdf"); Parser parser = new
> > > > AutoDetectParser(); ContentHandler handler = new
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > > Metadata metadata = new Metadata();
> > > > parser.parse(new FileInputStream(document), handler, metadata
> > > >
> > > >
> > > > , new ParseContext()); String plainText = handler.toString();
> > > > System.out.println(plainText);
> > > > -----------------------------------------------------
> > > >
> > > >
> > > > --
> > > > 技術初心者
> >
>