Re: I want to parse Then garbled in Tika. Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

question.answer...@gmail.com Wed, 14 Sep 2016 18:38:34 -0700

I am the image of what you want to do.
Do you I have to do is fix anywhere in the program?


files                  do                             result

PDF     -->
HTML   -->       Tika do analysis.  -->  String (Java UTF-8)
TXT     -->
  ^                                                        ^
  ^                                                        ^
EUC                                                   UTF-8
Shift-JIS
GB2312
   :


> > > > -----------------------------------------------------
> > > > File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> > > > AutoDetectParser(); ContentHandler handler = new 
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > > Metadata metadata = new Metadata();
> > > > parser.parse(new FileInputStream(document), handler, metadata
> > > >                                                                         
> > > >  
> > > > , new ParseContext()); String plainText = handler.toString(); 
> > > > System.out.println(plainText);
> > > > -----------------------------------------------------


Japanease:
私はTikaで各種ファイルを取り込みたい。
しかし、ファイルの文字コードにより、文字化けした状態で取り込まれる。
私は、どうすればいいですか？



-- 
question.answer...@gmail.com <question.answer...@gmail.com>


> I, in any way, Tika, you can EUC and shift-jis and UTF-8 of html and PDF 
> reading?
> Moreover, I, without garbled in String, want to put converted to UTF-8.
> I, I want you to tell me how the program.
> 
> I want to parse Then garbled in Tika.
> Garbled.
> 
> -- 
> question.answer...@gmail.com <question.answer...@gmail.com>
> 
> 
> 
> > Sorry, can't tell what the question is?
> > 
> > -----Original Message-----
> > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > Sent: Wednesday, September 14, 2016 11:50 AM
> > To: Allison, Timothy B. <talli...@mitre.org>
> > Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Hi :)
> > 
> > I, in any way to, or should I use the following in the program of Tika?
> > 私は、どのようにして、下記をTikaのプログラムで使えばいいですか？
> > 
> > ---------------------------------------------------------
> > Tika applies the following detectors in this order:
> > 
> > org.apache.tika.parser.html.HtmlEncodingDetector
> > org.apache.tika.parser.txt.UniversalEncodingDetector
> > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > 
> > These are specified in 
> > META-INF/services/org.apache.tika.detect.EncodingDetector
> > 
> > Tika selects the first detector that returns a non-null value.
> > ---------------------------------------------------------
> > 
> > 
> > -- 
> > question.answer...@gmail.com <question.answer...@gmail.com>
> > 
> > 
> > 
> > > Ha, thank you for running google translate for me. :)
> > > 
> > > If the question is: "If I don't know the encoding before I send it to 
> > > Tika, how does Tika determine the encoding?"
> > > 
> > > Tika applies the following detectors in this order:
> > > 
> > > org.apache.tika.parser.html.HtmlEncodingDetector
> > > org.apache.tika.parser.txt.UniversalEncodingDetector
> > > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > > 
> > > These are specified in 
> > > META-INF/services/org.apache.tika.detect.EncodingDetector
> > > 
> > > Tika selects the first detector that returns a non-null value.
> > > 
> > > You can modify the service loading file to run the encoders in a 
> > > different order or to specify your own encoding detector.
> > > 
> > > If the question is, "Why can't Tika get it right?"  Well, there are 
> > > limits to statistical inference on only a few observations (small amount 
> > > of bytes). :)
> > > 
> > > -----Original Message-----
> > > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > > Sent: Wednesday, September 14, 2016 11:06 AM
> > > To: user@tika.apache.org
> > > Cc: Allison, Timothy B. <talli...@mitre.org>
> > > Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > > 
> > > Thank you for your answer.
> > > 
> > > I, character code of the file can not be determined EUC or Shift-JIS, 
> > > UTF-8, etc. in advance.
> > > I, or JAVA library, I want you to determine to Tika.
> > > I want to know the determination method.
> > > 
> > > 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> > > 私は、JAVAのライブラリか、Tikaに判断してほしい。
> > > 私は、その判断方法を知りたい。
> > > 
> > > 
> > > 技術初心者
> > > 
> > > 
> > > 
> > > > Again, relying on Google translate.
> > > > 
> > > > The problem with these files is that they don't self identify their 
> > > > encoding via http metaheaders, and they contain very little content so 
> > > > Mozilla's UniversalChardet and ICU4J don't have enough to work with.  
> > > > IE, Chrome and Firefox all fail on these files, too.
> > > > 
> > > > If you know that a file is EUC_JP, you can send a hint via the metadata 
> > > > before the call to parse:
> > > > 
> > > > 
> > > > Metadata metadata = new Metadata();
> > > > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> > > > parser.parse(new FileInputStream(document), handler, metadata
> > > >                                                                         
> > > >  
> > > > , new ParseContext()); String plainText = handler.toString();
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: question.answer...@gmail.com 
> > > > [mailto:question.answer...@gmail.com]
> > > > Sent: Wednesday, September 14, 2016 7:37 AM
> > > > To: user@tika.apache.org
> > > > Subject: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > > > 
> > > > Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> > > > 
> > > > ※先程のメールに添付したのは、秀丸エディタで保存したときに、
> > > >    文字コードが変わったようで、文字化けしません。
> > > > 
> > > > ーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーー
> > > > こんにちは。
> > > > 
> > > > 困っております。
> > > > 
> > > > Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> > > > 
> > > > 原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？
> > > > 
> > > > ■読み込むと文字化けするhtmlを添付します。
> > > >       ※EUCコードのファイルです。（秀丸エディタの判定では）
> > > > 
> > > > ソース：
> > > > -----------------------------------------------------
> > > > File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> > > > AutoDetectParser(); ContentHandler handler = new 
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > > Metadata metadata = new Metadata();
> > > > parser.parse(new FileInputStream(document), handler, metadata
> > > >                                                                         
> > > >  
> > > > , new ParseContext()); String plainText = handler.toString(); 
> > > > System.out.println(plainText);
> > > > -----------------------------------------------------
> > > > 
> > > > 
> > > > --
> > > > 技術初心者
> > 
>

Re: I want to parse Then garbled in Tika. Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Reply via email to

Re: I want to parse Then garbled in Tika. Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け