I want to parse Then garbled in Tika. Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

[email protected] Wed, 14 Sep 2016 09:06:57 -0700

I, in any way, Tika, you can EUC and shift-jis and UTF-8 of html and PDF 
reading?
Moreover, I, without garbled in String, want to put converted to UTF-8.
I, I want you to tell me how the program.


I want to parse Then garbled in Tika.
Garbled.

-- 
[email protected] <[email protected]>



> Sorry, can't tell what the question is?
> 
> -----Original Message-----
> From: [email protected] [mailto:[email protected]] 
> Sent: Wednesday, September 14, 2016 11:50 AM
> To: Allison, Timothy B. <[email protected]>
> Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Hi :)
> 
> I, in any way to, or should I use the following in the program of Tika?
> 私は、どのようにして、下記をTikaのプログラムで使えばいいですか？
> 
> ---------------------------------------------------------
> Tika applies the following detectors in this order:
> 
> org.apache.tika.parser.html.HtmlEncodingDetector
> org.apache.tika.parser.txt.UniversalEncodingDetector
> org.apache.tika.parser.txt.Icu4jEncodingDetector
> 
> These are specified in 
> META-INF/services/org.apache.tika.detect.EncodingDetector
> 
> Tika selects the first detector that returns a non-null value.
> ---------------------------------------------------------
> 
> 
> -- 
> [email protected] <[email protected]>
> 
> 
> 
> > Ha, thank you for running google translate for me. :)
> > 
> > If the question is: "If I don't know the encoding before I send it to Tika, 
> > how does Tika determine the encoding?"
> > 
> > Tika applies the following detectors in this order:
> > 
> > org.apache.tika.parser.html.HtmlEncodingDetector
> > org.apache.tika.parser.txt.UniversalEncodingDetector
> > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > 
> > These are specified in 
> > META-INF/services/org.apache.tika.detect.EncodingDetector
> > 
> > Tika selects the first detector that returns a non-null value.
> > 
> > You can modify the service loading file to run the encoders in a different 
> > order or to specify your own encoding detector.
> > 
> > If the question is, "Why can't Tika get it right?"  Well, there are limits 
> > to statistical inference on only a few observations (small amount of 
> > bytes). :)
> > 
> > -----Original Message-----
> > From: [email protected] [mailto:[email protected]] 
> > Sent: Wednesday, September 14, 2016 11:06 AM
> > To: [email protected]
> > Cc: Allison, Timothy B. <[email protected]>
> > Subject: Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Thank you for your answer.
> > 
> > I, character code of the file can not be determined EUC or Shift-JIS, 
> > UTF-8, etc. in advance.
> > I, or JAVA library, I want you to determine to Tika.
> > I want to know the determination method.
> > 
> > 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> > 私は、JAVAのライブラリか、Tikaに判断してほしい。
> > 私は、その判断方法を知りたい。
> > 
> > 
> > 技術初心者
> > 
> > 
> > 
> > > Again, relying on Google translate.
> > > 
> > > The problem with these files is that they don't self identify their 
> > > encoding via http metaheaders, and they contain very little content so 
> > > Mozilla's UniversalChardet and ICU4J don't have enough to work with.  IE, 
> > > Chrome and Firefox all fail on these files, too.
> > > 
> > > If you know that a file is EUC_JP, you can send a hint via the metadata 
> > > before the call to parse:
> > > 
> > > 
> > > Metadata metadata = new Metadata();
> > > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> > > parser.parse(new FileInputStream(document), handler, metadata
> > >                                                                          
> > > , new ParseContext()); String plainText = handler.toString();
> > > 
> > > 
> > > -----Original Message-----
> > > From: [email protected] 
> > > [mailto:[email protected]]
> > > Sent: Wednesday, September 14, 2016 7:37 AM
> > > To: [email protected]
> > > Subject: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > > 
> > > Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> > > 
> > > ※先程のメールに添付したのは、秀丸エディタで保存したときに、
> > >    文字コードが変わったようで、文字化けしません。
> > > 
> > > ーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーーー
> > > こんにちは。
> > > 
> > > 困っております。
> > > 
> > > Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> > > 
> > > 原因は何で、対応策（Tikaへの設定？など）は、ありますでしょうか？
> > > 
> > > ■読み込むと文字化けするhtmlを添付します。
> > >       ※EUCコードのファイルです。（秀丸エディタの判定では）
> > > 
> > > ソース：
> > > -----------------------------------------------------
> > > File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> > > AutoDetectParser(); ContentHandler handler = new 
> > > BodyContentHandler(Integer.MAX_VALUE);
> > > Metadata metadata = new Metadata();
> > > parser.parse(new FileInputStream(document), handler, metadata
> > >                                                                          
> > > , new ParseContext()); String plainText = handler.toString(); 
> > > System.out.println(plainText);
> > > -----------------------------------------------------
> > > 
> > > 
> > > --
> > > 技術初心者
>

I want to parse Then garbled in Tika. Re: 訂正 ：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Reply via email to

I want to parse Then garbled in Tika. Re: 訂正：Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け