RE: I want to parse Then garbled in Tika. Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-15 Thread Allison, Timothy B.
Y, if I understand correctly, Tika should be doing all the work for you.

As I pointed out in an earlier email, sometimes Tika or its dependencies fail 
in any number of ways.

When Tika fails, there are some things we can fix, and there are some things we 
cannot fix.

It looks like your physicality.pdf doc was already fixed by PDFBox.  The 
mojibake document, however, is not fixable with our current encoding 
detectors...there are just too few non-ascii bytes for encoding detection to 
work.

-Original Message-
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
Sent: Wednesday, September 14, 2016 9:37 PM
To: user@tika.apache.org
Subject: Re: I want to parse Then garbled in Tika. Re: 訂正 :Apache 
Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け


I am the image of what you want to do.
Do you I have to do is fix anywhere in the program?

files  do result

PDF -->
HTML   -->   Tika do analysis.  -->  String (Java UTF-8)
TXT -->
  ^^
  ^^
EUC   UTF-8
Shift-JIS
GB2312
   :


> > > > -
> > > > File document = new File("/usr/local/sample.pdf"); Parser parser 
> > > > = new AutoDetectParser(); ContentHandler handler = new 
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > > Metadata metadata = new Metadata(); parser.parse(new 
> > > > FileInputStream(document), handler, metadata
> > > > 
> > > >  
> > > > , new ParseContext()); String plainText = handler.toString(); 
> > > > System.out.println(plainText);
> > > > -


Japanease:
私はTikaで各種ファイルを取り込みたい。
しかし、ファイルの文字コードにより、文字化けした状態で取り込まれる。
私は、どうすればいいですか?



--
question.answer...@gmail.com 


> I, in any way, Tika, you can EUC and shift-jis and UTF-8 of html and PDF 
> reading?
> Moreover, I, without garbled in String, want to put converted to UTF-8.
> I, I want you to tell me how the program.
> 
> I want to parse Then garbled in Tika.
> Garbled.
> 
> -- 
> question.answer...@gmail.com 
> 
> 
> 
> > Sorry, can't tell what the question is?
> > 
> > -----Original Message-----
> > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > Sent: Wednesday, September 14, 2016 11:50 AM
> > To: Allison, Timothy B. 
> > Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Hi :)
> > 
> > I, in any way to, or should I use the following in the program of Tika?
> > 私は、どのようにして、下記をTikaのプログラムで使えばいいですか?
> > 
> > -
> > Tika applies the following detectors in this order:
> > 
> > org.apache.tika.parser.html.HtmlEncodingDetector
> > org.apache.tika.parser.txt.UniversalEncodingDetector
> > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > 
> > These are specified in 
> > META-INF/services/org.apache.tika.detect.EncodingDetector
> > 
> > Tika selects the first detector that returns a non-null value.
> > -
> > 
> > 
> > -- 
> > question.answer...@gmail.com 
> > 
> > 
> > 
> > > Ha, thank you for running google translate for me. :)
> > > 
> > > If the question is: "If I don't know the encoding before I send it to 
> > > Tika, how does Tika determine the encoding?"
> > > 
> > > Tika applies the following detectors in this order:
> > > 
> > > org.apache.tika.parser.html.HtmlEncodingDetector
> > > org.apache.tika.parser.txt.UniversalEncodingDetector
> > > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > > 
> > > These are specified in 
> > > META-INF/services/org.apache.tika.detect.EncodingDetector
> > > 
> > > Tika selects the first detector that returns a non-null value.
> > > 
> > > You can modify the service loading file to run the encoders in a 
> > > different order or to specify your own encoding detector.
> > > 
> > > If the question is, "Why can't Tika get it right?"  Well, there are 
> > > limits to statistical inference on only a few observations (small amount 
> > > of bytes). :)
> > > 
> > > -Original Message-
> > > From: question.answer...@gmail.com [mailto:question.answer...@gma

Re: I want to parse Then garbled in Tika. Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com

I am the image of what you want to do.
Do you I have to do is fix anywhere in the program?

files  do result

PDF -->
HTML   -->   Tika do analysis.  -->  String (Java UTF-8)
TXT -->
  ^^
  ^^
EUC   UTF-8
Shift-JIS
GB2312
   :


> > > > -
> > > > File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> > > > AutoDetectParser(); ContentHandler handler = new 
> > > > BodyContentHandler(Integer.MAX_VALUE);
> > > > Metadata metadata = new Metadata();
> > > > parser.parse(new FileInputStream(document), handler, metadata
> > > > 
> > > >  
> > > > , new ParseContext()); String plainText = handler.toString(); 
> > > > System.out.println(plainText);
> > > > -


Japanease:
私はTikaで各種ファイルを取り込みたい。
しかし、ファイルの文字コードにより、文字化けした状態で取り込まれる。
私は、どうすればいいですか?



-- 
question.answer...@gmail.com 


> I, in any way, Tika, you can EUC and shift-jis and UTF-8 of html and PDF 
> reading?
> Moreover, I, without garbled in String, want to put converted to UTF-8.
> I, I want you to tell me how the program.
> 
> I want to parse Then garbled in Tika.
> Garbled.
> 
> -- 
> question.answer...@gmail.com 
> 
> 
> 
> > Sorry, can't tell what the question is?
> > 
> > -----Original Message-----
> > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > Sent: Wednesday, September 14, 2016 11:50 AM
> > To: Allison, Timothy B. 
> > Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Hi :)
> > 
> > I, in any way to, or should I use the following in the program of Tika?
> > 私は、どのようにして、下記をTikaのプログラムで使えばいいですか?
> > 
> > -
> > Tika applies the following detectors in this order:
> > 
> > org.apache.tika.parser.html.HtmlEncodingDetector
> > org.apache.tika.parser.txt.UniversalEncodingDetector
> > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > 
> > These are specified in 
> > META-INF/services/org.apache.tika.detect.EncodingDetector
> > 
> > Tika selects the first detector that returns a non-null value.
> > -
> > 
> > 
> > -- 
> > question.answer...@gmail.com 
> > 
> > 
> > 
> > > Ha, thank you for running google translate for me. :)
> > > 
> > > If the question is: "If I don't know the encoding before I send it to 
> > > Tika, how does Tika determine the encoding?"
> > > 
> > > Tika applies the following detectors in this order:
> > > 
> > > org.apache.tika.parser.html.HtmlEncodingDetector
> > > org.apache.tika.parser.txt.UniversalEncodingDetector
> > > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > > 
> > > These are specified in 
> > > META-INF/services/org.apache.tika.detect.EncodingDetector
> > > 
> > > Tika selects the first detector that returns a non-null value.
> > > 
> > > You can modify the service loading file to run the encoders in a 
> > > different order or to specify your own encoding detector.
> > > 
> > > If the question is, "Why can't Tika get it right?"  Well, there are 
> > > limits to statistical inference on only a few observations (small amount 
> > > of bytes). :)
> > > 
> > > -Original Message-
> > > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > > Sent: Wednesday, September 14, 2016 11:06 AM
> > > To: user@tika.apache.org
> > > Cc: Allison, Timothy B. 
> > > Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > > 
> > > Thank you for your answer.
> > > 
> > > I, character code of the file can not be determined EUC or Shift-JIS, 
> > > UTF-8, etc. in advance.
> > > I, or JAVA library, I want you to determine to Tika.
> > > I want to know the determination method.
> > > 
> > > 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> > > 私は、JAVAのライブラリか、Tikaに判断してほしい。
> > > 私は、その判断方法を知りたい。
> > > 
> > > 
> > > 技術初心者
> > > 
> &g

Re: I want to parse Then garbled in Tika. Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com


files  do result

PDF -->
HTML   -->   Tika do analysis.  -->  String (Java UTF-8)
TXT -->



-- 
question.answer...@gmail.com 


> I, in any way, Tika, you can EUC and shift-jis and UTF-8 of html and PDF 
> reading?
> Moreover, I, without garbled in String, want to put converted to UTF-8.
> I, I want you to tell me how the program.
> 
> I want to parse Then garbled in Tika.
> Garbled.
> 
> -- 
> question.answer...@gmail.com 
> 
> 
> 
> > Sorry, can't tell what the question is?
> > 
> > -Original Message-
> > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > Sent: Wednesday, September 14, 2016 11:50 AM
> > To: Allison, Timothy B. 
> > Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Hi :)
> > 
> > I, in any way to, or should I use the following in the program of Tika?
> > 私は、どのようにして、下記をTikaのプログラムで使えばいいですか?
> > 
> > -
> > Tika applies the following detectors in this order:
> > 
> > org.apache.tika.parser.html.HtmlEncodingDetector
> > org.apache.tika.parser.txt.UniversalEncodingDetector
> > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > 
> > These are specified in 
> > META-INF/services/org.apache.tika.detect.EncodingDetector
> > 
> > Tika selects the first detector that returns a non-null value.
> > -
> > 
> > 
> > -- 
> > question.answer...@gmail.com 
> > 
> > 
> > 
> > > Ha, thank you for running google translate for me. :)
> > > 
> > > If the question is: "If I don't know the encoding before I send it to 
> > > Tika, how does Tika determine the encoding?"
> > > 
> > > Tika applies the following detectors in this order:
> > > 
> > > org.apache.tika.parser.html.HtmlEncodingDetector
> > > org.apache.tika.parser.txt.UniversalEncodingDetector
> > > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > > 
> > > These are specified in 
> > > META-INF/services/org.apache.tika.detect.EncodingDetector
> > > 
> > > Tika selects the first detector that returns a non-null value.
> > > 
> > > You can modify the service loading file to run the encoders in a 
> > > different order or to specify your own encoding detector.
> > > 
> > > If the question is, "Why can't Tika get it right?"  Well, there are 
> > > limits to statistical inference on only a few observations (small amount 
> > > of bytes). :)
> > > 
> > > -Original Message-
> > > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > > Sent: Wednesday, September 14, 2016 11:06 AM
> > > To: user@tika.apache.org
> > > Cc: Allison, Timothy B. 
> > > Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > > 
> > > Thank you for your answer.
> > > 
> > > I, character code of the file can not be determined EUC or Shift-JIS, 
> > > UTF-8, etc. in advance.
> > > I, or JAVA library, I want you to determine to Tika.
> > > I want to know the determination method.
> > > 
> > > 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> > > 私は、JAVAのライブラリか、Tikaに判断してほしい。
> > > 私は、その判断方法を知りたい。
> > > 
> > > 
> > > 技術初心者
> > > 
> > > 
> > > 
> > > > Again, relying on Google translate.
> > > > 
> > > > The problem with these files is that they don't self identify their 
> > > > encoding via http metaheaders, and they contain very little content so 
> > > > Mozilla's UniversalChardet and ICU4J don't have enough to work with.  
> > > > IE, Chrome and Firefox all fail on these files, too.
> > > > 
> > > > If you know that a file is EUC_JP, you can send a hint via the metadata 
> > > > before the call to parse:
> > > > 
> > > > 
> > > > Metadata metadata = new Metadata();
> > > > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> > > > parser.parse(new FileInputStream(document), handler, metadata
> > > > 
> > > >  
> > > > , new ParseContext()); String plainText = handler.toString();
> > > > 
> > > > 
> > &

I want to parse Then garbled in Tika. Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com
I, in any way, Tika, you can EUC and shift-jis and UTF-8 of html and PDF 
reading?
Moreover, I, without garbled in String, want to put converted to UTF-8.
I, I want you to tell me how the program.

I want to parse Then garbled in Tika.
Garbled.

-- 
question.answer...@gmail.com 



> Sorry, can't tell what the question is?
> 
> -Original Message-
> From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> Sent: Wednesday, September 14, 2016 11:50 AM
> To: Allison, Timothy B. 
> Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Hi :)
> 
> I, in any way to, or should I use the following in the program of Tika?
> 私は、どのようにして、下記をTikaのプログラムで使えばいいですか?
> 
> -
> Tika applies the following detectors in this order:
> 
> org.apache.tika.parser.html.HtmlEncodingDetector
> org.apache.tika.parser.txt.UniversalEncodingDetector
> org.apache.tika.parser.txt.Icu4jEncodingDetector
> 
> These are specified in 
> META-INF/services/org.apache.tika.detect.EncodingDetector
> 
> Tika selects the first detector that returns a non-null value.
> -
> 
> 
> -- 
> question.answer...@gmail.com 
> 
> 
> 
> > Ha, thank you for running google translate for me. :)
> > 
> > If the question is: "If I don't know the encoding before I send it to Tika, 
> > how does Tika determine the encoding?"
> > 
> > Tika applies the following detectors in this order:
> > 
> > org.apache.tika.parser.html.HtmlEncodingDetector
> > org.apache.tika.parser.txt.UniversalEncodingDetector
> > org.apache.tika.parser.txt.Icu4jEncodingDetector
> > 
> > These are specified in 
> > META-INF/services/org.apache.tika.detect.EncodingDetector
> > 
> > Tika selects the first detector that returns a non-null value.
> > 
> > You can modify the service loading file to run the encoders in a different 
> > order or to specify your own encoding detector.
> > 
> > If the question is, "Why can't Tika get it right?"  Well, there are limits 
> > to statistical inference on only a few observations (small amount of 
> > bytes). :)
> > 
> > -Original Message-
> > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> > Sent: Wednesday, September 14, 2016 11:06 AM
> > To: user@tika.apache.org
> > Cc: Allison, Timothy B. 
> > Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Thank you for your answer.
> > 
> > I, character code of the file can not be determined EUC or Shift-JIS, 
> > UTF-8, etc. in advance.
> > I, or JAVA library, I want you to determine to Tika.
> > I want to know the determination method.
> > 
> > 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> > 私は、JAVAのライブラリか、Tikaに判断してほしい。
> > 私は、その判断方法を知りたい。
> > 
> > 
> > 技術初心者
> > 
> > 
> > 
> > > Again, relying on Google translate.
> > > 
> > > The problem with these files is that they don't self identify their 
> > > encoding via http metaheaders, and they contain very little content so 
> > > Mozilla's UniversalChardet and ICU4J don't have enough to work with.  IE, 
> > > Chrome and Firefox all fail on these files, too.
> > > 
> > > If you know that a file is EUC_JP, you can send a hint via the metadata 
> > > before the call to parse:
> > > 
> > > 
> > > Metadata metadata = new Metadata();
> > > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> > > parser.parse(new FileInputStream(document), handler, metadata
> > >  
> > > , new ParseContext()); String plainText = handler.toString();
> > > 
> > > 
> > > -Original Message-
> > > From: question.answer...@gmail.com 
> > > [mailto:question.answer...@gmail.com]
> > > Sent: Wednesday, September 14, 2016 7:37 AM
> > > To: user@tika.apache.org
> > > Subject: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > > 
> > > Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> > > 
> > > ※先程のメールに添付したのは、秀丸エディタで保存したときに、
> > >文字コードが変わったようで、文字化けしません。
> > > 
> > > ー
> > > こんにちは。
> > > 
> > > 困っております。
> > > 
> > > Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> > > 
> > > 原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか?
> > > 
> > > ■読み込むと文字化けするhtmlを添付します。
> > >   ※EUCコードのファイルです。(秀丸エディタの判定では)
> > > 
> > > ソース:
> > > -
> > > File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> > > AutoDetectParser(); ContentHandler handler = new 
> > > BodyContentHandler(Integer.MAX_VALUE);
> > > Metadata metadata = new Metadata();
> > > parser.parse(new FileInputStream(document), handler, metadata
> > >  
> > > , new ParseContext()); String plainText = handler.toString(); 
> > > System.out.println(plainText);
> > > -
> > > 
> > > 
> > > --
> > > 技術初心者
> 




RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
Sorry, can't tell what the question is?

-Original Message-
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
Sent: Wednesday, September 14, 2016 11:50 AM
To: Allison, Timothy B. 
Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Hi :)

I, in any way to, or should I use the following in the program of Tika?
私は、どのようにして、下記をTikaのプログラムで使えばいいですか?

-
Tika applies the following detectors in this order:

org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

These are specified in META-INF/services/org.apache.tika.detect.EncodingDetector

Tika selects the first detector that returns a non-null value.
-


-- 
question.answer...@gmail.com 



> Ha, thank you for running google translate for me. :)
> 
> If the question is: "If I don't know the encoding before I send it to Tika, 
> how does Tika determine the encoding?"
> 
> Tika applies the following detectors in this order:
> 
> org.apache.tika.parser.html.HtmlEncodingDetector
> org.apache.tika.parser.txt.UniversalEncodingDetector
> org.apache.tika.parser.txt.Icu4jEncodingDetector
> 
> These are specified in 
> META-INF/services/org.apache.tika.detect.EncodingDetector
> 
> Tika selects the first detector that returns a non-null value.
> 
> You can modify the service loading file to run the encoders in a different 
> order or to specify your own encoding detector.
> 
> If the question is, "Why can't Tika get it right?"  Well, there are limits to 
> statistical inference on only a few observations (small amount of bytes). :)
> 
> -Original Message-
> From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> Sent: Wednesday, September 14, 2016 11:06 AM
> To: user@tika.apache.org
> Cc: Allison, Timothy B. 
> Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Thank you for your answer.
> 
> I, character code of the file can not be determined EUC or Shift-JIS, UTF-8, 
> etc. in advance.
> I, or JAVA library, I want you to determine to Tika.
> I want to know the determination method.
> 
> 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
> 私は、JAVAのライブラリか、Tikaに判断してほしい。
> 私は、その判断方法を知りたい。
> 
> 
> 技術初心者
> 
> 
> 
> > Again, relying on Google translate.
> > 
> > The problem with these files is that they don't self identify their 
> > encoding via http metaheaders, and they contain very little content so 
> > Mozilla's UniversalChardet and ICU4J don't have enough to work with.  IE, 
> > Chrome and Firefox all fail on these files, too.
> > 
> > If you know that a file is EUC_JP, you can send a hint via the metadata 
> > before the call to parse:
> > 
> > 
> > Metadata metadata = new Metadata();
> > metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> > parser.parse(new FileInputStream(document), handler, metadata
> >  
> > , new ParseContext()); String plainText = handler.toString();
> > 
> > 
> > -Original Message-
> > From: question.answer...@gmail.com 
> > [mailto:question.answer...@gmail.com]
> > Sent: Wednesday, September 14, 2016 7:37 AM
> > To: user@tika.apache.org
> > Subject: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> > 
> > Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> > 
> > ※先程のメールに添付したのは、秀丸エディタで保存したときに、
> >文字コードが変わったようで、文字化けしません。
> > 
> > ー
> > こんにちは。
> > 
> > 困っております。
> > 
> > Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> > 
> > 原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか?
> > 
> > ■読み込むと文字化けするhtmlを添付します。
> >   ※EUCコードのファイルです。(秀丸エディタの判定では)
> > 
> > ソース:
> > -
> > File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> > AutoDetectParser(); ContentHandler handler = new 
> > BodyContentHandler(Integer.MAX_VALUE);
> > Metadata metadata = new Metadata();
> > parser.parse(new FileInputStream(document), handler, metadata
> >  
> > , new ParseContext()); String plainText = handler.toString(); 
> > System.out.println(plainText);
> > -
> > 
> > 
> > --
> > 技術初心者




RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
Ha, thank you for running google translate for me. :)

If the question is: "If I don't know the encoding before I send it to Tika, how 
does Tika determine the encoding?"

Tika applies the following detectors in this order:

org.apache.tika.parser.html.HtmlEncodingDetector
org.apache.tika.parser.txt.UniversalEncodingDetector
org.apache.tika.parser.txt.Icu4jEncodingDetector

These are specified in META-INF/services/org.apache.tika.detect.EncodingDetector

Tika selects the first detector that returns a non-null value.

You can modify the service loading file to run the encoders in a different 
order or to specify your own encoding detector.

If the question is, "Why can't Tika get it right?"  Well, there are limits to 
statistical inference on only a few observations (small amount of bytes). :)

-Original Message-
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
Sent: Wednesday, September 14, 2016 11:06 AM
To: user@tika.apache.org
Cc: Allison, Timothy B. 
Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Thank you for your answer.

I, character code of the file can not be determined EUC or Shift-JIS, UTF-8, 
etc. in advance.
I, or JAVA library, I want you to determine to Tika.
I want to know the determination method.

私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
私は、JAVAのライブラリか、Tikaに判断してほしい。
私は、その判断方法を知りたい。


技術初心者



> Again, relying on Google translate.
> 
> The problem with these files is that they don't self identify their encoding 
> via http metaheaders, and they contain very little content so Mozilla's 
> UniversalChardet and ICU4J don't have enough to work with.  IE, Chrome and 
> Firefox all fail on these files, too.
> 
> If you know that a file is EUC_JP, you can send a hint via the metadata 
> before the call to parse:
> 
> 
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP"); 
> parser.parse(new FileInputStream(document), handler, metadata
>  
> , new ParseContext()); String plainText = handler.toString();
> 
> 
> -Original Message-
> From: question.answer...@gmail.com 
> [mailto:question.answer...@gmail.com]
> Sent: Wednesday, September 14, 2016 7:37 AM
> To: user@tika.apache.org
> Subject: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> 
> ※先程のメールに添付したのは、秀丸エディタで保存したときに、
>文字コードが変わったようで、文字化けしません。
> 
> ー
> こんにちは。
> 
> 困っております。
> 
> Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> 
> 原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか?
> 
> ■読み込むと文字化けするhtmlを添付します。
>   ※EUCコードのファイルです。(秀丸エディタの判定では)
> 
> ソース:
> -
> File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> AutoDetectParser(); ContentHandler handler = new 
> BodyContentHandler(Integer.MAX_VALUE);
> Metadata metadata = new Metadata();
> parser.parse(new FileInputStream(document), handler, metadata
>  
> , new ParseContext()); String plainText = handler.toString(); 
> System.out.println(plainText);
> -
> 
> 
> --
> 技術初心者



Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com
Thank you for your answer.

I, character code of the file can not be determined EUC or Shift-JIS,
UTF-8, etc. in advance.
I, or JAVA library, I want you to determine to Tika.
I want to know the determination method.

私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。
私は、JAVAのライブラリか、Tikaに判断してほしい。
私は、その判断方法を知りたい。


技術初心者



> Again, relying on Google translate.
> 
> The problem with these files is that they don't self identify their encoding 
> via http metaheaders, and they contain very little content so Mozilla's 
> UniversalChardet and ICU4J don't have enough to work with.  IE, Chrome and 
> Firefox all fail on these files, too.
> 
> If you know that a file is EUC_JP, you can send a hint via the metadata 
> before the call to parse:
> 
> 
> Metadata metadata = new Metadata();
> metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP");
> parser.parse(new FileInputStream(document), handler, metadata
>  , 
> new ParseContext()); String plainText = handler.toString();
> 
> 
> -Original Message-
> From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
> Sent: Wednesday, September 14, 2016 7:37 AM
> To: user@tika.apache.org
> Subject: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け
> 
> Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。
> 
> ※先程のメールに添付したのは、秀丸エディタで保存したときに、
>文字コードが変わったようで、文字化けしません。
> 
> ー
> こんにちは。
> 
> 困っております。
> 
> Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。
> 
> 原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか?
> 
> ■読み込むと文字化けするhtmlを添付します。
>   ※EUCコードのファイルです。(秀丸エディタの判定では)
> 
> ソース:
> -
> File document = new File("/usr/local/sample.pdf"); Parser parser = new 
> AutoDetectParser(); ContentHandler handler = new 
> BodyContentHandler(Integer.MAX_VALUE);
> Metadata metadata = new Metadata();
> parser.parse(new FileInputStream(document), handler, metadata
>  , 
> new ParseContext()); String plainText = handler.toString(); 
> System.out.println(plainText);
> -
> 
> 
> -- 
> 技術初心者



RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
Again, relying on Google translate.

The problem with these files is that they don't self identify their encoding 
via http metaheaders, and they contain very little content so Mozilla's 
UniversalChardet and ICU4J don't have enough to work with.  IE, Chrome and 
Firefox all fail on these files, too.

If you know that a file is EUC_JP, you can send a hint via the metadata before 
the call to parse:


Metadata metadata = new Metadata();
metadata.set(Metadata.CONTENT_TYPE, "text/html; charset=EUC_JP");
parser.parse(new FileInputStream(document), handler, metadata
 , new 
ParseContext()); String plainText = handler.toString();


-Original Message-
From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] 
Sent: Wednesday, September 14, 2016 7:37 AM
To: user@tika.apache.org
Subject: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

Tikaで読み込むと文字化けするファイルは、このメールに添付してあるものです。

※先程のメールに添付したのは、秀丸エディタで保存したときに、
   文字コードが変わったようで、文字化けしません。

ー
こんにちは。

困っております。

Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化けします。

原因は何で、対応策(Tikaへの設定?など)は、ありますでしょうか?

■読み込むと文字化けするhtmlを添付します。
  ※EUCコードのファイルです。(秀丸エディタの判定では)

ソース:
-
File document = new File("/usr/local/sample.pdf"); Parser parser = new 
AutoDetectParser(); ContentHandler handler = new 
BodyContentHandler(Integer.MAX_VALUE);
Metadata metadata = new Metadata();
parser.parse(new FileInputStream(document), handler, metadata
 , new 
ParseContext()); String plainText = handler.toString(); 
System.out.println(plainText);
-


-- 
技術初心者