Re: I want to parse Then garbled in Tika. Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com
what the question is? > > > > -Original Message----- > > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] > > Sent: Wednesday, September 14, 2016 11:50 AM > > To: Allison, Timothy B. <talli...@mitre.org> > > Subject: Re: 訂正 :A

Re: I want to parse Then garbled in Tika. Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com
m> > > > > > Sorry, can't tell what the question is? > > > > -Original Message- > > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] > > Sent: Wednesday, September 14, 2016 11:50 AM > > To: Allison, Timothy B

I want to parse Then garbled in Tika. Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com
to statistical inference on only a few observations (small amount of > > bytes). :) > > > > -----Original Message- > > From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] > > Sent: Wednesday, September 14, 2016 11:06 AM > > To: user@tika.apa

RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
Sorry, can't tell what the question is? -Original Message- From: question.answer...@gmail.com [mailto:question.answer...@gmail.com] Sent: Wednesday, September 14, 2016 11:50 AM To: Allison, Timothy B. <talli...@mitre.org> Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字

RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
ilto:question.answer...@gmail.com] Sent: Wednesday, September 14, 2016 11:06 AM To: user@tika.apache.org Cc: Allison, Timothy B. <talli...@mitre.org> Subject: Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け Thank you for your answer. I, character code of the file can not be determined EUC

Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread question.answer...@gmail.com
Thank you for your answer. I, character code of the file can not be determined EUC or Shift-JIS, UTF-8, etc. in advance. I, or JAVA library, I want you to determine to Tika. I want to know the determination method. 私は、ファイルの文字コードがEUCやShift-JIS、UTF-8などを事前に判断できない。 私は、JAVAのライブラリか、Tikaに判断してほしい。

RE: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-14 Thread Allison, Timothy B.
Again, relying on Google translate. The problem with these files is that they don't self identify their encoding via http metaheaders, and they contain very little content so Mozilla's UniversalChardet and ICU4J don't have enough to work with. IE, Chrome and Firefox all fail on these files,