RE: I want to parse Then garbled in Tika. Re: 訂正 :Apache Tikaで、EUCやshift-jisコードのhtmlの読込みで文字化け

2016-09-15 Thread Allison, Timothy B.
Y, if I understand correctly, Tika should be doing all the work for you. As I pointed out in an earlier email, sometimes Tika or its dependencies fail in any number of ways. When Tika fails, there are some things we can fix, and there are some things we cannot fix. It looks like your physicali

RE: When Perth in Thika some of the characters in the body is continuous. Re: Apache Tikaで、PDFの本文内の文字が連続する現象発生

2016-09-15 Thread Allison, Timothy B.
I just tested this with PDFBox 2.0.3-rc1 (which should be released soon), and I got this: 物性目录的用法(6) 关于耐药品性, 耐热水性, 耐湿热性 DB So, I think this problem will be fixed in the next version of Tika. After we upgrade to 2.0.3 you can also get a nightly build. -Original Message- From: questi

I garbled characters when you import a Chinese PDF.

2016-09-15 Thread question.answer...@gmail.com
I garbled characters when you import a Chinese PDF. (in EUC, Shift-JIS, ) I want to read in UTF-8. Or should I what coding? below, it's my program now. - File document = new File(strFile_fullpath); ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE); Meta

Re: How to parse PDF files effectively with Tika

2016-09-15 Thread Sergey Beryozkin
Hi On 12/09/16 22:19, Sergey Beryozkin wrote: Hi Tim This is very helpful, thanks. I'll experiment with the code below. By the way, I've found out AutoDetectParser may not work if the (pdf) stream is an attachment stream which may not support a mark. I've been wondering, would it make sense to