Y, if I understand correctly, Tika should be doing all the work for you.
As I pointed out in an earlier email, sometimes Tika or its dependencies fail
in any number of ways.
When Tika fails, there are some things we can fix, and there are some things we
cannot fix.
It looks like your physicali
I just tested this with PDFBox 2.0.3-rc1 (which should be released soon), and I
got this:
物性目录的用法(6) 关于耐药品性, 耐热水性, 耐湿热性 DB
So, I think this problem will be fixed in the next version of Tika. After we
upgrade to 2.0.3 you can also get a nightly build.
-Original Message-
From: questi
I garbled characters when you import a Chinese PDF. (in EUC, Shift-JIS, )
I want to read in UTF-8.
Or should I what coding?
below, it's my program now.
-
File document = new File(strFile_fullpath);
ContentHandler handler = new BodyContentHandler(Integer.MAX_VALUE);
Meta
Hi
On 12/09/16 22:19, Sergey Beryozkin wrote:
Hi Tim
This is very helpful, thanks.
I'll experiment with the code below.
By the way, I've found out AutoDetectParser may not work if the (pdf)
stream is an attachment stream which may not support a mark.
I've been wondering, would it make sense to