Hi,
Am 23.07.2015 um 18:08 schrieb Tilman Hausherr:
Am 23.07.2015 um 17:10 schrieb Chris Clark:
Hi all,
I have been using PDFBox 2.0 to parse a number of scholarly documents,
which has in general been working great. Version 2.0 is definitely a big
step up from 1.8.9. I ran into a couple of PDFs that PDFBox seemed to have
trouble parsing and I wanted to run them by you to see if they could be
fixed or if I am missing something on my end They are:
http://vortex.cs.wayne.edu/papers/Limited_precision_weights_preprint.pdf
This PDF gets parsed fine by Preview from OS X, and I can copy the text the
text out of Preview without a problem . pdftotext also parses this PDF
without a problem. However when I run the TextExtractor from PDFBox 2.0 on
it I get a lots of warnings and junk output.
Adobe Reader can't extract the text either. Maybe OSX preview is making a guess?
http://www.cs.princeton.edu/~chongw/papers/RanganathWangBleiXing2013.pdf
Here I get an IOException when using PDFBox 2.0 (but not in 1.8.9). I
filed PDFBOX-2845 for this problem, but I realize I should have gone to the
mailing list first.
That was OK, I saw it... there just hasn't been anyone who has volunteered to
make a change. I did have a look at that issue at that time... it looks like
this is a malformed PDF, and the problem looked too complex for me, it involved
a reference between ordinary PDF objects and compressed PDF object streams. (We
do handle many malformed PDFs, but not all).
It looks like the file attached to PDFBOX-2845, which works in the most recent
trunk.
BR
Andreas
Ask yourself, is this really important to you, i.e. do you have many such files?
Or is this just one of many files that you tried to see what happens.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]