The PDFs were a samples from a larger corpus, but I haven't tested the entire corpus yet. From what I can tell IOExceptions are very rare, so being able to handle these cases is not a big deal as far as I am concerned. I am not sure how common the text parsing error is. It is a bit surprising Adobe can't extract the text but Preview and pdftotext can, but if that is the case I am not too worried about getting that PDF right either. I just wanted to check in in case either of these issues were due to bugs that could be easily resolved.
Thanks, Chris On Thu, Jul 23, 2015 at 9:08 AM, Tilman Hausherr <[email protected]> wrote: > Am 23.07.2015 um 17:10 schrieb Chris Clark: > >> Hi all, >> >> I have been using PDFBox 2.0 to parse a number of scholarly documents, >> which has in general been working great. Version 2.0 is definitely a big >> step up from 1.8.9. I ran into a couple of PDFs that PDFBox seemed to have >> trouble parsing and I wanted to run them by you to see if they could be >> fixed or if I am missing something on my end They are: >> >> http://vortex.cs.wayne.edu/papers/Limited_precision_weights_preprint.pdf >> This PDF gets parsed fine by Preview from OS X, and I can copy the text >> the >> text out of Preview without a problem . pdftotext also parses this PDF >> without a problem. However when I run the TextExtractor from PDFBox 2.0 on >> it I get a lots of warnings and junk output. >> > > Adobe Reader can't extract the text either. Maybe OSX preview is making a > guess? > > >> >> http://www.cs.princeton.edu/~chongw/papers/RanganathWangBleiXing2013.pdf >> Here I get an IOException when using PDFBox 2.0 (but not in 1.8.9). I >> filed PDFBOX-2845 for this problem, but I realize I should have gone to >> the >> mailing list first. >> >> > That was OK, I saw it... there just hasn't been anyone who has volunteered > to make a change. I did have a look at that issue at that time... it looks > like this is a malformed PDF, and the problem looked too complex for me, it > involved a reference between ordinary PDF objects and compressed PDF object > streams. (We do handle many malformed PDFs, but not all). > > Ask yourself, is this really important to you, i.e. do you have many such > files? Or is this just one of many files that you tried to see what happens. > > Tilman > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

