TextExtraction only working after uncompressing with pdftk

Jonas Karlsson Mon, 28 Apr 2014 08:01:49 -0700

Hello,

We have a user with PDFs generated by a commercial transcription service.
When we try to extract text from these pdfs, pdfbox returns a few empty
lines. We get this result both from our own code, and when using the
ExtractText command line tool


If I specify the non-sequential parser, with the -nonSeq flag, the
following error is produced:

Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream


If I uncompress the file with pdftk, pdfbox is able to successfully extract
the text.

Is it possible to perform this same uncompression with pdfbox? When I try
the WriteDecodedDoc command, I get an error:

java.io.StreamCorruptedException: Error: data is null

 at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82)


The PDF looks like it has been generated by Aspose.Words for .NET 10.0.0.0
. Unfortunately, I'm not authorized to share the file.


I realize there is not a lot to go on in my description of the problem, but
I appreciate any suggestions.


Thanks!


_jonas

TextExtraction only working after uncompressing with pdftk

Reply via email to