Hello, We have a user with PDFs generated by a commercial transcription service. When we try to extract text from these pdfs, pdfbox returns a few empty lines. We get this result both from our own code, and when using the ExtractText command line tool
If I specify the non-sequential parser, with the -nonSeq flag, the following error is produced: Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser validateStreamLength SEVERE: The end of the stream doesn't point to the correct offset, using workaround to read the stream If I uncompress the file with pdftk, pdfbox is able to successfully extract the text. Is it possible to perform this same uncompression with pdfbox? When I try the WriteDecodedDoc command, I get an error: java.io.StreamCorruptedException: Error: data is null at org.apache.pdfbox.filter.LZWFilter.decode(LZWFilter.java:82) The PDF looks like it has been generated by Aspose.Words for .NET 10.0.0.0 . Unfortunately, I'm not authorized to share the file. I realize there is not a lot to go on in my description of the problem, but I appreciate any suggestions. Thanks! _jonas

