[ https://issues.apache.org/jira/browse/PDFBOX-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tilman Hausherr resolved PDFBOX-2048. ------------------------------------- Resolution: Fixed > TextExtraction only working after uncompressing with pdftk > ---------------------------------------------------------- > > Key: PDFBOX-2048 > URL: https://issues.apache.org/jira/browse/PDFBOX-2048 > Project: PDFBox > Issue Type: Bug > Components: Parsing, Rendering, Text extraction > Affects Versions: 2.0.0 > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr > > From Jonas Karlsson on the user list: > === > We have a user with PDFs generated by a commercial transcription service. > When we try to extract text from these pdfs, pdfbox returns a few empty > lines. We get this result both from our own code, and when using the > ExtractText command line tool > If I specify the non-sequential parser, with the -nonSeq flag, the > following error is produced: > Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser > validateStreamLength > SEVERE: The end of the stream doesn't point to the correct offset, using > workaround to read the stream > If I uncompress the file with pdftk, pdfbox is able to successfully extract > the text. > === > I have been given permission to attach the file "committers only". So don't > pass it around, avoid quoting details from the file. The file is also not > rendering. The lengths of the streams are 0. -- This message was sent by Atlassian JIRA (v6.2#6252)