[ https://issues.apache.org/jira/browse/PDFBOX-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13984050#comment-13984050 ]
Tilman Hausherr edited comment on PDFBOX-2048 at 4/29/14 5:58 AM: ------------------------------------------------------------------ Change committed in the trunk in rev 1590873, and rev 1590874 in the 1.8 branch. Jonas, you can find a new jar file at https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox/1.8.6-SNAPSHOT/ within a few hours. However it will be a few months before this will be released officially. I will set to resolve after the release of 1.8.5 (which will not include this change, because the cut was already done). was (Author: tilman): Change committed in the trunk in rev 1590873, and rev 1590874 in the 1.8 branch. Jonas, you can find a new jar file at https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-war/1.8.6-SNAPSHOT/ within a few hours. However it will be a few months before this will be released officially. I will set to resolve after the release of 1.8.5 (which will not include this change, because the cut was already done). > TextExtraction only working after uncompressing with pdftk > ---------------------------------------------------------- > > Key: PDFBOX-2048 > URL: https://issues.apache.org/jira/browse/PDFBOX-2048 > Project: PDFBox > Issue Type: Bug > Components: Parsing, Rendering, Text extraction > Affects Versions: 2.0.0 > Reporter: Tilman Hausherr > Assignee: Tilman Hausherr > > From Jonas Karlsson on the user list: > === > We have a user with PDFs generated by a commercial transcription service. > When we try to extract text from these pdfs, pdfbox returns a few empty > lines. We get this result both from our own code, and when using the > ExtractText command line tool > If I specify the non-sequential parser, with the -nonSeq flag, the > following error is produced: > Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser > validateStreamLength > SEVERE: The end of the stream doesn't point to the correct offset, using > workaround to read the stream > If I uncompress the file with pdftk, pdfbox is able to successfully extract > the text. > === > I have been given permission to attach the file "committers only". So don't > pass it around, avoid quoting details from the file. The file is also not > rendering. The lengths of the streams are 0. -- This message was sent by Atlassian JIRA (v6.2#6252)