[jira] [Resolved] (PDFBOX-2048) TextExtraction only working after uncompressing with pdftk

Tilman Hausherr (JIRA) Fri, 02 May 2014 01:44:36 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr resolved PDFBOX-2048.
-------------------------------------

    Resolution: Fixed

> TextExtraction only working after uncompressing with pdftk
> ----------------------------------------------------------
>
>                 Key: PDFBOX-2048
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2048
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, Rendering, Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>            Assignee: Tilman Hausherr
>
> From Jonas Karlsson on the user list:
> ===
> We have a user with PDFs generated by a commercial transcription service.
> When we try to extract text from these pdfs, pdfbox returns a few empty
> lines. We get this result both from our own code, and when using the
> ExtractText command line tool
> If I specify the non-sequential parser, with the -nonSeq flag, the
> following error is produced:
> Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
> validateStreamLength
> SEVERE: The end of the stream doesn't point to the correct offset, using
> workaround to read the stream
> If I uncompress the file with pdftk, pdfbox is able to successfully extract
> the text.
> ===
> I have been given permission to attach the file "committers only". So don't 
> pass it around, avoid quoting details from the file. The file is also not 
> rendering. The lengths of the streams are 0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Resolved] (PDFBOX-2048) TextExtraction only working after uncompressing with pdftk

Reply via email to