[jira] [Updated] (PDFBOX-2048) TextExtraction only working after uncompressing with pdftk

Tilman Hausherr (JIRA) Mon, 28 Apr 2014 14:13:37 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-2048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr updated PDFBOX-2048:
------------------------------------

    Description: 
>From Jonas Karlsson on the user list:
===
We have a user with PDFs generated by a commercial transcription service.
When we try to extract text from these pdfs, pdfbox returns a few empty
lines. We get this result both from our own code, and when using the
ExtractText command line tool

If I specify the non-sequential parser, with the -nonSeq flag, the
following error is produced:

Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream

If I uncompress the file with pdftk, pdfbox is able to successfully extract
the text.
===

I have been given permission to attach the file "committers only". So don't 
pass it around, avoid quoting details from the file. The file is also not 
rendering. The lengths of the streams are 0.

  was:
>From Jonas Karlsson on the user list:
===
We have a user with PDFs generated by a commercial transcription service.
When we try to extract text from these pdfs, pdfbox returns a few empty
lines. We get this result both from our own code, and when using the
ExtractText command line tool

If I specify the non-sequential parser, with the -nonSeq flag, the
following error is produced:

Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
validateStreamLength

SEVERE: The end of the stream doesn't point to the correct offset, using
workaround to read the stream

If I uncompress the file with pdftk, pdfbox is able to successfully extract
the text.
===

I will attach the file "committers only". Don't pass it around, avoid quoting 
details from the file. The file is also not rendering. The lengths of the 
streams are 0.


> TextExtraction only working after uncompressing with pdftk
> ----------------------------------------------------------
>
>                 Key: PDFBOX-2048
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-2048
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing, Rendering, Text extraction
>    Affects Versions: 2.0.0
>            Reporter: Tilman Hausherr
>         Attachments: PDFBOX-2048.pdf
>
>
> From Jonas Karlsson on the user list:
> ===
> We have a user with PDFs generated by a commercial transcription service.
> When we try to extract text from these pdfs, pdfbox returns a few empty
> lines. We get this result both from our own code, and when using the
> ExtractText command line tool
> If I specify the non-sequential parser, with the -nonSeq flag, the
> following error is produced:
> Apr 28, 2014 10:35:11 AM org.apache.pdfbox.pdfparser.NonSequentialPDFParser
> validateStreamLength
> SEVERE: The end of the stream doesn't point to the correct offset, using
> workaround to read the stream
> If I uncompress the file with pdftk, pdfbox is able to successfully extract
> the text.
> ===
> I have been given permission to attach the file "committers only". So don't 
> pass it around, avoid quoting details from the file. The file is also not 
> rendering. The lengths of the streams are 0.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Updated] (PDFBOX-2048) TextExtraction only working after uncompressing with pdftk

Reply via email to