[
https://issues.apache.org/jira/browse/PDFBOX-1299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated PDFBOX-1299:
---------------------------------------
Attachment: TX0819_2009-07-27_Windstream-TCG_Agreement.pdf
Test PDF showing the problem. I got the PDF from
http://acrobatusers.com/gallery/pdf_portfolio_gallery, specifically
http://acrobatusers.com/assets/uploads/gallery/Tracey_Prather_31-Dec-2010_211843_2011Portfolio.pdf
In this PDF, at offset=446726, we have a "4 0 obj" stream, with
Length=368286.
If you skip ahead by that length, the next object is "5 0 obj".
But, unfortunately, within those bytes is an "endstream" on its own
line, just before offset=714247 (this "belongs" to the embedded PDF),
and that causes readUntilEndOfStream to stop too early, leading to
this IOException when running ExtractText (on current trunk):
{noformat}
Exception in thread "main" java.io.IOException: Unknown dir object c=']'
cInt=93 peek=']' peekInt=93 757109
at
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1215)
at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionaryValue(BaseParser.java:216)
at
org.apache.pdfbox.pdfparser.BaseParser.parseCOSDictionary(BaseParser.java:342)
at
org.apache.pdfbox.pdfparser.BaseParser.parseDirObject(BaseParser.java:1117)
ununun at org.apache.pdfbox.pdfparser.PDFParser.parseObject(PDFParser.java:557)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:187)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1090)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:1055)
at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:980)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:196)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:76)
{noformat}
> BaseParser.readUntilEndOfStream can stop too early, causing IOException on
> valid PDFs
> -------------------------------------------------------------------------------------
>
> Key: PDFBOX-1299
> URL: https://issues.apache.org/jira/browse/PDFBOX-1299
> Project: PDFBox
> Issue Type: Bug
> Affects Versions: 1.6.0
> Reporter: Michael McCandless
> Attachments: TX0819_2009-07-27_Windstream-TCG_Agreement.pdf
>
>
> The purpose of BaseParser.readUntilEndOfStream is to scan ahead,
> copying bytes to the output, stopping once it sees "endstream".
> The problem with this approach is sometimes the stream data itself
> contains endstream causing readUntilEndOfStream to stop too early.
> This can legitimately happen when the stream is an embedded PDF; I'll
> attach a test PDF showing this.
> However, the stream dict declares the stream length (in bytes)... so
> it seems like we should be respecting that length (if present) and
> simply copy over that many bytes, instead of scanning the stream bytes
> for endstream? This should be a lot faster too...
> I imagine we always scan so that we are more robust if the length is
> missing/invalid? Is that why this method was used? (I don't know the
> history here...). If so, maybe we can have an option to use
> the declared stream length if present.
> I have a patch to use the declared stream length (if present), and it enables
> at least this test PDF to correctly parse.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira