[
https://issues.apache.org/jira/browse/PDFBOX-1769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13849150#comment-13849150
]
Timo Boehme commented on PDFBOX-1769:
-------------------------------------
In my opinion this is a very specific fix for this PDF, e.g. if you have a non
zipped stream containing PDF markup it could be that the 'obj' is from this
stream. I would think a more general solution would be to do a sequential parse
of the PDF collecting possible object starts/ends and using this information to
deduce the correct end of a stream etc. - however this fall back information
parsing is not implemented yet...
> Fix crash on invalid xref
> -------------------------
>
> Key: PDFBOX-1769
> URL: https://issues.apache.org/jira/browse/PDFBOX-1769
> Project: PDFBox
> Issue Type: Wish
> Components: Parsing
> Affects Versions: 1.8.2
> Reporter: William Palmer
> Assignee: Andreas Lehmkühler
> Fix For: 1.8.4, 2.0.0
>
>
> Need to search for a correct xref start address
> Example file:
> http://digitalcorpora.org/corp/nps/files/govdocs1/020/020747.pdf
> Exception in thread "main" java.io.IOException: Error: Expected an integer
> type, actual='ref'
> at org.apache.pdfbox.pdfparser.BaseParser.readInt(BaseParser.java:1622)
> Using the code:
> PDFTextStripper ts = new PDFTextStripper();
> PrintWriter out = new PrintWriter(new FileWriter(new File (pFile+".txt")));
> RandomAccess scratchFile = new
> RandomAccessFile(File.createTempFile("pdfbox-", ".tmp"), "rw");
> PDDocument doc = PDDocument.loadNonSeq(new File(pFile), scratchFile)
> ts.setForceParsing(true);
> ts.writeText(doc, out);
> Related: PDFBOX-1757
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)