RE: RandomAccessFile for PDFBox

Kevin Jackson Wed, 20 Apr 2011 16:51:23 -0700

Basing the parser on RandomAccess would significantly improve the
accuracy of the parser.  I have found many files which can be read by
Adobe Reader and fail with PDFBox.  The problem was that bytes that are
unreachable from the xref table can cause parsing errors. Using the
force flag helped but didn't solve everything.  I also found documents
where the unreachable stuff what just valid enough to cause reachable
valid data to be skipped.  This resulted in null pointer exceptions
later.


Reading PDF files sequentially like PDFBox currently does works fine
most of the time but basing the parse off of the offsets in the xref
table would be more reliable.

Kevin Jackson

-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Wednesday, April 20, 2011 11:57 AM
To: [email protected]
Subject: Re: RandomAccessFile for PDFBox

Yeah, I'll create a JIRA issue for the conforming parser, which uses a 
RandomAccess, and I'll attach what I have done so far.  Right now the
new 
parser can read in the trailing and xref table, which allows it to jump 
directly to any object which needs loaded.  Per the PDF spec, it starts
at 
the end of the file and goes backward to get the EOF flag, xref
location, 
and trailer information.

Thomas has the right idea, read in the minimum amount of objects
possible 
and then read the rest if/when the user requests them.  Last I remember
I 
was still working on parsing all the different types of objects so I
could 
read the Root entry.

---- 
Thanks,
Adam



From:
Thomas Chojecki <[email protected]>
To:
[email protected]
Date:
04/20/2011 07:27
Subject:
RandomAccessFile for PDFBox



Hi all,
i checked the PDSignature object and come accross the methods 
getContents and getSignedContent. I think this both methods need realy 
a better solution.

I also remind that the saveIncremental method need also a rewrite.

This will be only possible with something like the RandomAccessFile. 
Some file from which we can create Input- and OutputStreams to work 
with it.

Adam wrote in the PDFBOX-912 Issue (from 03/Jan/11 00:56) [1] that he 
is implementing such a stucture. *@Adam, how far are you with the 
implementation? can i help you? or can you provide me the code and i 
will try to continue your work.*

My idea is to rewrite the PDDocument load-method and wrote the input 
in a file instead of using a InputStream. Then provide the PDFParser 
an InputStream created from the RandomAccessFile and parse the document.

The secound idea is to use this file for the COSDictionaries. 
Something like parsing the Dictionary only when the user will access 
it. So the Dictionary only carry a reference to the object. Don't know 
if this is possible or will slow down the parse process, so i will sit 
down and analyse the possible optimizations.

Maybe someone have other nice ideas for RandomAccessFile usage.

Best regards
Thomas

[1] 
https://issues.apache.org/jira/browse/PDFBOX-912?focusedCommentId=129765
89&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel
#comment-12976589








- FHA 203b; 203k; HECM; VA; USDA; Conventional 
- Warehouse Lines; FHA-Authorized Originators 
- Lending and Servicing in over 45 States 
www.swmc.com   -  www.simplehecmcalculator.com   
Visit  www.swmc.com/resources   for helpful links on Training, Webinars,
Lender Alerts and Submitting Conditions  

This email and any content within or attached hereto from Sun West
Mortgage Company, Inc. is confidential and/or legally privileged. The
information is intended only for the use of the individual or entity
named on this email. If you are not the intended recipient, you are
hereby notified that any disclosure, copying, distribution or taking any
action in reliance on the contents of this email information is strictly
prohibited, and that the documents should be returned to this office
immediately by email. Receipt by anyone other than the intended
recipient is not a waiver of any privilege. Please do not include your
social security number, account number, or any other personal or
financial information in the content of the email. Should you have any
questions, please call (800) 453 7884.

RE: RandomAccessFile for PDFBox

Reply via email to