It'll be faster, but I'm not so certain it'll be more reliable. For example, I know the xref section can be missing or completely inaccurate and Adobe Reader will still open it as if nothing is wrong. So either Adobe Reader is not a conforming reader, or it has a huge amount of code dedicated to detecting and recovering from non-conforming PDFs. Either way, it ignores the xref table at least some of the time (and perhaps all of the time).
I think the only way this will reduce parsing errors is if you're not accessing the part of the document which is non-conforming. For example, if page 5 is corrupt/non-conforming in a 10 page PDF, and you only read the first page, you'd avoid the error. On the other hand, if you process every page, you'll still run in to it and PDFBox may be able to auto-recover, or it might throw an exception. At any rate, I'll try to get the what I have out there either later tonight or tomorrow night. ---- Thanks, Adam From: "Kevin Jackson" <[email protected]> To: <[email protected]> Date: 04/20/2011 16:51 Subject: RE: RandomAccessFile for PDFBox Basing the parser on RandomAccess would significantly improve the accuracy of the parser. I have found many files which can be read by Adobe Reader and fail with PDFBox. The problem was that bytes that are unreachable from the xref table can cause parsing errors. Using the force flag helped but didn't solve everything. I also found documents where the unreachable stuff what just valid enough to cause reachable valid data to be skipped. This resulted in null pointer exceptions later. Reading PDF files sequentially like PDFBox currently does works fine most of the time but basing the parse off of the offsets in the xref table would be more reliable. Kevin Jackson -----Original Message----- From: [email protected] [mailto:[email protected]] Sent: Wednesday, April 20, 2011 11:57 AM To: [email protected] Subject: Re: RandomAccessFile for PDFBox Yeah, I'll create a JIRA issue for the conforming parser, which uses a RandomAccess, and I'll attach what I have done so far. Right now the new parser can read in the trailing and xref table, which allows it to jump directly to any object which needs loaded. Per the PDF spec, it starts at the end of the file and goes backward to get the EOF flag, xref location, and trailer information. Thomas has the right idea, read in the minimum amount of objects possible and then read the rest if/when the user requests them. Last I remember I was still working on parsing all the different types of objects so I could read the Root entry. ---- Thanks, Adam From: Thomas Chojecki <[email protected]> To: [email protected] Date: 04/20/2011 07:27 Subject: RandomAccessFile for PDFBox Hi all, i checked the PDSignature object and come accross the methods getContents and getSignedContent. I think this both methods need realy a better solution. I also remind that the saveIncremental method need also a rewrite. This will be only possible with something like the RandomAccessFile. Some file from which we can create Input- and OutputStreams to work with it. Adam wrote in the PDFBOX-912 Issue (from 03/Jan/11 00:56) [1] that he is implementing such a stucture. *@Adam, how far are you with the implementation? can i help you? or can you provide me the code and i will try to continue your work.* My idea is to rewrite the PDDocument load-method and wrote the input in a file instead of using a InputStream. Then provide the PDFParser an InputStream created from the RandomAccessFile and parse the document. The secound idea is to use this file for the COSDictionaries. Something like parsing the Dictionary only when the user will access it. So the Dictionary only carry a reference to the object. Don't know if this is possible or will slow down the parse process, so i will sit down and analyse the possible optimizations. Maybe someone have other nice ideas for RandomAccessFile usage. Best regards Thomas [1] https://issues.apache.org/jira/browse/PDFBOX-912?focusedCommentId=129765 89&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel #comment-12976589 - FHA 203b; 203k; HECM; VA; USDA; Conventional - Warehouse Lines; FHA-Authorized Originators - Lending and Servicing in over 45 States www.swmc.com - www.simplehecmcalculator.com Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884. - FHA 203b; 203k; HECM; VA; USDA; Conventional - Warehouse Lines; FHA-Authorized Originators - Lending and Servicing in over 45 States www.swmc.com - www.simplehecmcalculator.com Visit www.swmc.com/resources for helpful links on Training, Webinars, Lender Alerts and Submitting Conditions This email and any content within or attached hereto from Sun West Mortgage Company, Inc. is confidential and/or legally privileged. The information is intended only for the use of the individual or entity named on this email. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or taking any action in reliance on the contents of this email information is strictly prohibited, and that the documents should be returned to this office immediately by email. Receipt by anyone other than the intended recipient is not a waiver of any privilege. Please do not include your social security number, account number, or any other personal or financial information in the content of the email. Should you have any questions, please call (800) 453 7884.
