If the PDF contains incremental updates then there will be multiple %%EOF - that’s fine.
BR Maruan Am 20.10.2014 um 13:50 schrieb Vomlel Jan <[email protected]>: > Hi Maruan, > > I create patch for bug PDFBOX-2436. > > After %%EOF it skips data to next object. > > I don´t know, if such data are allowed by specification, but some czech > portal create them and acrobat have no problem with them. > > I changed org.apache.pdfbox.pdfparser.PDFParser near line 584, branch 1.8. > Can you commit it and fix this bug? > > > pdfSource.unread(eof.getBytes("ISO-8859-1")); > } > } > } > isEndOfFile = true; > > //PDFBOX-2436 - some files contain binary data after %%EOF. > skipToNextObj(); > } > } > //we are going to parse an normal object > Else > > Thank you, Jan > > -----Original Message----- > From: Vomlel Jan > Sent: Friday, October 17, 2014 9:12 AM > To: [email protected]; [email protected] > Subject: RE: problem with pdf eof > > I reported parsing error for load function: > https://issues.apache.org/jira/browse/PDFBOX-2436 > Jan > > -----Original Message----- > From: Maruan Sahyoun [mailto:[email protected]] > Sent: Thursday, October 16, 2014 8:23 PM > To: [email protected]; [email protected] > Subject: Re: problem with pdf eof > > sorry if that has been unclear - as of now if you’d like to sign you have to > use load() loadNonSeq() is not an option! > > For all other cases use loadNonSeq() and if that fails load() as a fallback. > > We are working on getting the missing signing support into nonSeq() but that > will probably be after 2.0. > > Now if you have parsing issues with load() please open an issue in Jira and > attach the PDFs together with code to reproduce it. Same if you have parsing > issues with loadNonSeq(). > > Of course if someone is willing to help getting that in … patches are welcome. > > Maruan > > Am 16.10.2014 um 20:13 schrieb Brzrk One <[email protected]>: > >> I hear dual advice here... >> - don't use NonSeq for signatures >> - but use NonSeq for multiple EOFs >> Files with both multiple EOFs and signatures will have problems... >> unless you mean we should parse 2x? >> >> On Thu, Oct 16, 2014 at 12:12 PM, Maruan Sahyoun >> <[email protected]> >> wrote: >> >>> depends on the parser being used. NonSeq does follow the Xref >>> information and handles multiple EOFs (incremental updates) when parsing. >>> >>> BR >>> Maruan >>> >>> Am 16.10.2014 um 17:01 schrieb Brzrk One <[email protected]>: >>> >>> I've noticed that when there are multiple EOFs in the file, PDFBox >>> parsing is less reliable. >>> >>> >>> On Thu, Oct 16, 2014 at 9:44 AM, Vomlel Jan <[email protected]> wrote: >>> >>> When I use load insted of loadNoSeq, signatures are in this case valid. >>> >>> But for some documents load function doesnot read complete document. >>> That is why I used loadNoSeq. Some signatures are then missing. >>> >>> Viz: >>> http://leteckaposta.cz/831516385 >>> h1.pdf - original file (signature and timestamp) h2.pdf - add first >>> signature by pdfbox (timestamp is missing) h3.pdf - add second >>> signature by pdfbox (timestamp and previous signature is missing) >>> >>> Jan >>> >>> -----Original Message----- >>> From: Maruan Sahyoun [mailto:[email protected]] >>> Sent: Thursday, October 16, 2014 2:37 PM >>> To: [email protected] >>> Subject: Re: problem with pdf eof >>> >>> when signing please make sure that you load the pdf using >>> PDDocument.load instead of PDDocument.loadNonSeq. >>> >>> >>> Am 16.10.2014 um 11:57 schrieb Vomlel Jan <[email protected]>: >>> >>> >>> >>> -----Original Message----- >>> From: Maruan Sahyoun [mailto:[email protected]] >>> Sent: Thursday, October 16, 2014 11:55 AM >>> To: [email protected] >>> Subject: Re: problem with pdf eof >>> >>> when you say invalid do you mean it’s corrupted or e.g. you get a >>> >>> warning sign in Adobe Reader? Would you have a sample PDF? >>> >>> >>> When you sign a document and sign it again the first signature points >>> to >>> >>> a different document revision as you have changed the documents >>> content afterwards. So invalid in that context could mean that the >>> warning you might be getting is only reflecting that fact. Would need >>> to see the document to understand what’s going on. >>> >>> >>> BR >>> >>> Maruan >>> >>> Am 16.10.2014 um 11:48 schrieb Vomlel Jan <[email protected]>: >>> >>> Hi Maruan and others, >>> >>> I created signature and it seems OK. >>> But when I create second signature (loadNonSeq, addSignature, >>> >>> saveIncremental again), the first signature becomes invalid. >>> >>> I think that there can be problem, that first page is updated >>> (signatur >>> >>> is invisible), but I dont understand it enough. >>> >>> >>> Jan >>> >>> -----Original Message----- >>> From: Maruan Sahyoun [mailto:[email protected]] >>> Sent: Monday, October 13, 2014 4:09 PM >>> To: [email protected] >>> Subject: Re: problem with pdf eof >>> >>> Hi Jan, >>> >>> there are sample in the examples package for various ways to sign a >>> >>> document [1]. Signing a document needs incremental saving. >>> >>> >>> OTOH choosing the right solution should not be made on the base if >>> >>> there is a license fee or not. >>> >>> >>> Maruan Sahyoun >>> >>> [1] >>> >>> >>> http://svn.apache.org/viewvc/pdfbox/trunk/examples/src/main/java/org/ >>> apache/pdfbox/examples/signature/ >>> >>> >>> >>> Am 13.10.2014 um 16:02 schrieb Vomlel Jan <[email protected]>: >>> >>> Hi Maruan (and others), >>> >>> I would like to use pdfbox and bouncycastle for managing pdf >>> >>> signatures. Parsing, validation, timestamping (PADES LTV) . >>> >>> We used itext for it, but it is under commercial licence. >>> Parsing signatures seems to be working (thanks to your advice). So I >>> >>> will try to create timestamp. >>> >>> Is it possible with pdfbox? I found save method on PDDocument, but >>> >>> Iˇm afraid, that it can change bite representation of pdf, and >>> signatures become invalid. Is it true? What is right way to create >>> signature or timestamp with pdfbox? >>> >>> >>> Jan >>> >>> >>> -----Original Message----- >>> From: Maruan Sahyoun [mailto:[email protected]] >>> Sent: Friday, October 10, 2014 10:44 AM >>> To: [email protected] >>> Subject: Re: problem with pdf eof >>> >>> Hi Jan, >>> >>> choosing the right technology is very important so I do understand >>> >>> your concerns. I had to make such decision about using PDFBox in the >>> past too. >>> >>> It can >>> If you have specific issues I can answer I’m happy to try to do so. >>> As >>> >>> a general statement PDFBox is used in production environments today >>> (as an example we ourselves are using it for a banking customer to >>> process account statements, an airline company to preprocess >>> archiving documents and various other customers). >>> >>> >>> PDFBox is continuously enhancing the parsing as we try to deal with >>> >>> real world PDF files which are not always inline with the the PDF >>> specification. Currently the best approach is to use >>> PDDocument.loadNonSeq (which parses documents according to the Xref >>> information) and in case of an exception PDDocument.load (which >>> parses sequentially). The Apache Tika project, which uses PDFBox for >>> parsing PDF’s, is running the parsing and text extraction against 50k >>> PDFs being made available via http://digitalcorpora.org >>> >>> >>> What is the application you would like to be using PDFBox for? Text >>> >>> Extraction, image conversion …. - I might be able to give you more >>> specific information for your use case. >>> >>> >>> BR >>> >>> Maruan >>> >>> Am 10.10.2014 um 10:10 schrieb Vomlel Jan <[email protected]>: >>> >>> Thank you Maruan, this function loads document. >>> >>> I have read https://pdfbox.apache.org/ideas.html "Replace/Enhance >>> >>> PDF parsing". I think correct parsing is very important, and I have >>> some doubts, if I can use pdfbox in production. Can you say something >>> to rest me :-). >>> >>> >>> Jan >>> >>> -----Original Message----- >>> From: Maruan Sahyoun [mailto:[email protected]] >>> Sent: Friday, October 10, 2014 9:25 AM >>> To: [email protected] >>> Subject: Re: problem with pdf eof >>> >>> Hi >>> >>> you can try PDDocument.loadNonSeq(InputStream is, null) >>> >>> BR >>> >>> Maruan >>> >>> Am 10.10.2014 um 09:09 schrieb Vomlel Jan <[email protected]>: >>> >>> Hello, >>> I use PDFBox 1.8.7 PDDocument.load(InputStream is) method to parse >>> >>> PDF document in attachement. >>> >>> Method return without exception, but document model is incomplete. >>> >>> Problem is in characters after EOF (ofset 22939): >>> startxref >>> 22449 >>> %%EOF >>> @ >>> 16 0 obj >>> << >>> /Type /Catalog >>> >>> PDFBox create internal IOException and ignore it with comment: >>> /* >>> * PDF files may have random data after the EOF >>> >>> marker. Ignore errors if >>> >>> * last object processed is EOF. >>> */ >>> >>> Is this PDF construction valid? >>> Which parser in PDFBox is correct? I tried ConformingPDParser, but >>> >>> another error occured. >>> >>> >>> Jan >>> >>> >>> >>> >>> Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu >>> >>> na uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu >>> tak není, nelze je považovat za jednání, které by zakládalo jakékoliv >>> nároky vůči společnosti AiP Safe. Tento e-mail je určen pouze >>> uvedenému příjemci a dalším osobám, které jsou jmenovitě uvedeny jako >>> příjemci, a jeho obsah, včetně obsahu všech připojených souborů, je >>> důvěrný. Jestliže nejste oprávněný příjemce, zdržte se, prosím, >>> jakékoliv formy zveřejnění, reprodukce, kopírování, distribuce nebo >>> šíření jeho obsahu, včetně obsahu všech připojených souborů. Pokud >>> jste obdržel tento e-mail omylem, oznamte to, prosím, neprodleně jeho >>> odesilateli a e-mail, včetně všech připojených souborů, vymažte. >>> Všechny e maily adresované, přijímané nebo posílané AiP Safe s.r.o. >>> nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně pracovní >>> e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů >>> souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než >>> je daný příjemce nebo odesilatel, proto aby byla zajištěna kontinuita >>> pracovních aktivit a byla umožněna jejich kontrola.. >>> >>> >>> >>> >>> >>> >>> >>> >>> > > > ________________________________ > > Tento e-mail ani žádný z připojených souborů nejsou přijetím návrhu na > uzavření smlouvy, ledaže je to v nich výslovně uvedeno. Pokud tomu tak není, > nelze je považovat za jednání, které by zakládalo jakékoliv nároky vůči > společnosti AiP Safe. Tento e-mail je určen pouze uvedenému příjemci a dalším > osobám, které jsou jmenovitě uvedeny jako příjemci, a jeho obsah, včetně > obsahu všech připojených souborů, je důvěrný. Jestliže nejste oprávněný > příjemce, zdržte se, prosím, jakékoliv formy zveřejnění, reprodukce, > kopírování, distribuce nebo šíření jeho obsahu, včetně obsahu všech > připojených souborů. Pokud jste obdržel tento e-mail omylem, oznamte to, > prosím, neprodleně jeho odesilateli a e-mail, včetně všech připojených > souborů, vymažte. Všechny e maily adresované, přijímané nebo posílané AiP > Safe s.r.o. nebo zaměstnanci AiP Safe s.r.o. jsou považovány za zásadně > pracovní e-maily. V souladu s tím odesilatel nebo příjemce těchto e mailů > souhlasí, že mohou být čteny jinými zaměstnanci AiP Safe s.r.o., než je daný > příjemce nebo odesilatel, proto aby byla zajištěna kontinuita pracovních > aktivit a byla umožněna jejich kontrola..

