Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs
Hi I finally had a chance to look at this – looks like there’s a long-standing bug in PdfParser::ReadXRefStreamContents Once called, the method assumes that all cross reference information found by following the “Prev” keys is stored as cross ref streams (XRefStm). The IRS test documents uses a mix of old style cross-ref tables (xref) and cross ref streams (XRefStm) in the Prev chain. I’m guessing they’ve been through a couple of different PDF editors. PdfTokenizer::GetNextNumber() is throwing an error because the next token is “xref” instead of number when it reads an xref table it assumes is an XRefStm Given that fixing this might uncover more problems, and it’s very close to release day, I’d suggest keeping r1648 for the moment and I’ll submit a patch after the release. Does that sound ok? Cheers Mark Mark Rogers - mark.rog...@powermapper.com PowerMapper Software Ltd - www.powermapper.com Registered in Scotland No 362274 Quartermile 2 Edinburgh EH3 9GL From: Dennis Jenkins [mailto:dennis.jenkins...@gmail.com] Sent: 30 June 2014 21:31 To: zyx Cc: podofo-users@lists.sourceforge.net Subject: Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs On Mon, Jun 30, 2014 at 3:10 PM, zyx z...@litepdf.czmailto:z...@litepdf.cz wrote: Hi, thanks for a quick testing. I committed the patch as r1648 [1]. If you'll find time and give it more thorough testing by Friday, then it'll be great (you know, just in case it has any side-effects). Thanks again and bye, zyx [1] http://sourceforge.net/p/podofo/code/1648 Hello, r1648 works fine for me, for both my quick parser test and for my full suite of unit tests for my own project. Thank you! -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs
On Wed, Jul 2, 2014 at 1:47 PM, Mark Rogers mark.rog...@powermapper.com wrote: Hi I finally had a chance to look at this – looks like there’s a long-standing bug in PdfParser::ReadXRefStreamContents Once called, the method assumes that all cross reference information found by following the “Prev” keys is stored as cross ref streams (XRefStm). The IRS test documents uses a mix of old style cross-ref tables (xref) and cross ref streams (XRefStm) in the Prev chain. I’m guessing they’ve been through a couple of different PDF editors. PdfTokenizer::GetNextNumber() is throwing an error because the next token is “xref” instead of number when it reads an xref table it assumes is an XRefStm Given that fixing this might uncover more problems, and it’s very close to release day, I’d suggest keeping r1648 for the moment and I’ll submit a patch after the release. Does that sound ok? Cheers Mark +1 That sounds like a very sensible plan to me! -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs
On Mon, Jun 30, 2014 at 2:29 PM, zyx z...@litepdf.cz wrote: On Sun, 2014-06-29 at 18:56 +0200, zyx wrote: I think of reverting the patch, to support those probably broken files, but I'd like to hear from you too, whether the file is truly broken. Hi, Dennis, could you try with the attached patch, preferably on current trunk, please? It seems to survive on the file you gave a link to, but I only tried to open it, not to modify in it or read its objects. Thanks and bye, zyx Hello Zyx, With your patch applied to a clean checkout of rev 1646, my test suite can now open every PDF that I have (various tax forms from 2009 to current). I have not attempted to make use of the contents of the files that previously failed to parse, so I do not know if they are fully intact (in PoDoFo's internal model). My quick+dirty testing tool can count the number of pages in these PDFs though (seems ok). -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs
On Mon, 2014-06-30 at 14:57 -0500, Dennis Jenkins wrote: With your patch applied to a clean checkout of rev 1646, my test suite can now open every PDF that I have (various tax forms from 2009 to current). I have not attempted to make use of the contents of the files that previously failed to parse, so I do not know if they are fully intact (in PoDoFo's internal model). My quick+dirty testing tool can count the number of pages in these PDFs though (seems ok). Hi, thanks for a quick testing. I committed the patch as r1648 [1]. If you'll find time and give it more thorough testing by Friday, then it'll be great (you know, just in case it has any side-effects). Thanks again and bye, zyx [1] http://sourceforge.net/p/podofo/code/1648 -- http://www.litePDF.cz i...@litepdf.cz -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs
On Mon, Jun 30, 2014 at 3:10 PM, zyx z...@litepdf.cz wrote: Hi, thanks for a quick testing. I committed the patch as r1648 [1]. If you'll find time and give it more thorough testing by Friday, then it'll be great (you know, just in case it has any side-effects). Thanks again and bye, zyx [1] http://sourceforge.net/p/podofo/code/1648 Hello, r1648 works fine for me, for both my quick parser test and for my full suite of unit tests for my own project. Thank you! -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft___ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users
Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs
On Sun, 2014-06-22 at 23:26 -0500, Dennis Jenkins wrote: Hello All, I recently noticed that PoDoFo (svn rev 1642) was unable to parse several older PDFs (all obtained from the USA IRS for tax years 2011 and before). These PDFs were made with profession Adobe products, so I expect them to be conformant. I narrowed down the version of PoDoFo that causes the failure, but I have not analyzes the source code diff yet. These PDFs parsed without error under PoDoFO svn rev 1586, but failed on rev 1857 (2014-04-01, change to PdfParser.cpp). Attempting to open the document with PoDoFo::PdfMemDocument() throws ePdfError_NoNumber. I have a total of 6 IRS tax forms for various years that all fail to open in PoDoFo (they all throw the same exception [2]), but for now, I'll just focus on one. This [1] PDF was created with Adobe LiveCycle Designer ES 8.2 on 2010-11-22. (October 2010 revision of the 941 tax form). I suspect that PDFs are conformant (unproven hunch) and that PoDoFo 1587+ is buggy. Thoughts? Analysis? [1] http://www.irs.gov/pub/irs-prior/f941--2010.pdf [2] The following stack trace is from PoDoFo rev 1587: PoDoFo encounter an error. Error: 14 ePdfError_NoNumber Error Description: A number was expected but not found. Callstack: #0 Error Source: /tmp/podofo/src/src/base/PdfParser.cpp:226 Information: Unable to load objects from file. #1 Error Source: /tmp/podofo/src/src/base/PdfParser.cpp:289 Information: Unable to skip xref dictionary. #2 Error Source: /tmp/podofo/src/src/base/PdfParser.cpp:738 #3 Error Source: /tmp/podofo/src/src/base/PdfParser.cpp:551 Information: Unable to load /XRefStm xref stream. #4 Error Source: /tmp/podofo/src/src/base/PdfParserObject.cpp:109 Information: Object and generation number cannot be read. #5 Error Source: /tmp/podofo/src/src/base/PdfTokenizer.cpp:365 Information: xref Hi Mark, I tried to investigate the above issue, which is after your fix for XRefStm streams read at r1587 ( http://sourceforge.net/p/podofo/code/1587 ). The file Dennis gave a link to at [1] above seems fine with respect of references to /XRefStm, but it seems that one of the streams contains a reference to an object which is out of position and instead of pointing to some 1234 0 obj the offset points to 'xref' tag. Here is backtrace from gdb: #0 PoDoFo::PdfTokenizer::GetNextNumber (this=0x7fffd1d0) at src/base/PdfTokenizer.cpp:366 #1 0x004af132 in PoDoFo::PdfParserObject::ReadObjectNumber (this=0x7fffd180) at src/base/PdfParserObject.cpp:105 #2 0x004af459 in PoDoFo::PdfParserObject::ParseFile (this=0x7fffd180, pEncrypt=0x0, bIsTrailer=false) at src/base/PdfParserObject.cpp:134 #3 0x004d1da1 in PoDoFo::PdfXRefStreamParserObject::Parse (this=0x7fffd180) at src/base/PdfXRefStreamParserObject.cpp:60 #4 0x004a9597 in PoDoFo::PdfParser::ReadXRefStreamContents (this=0x7b19d0, lOffset=203913, bReadOnlyTrailer=false) at src/base/PdfParser.cpp:824 #5 0x004a9690 in PoDoFo::PdfParser::ReadXRefStreamContents (this=0x7b19d0, lOffset=204202, bReadOnlyTrailer=false) at src/base/PdfParser.cpp:840 #6 0x004a84ae in PoDoFo::PdfParser::ReadNextTrailer (this=0x7b19d0) at src/base/PdfParser.cpp:549 #7 0x004a8f9a in PoDoFo::PdfParser::ReadXRefContents (this=0x7b19d0, lOffset=204376, bPositionAtEnd=true) at src/base/PdfParser.cpp:734 #8 0x004a6ba0 in PoDoFo::PdfParser::ReadDocumentStructure (this=0x7b19d0) at src/base/PdfParser.cpp:287 #9 0x004a6853 in PoDoFo::PdfParser::ParseFile (this=0x7b19d0, rDevice=..., bLoadOnDemand=true) at src/base/PdfParser.cpp:213 #10 0x004a6604 in PoDoFo::PdfParser::ParseFile (this=0x7b19d0, pszFilename=0x531b73 f941--2010.pdf, bLoadOnDemand=true) at src/base/PdfParser.cpp:157 #11 0x004878e6 in PoDoFo::PdfMemDocument::Load (this=0x7aa5b0, pszFilename=0x531b73 f941--2010.pdf) at src/doc/PdfMemDocument.cpp:186 #12 0x0047b435 in main () at test.cpp:69 I think of reverting the patch, to support those probably broken files, but I'd like to hear from you too, whether the file is truly broken. Thanks and bye, zyx -- http://www.litePDF.cz i...@litepdf.cz -- Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft ___ Podofo-users mailing list Podofo-users@lists.sourceforge.net