Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs

2014-07-02 Thread Mark Rogers
Hi

I finally had a chance to look at this – looks like there’s a long-standing bug 
in PdfParser::ReadXRefStreamContents

Once called, the method assumes that all cross reference information found by 
following the “Prev” keys is stored as cross ref streams (XRefStm). The IRS 
test documents uses a mix of old style cross-ref tables (xref) and cross ref 
streams (XRefStm) in the Prev chain. I’m guessing they’ve been through a couple 
of different PDF editors.

PdfTokenizer::GetNextNumber() is throwing an error because the next token is 
“xref” instead of number when it reads an xref table it assumes is an XRefStm

Given that fixing this might uncover more problems, and it’s very close to 
release day, I’d suggest keeping r1648 for the moment and I’ll submit a patch 
after the release.

Does that sound ok?

Cheers
Mark

Mark Rogers - mark.rog...@powermapper.com
PowerMapper Software Ltd - www.powermapper.com
Registered in Scotland No 362274 Quartermile 2 Edinburgh EH3 9GL

From: Dennis Jenkins [mailto:dennis.jenkins...@gmail.com]
Sent: 30 June 2014 21:31
To: zyx
Cc: podofo-users@lists.sourceforge.net
Subject: Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs


On Mon, Jun 30, 2014 at 3:10 PM, zyx z...@litepdf.czmailto:z...@litepdf.cz 
wrote:

Hi,
thanks for a quick testing. I committed the patch as r1648 [1]. If
you'll find time and give it more thorough testing by Friday, then
it'll be great (you know, just in case it has any side-effects).
Thanks again and bye,
zyx

[1] http://sourceforge.net/p/podofo/code/1648

Hello,
   r1648 works fine for me, for both my quick parser test and for my full suite 
of unit tests for my own project.  Thank you!
--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs

2014-07-02 Thread Dennis Jenkins
On Wed, Jul 2, 2014 at 1:47 PM, Mark Rogers mark.rog...@powermapper.com
wrote:

 Hi



 I finally had a chance to look at this – looks like there’s a
 long-standing bug in PdfParser::ReadXRefStreamContents



 Once called, the method assumes that all cross reference information found
 by following the “Prev” keys is stored as cross ref streams (XRefStm). The
 IRS test documents uses a mix of old style cross-ref tables (xref) and
 cross ref streams (XRefStm) in the Prev chain. I’m guessing they’ve been
 through a couple of different PDF editors.



 PdfTokenizer::GetNextNumber() is throwing an error because the next token
 is “xref” instead of number when it reads an xref table it assumes is an
 XRefStm



 Given that fixing this might uncover more problems, and it’s very close to
 release day, I’d suggest keeping r1648 for the moment and I’ll submit a
 patch after the release.



 Does that sound ok?



 Cheers

 Mark




+1   That sounds like a very sensible plan to me!
--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs

2014-06-30 Thread Dennis Jenkins
On Mon, Jun 30, 2014 at 2:29 PM, zyx z...@litepdf.cz wrote:

 On Sun, 2014-06-29 at 18:56 +0200, zyx wrote:
  I think of reverting the patch, to support those probably broken
  files, but I'd like to hear from you too, whether the file is truly
  broken.

 Hi,
 Dennis, could you try with the attached patch, preferably on current
 trunk, please? It seems to survive on the file you gave a link to, but
 I only tried to open it, not to modify in it or read its objects.
 Thanks and bye,
 zyx


Hello Zyx,

With your patch applied to a clean checkout of rev 1646, my test suite
can now open every PDF that I have (various tax forms from 2009 to
current).  I have not attempted to make use of the contents of the files
that previously failed to parse, so I do not know if they are fully intact
(in PoDoFo's internal model).  My quick+dirty testing tool can count the
number of pages in these PDFs though (seems ok).
--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs

2014-06-30 Thread zyx
On Mon, 2014-06-30 at 14:57 -0500, Dennis Jenkins wrote:
 With your patch applied to a clean checkout of rev 1646, my test 
 suite can now open every PDF that I have (various tax forms from 
 2009 to current).  I have not attempted to make use of the contents 
 of the files that previously failed to parse, so I do not know if 
 they are fully intact (in PoDoFo's internal model).  My quick+dirty 
 testing tool can count the number of pages in these PDFs though 
 (seems ok).

Hi,
thanks for a quick testing. I committed the patch as r1648 [1]. If 
you'll find time and give it more thorough testing by Friday, then 
it'll be great (you know, just in case it has any side-effects).
Thanks again and bye,
zyx

[1] http://sourceforge.net/p/podofo/code/1648


-- 
http://www.litePDF.cz i...@litepdf.cz


--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs

2014-06-30 Thread Dennis Jenkins
On Mon, Jun 30, 2014 at 3:10 PM, zyx z...@litepdf.cz wrote:


 Hi,
 thanks for a quick testing. I committed the patch as r1648 [1]. If
 you'll find time and give it more thorough testing by Friday, then
 it'll be great (you know, just in case it has any side-effects).
 Thanks again and bye,
 zyx

 [1] http://sourceforge.net/p/podofo/code/1648


Hello,

   r1648 works fine for me, for both my quick parser test and for my full
suite of unit tests for my own project.  Thank you!
--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users


Re: [Podofo-users] SVN commit 1587 broke ability to parse several PDFs

2014-06-29 Thread zyx
On Sun, 2014-06-22 at 23:26 -0500, Dennis Jenkins wrote:
 Hello All,
 
 I recently noticed that PoDoFo (svn rev 1642) was unable to 
 parse several older PDFs (all obtained from the USA IRS for tax 
 years 2011 and before).  These PDFs were made with profession Adobe 
 products, so I expect them to be conformant.
 
 I narrowed down the version of PoDoFo that causes the failure, 
 but I have not analyzes the source code diff yet.  These PDFs parsed 
 without error under PoDoFO svn rev 1586, but failed on rev 1857 
 (2014-04-01, change to PdfParser.cpp).  Attempting to open the 
 document with PoDoFo::PdfMemDocument() throws ePdfError_NoNumber.
 
 I have a total of 6 IRS tax forms for various years that all 
 fail to open in PoDoFo (they all throw the same exception [2]), but 
 for now, I'll just focus on one.  This [1] PDF was created with 
 Adobe LiveCycle Designer ES 8.2 on 2010-11-22. (October 2010 
 revision of the 941 tax form).
 
 I suspect that PDFs are conformant (unproven hunch) and that 
 PoDoFo 1587+ is buggy.
 
 Thoughts?  Analysis?
 
 
 [1]   http://www.irs.gov/pub/irs-prior/f941--2010.pdf
 
 [2]  The following stack trace is from PoDoFo rev 1587:
 PoDoFo encounter an error. Error: 14 ePdfError_NoNumber
 Error Description: A number was expected but not found.
 Callstack:
 #0 Error Source: /tmp/podofo/src/src/base/PdfParser.cpp:226
 Information: Unable to load objects from file.
 #1 Error Source: /tmp/podofo/src/src/base/PdfParser.cpp:289
 Information: Unable to skip xref dictionary.
 #2 Error Source: /tmp/podofo/src/src/base/PdfParser.cpp:738
 #3 Error Source: /tmp/podofo/src/src/base/PdfParser.cpp:551
 Information: Unable to load /XRefStm xref stream.
 #4 Error Source: 
 /tmp/podofo/src/src/base/PdfParserObject.cpp:109
 Information: Object and generation number cannot be 
 read.
 #5 Error Source: 
 /tmp/podofo/src/src/base/PdfTokenizer.cpp:365
 Information: xref
 
 
 

Hi Mark,
I tried to investigate the above issue, which is after your fix for 
XRefStm streams read at r1587 ( 
http://sourceforge.net/p/podofo/code/1587 ). The file Dennis gave a 
link to at [1] above seems fine with respect of references to 
/XRefStm, but it seems that one of the streams contains a reference to 
an object which is out of position and instead of pointing to some 
1234 0 obj the offset points to 'xref' tag. Here is backtrace from 
gdb:

#0  PoDoFo::PdfTokenizer::GetNextNumber (this=0x7fffd1d0) at 
src/base/PdfTokenizer.cpp:366
#1  0x004af132 in 
PoDoFo::PdfParserObject::ReadObjectNumber (this=0x7fffd180) at 
src/base/PdfParserObject.cpp:105
#2  0x004af459 in 
PoDoFo::PdfParserObject::ParseFile (this=0x7fffd180, pEncrypt=0x0, 
bIsTrailer=false) at src/base/PdfParserObject.cpp:134
#3  
0x004d1da1 in PoDoFo::PdfXRefStreamParserObject::Parse 
(this=0x7fffd180) at src/base/PdfXRefStreamParserObject.cpp:60
#4  
0x004a9597 in PoDoFo::PdfParser::ReadXRefStreamContents 
(this=0x7b19d0, lOffset=203913, bReadOnlyTrailer=false) at 
src/base/PdfParser.cpp:824
#5  0x004a9690 in 
PoDoFo::PdfParser::ReadXRefStreamContents (this=0x7b19d0, 
lOffset=204202, bReadOnlyTrailer=false) at src/base/PdfParser.cpp:840

#6  0x004a84ae in PoDoFo::PdfParser::ReadNextTrailer 
(this=0x7b19d0) at src/base/PdfParser.cpp:549
#7  0x004a8f9a in 
PoDoFo::PdfParser::ReadXRefContents (this=0x7b19d0, lOffset=204376, 
bPositionAtEnd=true) at src/base/PdfParser.cpp:734
#8  
0x004a6ba0 in PoDoFo::PdfParser::ReadDocumentStructure 
(this=0x7b19d0) at src/base/PdfParser.cpp:287
#9  0x004a6853 in 
PoDoFo::PdfParser::ParseFile (this=0x7b19d0, rDevice=..., 
bLoadOnDemand=true) at src/base/PdfParser.cpp:213
#10 
0x004a6604 in PoDoFo::PdfParser::ParseFile (this=0x7b19d0, 
pszFilename=0x531b73 f941--2010.pdf, bLoadOnDemand=true) at 
src/base/PdfParser.cpp:157
#11 0x004878e6 in 
PoDoFo::PdfMemDocument::Load (this=0x7aa5b0, pszFilename=0x531b73 
f941--2010.pdf) at src/doc/PdfMemDocument.cpp:186
#12 
0x0047b435 in main () at test.cpp:69


I think of reverting the patch, to support those probably broken 
files, but I'd like to hear from you too, whether the file is truly 
broken.

Thanks and bye,
zyx


-- 
http://www.litePDF.cz i...@litepdf.cz


--
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
___
Podofo-users mailing list
Podofo-users@lists.sourceforge.net