??Hi-

I have a question regarding the limitation on entering getLength() for a second 
time while parsing a pdf.

I understand that it is possible to create a malicious pdf which which 
essentially goes into an infinite loop by having it parse nested streams that 
refer to each other.  I do not believe this to be the case with these files 
(they are from well-known corporate book publishers).

Obviously, pdfbox prohibits this nesting behavior by passing a boolean flag 
around and setting the inGetLength flag member variable when it first enters 
then clearing it upon exit.

I have a several pdfs which open fine in Acrobat and Google Chrome (which is 
based on the pdfium engine), yet when I try to open them using pdfbox they 
throw the "Object must be defined and must not be compressed object"  error.

By observation, it seems to me that pdfium seems to get around this issue by 
keeping a counter of recursion depth (they use 64 max) and allowing essentially 
a short-depth nesting in this way, but they will throw an exception if this 
nesting gets too deep, thereby preventing those malicious pdfs from looping 
indefinitely.

I have forked pdfbox up on Github and made a few minor changes to it that 
replace the boolean inGetLength flag with an integer counter and a constant  
max depth variable instead.  This would allow pdfbox to continue to process an 
compressed stream provided the depth does no exceed the max depth.

For all of the pdfs that were failing this test before, simply allowing a depth 
of 2 instead of 1 seemed to be enough to allow pdfbox to process the files 
without throwing the exception.

If you would be so kind as to take a look at and comment on it if you would, I 
would be most appreciative.


I am hoping that this tweak is ok.  The intent is to continue to prevent 
malicious looping in the pdfs, but still allow shallow nesting to get through.

https://github.com/santoch/pdfbox/pull/1

Please let me know what you think-
Thanks-
Steve

Reply via email to