Hi

Loop detection may be worth doing, but it doesn’t fix the problem when objects 
contain no loops, but are so deeply nested they trigger stack overflow. Valid 
files with no loops, containing deeply nested objects can cause stack overflow. 
This is similar to how valid files with large objects cause out-of-memory 
conditions. This is expected in ISO 32000, and described in Annex C. 3 – Memory 
Limits.

An example is a nested array – if you add enough square brackets to MediaBox 
below it triggers stack overflow because PdfTokenizer::ReadArray is called 
recursively for each ‘[‘ token.

%PDF-1.0
1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Kids[3 0 
R]/Count 1>>endobj 3 0 obj<</Type/Page/MediaBox[[[[[[[[[[[0 0 3 3]]]]]]]]]]

Nested arrays are allowed in ISO 32000 7.3.6 which says “Arrays of higher 
dimension can be constructed using arrays as elements of arrays, nested to any 
depth.”

In a release build on x86/x64 each recursive ReadArray call loop uses about 400 
bytes of stack. In a debug build each recursive ReadArray call loop uses about 
1200-2048 bytes depending on which debug options are enabled (enabling ASAN 
makes objects on stack about 3x bigger).

How quickly this uses up the stack and causes a segment violation depends on 
the OS, CPU and compiler:


  *   Windows x86 apps – 1 MB max stack (stack overflows with 2620 ‘[‘ 
characters in a release build, or 506 ‘[‘ characters in a debug build)
  *   Windows x64 apps – 1 MB max stack (stack overflows with 2620 ‘[‘ 
characters)
  *   Windows IIS 32-bit worker processes – 256 KB max stack (stack overflows 
with 655 ‘[‘ characters)
  *   Windows IIS 64-bit worker processes – 512 KB max stack  (stack overflows 
with 1310 ‘[‘ characters)
  *   macOS main thread – 8MB max stack (stack overflows with 20960 ‘[‘ 
characters)
  *   macOS secondary threads - 512 KB max stack (stack overflows with 1310 ‘[‘ 
characters)
  *   modern Linux distros - 8MB for main and secondary threads (stack 
overflows with 20960 ‘[‘ characters)
Note that if user sets `ulimit -s unlimited` the secondary thread stack size 
drops to 2MB on x86/x64 - see 
https://man7.org/linux/man-pages/man3/pthread_create.3.html#NOTES

The same problem happens with dictionaries - Go-To Actions have target 
dictionaries that may be nested recursively with no specified limit (ISO 32000 
12.6.4.4) so a valid file with very deeply nested target dictionaries can cause 
stack overflow due to recursive calls to PdfTokenizer::ReadDictionary.

We have a simple patch for the tokenizer stack overflow issues (adding 
PdfRecursionGuard guard(m_nRecursionDepth) to PdfTokenizer::GetNextVariant) – 
but it needs PdfRecursionGuard moved to a header (like the patch for ticket 25).

The same problem happens with outlines – if they’re nested deeply enough and 
have no loops they still trigger stack overflow in 
PdfOutlineItem::PdfOutlineItem.

I had a look at the patch in https://sourceforge.net/p/podofo/tickets/25/#51b9. 
That’s a simpler solution than the changes I proposed for PdfRecursionGuard.

In addition to the patch for tokenizer recursion I’ll write some new unit tests 
for loops and deeply nested structures

BTW I’m the original author of PdfRecursionGuard - 
https://sourceforge.net/p/podofo/tickets/7/#df09 and the PdfParser unit tests 
https://sourceforge.net/p/podofo/mailman/message/36298123/

Cheers
Mark

--
Mark Rogers - mark.rog...@powermapper.com<mailto:mark.rog...@powermapper.com>
PowerMapper Software Ltd - www.powermapper.com
Registered in Scotland No 362274 Quartermile 2 Edinburgh EH3 9GL





From: Christopher Creutzig <ccreu...@mathworks.com>
Date: Thursday, 28 October 2021 at 07:50
To: Mark Rogers <mark.rog...@powermapper.com>, 
"podofo-users@lists.sourceforge.net" <podofo-users@lists.sourceforge.net>
Subject: RE: PoDoFo and recursive stack consumption CVEs

From: Mark Rogers <mark.rog...@powermapper.com>
> tl;dr
> PoDoFo needs a general mechanism to prevent recursive stack consumption 
> because this can happen in many different places in PDFs. Even if the issues 
> above are fixed there will still be other stack overflow issues in PoDoFo.

I agree. But I would like to propose yet another potential solution:

4) Parsing PDF does not create an unbounded number of different objects. Quite 
the opposite: Resolving the same indirect reference for the second time results 
in exactly the same object as the first time; semantically, the PDF file just 
has another edge coming into it.

Therefore, we may want to have a cache of resolved references. If there are 
cases where the same object in the PDF file needs to be represented by multiple 
PoDoFo classes, the cache may need to be able to store these several 
representations. Whether trying to resolve a reference currently under 
construction should throw a “recursive definition” error or provide a handle 
that is going to update as needed is something to discuss.

Does this make sense?


Cheers,
Christopher

The MathWorks GmbH | Friedlandstr.18 | 52064 Aachen | District Court Aachen | 
HRB 8082 | Managing Directors: Bertrand Dissler, Steven D. Barbo, Jeanne O’Keefe


_______________________________________________
Podofo-users mailing list
Podofo-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/podofo-users

Reply via email to