Hi Loop detection may be worth doing, but it doesn’t fix the problem when objects contain no loops, but are so deeply nested they trigger stack overflow. Valid files with no loops, containing deeply nested objects can cause stack overflow. This is similar to how valid files with large objects cause out-of-memory conditions. This is expected in ISO 32000, and described in Annex C. 3 – Memory Limits.
An example is a nested array – if you add enough square brackets to MediaBox below it triggers stack overflow because PdfTokenizer::ReadArray is called recursively for each ‘[‘ token. %PDF-1.0 1 0 obj<</Type/Catalog/Pages 2 0 R>>endobj 2 0 obj<</Type/Pages/Kids[3 0 R]/Count 1>>endobj 3 0 obj<</Type/Page/MediaBox[[[[[[[[[[[0 0 3 3]]]]]]]]]] Nested arrays are allowed in ISO 32000 7.3.6 which says “Arrays of higher dimension can be constructed using arrays as elements of arrays, nested to any depth.” In a release build on x86/x64 each recursive ReadArray call loop uses about 400 bytes of stack. In a debug build each recursive ReadArray call loop uses about 1200-2048 bytes depending on which debug options are enabled (enabling ASAN makes objects on stack about 3x bigger). How quickly this uses up the stack and causes a segment violation depends on the OS, CPU and compiler: * Windows x86 apps – 1 MB max stack (stack overflows with 2620 ‘[‘ characters in a release build, or 506 ‘[‘ characters in a debug build) * Windows x64 apps – 1 MB max stack (stack overflows with 2620 ‘[‘ characters) * Windows IIS 32-bit worker processes – 256 KB max stack (stack overflows with 655 ‘[‘ characters) * Windows IIS 64-bit worker processes – 512 KB max stack (stack overflows with 1310 ‘[‘ characters) * macOS main thread – 8MB max stack (stack overflows with 20960 ‘[‘ characters) * macOS secondary threads - 512 KB max stack (stack overflows with 1310 ‘[‘ characters) * modern Linux distros - 8MB for main and secondary threads (stack overflows with 20960 ‘[‘ characters) Note that if user sets `ulimit -s unlimited` the secondary thread stack size drops to 2MB on x86/x64 - see https://man7.org/linux/man-pages/man3/pthread_create.3.html#NOTES The same problem happens with dictionaries - Go-To Actions have target dictionaries that may be nested recursively with no specified limit (ISO 32000 12.6.4.4) so a valid file with very deeply nested target dictionaries can cause stack overflow due to recursive calls to PdfTokenizer::ReadDictionary. We have a simple patch for the tokenizer stack overflow issues (adding PdfRecursionGuard guard(m_nRecursionDepth) to PdfTokenizer::GetNextVariant) – but it needs PdfRecursionGuard moved to a header (like the patch for ticket 25). The same problem happens with outlines – if they’re nested deeply enough and have no loops they still trigger stack overflow in PdfOutlineItem::PdfOutlineItem. I had a look at the patch in https://sourceforge.net/p/podofo/tickets/25/#51b9. That’s a simpler solution than the changes I proposed for PdfRecursionGuard. In addition to the patch for tokenizer recursion I’ll write some new unit tests for loops and deeply nested structures BTW I’m the original author of PdfRecursionGuard - https://sourceforge.net/p/podofo/tickets/7/#df09 and the PdfParser unit tests https://sourceforge.net/p/podofo/mailman/message/36298123/ Cheers Mark -- Mark Rogers - mark.rog...@powermapper.com<mailto:mark.rog...@powermapper.com> PowerMapper Software Ltd - www.powermapper.com Registered in Scotland No 362274 Quartermile 2 Edinburgh EH3 9GL From: Christopher Creutzig <ccreu...@mathworks.com> Date: Thursday, 28 October 2021 at 07:50 To: Mark Rogers <mark.rog...@powermapper.com>, "podofo-users@lists.sourceforge.net" <podofo-users@lists.sourceforge.net> Subject: RE: PoDoFo and recursive stack consumption CVEs From: Mark Rogers <mark.rog...@powermapper.com> > tl;dr > PoDoFo needs a general mechanism to prevent recursive stack consumption > because this can happen in many different places in PDFs. Even if the issues > above are fixed there will still be other stack overflow issues in PoDoFo. I agree. But I would like to propose yet another potential solution: 4) Parsing PDF does not create an unbounded number of different objects. Quite the opposite: Resolving the same indirect reference for the second time results in exactly the same object as the first time; semantically, the PDF file just has another edge coming into it. Therefore, we may want to have a cache of resolved references. If there are cases where the same object in the PDF file needs to be represented by multiple PoDoFo classes, the cache may need to be able to store these several representations. Whether trying to resolve a reference currently under construction should throw a “recursive definition” error or provide a handle that is going to update as needed is something to discuss. Does this make sense? Cheers, Christopher The MathWorks GmbH | Friedlandstr.18 | 52064 Aachen | District Court Aachen | HRB 8082 | Managing Directors: Bertrand Dissler, Steven D. Barbo, Jeanne O’Keefe
_______________________________________________ Podofo-users mailing list Podofo-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/podofo-users