Re: Stack overflow issue
Hi, On 2024-04-17 14:39:14 +0300, Alexander Korotkov wrote: > On Wed, Apr 17, 2024 at 2:37 PM Alexander Korotkov > wrote: > > I've invested more time into polishing this. I'm intended to push > > this. Could you, please, take a look before? > > Just after sending this I spotted a typo s/untill/until/. The updated > version is attached. Nice, I see you moved the code back to "where it was", the diff to 16 got smaller this way. > + /* > + * Repeatedly call CommitTransactionCommandInternal() until all the work > + * is done. > + */ > + while (!CommitTransactionCommandInternal()); Personally I'd use { } instead of just ; here. The above scans weirdly for me. But it's also not important. Greetings, Andres Freund
Re: Stack overflow issue
On Wed, Apr 17, 2024 at 2:37 PM Alexander Korotkov wrote: > On Tue, Apr 16, 2024 at 7:42 PM Alexander Korotkov > wrote: > > On Tue, Apr 16, 2024 at 6:35 PM Andres Freund wrote: > > > On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote: > > > > On Tue, Apr 16, 2024 at 1:48 AM Andres Freund > > > > wrote: > > > > > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote: > > > > > > 0001 Turn tail recursion into iteration in > > > > > > CommitTransactionCommand() > > > > > > I did minor revision of comments and code blocks order to improve > > > > > > the > > > > > > readability. > > > > > > > > > > After sending > > > > > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de > > > > > I looked some more at important areas where changes didn't have code > > > > > coverage. One thing I noticed was that the "non-internal" part of > > > > > AbortCurrentTransaction() is uncovered: > > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403 > > > > > > > > > > Which made me try to understand fefd9a3fed2. I'm a bit confused > > > > > about why > > > > > some parts are handled in > > > > > CommitCurrentTransaction()/AbortCurrentTransaction() > > > > > and others are in the *Internal functions. > > > > > > > > > > I understand that fefd9a3fed2 needed to remove the recursion in > > > > > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't > > > > > understand > > > > > why that means having some code in in the non-internal and some in the > > > > > internal functions? Wouldn't it be easier to just have all the state > > > > > handling > > > > > code in the Internal() function and just break after the > > > > > CleanupSubTransaction() calls? > > > > > > > > I'm not sure I correctly get what you mean. Do you think the attached > > > > patch matches the direction you're pointing? The patch itself is not > > > > final, it requires cleanup and comments revision, just to check the > > > > direction. > > > > > > Something like that, yea. The split does seem less confusing that way to > > > me, > > > but also not 100% certain. > > > > Thank you for your feedback. I'm going to go ahead and polish this patch. > > I've invested more time into polishing this. I'm intended to push > this. Could you, please, take a look before? Just after sending this I spotted a typo s/untill/until/. The updated version is attached. -- Regards, Alexander Korotkov v3-0001-Refactoring-for-CommitTransactionCommand-AbortCur.patch Description: Binary data
Re: Stack overflow issue
On Tue, Apr 16, 2024 at 7:42 PM Alexander Korotkov wrote: > > On Tue, Apr 16, 2024 at 6:35 PM Andres Freund wrote: > > On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote: > > > On Tue, Apr 16, 2024 at 1:48 AM Andres Freund wrote: > > > > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote: > > > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand() > > > > > I did minor revision of comments and code blocks order to improve the > > > > > readability. > > > > > > > > After sending > > > > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de > > > > I looked some more at important areas where changes didn't have code > > > > coverage. One thing I noticed was that the "non-internal" part of > > > > AbortCurrentTransaction() is uncovered: > > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403 > > > > > > > > Which made me try to understand fefd9a3fed2. I'm a bit confused about > > > > why > > > > some parts are handled in > > > > CommitCurrentTransaction()/AbortCurrentTransaction() > > > > and others are in the *Internal functions. > > > > > > > > I understand that fefd9a3fed2 needed to remove the recursion in > > > > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't > > > > understand > > > > why that means having some code in in the non-internal and some in the > > > > internal functions? Wouldn't it be easier to just have all the state > > > > handling > > > > code in the Internal() function and just break after the > > > > CleanupSubTransaction() calls? > > > > > > I'm not sure I correctly get what you mean. Do you think the attached > > > patch matches the direction you're pointing? The patch itself is not > > > final, it requires cleanup and comments revision, just to check the > > > direction. > > > > Something like that, yea. The split does seem less confusing that way to me, > > but also not 100% certain. > > Thank you for your feedback. I'm going to go ahead and polish this patch. I've invested more time into polishing this. I'm intended to push this. Could you, please, take a look before? -- Regards, Alexander Korotkov v2-0001-Refactoring-for-CommitTransactionCommand-AbortCur.patch Description: Binary data
Re: Stack overflow issue
On Tue, Apr 16, 2024 at 6:35 PM Andres Freund wrote: > On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote: > > On Tue, Apr 16, 2024 at 1:48 AM Andres Freund wrote: > > > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote: > > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand() > > > > I did minor revision of comments and code blocks order to improve the > > > > readability. > > > > > > After sending > > > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de > > > I looked some more at important areas where changes didn't have code > > > coverage. One thing I noticed was that the "non-internal" part of > > > AbortCurrentTransaction() is uncovered: > > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403 > > > > > > Which made me try to understand fefd9a3fed2. I'm a bit confused about why > > > some parts are handled in > > > CommitCurrentTransaction()/AbortCurrentTransaction() > > > and others are in the *Internal functions. > > > > > > I understand that fefd9a3fed2 needed to remove the recursion in > > > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't > > > understand > > > why that means having some code in in the non-internal and some in the > > > internal functions? Wouldn't it be easier to just have all the state > > > handling > > > code in the Internal() function and just break after the > > > CleanupSubTransaction() calls? > > > > I'm not sure I correctly get what you mean. Do you think the attached > > patch matches the direction you're pointing? The patch itself is not > > final, it requires cleanup and comments revision, just to check the > > direction. > > Something like that, yea. The split does seem less confusing that way to me, > but also not 100% certain. Thank you for your feedback. I'm going to go ahead and polish this patch. -- Regards, Alexander Korotkov
Re: Stack overflow issue
Hi, On 2024-04-16 15:45:42 +0300, Alexander Korotkov wrote: > On Tue, Apr 16, 2024 at 1:48 AM Andres Freund wrote: > > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote: > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand() > > > I did minor revision of comments and code blocks order to improve the > > > readability. > > > > After sending > > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de > > I looked some more at important areas where changes didn't have code > > coverage. One thing I noticed was that the "non-internal" part of > > AbortCurrentTransaction() is uncovered: > > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403 > > > > Which made me try to understand fefd9a3fed2. I'm a bit confused about why > > some parts are handled in > > CommitCurrentTransaction()/AbortCurrentTransaction() > > and others are in the *Internal functions. > > > > I understand that fefd9a3fed2 needed to remove the recursion in > > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand > > why that means having some code in in the non-internal and some in the > > internal functions? Wouldn't it be easier to just have all the state > > handling > > code in the Internal() function and just break after the > > CleanupSubTransaction() calls? > > I'm not sure I correctly get what you mean. Do you think the attached > patch matches the direction you're pointing? The patch itself is not > final, it requires cleanup and comments revision, just to check the > direction. Something like that, yea. The split does seem less confusing that way to me, but also not 100% certain. Greetings, Andres Freund
Re: Stack overflow issue
On Tue, Apr 16, 2024 at 1:48 AM Andres Freund wrote: > On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote: > > 0001 Turn tail recursion into iteration in CommitTransactionCommand() > > I did minor revision of comments and code blocks order to improve the > > readability. > > After sending > https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de > I looked some more at important areas where changes didn't have code > coverage. One thing I noticed was that the "non-internal" part of > AbortCurrentTransaction() is uncovered: > https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403 > > Which made me try to understand fefd9a3fed2. I'm a bit confused about why > some parts are handled in CommitCurrentTransaction()/AbortCurrentTransaction() > and others are in the *Internal functions. > > I understand that fefd9a3fed2 needed to remove the recursion in > CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand > why that means having some code in in the non-internal and some in the > internal functions? Wouldn't it be easier to just have all the state handling > code in the Internal() function and just break after the > CleanupSubTransaction() calls? I'm not sure I correctly get what you mean. Do you think the attached patch matches the direction you're pointing? The patch itself is not final, it requires cleanup and comments revision, just to check the direction. -- Regards, Alexander Korotkov abort_commit_transaction.patch Description: Binary data
Re: Stack overflow issue
Hi, On 2024-03-06 14:17:23 +0200, Alexander Korotkov wrote: > 0001 Turn tail recursion into iteration in CommitTransactionCommand() > I did minor revision of comments and code blocks order to improve the > readability. After sending https://www.postgresql.org/message-id/20240414223305.m3i5eju6zylabvln%40awork3.anarazel.de I looked some more at important areas where changes didn't have code coverage. One thing I noticed was that the "non-internal" part of AbortCurrentTransaction() is uncovered: https://anarazel.de/postgres/cov/16-vs-HEAD-2024-04-14/src/backend/access/transam/xact.c.gcov.html#L3403 Which made me try to understand fefd9a3fed2. I'm a bit confused about why some parts are handled in CommitCurrentTransaction()/AbortCurrentTransaction() and others are in the *Internal functions. I understand that fefd9a3fed2 needed to remove the recursion in CommitTransactionCommand()/AbortCurrentTransaction(). But I don't understand why that means having some code in in the non-internal and some in the internal functions? Wouldn't it be easier to just have all the state handling code in the Internal() function and just break after the CleanupSubTransaction() calls? That's of course largely unrelated to the coverage aspects. I just got curious. Greetings, Andres Freund
Re: Stack overflow issue
On Thu, Mar 7, 2024 at 11:07 AM Alexander Korotkov wrote: > On Thu, Mar 7, 2024 at 9:53 AM Egor Chindyaskin wrote: > > > > > 6 march 2024 г., at 19:17, Alexander Korotkov > > > wrote: > > > > > > The revised set of remaining patches is attached. > > > > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand() > > > I did minor revision of comments and code blocks order to improve the > > > readability. > > > > > > 0002 Avoid stack overflow in ShowTransactionStateRec() > > > I didn't notice any issues, leave this piece as is. > > > > > > 0003 Avoid recursion in MemoryContext functions > > > I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(), > > > which I think is a bit more intuitive. Also I fixed > > > MemoryContextMemConsumed(), which was still trying to use the removed > > > argument "print" of MemoryContextStatsInternal() function. > > > > > > Generally, I think this patchset fixes important stack overflow holes. > > > It is quite straightforward, clear and the code has a good shape. I'm > > > going to push this if no objections. > > > > I have tested the scripts from message [1]. After applying these patches > > and Tom Lane’s patch from message [2], all of the above mentioned functions > > no longer caused the server to crash. I also tried increasing the values in > > the presented scripts, which also did not lead to server crashes. Thank you! > > Also, I would like to clarify something. Will fixes from message [3] and > > others be backported to all other branches, not just the master branch? As > > far as I remember, Tom Lane made corrections to all branches. For example > > [4]. > > > > Links: > > 1. > > https://www.postgresql.org/message-id/343ff14f-3060-4f88-9cc6-efdb390185df%40mail.ru > > 2. https://www.postgresql.org/message-id/386032.1709765547%40sss.pgh.pa.us > > 3. > > https://www.postgresql.org/message-id/CAPpHfduZqAjF%2B7rDRP-RGNHjOXy7nvFROQ0MGS436f8FPY5DpQ%40mail.gmail.com > > 4. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e07ebd4b > > Thank you for your feedback! > > Initially I didn't intend to backpatch any of these. But on second > thought with the references you provided, I think we should backpatch > simple check_stack_depth() checks from d57b7cc333 to all supported > branches, but apply refactoring of memory contextes and transaction > commit/abort just to master. Opinions? I've just backpatched check_stack_depth() checks to all supported branches. -- Regards, Alexander Korotkov
Re: Stack overflow issue
On Thu, Mar 7, 2024 at 1:49 AM Tom Lane wrote: > > Alexander Korotkov writes: > > Sorry for tediousness, but isn't pre-order a variation of depth-first order > > [1]? > > To me, depth-first implies visiting children before parents. > Do I have the terminology wrong? According to Wikipedia, depth-first is a general term describing the tree traversal algorithm, which goes as deep as possible in one branch before visiting other branches. The order of between parents and children, and between siblings specifies the variation of depth-first search, and pre-order is one of them. But "pre-order" is the most accurate term for MemoryContextTraverseNext() anyway. -- Regards, Alexander Korotkov
Re: Stack overflow issue
Hi, Egor! On Thu, Mar 7, 2024 at 9:53 AM Egor Chindyaskin wrote: > > > 6 march 2024 г., at 19:17, Alexander Korotkov wrote: > > > > The revised set of remaining patches is attached. > > > > 0001 Turn tail recursion into iteration in CommitTransactionCommand() > > I did minor revision of comments and code blocks order to improve the > > readability. > > > > 0002 Avoid stack overflow in ShowTransactionStateRec() > > I didn't notice any issues, leave this piece as is. > > > > 0003 Avoid recursion in MemoryContext functions > > I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(), > > which I think is a bit more intuitive. Also I fixed > > MemoryContextMemConsumed(), which was still trying to use the removed > > argument "print" of MemoryContextStatsInternal() function. > > > > Generally, I think this patchset fixes important stack overflow holes. > > It is quite straightforward, clear and the code has a good shape. I'm > > going to push this if no objections. > > I have tested the scripts from message [1]. After applying these patches and > Tom Lane’s patch from message [2], all of the above mentioned functions no > longer caused the server to crash. I also tried increasing the values in the > presented scripts, which also did not lead to server crashes. Thank you! > Also, I would like to clarify something. Will fixes from message [3] and > others be backported to all other branches, not just the master branch? As > far as I remember, Tom Lane made corrections to all branches. For example [4]. > > Links: > 1. > https://www.postgresql.org/message-id/343ff14f-3060-4f88-9cc6-efdb390185df%40mail.ru > 2. https://www.postgresql.org/message-id/386032.1709765547%40sss.pgh.pa.us > 3. > https://www.postgresql.org/message-id/CAPpHfduZqAjF%2B7rDRP-RGNHjOXy7nvFROQ0MGS436f8FPY5DpQ%40mail.gmail.com > 4. https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=e07ebd4b Thank you for your feedback! Initially I didn't intend to backpatch any of these. But on second thought with the references you provided, I think we should backpatch simple check_stack_depth() checks from d57b7cc333 to all supported branches, but apply refactoring of memory contextes and transaction commit/abort just to master. Opinions? -- Regards, Alexander Korotkov
Re: Stack overflow issue
Alexander Korotkov writes: > Sorry for tediousness, but isn't pre-order a variation of depth-first order > [1]? To me, depth-first implies visiting children before parents. Do I have the terminology wrong? regards, tom lane
Re: Stack overflow issue
On Thu, Mar 7, 2024 at 12:52 AM Tom Lane wrote: > Alexander Korotkov writes: > > The revised set of remaining patches is attached. > > ... > > 0003 Avoid recursion in MemoryContext functions > > I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(), > > which I think is a bit more intuitive. Also I fixed > > MemoryContextMemConsumed(), which was still trying to use the removed > > argument "print" of MemoryContextStatsInternal() function. > > This patch still doesn't compile for me --- MemoryContextMemConsumed > got modified some more by commit 743112a2e, and needs minor fixes. > > I initially didn't like the definition of MemoryContextTraverseNext > because it requires two copies of the "process node" logic. However, > that seems fine for most of the callers, and even where we are > duplicating logic it's just a line or so, so I guess it's ok. > However, MemoryContextTraverseNext seems undercommented to me, plus > the claim that it traverses in depth-first order is just wrong. > > I found some bugs in MemoryContextStatsInternal too: the old > logic assumed that ichild exceeding max_children was the only > way to get into the summarization logic, but now ichild minus > max_children could very well be negative. Fortunately we can > just reset ichild to zero and not worry about having any > connection between the first loop and the second. > > Here's a v5 of 0003 with those issues and some more-cosmetic ones > cleaned up. I didn't look at 0001 or 0002. > Tom, thank you for your revision of this patch! Sorry for tediousness, but isn't pre-order a variation of depth-first order [1]? Links. 1. https://en.wikipedia.org/wiki/Tree_traversal#Depth-first_search -- Regards, Alexander Korotkov
Re: Stack overflow issue
Alexander Korotkov writes: > The revised set of remaining patches is attached. > ... > 0003 Avoid recursion in MemoryContext functions > I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(), > which I think is a bit more intuitive. Also I fixed > MemoryContextMemConsumed(), which was still trying to use the removed > argument "print" of MemoryContextStatsInternal() function. This patch still doesn't compile for me --- MemoryContextMemConsumed got modified some more by commit 743112a2e, and needs minor fixes. I initially didn't like the definition of MemoryContextTraverseNext because it requires two copies of the "process node" logic. However, that seems fine for most of the callers, and even where we are duplicating logic it's just a line or so, so I guess it's ok. However, MemoryContextTraverseNext seems undercommented to me, plus the claim that it traverses in depth-first order is just wrong. I found some bugs in MemoryContextStatsInternal too: the old logic assumed that ichild exceeding max_children was the only way to get into the summarization logic, but now ichild minus max_children could very well be negative. Fortunately we can just reset ichild to zero and not worry about having any connection between the first loop and the second. Here's a v5 of 0003 with those issues and some more-cosmetic ones cleaned up. I didn't look at 0001 or 0002. regards, tom lane diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c index 1a615becae..5d426795d9 100644 --- a/src/backend/utils/mmgr/mcxt.c +++ b/src/backend/utils/mmgr/mcxt.c @@ -145,9 +145,10 @@ MemoryContext CurTransactionContext = NULL; /* This is a transient link to the active portal's memory context: */ MemoryContext PortalContext = NULL; +static void MemoryContextDeleteOnly(MemoryContext context); static void MemoryContextCallResetCallbacks(MemoryContext context); static void MemoryContextStatsInternal(MemoryContext context, int level, - bool print, int max_children, + int max_level, int max_children, MemoryContextCounters *totals, bool print_to_stderr); static void MemoryContextStatsPrint(MemoryContext context, void *passthru, @@ -219,6 +220,50 @@ GetMemoryChunkHeader(const void *pointer) return header; } +/* + * MemoryContextTraverseNext + * Helper function to traverse all descendants of a memory context + * without recursion. + * + * Recursion could lead to out-of-stack errors with deep context hierarchies, + * which would be unpleasant in error cleanup code paths. + * + * To process 'context' and all its descendants, use a loop like this: + * + * + * for (MemoryContext curr = context->firstchild; + * curr != NULL; + * curr = MemoryContextTraverseNext(curr, context)) + * { + * + * } + * + * This visits all the contexts in pre-order, that is a node is visited + * before its children. + */ +static MemoryContext +MemoryContextTraverseNext(MemoryContext curr, MemoryContext top) +{ + /* After processing a node, traverse to its first child if any */ + if (curr->firstchild != NULL) + return curr->firstchild; + + /* + * After processing a childless node, traverse to its next sibling if + * there is one. If there isn't, traverse back up to the parent (which + * has already been visited, and now so have all its descendants). We're + * done if that is "top", otherwise traverse to its next sibling if any, + * otherwise repeat moving up. + */ + while (curr->nextchild == NULL) + { + curr = curr->parent; + if (curr == top) + return NULL; + } + return curr->nextchild; +} + /* * Support routines to trap use of invalid memory context method IDs * (from calling pfree or the like on a bogus pointer). As a possible @@ -375,14 +420,13 @@ MemoryContextResetOnly(MemoryContext context) void MemoryContextResetChildren(MemoryContext context) { - MemoryContext child; - Assert(MemoryContextIsValid(context)); - for (child = context->firstchild; child != NULL; child = child->nextchild) + for (MemoryContext curr = context->firstchild; + curr != NULL; + curr = MemoryContextTraverseNext(curr, context)) { - MemoryContextResetChildren(child); - MemoryContextResetOnly(child); + MemoryContextResetOnly(curr); } } @@ -392,21 +436,60 @@ MemoryContextResetChildren(MemoryContext context) * allocated therein. * * The type-specific delete routine removes all storage for the context, - * but we have to recurse to handle the children. - * We must also delink the context from its parent, if it has one. + * but we have to deal with descendant nodes here. */ void MemoryContextDelete(MemoryContext context) +{ + MemoryContext curr; + + Assert(MemoryContextIsValid(context)); + + /* + * Delete subcontexts from the bottom up. + * + * Note: Do not use recursion here. A "stack depth limit exceeded" error + * would be unpleasant if we're already in the process of cleaning up
Re: Stack overflow issue
On Wed, Feb 14, 2024 at 2:00 PM Alexander Korotkov wrote: > On Fri, Jan 12, 2024 at 11:00 PM Robert Haas wrote: > > On Fri, Jan 12, 2024 at 10:12 AM Heikki Linnakangas wrote: > > > Here's one goto-free attempt. It adds a local loop to where the > > > recursion was, so that if you have a chain of subtransactions that need > > > to be aborted in CommitTransactionCommand, they are aborted iteratively. > > > The TBLOCK_SUBCOMMIT case already had such a loop. > > > > > > I added a couple of comments in the patch marked with "REVIEWER NOTE", > > > to explain why I changed some things. They are to be removed before > > > committing. > > > > > > I'm not sure if this is better than a goto. In fact, even if we commit > > > this, I think I'd still prefer to replace the remaining recursive calls > > > with a goto. Recursion feels a weird to me, when we're unwinding the > > > states from the stack as we go. > > > > I'm not able to quickly verify whether this version is correct, but I > > do think the code looks nicer this way. > > > > I understand that's a question of opinion rather than fact, though. > > I'd like to revive this thread. The attached 0001 patch represents my > attempt to remove recursion in > CommitTransactionCommand()/AbortCurrentTransaction() by adding a > wrapper function. This method doesn't use goto, doesn't require much > code changes and subjectively provides good readability. > > Regarding ShowTransactionStateRec() and memory context function, as I > get from this thread they are called in states where abortion can lead > to a panic. So, it's preferable to change them into loops too rather > than just adding check_stack_depth(). The 0002 and 0003 patches by > Heikki posted in [1] look good to me. Can we accept them? > > Also there are a number of recursive functions, which seem to be not > used in critical states where abortion can lead to a panic. I've > extracted them from [2] into an attached 0002 patch. I'd like to push > it if there is no objection. The revised set of remaining patches is attached. 0001 Turn tail recursion into iteration in CommitTransactionCommand() I did minor revision of comments and code blocks order to improve the readability. 0002 Avoid stack overflow in ShowTransactionStateRec() I didn't notice any issues, leave this piece as is. 0003 Avoid recursion in MemoryContext functions I've renamed MemoryContextTraverse() => MemoryContextTraverseNext(), which I think is a bit more intuitive. Also I fixed MemoryContextMemConsumed(), which was still trying to use the removed argument "print" of MemoryContextStatsInternal() function. Generally, I think this patchset fixes important stack overflow holes. It is quite straightforward, clear and the code has a good shape. I'm going to push this if no objections. -- Regards, Alexander Korotkov From d9c2a7e8eae2706ff1431ed090146e3a26ce07dc Mon Sep 17 00:00:00 2001 From: Alexander Korotkov Date: Wed, 6 Mar 2024 13:59:41 +0200 Subject: [PATCH v4 2/3] Avoid stack overflow in ShowTransactionStateRec() The function recurses, but didn't perform stack-depth checks. It's just a debugging aid, so instead of the usual check_stack_depth() call, stop the printing if we'd risk stack overflow. Here's an example of how to test this: (n=100; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "SET log_min_messages = 'DEBUG5'; SAVEPOINT sp;") | psql >/dev/null In the passing, swap building the list of child XIDs and recursing to parent. That saves memory while recursing, reducing the risk of out of memory errors with lots of subtransactions. The saving is not very significant in practice, but this order seems more logical anyway. Report by Egor Chindyaskin and Alexander Lakhin. Discussion: https://www.postgresql.org/message-id/1672760457.940462079%40f306.i.mail.ru Author: Heikki Linnakangas Reviewed-by: Robert Haas, Andres Freund, Alexander Korotkov --- src/backend/access/transam/xact.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 12f8bec1aeb..2f80111e18d 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -5583,8 +5583,22 @@ ShowTransactionStateRec(const char *str, TransactionState s) { StringInfoData buf; - initStringInfo(); + if (s->parent) + { + /* + * Since this function recurses, it could be driven to stack overflow. + * This is just a debugging aid, so we can leave out some details + * instead of erroring out with check_stack_depth(). + */ + if (stack_is_too_deep()) + ereport(DEBUG5, + (errmsg_internal("%s(%d): parent omitted to avoid stack overflow", + str, s->nestingLevel))); + else + ShowTransactionStateRec(str, s->parent); + } + initStringInfo(); if (s->nChildXids > 0) { int i; @@ -5593,10 +5607,6 @@ ShowTransactionStateRec(const char *str, TransactionState s) for (i =
Re: Stack overflow issue
Hi! On Fri, Jan 12, 2024 at 11:00 PM Robert Haas wrote: > On Fri, Jan 12, 2024 at 10:12 AM Heikki Linnakangas wrote: > > Here's one goto-free attempt. It adds a local loop to where the > > recursion was, so that if you have a chain of subtransactions that need > > to be aborted in CommitTransactionCommand, they are aborted iteratively. > > The TBLOCK_SUBCOMMIT case already had such a loop. > > > > I added a couple of comments in the patch marked with "REVIEWER NOTE", > > to explain why I changed some things. They are to be removed before > > committing. > > > > I'm not sure if this is better than a goto. In fact, even if we commit > > this, I think I'd still prefer to replace the remaining recursive calls > > with a goto. Recursion feels a weird to me, when we're unwinding the > > states from the stack as we go. > > I'm not able to quickly verify whether this version is correct, but I > do think the code looks nicer this way. > > I understand that's a question of opinion rather than fact, though. I'd like to revive this thread. The attached 0001 patch represents my attempt to remove recursion in CommitTransactionCommand()/AbortCurrentTransaction() by adding a wrapper function. This method doesn't use goto, doesn't require much code changes and subjectively provides good readability. Regarding ShowTransactionStateRec() and memory context function, as I get from this thread they are called in states where abortion can lead to a panic. So, it's preferable to change them into loops too rather than just adding check_stack_depth(). The 0002 and 0003 patches by Heikki posted in [1] look good to me. Can we accept them? Also there are a number of recursive functions, which seem to be not used in critical states where abortion can lead to a panic. I've extracted them from [2] into an attached 0002 patch. I'd like to push it if there is no objection. Links. 1. https://www.postgresql.org/message-id/6b48c746-9704-46dc-b9be-01fe4137c824%40iki.fi 2. https://www.postgresql.org/message-id/4530546a-3216-eaa9-4c92-92d33290a211%40mail.ru -- Regards, Alexander Korotkov 0002-Add-missing-check_stack_depth-to-some-recursive-f-v3.patch Description: Binary data 0001-Turn-tail-recursion-into-iteration-in-CommitTrans-v3.patch Description: Binary data
Re: Stack overflow issue
On Fri, Jan 12, 2024 at 10:12 AM Heikki Linnakangas wrote: > Here's one goto-free attempt. It adds a local loop to where the > recursion was, so that if you have a chain of subtransactions that need > to be aborted in CommitTransactionCommand, they are aborted iteratively. > The TBLOCK_SUBCOMMIT case already had such a loop. > > I added a couple of comments in the patch marked with "REVIEWER NOTE", > to explain why I changed some things. They are to be removed before > committing. > > I'm not sure if this is better than a goto. In fact, even if we commit > this, I think I'd still prefer to replace the remaining recursive calls > with a goto. Recursion feels a weird to me, when we're unwinding the > states from the stack as we go. I'm not able to quickly verify whether this version is correct, but I do think the code looks nicer this way. I understand that's a question of opinion rather than fact, though. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Stack overflow issue
On 11/01/2024 19:37, Robert Haas wrote: On Wed, Jan 10, 2024 at 4:25 PM Heikki Linnakangas wrote: The problem with CommitTransactionCommand (or rather AbortCurrentTransaction() which has the same problem) and ShowTransactionStateRec is that they get called in a state where aborting can lead to a panic. If you add a "check_stack_depth()" to them and try to reproducer scripts that Egor posted, you still get a panic. Hmm, that's unfortunate. I'm not sure what to do about that. But I'd still suggest looking for a goto-free approach. Here's one goto-free attempt. It adds a local loop to where the recursion was, so that if you have a chain of subtransactions that need to be aborted in CommitTransactionCommand, they are aborted iteratively. The TBLOCK_SUBCOMMIT case already had such a loop. I added a couple of comments in the patch marked with "REVIEWER NOTE", to explain why I changed some things. They are to be removed before committing. I'm not sure if this is better than a goto. In fact, even if we commit this, I think I'd still prefer to replace the remaining recursive calls with a goto. Recursion feels a weird to me, when we're unwinding the states from the stack as we go. Of course we could use a "for (;;) { ... continue }" construct around the whole function, instead of a goto, but I don't think that's better than a goto in this case. -- Heikki Linnakangas Neon (https://neon.tech) From 44568e6feef79d604bbe168c0839ab2e03c99f8a Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 12 Jan 2024 17:07:25 +0200 Subject: [PATCH v2 1/1] Turn tail recursion into iteration in AbortCurrentTransaction() Usually the compiler will optimize away the tail recursion anyway, but if it doesn't, you can drive the function into stack overflow. For example: (n=100; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "ERROR; COMMIT;") | psql >/dev/null The usual fix would be to add a check_stack_depth() to the function, but that's not good in AbortCurrentTransaction(), as you are already trying to clean up after a failure. ereporting there leads to a panic. Fix CurrentTransactionCommand() in a similar fashion, although ereporting would be less bad there. Report by Egor Chindyaskin and Alexander Lakhin. Discussion: https://www.postgresql.org/message-id/1672760457.940462079%40f306.i.mail.ru --- src/backend/access/transam/xact.c | 141 -- 1 file changed, 75 insertions(+), 66 deletions(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 464858117e0..329a139c991 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -3194,47 +3194,62 @@ CommitTransactionCommand(void) CommitSubTransaction(); s = CurrentTransactionState; /* changed by pop */ } while (s->blockState == TBLOCK_SUBCOMMIT); + /* + * REVIEWER NOTE: We used to have duplicated code here, to handle + * the TBLOCK_END and TBLOCK_PREPARE cases exactly the same way + * that we do above. I replaced that with a recursive call to + * CommitTransactionCommand(), to make this consistent with the + * TBLOCK_SUBABORT_END and TBLOCK_SUBABORT_PENDING handling. This + * can only recurse once, so there's no risk of stack overflow. + */ /* If we had a COMMIT command, finish off the main xact too */ - if (s->blockState == TBLOCK_END) - { -Assert(s->parent == NULL); -CommitTransaction(); -s->blockState = TBLOCK_DEFAULT; -if (s->chain) -{ - StartTransaction(); - s->blockState = TBLOCK_INPROGRESS; - s->chain = false; - RestoreTransactionCharacteristics(); -} - } - else if (s->blockState == TBLOCK_PREPARE) - { -Assert(s->parent == NULL); -PrepareTransaction(); -s->blockState = TBLOCK_DEFAULT; - } - else + if (s->blockState != TBLOCK_END && s->blockState != TBLOCK_PREPARE) elog(ERROR, "CommitTransactionCommand: unexpected state %s", BlockStateAsString(s->blockState)); + CommitTransactionCommand(); break; /* - * The current already-failed subtransaction is ending due to a - * ROLLBACK or ROLLBACK TO command, so pop it and recursively - * examine the parent (which could be in any of several states). + * The current subtransaction is ending due to a ROLLBACK or + * ROLLBACK TO command. Pop it, as well as any of it parents that + * are also being rolled back. */ case TBLOCK_SUBABORT_END: - CleanupSubTransaction(); - CommitTransactionCommand(); - break; + case TBLOCK_SUBABORT_PENDING: + do { +/* + * The difference between SUBABORT_END and SUBABORT_PENDING is + * whether the subtransaction has already been aborted and we + * just need to clean it up, or if we need to also abort it + * first. + */ +if (s->blockState == TBLOCK_SUBABORT_PENDING) + AbortSubTransaction(); +CleanupSubTransaction(); +s =
Re: Stack overflow issue
On Wed, Jan 10, 2024 at 4:25 PM Heikki Linnakangas wrote: > The problem with CommitTransactionCommand (or rather > AbortCurrentTransaction() which has the same problem) > and ShowTransactionStateRec is that they get called in a state where > aborting can lead to a panic. If you add a "check_stack_depth()" to them > and try to reproducer scripts that Egor posted, you still get a panic. Hmm, that's unfortunate. I'm not sure what to do about that. But I'd still suggest looking for a goto-free approach. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Stack overflow issue
On 05/01/2024 19:23, Robert Haas wrote: On Fri, Nov 24, 2023 at 10:47 AM Heikki Linnakangas wrote: What do you think? At least for 0001 and 0002, I think we should just add the stack depth checks. With regard to 0001, CommitTransactionCommand() and friends are hard enough to understand as it is; they need "goto" like I need an extra hole in my head. With regard to 0002, this function isn't sufficiently important to justify adding special-case code for an extremely rare event. We should just handle it the way we do in general. I agree that in the memory-context case it might be worth expending some more code to be more clever. But I probably wouldn't do that for MemoryContextStats(); check_stack_depth() seems fine for that one. In general, I think we should try to keep the number of places that handle stack overflow in "special" ways as small as possible. The problem with CommitTransactionCommand (or rather AbortCurrentTransaction() which has the same problem) and ShowTransactionStateRec is that they get called in a state where aborting can lead to a panic. If you add a "check_stack_depth()" to them and try to reproducer scripts that Egor posted, you still get a panic. I'm not sure if MemoryContextStats() could safely elog(ERROR). But at least it would mask the "out of memory" that caused the stats to be printed in the first place. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Stack overflow issue
On Fri, Jan 5, 2024 at 3:16 PM Andres Freund wrote: > On 2024-01-05 12:23:25 -0500, Robert Haas wrote: > > I agree that in the memory-context case it might be worth expending > > some more code to be more clever. But I probably wouldn't do that for > > MemoryContextStats(); check_stack_depth() seems fine for that one. > > We run MemoryContextStats() when we fail to allocate memory, including during > abort processing after a previous error. So I think it qualifies for being > somewhat special. OK. > Thus I suspect check_stack_depth() wouldn't be a good idea - > but we could make the stack_is_too_deep() path simpler and just return in the > existing MemoryContextStatsInternal() when that's the case. Since this kind of code will be exercised so rarely, it's highly vulnerable to bugs, so I favor keeping it as simple as we can. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Stack overflow issue
Hi, On 2024-01-05 12:23:25 -0500, Robert Haas wrote: > I agree that in the memory-context case it might be worth expending > some more code to be more clever. But I probably wouldn't do that for > MemoryContextStats(); check_stack_depth() seems fine for that one. We run MemoryContextStats() when we fail to allocate memory, including during abort processing after a previous error. So I think it qualifies for being somewhat special. Thus I suspect check_stack_depth() wouldn't be a good idea - but we could make the stack_is_too_deep() path simpler and just return in the existing MemoryContextStatsInternal() when that's the case. Greetings, Andres Freund
Re: Stack overflow issue
On Fri, Nov 24, 2023 at 10:47 AM Heikki Linnakangas wrote: > What do you think? At least for 0001 and 0002, I think we should just add the stack depth checks. With regard to 0001, CommitTransactionCommand() and friends are hard enough to understand as it is; they need "goto" like I need an extra hole in my head. With regard to 0002, this function isn't sufficiently important to justify adding special-case code for an extremely rare event. We should just handle it the way we do in general. I agree that in the memory-context case it might be worth expending some more code to be more clever. But I probably wouldn't do that for MemoryContextStats(); check_stack_depth() seems fine for that one. In general, I think we should try to keep the number of places that handle stack overflow in "special" ways as small as possible. -- Robert Haas EDB: http://www.enterprisedb.com
Re: Stack overflow issue
On 24/11/2023 21:14, Heikki Linnakangas wrote: What do you think? Hello! Thank you for researching the problem! I'm more of a tester than a developer, so I was able to check the patches from that side. I've configured the server with CFLAGS=" -O0" and cassert enabled and checked the following queries: #CommitTransactionCommand (n=100; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "ERROR; COMMIT;") | psql >/dev/null #ShowTransactionStateRec (n=100; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "SET log_min_messages = 'DEBUG5'; SAVEPOINT sp;") | psql >/dev/null #MemoryContextCheck (n=100; printf "begin;"; for ((i=1;i<=$n;i++)); do printf "savepoint s$i;"; done; printf "release s1;" ) | psql >/dev/null #MemoryContextStatsInternal (n=100; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "SELECT pg_log_backend_memory_contexts(pg_backend_pid())") | psql >/dev/null On my system, every of that queries led to a server crash at a number of savepoints in the range from 174,400 to 174,700. With your patches applied, the savepoint counter goes well beyond these values, I settled on an amount of approximately 300,000 savepoints. Your patches look good to me. Best regards, Egor Chindyaskin Postgres Professional: http://postgrespro.com/
Re: Stack overflow issue
On 21/06/2023 16:45, Egor Chindyaskin wrote: Hello! In continuation of the topic I would like to suggest solution. This patch adds several checks to the vulnerable functions above. I looked at this last patch. The depth checks are clearly better than segfaulting, but I think we can also avoid the recursions and having to error out. That seems nice especially for MemoryContextDelete(), which is called at transaction cleanup. 1. CommitTransactionCommand This is just tail recursion. The compiler will almost certainly optimize it away unless you're using -O0. We can easily turn it into iteration ourselves to avoid that hazard, per attached 0001-Turn-tail-recursion-into-iteration-in-CommitTransact.patch. 2. ShowTransactionStateRec Since this is just a debugging aid, I think we can just stop recursing if we're about to run out of stack space. Seems nicer than erroring out, although it can still error if you run out of memory. See 0002-Avoid-stack-overflow-in-ShowTransactionStateRec.patch. 3. All the MemoryContext functions I'm reluctant to add stack checks to these, because they are called in places like cleaning up after transaction abort. MemoryContextDelete() in particular. If you ereport an error, it's not clear that you can recover cleanly; you'll leak memory if nothing else. Fortunately MemoryContext contains pointers to parent and siblings, so we can traverse a tree of MemoryContexts iteratively, without using stack. MemoryContextStats() is a bit tricky, but we can put a limit on how much it recurses, and just print a summary line if the limit is reached. That's what we already do if a memory context has a lot of children. (Actually, if we didn't try keep track of the # of children at each level, to trigger the summarization, we could traverse the tree without using stack. But a limit seems useful.) What do you think? -- Heikki Linnakangas Neon (https://neon.tech) From 543e6569a3f4730f3105379cc946cc6b54525179 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 24 Nov 2023 14:13:10 +0200 Subject: [PATCH 1/4] Turn tail recursion into iteration in CommitTransactionCommand() Usually the compiler will optimize away the tail recursion anyway, but if it doesn't, you can drive the function into stack overflow. For example: (n=100; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "ERROR; COMMIT;") | psql >/dev/null Report by Egor Chindyaskin and Alexander Lakhin. Discussion: https://www.postgresql.org/message-id/1672760457.940462079%40f306.i.mail.ru --- src/backend/access/transam/xact.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 536edb3792..e16b73a9f1 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -3033,12 +3033,15 @@ RestoreTransactionCharacteristics(const SavedTransactionCharacteristics *s) void CommitTransactionCommand(void) { - TransactionState s = CurrentTransactionState; + TransactionState s; SavedTransactionCharacteristics savetc; +recurse: + s = CurrentTransactionState; /* Must save in case we need to restore below */ SaveTransactionCharacteristics(); + s = CurrentTransactionState; switch (s->blockState) { /* @@ -3226,8 +3229,7 @@ CommitTransactionCommand(void) */ case TBLOCK_SUBABORT_END: CleanupSubTransaction(); - CommitTransactionCommand(); - break; + goto recurse; /* * As above, but it's not dead yet, so abort first. @@ -3235,8 +3237,7 @@ CommitTransactionCommand(void) case TBLOCK_SUBABORT_PENDING: AbortSubTransaction(); CleanupSubTransaction(); - CommitTransactionCommand(); - break; + goto recurse; /* * The current subtransaction is the target of a ROLLBACK TO -- 2.39.2 From 1da20313b6252346fbc3f46cdfde612532e98c84 Mon Sep 17 00:00:00 2001 From: Heikki Linnakangas Date: Fri, 24 Nov 2023 15:06:53 +0200 Subject: [PATCH 2/4] Avoid stack overflow in ShowTransactionStateRec() The function recurses, but didn't perform stack-depth checks. It's just a debugging aid, so instead of the usual check_stack_depth() call, stop the printing if we'd risk stack overflow. Here's an example of how to test this: (n=100; printf "BEGIN;"; for ((i=1;i<=$n;i++)); do printf "SAVEPOINT s$i;"; done; printf "SET log_min_messages = 'DEBUG5'; SAVEPOINT sp;") | psql >/dev/null In the passing, swap building the list of child XIDs and recursing to parent. That saves memory while recursing, reducing the risk of out of memory errors with lots of subtransactions. The saving is not very significant in practice, but this order seems more logical anyway. Report by Egor Chindyaskin and Alexander Lakhin. Discussion: https://www.postgresql.org/message-id/1672760457.940462079%40f306.i.mail.ru --- src/backend/access/transam/xact.c | 21 +++-- 1 file changed, 15 insertions(+), 6
Re: Stack overflow issue
Hello! In continuation of the topic I would like to suggest solution. This patch adds several checks to the vulnerable functions above.diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index d85e313908..102d0e1574 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -3043,6 +3043,9 @@ CommitTransactionCommand(void) TransactionState s = CurrentTransactionState; SavedTransactionCharacteristics savetc; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + SaveTransactionCharacteristics(); switch (s->blockState) @@ -5500,6 +5503,9 @@ ShowTransactionStateRec(const char *str, TransactionState s) { StringInfoData buf; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + initStringInfo(); if (s->nChildXids > 0) diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index 7acf654bf8..58a1d70b8a 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -76,6 +76,7 @@ #include "commands/trigger.h" #include "commands/typecmds.h" #include "funcapi.h" +#include "miscadmin.h" #include "nodes/nodeFuncs.h" #include "parser/parsetree.h" #include "rewrite/rewriteRemove.h" @@ -524,6 +525,11 @@ findDependentObjects(const ObjectAddress *object, if (stack_address_present_add_flags(object, objflags, stack)) return; + /* since this function recurses, it could be driven to stack overflow, + * because of the deep dependency tree, not only due to dependency loops. + */ + check_stack_depth(); + /* * It's also possible that the target object has already been completely * processed and put into targetObjects. If so, again we just add the diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 4f006820b8..a88550913d 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -552,6 +552,9 @@ CheckAttributeType(const char *attname, char att_typtype = get_typtype(atttypid); Oid att_typelem; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + if (att_typtype == TYPTYPE_PSEUDO) { /* diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index 7c697a285b..f22da79c65 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -6684,6 +6684,9 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel, TupleDesc tupdesc; FormData_pg_attribute *aattr[] = {}; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + /* At top level, permission check was done in ATPrepCmd, else do it */ if (recursing) ATSimplePermissions((*cmd)->subtype, rel, ATT_TABLE | ATT_FOREIGN_TABLE); @@ -8383,6 +8386,10 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName, /* Initialize addrs on the first invocation */ Assert(!recursing || addrs != NULL); + + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + if (!recursing) addrs = new_object_addresses(); @@ -10839,6 +10846,9 @@ ATExecAlterConstrRecurse(Constraint *cmdcon, Relation conrel, Relation tgrel, Oid refrelid; bool changed = false; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + currcon = (Form_pg_constraint) GETSTRUCT(contuple); conoid = currcon->oid; refrelid = currcon->confrelid; @@ -11839,6 +11849,9 @@ ATExecDropConstraint(Relation rel, const char *constrName, bool is_no_inherit_constraint = false; char contype; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + /* At top level, permission check was done in ATPrepCmd, else do it */ if (recursing) ATSimplePermissions(AT_DropConstraint, rel, ATT_TABLE | ATT_FOREIGN_TABLE); diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c index aa584848cf..77a5eb526c 100644 --- a/src/backend/optimizer/util/clauses.c +++ b/src/backend/optimizer/util/clauses.c @@ -2318,6 +2318,10 @@ static Node * eval_const_expressions_mutator(Node *node, eval_const_expressions_context *context) { + + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + if (node == NULL) return NULL; switch (nodeTag(node)) diff --git a/src/backend/utils/adt/jsonpath_exec.c b/src/backend/utils/adt/jsonpath_exec.c index b561f0e7e8..dc7ab387ea 100644 --- a/src/backend/utils/adt/jsonpath_exec.c +++ b/src/backend/utils/adt/jsonpath_exec.c @@ -1232,6 +1232,9 @@ executeBoolItem(JsonPathExecContext *cxt, JsonPathItem *jsp, JsonPathBool res; JsonPathBool res2; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + if (!canHaveNext && jspHasNext(jsp)) elog(ERROR, "boolean jsonpath item
Re: Stack overflow issue
03.01.2023 22:45, Sascha Kuhl writes: Great work. Max Stack depth is memory dependent? Processor dependent? Hello! These situations are not specific to the x86_64 architecture, but also manifest themselves, for example, on aarch64 architecture. For example this query, ran on aarch64, (n=100;printf "begin;"; for ((i=1;i<=$n;i++)); do printf "savepoint s$i;"; done; printf "release s1;" ) | psql > /dev/null crashed the server on the savepoint174617 with the following stacktrace: Core was generated by `postgres: test test [local] SAVEPOINT '. Program terminated with signal SIGSEGV, Segmentation fault. #0 AllocSetCheck (context=at address 0xe2397fe8>) at aset.c:1409 1409 { (gdb) bt #0 AllocSetCheck (context=at address 0xe2397fe8>) at aset.c:1409 #1 0xd78c38c4 in MemoryContextCheck (context=0xaaab39ee16a0) at mcxt.c:740 #2 0xd78c38dc in MemoryContextCheck (context=0xaaab39edf690) at mcxt.c:742 #3 0xd78c38dc in MemoryContextCheck (context=0xaaab39edd680) at mcxt.c:742 #4 0xd78c38dc in MemoryContextCheck (context=0xaaab39edb670) at mcxt.c:742 #5 0xd78c38dc in MemoryContextCheck (context=0xaaab39ed9660) at mcxt.c:742 #6 0xd78c38dc in MemoryContextCheck (context=0xaaab39ed7650) at mcxt.c:742 #7 0xd78c38dc in MemoryContextCheck (context=0xaaab39ed5640) at mcxt.c:742 #8 0xd78c38dc in MemoryContextCheck (context=0xaaab39ed3630) at mcxt.c:742 #9 0xd78c38dc in MemoryContextCheck (context=0xaaab39ed1620) at mcxt.c:742 #10 0xd78c38dc in MemoryContextCheck (context=0xaaab39ecf610) at mcxt.c:742 ... #174617 0xd78c38dc in MemoryContextCheck (context=0xe47994b0) at mcxt.c:742 #174618 0xd78c38dc in MemoryContextCheck (context=0xe476dcd0) at mcxt.c:742 #174619 0xd78c38dc in MemoryContextCheck (context=0xe46ead50) at mcxt.c:742 #174620 0xd76c7e24 in finish_xact_command () at postgres.c:2739 #174621 0xd76c55b8 in exec_simple_query (query_string=0xe46f0540 "savepoint s174617;") at postgres.c:1238 #174622 0xd76ca7a4 in PostgresMain (argc=1, argv=0xe2b96898, dbname=0xe471c098 "test", username=0xe471c078 "test") at postgres.c:4508 #174623 0xd75e263c in BackendRun (port=0xe4711470) at postmaster.c:4530 #174624 0xd75e1f70 in BackendStartup (port=0xe4711470) at postmaster.c:4252 #174625 0xd75dd4c0 in ServerLoop () at postmaster.c:1745 #174626 0xd75dcd3c in PostmasterMain (argc=3, argv=0xe46eacb0) at postmaster.c:1417 #174627 0xd74d462c in main (argc=3, argv=0xe46eacb0) at main.c:209
Re: Stack overflow issue
Hello! In continuation of the topic, I, under the leadership of Alexander Lakhin, prepared patches that fix these problems. We decided that these checks would be enough and put them in the places we saw fit.diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index fd1a7ac5d5..318f620ed5 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2812,6 +2812,9 @@ CommitTransactionCommand(void) { TransactionState s = CurrentTransactionState; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + switch (s->blockState) { /* @@ -5194,6 +5197,9 @@ ShowTransactionStateRec(const char *str, TransactionState s) { StringInfoData buf; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + initStringInfo(); if (s->nChildXids > 0) diff --git a/src/backend/catalog/dependency.c b/src/backend/catalog/dependency.c index f8510a1483..d1bbc11d3a 100644 --- a/src/backend/catalog/dependency.c +++ b/src/backend/catalog/dependency.c @@ -74,6 +74,7 @@ #include "nodes/nodeFuncs.h" #include "parser/parsetree.h" #include "rewrite/rewriteRemove.h" +#include "miscadmin.h" #include "storage/lmgr.h" #include "utils/fmgroids.h" #include "utils/guc.h" @@ -489,6 +490,11 @@ findDependentObjects(const ObjectAddress *object, if (stack_address_present_add_flags(object, objflags, stack)) return; + /* since this function recurses, it could be driven to stack overflow, + * because of the deep dependency tree, not only due to dependency loops. + */ + check_stack_depth(); + /* * It's also possible that the target object has already been completely * processed and put into targetObjects. If so, again we just add the diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c index 846c530154..2f243bcf5e 100644 --- a/src/backend/catalog/heap.c +++ b/src/backend/catalog/heap.c @@ -513,6 +513,9 @@ CheckAttributeType(const char *attname, char att_typtype = get_typtype(atttypid); Oid att_typelem; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + if (att_typtype == TYPTYPE_PSEUDO) { /* diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c index f56219973e..4cf08a1e04 100644 --- a/src/backend/commands/tablecmds.c +++ b/src/backend/commands/tablecmds.c @@ -5578,6 +5578,9 @@ ATExecAddColumn(List **wqueue, AlteredTableInfo *tab, Relation rel, AclResult aclresult; ObjectAddress address; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + /* At top level, permission check was done in ATPrepCmd, else do it */ if (recursing) ATSimplePermissions(rel, ATT_TABLE | ATT_FOREIGN_TABLE); @@ -7004,6 +7007,9 @@ ATExecDropColumn(List **wqueue, Relation rel, const char *colName, ObjectAddress object; bool is_expr; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + /* At top level, permission check was done in ATPrepCmd, else do it */ if (recursing) ATSimplePermissions(rel, ATT_TABLE | ATT_FOREIGN_TABLE); @@ -8563,6 +8569,9 @@ ATExecAlterConstrRecurse(Constraint *cmdcon, Relation conrel, Relation tgrel, Oid conoid; bool changed = false; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + currcon = (Form_pg_constraint) GETSTRUCT(contuple); conoid = HeapTupleGetOid(contuple); @@ -9552,6 +9561,9 @@ ATExecDropConstraint(Relation rel, const char *constrName, bool is_no_inherit_constraint = false; char contype; + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + /* At top level, permission check was done in ATPrepCmd, else do it */ if (recursing) ATSimplePermissions(rel, ATT_TABLE | ATT_FOREIGN_TABLE); diff --git a/src/backend/optimizer/util/clauses.c b/src/backend/optimizer/util/clauses.c index edb56ab3a4..67b3a1f4bf 100644 --- a/src/backend/optimizer/util/clauses.c +++ b/src/backend/optimizer/util/clauses.c @@ -2650,6 +2650,10 @@ static Node * eval_const_expressions_mutator(Node *node, eval_const_expressions_context *context) { + + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + if (node == NULL) return NULL; switch (nodeTag(node)) diff --git a/src/backend/utils/mmgr/mcxt.c b/src/backend/utils/mmgr/mcxt.c index 22da98c19d..800c1ffac0 100644 --- a/src/backend/utils/mmgr/mcxt.c +++ b/src/backend/utils/mmgr/mcxt.c @@ -216,6 +216,9 @@ MemoryContextDelete(MemoryContext context) /* And not CurrentMemoryContext, either */ Assert(context != CurrentMemoryContext); + /* since this function recurses, it could be driven to stack overflow */ + check_stack_depth(); + /* save a function call in common case where there are no children */ if (context->firstchild != NULL)
Re: Stack overflow issue
Great work. Max Stack depth is memory dependent? Processor dependent? Егор Чиндяскин schrieb am Mi., 24. Aug. 2022, 11:51: > Hello, I recently got a server crash (bug #17583 [1]) caused by a stack > overflow. > > Tom Lane and Richard Guo, in a discussion of this bug, suggested that > there could be more such places. > Therefore, Alexander Lakhin and I decided to deal with this issue and > Alexander developed a methodology. We processed src/backend/*/*.c with > "clang -emit-llvm ... | opt -analyze -print-calgraph" to find all the > functions that call themselves directly. I checked each of them for > features that protect against stack overflows. > We analyzed 4 catalogs: regex, tsearch, snowball and adt. > Firstly, we decided to test the regex catalog functions and found 6 of > them that lack the check_stach_depth() call. > > zaptreesubs > markst > next > nfatree > numst > repeat > > We have tried to exploit the recursion in the function zaptreesubs(): > select regexp_matches('a' || repeat(' a', 11000), '(.)(' || repeat(' \1', > 11000) || ')?'); > > ERROR: invalid regular expression: regular expression is too complex > > repeat(): > select regexp_match('abc01234xyz',repeat('a{0,2}',11)); > > ERROR: invalid regular expression: regular expression is too complex > > numst(): > select regexp_match('abc01234xyz',repeat('(.)\1e',11)); > > ERROR: invalid regular expression: regular expression is too complex > > markst(): > markst is called in the code after v->tree = parse(...); > it is necessary that the tree be successfully parsed, but with a nesting > level of about 100,000 this will not work - stack protection will work > during parsing and v->ntree = numst(...); is also there. > > next(): > we were able to crash the server with the following query: > (printf "SELECT regexp_match('abc', 'a"; for ((i=1;i<100;i++)); do > printf "(?#)"; done; printf "b')" ) | psql > > Secondly, we have tried to exploit the recursion in the adt catalog > functions and Alexander was able to crash the server with the following > query: > > regex_selectivity_sub(): > SELECT * FROM pg_proc WHERE proname ~ ('(a' || repeat('|', 20) || > 'b)'); > > And this query: > > (n=10; > printf "SELECT polygon '((0,0),(0,100))' <@ polygon > '((-20,100),"; > for ((i=1;i<$n;i++)); do printf "(10,$(( 30 + > $i))),(-10,$((80 + $i))),"; done; > printf "(20,90),(20,0))';" > ) | psql > > Thirdly, the snowball catalog, Alexander has tried to exploit the > recursion in the r_stem_suffix_chain_before_ki function and crashed a > server using this query: > > r_stem_suffix_chain_before_ki(): > SELECT ts_lexize('turkish_stem', repeat('lerdeki', 100)); > > The last one is the tsearch catalog. We have found 4 functions that didn't > have check_stach_depth() function: > > SplitToVariants > mkANode > mkSPNode > LexizeExec > > We have tried to exploit the recursion in the SplitToVariants function and > Alexander crashed a server using this: > > SplitToVariants(): > CREATE TEXT SEARCH DICTIONARY ispell (Template=ispell, > DictFile=ispell_sample,AffFile=ispell_sample); > SELECT ts_lexize('ispell', repeat('bally', 1)); > > After trying to exploit the recursion in the LexizeExec function Alexander > made this conlusion: > > LexizeExec has two branches "ld->curDictId == InvalidOid" (usual mode) and > "ld->curDictId != InvalidOid" (multiword mode) - we start with the first > one, then make recursive call to switch to the multiword mode, but then we > return to the usual mode again. > > mkANode and mkSPNode deal with the dictionary structs, not with > user-supplied data, so we believe these functions are not vulnerable. > > [1] > https://www.postgresql.org/message-id/flat/CAMbWs499ytQiH4mLMhRxRWP-iEUz3-DSinpAD-cUCtVo_23Wtg%40mail.gmail.com#03ad703cf4bc8d28ccba69913e1e8106 >
Re: Stack overflow issue
>Среда, 26 октября 2022, 21:47 +07:00 от Egor Chindyaskin : > >24.08.2022 20:58, Tom Lane writes: >> Nice work! I wonder if you can make the regex crashes reachable by >> reducing the value of max_stack_depth enough that it's hit before >> reaching the "regular expression is too complex" limit. >> >> regards, tom lane Hello everyone! It's been a while since me and Alexander >> Lakhin have >published a list of functions that have a stack overflow illness. We are >back to tell you more about such places. >During our analyze we made a conclusion that some functions can be >crashed without changing any of the parameters and some can be crashed >only if we change some stuff. > >The first function crashes without any changes: > ># CheckAttributeType > >(n=6; printf "create domain dint as int; create domain dinta0 as >dint[];"; for ((i=1;i<=$n;i++)); do printf "create domain dinta$i as >dinta$(( $i - 1 ))[]; "; done; ) | psql >psql -c "create table t(f1 dinta6[]);" > >Some of the others crash if we change "max_locks_per_transaction" >parameter: > ># findDependentObjects > >max_locks_per_transaction = 200 > >(n=1; printf "create table t (i int); create view v0 as select * >from t;"; for ((i=1;i<$n;i++)); do printf "create view v$i as select * >from v$(( $i - 1 )); "; done; ) | psql >psql -c "drop table t" > ># ATExecDropColumn > >max_locks_per_transaction = 300 > >(n=5; printf "create table t0 (a int, b int); "; for >((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 >))); "; done; printf "alter table t0 drop b;" ) | psql > ># ATExecDropConstraint > >max_locks_per_transaction = 300 > >(n=5; printf "create table t0 (a int, b int, constraint bc check (b > > 0));"; for ((i=1;i<=$n;i++)); do printf "create table t$i() >inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 drop >constraint bc;" ) | psql > ># ATExecAddColumn > >max_locks_per_transaction = 200 > >(n=5; printf "create table t0 (a int, b int);"; for >((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 >))); "; done; printf "alter table t0 add column c int;" ) | psql > ># ATExecAlterConstrRecurse > >max_locks_per_transaction = 300 > >(n=5; >printf "create table t(a int primary key); create table pt (a int >primary key, foreign key(a) references t) partition by range (a);"; >printf "create table pt0 partition of pt for values from (0) to (10) >partition by range (a);"; >for ((i=1;i<=$n;i++)); do printf "create table pt$i partition of pt$(( >$i - 1 )) for values from ($i) to (10) partition by range (a); "; done; >printf "alter table pt alter constraint pt_a_fkey deferrable initially >deferred;" >) | psql > >This is where the fun begins. According to Tom Lane, a decrease in >max_stack_depth could lead to new crashes, but it turned out that >Alexander was able to find new crashes precisely due to the increase in >this parameter. Also, we had ulimit -s set to 8MB as the default value. > ># eval_const_expressions_mutator > >max_stack_depth = '7000kB' > >(n=1; printf "select 'a' "; for ((i=1;i<$n;i++)); do printf " >collate \"C\" "; done; ) | psql > >If you didn’t have a crash, like me, when Alexander shared his find, >then probably you configured your cluster with an optimization flag -Og. >In the process of trying to break this function, we came to the >conclusion that the maximum stack depth depends on the optimization flag >(-O0/-Og). As it turned out, when optimizing, the function frame on the >stack becomes smaller and because of this, the limit is reached more >slowly, therefore, the system can withstand more torment. Therefore, >this query will fail if you have a cluster configured with the -O0 >optimization flag. > >The crash of the next function not only depends on the optimization >flag, but also on a number of other things. While researching, we >noticed that postgres enforces a distance ~400kB from max_stack_depth to >ulimit -s. We thought we could hit the max_stack_depth limit and then >hit the OS limit as well. Therefore, Alexander wrote a recursive SQL >function, that eats up a stack within max_stack_depth, including a query >that eats up the remaining ~400kB. And this causes a crash. > ># executeBoolItem > >max_stack_depth = '7600kB' > >create function infinite_recurse(i int) returns int as $$ >begin > raise notice 'Level %', i; > begin > perform jsonb_path_query('{"a":[1]}'::jsonb, ('$.a[*] ? (' || >repeat('!(', 4800) || '@ == @' || repeat(')', 4800) || ')')::jsonpath); > exception > when others then raise notice 'jsonb_path_query error at level %, >%', i, sqlerrm; > end; > begin > select infinite_recurse(i + 1) into i; > exception > when others then raise notice 'Max stack depth reached at level %, >%', i, sqlerrm; > end; > return i; >end; >$$ language plpgsql; > >select infinite_recurse(1); > >To sum it all up, we have not yet decided on a general approach to such >functions. Some functions are definitely subject to stack
Re: Stack overflow issue
24.08.2022 20:58, Tom Lane writes: Nice work! I wonder if you can make the regex crashes reachable by reducing the value of max_stack_depth enough that it's hit before reaching the "regular expression is too complex" limit. regards, tom lane Hello everyone! It's been a while since me and Alexander Lakhin have published a list of functions that have a stack overflow illness. We are back to tell you more about such places. During our analyze we made a conclusion that some functions can be crashed without changing any of the parameters and some can be crashed only if we change some stuff. The first function crashes without any changes: # CheckAttributeType (n=6; printf "create domain dint as int; create domain dinta0 as dint[];"; for ((i=1;i<=$n;i++)); do printf "create domain dinta$i as dinta$(( $i - 1 ))[]; "; done; ) | psql psql -c "create table t(f1 dinta6[]);" Some of the others crash if we change "max_locks_per_transaction" parameter: # findDependentObjects max_locks_per_transaction = 200 (n=1; printf "create table t (i int); create view v0 as select * from t;"; for ((i=1;i<$n;i++)); do printf "create view v$i as select * from v$(( $i - 1 )); "; done; ) | psql psql -c "drop table t" # ATExecDropColumn max_locks_per_transaction = 300 (n=5; printf "create table t0 (a int, b int); "; for ((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 drop b;" ) | psql # ATExecDropConstraint max_locks_per_transaction = 300 (n=5; printf "create table t0 (a int, b int, constraint bc check (b > 0));"; for ((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 drop constraint bc;" ) | psql # ATExecAddColumn max_locks_per_transaction = 200 (n=5; printf "create table t0 (a int, b int);"; for ((i=1;i<=$n;i++)); do printf "create table t$i() inherits(t$(( $i - 1 ))); "; done; printf "alter table t0 add column c int;" ) | psql # ATExecAlterConstrRecurse max_locks_per_transaction = 300 (n=5; printf "create table t(a int primary key); create table pt (a int primary key, foreign key(a) references t) partition by range (a);"; printf "create table pt0 partition of pt for values from (0) to (10) partition by range (a);"; for ((i=1;i<=$n;i++)); do printf "create table pt$i partition of pt$(( $i - 1 )) for values from ($i) to (10) partition by range (a); "; done; printf "alter table pt alter constraint pt_a_fkey deferrable initially deferred;" ) | psql This is where the fun begins. According to Tom Lane, a decrease in max_stack_depth could lead to new crashes, but it turned out that Alexander was able to find new crashes precisely due to the increase in this parameter. Also, we had ulimit -s set to 8MB as the default value. # eval_const_expressions_mutator max_stack_depth = '7000kB' (n=1; printf "select 'a' "; for ((i=1;i<$n;i++)); do printf " collate \"C\" "; done; ) | psql If you didn’t have a crash, like me, when Alexander shared his find, then probably you configured your cluster with an optimization flag -Og. In the process of trying to break this function, we came to the conclusion that the maximum stack depth depends on the optimization flag (-O0/-Og). As it turned out, when optimizing, the function frame on the stack becomes smaller and because of this, the limit is reached more slowly, therefore, the system can withstand more torment. Therefore, this query will fail if you have a cluster configured with the -O0 optimization flag. The crash of the next function not only depends on the optimization flag, but also on a number of other things. While researching, we noticed that postgres enforces a distance ~400kB from max_stack_depth to ulimit -s. We thought we could hit the max_stack_depth limit and then hit the OS limit as well. Therefore, Alexander wrote a recursive SQL function, that eats up a stack within max_stack_depth, including a query that eats up the remaining ~400kB. And this causes a crash. # executeBoolItem max_stack_depth = '7600kB' create function infinite_recurse(i int) returns int as $$ begin raise notice 'Level %', i; begin perform jsonb_path_query('{"a":[1]}'::jsonb, ('$.a[*] ? (' || repeat('!(', 4800) || '@ == @' || repeat(')', 4800) || ')')::jsonpath); exception when others then raise notice 'jsonb_path_query error at level %, %', i, sqlerrm; end; begin select infinite_recurse(i + 1) into i; exception when others then raise notice 'Max stack depth reached at level %, %', i, sqlerrm; end; return i; end; $$ language plpgsql; select infinite_recurse(1); To sum it all up, we have not yet decided on a general approach to such functions. Some functions are definitely subject to stack overflow. Some are definitely not. This can be seen from the code where the recurse flag is passed, or a function that checks the stack is
Re: Stack overflow issue
On Wed, Aug 31, 2022 at 6:57 AM Tom Lane wrote: > I wrote: > > The upstream recommendation, which seems pretty sane to me, is to > > simply reject any string exceeding some threshold length as not > > possibly being a word. Apparently it's common to use thresholds > > as small as 64 bytes, but in the attached I used 1000 bytes. > > On further thought: that coding treats anything longer than 1000 > bytes as a stopword, but maybe we should just accept it unmodified. > The manual says "A Snowball dictionary recognizes everything, whether > or not it is able to simplify the word". While "recognizes" formally > includes the case of "recognizes as a stopword", people might find > this behavior surprising. We could alternatively do it as attached, > which accepts overlength words but does nothing to them except > case-fold. This is closer to the pre-patch behavior, but gives up > the opportunity to avoid useless downstream processing of long words. This patch looks good to me. It avoids overly-long words (> 1000 bytes) going through the stemmer so the stack overflow issue in Turkish stemmer should not exist any more. Thanks Richard
Re: Stack overflow issue
I wrote: > The upstream recommendation, which seems pretty sane to me, is to > simply reject any string exceeding some threshold length as not > possibly being a word. Apparently it's common to use thresholds > as small as 64 bytes, but in the attached I used 1000 bytes. On further thought: that coding treats anything longer than 1000 bytes as a stopword, but maybe we should just accept it unmodified. The manual says "A Snowball dictionary recognizes everything, whether or not it is able to simplify the word". While "recognizes" formally includes the case of "recognizes as a stopword", people might find this behavior surprising. We could alternatively do it as attached, which accepts overlength words but does nothing to them except case-fold. This is closer to the pre-patch behavior, but gives up the opportunity to avoid useless downstream processing of long words. regards, tom lane diff --git a/src/backend/snowball/dict_snowball.c b/src/backend/snowball/dict_snowball.c index 68c9213f69..1d5dfff5a0 100644 --- a/src/backend/snowball/dict_snowball.c +++ b/src/backend/snowball/dict_snowball.c @@ -275,8 +275,24 @@ dsnowball_lexize(PG_FUNCTION_ARGS) char *txt = lowerstr_with_len(in, len); TSLexeme *res = palloc0(sizeof(TSLexeme) * 2); - if (*txt == '\0' || searchstoplist(&(d->stoplist), txt)) + /* + * Do not pass strings exceeding 1000 bytes to the stemmer, as they're + * surely not words in any human language. This restriction avoids + * wasting cycles on stuff like base64-encoded data, and it protects us + * against possible inefficiency or misbehavior in the stemmer. (For + * example, the Turkish stemmer has an indefinite recursion, so it can + * crash on long-enough strings.) However, Snowball dictionaries are + * defined to recognize all strings, so we can't reject the string as an + * unknown word. + */ + if (len > 1000) + { + /* return the lexeme lowercased, but otherwise unmodified */ + res->lexeme = txt; + } + else if (*txt == '\0' || searchstoplist(&(d->stoplist), txt)) { + /* empty or stopword, so report as stopword */ pfree(txt); } else
Re: Stack overflow issue
I wrote: >> I think most likely we should report this to Snowball upstream >> and see what they think is an appropriate fix. > Done at [1], and I pushed the other fixes. Thanks again for the report! The upstream recommendation, which seems pretty sane to me, is to simply reject any string exceeding some threshold length as not possibly being a word. Apparently it's common to use thresholds as small as 64 bytes, but in the attached I used 1000 bytes. regards, tom lane diff --git a/src/backend/snowball/dict_snowball.c b/src/backend/snowball/dict_snowball.c index 68c9213f69..aaf4ff72b6 100644 --- a/src/backend/snowball/dict_snowball.c +++ b/src/backend/snowball/dict_snowball.c @@ -272,11 +272,25 @@ dsnowball_lexize(PG_FUNCTION_ARGS) DictSnowball *d = (DictSnowball *) PG_GETARG_POINTER(0); char *in = (char *) PG_GETARG_POINTER(1); int32 len = PG_GETARG_INT32(2); - char *txt = lowerstr_with_len(in, len); TSLexeme *res = palloc0(sizeof(TSLexeme) * 2); + char *txt; + /* + * Reject strings exceeding 1000 bytes, as they're surely not words in any + * human language. This restriction avoids wasting cycles on stuff like + * base64-encoded data, and it protects us against possible inefficiency + * or misbehavior in the stemmers (for example, the Turkish stemmer has an + * indefinite recursion so it can crash on long-enough strings). + */ + if (len <= 0 || len > 1000) + PG_RETURN_POINTER(res); + + txt = lowerstr_with_len(in, len); + + /* txt is probably not zero-length now, but we'll check anyway */ if (*txt == '\0' || searchstoplist(&(d->stoplist), txt)) { + /* empty or stopword, so reject */ pfree(txt); } else
Re: Stack overflow issue
I wrote: > I think most likely we should report this to Snowball upstream > and see what they think is an appropriate fix. Done at [1], and I pushed the other fixes. Thanks again for the report! regards, tom lane [1] https://lists.tartarus.org/pipermail/snowball-discuss/2022-August/001734.html
Re: Stack overflow issue
=?UTF-8?B?0JXQs9C+0YAg0KfQuNC90LTRj9GB0LrQuNC9?= writes: > Firstly, we decided to test the regex catalog functions and found 6 of them > that lack the check_stach_depth() call. > zaptreesubs > markst > next > nfatree > numst > repeat I took a closer look at these. I think the markst, numst, and nfatree cases are non-issues. They are recursing on a subre tree that was just built by parse(), so parse() must have successfully recursed the same number of levels. parse() surely has a larger stack frame, and it does have a stack overflow guard (in subre()), so it would have failed cleanly before making a data structure that could be hazardous here. Also, having markst error out would be problematic for the reasons discussed in its comment, so I'm disinclined to try to add checks that have no use. BTW, I wonder why your test didn't notice freesubre()? But the same analysis applies there, as does the concern that we can't just error out. Likewise, zaptreesubs() can't recurse more levels than cdissect() did, and that has a stack check, so I'm not very excited about adding another one there. I believe that repeat() is a non-issue because (a) the number of recursion levels in it is limited by DUPMAX, which is generally going to be 255, or at least not enormous, and (b) it will recurse at most once before calling dupnfa(), which contains stack checks. I think next() is a legit issue, although your example doesn't crash for me. I suppose that's because my compiler turned the tail recursion into a loop, and I suggest that we fix it by doing that manually. (Richard's proposed fix is wrong anyway: we can't just throw elog(ERROR) in the regex code without creating memory leaks.) regards, tom lane
Re: Stack overflow issue
Richard Guo writes: > Attached adds the checks in these places. But I'm not sure about the > snowball case. Can we edit src/backend/snowball/libstemmer/*.c directly? No, that file is generated code, as it says right at the top. I think most likely we should report this to Snowball upstream and see what they think is an appropriate fix. regards, tom lane
Re: Stack overflow issue
=?UTF-8?B?0JXQs9C+0YAg0KfQuNC90LTRj9GB0LrQuNC9?= writes: > Therefore, Alexander Lakhin and I decided to deal with this issue and > Alexander developed a methodology. We processed src/backend/*/*.c with "clang > -emit-llvm ... | opt -analyze -print-calgraph" to find all the functions > that call themselves directly. I checked each of them for features that > protect against stack overflows. > We analyzed 4 catalogs: regex, tsearch, snowball and adt. > Firstly, we decided to test the regex catalog functions and found 6 of them > that lack the check_stach_depth() call. Nice work! I wonder if you can make the regex crashes reachable by reducing the value of max_stack_depth enough that it's hit before reaching the "regular expression is too complex" limit. regards, tom lane
Re: Stack overflow issue
Hi Richard, Patch is looking good to me. Would request others to take a look at it as well. Thanks, Mahendrakar. On Wed, 24 Aug 2022 at 17:24, Richard Guo wrote: > > On Wed, Aug 24, 2022 at 7:12 PM Richard Guo > wrote: > >> >> On Wed, Aug 24, 2022 at 6:49 PM Alvaro Herrera >> wrote: >> >>> On 2022-Aug-24, mahendrakar s wrote: >>> >>> > Hi, >>> > Can we have a parameter to control the recursion depth in these cases >>> to >>> > avoid crashes? >>> >>> We already have one (max_stack_depth). The problem is lack of calling >>> the control function in a few places. >> >> >> Thanks Egor and Alexander for the work! I think we can just add >> check_stack_depth checks in these cases. >> > > Attached adds the checks in these places. But I'm not sure about the > snowball case. Can we edit src/backend/snowball/libstemmer/*.c directly? > > Thanks > Richard >
Re: Stack overflow issue
On Wed, Aug 24, 2022 at 7:12 PM Richard Guo wrote: > > On Wed, Aug 24, 2022 at 6:49 PM Alvaro Herrera > wrote: > >> On 2022-Aug-24, mahendrakar s wrote: >> >> > Hi, >> > Can we have a parameter to control the recursion depth in these cases to >> > avoid crashes? >> >> We already have one (max_stack_depth). The problem is lack of calling >> the control function in a few places. > > > Thanks Egor and Alexander for the work! I think we can just add > check_stack_depth checks in these cases. > Attached adds the checks in these places. But I'm not sure about the snowball case. Can we edit src/backend/snowball/libstemmer/*.c directly? Thanks Richard v1-0001-add-check_stack_depth-in-more-places.patch Description: Binary data
Re: Stack overflow issue
On Wed, Aug 24, 2022 at 6:49 PM Alvaro Herrera wrote: > On 2022-Aug-24, mahendrakar s wrote: > > > Hi, > > Can we have a parameter to control the recursion depth in these cases to > > avoid crashes? > > We already have one (max_stack_depth). The problem is lack of calling > the control function in a few places. Thanks Egor and Alexander for the work! I think we can just add check_stack_depth checks in these cases. Thanks Richard
Re: Stack overflow issue
Thanks. On Wed, 24 Aug, 2022, 4:19 pm Alvaro Herrera, wrote: > On 2022-Aug-24, mahendrakar s wrote: > > > Hi, > > Can we have a parameter to control the recursion depth in these cases to > > avoid crashes? > > We already have one (max_stack_depth). The problem is lack of calling > the control function in a few places. > > -- > Álvaro Herrera 48°01'N 7°57'E — > https://www.EnterpriseDB.com/ >
Re: Stack overflow issue
On 2022-Aug-24, mahendrakar s wrote: > Hi, > Can we have a parameter to control the recursion depth in these cases to > avoid crashes? We already have one (max_stack_depth). The problem is lack of calling the control function in a few places. -- Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
Re: Stack overflow issue
Hi, Can we have a parameter to control the recursion depth in these cases to avoid crashes? Just a thought. Thanks, Mahendrakar. On Wed, 24 Aug, 2022, 3:21 pm Егор Чиндяскин, wrote: > Hello, I recently got a server crash (bug #17583 [1]) caused by a stack > overflow. > > Tom Lane and Richard Guo, in a discussion of this bug, suggested that > there could be more such places. > Therefore, Alexander Lakhin and I decided to deal with this issue and > Alexander developed a methodology. We processed src/backend/*/*.c with > "clang -emit-llvm ... | opt -analyze -print-calgraph" to find all the > functions that call themselves directly. I checked each of them for > features that protect against stack overflows. > We analyzed 4 catalogs: regex, tsearch, snowball and adt. > Firstly, we decided to test the regex catalog functions and found 6 of > them that lack the check_stach_depth() call. > > zaptreesubs > markst > next > nfatree > numst > repeat > > We have tried to exploit the recursion in the function zaptreesubs(): > select regexp_matches('a' || repeat(' a', 11000), '(.)(' || repeat(' \1', > 11000) || ')?'); > > ERROR: invalid regular expression: regular expression is too complex > > repeat(): > select regexp_match('abc01234xyz',repeat('a{0,2}',11)); > > ERROR: invalid regular expression: regular expression is too complex > > numst(): > select regexp_match('abc01234xyz',repeat('(.)\1e',11)); > > ERROR: invalid regular expression: regular expression is too complex > > markst(): > markst is called in the code after v->tree = parse(...); > it is necessary that the tree be successfully parsed, but with a nesting > level of about 100,000 this will not work - stack protection will work > during parsing and v->ntree = numst(...); is also there. > > next(): > we were able to crash the server with the following query: > (printf "SELECT regexp_match('abc', 'a"; for ((i=1;i<100;i++)); do > printf "(?#)"; done; printf "b')" ) | psql > > Secondly, we have tried to exploit the recursion in the adt catalog > functions and Alexander was able to crash the server with the following > query: > > regex_selectivity_sub(): > SELECT * FROM pg_proc WHERE proname ~ ('(a' || repeat('|', 20) || > 'b)'); > > And this query: > > (n=10; > printf "SELECT polygon '((0,0),(0,100))' <@ polygon > '((-20,100),"; > for ((i=1;i<$n;i++)); do printf "(10,$(( 30 + > $i))),(-10,$((80 + $i))),"; done; > printf "(20,90),(20,0))';" > ) | psql > > Thirdly, the snowball catalog, Alexander has tried to exploit the > recursion in the r_stem_suffix_chain_before_ki function and crashed a > server using this query: > > r_stem_suffix_chain_before_ki(): > SELECT ts_lexize('turkish_stem', repeat('lerdeki', 100)); > > The last one is the tsearch catalog. We have found 4 functions that didn't > have check_stach_depth() function: > > SplitToVariants > mkANode > mkSPNode > LexizeExec > > We have tried to exploit the recursion in the SplitToVariants function and > Alexander crashed a server using this: > > SplitToVariants(): > CREATE TEXT SEARCH DICTIONARY ispell (Template=ispell, > DictFile=ispell_sample,AffFile=ispell_sample); > SELECT ts_lexize('ispell', repeat('bally', 1)); > > After trying to exploit the recursion in the LexizeExec function Alexander > made this conlusion: > > LexizeExec has two branches "ld->curDictId == InvalidOid" (usual mode) and > "ld->curDictId != InvalidOid" (multiword mode) - we start with the first > one, then make recursive call to switch to the multiword mode, but then we > return to the usual mode again. > > mkANode and mkSPNode deal with the dictionary structs, not with > user-supplied data, so we believe these functions are not vulnerable. > > [1] > https://www.postgresql.org/message-id/flat/CAMbWs499ytQiH4mLMhRxRWP-iEUz3-DSinpAD-cUCtVo_23Wtg%40mail.gmail.com#03ad703cf4bc8d28ccba69913e1e8106 >
Stack overflow issue
Hello, I recently got a server crash (bug #17583 [1]) caused by a stack overflow. Tom Lane and Richard Guo, in a discussion of this bug, suggested that there could be more such places. Therefore, Alexander Lakhin and I decided to deal with this issue and Alexander developed a methodology. We processed src/backend/*/*.c with "clang -emit-llvm ... | opt -analyze -print-calgraph" to find all the functions that call themselves directly. I checked each of them for features that protect against stack overflows. We analyzed 4 catalogs: regex, tsearch, snowball and adt. Firstly, we decided to test the regex catalog functions and found 6 of them that lack the check_stach_depth() call. zaptreesubs markst next nfatree numst repeat We have tried to exploit the recursion in the function zaptreesubs(): select regexp_matches('a' || repeat(' a', 11000), '(.)(' || repeat(' \1', 11000) || ')?'); ERROR: invalid regular expression: regular expression is too complex repeat(): select regexp_match('abc01234xyz',repeat('a{0,2}',11)); ERROR: invalid regular expression: regular expression is too complex numst(): select regexp_match('abc01234xyz',repeat('(.)\1e',11)); ERROR: invalid regular expression: regular expression is too complex markst(): markst is called in the code after v->tree = parse(...); it is necessary that the tree be successfully parsed, but with a nesting level of about 100,000 this will not work - stack protection will work during parsing and v->ntree = numst(...); is also there. next(): we were able to crash the server with the following query: (printf "SELECT regexp_match('abc', 'a"; for ((i=1;i<100;i++)); do printf "(?#)"; done; printf "b')" ) | psql Secondly, we have tried to exploit the recursion in the adt catalog functions and Alexander was able to crash the server with the following query: regex_selectivity_sub(): SELECT * FROM pg_proc WHERE proname ~ ('(a' || repeat('|', 20) || 'b)'); And this query: (n=10; printf "SELECT polygon '((0,0),(0,100))' <@ polygon '((-20,100),"; for ((i=1;i<$n;i++)); do printf "(10,$(( 30 + $i))),(-10,$((80 + $i))),"; done; printf "(20,90),(20,0))';" ) | psql Thirdly, the snowball catalog, Alexander has tried to exploit the recursion in the r_stem_suffix_chain_before_ki function and crashed a server using this query: r_stem_suffix_chain_before_ki(): SELECT ts_lexize('turkish_stem', repeat('lerdeki', 100)); The last one is the tsearch catalog. We have found 4 functions that didn't have check_stach_depth() function: SplitToVariants mkANode mkSPNode LexizeExec We have tried to exploit the recursion in the SplitToVariants function and Alexander crashed a server using this: SplitToVariants(): CREATE TEXT SEARCH DICTIONARY ispell (Template=ispell, DictFile=ispell_sample,AffFile=ispell_sample); SELECT ts_lexize('ispell', repeat('bally', 1)); After trying to exploit the recursion in the LexizeExec function Alexander made this conlusion: LexizeExec has two branches "ld->curDictId == InvalidOid" (usual mode) and "ld->curDictId != InvalidOid" (multiword mode) - we start with the first one, then make recursive call to switch to the multiword mode, but then we return to the usual mode again. mkANode and mkSPNode deal with the dictionary structs, not with user-supplied data, so we believe these functions are not vulnerable. [1] https://www.postgresql.org/message-id/flat/CAMbWs499ytQiH4mLMhRxRWP-iEUz3-DSinpAD-cUCtVo_23Wtg%40mail.gmail.com#03ad703cf4bc8d28ccba69913e1e8106