Re: Assert triggered during RE_compile_and_cache

2021-08-05 Thread Mark Dilger



> On Aug 5, 2021, at 1:38 PM, Mark Dilger  wrote:
> 
> +select 'vyrlvyrlwvqko' ~ '(?:(?:((.((\2)\1.){0,0}?';

I've boiled it down a bit more:

+select '' ~ '()\1{0}';
+ ?column? 
+--
+ t
+(1 row)
+
+select '' ~ '()(\1){0}';
+ ?column? 
+--
+ t
+(1 row)
+
+select '' ~ '(())\2{0}';
+ ?column? 
+--
+ t
+(1 row)
+
+select '' ~ '(())(\2){0}';
+server closed the connection unexpectedly
+   This probably means the server terminated abnormally
+   before or while processing the request.
+connection to server was lost

Any ideas?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Assert triggered during RE_compile_and_cache

2021-08-05 Thread Tom Lane
Mark Dilger  writes:
> +select '' ~ '(())(\2){0}';
> +server closed the connection unexpectedly

> Any ideas?

Huh.  This seems like some deficiency in the part of parseqatom
starting with

/* annoying special case:  {0} or {0,0} cancels everything */
if (m == 0 && n == 0)

but I don't immediately see what's different about your failing case
versus the not-failing ones.

regards, tom lane




Re: Assert triggered during RE_compile_and_cache

2021-08-05 Thread Mark Dilger


> On Aug 5, 2021, at 3:15 PM, Tom Lane  wrote:
> 
> I don't immediately see what's different about your failing case
> versus the not-failing ones.

I have now found lots of cases of this failure.  I *believe* the backreference 
is always greater than 1, and it is always in a capture group which then has 
the {0} or {0,0} applied to it.

You can find lots of cases using the attached regex generating script I whipped 
up for testing your work.  (Note this is just a quick and dirty tool for 
hacking, not anything refined.)

#!/usr/bin/perl

use strict;
use warnings;

our @alphabet = ('a'..'z');

sub rand_num
{
	my $result = 0;
	$result++ while(int(rand(3)));
	return $result;
}

sub rand_char
{
	return $alphabet[int(rand(@alphabet))];
}

our @strings;
sub rand_string
{
	if (scalar(@strings))
	{
		my $dice = int(rand(3));
		return $strings[int(rand(@strings))] if ($dice == 0);
		shift(@strings) if ($dice == 1);
		pop(@strings) if ($dice == 2);
	}

	my $result = join('', map { rand_char() } (1..rand_num()));
	push (@strings, $result) if (int(rand(2)));

	return $result;
}

sub rand_long_string
{
	my $result = "";
	$result .= rand_string() while(int(rand(10)));

	return $result;
}

sub rand_quantifier
{
	my $dice = int(rand(12));

	return "*"if ($dice == 0);
	return "+"if ($dice == 1);
	return "?"if ($dice == 2);
	return "*?"   if ($dice == 3);
	return "+?"   if ($dice == 4);
	return "??"   if ($dice == 5);

	my $beg = rand_num();
	return "{$beg}"   if ($dice == 6);
	return "{$beg,}"  if ($dice == 7);
	return "{$beg}?"  if ($dice == 8);
	return "{$beg,}?" if ($dice == 9);

	my $end = rand_num() + $beg;
	return "{$beg,$end}"  if ($dice == 10);
	return "{$beg,$end}?" if ($dice == 11);
	return "";
}

sub rand_escape
{
	my $dice = int(rand(5));

	return '\\0'if ($dice == 0);
	return '\\' . rand_char()   if ($dice == 1);
	return '\\' . uc(rand_char())   if ($dice == 2);
	return '\\' . rand_string() if ($dice == 3);
	return '\\' . uc(rand_string()) if ($dice == 4);

	return "";
}

our $max_capture = 0;

sub rand_rgx
{
	my ($depth) = @_;

	$depth = 0 unless defined $depth;

	# Choose option, but limit the choice if we're in danger of deep recursion
	my $dice = int(rand($depth < 5 ? 100 : 20));

	# Base cases
	return "" if ($dice == 0);
	return rand_escape()  if ($dice == 2);
	return rand_char()if ($dice < 5);
	if ($dice < 10 && $max_capture)
	{
		my $capgroup = 1 + int(rand($max_capture));
		return '\\' . $capgroup;
	}
	return "."if ($dice < 20);

	# Recursive cases
	return '[' . rand_escape() . ']'  if ($dice == 20);
	return '[^' . rand_escape() . ']' if ($dice == 21);
	return '[' . rand_string() . ']'  if ($dice == 22);
	return '[^' . rand_string() . ']' if ($dice == 23);
	if ($dice < 60)
	{
		my $result = '(' . rand_rgx($depth+1) . ')';
		$max_capture++;
		return $result;
	}
	return '(?:' . rand_rgx($depth+1) . ')'   if ($dice < 70);
	return '(?=' . rand_rgx($depth+1) . ')'   if ($dice == 71);
	return '(?!' . rand_rgx($depth+1) . ')'   if ($dice == 72);
	return '(?<=' . rand_rgx($depth+1) . ')'  if ($dice == 73);
	return '(?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company





Re: Assert triggered during RE_compile_and_cache

2021-08-05 Thread Tom Lane
Mark Dilger  writes:
> I have now found lots of cases of this failure.  I *believe* the 
> backreference is always greater than 1, and it is always in a capture group 
> which then has the {0} or {0,0} applied to it.

Hm.  I thought that this might be an old issue, but I'm not seeing the
crash in pre-v14 branches.  That means it's some bug introduced in
the performance improvements I did a few months ago.  Open item added.

regards, tom lane




Re: Assert triggered during RE_compile_and_cache

2021-08-05 Thread Tom Lane
I wrote:
> Hm.  I thought that this might be an old issue, but I'm not seeing the
> crash in pre-v14 branches.  That means it's some bug introduced in
> the performance improvements I did a few months ago.  Open item added.

git bisect says the trouble started with

ea1268f6301cc7adce571cc9c5ebe8d9342a2ef4 is the first bad commit
commit ea1268f6301cc7adce571cc9c5ebe8d9342a2ef4
Author: Tom Lane 
Date:   Sat Feb 20 19:26:41 2021 -0500

Avoid generating extra subre tree nodes for capturing parentheses.

I probably won't be able to dig deeper for a little while.

regards, tom lane




Re: Assert triggered during RE_compile_and_cache

2021-08-07 Thread Tom Lane
I wrote:
>> Hm.  I thought that this might be an old issue, but I'm not seeing the
>> crash in pre-v14 branches.  That means it's some bug introduced in
>> the performance improvements I did a few months ago.  Open item added.

> git bisect says the trouble started with
> ea1268f6301cc7adce571cc9c5ebe8d9342a2ef4 is the first bad commit

Here's a draft fix for this.  The issue seems to be that parseqatom
records where a previous capture group is by saving a pointer to
the "subre" that parse() returned for it.  That was okay before
said commit because we didn't do anything further to that subre:
we immediately wrapped it into a parent subre, and then any further
hacking that parseqatom did was done on the parent subre.  But after
ea1268f63, in most cases we'd hack right on that same subre, thus
potentially modifying the portion of the NFA that would be copied
by a subsequent back-reference.

The particular thing that's asserting when you wrap {0} around such
a back-ref is just accidental road kill: it happens to notice that
the NFA is no longer consistent, because there's no path leading
from the start to the end of the copied sub-NFA, thanks to the copying
having been done between a pair of states that weren't actually
appropriate to reference.  What surprises me about this is not that
crash, but that we didn't see fundamental breakage of backref-using
patterns all over the place.  It evidently breaks only in corner
cases, but I'm not quite sure why the effects are so limited.

Anyway, the fix seems pretty straightforward.  We don't really need
anything about the subre as such except for its starting and ending
NFA states, so the attached modifies things to record only those
state pointers.  I'm not done testing this by any means, but it
does fix the two cases you showed, and I've been running that perl
script for some time with no further crashes.

I suspect the assertion you hit with the REG_NOSUB patch was just
further fallout from this same basic bug, but I've not yet rebased
that patch over this to check.

regards, tom lane

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 9f71177d31..afa6dd44c6 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -233,6 +233,13 @@ static int	cmp(const chr *, const chr *, size_t);
 static int	casecmp(const chr *, const chr *, size_t);
 
 
+/* info we need during compilation about a known capturing subexpression */
+struct subinfo
+{
+	struct state *left;			/* left end of its sub-NFA */
+	struct state *right;		/* right end of its sub-NFA */
+};
+
 /* internal variables, bundled for easy passing around */
 struct vars
 {
@@ -245,10 +252,10 @@ struct vars
 	int			nexttype;		/* type of next token */
 	chr			nextvalue;		/* value (if any) of next token */
 	int			lexcon;			/* lexical context type (see regc_lex.c) */
-	int			nsubexp;		/* subexpression count */
-	struct subre **subs;		/* subRE pointer vector */
-	size_t		nsubs;			/* length of vector */
-	struct subre *sub10[10];	/* initial vector, enough for most */
+	int			nsubexp;		/* number of known capturing subexpressions */
+	struct subinfo *subs;		/* info about known capturing subexpressions */
+	size_t		nsubs;			/* allocated length of subs[] vector */
+	struct subinfo sub10[10];	/* initial vector, enough for most */
 	struct nfa *nfa;			/* the NFA */
 	struct colormap *cm;		/* character color map */
 	color		nlcolor;		/* color of newline */
@@ -368,7 +375,7 @@ pg_regcomp(regex_t *re,
 	v->subs = v->sub10;
 	v->nsubs = 10;
 	for (j = 0; j < v->nsubs; j++)
-		v->subs[j] = NULL;
+		v->subs[j].left = v->subs[j].right = NULL;
 	v->nfa = NULL;
 	v->cm = NULL;
 	v->nlcolor = COLORLESS;
@@ -504,13 +511,13 @@ pg_regcomp(regex_t *re,
 }
 
 /*
- * moresubs - enlarge subRE vector
+ * moresubs - enlarge capturing-subexpressions vector
  */
 static void
 moresubs(struct vars *v,
 		 int wanted)			/* want enough room for this one */
 {
-	struct subre **p;
+	struct subinfo *p;
 	size_t		n;
 
 	assert(wanted > 0 && (size_t) wanted >= v->nsubs);
@@ -518,13 +525,13 @@ moresubs(struct vars *v,
 
 	if (v->subs == v->sub10)
 	{
-		p = (struct subre **) MALLOC(n * sizeof(struct subre *));
+		p = (struct subinfo *) MALLOC(n * sizeof(struct subinfo));
 		if (p != NULL)
 			memcpy(VS(p), VS(v->subs),
-   v->nsubs * sizeof(struct subre *));
+   v->nsubs * sizeof(struct subinfo));
 	}
 	else
-		p = (struct subre **) REALLOC(v->subs, n * sizeof(struct subre *));
+		p = (struct subinfo *) REALLOC(v->subs, n * sizeof(struct subinfo));
 	if (p == NULL)
 	{
 		ERR(REG_ESPACE);
@@ -532,7 +539,7 @@ moresubs(struct vars *v,
 	}
 	v->subs = p;
 	for (p = &v->subs[v->nsubs]; v->nsubs < n; p++, v->nsubs++)
-		*p = NULL;
+		p->left = p->right = NULL;
 	assert(v->nsubs == n);
 	assert((size_t) wanted < v->nsubs);
 }
@@ -982,8 +989,10 @@ parseqatom(struct vars *v,
 			NOERR();
 			if (cap)
 			{
-assert(v->subs[subno] == NULL);
-v->subs[subno] = atom;
+/* save the 

Re: Assert triggered during RE_compile_and_cache

2021-08-07 Thread Tom Lane
I wrote:
> ... What surprises me about this is not that
> crash, but that we didn't see fundamental breakage of backref-using
> patterns all over the place.  It evidently breaks only in corner
> cases, but I'm not quite sure why the effects are so limited.

Ah-hah, I've sussed it.  I'd supposed that the issue was that parseqatom
could replace the subre atom's endpoints in the bit where it starts
to "prepare a general-purpose state skeleton".  However, it seems that
that's not what makes it go off the rails.  Rather, in the specific
combination we've got in the test case, the next outer recursion level
of parseqatom() decides that it can free the subre that was returned
by the inner level, *which is the same subre that v->subs[n] is pointing
to*.  So this is really a use-after-free problem.  freesrnode() won't
actually free() the node, just stick it back onto the chain for possible
recycling --- and it doesn't clear the state pointer fields when it
does that.  So there's no use-after-free that ordinary tools could detect.
We only see a problem manifest if the subre node is reused later, which
explains why '(())(\2){0}' fails while '(())\2{0}' does not.

So the problem boils down to parseqatom deciding it can free child
subre nodes that it knows little about the provenance of.  It would
be safer to make that optimization by instead freeing the "top"
node passed in from parsebranch, which we know has got no other
interesting linkages.  That requires tweaking the API of parseqatom,
which why I'd not done it like that to begin with --- but that's not
a hard or complicated change, just a mite tedious.

Hence, the attached patch.

While this is sufficient to make the problem go away, I'm
inclined to apply both changesets.  Even if it accidentally
works right now to have later backrefs consult the outer s/s2
state pair rather than the original pair, that seems far too
convoluted and bug-prone.  The outer states should be strictly
the concern of the iteration setup logic in the outer invocation
of parseqatom.

(I'm sort of wondering now whether the outer s/s2 states are
even really necessary anymore ... maybe Spencer put those in
as a way of preventing some prehistoric version of this bug.
But I'm not excited about messing with that right now.)

regards, tom lane

diff --git a/src/backend/regex/regcomp.c b/src/backend/regex/regcomp.c
index 9f71177d31..3d7f11af8c 100644
--- a/src/backend/regex/regcomp.c
+++ b/src/backend/regex/regcomp.c
@@ -43,7 +43,7 @@ static int	freev(struct vars *, int);
 static void makesearch(struct vars *, struct nfa *);
 static struct subre *parse(struct vars *, int, int, struct state *, struct state *);
 static struct subre *parsebranch(struct vars *, int, int, struct state *, struct state *, int);
-static void parseqatom(struct vars *, int, int, struct state *, struct state *, struct subre *);
+static struct subre *parseqatom(struct vars *, int, int, struct state *, struct state *, struct subre *);
 static void nonword(struct vars *, int, struct state *, struct state *);
 static void word(struct vars *, int, struct state *, struct state *);
 static void charclass(struct vars *, enum char_classes,
@@ -756,7 +756,7 @@ parsebranch(struct vars *v,
 		seencontent = 1;
 
 		/* NB, recursion in parseqatom() may swallow rest of branch */
-		parseqatom(v, stopper, type, lp, right, t);
+		t = parseqatom(v, stopper, type, lp, right, t);
 		NOERRN();
 	}
 
@@ -777,8 +777,12 @@ parsebranch(struct vars *v,
  * The bookkeeping near the end cooperates very closely with parsebranch();
  * in particular, it contains a recursion that can involve parsing the rest
  * of the branch, making this function's name somewhat inaccurate.
+ *
+ * Usually, the return value is just "top", but in some cases where we
+ * have parsed the rest of the branch, we may deem "top" redundant and
+ * free it, returning some child subre instead.
  */
-static void
+static struct subre *
 parseqatom(struct vars *v,
 		   int stopper,			/* EOS or ')' */
 		   int type,			/* LACON (lookaround subRE) or PLAIN */
@@ -818,84 +822,84 @@ parseqatom(struct vars *v,
 			if (v->cflags & REG_NLANCH)
 ARCV(BEHIND, v->nlcolor);
 			NEXT();
-			return;
+			return top;
 			break;
 		case '$':
 			ARCV('$', 1);
 			if (v->cflags & REG_NLANCH)
 ARCV(AHEAD, v->nlcolor);
 			NEXT();
-			return;
+			return top;
 			break;
 		case SBEGIN:
 			ARCV('^', 1);		/* BOL */
 			ARCV('^', 0);		/* or BOS */
 			NEXT();
-			return;
+			return top;
 			break;
 		case SEND:
 			ARCV('$', 1);		/* EOL */
 			ARCV('$', 0);		/* or EOS */
 			NEXT();
-			return;
+			return top;
 			break;
 		case '<':
 			wordchrs(v);
 			s = newstate(v->nfa);
-			NOERR();
+			NOERRN();
 			nonword(v, BEHIND, lp, s);
 			word(v, AHEAD, s, rp);
 			NEXT();
-			return;
+			return top;
 			break;
 		case '>':
 			wordchrs(v);
 			s = newstate(v->nfa);
-			NOERR();
+			NOERRN();
 			word(v, BEHIND, lp, s);
 			nonword(v, AHEAD, s, rp);
 			NEXT();
-			return;

Re: Assert triggered during RE_compile_and_cache

2021-08-08 Thread Tom Lane
I wrote:
> While this is sufficient to make the problem go away, I'm
> inclined to apply both changesets.  Even if it accidentally
> works right now to have later backrefs consult the outer s/s2
> state pair rather than the original pair, that seems far too
> convoluted and bug-prone.  The outer states should be strictly
> the concern of the iteration setup logic in the outer invocation
> of parseqatom.

> (I'm sort of wondering now whether the outer s/s2 states are
> even really necessary anymore ... maybe Spencer put those in
> as a way of preventing some prehistoric version of this bug.
> But I'm not excited about messing with that right now.)

I realized that the earlier patch is actually a bad idea, because
it breaks the possibility of updating the subre to mark it as
being referenced by a later backref, as the REG_NOSUB patch needs
to do.  However, on closer study, the outer s/s2 states being
added by the "prepare a general-purpose state skeleton" stanza
really are duplicative of the ones we already made for a
parenthesized atom, so we can just get rid of them.  Done that
way.

regards, tom lane




Re: Assert triggered during RE_compile_and_cache

2021-08-08 Thread Mark Dilger



> On Aug 8, 2021, at 9:31 AM, Tom Lane  wrote:
> 
> I realized that the earlier patch is actually a bad idea

Applying only your  to master, the following 
which did not crash begins to crash:

 select regexp_split_to_array('rjgy', '(?:(.))((\1\1.))');

It also changes the results for another one:

 select 
regexp_split_to_array('jdpveoarcnsarcnsarcnszieqbqbqbqbiufdlywphbnrxtdoboouuzcqiqmenj',
 '()((.)){1}?\3?', 'itx');
-   
  regexp_split_to_array 
 
-
- 
{"","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","","",""}
+  regexp_split_to_array   
+--
+ {jdpveoarcnsarcnsarcnszieqbqbqbqbiufdlywphbnrxtdoboouuzcqiqmenj}

I'm not sure what *should* be returned here, only that it is a behavioral 
change.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Assert triggered during RE_compile_and_cache

2021-08-08 Thread Mark Dilger



> On Aug 7, 2021, at 6:03 PM, Tom Lane  wrote:
> 
> That requires tweaking the API of parseqatom,
> which why I'd not done it like that to begin with --- but that's not
> a hard or complicated change, just a mite tedious.
> 
> Hence, the attached patch.

Applying your  to master changes the 
outcome of some regular expression queries, but I *think* it changes them for 
the better.

These three look to me exactly correct after the patch, and wrong before:

 select regexp_matches('vnrungnajjjgkaaeaper', '((.))(((\1)))((?:\5..))', 'mx');
- regexp_matches 
-
-(0 rows)
+ regexp_matches  
+-
+ {j,j,j,j,j,jgk}
+(1 row)

 select 
regexp_match('kgkgeganlifykxhfspjtgluwluwluwdfdfbbdjvrxjvrxedndrkaxxvbtqdj', 
'((.))\2');
  regexp_match
 --
- 
+ {b,b}
 (1 row)

 select regexp_split_to_array('uuuzkodphfbfbfb', '((.))(\1\2)', 'ntw');
  regexp_split_to_array
 ---
- {uuuzkodphfbfbfb}
+ {"",zkodphfbfbfb}
 (1 row)

But these next two look to me correct before the patch and wrong after:

 select regexp_matches('ircecpbgyiggvtruqgxzigxzigxzisdbkuhbkuhrvl', 
'(((.)))(?:(\3))[^\f]');
  regexp_matches
 
-(0 rows)
+ {g,g,g,g}
+(1 row)

 select regexp_matches('fhgxnvbvjaej', '(((.)).)((\3)((.)))', 'csx');
- regexp_matches 
-
-(0 rows)
+  regexp_matches   
+---
+ {vb,v,v,vj,v,j,j}
+(1 row)

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Assert triggered during RE_compile_and_cache

2021-08-08 Thread Mark Dilger



> On Aug 8, 2021, at 10:15 AM, Mark Dilger  wrote:
> 
> But these next two look to me correct before the patch and wrong after:
> 
> select regexp_matches('ircecpbgyiggvtruqgxzigxzigxzisdbkuhbkuhrvl', 
> '(((.)))(?:(\3))[^\f]');
>  regexp_matches
> 
> -(0 rows)
> + {g,g,g,g}
> +(1 row)
> 
> select regexp_matches('fhgxnvbvjaej', '(((.)).)((\3)((.)))', 'csx');
> - regexp_matches 
> -
> -(0 rows)
> +  regexp_matches   
> +---
> + {vb,v,v,vj,v,j,j}
> +(1 row)

Scratch that.  On further thought, these also look correct.  I wasn't doing the 
capturing in my head correctly.  So I think the patch is working!  I'll test a 
bit longer.

Is this patch (alternate-backref-corner-case-fix-1.patch) the current state of 
your patch set?  I'd like to be sure I'm testing your latest work.  Thanks.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Assert triggered during RE_compile_and_cache

2021-08-08 Thread Tom Lane
Mark Dilger  writes:
> Applying your  to master changes 
> the outcome of some regular expression queries, but I *think* it changes them 
> for the better.

[ squint... ] You sure you applied the patch properly?  All these examples
give me the same results in HEAD (with said patches as-committed) and v13.

regards, tom lane




Re: Assert triggered during RE_compile_and_cache

2021-08-08 Thread Mark Dilger



> On Aug 8, 2021, at 10:29 AM, Tom Lane  wrote:
> 
> Mark Dilger  writes:
>> Applying your  to master changes 
>> the outcome of some regular expression queries, but I *think* it changes 
>> them for the better.
> 
> [ squint... ] You sure you applied the patch properly?  All these examples
> give me the same results in HEAD (with said patches as-committed) and v13.
> 
>   regards, tom lane

I'm not sure what you mean by "as-committed".  Did you commit one of these 
recently?

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Assert triggered during RE_compile_and_cache

2021-08-08 Thread Mark Dilger



> On Aug 8, 2021, at 10:32 AM, Mark Dilger  wrote:
> 
> I'm not sure what you mean by "as-committed".  Did you commit one of these 
> recently?

Nevermind, I see the commit now.

I'll rerun my tests with the new master.  I was still using the code that I 
pulled yesterday.

—
Mark Dilger
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company







Re: Assert triggered during RE_compile_and_cache

2021-08-08 Thread Mark Dilger



> On Aug 8, 2021, at 10:34 AM, Mark Dilger  wrote:
> 
> I'll rerun my tests with the new master.  I was still using the code that I 
> pulled yesterday.

I am now testing REL_13_STABLE (ba9f665a4413f855bbf4b544481db326f7b9bd73) vs 
master (c1132aae336c41cf9d316222e525d8d593c2b5d2).  The regular expressions we 
discussed upthread show no differences.

There are a few changes which appear correct to me, but I don't know if you 
expected them:

 select regexp_match('hko', '?http://www.enterprisedb.com
The Enterprise PostgreSQL Company