Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

Heikki Linnakangas Mon, 25 Aug 2014 07:42:47 -0700

On 07/12/2014 05:16 AM, Jeff Davis wrote:

On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:

Jeff Davis <pg...@j-davis.com> writes:

Attached is a small patch to $SUBJECT.
In master, only single-byte characters are allowed as an escape. Of
course, with the patch it must still be a single character, but it may
be multi-byte.


I'm concerned about the performance cost of this patch.  Have you done
any measurements about what kind of overhead you are putting on the
inner loop of similar_escape?


I didn't consider this very performance critical, because this is
looping through the pattern, which I wouldn't expect to be a long
string. On my machine using en_US.UTF-8, the difference is imperceptible
for a SIMILAR TO ... ESCAPE query.

I was able to see about a 2% increase in runtime when using the
similar_escape function directly. I made a 10M tuple table and did:

     explain analyze
       select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
 from t;

which was the worst reasonable case I could think of.


Actually, that gets optimized to a constant in the planner:

postgres=# explain  verbose select

similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')from t;QUERY PLAN


--------------------------------------------------------------------------------
----------
 Seq Scan on public.t  (cost=0.00..144247.85 rows=9999985 width=0)

Output:'^(?:ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ

)$'::text
 Planning time: 0.033 ms
(3 rows)

With a working test case:

create table t (pattern text);

insert into t select'ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ' fromgenerate_series(1, 1000000);

vacuum t;

explain (analyze) select similar_escape(pattern,'#') from t;

your patch seems to be about 2x-3x as slow as unpatched master. So thisneeds some optimization. A couple of ideas:

1. If the escape string is in fact a single-byte character, you canproceed with the loop just as it is today, without the pg_mblen calls.

2. Since pg_mblen() will always return an integer between 1-6, it wouldprobably be faster to replace the memcpy() and memcmp() calls withsimple for-loops iterating byte-by-byte.

In very brief testing, with the 1. change above, the performance withthis patch is back to what it's without the patch. See attached.


- Heikki

diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index caf45ef..dd46325 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -688,11 +688,16 @@ similar_escape(PG_FUNCTION_ARGS)
 		elen = VARSIZE_ANY_EXHDR(esc_text);
 		if (elen == 0)
 			e = NULL;			/* no escape character */
-		else if (elen != 1)
-			ereport(ERROR,
-					(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
-					 errmsg("invalid escape string"),
-				  errhint("Escape string must be empty or one character.")));
+		else
+		{
+			int			escape_mblen = pg_mbstrlen_with_len(e, elen);
+
+			if (escape_mblen > 1)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
+						 errmsg("invalid escape string"),
+						 errhint("Escape string must be empty or one character.")));
+		}
 	}
 
 	/*----------
@@ -723,59 +728,94 @@ similar_escape(PG_FUNCTION_ARGS)
 	while (plen > 0)
 	{
 		char		pchar = *p;
+		int			mblen;
 
-		if (afterescape)
+		/*
+		 * If the escape string is single-byte character, we can process the
+		 * the pattern one byte at a time, ignoring multi-byte characters.
+		 * (This works because all server-encodings have the property that the
+		 * a non-first byte of a multi-byte characters always has the high-bit
+		 * set, and hence we cannot be fooled by a byte in the middle of a
+		 * multi-byte character.)
+		 */
+		if (elen == 1 || (mblen = pg_mblen(p)))
 		{
-			if (pchar == '"' && !incharclass)	/* for SUBSTRING patterns */
-				*r++ = ((nquotes++ % 2) == 0) ? '(' : ')';
-			else
+			if (afterescape)
+			{
+				if (pchar == '"' && !incharclass)	/* for SUBSTRING patterns */
+					*r++ = ((nquotes++ % 2) == 0) ? '(' : ')';
+				else
+				{
+					*r++ = '\\';
+					*r++ = pchar;
+				}
+				afterescape = false;
+			}
+			else if (e && pchar == *e)
+			{
+				/* SQL99 escape character; do not send to output */
+				afterescape = true;
+			}
+			else if (incharclass)
+			{
+				if (pchar == '\\')
+					*r++ = '\\';
+				*r++ = pchar;
+				if (pchar == ']')
+					incharclass = false;
+			}
+			else if (pchar == '[')
+			{
+				*r++ = pchar;
+				incharclass = true;
+			}
+			else if (pchar == '%')
+			{
+				*r++ = '.';
+				*r++ = '*';
+			}
+			else if (pchar == '_')
+				*r++ = '.';
+			else if (pchar == '(')
+			{
+				/* convert to non-capturing parenthesis */
+				*r++ = '(';
+				*r++ = '?';
+				*r++ = ':';
+			}
+			else if (pchar == '\\' || pchar == '.' ||
+					 pchar == '^' || pchar == '$')
 			{
 				*r++ = '\\';
 				*r++ = pchar;
 			}
-			afterescape = false;
-		}
-		else if (e && pchar == *e)
-		{
-			/* SQL99 escape character; do not send to output */
-			afterescape = true;
+			else
+				*r++ = pchar;
+			p++, plen--;
 		}
-		else if (incharclass)
+		else
 		{
-			if (pchar == '\\')
+			if (afterescape)
+			{
 				*r++ = '\\';
-			*r++ = pchar;
-			if (pchar == ']')
-				incharclass = false;
-		}
-		else if (pchar == '[')
-		{
-			*r++ = pchar;
-			incharclass = true;
-		}
-		else if (pchar == '%')
-		{
-			*r++ = '.';
-			*r++ = '*';
-		}
-		else if (pchar == '_')
-			*r++ = '.';
-		else if (pchar == '(')
-		{
-			/* convert to non-capturing parenthesis */
-			*r++ = '(';
-			*r++ = '?';
-			*r++ = ':';
-		}
-		else if (pchar == '\\' || pchar == '.' ||
-				 pchar == '^' || pchar == '$')
-		{
-			*r++ = '\\';
-			*r++ = pchar;
+				memcpy(r, p, mblen);
+				r += mblen;
+				afterescape = false;
+			}
+			else if (e && elen == mblen && memcmp(e, p, mblen) == 0)
+			{
+				/* SQL99 escape character; do not send to output */
+				afterescape = true;
+			}
+			else
+			{
+				memcpy(r, p, mblen);
+				r += mblen;
+			}
+
+			p += mblen;
+			plen -= mblen;
 		}
-		else
-			*r++ = pchar;
-		p++, plen--;
 	}
 
 	*r++ = ')';

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Allow multi-byte characters as escape in SIMILAR TO and SUBSTRING

Reply via email to