On 07/12/2014 05:16 AM, Jeff Davis wrote:
On Fri, 2014-07-11 at 11:51 -0400, Tom Lane wrote:
Jeff Davis <pg...@j-davis.com> writes:
Attached is a small patch to $SUBJECT.
In master, only single-byte characters are allowed as an escape. Of
course, with the patch it must still be a single character, but it may
be multi-byte.
I'm concerned about the performance cost of this patch. Have you done
any measurements about what kind of overhead you are putting on the
inner loop of similar_escape?
I didn't consider this very performance critical, because this is
looping through the pattern, which I wouldn't expect to be a long
string. On my machine using en_US.UTF-8, the difference is imperceptible
for a SIMILAR TO ... ESCAPE query.
I was able to see about a 2% increase in runtime when using the
similar_escape function directly. I made a 10M tuple table and did:
explain analyze
select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
from t;
which was the worst reasonable case I could think of.
Actually, that gets optimized to a constant in the planner:
postgres=# explain verbose select
similar_escape('ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ','#')
from t;
QUERY PLAN
--------------------------------------------------------------------------------
----------
Seq Scan on public.t (cost=0.00..144247.85 rows=9999985 width=0)
Output:
'^(?:ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ
)$'::text
Planning time: 0.033 ms
(3 rows)
With a working test case:
create table t (pattern text);
insert into t select
'ΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣΣ' from
generate_series(1, 1000000);
vacuum t;
explain (analyze) select similar_escape(pattern,'#') from t;
your patch seems to be about 2x-3x as slow as unpatched master. So this
needs some optimization. A couple of ideas:
1. If the escape string is in fact a single-byte character, you can
proceed with the loop just as it is today, without the pg_mblen calls.
2. Since pg_mblen() will always return an integer between 1-6, it would
probably be faster to replace the memcpy() and memcmp() calls with
simple for-loops iterating byte-by-byte.
In very brief testing, with the 1. change above, the performance with
this patch is back to what it's without the patch. See attached.
- Heikki
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index caf45ef..dd46325 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -688,11 +688,16 @@ similar_escape(PG_FUNCTION_ARGS)
elen = VARSIZE_ANY_EXHDR(esc_text);
if (elen == 0)
e = NULL; /* no escape character */
- else if (elen != 1)
- ereport(ERROR,
- (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
- errmsg("invalid escape string"),
- errhint("Escape string must be empty or one character.")));
+ else
+ {
+ int escape_mblen = pg_mbstrlen_with_len(e, elen);
+
+ if (escape_mblen > 1)
+ ereport(ERROR,
+ (errcode(ERRCODE_INVALID_ESCAPE_SEQUENCE),
+ errmsg("invalid escape string"),
+ errhint("Escape string must be empty or one character.")));
+ }
}
/*----------
@@ -723,59 +728,94 @@ similar_escape(PG_FUNCTION_ARGS)
while (plen > 0)
{
char pchar = *p;
+ int mblen;
- if (afterescape)
+ /*
+ * If the escape string is single-byte character, we can process the
+ * the pattern one byte at a time, ignoring multi-byte characters.
+ * (This works because all server-encodings have the property that the
+ * a non-first byte of a multi-byte characters always has the high-bit
+ * set, and hence we cannot be fooled by a byte in the middle of a
+ * multi-byte character.)
+ */
+ if (elen == 1 || (mblen = pg_mblen(p)))
{
- if (pchar == '"' && !incharclass) /* for SUBSTRING patterns */
- *r++ = ((nquotes++ % 2) == 0) ? '(' : ')';
- else
+ if (afterescape)
+ {
+ if (pchar == '"' && !incharclass) /* for SUBSTRING patterns */
+ *r++ = ((nquotes++ % 2) == 0) ? '(' : ')';
+ else
+ {
+ *r++ = '\\';
+ *r++ = pchar;
+ }
+ afterescape = false;
+ }
+ else if (e && pchar == *e)
+ {
+ /* SQL99 escape character; do not send to output */
+ afterescape = true;
+ }
+ else if (incharclass)
+ {
+ if (pchar == '\\')
+ *r++ = '\\';
+ *r++ = pchar;
+ if (pchar == ']')
+ incharclass = false;
+ }
+ else if (pchar == '[')
+ {
+ *r++ = pchar;
+ incharclass = true;
+ }
+ else if (pchar == '%')
+ {
+ *r++ = '.';
+ *r++ = '*';
+ }
+ else if (pchar == '_')
+ *r++ = '.';
+ else if (pchar == '(')
+ {
+ /* convert to non-capturing parenthesis */
+ *r++ = '(';
+ *r++ = '?';
+ *r++ = ':';
+ }
+ else if (pchar == '\\' || pchar == '.' ||
+ pchar == '^' || pchar == '$')
{
*r++ = '\\';
*r++ = pchar;
}
- afterescape = false;
- }
- else if (e && pchar == *e)
- {
- /* SQL99 escape character; do not send to output */
- afterescape = true;
+ else
+ *r++ = pchar;
+ p++, plen--;
}
- else if (incharclass)
+ else
{
- if (pchar == '\\')
+ if (afterescape)
+ {
*r++ = '\\';
- *r++ = pchar;
- if (pchar == ']')
- incharclass = false;
- }
- else if (pchar == '[')
- {
- *r++ = pchar;
- incharclass = true;
- }
- else if (pchar == '%')
- {
- *r++ = '.';
- *r++ = '*';
- }
- else if (pchar == '_')
- *r++ = '.';
- else if (pchar == '(')
- {
- /* convert to non-capturing parenthesis */
- *r++ = '(';
- *r++ = '?';
- *r++ = ':';
- }
- else if (pchar == '\\' || pchar == '.' ||
- pchar == '^' || pchar == '$')
- {
- *r++ = '\\';
- *r++ = pchar;
+ memcpy(r, p, mblen);
+ r += mblen;
+ afterescape = false;
+ }
+ else if (e && elen == mblen && memcmp(e, p, mblen) == 0)
+ {
+ /* SQL99 escape character; do not send to output */
+ afterescape = true;
+ }
+ else
+ {
+ memcpy(r, p, mblen);
+ r += mblen;
+ }
+
+ p += mblen;
+ plen -= mblen;
}
- else
- *r++ = pchar;
- p++, plen--;
}
*r++ = ')';
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers