Re: Boyer-More-Horspool searching LIKE queries

Atsushi Ogawa Fri, 14 Jan 2022 06:41:37 -0800

Thanks for the comments.

> > The conditions under which B-M-H can be used are as follows.
> >
> > (1) B-M-H in LIKE search supports only single-byte character sets and
UTF8.
> > Multibyte character sets does not support, because it may contain
another
> > characters in the byte sequence. For UTF-8, it works fine, because in
> > UTF-8 the byte sequence of one character cannot contain another
character.
>
> You can make it work with any encoding, if you check for that case after
> you find a match. See how text_position() does it.


I saw the text_position(). I would like to do the same.

> > (3) The pattern string should be at least 4 characters.
> > For example, '%AB%' can use B-M-H.
>
> To be precise, the patch checks that the pattern string is at least 4
> *bytes* long. A pattern like E'%\U0001F418%' would benefit too.

*bytes* is precise. I will revise the comment of code.

> If I'm reading the code correctly, it doesn't account for escapes
> correctly. It will use B-M-H for a pattern like '%\\%', even though
> that's just searching for a single backslash and won't benefit from B-M-H.

You are correct. I will fix it.

> > (4) The first and last character of the pattern string should be '%'.
>
> I wonder if we can do better than that. If you have a pattern like
> '%foo%bar', its pretty obvious (to a human) that you can quickly check
> if the string ends in 'bar', and then check if it also contains the
> substring 'foo'. Is there some way to generalize that?

I think the following optimizations are possible.

(1)%foo%bar
 Check if the string ends with 'bar' and search for 'foo' by B-M-H.

(2)foo%bar%
 Check if the string starts with 'foo' and search for 'bar' by B-M-H.

(3)foo%bar%baz
 Check if the string starts with 'foo' and string ends with 'baz' and
 search for 'bar' by B-M-H.

> Looking at MatchText() in like.c, there is this piece of code:
>
> >               else if (*p == '%')
> >               {
> >                       char            firstpat;
> >
> >                       /*
> >                        * % processing is essentially a search for a
text position at
> >                        * which the remainder of the text matches the
remainder of the
> >                        * pattern, using a recursive call to check each
potential match.
> >                        *
> >                        * If there are wildcards immediately following
the %, we can skip
> >                        * over them first, using the idea that any
sequence of N _'s and
> >                        * one or more %'s is equivalent to N _'s and one
% (ie, it will
> >                        * match any sequence of at least N text
characters).  In this way
> >                        * we will always run the recursive search loop
using a pattern
> >                        * fragment that begins with a literal
character-to-match, thereby
> >                        * not recursing more than we have to.
> >                        */
> >                       NextByte(p, plen);
> >
> >                       while (plen > 0)
> >                       {
> >                               if (*p == '%')
> >                                       NextByte(p, plen);
> >                               else if (*p == '_')
> >                               {
> >                                       /* If not enough text left to
match the pattern, ABORT */
> >                                       if (tlen <= 0)
> >                                               return LIKE_ABORT;
> >                                       NextChar(t, tlen);
> >                                       NextByte(p, plen);
> >                               }
> >                               else
> >                                       break;          /* Reached a
non-wildcard pattern char */
> >                       }
>
> Could we use B-M-H to replace that piece of code?

For example, in a pattern such as %foo%bar%, it is possible to first search
for 'foo' by B-M-H, and then search for 'bar' by B-M-H. It would be nice if
such a
process could be generalized to handle various LIKE search patterns.

> How does the performance compare with regular expressions? Would it be
> possible to use this for trivial regular expressions too? Or could we
> speed up the regexp engine to take advantage of B-M-H, and use it for
> LIKE? Or something like that?

I think regular expressions in postgresql is slower than LIKE.
I compared it with the following two SQLs.

(1)LIKE: execution time is about 0.8msec
select count(*) from pg_catalog.pg_description d where d.description like
'%greater%';

(2)regular expression: execution time is about 3.1 msec
select count(*) from pg_catalog.pg_description d where d.description ~
'greater';

For trivial regular expressions, it may be better to use LIKE.

> > I have measured the performance with the following query.
>
> Setting up the B-M-H table adds some initialization overhead, so this
> would be a loss for cases where the LIKE is executed only once, and/or
> the haystack strings are very small. That's probably OK, the overhead is
> probably small, and those cases are probably not performance-critical.
> But would be nice to measure that too.

I tried to measure the case where LIKE is executed only once and
the haystack string are very small.

---------------------------------------------------------------
SET client_min_messages TO notice;

\timing

DO $$
DECLARE
  cnt integer := 0;
  total integer := 0;
BEGIN
  FOR i IN 1..10000 LOOP
    select count(*) into cnt from pg_class where oid = 2662 and relname
like '%cl%';
    total := total + cnt;
  END LOOP;

  RAISE NOTICE 'TOTAL: %', total;
END
$$
;
---------------------------------------------------------------

without patch: 74.499msec
with patch 77.321msec

In this case, the patched version will be a few percent slower, but I think
the overhead is small.

Regards,
Atsushi Ogawa



2022年1月12日(水) 5:17 Heikki Linnakangas <hlinn...@iki.fi>:

> On 11/01/2022 15:55, Atsushi Ogawa wrote:
> > I have created a patch to enable the Boyer-More-Horspool search
> > algorithm (B-M-H) for LIKE queries.
>
> Cool!
>
> > The conditions under which B-M-H can be used are as follows.
> >
> > (1) B-M-H in LIKE search supports only single-byte character sets and
> UTF8.
> > Multibyte character sets does not support, because it may contain another
> > characters in the byte sequence. For UTF-8, it works fine, because in
> > UTF-8 the byte sequence of one character cannot contain another
> character.
>
> You can make it work with any encoding, if you check for that case after
> you find a match. See how text_position() does it.
>
> > (3) The pattern string should be at least 4 characters.
> > For example, '%AB%' can use B-M-H.
>
> To be precise, the patch checks that the pattern string is at least 4
> *bytes* long. A pattern like E'%\U0001F418%' would benefit too.
>
> If I'm reading the code correctly, it doesn't account for escapes
> correctly. It will use B-M-H for a pattern like '%\\%', even though
> that's just searching for a single backslash and won't benefit from B-M-H.
>
> > (4) The first and last character of the pattern string should be '%'.
>
> I wonder if we can do better than that. If you have a pattern like
> '%foo%bar', its pretty obvious (to a human) that you can quickly check
> if the string ends in 'bar', and then check if it also contains the
> substring 'foo'. Is there some way to generalize that?
>
> Looking at MatchText() in like.c, there is this piece of code:
>
> >               else if (*p == '%')
> >               {
> >                       char            firstpat;
> >
> >                       /*
> >                        * % processing is essentially a search for a text
> position at
> >                        * which the remainder of the text matches the
> remainder of the
> >                        * pattern, using a recursive call to check each
> potential match.
> >                        *
> >                        * If there are wildcards immediately following
> the %, we can skip
> >                        * over them first, using the idea that any
> sequence of N _'s and
> >                        * one or more %'s is equivalent to N _'s and one
> % (ie, it will
> >                        * match any sequence of at least N text
> characters).  In this way
> >                        * we will always run the recursive search loop
> using a pattern
> >                        * fragment that begins with a literal
> character-to-match, thereby
> >                        * not recursing more than we have to.
> >                        */
> >                       NextByte(p, plen);
> >
> >                       while (plen > 0)
> >                       {
> >                               if (*p == '%')
> >                                       NextByte(p, plen);
> >                               else if (*p == '_')
> >                               {
> >                                       /* If not enough text left to
> match the pattern, ABORT */
> >                                       if (tlen <= 0)
> >                                               return LIKE_ABORT;
> >                                       NextChar(t, tlen);
> >                                       NextByte(p, plen);
> >                               }
> >                               else
> >                                       break;          /* Reached a
> non-wildcard pattern char */
> >                       }
>
> Could we use B-M-H to replace that piece of code?
>
> How does the performance compare with regular expressions? Would it be
> possible to use this for trivial regular expressions too? Or could we
> speed up the regexp engine to take advantage of B-M-H, and use it for
> LIKE? Or something like that?
>
> > I have measured the performance with the following query.
>
> Setting up the B-M-H table adds some initialization overhead, so this
> would be a loss for cases where the LIKE is executed only once, and/or
> the haystack strings are very small. That's probably OK, the overhead is
> probably small, and those cases are probably not performance-critical.
> But would be nice to measure that too.
>
> - Heikki
>

Re: Boyer-More-Horspool searching LIKE queries

Reply via email to