Thanks for the comments. > > The conditions under which B-M-H can be used are as follows. > > > > (1) B-M-H in LIKE search supports only single-byte character sets and UTF8. > > Multibyte character sets does not support, because it may contain another > > characters in the byte sequence. For UTF-8, it works fine, because in > > UTF-8 the byte sequence of one character cannot contain another character. > > You can make it work with any encoding, if you check for that case after > you find a match. See how text_position() does it.
I saw the text_position(). I would like to do the same. > > (3) The pattern string should be at least 4 characters. > > For example, '%AB%' can use B-M-H. > > To be precise, the patch checks that the pattern string is at least 4 > *bytes* long. A pattern like E'%\U0001F418%' would benefit too. *bytes* is precise. I will revise the comment of code. > If I'm reading the code correctly, it doesn't account for escapes > correctly. It will use B-M-H for a pattern like '%\\%', even though > that's just searching for a single backslash and won't benefit from B-M-H. You are correct. I will fix it. > > (4) The first and last character of the pattern string should be '%'. > > I wonder if we can do better than that. If you have a pattern like > '%foo%bar', its pretty obvious (to a human) that you can quickly check > if the string ends in 'bar', and then check if it also contains the > substring 'foo'. Is there some way to generalize that? I think the following optimizations are possible. (1)%foo%bar Check if the string ends with 'bar' and search for 'foo' by B-M-H. (2)foo%bar% Check if the string starts with 'foo' and search for 'bar' by B-M-H. (3)foo%bar%baz Check if the string starts with 'foo' and string ends with 'baz' and search for 'bar' by B-M-H. > Looking at MatchText() in like.c, there is this piece of code: > > > else if (*p == '%') > > { > > char firstpat; > > > > /* > > * % processing is essentially a search for a text position at > > * which the remainder of the text matches the remainder of the > > * pattern, using a recursive call to check each potential match. > > * > > * If there are wildcards immediately following the %, we can skip > > * over them first, using the idea that any sequence of N _'s and > > * one or more %'s is equivalent to N _'s and one % (ie, it will > > * match any sequence of at least N text characters). In this way > > * we will always run the recursive search loop using a pattern > > * fragment that begins with a literal character-to-match, thereby > > * not recursing more than we have to. > > */ > > NextByte(p, plen); > > > > while (plen > 0) > > { > > if (*p == '%') > > NextByte(p, plen); > > else if (*p == '_') > > { > > /* If not enough text left to match the pattern, ABORT */ > > if (tlen <= 0) > > return LIKE_ABORT; > > NextChar(t, tlen); > > NextByte(p, plen); > > } > > else > > break; /* Reached a non-wildcard pattern char */ > > } > > Could we use B-M-H to replace that piece of code? For example, in a pattern such as %foo%bar%, it is possible to first search for 'foo' by B-M-H, and then search for 'bar' by B-M-H. It would be nice if such a process could be generalized to handle various LIKE search patterns. > How does the performance compare with regular expressions? Would it be > possible to use this for trivial regular expressions too? Or could we > speed up the regexp engine to take advantage of B-M-H, and use it for > LIKE? Or something like that? I think regular expressions in postgresql is slower than LIKE. I compared it with the following two SQLs. (1)LIKE: execution time is about 0.8msec select count(*) from pg_catalog.pg_description d where d.description like '%greater%'; (2)regular expression: execution time is about 3.1 msec select count(*) from pg_catalog.pg_description d where d.description ~ 'greater'; For trivial regular expressions, it may be better to use LIKE. > > I have measured the performance with the following query. > > Setting up the B-M-H table adds some initialization overhead, so this > would be a loss for cases where the LIKE is executed only once, and/or > the haystack strings are very small. That's probably OK, the overhead is > probably small, and those cases are probably not performance-critical. > But would be nice to measure that too. I tried to measure the case where LIKE is executed only once and the haystack string are very small. --------------------------------------------------------------- SET client_min_messages TO notice; \timing DO $$ DECLARE cnt integer := 0; total integer := 0; BEGIN FOR i IN 1..10000 LOOP select count(*) into cnt from pg_class where oid = 2662 and relname like '%cl%'; total := total + cnt; END LOOP; RAISE NOTICE 'TOTAL: %', total; END $$ ; --------------------------------------------------------------- without patch: 74.499msec with patch 77.321msec In this case, the patched version will be a few percent slower, but I think the overhead is small. Regards, Atsushi Ogawa 2022年1月12日(水) 5:17 Heikki Linnakangas <hlinn...@iki.fi>: > On 11/01/2022 15:55, Atsushi Ogawa wrote: > > I have created a patch to enable the Boyer-More-Horspool search > > algorithm (B-M-H) for LIKE queries. > > Cool! > > > The conditions under which B-M-H can be used are as follows. > > > > (1) B-M-H in LIKE search supports only single-byte character sets and > UTF8. > > Multibyte character sets does not support, because it may contain another > > characters in the byte sequence. For UTF-8, it works fine, because in > > UTF-8 the byte sequence of one character cannot contain another > character. > > You can make it work with any encoding, if you check for that case after > you find a match. See how text_position() does it. > > > (3) The pattern string should be at least 4 characters. > > For example, '%AB%' can use B-M-H. > > To be precise, the patch checks that the pattern string is at least 4 > *bytes* long. A pattern like E'%\U0001F418%' would benefit too. > > If I'm reading the code correctly, it doesn't account for escapes > correctly. It will use B-M-H for a pattern like '%\\%', even though > that's just searching for a single backslash and won't benefit from B-M-H. > > > (4) The first and last character of the pattern string should be '%'. > > I wonder if we can do better than that. If you have a pattern like > '%foo%bar', its pretty obvious (to a human) that you can quickly check > if the string ends in 'bar', and then check if it also contains the > substring 'foo'. Is there some way to generalize that? > > Looking at MatchText() in like.c, there is this piece of code: > > > else if (*p == '%') > > { > > char firstpat; > > > > /* > > * % processing is essentially a search for a text > position at > > * which the remainder of the text matches the > remainder of the > > * pattern, using a recursive call to check each > potential match. > > * > > * If there are wildcards immediately following > the %, we can skip > > * over them first, using the idea that any > sequence of N _'s and > > * one or more %'s is equivalent to N _'s and one > % (ie, it will > > * match any sequence of at least N text > characters). In this way > > * we will always run the recursive search loop > using a pattern > > * fragment that begins with a literal > character-to-match, thereby > > * not recursing more than we have to. > > */ > > NextByte(p, plen); > > > > while (plen > 0) > > { > > if (*p == '%') > > NextByte(p, plen); > > else if (*p == '_') > > { > > /* If not enough text left to > match the pattern, ABORT */ > > if (tlen <= 0) > > return LIKE_ABORT; > > NextChar(t, tlen); > > NextByte(p, plen); > > } > > else > > break; /* Reached a > non-wildcard pattern char */ > > } > > Could we use B-M-H to replace that piece of code? > > How does the performance compare with regular expressions? Would it be > possible to use this for trivial regular expressions too? Or could we > speed up the regexp engine to take advantage of B-M-H, and use it for > LIKE? Or something like that? > > > I have measured the performance with the following query. > > Setting up the B-M-H table adds some initialization overhead, so this > would be a loss for cases where the LIKE is executed only once, and/or > the haystack strings are very small. That's probably OK, the overhead is > probably small, and those cases are probably not performance-critical. > But would be nice to measure that too. > > - Heikki >