Re: [HACKERS] WIP: index support for regexp search

Heikki Linnakangas Mon, 26 Nov 2012 04:57:13 -0800

On 25.11.2012 22:55, Alexander Korotkov wrote:

On Tue, Nov 20, 2012 at 1:43 PM, Heikki Linnakangas<hlinnakan...@vmware.com

wrote:

Glad to see this patch hasn't been totally forgotten. Being able to use
indexes for regular expressions would be really cool!

Back in January, I asked for some high-level description of how the
algorithm works (http://archives.postgresql.**
org/message-id/4F187D5C.30701@**enterprisedb.com<http://archives.postgresql.org/message-id/4f187d5c.30...@enterprisedb.com>).
That's still sorely needed. Googling around, I found the slides for your
presentation on this from PGConf.EU - it would be great to have the
information from that presentation included in the patch.



New version of patch is attached. The changes are following:
1) A big comment with high-level description of what is going on.
2) Regression tests.
3) Documetation update.
4) Some more refactoring.


Great, that top-level comment helped tremendously! I feel enlightened.

I fixed some spelling, formatting etc. trivial stuff while readingthrough the patch, see attached. Below is some feedback on the details:

* I don't like the PG_TRY/CATCH trick. It's not generally safe to catchan error, without propagating it further or rolling back the whole(sub)transation. It might work in this case, as you're only suppressingerrors with the special sqlcode that are used in the same file, but itnevertheless feels naughty. I believe none of the limits that are beingchecked are strict; it's OK to exceed the limits somewhat, as long asyou terminate the processing in a reasonable time, in case ofpathological input. I'd suggest putting an explicit check for the limitssomewhere, and not rely on ereport(). Something like this, in the codethat recurses:


if (trgmCNFA->arcsCount > MAX_RESULT_ARCS ||
    hash_get_num_entries(trgmCNFA->states) > MAX_RESULT_STATES)
{
        trgmCNFA->overflowed = true;
        return;
}

And then check for the overflowed flag at the top level.

* This part of the high-level comment was not clear to me:

 * States of the graph produced in the first stage are marked with "keys". Key 
is a pair
 * of a "prefix" and a state of the original automaton. "Prefix" is a last
 * characters. So, knowing the prefix is enough to know what is a trigram when 
we read some new
 * character. However, we can know single character of prefix or don't know any
 * characters of it. Each state of resulting graph have an "enter key" (with 
that
 * key we've entered this state) and a set of keys which are reachable without
 * reading any predictable trigram. The algorithm of processing each state
 * of resulting graph are so:
 * 1) Add all keys which achievable without reading of any predictable trigram.
 * 2) Add outgoing arcs labeled with trigrams.
 * Step 2 leads to creation of new states and recursively processing them. So,
 * we use depth-first algorithm.

I didn't understand that. Can you elaborate? It might help to workthrough an example, with some ascii art depicting the graph.

* It would be nice to add some comments to TrgmCNFA struct, explainingwhich fields are valid at which stages. For example, it seems that'trgms' array is calculated only after building the CNFA, bygetTrgmVector() function, while arcsCount is updated on the fly, whilerecursing in the getState() function.


* What is the representation used for the path matrix? Needs a comment.

* What do the getColorinfo() and scanColorMap() functions do? Whatexactly does a color represent? What's the tradeoff in choosingMAX_COLOR_CHARS?


- Heikki

diff --git a/contrib/pg_trgm/Makefile b/contrib/pg_trgm/Makefile
index 64fd69f..8033733 100644
--- a/contrib/pg_trgm/Makefile
+++ b/contrib/pg_trgm/Makefile
@@ -1,7 +1,7 @@
 # contrib/pg_trgm/Makefile
 
 MODULE_big = pg_trgm
-OBJS = trgm_op.o trgm_gist.o trgm_gin.o
+OBJS = trgm_op.o trgm_gist.o trgm_gin.o trgm_regexp.o
 
 EXTENSION = pg_trgm
 DATA = pg_trgm--1.0.sql pg_trgm--unpackaged--1.0.sql
diff --git a/contrib/pg_trgm/expected/pg_trgm.out b/contrib/pg_trgm/expected/pg_trgm.out
index 81d0ca8..ee0131f 100644
--- a/contrib/pg_trgm/expected/pg_trgm.out
+++ b/contrib/pg_trgm/expected/pg_trgm.out
@@ -54,7 +54,7 @@ select similarity('wow',' WOW ');
 (1 row)
 
 CREATE TABLE test_trgm(t text);
-\copy test_trgm from 'data/trgm.data
+\copy test_trgm from 'data/trgm.data'
 select t,similarity(t,'qwertyu0988') as sml from test_trgm where t % 'qwertyu0988' order by sml desc, t;
       t      |   sml    
 -------------+----------
@@ -3515,6 +3515,47 @@ select * from test2 where t ilike 'qua%';
  quark
 (1 row)
 
+select * from test2 where t ~ '[abc]{3}';
+   t    
+--------
+ abcdef
+(1 row)
+
+select * from test2 where t ~ 'a[bc]+d';
+   t    
+--------
+ abcdef
+(1 row)
+
+select * from test2 where t ~* 'DEF';
+   t    
+--------
+ abcdef
+(1 row)
+
+select * from test2 where t ~ 'dEf';
+ t 
+---
+(0 rows)
+
+select * from test2 where t ~* '^q';
+   t   
+-------
+ quark
+(1 row)
+
+select * from test2 where t ~* '[abc]{3}[def]{3}';
+   t    
+--------
+ abcdef
+(1 row)
+
+select * from test2 where t ~ 'q.*rk$';
+   t   
+-------
+ quark
+(1 row)
+
 drop index test2_idx_gin;
 create index test2_idx_gist on test2 using gist (t gist_trgm_ops);
 set enable_seqscan=off;
diff --git a/contrib/pg_trgm/pg_trgm--1.0.sql b/contrib/pg_trgm/pg_trgm--1.0.sql
index 8067bd6..ca9bcaa 100644
--- a/contrib/pg_trgm/pg_trgm--1.0.sql
+++ b/contrib/pg_trgm/pg_trgm--1.0.sql
@@ -163,4 +163,6 @@ AS
 
 ALTER OPERATOR FAMILY gin_trgm_ops USING gin ADD
         OPERATOR        3       pg_catalog.~~ (text, text),
-        OPERATOR        4       pg_catalog.~~* (text, text);
+        OPERATOR        4       pg_catalog.~~* (text, text),
+        OPERATOR        5       pg_catalog.~ (text, text),
+        OPERATOR        6       pg_catalog.~* (text, text);
diff --git a/contrib/pg_trgm/sql/pg_trgm.sql b/contrib/pg_trgm/sql/pg_trgm.sql
index 81ab1e7..7d8a151 100644
--- a/contrib/pg_trgm/sql/pg_trgm.sql
+++ b/contrib/pg_trgm/sql/pg_trgm.sql
@@ -13,7 +13,7 @@ select similarity('wow',' WOW ');
 
 CREATE TABLE test_trgm(t text);
 
-\copy test_trgm from 'data/trgm.data
+\copy test_trgm from 'data/trgm.data'
 
 select t,similarity(t,'qwertyu0988') as sml from test_trgm where t % 'qwertyu0988' order by sml desc, t;
 select t,similarity(t,'gwertyu0988') as sml from test_trgm where t % 'gwertyu0988' order by sml desc, t;
@@ -52,6 +52,14 @@ select * from test2 where t like '%bcd%';
 select * from test2 where t like E'%\\bcd%';
 select * from test2 where t ilike '%BCD%';
 select * from test2 where t ilike 'qua%';
+
+select * from test2 where t ~ '[abc]{3}';
+select * from test2 where t ~ 'a[bc]+d';
+select * from test2 where t ~* 'DEF';
+select * from test2 where t ~ 'dEf';
+select * from test2 where t ~* '^q';
+select * from test2 where t ~* '[abc]{3}[def]{3}';
+select * from test2 where t ~ 'q.*rk$';
 drop index test2_idx_gin;
 create index test2_idx_gist on test2 using gist (t gist_trgm_ops);
 set enable_seqscan=off;
diff --git a/contrib/pg_trgm/trgm.h b/contrib/pg_trgm/trgm.h
index 067f29d..6ec9345 100644
--- a/contrib/pg_trgm/trgm.h
+++ b/contrib/pg_trgm/trgm.h
@@ -7,7 +7,6 @@
 #include "access/gist.h"
 #include "access/itup.h"
 #include "storage/bufpage.h"
-#include "utils/builtins.h"
 
 /* options */
 #define LPADDING		2
@@ -28,6 +27,8 @@
 #define DistanceStrategyNumber		2
 #define LikeStrategyNumber			3
 #define ILikeStrategyNumber			4
+#define RegExpStrategyNumber		5
+#define RegExpStrategyNumberICase	6
 
 
 typedef char trgm[3];
@@ -46,8 +47,10 @@ uint32		trgm2int(trgm *ptr);
 
 #ifdef KEEPONLYALNUM
 #define ISPRINTABLECHAR(a)	( isascii( *(unsigned char*)(a) ) && (isalnum( *(unsigned char*)(a) ) || *(unsigned char*)(a)==' ') )
+#define ISWORDCHR(c)	(t_isalpha(c) || t_isdigit(c))
 #else
 #define ISPRINTABLECHAR(a)	( isascii( *(unsigned char*)(a) ) && isprint( *(unsigned char*)(a) ) )
+#define ISWORDCHR(c)	(!t_isspace(c))
 #endif
 #define ISPRINTABLETRGM(t)	( ISPRINTABLECHAR( ((char*)(t)) ) && ISPRINTABLECHAR( ((char*)(t))+1 ) && ISPRINTABLECHAR( ((char*)(t))+2 ) )
 
@@ -99,11 +102,21 @@ typedef char *BITVECP;
 #define GETARR(x)		( (trgm*)( (char*)x+TRGMHDRSIZE ) )
 #define ARRNELEM(x) ( ( VARSIZE(x) - TRGMHDRSIZE )/sizeof(trgm) )
 
+typedef struct
+{
+	int count;
+	char data[0];
+} PackedTrgmPaths;
+
 extern float4 trgm_limit;
 
+#define ERRCODE_TRGM_REGEX_TOO_COMPLEX MAKE_SQLSTATE('T','M','0','0','0')
+
 TRGM	   *generate_trgm(char *str, int slen);
 TRGM	   *generate_wildcard_trgm(const char *str, int slen);
 float4		cnt_sml(TRGM *trg1, TRGM *trg2);
 bool		trgm_contained_by(TRGM *trg1, TRGM *trg2);
+void		cnt_trigram(trgm *trgmptr, char *str, int bytelen);
+TRGM	   *createTrgmCNFA(text *text_re, MemoryContext context, PackedTrgmPaths **paths);
 
 #endif   /* __TRGM_H__ */
diff --git a/contrib/pg_trgm/trgm_gin.c b/contrib/pg_trgm/trgm_gin.c
index 114fb78..9f53644 100644
--- a/contrib/pg_trgm/trgm_gin.c
+++ b/contrib/pg_trgm/trgm_gin.c
@@ -75,19 +75,20 @@ gin_extract_value_trgm(PG_FUNCTION_ARGS)
 Datum
 gin_extract_query_trgm(PG_FUNCTION_ARGS)
 {
-	text	   *val = (text *) PG_GETARG_TEXT_P(0);
-	int32	   *nentries = (int32 *) PG_GETARG_POINTER(1);
-	StrategyNumber strategy = PG_GETARG_UINT16(2);
+	text			*val = (text *) PG_GETARG_TEXT_P(0);
+	int32			*nentries = (int32 *) PG_GETARG_POINTER(1);
+	StrategyNumber	 strategy = PG_GETARG_UINT16(2);
 
-	/* bool   **pmatch = (bool **) PG_GETARG_POINTER(3); */
-	/* Pointer	  *extra_data = (Pointer *) PG_GETARG_POINTER(4); */
-	/* bool   **nullFlags = (bool **) PG_GETARG_POINTER(5); */
-	int32	   *searchMode = (int32 *) PG_GETARG_POINTER(6);
-	Datum	   *entries = NULL;
-	TRGM	   *trg;
-	int32		trglen;
-	trgm	   *ptr;
-	int32		i;
+	/* bool        **pmatch = (bool **) PG_GETARG_POINTER(3); */
+	Pointer		   **extra_data = (Pointer **) PG_GETARG_POINTER(4);
+	/* bool        **nullFlags = (bool **) PG_GETARG_POINTER(5); */
+	int32			*searchMode = (int32 *) PG_GETARG_POINTER(6);
+	Datum			*entries = NULL;
+	TRGM			*trg;
+	int32			 trglen;
+	trgm			*ptr;
+	int32			 i;
+	PackedTrgmPaths	*paths;
 
 	switch (strategy)
 	{
@@ -107,6 +108,32 @@ gin_extract_query_trgm(PG_FUNCTION_ARGS)
 			 */
 			trg = generate_wildcard_trgm(VARDATA(val), VARSIZE(val) - VARHDRSZ);
 			break;
+		case RegExpStrategyNumberICase:
+#ifndef IGNORECASE
+			elog(ERROR, "cannot handle ~* with case-sensitive trigrams");
+#endif
+			/* FALL THRU */
+		case RegExpStrategyNumber:
+			trg = createTrgmCNFA(val, fcinfo->flinfo->fn_mcxt, &paths);
+			if (trg && ARRNELEM(trg) > 0)
+			{
+				/*
+				 * Successful of regex processing: store path matrix as an
+				 * extra_data.
+				 */
+				*extra_data = (Pointer *)palloc0(sizeof(Pointer) *
+																ARRNELEM(trg));
+				for (i = 0; i < ARRNELEM(trg); i++)
+					(*extra_data)[i] = (Pointer)paths;
+			}
+			else
+			{
+				/* No result: have to do full index scan. */
+				*nentries = 0;
+				*searchMode = GIN_SEARCH_MODE_ALL;
+				PG_RETURN_POINTER(entries);
+			}
+			break;
 		default:
 			elog(ERROR, "unrecognized strategy number: %d", strategy);
 			trg = NULL;			/* keep compiler quiet */
@@ -147,11 +174,15 @@ gin_trgm_consistent(PG_FUNCTION_ARGS)
 	/* text    *query = PG_GETARG_TEXT_P(2); */
 	int32		nkeys = PG_GETARG_INT32(3);
 
-	/* Pointer	  *extra_data = (Pointer *) PG_GETARG_POINTER(4); */
+	Pointer	   *extra_data = (Pointer *) PG_GETARG_POINTER(4);
 	bool	   *recheck = (bool *) PG_GETARG_POINTER(5);
-	bool		res;
-	int32		i,
+	bool		res, f;
+	int32		i, j,
 				ntrue;
+	PackedTrgmPaths *paths;
+	char	   *path;
+	int			pathCount, bitmaskLength = (nkeys + 7) / 8;
+
 
 	/* All cases served by this function are inexact */
 	*recheck = true;
@@ -189,6 +220,37 @@ gin_trgm_consistent(PG_FUNCTION_ARGS)
 				}
 			}
 			break;
+		case RegExpStrategyNumber:
+		case RegExpStrategyNumberICase:
+			if (nkeys < 1)
+			{
+				/* Regex processing gives no result: do full index scan */
+				res = true;
+				break;
+			}
+			/* Try to find path conforming this set of trigrams */
+			paths = (PackedTrgmPaths *)extra_data[0];
+			pathCount = paths->count;
+			res = false;
+			for (i = 0; i < pathCount; i++)
+			{
+				path = &paths->data[bitmaskLength * i];
+				f = true;
+				for (j = 0; j < nkeys; j++)
+				{
+					if (GETBIT(path, j) && !check[j])
+					{
+						f = false;
+						break;
+					}
+				}
+				if (f)
+				{
+					res = true;
+					break;
+				}
+			}
+			break;
 		default:
 			elog(ERROR, "unrecognized strategy number: %d", strategy);
 			res = false;		/* keep compiler quiet */
diff --git a/contrib/pg_trgm/trgm_op.c b/contrib/pg_trgm/trgm_op.c
index 87dffd1..71aa938 100644
--- a/contrib/pg_trgm/trgm_op.c
+++ b/contrib/pg_trgm/trgm_op.c
@@ -77,12 +77,6 @@ unique_array(trgm *a, int len)
 	return curend + 1 - a;
 }
 
-#ifdef KEEPONLYALNUM
-#define iswordchr(c)	(t_isalpha(c) || t_isdigit(c))
-#else
-#define iswordchr(c)	(!t_isspace(c))
-#endif
-
 /*
  * Finds first word in string, returns pointer to the word,
  * endword points to the character after word
@@ -92,7 +86,7 @@ find_word(char *str, int lenstr, char **endword, int *charlen)
 {
 	char	   *beginword = str;
 
-	while (beginword - str < lenstr && !iswordchr(beginword))
+	while (beginword - str < lenstr && !ISWORDCHR(beginword))
 		beginword += pg_mblen(beginword);
 
 	if (beginword - str >= lenstr)
@@ -100,7 +94,7 @@ find_word(char *str, int lenstr, char **endword, int *charlen)
 
 	*endword = beginword;
 	*charlen = 0;
-	while (*endword - str < lenstr && iswordchr(*endword))
+	while (*endword - str < lenstr && ISWORDCHR(*endword))
 	{
 		*endword += pg_mblen(*endword);
 		(*charlen)++;
@@ -109,8 +103,7 @@ find_word(char *str, int lenstr, char **endword, int *charlen)
 	return beginword;
 }
 
-#ifdef USE_WIDE_UPPER_LOWER
-static void
+void
 cnt_trigram(trgm *tptr, char *str, int bytelen)
 {
 	if (bytelen == 3)
@@ -131,7 +124,6 @@ cnt_trigram(trgm *tptr, char *str, int bytelen)
 		CPTRGM(tptr, &crc);
 	}
 }
-#endif
 
 /*
  * Adds trigrams from words (already padded).
@@ -287,7 +279,7 @@ get_wildcard_part(const char *str, int lenstr,
 	{
 		if (in_escape)
 		{
-			if (iswordchr(beginword))
+			if (ISWORDCHR(beginword))
 				break;
 			in_escape = false;
 			in_leading_wildcard_meta = false;
@@ -298,7 +290,7 @@ get_wildcard_part(const char *str, int lenstr,
 				in_escape = true;
 			else if (ISWILDCARDCHAR(beginword))
 				in_leading_wildcard_meta = true;
-			else if (iswordchr(beginword))
+			else if (ISWORDCHR(beginword))
 				break;
 			else
 				in_leading_wildcard_meta = false;
@@ -341,7 +333,7 @@ get_wildcard_part(const char *str, int lenstr,
 		clen = pg_mblen(endword);
 		if (in_escape)
 		{
-			if (iswordchr(endword))
+			if (ISWORDCHR(endword))
 			{
 				memcpy(s, endword, clen);
 				(*charlen)++;
@@ -369,7 +361,7 @@ get_wildcard_part(const char *str, int lenstr,
 				in_trailing_wildcard_meta = true;
 				break;
 			}
-			else if (iswordchr(endword))
+			else if (ISWORDCHR(endword))
 			{
 				memcpy(s, endword, clen);
 				(*charlen)++;
diff --git a/contrib/pg_trgm/trgm_regexp.c b/contrib/pg_trgm/trgm_regexp.c
new file mode 100644
index 0000000..be7d846
--- /dev/null
+++ b/contrib/pg_trgm/trgm_regexp.c
@@ -0,0 +1,1209 @@
+/*
+ * contrib/pg_trgm/trgm_regexp.c - regular expression matching using trigrams
+ *
+ * The general idea of index support for a regular expression (regex) search
+ * is to transform regex to a logical expression on trigrams. For example:
+ *
+ *   (ab|cd)efg => ((abe & bef) | (cde & def)) & efg
+ *
+ * If a string matches the regex, then it must match the logical expression of
+ * trigrams. The opposite is not necessary true, however: a string that matches
+ * the logical expression might not match the original regex. Such false
+ * positives are removed during recheck.
+ *
+ * The algorithm to convert a regex to a logical expression is based on
+ * analysis of an automaton that corresponds to regex. The algorithm consists
+ * of two stages:
+ * 1) Transform the automaton to an automaton-like graph of trigrams.
+ * 2) Collect all minimal paths of that graph into a matrix.
+ *
+ * Automaton we have after processing a regular expression is a graph where
+ * vertices are "states" and arcs are labeled with characters. There are
+ * two special states: "initial" and "final". If you can traverse from the
+ * initial state to the final state, and type given string by arc labels then
+ * the string matches the regular expression.
+ *
+ * We use CNFA (colored non-deterministic finite-state automaton) produced by
+ * the PostgreSQL regex library. CNFA means that:
+ * 1) Characters are grouped into colors, so arcs are labeled with colors.
+ * 2) Multiple outgoing arcs from same state can be labeled with the same color.
+ *    This makes the automaton non-deterministic, because it can be in many
+ *    states simultaneously.
+ * 3) It has finite number of states (actually infinite-state automata are
+ *    almost never considered).
+ *
+ * As result of the first stage we have a CNFA-like graph with the following
+ * property: If you can traverse from the initial state to the final state, via
+ * arcs labeled with trigrams that are present in the string, then the string
+ * might match the regex. Otherwise, it does not. Actually, this graph is a
+ * form of representation of logical expression we need.
+ *
+ * States of the graph produced in the first stage are marked with "keys". Key is a pair
+ * of a "prefix" and a state of the original automaton. "Prefix" is a last
+ * characters. So, knowing the prefix is enough to know what is a trigram when we read some new
+ * character. However, we can know single character of prefix or don't know any
+ * characters of it. Each state of resulting graph have an "enter key" (with that
+ * key we've entered this state) and a set of keys which are reachable without
+ * reading any predictable trigram. The algorithm of processing each state
+ * of resulting graph are so:
+ * 1) Add all keys which achievable without reading of any predictable trigram.
+ * 2) Add outgoing arcs labeled with trigrams.
+ * Step 2 leads to creation of new states and recursively processing them. So,
+ * we use depth-first algorithm.
+ *
+ * At the second stage we collect all the paths in graph from first stage. We
+ * only need "minimal" paths. For example, if we have two paths "abc & bcd" and
+ * "abc & bcd & cde" then then second one doesn't matter because the first one
+ * includes all the strings which the second one does. In order to evade
+ * enumeration of too many non-minimal paths we use breadth-first search with
+ * keeping matrix of minimal paths in each state.
+ */
+#include "postgres.h"
+
+#include "trgm.h"
+
+#include "catalog/pg_collation.h"
+#include "fmgr.h"
+#include "miscadmin.h"
+#include "mb/pg_wchar.h"
+#include "nodes/pg_list.h"
+#include "regex/regex.h"
+#undef INFINITY /* avoid conflict of INFINITY definition */
+#include "regex/regguts.h"
+#include "tsearch/ts_locale.h"
+#include "utils/hsearch.h"
+
+/*
+ * Uncomment to print intermediate stages, for exploring and debugging the
+ * algorithm implementation.
+ */
+/* #define TRGM_REGEXP_DEBUG */
+
+/* How big colors we're considering as separate arcs */
+#define MAX_COLOR_CHARS		4
+
+/*---
+ * Following group of parameters are used in order to limit our computations.
+ * Otherwise regex processing could be too slow and memory-consuming.
+ *
+ *  MAX_RESULT_STATES - How many states we allow in result CNFA-like graph
+ *  MAX_RESULT_ARCS - How many arcs we allow in result CNFA-like graph
+ *  MAX_RESULT_PATHS - How many paths we allow in single path matrix
+ */
+#define MAX_RESULT_STATES	128
+#define MAX_RESULT_ARCS		1024
+#define MAX_RESULT_PATHS	256
+
+/*
+ * Widechar trigram datatype for holding trigram before possible conversion into
+ * CRC32
+ */
+typedef pg_wchar Trgm[3];
+
+/*
+ * Maximum length of multibyte encoding character is 4. So, we can hold it in
+ * 32 bit integer for handling simplicity.
+ */
+typedef uint32 mb_char;
+
+/*----
+ * Attributes of CNFA colors:
+ *
+ *  containAlpha - flag indicates if color might contain alphanumeric characters
+ *                 (which are extracted into trigrams)
+ *  charsCount   - count of characters in color
+ *  chars        - characters of color
+ *
+ * All non alphanumeric character are treated as same zero character, because
+ * there are no difference between them for trigrams. Exact value of
+ * "charsCount" and value of "chars" is meaningful only when
+ * charsCount <= MAX_COLOR_CHARS. When charsCount > MAX_COLOR_CHARS we can
+ * expect any characters from this color.
+ */
+typedef struct
+{
+	int			charsCount;
+	bool		containAlpha;
+	mb_char		chars[MAX_COLOR_CHARS];
+} ColorInfo;
+
+/*
+ * Prefix is information about last two read characters when coming into
+ * specific CNFA state. "s" contain that characters itself. But s[0] or both
+ * s[0] and s[1] could be zeros. "ambiguity" flag tells us how to treat that.
+ * If "ambiguity" is false then zeros in s indicates start of trigram. When
+ * "ambiguity" is true then zeros in s indicates that it could be any
+ * characters. Having "ambiguity" flag this structure actually express a class
+ * of prefixes.
+ */
+typedef struct
+{
+	mb_char	s[2];
+	bool	ambiguity;
+} TrgmPrefix;
+
+/*
+ * "Key" of resulting state: pair of prefix and source CNFA state.
+ */
+typedef struct
+{
+	TrgmPrefix prefix;
+	int        nstate;
+} TrgmStateKey;
+
+/*---
+ * State of resulting graph.
+ *
+ *  enterKey - a key with which we can enter this state
+ *  keys     - all keys achievable without reading any predictable trigram
+ *  arcs     - outgoing arcs of this state
+ *  fin      - flag indicated this state if final
+ *  queued   - flag indicated this states is queued in path matrix collection
+ *             algorithm
+ */
+typedef struct
+{
+	TrgmStateKey	enterKey;
+	List		   *keys;
+	List		   *arcs;
+	List		   *path;
+	bool			fin;
+	bool			queued;
+} TrgmState;
+
+/*
+ * Arc of trigram CNFA-like structure. Arc is labeled with trigram.
+ */
+typedef struct
+{
+	TrgmState  *target;
+	Trgm		trgm;
+} TrgmArc;
+
+/*---
+ * A single path is a path matrix of some state.
+ *
+ *  queued - flag indicated that this path is new since last processing of this
+ *           state and it's queued for processing.
+ *  path   - bit array representing a path itself.
+ */
+typedef struct
+{
+	bool queued;
+	char path[0];
+} TrgmStatePath;
+
+/*---
+ * Data structure representing all the data we need during regex processing.
+ *
+ *  states     - hash of states of resulting graph
+ *  cnfa       - source CFNA of regex
+ *  colorInfo  - processed information of regex colors
+ *  initState  - pointer to initial state of resulting graph
+ *  trgms      - array of all trigrams presented in graph
+ *  trgmCount  - count of that trigrams
+ *  arcsCount  - total number of arcs of resulting graph (for resouces limiting)
+ *  bitmaskLen - length of bitmask representing single path in path matrix
+ *  path       - resulting path matrix
+ *  queue      - queue for path matrix producing
+ */
+typedef struct
+{
+	HTAB	       *states;
+	struct cnfa    *cnfa;
+	ColorInfo	   *colorInfo;
+	TrgmState	   *initState;
+	Trgm		   *trgms;
+	int				trgmCount;
+	int				arcsCount;
+	int				bitmaskLen;
+	List		   *path;
+	List		   *queue;
+} TrgmCNFA;
+
+static TrgmState *getState(TrgmCNFA *trgmCNFA, TrgmStateKey *key);
+
+/*
+ * Convert pg_wchar to multibyte character.
+ */
+static mb_char
+convertPgWchar(pg_wchar c)
+{
+	/*
+	 * "s" has enough of space for, at maximum 4 byte multibyte character and
+	 * a zero-byte at the end.
+	 */
+	char		s[5];
+	char	   *lowerCased;
+	mb_char		result;
+
+	if (c == 0)
+		return 0;
+
+	MemSet(s, 0, sizeof(s));
+	pg_wchar2mb_with_len(&c, s, 1);
+
+	/* Convert to lowercase if needed */
+#ifdef IGNORECASE
+	lowerCased = lowerstr(s);
+#else
+	lowerCased = s;
+#endif
+	strncpy((char *)&result, lowerCased, 4);
+	return result;
+}
+
+/*
+ * Recursive function of colormap scanning.
+ */
+static void
+scanColorMap(union tree	tree[NBYTS], union tree *t, ColorInfo *colorInfos,
+			 int level, pg_wchar p)
+{
+	int			i;
+
+	check_stack_depth();
+
+	if (level < NBYTS - 1)
+	{
+		for (i = 0; i < BYTTAB; i++)
+		{
+			/*
+			 * This condition checks if all underlying levels express zero
+			 * color. Zero color uses multiple links to same color table. So,
+			 * avoid scanning it because it's expensive.
+			 */
+			if (t->tptr[i] == &tree[level + 1])
+				continue;
+			/* Recursive scanning of next level color table */
+			scanColorMap(tree, t->tptr[i], colorInfos, level + 1, (p << 8) | i);
+		}
+	}
+	else
+	{
+		p <<= 8;
+		for (i = 0; i < BYTTAB; i++)
+		{
+			ColorInfo *colorInfo = &colorInfos[t->tcolor[i]];
+			int j;
+			pg_wchar c;
+
+			/* Convert to multibyte character */
+			c = convertPgWchar(p | i);
+
+			/* Update color attributes according to next character */
+			if (ISWORDCHR((char *)&c))
+				colorInfo->containAlpha = true;
+			else
+				c = 0;
+			if (colorInfo->charsCount <= MAX_COLOR_CHARS)
+			{
+				bool found = false;
+				for (j = 0; j < colorInfo->charsCount; j++)
+				{
+					if (colorInfo->chars[j] == c)
+					{
+						found = true;
+						break;
+					}
+				}
+				if (found)
+					continue;
+			}
+			if (colorInfo->charsCount < MAX_COLOR_CHARS)
+				colorInfo->chars[colorInfo->charsCount] = c;
+			colorInfo->charsCount++;
+		}
+	}
+}
+
+/*
+ * Obtain attributes of colors.
+ */
+static ColorInfo *
+getColorInfo(regex_t *regex)
+{
+	struct guts *g;
+	struct colormap *cm;
+	ColorInfo  *result;
+	int			colorsCount;
+
+	g = (struct guts *) regex->re_guts;
+	cm = &g->cmap;
+	colorsCount = cm->max + 1;
+
+	result = (ColorInfo *) palloc0(colorsCount * sizeof(ColorInfo));
+
+	/*
+	 * Zero color is a default color which contains all character which aren't
+	 * in explicitly expressed classes. Mark that we can expect everything
+	 * from it.
+	 */
+	result[0].containAlpha = true;
+	result[0].charsCount = MAX_COLOR_CHARS + 1;
+	scanColorMap(cm->tree, &cm->tree[0], result, 0, 0);
+
+#ifdef TRGM_REGEXP_DEBUG
+	{
+		int i;
+		for (i = 0; i < colorsCount; i++)
+		{
+			ColorInfo *colorInfo = &result[i];
+			elog(NOTICE, "COLOR %d %d %d %08X %08X %08X %08X", i,
+				colorInfo->charsCount, colorInfo->containAlpha,
+				colorInfo->chars[0], colorInfo->chars[1],
+				colorInfo->chars[2], colorInfo->chars[3]);
+		}
+	}
+#endif
+	return result;
+}
+
+/*
+ * Check if prefix1 "contains" prefix2. "contains" mean that any exact prefix
+ * (which no ambiguity) which satisfy to prefix2 also satisfy to prefix1.
+ */
+static bool
+prefixContains(TrgmPrefix *prefix1, TrgmPrefix *prefix2)
+{
+	if (prefix1->ambiguity)
+	{
+		if (prefix1->s[1] == 0)
+		{
+			/* Fully ambiguous prefix contains everything */
+			return true;
+		}
+		else
+		{
+			/*
+			 * Prefix with only first ambiguous characters contains every prefix
+			 * with same second character.
+			 */
+			if (prefix1->s[1] == prefix2->s[1])
+				return true;
+			else
+				return false;
+		}
+	}
+	else
+	{
+		/* Exact prefix contains only exactly same prefix */
+		if (prefix1->s[0] == prefix2->s[0] && prefix1->s[1] == prefix2->s[1]
+			&& !prefix2->ambiguity)
+			return true;
+		else
+			return false;
+	}
+}
+
+/*
+ * Add all keys which can be achieved without reading any trigram to state of
+ * CNFA-like graph on trigrams.
+ */
+static void
+addKeys(TrgmCNFA *trgmCNFA, TrgmState *state, TrgmStateKey *key)
+{
+	struct carc *s;
+	TrgmStateKey destKey;
+	ListCell *cell, *prev, *next;
+	TrgmStateKey *keyCopy;
+
+	MemSet(&destKey, 0, sizeof(TrgmStateKey));
+
+	/* Adjust list of keys with new one */
+	prev = NULL;
+	cell = list_head(state->keys);
+	while (cell)
+	{
+		TrgmStateKey *listKey = (TrgmStateKey *) lfirst(cell);
+		next = lnext(cell);
+		if (listKey->nstate == key->nstate)
+		{
+			if (prefixContains(&listKey->prefix, &key->prefix))
+			{
+				/* Already had this key: nothing to do */
+				return;
+			}
+			if (prefixContains(&key->prefix, &listKey->prefix))
+				state->keys = list_delete_cell(state->keys, cell, prev);
+			else
+				prev = cell;
+		}
+		else
+			prev = cell;
+		cell = next;
+	}
+	keyCopy = (TrgmStateKey *) palloc(sizeof(TrgmStateKey));
+	memcpy(keyCopy, key, sizeof(TrgmStateKey));
+	state->keys = lappend(state->keys, keyCopy);
+
+	/* Mark final state */
+	if (key->nstate == trgmCNFA->cnfa->post)
+	{
+		state->fin = true;
+		return;
+	}
+
+	s = trgmCNFA->cnfa->states[key->nstate];
+	while (s->co != COLORLESS)
+	{
+		ColorInfo *colorInfo;
+
+		if (s->co == trgmCNFA->cnfa->bos[1])
+		{
+			/* Start of line (^) */
+			destKey.nstate = s->to;
+
+			/* Mark prefix as start of new trigram */
+			destKey.prefix.s[0] = 0;
+			destKey.prefix.s[1] = 0;
+			destKey.prefix.ambiguity = false;
+
+			/* Add key to this state */
+			addKeys(trgmCNFA, state, &destKey);
+			if (state->fin)
+				return;
+		}
+		else if (s->co == trgmCNFA->cnfa->eos[1])
+		{
+			/* End of string ($) */
+			if (key->prefix.ambiguity)
+			{
+				destKey.nstate = s->to;
+
+				/*
+				 * Let's think prefix to become ambiguous (in order to prevent
+				 * latter fiddling around with keys).
+				 */
+				destKey.prefix.s[1] = 0;
+				destKey.prefix.s[0] = 0;
+				destKey.prefix.ambiguity = true;
+
+				/* Prefix is ambiguous, add key to the same state */
+				addKeys(trgmCNFA, state, &destKey);
+				if (state->fin)
+					return;
+			}
+		}
+		else
+		{
+			/* Regular color */
+			colorInfo = &trgmCNFA->colorInfo[s->co];
+
+			if (colorInfo->charsCount > 0 &&
+				colorInfo->charsCount <= MAX_COLOR_CHARS)
+			{
+				/* We can enumerate characters of this color */
+				int i;
+				for (i = 0; i < colorInfo->charsCount; i++)
+				{
+					mb_char c = colorInfo->chars[i];
+
+					/*
+					 * ----
+					 * Create new prefix
+					 * 1 (end of prefix)   - current character "c"
+					 * 0 (start of prefix) - end of previous prefix
+					 *
+					 * Zero "c" means non alphanumeric character, this means
+					 * new prefix should indicate start of new trigram.
+					 */
+					destKey.prefix.s[1] = c;
+					destKey.prefix.s[0] = c ? key->prefix.s[1] : 0;
+
+					destKey.nstate = s->to;
+
+					if (destKey.prefix.s[0] || c == 0)
+						destKey.prefix.ambiguity = false;
+					else
+						destKey.prefix.ambiguity = key->prefix.ambiguity;
+
+					if (key->prefix.ambiguity ||
+											(key->prefix.s[1] == 0 && c == 0))
+					{
+						/*
+						 * If we have ambiguity or start of new trigram then add
+						 * corresponding keys to same state.
+						 */
+						addKeys(trgmCNFA, state, &destKey);
+						if (state->fin)
+							return;
+					}
+				}
+			}
+			else
+			{
+				/*
+				 * Can't enumerate characters. Add corresponding key to this
+				 * state.
+				 */
+				destKey.nstate = s->to;
+				destKey.prefix.s[0] = 0;
+				destKey.prefix.s[1] = 0;
+				destKey.prefix.ambiguity = colorInfo->containAlpha;
+				addKeys(trgmCNFA, state, &destKey);
+				if (state->fin)
+					return;
+			}
+		}
+		s++;
+	}
+}
+
+/*
+ * Add outgoing arc from state if needed.
+ */
+static void
+addArc(TrgmCNFA *trgmCNFA, TrgmState *state, TrgmStateKey *key,
+	   TrgmStateKey *destKey, mb_char c)
+{
+	TrgmArc *arc;
+	ListCell *cell2;
+
+	if (key->prefix.ambiguity || (key->prefix.s[1] == 0 && c == 0))
+		return;
+
+	/* If we have the same key here, we don't need to add new arc */
+	foreach(cell2, state->keys)
+	{
+		TrgmStateKey *key2 = (TrgmStateKey *) lfirst(cell2);
+		if (key2->nstate == destKey->nstate &&
+			prefixContains(&key2->prefix, &destKey->prefix))
+		{
+			return;
+		}
+	}
+
+	/* Not found, add new arc */
+	arc = (TrgmArc *) palloc(sizeof(TrgmArc));
+	arc->target = getState(trgmCNFA, destKey);
+	arc->trgm[0] = key->prefix.s[0];
+	arc->trgm[1] = key->prefix.s[1];
+	arc->trgm[2] = c;
+	state->arcs = lappend(state->arcs, arc);
+	trgmCNFA->arcsCount++;
+	if (trgmCNFA->arcsCount > MAX_RESULT_ARCS)
+		ereport(ERROR,
+				(errcode(ERRCODE_TRGM_REGEX_TOO_COMPLEX),
+				 errmsg("Too many resulting arcs.")));
+}
+
+/*
+ * Add outgoing arcs from given state.
+ */
+static void
+addArcs(TrgmCNFA *trgmCNFA, TrgmState *state)
+{
+	struct carc *s;
+	TrgmStateKey destKey;
+	ListCell *cell;
+
+	MemSet(&destKey, 0, sizeof(TrgmStateKey));
+
+	/*
+	 * Iterate over keys associated with this output state.
+	 */
+	foreach(cell, state->keys)
+	{
+		TrgmStateKey *key = (TrgmStateKey *) lfirst(cell);
+		s = trgmCNFA->cnfa->states[key->nstate];
+		while (s->co != COLORLESS)
+		{
+			ColorInfo *colorInfo;
+			if (s->co == trgmCNFA->cnfa->bos[1])
+			{
+				/* Should be already handled by addKeys. */
+			}
+			else if (s->co == trgmCNFA->cnfa->eos[1])
+			{
+				/* End of the string ($) */
+				destKey.nstate = s->to;
+
+				/* Assume prefix to become ambiguous after end of the string */
+				destKey.prefix.s[1] = 0;
+				destKey.prefix.s[0] = 0;
+				destKey.prefix.ambiguity = true;
+
+				addArc(trgmCNFA, state, key, &destKey, 0);
+			}
+			else
+			{
+				/* Regular color */
+				colorInfo = &trgmCNFA->colorInfo[s->co];
+
+				if (colorInfo->charsCount > 0
+					&& colorInfo->charsCount <= MAX_COLOR_CHARS)
+				{
+					/* We can enumerate characters of this color */
+					int i;
+					for (i = 0; i < colorInfo->charsCount; i++)
+					{
+						mb_char c = colorInfo->chars[i];
+
+						/*
+						 * ----
+						 * Create new prefix
+						 * 1 (end of prefix)   - current character "c"
+						 * 0 (start of prefix) - end of previous prefix
+						 *
+						 * Zero "c" means non alphanumeric character, this means
+						 * new prefix should indicate start of new trigram.
+						 */
+						destKey.prefix.s[1] = c;
+						destKey.prefix.s[0] = c ? key->prefix.s[1] : 0;
+
+						destKey.nstate = s->to;
+
+						if (destKey.prefix.s[0] || c == 0)
+							destKey.prefix.ambiguity = false;
+						else
+							destKey.prefix.ambiguity = key->prefix.ambiguity;
+
+						addArc(trgmCNFA, state, key, &destKey, c);
+					}
+				}
+			}
+			s++;
+		}
+	}
+}
+
+/*
+ * Get state of trigram CNFA-like graph of given enter key and process it if
+ * needed.
+ */
+static TrgmState *
+getState(TrgmCNFA *trgmCNFA, TrgmStateKey *key)
+{
+	bool		found;
+	TrgmState  *state;
+
+	state = hash_search(trgmCNFA->states, key, HASH_ENTER, &found);
+	if (found)
+		return state;
+	else
+	{
+		if (hash_get_num_entries(trgmCNFA->states) > MAX_RESULT_STATES)
+			ereport(ERROR,
+					(errcode(ERRCODE_TRGM_REGEX_TOO_COMPLEX),
+					 errmsg("Too many resulting states.")));
+		state->arcs = NIL;
+		state->keys = NIL;
+		state->path = NIL;
+		state->fin = false;
+		state->queued = false;
+		addKeys(trgmCNFA, state, key);
+		if (!state->fin)
+			addArcs(trgmCNFA, state);
+	}
+	return state;
+}
+
+#ifdef TRGM_REGEXP_DEBUG
+/*
+ * Log source CNFA.
+ */
+static void
+printCNFA(struct cnfa *cnfa)
+{
+	int state;
+	for (state = 0; state < cnfa->nstates; state++)
+	{
+		struct carc *s;
+		elog(NOTICE, "STATE %d", state);
+		s = cnfa->states[state];
+		while (s->co != COLORLESS)
+		{
+			elog(NOTICE, "ARC %d %d", s->co, s->to);
+			s++;
+		}
+	}
+}
+
+/*
+ * Log resulting trigram-based CNFA-like structure.
+ */
+static void
+printTrgmCNFA(TrgmCNFA *trgmCNFA)
+{
+	HASH_SEQ_STATUS scan_status;
+	TrgmState *state;
+
+	elog(NOTICE, "INITSTATE %p", (void *)trgmCNFA->initState);
+
+	hash_seq_init(&scan_status, trgmCNFA->states);
+	while ((state = (TrgmState *) hash_seq_search(&scan_status)) != NULL)
+	{
+		ListCell *cell;
+		elog(NOTICE, "STATE %p %d", (void *)state, (int)state->fin);
+		foreach(cell, state->keys)
+		{
+			TrgmStateKey *key = (TrgmStateKey *) lfirst(cell);
+			elog(NOTICE, "KEY %08X %08X %d %d", key->prefix.s[0],
+				key->prefix.s[1], (int)key->prefix.ambiguity, key->nstate);
+		}
+
+		foreach(cell, state->arcs)
+		{
+			TrgmArc *arc = (TrgmArc *) lfirst(cell);
+			elog(NOTICE, "ARC %p %08X %08X %08X", (void *)arc->target,
+					(uint32)arc->trgm[0], (uint32)arc->trgm[1], arc->trgm[2]);
+		}
+	}
+}
+
+/*
+ * Log path matrix on trigrams.
+ */
+static void
+printTrgm(TRGM *trg, char *paths)
+{
+	int			i,
+				pathsCount,
+				trgmCount,
+				bitmaskSize,
+				j;
+	for (i = 0; i < ARRNELEM(trg); i++)
+	{
+		elog(NOTICE, "TRGM %c %c %c",
+			 GETARR(trg)[i][0], GETARR(trg)[i][1], GETARR(trg)[i][2]);
+	}
+
+	trgmCount = ARRNELEM(trg);
+	bitmaskSize = (trgmCount + 7) / 8;
+	pathsCount = *((int *)paths);
+	for (i = 0; i < pathsCount; i++)
+	{
+		char *path = paths + sizeof(int) + i * bitmaskSize;
+		char msg[1024], *p;
+		p = msg;
+
+		for (j = 0; j < trgmCount; j++)
+		{
+			if (GETBIT(path, j))
+				*p++ = '1';
+			else
+				*p++ = '0';
+		}
+		*p = 0;
+		elog(NOTICE, "%s", msg);
+	}
+}
+#endif
+
+/*
+ * Compare function for sorting of Trgm datatype.
+ */
+static int
+trgmCmp(const void *p1, const void *p2)
+{
+	return memcmp(p1, p2, sizeof(Trgm));
+}
+
+/*
+ * Calculate vector of all unique trigrams which are used
+ */
+static void
+getTrgmVector(TrgmCNFA *trgmCNFA)
+{
+	HASH_SEQ_STATUS scan_status;
+	int			totalCount = 0, i = 0;
+	Trgm	   *p1, *p2;
+	TrgmState  *state;
+
+	hash_seq_init(&scan_status, trgmCNFA->states);
+	while ((state = (TrgmState *) hash_seq_search(&scan_status)) != NULL)
+	{
+		ListCell *cell;
+		foreach(cell, state->arcs)
+		{
+			totalCount++;
+		}
+	}
+	trgmCNFA->trgms = (Trgm *) palloc(sizeof(Trgm) * totalCount);
+
+	hash_seq_init(&scan_status, trgmCNFA->states);
+	while ((state = (TrgmState *) hash_seq_search(&scan_status)) != NULL)
+	{
+		ListCell *cell;
+		foreach(cell, state->arcs)
+		{
+			TrgmArc *arc = (TrgmArc *) lfirst(cell);
+			memcpy(&trgmCNFA->trgms[i], &arc->trgm, sizeof(Trgm));
+			i++;
+		}
+	}
+
+	if (totalCount < 2)
+	{
+		trgmCNFA->trgmCount = totalCount;
+		return;
+	}
+
+	qsort(trgmCNFA->trgms, totalCount, sizeof(Trgm), trgmCmp);
+
+	/* Remove duplicates */
+	p1 = trgmCNFA->trgms;
+	p2 = trgmCNFA->trgms;
+	while (p1 - trgmCNFA->trgms < totalCount)
+	{
+		if (memcmp(p1, p2, sizeof(Trgm)) != 0)
+		{
+			p2++;
+			memcpy(p2, p1, sizeof(Trgm));
+		}
+		p1++;
+	}
+	trgmCNFA->trgmCount = p2 + 1 - trgmCNFA->trgms;
+}
+
+/*
+ * Compare two paths in trigram CNFA-like structure. "contain" means that "new"
+ * path contain all trigrams from "origin" path. "contained" means that "origin"
+ * cotain all trigrams from "new" path.
+ */
+static void
+compareMasks(char *new, char *origin, int len, bool *contain, bool *contained)
+{
+	int			i;
+	*contain   = true;
+	*contained = true;
+	for (i = 0; i < len; i++)
+	{
+		if ((~new[i]) & origin[i])
+			*contain = false;
+		if (new[i] & (~origin[i]))
+			*contained = false;
+	}
+}
+
+/*
+ * Add new path into path matrix.
+ */
+static void
+addPath(List **pathMatrix, TrgmStatePath *path, int len, bool *modify)
+{
+	ListCell   *cell, *prev, *next;
+	int			count = 0;
+
+	*modify = false;
+	prev = NULL;
+	cell = list_head(*pathMatrix);
+	while(cell)
+	{
+		bool contain, contained;
+
+		next = lnext(cell);
+		compareMasks(path->path,
+					 ((TrgmStatePath *) lfirst(cell))->path,
+					 len, &contain, &contained);
+
+		if (contain)
+		{
+			/* We already have same or wider path in matrix. Nothing to do. */
+			return;
+		}
+		if (contained)
+		{
+			/* New path is wider than other path in matrix. Delete latter. */
+			*pathMatrix = list_delete_cell(*pathMatrix, cell, prev);
+		}
+		else
+		{
+			prev = cell;
+			count++;
+		}
+
+		cell = next;
+	}
+	*modify = true;
+	if (count >= MAX_RESULT_PATHS)
+		ereport(ERROR,
+				(errcode(ERRCODE_TRGM_REGEX_TOO_COMPLEX),
+				 errmsg( "Too many paths.")));
+	*pathMatrix = lappend(*pathMatrix, path);
+}
+
+/*
+ * Add path to path matrix of corresponding path.
+ */
+static void
+adjustPaths(TrgmCNFA *trgmCNFA, TrgmState *state, TrgmStatePath *path)
+{
+	bool		modify;
+
+	if (state->fin)
+	{
+		/*
+		 * It's a final state. Add path to the final path matrix.
+		 */
+		addPath(&trgmCNFA->path, path, trgmCNFA->bitmaskLen, &modify);
+	}
+	else
+	{
+		addPath(&state->path, path, trgmCNFA->bitmaskLen, &modify);
+
+		/* Did we actually change anything? */
+		if (!modify)
+			return;
+
+		/*
+		 * Plan to scan outgoing arcs from this state if it's not done already.
+		 */
+		if (!state->queued)
+		{
+			trgmCNFA->queue = lappend(trgmCNFA->queue, state);
+			state->queued = true;
+		}
+	}
+}
+
+/*
+ * Process queue of trigram CNFA-like states until we collect all the paths.
+ */
+static void
+processQueue(TrgmCNFA *trgmCNFA)
+{
+	while (trgmCNFA->queue != NIL)
+	{
+		TrgmState *state = (TrgmState *) linitial(trgmCNFA->queue);
+		ListCell *arcCell, *pathCell;
+
+		state->queued = false;
+		trgmCNFA->queue = list_delete_first(trgmCNFA->queue);
+
+		foreach(pathCell, state->path)
+		{
+			TrgmStatePath *path = (TrgmStatePath *) lfirst(pathCell);
+			if (!path->queued)
+				continue;
+			foreach(arcCell, state->arcs)
+			{
+				TrgmArc *arc = (TrgmArc *) lfirst(arcCell);
+				int index;
+				size_t size;
+				TrgmStatePath *pathCopy;
+
+				/* Create path according to arc (with corresponding bit set) */
+				size = sizeof(bool) + sizeof(char) * trgmCNFA->bitmaskLen;
+				pathCopy = (TrgmStatePath *)palloc(size);
+				memcpy(pathCopy, path, size);
+				index = (Trgm *)bsearch(&arc->trgm, trgmCNFA->trgms,
+									trgmCNFA->trgmCount, sizeof(Trgm), trgmCmp)
+						- trgmCNFA->trgms;
+				SETBIT(pathCopy->path, index);
+				adjustPaths(trgmCNFA, arc->target, pathCopy);
+			}
+			path->queued = false;
+		}
+	}
+}
+
+/*
+ * Convert trigram into trgm datatype.
+ */
+static void
+fillTrgm(trgm *ptrgm, Trgm trgm)
+{
+	char		text[14], *p;
+	int			i;
+
+	/* Write multibyte string into "text". */
+	p = text;
+	for (i = 0; i < 3; i++)
+	{
+		int len;
+		if (trgm[i] != 0)
+		{
+			len = strnlen((char *)&trgm[i], 4);
+			memcpy(p, &trgm[i], len);
+			p += len;
+		}
+		else
+		{
+			*p++ = ' ';
+		}
+	}
+	*p = 0;
+
+	/* Extract trigrams from "text" */
+	cnt_trigram(ptrgm, text, p - text);
+}
+
+/*
+ * Final pack of path matrix: convert trigrams info trgm datatype and remove
+ * empty columns from the matrix.
+ */
+static TRGM *
+finalPack(TrgmCNFA *trgmCNFA, MemoryContext context, PackedTrgmPaths **pathsOut)
+{
+	ListCell   *cell;
+	char	   *unionPath;
+	int			i, nonEmptyCount = 0, j;
+	TRGM	   *trg;
+	trgm	   *trgms;
+	int			newBitMaskLen, pathIndex = 0;
+	PackedTrgmPaths *newPaths;
+
+	/* Calculate union of all path of order to detect empty columns */
+	unionPath = (char *) palloc0(sizeof(char) * trgmCNFA->bitmaskLen);
+	foreach(cell, trgmCNFA->path)
+	{
+		char *path = ((TrgmStatePath *) lfirst(cell))->path;
+		for (i = 0; i < trgmCNFA->bitmaskLen; i++)
+			unionPath[i] |= path[i];
+	}
+
+	/* Count non-empty columns */
+	for (i = 0; i < trgmCNFA->trgmCount; i++)
+	{
+		if (GETBIT(unionPath, i))
+			nonEmptyCount++;
+	}
+
+	/* Convert trigrams into trgm datatype */
+	trg = (TRGM *) palloc0(TRGMHDRSIZE + nonEmptyCount * sizeof(trgm));
+	trg->flag = ARRKEY;
+	SET_VARSIZE(trg, CALCGTSIZE(ARRKEY, nonEmptyCount));
+	trgms = GETARR(trg);
+	j = 0;
+	for (i = 0; i < trgmCNFA->trgmCount; i++)
+	{
+		if (GETBIT(unionPath, i))
+		{
+			fillTrgm(&trgms[j], trgmCNFA->trgms[i]);
+			j++;
+		}
+	}
+
+	/* Fill new matrix without empty columns */
+	newBitMaskLen = (nonEmptyCount + 7) / 8;
+	newPaths = (PackedTrgmPaths *)
+		MemoryContextAllocZero(context,
+							   offsetof(PackedTrgmPaths, data)
+			+ newBitMaskLen * sizeof(char) * list_length(trgmCNFA->path));
+	newPaths->count = list_length(trgmCNFA->path);
+	foreach(cell, trgmCNFA->path)
+	{
+		char *path = ((TrgmStatePath *) lfirst(cell))->path;
+		char *path2 = &newPaths->data[newBitMaskLen * pathIndex];
+		j = 0;
+		for (i = 0; i < trgmCNFA->trgmCount; i++)
+		{
+			if (GETBIT(unionPath, i))
+			{
+				if (GETBIT(path, i))
+					SETBIT(path2, j);
+				j++;
+			}
+		}
+		pathIndex++;
+	}
+	*pathsOut = newPaths;
+	return trg;
+}
+
+/*
+ * Main function of regex processing. Returns an array of trigrams and a paths
+ * matrix of those trigrams.
+ */
+TRGM *
+createTrgmCNFA(text *text_re, MemoryContext context, PackedTrgmPaths **pathsOut)
+{
+	HASHCTL		   hashCtl;
+	struct guts   *g;
+	TrgmStateKey   key;
+	TrgmCNFA	   trgmCNFA;
+	TrgmStatePath *path;
+	regex_t		  *regex;
+	TRGM		  *trg;
+	MemoryContext  ccxt = CurrentMemoryContext;
+
+	PG_TRY();
+	{
+#ifdef IGNORECASE
+		regex = RE_compile_and_cache(text_re, REG_ADVANCED | REG_ICASE, DEFAULT_COLLATION_OID);
+#else
+		regex = RE_compile_and_cache(text_re, REG_ADVANCED, DEFAULT_COLLATION_OID);
+#endif
+		g = (struct guts *) regex->re_guts;
+
+		/* Collect color info */
+		trgmCNFA.cnfa = &g->search;
+		trgmCNFA.path = NIL;
+		trgmCNFA.queue = NIL;
+		trgmCNFA.colorInfo = getColorInfo(regex);
+		trgmCNFA.arcsCount = 0;
+
+#ifdef TRGM_REGEXP_DEBUG
+		printCNFA(&g->search);
+#endif
+
+		/* Create hash of states */
+		hashCtl.keysize = sizeof(TrgmStateKey);
+		hashCtl.entrysize = sizeof(TrgmState);
+		hashCtl.hcxt = CurrentMemoryContext;
+		hashCtl.hash = tag_hash;
+		hashCtl.match = memcmp;
+		trgmCNFA.states = hash_create("Trigram CNFA",
+		   1024,
+		   &hashCtl,
+		   HASH_ELEM | HASH_CONTEXT
+		   | HASH_FUNCTION | HASH_COMPARE);
+
+		/* Create initial state of CNFA-like graph on trigrams */
+		MemSet(&key, 0, sizeof(TrgmStateKey));
+		key.prefix.s[0] = 0;
+		key.prefix.s[1] = 0;
+		key.nstate = trgmCNFA.cnfa->pre;
+		key.prefix.ambiguity = true;
+
+		/* Recursively build CNFA-like graph on trigrams */
+		trgmCNFA.initState = getState(&trgmCNFA, &key);
+
+#ifdef TRGM_REGEXP_DEBUG
+		printTrgmCNFA(&trgmCNFA);
+#endif
+
+		/* Get vector of unique trigrams */
+		getTrgmVector(&trgmCNFA);
+		trgmCNFA.bitmaskLen = (trgmCNFA.trgmCount + 7) / 8;
+
+		/*
+		 * Create add empty path to initial state, and create matrix by
+		 * processing queue (BFS search).
+		 */
+		path = (TrgmStatePath *) palloc0(sizeof(bool) +
+											sizeof(char) * trgmCNFA.bitmaskLen);
+		path->queued = true;
+		adjustPaths(&trgmCNFA, trgmCNFA.initState, path);
+		processQueue(&trgmCNFA);
+
+		/* Final pack of paths matrix */
+		trg = finalPack(&trgmCNFA, context, pathsOut);
+
+#ifdef TRGM_REGEXP_DEBUG
+		printTrgm(trg, *pathsOut);
+#endif
+	}
+	PG_CATCH();
+ 	{
+		ErrorData  *errdata;
+		MemoryContext ecxt;
+
+		ecxt = MemoryContextSwitchTo(ccxt);
+		errdata = CopyErrorData();
+		if (errdata->sqlerrcode == ERRCODE_TRGM_REGEX_TOO_COMPLEX)
+		{
+			trg = NULL;
+			FlushErrorState();
+		}
+		else
+		{
+			MemoryContextSwitchTo(ecxt);
+			PG_RE_THROW();
+		}
+ 	}
+ 	PG_END_TRY();
+	return trg;
+}
diff --git a/doc/src/sgml/pgtrgm.sgml b/doc/src/sgml/pgtrgm.sgml
index 30e5355..c4105e1 100644
--- a/doc/src/sgml/pgtrgm.sgml
+++ b/doc/src/sgml/pgtrgm.sgml
@@ -145,9 +145,10 @@
    operator classes that allow you to create an index over a text column for
    the purpose of very fast similarity searches.  These index types support
    the above-described similarity operators, and additionally support
-   trigram-based index searches for <literal>LIKE</> and <literal>ILIKE</>
-   queries.  (These indexes do not support equality nor simple comparison
-   operators, so you may need a regular B-tree index too.)
+   trigram-based index searches for <literal>LIKE</>, <literal>ILIKE</>,
+   <literal>~</> and <literal>~*</> queries.  (These indexes do not
+   support equality nor simple comparison operators, so you may need a
+   regular B-tree index too.)
   </para>
 
   <para>
@@ -203,6 +204,23 @@ SELECT * FROM test_trgm WHERE t LIKE '%foo%bar';
   </para>
 
   <para>
+   Beginning in <productname>PostgreSQL</> 9.3, these index types also support
+   index searches for <literal>~</> and <literal>~*</> (regular expression
+   matching), for example
+<programlisting>
+SELECT * FROM test_trgm WHERE t ~ '(foo|bar)';
+</programlisting>
+   The index search works by extracting trigrams from the regular expression
+   and then looking these up in the index.  The more trigrams could be
+   extracted from regular expression, the more effective the index search is. 
+   Unlike B-tree based searches, the search string need not be left-anchored.
+   However, sometimes regular expression is too complex for analysis, then
+   it performs the same as when no trigrams can be extracted from regular
+   expression. In this situation full index scan or sequential scan will
+   be performed depending on query plan.   
+  </para>
+  
+  <para>
    The choice between GiST and GIN indexing depends on the relative
    performance characteristics of GiST and GIN, which are discussed elsewhere.
    As a rule of thumb, a GIN index is faster to search than a GiST index, but
diff --git a/src/backend/utils/adt/regexp.c b/src/backend/utils/adt/regexp.c
index 92dfbb1..b30e753 100644
--- a/src/backend/utils/adt/regexp.c
+++ b/src/backend/utils/adt/regexp.c
@@ -128,7 +128,7 @@ static Datum build_regexp_split_result(regexp_matches_ctx *splitctx);
  * Pattern is given in the database encoding.  We internally convert to
  * an array of pg_wchar, which is what Spencer's regex package wants.
  */
-static regex_t *
+regex_t *
 RE_compile_and_cache(text *text_re, int cflags, Oid collation)
 {
 	int			text_re_len = VARSIZE_ANY_EXHDR(text_re);
diff --git a/src/include/regex/regex.h b/src/include/regex/regex.h
index 616c2c6..7e19e8a 100644
--- a/src/include/regex/regex.h
+++ b/src/include/regex/regex.h
@@ -171,5 +171,6 @@ extern int	pg_regprefix(regex_t *, pg_wchar **, size_t *);
 extern void pg_regfree(regex_t *);
 extern size_t pg_regerror(int, const regex_t *, char *, size_t);
 extern void pg_set_regex_collation(Oid collation);
+regex_t *RE_compile_and_cache(text *text_re, int cflags, Oid collation);
 
 #endif   /* _REGEX_H_ */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

Reply via email to