On 15.11.2012 20:39, Fujii Masao wrote: > Hi, > > I'd like to propose to extend pg_trgm so that it can compare a partial-match > query key to a GIN index. IOW, I'm thinking to implement the 'comparePartial' > GIN method for pg_trgm. > > Currently, when the query key is less than three characters, we cannot use > a GIN index (+ pg_trgm) efficiently, because pg_trgm doesn't support a > partial-match method. In this case, seq scan or index full scan would be > executed, and its response time would be very slow. I'd like to alleviate this > problem. > > Note that we cannot do a partial-match if KEEPONLYALNUM is disabled, > i.e., if query key contains multibyte characters. In this case, byte length of > the trigram string might be larger than three, and its CRC is used as a > trigram key instead of the trigram string itself. Because of using CRC, we > cannot do a partial-match. Attached patch extends pg_trgm so that it > compares a partial-match query key only when KEEPONLYALNUM is > enabled. > > Attached patch is WIP yet. What I should do next is: > > * version up pg_trgm from 1.0 to 1.1, i.e., create pg_trgm--1.1.sql, etc. > * write the regression test > > Comments? Review? Objection?
Hi, I've done a quick review of the current patch: (1) It applies cleanly on the current master and builds fine. (2) In pg_trgm--1.0.sql the gin_trgm_compare_partial is indented differently (using tabs instead of spaces). (3) In trgm_gin.c, function gin_extract_value_trgm contains #ifdef KEEPONLYALNUM, although trgm_pmatch is not used at all. (4) The patch removes some commented-out variables, but there still remain various commented-out variables. Will this be cleaned too? (5) I've done basic functionality of the patch, it really seems to work: CREATE EXTENSION pg_trgm ; CREATE TABLE TEST (val TEXT); INSERT INTO test SELECT md5(i::text) FROM generate_series(1,1000000) s(i); CREATE INDEX trgm_idx ON test USING gin (val gin_trgm_ops); ANALYZE test; EXPLAIN SELECT * FROM test WHERE val LIKE '%aa%'; QUERY PLAN ------------------------------------------------------------------------ Bitmap Heap Scan on test (cost=1655.96..11757.63 rows=141414 width=33) Recheck Cond: (val ~~ '%aa%'::text) -> Bitmap Index Scan on trgm_idx (cost=0.00..1620.61 rows=141414 width=0) Index Cond: (val ~~ '%aa%'::text) (4 rows) Without the patch, this gives a seq scan (as expected). Do you expect to update the docs too? IMHO it's worth mentioning that the pg_trgm can handle even patterns shorter than 2 chars ... regards Tomas Vondra -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers