From: "Kevin Grittner" <kevin.gritt...@wicourts.gov>
"MauMau" <maumau...@gmail.com> wrote:
For information, what kind of breakage would occur?

I imagined removing KEEPONLYALNUM would just accept
non-alphanumeric characters and cause no harm to those who use
only alphanumeric characters.

This would break our current usages because of the handling of
trigrams at the "edges" of groups of qualifying characters.  It
would make similarity (and distance) values less useful for our
current name searches using it.  To simulate the effect, I used an
'8' in place of a comma instead of recompiling with the suggested
change.

test=# select show_trgm('smith,john');
                        show_trgm
-----------------------------------------------------------
{"  j","  s"," jo"," sm","hn ",ith,joh,mit,ohn,smi,"th "}
(1 row)

test=# select show_trgm('smith8john');
                     show_trgm
-----------------------------------------------------
{"  s"," sm",8jo,h8j,"hn ",ith,joh,mit,ohn,smi,th8}
(1 row)

test=# select similarity('smith,john', 'jon smith');
similarity
------------
  0.615385
(1 row)

test=# select similarity('smith8john', 'jon smith');
similarity
------------
    0.3125
(1 row)

So making the proposed change unconditionally could indeed hurt
current users of the technique.  On the other hand, if there was
fine-grained control of this, it might make trigrams useful for
searching statute cites (using all characters) as well as names
(using the current character set); so I wouldn't want it to just be
controlled by a global GUC.

Thank you for your concise examples. I probably got it.

From your examples, I thought KEEPONLYALNUM controls whether
non-alphanumeric characters are included in trigrams, though I haven't read the code of pg_trgm. So, removing KEEPONLYALNUM definition produces trigrams unnecessary for users who handle only alphanumeric text. That would lead to undesirable query results.

Then, I wonder what would be the ideal specification...to add alphanumeric/non-alphanumeric boolean switch to similarity() function, add non-alphanumeric version of operators (ex. %* and <->*) and non-alphanumeric version of operator classes (ex. gin_allchars_trgm_ops)? At least, I understood the fix is not appropriate for minor releases.

Regards
MauMau


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to