Re: [HACKERS] WIP: index support for regexp search

Alexander Korotkov Tue, 22 Jan 2013 23:37:27 -0800

Hi!

Some quick answers to the part of notes/issues. I will provide rest of
answers soon.

On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

> The biggest problem is that I really don't care for the idea of
> contrib/pg_trgm being this cozy with the innards of regex_t.  Sooner
> or later we are going to want to split the regex code off again as a
> standalone library, and we *must* have a cleaner division of labor if
> that is ever to happen.  Not sure what a suitable API would look like
> though.
>

The only option I see now is to provide a method like "export_cnfa" which
would export corresponding CNFA in fixed format.

> I think the assumption that all MB characters fit in 4 bytes is
> unacceptable; someday we'll want to support wider Unicode characters
> than we do now, and this code seems utterly unable to handle it.  It's
> especially bad that the code isn't even bothering to defend itself
> against the possibility of wider characters.
>

In attached patch I introduce MAX_MULTIBYTE_CHARACTER_LENGTH macro and use
it in type definition. Is this way ok?

> Can't just modify pg_trgm--1.0.sql in place, must create a "1.1" version
> and an upgrade script.
>

Fixed.

> Comments and documentation still need a lot of copy-editing, also I
> think a lot of the comment blocks will not survive pg_indent.  It'd be
> a good idea to run trgm_regexp.c through pg_indent as soon as you have
> that fixed.
>

Fixed.

> New file trgm_regexp.c lacks a copyright notice
>

Fixed.

Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good
> enough; need to pass through the actual collation for the regex
> operator.
>

We have collation passed to gin_extract_query in ginscan.c. I noticed
that gincost_pattern don't pass collation to gin_extract_query. Is it a
bug? Anyway this is not collation we need. We need collation used in
operator clause. In attached patch I introduce additional argument to
gin_extract_query which represent collation of operator clause. Do you
think it is reasonable change in GIN interface? If so, I will provide it as
separate patch.

> Not too happy with convertPgWchar: aside from hard-wired, unchecked
> assumption about maximum length of pg_wchar2mb_with_len result, why is
> it that this is doing a lowercase conversion?  Surely the regex stuff
> dealt with that already?
>

Trigrams are already lowercased. We can simplify our calculations by
excluding uppercased characters from consideration.

------
With best regards,
Alexander Korotkov.

trgm-regexp-0.10.patch.gz
Description: GNU Zip compressed data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WIP: index support for regexp search

Reply via email to