Hi! Some quick answers to the part of notes/issues. I will provide rest of answers soon.
On Wed, Jan 23, 2013 at 6:08 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: > The biggest problem is that I really don't care for the idea of > contrib/pg_trgm being this cozy with the innards of regex_t. Sooner > or later we are going to want to split the regex code off again as a > standalone library, and we *must* have a cleaner division of labor if > that is ever to happen. Not sure what a suitable API would look like > though. > The only option I see now is to provide a method like "export_cnfa" which would export corresponding CNFA in fixed format. > I think the assumption that all MB characters fit in 4 bytes is > unacceptable; someday we'll want to support wider Unicode characters > than we do now, and this code seems utterly unable to handle it. It's > especially bad that the code isn't even bothering to defend itself > against the possibility of wider characters. > In attached patch I introduce MAX_MULTIBYTE_CHARACTER_LENGTH macro and use it in type definition. Is this way ok? > Can't just modify pg_trgm--1.0.sql in place, must create a "1.1" version > and an upgrade script. > Fixed. > Comments and documentation still need a lot of copy-editing, also I > think a lot of the comment blocks will not survive pg_indent. It'd be > a good idea to run trgm_regexp.c through pg_indent as soon as you have > that fixed. > Fixed. > New file trgm_regexp.c lacks a copyright notice > Fixed. Calling RE_compile_and_cache with DEFAULT_COLLATION_OID is not good > enough; need to pass through the actual collation for the regex > operator. > We have collation passed to gin_extract_query in ginscan.c. I noticed that gincost_pattern don't pass collation to gin_extract_query. Is it a bug? Anyway this is not collation we need. We need collation used in operator clause. In attached patch I introduce additional argument to gin_extract_query which represent collation of operator clause. Do you think it is reasonable change in GIN interface? If so, I will provide it as separate patch. > Not too happy with convertPgWchar: aside from hard-wired, unchecked > assumption about maximum length of pg_wchar2mb_with_len result, why is > it that this is doing a lowercase conversion? Surely the regex stuff > dealt with that already? > Trigrams are already lowercased. We can simplify our calculations by excluding uppercased characters from consideration. ------ With best regards, Alexander Korotkov.
trgm-regexp-0.10.patch.gz
Description: GNU Zip compressed data
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers