I gave a quick look through the patch and noticed a few minor things while trying to understand it.
I think the test corpus isn't particularly interesting for how big it is. I'd rather have (a) a small corpus (say 100 words) with which to do detailed regression testing, and (b) some larger document for more extensive testing. I'm not sure (b) is actually necessary. Overall I think the new functions could stand a lot more commentary. > --- 104,125 ---- > #define GETARR(x) ( (trgm*)( (char*)x+TRGMHDRSIZE ) ) > #define ARRNELEM(x) ( ( VARSIZE(x) - TRGMHDRSIZE )/sizeof(trgm) ) > > + #ifdef DIVUNION > + #define CALCSML(count, len1, len2) ((float4) (count)) / ((float4) ((len1) + > (len2) - (count))) > + #else > + #define CALCSML(count, len1, len2) ((float4) (count)) / ((float4) (((len1) > > (len2)) ? (len1) : (len2))) > + #endif This macro has a multiple evaluation gotcha. Maybe we don't really care because of the limited applicability but I'd put a comment on top. > /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */ > ! res = (*(int *) &tmpsml == *(int *) &trgm_limit > || tmpsml > trgm_limit) ? true : false; > } > else if (ISALLTRUE(key)) > { /* non-leaf > contains signature */ > --- 288,304 ---- > switch (strategy) > { > case SimilarityStrategyNumber: > ! case SubstringSimilarityStrategyNumber: > ! /* Similarity search is exact. Substring similarity > search is inexact */ > ! *recheck = (strategy == > SubstringSimilarityStrategyNumber) ? true : false; > ! nlimit = (strategy == SimilarityStrategyNumber) ? > trgm_limit : trgm_substring_limit; > > if (GIST_LEAF(entry)) > { /* all leafs > contains orig trgm */ > ! float4 tmpsml = cnt_sml(qtrg, key, > *recheck); > > /* strange bug at freebsd 5.2.1 and gcc 3.3.3 */ > ! res = (*(int *) &tmpsml == *(int *) &nlimit || > tmpsml > nlimit) ? true : false; What's the compiler bug about? This code and comment were introduced in cbfa4092bb (May 2004) without any explanation. Do we still need to keep it, if &nlimit is now a local variable instead of a global? FWIW the oldest GCC in the buildfarm is 3.4.2/3.4.3 (except for Gaur which uses GCC 2.95 under HPPA) BTW you don't need a ternary op just to say "? true : false" -- just make it res = foo == bar; which immediately assigns the correct boolean value. (This pattern occurs in many places both in the original and in your submitted code. I see no reason to perpetuate it.) > + Datum > + set_substring_limit(PG_FUNCTION_ARGS) > + { > + float4 nlimit = PG_GETARG_FLOAT4(0); > + > + if (nlimit < 0 || nlimit > 1.0) > + elog(ERROR, "wrong limit, should be between 0 and 1"); Should be an ereport(). > ! static void > ! protect_out_of_mem(int slen) > ! { > ! /* > ! * Guard against possible overflow in the palloc requests below. (We > ! * don't worry about the additive constants, since palloc can detect > ! * requests that are a little above MaxAllocSize --- we just need to > ! * prevent integer overflow in the multiplications.) > ! */ > ! if ((Size) (slen / 2) >= (MaxAllocSize / (sizeof(trgm) * 3)) || > ! (Size) slen >= (MaxAllocSize / > pg_database_encoding_max_length())) > ! ereport(ERROR, > ! (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED), > ! errmsg("out of memory"))); > ! } I think the comment should now be above the function, not inside it. > *************** > *** 254,259 **** generate_trgm(char *str, int slen) > --- 292,534 ---- > return trg; > } > > + /* Trigram with position */ > + typedef struct > + { > + trgm t; > + int i; > + } ptrgm; This struct definition could stand a better name, and better member names. Also, our policy (not always enforced) is to have these near the top of the file rather than in the middle. If the struct is used in many places then IMO it's better that it go at the top; if it's used only in one function I'd say it's okay to have it just after that function. > + /* > + * Make array of positional trigrams from two trigram arrays. The first > array > + * is required substing which positions don't matter and replaced with -1. > + * The second array is haystack where we search and have to store its > + * positions. > + */ > + static ptrgm * > + make_positional_trgm(trgm *trg1, int len1, trgm *trg2, int len2) > + { > + ptrgm *result; > + int i, len = len1 + len2; > + > + result = (ptrgm *) palloc(sizeof(ptrgm) * len); > + > + for (i = 0; i < len1; i++) > + { > + memcpy(&result[i].t, &trg1[i], sizeof(trgm)); > + result[i].i = -1; > + } > + > + for (i = 0; i < len2; i++) > + { > + memcpy(&result[i + len1].t, &trg2[i], sizeof(trgm)); > + result[i + len1].i = i; > + } > + > + return result; > + } So the output trigram array is the concatenation of both input trigrams? That's not quite evident from the comment above the function ... I think some rephrasing there is in order. > --- 856,869 ---- > } > } > > ! /* if inexact then len2 is equal to count */ > ! if (inexact) > ! return CALCSML(count, len1, count); > ! else > ! return CALCSML(count, len1, len2); This is where a ternary op would make better code. CALCSML(count, len1, inexact ? count : len2); but a better comment would actually explain why we do this, rather than just stating we do. > + <row> > + > <entry><function>show_substring_limit()</function><indexterm><primary>show_substring_limit</primary></indexterm></entry> > + <entry><type>real</type></entry> > + <entry> > + Returns the current similarity threshold used by the > <literal><%</> > + operator. This sets the minimum substring similarity between > + two phrases. > + </entry> > + </row> > + <row> > + > <entry><function>set_substring_limit(real)</function><indexterm><primary>set_substring_limit</primary></indexterm></entry> > + <entry><type>real</type></entry> > + <entry> > + Sets the current substring similarity threshold that is used by > + the <literal><%</> operator. The threshold must be between > + 0 and 1 (default is 0.6). Returns the same value passed in. > + </entry> I don't quite understand why aren't we using a custom GUC variable here. These already have SHOW and SET support ... -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers