On Thu, Jan 16, 2014 at 3:34 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:
> Alexander Korotkov <aekorot...@gmail.com> writes: > > Revised version of patch with necessary comments. > > I looked at this patch a bit. It seems like this: > > + * BLANK_COLOR_SIZE - How much blank character is more frequent than > + * other character in average > + #define BLANK_COLOR_SIZE 32 > > is a drastic overestimate. Using the program Erik provided to generate > some sample data, I find that blanks in trigrams are maybe about three > times more common than other characters. Possibly Erik's data isn't > very representative of real text, but if that's the case we'd better > get a more representative sample case before we start optimizing ... > > Now maybe the above number is okay for this purpose anyway, but if so > it needs some different justification; maybe call it a "penalty"? > But nonetheless it seems like a darn large penalty, particularly for " x" > type trigrams where the value would effectively be squared. Compared to > the proposed values of MAX_TRGM_SIZE and WISH_TRGM_SIZE, it seems like > you might as well forget the "sizing" approach and just flat out reject > trigrams containing COLOR_BLANK, because they'll never get through the > size filter. It seems to be right decision to me when we have other trigrams can reject them. But there are cases when we can't. > I'm inclined to think you need a larger MAX_TRGM_SIZE; > and WISH_TRGM_SIZE seems mighty small as well, compared to what the > code would have done before. > OK, I would like to make more reasoning for penalty. Let's consider word of length n. It contains n+1 trigrams including: 1 __x trigram 1 _xx trigram 1 xx_ trigram n - 2 xxx trigrams Assume alphabet of size m those trigrams have following average frequencies: __x 1/((n+1)*m) _xx 1/((n+1)*m^2) xx_ 1/((n+1)*m^2) xxx (n-2)/((n+1)*m^3) The ratio of this frequencies with blanks to frequency of xxx is: __x m^2/(n-2) _xx and xx_ m/(n-2) In order to estimate n I've analyzed some classics: Ernest Hemingway "A farewell to arms" - 3.8 Leo Tolstoy "War and Peace" - 5.0 In english with alphabet size = 26 we have __x m^2/(n-2) = 375 _xx and xx_ m/(n-2) = 14.4 In russian with alphabet size = 33 we have __x m^2/(n-2) = 363 _xx and xx_ m/(n-2) = 11 Assuming these calculations is approximate, we can safely use same values for both languages. Does such reasoning make sense? Also, surely this bit: > > ! trgmNFA->totalTrgmCount = (int) totalTrgmSize; > > is flat out wrong? expandColorTrigrams() expects that > trgmNFA->totalTrgmCount is the truth, not some penalty-bloated > abstraction. The fact that the patch hasn't failed on you completely > demonstrates that you've not tested any cases where blank-containing > trigrams got through the filter, reinforcing my feeling that it's > probably too strict now. > Oh, I wonder how can I leave such weird bug in patch :-( > I wonder if it would be more appropriate to continue to measure the limit > MAX_TRGM_COUNT in terms of actual trigram counts, and just use the > penalized "size" as the sort key while determining which color trigrams > to discard first. > Agree. But I would like to save some equivalent of WISH_TRGM_SIZE. Another thought is that it's not clear that you should apply the same > penalty to blanks in all three positions. Because of the padding > settings, any one word will normally lead to padded strings " a", > " ab", "yz ", so that spaces in the first position are about twice > as common as spaces in the other positions. (It's a little less > than that because single-letter words produce only " a" and " a "; > but I'd think that such words aren't very common in text that trigrams > are effective for.) So I'm thinking the penalty ought to take that > into account. > > I'm also inclined to think that it might be worth adding a size > field to ColorTrgmInfo rather than repeatedly calculating the > size estimate. We only allow a thousand and some ColorTrgmInfos > at most, so the extra space wouldn't be that large, and I'm concerned > by the expense the current patch adds to the sort comparator. > Agree. > > Another point is that this comment: > > * Note: per-color-trigram counts cannot overflow an int so long as > * COLOR_COUNT_LIMIT is not more than the cube root of INT_MAX, ie > about > * 1290. However, the grand total totalTrgmCount might conceivably > * overflow an int, so we use int64 for that within this routine. > > is no longer valid, or at least it fails to demonstrate that the result > of getColorTrigramSize() can't overflow an int; indeed I rather fear it > can. > The patch failed to even update the variable name in this comment, let > alone address the substantive question. > > There are some other cosmetic things I didn't like, for instance > colorTrgmInfoCountCmp() is no longer comparing counts but you didn't > rename it. That's about it for substantive comments though. > Thanks, will be fixed. ------ With best regards, Alexander Korotkov.