On 29/05/10 12:34, Jesper Krogh wrote: > On 2010-05-28 23:47, Jan Urbański wrote: >> On 28/05/10 22:22, Tom Lane wrote: >> Now I tried to substitute some numbers there, and so assuming the >> English language has ~1e6 words H(W) is around 6.5. Let's assume the >> statistics target to be 100. >> >> I chose s as 1/(st + 10)*H(W) because the top 10 English words will most >> probably be stopwords, so we will never see them in the input. >> > I think you should skip the assumption about stop-words, users may > use something where they are needed in the index or have a language > than the typical. (and they dont seem to influcence the math that much).
Turns out it has nearly linear influence on the bucket width and the frequency necessary to survive the final pruning. I put some data in a spreadsheet, results below. > Isn't it the same "type" of logic that is used for collecting statistics > for > "array-types", say integer-arrays and text arrays? AFAIK statistics for everything other than tsvectors are built based on the values of whole rows. ts_typanalyze is the only typanalyze function that takes the trouble of looping over the actual contents of each cell, all the others just compare whole arrays (which means that for a text[] field you will probably a quite useless MCV entry). >> Using the above estimate s ends up being 6.5/(100 + 10) = 0.06 >> >> We then do LC, pruning the D structure every w = 1/0.006 = 167 lexemes >> > > Im not sure I get this one.. does this mean that we prune everytime > we have collected 167 new datapoints .. that would seem too often > for me since that would roughly be once per "row". Hm, if we pick s to be 0.06, we say that the K'th word in the English language will have a frequency of 0.06, so if we want to have statistics with an error of s/10, we can prune every 167 lexemes (K is the statistics target, possibly +top_stopwords). Hm, I am now thinking that maybe this theory is flawed, because tsvecors contain only *unique* words, and Zipf's law is talking about words in documents in general. Normally a word like "the" would appear lots of times in a document, but (even ignoring the fact that it's a stopword and so won't appear at all) in a tsvector it will be present only once. This may or may not be a problem, not sure if such "squashing" of occurences as tsvectors do skewes the distribution away from Zipfian or not. Anyway, figuring that out would require some more math and thinking, and to fix the problem at hand we can say Zipf is good enough. >> After that, we remove lexemes with f< 0.9 * 0.06 * N = 0.054*N >> >> So assuming that on average a tsvector has 154 elements and that we went >> through 35017 rows (as it would be in Jesper's case, before he raised >> the stats target from 100 to 1000), we will remove lexemes with f< >> 0.054 * 35017 * 154 that is f< 291201.37 >> >> I wonder what would happen if Jasper's case if we did that... And I >> wonder how sound that maths is >> > > If it means that I would get an accurate MCE-histogram for all > things that have an occourance of more than 5.4% of the rows > (given the samples chosen), then I think that would be really > reasonable. Here's the spreadsheet spat out. The variables are: * the statistics target * top stopwords * error factor Where top stopwords is the number of top words in the English language that would be stopwords. You can also think about it as the smudge factor determinig how well do we trust that the distribution is Zipfian. Theoretically if you want to keep X values in the MCE array, you should discard inputs with frequency lower than the frequency of the X'th value in a Zipfian distribution. If you would write out all English words and their frequencies (according to Zipf's law), the top Y of them would be stopwords. We want to discard words with frequency that's lower than X + Y, and then we probably want to have some breathing space as well. That cutoff frequency is called s in the LC algorithm. Error factor determines the relation between s and e, since apparently we want e to be proportional to s (e is the error from the LC algorithm). It directly determines the bucket width, since the larger the bucket, the more accurate the results will be, as there will be less pruning going on. There are also constants: H(len(eng)) is the harmonic number from Zipf's law, that assuming 1e6 words in English is 6.5. tsvector length and rows in sample are just some values to get concrete numbers out. They influence the final pruning frequency, because the rule is f < (s-e)N and N is the total number of lexemes seen The results are attached in a text (CSV) file, to preserve formatting. Based on them I'd like to propose top_stopwords and error_factor to be 100. With your dataset this would mean pruning every 3076 lexemes and discarding from the result all lexemes with < 173507 occurrences. With statistics target set to 1000 it would change to 16923 and 31546, respectively. > I can "fairly easy" try out patches or do other kind of testing. I'll try to come up with a patch for you to try and fiddle with these values before Monday. Cheers, Jan
"statistics target","H(len(eng))","top stopwords","tsvector length","rows in sample","error factor",,"s","e","bucket width","final prune" 100,6.5,10,154,35017,100,,0.0590909090909091,0.000590909090909091,1692,315468 100,6.5,100,154,35017,10,,0.032500000,0.003250000,307,157734 100,6.5,10,154,35017,100,,0.0590909090909091,0.000590909090909091,1692,315468 100,6.5,100,154,35017,100,,0.032500000,0.000325000,3076,173507 100,6.5,0,154,35017,100,,0.065000000,0.000650000,1538,347014 100,6.5,20,154,35017,50,,0.0541666666666667,0.00108333333333333,923,286258 100,6.5,50,154,35017,50,,0.0433333333333333,0.000866666666666667,1153,229006 1000,6.5,0,154,35017,100,,0.006500000,0.000065000,15384,34701 1000,6.5,100,154,35017,100,,0.00590909090909091,0.0000590909090909091,16923,31546
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers