=?UTF-8?B?SmFuIFVyYmHFhHNraQ==?= <wulc...@wulczer.org> writes: > On 29/05/10 17:09, Tom Lane wrote: >> There is definitely something wrong with your math there. It's not >> possible for the 100'th most common word to have a frequency as high >> as 0.06 --- the ones above it presumably have larger frequencies, >> which makes the total quite a lot more than 1.0.
> Upf... hahaha, I computed this as 1/(st + 10)*H(W), where it should be > 1/((st + 10)*H(W))... So s would be 1/(110*6.5) = 0.0014 Um, apparently I can't do simple arithmetic first thing in the morning either, cause I got my number wrong too ;-) After a bit more research: if you use the basic form of Zipf's law with a 1/k distribution, the first frequency has to be about 0.07 to make the total come out to 1.0 for a reasonable number of words. So we could use s = 0.07 / K when we wanted a final list of K words. Some people (including the LC paper) prefer a higher exponent, ie 1/k^S with S around 1.25. That makes the F1 value around 0.22 which seems awfully high for the type of data we're working with, so I think the 1/k rule is probably what we want here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers