The best rule is to try several cases.  L-1 and L-2 with or without
normalization are the most important cases.

The k-means clustering assumes that you have already done any term
weighting.  You should experiment a little bit there as well, but the
standard IDF measure is probably fine.  The only question is whether you
should limit the weight of singleton terms somewhat.  With large corpora,
that is less critical.  Also, if you don't use L-2 normalization, then what
you do with very rare terms will matter much less since they probably won't
ever match with anything and thus won't contribute to dot products.

On Thu, Jan 7, 2010 at 5:20 AM, Grant Ingersoll <[email protected]> wrote:

> I'm sure others can chime in w/ more of their experience.




-- 
Ted Dunning, CTO
DeepDyve

Reply via email to