The best rule is to try several cases. L-1 and L-2 with or without normalization are the most important cases.
The k-means clustering assumes that you have already done any term weighting. You should experiment a little bit there as well, but the standard IDF measure is probably fine. The only question is whether you should limit the weight of singleton terms somewhat. With large corpora, that is less critical. Also, if you don't use L-2 normalization, then what you do with very rare terms will matter much less since they probably won't ever match with anything and thus won't contribute to dot products. On Thu, Jan 7, 2010 at 5:20 AM, Grant Ingersoll <[email protected]> wrote: > I'm sure others can chime in w/ more of their experience. -- Ted Dunning, CTO DeepDyve
