One thing that may be happening here is that the scale of your data varies from place to place.
Have you tried the upcoming k-means stuff? On Sat, May 12, 2012 at 8:53 AM, Pat Ferrel <p...@farfetchers.com> wrote: > One problem I have is that virtually any value for T gives me a very large > number of canopies--on the order of 2-5 docs per cluster. Whether I create > clusters using random seeds or canopies they are of poor quality to my eye. > A few are good but many are silly. I've tried a wide range of vectorizing > knobs including L2 norm, n-grams with a high ml, and doing a cutom lucene > filter to filer out numbers and do stemming to little avail. Using your > method of t1==t2 - get 2 docs per cluster with t=0.3 (tanimoto or cosine) > and 5 docs per cluster with t = 0.95. This is telling me that the docs are > not really clusterable contrary to