One thing that may be happening here is that the scale of your data varies
from place to place.

Have you tried the upcoming k-means stuff?

On Sat, May 12, 2012 at 8:53 AM, Pat Ferrel <> wrote:

> One problem I have is that virtually any value for T gives me a very large
> number of canopies--on the order of 2-5 docs per cluster. Whether I create
> clusters using random seeds or canopies they are of poor quality to my eye.
> A few are good but many are silly. I've tried a wide range of vectorizing
> knobs including L2 norm, n-grams with a high ml, and doing a cutom lucene
> filter to filer out numbers and do stemming to little avail. Using your
> method of t1==t2 - get 2 docs per cluster with t=0.3 (tanimoto or cosine)
> and 5 docs per cluster with t = 0.95. This is telling me that the docs are
> not really clusterable contrary to

Reply via email to