I get your point. Thanks you. I am using Eucleadean Distance.
--shashi On Thu, May 14, 2009 at 1:51 AM, Jeff Eastman <[email protected]> wrote: > I think the "optimum" value for these parameters is pretty subjective. You > may find some estimation procedures that will give you values you like some > times, but canopy will put every point into a cluster so the number of > clusters is very sensitive to these values. I don't think normalizing your > vectors will help, since you need to normalize all vectors in your corpus by > the same amount. You might then find t1 and t2 values always on 0..1 but the > number of clusters will still be sensitive to your choices on this range and > you will be dealing with decimal values. > > It really depends upon how "similar" the documents in your corpus are and > how fine a distinction you want to draw between documents before declaring > them "different". What kind of distance measure are you using? A cosine > distance measure will always give you distances on 0..1. > > Jeff >
