I think the "optimum" value for these parameters is pretty subjective. You may find some estimation procedures that will give you values you like some times, but canopy will put every point into a cluster so the number of clusters is very sensitive to these values. I don't think normalizing your vectors will help, since you need to normalize all vectors in your corpus by the same amount. You might then find t1 and t2 values always on 0..1 but the number of clusters will still be sensitive to your choices on this range and you will be dealing with decimal values.

It really depends upon how "similar" the documents in your corpus are and how fine a distinction you want to draw between documents before declaring them "different". What kind of distance measure are you using? A cosine distance measure will always give you distances on 0..1.

Jeff


Shashikant Kore wrote:
Thank you, Jeff. Unfortunately, I don't have an option of using EC2.

Yes, t1 and t2 values were low.  Increasing these values helps. From
my observations, the values of t1 and t2  need to be tuned depnding on
data set. If the values of t1 and t2 for 100 documents are used for
the set of 1000 documents, the runtime is affected.

Is there any algorithm to find the "optimum" t1 and t2 values for
given data set?  Ideally, if all the distances are normalized (say in
the range of 1 to 100), using same distance thresholds across data set
of various sizes should work fine.  Is this statement correct?

More questions as I dig deeper.

--shashi

On Tue, May 12, 2009 at 3:22 AM, Jeff Eastman
<[email protected]> wrote:
I don't see anything obviously canopy-related in the logs. Canopy serializes
the vectors but the storage representation should not be too inefficient.

If T1 and T2 are too small relative to your observed distance measures you
will get a LOT of canopies, potentially one per document. How many did you
get in your run? For 1000 vectors of 100 terms; however, it does seem that
something is unusual here. I've run canopy (on a 12 node cluster) with
millions of 30-element DenseVector input points and not seen these sorts of
numbers. It is possible you are thrashing your RAM. Have you thought about
getting an EC2 instance or two? I think we are currently ok with elastic MR
too but have not tried that yet.

I would not expect the reducer to start until all the mappers are done.

I'm back stateside Wednesday from Oz and will be able to take a look later
in the week. I also notice canopy still has the combiner problem we fixed in
kMeans and won't work if the combiner does not run. It's darned unfortunate
there isn't an option to require the combiner. More to think about...

Jeff





Reply via email to