If I understand correctly, CosineDistanceMeasure has a range of [0,1].
So shouldn't Canopy Clustering return only one single cluster if 1 is
used for T1 and T2 as in the example below? All points are within range
1 from the random starting point and should therefore be removed from
the list of possible canopy centroids.
Yet it returns multiple clusters. Are my assumptions wrong, can someone
help me understand this behavior?
CanopyDriver.run(new Path("tfidf-vectors"), new
Path("canopy_centroids"),
new CosineDistanceMeasure(), 1, 1, 0.0, true);