I went to run the syntheticcontrol example (which uses MR) but there is
no fuzzy version. It might be easier to debug if one was created from
the kmeans job. It really feels to me like the problem lies somewhere in
Writable handling and not in the ClusterBase refactoring.
Jeff Eastman wrote:
Hmn, ok, I'm confused now too. The Writable methods do the right thing
and centroid is just a (non-thread-safe) lazy eval mechanism. I made
the same changes, deleting SoftCluster.center (and get/set) and
superclassing to ClusterBase. This required using the get/setters to
touch center but both work correctly when run from DisplayFuzzyKmeans
which runs in memory. So, the problem only manifests in the MR version.
Hadoop reuses the *same* instance whenever it uses readFields and I've
been bitten more than once by assuming otherwise. That could certainly
cause all the values to be identical (and especially if they are not
all zeros).
Jeff
Robin Anil wrote:
On Tue, Feb 16, 2010 at 10:25 PM, Jeff Eastman
<j...@windwardsolutions.com>wrote:
Looks to me like the unit tests are the only calls to recomputeCenter,
which is where the center is set. The clusterer seems to be calling
computeCentroid, which sets the centroid, instead. I'm not sure why
it needs
both instance variables, as the pointProbSum and weightedPointTotal
variables take the place of the single pointTotal in ClusterBase. I
think
perhaps center and centroid need to be merged?
In k-means and canopy, the center is the (read-only) current
centroid which
is used for the distance calculations during an iteration, and it is
recomputed by computeCentroid (using pointTotal and numPoints) at
the end of
the iteration.
So just writing computeCentroid should do right? Which is what its
doing.
@Override
public void write(DataOutput out) throws IOException {
out.writeInt(clusterId);
out.writeBoolean(converged);
Vector vector = computeCentroid();
VectorWritable.writeVector(out, vector);
}
@Override
public void readFields(DataInput in) throws IOException {
clusterId = in.readInt();
converged = in.readBoolean();
VectorWritable temp = new VectorWritable();
temp.readFields(in);
setCenter(temp.get());
this.pointProbSum = 0;
this.weightedPointTotal = getCenter().like();
}
Jeff
Robin Anil wrote:
I have been trying to convert FuzzyKMeans SoftCluster(which should be
ideally be named FuzzyKmeansCluster) to use the ClusterBase.
I am getting* the same center* for all the clusters. To aid the
conversion
all i did was remove the center vector from the SoftCluster class and
reuse
the same from the ClusterBase. These are essentially making no
change in
the
tests which passes correctly.
So I am questioning whether the implementation keeps the average
center at
all ? Anyone who has used FuzzyKMeans experiencing this?
Robin