I went to run the syntheticcontrol example (which uses MR) but there is no fuzzy version. It might be easier to debug if one was created from the kmeans job. It really feels to me like the problem lies somewhere in Writable handling and not in the ClusterBase refactoring.

Jeff Eastman wrote:
Hmn, ok, I'm confused now too. The Writable methods do the right thing and centroid is just a (non-thread-safe) lazy eval mechanism. I made the same changes, deleting SoftCluster.center (and get/set) and superclassing to ClusterBase. This required using the get/setters to touch center but both work correctly when run from DisplayFuzzyKmeans which runs in memory. So, the problem only manifests in the MR version.

Hadoop reuses the *same* instance whenever it uses readFields and I've been bitten more than once by assuming otherwise. That could certainly cause all the values to be identical (and especially if they are not all zeros).

Jeff

Robin Anil wrote:
On Tue, Feb 16, 2010 at 10:25 PM, Jeff Eastman
<j...@windwardsolutions.com>wrote:

Looks to me like the unit tests are the only calls to recomputeCenter,
which is where the center is set. The clusterer seems to be calling
computeCentroid, which sets the centroid, instead. I'm not sure why it needs
both instance variables, as the pointProbSum and weightedPointTotal
variables take the place of the single pointTotal in ClusterBase. I think
perhaps center and centroid need to be merged?

In k-means and canopy, the center is the (read-only) current centroid which
is used for the distance calculations during an iteration, and it is
recomputed by computeCentroid (using pointTotal and numPoints) at the end of
the iteration.

So just writing computeCentroid should do right? Which is what its doing.

@Override
  public void write(DataOutput out) throws IOException {
    out.writeInt(clusterId);
    out.writeBoolean(converged);
    Vector vector = computeCentroid();
    VectorWritable.writeVector(out, vector);
  }

  @Override
  public void readFields(DataInput in) throws IOException {
    clusterId = in.readInt();
    converged = in.readBoolean();
    VectorWritable temp = new VectorWritable();
    temp.readFields(in);
    setCenter(temp.get());
    this.pointProbSum = 0;
    this.weightedPointTotal = getCenter().like();
  }

 Jeff

Robin Anil wrote:

I have been trying to convert FuzzyKMeans SoftCluster(which should be
ideally be named FuzzyKmeansCluster) to use the ClusterBase.

I am getting* the same center* for all the clusters. To aid the conversion
all i did was remove the center vector from the SoftCluster class and
reuse
the same from the ClusterBase. These are essentially making no change in
the
tests which passes correctly.

So I am questioning whether the implementation keeps the average center at
all ? Anyone who has used FuzzyKMeans experiencing this?


Robin







Reply via email to