Hi Jeff,
I've been trying out the CDbwDriver today, and I'm having a problem
running it on the clusters I've generated from my data, whereby I get
all 0s printed for the following lines in CDbwDriver.job():
System.out.println("CDbw = " + evaluator.getCDbw());
System.out.println("Intra-cluster density = " +
evaluator.intraClusterDensity());
System.out.println("Inter-cluster density = " +
evaluator.interClusterDensity());
System.out.println("Separation = " + evaluator.separation());
Stepping through this, I found a problem at these lines in
CDbwEvaluator.setStDev():
Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
SquareRootFunction()).divide(s0);
double d = std.zSum() / std.size();
'd' was being set to NaN for one of my clusters, caused by
"s2.times(s0).minus(s1.times(s1))" returning a negative number, and so
the subsequent sqrt failed. Looking at the cluster which had the
problem, I saw that it only contained one point. However, 'repPts' in
setStDev() contained 3 points, in fact 3 copies of the same sole cluster
inhabitant point. This appeared to be causing the std calculation to
fail, I guess from floating point inaccuracies.
I then started digging back further to see why there were 3 copies of
the same point in 'repPts'. FYI I had specified numIterations = 2 to
CDbwMapper.runJob(). Stepping through the code, I see the following
happening:
- CDbwDriver.writeInitialState() writes out the cluster centroids to
"representatives-0", with this particular point in question being
written out as the representative for its cluster.
- CDbwMapper loads these into 'representativePoints' via
setup()/getRepresentativePoints()
- When CDbwMapper.map() is called with this point, it will be added to
'mostDistantPoints'
- CDbwReducer loads the mapper 'representativePoints' into
'referencePoints' via setup()/CDbwMapper.getRepresentativePoints()
- CDbwReducer writes out the same point twice, once by writing it out as
a most distant point in reduce(), and then again while writing it out as
a reference/representative point in cleanup()
- The process repeats, and an additional copy of the point is written
out by the reducer during each iteration, on top of those from the
previous iteration.
- Later on, the evaluator fails in the std calculation as described above.
I'm wondering if the quickest solution would be to change the following
statement in CDbwDriver.writeInitialState():
if (!(cluster instanceof DirichletCluster) || ((DirichletCluster)
cluster).getTotalCount() > 0) {
to ignore clusters which only contain one point? The mapper would then
need to check if there was an entry for the cluster id key in
representative points before doing anything with the point.
Does the issue also point to a separate problem with the std
calculation, in that it's possible that negative numbers are passed to
sqrt()?
Thanks,
Derek