Hi Jeff,
I made a quick change in CDbwDriver.writeInitialState(), changing:
if (!(cluster instanceof DirichletCluster) || ((DirichletCluster)
cluster).getTotalCount() > 0) {
to:
if ((cluster instanceof DirichletCluster && ((DirichletCluster)
cluster).getTotalCount() > 1) || cluster.getNumPoints() > 1) {
while also adding a null test in the mapper, and I get 4 non-zero values
printed at the end of the evaluator as expected. However, I'm not sure
the if statement change is the correct solution, given that
getTotalCount() and getNumPoints() return the number of points observed
while building the clusters, but not the actual number of clustered
points from the set that's passed to the mapper? In this particular
case, it so happens that number of observed = number clustered = 1, but
I guess it's possible that this may not be the case with other
data/clusters.
Regarding the std calculation issue, I had a problem running Dirichlet
at the weekend, in that pdf was being calculated as NaN after a number
of iterations. It might be a similar problem, I'll take a look at it
again and let you know if I find anything.
Thanks,
Derek
On 21/09/10 16:50, Jeff Eastman wrote:
Hi Derek,
Thanks for taking the time to look into CDbw. This is the first time
anybody besides me has looked at it afaict and it is still quite
experimental. I agree with your analysis and have seen this occur
myself. It's a pathological case which is not handled well and your
proposed fix may in fact be the best solution.
On the std calculation itself, it is correct for scalar values of s0,
s1 and s2 but I'm not as confident that it extrapolates to vectors. It
also has potential overflow, underflow and rounding issues but the
running sums method is convenient and is used throughout the
clustering code via AbstractCluster.computeParameters(). Most
clustering doesn't really rely on the std (Dirichlet does to compute
pdf for Gaussian models) and this is the only situation where I've
seen this error.
Finally, checking the computeParameters() implementation, it does not
perform the std computation unless s0>1 so ignoring clusters with zero
or one point is probably the right thing to do. Does it fix the
problems you are seeing? I will write up a test today and commit a
change if it does.
Jeff
On 9/21/10 10:39 AM, Derek O'Callaghan wrote:
Hi Jeff,
I've been trying out the CDbwDriver today, and I'm having a problem
running it on the clusters I've generated from my data, whereby I get
all 0s printed for the following lines in CDbwDriver.job():
System.out.println("CDbw = " + evaluator.getCDbw());
System.out.println("Intra-cluster density = " +
evaluator.intraClusterDensity());
System.out.println("Inter-cluster density = " +
evaluator.interClusterDensity());
System.out.println("Separation = " + evaluator.separation());
Stepping through this, I found a problem at these lines in
CDbwEvaluator.setStDev():
Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
SquareRootFunction()).divide(s0);
double d = std.zSum() / std.size();
'd' was being set to NaN for one of my clusters, caused by
"s2.times(s0).minus(s1.times(s1))" returning a negative number, and
so the subsequent sqrt failed. Looking at the cluster which had the
problem, I saw that it only contained one point. However, 'repPts' in
setStDev() contained 3 points, in fact 3 copies of the same sole
cluster inhabitant point. This appeared to be causing the std
calculation to fail, I guess from floating point inaccuracies.
I then started digging back further to see why there were 3 copies of
the same point in 'repPts'. FYI I had specified numIterations = 2 to
CDbwMapper.runJob(). Stepping through the code, I see the following
happening:
- CDbwDriver.writeInitialState() writes out the cluster centroids to
"representatives-0", with this particular point in question being
written out as the representative for its cluster.
- CDbwMapper loads these into 'representativePoints' via
setup()/getRepresentativePoints()
- When CDbwMapper.map() is called with this point, it will be added
to 'mostDistantPoints'
- CDbwReducer loads the mapper 'representativePoints' into
'referencePoints' via setup()/CDbwMapper.getRepresentativePoints()
- CDbwReducer writes out the same point twice, once by writing it out
as a most distant point in reduce(), and then again while writing it
out as a reference/representative point in cleanup()
- The process repeats, and an additional copy of the point is written
out by the reducer during each iteration, on top of those from the
previous iteration.
- Later on, the evaluator fails in the std calculation as described
above.
I'm wondering if the quickest solution would be to change the
following statement in CDbwDriver.writeInitialState():
if (!(cluster instanceof DirichletCluster) || ((DirichletCluster)
cluster).getTotalCount() > 0) {
to ignore clusters which only contain one point? The mapper would
then need to check if there was an entry for the cluster id key in
representative points before doing anything with the point.
Does the issue also point to a separate problem with the std
calculation, in that it's possible that negative numbers are passed
to sqrt()?
Thanks,
Derek