Re: Possible CDbw clustering evaluation problem

Jeff Eastman Tue, 21 Sep 2010 10:56:03 -0700

I'm coming to the same conclusion. In situations where the number ofclusteredPoints is smaller than the number of representative pointsbeing requested there will be duplication of some of the points in therepresentative points output. Since the cluster center is always thefirst representative point, it will be the likely one. I think therepresentative point job is doing things correctly. What I see insidethe evaluator; however, is that it has some brittleness in some of thesesituations.

I'm writing some tests to try to duplicate these errors, building off ofthe TestCDbwEvaluator.testCDbw1() test. I can duplicate your exceptionbut don't yet have a solution.



On 9/21/10 12:34 PM, Derek O'Callaghan wrote:

Hi Jeff,

I made a quick change in CDbwDriver.writeInitialState(), changing:
if (!(cluster instanceof DirichletCluster) || ((DirichletCluster)cluster).getTotalCount() > 0) {
to:
if ((cluster instanceof DirichletCluster && ((DirichletCluster)cluster).getTotalCount() > 1) || cluster.getNumPoints() > 1) {
while also adding a null test in the mapper, and I get 4 non-zerovalues printed at the end of the evaluator as expected. However, I'mnot sure the if statement change is the correct solution, given thatgetTotalCount() and getNumPoints() return the number of pointsobserved while building the clusters, but not the actual number ofclustered points from the set that's passed to the mapper? In thisparticular case, it so happens that number of observed = numberclustered = 1, but I guess it's possible that this may not be the casewith other data/clusters.
Regarding the std calculation issue, I had a problem running Dirichletat the weekend, in that pdf was being calculated as NaN after a numberof iterations. It might be a similar problem, I'll take a look at itagain and let you know if I find anything.
Thanks,

Derek

On 21/09/10 16:50, Jeff Eastman wrote:
 Hi Derek,
Thanks for taking the time to look into CDbw. This is the first timeanybody besides me has looked at it afaict and it is still quiteexperimental. I agree with your analysis and have seen this occurmyself. It's a pathological case which is not handled well and yourproposed fix may in fact be the best solution.
On the std calculation itself, it is correct for scalar values of s0,s1 and s2 but I'm not as confident that it extrapolates to vectors.It also has potential overflow, underflow and rounding issues but therunning sums method is convenient and is used throughout theclustering code via AbstractCluster.computeParameters(). Mostclustering doesn't really rely on the std (Dirichlet does to computepdf for Gaussian models) and this is the only situation where I'veseen this error.
Finally, checking the computeParameters() implementation, it does notperform the std computation unless s0>1 so ignoring clusters withzero or one point is probably the right thing to do. Does it fix theproblems you are seeing? I will write up a test today and commit achange if it does.
Jeff


On 9/21/10 10:39 AM, Derek O'Callaghan wrote:
Hi Jeff,
I've been trying out the CDbwDriver today, and I'm having a problemrunning it on the clusters I've generated from my data, whereby Iget all 0s printed for the following lines in CDbwDriver.job():
System.out.println("CDbw = " + evaluator.getCDbw());
System.out.println("Intra-cluster density = " +evaluator.intraClusterDensity());System.out.println("Inter-cluster density = " +evaluator.interClusterDensity());
System.out.println("Separation = " + evaluator.separation());
Stepping through this, I found a problem at these lines inCDbwEvaluator.setStDev():
Vector std = s2.times(s0).minus(s1.times(s1)).assign(newSquareRootFunction()).divide(s0);
double d = std.zSum() / std.size();
'd' was being set to NaN for one of my clusters, caused by"s2.times(s0).minus(s1.times(s1))" returning a negative number, andso the subsequent sqrt failed. Looking at the cluster which had theproblem, I saw that it only contained one point. However, 'repPts'in setStDev() contained 3 points, in fact 3 copies of the same solecluster inhabitant point. This appeared to be causing the stdcalculation to fail, I guess from floating point inaccuracies.
I then started digging back further to see why there were 3 copiesof the same point in 'repPts'. FYI I had specified numIterations = 2to CDbwMapper.runJob(). Stepping through the code, I see thefollowing happening:
- CDbwDriver.writeInitialState() writes out the cluster centroids to"representatives-0", with this particular point in question beingwritten out as the representative for its cluster.- CDbwMapper loads these into 'representativePoints' viasetup()/getRepresentativePoints()- When CDbwMapper.map() is called with this point, it will be addedto 'mostDistantPoints'- CDbwReducer loads the mapper 'representativePoints' into'referencePoints' via setup()/CDbwMapper.getRepresentativePoints()- CDbwReducer writes out the same point twice, once by writing itout as a most distant point in reduce(), and then again whilewriting it out as a reference/representative point in cleanup()- The process repeats, and an additional copy of the point iswritten out by the reducer during each iteration, on top of thosefrom the previous iteration.- Later on, the evaluator fails in the std calculation as describedabove.
I'm wondering if the quickest solution would be to change thefollowing statement in CDbwDriver.writeInitialState():
if (!(cluster instanceof DirichletCluster) || ((DirichletCluster)cluster).getTotalCount() > 0) {
to ignore clusters which only contain one point? The mapper wouldthen need to check if there was an entry for the cluster id key inrepresentative points before doing anything with the point.
Does the issue also point to a separate problem with the stdcalculation, in that it's possible that negative numbers are passedto sqrt()?
Thanks,

Derek

Re: Possible CDbw clustering evaluation problem

Reply via email to