One general technique that can help with these kinds of problems (std <= 0) is to do the calculation for std assuming a prior distribution on standard deviation. In practice, this comes down to assuming that you have some number of prior observations with non-zero deviation. You can implement this by starting the sum at epsilon > 0 and then adding that epsilon to the number of observations that you divide by at the end. If using an on-line computation, you just start the initial estimate at something slightly positive and start the count of the number of items at a small positive number that is <<1. This will cause negligible bias when real data is observed, but will prevent the variance from ever being negative.
On Tue, Sep 21, 2010 at 10:55 AM, Jeff Eastman <[email protected]>wrote: > I'm coming to the same conclusion. In situations where the number of > clusteredPoints is smaller than the number of representative points being > requested there will be duplication of some of the points in the > representative points output. Since the cluster center is always the first > representative point, it will be the likely one. I think the representative > point job is doing things correctly. What I see inside the evaluator; > however, is that it has some brittleness in some of these situations. > > I'm writing some tests to try to duplicate these errors, building off of > the TestCDbwEvaluator.testCDbw1() test. I can duplicate your exception but > don't yet have a solution. > > > > On 9/21/10 12:34 PM, Derek O'Callaghan wrote: > >> Hi Jeff, >> >> I made a quick change in CDbwDriver.writeInitialState(), changing: >> >> if (!(cluster instanceof DirichletCluster) || ((DirichletCluster) >> cluster).getTotalCount() > 0) { >> >> to: >> >> if ((cluster instanceof DirichletCluster && ((DirichletCluster) >> cluster).getTotalCount() > 1) || cluster.getNumPoints() > 1) { >> >> while also adding a null test in the mapper, and I get 4 non-zero values >> printed at the end of the evaluator as expected. However, I'm not sure the >> if statement change is the correct solution, given that getTotalCount() and >> getNumPoints() return the number of points observed while building the >> clusters, but not the actual number of clustered points from the set that's >> passed to the mapper? In this particular case, it so happens that number of >> observed = number clustered = 1, but I guess it's possible that this may not >> be the case with other data/clusters. >> >> Regarding the std calculation issue, I had a problem running Dirichlet at >> the weekend, in that pdf was being calculated as NaN after a number of >> iterations. It might be a similar problem, I'll take a look at it again and >> let you know if I find anything. >> >> Thanks, >> >> Derek >> >> On 21/09/10 16:50, Jeff Eastman wrote: >> >>> Hi Derek, >>> >>> Thanks for taking the time to look into CDbw. This is the first time >>> anybody besides me has looked at it afaict and it is still quite >>> experimental. I agree with your analysis and have seen this occur myself. >>> It's a pathological case which is not handled well and your proposed fix may >>> in fact be the best solution. >>> >>> On the std calculation itself, it is correct for scalar values of s0, s1 >>> and s2 but I'm not as confident that it extrapolates to vectors. It also has >>> potential overflow, underflow and rounding issues but the running sums >>> method is convenient and is used throughout the clustering code via >>> AbstractCluster.computeParameters(). Most clustering doesn't really rely on >>> the std (Dirichlet does to compute pdf for Gaussian models) and this is the >>> only situation where I've seen this error. >>> >>> Finally, checking the computeParameters() implementation, it does not >>> perform the std computation unless s0>1 so ignoring clusters with zero or >>> one point is probably the right thing to do. Does it fix the problems you >>> are seeing? I will write up a test today and commit a change if it does. >>> >>> Jeff >>> >>> >>> On 9/21/10 10:39 AM, Derek O'Callaghan wrote: >>> >>>> Hi Jeff, >>>> >>>> I've been trying out the CDbwDriver today, and I'm having a problem >>>> running it on the clusters I've generated from my data, whereby I get all >>>> 0s >>>> printed for the following lines in CDbwDriver.job(): >>>> >>>> System.out.println("CDbw = " + evaluator.getCDbw()); >>>> System.out.println("Intra-cluster density = " + >>>> evaluator.intraClusterDensity()); >>>> System.out.println("Inter-cluster density = " + >>>> evaluator.interClusterDensity()); >>>> System.out.println("Separation = " + evaluator.separation()); >>>> >>>> Stepping through this, I found a problem at these lines in >>>> CDbwEvaluator.setStDev(): >>>> >>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new >>>> SquareRootFunction()).divide(s0); >>>> double d = std.zSum() / std.size(); >>>> >>>> 'd' was being set to NaN for one of my clusters, caused by >>>> "s2.times(s0).minus(s1.times(s1))" returning a negative number, and so the >>>> subsequent sqrt failed. Looking at the cluster which had the problem, I saw >>>> that it only contained one point. However, 'repPts' in setStDev() contained >>>> 3 points, in fact 3 copies of the same sole cluster inhabitant point. This >>>> appeared to be causing the std calculation to fail, I guess from floating >>>> point inaccuracies. >>>> >>>> I then started digging back further to see why there were 3 copies of >>>> the same point in 'repPts'. FYI I had specified numIterations = 2 to >>>> CDbwMapper.runJob(). Stepping through the code, I see the following >>>> happening: >>>> >>>> - CDbwDriver.writeInitialState() writes out the cluster centroids to >>>> "representatives-0", with this particular point in question being written >>>> out as the representative for its cluster. >>>> - CDbwMapper loads these into 'representativePoints' via >>>> setup()/getRepresentativePoints() >>>> - When CDbwMapper.map() is called with this point, it will be added to >>>> 'mostDistantPoints' >>>> - CDbwReducer loads the mapper 'representativePoints' into >>>> 'referencePoints' via setup()/CDbwMapper.getRepresentativePoints() >>>> - CDbwReducer writes out the same point twice, once by writing it out as >>>> a most distant point in reduce(), and then again while writing it out as a >>>> reference/representative point in cleanup() >>>> - The process repeats, and an additional copy of the point is written >>>> out by the reducer during each iteration, on top of those from the previous >>>> iteration. >>>> - Later on, the evaluator fails in the std calculation as described >>>> above. >>>> >>>> I'm wondering if the quickest solution would be to change the following >>>> statement in CDbwDriver.writeInitialState(): >>>> >>>> if (!(cluster instanceof DirichletCluster) || ((DirichletCluster) >>>> cluster).getTotalCount() > 0) { >>>> >>>> to ignore clusters which only contain one point? The mapper would then >>>> need to check if there was an entry for the cluster id key in >>>> representative >>>> points before doing anything with the point. >>>> >>>> Does the issue also point to a separate problem with the std >>>> calculation, in that it's possible that negative numbers are passed to >>>> sqrt()? >>>> >>>> Thanks, >>>> >>>> Derek >>>> >>>> >>>> >>>> >>> >> >
