Re: Possible CDbw clustering evaluation problem

Ted Dunning Tue, 21 Sep 2010 11:20:30 -0700

One general technique that can help with these kinds of problems (std <= 0)
is to do the calculation for std assuming a prior distribution on standard
deviation.  In practice, this comes down to assuming that you have some
number of prior observations with non-zero deviation.  You can implement
this by starting the sum at epsilon > 0 and then adding that epsilon to the
number of observations that you divide by at the end.  If using an on-line
computation, you just start the initial estimate at something slightly
positive and start the count of the number of items at a small positive
number that is <<1.  This will cause negligible bias when real data is
observed, but will prevent the variance from ever being negative.


On Tue, Sep 21, 2010 at 10:55 AM, Jeff Eastman
<[email protected]>wrote:

>  I'm coming to the same conclusion. In situations where the number of
> clusteredPoints is smaller than the number of representative points being
> requested there will be duplication of some of the points in the
> representative points output. Since the cluster center is always the first
> representative point, it will be the likely one. I think the representative
> point job is doing things correctly. What I see inside the evaluator;
> however, is that it has some brittleness in some of these situations.
>
> I'm writing some tests to try to duplicate these errors, building off of
> the TestCDbwEvaluator.testCDbw1() test. I can duplicate your exception but
> don't yet have a solution.
>
>
>
> On 9/21/10 12:34 PM, Derek O'Callaghan wrote:
>
>> Hi Jeff,
>>
>> I made a quick change in CDbwDriver.writeInitialState(), changing:
>>
>> if (!(cluster instanceof DirichletCluster) || ((DirichletCluster)
>> cluster).getTotalCount() > 0) {
>>
>> to:
>>
>> if ((cluster instanceof DirichletCluster && ((DirichletCluster)
>> cluster).getTotalCount() > 1) || cluster.getNumPoints() > 1) {
>>
>> while also adding a null test in the mapper, and I get 4 non-zero values
>> printed at the end of the evaluator as expected. However, I'm not sure the
>> if statement change is the correct solution, given that getTotalCount() and
>> getNumPoints() return the number of points observed while building the
>> clusters, but not the actual number of clustered points from the set that's
>> passed to the mapper? In this particular case, it so happens that number of
>> observed = number clustered = 1, but I guess it's possible that this may not
>> be the case with other data/clusters.
>>
>> Regarding the std calculation issue, I had a problem running Dirichlet at
>> the weekend, in that pdf was being calculated as NaN after a number of
>> iterations. It might be a similar problem, I'll take a look at it again and
>> let you know if I find anything.
>>
>> Thanks,
>>
>> Derek
>>
>> On 21/09/10 16:50, Jeff Eastman wrote:
>>
>>>  Hi Derek,
>>>
>>> Thanks for taking the time to look into CDbw. This is the first time
>>> anybody besides me has looked at it afaict and it is still quite
>>> experimental. I agree with your analysis and have seen this occur myself.
>>> It's a pathological case which is not handled well and your proposed fix may
>>> in fact be the best solution.
>>>
>>> On the std calculation itself, it is correct for scalar values of s0, s1
>>> and s2 but I'm not as confident that it extrapolates to vectors. It also has
>>> potential overflow, underflow and rounding issues but the running sums
>>> method is convenient and is used throughout the clustering code via
>>> AbstractCluster.computeParameters(). Most clustering doesn't really rely on
>>> the std (Dirichlet does to compute pdf for Gaussian models) and this is the
>>> only situation where I've seen this error.
>>>
>>> Finally, checking the computeParameters() implementation, it does not
>>> perform the std computation unless s0>1 so ignoring clusters with zero or
>>> one point is probably the right thing to do. Does it fix the problems you
>>> are seeing? I will write up a test today and commit a change if it does.
>>>
>>> Jeff
>>>
>>>
>>> On 9/21/10 10:39 AM, Derek O'Callaghan wrote:
>>>
>>>> Hi Jeff,
>>>>
>>>> I've been trying out the CDbwDriver today, and I'm having a problem
>>>> running it on the clusters I've generated from my data, whereby I get all 
>>>> 0s
>>>> printed for the following lines in CDbwDriver.job():
>>>>
>>>> System.out.println("CDbw = " + evaluator.getCDbw());
>>>> System.out.println("Intra-cluster density = " +
>>>> evaluator.intraClusterDensity());
>>>> System.out.println("Inter-cluster density = " +
>>>> evaluator.interClusterDensity());
>>>> System.out.println("Separation = " + evaluator.separation());
>>>>
>>>> Stepping through this, I found a problem at these lines in
>>>> CDbwEvaluator.setStDev():
>>>>
>>>> Vector std = s2.times(s0).minus(s1.times(s1)).assign(new
>>>> SquareRootFunction()).divide(s0);
>>>> double d = std.zSum() / std.size();
>>>>
>>>> 'd' was being set to NaN for one of my clusters, caused by
>>>> "s2.times(s0).minus(s1.times(s1))" returning a negative number, and so the
>>>> subsequent sqrt failed. Looking at the cluster which had the problem, I saw
>>>> that it only contained one point. However, 'repPts' in setStDev() contained
>>>> 3 points, in fact 3 copies of the same sole cluster inhabitant point. This
>>>> appeared to be causing the std calculation to fail, I guess from floating
>>>> point inaccuracies.
>>>>
>>>> I then started digging back further to see why there were 3 copies of
>>>> the same point in 'repPts'. FYI I had specified numIterations = 2 to
>>>> CDbwMapper.runJob(). Stepping through the code, I see the following
>>>> happening:
>>>>
>>>> - CDbwDriver.writeInitialState() writes out the cluster centroids to
>>>> "representatives-0", with this particular point in question being written
>>>> out as the representative for its cluster.
>>>> - CDbwMapper loads these into 'representativePoints' via
>>>> setup()/getRepresentativePoints()
>>>> - When CDbwMapper.map() is called with this point, it will be added to
>>>> 'mostDistantPoints'
>>>> - CDbwReducer loads the mapper 'representativePoints' into
>>>> 'referencePoints' via setup()/CDbwMapper.getRepresentativePoints()
>>>> - CDbwReducer writes out the same point twice, once by writing it out as
>>>> a most distant point in reduce(), and then again while writing it out as a
>>>> reference/representative point in cleanup()
>>>> - The process repeats, and an additional copy of the point is written
>>>> out by the reducer during each iteration, on top of those from the previous
>>>> iteration.
>>>> - Later on, the evaluator fails in the std calculation as described
>>>> above.
>>>>
>>>> I'm wondering if the quickest solution would be to change the following
>>>> statement in CDbwDriver.writeInitialState():
>>>>
>>>> if (!(cluster instanceof DirichletCluster) || ((DirichletCluster)
>>>> cluster).getTotalCount() > 0) {
>>>>
>>>> to ignore clusters which only contain one point? The mapper would then
>>>> need to check if there was an entry for the cluster id key in 
>>>> representative
>>>> points before doing anything with the point.
>>>>
>>>> Does the issue also point to a separate problem with the std
>>>> calculation, in that it's possible that negative numbers are passed to
>>>> sqrt()?
>>>>
>>>> Thanks,
>>>>
>>>> Derek
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Possible CDbw clustering evaluation problem

Reply via email to