[ 
https://issues.apache.org/jira/browse/MAHOUT-1368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13838638#comment-13838638
 ] 

Suneel Marthi edited comment on MAHOUT-1368 at 12/4/13 6:07 AM:
----------------------------------------------------------------

Ted, we need to hold off on committing this patch until we fix the issue with 
ClusterQualitySummarizer which is broken after applying this patch.  I'll look 
at it tomorrow, its too late in the night now to wrap my head around it.

Running ClusterQualitySummarizer (after applying this patch) on output 
StreamingKMeans and it throws the following exception:-

{Code}
Average distance in cluster 0 [4]: 18723.469424
Average distance in cluster 1 [1169]: 13974.466645
Average distance in cluster 2 [1932]: 1273.335898
Exception in thread "main" java.lang.IllegalArgumentException
        at 
com.google.common.base.Preconditions.checkArgument(Preconditions.java:76)
        at org.apache.mahout.math.stats.TDigest.quantile(TDigest.java:268)
        at 
org.apache.mahout.math.stats.OnlineSummarizer.getQuartile(OnlineSummarizer.java:83)
        at 
org.apache.mahout.math.stats.OnlineSummarizer.getMax(OnlineSummarizer.java:79)
        at 
org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.printSummaries(ClusterQualitySummarizer.java:74)
        at 
org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.printSummaries(ClusterQualitySummarizer.java:66)
        at 
org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.run(ClusterQualitySummarizer.java:141)
        at 
org.apache.mahout.clustering.streaming.tools.ClusterQualitySummarizer.main(ClusterQualitySummarizer.java:281)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at 
org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
        at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
        at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)

{Code}


was (Author: smarthi):
Ted, we need to hold off on committing this patch until we fix the issue with 
ClusterQualitySummarizer which is broken after applying this patch.  I'll look 
at it tomorrow, too late in the night to wrap my head around the issue.

> Convert OnlineSummarizer to use the new TDigest
> -----------------------------------------------
>
>                 Key: MAHOUT-1368
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1368
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Ted Dunning
>             Fix For: 0.9
>
>         Attachments: MAHOUT-1368.patch
>
>
> The new TDigest provides better accuracy for quartile estimation as well as 
> producing any other quantile you might like.  The current quartile estimation 
> of the OnlineSummarizer fails for highly skewed distributions and can't 
> really be extended to provide other quantiles.  The TDigest handles all of 
> this.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to