[jira] [Comment Edited] (MAHOUT-1469) Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true

Suneel Marthi (JIRA) Fri, 21 Mar 2014 07:45:13 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13942847#comment-13942847
 ]


Suneel Marthi edited comment on MAHOUT-1469 at 3/21/14 2:43 PM:
----------------------------------------------------------------

The issue is with the code in StreamingKMeansthread.java which has the 
NUM_ESTIMATE_POINTS set to 1000 and if the number of centroids being clustered 
is < NUM_ESTIMATE_POINTS then no intermediateCentroids are returned to 
BallKMeans.

Seeing more issues, keeping notes lest I loose track, will submit a patch over 
the weekend to address these:

a) Always uses SquaredEuclideanDistanceMeasure, regardless of what's been 
provided via CLI. The original paper talks about using SquaredEuclidean but 
should we be honoring the user provided distanceMeasure. 

b) ClusteringUtils.estimateDistanceCutOff() always uses BruteSearch, regardless 
of what's been provided via CLI. Hence it always calls 
BruteSearch.searchFirst() when computing estimated minDistance.

c) Sequential Mode completely ignores -rskm flag. 

d) If the no. of datapoints is less than NUM_ESTIMATE_POINTS, then clustering 
fails when -rskm specified for MapReduce mode. This is because there are no 
more points to cluster once the estimated cutoff distance has been computed.


was (Author: smarthi):
The issue is with the code in StreamingKMeansthread.java which has the 
NUM_ESTIMATE_POINTS set to 1000 and if the number of centroids being clustered 
is < NUM_ESTIMATE_POINTS then no intermediateCentroids are returned to 
BallKMeans.

Also noticed that -rskm flag is not honored in Sequential mode execution of 
Streaming KMeans.

Will be submitting a patch to fix both the issues.

> Streaming KMeans fails when executed in MapReduce mode and 
> REDUCE_STREAMING_KMEANS is set to true
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1469
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1469
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.9
>            Reporter: Suneel Marthi
>            Assignee: Suneel Marthi
>             Fix For: 1.0
>
>
> Centroids are not being generated when executed in MR mode with -rskm flag 
> set. 
> {Code}
> 14/03/20 02:42:12 INFO mapreduce.StreamingKMeansThread: Estimated Points: 282
> 14/03/20 02:42:12 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/20 02:42:14 INFO mapreduce.StreamingKMeansReducer: Number of Centroids: > 0
> 14/03/20 02:42:14 WARN mapred.LocalJobRunner: job_local1374896815_0001
> java.lang.IllegalArgumentException: Must have nonzero number of training and 
> test vectors. Asked for %.1f %% of %d vectors for test [10.000000149011612, 0]
>       at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
>       at 
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
>       at 
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
>       at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
>       at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
>       at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> 14/03/20 02:42:14 INFO mapred.JobClient: Job complete: 
> job_local1374896815_0001
> 14/03/20 02:42:14 INFO mapred.JobClient: Counters: 16
> 14/03/20 02:42:14 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/20 02:42:14 INFO mapred.JobClient:     Bytes Read=17156391
> 14/03/20 02:42:14 INFO mapred.JobClient:   FileSystemCounters
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_READ=41925624
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=25974741
> 14/03/20 02:42:14 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output materialized 
> bytes=956293
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map input records=21578
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce shuffle bytes=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Spilled Records=282
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output bytes=1788012
> 14/03/20 02:42:14 INFO mapred.JobClient:     Total committed heap usage 
> (bytes)=217214976
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=163
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input groups=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output records=282
> 14/03/20 02:42:14 INFO driver.MahoutDriver: Program took 506269 ms (Minutes: 
> 8.437816666666667)
> {Code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Comment Edited] (MAHOUT-1469) Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true

Reply via email to