[
https://issues.apache.org/jira/browse/MAHOUT-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982406#comment-13982406
]
Maxim Arap edited comment on MAHOUT-1469 at 4/27/14 5:52 PM:
-------------------------------------------------------------
Have you considered splitting up the StreamingKMeans implementation into the
following three components, each of which can be executed independently of each
other or in combination:
1. Streaming step
2. kmeans++
3. Ball k-means ?
It seems that this would not involve too much work, but may provide:
1. useful flexibility for the user
2. improvement in running time with little or no sacrifice on performance
3. logical clarity
4. extendibility (can add other clustering algorithms, like the one Harry Lang
mentioned from Ke Chen's paper, and "plug-and-play" with the other components)
In particular, the Streaming step seems unnecessary if the data is available in
its entirety when the program starts executing (which is true in standard uses
of clustering).
was (Author: arapmv):
Have you considered splitting up the StreamingKMeans implementation into the
following three components, each of which can be executed independently of each
other or in combination:
1. Streaming step
2. kmeans++
3. Ball k-means ?
It seems that this would not involve too much work, but may provide:
1. useful flexibility for the user
2. improvement in running time with little or no sacrifice on performance
3. logical clarity
In particular, the Streaming step seems unnecessary if the data is available in
its entirety when the program starts executing (which is true in standard uses
of clustering).
> Streaming KMeans fails when executed in MapReduce mode and
> REDUCE_STREAMING_KMEANS is set to true
> -------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1469
> URL: https://issues.apache.org/jira/browse/MAHOUT-1469
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.9
> Reporter: Suneel Marthi
> Assignee: Suneel Marthi
> Fix For: 1.0
>
>
> Centroids are not being generated when executed in MR mode with -rskm flag
> set.
> {Code}
> 14/03/20 02:42:12 INFO mapreduce.StreamingKMeansThread: Estimated Points: 282
> 14/03/20 02:42:12 INFO mapred.JobClient: map 100% reduce 0%
> 14/03/20 02:42:14 INFO mapreduce.StreamingKMeansReducer: Number of Centroids: > 0
> 14/03/20 02:42:14 WARN mapred.LocalJobRunner: job_local1374896815_0001
> java.lang.IllegalArgumentException: Must have nonzero number of training and
> test vectors. Asked for %.1f %% of %d vectors for test [10.000000149011612, 0]
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
> at
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
> at
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
> at
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
> at
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
> at
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
> at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> 14/03/20 02:42:14 INFO mapred.JobClient: Job complete:
> job_local1374896815_0001
> 14/03/20 02:42:14 INFO mapred.JobClient: Counters: 16
> 14/03/20 02:42:14 INFO mapred.JobClient: File Input Format Counters
> 14/03/20 02:42:14 INFO mapred.JobClient: Bytes Read=17156391
> 14/03/20 02:42:14 INFO mapred.JobClient: FileSystemCounters
> 14/03/20 02:42:14 INFO mapred.JobClient: FILE_BYTES_READ=41925624
> 14/03/20 02:42:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=25974741
> 14/03/20 02:42:14 INFO mapred.JobClient: Map-Reduce Framework
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output materialized
> bytes=956293
> 14/03/20 02:42:14 INFO mapred.JobClient: Map input records=21578
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce shuffle bytes=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Spilled Records=282
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output bytes=1788012
> 14/03/20 02:42:14 INFO mapred.JobClient: Total committed heap usage
> (bytes)=217214976
> 14/03/20 02:42:14 INFO mapred.JobClient: Combine input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=163
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce input groups=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Combine output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output records=282
> 14/03/20 02:42:14 INFO driver.MahoutDriver: Program took 506269 ms (Minutes:
> 8.437816666666667)
> {Code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)