[
https://issues.apache.org/jira/browse/MAHOUT-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982401#comment-13982401
]
Maxim Arap commented on MAHOUT-1469:
------------------------------------
Suneel: I modified clusterInternal() on my local copy of the trunk so that
distanceCutoff is updated every time numClusters is updated (namely,
distanceCutoff = distanceCutoff * numClustersOld / numClusters). I compiled
and ran the new version in pseudo-distributed mapreduce mode on my laptop on
the Reuters dataset. The purpose was to compare the run times with and without
the tweak of distanceCutoff updating. The results are as follows.
For the Reuters dataset, if the final number of clusters is 10, the estimated
number of clusters for the streaming step is about 100 (the number of documents
in Reuters data set is about 21,600; hence if k=10 and n=21,600 then k log n is
about 100). Here's outcome of my experiment:
Original distanceCutoff updating:
Initial numClusters: 100
Final numClusters: 215
Run time in minutes: 8.00858
Tweaked distanceCutoff updating:
Initial numClusters: 100
Final numClusters: 269
Run time in minutes: 8.12646
What if we give the programs an underestimated initial numClusters? Not much
changes:
Original distanceCutoff updating:
Initial numClusters: 25
Final numClusters: 271
Run time in minutes: 7.5418
Tweaked distanceCutoff updating:
Initial numClusters: 25
Final numClusters: 307
Run time in minutes: 7.77573
I did the analogous comparison for sequential mode and the situation is
similar.
My experiment suggests that the bottleneck is not in the distanceCutoff update
rule. Harry Lang and I suspect that the bottleneck lies in the parallelization
scheme (i.e., StreamingKMeansMapper.java and StreamingKMeansReducer.java).
> Streaming KMeans fails when executed in MapReduce mode and
> REDUCE_STREAMING_KMEANS is set to true
> -------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1469
> URL: https://issues.apache.org/jira/browse/MAHOUT-1469
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.9
> Reporter: Suneel Marthi
> Assignee: Suneel Marthi
> Fix For: 1.0
>
>
> Centroids are not being generated when executed in MR mode with -rskm flag
> set.
> {Code}
> 14/03/20 02:42:12 INFO mapreduce.StreamingKMeansThread: Estimated Points: 282
> 14/03/20 02:42:12 INFO mapred.JobClient: map 100% reduce 0%
> 14/03/20 02:42:14 INFO mapreduce.StreamingKMeansReducer: Number of Centroids: > 0
> 14/03/20 02:42:14 WARN mapred.LocalJobRunner: job_local1374896815_0001
> java.lang.IllegalArgumentException: Must have nonzero number of training and
> test vectors. Asked for %.1f %% of %d vectors for test [10.000000149011612, 0]
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
> at
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
> at
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
> at
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
> at
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
> at
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
> at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> 14/03/20 02:42:14 INFO mapred.JobClient: Job complete:
> job_local1374896815_0001
> 14/03/20 02:42:14 INFO mapred.JobClient: Counters: 16
> 14/03/20 02:42:14 INFO mapred.JobClient: File Input Format Counters
> 14/03/20 02:42:14 INFO mapred.JobClient: Bytes Read=17156391
> 14/03/20 02:42:14 INFO mapred.JobClient: FileSystemCounters
> 14/03/20 02:42:14 INFO mapred.JobClient: FILE_BYTES_READ=41925624
> 14/03/20 02:42:14 INFO mapred.JobClient: FILE_BYTES_WRITTEN=25974741
> 14/03/20 02:42:14 INFO mapred.JobClient: Map-Reduce Framework
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output materialized
> bytes=956293
> 14/03/20 02:42:14 INFO mapred.JobClient: Map input records=21578
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce shuffle bytes=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Spilled Records=282
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output bytes=1788012
> 14/03/20 02:42:14 INFO mapred.JobClient: Total committed heap usage
> (bytes)=217214976
> 14/03/20 02:42:14 INFO mapred.JobClient: Combine input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: SPLIT_RAW_BYTES=163
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce input groups=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Combine output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Reduce output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient: Map output records=282
> 14/03/20 02:42:14 INFO driver.MahoutDriver: Program took 506269 ms (Minutes:
> 8.437816666666667)
> {Code}
--
This message was sent by Atlassian JIRA
(v6.2#6252)