[jira] [Commented] (MAHOUT-1469) Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true

Maxim Arap (JIRA) Sun, 27 Apr 2014 10:34:28 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-1469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13982401#comment-13982401
 ]


Maxim Arap commented on MAHOUT-1469:
------------------------------------

Suneel: I modified clusterInternal() on my local copy of the trunk so that 
distanceCutoff is updated every time numClusters is updated (namely,  
distanceCutoff = distanceCutoff  * numClustersOld / numClusters). I compiled 
and ran the new version in pseudo-distributed mapreduce mode on my laptop on 
the Reuters dataset. The purpose was to compare the run times with and without 
the tweak of distanceCutoff updating. The results are as follows. 

For the Reuters dataset, if the final number of clusters is 10, the estimated 
number of clusters for the streaming step is about 100 (the number of documents 
in Reuters data set is about 21,600; hence if k=10 and n=21,600 then k log n is 
about 100). Here's outcome of my experiment:

Original distanceCutoff updating: 

Initial numClusters: 100 
Final numClusters: 215
Run time in minutes: 8.00858

Tweaked distanceCutoff updating:

Initial numClusters: 100 
Final numClusters: 269
Run time in minutes: 8.12646

What if we give the programs an underestimated initial numClusters? Not much 
changes: 

Original distanceCutoff updating: 
Initial numClusters: 25
Final numClusters: 271
Run time in minutes: 7.5418

Tweaked distanceCutoff updating:
Initial numClusters: 25
Final numClusters: 307
Run time in minutes: 7.77573

I did the analogous comparison for sequential mode and the situation is 
similar. 

My experiment suggests that the bottleneck is not in the distanceCutoff update 
rule. Harry Lang and I suspect that the bottleneck lies in the parallelization 
scheme (i.e., StreamingKMeansMapper.java and StreamingKMeansReducer.java). 



> Streaming KMeans fails when executed in MapReduce mode and 
> REDUCE_STREAMING_KMEANS is set to true
> -------------------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-1469
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1469
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.9
>            Reporter: Suneel Marthi
>            Assignee: Suneel Marthi
>             Fix For: 1.0
>
>
> Centroids are not being generated when executed in MR mode with -rskm flag 
> set. 
> {Code}
> 14/03/20 02:42:12 INFO mapreduce.StreamingKMeansThread: Estimated Points: 282
> 14/03/20 02:42:12 INFO mapred.JobClient:  map 100% reduce 0%
> 14/03/20 02:42:14 INFO mapreduce.StreamingKMeansReducer: Number of Centroids: > 0
> 14/03/20 02:42:14 WARN mapred.LocalJobRunner: job_local1374896815_0001
> java.lang.IllegalArgumentException: Must have nonzero number of training and 
> test vectors. Asked for %.1f %% of %d vectors for test [10.000000149011612, 0]
>       at 
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
>       at 
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
>       at 
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
>       at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
>       at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
>       at 
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
>       at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
>       at 
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>       at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> 14/03/20 02:42:14 INFO mapred.JobClient: Job complete: 
> job_local1374896815_0001
> 14/03/20 02:42:14 INFO mapred.JobClient: Counters: 16
> 14/03/20 02:42:14 INFO mapred.JobClient:   File Input Format Counters 
> 14/03/20 02:42:14 INFO mapred.JobClient:     Bytes Read=17156391
> 14/03/20 02:42:14 INFO mapred.JobClient:   FileSystemCounters
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_READ=41925624
> 14/03/20 02:42:14 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=25974741
> 14/03/20 02:42:14 INFO mapred.JobClient:   Map-Reduce Framework
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output materialized 
> bytes=956293
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map input records=21578
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce shuffle bytes=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Spilled Records=282
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output bytes=1788012
> 14/03/20 02:42:14 INFO mapred.JobClient:     Total committed heap usage 
> (bytes)=217214976
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     SPLIT_RAW_BYTES=163
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce input groups=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Combine output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Reduce output records=0
> 14/03/20 02:42:14 INFO mapred.JobClient:     Map output records=282
> 14/03/20 02:42:14 INFO driver.MahoutDriver: Program took 506269 ms (Minutes: 
> 8.437816666666667)
> {Code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (MAHOUT-1469) Streaming KMeans fails when executed in MapReduce mode and REDUCE_STREAMING_KMEANS is set to true

Reply via email to