[
https://issues.apache.org/jira/browse/MAHOUT-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13855974#comment-13855974
]
Hudson commented on MAHOUT-1358:
--------------------------------
SUCCESS: Integrated in Mahout-Quality #2381 (See
[https://builds.apache.org/job/Mahout-Quality/2381/])
MAHOUT-1358 - earlier fix for this issue throws a heap space exception for
large datasets during the Mapper phase, new fix in place now and code cleanup.
(smarthi: rev 1553189)
*
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansDriver.java
*
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansMapper.java
*
/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansThread.java
> StreamingKMeansThread throws IllegalArgumentException when
> REDUCE_STREAMING_KMEANS is set to true
> -------------------------------------------------------------------------------------------------
>
> Key: MAHOUT-1358
> URL: https://issues.apache.org/jira/browse/MAHOUT-1358
> Project: Mahout
> Issue Type: Bug
> Components: Clustering
> Affects Versions: 0.8
> Reporter: Suneel Marthi
> Assignee: Suneel Marthi
> Fix For: 0.9
>
> Attachments: MAHOUT-1358.patch
>
>
> Running StreamingKMeans Clustering with REDUCE_STREAMING_KMEANS = true,
> throws the following error
> {Code}
> java.lang.IllegalArgumentException: Must have nonzero number of training and
> test vectors. Asked for %.1f %% of %d vectors for test [10.000000149011612, 0]
> at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:120)
> at
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176)
> at
> org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192)
> at
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107)
> at
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73)
> at
> org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37)
> at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177)
> at
> org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
> at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> {Code}
> The issue is caused by the following code in StreamingKMeansThread.call()
> {Code}
> Iterator<Centroid> datapointsIterator = datapoints.iterator();
> if (estimateDistanceCutoff ==
> StreamingKMeansDriver.INVALID_DISTANCE_CUTOFF) {
> List<Centroid> estimatePoints =
> Lists.newArrayListWithExpectedSize(NUM_ESTIMATE_POINTS);
> while (datapointsIterator.hasNext() && estimatePoints.size() <
> NUM_ESTIMATE_POINTS) {
> estimatePoints.add(datapointsIterator.next());
> }
> estimateDistanceCutoff =
> ClusteringUtils.estimateDistanceCutoff(estimatePoints,
> searcher.getDistanceMeasure());
> }
> StreamingKMeans clusterer = new StreamingKMeans(searcher, numClusters,
> estimateDistanceCutoff);
> while (datapointsIterator.hasNext()) {
> clusterer.cluster(datapointsIterator.next());
> }
> {Code}
> The code is using the same iterator twice, and it fails on the second use for
> obvious reasons.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)