[ https://issues.apache.org/jira/browse/MAHOUT-1358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13826083#comment-13826083 ]
Hudson commented on MAHOUT-1358: -------------------------------- SUCCESS: Integrated in Mahout-Quality #2326 (See [https://builds.apache.org/job/Mahout-Quality/2326/]) MAHOUT-1358: StreamingKMeansReducer throws IllegalArgumentException when REDUCE_STREAMING_KMEANS is set to true (smarthi: rev 1543095) * /mahout/trunk/CHANGELOG * /mahout/trunk/core/src/main/java/org/apache/mahout/clustering/streaming/mapreduce/StreamingKMeansThread.java > StreamingKMeansThread throws IllegalArgumentException when > REDUCE_STREAMING_KMEANS is set to true > ------------------------------------------------------------------------------------------------- > > Key: MAHOUT-1358 > URL: https://issues.apache.org/jira/browse/MAHOUT-1358 > Project: Mahout > Issue Type: Bug > Components: Clustering > Affects Versions: 0.8 > Reporter: Suneel Marthi > Assignee: Suneel Marthi > Fix For: 0.9 > > Attachments: MAHOUT-1358.patch > > > Running StreamingKMeans Clustering with REDUCE_STREAMING_KMEANS = true, > throws the following error > {Code} > java.lang.IllegalArgumentException: Must have nonzero number of training and > test vectors. Asked for %.1f %% of %d vectors for test [10.000000149011612, 0] > at > com.google.common.base.Preconditions.checkArgument(Preconditions.java:120) > at > org.apache.mahout.clustering.streaming.cluster.BallKMeans.splitTrainTest(BallKMeans.java:176) > at > org.apache.mahout.clustering.streaming.cluster.BallKMeans.cluster(BallKMeans.java:192) > at > org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.getBestCentroids(StreamingKMeansReducer.java:107) > at > org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:73) > at > org.apache.mahout.clustering.streaming.mapreduce.StreamingKMeansReducer.reduce(StreamingKMeansReducer.java:37) > at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:177) > at > org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398) > {Code} > The issue is caused by the following code in StreamingKMeansThread.call() > {Code} > Iterator<Centroid> datapointsIterator = datapoints.iterator(); > if (estimateDistanceCutoff == > StreamingKMeansDriver.INVALID_DISTANCE_CUTOFF) { > List<Centroid> estimatePoints = > Lists.newArrayListWithExpectedSize(NUM_ESTIMATE_POINTS); > while (datapointsIterator.hasNext() && estimatePoints.size() < > NUM_ESTIMATE_POINTS) { > estimatePoints.add(datapointsIterator.next()); > } > estimateDistanceCutoff = > ClusteringUtils.estimateDistanceCutoff(estimatePoints, > searcher.getDistanceMeasure()); > } > StreamingKMeans clusterer = new StreamingKMeans(searcher, numClusters, > estimateDistanceCutoff); > while (datapointsIterator.hasNext()) { > clusterer.cluster(datapointsIterator.next()); > } > {Code} > The code is using the same iterator twice, and it fails on the second use for > obvious reasons. -- This message was sent by Atlassian JIRA (v6.1#6144)