IllegalArgumentException on calling KMeans.train()

2014-06-04 Thread bluejoe2008
what does this exception mean?

14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
at 
org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:398)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:372)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:366)
at 
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:366)
at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:389)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:269)
at 
org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:268)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:268)
at org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
at 
org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.Task.run(Task.scala:51)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)

my spark version: 1.0.0
Java: 1.7
my codes:

JavaRDDVector docVectors = generateDocVector(...);
int numClusters = 20;
int numIterations = 20;
KMeansModel clusters = KMeans.train(docVectors.rdd(), numClusters, 
numIterations);

another strange thing is that the mapPartitionsWithIndex() method call in 
generateDocVector() are invoked for 3 times... 

2014-06-04 


bluejoe2008

Re: IllegalArgumentException on calling KMeans.train()

2014-06-04 Thread Xiangrui Meng
Could you check whether the vectors have the same size? -Xiangrui

On Wed, Jun 4, 2014 at 1:43 AM, bluejoe2008 bluejoe2...@gmail.com wrote:
 what does this exception mean?

 14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6
 java.lang.IllegalArgumentException: requirement failed
 at scala.Predef$.require(Predef.scala:221)
 at
 org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
 at
 org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:398)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:372)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:366)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:366)
 at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:389)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:269)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:268)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.Range.foreach(Range.scala:141)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:268)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)

 my spark version: 1.0.0
 Java: 1.7
 my codes:

 JavaRDDVector docVectors = generateDocVector(...);
 int numClusters = 20;
 int numIterations = 20;
 KMeansModel clusters = KMeans.train(docVectors.rdd(), numClusters,
 numIterations);

 another strange thing is that the mapPartitionsWithIndex() method call in
 generateDocVector() are invoked for 3 times...

 2014-06-04
 
 bluejoe2008


Re: Re: IllegalArgumentException on calling KMeans.train()

2014-06-04 Thread bluejoe2008
thank you! 孟祥瑞
with your help i solved the problem.

I constructed SparseVectors in a wrong way
the first parameter  of the constructor  SparseVector(int size, int[] indices, 
double[] values) 
I mistaked it for the size of values 

2014-06-04 


bluejoe2008

From: Xiangrui Meng
Date: 2014-06-04 17:35
To: user
Subject: Re: IllegalArgumentException on calling KMeans.train()
Could you check whether the vectors have the same size? -Xiangrui

On Wed, Jun 4, 2014 at 1:43 AM, bluejoe2008 bluejoe2...@gmail.com wrote:
 what does this exception mean?

 14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6
 java.lang.IllegalArgumentException: requirement failed
 at scala.Predef$.require(Predef.scala:221)
 at
 org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
 at
 org.apache.spark.mllib.clustering.KMeans$.fastSquaredDistance(KMeans.scala:398)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:372)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$findClosest$1.apply(KMeans.scala:366)
 at
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at org.apache.spark.mllib.clustering.KMeans$.findClosest(KMeans.scala:366)
 at org.apache.spark.mllib.clustering.KMeans$.pointCost(KMeans.scala:389)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:269)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$17$$anonfun$apply$7.apply(KMeans.scala:268)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at
 scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
 at scala.collection.immutable.Range.foreach(Range.scala:141)
 at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
 at scala.collection.AbstractTraversable.map(Traversable.scala:105)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:268)
 at
 org.apache.spark.mllib.clustering.KMeans$$anonfun$17.apply(KMeans.scala:267)
 at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
 at org.apache.spark.Aggregator.combineValuesByKey(Aggregator.scala:58)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:96)
 at
 org.apache.spark.rdd.PairRDDFunctions$$anonfun$1.apply(PairRDDFunctions.scala:95)
 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
 at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:582)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
 at
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
 at org.apache.spark.scheduler.Task.run(Task.scala:51)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:187)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
 at java.lang.Thread.run(Thread.java:619)

 my spark version: 1.0.0
 Java: 1.7
 my codes:

 JavaRDDVector docVectors = generateDocVector(...);
 int numClusters = 20;
 int numIterations = 20;
 KMeansModel clusters = KMeans.train(docVectors.rdd(), numClusters,
 numIterations);

 another strange thing is that the mapPartitionsWithIndex() method call in
 generateDocVector() are invoked for 3 times...

 2014-06-04
 
 bluejoe2008