[ https://issues.apache.org/jira/browse/SPARK-16857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15405050#comment-15405050 ]
Xusen Yin commented on SPARK-16857: ----------------------------------- Using CrossValidator with KMeans should be supported. As a kind of external evaluation for KMeans, I think using MulticlassClassificationEvaluator with KMeans should also be supported. Why not send a PR since it would be a quick fix. CC [~yanboliang] > CrossValidator and KMeans throws IllegalArgumentException > --------------------------------------------------------- > > Key: SPARK-16857 > URL: https://issues.apache.org/jira/browse/SPARK-16857 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 1.6.1 > Environment: spark-jobserver docker image. Spark 1.6.1 on ubuntu, > Hadoop 2.4 > Reporter: Ryan Claussen > > I am attempting to use CrossValidation to train KMeans model. When I attempt > to fit the data spark throws an IllegalArgumentException as below since the > KMeans algorithm outputs an Integer into the prediction column instead of a > Double. Before I go too far: is using CrossValidation with Kmeans > supported? > Here's the exception: > {quote} > java.lang.IllegalArgumentException: requirement failed: Column prediction > must be of type DoubleType but was actually IntegerType. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator.evaluate(MulticlassClassificationEvaluator.scala:74) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:109) > at > org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:99) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) > at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:99) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.generateKMeans(SparkModelJob.scala:202) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:62) > at > com.ibm.bpm.cloud.ci.cto.prediction.SparkModelJob$.runJob(SparkModelJob.scala:39) > at > spark.jobserver.JobManagerActor$$anonfun$spark$jobserver$JobManagerActor$$getJobFuture$4.apply(JobManagerActor.scala:301) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24) > at > scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > {quote} > Here is the code I'm using to set up my cross validator. As the stack trace > above indicates it is failing at the fit step when > {quote} > ... > val mpc = new KMeans().setK(2).setFeaturesCol("indexedFeatures") > val labelConverter = new > IndexToString().setInputCol("prediction").setOutputCol("predictedLabel").setLabels(labelIndexer.labels) > val pipeline = new Pipeline().setStages(Array(labelIndexer, > featureIndexer, mpc, labelConverter)) > val evaluator = new > MulticlassClassificationEvaluator().setLabelCol("approvedIndex").setPredictionCol("prediction") > val paramGrid = new ParamGridBuilder().addGrid(mpc.maxIter, Array(100, > 200, 500)).build() > val cv = new > CrossValidator().setEstimator(pipeline).setEvaluator(evaluator).setEstimatorParamMaps(paramGrid).setNumFolds(3) > val cvModel = cv.fit(trainingData) > {quote} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org