Nick Pentreath created SPARK-14843:
--------------------------------------

             Summary: Error while encoding: java.lang.ClassCastException with 
LibSVMRelation
                 Key: SPARK-14843
                 URL: https://issues.apache.org/jira/browse/SPARK-14843
             Project: Spark
          Issue Type: Bug
          Components: ML, MLlib
            Reporter: Nick Pentreath


While trying to run some example ML linear regression code, I came across the 
following. In fact this error occurs when doing {{./bin/run-example 
ml.LinearRegressionWithElasticNetExample}}.

{code}
scala> import org.apache.spark.ml.regression.LinearRegression
import org.apache.spark.ml.regression.LinearRegression

scala> import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.linalg.Vector

scala> import org.apache.spark.sql.Row
import org.apache.spark.sql.Row

scala> val data = 
sqlContext.read.format("libsvm").load("data/mllib/sample_linear_regression_data.txt")
data: org.apache.spark.sql.DataFrame = [label: double, features: vector]

scala> val model = lr.fit(data)
{code}
Stack trace:
{code}
Driver stacktrace:
...
  at org.apache.spark.rdd.RDD$$anonfun$take$1.apply(RDD.scala:1276)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
  at org.apache.spark.rdd.RDD.take(RDD.scala:1250)
  at org.apache.spark.rdd.RDD$$anonfun$first$1.apply(RDD.scala:1290)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
  at org.apache.spark.rdd.RDD.withScope(RDD.scala:357)
  at org.apache.spark.rdd.RDD.first(RDD.scala:1289)
  at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:165)
  at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:69)
  at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
  ... 48 elided
Caused by: java.lang.RuntimeException: Error while encoding: 
java.lang.ClassCastException: java.lang.Double cannot be cast to 
org.apache.spark.mllib.linalg.Vector
if (input[0, org.apache.spark.sql.Row].isNullAt) null else newInstance(class 
org.apache.spark.mllib.linalg.VectorUDT).serialize
:- input[0, org.apache.spark.sql.Row].isNullAt
:  :- input[0, org.apache.spark.sql.Row]
:  +- 0
:- null
+- newInstance(class org.apache.spark.mllib.linalg.VectorUDT).serialize
   :- newInstance(class org.apache.spark.mllib.linalg.VectorUDT)
   +- input[0, org.apache.spark.sql.Row].get
      :- input[0, org.apache.spark.sql.Row]
      +- 0

  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:230)
  at 
org.apache.spark.ml.source.libsvm.DefaultSource$$anonfun$buildReader$1$$anonfun$8.apply(LibSVMRelation.scala:209)
  at 
org.apache.spark.ml.source.libsvm.DefaultSource$$anonfun$buildReader$1$$anonfun$8.apply(LibSVMRelation.scala:207)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.next(FileScanRDD.scala:90)
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
  at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at 
org.apache.spark.sql.execution.WholeStageCodegen$$anonfun$7$$anon$1.hasNext(WholeStageCodegen.scala:362)
  at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
  at 
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
  at org.apache.spark.scheduler.Task.run(Task.scala:85)
  at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:254)
  at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
  at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
  at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassCastException: java.lang.Double cannot be cast to 
org.apache.spark.mllib.linalg.Vector
  at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
  at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:227)
  ... 17 more
{code}

The error is triggered by L163 of {{LinearRegression}}:
{code}
    val numFeatures = dataset.select(col($(featuresCol))).limit(1).rdd.map {
      case Row(features: Vector) => features.size
    }.first()
{code}

Using the above example, the following works:
{code}
scala> data.select("label").rdd.map { case Row(d: Double) => d }.first
res49: Double = -9.490009878824548
{code}
But this triggers the exception:
{code}
scala> data.select("features").rdd.map { case Row(d: Vector) => d }.first
16/04/22 11:25:20 ERROR Executor: Exception in task 0.0 in stage 87.0 (TID 98)
java.lang.RuntimeException: Error while encoding: java.lang.ClassCastException: 
java.lang.Double cannot be cast to org.apache.spark.mllib.linalg.Vector
...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to