StandardScaler in spark.ml.feature requires vector input?

Kristina Rogale Plazonic Sat, 09 Jan 2016 16:11:33 -0800

Hi,

The code below gives me an unexpected result. I expected that
StandardScaler (in ml, not mllib) will take a specified column of an input
dataframe and subtract the mean of the column and divide the difference by
the standard deviation of the dataframe column.


However, Spark gives me the error that the input column must be of type
vector. This shouldn't be the case, as the StandardScaler should transform
a numeric column (not vector column) to numeric column, right?  (The
offending line in Spark source code
<https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L90>).
Am I missing something?

Reproducing the error (python's sklearn example
<http://scikit-learn.org/stable/modules/preprocessing.html>):

val ccdf = sqlContext.createDataFrame( Seq(
          ( 1.0, -1.0,  2.0),
          ( 2.0,  0.0,  0.0),
          ( 0.0,  1.0, -1.0)
          )).toDF("c1", "c2", "c3")

val newccdf = new StandardScaler()
                  .setInputCol("c1")
                  .setOutputCol("c1_norm")
                  .setWithMean(true)
                  .setWithStd(true)
                  .fit(ccdf)
                  .transform(ccdf)

The error output: (spark-shell, Spark 1.5.2)

java.lang.IllegalArgumentException: requirement failed: Input column c1
must be a vector column
(.....)

Thanks!
Kristina

StandardScaler in spark.ml.feature requires vector input?

Reply via email to