Hi Kristina,

The input column of StandardScaler must be vector type, because it's
usually used as feature scaling before model training and the type of
feature column should be vector in most cases.
If you only want to standardize a numeric column, you can wrap it as a
vector and feed into StandardScaler.

Thanks
Yanbo

2016-01-10 8:10 GMT+08:00 Kristina Rogale Plazonic <kpl...@gmail.com>:

> Hi,
>
> The code below gives me an unexpected result. I expected that
> StandardScaler (in ml, not mllib) will take a specified column of an input
> dataframe and subtract the mean of the column and divide the difference by
> the standard deviation of the dataframe column.
>
> However, Spark gives me the error that the input column must be of type
> vector. This shouldn't be the case, as the StandardScaler should transform
> a numeric column (not vector column) to numeric column, right?  (The
> offending line in Spark source code
> <https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala#L90>).
> Am I missing something?
>
> Reproducing the error (python's sklearn example
> <http://scikit-learn.org/stable/modules/preprocessing.html>):
>
> val ccdf = sqlContext.createDataFrame( Seq(
>           ( 1.0, -1.0,  2.0),
>           ( 2.0,  0.0,  0.0),
>           ( 0.0,  1.0, -1.0)
>           )).toDF("c1", "c2", "c3")
>
> val newccdf = new StandardScaler()
>                   .setInputCol("c1")
>                   .setOutputCol("c1_norm")
>                   .setWithMean(true)
>                   .setWithStd(true)
>                   .fit(ccdf)
>                   .transform(ccdf)
>
> The error output: (spark-shell, Spark 1.5.2)
>
> java.lang.IllegalArgumentException: requirement failed: Input column c1
> must be a vector column
> (.....)
>
> Thanks!
> Kristina
>

Reply via email to