Re: MLLib: feature standardization

Xiangrui Meng Mon, 09 Feb 2015 12:02:10 -0800

`mean()` and `variance()` are not defined in `Vector`. You can use the
mean and variance implementation from commons-math3
(http://commons.apache.org/proper/commons-math/javadocs/api-3.4.1/index.html)
if you don't want to implement them. -Xiangrui


On Fri, Feb 6, 2015 at 12:50 PM, SK <skrishna...@gmail.com> wrote:
> Hi,
>
> I have a dataset in csv format and I am trying to standardize the features
> before using k-means clustering. The data does not have any labels but has
> the following format:
>
> s1, f12,f13,...
> s2, f21,f22,...
>
> where s is a string id, and f is a floating point feature value.
> To perform feature standardization, I need to compute the mean and
> variance/std deviation of the features values in each element of the RDD
> (i.e each row). However, the summary Statistics library in Spark MLLib
> provides only a colStats() method that provides column-wise mean and
> variance. I tried to compute the mean and variance per row, using the code
> below but got a compilation error that there is no mean() or variance()
> method for a tuple or Vector object. Is there a Spark library to compute the
> row-wise mean and variance for an RDD, where each row (i.e. element) of the
> RDD is a Vector or tuple of N feature values?
>
> thanks
>
> My code for standardization is as follows:
>
> //read the data
> val data=sc.textFile(file_name)
>                   .map(_.split(","))
>
> // extract the features. For this example I am using only 2 features, but
> the data has more features
> val features = data.map(d=> Vectors.dense(d(1).toDouble, d(2).toDouble))
>
> val std_features = features.map(f=> {
>                                            val fmean = f.mean()   // Error:
> NO MEAN() for a Vector or Tuple object
>                                            val fstd    =
> scala.math.sqrt(f.variance())    // Error: NO variance() for a Vector or
> Tuple object
>                                            for (i <- 0 to f.length) //
> standardize the features
>                                                    { var fs = 0.0
>                                                       if (fstd >0.0)
>                                                           fs = (f(i)  -
> fmean)/fstd
>                                                       fs
>                                                    }
>                                               }
>                                           )
>
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-feature-standardization-tp21539.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: MLLib: feature standardization

Reply via email to