Hi,

I have a dataset in csv format and I am trying to standardize the features
before using k-means clustering. The data does not have any labels but has
the following format:

s1, f12,f13,...
s2, f21,f22,...

where s is a string id, and f is a floating point feature value.
To perform feature standardization, I need to compute the mean and
variance/std deviation of the features values in each element of the RDD
(i.e each row). However, the summary Statistics library in Spark MLLib
provides only a colStats() method that provides column-wise mean and
variance. I tried to compute the mean and variance per row, using the code
below but got a compilation error that there is no mean() or variance()
method for a tuple or Vector object. Is there a Spark library to compute the
row-wise mean and variance for an RDD, where each row (i.e. element) of the
RDD is a Vector or tuple of N feature values?

thanks

My code for standardization is as follows:

//read the data 
val data=sc.textFile(file_name)
                  .map(_.split(","))

// extract the features. For this example I am using only 2 features, but
the data has more features
val features = data.map(d=> Vectors.dense(d(1).toDouble, d(2).toDouble))

val std_features = features.map(f=> {
                                           val fmean = f.mean()   // Error:
NO MEAN() for a Vector or Tuple object        
                                           val fstd    = 
scala.math.sqrt(f.variance())    // Error: NO variance() for a Vector or
Tuple object
                                           for (i <- 0 to f.length) //
standardize the features
                                                   { var fs = 0.0
                                                      if (fstd >0.0)
                                                          fs = (f(i)  - 
fmean)/fstd
                                                      fs
                                                   }
                                              }   
                                          )




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-feature-standardization-tp21539.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to