Hi, I have a dataset in csv format and I am trying to standardize the features before using k-means clustering. The data does not have any labels but has the following format:
s1, f12,f13,... s2, f21,f22,... where s is a string id, and f is a floating point feature value. To perform feature standardization, I need to compute the mean and variance/std deviation of the features values in each element of the RDD (i.e each row). However, the summary Statistics library in Spark MLLib provides only a colStats() method that provides column-wise mean and variance. I tried to compute the mean and variance per row, using the code below but got a compilation error that there is no mean() or variance() method for a tuple or Vector object. Is there a Spark library to compute the row-wise mean and variance for an RDD, where each row (i.e. element) of the RDD is a Vector or tuple of N feature values? thanks My code for standardization is as follows: //read the data val data=sc.textFile(file_name) .map(_.split(",")) // extract the features. For this example I am using only 2 features, but the data has more features val features = data.map(d=> Vectors.dense(d(1).toDouble, d(2).toDouble)) val std_features = features.map(f=> { val fmean = f.mean() // Error: NO MEAN() for a Vector or Tuple object val fstd = scala.math.sqrt(f.variance()) // Error: NO variance() for a Vector or Tuple object for (i <- 0 to f.length) // standardize the features { var fs = 0.0 if (fstd >0.0) fs = (f(i) - fmean)/fstd fs } } ) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-feature-standardization-tp21539.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org