`mean()` and `variance()` are not defined in `Vector`. You can use the
mean and variance implementation from commons-math3
(http://commons.apache.org/proper/commons-math/javadocs/api-3.4.1/index.html)
if you don't want to implement them. -Xiangrui
On Fri, Feb 6, 2015 at 12:50 PM, SK skrishna...@gmail.com wrote:
Hi,
I have a dataset in csv format and I am trying to standardize the features
before using k-means clustering. The data does not have any labels but has
the following format:
s1, f12,f13,...
s2, f21,f22,...
where s is a string id, and f is a floating point feature value.
To perform feature standardization, I need to compute the mean and
variance/std deviation of the features values in each element of the RDD
(i.e each row). However, the summary Statistics library in Spark MLLib
provides only a colStats() method that provides column-wise mean and
variance. I tried to compute the mean and variance per row, using the code
below but got a compilation error that there is no mean() or variance()
method for a tuple or Vector object. Is there a Spark library to compute the
row-wise mean and variance for an RDD, where each row (i.e. element) of the
RDD is a Vector or tuple of N feature values?
thanks
My code for standardization is as follows:
//read the data
val data=sc.textFile(file_name)
.map(_.split(,))
// extract the features. For this example I am using only 2 features, but
the data has more features
val features = data.map(d= Vectors.dense(d(1).toDouble, d(2).toDouble))
val std_features = features.map(f= {
val fmean = f.mean() // Error:
NO MEAN() for a Vector or Tuple object
val fstd=
scala.math.sqrt(f.variance())// Error: NO variance() for a Vector or
Tuple object
for (i - 0 to f.length) //
standardize the features
{ var fs = 0.0
if (fstd 0.0)
fs = (f(i) -
fmean)/fstd
fs
}
}
)
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-feature-standardization-tp21539.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org