Re: MLLib: feature standardization

2015-02-09 Thread Xiangrui Meng
`mean()` and `variance()` are not defined in `Vector`. You can use the
mean and variance implementation from commons-math3
(http://commons.apache.org/proper/commons-math/javadocs/api-3.4.1/index.html)
if you don't want to implement them. -Xiangrui

On Fri, Feb 6, 2015 at 12:50 PM, SK skrishna...@gmail.com wrote:
 Hi,

 I have a dataset in csv format and I am trying to standardize the features
 before using k-means clustering. The data does not have any labels but has
 the following format:

 s1, f12,f13,...
 s2, f21,f22,...

 where s is a string id, and f is a floating point feature value.
 To perform feature standardization, I need to compute the mean and
 variance/std deviation of the features values in each element of the RDD
 (i.e each row). However, the summary Statistics library in Spark MLLib
 provides only a colStats() method that provides column-wise mean and
 variance. I tried to compute the mean and variance per row, using the code
 below but got a compilation error that there is no mean() or variance()
 method for a tuple or Vector object. Is there a Spark library to compute the
 row-wise mean and variance for an RDD, where each row (i.e. element) of the
 RDD is a Vector or tuple of N feature values?

 thanks

 My code for standardization is as follows:

 //read the data
 val data=sc.textFile(file_name)
   .map(_.split(,))

 // extract the features. For this example I am using only 2 features, but
 the data has more features
 val features = data.map(d= Vectors.dense(d(1).toDouble, d(2).toDouble))

 val std_features = features.map(f= {
val fmean = f.mean()   // Error:
 NO MEAN() for a Vector or Tuple object
val fstd=
 scala.math.sqrt(f.variance())// Error: NO variance() for a Vector or
 Tuple object
for (i - 0 to f.length) //
 standardize the features
{ var fs = 0.0
   if (fstd 0.0)
   fs = (f(i)  -
 fmean)/fstd
   fs
}
   }
   )




 --
 View this message in context: 
 http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-feature-standardization-tp21539.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



MLLib: feature standardization

2015-02-06 Thread SK
Hi,

I have a dataset in csv format and I am trying to standardize the features
before using k-means clustering. The data does not have any labels but has
the following format:

s1, f12,f13,...
s2, f21,f22,...

where s is a string id, and f is a floating point feature value.
To perform feature standardization, I need to compute the mean and
variance/std deviation of the features values in each element of the RDD
(i.e each row). However, the summary Statistics library in Spark MLLib
provides only a colStats() method that provides column-wise mean and
variance. I tried to compute the mean and variance per row, using the code
below but got a compilation error that there is no mean() or variance()
method for a tuple or Vector object. Is there a Spark library to compute the
row-wise mean and variance for an RDD, where each row (i.e. element) of the
RDD is a Vector or tuple of N feature values?

thanks

My code for standardization is as follows:

//read the data 
val data=sc.textFile(file_name)
  .map(_.split(,))

// extract the features. For this example I am using only 2 features, but
the data has more features
val features = data.map(d= Vectors.dense(d(1).toDouble, d(2).toDouble))

val std_features = features.map(f= {
   val fmean = f.mean()   // Error:
NO MEAN() for a Vector or Tuple object
   val fstd= 
scala.math.sqrt(f.variance())// Error: NO variance() for a Vector or
Tuple object
   for (i - 0 to f.length) //
standardize the features
   { var fs = 0.0
  if (fstd 0.0)
  fs = (f(i)  - 
fmean)/fstd
  fs
   }
  }   
  )




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLib-feature-standardization-tp21539.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org