回复:Re: how to use DoubleRDDFunctions on mllib Vector?

prosp4300 Wed, 08 Jul 2015 21:05:16 -0700


Seems what Feynman mentioned is the source code instead of documentation, 
vectorMean is private, see
https://github.com/apache/spark/blob/v1.3.0/mllib/src/main/scala/org/apache/spark/mllib/clustering/GaussianMixture.scala


At 2015-07-09 10:10:58, "诺铁" <noty...@gmail.com> wrote:

thanks, I understand now.
but I can't find mllib.clustering.GaussianMixture#vectorMean   , what version 
of spark do you use?


On Thu, Jul 9, 2015 at 1:16 AM, Feynman Liang <fli...@databricks.com> wrote:

A RDD[Double] is an abstraction for a large collection of doubles, possibly 
distributed across multiple nodes. The DoubleRDDFunctions are there for 
performing mean and variance calculations across this distributed dataset.


In contrast, a Vector is not distributed and fits on your local machine. You 
would be better off computing these quantities on the Vector directly (see 
mllib.clustering.GaussianMixture#vectorMean for an example of how to compute 
the mean of a vector).


On Tue, Jul 7, 2015 at 8:26 PM, 诺铁 <noty...@gmail.com> wrote:

hi,


there are some useful functions in DoubleRDDFunctions, which I can use if I 
have RDD[Double], eg, mean, variance.  


Vector doesn't have such methods, how can I convert Vector to RDD[Double], or 
maybe better if I can call mean directly on a Vector?

回复:Re: how to use DoubleRDDFunctions on mllib Vector?

Reply via email to