You could try flatMapping i.e. if you have data : RDD[(key, Iterable[Vector])] then data.flatMap(_._2) : RDD[Vector], which can be GMMed.
If you want to first partition by url, I would first create multiple RDDs using `filter`, then running GMM on each of the filtered rdds. On Tue, Aug 11, 2015 at 5:43 AM, Fabian Böhnlein <fabian.boehnl...@gmail.com > wrote: > Hi everyone, > > I am trying to use mllib.clustering.GaussianMixture, but am blocked by the > fact that the API only accepts RDD[Vector]. > > Broadly speaking I need to run the clustering on an RDD[(key, > Iterable[Vector]), e.g. (fabricated): > > val WebsiteUserAgeRDD : RDD[url, userAgeVector] > > val ageClusterByUrl = > WebsiteUserAgeRDD.groupby(_.url).mapValues(GaussianMixture.setK(x).run) > > This obviously does not work, as the mapValues function is called on > Iterable[Vector] but requires RDD[Vector] > As I see it, parallelizing this Iterable is not possible, would result in > an RDD of RDDs? > > Anyone has an idea how to cluster an RDD of (key, Iterable[Vector]) like > in above groupBy result? > > Many thanks, > Fabian >