You could try flatMapping i.e. if you have data : RDD[(key,
Iterable[Vector])] then  data.flatMap(_._2) : RDD[Vector], which can be
GMMed.

If you want to first partition by url, I would first create multiple RDDs
using `filter`, then running GMM on each of the filtered rdds.

On Tue, Aug 11, 2015 at 5:43 AM, Fabian Böhnlein <fabian.boehnl...@gmail.com
> wrote:

> Hi everyone,
>
> I am trying to use mllib.clustering.GaussianMixture, but am blocked by the
> fact that the API only accepts RDD[Vector].
>
> Broadly speaking I need to run the clustering on an RDD[(key,
> Iterable[Vector]), e.g. (fabricated):
>
> val WebsiteUserAgeRDD : RDD[url, userAgeVector]
>
> val ageClusterByUrl =
> WebsiteUserAgeRDD.groupby(_.url).mapValues(GaussianMixture.setK(x).run)
>
> This obviously does not work, as the mapValues function is called on
> Iterable[Vector] but requires RDD[Vector]
> As I see it, parallelizing this Iterable is not possible, would result in
> an RDD of RDDs?
>
> Anyone has an idea how to cluster an RDD of (key, Iterable[Vector]) like
> in above groupBy result?
>
> Many thanks,
> Fabian
>

Reply via email to