Since you can't define your own distance function, you will need to
convert these to numeric dimensions. 1-of-n encoding can work OK,
depending on your use case. So a dimension that takes on 3 categorical
values, becomes 3 dimensions, of which all are 0 except one that has
value 1.

On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan <wen.p...@mac.com> wrote:
> Hi Folks,
>
> Does any one have experience or recommendations on incorporating categorical 
> features (attributes) into k-means clustering in Spark?  In other words, I 
> want to cluster on a set of attributes that include categorical variables.
>
> I know I could probably implement some custom code to parse and calculate my 
> own similarity function, but I wanted to reach out before I did so.  I’d also 
> prefer to take advantage of the k-means\parallel initialization feature of 
> the model in MLlib, so an MLlib-based implementation would be preferred.
>
> Thanks in advance.
>
> Best,
>
> -Wen

Reply via email to