Since you can't define your own distance function, you will need to convert these to numeric dimensions. 1-of-n encoding can work OK, depending on your use case. So a dimension that takes on 3 categorical values, becomes 3 dimensions, of which all are 0 except one that has value 1.
On Fri, Jul 11, 2014 at 3:07 PM, Wen Phan <wen.p...@mac.com> wrote: > Hi Folks, > > Does any one have experience or recommendations on incorporating categorical > features (attributes) into k-means clustering in Spark? In other words, I > want to cluster on a set of attributes that include categorical variables. > > I know I could probably implement some custom code to parse and calculate my > own similarity function, but I wanted to reach out before I did so. I’d also > prefer to take advantage of the k-means\parallel initialization feature of > the model in MLlib, so an MLlib-based implementation would be preferred. > > Thanks in advance. > > Best, > > -Wen