Hi Rexx, In general (ie not Spark specific), its best to convert categorical data to 1-hot encoding rather than integers - that way the algorithm doesn't use the ordering implicit in the integer representation.
-sujit On Tue, Jun 16, 2015 at 1:17 PM, Rex X <dnsr...@gmail.com> wrote: > Is it necessary to convert categorical data into integers? > > Any tips would be greatly appreciated! > > -Rex > > On Sun, Jun 14, 2015 at 10:05 AM, Rex X <dnsr...@gmail.com> wrote: > >> For clustering analysis, we need a way to measure distances. >> >> When the data contains different levels of measurement - >> *binary / categorical (nominal), counts (ordinal), and ratio (scale)* >> >> To be concrete, for example, working with attributes of >> *city, zip, satisfaction_level, price* >> >> In the meanwhile, the real data usually also contains string attributes, >> for example, book titles. The distance between two strings can be measured >> by minimum-edit-distance. >> >> >> In SPSS, it provides Two-Step Cluster, which can handle both ratio scale >> and ordinal numbers. >> >> >> What is right algorithm to do hierarchical clustering analysis with all >> these four-kind attributes above with *MLlib*? >> >> >> If we cannot find a right metric to measure the distance, an alternative >> solution is to do a topological data analysis (e.g. linkage, and etc). >> Can we do such kind of analysis with *GraphX*? >> >> >> -Rex >> >> >