What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

Rex X Sun, 14 Jun 2015 10:06:11 -0700

For clustering analysis, we need a way to measure distances.

When the data contains different levels of measurement -
*binary / categorical (nominal), counts (ordinal), and ratio (scale)*


To be concrete, for example, working with attributes of
*city, zip, satisfaction_level, price*

In the meanwhile, the real data usually also contains string attributes,
for example, book titles. The distance between two strings can be measured
by minimum-edit-distance.


In SPSS, it provides Two-Step Cluster, which can handle both ratio scale
and ordinal numbers.


What is right algorithm to do hierarchical clustering analysis with all
these four-kind attributes above with *MLlib*?


If we cannot find a right metric to measure the distance, an alternative
solution is to do a topological data analysis (e.g. linkage, and etc). Can
we do such kind of analysis with *GraphX*?


-Rex

What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

Reply via email to