You can try hashing to control the feature dimension. MLlib's k-means
implementation can handle sparse data efficiently if the number of
features is not huge. -Xiangrui
On Tue, Jun 16, 2015 at 2:44 PM, Rex X dnsr...@gmail.com wrote:
Hi Sujit,
That's a good point. But 1-hot encoding will make
Hi Sujit,
That's a good point. But 1-hot encoding will make our data changing from
Terabytes to Petabytes, because we have tens of categorical attributes, and
some of them contain thousands of categorical values.
Is there any way to make a good balance of data size and right
representation of
Hi Rexx,
In general (ie not Spark specific), its best to convert categorical data to
1-hot encoding rather than integers - that way the algorithm doesn't use
the ordering implicit in the integer representation.
-sujit
On Tue, Jun 16, 2015 at 1:17 PM, Rex X dnsr...@gmail.com wrote:
Is it
Is it necessary to convert categorical data into integers?
Any tips would be greatly appreciated!
-Rex
On Sun, Jun 14, 2015 at 10:05 AM, Rex X dnsr...@gmail.com wrote:
For clustering analysis, we need a way to measure distances.
When the data contains different levels of measurement -
For clustering analysis, we need a way to measure distances.
When the data contains different levels of measurement -
*binary / categorical (nominal), counts (ordinal), and ratio (scale)*
To be concrete, for example, working with attributes of
*city, zip, satisfaction_level, price*
In the