Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

Xiangrui Meng Wed, 17 Jun 2015 17:06:06 -0700

You can try hashing to control the feature dimension. MLlib's k-means
implementation can handle sparse data efficiently if the number of
features is not huge. -Xiangrui


On Tue, Jun 16, 2015 at 2:44 PM, Rex X <dnsr...@gmail.com> wrote:
> Hi Sujit,
>
> That's a good point. But 1-hot encoding will make our data changing from
> Terabytes to Petabytes, because we have tens of categorical attributes, and
> some of them contain thousands of categorical values.
>
> Is there any way to make a good balance of data size and right
> representation of categories?
>
>
> -Rex
>
>
>
> On Tue, Jun 16, 2015 at 1:27 PM, Sujit Pal <sujitatgt...@gmail.com> wrote:
>>
>> Hi Rexx,
>>
>> In general (ie not Spark specific), its best to convert categorical data
>> to 1-hot encoding rather than integers - that way the algorithm doesn't use
>> the ordering implicit in the integer representation.
>>
>> -sujit
>>
>>
>> On Tue, Jun 16, 2015 at 1:17 PM, Rex X <dnsr...@gmail.com> wrote:
>>>
>>> Is it necessary to convert categorical data into integers?
>>>
>>> Any tips would be greatly appreciated!
>>>
>>> -Rex
>>>
>>> On Sun, Jun 14, 2015 at 10:05 AM, Rex X <dnsr...@gmail.com> wrote:
>>>>
>>>> For clustering analysis, we need a way to measure distances.
>>>>
>>>> When the data contains different levels of measurement -
>>>> binary / categorical (nominal), counts (ordinal), and ratio (scale)
>>>>
>>>> To be concrete, for example, working with attributes of
>>>> city, zip, satisfaction_level, price
>>>>
>>>> In the meanwhile, the real data usually also contains string attributes,
>>>> for example, book titles. The distance between two strings can be measured
>>>> by minimum-edit-distance.
>>>>
>>>>
>>>> In SPSS, it provides Two-Step Cluster, which can handle both ratio scale
>>>> and ordinal numbers.
>>>>
>>>>
>>>> What is right algorithm to do hierarchical clustering analysis with all
>>>> these four-kind attributes above with MLlib?
>>>>
>>>>
>>>> If we cannot find a right metric to measure the distance, an alternative
>>>> solution is to do a topological data analysis (e.g. linkage, and etc). Can
>>>> we do such kind of analysis with GraphX?
>>>>
>>>>
>>>> -Rex
>>>>
>>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: What is the right algorithm to do cluster analysis with mixed numeric, categorical, and string value attributes?

Reply via email to