Re: [Dev] [ML] Categorical or Numerical column?

Nirmal Fernando Fri, 14 Aug 2015 09:15:04 -0700

Combined solution;

* if a feature contains strings -> categorical
* Frequency of distinct values - if they repeat enough (80% default) and if
it doesn't have decimal values, then it is a categorical feature.


On Fri, Aug 14, 2015 at 7:22 PM, Supun Sethunga <[email protected]> wrote:

> Hi all,
>
> +1 for a hybrid solution. But still a -1 for using skewness even in the
> hybrid solution :D
>
> One good example why we shouldn't use skenwness is the income distribution
> graph in [1]. There, regardless of whether Im using the raw data (then its
> a continuous feature) or whether Im breaking them in to intervals and
> categorized the income in to several levels, I would get the same shape for
> the distribution. i.e skewness would be significant.
>
> So the point  Im trying to make is, categorical features as well as a
> continuous features can be skewed/symmetric, and we cant really distinguish.
>
> [1]
> https://cdn2.vox-cdn.com/uploads/chorus_asset/file/2930990/Distribution_of_Annual_Household_Income_in_the_United_States_2012.0.png
>
>
> On Fri, Aug 14, 2015 at 1:03 AM, Nirmal Fernando <[email protected]> wrote:
>
>> Thanks Thushan. Good suggestion on the frequency.
>>
>> *solutions*
>>
>> 1. Categorical threshold: if # of distinct values are less than X, it is
>> a categorical feature.
>> 2. Make all features with only integers (no decimals) categorical.
>> 3. Skewness: if skewness of a distribution of a feature is less than X,
>> it is a categorical feature.
>> 4. Gaps between consecutive distinct values
>> 5. Frequency of distinct values - if they repeat enough, then it is a
>> categorical feature.
>> 6. Combined solution
>>
>> So, I guess as suggested by many of you, we need to build a combined
>> solution.
>>
>> On Fri, Aug 14, 2015 at 10:29 AM, Thushan Ganegedara <[email protected]>
>> wrote:
>>
>>> Moreover, I think a hybrid approach as follows might work well.
>>>
>>> 1. Select a sample
>>>
>>> 2. Filter columns by the data type and find potential categorical
>>> variables (integer / string)
>>>
>>> 3. Filter further by checking if same values are repeated multiple times
>>> in the dataset.
>>>
>>> On Fri, Aug 14, 2015 at 2:53 PM, Nirmal Fernando <[email protected]>
>>> wrote:
>>>
>>>> Thanks for all the input.
>>>>
>>>> So let me summarise;
>>>>
>>>> *the problem*
>>>>
>>>> * We need to determine whether a feature is a categorical one or not,
>>>> to draw certain graphs to explore a dataset, before a user starts to build
>>>> analyses (before user input).
>>>> * We can't get a 100% accuracy, hence it is of course a suggestion that
>>>> we do.
>>>> * Question is, what would be the most accurate method.
>>>>
>>>> *solutions*
>>>>
>>>> 1. Categorical threshold: if # of distinct values are less than X, it
>>>> is a categorical feature.
>>>> 2. Make all features with only integers (no decimals) categorical.
>>>> 3. Skewness: if skewness of a distribution of a feature is less than X,
>>>> it is a categorical feature.
>>>> 4. Gaps between consecutive distinct values
>>>> 5. Combined solution
>>>>
>>>> On Fri, Aug 14, 2015 at 9:33 AM, Maheshakya Wijewardena <
>>>> [email protected]> wrote:
>>>>
>>>>> Another approach to distinguish between categorical and numerical
>>>>> features can be elaborated as follows:
>>>>>
>>>>> First, we take out the unique values from the column and sort them. If
>>>>> it's a categorical feature, then the gaps between the elements of this
>>>>> sorted list should be equal. In a numerical feature, this is extremely
>>>>> unlikely to happen. This behavior of valid in most scenarios, but there 
>>>>> are
>>>>> a few exceptions as well. eg: when a numerical ID is used as the
>>>>> categorical label - 19933, 19913, 18832, ...
>>>>>
>>>>> This is a very simple hack that can be easily implemented, but not a
>>>>> standard technique.
>>>>>
>>>>> WDYT?
>>>>>
>>>>> On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I mean current approach and skewness?
>>>>>>
>>>>>> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Can we use a combination of both?
>>>>>>>
>>>>>>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> When a dataset is large, in general its said to be approximates to
>>>>>>>> a Normal Distribution. :)  True it Hypothetical, but the point they 
>>>>>>>> make
>>>>>>>> is, when the datasets are large, then properties of a distribution like
>>>>>>>> skewness, variance and etc. become closer to the properties Normal
>>>>>>>> Distribution in most cases..
>>>>>>>>
>>>>>>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Supun,
>>>>>>>>>
>>>>>>>>> Thanks for the reply.
>>>>>>>>>
>>>>>>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Nirmal,
>>>>>>>>>>
>>>>>>>>>> IMO don't think we would be able to use skewness in this case.
>>>>>>>>>> Skewness says how symmetric the distribution is. For example, if we
>>>>>>>>>> consider a numerical/continuous feature (not categorical) which is 
>>>>>>>>>> Normally
>>>>>>>>>> Distributed, then the skewness would be 0. Also for a categorical 
>>>>>>>>>> (encoded)
>>>>>>>>>> feature having a systematic distribution, then again the skewness 
>>>>>>>>>> would be
>>>>>>>>>> 0.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> What's the probability of you see a normal distribution of a real
>>>>>>>>> dataset? IMO it's very less and also since what we're doing here is a
>>>>>>>>> suggestion, do you see it as an issue?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> We did have this concern at the beginning as well, regarding how
>>>>>>>>>> we could determine whether a feature is categorical or Continuous. 
>>>>>>>>>> Usually
>>>>>>>>>> this is strictly dependent on the domain of the dataset (i.e. user 
>>>>>>>>>> have to
>>>>>>>>>> decide this with the knowledge about the data). That was the idea 
>>>>>>>>>> behind
>>>>>>>>>> letting user change the data type.. But since we needed a default 
>>>>>>>>>> option,
>>>>>>>>>> we had to go for the threshold thing, which was the olny option we 
>>>>>>>>>> could
>>>>>>>>>> come-up with. I did a bit of research on this too, but only to find 
>>>>>>>>>> no
>>>>>>>>>> other solution :(
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Supun
>>>>>>>>>>
>>>>>>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <[email protected]
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> We have a feature in ML where we suggest a given data column of
>>>>>>>>>>> a dataset is categorical or numerical. Currently, how we determine 
>>>>>>>>>>> this is
>>>>>>>>>>> by using a threshold value (The maximum number of categories
>>>>>>>>>>> that can have in a non-string categorical feature. If exceeds,
>>>>>>>>>>> the feature will be treated as a numerical feature.). But this
>>>>>>>>>>> is not a successful measurement for most of the datasets.
>>>>>>>>>>>
>>>>>>>>>>> Can we use 'skewness' of a distribution as a measurement to
>>>>>>>>>>> determine this? Can we say, a column is numerical, if the modulus 
>>>>>>>>>>> of the
>>>>>>>>>>> skewness of the distribution is less than a certain threshold (say 
>>>>>>>>>>> 0.01) ?
>>>>>>>>>>>
>>>>>>>>>>> *References*:
>>>>>>>>>>>
>>>>>>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Thanks & regards,
>>>>>>>>>>> Nirmal
>>>>>>>>>>>
>>>>>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>>>>>> Mobile: +94715779733
>>>>>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> *Supun Sethunga*
>>>>>>>>>> Software Engineer
>>>>>>>>>> WSO2, Inc.
>>>>>>>>>> http://wso2.com/
>>>>>>>>>> lean | enterprise | middleware
>>>>>>>>>> Mobile : +94 716546324
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Thanks & regards,
>>>>>>>>> Nirmal
>>>>>>>>>
>>>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>>>> Mobile: +94715779733
>>>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Supun Sethunga*
>>>>>>>> Software Engineer
>>>>>>>> WSO2, Inc.
>>>>>>>> http://wso2.com/
>>>>>>>> lean | enterprise | middleware
>>>>>>>> Mobile : +94 716546324
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> ============================
>>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>>>> Site: http://people.apache.org/~hemapani/
>>>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>>>> Phone: 0772360902
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> ============================
>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>>> Site: http://people.apache.org/~hemapani/
>>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>>> Phone: 0772360902
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Pruthuvi Maheshakya Wijewardena
>>>>> Software Engineer
>>>>> WSO2 : http://wso2.com/
>>>>> Email: [email protected]
>>>>> Mobile: +94711228855
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks & regards,
>>>> Nirmal
>>>>
>>>> Team Lead - WSO2 Machine Learner
>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>> Mobile: +94715779733
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Dev mailing list
>>>> [email protected]
>>>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Thushan Ganegedara
>>> School of IT
>>> University of Sydney, Australia
>>>
>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Team Lead - WSO2 Machine Learner
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>> _______________________________________________
>> Dev mailing list
>> [email protected]
>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>
>>
>
>
> --
> *Supun Sethunga*
> Software Engineer
> WSO2, Inc.
> http://wso2.com/
> lean | enterprise | middleware
> Mobile : +94 716546324
>



-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/

_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Re: [Dev] [ML] Categorical or Numerical column?

Reply via email to