Thanks Thushan. Good suggestion on the frequency.

*solutions*

1. Categorical threshold: if # of distinct values are less than X, it is a
categorical feature.
2. Make all features with only integers (no decimals) categorical.
3. Skewness: if skewness of a distribution of a feature is less than X, it
is a categorical feature.
4. Gaps between consecutive distinct values
5. Frequency of distinct values - if they repeat enough, then it is a
categorical feature.
6. Combined solution

So, I guess as suggested by many of you, we need to build a combined
solution.

On Fri, Aug 14, 2015 at 10:29 AM, Thushan Ganegedara <[email protected]>
wrote:

> Moreover, I think a hybrid approach as follows might work well.
>
> 1. Select a sample
>
> 2. Filter columns by the data type and find potential categorical
> variables (integer / string)
>
> 3. Filter further by checking if same values are repeated multiple times
> in the dataset.
>
> On Fri, Aug 14, 2015 at 2:53 PM, Nirmal Fernando <[email protected]> wrote:
>
>> Thanks for all the input.
>>
>> So let me summarise;
>>
>> *the problem*
>>
>> * We need to determine whether a feature is a categorical one or not, to
>> draw certain graphs to explore a dataset, before a user starts to build
>> analyses (before user input).
>> * We can't get a 100% accuracy, hence it is of course a suggestion that
>> we do.
>> * Question is, what would be the most accurate method.
>>
>> *solutions*
>>
>> 1. Categorical threshold: if # of distinct values are less than X, it is
>> a categorical feature.
>> 2. Make all features with only integers (no decimals) categorical.
>> 3. Skewness: if skewness of a distribution of a feature is less than X,
>> it is a categorical feature.
>> 4. Gaps between consecutive distinct values
>> 5. Combined solution
>>
>> On Fri, Aug 14, 2015 at 9:33 AM, Maheshakya Wijewardena <
>> [email protected]> wrote:
>>
>>> Another approach to distinguish between categorical and numerical
>>> features can be elaborated as follows:
>>>
>>> First, we take out the unique values from the column and sort them. If
>>> it's a categorical feature, then the gaps between the elements of this
>>> sorted list should be equal. In a numerical feature, this is extremely
>>> unlikely to happen. This behavior of valid in most scenarios, but there are
>>> a few exceptions as well. eg: when a numerical ID is used as the
>>> categorical label - 19933, 19913, 18832, ...
>>>
>>> This is a very simple hack that can be easily implemented, but not a
>>> standard technique.
>>>
>>> WDYT?
>>>
>>> On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <[email protected]>
>>> wrote:
>>>
>>>> I mean current approach and skewness?
>>>>
>>>> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <[email protected]>
>>>> wrote:
>>>>
>>>>> Can we use a combination of both?
>>>>>
>>>>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> When a dataset is large, in general its said to be approximates to a
>>>>>> Normal Distribution. :)  True it Hypothetical, but the point they make 
>>>>>> is,
>>>>>> when the datasets are large, then properties of a distribution like
>>>>>> skewness, variance and etc. become closer to the properties Normal
>>>>>> Distribution in most cases..
>>>>>>
>>>>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Supun,
>>>>>>>
>>>>>>> Thanks for the reply.
>>>>>>>
>>>>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Nirmal,
>>>>>>>>
>>>>>>>> IMO don't think we would be able to use skewness in this case.
>>>>>>>> Skewness says how symmetric the distribution is. For example, if we
>>>>>>>> consider a numerical/continuous feature (not categorical) which is 
>>>>>>>> Normally
>>>>>>>> Distributed, then the skewness would be 0. Also for a categorical 
>>>>>>>> (encoded)
>>>>>>>> feature having a systematic distribution, then again the skewness 
>>>>>>>> would be
>>>>>>>> 0.
>>>>>>>>
>>>>>>>
>>>>>>> What's the probability of you see a normal distribution of a real
>>>>>>> dataset? IMO it's very less and also since what we're doing here is a
>>>>>>> suggestion, do you see it as an issue?
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> We did have this concern at the beginning as well, regarding how we
>>>>>>>> could determine whether a feature is categorical or Continuous. Usually
>>>>>>>> this is strictly dependent on the domain of the dataset (i.e. user 
>>>>>>>> have to
>>>>>>>> decide this with the knowledge about the data). That was the idea 
>>>>>>>> behind
>>>>>>>> letting user change the data type.. But since we needed a default 
>>>>>>>> option,
>>>>>>>> we had to go for the threshold thing, which was the olny option we 
>>>>>>>> could
>>>>>>>> come-up with. I did a bit of research on this too, but only to find no
>>>>>>>> other solution :(
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Supun
>>>>>>>>
>>>>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> We have a feature in ML where we suggest a given data column of a
>>>>>>>>> dataset is categorical or numerical. Currently, how we determine this 
>>>>>>>>> is by
>>>>>>>>> using a threshold value (The maximum number of categories that
>>>>>>>>> can have in a non-string categorical feature. If exceeds, the
>>>>>>>>> feature will be treated as a numerical feature.). But this is not
>>>>>>>>> a successful measurement for most of the datasets.
>>>>>>>>>
>>>>>>>>> Can we use 'skewness' of a distribution as a measurement to
>>>>>>>>> determine this? Can we say, a column is numerical, if the modulus of 
>>>>>>>>> the
>>>>>>>>> skewness of the distribution is less than a certain threshold (say 
>>>>>>>>> 0.01) ?
>>>>>>>>>
>>>>>>>>> *References*:
>>>>>>>>>
>>>>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Thanks & regards,
>>>>>>>>> Nirmal
>>>>>>>>>
>>>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>>>> Mobile: +94715779733
>>>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Supun Sethunga*
>>>>>>>> Software Engineer
>>>>>>>> WSO2, Inc.
>>>>>>>> http://wso2.com/
>>>>>>>> lean | enterprise | middleware
>>>>>>>> Mobile : +94 716546324
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Thanks & regards,
>>>>>>> Nirmal
>>>>>>>
>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>> Mobile: +94715779733
>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Supun Sethunga*
>>>>>> Software Engineer
>>>>>> WSO2, Inc.
>>>>>> http://wso2.com/
>>>>>> lean | enterprise | middleware
>>>>>> Mobile : +94 716546324
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> ============================
>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>>> Site: http://people.apache.org/~hemapani/
>>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>>> Phone: 0772360902
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> ============================
>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>>> Site: http://people.apache.org/~hemapani/
>>>> Photos: http://www.flickr.com/photos/hemapani/
>>>> Phone: 0772360902
>>>>
>>>
>>>
>>>
>>> --
>>> Pruthuvi Maheshakya Wijewardena
>>> Software Engineer
>>> WSO2 : http://wso2.com/
>>> Email: [email protected]
>>> Mobile: +94711228855
>>>
>>>
>>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Team Lead - WSO2 Machine Learner
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>> _______________________________________________
>> Dev mailing list
>> [email protected]
>> http://wso2.org/cgi-bin/mailman/listinfo/dev
>>
>>
>
>
> --
> Regards,
>
> Thushan Ganegedara
> School of IT
> University of Sydney, Australia
>



-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/
_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to