Thanks for all the input.

So let me summarise;

*the problem*

* We need to determine whether a feature is a categorical one or not, to
draw certain graphs to explore a dataset, before a user starts to build
analyses (before user input).
* We can't get a 100% accuracy, hence it is of course a suggestion that we
do.
* Question is, what would be the most accurate method.

*solutions*

1. Categorical threshold: if # of distinct values are less than X, it is a
categorical feature.
2. Make all features with only integers (no decimals) categorical.
3. Skewness: if skewness of a distribution of a feature is less than X, it
is a categorical feature.
4. Gaps between consecutive distinct values
5. Combined solution

On Fri, Aug 14, 2015 at 9:33 AM, Maheshakya Wijewardena <[email protected]
> wrote:

> Another approach to distinguish between categorical and numerical features
> can be elaborated as follows:
>
> First, we take out the unique values from the column and sort them. If
> it's a categorical feature, then the gaps between the elements of this
> sorted list should be equal. In a numerical feature, this is extremely
> unlikely to happen. This behavior of valid in most scenarios, but there are
> a few exceptions as well. eg: when a numerical ID is used as the
> categorical label - 19933, 19913, 18832, ...
>
> This is a very simple hack that can be easily implemented, but not a
> standard technique.
>
> WDYT?
>
> On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <[email protected]> wrote:
>
>> I mean current approach and skewness?
>>
>> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <[email protected]> wrote:
>>
>>> Can we use a combination of both?
>>>
>>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <[email protected]> wrote:
>>>
>>>> When a dataset is large, in general its said to be approximates to a
>>>> Normal Distribution. :)  True it Hypothetical, but the point they make is,
>>>> when the datasets are large, then properties of a distribution like
>>>> skewness, variance and etc. become closer to the properties Normal
>>>> Distribution in most cases..
>>>>
>>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Supun,
>>>>>
>>>>> Thanks for the reply.
>>>>>
>>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Nirmal,
>>>>>>
>>>>>> IMO don't think we would be able to use skewness in this case.
>>>>>> Skewness says how symmetric the distribution is. For example, if we
>>>>>> consider a numerical/continuous feature (not categorical) which is 
>>>>>> Normally
>>>>>> Distributed, then the skewness would be 0. Also for a categorical 
>>>>>> (encoded)
>>>>>> feature having a systematic distribution, then again the skewness would 
>>>>>> be
>>>>>> 0.
>>>>>>
>>>>>
>>>>> What's the probability of you see a normal distribution of a real
>>>>> dataset? IMO it's very less and also since what we're doing here is a
>>>>> suggestion, do you see it as an issue?
>>>>>
>>>>>
>>>>>>
>>>>>> We did have this concern at the beginning as well, regarding how we
>>>>>> could determine whether a feature is categorical or Continuous. Usually
>>>>>> this is strictly dependent on the domain of the dataset (i.e. user have 
>>>>>> to
>>>>>> decide this with the knowledge about the data). That was the idea behind
>>>>>> letting user change the data type.. But since we needed a default option,
>>>>>> we had to go for the threshold thing, which was the olny option we could
>>>>>> come-up with. I did a bit of research on this too, but only to find no
>>>>>> other solution :(
>>>>>>
>>>>>> Thanks,
>>>>>> Supun
>>>>>>
>>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> We have a feature in ML where we suggest a given data column of a
>>>>>>> dataset is categorical or numerical. Currently, how we determine this 
>>>>>>> is by
>>>>>>> using a threshold value (The maximum number of categories that can
>>>>>>> have in a non-string categorical feature. If exceeds, the feature
>>>>>>> will be treated as a numerical feature.). But this is not a
>>>>>>> successful measurement for most of the datasets.
>>>>>>>
>>>>>>> Can we use 'skewness' of a distribution as a measurement to
>>>>>>> determine this? Can we say, a column is numerical, if the modulus of the
>>>>>>> skewness of the distribution is less than a certain threshold (say 
>>>>>>> 0.01) ?
>>>>>>>
>>>>>>> *References*:
>>>>>>>
>>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Thanks & regards,
>>>>>>> Nirmal
>>>>>>>
>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>> Mobile: +94715779733
>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> *Supun Sethunga*
>>>>>> Software Engineer
>>>>>> WSO2, Inc.
>>>>>> http://wso2.com/
>>>>>> lean | enterprise | middleware
>>>>>> Mobile : +94 716546324
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Thanks & regards,
>>>>> Nirmal
>>>>>
>>>>> Team Lead - WSO2 Machine Learner
>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>> Mobile: +94715779733
>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Supun Sethunga*
>>>> Software Engineer
>>>> WSO2, Inc.
>>>> http://wso2.com/
>>>> lean | enterprise | middleware
>>>> Mobile : +94 716546324
>>>>
>>>
>>>
>>>
>>> --
>>> ============================
>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>>> Site: http://people.apache.org/~hemapani/
>>> Photos: http://www.flickr.com/photos/hemapani/
>>> Phone: 0772360902
>>>
>>
>>
>>
>> --
>> ============================
>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera
>> Site: http://people.apache.org/~hemapani/
>> Photos: http://www.flickr.com/photos/hemapani/
>> Phone: 0772360902
>>
>
>
>
> --
> Pruthuvi Maheshakya Wijewardena
> Software Engineer
> WSO2 : http://wso2.com/
> Email: [email protected]
> Mobile: +94711228855
>
>
>


-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/
_______________________________________________
Dev mailing list
[email protected]
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to