Thanks for all the input. So let me summarise;
*the problem* * We need to determine whether a feature is a categorical one or not, to draw certain graphs to explore a dataset, before a user starts to build analyses (before user input). * We can't get a 100% accuracy, hence it is of course a suggestion that we do. * Question is, what would be the most accurate method. *solutions* 1. Categorical threshold: if # of distinct values are less than X, it is a categorical feature. 2. Make all features with only integers (no decimals) categorical. 3. Skewness: if skewness of a distribution of a feature is less than X, it is a categorical feature. 4. Gaps between consecutive distinct values 5. Combined solution On Fri, Aug 14, 2015 at 9:33 AM, Maheshakya Wijewardena <[email protected] > wrote: > Another approach to distinguish between categorical and numerical features > can be elaborated as follows: > > First, we take out the unique values from the column and sort them. If > it's a categorical feature, then the gaps between the elements of this > sorted list should be equal. In a numerical feature, this is extremely > unlikely to happen. This behavior of valid in most scenarios, but there are > a few exceptions as well. eg: when a numerical ID is used as the > categorical label - 19933, 19913, 18832, ... > > This is a very simple hack that can be easily implemented, but not a > standard technique. > > WDYT? > > On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <[email protected]> wrote: > >> I mean current approach and skewness? >> >> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <[email protected]> wrote: >> >>> Can we use a combination of both? >>> >>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <[email protected]> wrote: >>> >>>> When a dataset is large, in general its said to be approximates to a >>>> Normal Distribution. :) True it Hypothetical, but the point they make is, >>>> when the datasets are large, then properties of a distribution like >>>> skewness, variance and etc. become closer to the properties Normal >>>> Distribution in most cases.. >>>> >>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <[email protected]> >>>> wrote: >>>> >>>>> Hi Supun, >>>>> >>>>> Thanks for the reply. >>>>> >>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Nirmal, >>>>>> >>>>>> IMO don't think we would be able to use skewness in this case. >>>>>> Skewness says how symmetric the distribution is. For example, if we >>>>>> consider a numerical/continuous feature (not categorical) which is >>>>>> Normally >>>>>> Distributed, then the skewness would be 0. Also for a categorical >>>>>> (encoded) >>>>>> feature having a systematic distribution, then again the skewness would >>>>>> be >>>>>> 0. >>>>>> >>>>> >>>>> What's the probability of you see a normal distribution of a real >>>>> dataset? IMO it's very less and also since what we're doing here is a >>>>> suggestion, do you see it as an issue? >>>>> >>>>> >>>>>> >>>>>> We did have this concern at the beginning as well, regarding how we >>>>>> could determine whether a feature is categorical or Continuous. Usually >>>>>> this is strictly dependent on the domain of the dataset (i.e. user have >>>>>> to >>>>>> decide this with the knowledge about the data). That was the idea behind >>>>>> letting user change the data type.. But since we needed a default option, >>>>>> we had to go for the threshold thing, which was the olny option we could >>>>>> come-up with. I did a bit of research on this too, but only to find no >>>>>> other solution :( >>>>>> >>>>>> Thanks, >>>>>> Supun >>>>>> >>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> We have a feature in ML where we suggest a given data column of a >>>>>>> dataset is categorical or numerical. Currently, how we determine this >>>>>>> is by >>>>>>> using a threshold value (The maximum number of categories that can >>>>>>> have in a non-string categorical feature. If exceeds, the feature >>>>>>> will be treated as a numerical feature.). But this is not a >>>>>>> successful measurement for most of the datasets. >>>>>>> >>>>>>> Can we use 'skewness' of a distribution as a measurement to >>>>>>> determine this? Can we say, a column is numerical, if the modulus of the >>>>>>> skewness of the distribution is less than a certain threshold (say >>>>>>> 0.01) ? >>>>>>> >>>>>>> *References*: >>>>>>> >>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Thanks & regards, >>>>>>> Nirmal >>>>>>> >>>>>>> Team Lead - WSO2 Machine Learner >>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>> Mobile: +94715779733 >>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> *Supun Sethunga* >>>>>> Software Engineer >>>>>> WSO2, Inc. >>>>>> http://wso2.com/ >>>>>> lean | enterprise | middleware >>>>>> Mobile : +94 716546324 >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Thanks & regards, >>>>> Nirmal >>>>> >>>>> Team Lead - WSO2 Machine Learner >>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>> Mobile: +94715779733 >>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Supun Sethunga* >>>> Software Engineer >>>> WSO2, Inc. >>>> http://wso2.com/ >>>> lean | enterprise | middleware >>>> Mobile : +94 716546324 >>>> >>> >>> >>> >>> -- >>> ============================ >>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>> Site: http://people.apache.org/~hemapani/ >>> Photos: http://www.flickr.com/photos/hemapani/ >>> Phone: 0772360902 >>> >> >> >> >> -- >> ============================ >> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >> Site: http://people.apache.org/~hemapani/ >> Photos: http://www.flickr.com/photos/hemapani/ >> Phone: 0772360902 >> > > > > -- > Pruthuvi Maheshakya Wijewardena > Software Engineer > WSO2 : http://wso2.com/ > Email: [email protected] > Mobile: +94711228855 > > > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
_______________________________________________ Dev mailing list [email protected] http://wso2.org/cgi-bin/mailman/listinfo/dev
