Combined solution; * if a feature contains strings -> categorical * Frequency of distinct values - if they repeat enough (80% default) and if it doesn't have decimal values, then it is a categorical feature.
On Fri, Aug 14, 2015 at 7:22 PM, Supun Sethunga <[email protected]> wrote: > Hi all, > > +1 for a hybrid solution. But still a -1 for using skewness even in the > hybrid solution :D > > One good example why we shouldn't use skenwness is the income distribution > graph in [1]. There, regardless of whether Im using the raw data (then its > a continuous feature) or whether Im breaking them in to intervals and > categorized the income in to several levels, I would get the same shape for > the distribution. i.e skewness would be significant. > > So the point Im trying to make is, categorical features as well as a > continuous features can be skewed/symmetric, and we cant really distinguish. > > [1] > https://cdn2.vox-cdn.com/uploads/chorus_asset/file/2930990/Distribution_of_Annual_Household_Income_in_the_United_States_2012.0.png > > > On Fri, Aug 14, 2015 at 1:03 AM, Nirmal Fernando <[email protected]> wrote: > >> Thanks Thushan. Good suggestion on the frequency. >> >> *solutions* >> >> 1. Categorical threshold: if # of distinct values are less than X, it is >> a categorical feature. >> 2. Make all features with only integers (no decimals) categorical. >> 3. Skewness: if skewness of a distribution of a feature is less than X, >> it is a categorical feature. >> 4. Gaps between consecutive distinct values >> 5. Frequency of distinct values - if they repeat enough, then it is a >> categorical feature. >> 6. Combined solution >> >> So, I guess as suggested by many of you, we need to build a combined >> solution. >> >> On Fri, Aug 14, 2015 at 10:29 AM, Thushan Ganegedara <[email protected]> >> wrote: >> >>> Moreover, I think a hybrid approach as follows might work well. >>> >>> 1. Select a sample >>> >>> 2. Filter columns by the data type and find potential categorical >>> variables (integer / string) >>> >>> 3. Filter further by checking if same values are repeated multiple times >>> in the dataset. >>> >>> On Fri, Aug 14, 2015 at 2:53 PM, Nirmal Fernando <[email protected]> >>> wrote: >>> >>>> Thanks for all the input. >>>> >>>> So let me summarise; >>>> >>>> *the problem* >>>> >>>> * We need to determine whether a feature is a categorical one or not, >>>> to draw certain graphs to explore a dataset, before a user starts to build >>>> analyses (before user input). >>>> * We can't get a 100% accuracy, hence it is of course a suggestion that >>>> we do. >>>> * Question is, what would be the most accurate method. >>>> >>>> *solutions* >>>> >>>> 1. Categorical threshold: if # of distinct values are less than X, it >>>> is a categorical feature. >>>> 2. Make all features with only integers (no decimals) categorical. >>>> 3. Skewness: if skewness of a distribution of a feature is less than X, >>>> it is a categorical feature. >>>> 4. Gaps between consecutive distinct values >>>> 5. Combined solution >>>> >>>> On Fri, Aug 14, 2015 at 9:33 AM, Maheshakya Wijewardena < >>>> [email protected]> wrote: >>>> >>>>> Another approach to distinguish between categorical and numerical >>>>> features can be elaborated as follows: >>>>> >>>>> First, we take out the unique values from the column and sort them. If >>>>> it's a categorical feature, then the gaps between the elements of this >>>>> sorted list should be equal. In a numerical feature, this is extremely >>>>> unlikely to happen. This behavior of valid in most scenarios, but there >>>>> are >>>>> a few exceptions as well. eg: when a numerical ID is used as the >>>>> categorical label - 19933, 19913, 18832, ... >>>>> >>>>> This is a very simple hack that can be easily implemented, but not a >>>>> standard technique. >>>>> >>>>> WDYT? >>>>> >>>>> On Fri, Aug 14, 2015 at 8:55 AM, Srinath Perera <[email protected]> >>>>> wrote: >>>>> >>>>>> I mean current approach and skewness? >>>>>> >>>>>> On Fri, Aug 14, 2015 at 8:54 AM, Srinath Perera <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Can we use a combination of both? >>>>>>> >>>>>>> On Thu, Aug 13, 2015 at 8:46 PM, Supun Sethunga <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> When a dataset is large, in general its said to be approximates to >>>>>>>> a Normal Distribution. :) True it Hypothetical, but the point they >>>>>>>> make >>>>>>>> is, when the datasets are large, then properties of a distribution like >>>>>>>> skewness, variance and etc. become closer to the properties Normal >>>>>>>> Distribution in most cases.. >>>>>>>> >>>>>>>> On Thu, Aug 13, 2015 at 11:07 AM, Nirmal Fernando <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Supun, >>>>>>>>> >>>>>>>>> Thanks for the reply. >>>>>>>>> >>>>>>>>> On Thu, Aug 13, 2015 at 8:09 PM, Supun Sethunga <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Nirmal, >>>>>>>>>> >>>>>>>>>> IMO don't think we would be able to use skewness in this case. >>>>>>>>>> Skewness says how symmetric the distribution is. For example, if we >>>>>>>>>> consider a numerical/continuous feature (not categorical) which is >>>>>>>>>> Normally >>>>>>>>>> Distributed, then the skewness would be 0. Also for a categorical >>>>>>>>>> (encoded) >>>>>>>>>> feature having a systematic distribution, then again the skewness >>>>>>>>>> would be >>>>>>>>>> 0. >>>>>>>>>> >>>>>>>>> >>>>>>>>> What's the probability of you see a normal distribution of a real >>>>>>>>> dataset? IMO it's very less and also since what we're doing here is a >>>>>>>>> suggestion, do you see it as an issue? >>>>>>>>> >>>>>>>>> >>>>>>>>>> >>>>>>>>>> We did have this concern at the beginning as well, regarding how >>>>>>>>>> we could determine whether a feature is categorical or Continuous. >>>>>>>>>> Usually >>>>>>>>>> this is strictly dependent on the domain of the dataset (i.e. user >>>>>>>>>> have to >>>>>>>>>> decide this with the knowledge about the data). That was the idea >>>>>>>>>> behind >>>>>>>>>> letting user change the data type.. But since we needed a default >>>>>>>>>> option, >>>>>>>>>> we had to go for the threshold thing, which was the olny option we >>>>>>>>>> could >>>>>>>>>> come-up with. I did a bit of research on this too, but only to find >>>>>>>>>> no >>>>>>>>>> other solution :( >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Supun >>>>>>>>>> >>>>>>>>>> On Thu, Aug 13, 2015 at 1:49 AM, Nirmal Fernando <[email protected] >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> Hi All, >>>>>>>>>>> >>>>>>>>>>> We have a feature in ML where we suggest a given data column of >>>>>>>>>>> a dataset is categorical or numerical. Currently, how we determine >>>>>>>>>>> this is >>>>>>>>>>> by using a threshold value (The maximum number of categories >>>>>>>>>>> that can have in a non-string categorical feature. If exceeds, >>>>>>>>>>> the feature will be treated as a numerical feature.). But this >>>>>>>>>>> is not a successful measurement for most of the datasets. >>>>>>>>>>> >>>>>>>>>>> Can we use 'skewness' of a distribution as a measurement to >>>>>>>>>>> determine this? Can we say, a column is numerical, if the modulus >>>>>>>>>>> of the >>>>>>>>>>> skewness of the distribution is less than a certain threshold (say >>>>>>>>>>> 0.01) ? >>>>>>>>>>> >>>>>>>>>>> *References*: >>>>>>>>>>> >>>>>>>>>>> http://www.itrcweb.org/gsmc-1/Content/GW%20Stats/5%20Methods%20in%20indiv%20Topics/5%206%20Distributional%20Tests.htm >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> Thanks & regards, >>>>>>>>>>> Nirmal >>>>>>>>>>> >>>>>>>>>>> Team Lead - WSO2 Machine Learner >>>>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>>>>>> Mobile: +94715779733 >>>>>>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> *Supun Sethunga* >>>>>>>>>> Software Engineer >>>>>>>>>> WSO2, Inc. >>>>>>>>>> http://wso2.com/ >>>>>>>>>> lean | enterprise | middleware >>>>>>>>>> Mobile : +94 716546324 >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Thanks & regards, >>>>>>>>> Nirmal >>>>>>>>> >>>>>>>>> Team Lead - WSO2 Machine Learner >>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>>>> Mobile: +94715779733 >>>>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Supun Sethunga* >>>>>>>> Software Engineer >>>>>>>> WSO2, Inc. >>>>>>>> http://wso2.com/ >>>>>>>> lean | enterprise | middleware >>>>>>>> Mobile : +94 716546324 >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> ============================ >>>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>>>>>> Site: http://people.apache.org/~hemapani/ >>>>>>> Photos: http://www.flickr.com/photos/hemapani/ >>>>>>> Phone: 0772360902 >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> ============================ >>>>>> Blog: http://srinathsview.blogspot.com twitter:@srinath_perera >>>>>> Site: http://people.apache.org/~hemapani/ >>>>>> Photos: http://www.flickr.com/photos/hemapani/ >>>>>> Phone: 0772360902 >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Pruthuvi Maheshakya Wijewardena >>>>> Software Engineer >>>>> WSO2 : http://wso2.com/ >>>>> Email: [email protected] >>>>> Mobile: +94711228855 >>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> >>>> Thanks & regards, >>>> Nirmal >>>> >>>> Team Lead - WSO2 Machine Learner >>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>> Mobile: +94715779733 >>>> Blog: http://nirmalfdo.blogspot.com/ >>>> >>>> >>>> >>>> _______________________________________________ >>>> Dev mailing list >>>> [email protected] >>>> http://wso2.org/cgi-bin/mailman/listinfo/dev >>>> >>>> >>> >>> >>> -- >>> Regards, >>> >>> Thushan Ganegedara >>> School of IT >>> University of Sydney, Australia >>> >> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Team Lead - WSO2 Machine Learner >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> >> _______________________________________________ >> Dev mailing list >> [email protected] >> http://wso2.org/cgi-bin/mailman/listinfo/dev >> >> > > > -- > *Supun Sethunga* > Software Engineer > WSO2, Inc. > http://wso2.com/ > lean | enterprise | middleware > Mobile : +94 716546324 > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/
_______________________________________________ Dev mailing list [email protected] http://wso2.org/cgi-bin/mailman/listinfo/dev
