Moreover, I think a hybrid approach as follows might work well. 1. Select a sample
2. Filter columns by the data type and find potential categorical variables (integer / string) 3. Filter further by checking if same values are repeated multiple times in the dataset. On Fri, Aug 14, 2015 at 2:48 PM, Thushan Ganegedara <thu...@gmail.com> wrote: > Hi, > > Yes, no mater which approach used, there's always going to be outliers > which does not fit the defined rules. But for these corner cases, user > always have to opportunity to change the variable to numerical. > > One more approach is to introduce a measure of replication of values in a > column. If the column shows a repetition of same values many times, imo, it > is a good indicator for detecting categorical variable. > > On Fri, Aug 14, 2015 at 2:41 PM, Nirmal Fernando <nir...@wso2.com> wrote: > >> >> >> On Fri, Aug 14, 2015 at 10:01 AM, Thushan Ganegedara <thu...@gmail.com> >> wrote: >> >>> Hi, >>> >>> This was mainly due to the detection of a numerical feature as a >>> categorical one. >>> Oh, it makes sense now. Why don't we try taking a sample of data and if >>> the sample contains only integers (or doubles without any decimals) or >>> strings, consider it as a categorical variable. >>> >> >> I tried that approach too, but there're some datasets like automobile >> dataset normalized-losses feature, which has integer values (0-164) but >> which is probably not categorical. >> >>> >>> We suggested increasing the categorical threshold as a work-around. >>> @thushan did it work? >>> Yes, it worked. After increasing the threshold to 40. >>> >>> On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando <nir...@wso2.com> >>> wrote: >>> >>>> This was mainly due to the detection of a numerical feature as a >>>> categorical one. >>>> >>>> We suggested increasing the categorical threshold as a work-around. >>>> @thushan did it work? >>>> >>>> On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara <thu...@gmail.com> >>>> wrote: >>>> >>>>> This issue occurs, if I turn the response variable to a categorical >>>>> variable. If I get the variable as a numerical variable, the values are >>>>> read correctly. >>>>> >>>>> So I presume there is a fault in categorical conversion of the >>>>> variable. >>>>> >>>>> On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara <thu...@gmail.com> >>>>> wrote: >>>>> >>>>>> I still get the same result >>>>>> >>>>>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 >>>>>> 1.0 1.0 1.0 1.0 12.0 12.0 12.0 12.0 12.0 >>>>>> 12.0 12.0 12.0 12.0 12.0 13.0 13.0 13.0 13.0 >>>>>> 13.0 13.0 >>>>>> 13.0 13.0 13.0 13.0 14.0 14.0 14.0 14.0 >>>>>> 14.0 14.0 14.0 14.0 15.0 15.0 15.0 15.0 15.0 >>>>>> 15.0 15.0 15.0 15.0 15.0 15.0 15.0 16.0 16.0 >>>>>> 16.0 16.0 >>>>>> 16.0 16.0 16.0 16.0 17.0 17.0 17.0 17.0 >>>>>> 17.0 17.0 17.0 17.0 17.0 17.0 18.0 18.0 18.0 >>>>>> 18.0 18.0 18.0 18.0 18.0 18.0 18.0 18.0 19.0 >>>>>> 19.0 19.0 >>>>>> 19.0 19.0 19.0 19.0 19.0 19.0 19.0 19.0 >>>>>> 19.0 19.0 19.0 2.0 2.0 2.0 2.0 2.0 2.0 >>>>>> 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 >>>>>> 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 >>>>>> 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 >>>>>> 5.0 5.0 >>>>>> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 >>>>>> 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 >>>>>> 6.0 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 >>>>>> 7.0 7.0 >>>>>> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>> 3.0 3.0 >>>>>> 3.0 3.0 3.0 3.0 >>>>>> >>>>>> On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando <nir...@wso2.com> >>>>>> wrote: >>>>>> >>>>>>> Can you use following code and try; >>>>>>> >>>>>>> List<LabeledPoint> points = labeledPoints.collect(); >>>>>>> for(int i=0;i<points.size();i++){ >>>>>>> System.out.print(points.get(i).label() + "\t"); >>>>>>> } >>>>>>> >>>>>>> On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara < >>>>>>> thu...@gmail.com> wrote: >>>>>>> >>>>>>>> I used the following snippet >>>>>>>> >>>>>>>> for(int i=0;i<labeledPoints.collect().size();i++){ >>>>>>>> System.out.print(labeledPoints.collect().get(i).label() >>>>>>>> + "\t"); >>>>>>>> } >>>>>>>> >>>>>>>> in the public MLModel build() throws MLModelBuilderException in >>>>>>>> DeeplearningModelBuilder.java >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando <nir...@wso2.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi thushan, >>>>>>>>> >>>>>>>>> We need more info. What did you exactly print and where? >>>>>>>>> >>>>>>>>> On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara < >>>>>>>>> thu...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> I found the potential cause of the poor accuracy for the leaf >>>>>>>>>> dataset. It seems the data read into ML is wrong. >>>>>>>>>> >>>>>>>>>> I have attached the data file as a CSV (classes are in the last >>>>>>>>>> column) >>>>>>>>>> >>>>>>>>>> However, when I print out the labels of the read data (classes), >>>>>>>>>> it looks something like below. Clearly there aren't this many "3.0" >>>>>>>>>> classes >>>>>>>>>> and there should be classes up to 36.0. >>>>>>>>>> >>>>>>>>>> Is this caused by a bug? >>>>>>>>>> >>>>>>>>>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 >>>>>>>>>> 1.0 1.0 1.0 1.0 12.0 12.0 12.0 12.0 12.0 >>>>>>>>>> 12.0 12.0 12.0 12.0 12.0 13.0 13.0 13.0 13.0 >>>>>>>>>> 13.0 13.0 >>>>>>>>>> 13.0 13.0 13.0 13.0 14.0 14.0 14.0 14.0 >>>>>>>>>> 14.0 14.0 14.0 14.0 15.0 15.0 15.0 15.0 15.0 >>>>>>>>>> 15.0 15.0 15.0 15.0 15.0 15.0 15.0 16.0 16.0 >>>>>>>>>> 16.0 16.0 >>>>>>>>>> 16.0 16.0 16.0 16.0 17.0 17.0 17.0 17.0 >>>>>>>>>> 17.0 17.0 17.0 17.0 17.0 17.0 18.0 18.0 18.0 >>>>>>>>>> 18.0 18.0 18.0 18.0 18.0 18.0 18.0 18.0 19.0 >>>>>>>>>> 19.0 19.0 >>>>>>>>>> 19.0 19.0 19.0 19.0 19.0 19.0 19.0 19.0 >>>>>>>>>> 19.0 19.0 19.0 2.0 2.0 2.0 2.0 2.0 2.0 >>>>>>>>>> 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 >>>>>>>>>> 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 >>>>>>>>>> 5.0 5.0 >>>>>>>>>> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 >>>>>>>>>> 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 >>>>>>>>>> 6.0 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 >>>>>>>>>> 7.0 7.0 >>>>>>>>>> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>>>>>>>> 3.0 3.0 >>>>>>>>>> 3.0 3.0 3.0 3.0 >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> Thushan Ganegedara >>>>>>>>>> School of IT >>>>>>>>>> University of Sydney, Australia >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Thanks & regards, >>>>>>>>> Nirmal >>>>>>>>> >>>>>>>>> Team Lead - WSO2 Machine Learner >>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>>>> Mobile: +94715779733 >>>>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> Regards, >>>>>>>> >>>>>>>> Thushan Ganegedara >>>>>>>> School of IT >>>>>>>> University of Sydney, Australia >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> >>>>>>> Thanks & regards, >>>>>>> Nirmal >>>>>>> >>>>>>> Team Lead - WSO2 Machine Learner >>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>>>>> Mobile: +94715779733 >>>>>>> Blog: http://nirmalfdo.blogspot.com/ >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Regards, >>>>>> >>>>>> Thushan Ganegedara >>>>>> School of IT >>>>>> University of Sydney, Australia >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Regards, >>>>> >>>>> Thushan Ganegedara >>>>> School of IT >>>>> University of Sydney, Australia >>>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Thanks & regards, >>>> Nirmal >>>> >>>> Team Lead - WSO2 Machine Learner >>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >>>> Mobile: +94715779733 >>>> Blog: http://nirmalfdo.blogspot.com/ >>>> >>>> >>>> >>> >>> >>> -- >>> Regards, >>> >>> Thushan Ganegedara >>> School of IT >>> University of Sydney, Australia >>> >> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Team Lead - WSO2 Machine Learner >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > > > -- > Regards, > > Thushan Ganegedara > School of IT > University of Sydney, Australia > -- Regards, Thushan Ganegedara School of IT University of Sydney, Australia
_______________________________________________ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev