Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)
Hi thushan, We need more info. What did you exactly print and where? On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara wrote: > Hi, > > I found the potential cause of the poor accuracy for the leaf dataset. It > seems the data read into ML is wrong. > > I have attached the data file as a CSV (classes are in the last column) > > However, when I print out the labels of the read data (classes), it looks > something like below. Clearly there aren't this many "3.0" classes and > there should be classes up to 36.0. > > Is this caused by a bug? > > 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 > 1.0 1.0 1.0 12.012.012.012.012.012.0 > 12.012.012.012.013.013.013.013.013.013.0 > 13.013.013.013.014.014.014.014.014.0 > 14.014.014.015.015.015.015.015.015.0 > 15.015.015.015.015.015.016.016.016.016.0 > 16.016.016.016.017.017.017.017.017.0 > 17.017.017.017.017.018.018.018.018.0 > 18.018.018.018.018.018.018.019.019.019.0 > 19.019.019.019.019.019.019.019.019.0 > 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0 > 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 > 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0 > 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 > 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 > 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 > 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 > > -- > Regards, > > Thushan Ganegedara > School of IT > University of Sydney, Australia > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ ___ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev
Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)
I used the following snippet for(int i=0;i wrote: > Hi thushan, > > We need more info. What did you exactly print and where? > > On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara > wrote: > >> Hi, >> >> I found the potential cause of the poor accuracy for the leaf dataset. It >> seems the data read into ML is wrong. >> >> I have attached the data file as a CSV (classes are in the last column) >> >> However, when I print out the labels of the read data (classes), it looks >> something like below. Clearly there aren't this many "3.0" classes and >> there should be classes up to 36.0. >> >> Is this caused by a bug? >> >> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 >> 1.0 1.0 1.0 12.012.012.012.012.012.0 >> 12.012.012.012.013.013.013.013.013.013.0 >> 13.013.013.013.014.014.014.014.014.0 >> 14.014.014.015.015.015.015.015.015.0 >> 15.015.015.015.015.015.016.016.016.016.0 >> 16.016.016.016.017.017.017.017.017.0 >> 17.017.017.017.017.018.018.018.018.0 >> 18.018.018.018.018.018.018.019.019.019.0 >> 19.019.019.019.019.019.019.019.019.0 >> 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0 >> 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 >> 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0 >> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 >> 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 >> 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 >> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 >> >> -- >> Regards, >> >> Thushan Ganegedara >> School of IT >> University of Sydney, Australia >> > > > > -- > > Thanks & regards, > Nirmal > > Team Lead - WSO2 Machine Learner > Associate Technical Lead - Data Technologies Team, WSO2 Inc. > Mobile: +94715779733 > Blog: http://nirmalfdo.blogspot.com/ > > > -- Regards, Thushan Ganegedara School of IT University of Sydney, Australia ___ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev
Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)
Can you use following code and try; List points = labeledPoints.collect(); for(int i=0;i wrote: > I used the following snippet > > for(int i=0;i System.out.print(labeledPoints.collect().get(i).label() + > "\t"); > } > > in the public MLModel build() throws MLModelBuilderException in > DeeplearningModelBuilder.java > > > On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando wrote: > >> Hi thushan, >> >> We need more info. What did you exactly print and where? >> >> On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara >> wrote: >> >>> Hi, >>> >>> I found the potential cause of the poor accuracy for the leaf dataset. >>> It seems the data read into ML is wrong. >>> >>> I have attached the data file as a CSV (classes are in the last column) >>> >>> However, when I print out the labels of the read data (classes), it >>> looks something like below. Clearly there aren't this many "3.0" classes >>> and there should be classes up to 36.0. >>> >>> Is this caused by a bug? >>> >>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 >>> 1.0 1.0 1.0 12.012.012.012.012.012.0 >>> 12.012.012.012.013.013.013.013.013.013.0 >>> 13.013.013.013.014.014.014.014.014.0 >>> 14.014.014.015.015.015.015.015.015.0 >>> 15.015.015.015.015.015.016.016.016.016.0 >>> 16.016.016.016.017.017.017.017.017.0 >>> 17.017.017.017.017.018.018.018.018.0 >>> 18.018.018.018.018.018.018.019.019.019.0 >>> 19.019.019.019.019.019.019.019.019.0 >>> 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0 >>> 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 >>> 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0 >>> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 >>> 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 >>> 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 >>> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 >>> >>> -- >>> Regards, >>> >>> Thushan Ganegedara >>> School of IT >>> University of Sydney, Australia >>> >> >> >> >> -- >> >> Thanks & regards, >> Nirmal >> >> Team Lead - WSO2 Machine Learner >> Associate Technical Lead - Data Technologies Team, WSO2 Inc. >> Mobile: +94715779733 >> Blog: http://nirmalfdo.blogspot.com/ >> >> >> > > > -- > Regards, > > Thushan Ganegedara > School of IT > University of Sydney, Australia > -- Thanks & regards, Nirmal Team Lead - WSO2 Machine Learner Associate Technical Lead - Data Technologies Team, WSO2 Inc. Mobile: +94715779733 Blog: http://nirmalfdo.blogspot.com/ ___ Dev mailing list Dev@wso2.org http://wso2.org/cgi-bin/mailman/listinfo/dev
Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)
I still get the same result 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 12.012.012.012.012.012.0 12.012.012.012.013.013.013.013.013.013.0 13.013.013.013.014.014.014.014.014.0 14.014.014.015.015.015.015.015.015.0 15.015.015.015.015.015.016.016.016.016.0 16.016.016.016.017.017.017.017.017.0 17.017.017.017.017.018.018.018.018.0 18.018.018.018.018.018.018.019.019.019.0 19.019.019.019.019.019.019.019.019.0 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando wrote: > Can you use following code and try; > > List points = labeledPoints.collect(); > for(int i=0;i System.out.print(points.get(i).label() + "\t"); > } > > On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara > wrote: > >> I used the following snippet >> >> for(int i=0;i> System.out.print(labeledPoints.collect().get(i).label() + >> "\t"); >> } >> >> in the public MLModel build() throws MLModelBuilderException in >> DeeplearningModelBuilder.java >> >> >> On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando wrote: >> >>> Hi thushan, >>> >>> We need more info. What did you exactly print and where? >>> >>> On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara >>> wrote: >>> Hi, I found the potential cause of the poor accuracy for the leaf dataset. It seems the data read into ML is wrong. I have attached the data file as a CSV (classes are in the last column) However, when I print out the labels of the read data (classes), it looks something like below. Clearly there aren't this many "3.0" classes and there should be classes up to 36.0. Is this caused by a bug? 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 12.012.012.012.012.012.0 12.012.012.012.013.013.013.013.013.0 13.0 13.013.013.013.014.014.014.014.014.0 14.014.014.015.015.015.015.015.015.0 15.015.015.015.015.015.016.016.016.0 16.0 16.016.016.016.017.017.017.017.017.0 17.017.017.017.017.018.018.018.018.0 18.018.018.018.018.018.018.019.019.0 19.0 19.019.019.019.019.019.019.019.019.0 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0
Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)
This issue occurs, if I turn the response variable to a categorical variable. If I get the variable as a numerical variable, the values are read correctly. So I presume there is a fault in categorical conversion of the variable. On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara wrote: > I still get the same result > > 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 > 1.0 1.0 1.0 12.012.012.012.012.012.0 > 12.012.012.012.013.013.013.013.013.013.0 > 13.013.013.013.014.014.014.014.014.0 > 14.014.014.015.015.015.015.015.015.0 > 15.015.015.015.015.015.016.016.016.016.0 > 16.016.016.016.017.017.017.017.017.0 > 17.017.017.017.017.018.018.018.018.0 > 18.018.018.018.018.018.018.019.019.019.0 > 19.019.019.019.019.019.019.019.019.0 > 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0 > 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 > 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0 > 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 > 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 > 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 > 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 > > On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando wrote: > >> Can you use following code and try; >> >> List points = labeledPoints.collect(); >> for(int i=0;i> System.out.print(points.get(i).label() + "\t"); >> } >> >> On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara >> wrote: >> >>> I used the following snippet >>> >>> for(int i=0;i>> System.out.print(labeledPoints.collect().get(i).label() + >>> "\t"); >>> } >>> >>> in the public MLModel build() throws MLModelBuilderException in >>> DeeplearningModelBuilder.java >>> >>> >>> On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando >>> wrote: >>> Hi thushan, We need more info. What did you exactly print and where? On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara wrote: > Hi, > > I found the potential cause of the poor accuracy for the leaf dataset. > It seems the data read into ML is wrong. > > I have attached the data file as a CSV (classes are in the last column) > > However, when I print out the labels of the read data (classes), it > looks something like below. Clearly there aren't this many "3.0" classes > and there should be classes up to 36.0. > > Is this caused by a bug? > > 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 > 1.0 1.0 1.0 1.0 12.012.012.012.012.0 > 12.012.012.012.012.013.013.013.013.0 > 13.013.0 > 13.013.013.013.014.014.014.014.0 > 14.014.014.014.015.015.015.015.015.0 > 15.015.015.015.015.015.015.016.016.0 > 16.016.0 > 16.016.016.016.017.017.017.017.0 > 17.017.017.017.017.017.018.018.018.0 > 18.018.018.018.018.018.018.018.019.0 > 19.019.0 > 19.019.0
Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)
This was mainly due to the detection of a numerical feature as a categorical one. We suggested increasing the categorical threshold as a work-around. @thushan did it work? On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara wrote: > This issue occurs, if I turn the response variable to a categorical > variable. If I get the variable as a numerical variable, the values are > read correctly. > > So I presume there is a fault in categorical conversion of the variable. > > On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara > wrote: > >> I still get the same result >> >> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 >> 1.0 1.0 1.0 12.012.012.012.012.012.0 >> 12.012.012.012.013.013.013.013.013.013.0 >> 13.013.013.013.014.014.014.014.014.0 >> 14.014.014.015.015.015.015.015.015.0 >> 15.015.015.015.015.015.016.016.016.016.0 >> 16.016.016.016.017.017.017.017.017.0 >> 17.017.017.017.017.018.018.018.018.0 >> 18.018.018.018.018.018.018.019.019.019.0 >> 19.019.019.019.019.019.019.019.019.0 >> 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0 >> 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 >> 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0 >> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 >> 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 >> 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 >> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 >> >> On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando wrote: >> >>> Can you use following code and try; >>> >>> List points = labeledPoints.collect(); >>> for(int i=0;i>> System.out.print(points.get(i).label() + "\t"); >>> } >>> >>> On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara >>> wrote: >>> I used the following snippet for(int i=0;i>>> System.out.print(labeledPoints.collect().get(i).label() + "\t"); } in the public MLModel build() throws MLModelBuilderException in DeeplearningModelBuilder.java On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando wrote: > Hi thushan, > > We need more info. What did you exactly print and where? > > On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara > wrote: > >> Hi, >> >> I found the potential cause of the poor accuracy for the leaf >> dataset. It seems the data read into ML is wrong. >> >> I have attached the data file as a CSV (classes are in the last >> column) >> >> However, when I print out the labels of the read data (classes), it >> looks something like below. Clearly there aren't this many "3.0" classes >> and there should be classes up to 36.0. >> >> Is this caused by a bug? >> >> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 >> 1.0 1.0 1.0 1.0 12.012.012.012.012.0 >> 12.012.012.012.012.013.013.013.013.0 >> 13.013.0 >> 13.013.013.013.014.014.014.014.0 >> 14.014.014.014.015.015.015.015.015.0 >>>
Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)
Hi, This was mainly due to the detection of a numerical feature as a categorical one. Oh, it makes sense now. Why don't we try taking a sample of data and if the sample contains only integers (or doubles without any decimals) or strings, consider it as a categorical variable. We suggested increasing the categorical threshold as a work-around. @thushan did it work? Yes, it worked. After increasing the threshold to 40. On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando wrote: > This was mainly due to the detection of a numerical feature as a > categorical one. > > We suggested increasing the categorical threshold as a work-around. > @thushan did it work? > > On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara > wrote: > >> This issue occurs, if I turn the response variable to a categorical >> variable. If I get the variable as a numerical variable, the values are >> read correctly. >> >> So I presume there is a fault in categorical conversion of the variable. >> >> On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara >> wrote: >> >>> I still get the same result >>> >>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 >>> 1.0 1.0 1.0 12.012.012.012.012.012.0 >>> 12.012.012.012.013.013.013.013.013.013.0 >>> 13.013.013.013.014.014.014.014.014.0 >>> 14.014.014.015.015.015.015.015.015.0 >>> 15.015.015.015.015.015.016.016.016.016.0 >>> 16.016.016.016.017.017.017.017.017.0 >>> 17.017.017.017.017.018.018.018.018.0 >>> 18.018.018.018.018.018.018.019.019.019.0 >>> 19.019.019.019.019.019.019.019.019.0 >>> 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0 >>> 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 >>> 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0 >>> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 >>> 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 >>> 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 >>> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 >>> >>> On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando >>> wrote: >>> Can you use following code and try; List points = labeledPoints.collect(); for(int i=0;i>>> System.out.print(points.get(i).label() + "\t"); } On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara wrote: > I used the following snippet > > for(int i=0;i System.out.print(labeledPoints.collect().get(i).label() + > "\t"); > } > > in the public MLModel build() throws MLModelBuilderException in > DeeplearningModelBuilder.java > > > On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando > wrote: > >> Hi thushan, >> >> We need more info. What did you exactly print and where? >> >> On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara < >> thu...@gmail.com> wrote: >> >>> Hi, >>> >>> I found the potential cause of the poor accuracy for the leaf >>> dataset. It seems the data read into ML is wrong. >>> >>> I have attached the data file as a CSV (classes are in the last >>> column) >>> >>> However, when I
Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)
On Fri, Aug 14, 2015 at 10:01 AM, Thushan Ganegedara wrote: > Hi, > > This was mainly due to the detection of a numerical feature as a > categorical one. > Oh, it makes sense now. Why don't we try taking a sample of data and if > the sample contains only integers (or doubles without any decimals) or > strings, consider it as a categorical variable. > I tried that approach too, but there're some datasets like automobile dataset normalized-losses feature, which has integer values (0-164) but which is probably not categorical. > > We suggested increasing the categorical threshold as a work-around. > @thushan did it work? > Yes, it worked. After increasing the threshold to 40. > > On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando wrote: > >> This was mainly due to the detection of a numerical feature as a >> categorical one. >> >> We suggested increasing the categorical threshold as a work-around. >> @thushan did it work? >> >> On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara >> wrote: >> >>> This issue occurs, if I turn the response variable to a categorical >>> variable. If I get the variable as a numerical variable, the values are >>> read correctly. >>> >>> So I presume there is a fault in categorical conversion of the variable. >>> >>> On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara >>> wrote: >>> I still get the same result 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 12.012.012.012.012.012.0 12.012.012.012.013.013.013.013.013.0 13.0 13.013.013.013.014.014.014.014.014.0 14.014.014.015.015.015.015.015.015.0 15.015.015.015.015.015.016.016.016.0 16.0 16.016.016.016.017.017.017.017.017.0 17.017.017.017.017.018.018.018.018.0 18.018.018.018.018.018.018.019.019.0 19.0 19.019.019.019.019.019.019.019.019.0 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando wrote: > Can you use following code and try; > > List points = labeledPoints.collect(); > for(int i=0;i System.out.print(points.get(i).label() + "\t"); > } > > On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara > wrote: > >> I used the following snippet >> >> for(int i=0;i> System.out.print(labeledPoints.collect().get(i).label() >> + "\t"); >> } >> >> in the public MLModel build() throws MLModelBuilderException in >> DeeplearningModelBuilder.java >> >> >> On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando >> wrote: >> >>> Hi thushan, >>> >>> We need more info. What did you exactly print and where? >>> >>>
Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)
Hi, Yes, no mater which approach used, there's always going to be outliers which does not fit the defined rules. But for these corner cases, user always have to opportunity to change the variable to numerical. One more approach is to introduce a measure of replication of values in a column. If the column shows a repetition of same values many times, imo, it is a good indicator for detecting categorical variable. On Fri, Aug 14, 2015 at 2:41 PM, Nirmal Fernando wrote: > > > On Fri, Aug 14, 2015 at 10:01 AM, Thushan Ganegedara > wrote: > >> Hi, >> >> This was mainly due to the detection of a numerical feature as a >> categorical one. >> Oh, it makes sense now. Why don't we try taking a sample of data and if >> the sample contains only integers (or doubles without any decimals) or >> strings, consider it as a categorical variable. >> > > I tried that approach too, but there're some datasets like automobile > dataset normalized-losses feature, which has integer values (0-164) but > which is probably not categorical. > >> >> We suggested increasing the categorical threshold as a work-around. >> @thushan did it work? >> Yes, it worked. After increasing the threshold to 40. >> >> On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando wrote: >> >>> This was mainly due to the detection of a numerical feature as a >>> categorical one. >>> >>> We suggested increasing the categorical threshold as a work-around. >>> @thushan did it work? >>> >>> On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara >>> wrote: >>> This issue occurs, if I turn the response variable to a categorical variable. If I get the variable as a numerical variable, the values are read correctly. So I presume there is a fault in categorical conversion of the variable. On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara wrote: > I still get the same result > > 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 > 1.0 1.0 1.0 1.0 12.012.012.012.012.0 > 12.012.012.012.012.013.013.013.013.0 > 13.013.0 > 13.013.013.013.014.014.014.014.0 > 14.014.014.014.015.015.015.015.015.0 > 15.015.015.015.015.015.015.016.016.0 > 16.016.0 > 16.016.016.016.017.017.017.017.0 > 17.017.017.017.017.017.018.018.018.0 > 18.018.018.018.018.018.018.018.019.0 > 19.019.0 > 19.019.019.019.019.019.019.019.0 > 19.019.019.02.0 2.0 2.0 2.0 2.0 2.0 > 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 > 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 > 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 > 5.0 5.0 > 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 > 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 > 6.0 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 > 7.0 7.0 > 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 > 3.0 3.0 > 3.0 3.0 3.0 3.0 > > On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando > wrote: > >> Can you use following code and try; >> >> List points = labeledPoints.collect(); >> for(int i=0;i> System.out.print(points.get(i).label() + "\t"); >
Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)
Moreover, I think a hybrid approach as follows might work well. 1. Select a sample 2. Filter columns by the data type and find potential categorical variables (integer / string) 3. Filter further by checking if same values are repeated multiple times in the dataset. On Fri, Aug 14, 2015 at 2:48 PM, Thushan Ganegedara wrote: > Hi, > > Yes, no mater which approach used, there's always going to be outliers > which does not fit the defined rules. But for these corner cases, user > always have to opportunity to change the variable to numerical. > > One more approach is to introduce a measure of replication of values in a > column. If the column shows a repetition of same values many times, imo, it > is a good indicator for detecting categorical variable. > > On Fri, Aug 14, 2015 at 2:41 PM, Nirmal Fernando wrote: > >> >> >> On Fri, Aug 14, 2015 at 10:01 AM, Thushan Ganegedara >> wrote: >> >>> Hi, >>> >>> This was mainly due to the detection of a numerical feature as a >>> categorical one. >>> Oh, it makes sense now. Why don't we try taking a sample of data and if >>> the sample contains only integers (or doubles without any decimals) or >>> strings, consider it as a categorical variable. >>> >> >> I tried that approach too, but there're some datasets like automobile >> dataset normalized-losses feature, which has integer values (0-164) but >> which is probably not categorical. >> >>> >>> We suggested increasing the categorical threshold as a work-around. >>> @thushan did it work? >>> Yes, it worked. After increasing the threshold to 40. >>> >>> On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando >>> wrote: >>> This was mainly due to the detection of a numerical feature as a categorical one. We suggested increasing the categorical threshold as a work-around. @thushan did it work? On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara wrote: > This issue occurs, if I turn the response variable to a categorical > variable. If I get the variable as a numerical variable, the values are > read correctly. > > So I presume there is a fault in categorical conversion of the > variable. > > On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara > wrote: > >> I still get the same result >> >> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 >> 1.0 1.0 1.0 1.0 12.012.012.012.012.0 >> 12.012.012.012.012.013.013.013.013.0 >> 13.013.0 >> 13.013.013.013.014.014.014.014.0 >> 14.014.014.014.015.015.015.015.015.0 >> 15.015.015.015.015.015.015.016.016.0 >> 16.016.0 >> 16.016.016.016.017.017.017.017.0 >> 17.017.017.017.017.017.018.018.018.0 >> 18.018.018.018.018.018.018.018.019.0 >> 19.019.0 >> 19.019.019.019.019.019.019.019.0 >> 19.019.019.02.0 2.0 2.0 2.0 2.0 2.0 >> 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 >> 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 >> 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 >> 5.0 5.0 >> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 >> 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 >> 6.0 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 >> 7.0 7.0 >> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.0 >> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >> 3.0 3.
Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)
Thushan, please send your suggestions to the other thread :) On Fri, Aug 14, 2015 at 10:22 AM, Thushan Ganegedara wrote: > Moreover, I think a hybrid approach as follows might work well. > > 1. Select a sample > > 2. Filter columns by the data type and find potential categorical > variables (integer / string) > > 3. Filter further by checking if same values are repeated multiple times > in the dataset. > > On Fri, Aug 14, 2015 at 2:48 PM, Thushan Ganegedara > wrote: > >> Hi, >> >> Yes, no mater which approach used, there's always going to be outliers >> which does not fit the defined rules. But for these corner cases, user >> always have to opportunity to change the variable to numerical. >> >> One more approach is to introduce a measure of replication of values in a >> column. If the column shows a repetition of same values many times, imo, it >> is a good indicator for detecting categorical variable. >> >> On Fri, Aug 14, 2015 at 2:41 PM, Nirmal Fernando wrote: >> >>> >>> >>> On Fri, Aug 14, 2015 at 10:01 AM, Thushan Ganegedara >>> wrote: >>> Hi, This was mainly due to the detection of a numerical feature as a categorical one. Oh, it makes sense now. Why don't we try taking a sample of data and if the sample contains only integers (or doubles without any decimals) or strings, consider it as a categorical variable. >>> >>> I tried that approach too, but there're some datasets like automobile >>> dataset normalized-losses feature, which has integer values (0-164) but >>> which is probably not categorical. >>> We suggested increasing the categorical threshold as a work-around. @thushan did it work? Yes, it worked. After increasing the threshold to 40. On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando wrote: > This was mainly due to the detection of a numerical feature as a > categorical one. > > We suggested increasing the categorical threshold as a work-around. > @thushan did it work? > > On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara > wrote: > >> This issue occurs, if I turn the response variable to a categorical >> variable. If I get the variable as a numerical variable, the values are >> read correctly. >> >> So I presume there is a fault in categorical conversion of the >> variable. >> >> On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara > > wrote: >> >>> I still get the same result >>> >>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 >>> 1.0 1.0 1.0 1.0 12.012.012.012.012.0 >>> 12.012.012.012.012.013.013.013.013.0 >>> 13.013.0 >>> 13.013.013.013.014.014.014.014.0 >>> 14.014.014.014.015.015.015.015.015.0 >>> 15.015.015.015.015.015.015.016.016.0 >>> 16.016.0 >>> 16.016.016.016.017.017.017.017.0 >>> 17.017.017.017.017.017.018.018.018.0 >>> 18.018.018.018.018.018.018.018.019.0 >>> 19.019.0 >>> 19.019.019.019.019.019.019.019.0 >>> 19.019.019.02.0 2.0 2.0 2.0 2.0 2.0 >>> 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 >>> 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 >>> 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 >>> 5.0 5.0 >>> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 >>> 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 >>> 6.0 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 >>> 7.0 7.0 >>> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>> 3.0 3.0 >>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 >>>