Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)

2015-08-11 Thread Nirmal Fernando
Hi thushan,

We need more info. What did you exactly print and where?

On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara 
wrote:

> Hi,
>
> I found the potential cause of the poor accuracy for the leaf dataset. It
> seems the data read into ML is wrong.
>
> I have attached the data file as a CSV (classes are in the last column)
>
> However, when I print out the labels of the read data (classes), it looks
> something like below. Clearly there aren't this many "3.0" classes and
> there should be classes up to 36.0.
>
> Is this caused by a bug?
>
> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
> 1.0 1.0 1.0 12.012.012.012.012.012.0
> 12.012.012.012.013.013.013.013.013.013.0
> 13.013.013.013.014.014.014.014.014.0
> 14.014.014.015.015.015.015.015.015.0
> 15.015.015.015.015.015.016.016.016.016.0
> 16.016.016.016.017.017.017.017.017.0
> 17.017.017.017.017.018.018.018.018.0
> 18.018.018.018.018.018.018.019.019.019.0
> 19.019.019.019.019.019.019.019.019.0
> 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0
> 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0
> 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0
> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
> 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
> 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0
> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0
>
> --
> Regards,
>
> Thushan Ganegedara
> School of IT
> University of Sydney, Australia
>



-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)

2015-08-11 Thread Thushan Ganegedara
I used the following snippet

for(int i=0;i wrote:

> Hi thushan,
>
> We need more info. What did you exactly print and where?
>
> On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara 
> wrote:
>
>> Hi,
>>
>> I found the potential cause of the poor accuracy for the leaf dataset. It
>> seems the data read into ML is wrong.
>>
>> I have attached the data file as a CSV (classes are in the last column)
>>
>> However, when I print out the labels of the read data (classes), it looks
>> something like below. Clearly there aren't this many "3.0" classes and
>> there should be classes up to 36.0.
>>
>> Is this caused by a bug?
>>
>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
>> 1.0 1.0 1.0 12.012.012.012.012.012.0
>> 12.012.012.012.013.013.013.013.013.013.0
>> 13.013.013.013.014.014.014.014.014.0
>> 14.014.014.015.015.015.015.015.015.0
>> 15.015.015.015.015.015.016.016.016.016.0
>> 16.016.016.016.017.017.017.017.017.0
>> 17.017.017.017.017.018.018.018.018.0
>> 18.018.018.018.018.018.018.019.019.019.0
>> 19.019.019.019.019.019.019.019.019.0
>> 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0
>> 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0
>> 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0
>> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
>> 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
>> 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0
>> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0
>>
>> --
>> Regards,
>>
>> Thushan Ganegedara
>> School of IT
>> University of Sydney, Australia
>>
>
>
>
> --
>
> Thanks & regards,
> Nirmal
>
> Team Lead - WSO2 Machine Learner
> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
> Mobile: +94715779733
> Blog: http://nirmalfdo.blogspot.com/
>
>
>


-- 
Regards,

Thushan Ganegedara
School of IT
University of Sydney, Australia
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)

2015-08-11 Thread Nirmal Fernando
Can you use following code and try;

List points = labeledPoints.collect();
for(int i=0;i
wrote:

> I used the following snippet
>
> for(int i=0;i System.out.print(labeledPoints.collect().get(i).label() +
> "\t");
> }
>
> in the public MLModel build() throws MLModelBuilderException in
> DeeplearningModelBuilder.java
>
>
> On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando  wrote:
>
>> Hi thushan,
>>
>> We need more info. What did you exactly print and where?
>>
>> On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara 
>> wrote:
>>
>>> Hi,
>>>
>>> I found the potential cause of the poor accuracy for the leaf dataset.
>>> It seems the data read into ML is wrong.
>>>
>>> I have attached the data file as a CSV (classes are in the last column)
>>>
>>> However, when I print out the labels of the read data (classes), it
>>> looks something like below. Clearly there aren't this many "3.0" classes
>>> and there should be classes up to 36.0.
>>>
>>> Is this caused by a bug?
>>>
>>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
>>> 1.0 1.0 1.0 12.012.012.012.012.012.0
>>> 12.012.012.012.013.013.013.013.013.013.0
>>> 13.013.013.013.014.014.014.014.014.0
>>> 14.014.014.015.015.015.015.015.015.0
>>> 15.015.015.015.015.015.016.016.016.016.0
>>> 16.016.016.016.017.017.017.017.017.0
>>> 17.017.017.017.017.018.018.018.018.0
>>> 18.018.018.018.018.018.018.019.019.019.0
>>> 19.019.019.019.019.019.019.019.019.0
>>> 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0
>>> 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0
>>> 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0
>>> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
>>> 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
>>> 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0
>>> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0
>>>
>>> --
>>> Regards,
>>>
>>> Thushan Ganegedara
>>> School of IT
>>> University of Sydney, Australia
>>>
>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Team Lead - WSO2 Machine Learner
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>
>
> --
> Regards,
>
> Thushan Ganegedara
> School of IT
> University of Sydney, Australia
>



-- 

Thanks & regards,
Nirmal

Team Lead - WSO2 Machine Learner
Associate Technical Lead - Data Technologies Team, WSO2 Inc.
Mobile: +94715779733
Blog: http://nirmalfdo.blogspot.com/
___
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev


Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)

2015-08-11 Thread Thushan Ganegedara
I still get the same result

1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 12.012.012.012.012.012.0
12.012.012.012.013.013.013.013.013.013.0
13.013.013.013.014.014.014.014.014.0
14.014.014.015.015.015.015.015.015.0
15.015.015.015.015.015.016.016.016.016.0
16.016.016.016.017.017.017.017.017.0
17.017.017.017.017.018.018.018.018.0
18.018.018.018.018.018.018.019.019.019.0
19.019.019.019.019.019.019.019.019.0
19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0
2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0
4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0
5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0
7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
3.0 3.0 3.0 3.0

On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando  wrote:

> Can you use following code and try;
>
> List points = labeledPoints.collect();
> for(int i=0;i  System.out.print(points.get(i).label() + "\t");
> }
>
> On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara 
> wrote:
>
>> I used the following snippet
>>
>> for(int i=0;i> System.out.print(labeledPoints.collect().get(i).label() +
>> "\t");
>> }
>>
>> in the public MLModel build() throws MLModelBuilderException in
>> DeeplearningModelBuilder.java
>>
>>
>> On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando  wrote:
>>
>>> Hi thushan,
>>>
>>> We need more info. What did you exactly print and where?
>>>
>>> On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara 
>>> wrote:
>>>
 Hi,

 I found the potential cause of the poor accuracy for the leaf dataset.
 It seems the data read into ML is wrong.

 I have attached the data file as a CSV (classes are in the last column)

 However, when I print out the labels of the read data (classes), it
 looks something like below. Clearly there aren't this many "3.0" classes
 and there should be classes up to 36.0.

 Is this caused by a bug?

 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
 1.0 1.0 1.0 12.012.012.012.012.012.0
 12.012.012.012.013.013.013.013.013.0
 13.0
 13.013.013.013.014.014.014.014.014.0
 14.014.014.015.015.015.015.015.015.0
 15.015.015.015.015.015.016.016.016.0
 16.0
 16.016.016.016.017.017.017.017.017.0
 17.017.017.017.017.018.018.018.018.0
 18.018.018.018.018.018.018.019.019.0
 19.0
 19.019.019.019.019.019.019.019.019.0
 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0
 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0
 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0

Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)

2015-08-11 Thread Thushan Ganegedara
This issue occurs, if I turn the response variable to a categorical
variable. If I get the variable as a numerical variable, the values are
read correctly.

So I presume there is a fault in categorical conversion of the variable.

On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara 
wrote:

> I still get the same result
>
> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
> 1.0 1.0 1.0 12.012.012.012.012.012.0
> 12.012.012.012.013.013.013.013.013.013.0
> 13.013.013.013.014.014.014.014.014.0
> 14.014.014.015.015.015.015.015.015.0
> 15.015.015.015.015.015.016.016.016.016.0
> 16.016.016.016.017.017.017.017.017.0
> 17.017.017.017.017.018.018.018.018.0
> 18.018.018.018.018.018.018.019.019.019.0
> 19.019.019.019.019.019.019.019.019.0
> 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0
> 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0
> 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0
> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
> 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
> 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0
> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0
>
> On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando  wrote:
>
>> Can you use following code and try;
>>
>> List points = labeledPoints.collect();
>> for(int i=0;i>  System.out.print(points.get(i).label() + "\t");
>> }
>>
>> On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara 
>> wrote:
>>
>>> I used the following snippet
>>>
>>> for(int i=0;i>> System.out.print(labeledPoints.collect().get(i).label() +
>>> "\t");
>>> }
>>>
>>> in the public MLModel build() throws MLModelBuilderException in
>>> DeeplearningModelBuilder.java
>>>
>>>
>>> On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando 
>>> wrote:
>>>
 Hi thushan,

 We need more info. What did you exactly print and where?

 On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara 
 wrote:

> Hi,
>
> I found the potential cause of the poor accuracy for the leaf dataset.
> It seems the data read into ML is wrong.
>
> I have attached the data file as a CSV (classes are in the last column)
>
> However, when I print out the labels of the read data (classes), it
> looks something like below. Clearly there aren't this many "3.0" classes
> and there should be classes up to 36.0.
>
> Is this caused by a bug?
>
> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
> 1.0 1.0 1.0 1.0 12.012.012.012.012.0
> 12.012.012.012.012.013.013.013.013.0
> 13.013.0
> 13.013.013.013.014.014.014.014.0
> 14.014.014.014.015.015.015.015.015.0
> 15.015.015.015.015.015.015.016.016.0
> 16.016.0
> 16.016.016.016.017.017.017.017.0
> 17.017.017.017.017.017.018.018.018.0
> 18.018.018.018.018.018.018.018.019.0
> 19.019.0
> 19.019.0

Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)

2015-08-13 Thread Nirmal Fernando
This was mainly due to the detection of a numerical feature as a
categorical one.

We suggested increasing the categorical threshold as a work-around.
@thushan did it work?

On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara 
wrote:

> This issue occurs, if I turn the response variable to a categorical
> variable. If I get the variable as a numerical variable, the values are
> read correctly.
>
> So I presume there is a fault in categorical conversion of the variable.
>
> On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara 
> wrote:
>
>> I still get the same result
>>
>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
>> 1.0 1.0 1.0 12.012.012.012.012.012.0
>> 12.012.012.012.013.013.013.013.013.013.0
>> 13.013.013.013.014.014.014.014.014.0
>> 14.014.014.015.015.015.015.015.015.0
>> 15.015.015.015.015.015.016.016.016.016.0
>> 16.016.016.016.017.017.017.017.017.0
>> 17.017.017.017.017.018.018.018.018.0
>> 18.018.018.018.018.018.018.019.019.019.0
>> 19.019.019.019.019.019.019.019.019.0
>> 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0
>> 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0
>> 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0
>> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
>> 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
>> 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0
>> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0
>>
>> On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando  wrote:
>>
>>> Can you use following code and try;
>>>
>>> List points = labeledPoints.collect();
>>> for(int i=0;i>>  System.out.print(points.get(i).label() + "\t");
>>> }
>>>
>>> On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara 
>>> wrote:
>>>
 I used the following snippet

 for(int i=0;i>>> System.out.print(labeledPoints.collect().get(i).label() +
 "\t");
 }

 in the public MLModel build() throws MLModelBuilderException in
 DeeplearningModelBuilder.java


 On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando 
 wrote:

> Hi thushan,
>
> We need more info. What did you exactly print and where?
>
> On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara  > wrote:
>
>> Hi,
>>
>> I found the potential cause of the poor accuracy for the leaf
>> dataset. It seems the data read into ML is wrong.
>>
>> I have attached the data file as a CSV (classes are in the last
>> column)
>>
>> However, when I print out the labels of the read data (classes), it
>> looks something like below. Clearly there aren't this many "3.0" classes
>> and there should be classes up to 36.0.
>>
>> Is this caused by a bug?
>>
>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
>> 1.0 1.0 1.0 1.0 12.012.012.012.012.0
>> 12.012.012.012.012.013.013.013.013.0
>> 13.013.0
>> 13.013.013.013.014.014.014.014.0
>> 14.014.014.014.015.015.015.015.015.0
>>>

Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)

2015-08-13 Thread Thushan Ganegedara
Hi,

This was mainly due to the detection of a numerical feature as a
categorical one.
Oh, it makes sense now. Why don't we try taking a sample of data and if the
sample contains only integers (or doubles without any decimals) or strings,
consider it as a categorical variable.

We suggested increasing the categorical threshold as a work-around.
@thushan did it work?
Yes, it worked. After increasing the threshold to 40.

On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando  wrote:

> This was mainly due to the detection of a numerical feature as a
> categorical one.
>
> We suggested increasing the categorical threshold as a work-around.
> @thushan did it work?
>
> On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara 
> wrote:
>
>> This issue occurs, if I turn the response variable to a categorical
>> variable. If I get the variable as a numerical variable, the values are
>> read correctly.
>>
>> So I presume there is a fault in categorical conversion of the variable.
>>
>> On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara 
>> wrote:
>>
>>> I still get the same result
>>>
>>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
>>> 1.0 1.0 1.0 12.012.012.012.012.012.0
>>> 12.012.012.012.013.013.013.013.013.013.0
>>> 13.013.013.013.014.014.014.014.014.0
>>> 14.014.014.015.015.015.015.015.015.0
>>> 15.015.015.015.015.015.016.016.016.016.0
>>> 16.016.016.016.017.017.017.017.017.0
>>> 17.017.017.017.017.018.018.018.018.0
>>> 18.018.018.018.018.018.018.019.019.019.0
>>> 19.019.019.019.019.019.019.019.019.0
>>> 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0
>>> 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0
>>> 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0
>>> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
>>> 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
>>> 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0
>>> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0
>>>
>>> On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando 
>>> wrote:
>>>
 Can you use following code and try;

 List points = labeledPoints.collect();
 for(int i=0;i>>>  System.out.print(points.get(i).label() + "\t");
 }

 On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara 
 wrote:

> I used the following snippet
>
> for(int i=0;i System.out.print(labeledPoints.collect().get(i).label() +
> "\t");
> }
>
> in the public MLModel build() throws MLModelBuilderException in
> DeeplearningModelBuilder.java
>
>
> On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando 
> wrote:
>
>> Hi thushan,
>>
>> We need more info. What did you exactly print and where?
>>
>> On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara <
>> thu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I found the potential cause of the poor accuracy for the leaf
>>> dataset. It seems the data read into ML is wrong.
>>>
>>> I have attached the data file as a CSV (classes are in the last
>>> column)
>>>
>>> However, when I 

Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)

2015-08-13 Thread Nirmal Fernando
On Fri, Aug 14, 2015 at 10:01 AM, Thushan Ganegedara 
wrote:

> Hi,
>
> This was mainly due to the detection of a numerical feature as a
> categorical one.
> Oh, it makes sense now. Why don't we try taking a sample of data and if
> the sample contains only integers (or doubles without any decimals) or
> strings, consider it as a categorical variable.
>

I tried that approach too, but there're some datasets like automobile
dataset normalized-losses feature, which has integer values (0-164) but
which is probably not categorical.

>
> We suggested increasing the categorical threshold as a work-around.
> @thushan did it work?
> Yes, it worked. After increasing the threshold to 40.
>
> On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando  wrote:
>
>> This was mainly due to the detection of a numerical feature as a
>> categorical one.
>>
>> We suggested increasing the categorical threshold as a work-around.
>> @thushan did it work?
>>
>> On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara 
>> wrote:
>>
>>> This issue occurs, if I turn the response variable to a categorical
>>> variable. If I get the variable as a numerical variable, the values are
>>> read correctly.
>>>
>>> So I presume there is a fault in categorical conversion of the variable.
>>>
>>> On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara 
>>> wrote:
>>>
 I still get the same result

 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
 1.0 1.0 1.0 12.012.012.012.012.012.0
 12.012.012.012.013.013.013.013.013.0
 13.0
 13.013.013.013.014.014.014.014.014.0
 14.014.014.015.015.015.015.015.015.0
 15.015.015.015.015.015.016.016.016.0
 16.0
 16.016.016.016.017.017.017.017.017.0
 17.017.017.017.017.018.018.018.018.0
 18.018.018.018.018.018.018.019.019.0
 19.0
 19.019.019.019.019.019.019.019.019.0
 19.019.02.0 2.0 2.0 2.0 2.0 2.0 2.0
 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0 4.0
 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0 5.0 5.0
 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0
 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
 3.0 3.0 3.0 3.0

 On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando 
 wrote:

> Can you use following code and try;
>
> List points = labeledPoints.collect();
> for(int i=0;i  System.out.print(points.get(i).label() + "\t");
> }
>
> On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara 
> wrote:
>
>> I used the following snippet
>>
>> for(int i=0;i> System.out.print(labeledPoints.collect().get(i).label()
>> + "\t");
>> }
>>
>> in the public MLModel build() throws MLModelBuilderException in
>> DeeplearningModelBuilder.java
>>
>>
>> On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando 
>> wrote:
>>
>>> Hi thushan,
>>>
>>> We need more info. What did you exactly print and where?
>>>
>>>

Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)

2015-08-13 Thread Thushan Ganegedara
Hi,

Yes, no mater which approach used, there's always going to be outliers
which does not fit the defined rules. But for these corner cases, user
always have to opportunity to change the variable to numerical.

One more approach is to introduce a measure of replication of values in a
column. If the column shows a repetition of same values many times, imo, it
is a good indicator for detecting categorical variable.

On Fri, Aug 14, 2015 at 2:41 PM, Nirmal Fernando  wrote:

>
>
> On Fri, Aug 14, 2015 at 10:01 AM, Thushan Ganegedara 
> wrote:
>
>> Hi,
>>
>> This was mainly due to the detection of a numerical feature as a
>> categorical one.
>> Oh, it makes sense now. Why don't we try taking a sample of data and if
>> the sample contains only integers (or doubles without any decimals) or
>> strings, consider it as a categorical variable.
>>
>
> I tried that approach too, but there're some datasets like automobile
> dataset normalized-losses feature, which has integer values (0-164) but
> which is probably not categorical.
>
>>
>> We suggested increasing the categorical threshold as a work-around.
>> @thushan did it work?
>> Yes, it worked. After increasing the threshold to 40.
>>
>> On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando  wrote:
>>
>>> This was mainly due to the detection of a numerical feature as a
>>> categorical one.
>>>
>>> We suggested increasing the categorical threshold as a work-around.
>>> @thushan did it work?
>>>
>>> On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara 
>>> wrote:
>>>
 This issue occurs, if I turn the response variable to a categorical
 variable. If I get the variable as a numerical variable, the values are
 read correctly.

 So I presume there is a fault in categorical conversion of the variable.

 On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara 
 wrote:

> I still get the same result
>
> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
> 1.0 1.0 1.0 1.0 12.012.012.012.012.0
> 12.012.012.012.012.013.013.013.013.0
> 13.013.0
> 13.013.013.013.014.014.014.014.0
> 14.014.014.014.015.015.015.015.015.0
> 15.015.015.015.015.015.015.016.016.0
> 16.016.0
> 16.016.016.016.017.017.017.017.0
> 17.017.017.017.017.017.018.018.018.0
> 18.018.018.018.018.018.018.018.019.0
> 19.019.0
> 19.019.019.019.019.019.019.019.0
> 19.019.019.02.0 2.0 2.0 2.0 2.0 2.0
> 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0
> 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0
> 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0
> 5.0 5.0
> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
> 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
> 6.0 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0
> 7.0 7.0
> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
> 3.0 3.0
> 3.0 3.0 3.0 3.0
>
> On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando 
> wrote:
>
>> Can you use following code and try;
>>
>> List points = labeledPoints.collect();
>> for(int i=0;i>  System.out.print(points.get(i).label() + "\t");
>

Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)

2015-08-13 Thread Thushan Ganegedara
Moreover, I think a hybrid approach as follows might work well.

1. Select a sample

2. Filter columns by the data type and find potential categorical variables
(integer / string)

3. Filter further by checking if same values are repeated multiple times in
the dataset.

On Fri, Aug 14, 2015 at 2:48 PM, Thushan Ganegedara 
wrote:

> Hi,
>
> Yes, no mater which approach used, there's always going to be outliers
> which does not fit the defined rules. But for these corner cases, user
> always have to opportunity to change the variable to numerical.
>
> One more approach is to introduce a measure of replication of values in a
> column. If the column shows a repetition of same values many times, imo, it
> is a good indicator for detecting categorical variable.
>
> On Fri, Aug 14, 2015 at 2:41 PM, Nirmal Fernando  wrote:
>
>>
>>
>> On Fri, Aug 14, 2015 at 10:01 AM, Thushan Ganegedara 
>> wrote:
>>
>>> Hi,
>>>
>>> This was mainly due to the detection of a numerical feature as a
>>> categorical one.
>>> Oh, it makes sense now. Why don't we try taking a sample of data and if
>>> the sample contains only integers (or doubles without any decimals) or
>>> strings, consider it as a categorical variable.
>>>
>>
>> I tried that approach too, but there're some datasets like automobile
>> dataset normalized-losses feature, which has integer values (0-164) but
>> which is probably not categorical.
>>
>>>
>>> We suggested increasing the categorical threshold as a work-around.
>>> @thushan did it work?
>>> Yes, it worked. After increasing the threshold to 40.
>>>
>>> On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando 
>>> wrote:
>>>
 This was mainly due to the detection of a numerical feature as a
 categorical one.

 We suggested increasing the categorical threshold as a work-around.
 @thushan did it work?

 On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara 
 wrote:

> This issue occurs, if I turn the response variable to a categorical
> variable. If I get the variable as a numerical variable, the values are
> read correctly.
>
> So I presume there is a fault in categorical conversion of the
> variable.
>
> On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara 
> wrote:
>
>> I still get the same result
>>
>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
>> 1.0 1.0 1.0 1.0 12.012.012.012.012.0
>> 12.012.012.012.012.013.013.013.013.0
>> 13.013.0
>> 13.013.013.013.014.014.014.014.0
>> 14.014.014.014.015.015.015.015.015.0
>> 15.015.015.015.015.015.015.016.016.0
>> 16.016.0
>> 16.016.016.016.017.017.017.017.0
>> 17.017.017.017.017.017.018.018.018.0
>> 18.018.018.018.018.018.018.018.019.0
>> 19.019.0
>> 19.019.019.019.019.019.019.019.0
>> 19.019.019.02.0 2.0 2.0 2.0 2.0 2.0
>> 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0
>> 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0
>> 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0
>> 5.0 5.0
>> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
>> 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
>> 6.0 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0
>> 7.0 7.0
>> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.0
>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>> 3.0 3.

Re: [Dev] [ML] Issue while loading the leaf dataset (misreading classes)

2015-08-13 Thread Nirmal Fernando
Thushan, please send your suggestions to the other thread :)

On Fri, Aug 14, 2015 at 10:22 AM, Thushan Ganegedara 
wrote:

> Moreover, I think a hybrid approach as follows might work well.
>
> 1. Select a sample
>
> 2. Filter columns by the data type and find potential categorical
> variables (integer / string)
>
> 3. Filter further by checking if same values are repeated multiple times
> in the dataset.
>
> On Fri, Aug 14, 2015 at 2:48 PM, Thushan Ganegedara 
> wrote:
>
>> Hi,
>>
>> Yes, no mater which approach used, there's always going to be outliers
>> which does not fit the defined rules. But for these corner cases, user
>> always have to opportunity to change the variable to numerical.
>>
>> One more approach is to introduce a measure of replication of values in a
>> column. If the column shows a repetition of same values many times, imo, it
>> is a good indicator for detecting categorical variable.
>>
>> On Fri, Aug 14, 2015 at 2:41 PM, Nirmal Fernando  wrote:
>>
>>>
>>>
>>> On Fri, Aug 14, 2015 at 10:01 AM, Thushan Ganegedara 
>>> wrote:
>>>
 Hi,

 This was mainly due to the detection of a numerical feature as a
 categorical one.
 Oh, it makes sense now. Why don't we try taking a sample of data and if
 the sample contains only integers (or doubles without any decimals) or
 strings, consider it as a categorical variable.

>>>
>>> I tried that approach too, but there're some datasets like automobile
>>> dataset normalized-losses feature, which has integer values (0-164) but
>>> which is probably not categorical.
>>>

 We suggested increasing the categorical threshold as a work-around.
 @thushan did it work?
 Yes, it worked. After increasing the threshold to 40.

 On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando 
 wrote:

> This was mainly due to the detection of a numerical feature as a
> categorical one.
>
> We suggested increasing the categorical threshold as a work-around.
> @thushan did it work?
>
> On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara 
> wrote:
>
>> This issue occurs, if I turn the response variable to a categorical
>> variable. If I get the variable as a numerical variable, the values are
>> read correctly.
>>
>> So I presume there is a fault in categorical conversion of the
>> variable.
>>
>> On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara > > wrote:
>>
>>> I still get the same result
>>>
>>> 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
>>> 1.0 1.0 1.0 1.0 12.012.012.012.012.0
>>> 12.012.012.012.012.013.013.013.013.0
>>> 13.013.0
>>> 13.013.013.013.014.014.014.014.0
>>> 14.014.014.014.015.015.015.015.015.0
>>> 15.015.015.015.015.015.015.016.016.0
>>> 16.016.0
>>> 16.016.016.016.017.017.017.017.0
>>> 17.017.017.017.017.017.018.018.018.0
>>> 18.018.018.018.018.018.018.018.019.0
>>> 19.019.0
>>> 19.019.019.019.019.019.019.019.0
>>> 19.019.019.02.0 2.0 2.0 2.0 2.0 2.0
>>> 2.0 2.0 2.0 2.0 2.0 2.0 2.0 3.0 3.0
>>> 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 4.0 4.0 4.0 4.0 4.0
>>> 4.0 4.0 4.0 4.0 4.0 4.0 4.0 5.0 5.0
>>> 5.0 5.0
>>> 5.0 5.0 5.0 5.0 5.0 5.0 5.0 5.0
>>> 5.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0 6.0
>>> 6.0 6.0 6.0 6.0 7.0 7.0 7.0 7.0 7.0
>>> 7.0 7.0
>>> 7.0 7.0 7.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>> 3.0 3.0
>>> 3.0 3.0 3.0 3.0 3.0 3.0 3.0 3.0
>>>