Moreover, I think a hybrid approach as follows might work well.

1. Select a sample

2. Filter columns by the data type and find potential categorical variables
(integer / string)

3. Filter further by checking if same values are repeated multiple times in
the dataset.

On Fri, Aug 14, 2015 at 2:48 PM, Thushan Ganegedara <thu...@gmail.com>
wrote:

> Hi,
>
> Yes, no mater which approach used, there's always going to be outliers
> which does not fit the defined rules. But for these corner cases, user
> always have to opportunity to change the variable to numerical.
>
> One more approach is to introduce a measure of replication of values in a
> column. If the column shows a repetition of same values many times, imo, it
> is a good indicator for detecting categorical variable.
>
> On Fri, Aug 14, 2015 at 2:41 PM, Nirmal Fernando <nir...@wso2.com> wrote:
>
>>
>>
>> On Fri, Aug 14, 2015 at 10:01 AM, Thushan Ganegedara <thu...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> This was mainly due to the detection of a numerical feature as a
>>> categorical one.
>>> Oh, it makes sense now. Why don't we try taking a sample of data and if
>>> the sample contains only integers (or doubles without any decimals) or
>>> strings, consider it as a categorical variable.
>>>
>>
>> I tried that approach too, but there're some datasets like automobile
>> dataset normalized-losses feature, which has integer values (0-164) but
>> which is probably not categorical.
>>
>>>
>>> We suggested increasing the categorical threshold as a work-around.
>>> @thushan did it work?
>>> Yes, it worked. After increasing the threshold to 40.
>>>
>>> On Fri, Aug 14, 2015 at 2:21 PM, Nirmal Fernando <nir...@wso2.com>
>>> wrote:
>>>
>>>> This was mainly due to the detection of a numerical feature as a
>>>> categorical one.
>>>>
>>>> We suggested increasing the categorical threshold as a work-around.
>>>> @thushan did it work?
>>>>
>>>> On Tue, Aug 11, 2015 at 5:50 PM, Thushan Ganegedara <thu...@gmail.com>
>>>> wrote:
>>>>
>>>>> This issue occurs, if I turn the response variable to a categorical
>>>>> variable. If I get the variable as a numerical variable, the values are
>>>>> read correctly.
>>>>>
>>>>> So I presume there is a fault in categorical conversion of the
>>>>> variable.
>>>>>
>>>>> On Tue, Aug 11, 2015 at 7:11 PM, Thushan Ganegedara <thu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I still get the same result
>>>>>>
>>>>>> 1.0     1.0     1.0     1.0     1.0     1.0     1.0     1.0
>>>>>> 1.0     1.0     1.0     1.0     12.0    12.0    12.0    12.0    12.0
>>>>>> 12.0    12.0    12.0    12.0    12.0    13.0    13.0    13.0    13.0
>>>>>> 13.0    13.0
>>>>>> 13.0    13.0    13.0    13.0    14.0    14.0    14.0    14.0
>>>>>> 14.0    14.0    14.0    14.0    15.0    15.0    15.0    15.0    15.0
>>>>>> 15.0    15.0    15.0    15.0    15.0    15.0    15.0    16.0    16.0
>>>>>> 16.0    16.0
>>>>>> 16.0    16.0    16.0    16.0    17.0    17.0    17.0    17.0
>>>>>> 17.0    17.0    17.0    17.0    17.0    17.0    18.0    18.0    18.0
>>>>>> 18.0    18.0    18.0    18.0    18.0    18.0    18.0    18.0    19.0
>>>>>> 19.0    19.0
>>>>>> 19.0    19.0    19.0    19.0    19.0    19.0    19.0    19.0
>>>>>> 19.0    19.0    19.0    2.0     2.0     2.0     2.0     2.0     2.0
>>>>>> 2.0     2.0     2.0     2.0     2.0     2.0     2.0     3.0     3.0
>>>>>> 3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     4.0     4.0     4.0     4.0     4.0
>>>>>> 4.0     4.0     4.0     4.0     4.0     4.0     4.0     5.0     5.0
>>>>>> 5.0     5.0
>>>>>> 5.0     5.0     5.0     5.0     5.0     5.0     5.0     5.0
>>>>>> 5.0     6.0     6.0     6.0     6.0     6.0     6.0     6.0     6.0
>>>>>> 6.0     6.0     6.0     6.0     7.0     7.0     7.0     7.0     7.0
>>>>>> 7.0     7.0
>>>>>> 7.0     7.0     7.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>> 3.0     3.0
>>>>>> 3.0     3.0     3.0     3.0
>>>>>>
>>>>>> On Tue, Aug 11, 2015 at 7:05 PM, Nirmal Fernando <nir...@wso2.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Can you use following code and try;
>>>>>>>
>>>>>>> List<LabeledPoint> points = labeledPoints.collect();
>>>>>>> for(int i=0;i<points.size();i++){
>>>>>>>              System.out.print(points.get(i).label() + "\t");
>>>>>>>             }
>>>>>>>
>>>>>>> On Tue, Aug 11, 2015 at 2:30 PM, Thushan Ganegedara <
>>>>>>> thu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> I used the following snippet
>>>>>>>>
>>>>>>>> for(int i=0;i<labeledPoints.collect().size();i++){
>>>>>>>>             System.out.print(labeledPoints.collect().get(i).label()
>>>>>>>> + "\t");
>>>>>>>>             }
>>>>>>>>
>>>>>>>> in the public MLModel build() throws MLModelBuilderException in
>>>>>>>> DeeplearningModelBuilder.java
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Aug 11, 2015 at 6:17 PM, Nirmal Fernando <nir...@wso2.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi thushan,
>>>>>>>>>
>>>>>>>>> We need more info. What did you exactly print and where?
>>>>>>>>>
>>>>>>>>> On Tue, Aug 11, 2015 at 12:47 PM, Thushan Ganegedara <
>>>>>>>>> thu...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> I found the potential cause of the poor accuracy for the leaf
>>>>>>>>>> dataset. It seems the data read into ML is wrong.
>>>>>>>>>>
>>>>>>>>>> I have attached the data file as a CSV (classes are in the last
>>>>>>>>>> column)
>>>>>>>>>>
>>>>>>>>>> However, when I print out the labels of the read data (classes),
>>>>>>>>>> it looks something like below. Clearly there aren't this many "3.0" 
>>>>>>>>>> classes
>>>>>>>>>> and there should be classes up to 36.0.
>>>>>>>>>>
>>>>>>>>>> Is this caused by a bug?
>>>>>>>>>>
>>>>>>>>>> 1.0     1.0     1.0     1.0     1.0     1.0     1.0     1.0
>>>>>>>>>> 1.0     1.0     1.0     1.0     12.0    12.0    12.0    12.0    12.0
>>>>>>>>>> 12.0    12.0    12.0    12.0    12.0    13.0    13.0    13.0    13.0
>>>>>>>>>> 13.0    13.0
>>>>>>>>>> 13.0    13.0    13.0    13.0    14.0    14.0    14.0    14.0
>>>>>>>>>> 14.0    14.0    14.0    14.0    15.0    15.0    15.0    15.0    15.0
>>>>>>>>>> 15.0    15.0    15.0    15.0    15.0    15.0    15.0    16.0    16.0
>>>>>>>>>> 16.0    16.0
>>>>>>>>>> 16.0    16.0    16.0    16.0    17.0    17.0    17.0    17.0
>>>>>>>>>> 17.0    17.0    17.0    17.0    17.0    17.0    18.0    18.0    18.0
>>>>>>>>>> 18.0    18.0    18.0    18.0    18.0    18.0    18.0    18.0    19.0
>>>>>>>>>> 19.0    19.0
>>>>>>>>>> 19.0    19.0    19.0    19.0    19.0    19.0    19.0    19.0
>>>>>>>>>> 19.0    19.0    19.0    2.0     2.0     2.0     2.0     2.0     2.0
>>>>>>>>>> 2.0     2.0     2.0     2.0     2.0     2.0     2.0     3.0     3.0
>>>>>>>>>> 3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     4.0     4.0     4.0     4.0     4.0
>>>>>>>>>> 4.0     4.0     4.0     4.0     4.0     4.0     4.0     5.0     5.0
>>>>>>>>>> 5.0     5.0
>>>>>>>>>> 5.0     5.0     5.0     5.0     5.0     5.0     5.0     5.0
>>>>>>>>>> 5.0     6.0     6.0     6.0     6.0     6.0     6.0     6.0     6.0
>>>>>>>>>> 6.0     6.0     6.0     6.0     7.0     7.0     7.0     7.0     7.0
>>>>>>>>>> 7.0     7.0
>>>>>>>>>> 7.0     7.0     7.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0     3.0
>>>>>>>>>> 3.0     3.0
>>>>>>>>>> 3.0     3.0     3.0     3.0
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>> Thushan Ganegedara
>>>>>>>>>> School of IT
>>>>>>>>>> University of Sydney, Australia
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Thanks & regards,
>>>>>>>>> Nirmal
>>>>>>>>>
>>>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>>>> Mobile: +94715779733
>>>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Thushan Ganegedara
>>>>>>>> School of IT
>>>>>>>> University of Sydney, Australia
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Thanks & regards,
>>>>>>> Nirmal
>>>>>>>
>>>>>>> Team Lead - WSO2 Machine Learner
>>>>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>>>>> Mobile: +94715779733
>>>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Regards,
>>>>>>
>>>>>> Thushan Ganegedara
>>>>>> School of IT
>>>>>> University of Sydney, Australia
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Regards,
>>>>>
>>>>> Thushan Ganegedara
>>>>> School of IT
>>>>> University of Sydney, Australia
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Thanks & regards,
>>>> Nirmal
>>>>
>>>> Team Lead - WSO2 Machine Learner
>>>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>>>> Mobile: +94715779733
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Regards,
>>>
>>> Thushan Ganegedara
>>> School of IT
>>> University of Sydney, Australia
>>>
>>
>>
>>
>> --
>>
>> Thanks & regards,
>> Nirmal
>>
>> Team Lead - WSO2 Machine Learner
>> Associate Technical Lead - Data Technologies Team, WSO2 Inc.
>> Mobile: +94715779733
>> Blog: http://nirmalfdo.blogspot.com/
>>
>>
>>
>
>
> --
> Regards,
>
> Thushan Ganegedara
> School of IT
> University of Sydney, Australia
>



-- 
Regards,

Thushan Ganegedara
School of IT
University of Sydney, Australia
_______________________________________________
Dev mailing list
Dev@wso2.org
http://wso2.org/cgi-bin/mailman/listinfo/dev

Reply via email to