[
https://issues.apache.org/jira/browse/MAHOUT-155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125194#comment-13125194
]
Joe Prasanna Kumar commented on MAHOUT-155:
-------------------------------------------
+*Problem:*+
Nominal attributes in ARFF format are not getting completely converted to
vector format. When the nominal attribute is mapped to a value of 0, it is not
getting reflected in the vector. For example consider the below bank.ARFF file
from WEKA site
{color:blue}
@relation bank
@attribute age numeric
@attribute sex {MALE,FEMALE}
@attribute region {INNER_CITY,RURAL,TOWN,SUBURBAN}
@attribute income numeric
@attribute married {YES,NO}
@attribute children {YES,NO}
@attribute car {YES,NO}
@attribute mortgage {YES,NO}
@attribute pep {YES,NO}
@data
40,MALE,TOWN,30085.1,YES,YES,YES,YES,NO
{color}
The nominal mappings for the above arff file is
{sex={FEMALE=1, MALE=0}, region={INNER_CITY=0, TOWN=2, RURAL=1, SUBURBAN=3} ,
children={YES=0, NO=1}, married={YES=0, NO=1},car={YES=0, NO=1},
mortgage={YES=0, NO=1}, pep={YES=0, NO=1}}
When this arff gets converted to vector format, it outputs
{color:red}{0:40.0,2:2.0,3:30085.1,8:1.0}{color}
Because the attribute married assigns 0 to YES and 1 to NO, this attribute
(attribute # 5) doesn't show up in the vector
+*Issue:*+
When I try to convert a nominal attribute in ARFF format to vector format,
ARFFIterator by default creates a Dense vector. Since the nominal attribute
(here itz married) has value 0, the dense vector ignores this attribute.
+*Solution:*+
1. In ARFFVectorIterable, when we add the nominal attributes to the ARFFModel,
we'll start the class values from 1 instead of 0. This will fix the issue.
So in the above bank.ARFF, the nominal mappings would be
{sex={FEMALE=2, MALE=1}, region={INNER_CITY=1, TOWN=3, RURAL=2, SUBURBAN=4},
children={YES=1, NO=2}, married={YES=1, NO=2},car={YES=1, NO=2},
mortgage={YES=1, NO=2}, pep={YES=1, NO=2} }
and the output of the vector is {color:green}
{0:40.0,1:1.0,2:3.0,3:30085.1,4:1.0,5:1.0,6:1.0,7:1.0,8:2.0}
{color}
If this issue and solution looks right, I can upload a patch with the fix.
Please let me know your thoughts.
Joe.
> ARFF VectorIterable
> -------------------
>
> Key: MAHOUT-155
> URL: https://issues.apache.org/jira/browse/MAHOUT-155
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Reporter: Grant Ingersoll
> Assignee: Grant Ingersoll
> Priority: Minor
>
> Convert ARFF to Vector. See http://www.cs.waikato.ac.nz/~ml/weka/arff.html
> Create a VectorIterable implementation for ARFF.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira