[ https://issues.apache.org/jira/browse/MAHOUT-155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125194#comment-13125194 ]
Joe Prasanna Kumar commented on MAHOUT-155: ------------------------------------------- +*Problem:*+ Nominal attributes in ARFF format are not getting completely converted to vector format. When the nominal attribute is mapped to a value of 0, it is not getting reflected in the vector. For example consider the below bank.ARFF file from WEKA site {color:blue} @relation bank @attribute age numeric @attribute sex {MALE,FEMALE} @attribute region {INNER_CITY,RURAL,TOWN,SUBURBAN} @attribute income numeric @attribute married {YES,NO} @attribute children {YES,NO} @attribute car {YES,NO} @attribute mortgage {YES,NO} @attribute pep {YES,NO} @data 40,MALE,TOWN,30085.1,YES,YES,YES,YES,NO {color} The nominal mappings for the above arff file is {sex={FEMALE=1, MALE=0}, region={INNER_CITY=0, TOWN=2, RURAL=1, SUBURBAN=3} , children={YES=0, NO=1}, married={YES=0, NO=1},car={YES=0, NO=1}, mortgage={YES=0, NO=1}, pep={YES=0, NO=1}} When this arff gets converted to vector format, it outputs {color:red}{0:40.0,2:2.0,3:30085.1,8:1.0}{color} Because the attribute married assigns 0 to YES and 1 to NO, this attribute (attribute # 5) doesn't show up in the vector +*Issue:*+ When I try to convert a nominal attribute in ARFF format to vector format, ARFFIterator by default creates a Dense vector. Since the nominal attribute (here itz married) has value 0, the dense vector ignores this attribute. +*Solution:*+ 1. In ARFFVectorIterable, when we add the nominal attributes to the ARFFModel, we'll start the class values from 1 instead of 0. This will fix the issue. So in the above bank.ARFF, the nominal mappings would be {sex={FEMALE=2, MALE=1}, region={INNER_CITY=1, TOWN=3, RURAL=2, SUBURBAN=4}, children={YES=1, NO=2}, married={YES=1, NO=2},car={YES=1, NO=2}, mortgage={YES=1, NO=2}, pep={YES=1, NO=2} } and the output of the vector is {color:green} {0:40.0,1:1.0,2:3.0,3:30085.1,4:1.0,5:1.0,6:1.0,7:1.0,8:2.0} {color} If this issue and solution looks right, I can upload a patch with the fix. Please let me know your thoughts. Joe. > ARFF VectorIterable > ------------------- > > Key: MAHOUT-155 > URL: https://issues.apache.org/jira/browse/MAHOUT-155 > Project: Mahout > Issue Type: New Feature > Components: Math > Reporter: Grant Ingersoll > Assignee: Grant Ingersoll > Priority: Minor > > Convert ARFF to Vector. See http://www.cs.waikato.ac.nz/~ml/weka/arff.html > Create a VectorIterable implementation for ARFF. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira