[ https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll updated MAHOUT-953: ----------------------------------- Fix Version/s: 0.8 > ArffVectorIterable does not gracefully handle duplicate attribute name > ---------------------------------------------------------------------- > > Key: MAHOUT-953 > URL: https://issues.apache.org/jira/browse/MAHOUT-953 > Project: Mahout > Issue Type: Improvement > Components: Integration > Affects Versions: 0.6 > Reporter: Stuart Smith > Priority: Trivial > Fix For: 0.8 > > > If you have duplicate attribute names in your ARFF file, and you have > non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a > ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size > of your attribute labels (duplicates removed), but your arff vectors could > have more values (if they reference the attribute at both indexes). This is a > somewhat pathological ARFF file. > Not sure if I should note the error (throw an exception) in computeNext() > when it's out of bounds, or when someone tries to add duplicate label to the > MapBackedArffModel. > My first impulse would be to check in computeNext(), but addLabel() in > MapBackedArffModel will do something rather pathological in the case of > duplicate attributes: it overwrites the Label map with the new index, but the > idxLabel map will hold a mapping from both indexes to the attribute name, so > it's out of sync.. so it may be best to disallow duplicate attribute names > "IllegalArgumentException" altogether. > For example > @attribute my_attribute NUMERIC > @attribute my_attribute NUMERIC > addLabel() > addLabel() > labelBindings -> ('my_attribute', 1) > idxLabel -> (0, 'my_attribute), (1, 'my_attribute') > I'll happily submit a patch, just wondering if it should be in computeNext() > or addLabel() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira