[jira] [Updated] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name
[ https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Suneel Marthi updated MAHOUT-953: - Status: Open (was: Patch Available) ArffVectorIterable does not gracefully handle duplicate attribute name -- Key: MAHOUT-953 URL: https://issues.apache.org/jira/browse/MAHOUT-953 Project: Mahout Issue Type: Improvement Components: Integration Affects Versions: 0.6 Reporter: Stuart Smith Priority: Trivial Fix For: Backlog If you have duplicate attribute names in your ARFF file, and you have non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size of your attribute labels (duplicates removed), but your arff vectors could have more values (if they reference the attribute at both indexes). This is a somewhat pathological ARFF file. Not sure if I should note the error (throw an exception) in computeNext() when it's out of bounds, or when someone tries to add duplicate label to the MapBackedArffModel. My first impulse would be to check in computeNext(), but addLabel() in MapBackedArffModel will do something rather pathological in the case of duplicate attributes: it overwrites the Label map with the new index, but the idxLabel map will hold a mapping from both indexes to the attribute name, so it's out of sync.. so it may be best to disallow duplicate attribute names IllegalArgumentException altogether. For example @attribute my_attribute NUMERIC @attribute my_attribute NUMERIC addLabel() addLabel() labelBindings - ('my_attribute', 1) idxLabel - (0, 'my_attribute), (1, 'my_attribute') I'll happily submit a patch, just wondering if it should be in computeNext() or addLabel() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name
[ https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robin Anil updated MAHOUT-953: -- Fix Version/s: (was: 0.8) Backlog Bring it back to 0.8 queue if anyone is willing to do the work within the next week. ArffVectorIterable does not gracefully handle duplicate attribute name -- Key: MAHOUT-953 URL: https://issues.apache.org/jira/browse/MAHOUT-953 Project: Mahout Issue Type: Improvement Components: Integration Affects Versions: 0.6 Reporter: Stuart Smith Priority: Trivial Fix For: Backlog If you have duplicate attribute names in your ARFF file, and you have non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size of your attribute labels (duplicates removed), but your arff vectors could have more values (if they reference the attribute at both indexes). This is a somewhat pathological ARFF file. Not sure if I should note the error (throw an exception) in computeNext() when it's out of bounds, or when someone tries to add duplicate label to the MapBackedArffModel. My first impulse would be to check in computeNext(), but addLabel() in MapBackedArffModel will do something rather pathological in the case of duplicate attributes: it overwrites the Label map with the new index, but the idxLabel map will hold a mapping from both indexes to the attribute name, so it's out of sync.. so it may be best to disallow duplicate attribute names IllegalArgumentException altogether. For example @attribute my_attribute NUMERIC @attribute my_attribute NUMERIC addLabel() addLabel() labelBindings - ('my_attribute', 1) idxLabel - (0, 'my_attribute), (1, 'my_attribute') I'll happily submit a patch, just wondering if it should be in computeNext() or addLabel() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (MAHOUT-953) ArffVectorIterable does not gracefully handle duplicate attribute name
[ https://issues.apache.org/jira/browse/MAHOUT-953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Grant Ingersoll updated MAHOUT-953: --- Fix Version/s: 0.8 ArffVectorIterable does not gracefully handle duplicate attribute name -- Key: MAHOUT-953 URL: https://issues.apache.org/jira/browse/MAHOUT-953 Project: Mahout Issue Type: Improvement Components: Integration Affects Versions: 0.6 Reporter: Stuart Smith Priority: Trivial Fix For: 0.8 If you have duplicate attribute names in your ARFF file, and you have non-sparse arff vectors, ARFFVectorIterable.computeNext will throw a ArrayIndexOutOfBoundsExceptions, as it allocates a DenseVector with the size of your attribute labels (duplicates removed), but your arff vectors could have more values (if they reference the attribute at both indexes). This is a somewhat pathological ARFF file. Not sure if I should note the error (throw an exception) in computeNext() when it's out of bounds, or when someone tries to add duplicate label to the MapBackedArffModel. My first impulse would be to check in computeNext(), but addLabel() in MapBackedArffModel will do something rather pathological in the case of duplicate attributes: it overwrites the Label map with the new index, but the idxLabel map will hold a mapping from both indexes to the attribute name, so it's out of sync.. so it may be best to disallow duplicate attribute names IllegalArgumentException altogether. For example @attribute my_attribute NUMERIC @attribute my_attribute NUMERIC addLabel() addLabel() labelBindings - ('my_attribute', 1) idxLabel - (0, 'my_attribute), (1, 'my_attribute') I'll happily submit a patch, just wondering if it should be in computeNext() or addLabel() -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira