[ https://issues.apache.org/jira/browse/SYSTEMML-700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15536545#comment-15536545 ]
Niketan Pansare commented on SYSTEMML-700: ------------------------------------------ Pros of existing approach (i.e. label transformation in PredictionUtils): 1. Addresses this JIRA and also allows string-based labels. 2. To attract scikit-learn/python users, this feature is a must have. Cons of existing approach: 1. Performance impact as it requires an additional preprocessing pass of doing label transformation. 2. Consistency with label conversion. As an example: if inputs fails or produces incorrect results from commandline, it should have same behavior through API. [~mboehm7] [~freiss] [~reinw...@us.ibm.com] Since it is important to attract more users as well as to reduce performance overhead, how about going with following solution ? We add additional parameter to the wrappers (i.e. encodeData) and we can have it turned on by default in Python. > Inflexible category labels for Multinomial Logistic Regression > -------------------------------------------------------------- > > Key: SYSTEMML-700 > URL: https://issues.apache.org/jira/browse/SYSTEMML-700 > Project: SystemML > Issue Type: Bug > Components: Algorithms > Reporter: Jeremy > Priority: Minor > Original Estimate: 4h > Remaining Estimate: 4h > > The Logistic Regression algorithm requires that category labels be labeled as > 0 up to the number of classes-1. It should be able to handle any set of > category labels provided by the user. B_out should have the appropriate size > regardless of the values of the labels given, and the algorithm should also > preserve the original labeling for the user. > Added detail: > The solution I'm currently using is to transform the labels from whatever > values they are to 0, 1, 2,... before hand, and then transform them back to > their original labels after the algorithm runs. > Currently the algorithm doesn't handle class values that don't start at 0 or > 1, and doesn't handle non-contiguous integers, both of which can come up. For > example, the result for class labels 4,5,6 will return 5 sets of coefficients > (correct number should be 2), and class labels -1, 0, 1 returns just one set > of coefficients (correct number should be 2). > Handling frames with strings would be a really great user experience - that > could look like R's coercion internally. Both glmnet and scikit-learn handle > string label arguments, but both apis are weakly typed as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)