[ https://issues.apache.org/jira/browse/SPARK-17476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15478517#comment-15478517 ]
Xin Ren commented on SPARK-17476: --------------------------------- Hi I can try to work on this one, thanks :) > Proper handling for unseen labels in logistic regression training. > ------------------------------------------------------------------ > > Key: SPARK-17476 > URL: https://issues.apache.org/jira/browse/SPARK-17476 > Project: Spark > Issue Type: Sub-task > Components: ML > Reporter: Seth Hendrickson > > Now that logistic regression supports multiclass, it is possible to train on > data that has {{K}} classes, but one or more of the classes does not appear > in training. For example, > {code} > (0.0, x1) > (2.0, x2) > ... > {code} > Currently, logistic regression assumes that the outcome classes in the above > dataset have three levels: {{0, 1, 2}}. Since label 1 never appears, it > should never be predicted. In theory, the coefficients should be zero and the > intercept should be negative infinity. This can cause problems since we > center the intercepts after training. > We should discuss whether or not the intercepts actually tend to -infinity in > practice, and whether or not we should even include them in training. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org