[ 
https://issues.apache.org/jira/browse/SPARK-17151?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15429066#comment-15429066
 ] 

DB Tsai edited comment on SPARK-17151 at 8/19/16 11:49 PM:
-----------------------------------------------------------

Not only the zero coefficients issues but also the intercepts will be negative 
infinity for those classes which are not seen in the training time. This will 
cause some instabilities during the optimization, and we should not train on 
those unseen classes. As a result, we need to keep track on what are the seen 
classes in the training time, and only optimize the coefficients for them. 
Since we know all the possible classes which should be able to be specified by 
users as part of the API, in prediction time, we just make them probability 
zero. 


was (Author: dbtsai):
BTW, not only the zero coefficients issues but also the intercepts will be 
negative infinity for those classes which are not seen in the training time. 
This will cause some instabilities during the optimization, and we should not 
train on those unseen classes. As a result, we need to keep track on what are 
the seen classes in the training time, and only optimize the coefficients for 
them. Since we know all the possible classes which should be able to be 
specified by users as part of the API, in prediction time, we just make them 
probability zero. 

> Decide how to handle inferring number of classes in Multinomial logistic 
> regression
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-17151
>                 URL: https://issues.apache.org/jira/browse/SPARK-17151
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, MLlib
>            Reporter: Seth Hendrickson
>            Priority: Minor
>
> This JIRA is to discuss how the number of label classes should be inferred in 
> multinomial logistic regression. Currently, MLOR checks the dataframe 
> metadata and if the number of classes is not specified then it uses the 
> maximum value seen in the label column. If the labels are not properly 
> indexed, then this can cause a large number of zero coefficients and 
> potentially produce instabilities in model training.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to