[ 
https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16133483#comment-16133483
 ] 

Joseph K. Bradley commented on SPARK-21770:
-------------------------------------------

I vaguely recall discussing this before but forget where that discussion was.  
Overall, I'd vote for the uniform distribution:
* The "probability" column has a clear meaning: It should provide a predicted 
probability distribution over class labels.  An all-0 vector is not a valid 
probability distribution.
* It does not really make sense to say all classes are impossible.  When 
fitting a statistical model to predict from n classes, one makes the implicit 
assumption that there exist "true" classes to be predicted.

However, I can see the argument for not changing current behavior (from a 
software engineering standpoint).  That said, if people are relying on this 
behavior, their application logic is probably incorrect from a statistical 
modeling perspective.

Any opinions [~sethah], [~yanboliang], [~dbtsai] ?

> ProbabilisticClassificationModel: Improve normalization of all-zero raw 
> predictions
> -----------------------------------------------------------------------------------
>
>                 Key: SPARK-21770
>                 URL: https://issues.apache.org/jira/browse/SPARK-21770
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML
>    Affects Versions: 2.3.0
>            Reporter: Siddharth Murching
>            Priority: Minor
>
> Given an n-element raw prediction vector of all-zeros, 
> ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output 
> a probability vector of all-equal 1/n entries



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to