[ 
https://issues.apache.org/jira/browse/SPARK-12804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097008#comment-15097008
 ] 

DB Tsai commented on SPARK-12804:
---------------------------------

As comment in the code, when only positive or negative is detected, we should 
return the solution immediately like what LiR does. It's slightly different 
from the issue in LiR. In LiR, the issue is when `yStd == 0`, how do we handle 
standardization. The problem here is that when all the samples are label-zero, 
the histogram will only be generated from single label so when accessing 
label-one will result array out of bounds exception.

This bug should not happen when all label one tho. 

> ml.classification.LogisticRegression fails when FitIntercept with same-label 
> dataset
> ------------------------------------------------------------------------------------
>
>                 Key: SPARK-12804
>                 URL: https://issues.apache.org/jira/browse/SPARK-12804
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 1.6.0
>            Reporter: Feynman Liang
>            Assignee: Feynman Liang
>
> When training LogisticRegression on a dataset where the label is all 0 or all 
> 1, an array out of bounds exception is thrown. The problematic code is
> {code}
>       initialCoefficientsWithIntercept.toArray(numFeatures)
>         = math.log(histogram(1) / histogram(0))
>     }
> {code}
> The correct behaviour is to short-circuit training entirely when only a 
> single label is present (can be detected from {{labelSummarizer}}) and return 
> a classifier which assigns all true/false with infinite weights.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to