[ 
https://issues.apache.org/jira/browse/SPARK-26883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16782099#comment-16782099
 ] 

Sean Owen commented on SPARK-26883:
-----------------------------------

My guess: something goes wrong when a partition has 0 of the 'rare' outcome. 
While it's not going to give exactly the same output as scikit, this seems too 
different of course. You could probably verify by seeing what happens if all 
the data is in one partition?

> Spark MLIB Logistic Regression with heavy class imbalance estimates 0 
> coefficients
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-26883
>                 URL: https://issues.apache.org/jira/browse/SPARK-26883
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.3.2
>            Reporter: GAURAV BHATIA
>            Priority: Major
>
> Minimal example is below.
> Basically, when the frequency of positives becomes low, the coefficients out 
> of spark.ml.classification.LogisticRegression become 0, deviating from the 
> corresponding sklearn results.
> I have not been able to find any parameter setting or documentation that 
> describes why this happens or how I can alter the behavior. 
> I'd appreciate any help in debugging. Thanks in advance!
>  
> Here, we set up the code to create the two sample scenarios. In both cases a 
> binary outcome is fit to a single binary predictor using logistic regression. 
> The effect of the binary predictor is to approximately 10x the probability of 
> a positive (1) outcome. The only difference between the "common" and "rare" 
> cases is the base frequency of the positive outcome. In the "common" case it 
> is 0.01, in the "rare" case it is 1e-4.
>  
> {code:java}
>  
> import pandas as pd
> import numpy as np
> import math
> def sampleLogistic(p0, p1, p1prev,size):
>  intercept = -1*math.log(1/p0 - 1)
>  coefficient = -1*math.log(1/p1 - 1) - intercept
>  x = np.random.choice([0, 1], size=(size,), p=[1 - p1prev, p1prev])
>  freq= 1/(1 + np.exp(-1*(intercept + coefficient*x)))
>  y = (np.random.uniform(size=size) < freq).astype(int)
>  df = pd.DataFrame({'x':x, 'y':y})
>  return(df)
> df_common = sampleLogistic(0.01,0.1,0.1,100000)
> df_rare = sampleLogistic(0.0001,0.001,0.1,100000){code}
>  
> Using sklearn:
>  
> {code:java}
> from sklearn.linear_model import LogisticRegression
> l = 0.3
> skmodel = LogisticRegression(
> fit_intercept=True,
> penalty='l2',
> C=1/l,
> max_iter=100,
> tol=1e-11,
> solver='lbfgs',verbose=1)
> skmodel.fit(df_common[['x']], df_common.y)
> print(skmodel.coef_, skmodel.intercept_)
> skmodel.fit(df_rare[['x']], df_rare.y)
> print(skmodel.coef_, skmodel.intercept_)
> {code}
> In one run of the simulation, this prints:
>  
>  
> {noformat}
> [[ 2.39497867]] [-4.58143701] # the common case 
> [[ 1.84918485]] [-9.05090438] # the rare case{noformat}
> Now, using PySpark for the common case:
>  
>  
> {code:java}
> from pyspark.ml.classification import LogisticRegression
> from pyspark.ml.feature import VectorAssembler
> n = len(df_common.index)
> sdf_common = spark.createDataFrame(df_common)
> assembler = VectorAssembler(inputCols=['x'], outputCol="features")
> sdf_common = assembler.transform(sdf_common)
> # Make regularization 0.3/10=0.03
> lr = 
> LogisticRegression(regParam=l/n,labelCol='y',featuresCol='features',tol=1e-11/n,maxIter=100,standardization=False)
> model = lr.fit(sdf_common)
> print(model.coefficients, model.intercept)
> {code}
>  
> This prints:
>  
> {code:java}
> [2.39497214622] -4.5814342575166505 # nearly identical to the common case 
> above
> {code}
> Pyspark for the rare case:
>  
>  
> {code:java}
> n = len(df_rare.index)
> sdf_rare = spark.createDataFrame(df_rare)
> assembler = VectorAssembler(inputCols=['x'], outputCol="features")
> sdf_rare = assembler.transform(sdf_rare)
> # Make regularization 0.3/10=0.03
> lr = 
> LogisticRegression(regParam=l/n,labelCol='y',featuresCol='features',tol=1e-11/n,maxIter=100,standardization=False)
> model = lr.fit(sdf_rare)
> print(model.coefficients,model.intercept)
> {code}
> This prints:
>  
>  
> {noformat}
> [0.0] -8.62237369087212 # where does the 0 come from??
> {noformat}
>  
>  
> To verify that the data frames have the properties that we discussed:
> {code:java}
> sdf_common.describe().show()
> +-------+------------------+------------------+
> |summary|                 x|                 y|
> +-------+------------------+------------------+
> |  count|            100000|            100000|
> |   mean|           0.10055|           0.01927|
> | stddev|0.3007334399530905|0.1374731104200414|
> |    min|                 0|                 0|
> |    max|                 1|                 1|
> +-------+------------------+------------------+
> sdf_rare.describe().show()
> +-------+------------------+--------------------+
> |summary|                 x|                   y|
> +-------+------------------+--------------------+
> |  count|            100000|              100000|
> |   mean|           0.09997|              1.8E-4|
> | stddev|0.2999614956440055|0.013415267410454295|
> |    min|                 0|                   0|
> |    max|                 1|                   1|
> +-------+------------------+--------------------+
> {code}
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to