[jira] [Commented] (SPARK-41008) Isotonic regression result differs from sklearn implementation

Ahmed Mahran (Jira) Mon, 05 Dec 2022 02:43:17 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-41008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17643255#comment-17643255
 ]


Ahmed Mahran commented on SPARK-41008:
--------------------------------------

[~srowen] I think you are right. Repeated feature/x values are pooled into a 
single point such that the label/y value is the weighted average of 
corresponding label/y values.

 

I've checked sklearn implementation: 
[https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b6/sklearn/isotonic.py#L281]
 and 
[https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b611bf873bd5836748647221480071a87/sklearn/_isotonic.pyx#L66.
 
|https://github.com/scikit-learn/scikit-learn/blob/f3f51f9b611bf873bd5836748647221480071a87/sklearn/_isotonic.pyx#L66.]Then
 I did a draft scala version of the pooling function and it seems to give the 
same results for different few examples.

 

I'd like to pick this up if possible. Also, should the new pooling be applied 
always or should there be a new option?

> Isotonic regression result differs from sklearn implementation
> --------------------------------------------------------------
>
>                 Key: SPARK-41008
>                 URL: https://issues.apache.org/jira/browse/SPARK-41008
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 3.3.1
>            Reporter: Arne Koopman
>            Priority: Minor
>
>  
> {code:python}
> import pandas as pd
> from pyspark.sql.types import DoubleType
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> from pyspark.ml.regression import IsotonicRegression as 
> IsotonicRegression_pyspark
> # The P(positives | model_score):
> # 0.6 -> 0.5 (1 out of the 2 labels is positive)
> # 0.333 -> 0.333 (1 out of the 3 labels is positive)
> # 0.20 -> 0.25 (1 out of the 4 labels is positive)
> tc_pd = pd.DataFrame({
>     "model_score": [0.6, 0.6, 0.333, 0.333, 0.333, 0.20, 0.20, 0.20, 0.20],   
>       
>     "label": [1, 0, 0, 1, 0, 1, 0, 0, 0],         
>     "weight": 1,     }
> )
> # The fraction of positives for each of the distinct model_scores would be 
> the best fit.
> # Resulting in the following expected calibrated model_scores:
> # "calibrated_model_score": [0.5, 0.5, 0.333, 0.333, 0.333, 0.25, 0.25, 0.25, 
> 0.25]
> # The sklearn implementation of Isotonic Regression. 
> from sklearn.isotonic import IsotonicRegression as IsotonicRegression_sklearn
> tc_regressor_sklearn = 
> IsotonicRegression_sklearn().fit(X=tc_pd['model_score'], y=tc_pd['label'], 
> sample_weight=tc_pd['weight'])
> print("sklearn:", tc_regressor_sklearn.predict(tc_pd['model_score']))
> # >> sklearn: [0.5 0.5 0.33333333 0.33333333 0.33333333 0.25 0.25 0.25 0.25 ]
> # The pyspark implementation of Isotonic Regression. 
> tc_df = spark.createDataFrame(tc_pd)
> tc_df = tc_df.withColumn('model_score', 
> F.col('model_score').cast(DoubleType()))
> isotonic_regressor_pyspark = 
> IsotonicRegression_pyspark(featuresCol='model_score', labelCol='label', 
> weightCol='weight')
> tc_model = isotonic_regressor_pyspark.fit(tc_df)
> tc_pd = tc_model.transform(tc_df).toPandas()
> print("pyspark:", tc_pd['prediction'].values)
> # >> pyspark: [0.5 0.5 0.33333333 0.33333333 0.33333333 0. 0. 0. 0. ]
> # The result from the pyspark implementation seems unclear. Similar small toy 
> examples lead to similar non-expected results for the pyspark implementation. 
> # Strangely enough, for 'large' datasets, the difference between calibrated 
> model_scores generated by both implementations dissapears. 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-41008) Isotonic regression result differs from sklearn implementation

Reply via email to