[ 
https://issues.apache.org/jira/browse/SPARK-28958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhengruifeng updated SPARK-28958:
---------------------------------
    Attachment: ML_SYNC.pdf

> pyspark.ml function parity
> --------------------------
>
>                 Key: SPARK-28958
>                 URL: https://issues.apache.org/jira/browse/SPARK-28958
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, PySpark
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Priority: Major
>         Attachments: ML_SYNC.pdf
>
>
> I looked into the hierarchy of both py and scala sides, and found that they 
> are quite different, which damage the parity and make the codebase hard to 
> maintain.
> The main inconvenience is that most models in pyspark do not support any 
> param getters and setters.
> In the py side, I think we need to do:
> 1, remove setters generated by _shared_params_code_gen.py;
> 2, add common abstract classes like the side side, such as 
> JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier;
> 3, for each alg, add its param trait, such as LinearSVCParams;
> 4, since sharedParam do not have setters, we need to add them in right places;
> Unfortunately, I notice that if we do 1 (remove setters generated by 
> _shared_params_code_gen.py), all algs 
> (classification/regression/clustering/features/fpm/recommendation) need to be 
> modified in one batch.
> The scala side also need some small improvements, but I think they can be 
> leave alone at first, to avoid a lot of MiMa Failures.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to