[ https://issues.apache.org/jira/browse/SPARK-28958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
zhengruifeng updated SPARK-28958: --------------------------------- Attachment: ML_SYNC.pdf > pyspark.ml function parity > -------------------------- > > Key: SPARK-28958 > URL: https://issues.apache.org/jira/browse/SPARK-28958 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark > Affects Versions: 3.0.0 > Reporter: zhengruifeng > Priority: Major > Attachments: ML_SYNC.pdf > > > I looked into the hierarchy of both py and scala sides, and found that they > are quite different, which damage the parity and make the codebase hard to > maintain. > The main inconvenience is that most models in pyspark do not support any > param getters and setters. > In the py side, I think we need to do: > 1, remove setters generated by _shared_params_code_gen.py; > 2, add common abstract classes like the side side, such as > JavaPredictor/JavaClassificationModel/JavaProbabilisticClassifier; > 3, for each alg, add its param trait, such as LinearSVCParams; > 4, since sharedParam do not have setters, we need to add them in right places; > Unfortunately, I notice that if we do 1 (remove setters generated by > _shared_params_code_gen.py), all algs > (classification/regression/clustering/features/fpm/recommendation) need to be > modified in one batch. > The scala side also need some small improvements, but I think they can be > leave alone at first, to avoid a lot of MiMa Failures. -- This message was sent by Atlassian Jira (v8.3.2#803003) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org