[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938205#comment-16938205 ]
zhengruifeng commented on SPARK-29212: -------------------------------------- [~zero323] Would you like to help work on this? > Add common classes without using JVM backend > -------------------------------------------- > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark > Affects Versions: 3.0.0 > Reporter: zhengruifeng > Priority: Major > > copyed from [https://github.com/apache/spark/pull/25776.] > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > > Maciej's *Proposal*: > {code:java} > Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. > class ClassifierParams: ... > class Predictor(Estimator,PredictorParams): > def setLabelCol(self, value): ... > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > class Classifier(Predictor, ClassifierParams): > def setRawPredictionCol(self, value): ... > class PredictionModel(Model,PredictorParams): > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > def numFeatures(self): ... > def predict(self, value): ... > and JVM interop should extend from this hierarchy, i.e. > class JavaPredictionModel(PredictionModel): ... > In other words it should be consistent with existing approach, where we have > ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and > Java* variants are their subclasses. > {code} > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org