[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maciej Szymkiewicz updated SPARK-29212: --------------------------------------- Description: Copied from [https://github.com/apache/spark/pull/25776]. Maciej's *Concern*: *Use cases for public ML type hierarchy* * Add Python-only Transformer implementations: * ** I am Python user and want to implement pure Python ML classifier without providing JVM backend. ** I want this classifier to be meaningfully positioned in the existing type hierarchy. ** However I have access only to high level classes ({{Estimator}}, {{Model}}, {{MLReader}} / {{MLReadable}}). * Run time parameter validation for both user defined (see above) and existing class hierarchy, * ** I am a library developer who provides functions that are meaningful only for specific categories of {{Estimators}} - here classifiers. ** I want to validate that user passed argument is indeed a classifier: *** For built-in objects using "private" type hierarchy is not really satisfying (actually, what is the rationale behind making it "private"? If the goal is Scala API parity, and Scala counterparts are public, shouldn't these be too?). ** For user defined objects I can: *** Use duck typing (on {{setRawPredictionCol}} for classifier, on {{numClasses}} for classification model) but it hardly satisfying. *** Provide parallel non-abstract type hierarchy ({{Classifier}} or {{PythonClassifier}} and so on) and require users to implement such interfaces. That however would require separate logic for checking for built-in and and user-provided classes. *** Provide parallel abstract type hierarchy, register all existing built-in classes and require users to do the same. Clearly these are not satisfying solutions as they require either defensive programming or reinventing the same functionality for different 3rd party APIs. * Static type checking * ** I am either end user or library developer and want to use PEP-484 annotations to indicate components that require classifier or classification model. * ** Currently I can provide only imprecise annotations, [such as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... or try to narrow things down using structural subtyping: class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, value: str) -> Classifier: ... class Classifier(Protocol, Model): def setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> int: ... (...) * First of all nothing in the original API indicated this. On the contrary, the original API clearly suggests that non-Java path is supported, by providing base classes (Params, Transformer, Estimator, Model, ML \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, JavaML\{Reader,Writer}, JavaML {Readable,Writable} ). * Furthermore authoritative (IMHO) and open Python ML extensions exist (spark-sklearn is one of these, but if I recall correctly spark-deep-learning provides so pure-Python utilities). Personally I've seen quite a lot of private implementations, but that's just anecdotal evidence. Let us assume for the sake of argument that above observations are irrelevant. I will argue that having complete, public type hierarchy is still desired: * Two out three use cases I described, can narrowed down to Java implementation only, though there are less compelling if we do that. * More importantly, public type hierarchy with Java specific extensions, is pyspark.ml standard. There is no reason to treat this specific case as an exception, especially when the implementations, is far from utilitarian (for example implementation-free JavaClassifierParams and JavaProbabilisticClassifierParams save, as for now, no practical purpose whatsoever). Maciej's *Proposal*: {code:python} Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. class ClassifierParams: ... class Predictor(Estimator,PredictorParams): def setLabelCol(self, value): ... def setFeaturesCol(self, value): ... def setPredictionCol(self, value): ... class Classifier(Predictor, ClassifierParams): def setRawPredictionCol(self, value): ... class PredictionModel(Model,PredictorParams): def setFeaturesCol(self, value): ... def setPredictionCol(self, value): ... def numFeatures(self): ... def predict(self, value): ... and JVM interop should extend from this hierarchy, i.e. class JavaPredictionModel(PredictionModel): ... In other words it should be consistent with existing approach, where we have ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and Java* variants are their subclasses. {code} *Formally*: * Python ML type hierarchy should reflect Scala counterpart, as long as it makes sense give differences between languages (also minor changes are required due to specifics of the MRO algorithm). This can reduce development burden, overhead of porting user application between both languages and overall consistency of the project. * Classes which are public in Scala should be "public" and implementation independent in Python. No {{Java}} prefix should be used. * Classes which are {{private}} or {{private[ml]}} in Scala, should be explicitly marked as internal, using PEP8 syntax. i.e {code:python} class _JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): """ Java Classifier Params for classification tasks. """ pass {code} not {code:python} class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): """ (Private) Java Classifier Params for classification tasks. """ pass {code} * High level Java-specific mixins ({{JavaEstimator}}, {{JavaModel}}, {{JavaParams}}, {{JavaPredictor}}, {{JavaPredictorParams}}, {{JavaPredictionModel}}, {{JavaWrapper}}) should be used only if wrapper-specific implementation is needed and mixed into concrete implementations only at the bottom of the type hierarchy, i.e. {code:python} class PredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ... class ClassifierParams(HasRawPredictionCol, PredictorParams): ... class PredictionModel(Model, PredictorParams): ... class ClassificationModel(PredictionModel, ClassifierParams): class ProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, ClassifierParams): ... class ProbabilisticClassificationModel(ClassificationModel, ProbabilisticClassifierParams): ... class LogisticRegressionModel(JavaModel, ProbabilisticClassificationModel, JavaMLWritable, JavaMLReadable, HasTrainingSummary): ... {code} not {code:python} class JavaPredictionModel(JavaModel, JavaPredictorParams): ... class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ... class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): ... class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams): ... class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, JavaClassifierParams): ... class JavaProbabilisticClassificationModel(JavaClassificationModel, JavaProbabilisticClassifierParams): ... class LogisticRegressionModel(JavaProbabilisticClassificationModel, JavaMLWritable, JavaMLReadable, HasTrainingSummary): ... {code} Internal compositions a level above intended leaf could, i.e. {code:python} class _JavaProbabilisticClassificationModel(JavaModel, ProbabilisticClassificationModel): ... {code} could be acceptable, but are hardly justified. - We can assume that the bottom built-in {{Transformers}} are designed as final, and don't have to be designed for inheritance by the 3rd party projects. - We should assume that Scala public intermediary traits mapped to Python mixins will be extended by the 3rd party projects, and design these accordingly. - (Optionally) Some of the existing types designed for JVM-interop ({{JavaModel}}, {{JavaEstimator}}) could be made internal, as they don't provide crucial information for the end user. Individual type checks like {code:python} issubclass(LogisticRegression, JavaWrapper) issubclass(LogisticRegression, Estimator) {code} or {code:python} isinstance(LogisticRegression(), Estimator) isinstance(LogisticRegression(), JavaWrapper) {code} are more likely in the user code than {code:python} isinstance(LogisticRegression(), JavaEstimator) {code} At least top level {{JavaWrapper}} *must* be public, as it provides crucial information for the final user (for example allows to distinguish between objects that can pickled, and such that require more complex serialization strategy). was: Copied from [https://github.com/apache/spark/pull/25776]. Maciej's *Concern*: *Use cases for public ML type hierarchy* * Add Python-only Transformer implementations: * ** I am Python user and want to implement pure Python ML classifier without providing JVM backend. ** I want this classifier to be meaningfully positioned in the existing type hierarchy. ** However I have access only to high level classes ({{Estimator}}, {{Model}}, {{MLReader}} / {{MLReadable}}). * Run time parameter validation for both user defined (see above) and existing class hierarchy, * ** I am a library developer who provides functions that are meaningful only for specific categories of {{Estimators}} - here classifiers. ** I want to validate that user passed argument is indeed a classifier: *** For built-in objects using "private" type hierarchy is not really satisfying (actually, what is the rationale behind making it "private"? If the goal is Scala API parity, and Scala counterparts are public, shouldn't these be too?). ** For user defined objects I can: *** Use duck typing (on {{setRawPredictionCol}} for classifier, on {{numClasses}} for classification model) but it hardly satisfying. *** Provide parallel non-abstract type hierarchy ({{Classifier}} or {{PythonClassifier}} and so on) and require users to implement such interfaces. That however would require separate logic for checking for built-in and and user-provided classes. *** Provide parallel abstract type hierarchy, register all existing built-in classes and require users to do the same. Clearly these are not satisfying solutions as they require either defensive programming or reinventing the same functionality for different 3rd party APIs. * Static type checking * ** I am either end user or library developer and want to use PEP-484 annotations to indicate components that require classifier or classification model. * ** Currently I can provide only imprecise annotations, [such as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... or try to narrow things down using structural subtyping: class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, value: str) -> Classifier: ... class Classifier(Protocol, Model): def setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> int: ... (...) * First of all nothing in the original API indicated this. On the contrary, the original API clearly suggests that non-Java path is supported, by providing base classes (Params, Transformer, Estimator, Model, ML \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, JavaML\{Reader,Writer}, JavaML {Readable,Writable} ). * Furthermore authoritative (IMHO) and open Python ML extensions exist (spark-sklearn is one of these, but if I recall correctly spark-deep-learning provides so pure-Python utilities). Personally I've seen quite a lot of private implementations, but that's just anecdotal evidence. Let us assume for the sake of argument that above observations are irrelevant. I will argue that having complete, public type hierarchy is still desired: * Two out three use cases I described, can narrowed down to Java implementation only, though there are less compelling if we do that. * More importantly, public type hierarchy with Java specific extensions, is pyspark.ml standard. There is no reason to treat this specific case as an exception, especially when the implementations, is far from utilitarian (for example implementation-free JavaClassifierParams and JavaProbabilisticClassifierParams save, as for now, no practical purpose whatsoever). Maciej's *Proposal*: {code:python} Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. class ClassifierParams: ... class Predictor(Estimator,PredictorParams): def setLabelCol(self, value): ... def setFeaturesCol(self, value): ... def setPredictionCol(self, value): ... class Classifier(Predictor, ClassifierParams): def setRawPredictionCol(self, value): ... class PredictionModel(Model,PredictorParams): def setFeaturesCol(self, value): ... def setPredictionCol(self, value): ... def numFeatures(self): ... def predict(self, value): ... and JVM interop should extend from this hierarchy, i.e. class JavaPredictionModel(PredictionModel): ... In other words it should be consistent with existing approach, where we have ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and Java* variants are their subclasses. {code} *Formally*: * Python ML type hierarchy should reflect Scala counterpart, as long as it makes sense give differences between languages (also minor changes are required due to specifics of the MRO algorithm). This can reduce development burden, overhead of porting user application between both languages and overall consistency of the project. * Classes which are public in Scala should be "public" and implementation independent in Python. No {{Java}} prefix should be used. * Classes which are {{private}} or {{private[ml]}} in Scala, should be explicitly marked as internal, using PEP8 syntax. i.e {code:python} class _JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): """ Java Classifier Params for classification tasks. """ pass {code} not {code:python} class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): """ (Private) Java Classifier Params for classification tasks. """ pass {code} * High level Java-specific mixins ({{JavaEstimator}}, {{JavaModel}}, {{JavaParams}}, {{JavaPredictor}}, {{JavaPredictorParams}}, {{JavaPredictionModel}}, {{JavaWrapper}}) should be used only if wrapper-specific implementation is needed and mixed into concrete implementations only at the bottom of the type hierarchy, i.e. {code:python} class PredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ... class ClassifierParams(HasRawPredictionCol, PredictorParams): ... class PredictionModel(Model, PredictorParams): ... class ClassificationModel(PredictionModel, ClassifierParams): class ProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, ClassifierParams): ... class ProbabilisticClassificationModel(ClassificationModel, ProbabilisticClassifierParams): ... class LogisticRegressionModel(JavaModel, ProbabilisticClassificationModel, JavaMLWritable, JavaMLReadable, HasTrainingSummary): ... {code} not {code:python} class JavaPredictionModel(JavaModel, JavaPredictorParams): ... class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ... class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): ... class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams): ... class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, JavaClassifierParams): ... class JavaProbabilisticClassificationModel(JavaClassificationModel, JavaProbabilisticClassifierParams): ... class LogisticRegressionModel(JavaProbabilisticClassificationModel, JavaMLWritable, JavaMLReadable, HasTrainingSummary): ... {code} Internal compositions a level above intended leaf could, i.e. {code:python} class _JavaProbabilisticClassificationModel(JavaModel, ProbabilisticClassificationModel): ... {code} could be acceptable, but are hardly justified. - We can assume that the bottom built-in {{Transformers}} are designed as final, and don't have to be designed for inheritance by the 3rd party projects. - We should assume that Scala public intermediary traits mapped to Python mixins will be extended by the 3rd party projects, and design these accordingly. - (Optionally) Some of the existing types designed for JVM-interop ({{JavaModel}}, {{JavaEstimator}}) could be made internal, as they don't provide crucial information for the end user. Individual type checks like {code:python} issubclass(LogisticRegression, JavaWrapper) issubclass(LogisticRegression, Estimator) {code} or {code:python} isinstance(LogisticRegression(), Estimator) isinstance(LogisticRegression(), JavaWrapper) {code} are more likely in the user code than {code:python} {code} > Add common classes without using JVM backend > -------------------------------------------- > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark > Affects Versions: 3.0.0 > Reporter: zhengruifeng > Priority: Major > > Copied from [https://github.com/apache/spark/pull/25776]. > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > * > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > * > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > * > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > * > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > (...) > * First of all nothing in the original API indicated this. On the contrary, > the original API clearly suggests that non-Java path is supported, by > providing base classes (Params, Transformer, Estimator, Model, ML > \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific > implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, > JavaML\{Reader,Writer}, JavaML > {Readable,Writable} > ). > * Furthermore authoritative (IMHO) and open Python ML extensions exist > (spark-sklearn is one of these, but if I recall correctly spark-deep-learning > provides so pure-Python utilities). Personally I've seen quite a lot of > private implementations, but that's just anecdotal evidence. > Let us assume for the sake of argument that above observations are > irrelevant. I will argue that having complete, public type hierarchy is still > desired: > * Two out three use cases I described, can narrowed down to Java > implementation only, though there are less compelling if we do that. > * More importantly, public type hierarchy with Java specific extensions, is > pyspark.ml standard. There is no reason to treat this specific case as an > exception, especially when the implementations, is far from utilitarian (for > example implementation-free JavaClassifierParams and > JavaProbabilisticClassifierParams save, as for now, no practical purpose > whatsoever). > > Maciej's *Proposal*: > {code:python} > Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. > class ClassifierParams: ... > class Predictor(Estimator,PredictorParams): > def setLabelCol(self, value): ... > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > class Classifier(Predictor, ClassifierParams): > def setRawPredictionCol(self, value): ... > class PredictionModel(Model,PredictorParams): > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > def numFeatures(self): ... > def predict(self, value): ... > and JVM interop should extend from this hierarchy, i.e. > class JavaPredictionModel(PredictionModel): ... > In other words it should be consistent with existing approach, where we have > ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and > Java* variants are their subclasses. > {code} > > *Formally*: > * Python ML type hierarchy should reflect Scala counterpart, as long as it > makes sense give differences between languages (also minor changes are > required due to specifics of the MRO algorithm). This can reduce development > burden, overhead of porting user application between both languages and > overall consistency of the project. > * Classes which are public in Scala should be "public" and implementation > independent in Python. No {{Java}} prefix should be used. > * Classes which are {{private}} or {{private[ml]}} in Scala, should be > explicitly marked as internal, using PEP8 syntax. i.e > {code:python} > class _JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): > """ > Java Classifier Params for classification tasks. > """ > pass > {code} > not > {code:python} > class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): > """ > (Private) Java Classifier Params for classification tasks. > """ > pass > {code} > * High level Java-specific mixins ({{JavaEstimator}}, {{JavaModel}}, > {{JavaParams}}, {{JavaPredictor}}, {{JavaPredictorParams}}, > {{JavaPredictionModel}}, {{JavaWrapper}}) should be used only if > wrapper-specific implementation is needed and mixed into concrete > implementations only at the bottom of the type hierarchy, i.e. > {code:python} > class PredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ... > class ClassifierParams(HasRawPredictionCol, PredictorParams): ... > class PredictionModel(Model, PredictorParams): ... > class ClassificationModel(PredictionModel, ClassifierParams): > class ProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, > ClassifierParams): ... > class ProbabilisticClassificationModel(ClassificationModel, > ProbabilisticClassifierParams): ... > class LogisticRegressionModel(JavaModel, ProbabilisticClassificationModel, > JavaMLWritable, JavaMLReadable, HasTrainingSummary): ... > {code} > not > {code:python} > class JavaPredictionModel(JavaModel, JavaPredictorParams): ... > class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ... > class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): ... > class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams): ... > class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, > JavaClassifierParams): ... > class JavaProbabilisticClassificationModel(JavaClassificationModel, > JavaProbabilisticClassifierParams): ... > class LogisticRegressionModel(JavaProbabilisticClassificationModel, > JavaMLWritable, JavaMLReadable, HasTrainingSummary): ... > {code} > Internal compositions a level above intended leaf could, i.e. > {code:python} > class _JavaProbabilisticClassificationModel(JavaModel, > ProbabilisticClassificationModel): ... > {code} > could be acceptable, but are hardly justified. > - We can assume that the bottom built-in {{Transformers}} are designed as > final, and don't have to be designed for inheritance by the 3rd party > projects. > - We should assume that Scala public intermediary traits mapped to Python > mixins will be extended by the 3rd party projects, and design these > accordingly. > - (Optionally) Some of the existing types designed for JVM-interop > ({{JavaModel}}, {{JavaEstimator}}) could be made internal, as they don't > provide crucial information for the end user. Individual type checks like > {code:python} > issubclass(LogisticRegression, JavaWrapper) > issubclass(LogisticRegression, Estimator) > {code} > or > {code:python} > isinstance(LogisticRegression(), Estimator) > isinstance(LogisticRegression(), JavaWrapper) > {code} > are more likely in the user code than > {code:python} > isinstance(LogisticRegression(), JavaEstimator) > {code} > At least top level {{JavaWrapper}} *must* be public, as it provides crucial > information for the final user (for example allows to distinguish between > objects that can pickled, and such that require more complex serialization > strategy). > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org