[ https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946784#comment-16946784 ]
Sean R. Owen commented on SPARK-29212: -------------------------------------- I don't have a strong opinion on it. I'd make it all consistent first, before changing to another consistent state. Removing truly no-op classes is OK, unless we can foresee a use for them later. Pyspark does mean to wrap the JVM implemenatation, so "Java" related classes aren't inherently wrong even as part of a developer API. Refactoring is good, but weigh it against breaking existing extensions of these classes, even in Spark 3. > Add common classes without using JVM backend > -------------------------------------------- > > Key: SPARK-29212 > URL: https://issues.apache.org/jira/browse/SPARK-29212 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark > Affects Versions: 3.0.0 > Reporter: zhengruifeng > Priority: Major > > Copied from [https://github.com/apache/spark/pull/25776]. > > Maciej's *Concern*: > *Use cases for public ML type hierarchy* > * Add Python-only Transformer implementations: > * > ** I am Python user and want to implement pure Python ML classifier without > providing JVM backend. > ** I want this classifier to be meaningfully positioned in the existing type > hierarchy. > ** However I have access only to high level classes ({{Estimator}}, > {{Model}}, {{MLReader}} / {{MLReadable}}). > * Run time parameter validation for both user defined (see above) and > existing class hierarchy, > * > ** I am a library developer who provides functions that are meaningful only > for specific categories of {{Estimators}} - here classifiers. > ** I want to validate that user passed argument is indeed a classifier: > *** For built-in objects using "private" type hierarchy is not really > satisfying (actually, what is the rationale behind making it "private"? If > the goal is Scala API parity, and Scala counterparts are public, shouldn't > these be too?). > ** For user defined objects I can: > *** Use duck typing (on {{setRawPredictionCol}} for classifier, on > {{numClasses}} for classification model) but it hardly satisfying. > *** Provide parallel non-abstract type hierarchy ({{Classifier}} or > {{PythonClassifier}} and so on) and require users to implement such > interfaces. That however would require separate logic for checking for > built-in and and user-provided classes. > *** Provide parallel abstract type hierarchy, register all existing built-in > classes and require users to do the same. > Clearly these are not satisfying solutions as they require either defensive > programming or reinventing the same functionality for different 3rd party > APIs. > * Static type checking > * > ** I am either end user or library developer and want to use PEP-484 > annotations to indicate components that require classifier or classification > model. > * > ** Currently I can provide only imprecise annotations, [such > as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241] > def setClassifier(self, value: Estimator[M]) -> OneVsRest: ... > or try to narrow things down using structural subtyping: > class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, > value: str) -> Classifier: ... class Classifier(Protocol, Model): def > setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> > int: ... > (...) > * First of all nothing in the original API indicated this. On the contrary, > the original API clearly suggests that non-Java path is supported, by > providing base classes (Params, Transformer, Estimator, Model, ML > \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific > implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, > JavaML\{Reader,Writer}, JavaML > {Readable,Writable} > ). > * Furthermore authoritative (IMHO) and open Python ML extensions exist > (spark-sklearn is one of these, but if I recall correctly spark-deep-learning > provides so pure-Python utilities). Personally I've seen quite a lot of > private implementations, but that's just anecdotal evidence. > Let us assume for the sake of argument that above observations are > irrelevant. I will argue that having complete, public type hierarchy is still > desired: > * Two out three use cases I described, can narrowed down to Java > implementation only, though there are less compelling if we do that. > * More importantly, public type hierarchy with Java specific extensions, is > pyspark.ml standard. There is no reason to treat this specific case as an > exception, especially when the implementations, is far from utilitarian (for > example implementation-free JavaClassifierParams and > JavaProbabilisticClassifierParams save, as for now, no practical purpose > whatsoever). > > Maciej's *Proposal*: > {code:python} > Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e. > class ClassifierParams: ... > class Predictor(Estimator,PredictorParams): > def setLabelCol(self, value): ... > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > class Classifier(Predictor, ClassifierParams): > def setRawPredictionCol(self, value): ... > class PredictionModel(Model,PredictorParams): > def setFeaturesCol(self, value): ... > def setPredictionCol(self, value): ... > def numFeatures(self): ... > def predict(self, value): ... > {code} > and JVM interop should extend from this hierarchy, i.e. > {code:python} > class JavaPredictionModel(PredictionModel): ... > {code} > In other words it should be consistent with existing approach, where we have > ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and > Java* variants are their subclasses. > > *Formally*: > * Python ML type hierarchy should reflect Scala counterpart, as long as it > makes sense give differences between languages (also minor changes are > required due to specifics of the MRO algorithm). This can reduce development > burden, overhead of porting user application between both languages and > overall consistency of the project. > * Classes which are public in Scala should be "public" and implementation > independent in Python. No {{Java}} prefix should be used. > * Classes which are {{private}} or {{private[ml]}} in Scala, should be > explicitly marked as internal, using PEP8 syntax. i.e > {code:python} > class _JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): > """ > Java Classifier Params for classification tasks. > """ > pass > {code} > not > {code:python} > class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): > """ > (Private) Java Classifier Params for classification tasks. > """ > pass > {code} > * High level Java-specific mixins ({{JavaEstimator}}, {{JavaModel}}, > {{JavaParams}}, {{JavaPredictor}}, {{JavaPredictorParams}}, > {{JavaPredictionModel}}, {{JavaWrapper}}) should be used only if > wrapper-specific implementation is needed and mixed into concrete > implementations only at the bottom of the type hierarchy, i.e. > {code:python} > class PredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ... > class ClassifierParams(HasRawPredictionCol, PredictorParams): ... > class PredictionModel(Model, PredictorParams): ... > class ClassificationModel(PredictionModel, ClassifierParams): > class ProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, > ClassifierParams): ... > class ProbabilisticClassificationModel(ClassificationModel, > ProbabilisticClassifierParams): ... > class LogisticRegressionModel(JavaModel, ProbabilisticClassificationModel, > JavaMLWritable, JavaMLReadable, HasTrainingSummary): ... > {code} > not > {code:python} > class JavaPredictionModel(JavaModel, JavaPredictorParams): ... > class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ... > class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): ... > class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams): ... > class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, > JavaClassifierParams): ... > class JavaProbabilisticClassificationModel(JavaClassificationModel, > JavaProbabilisticClassifierParams): ... > class LogisticRegressionModel(JavaProbabilisticClassificationModel, > JavaMLWritable, JavaMLReadable, HasTrainingSummary): ... > {code} > Internal compositions a level above intended leaf, i.e. > {code:python} > class _JavaProbabilisticClassificationModel(JavaModel, > ProbabilisticClassificationModel): ... > {code} > could be acceptable, but are hardly justified. > - We can assume that the bottom built-in {{Transformers}} are designed as > final, and don't have to be designed for inheritance by the 3rd party > projects. > - We should assume that Scala public intermediary traits mapped to Python > mixins will be extended by the 3rd party projects, and design these > accordingly. > - (Optionally) Some of the existing types designed for JVM-interop > ({{JavaModel}}, {{JavaEstimator}}) could be made internal, as they don't > provide crucial information for the end user. Individual type checks like > {code:python} > issubclass(LogisticRegression, JavaWrapper) > issubclass(LogisticRegression, Estimator) > {code} > or > {code:python} > isinstance(LogisticRegression(), Estimator) > isinstance(LogisticRegression(), JavaWrapper) > {code} > are more likely in the user code than > {code:python} > isinstance(LogisticRegression(), JavaEstimator) > {code} > At least top level {{JavaWrapper}} *must* be public, as it provides crucial > information for the final user (for example allows to distinguish between > objects that can pickled, and such that require more complex serialization > strategy). This is possibly a breaking change, hence should be implemented in > a major release. > - If the above suggestions are approved and implemented we'll see two > different usages of PEP8 internal naming - one for objects that are > {{private}} and not designed for extension (for example [different > compositions used for specific {{Estimator}} / {{Model}} > pairs|https://github.com/apache/spark/blob/8556710409d9f2fbaee9dbf76a2ea70218316693/python/pyspark/ml/clustering.py#L99]) > and these which are used for PySpark internals (primarily Java wrappers). > The latter one are not intended for end users, but can be still utilized by > 3rd party developers providing custom multilingual libraries. Additional > guidance in the docstrings would be advised. > *Non-goals*: > - Reducing overhead of maintaining wrappers in general (though [some > related|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611] > ideas emerged here). > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org