[
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
zhengruifeng reassigned SPARK-29212:
------------------------------------
Assignee: Maciej Szymkiewicz
> Add common classes without using JVM backend
> --------------------------------------------
>
> Key: SPARK-29212
> URL: https://issues.apache.org/jira/browse/SPARK-29212
> Project: Spark
> Issue Type: Sub-task
> Components: ML, PySpark
> Affects Versions: 3.0.0
> Reporter: zhengruifeng
> Assignee: Maciej Szymkiewicz
> Priority: Major
>
> Copied from [https://github.com/apache/spark/pull/25776].
>
> Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
> * Add Python-only Transformer implementations:
> *
> ** I am Python user and want to implement pure Python ML classifier without
> providing JVM backend.
> ** I want this classifier to be meaningfully positioned in the existing type
> hierarchy.
> ** However I have access only to high level classes ({{Estimator}},
> {{Model}}, {{MLReader}} / {{MLReadable}}).
> * Run time parameter validation for both user defined (see above) and
> existing class hierarchy,
> *
> ** I am a library developer who provides functions that are meaningful only
> for specific categories of {{Estimators}} - here classifiers.
> ** I want to validate that user passed argument is indeed a classifier:
> *** For built-in objects using "private" type hierarchy is not really
> satisfying (actually, what is the rationale behind making it "private"? If
> the goal is Scala API parity, and Scala counterparts are public, shouldn't
> these be too?).
> ** For user defined objects I can:
> *** Use duck typing (on {{setRawPredictionCol}} for classifier, on
> {{numClasses}} for classification model) but it hardly satisfying.
> *** Provide parallel non-abstract type hierarchy ({{Classifier}} or
> {{PythonClassifier}} and so on) and require users to implement such
> interfaces. That however would require separate logic for checking for
> built-in and and user-provided classes.
> *** Provide parallel abstract type hierarchy, register all existing built-in
> classes and require users to do the same.
> Clearly these are not satisfying solutions as they require either defensive
> programming or reinventing the same functionality for different 3rd party
> APIs.
> * Static type checking
> *
> ** I am either end user or library developer and want to use PEP-484
> annotations to indicate components that require classifier or classification
> model.
> *
> ** Currently I can provide only imprecise annotations, [such
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
> def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
> or try to narrow things down using structural subtyping:
> class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self,
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) ->
> int: ...
> (...)
> * First of all nothing in the original API indicated this. On the contrary,
> the original API clearly suggests that non-Java path is supported, by
> providing base classes (Params, Transformer, Estimator, Model, ML
> \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific
> implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel,
> JavaML\{Reader,Writer}, JavaML
> {Readable,Writable}
> ).
> * Furthermore authoritative (IMHO) and open Python ML extensions exist
> (spark-sklearn is one of these, but if I recall correctly spark-deep-learning
> provides so pure-Python utilities). Personally I've seen quite a lot of
> private implementations, but that's just anecdotal evidence.
> Let us assume for the sake of argument that above observations are
> irrelevant. I will argue that having complete, public type hierarchy is still
> desired:
> * Two out three use cases I described, can narrowed down to Java
> implementation only, though there are less compelling if we do that.
> * More importantly, public type hierarchy with Java specific extensions, is
> pyspark.ml standard. There is no reason to treat this specific case as an
> exception, especially when the implementations, is far from utilitarian (for
> example implementation-free JavaClassifierParams and
> JavaProbabilisticClassifierParams save, as for now, no practical purpose
> whatsoever).
>
> Maciej's *Proposal*:
> {code:python}
> Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
> class ClassifierParams: ...
> class Predictor(Estimator,PredictorParams):
> def setLabelCol(self, value): ...
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> class Classifier(Predictor, ClassifierParams):
> def setRawPredictionCol(self, value): ...
> class PredictionModel(Model,PredictorParams):
> def setFeaturesCol(self, value): ...
> def setPredictionCol(self, value): ...
> def numFeatures(self): ...
> def predict(self, value): ...
> {code}
> and JVM interop should extend from this hierarchy, i.e.
> {code:python}
> class JavaPredictionModel(PredictionModel): ...
> {code}
> In other words it should be consistent with existing approach, where we have
> ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and
> Java* variants are their subclasses.
>
> *Formally*:
> * Python ML type hierarchy should reflect Scala counterpart, as long as it
> makes sense give differences between languages (also minor changes are
> required due to specifics of the MRO algorithm). This can reduce development
> burden, overhead of porting user application between both languages and
> overall consistency of the project.
> * Classes which are public in Scala should be "public" and implementation
> independent in Python. No {{Java}} prefix should be used.
> * Classes which are {{private}} or {{private[ml]}} in Scala, should be
> explicitly marked as internal, using PEP8 syntax. i.e
> {code:python}
> class _JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams):
> """
> Java Classifier Params for classification tasks.
> """
> pass
> {code}
> not
> {code:python}
> class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams):
> """
> (Private) Java Classifier Params for classification tasks.
> """
> pass
> {code}
> * High level Java-specific mixins ({{JavaEstimator}}, {{JavaModel}},
> {{JavaParams}}, {{JavaPredictor}}, {{JavaPredictorParams}},
> {{JavaPredictionModel}}, {{JavaWrapper}}) should be used only if
> wrapper-specific implementation is needed and mixed into concrete
> implementations only at the bottom of the type hierarchy, i.e.
> {code:python}
> class PredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ...
> class ClassifierParams(HasRawPredictionCol, PredictorParams): ...
> class PredictionModel(Model, PredictorParams): ...
> class ClassificationModel(PredictionModel, ClassifierParams):
> class ProbabilisticClassifierParams(HasProbabilityCol, HasThresholds,
> ClassifierParams): ...
> class ProbabilisticClassificationModel(ClassificationModel,
> ProbabilisticClassifierParams): ...
> class LogisticRegressionModel(JavaModel, ProbabilisticClassificationModel,
> JavaMLWritable, JavaMLReadable, HasTrainingSummary): ...
> {code}
> not
> {code:python}
> class JavaPredictionModel(JavaModel, JavaPredictorParams): ...
> class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ...
> class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): ...
> class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams): ...
> class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds,
> JavaClassifierParams): ...
> class JavaProbabilisticClassificationModel(JavaClassificationModel,
> JavaProbabilisticClassifierParams): ...
> class LogisticRegressionModel(JavaProbabilisticClassificationModel,
> JavaMLWritable, JavaMLReadable, HasTrainingSummary): ...
> {code}
> Internal compositions a level above intended leaf, i.e.
> {code:python}
> class _JavaProbabilisticClassificationModel(JavaModel,
> ProbabilisticClassificationModel): ...
> {code}
> could be acceptable, but are hardly justified.
> - We can assume that the bottom built-in {{Transformers}} are designed as
> final, and don't have to be designed for inheritance by the 3rd party
> projects.
> - We should assume that Scala public intermediary traits mapped to Python
> mixins will be extended by the 3rd party projects, and design these
> accordingly.
> - (Optionally) Some of the existing types designed for JVM-interop
> ({{JavaModel}}, {{JavaEstimator}}) could be made internal, as they don't
> provide crucial information for the end user. Individual type checks like
> {code:python}
> issubclass(LogisticRegression, JavaWrapper)
> issubclass(LogisticRegression, Estimator)
> {code}
> or
> {code:python}
> isinstance(LogisticRegression(), Estimator)
> isinstance(LogisticRegression(), JavaWrapper)
> {code}
> are more likely in the user code than
> {code:python}
> isinstance(LogisticRegression(), JavaEstimator)
> {code}
> At least top level {{JavaWrapper}} *must* be public, as it provides crucial
> information for the final user (for example allows to distinguish between
> objects that can pickled, and such that require more complex serialization
> strategy). This is possibly a breaking change, hence should be implemented in
> a major release.
> - If the above suggestions are approved and implemented we'll see two
> different usages of PEP8 internal naming - one for objects that are
> {{private}} and not designed for extension (for example [different
> compositions used for specific {{Estimator}} / {{Model}}
> pairs|https://github.com/apache/spark/blob/8556710409d9f2fbaee9dbf76a2ea70218316693/python/pyspark/ml/clustering.py#L99])
> and these which are used for PySpark internals (primarily Java wrappers).
> The latter one are not intended for end users, but can be still utilized by
> 3rd party developers providing custom multilingual libraries. Additional
> guidance in the docstrings would be advised.
> *Non-goals*:
> - Reducing overhead of maintaining wrappers in general (though [some
> related|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611]
> ideas emerged here).
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]