[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

Sean R. Owen (Jira) Tue, 08 Oct 2019 05:14:55 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-29212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946784#comment-16946784
 ]


Sean R. Owen commented on SPARK-29212:
--------------------------------------

I don't have a strong opinion on it. I'd make it all consistent first, before 
changing to another consistent state. Removing truly no-op classes is OK, 
unless we can foresee a use for them later. Pyspark does mean to wrap the JVM 
implemenatation, so "Java" related classes aren't inherently wrong even as part 
of a developer API. Refactoring is good, but weigh it against breaking existing 
extensions of these classes, even in Spark 3.

> Add common classes without using JVM backend
> --------------------------------------------
>
>                 Key: SPARK-29212
>                 URL: https://issues.apache.org/jira/browse/SPARK-29212
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML, PySpark
>    Affects Versions: 3.0.0
>            Reporter: zhengruifeng
>            Priority: Major
>
> Copied from [https://github.com/apache/spark/pull/25776].
>  
>  Maciej's *Concern*:
> *Use cases for public ML type hierarchy*
>  * Add Python-only Transformer implementations:
>  * 
>  ** I am Python user and want to implement pure Python ML classifier without 
> providing JVM backend.
>  ** I want this classifier to be meaningfully positioned in the existing type 
> hierarchy.
>  ** However I have access only to high level classes ({{Estimator}}, 
> {{Model}}, {{MLReader}} / {{MLReadable}}).
>  * Run time parameter validation for both user defined (see above) and 
> existing class hierarchy,
>  * 
>  ** I am a library developer who provides functions that are meaningful only 
> for specific categories of {{Estimators}} - here classifiers.
>  ** I want to validate that user passed argument is indeed a classifier:
>  *** For built-in objects using "private" type hierarchy is not really 
> satisfying (actually, what is the rationale behind making it "private"? If 
> the goal is Scala API parity, and Scala counterparts are public, shouldn't 
> these be too?).
>  ** For user defined objects I can:
>  *** Use duck typing (on {{setRawPredictionCol}} for classifier, on 
> {{numClasses}} for classification model) but it hardly satisfying.
>  *** Provide parallel non-abstract type hierarchy ({{Classifier}} or 
> {{PythonClassifier}} and so on) and require users to implement such 
> interfaces. That however would require separate logic for checking for 
> built-in and and user-provided classes.
>  *** Provide parallel abstract type hierarchy, register all existing built-in 
> classes and require users to do the same.
>  Clearly these are not satisfying solutions as they require either defensive 
> programming or reinventing the same functionality for different 3rd party 
> APIs.
>  * Static type checking
>  * 
>  ** I am either end user or library developer and want to use PEP-484 
> annotations to indicate components that require classifier or classification 
> model.
>  * 
>  ** Currently I can provide only imprecise annotations, [such 
> as|https://github.com/zero323/pyspark-stubs/blob/dd5cfc9ef1737fc3ccc85c247c5116eaa4b9df18/third_party/3/pyspark/ml/classification.pyi#L241]
>  def setClassifier(self, value: Estimator[M]) -> OneVsRest: ...
>  or try to narrow things down using structural subtyping:
>  class Classifier(Protocol, Estimator[M]): def setRawPredictionCol(self, 
> value: str) -> Classifier: ... class Classifier(Protocol, Model): def 
> setRawPredictionCol(self, value: str) -> Model: ... def numClasses(self) -> 
> int: ...
> (...)
>  * First of all nothing in the original API indicated this. On the contrary, 
> the original API clearly suggests that non-Java path is supported, by 
> providing base classes (Params, Transformer, Estimator, Model, ML 
> \{Reader,Writer}, ML\{Readable,Writable}) as well as Java specific 
> implementations (JavaParams, JavaTransformer, JavaEstimator, JavaModel, 
> JavaML\{Reader,Writer}, JavaML
> {Readable,Writable}
> ).
>  * Furthermore authoritative (IMHO) and open Python ML extensions exist 
> (spark-sklearn is one of these, but if I recall correctly spark-deep-learning 
> provides so pure-Python utilities). Personally I've seen quite a lot of 
> private implementations, but that's just anecdotal evidence.
> Let us assume for the sake of argument that above observations are 
> irrelevant. I will argue that having complete, public type hierarchy is still 
> desired:
>  * Two out three use cases I described, can narrowed down to Java 
> implementation only, though there are less compelling if we do that.
>  * More importantly, public type hierarchy with Java specific extensions, is 
> pyspark.ml standard. There is no reason to treat this specific case as an 
> exception, especially when the implementations, is far from utilitarian (for 
> example implementation-free JavaClassifierParams and 
> JavaProbabilisticClassifierParams save, as for now, no practical purpose 
> whatsoever).
>   
> Maciej's *Proposal*:
> {code:python}
> Python ML hierarchy should reflect Scala hierarchy first (@srowen), i.e.
> class ClassifierParams: ...
> class Predictor(Estimator,PredictorParams):
>     def setLabelCol(self, value): ...
>     def setFeaturesCol(self, value): ...
>     def setPredictionCol(self, value): ...
> class Classifier(Predictor, ClassifierParams):
>     def setRawPredictionCol(self, value): ...
> class PredictionModel(Model,PredictorParams):
>     def setFeaturesCol(self, value): ...
>     def setPredictionCol(self, value): ...
>     def numFeatures(self): ...
>     def predict(self, value): ...
> {code}
> and JVM interop should extend from this hierarchy, i.e.
>  {code:python}
> class JavaPredictionModel(PredictionModel): ...
>  {code}
> In other words it should be consistent with existing approach, where we have 
> ABCs reflecting Scala API (Transformer, Estimator, Model) and so on, and 
> Java* variants are their subclasses.
>  
> *Formally*:
>  * Python ML type hierarchy should reflect Scala counterpart, as long as it 
> makes sense give differences between languages (also minor changes are 
> required due to specifics of the MRO algorithm). This can reduce development 
> burden, overhead of porting user application between both languages and 
> overall consistency of the project.
>  * Classes which are public in Scala should be "public" and implementation 
> independent in Python. No {{Java}} prefix should be used.
>  * Classes which are {{private}} or {{private[ml]}} in Scala, should be 
> explicitly marked as internal, using PEP8 syntax. i.e
> {code:python}
> class _JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams):
>     """
>     Java Classifier Params for classification tasks.
>     """
>     pass
>  {code}
> not
> {code:python}
> class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams):
>     """
>     (Private) Java Classifier Params for classification tasks.
>     """
>     pass
>  {code}
> * High level Java-specific mixins ({{JavaEstimator}}, {{JavaModel}}, 
> {{JavaParams}}, {{JavaPredictor}}, {{JavaPredictorParams}}, 
> {{JavaPredictionModel}}, {{JavaWrapper}}) should be used only if 
> wrapper-specific implementation is needed and mixed into concrete 
> implementations only at the bottom of the type hierarchy, i.e.
> {code:python}
> class PredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ...
> class ClassifierParams(HasRawPredictionCol, PredictorParams): ...
> class PredictionModel(Model, PredictorParams): ...
> class ClassificationModel(PredictionModel, ClassifierParams):
> class ProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, 
> ClassifierParams): ...
> class ProbabilisticClassificationModel(ClassificationModel, 
> ProbabilisticClassifierParams): ...
> class LogisticRegressionModel(JavaModel, ProbabilisticClassificationModel, 
> JavaMLWritable, JavaMLReadable, HasTrainingSummary): ...
> {code}
> not
> {code:python}
> class JavaPredictionModel(JavaModel, JavaPredictorParams): ...
> class JavaPredictorParams(HasLabelCol, HasFeaturesCol, HasPredictionCol): ...
> class JavaClassifierParams(HasRawPredictionCol, JavaPredictorParams): ...
> class JavaClassificationModel(JavaPredictionModel, JavaClassifierParams): ...
> class JavaProbabilisticClassifierParams(HasProbabilityCol, HasThresholds, 
> JavaClassifierParams): ...
> class JavaProbabilisticClassificationModel(JavaClassificationModel, 
> JavaProbabilisticClassifierParams): ...
> class LogisticRegressionModel(JavaProbabilisticClassificationModel, 
> JavaMLWritable, JavaMLReadable, HasTrainingSummary): ...
>  {code}
> Internal compositions a level above intended leaf, i.e.
> {code:python}
> class _JavaProbabilisticClassificationModel(JavaModel, 
> ProbabilisticClassificationModel): ...
> {code}
> could be acceptable, but are hardly justified.
> - We can assume that the bottom built-in {{Transformers}} are designed as 
> final, and don't have to be designed for inheritance by the 3rd party 
> projects.
> - We should assume that Scala public intermediary traits mapped to Python 
> mixins will be extended by the 3rd party projects, and design these 
> accordingly.
> - (Optionally) Some of the existing types designed for JVM-interop 
> ({{JavaModel}}, {{JavaEstimator}}) could be made internal, as they don't 
> provide crucial information for the end user. Individual type checks like
> {code:python}
> issubclass(LogisticRegression, JavaWrapper)
> issubclass(LogisticRegression, Estimator) 
> {code}
> or
> {code:python}
> isinstance(LogisticRegression(), Estimator)
> isinstance(LogisticRegression(), JavaWrapper)
> {code}
> are more likely in the user code than
> {code:python}
> isinstance(LogisticRegression(), JavaEstimator)
> {code}
> At least top level {{JavaWrapper}} *must* be public, as it provides crucial 
> information for the final user (for example allows to distinguish between 
> objects that can pickled, and such that require more complex serialization 
> strategy). This is possibly a breaking change, hence should be implemented in 
> a major release.
> - If the above suggestions are approved and implemented we'll see two 
> different usages of PEP8 internal naming - one for objects that are 
> {{private}} and not designed for extension (for example [different 
> compositions used for specific {{Estimator}} / {{Model}} 
> pairs|https://github.com/apache/spark/blob/8556710409d9f2fbaee9dbf76a2ea70218316693/python/pyspark/ml/clustering.py#L99])
>  and these which are used for PySpark internals (primarily Java wrappers). 
> The latter one are not intended for end users, but can be still utilized by 
> 3rd party developers providing custom multilingual libraries. Additional 
> guidance in the docstrings would be advised.
> *Non-goals*:
> - Reducing overhead of maintaining wrappers in general (though [some 
> related|https://gist.github.com/zero323/ee36bce57ddeac82322e3ab4ef547611] 
> ideas emerged here).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-29212) Add common classes without using JVM backend

Reply via email to