[ 
https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16488512#comment-16488512
 ] 

Felix Cheung edited comment on SPARK-24359 at 5/24/18 6:34 AM:
---------------------------------------------------------------

re this 
 * This also leads to the question of how the SparkML package APIs are going to 
depend on Spark versions. Are we only going to have code that depends on older 
Spark releases or are we going to have cases where we introduce the Java/Scala 
side code at the same time as the R API ?

The mix and match of different Spark / SparkR / SparkML release versions might 
need to be checked and enforced more strongly.

For instance, what if someone has Spark 2.3.0, SparkR 2.3.1 and SparkML 2.4 - 
does it work at all? what kind of quality (as in, testing) we would assert for 
something like this?

re release

I'm not sure I'd advocate a separate git/repo though simply because of the 
uncertainty around licensing and governance. If it's intend to be a fully 
separate project (like sparklyr, or SparkR in the early days) then by all 
means. And to agree with Shivaram's point, one other possibility is a new git 
repo under ASF. And it might be more agile / faster to iterate in a separate 
codebase.

Separate repo also add to the complicity of release mix & match - what if we 
need to patch Spark & SparkR for a critical security issue? Should we always 
re-release SparkML even when there is "no change"? or we are going to allow of 
"patch release compatibility" (or semantic versioning)? what if we have to make 
a protocol change in a patch release (thus breaking semantic versioning)? 
Either way this is complicated.

re R package

I'd recommend (learning from the year (?!) we spent with SparkR) completely 
decoupling all R package content (package tests, vignettes) from all JVM 
dependency, ie. the package could be made to run standalone without Java 
JRE/JDK and without Spark release jar. This would make getting this submitted 
to CRAN much much easier...

re naming

Sounds like you are planning to have these as S4 classes/generics/methods then.

Again I'd strongly recommend against spark.name AND set_param style. Not only 
it is inconsistent, spark.name style is generally flagged (lintr rule on this) 
and conflict with S3 OO style / method dispatch 
([http://adv-r.had.co.nz/OO-essentials.html#s3)] eg. mean.a

(yes I realize there are also many examples of . in method names)

 

That's it for now. More to add after review.

 


was (Author: felixcheung):
re this 
 * This also leads to the question of how the SparkML package APIs are going to 
depend on Spark versions. Are we only going to have code that depends on older 
Spark releases or are we going to have cases where we introduce the Java/Scala 
side code at the same time as the R API ?

The mix and match of different Spark / SparkR / SparkML release versions might 
need to be checked and enforced more strongly.

For instance, what if someone has Spark 2.3.0, SparkR 2.3.1 and SparkML 2.4 - 
does it work at all? what kind of quality (as in, testing) we would assert for 
something like this?

re release

I'm not sure I'd advocate a separate git/repo though simply because of the 
uncertainty around licensing and governance. If it's intend to be a fully 
separate project (like sparklyr, or SparkR in the early days) then by all 
means. One other possibility is a new git repo under ASF. And it might be more 
agile / faster to iterate in a separate codebase.

Separate repo also add to the complicity of release mix & match - what if we 
need to patch Spark & SparkR for a critical security issue? Should we always 
re-release SparkML even when there is "no change"? or we are going to allow of 
"patch release compatibility" (or semantic versioning)? Either way this is 
complicated.

re R package

I'd recommend (learning from the year (?!) we spent with SparkR) completely 
decoupling all R package content (package tests, vignettes) from all JVM 
dependency, ie. the package could be made to run standalone without Java 
JRE/JDK and without Spark release jar. This would make getting this submitted 
to CRAN much much easier...

re naming

Sounds like you are planning to have these as S4 classes/generics/methods then.

Again I'd strongly recommend against spark.name AND set_param style. Not only 
it is inconsistent, spark.name style is generally flagged (lintr rule on this) 
and conflict with S3 OO style / method dispatch 
([http://adv-r.had.co.nz/OO-essentials.html#s3)] eg. mean.a

(yes I realize there are also many examples of . in method names)

 

That's it for now. More to add after review.

 

> SPIP: ML Pipelines in R
> -----------------------
>
>                 Key: SPARK-24359
>                 URL: https://issues.apache.org/jira/browse/SPARK-24359
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>    Affects Versions: 3.0.0
>            Reporter: Hossein Falaki
>            Priority: Major
>              Labels: SPIP
>         Attachments: SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly 
> API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
>  Since Spark 1.5 the (new) SparkML API which is based on [pipelines and 
> parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
>  has matured significantly. It allows users build and maintain complicated 
> machine learning pipelines. A lot of this functionality is difficult to 
> expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as 
> part of Apache Spark. This new package will be built on top of SparkR’s APIs 
> to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in 
> base and other popular CRAN packages. We think adding more functions to 
> SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
> users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
> they are not comprehensive. Also we propose a code-gen approach for this 
> package to minimize work needed to expose future MLlib API, but sparkly’s API 
> is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML 
> pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  * create a pipeline by chaining individual components and specifying their 
> parameters
>  * tune a pipeline in parallel, taking advantage of Spark
>  * inspect a pipeline’s parameters and evaluation metrics
>  * repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
> Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following 
> list of priorities. The API choice that addresses a higher priority goal will 
> be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
> of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. 
> Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual 
> maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API 
> as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and 
> they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as 
> the first argument of the method:  do_something(obj, arg1, arg2). All 
> constructors are dot separated (e.g., spark.logistic.regression()) and all 
> setters and getters are snake case (e.g., set_max_iter()). If a constructor 
> gets arguments, they will be named arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 
> > 0.1){code}
> When calls need to be chained, like above example, syntax can nicely 
> translate to a natural pipeline style with help from very popular[ magrittr 
> package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
> example:
> {code:java}
> > logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> 
> > lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package 
> should be usable without needing SparkR in the namespace. The package will 
> introduce a number of S4 classes that inherit from four basic classes. Here 
> we will list the basic types with a few examples. An object of any child 
> class can be instantiated with a function call that starts with spark.
> h2. Pipeline & PipelineStage
> A pipeline object contains one or more stages.  
> {code:java}
> > pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3){code}
> Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is 
> an object of type Pipeline.
> h2. Transformers
> A Transformer is an algorithm that can transform one SparkDataFrame into 
> another SparkDataFrame.
> *Example API:*
> {code:java}
> > tokenizer <- spark.tokenizer() %>%
>             set_input_col(“text”) %>%
>             set_output_col(“words”)
> > tokenized.df <- tokenizer %>% transform(df) {code}
> h2. Estimators
> An Estimator is an algorithm which can be fit on a SparkDataFrame to produce 
> a Transformer. E.g., a learning algorithm is an Estimator which trains on a 
> DataFrame and produces a model.
> *Example API:*
> {code:java}
> lr <- spark.logistic.regression() %>%
>             set_max_iter(10) %>%
>             set_reg_param(0.001){code}
> h2. Evaluators
> An evaluator computes metrics from predictions (model outputs) and returns a 
> scalar metric.
> *Example API:*
> {code:java}
> lr.eval <- spark.regression.evaluator(){code}
> h2. Miscellaneous Classes
> MLlib pipelines have a variety of miscellaneous classes that serve as helpers 
> and utilities. For example an object of ParamGridBuilder is used to build a 
> grid search pipeline. Another example is ClusteringSummary.
> *Example API:*
> {code:java}
> > grid <- param.grid.builder() %>%
>             add_grid(reg_param(lr), c(0.1, 0.01)) %>%
>             add_grid(fit_intercept(lr), c(TRUE, FALSE)) %>%
>             add_grid(elastic_net_param(lr), c(0.0, 0.5, 1.0))
>  > model <- train.validation.split() %>%
>             set_estimator(lr) %>%
>             set_evaluator(spark.regression.evaluator()) %>%
>             set_estimator_param_maps(grid) %>%
>             set_train-ratio(0.8) %>%
>             set_parallelism(2) %>%
>             fit(){code}
> Pipeline Persistence
> SparkML package will fix a longstanding issue with SparkR model persistence 
> SPARK-15572. SparkML will directly wrap MLlib pipeline persistence API. 
> *API example:*
> {code:java}
> > model <- pipeline %>% fit(training)
> > model %>% spark.write.pipeline(overwrite = TRUE, path = “...”){code}
> h1. Design Sketch
> We propose using code generation from Scala to produce comprehensive API 
> wrappers in R. For more details please see the attached design document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to