Hossein Falaki created SPARK-24359:
--------------------------------------

             Summary: SPIP: ML Pipelines in R
                 Key: SPARK-24359
                 URL: https://issues.apache.org/jira/browse/SPARK-24359
             Project: Spark
          Issue Type: Improvement
          Components: SparkR
    Affects Versions: 3.0.0
            Reporter: Hossein Falaki
         Attachments: SparkML_ ML Pipelines in R.pdf

h1. Background and motivation

SparkR supports calling MLlib functionality with an[ [R-friendly 
API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/].
 Since Spark 1.5 the (new) SparkML API which is based on[ [pipelines and 
parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o]
 has matured significantly. It allows users build and maintain complicated 
machine learning pipelines. A lot of this functionality is difficult to expose 
using the simple formula-based API in SparkR.

 

We propose a new R package, _SparkML_, to be distributed along with SparkR as 
part of

Apache Spark. This new package will be built on top of SparkR’s APIs to expose 
SparkML’s pipeline APIs and functionality.

 

*Why not SparkR?*

SparkR package contains ~300 functions. Many of these shadow functions in base 
and other popular CRAN packages. We think adding more functions to SparkR will 
degrade usability and make maintenance harder.

 

*Why not sparklyr?*

sparklyr is an R package developed by RStudio Inc. to expose Spark API to R 
users. sparklyr includes MLlib API wrappers, but to the best of our knowledge 
they are not comprehensive. Also we propose a code-gen approach for this 
package to minimize work needed to expose future MLlib API, but sparkly’s API 
is manually written.
h1. Target Personas
 * Existing SparkR users who need more flexible SparkML API
 * R users (data scientists, statisticians) who wish to build Spark ML 
pipelines in R

h1. Goals
 * R users can install SparkML from CRAN
 * R users will be able to import SparkML independent from SparkR
 * After setting up a Spark session R users can
 * create a pipeline by chaining individual components and specifying their 
parameters
 * tune a pipeline in parallel, taking advantage of Spark
 * inspect a pipeline’s parameters and evaluation metrics
 * repeatedly apply a pipeline

 * MLlib contributors can easily add R wrappers for new MLlib Estimators and 
Transformers

h1. Non-Goals
 * Adding new algorithms to SparkML R package which do not exist in Scala
 * Parallelizing existing CRAN packages
 * Changing existing SparkR ML wrapping API

h1. Proposed API Changes
h2. Design goals

When encountering trade-offs in API, we will chose based on the following list 
of priorities. The API choice that addresses a higher priority goal will be 
chosen.
 # *Comprehensive coverage of MLlib API:* Design choices that make R coverage 
of future ML algorithms difficult will be ruled out.

 * *Semantic clarity*: We attempt to minimize confusion with other packages. 
Between consciousness and clarity, we will choose clarity.

 * *Maintainability and testability:* API choices that require manual 
maintenance or make testing difficult should be avoided.

 * *Interoperability with rest of Spark components:* We will keep the R API as 
thin as possible and keep all functionality implementation in JVM/Scala.

 * *Being natural to R users:* Ultimate users of this package are R users and 
they should find it easy and natural to use.

The API will follow familiar R function syntax, where the object is passed as 
the first argument of the method:  do_something(obj, arg1, arg2). All 
constructors are dot separated (e.g., spark.logistic.regression()) and all 
setters and getters are snake case (e.g., set_max_iter()). If a constructor 
gets arguments, they will be named arguments. For example:

 

> lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1)

 

When calls need to be chained, like above example, syntax can nicely translate 
to a natural pipeline style with help from very popular[ magrittr 
package|https://cran.r-project.org/web/packages/magrittr/index.html]. For 
example:

 

> logistic.regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr
h2. Namespace

All new API will be under a new CRAN package, named SparkML. The package should 
be usable without needing SparkR in the namespace. The package will introduce a 
number of S4 classes that inherit from four basic classes. Here we will list 
the basic types with a few examples. An object of any child class can be 
instantiated with a function call that starts with spark.
h2. Pipeline & PipelineStage

A pipeline object contains one or more stages.  

 

> pipeline <- spark.pipeline() %>% set_stages(stage1, stage2, stage3)

 

Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an 
object of type Pipeline.
h2. Transformers

A Transformer is an algorithm that can transform one SparkDataFrame into 
another SparkDataFrame.

*Example API:*

 

> tokenizer <- spark.tokenizer() %>%

            set_input_col(“text”) %>%

            set_output_col(“words”)

> tokenized.df <- tokenizer %>% transform(df)
h2. Estimators

An Estimator is an algorithm which can be fit on a SparkDataFrame to produce a 
Transformer. E.g., a learning algorithm is an Estimator which trains on a 
DataFrame and produces a model.

 

*Example API:*

lr <- spark.logistic.regression() %>%

            set_max_iter(10) %>%

            set_reg_param(0.001)

 
h2. Evaluators

An evaluator computes metrics from predictions (model outputs) and returns a 
scalar metric.

 

*Example API:*

lr.eval <- spark.regression.evaluator()
h2. Miscellaneous Classes

MLlib pipelines have a variety of miscellaneous classes that serve as helpers 
and utilities. For example an object of ParamGridBuilder is used to build a 
grid search pipeline. Another example is ClusteringSummary.

 

*Example API:*

 

> grid <- param.grid.builder() %>%

            add_grid(reg_param(lr), c(0.1, 0.01)) %>%

            add_grid(fit_intercept(lr), c(TRUE, FALSE)) %>%

            add_grid(elastic_net_param(lr), c(0.0, 0.5, 1.0))

 

> model <- train.validation.split() %>%

            set_estimator(lr) %>%

            set_evaluator(spark.regression.evaluator()) %>%

            set_estimator_param_maps(grid) %>%

            set_train-ratio(0.8) %>%

            set_parallelism(2) %>%

            fit()
h2. Pipeline Persistence

SparkML package will fix a longstanding issue with SparkR model persistence 
SPARK-15572. SparkML will directly wrap MLlib pipeline persistence API.

 

*API example:*

> model <- pipeline %>% fit(training)

> model %>% spark.write.pipeline(overwrite = TRUE, path = “...”)
h1. Design Sketch

We propose using code generation from Scala to produce comprehensive API 
wrappers in R. For more details please see the attached design document.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to