[ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joseph K. Bradley updated SPARK-24359: -------------------------------------- Description: h1. Background and motivation SparkR supports calling MLlib functionality with an [R-friendly API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. Since Spark 1.5 the (new) SparkML API which is based on [pipelines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR. We propose a new R package, _SparkML_, to be distributed along with SparkR as part of Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s pipeline APIs and functionality. *Why not SparkR?* SparkR package contains ~300 functions. Many of these shadow functions in base and other popular CRAN packages. We think adding more functions to SparkR will degrade usability and make maintenance harder. *Why not sparklyr?* sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive. Also we propose a code-gen approach for this package to minimize work needed to expose future MLlib API, but sparklyr’s API is manually written. h1. Target Personas * Existing SparkR users who need more flexible SparkML API * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R h1. Goals * R users can install SparkML from CRAN * R users will be able to import SparkML independent from SparkR * After setting up a Spark session R users can ** create a pipeline by chaining individual components and specifying their parameters ** tune a pipeline in parallel, taking advantage of Spark ** inspect a pipeline’s parameters and evaluation metrics ** repeatedly apply a pipeline * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers h1. Non-Goals * Adding new algorithms to SparkML R package which do not exist in Scala * Parallelizing existing CRAN packages * Changing existing SparkR ML wrapping API h1. Proposed API Changes h2. Design goals When encountering trade-offs in API, we will chose based on the following list of priorities. The API choice that addresses a higher priority goal will be chosen. # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future ML algorithms difficult will be ruled out. * *Semantic clarity*: We attempt to minimize confusion with other packages. Between consciousness and clarity, we will choose clarity. * *Maintainability and testability:* API choices that require manual maintenance or make testing difficult should be avoided. * *Interoperability with rest of Spark components:* We will keep the R API as thin as possible and keep all functionality implementation in JVM/Scala. * *Being natural to R users:* Ultimate users of this package are R users and they should find it easy and natural to use. The API will follow familiar R function syntax, where the object is passed as the first argument of the method: do_something(obj, arg1, arg2). All functions are snake_case (e.g., {{spark_logistic_regression()}} and {{set_max_iter()}}). If a constructor gets arguments, they will be named arguments. For example: {code:java} > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code} When calls need to be chained, like above example, syntax can nicely translate to a natural pipeline style with help from very popular[ magrittr package|https://cran.r-project.org/web/packages/magrittr/index.html]. For example: {code:java} > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code} h2. Namespace All new API will be under a new CRAN package, named SparkML. The package should be usable without needing SparkR in the namespace. The package will introduce a number of S4 classes that inherit from four basic classes. Here we will list the basic types with a few examples. An object of any child class can be instantiated with a function call that starts with {{spark_}}. h2. Pipeline & PipelineStage A pipeline object contains one or more stages. {code:java} > pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code} Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an object of type Pipeline. h2. Transformers A Transformer is an algorithm that can transform one SparkDataFrame into another SparkDataFrame. *Example API:* {code:java} > tokenizer <- spark_tokenizer() %>% set_input_col(“text”) %>% set_output_col(“words”) > tokenized.df <- tokenizer %>% transform(df) {code} h2. Estimators An Estimator is an algorithm which can be fit on a SparkDataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model. *Example API:* {code:java} lr <- spark_logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.001){code} h2. Evaluators An evaluator computes metrics from predictions (model outputs) and returns a scalar metric. *Example API:* {code:java} lr.eval <- spark_regression_evaluator(){code} h2. Miscellaneous Classes MLlib pipelines have a variety of miscellaneous classes that serve as helpers and utilities. For example an object of ParamGridBuilder is used to build a grid search pipeline. Another example is ClusteringSummary. *Example API:* {code:java} > grid <- param_grid_builder() %>% add_grid(reg_param(lr), c(0.1, 0.01)) %>% add_grid(fit_intercept(lr), c(TRUE, FALSE)) %>% add_grid(elastic_net_param(lr), c(0.0, 0.5, 1.0)) > model <- train_validation_split() %>% set_estimator(lr) %>% set_evaluator(spark_regression_evaluator()) %>% set_estimator_param_maps(grid) %>% set_train-ratio(0.8) %>% set_parallelism(2) %>% fit(){code} Pipeline Persistence SparkML package will fix a longstanding issue with SparkR model persistence SPARK-15572. SparkML will directly wrap MLlib pipeline persistence API. *API example:* {code:java} > model <- pipeline %>% fit(training) > model %>% spark_write_pipeline(overwrite = TRUE, path = “...”){code} h1. Design Sketch We propose using code generation from Scala to produce comprehensive API wrappers in R. For more details please see the attached design document. was: h1. Background and motivation SparkR supports calling MLlib functionality with an [R-friendly API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. Since Spark 1.5 the (new) SparkML API which is based on [pipelines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR. We propose a new R package, _SparkML_, to be distributed along with SparkR as part of Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s pipeline APIs and functionality. *Why not SparkR?* SparkR package contains ~300 functions. Many of these shadow functions in base and other popular CRAN packages. We think adding more functions to SparkR will degrade usability and make maintenance harder. *Why not sparklyr?* sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive. Also we propose a code-gen approach for this package to minimize work needed to expose future MLlib API, but sparkly’s API is manually written. h1. Target Personas * Existing SparkR users who need more flexible SparkML API * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R h1. Goals * R users can install SparkML from CRAN * R users will be able to import SparkML independent from SparkR * After setting up a Spark session R users can * create a pipeline by chaining individual components and specifying their parameters * tune a pipeline in parallel, taking advantage of Spark * inspect a pipeline’s parameters and evaluation metrics * repeatedly apply a pipeline * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers h1. Non-Goals * Adding new algorithms to SparkML R package which do not exist in Scala * Parallelizing existing CRAN packages * Changing existing SparkR ML wrapping API h1. Proposed API Changes h2. Design goals When encountering trade-offs in API, we will chose based on the following list of priorities. The API choice that addresses a higher priority goal will be chosen. # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future ML algorithms difficult will be ruled out. * *Semantic clarity*: We attempt to minimize confusion with other packages. Between consciousness and clarity, we will choose clarity. * *Maintainability and testability:* API choices that require manual maintenance or make testing difficult should be avoided. * *Interoperability with rest of Spark components:* We will keep the R API as thin as possible and keep all functionality implementation in JVM/Scala. * *Being natural to R users:* Ultimate users of this package are R users and they should find it easy and natural to use. The API will follow familiar R function syntax, where the object is passed as the first argument of the method: do_something(obj, arg1, arg2). All functions are snake_case (e.g., {{spark_logistic_regression()}} and {{set_max_iter()}}). If a constructor gets arguments, they will be named arguments. For example: {code:java} > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code} When calls need to be chained, like above example, syntax can nicely translate to a natural pipeline style with help from very popular[ magrittr package|https://cran.r-project.org/web/packages/magrittr/index.html]. For example: {code:java} > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code} h2. Namespace All new API will be under a new CRAN package, named SparkML. The package should be usable without needing SparkR in the namespace. The package will introduce a number of S4 classes that inherit from four basic classes. Here we will list the basic types with a few examples. An object of any child class can be instantiated with a function call that starts with {{spark_}}. h2. Pipeline & PipelineStage A pipeline object contains one or more stages. {code:java} > pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code} Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an object of type Pipeline. h2. Transformers A Transformer is an algorithm that can transform one SparkDataFrame into another SparkDataFrame. *Example API:* {code:java} > tokenizer <- spark_tokenizer() %>% set_input_col(“text”) %>% set_output_col(“words”) > tokenized.df <- tokenizer %>% transform(df) {code} h2. Estimators An Estimator is an algorithm which can be fit on a SparkDataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model. *Example API:* {code:java} lr <- spark_logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.001){code} h2. Evaluators An evaluator computes metrics from predictions (model outputs) and returns a scalar metric. *Example API:* {code:java} lr.eval <- spark_regression_evaluator(){code} h2. Miscellaneous Classes MLlib pipelines have a variety of miscellaneous classes that serve as helpers and utilities. For example an object of ParamGridBuilder is used to build a grid search pipeline. Another example is ClusteringSummary. *Example API:* {code:java} > grid <- param_grid_builder() %>% add_grid(reg_param(lr), c(0.1, 0.01)) %>% add_grid(fit_intercept(lr), c(TRUE, FALSE)) %>% add_grid(elastic_net_param(lr), c(0.0, 0.5, 1.0)) > model <- train_validation_split() %>% set_estimator(lr) %>% set_evaluator(spark_regression_evaluator()) %>% set_estimator_param_maps(grid) %>% set_train-ratio(0.8) %>% set_parallelism(2) %>% fit(){code} Pipeline Persistence SparkML package will fix a longstanding issue with SparkR model persistence SPARK-15572. SparkML will directly wrap MLlib pipeline persistence API. *API example:* {code:java} > model <- pipeline %>% fit(training) > model %>% spark_write_pipeline(overwrite = TRUE, path = “...”){code} h1. Design Sketch We propose using code generation from Scala to produce comprehensive API wrappers in R. For more details please see the attached design document. > SPIP: ML Pipelines in R > ----------------------- > > Key: SPARK-24359 > URL: https://issues.apache.org/jira/browse/SPARK-24359 > Project: Spark > Issue Type: Improvement > Components: SparkR > Affects Versions: 3.0.0 > Reporter: Hossein Falaki > Priority: Major > Labels: SPIP > Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines > in R.pdf > > > h1. Background and motivation > SparkR supports calling MLlib functionality with an [R-friendly > API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. > Since Spark 1.5 the (new) SparkML API which is based on [pipelines and > parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] > has matured significantly. It allows users build and maintain complicated > machine learning pipelines. A lot of this functionality is difficult to > expose using the simple formula-based API in SparkR. > We propose a new R package, _SparkML_, to be distributed along with SparkR as > part of Apache Spark. This new package will be built on top of SparkR’s APIs > to expose SparkML’s pipeline APIs and functionality. > *Why not SparkR?* > SparkR package contains ~300 functions. Many of these shadow functions in > base and other popular CRAN packages. We think adding more functions to > SparkR will degrade usability and make maintenance harder. > *Why not sparklyr?* > sparklyr is an R package developed by RStudio Inc. to expose Spark API to R > users. sparklyr includes MLlib API wrappers, but to the best of our knowledge > they are not comprehensive. Also we propose a code-gen approach for this > package to minimize work needed to expose future MLlib API, but sparklyr’s > API is manually written. > h1. Target Personas > * Existing SparkR users who need more flexible SparkML API > * R users (data scientists, statisticians) who wish to build Spark ML > pipelines in R > h1. Goals > * R users can install SparkML from CRAN > * R users will be able to import SparkML independent from SparkR > * After setting up a Spark session R users can > ** create a pipeline by chaining individual components and specifying their > parameters > ** tune a pipeline in parallel, taking advantage of Spark > ** inspect a pipeline’s parameters and evaluation metrics > ** repeatedly apply a pipeline > * MLlib contributors can easily add R wrappers for new MLlib Estimators and > Transformers > h1. Non-Goals > * Adding new algorithms to SparkML R package which do not exist in Scala > * Parallelizing existing CRAN packages > * Changing existing SparkR ML wrapping API > h1. Proposed API Changes > h2. Design goals > When encountering trade-offs in API, we will chose based on the following > list of priorities. The API choice that addresses a higher priority goal will > be chosen. > # *Comprehensive coverage of MLlib API:* Design choices that make R coverage > of future ML algorithms difficult will be ruled out. > * *Semantic clarity*: We attempt to minimize confusion with other packages. > Between consciousness and clarity, we will choose clarity. > * *Maintainability and testability:* API choices that require manual > maintenance or make testing difficult should be avoided. > * *Interoperability with rest of Spark components:* We will keep the R API > as thin as possible and keep all functionality implementation in JVM/Scala. > * *Being natural to R users:* Ultimate users of this package are R users and > they should find it easy and natural to use. > The API will follow familiar R function syntax, where the object is passed as > the first argument of the method: do_something(obj, arg1, arg2). All > functions are snake_case (e.g., {{spark_logistic_regression()}} and > {{set_max_iter()}}). If a constructor gets arguments, they will be named > arguments. For example: > {code:java} > > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), > > 0.1){code} > When calls need to be chained, like above example, syntax can nicely > translate to a natural pipeline style with help from very popular[ magrittr > package|https://cran.r-project.org/web/packages/magrittr/index.html]. For > example: > {code:java} > > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> > > lr{code} > h2. Namespace > All new API will be under a new CRAN package, named SparkML. The package > should be usable without needing SparkR in the namespace. The package will > introduce a number of S4 classes that inherit from four basic classes. Here > we will list the basic types with a few examples. An object of any child > class can be instantiated with a function call that starts with {{spark_}}. > h2. Pipeline & PipelineStage > A pipeline object contains one or more stages. > {code:java} > > pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code} > Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is > an object of type Pipeline. > h2. Transformers > A Transformer is an algorithm that can transform one SparkDataFrame into > another SparkDataFrame. > *Example API:* > {code:java} > > tokenizer <- spark_tokenizer() %>% > set_input_col(“text”) %>% > set_output_col(“words”) > > tokenized.df <- tokenizer %>% transform(df) {code} > h2. Estimators > An Estimator is an algorithm which can be fit on a SparkDataFrame to produce > a Transformer. E.g., a learning algorithm is an Estimator which trains on a > DataFrame and produces a model. > *Example API:* > {code:java} > lr <- spark_logistic_regression() %>% > set_max_iter(10) %>% > set_reg_param(0.001){code} > h2. Evaluators > An evaluator computes metrics from predictions (model outputs) and returns a > scalar metric. > *Example API:* > {code:java} > lr.eval <- spark_regression_evaluator(){code} > h2. Miscellaneous Classes > MLlib pipelines have a variety of miscellaneous classes that serve as helpers > and utilities. For example an object of ParamGridBuilder is used to build a > grid search pipeline. Another example is ClusteringSummary. > *Example API:* > {code:java} > > grid <- param_grid_builder() %>% > add_grid(reg_param(lr), c(0.1, 0.01)) %>% > add_grid(fit_intercept(lr), c(TRUE, FALSE)) %>% > add_grid(elastic_net_param(lr), c(0.0, 0.5, 1.0)) > > model <- train_validation_split() %>% > set_estimator(lr) %>% > set_evaluator(spark_regression_evaluator()) %>% > set_estimator_param_maps(grid) %>% > set_train-ratio(0.8) %>% > set_parallelism(2) %>% > fit(){code} > Pipeline Persistence > SparkML package will fix a longstanding issue with SparkR model persistence > SPARK-15572. SparkML will directly wrap MLlib pipeline persistence API. > *API example:* > {code:java} > > model <- pipeline %>% fit(training) > > model %>% spark_write_pipeline(overwrite = TRUE, path = “...”){code} > h1. Design Sketch > We propose using code generation from Scala to produce comprehensive API > wrappers in R. For more details please see the attached design document. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org