Re: [PROPOSAL] sparklyr

Dave Fisher Sat, 19 Oct 2019 09:37:49 -0700

Hi -

An interesting proposal. I am concerned about the very small size of the 
Initial Committer list with 3 individuals one of whom I only see small 
contributions from on https://github.com/rstudio/sparklyr/graphs/contributors


Do the Mentors intend to be active participants in the community?

Also, Sean will need to join the IPMC which is easy for him to request.

Regards,
Dave

> On Oct 19, 2019, at 8:53 AM, Kevin Kuo <keviny...@gmail.com> wrote:
> 
> Greetings!
> 
> We are proposing to enter sparklyr (https://spark.rstudio.com/), an open
> source R package for interfacing with Apache Spark, into incubation. Please
> see the proposal below.
> 
> ======
> 
> = Abstract =
> 
> sparklyr is an open source R package providing an interface to Apache
> Spark, a system for large-scale data analysis on clusters. It provides a
> dplyr interface for manipulating Spark DataFrames, supports the Spark ML
> and Structured Streaming components, and offers a developer API to create
> extensions.
> 
> = Proposal =
> 
> The sparklyr project, along with the ecosystem of extensions it supports,
> aims to democratize the capabilities of Apache Spark for R users, who
> represent a significant portion of data scientists today. The API is
> designed to reduce friction for users transitioning from local, “small
> data” workflows to computing on clusters, while preserving the flexibility
> of Apache Spark as much as possible. Some features include:
> 
> - It is compatible with the tidyverse ecosystem of packages, which is a
> popular collection of libraries for data science in R. Specifically, one
> can use `dplyr` verbs to manipulate Spark DataFrames. However, one can also
> use sparklyr without using tidyverse packages.
> - It features an extensions API that allows users to easily wrap existing
> Spark packages written in Scala. This has enabled the development of
> sparkxgb (interface for xgboost4j), graphframes (interface for
> GraphFrames), mleap (interface for MLeap), and sparktf (interface for Spark
> TensorFlow connector), to name a few.
> 
> = Rationale =
> 
> By becoming an Apache project, sparklyr can better align with the Apache
> Spark project, and encourage stronger collaboration among users and
> contributors in the R and Apache communities. Culturally, sparklyr is also
> a good fit for ASF: the development of the project has adhered to the
> Apache way since inception, and the current contributors are committed to
> upholding those values.
> 
> = Initial Goals =
> 
> The initial goals will be to move the existing codebase to Apache and the
> documentation from the RStudio domain to Apache.
> 
> = Current Status =
> 
> == Meritocracy ==
> 
> The sparklyr project has operated on meritocratic principles since
> inception. We have accepted major patches from developers outside RStudio,
> and have operated with the implicit expectation that contributors to major
> features maintain those features.
> 
> == Community ==
> 
> The sparklyr project currently has 699 stars on GitHub, 52 direct
> contributors, ~1,400 issues (approximately 500 of those are open), and
> approximately 194,000 downloads from CRAN each month. The documentation
> website spark.rstudio.com achieves ~15k visitors per month. There are also
> more than 15 open source extensions written that implement features such as
> genomic analysis and interoperability with databases.
> 
> = Known Risks =
> 
> == Reliance on Salaried Developers ==
> 
> sparklyr is currently maintained by salaried developers at RStudio and
> receives some ongoing contributions from the community, although all
> committers are employed by RStudio. We hope that by becoming an Apache
> project, the project will garner additional developer interest and expand
> the diversity of committers.
> 
> = Documentation =
> 
> Documentation of the project can be found at https://spark.rstudio.com/ and
> https://cran.r-project.org/web/packages/sparklyr/sparklyr.pdf. There is
> also a free online book, available at https://therinspark.com/, that can be
> used as a reference.
> 
> = Initial Source =
> 
> The sparklyr codebase is currently hosted on GitHub:
> https://github.com/rstudio/sparklyr. sparklyr has been Apache 2.0 licensed
> since inception. RStudio currently maintains CLAs from all significant
> contributors. RStudio does not own the copyright of sparklyr and it is not
> a trademark.
> 
> = External Dependencies =
> 
> We remark that `sparklyr` imports some R packages that are not
> Apache-compatible licensed; however, these packages are not distributed
> with the project. Note, for example, R itself is GPLv2 licensed.
> 
> = Required Resources =
> 
> - Mailing lists: {users, dev, commits}@sparklyr.incubator.apache.org
> - GitHub repo
> - If possible, we would like to continue using GitHub for issue tracking,
> as it is much more familiar to the R community than JIRA.
> 
> = Project Name =
> 
> There is sufficient goodwill built around the package so we would like to
> keep the name. sparklyr is pronounced spark-lee-R, i.e. does not rhyme with
> the data manipulation package dplyr, and is never capitalized. Incorrect
> spellings include SparklyR and sparklyR.
> 
> = Initial Committers =
> 
> Javier Luraschi <jav...@rstudio.com> (RStudio)
> Kevin Kuo <keviny...@gmail.com> (RStudio)
> Hossein Falaki <hoss...@databricks.com> (Databricks)
> 
> = Sponsors =
> 
> == Champion ==
> 
> Xiangrui Meng
> 
> == Nominated Mentors ==
> 
> Xiangrui Meng
> Felix Cheung
> Sean R. Owen


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] sparklyr

Reply via email to