Re: [PROPOSAL] sparklyr

2019-10-26 Thread Kevin Kuo
A big thanks to all that have left feedback! After much deliberation, we
have decided to withdraw this proposal for the time being. The questions
around licenses are delicate, and we are currently not ready to navigate
them.

Cheers,
Kevin

On Mon, Oct 21, 2019 at 11:52 PM 申远  wrote:

> You could also read the documentation[1] here about what license is allowed
> in ASF project.
>
> [1] https://apache.org/legal/resolved.html#category-a
>
> Best Regards,
> YorkShen
>
> 申远
>
>
> 申远  于2019年10月22日周二 下午2:49写道:
>
> > Base on my experience (wearing my Apache Weex's hat),  GPL/LGPL
> dependency
> > is not compatible with ASF's policy, and you may want to fix the License
> > problem at the beginning, even before into Incubator. Otherwise, GPL/LGPL
> > dependency will give you a lot of pain than you'd ever expect.
> >
> > Best Regards,
> > YorkShen
> >
> > 申远
> >
> >
> > Javier Luraschi  于2019年10月22日周二 上午2:55写道:
> >
> >> Regarding licenses, dplyr is under MIT, see:
> >> https://github.com/tidyverse/dplyr/blob/master/LICENSE.md. However,
> other
> >> packages are under GPL2.
> >>
> >> Here are all the packages that sparklyr currently depends on and their
> >> associated license (This was retrieved from
> >> https://CRAN.R-project.org/package=, since R package repo
> >> (CRAN) requires their license to be clearly defined).
> >>
> >> assertthat: GPL-3
> >> base64enc: GPL-2 | GPL-3
> >> config: GPL-3
> >> DBI: LGPL-2 | LGPL-2.1 | LGPL-3
> >> dplyr: MIT
> >> dbplyr: MIT
> >> digest: GPL-2 | GPL-3
> >> forge: Apache
> >> generics: GPL-2
> >> httr: MIT
> >> jsonlite: MIT
> >> openssl: MIT
> >> purrr: GPL-3
> >> r2d3: BSD-3
> >> rappdirs: MIT
> >> rlang: GPL-3
> >> rprojroot: GPL-3
> >> rstudioapi: MIT
> >> tibble: MIT
> >> tidyr: MIT
> >> withr: GPL-2 | GPL-3
> >> xml2: GPL-2 | GPL-3
> >> ellipsis: GPL-3
> >>
> >>
> >> On Mon, Oct 21, 2019 at 1:12 AM Justin Mclean  >
> >> wrote:
> >>
> >> > Hi,
> >> >
> >> > I also concerned that the initial committer list only contains 3
> >> > committers. Why have you not included others in the community that
> have
> >> > made contributions?
> >> >
> >> > I don’t know if this is an issue or not but bring it up just in case
> you
> >> > not aware. I can see that some of the tidyverse packages are under
> GPL2,
> >> > the GPL license is not compatible with the ALv2. I’m not 100% sure
> what
> >> > license dplyr is under. I can see that sparkly depends on several
> (10+)
> >> GPL
> >> > licensed pieces of software. Do you see this causing any issue as GPL
> >> code
> >> > can’t be included in an Apache source release and can’t be a
> >> non-optional
> >> > dependancy of an ASF project. Have you discussed this with your
> >> champion or
> >> > proposed mentors and have they flagged this as a possible issue?
> >> >
> >> > I can see that one of the proposed mentors is not an IPMC member
> (which
> >> is
> >> > required) and another seems not very active in signing off reports or
> >> > voting on releases. Did you think the existing mentors will provide
> your
> >> > project with enough support?
> >> >
> >> > Thanks,
> >> > Justin
> >> >
> >> > 1. https://github.com/tidyverse/dplyr/blob/master/LICENSE
> >> > -
> >> > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> >> > For additional commands, e-mail: general-h...@incubator.apache.org
> >> >
> >> >
> >>
> >
>


[PROPOSAL] sparklyr

2019-10-19 Thread Kevin Kuo
Greetings!

We are proposing to enter sparklyr (https://spark.rstudio.com/), an open
source R package for interfacing with Apache Spark, into incubation. Please
see the proposal below.

==

= Abstract =

sparklyr is an open source R package providing an interface to Apache
Spark, a system for large-scale data analysis on clusters. It provides a
dplyr interface for manipulating Spark DataFrames, supports the Spark ML
and Structured Streaming components, and offers a developer API to create
extensions.

= Proposal =

The sparklyr project, along with the ecosystem of extensions it supports,
aims to democratize the capabilities of Apache Spark for R users, who
represent a significant portion of data scientists today. The API is
designed to reduce friction for users transitioning from local, “small
data” workflows to computing on clusters, while preserving the flexibility
of Apache Spark as much as possible. Some features include:

- It is compatible with the tidyverse ecosystem of packages, which is a
popular collection of libraries for data science in R. Specifically, one
can use `dplyr` verbs to manipulate Spark DataFrames. However, one can also
use sparklyr without using tidyverse packages.
- It features an extensions API that allows users to easily wrap existing
Spark packages written in Scala. This has enabled the development of
sparkxgb (interface for xgboost4j), graphframes (interface for
GraphFrames), mleap (interface for MLeap), and sparktf (interface for Spark
TensorFlow connector), to name a few.

= Rationale =

By becoming an Apache project, sparklyr can better align with the Apache
Spark project, and encourage stronger collaboration among users and
contributors in the R and Apache communities. Culturally, sparklyr is also
a good fit for ASF: the development of the project has adhered to the
Apache way since inception, and the current contributors are committed to
upholding those values.

= Initial Goals =

The initial goals will be to move the existing codebase to Apache and the
documentation from the RStudio domain to Apache.

= Current Status =

== Meritocracy ==

The sparklyr project has operated on meritocratic principles since
inception. We have accepted major patches from developers outside RStudio,
and have operated with the implicit expectation that contributors to major
features maintain those features.

== Community ==

The sparklyr project currently has 699 stars on GitHub, 52 direct
contributors, ~1,400 issues (approximately 500 of those are open), and
approximately 194,000 downloads from CRAN each month. The documentation
website spark.rstudio.com achieves ~15k visitors per month. There are also
more than 15 open source extensions written that implement features such as
genomic analysis and interoperability with databases.

= Known Risks =

== Reliance on Salaried Developers ==

sparklyr is currently maintained by salaried developers at RStudio and
receives some ongoing contributions from the community, although all
committers are employed by RStudio. We hope that by becoming an Apache
project, the project will garner additional developer interest and expand
the diversity of committers.

= Documentation =

Documentation of the project can be found at https://spark.rstudio.com/ and
https://cran.r-project.org/web/packages/sparklyr/sparklyr.pdf. There is
also a free online book, available at https://therinspark.com/, that can be
used as a reference.

= Initial Source =

The sparklyr codebase is currently hosted on GitHub:
https://github.com/rstudio/sparklyr. sparklyr has been Apache 2.0 licensed
since inception. RStudio currently maintains CLAs from all significant
contributors. RStudio does not own the copyright of sparklyr and it is not
a trademark.

= External Dependencies =

We remark that `sparklyr` imports some R packages that are not
Apache-compatible licensed; however, these packages are not distributed
with the project. Note, for example, R itself is GPLv2 licensed.

= Required Resources =

- Mailing lists: {users, dev, commits}@sparklyr.incubator.apache.org
- GitHub repo
- If possible, we would like to continue using GitHub for issue tracking,
as it is much more familiar to the R community than JIRA.

= Project Name =

There is sufficient goodwill built around the package so we would like to
keep the name. sparklyr is pronounced spark-lee-R, i.e. does not rhyme with
the data manipulation package dplyr, and is never capitalized. Incorrect
spellings include SparklyR and sparklyR.

= Initial Committers =

Javier Luraschi  (RStudio)
Kevin Kuo  (RStudio)
Hossein Falaki  (Databricks)

= Sponsors =

== Champion ==

Xiangrui Meng

== Nominated Mentors ==

Xiangrui Meng
Felix Cheung
Sean R. Owen