+1 (non-binding) This seems like a really interesting project. Q- Is Spark just a framework/API or does it also have some tools implemented for data analytics? --Pei
> -----Original Message----- > From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov] > Sent: Friday, May 31, 2013 2:04 PM > To: general@incubator.apache.org > Subject: [PROPOSAL] Apache Spark for the Incubator > > Hi Folks, > > I'm pleased to bring you a proposal to the Apache Incubator for the Apache > Spark project: https://wiki.apache.org/incubator/SparkProposal > > The work originates from the Berkeley AMPLab and through a number of > industry participants, and other institutions. Spark is a framework for large- > scale data analysis on clusters, with a particular focus on low latency > operations. > The > source code is written in Scala, and provides a number of APIs and bindings in > various programming languages. > > The proposal text is copied to the bottom of this email. I'm going to leave > this > thread open for the next week for discussion. Once it's died down, I'll call > an > official VOTE. > > Suresh, Ross G. -- heads up -- this project may be of interest to you both and > would welcome you guys as additional mentors. We currently have 3 > mentors committed to the project, but would love to have more. People > interested in contributing should declare their interest here on the > general@incubator thread and those potential contributors will be discussed > by the incoming Spark community. > > Questions -- let's hear em'! :) > > Cheers, > Chris > ("Champion", incoming Apache Spark) > > === Abstract === > Spark is an open source system for large-scale data analysis on clusters. > > === Proposal === > Spark is an open source system for fast and flexible large-scale data > analysis. > Spark provides a general purpose runtime that supports low-latency > execution in several forms. These include interactive exploration of very > large datasets, near real-time stream processing, and ad-hoc SQL analytics > (through higher layer extensions). Spark interfaces with HDFS, HBase, > Cassandra and several other storage storage layers, and exposes APIs in > Scala, Java and Python. > Background > Spark started as U.C. Berkeley research project, designed to efficiently run > machine learning algorithms on large datasets. Over time, it has evolved into > a general computing engine as outlined above. Spark¹s developer community > has also grown to include additional institutions, such as universities, > research labs, and corporations. Funding has been provided by various > institutions including the U.S. National Science Foundation, DARPA, and a > number of industry sponsors. See: > https://amplab.cs.berkeley.edu/sponsors/ for full details. > > === Rationale === > As the number of contributors to Spark has grown, we have sought for a > long-term home for the project, and we believe the Apache foundation > would be a great fit. Spark is a natural fit for the Apache foundation: Spark > already interoperates with several existing Apache projects (HDFS, HBase, > Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar > with the Apache process and and subscribes to the Apache mission - the > team includes multiple Apache committers already. Finally, joining Apache > will help coordinate the development effort of the growing number of > organizations which contribute to Spark. > > == Initial Goals == > The initial goals will most likely be to move the existing codebase to Apache > and integrate with the Apache development process. Furthermore, we plan > for incremental development, and releases along with the Apache > guidelines. > > === Current Status === > == Meritocracy == > The Spark project already operates on meritocratic principles. Today, Spark > has several developers and has accepted multiple major patches from > outside of U.C. Berkeley. While this process has remained mostly informal > (we do not have an official committer list), an implicit organization exists > in > which individuals who contribute major components act as maintainers for > those modules. If accepted, the Spark project would include several of these > participants as committers from the onset. We will work to identify all > committers and PPMC members for the project and to operate under the > ASF meritocratic principles. > > === Community === > Acceptance into the Apache foundation would bolster the already strong > user and developer community around Spark. That community includes > dozens of contributors from several institutions, a meetup group with > several hundred members, and an active mailing list composed of hundreds > of users. > Core Developers > The core developers of our project are listed in our contributors and initial > PPMC below. Though many exist at UC Berkeley, there is a representative > cross sampling of other organizations including Quantifind, Microsoft, Yahoo!, > ClearStory Data, Bizo, Intel, Tagged and Webtrends. > > > === Alignment === > Our proposed effort aligns with several ongoing BIGDATA and U.S. National > priority funding interests including the NSF and its Expeditions program, and > the DARPA XDATA project. Our industry partners and collaborators are well > aligned with our code base. > > There are also a number of related Apache projects and dependencies, that > will be mentioned in the Relationships with Other Apache products section. > > == Known Risks == > > === Orphaned Products === > Given the current level of investment in Spark - the risk of the project being > abandoned is minimal. There are several constituents who are highly > incentivized to continue development. The U.C. Berkeley AMPLab relies on > Spark as a platform for a large number of long-term research projects. > Several companies have build verticalized products which are tightly > dependent on Spark. Other companies have devoted significant internal > infrastructure investment in Spark. > > === Inexperience with Open Source === > Spark has existed as a healthy open source project for several years. > During that time, Matei and others have curated an open-source community > successfully, attracting developers from a diverse group of companies > including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, and > Webtrends. > > === Homogenous Developers === > The initial list of committers includes developers from several institutions, > including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, and > Webtrends. > > === Reliance on Salaried Developers === > Like most open source projects, Spark receives a substantial support from > salaried developers. A large fraction of Spark development is supported by > graduate students at U.C. Berkeley in the course of research degrees - this is > more a ³volunteer² relationship, since in most cases students contribute > vastly more than is necessary to immediately support research. > In addition, those working from within corporations often devote ³after > hours² or spare time in the project - and these come from several > organizations. We will work to ensure that the ability for the project to > continuously be stewarded and to proceed forward independent of salaried > developers is continued. > > > === Relationship with Other Apache Products === Spark inter-operates with > several existing Apache products by supporting them as storage layers: > Apache Cassandra, Apache HBase, and Apache Hadoop (HDFS). It also uses > several Apache components internally including Apache Maven and several > Apache Commons libraries. Finally, Shark (a higher layer framework built on > Spark) inter-operates with Apache Hive. We will explore the relationship > between Spark and Apache Gora, which also provides in-memory object > storage (Champion Mattmann was the Champion for Apace Gora so we > expect alignment and cross pollination between our efforts). > > Spark offers an alternative computation engine to Apache Hadoop > (MapReduce). Unlike MapReduce, Spark is designed for lower-latency and > interactive workloads. This makes the projects complimentary: many users > run MapReduce and Spark side-by-side. > > === A Excessive Fascination with the Apache Brand === Spark is already a > healthy and relatively well known open source project. > This proposal is not for the purpose of generating publicity. Rather, the > primary benefits to joining Apache are those outlined in the Rationale > section. > > === Documentation === > The reader will find these websites highly relevant: > * Spark website: http://spark-project.org/ > * Spark documentation: http://spark-project.org/documentation/ > * Issue tracking: https://spark-project.atlassian.net/ > * Codebase: https://github.com/mesos/spark > * User group: https://groups.google.com/group/spark-users > > == Initial Source == > The Spark codebase is currently hosted on Github: > https://github.com/mesos/spark. This is the exact codebase that we would > migrate to the Apache foundation. > Source and Intellectual Property Submission Plan Currently, the Spark > codebase is distributed under a BSD license. The vast majority of code has > copyright held by the University of California. Upon entering Apache, Spark > will migrate to an Apache License with all copyright assigned to the Apache > Foundation. The University of California will transfer all copyright to the > Apache Foundation. In certain cases where individuals hold copyright, we will > have individuals sign over copyright to the Apache foundation as well. > > Going forward, all commits would assign copyright directly to the Apache > foundation through our signed Individual Contributor License Agreements > for all initial committers on the project. > > > == External Dependencies == > To the best of our knowledge, all dependencies of Spark are distributed > under Apache compatible licenses. Upon acceptance to the incubator, we > would begin a thorough analysis of all transitive dependencies to verify this > fact and introduce license checking into the build and release process (for > instance integrating Apache Rat). > > == Required Resources == > === Mailing list === > We will migrate the existing Spark mailing lists as follows: > > * spark-users@googlegroups --> us...@spark.incubator.apache.org > * spark-developers@googlegroups --> d...@spark.incubator.apache.org > * spark-commits are hosted on Github, so we would request > comm...@spark.incubator.apache.org > > The latter is to be consistent with the new PIAO naming scheme for podlings. > > === Source control === > The Spark team would like to use Git for source control, due to our current > use of Git. > We request a writeable Git repo for Spark, and mirroring to be set up to > Github through INFRA. Champion Mattmann can assist with creating INFRA > tickets for this. > > === Issue Tracking === > Spark currently uses a hosted JIRA deployment for issue tracking. We will > migrate to the Apache JIRA. > http://issues.apache.org/jira/browse/SPARK > > == Initial Committers == > * Matei Zaharia <ma...@apache.org> > * Ankur Dave <ankurd...@gmail.com> > * Tathagata Das <t...@eecs.berkeley.edu> > * Haoyuan Li <haoy...@cs.berkeley.edu> > * Josh Rosen <joshro...@cs.berkeley.edu> > * Reynold Xin <r...@cs.berkeley.edu> > * Shivaram Venkataraman <shiva...@eecs.berkeley.edu> > * Mosharaf Chowdhury <mosha...@cs.berkeley.edu> > * Charles Reiss <char...@eecs.berkeley.edu> > * Andy Konwinski <andykonwin...@gmail.com> > * Patrick Wendell <pwend...@eecs.berkeley.edu> > * Imran Rashid <im...@quantifind.com> > * Ryan LeCompte <lecom...@gmail.com> > * Ravi Pandya <ra...@exchange.microsoft.com> > * Ram Sriharsha <harsh...@yahoo-inc.com> > * Robert Evans <ev...@yahoo-inc.com> > * Mridul Muralidharan <mrid...@yahoo-inc.com> > * Thomas Dudziak <to...@clearstorydata.com> > * Mark Hamstra <m...@clearstorydata.com> > * Stephen Haberman <stephen.haber...@gmail.com> > * Shane Huang <shannie.hu...@gmail.com> > * Andrew xia <xiajunl...@gmail.com> > * Nick Pentreath <nick.pentre...@gmail.com> > * Sean McNamara <sean.mcnam...@webtrends.com> > > == Affiliations == > The initial committers are from nine organizations: UC Berkeley, Quantifind, > Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Mxit and Webtrends. > > * Matei Zaharia (UCB) > * Ankur Dave (UCB) > * Tathagata Das (UCB) > * Haoyuan Li (UCB) > * Josh Rosen (UCB) > * Reynold Xin (UCB) > * Shivaram Venkataraman (UCB) > * Mosharaf Chowdhury (UCB) > * Charles Reiss (UCB) > * Andy Konwinski (UCB) > * Patrick Wendell (UCB) > * Imran Rashid (Quantifind) > * Ryan LeCompte (Quantifind) > * Ravi Pandya (Microsoft) > * Ram Sriharsha (Yahoo!) > * Robert Evans (Yahoo!) > * Mridul Muralidharam (Yahoo!) > * Thomas Dudziak (ClearStory) > * Mark Hamstra (ClearStory) > * Stephen Haberman (Bizo) > * Shane Huang (Intel) > * Andrew Xia (Intel) > * Nick Pentreath (Mxit) > * Sean McNamara (Webtrends) > > == Sponsors == > === Champion === > * Chris Mattmann > > === Nominated Mentors === > * Chris Mattmann > * Paul Ramirez > * Andrew Hart > > === Sponsoring Entity === > The Apache Incubator > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: chris.a.mattm...@nasa.gov > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++ > Adjunct Assistant Professor, Computer Science Department University of > Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > ++++++++ > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org