Re: [PROPOSAL] Apache Spark for the Incubator

Mattmann, Chris A (398J) Fri, 31 May 2013 11:16:45 -0700

Guys, I've added: Thomas Dudziak as a mentor to the proposal
at his request. He is a member of the ASF and should be granted
IPMC access soon.


Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Mattmann>, jpluser <chris.a.mattm...@jpl.nasa.gov>
Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
Date: Friday, May 31, 2013 11:03 AM
To: "general@incubator.apache.org" <general@incubator.apache.org>
Subject: [PROPOSAL] Apache Spark for the Incubator

>Hi Folks,
>
>I'm pleased to bring you a proposal to the Apache Incubator for the Apache
>Spark project: https://wiki.apache.org/incubator/SparkProposal
>
>The work originates from the Berkeley AMPLab and through a number of
>industry
>participants, and other institutions. Spark is a framework for large-scale
>data 
>analysis on clusters, with a particular focus on low latency operations.
>The
>source code is written in Scala, and provides a number of APIs and
>bindings
>in various programming languages.
>
>The proposal text is copied to the bottom of this email. I'm going to
>leave
>this thread open for the next week for discussion. Once it's died down,
>I'll
>call an official VOTE.
>
>Suresh, Ross G. -- heads up -- this project may be of interest to you both
>and would welcome you guys as additional mentors. We currently have 3
>mentors
>committed to the project, but would love to have more. People interested
>in
>contributing should declare their interest here on the general@incubator
>thread
>and those potential contributors will be discussed by the incoming Spark
>community.
>
>Questions -- let's hear em'! :)
>
>Cheers,
>Chris
>("Champion", incoming Apache Spark)
>
>=== Abstract ===
>Spark is an open source system for large-scale data analysis on clusters.
>
>=== Proposal ===
>Spark is an open source system for fast and flexible large-scale data
>analysis. Spark provides a general purpose runtime that supports
>low-latency execution in several forms. These include interactive
>exploration of very large datasets, near real-time stream processing, and
>ad-hoc SQL analytics (through higher layer extensions). Spark interfaces
>with HDFS, HBase, Cassandra and several other storage storage layers, and
>exposes APIs in Scala, Java and Python.
>Background
>Spark started as U.C. Berkeley research project, designed to efficiently
>run machine learning algorithms on large datasets. Over time, it has
>evolved into a general computing engine as outlined above. Spark¹s
>developer community has also grown to include additional institutions,
>such as universities, research labs, and corporations. Funding has been
>provided by various institutions including the U.S. National Science
>Foundation, DARPA, and a number of industry sponsors. See:
>https://amplab.cs.berkeley.edu/sponsors/ for full details.
>
>=== Rationale ===
>As the number of contributors to Spark has grown, we have sought for a
>long-term home for the project, and we believe the Apache foundation would
>be a great fit. Spark is a natural fit for the Apache foundation: Spark
>already interoperates with several existing Apache projects (HDFS, HBase,
>Hive, Cassandra, Avro and Flume to name a few). The Spark team is familiar
>with the Apache process and and subscribes to the Apache mission - the
>team includes multiple Apache committers already. Finally, joining Apache
>will help coordinate the development effort of the growing number of
>organizations which contribute to Spark.
>
>== Initial Goals ==
>The initial goals will most likely be to move the existing codebase to
>Apache and integrate with the Apache development process. Furthermore, we
>plan for incremental development, and releases along with the Apache
>guidelines.
>
>=== Current Status ===
>== Meritocracy ==
>The Spark project already operates on meritocratic principles. Today,
>Spark has several developers and has accepted multiple major patches from
>outside of U.C. Berkeley. While this process has remained mostly informal
>(we do not have an official committer list), an implicit organization
>exists in which individuals who contribute major components act as
>maintainers for those modules. If accepted, the Spark project would
>include several of these participants as committers from the onset. We
>will work to identify all committers and PPMC members for the project and
>to operate under the ASF meritocratic principles.
>
>=== Community ===
>Acceptance into the Apache foundation would bolster the already strong
>user and developer community around Spark. That community includes dozens
>of contributors from several institutions, a meetup group with several
>hundred members, and an active mailing list composed of hundreds of users.
>Core Developers
>The core developers of our project are listed in our contributors and
>initial PPMC below. Though many exist at UC Berkeley, there is a
>representative cross sampling of other organizations including Quantifind,
>Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Tagged and Webtrends.
>
>
>=== Alignment ===
>Our proposed effort aligns with several ongoing BIGDATA and U.S. National
>priority funding interests including the NSF and its Expeditions program,
>and the DARPA XDATA project. Our industry partners and collaborators are
>well aligned with our code base.
>
>There are also a number of related Apache projects and dependencies, that
>will be mentioned in the Relationships with Other Apache products section.
>
>== Known Risks ==
>
>=== Orphaned Products ===
>Given the current level of investment in Spark - the risk of the project
>being abandoned is minimal. There are several constituents who are highly
>incentivized to continue development. The U.C. Berkeley AMPLab relies on
>Spark as a platform for a large number of long-term research projects.
>Several companies have build verticalized products which are tightly
>dependent on Spark. Other companies have devoted significant internal
>infrastructure investment in Spark.
>
>=== Inexperience with Open Source ===
>Spark has existed as a healthy open source project for several years.
>During that time, Matei and others have curated an open-source community
>successfully, attracting developers from a diverse group of companies
>including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, and
>Webtrends. 
>
>=== Homogenous Developers ===
>The initial list of committers includes developers from several
>institutions, including Quantifind, Microsoft, Yahoo!, ClearStory Data,
>Bizo, Intel, and Webtrends.
>
>=== Reliance on Salaried Developers ===
>Like most open source projects, Spark receives a substantial support from
>salaried developers. A large fraction of Spark development is supported by
>graduate students at U.C. Berkeley in the course of research degrees -
>this is more a ³volunteer² relationship, since in most cases students
>contribute vastly more than is necessary to immediately support research.
>In addition, those working from within corporations often devote ³after
>hours² or spare time in the project - and these come from several
>organizations. We will work to ensure that the ability for the project to
>continuously be stewarded and to proceed forward independent of salaried
>developers is continued.
>
>
>=== Relationship with Other Apache Products ===
>Spark inter-operates with several existing Apache products by supporting
>them as storage layers: Apache Cassandra, Apache HBase, and Apache Hadoop
>(HDFS). It also uses several Apache components internally including Apache
>Maven and several Apache Commons libraries. Finally, Shark (a higher layer
>framework built on Spark) inter-operates with Apache Hive. We will explore
>the relationship between Spark and Apache Gora, which also provides
>in-memory object storage (Champion Mattmann was the Champion for Apace
>Gora so we expect alignment and cross pollination between our efforts).
>
>Spark offers an alternative computation engine to Apache Hadoop
>(MapReduce). Unlike MapReduce, Spark is designed for lower-latency and
>interactive workloads. This makes the projects complimentary: many users
>run MapReduce and Spark side-by-side.
>
>=== A Excessive Fascination with the Apache Brand ===
>Spark is already a healthy and relatively well known open source project.
>This proposal is not for the purpose of generating publicity. Rather, the
>primary benefits to joining Apache are those outlined in the Rationale
>section.
>
>=== Documentation ===
>The reader will find these websites highly relevant:
> * Spark website: http://spark-project.org/
> * Spark documentation: http://spark-project.org/documentation/
> * Issue tracking: https://spark-project.atlassian.net/
> * Codebase: https://github.com/mesos/spark
> * User group: https://groups.google.com/group/spark-users
>
>== Initial Source ==
>The Spark codebase is currently hosted on Github:
>https://github.com/mesos/spark. This is the exact codebase that we would
>migrate to the Apache foundation.
>Source and Intellectual Property Submission Plan
>Currently, the Spark codebase is distributed under a BSD license. The vast
>majority of code has copyright held by the University of California. Upon
>entering Apache, Spark will migrate to an Apache License with all
>copyright assigned to the Apache Foundation. The University of California
>will transfer all copyright to the Apache Foundation. In certain cases
>where individuals hold copyright, we will have individuals sign over
>copyright to the Apache foundation as well.
>
>Going forward, all commits would assign copyright directly to the Apache
>foundation through our signed Individual Contributor License Agreements
>for all initial committers on the project.
>
>
>== External Dependencies ==
>To the best of our knowledge, all dependencies of Spark are distributed
>under Apache compatible licenses. Upon acceptance to the incubator, we
>would begin a thorough analysis of all transitive dependencies to verify
>this fact and introduce license checking into the build and release
>process (for instance integrating Apache Rat).
>
>== Required Resources ==
>=== Mailing list ===
>We will migrate the existing Spark mailing lists as follows:
>
> * spark-users@googlegroups --> us...@spark.incubator.apache.org
> * spark-developers@googlegroups --> d...@spark.incubator.apache.org
> * spark-commits are hosted on Github, so we would request
>comm...@spark.incubator.apache.org
>
>The latter is to be consistent with the new PIAO naming scheme for
>podlings.
>
>=== Source control ===
>The Spark team would like to use Git for source control, due to our
>current use of Git.
>We request a writeable Git repo for Spark, and mirroring to be set up to
>Github through INFRA. Champion Mattmann can assist with creating INFRA
>tickets for this.
>
>=== Issue Tracking ===
>Spark currently uses a hosted JIRA deployment for issue tracking. We will
>migrate to the Apache JIRA.
>http://issues.apache.org/jira/browse/SPARK
>
>== Initial Committers ==
> * Matei Zaharia <ma...@apache.org>
> * Ankur Dave <ankurd...@gmail.com>
> * Tathagata Das <t...@eecs.berkeley.edu>
> * Haoyuan Li <haoy...@cs.berkeley.edu>
> * Josh Rosen <joshro...@cs.berkeley.edu>
> * Reynold Xin <r...@cs.berkeley.edu>
> * Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
> * Mosharaf Chowdhury <mosha...@cs.berkeley.edu>
> * Charles Reiss <char...@eecs.berkeley.edu>
> * Andy Konwinski <andykonwin...@gmail.com>
> * Patrick Wendell <pwend...@eecs.berkeley.edu>
> * Imran Rashid <im...@quantifind.com>
> * Ryan LeCompte <lecom...@gmail.com>
> * Ravi Pandya <ra...@exchange.microsoft.com>
> * Ram Sriharsha <harsh...@yahoo-inc.com>
> * Robert Evans <ev...@yahoo-inc.com>
> * Mridul Muralidharan <mrid...@yahoo-inc.com>
> * Thomas Dudziak <to...@clearstorydata.com>
> * Mark Hamstra <m...@clearstorydata.com>
> * Stephen Haberman <stephen.haber...@gmail.com>
> * Shane Huang <shannie.hu...@gmail.com>
> * Andrew xia <xiajunl...@gmail.com>
> * Nick Pentreath <nick.pentre...@gmail.com>
> * Sean McNamara <sean.mcnam...@webtrends.com>
>
>== Affiliations ==
>The initial committers are from nine organizations: UC Berkeley,
>Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Mxit and
>Webtrends.
>
> * Matei Zaharia (UCB)
> * Ankur Dave (UCB)
> * Tathagata Das (UCB)
> * Haoyuan Li (UCB)
> * Josh Rosen (UCB)
> * Reynold Xin (UCB)
> * Shivaram Venkataraman (UCB)
> * Mosharaf Chowdhury (UCB)
> * Charles Reiss (UCB)
> * Andy Konwinski (UCB)
> * Patrick Wendell (UCB)
> * Imran Rashid (Quantifind)
> * Ryan LeCompte (Quantifind)
> * Ravi Pandya (Microsoft)
> * Ram Sriharsha (Yahoo!)
> * Robert Evans (Yahoo!)
> * Mridul Muralidharam (Yahoo!)
> * Thomas Dudziak (ClearStory)
> * Mark Hamstra (ClearStory)
> * Stephen Haberman (Bizo)
> * Shane Huang (Intel)
> * Andrew Xia (Intel)
> * Nick Pentreath (Mxit)
> * Sean McNamara (Webtrends)
>
>== Sponsors ==
>=== Champion ===
> * Chris Mattmann
>
>=== Nominated Mentors ===
> * Chris Mattmann
> * Paul Ramirez 
> * Andrew Hart 
>
>=== Sponsoring Entity ===
> The Apache Incubator
>
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Senior Computer Scientist
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 171-266B, Mailstop: 171-246
>Email: chris.a.mattm...@nasa.gov
>WWW:  http://sunset.usc.edu/~mattmann/
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Assistant Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>For additional commands, e-mail: general-h...@incubator.apache.org
>

Re: [PROPOSAL] Apache Spark for the Incubator

Reply via email to