Re: [PROPOSAL] Apache Spark for the Incubator

Mattmann, Chris A (398J) Mon, 03 Jun 2013 06:03:37 -0700

Thanks for the support, Pei. I think the questions you had
about frameworks/etc., hopefully were answered.


Cheers,
Chris

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++






-----Original Message-----
From: <Chen>, Pei <pei.c...@childrens.harvard.edu>
Reply-To: "general@incubator.apache.org" <general@incubator.apache.org>
Date: Friday, May 31, 2013 11:45 AM
To: "general@incubator.apache.org" <general@incubator.apache.org>
Subject: RE: [PROPOSAL] Apache Spark for the Incubator

>+1 (non-binding)
>This seems like a really interesting project.
>Q- Is Spark just a framework/API or does it also have some tools
>implemented for data analytics?
>--Pei
>
>> -----Original Message-----
>> From: Mattmann, Chris A (398J) [mailto:chris.a.mattm...@jpl.nasa.gov]
>> Sent: Friday, May 31, 2013 2:04 PM
>> To: general@incubator.apache.org
>> Subject: [PROPOSAL] Apache Spark for the Incubator
>> 
>> Hi Folks,
>> 
>> I'm pleased to bring you a proposal to the Apache Incubator for the
>>Apache
>> Spark project: https://wiki.apache.org/incubator/SparkProposal
>> 
>> The work originates from the Berkeley AMPLab and through a number of
>> industry participants, and other institutions. Spark is a framework for
>>large-
>> scale data analysis on clusters, with a particular focus on low latency
>> operations.
>> The
>> source code is written in Scala, and provides a number of APIs and
>>bindings in
>> various programming languages.
>> 
>> The proposal text is copied to the bottom of this email. I'm going to
>>leave this
>> thread open for the next week for discussion. Once it's died down, I'll
>>call an
>> official VOTE.
>> 
>> Suresh, Ross G. -- heads up -- this project may be of interest to you
>>both and
>> would welcome you guys as additional mentors. We currently have 3
>> mentors committed to the project, but would love to have more. People
>> interested in contributing should declare their interest here on the
>> general@incubator thread and those potential contributors will be
>>discussed
>> by the incoming Spark community.
>> 
>> Questions -- let's hear em'! :)
>> 
>> Cheers,
>> Chris
>> ("Champion", incoming Apache Spark)
>> 
>> === Abstract ===
>> Spark is an open source system for large-scale data analysis on
>>clusters.
>> 
>> === Proposal ===
>> Spark is an open source system for fast and flexible large-scale data
>>analysis.
>> Spark provides a general purpose runtime that supports low-latency
>> execution in several forms. These include interactive exploration of
>>very
>> large datasets, near real-time stream processing, and ad-hoc SQL
>>analytics
>> (through higher layer extensions). Spark interfaces with HDFS, HBase,
>> Cassandra and several other storage storage layers, and exposes APIs in
>> Scala, Java and Python.
>> Background
>> Spark started as U.C. Berkeley research project, designed to
>>efficiently run
>> machine learning algorithms on large datasets. Over time, it has
>>evolved into
>> a general computing engine as outlined above. Spark¹s developer
>>community
>> has also grown to include additional institutions, such as universities,
>> research labs, and corporations. Funding has been provided by various
>> institutions including the U.S. National Science Foundation, DARPA, and
>>a
>> number of industry sponsors. See:
>> https://amplab.cs.berkeley.edu/sponsors/ for full details.
>> 
>> === Rationale ===
>> As the number of contributors to Spark has grown, we have sought for a
>> long-term home for the project, and we believe the Apache foundation
>> would be a great fit. Spark is a natural fit for the Apache foundation:
>>Spark
>> already interoperates with several existing Apache projects (HDFS,
>>HBase,
>> Hive, Cassandra, Avro and Flume to name a few). The Spark team is
>>familiar
>> with the Apache process and and subscribes to the Apache mission - the
>> team includes multiple Apache committers already. Finally, joining
>>Apache
>> will help coordinate the development effort of the growing number of
>> organizations which contribute to Spark.
>> 
>> == Initial Goals ==
>> The initial goals will most likely be to move the existing codebase to
>>Apache
>> and integrate with the Apache development process. Furthermore, we plan
>> for incremental development, and releases along with the Apache
>> guidelines.
>> 
>> === Current Status ===
>> == Meritocracy ==
>> The Spark project already operates on meritocratic principles. Today,
>>Spark
>> has several developers and has accepted multiple major patches from
>> outside of U.C. Berkeley. While this process has remained mostly
>>informal
>> (we do not have an official committer list), an implicit organization
>>exists in
>> which individuals who contribute major components act as maintainers for
>> those modules. If accepted, the Spark project would include several of
>>these
>> participants as committers from the onset. We will work to identify all
>> committers and PPMC members for the project and to operate under the
>> ASF meritocratic principles.
>> 
>> === Community ===
>> Acceptance into the Apache foundation would bolster the already strong
>> user and developer community around Spark. That community includes
>> dozens of contributors from several institutions, a meetup group with
>> several hundred members, and an active mailing list composed of hundreds
>> of users.
>> Core Developers
>> The core developers of our project are listed in our contributors and
>>initial
>> PPMC below. Though many exist at UC Berkeley, there is a representative
>> cross sampling of other organizations including Quantifind, Microsoft,
>>Yahoo!,
>> ClearStory Data, Bizo, Intel, Tagged and Webtrends.
>> 
>> 
>> === Alignment ===
>> Our proposed effort aligns with several ongoing BIGDATA and U.S.
>>National
>> priority funding interests including the NSF and its Expeditions
>>program, and
>> the DARPA XDATA project. Our industry partners and collaborators are
>>well
>> aligned with our code base.
>> 
>> There are also a number of related Apache projects and dependencies,
>>that
>> will be mentioned in the Relationships with Other Apache products
>>section.
>> 
>> == Known Risks ==
>> 
>> === Orphaned Products ===
>> Given the current level of investment in Spark - the risk of the
>>project being
>> abandoned is minimal. There are several constituents who are highly
>> incentivized to continue development. The U.C. Berkeley AMPLab relies on
>> Spark as a platform for a large number of long-term research projects.
>> Several companies have build verticalized products which are tightly
>> dependent on Spark. Other companies have devoted significant internal
>> infrastructure investment in Spark.
>> 
>> === Inexperience with Open Source ===
>> Spark has existed as a healthy open source project for several years.
>> During that time, Matei and others have curated an open-source community
>> successfully, attracting developers from a diverse group of companies
>> including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel,
>>and
>> Webtrends.
>> 
>> === Homogenous Developers ===
>> The initial list of committers includes developers from several
>>institutions,
>> including Quantifind, Microsoft, Yahoo!, ClearStory Data, Bizo, Intel,
>>and
>> Webtrends.
>> 
>> === Reliance on Salaried Developers ===
>> Like most open source projects, Spark receives a substantial support
>>from
>> salaried developers. A large fraction of Spark development is supported
>>by
>> graduate students at U.C. Berkeley in the course of research degrees -
>>this is
>> more a ³volunteer² relationship, since in most cases students contribute
>> vastly more than is necessary to immediately support research.
>> In addition, those working from within corporations often devote ³after
>> hours² or spare time in the project - and these come from several
>> organizations. We will work to ensure that the ability for the project
>>to
>> continuously be stewarded and to proceed forward independent of salaried
>> developers is continued.
>> 
>> 
>> === Relationship with Other Apache Products === Spark inter-operates
>>with
>> several existing Apache products by supporting them as storage layers:
>> Apache Cassandra, Apache HBase, and Apache Hadoop (HDFS). It also uses
>> several Apache components internally including Apache Maven and several
>> Apache Commons libraries. Finally, Shark (a higher layer framework
>>built on
>> Spark) inter-operates with Apache Hive. We will explore the relationship
>> between Spark and Apache Gora, which also provides in-memory object
>> storage (Champion Mattmann was the Champion for Apace Gora so we
>> expect alignment and cross pollination between our efforts).
>> 
>> Spark offers an alternative computation engine to Apache Hadoop
>> (MapReduce). Unlike MapReduce, Spark is designed for lower-latency and
>> interactive workloads. This makes the projects complimentary: many users
>> run MapReduce and Spark side-by-side.
>> 
>> === A Excessive Fascination with the Apache Brand === Spark is already a
>> healthy and relatively well known open source project.
>> This proposal is not for the purpose of generating publicity. Rather,
>>the
>> primary benefits to joining Apache are those outlined in the Rationale
>> section.
>> 
>> === Documentation ===
>> The reader will find these websites highly relevant:
>>  * Spark website: http://spark-project.org/
>>  * Spark documentation: http://spark-project.org/documentation/
>>  * Issue tracking: https://spark-project.atlassian.net/
>>  * Codebase: https://github.com/mesos/spark
>>  * User group: https://groups.google.com/group/spark-users
>> 
>> == Initial Source ==
>> The Spark codebase is currently hosted on Github:
>> https://github.com/mesos/spark. This is the exact codebase that we would
>> migrate to the Apache foundation.
>> Source and Intellectual Property Submission Plan Currently, the Spark
>> codebase is distributed under a BSD license. The vast majority of code
>>has
>> copyright held by the University of California. Upon entering Apache,
>>Spark
>> will migrate to an Apache License with all copyright assigned to the
>>Apache
>> Foundation. The University of California will transfer all copyright to
>>the
>> Apache Foundation. In certain cases where individuals hold copyright,
>>we will
>> have individuals sign over copyright to the Apache foundation as well.
>> 
>> Going forward, all commits would assign copyright directly to the Apache
>> foundation through our signed Individual Contributor License Agreements
>> for all initial committers on the project.
>> 
>> 
>> == External Dependencies ==
>> To the best of our knowledge, all dependencies of Spark are distributed
>> under Apache compatible licenses. Upon acceptance to the incubator, we
>> would begin a thorough analysis of all transitive dependencies to
>>verify this
>> fact and introduce license checking into the build and release process
>>(for
>> instance integrating Apache Rat).
>> 
>> == Required Resources ==
>> === Mailing list ===
>> We will migrate the existing Spark mailing lists as follows:
>> 
>>  * spark-users@googlegroups --> us...@spark.incubator.apache.org
>>  * spark-developers@googlegroups --> d...@spark.incubator.apache.org
>>  * spark-commits are hosted on Github, so we would request
>> comm...@spark.incubator.apache.org
>> 
>> The latter is to be consistent with the new PIAO naming scheme for
>>podlings.
>> 
>> === Source control ===
>> The Spark team would like to use Git for source control, due to our
>>current
>> use of Git.
>> We request a writeable Git repo for Spark, and mirroring to be set up to
>> Github through INFRA. Champion Mattmann can assist with creating INFRA
>> tickets for this.
>> 
>> === Issue Tracking ===
>> Spark currently uses a hosted JIRA deployment for issue tracking. We
>>will
>> migrate to the Apache JIRA.
>> http://issues.apache.org/jira/browse/SPARK
>> 
>> == Initial Committers ==
>>  * Matei Zaharia <ma...@apache.org>
>>  * Ankur Dave <ankurd...@gmail.com>
>>  * Tathagata Das <t...@eecs.berkeley.edu>
>>  * Haoyuan Li <haoy...@cs.berkeley.edu>
>>  * Josh Rosen <joshro...@cs.berkeley.edu>
>>  * Reynold Xin <r...@cs.berkeley.edu>
>>  * Shivaram Venkataraman <shiva...@eecs.berkeley.edu>
>>  * Mosharaf Chowdhury <mosha...@cs.berkeley.edu>
>>  * Charles Reiss <char...@eecs.berkeley.edu>
>>  * Andy Konwinski <andykonwin...@gmail.com>
>>  * Patrick Wendell <pwend...@eecs.berkeley.edu>
>>  * Imran Rashid <im...@quantifind.com>
>>  * Ryan LeCompte <lecom...@gmail.com>
>>  * Ravi Pandya <ra...@exchange.microsoft.com>
>>  * Ram Sriharsha <harsh...@yahoo-inc.com>
>>  * Robert Evans <ev...@yahoo-inc.com>
>>  * Mridul Muralidharan <mrid...@yahoo-inc.com>
>>  * Thomas Dudziak <to...@clearstorydata.com>
>>  * Mark Hamstra <m...@clearstorydata.com>
>>  * Stephen Haberman <stephen.haber...@gmail.com>
>>  * Shane Huang <shannie.hu...@gmail.com>
>>  * Andrew xia <xiajunl...@gmail.com>
>>  * Nick Pentreath <nick.pentre...@gmail.com>
>>  * Sean McNamara <sean.mcnam...@webtrends.com>
>> 
>> == Affiliations ==
>> The initial committers are from nine organizations: UC Berkeley,
>>Quantifind,
>> Microsoft, Yahoo!, ClearStory Data, Bizo, Intel, Mxit and Webtrends.
>> 
>>  * Matei Zaharia (UCB)
>>  * Ankur Dave (UCB)
>>  * Tathagata Das (UCB)
>>  * Haoyuan Li (UCB)
>>  * Josh Rosen (UCB)
>>  * Reynold Xin (UCB)
>>  * Shivaram Venkataraman (UCB)
>>  * Mosharaf Chowdhury (UCB)
>>  * Charles Reiss (UCB)
>>  * Andy Konwinski (UCB)
>>  * Patrick Wendell (UCB)
>>  * Imran Rashid (Quantifind)
>>  * Ryan LeCompte (Quantifind)
>>  * Ravi Pandya (Microsoft)
>>  * Ram Sriharsha (Yahoo!)
>>  * Robert Evans (Yahoo!)
>>  * Mridul Muralidharam (Yahoo!)
>>  * Thomas Dudziak (ClearStory)
>>  * Mark Hamstra (ClearStory)
>>  * Stephen Haberman (Bizo)
>>  * Shane Huang (Intel)
>>  * Andrew Xia (Intel)
>>  * Nick Pentreath (Mxit)
>>  * Sean McNamara (Webtrends)
>> 
>> == Sponsors ==
>> === Champion ===
>>  * Chris Mattmann
>> 
>> === Nominated Mentors ===
>>  * Chris Mattmann
>>  * Paul Ramirez
>>  * Andrew Hart
>> 
>> === Sponsoring Entity ===
>>  The Apache Incubator
>> 
>> 
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++
>> Chris Mattmann, Ph.D.
>> Senior Computer Scientist
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 171-266B, Mailstop: 171-246
>> Email: chris.a.mattm...@nasa.gov
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++
>> Adjunct Assistant Professor, Computer Science Department University of
>> Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> ++++++++
>> 
>> 
>> 
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>> For additional commands, e-mail: general-h...@incubator.apache.org
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>For additional commands, e-mail: general-h...@incubator.apache.org
>

Re: [PROPOSAL] Apache Spark for the Incubator

Reply via email to