[DISCUSS] MADlib Incubation Proposal

Roman Shaposhnik Wed, 02 Sep 2015 13:38:57 -0700

Hi!

on the heels of the HAWQ proposal, I'd like
to follow with a discussion of accepting MADlib's
community into the ASF Incubator:
     https://wiki.apache.org/incubator/MADlibProposal


There was an extensive discussion within the existing
open source community and the overall consensus
is extremely supportive of this proposal:
    http://madlib.net/pipermail/user/2015-August/
    http://madlib.net/pipermail/devel/2015-August/

We've done quite a bit of outreach in order to identify
all the folks who may be interested in joining the initial
list of committers. The current proposal reflects that.
Additionally, we hope that the ASF DISCUSS thread
will help us in reaching out even further.

Finally, while 3 experienced mentors currently mentioned
on the proposal seems like a reasonable number, we would
love if other folks from IPMC could volunteer to help us on
this journey.

Thanks,
Roman.

== Abstract ==
MADlib is an open-source library (licensed under 2-clause BSD license)
for scalable in-database analytics. It provides data-parallel
implementations of mathematical, statistical and machine learning
methods for structured and unstructured data. The MADlib mission is to
foster widespread development of scalable analytic skills, by
harnessing efforts from commercial practice, academic research, and
open source development.

MADlib occupies a unique niche in the realm of data science and
machine learning libraries since its SQL APIs can allow it to work on
a wide range of data stores and SQL engines.

== Proposal ==
The current open source community behind MADlib feels that aligning
itself with HAWQ's community, governance model, infrastructure and
roadmap will allow the project to accelerate adoption and community
growth. Given HAWQ's trajectory of entering Apache Software Foundation
family as an Incubating project, we feel that the best course of
action for MADlib is to follow a similar route.

MADlib and HAWQ are complementary technologies in that MADlib
in-database analytical functions can run within the HAWQ execution
engine. (MADlib also runs on Greenplum Database and PostgreSQL today.)
It is expected that contributors to MADlib will be cognizant of the
HAWQ ASF project and may contribute to it as well.  In short,
collaboration between the two communities will make both projects more
vibrant and advance the respective technologies in potentially novel
directions.

Contributors may also look at the HAWQ project as a starting port for
ports to other parallel database engines. This proposal highly
encourages this type of work as it would help to further realize the
original cross-platform goal of MADlib as envisioned by its
originators.

Thus, the goal of this proposal is to bring the existing MADlib open
source community into ASF, change the project's governance model to
the "Apache Way" and transition the project's codebase and
infrastructure into ASF INFRA. The community has agreed to transfer
the brand name "MADlib" to Apache Software Foundation as well.

Pivotal Inc. on behalf of the MADlib open source community is
submitting this proposal to transition source code and associated
artifacts (documentation, web site content, wiki, etc.) to the Apache
Software Foundation Incubator under the Apache License, Version 2.0
and is asking Incubator PMC to established a MADlib incubating
project.

Currently MADlib uses a few category X licensed software tools during
its build (mostly for generating documentation):
   * doxypy 0.4.2 (GPL)
   * doxygen 1.8.4 (GPL)
   * TikZ-UML
   * bison 2.4 (GPL, with an exception for generated output)
We feel that this usage is compatible with an overall project licensed
under the ALv2 and don't anticipate any changes.
Our usage of LGPL library cern_root-5.34 is expected to go away since
the 2 cern modules used are being entirely re-written
in MADlib

Finally, MADlib inclusion of MPL licensed library (eigen 3.2.2) into
its binary artifact seems to be consistent with
ASF recommendation for managing "weak copyleft" dependencies.


== Background ==
MADlib grew out of discussions between database engine developers,
data scientists, IT architects and academics interested in new
approaches to scalable, sophisticated in-database analytics. These
discussions were written up in a paper in VLDB 2009 that coined the
term “MAD Skills” for data analysis
(http://dl.acm.org/citation.cfm?id=1687576). The MADlib software
project began the following year as a collaboration between
researchers at UC Berkeley and engineers and data scientists at
Pivotal (former EMC/Greenplum).

The initial MADlib codebase came from EMC/Greenplum, UC Berkeley, the
University of Wisconsin, and the University of Florida.  The project
was publicly documented in a paper at VLDB 2012
(http://vldb.org/pvldb/vol5/p1700_joehellerstein_vldb2012.pdf).  Today
MADlib has contributors from around the world including both
individuals and institutions.  For example, recent contributions have
come from Pivotal, Stanford University, and the University of Illinois
at Chicago.

MADlib was conceived from the outset as a free, open source library
for all to use and contribute to.  Since its inception, the community
has steadily added new methods in the areas of mathematics,
statistics, machine learning, and data transformation.  The current
library includes over 30 principle algorithms as well as many
additional operators and utility functions.

The methods in MADlib are designed both for in- or out-of-core
execution, and for the shared-nothing, scale-out parallelism offered
by modern parallel database engines, ensuring that computation is done
close to the data. The core functionality is written in declarative
SQL statements, which orchestrate data movement to and from disk, and
across networked machines. Single-node inner loops take advantage of
SQL extensibility to call out to high performance math libraries in
user-defined scalar and aggregate functions. At the highest level,
tasks that require iteration and/or structure definition are coded in
Python driver routines, which are used only to kick off the data-rich
computations that happen within the database engine.

The first platforms supported by MADlib were Greenplum Database and
PostgreSQL.  With the development of HAWQ SQL-on-Hadoop technology by
Pivotal, MADlib offers a way to perform predictive analytics on very
large data sets stored on a Hadoop cluster.

Today, MADlib is in active development and is deployed on a wide
variety of industry and academic projects across many different
verticals.

== Rationale ==
Enterprises today are seeing the value of landing very large
quantities of data in Hadoop clusters with the goal improving their
products and processes.  With the proliferation of increasingly
sophisticated SQL-on-Hadoop technologies such as HAWQ, analysts can
use the familiar SQL language to query this data at scale.  This
effectively opens the door to Hadoop in the enterprise.

Adding SQL-based predictive analytics like MADlib to the equation
enables organizations to reason across large data sets without
resorting to sampling, which has been a traditional approach when
confronted with scale problems.  Operating on all of the data with
MADlib results in more robust and accurate models.

Since MADlib is a SQL-based interface, organizations do not need to
re-train their teams on an unfamiliar programming language since SQL
skills are ubiquitous in today's enterprises.

Given the high velocity of innovation happening in the underlying
Hadoop ecosystem, any SQL-based predictive analytics technology that
plays in this ecosystem must be commensurately agile to keep up with
the community. We strongly believe that in the Big Data space, this
can be optimally achieved through a vibrant, diverse, self-governed
community collectively innovating around a single codebase while at
the same time cross-pollinating with various other data management
communities. Apache Software Foundation is the ideal place to meet
those ambitious goals.

== Initial Goals ==
Our initial goals are to bring MADlib into the ASF, transition the
engineering and governance processes to be in accordance with the
"Apache Way" and foster a collaborative development model closely
aligned with that of HAWQ.

Another important goal is encouraging efforts to port to other
execution engines.

The MADlib project will continue developing new functionality in an
open, community-driven way. We envision accelerating innovation under
ASF governance, in order to meet the requirements of a wide variety of
predictive analytics use cases.

We will also require transitioning of existing project infrastructure
(source code, JIRA, mailing list) to the ASF infrastructure.

== Current Status ==
Currently, the project is available at http://madlib.net/. The
codebase is licensed under the a 2-clause BSD license. Our current
governance model could be described as a "benevolent dictator" one. As
stated above, the existing MADlib community feels that closer
alignment with HAWQ community, infrastructure and the governance model
as it is being proposed to ASF will allow MADlib project to thrive
much more compared to relative isolation from HAWQ.

=== Meritocracy ===
Our proposed list of initial committers include the current MADlib R&D
team at Pivotal and existing active members of the open source
project. This group will form a base for the broader community we will
invite to collaborate on the codebase. We intend to radically expand
the initial developer and user community by running the project in
accordance with the "Apache Way". Users and new contributors will be
treated with respect and welcomed. By participating in the community
and providing quality patches/support that move the project forward,
they will earn merit. They also will be encouraged to provide non-code
contributions (documentation, events, community management, etc.) and
will gain merit for doing so. Those with a proven support and quality
track record will be encouraged to become committers.

=== Community ===
If MADlib is accepted for incubation, the primary initial goal will be
transitioning the core community towards embracing the Apache Way of
project governance. We would solicit major existing contributors to
become committers on the project from the start.

=== Core Developers ===
MADlib core developers are skilled in working as part of openly
governed communities. That said, most of the core developers are
currently NOT affiliated with the ASF and would require new ICLAs
before committing to the project.

=== Alignment ===
The following existing ASF projects can be considered when reviewing
the MADlib proposal:

Apache Mahout project's goal is to build an environment for quickly
creating scalable performant machine learning applications. Apache
Mahout is, perhaps, the oldest machine learning library in Hadoop
ecosystem. The three major components of Mahout are an environment for
building scalable algorithms, many new Scala + Spark (H2O in progress)
algorithms, and Mahout's mature Hadoop MapReduce algorithms. We see
the two projects benefiting from each other's experience of
implementing similar classes of algorithms and look forward to a
fruitful exchange of ideas between the two communities.

Apache Spark is a fast engine for processing large datasets, typically
from a Hadoop cluster, and performing batch, streaming, interactive,
or machine learning workloads.  Recently, Apache Spark has embraced
SQL-like APIs around DataFrames at its core. Because of that we would
expect a level of collaboration between the two projects. Spark
project also contains a library (MLlib) that is the closest cousin to
MADlib. MLlib is Apache Spark's scalable machine learning library. We
see the two projects benefiting from each other's experience of
implementing similar classes of algorithms and look forward to a
fruitful exchange of ideas between the two communities.

Apache Hive is a data warehouse software that facilitates querying and
managing large datasets residing in distributed storage. Hive provides
a mechanism to project structure onto this data and query the data
using a SQL-like language called HiveQL. We see a potential for MADlib
to leverage Hive as a backend the same way it currently leverages
PostgreSQL-derived SQL backends. This could be especially useful for
longer running algorithms.

Apache Drill is a schema-free SQL query engine for Hadoop, NoSQL and
Cloud Storage. We see a potential for MADlib to leverage Drill as a
backend the same way it currently leverages PostgreSQL-derived SQL
backends. This could be especially useful for analyzing data coming
from heterogenous sources and federated by the Drill engine.

== Known Risks ==
Development has been sponsored mostly by a single company (or its
predecessors) thus far and coordinated mainly by the core Pivotal R&D
team.

So far, the project's governance model has explicitly been a
"benevolent dictator" one. For the project to fully transition to the
"Apache Way", development must shift towards the meritocracy-centric
model of growing a community of contributors balanced with the needs
for extreme stability and core implementation coherency.

=== Orphaned products ===
The community proposing MADlib for incubation is an independent open
source community. Even though Pivotal happens to be the biggest
corporate sponsor of the project (by means of employing the core team)
the community goes beyond those affiliated with Pivotal. On top of
that, Pivotal is fully committed to maintain its position as one of
the leading providers of SQL-based analytics aimed squarely at data
scientists. MADlib is the only game in town that can leverage SQL APIs
ranging from traditional RDBMS technology all the way to data
warehousing (Pivotal Greenplum Database) and into SQL-on-Hadoop
(HAWQ). Moreover, Pivotal has a vested interest in making MADlib
succeed by driving its close integration with sister ASF projects. We
expect this to further reduces the risk of orphaning the product.

Even in the absence of support by a particular vendor such as Pivotal,
and in a worst-case scenario where HAWQ and Greenplum Database fail to
gain traction in OSS, the existence of an established PostgreSQL OSS
project means there’s will still be a working stack for MADlib.

=== Inexperience with Open Source ===
MADlib has been an open source project from the outset. All developers
working on the project (regardless of their employment affiliation)
did so completely in the open. While the governance model of MADlib
has been more of a benevolent dictator model, the project has always
been receptive to accepting contributions from all sources and
including them in future releases based on thorough code review,
testing, and compliance with the project’s coding best practices.

=== Homogeneous Developers ===
While most of the initial committers are employed by Pivotal, there's
still a healthy level of interest coming from academia. On top of that
we expect to spark curiosity in sister ASF projects and attract
developers unaffiliated with Pivotal. Finally, MADlib is being used
extensively whenever Pivotal engages with customers on data science
projects. This typically means that the skills remain within a
customer organization which further increases the chance of turning
customer data scientists into MADlib contributors.

=== Reliance on Salaried Developers ===
A large percentage of the contributors are paid to work in the Big
Data space. While they might wander from their current employers, they
are unlikely to venture far from their core expertise and thus will
continue to be engaged with the project regardless of their current
employers. In addition, the project is still enjoying popularity in
academic circles and we hope that will help mitigate reliance on
salaried developers as well.

=== Relationships with Other Apache Products ===
As mentioned in the Alignment section, MADlib may consider various
degrees of integration and code exchange with Apache Spark (MLlib),
Apache Mahout, Apache Hive and Apache Drill projects. We expect
integration points to be inside and outside the project. We look
forward to collaborating with these communities as well as other
communities under the Apache umbrella.

=== An Excessive Fascination with the Apache Brand ===
While we intend to leverage the Apache "brand" when talking to other
projects as a testament to our project’s neutrality, we have no plans
for making use of the Apache brand in press releases nor posting
billboards advertising acceptance of MADlib into Apache Incubator.

== Documentation ==
The documentation is currently available at: https://github.com/madlib/frontpage

The documentation is currently licensed under 2-clause BSD license.

== Initial Source ==
Initial source code is available at:
   * MADlib: https://github.com/madlib/madlib
   * Testsuite: https://github.com/madlib/testsuite
   * Contributors: https://github.com/madlib/contrib

The code is currently licensed under 2-clause BSD license.

== Source and Intellectual Property Submission Plan ==
As soon as MADlib is approved to join the Incubator, the source code
will be transitioned via the Software Grant Agreement onto ASF
infrastructure and in turn made available under the Apache License,
version 2.0.  We know of no legal encumbrances that would inhibit the
transfer of source code to the ASF.

== External Dependencies ==

Runtime dependencies:
   * boost-1.47.0 (Boost Software License)
   * _m_widen_init (MIT for this subcomponent of GCC)
   * python-argparse-1.2.1 (PSF LICENSE AGREEMENT FOR PYTHON 2.7.1)
   * pyyaml-3.10 (MIT license)
   * cern_root-5.34 (LGPL, however this dependency will be removed
since the 2 cern modules used are being entirely re-written in MADlib)
   * eigen-3.2.2 (Mozilla Public License)
   * pyxb-1.2.4 (Apache license version 2)
   * python (Python Software Foundation License Version 2)
   * mathjax-2.5 (Apache license version 2)

Build only dependencies:
   * doxypy-0.4.2 (GPL)
   * cmake-2.8.4 (BSD 3-clause License)
   * doxygen >= 1.8.4 (GPL)
   * flex >= 2.5.33 (BSD)
   * bison >= 2.4 (GPL)
   * latex (LaTeX Project Public License)
   * TikZ-UML (no license information)

Cryptography
   * N/A

== Required Resources ==

=== Mailing lists ===
  * priv...@madlib.incubator.apache.org (moderated subscriptions)
  * comm...@madlib.incubator.apache.org
  * d...@madlib.incubator.apache.org
  * iss...@madlib.incubator.apache.org
  * u...@madlib.incubator.apache.org

=== Git Repository ===
https://git-wip-us.apache.org/repos/asf/incubator-madlib.git

=== Issue Tracking ===
JIRA Project MADlib (MADLIB)

We will also request migration of our current JIRA available at
http://jira.madlib.net/

=== Other Resources ===

Means of setting up regular builds for MADlib on builds.apache.org
will require integration with Docker support.

== Initial Committers ==
  * Anirudh Kondaveeti
  * Caleb Welton
  * Frank McQuillan
  * Gang Xiong
  * Gautam Muralidhar
  * Hitoshi Harada
  * Hulya Emir-farinas
  * Ian Huston
  * KeeSiong Ng
  * Noel Sio
  * Rahul Iyer
  * Rashmi Raghu
  * Regunathan Radhakrishnan
  * Ronert Obst
  * Samuel Ziegler
  * Sarah Aerni
  * Srivatsan Ramanujam
  * Woo Jae Jung
  * Xixuan Feng
  * Yu Yang
  * Atri Sharma
  * Greg Chase
  * Chloe Jackson
  * Roman Shaposhnik
  * Vaibhav Gumashta
  * Ted Dunning
  * Konstantin Boudnik

== Affiliations ==
  * Hortonworks: Vaibhav Gumashta
  * MapR: Ted Dunning
  * WANDisco: Konstantin Boudnik
  * Barclays:  Atri Sharma
  * Pivotal: everyone else on this proposal

== Sponsors ==

=== Champion ===
Roman Shaposhnik

=== Nominated Mentors ===

The initial mentors are listed below:
  * Ted Dunning - Apache Member, MapR
  * Konstantin Boudnik - Apache Member, WANDisco
  * Roman Shaposhnik - Apache Member, Pivotal

=== Sponsoring Entity ===
We would like to propose Apache incubator to sponsor this project.

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

[DISCUSS] MADlib Incubation Proposal

Reply via email to