Re: [PROPOSAL] Apache AsterixDB Incubator

Henry Saputra Mon, 19 Jan 2015 10:28:08 -0800

+1 This is GREAT News!

Was watching and trying AsterixDB last year and looked in awesome shape.


I have my plate full but would love to help mentor this project to get
it going to ASF if needed!

- Henry

On Wed, Jan 14, 2015 at 6:21 PM, Mattmann, Chris A (3980)
<chris.a.mattm...@jpl.nasa.gov> wrote:
> Hi Folks,
>
> I am pleased to bring forth the Apache AsterixDB proposal to the
> Apache Incubator as Champion, working in collaboration with the
> team. Please find the wiki proposal here:
>
> https://wiki.apache.org/incubator/AsterixDBProposal
>
>
> Full text of the proposal is below. Please discuss and enjoy. I’ll
> leave the discussion open for a week, and then look to call a VOTE
> hopefully end of next week if all is well.
>
> Cheers!
> Chris Mattmann
>
> =============================================================
> Apache AsterixDB Proposal
>
> Abstract
>
> Apache AsterixDB is a scalable big data management system (BDMS) that
> provides storage, management, and query capabilities for large
> collections of semi-structured data.
>
> Proposal
>
> AsterixDB is a big data management system (BDMS) that makes it
> well-suited to needs such as web data warehousing and social data
> storage and analysis. Feature-wise, AsterixDB has:
>
> * A NoSQL style data model (ADM) based on extending JSON with object
>   database concepts.
> * An expressive and declarative query language (AQL) for querying
>   semi-structured data.
> * A runtime query execution engine, Hyracks, for partitioned-parallel
>   execution of query plans.
> * Partitioned LSM-based data storage and indexing for efficient
>   ingestion of newly arriving data.
> * Support for querying and indexing external data (e.g., in HDFS) as
>   well as data stored within AsterixDB.
> * A rich set of primitive data types, including support for spatial,
>   temporal, and textual data.
> * Indexing options that include B+ trees, R trees, and inverted
>   keyword index support.
> * Basic transactional (concurrency and recovery) capabilities akin to
>   those of a NoSQL store.
>
>
> Background and Rationale
>
> In the world of relational databases, the need to tackle data volumes
> that exceed the capabilities of a single server led to the
> development of “shared-nothing” parallel database systems several
> decades ago. These systems spread data over a cluster based on a
> partitioning strategy, such as hash partitioning, and queries are
> processed by employing partitioned-parallel divide-and-conquer
> techniques. Since these systems are fronted by a high-level,
> declarative language (SQL), their users are shielded from the
> complexities of parallel programming. Parallel database systems have
> been an extremely successful application of parallel computing, and
> quite a number of commercial products exist today.
>
> In the distributed systems world, the Web brought a need to index and
> query its huge content. SQL and relational databases were not the
> answer, though shared-nothing clusters again emerged as the hardware
> platform of choice. Google developed the Google File System (GFS) and
> MapReduce programming model to allow programmers to store and process
> Big Data by writing a few user-defined functions. The MapReduce
> framework applies these functions in parallel to data instances in
> distributed files (map) and to sorted groups of instances sharing a
> common key (reduce) -- not unlike the partitioned parallelism in
> parallel database systems. Apache's Hadoop MapReduce platform is the
> most prominent implementation of this paradigm for the rest of the
> Big Data community. On top of Hadoop and HDFS sit declarative
> languages like Pig and Hive that each compile down to Hadoop
> MapReduce jobs.
>
> The big Web companies were also challenged by extreme user bases
> (100s of millions of users) and needed fast simple lookups and
> updates to very large keyed data sets like user profiles. SQL
> databases were deemed either too expensive or not scalable, so the
> “NoSQL movement” was born. The ASF now has HBase and Cassandra, two
> popular key-value stores, in this space. MongoDB and Couchbase are
> other open source alternatives (document stores).
>
> It is evident from the rapidly growing popularity of "NoSQL" stores,
> as well as the strong demand for Big Data analytics engines today,
> that there is a strong (and growing!) need to store, process, *and*
> query large volumes of semi-structured data in many application
> areas. Until very recently, developers have had to ``choose'' between
> using big data analytics engines like Apache Hive or Apache Spark,
> which can do complex query processing and analysis over HDFS-resident
> files, and flexible but low-function data stores like MongoDB or
> Apache HBase. (The Apache Phoenix project,
> http://phoenix.apache.org/, is a recent SQL-over-HBase effort that
> aims to bridge between these choices.)
>
> AsterixDB is a highly scalable data management system that can store,
> index, and manage semi-structured data, e.g., much like MongoDB, but
> it also supports a full-power query language with the expressiveness
> of SQL (and more). Unlike analytics engines like Hive or Spark, it
> stores and manages data, so AsterixDB can exploit its knowledge of
> data partitioning and the availability of indexes to avoid always
> scanning data set(s) to process queries. Somewhat surprisingly, there
> is no open source parallel database system (relational or otherwise)
> available to developers today -- AsterixDB aims to fill this need.
> Since Apache is where the majority of the today's most important Big
> Data technologies live, the ASF seems like the obvious home for a
> system like AsterixDB.
>
> Current Status
>
> The current version of AsterixDB was co-developed by a team of
> faculty, staff, and students at UC Irvine and UC Riverside. The
> project was initiated as a large NSF-sponsored project in 2009, the
> goal of which was to combine the best ideas from the parallel
> database world, the then new Hadoop world, and the semi-structured
> (e.g., XML/JSON) data world in order to create a next-generation
> BDMS. A first informal open source release was made four years later,
> in June of 2013, under the Apache Software License 2.0.
>
>
> Meritocracy
>
> The current developers are familiar with meritocratic open source
> development at Apache. Apache was chosen specifically because we want
> to encourage this style of development for the project.
>
>
> Community
>
> While AsterixDB started as a university project it has developed into
> a community. A number of the initial committers started contributing
> in academia and continue to actively participate and contribute after
> graduation. And we seek to further develop developer and user
> communities. One way to broaden the community that is ongoing is
> through academic collaborations (currently with IIT Mumbai in India
> and TU Berlin in Germany). During incubation we will also explicitly
> seek increased industrial participation.
>
> Some indicators of the effort's development community and history can
> be
> found at:
> https://www.openhub.net/p/asterixdb/contributors?query=&sort=commits_12_mo,
> https://www.openhub.net/p/hyracks/contributors?query=&sort=commits_12_mo
>
>
> Core Developers
>
> The core developers of the project are diverse, although initially UC
> Irvine heavy (roughly 50) due to the project's origins at UCI. The
> other 50 are from other academic institutions (UC Riverside and the
> Hebrew University in Jerusalem) and companies (Couchbase, Facebook,
> IBM, KACST Saudi Arabia, Oracle, Saudi Aramco, X15 Software).
>
>
> Alignment
>
> Apache is, by far, the most natural home for taking the AsterixDB
> project forward. A large fraction of today's top Big Data
> technologies have their homes in Apache, including Hadoop, YARN, Pig,
> Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
> significant gap -- the parallel data management system gap -- that
> exists in the Big Data open source world. It is well-aligned with a
> number of the Apache projects, e.g., it has strong support for
> accessing and indexing external data in HDFS, and it uses YARN as an
> answer to basic cluster resource management. AsterixDB also seeks to
> achieve an Apache-style development model; it is seeking a broader
> community of contributors and users in order to achieve its full
> potential and value to the Big Data community.
>
> There are also a number of related Apache projects and dependencies
> that will be mentioned below in the Relationships with Other Apache
> products section.
>
>
> Known Risks
>
> Orphaned products
>
> Given the current level of intellectual investment in AsterixDB, the
> risk of the project being abandoned is very small. The UCI/UCR
> faculty team leads are highly incentivized to continue development
> since the database groups at UC Irvine and UC Riverside are both
> reliant on AsterixDB as a platform for long-term graduate research
> projects. UC San Diego is also beginning to contribute to the code
> base, and a collaboration involving public health applications is
> forming with UCLA. The work on AsterixDB is managed via a mix of
> mailing list discussions supplemented by weekly project status
> meetings which are summarized on the mailing list. Typical (local
> plus Skype-in) attendance to the weekly status meetings runs at about
> 20 active contributors.
>
>
> Inexperience with Open Source
>
> AsterixDB and Hyracks were completely developed in Open Source under
> the ASL 2.0. The source code repositories, issue tracker, and mailing
> lists are available on Google Code and discussions and decisions
> happen on the mailing lists (which is necessary due to the geographic
> distribution of the current developers).
>
> Also a few of the initial committers have contributed to Apache
> projects. Vinayak Borkar is a committer on the Apache Helix and
> Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
> and an IPMC member. Preston Carman and Steven Jacobs are committers
> on the Apache VXQuery project.
>
>
> Relationships with Other Apache Products
>
> Apache VXQuery is based on the Hyracks data-parallel runtime, which
> is also included in the AsterixDB code base.
>
> AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
> is support for accessing external data in HDFS (and Hive formats),
> and resource management and system administration features are in the
> process of being migrated to YARN.
>
> AsterixDB's AQL query facilities offer comparable query power to
> Apache's Pig and Hive systems for big data analytics. AsterixDB
> differs in storing and indexing data and thus being able to quickly
> answer small and medium queries without large HDFS data scans -
> thereby targeting a different class of use cases.
>
> AsterixDB's data storage and indexing facilities are similar to those
> of HBase, but AsterixDB differs in being a much more complete and
> queryable BDMS (not just a key-value style store).
>
> AsterixDB's target use cases are not in-memory processing or
> iterative algorithm support, making AsterixDB complementary to the
> Apache Spark platform. (Spark interoperability is on our longer-term
> to-do wishlist.)
>
>
> Homogeneous Developers
>
> As mentioned before the current community is already organizationally
> and geographically distributed - and we would like to increase the
> heterogeneity.
>
>
> Reliance on Salaried Developers
>
> Of the initial committers only 3 are full-time UCI staff. The other
> committers are a mix of students, alumni who continue to contribute
> to the effort, and individuals working with permission part-time (or
> in spare time) on this project.
>
>
> A Excessive Fascination with the Apache Brand
>
> We believe in the processes, systems, and framework Apache has put in
> place. Apache is also known to foster a great community around their
> projects and provide exposure. While brand is important, our
> fascination with it is not excessive. We believe that the ASF is the
> right home for AsterixDB and that having AsterixDB inside of the ASF
> will lead to a better long-term outcome for the Big Data community.
>
>
> Documentation
>
> Documentation and publications related to AsterixDB can be found at
> http://asterixdb.ics.uci.edu/.
>
>
> Initial Source
>
> Current source resides in Google code:
> https://code.google.com/p/asterixdb/ (query language and upper system
> layers) and https://code.google.com/p/hyracks/ (dataflow runtime
> system and storage management libraries).
>
>
> External Dependencies
>
> AsterixDB depends on a number of Apache projects:
>
> - Ant
> - Avro
> - ApacheDB JDO
> - Commons
> - Derby
> - Hadoop
> - Hive
> - HTTPComponents
> - Jakarta ORO
> - Maven
> - Tomcat
> - Thrift
> - Velocity
> - Wicket
> - Xerces
>
> and other open source projects (organized by license):
>
> -- ASL 2.0:
>  - Jackson
>  - Google Guava
>  - Google Guice
>  - JSON-simple
>  - BoneCP
>  - Microsoft Azure SDK
>  - Netty
>  - Rome
>  - JetS3t
>  - Groovy
>  - Jettison
>  - Plexus
>  - Datanucleus (JDO)
>  - Jetty
>  - Twitter4J
>  - Snappy-java
>
> -- BSD:
>  - Antlr
>  - ObjectWeb ASM
>  - Protobuf
>  - JSCH
>  - JavaCC
>  - Paranamer
>  - JLine
>  - Stax
>  - StringTemplate
>  - xmlEnc
>
> -- MIT
>  - AppAssembler
>  - SimpleLog4J
>
> -- CDDL 1.0
>  - Java Activation Framework
>  - Java Transactions
>  - Java Servlet API
>  - Grizzly
>  - gmbal
>  - Glassfish
>
> -- CDDL 1.1
>  - Jersey
>  - JAXB Reference Implementation
>
> -- JSON License
>  - JSON
>
> -- EPL 1.0
>  - JUnit
>
> -- JDOM License
>  - JDOM
>
> -- Public Domain
>  - xz
>  - AOPAlliance
>
> As all dependencies are managed using Apache Maven, none of the
> external libraries need to be packaged in a source distribution.
>
>
> Required Resources
>
> Developer and user mailing lists
>
> priv...@asterixdb.incubator.apache.org (with moderated subscriptions)
> comm...@asterixdb.incubator.apache.org
> d...@asterixdb.incubator.apache.org
> us...@asterixdb.incubator.apache.org
>
>
> A git repository
>
> https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git
>
>
> A JIRA issue tracker
>
> https://issues.apache.org/jira/browse/ASTERIXDB
>
>
> Initial Committers
>
> The following is a list of the planned initial Apache committers (the
> active subset of the committers for the current repository at Google
> code).
>
> Abdullah Alamoudi (bamou...@gmail.com)
> Cameron Samak (euf...@gmail.com)
> Chen Li (che...@gmail.com)
> Ian Maxon (ima...@uci.edu)
> Ildar Absalyamov (ildar.absalya...@gmail.com)
> Jianfeng Jia (jianfeng....@gmail.com)
> Karen Ouaknine (ker...@gmail.com)
> Markus Dreseler (apa...@dreseler.de)
> Mike Carey (dtab...@apache.org)
> Murtadha Hubail (hubail...@gmail.com)
> Pouria Pirzadeh (pouria.pirza...@gmail.com)
> Preston Carman (prest...@apache.org)
> Raman Grover (ramangrove...@gmail.com)
> Sattam Alsubaiee (salsuba...@gmail.com)
> Steven Jacobs (sjaco...@apache.org)
> Taewoo Kim (wangs...@gmail.com)
> Till Westmann (ti...@apache.org)
> Vinayak Borkar (vinay...@apache.org)
> Yingyi Bu (buyin...@gmail.com)
> Young-Seok Kim (kiss...@gmail.com)
> Zach Heilbron (zheilb...@gmail.com)
>
>
> Affiliations
>
> UC Irvine
> - Mike Carey
> - Chen Li
> - Ian Maxon
> - Yingyi Bu
> - Raman Grover
> - Pouria Pirzadeh
> - Young-Seok Kim
> - Cameron Samak
> - Taewoo Kim
> - Jianfeng Jia
> - Murtadha Hubail
> - Markus Dreseler
>
> UC Riverside
> - Ildar Absalyamov
> - Preston Carman
> - Steven Jacobs
>
> Hebrew University
> - Keren Ouaknine
>
> Oracle
> - Till Westmann
>
> X15 Software
> - Vinayak Borkar
> - Zach Heilbron
>
> KACST Saudi Arabia
> - Sattam Alsubaiee
>
> Saudi Aramco
> - Abdullah Alamoudi
>
> Carey, Li, and Maxon are full-time UCI staff, with the remaining UCI
> (UC Irvine) and UCR (UC Riverside) affiliates being students. The
> non-UC committers are a mix of alumni who continue to contribute to
> the effort and individuals working with permission part-time (or in
> spare time) on this project.
>
>
> Sponsors
>
> Champion
>
> Chris Mattmann (NASA/JPL)
>
> Nominated Mentors
>
> TBD
>
> Sponsoring Entity
>
> The Apache Incubator
>
>
>
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: chris.a.mattm...@nasa.gov
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] Apache AsterixDB Incubator

Reply via email to