Re: [PROPOSAL] MRQL for the Apache Incubator

Mattmann, Chris A (388J) Sat, 02 Mar 2013 17:12:41 -0800

Sounds awesome guys look forward to the VOTE.

Cheers,
Chris


On 3/2/13 7:12 AM, "Leonidas Fegaras" <fega...@cse.uta.edu> wrote:

>Dear ASF members,
>
>We would like to propose a new project to the incubator, called MRQL.
>Edward J. Yoon has volunteered to be the champion for this project.
>The proposal draft is available at:
>
>http://wiki.apache.org/incubator/MRQLProposal
>
>We are very excited about having this opportunity to work with ASF to
>create an incubator project. We are looking forward to your feedback
>and suggestions.
>Best regards
>Leonidas Fegaras
>
>
>= Abstract =
>
>MRQL is a query processing and optimization system for large-scale,
>distributed data analysis, built on top of Apache Hadoop and Hama.
>
>= Proposal =
>
>MRQL (pronounced ''miracle'') is a query processing and optimization
>system for large-scale, distributed data analysis. MRQL (the MapReduce
>Query Language) is an SQL-like query language for large-scale data
>analysis on a cluster of computers. The MRQL query processing system
>can evaluate MRQL queries in two modes: in MapReduce mode on top of
>Apache Hadoop or in Bulk Synchronous Parallel (BSP) mode on top of
>Apache Hama. The MRQL query language is powerful enough to express
>most common data analysis tasks over many forms of raw ''in-situ''
>data, such as XML and JSON documents, binary files, and CSV
>documents. MRQL is more powerful than other current high-level
>MapReduce languages, such as Hive and PigLatin, since it can operate
>on more complex data and supports more powerful query constructs, thus
>eliminating the need for using explicit MapReduce code. With MRQL,
>users will be able to express complex data analysis tasks, such as
>PageRank, k-means clustering, matrix factorization, etc, using
>SQL-like queries exclusively, while the MRQL query processing system
>will be able to compile these queries to efficient Java code.
>
>= Background =
>
>The initial code was developed at the University of Texas of Arlington
>(UTA) by a research team, led by Leonidas Fegaras. The software was
>first released in May 2011. The original goal of this project was to
>build a query processing system that translates SQL-like data analysis
>queries to efficient workflows of MapReduce jobs. A design goal was to
>use HDFS as the physical storage layer, without any indexing, data
>partitioning, or data normalization, and to use Hadoop (without
>extensions) as the run-time engine. The motivation behind this work
>was to built a platform to test new ideas on query processing and
>optimization techniques applicable to the MapReduce framework.
>
>A year ago, MRQL was extended to run on Hama. The motivation for this
>extension was that Hadoop MapReduce jobs were required to read their
>input and write their output on HDFS. This simplifies reliability and
>fault tolerance but it imposes a high overhead to complex MapReduce
>workflows and graph algorithms, such as PageRank, which require
>repetitive jobs. In addition, Hadoop does not preserve data in memory
>across consecutive MapReduce jobs. This restriction requires to read
>data at every step, even when the data is constant. BSP, on the other
>hand, does not suffer from this restriction, and, under certain
>circumstances, allows complex repetitive algorithms to run entirely in
>the collective memory of a cluster. Thus, the goal was to be able to
>run the same MRQL queries in both modes, MapReduce and BSP, without
>modifying the queries: If there are enough resources available, and
>low latency and speed are more important than resilience, queries may
>run in BSP mode; otherwise, the same queries may run in MapReduce
>mode. BSP evaluation was found to be a good choice when fault
>tolerance is not critical, data (both input and intermediate) can fit
>in the cluster memory, and data processing requires complex/repetitive
>steps.
>
>The research results of this ongoing work have already been published
>in conferences (WebDB'11, EDBT'12, and DataCloud'12) and the authors
>have already received positive feedback from researchers in academia
>and industry who were attending these conferences.
>
>= Rationale =
>
>* MRQL will be the first general-purpose, SQL-like query language for
>data analysis based on BSP.
>Currently, many programmers prefer to code their MapReduce
>applications in a higher-level query language, rather than an
>algorithmic language. For instance, Pig is used for 60% of Yahoo
>MapReduce jobs, while Hive is used for 90% of Facebook MapReduce
>jobs. This, we believe, will also be the trend for BSP applications,
>because, even though, in principle, the BSP model is very simple to
>understand, it is hard to develop, optimize, and maintain non-trivial
>BSP applications coded in a general-purpose programming
>language. Currently, there is no widely acceptable declarative BSP
>query language, although there are a few special-purpose BSP systems
>for graph analysis, such as Google Pregel and Apache Giraph, for
>machine learning, such as BSML, and for scientific data analysis.
>
>* MRQL can capture many complex data analysis algorithms in
>declarative form.
>Existing MapReduce query languages, such as HiveQL and PigLatin,
>provide a limited syntax for operating on data collections, in the
>form of relational joins and group-bys. Because of these limitations,
>these languages enable users to plug-in custom MapReduce scripts into
>their queries for those jobs that cannot be declaratively coded in
>their query language. This nulliﬁes the beneﬁts of using a
>declarative query language and may result to suboptimal, error-prone,
>and hard-to-maintain code. More importantly, these languages are
>inappropriate for complex scientiﬁc applications and graph analysis,
>because they do not directly support iteration or recursion in
>declarative form and are not able to handle complex, nested scientiﬁc
>data, which are often semi-structured. Furthermore, current MapReduce
>query processors apply traditional query optimization techniques that
>may be suboptimal in a MapReduce or BSP environment.
>
>* The MRQL design is modular, with pluggable distributed processing
>back-ends, query languages, and data formats.
>MRQL aims to be both powerful and adaptable. Although Hadoop is
>currently the most popular framework for large-scale data analysis,
>there are a few alternatives that are currently shaping form,
>including frameworks based on BSP (eg, Giraph, Pregel, Hama), MPI
>(eg, OpenMPI), etc. MRQL was designed in such a way so that it will
>be easy to support other distributed processing frameworks in the
>future. As an evidence of this claim, the MRQL processor required
>only 2K extra lines of Java code to support BSP evaluation.
>
>= Initial Goals =
>
>Some current goals include:
>
>* apply MRQL to graph analysis problems, such as k-means clustering
>and PageRank
>
>* apply MRQL to large-scale scientific analysis (develop general
>optimization techniques that can apply to matrix multiplication,
>matrix factorization, etc)
>
>* process additional data formats, such as Avro, and column-based
>stores, such as HBase
>
>* map MRQL to additional distributed processing frameworks, such as
>Spark and OpenMPI
>
>* extend the front-end to process more query languages, such as
>standard SQL, SPARQL, XQuery, and PigLatin
>
>= Current Status =
>
>The current MRQL release (version 0.8.10) is a beta release. It is
>built on top of Hadoop and Hama (no extensions are needed). It
>currently works on Hadoop up to 1.0.4 (but not on Yarn yet) and Hama
>0.5.0. It has only been tested on a small cluster of 20 nodes (80
>cores).
>
>== Meritocracy ==
>
>The initial MRQL code base was developed by Leonidas Fegaras in May
>2011, and was continuously improved throughout the years. We will
>reach out other potential contributors through open forums. We plan
>to do everything possible to encourage an environment that supports a
>meritocracy, where contributors will extend their privileges based on
>their contribution. MRQL's modular design will facilitate the
>strategic extensions to various modules, such as adding a standard-SQL
>interface, introducing new optimization techniques, etc.
>
>== Community ==
>
>The interest in open-source query processing systems for analyzing
>large datasets has been steadily increased in the last few years.
>Related Apache projects have already attracted a very large community
>from both academia and industry. We expect that MRQL will also
>establish an active community. Several researchers from both academia
>and industry who are interested in using our code have already
>contacted us.
>
>== Core Developers ==
>
>The initial core developer was Leonidas Fegaras, who wrote the
>majority of the code. He is an associate professor at UTA, with
>interests in cloud computing, databases, web technologies, and
>functional programming. He has an extensive knowledge and working
>experience in building complex query processing systems for databases,
>and compilers for functional and algorithmic programming languages.
>
>== Alignment ==
>
>MRQL is built on top of two Apache projects: Hadoop and Hama. We have
>plans to incorporate other products from the Hadoop ecosystem, such as
>Avro and HBase. MRQL can serve as a testbed for fine-tuning and
>evaluating the performance of the Apache Hama system. Finally, the
>MRQL query language and processor can be used by Apache Drill as a
>pluggable query language.
>
>= Known Risks =
>
>== Orphaned Products ==
>
>The initial committer is from academia, which may be a risk, since
>research in academia is publication-driven, rather than
>product-driven. It happens very often in academic research, when a
>project becomes outdated and doesn't produce publishable results, to
>be abandoned in favor of new cutting-edge projects. We do not believe
>that this will be the case for MRQL for the years to come, because it
>can be adapted to support new query languages, new optimization
>techniques, and new distributed back-ends, thus sustaining enough
>research interest. Another risk is that, when graduate students who
>write code graduate, they may leave their work undocumented and
>unfinished. We will strive to get enough momentum to recruit
>additional committers from industry in order to eliminate these risks.
>
>== Inexperience with Open Source ==
>
>The initial developer has been involved with various projects whose
>source code has been released under open source license, but he has no
>prior experience on contributing to open-source projects. With the
>guidance from other more experienced committers and participants, we
>expect that the meritocracy rules will have a positive influence on
>this project.
>
>== Homogeneous Developers ==
>
>The initial committer comes from academia. However, given the interest
>we have seen in the project, we expect the diversity to improve in the
>near future.
>
>== Reliance on Salaried Developers ==
>
>Currently, the MRQL code was developed on the committer's volunteer
>time. In the future, UTA graduate students who will do some of the
>coding may be supported by UTA and funding agencies, such as NSF.
>
>== Relationships with Other Apache Products ==
>
>MRQL has some overlapping functionality with Hive and Tajo, which are
>Data Warehouse systems for Hadoop, and with Drill, which is an
>interactive data analysis system that can process nested data. MRQL
>has a more powerful data model, in which any form of nested data, such
>as XML and JSON, can be defined as a user-defined datatype. More
>importantly, complex data analysis tasks, such as PageRank, k-means
>clustering, and matrix multiplication and factorization, can be
>expressed as short SQL-like queries, while the MRQL system is able to
>evaluate these queries efficiently. Furthermore, the MRQL system can
>run these queries in BSP mode, in addition to MapReduce mode, thus
>achieving low latency and speed, which are also Drill's goals.
>Nevertheless, we will welcome and encourage any help from these
>projects and we will be eager to make contributions to these projects
>too.
>
>== An Excessive Fascination with the Apache Brand ==
>
>The Apache brand is likely to help us find contributors and reach out
>to the open-source community. Nevertheless, since MRQL depends on
>Apache projects (Hadoop and Hama), it makes sense to have our software
>available as part of this ecosystem.
>
>= Documentation =
>
>Information about MRQL can be found at http://lambda.uta.edu/mrql/
>
>= Initial Source =
>
>The initial MRQL code has been released as part of a research project
>developed at the University of Texas at Arlington under the Apache 2.0
>license for the past two years. The source code is currently hosted
>on GitHub at: https://github.com/fegaras/mrql MRQL’s release artifact
>would consist of a single tarball of packaging and test code.
>
>= External Dependencies =
>
>The MRQL source code is already licensed under the Apache License,
>Version 2.0. MRQL uses JLine which is distributed under the BSD
>license.
>
>= Cryptography =
>
>Not applicable.
>
>= Required Resources =
>
>== Mailing Lists ==
>
>* mrql-private
>* mrql-dev
>* mrql-user
>
>== Subversion Directory ==
>
>* Git is the preferred source control system:
>git://git.apache.org/mrql
>
>== Issue Tracking ==
>
>* A JIRA issue tracker, MRQL
>
>= Initial Committers =
>
>* Leonidas Fegaras <fegaras AT cse DOT uta DOT edu>
>* Upa Gupta <upa.gupta AT mavs DOT uta DOT edu>
>* Edward J. Yoon <edwardyoon AT apache DOT org>
>* Maqsood Alam <maqsoodalam AT hotmail DOT com>
>* John Hope <john.hope AT oracle DOT com>
>* Mark Wall <mark.wall AT oracle DOT com>
>* Kuassi Mensah <kuassi.mensah AT oracle DOT com>
>* Ambreesh Khanna <ambreesh.khanna AT oracle DOT com>
>
>= Affiliations =
>
>* Leonidas Fegaras (University of Texas at Arlington)
>* Upa Gupta (University of Texas at Arlington)
>* Edward J. Yoon (Oracle corp)
>* Maqsood Alam (Oracle corp)
>* John Hope (Oracle corp)
>* Mark Wall (Oracle corp)
>* Kuassi Mensah (Oracle corp)
>* Ambreesh Khanna (Oracle corp)
>
>= Sponsors =
>
>== Champion ==
>
>* Edward J. Yoon <edwardyoon AT apache DOT org>
>
>== Nominated Mentors ==
>
>* Alex Karasulu <akarasulu AT apache DOT org>
>
>== Sponsoring Entity ==
>
>Incubator PMC
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
>For additional commands, e-mail: general-h...@incubator.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org

Re: [PROPOSAL] MRQL for the Apache Incubator

Reply via email to