Re: [PROPOSAL] Tajo (A Data Warehouse System for Hadoop) to join Apache Incubator

Hyunsik Choi Mon, 25 Feb 2013 02:54:52 -0800

There is a typo error.  "I are" should be "we are".

Thanks,
Hyunsik Choi



On Mon, Feb 25, 2013 at 6:52 PM, Hyunsik Choi <hyun...@apache.org> wrote:

> Hi folks,
>
> I would like to propose Tajo to the Apache incubator.
>
> http://wiki.apache.org/incubator/TajoProposal
>
> Tajo is a distributed data warehouse system for Hadoop. Tajo is designed
> for low-latency and scalable ad-hoc queries, online aggregation and ETL on
> large-data sets by leveraging advanced database techniques. It supports SQL
> standards. Tajo uses HDFS as a primary storage layer, uses Apache Hadoop
> Yarn as a resource management platform, and has its own query engine
> which enables direct control of distributed execution and data flow. Tajo
> is in the alpha stage, and its initial code contains about 100,000 lines.
>
> I are looking forward to your feedback and suggestions.
>
> Thanks,
> Hyunsik Choi
>
> ------------------------------------------------------------------------
>
> = Abstract =
>
> Tajo is a distributed data warehouse system for Hadoop.
>
> = Proposal =
> Tajo is a relational and distributed data warehouse system for Hadoop.
> Tajo is designed for low-latency and scalable ad-hoc queries, online
> aggregation and ETL on large-data sets by leveraging advanced database
> techniques. It supports SQL standards. Tajo is inspired by Dryad,
> MapReduce, Dremel, Scope, and parallel databases. Tajo uses HDFS as a
> primary storage layer, and it has its own query engine which allows direct
> control of distributed execution and data flow. As a result, Tajo has a
> variety of query evaluation strategies and more optimization opportunities.
> In addition, Tajo will have a native columnar execution and and its
> optimizer. Tajo will be an alternative choice to Hive/Pig on the top of
> MapReduce.
>
> = Background =
> Big data analysis has gained much attention in the industrial. Open source
> communities have proposed scalable and distributed solutions for ad-hoc
> queries on big data. However, there is still room for improvement. Markets
> need more faster and efficient solutions. Recently, some alternatives
> (e.g., Cloudera's Impala and Amazon Redshift) have come out.
>
> = Rationale =
> There are a variety of open source distributed execution engines (e.g.,
> hive, and pig) running on the top of MapReduce. They are limited by MR
> framework. They cannot directly control distributed execution and data
> flow, and they just use MR framework. So, they have limited query
> evaluation strategies and optimization opportunities. It is hard for them
> to be optimized for a certain type of data processing.
>
> = Initial Goals =
>
> The initial goal is to write more documents to describe Tajo's internal.
> It will be helpful to recruit more committers and to build a solid
> community. Then, we will make milestones for short/long term plans.
>
> = Current Status =
>
> Tajo is in the alpha stage. Users can execute usual SQL queries (e.g.,
> selection, projection, group-by, join, union and sort) except for nested
> queries. Tajo provides various row/column storage formats, such as CSV,
> RowFile (a row-store file we have implemented), RCFile, and Trevni, and it
> also has a rudimentary ETL feature to transform one data format to another
> data format. In addition, Tajo provides hash and range repartitions. By
> using both repartition methods, Tajo processes aggregation, join, and sort
> queries over a number of cluster nodes. To evaluate the performance, we
> have carried out benchmark test using TPC-H 1TB on 32 cluster nodes.
>
> == Meritocracy ==
>
> We will discuss the milestone and the future plan in an open forum. We
> plan to encourage an environment that supports a meritocracy. The
> contributors will have different privileges according to their
> contributions.
>
> == Community ==
> Big data analysis has gained attention from open source communities,
> industrial and academic areas. Some projects related to Hadoop already have
> very large and active communities. We expect that Tajo also will establish
> an active community. Since Tajo already works for some features and is in
> the alpha stage, it will attract a large community soon.
>
> == Core Developers ==
> Core developers are very experienced in the Apache Hadoop ecosystem. To
> achieve more diversity of developers, we will be eager to recruit
> developers from diverse companies.
> * Eli Reisman <ereisman AT apache DOT org>
> * Hyunsik Choi <hyunsik AT apache DOT org>
> * Jihoon Son <ghoonson AT gmail DOT com>
> * Jin Ho Kim <jhkim AT gruter DOT com>
> * Sangwook Kim <swkim AT inervit DOT com>
>
> == Alignment ==
> Tajo employs Apache Hadoop Yarn as a resource management platform for
> large clusters. It uses HDFS as a primary storage layer. It already
> supports Hadoop-related data formats (RCFile, Trevni) and will support ORC
> file. In addition, we have a plan to integrate Tajo with other products of
> Hadoop ecosystem. Tajo's modules are well organized, and these modules can
> also be used for other projects.
>
> = Known Risks =
>
> == Orphaned Products ==
> Most of codes have been developed by only two core developers, who are
> Hyunsik Choi and Jihoon Son. It may be a risk of being orphaned. However,
> they are guaranteed to have enough time to develop Tajo for years. As you
> can see the commit history, they have participated in this project for
> about two years. Recently, Tajo has been supported by two IT companies in
> Korea. So, the risk of being orphaned is relatively low. In addition, we
> will be eager to recruit additional committers in order to eliminate this
> risk.
>
> == Inexperience with Open Source ==
> Most of the initial committers have experience working on open source
> projects. Eli Reisman and Hyunsik Choi have experience as committers and
> PMC members on other Apache projects.
>
> == Homogeneous Developers ==
> Although they have four affiliations, what most of core developers are in
> South Korea is a risk. This is because their offline activities are limited
> due to their location. Since we surely recognize this risk, we will write
> more complete documents and presentation materials in order to disseminate
> Tajo's internal and users guide. In addition, to mitigate this risk we will
> be eager to recruit additional committers around the world.
>
> == Reliance on Salaried Developers ==
> It is expected that Tajo development will occur on both salaried time and
> on volunteer time. Hyunsik Choi and Jihoon Son belong to Database lab.,
> Korea Univ. They will be paid by the lab to contribute Tajo for years. Jin
> Ho Kim and Sangwook Kim are paid by their employer to contribute to this
> project. Eli Reisman will contribute to this project on volunteer time. In
> addition, we will be eager to recruit additional committers including
> salaried and non-salaried developers.
>
> == Relationships with Other Apache Products ==
> Tajo has some overlapping function with Apache Incubator Drill. However,
> Tajo is more mature than Drill. In addition, there are some significant
> differences. Drill is a distributed system specialized for low-latency
> query processing by using column operations and intermediate data
> streaming. Drill has very simple query optimizer. However, some queries
> including big-big table join and sort are not available in that manner.
> Drill will support some of query types.
>
> In contrast, Tajo has advanced query optimization system. Tajo mainly aims
> at scalable and efficient processing on all query types. By using the query
> optimizer, Tajo will only chase low latency query processing for some query
> types that can be executed in online aggregation manner.
>
> Besides, Tez has some overlapping functions with Tajo. However, Tez is in
> the pre-alpha stage and may be a prototype. When Tez becomes feasible, Tajo
> could use Tez as an underlying framework according to the applicability.
> However, Tajo will still use its row/native columnar execution engine and
> its optimizer. Tajo may be potentially the first application of Tez.
>
> == A Excessive Fascination with the Apache Brand ==
> We believe that the Apache brand will help us to find contributors and to
> grow the community. The community and development process will make this
> project more stable and help establish ubiquitous APIs. In addition, Tajo
> depends other project in Apache Hadoop ecosystem. We expect that
> cooperative work occurs with other projects in the same place.
>
> = Documentation =
> Tajo's demonstration paper was accepted to IEEE ICDE 2013. Since this
> conference will be held in April 2013, we cannot publicly show the paper.
> Instead, we attached some presentation material. Checkout this slide (
> http://www.slideshare.net/hyunsikchoi/tajo-intro)
>
> In addition, some documents (e.g., getting started) are available at
> http://tajo-project.github.com/tajo/
>
> = Initial Source =
> The initial source code has been developed in the Database Lab. Korea
> Univ. This has been implemented in Java and has almost 100,000 lines except
> for parser and protobuf generated codes. Currently, initial source code is
> already available on GitHub at https://github.com/tajo-project/tajo.
>
> = Source and Intellectual Property Submission Plan =
>
> We intend the entire code base to be licensed under the Apache License,
> Version 2.0.
>
> = External Dependencies =
> The required dependencies are all Apache compatible licenses. The
> following components with non-Apache licenses are enumerated:
>
> * Google Guava
> * Google Protocol Buffer
> * Antlr
> * Mockito
> * JLine2
>
> = Cryptography =
>
> Tajo will depend on secure Hadoop that can optionally use Kerberos.
>
> = Required Resources =
> == Mailling List ==
> * tajo-private (with moderated subscriptions)
> * tajo-dev
> * tajo-commits
> * tajo-user
>
> == Subversion Directory ==
> https://git-wip-us.apache.org/repos/asf/tajo.git
>
> == Issue Tracking ==
> Jira Tajo (TAJO)
>
> == Other Resources ==
> * Continuous Integration
> * Jenkins
> * Wiki
> * http://wiki.apache.org/tajo
>
> = Initial Committers =
> * Eli Reisman <ereisman AT apache DOT org>
> * Hyunsik Choi <hyunsik AT apache DOT org>
> * Jihoon Son <ghoonson AT gmail DOT com>
> * Jin Ho Kim <jhkim AT gruter DOT com>
> * Sangwook Kim <swkim AT inervit DOT com>
>
> = Affiliations =
> * Eli Reisman (Hortonworks)
> * Hyunsik Choi (Database Lab., Korea University)
> * Jihoon Son (Database Lab., Korea University)
> * Jin Ho Kim (Gruter)
> * Sangwook Kim (Inervit)
>
> The nominated mentors are employees of Hortonworks and NASA JPL.
>
> * Chris Mattman - NASA JPL
> * Jakob Homan - LinkedIn
> * Owen O'Malley - Hortonworks
>
> = Sponsors =
>
> == Champion ==
>
> * Jakob Homan <ghoman AT apache DOT org>
>
> == Nominated Mentors ==
>
> * Chris Mattman <chris DOT a DOT mattmann AT jpl DOT nasa DOT gov>
> * Jakob Homan <jghoman AT apache DOT org>
> * Owen O'Malley <omalley AT apache DOT org>
>
> == Sponsoring Entity ==
> Apache Incubator
>

Re: [PROPOSAL] Tajo (A Data Warehouse System for Hadoop) to join Apache Incubator

Reply via email to