---------- Forwarded message ---------- From: Owen O'Malley <omal...@apache.org> Date: Fri, Mar 20, 2015 at 11:36 AM Subject: Create ORC project To: bo...@apache.org
Board, We'd like to create a separate ORC project from the code base that is currently in Hive. *Apache ORC f**or Apache Top Level Project* *Abstract* ORC is a fast columnar file format for Apache Hadoop style workloads that supports columnar projection and pushing filters down in the reader. Both features can dramatically reduce the number of bytes that need to be read, decompressed, and deserialized to answer a query. ORC has been developed and released by Apache Hive, but other projects wanting to use ORC do not want to depend on Hive’s large jar and dependency tree. Additionally, a C++ ORC reader and writer are being developed and would benefit from being released as the same project. The Hive community believes that further development on ORC can be done better as a separate project as discussed on the Hive email lists (here <http://search-hadoop.com/m/8er9LOj9O1/separate+orc&subj=ORC+separate+project> and here <http://search-hadoop.com/m/8er9WxfWw&subj=Re+Native+ORC>). *Proposal* Although ORC (Optimized Row Columnar) file was originally developed in Apache Hive, there are several forces that are encouraging it to move to a separate project. First is that projects both inside and outside of Apache wish to support it, but do not want to depend on Hive and its large list of dependencies. Additionally, the Hive community, as a Java project, is not interested in incorporating the new C++ implementation of ORC into their code base. Developing and releasing the Java and C++ ORC readers and writers in the same project will allow them to stay synchronized with each other and give users a single place to direct questions and file issues. Moving out of Hive will also allow ORC to support other languages in the future (Go, etc.), release on a faster release cycle than Hive, and develop an independent community. The traditional path at Apache would have been to create an incubator project, but the code is already being released by Apache and most of the developers are familiar with Apache rules and guidelines. In particular, the proposed PMC has 3 Apache members and incubator PMC members from three companies. They will provide oversight and guidance for the developers that are less experienced in the Apache Way. Therefore, the ORC project would like to propose becoming a Top Level Project at Apache. *Overview of ORC * Although Hive's RCFile was the standard format for storing tabular data in Hadoop for several years, it has limitations because it treats each column as a binary blob without semantics. In 2013, Hive added a new file format named Optimized Row Columnar (ORC) file that uses and retains the type information from the table definition. ORC uses type specific readers and writers that provide light-weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding -- resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. ORC files also have optional bloom filters that provide fine grain details of the values in each set of 10,000 rows. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren't important for this query. *Current Status* *Meritocracy* ORC has been developed as part of Apache Hive and thus has been operating as a meritocracy. Many of the developers of ORC are Hive PMC members or committers. The ORC project plans to continue adding new PMC and committers as the project continues to develop. *Community* ORC’s development team seeks to foster the development and user communities. We feel that becoming a separate project will improve both communities by being smaller and more focused than Hive and bring tighter integration with various Apache projects that either don’t want to or can’t accept the large list of dependencies from Hive. *Core Developers* ORC is being primarily developed by HP, Hortonworks, and Microsoft. Facebook was instrumental in the early development and is an active user. *Alignment* The ASF is a natural host for ORC given that it is already the home of Hadoop, Pig, Hive, and other emerging distributed computing software projects. ORC was designed to offer improved storage capability for Hadoop clusters and query speed on Hive and Pig. *Known Risks* *Orphaned Products* The core developers of the ORC team are actively working on the project and plan to continue. There is very little risk of ORC getting orphaned since many large companies are storing their production data in ORC format. For example, Facebook is using ORC to store 10’s of petabytes of their production data (blog <https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/> ). *Inexperience with Open Source* The proposed PMC has extensive experience with Apache projects and includes 3 Apache members and Incubator PMC members. The ORC PMC and the more experienced committers will be responsible for training the committers that are less familiar with the Apache Way. *Homogeneous Developers* The developers include employees from Facebook, HP, Hortonworks, Microsoft, and an independent contributor. Apache projects encourage an open and diverse meritocratic community and ORC team is very motivated to increase the size and diversity of the development team. *Reliance on Salaried Developers* Most of the work on ORC has been by salaried developers, but the hope is that by making ORC a separate project, it will be more approachable for new developers including non-salaried developers. *Relationships with Other Apache Products* ORC has a strong relationship and integration with Apache Hadoop, Hive, and Pig. Being independent of Hive will allow other projects to depend on ORC directly without incurring the cost of depending on the large list of Hive dependencies. ORC would like to encourage integration with additional Apache projects: - Apache Bigtop - Apache Drill - Apache Flink - Apache Flume - Apache Spark ORC does compete with Parquet, which is also a columnar format that was released after ORC, and to a lesser extent Avro and Thrift, which are row-major serialization formats. Apache as a foundation, does not pick particular projects among competitors, but rather acts as a support system for each project’s community. *An Excessive Fascination with the Apache Brand* ORC wants to become an Apache project in order to help efforts to diversify the committer-base, and not to capitalize on the Apache brand. The ORC project is in production use already inside many large companies and is already being released by Apache Hive. As such, the ORC project is not seeking to use the Apache brand as a marketing tool. *Documentation* The primary documentation about ORC is located on the Apache Hive wiki <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC>. There have been also been presentations on ORC: ● Introduction to ORC files 2012 <http://www.slideshare.net/oom65/orc-fileintro> ● Berlin Buzzwords 2013 <http://www.slideshare.net/oom65/orc-files> ● Hadoop Summit 2013 <http://www.slideshare.net/Hadoop_Summit/hanson-o-malleypandeyjune27425pmroom212?related=1> *Initial Source* ORC has been under development as part of Hive since late 2012. The original inclusion into Hive was via HIVE-3874 <https://issues.apache.org/jira/browse/HIVE-3874>. There are several implementations that read or write the ORC format: ● Hive reader and writer in Hive subversion <http://svn.apache.org/repos/asf/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/io/orc/> ● Presto integrated reader in Presto github <https://github.com/facebook/presto/tree/master/presto-orc> ● C++ reader and writer in github <https://github.com/hortonworks/orc> The intent is to pull both the Hive Java reader and writer and the C++ reader and writer into the Apache ORC project. *External Dependencies* ORC has the following external dependencies. ● Build tools - Apache Maven - Gmock - JUnit ● Apache - Log4j - Hadoop ● Non-Apache - JDK 1.6+ - Protobuf - Snappy - zlib *Cryptography* ORC does not currently support encryption, but will eventually support column encryption. *Required Resources* *Mailing Lists* ● private@orc for private PMC discussions (with moderated subscriptions) ● dev@orc ● user@orc ● commits@orc *Version Control* Git is the preferred source control system. *Issue Tracking* ORC will need a jira instance. *Other Resources* The existing code already has unit tests so we will make use of existing Apache continuous testing infrastructure. The resulting load should not be very large. *Initial PMC* ● Chris Douglas <cdouglas at apache.org> (Apache member, Incubator & Hadoop PMC) ● Alan Gates <gates at apache.org> (Apache member, Incubator, Hive & Pig PMC) ● Prasanth Jayachandran <prasanthj at apache dot org> (Hive PMC) ● Lefty Leverenz <leftyleverenz at gmail dot com> (Hive PMC) ● Owen O’Malley <omalley at apache dot org> (Apache member, Incubator, Hadoop, & Hive PMC) We’d like to propose Owen O’Malley as the initial VP for the ORC project. *Initial Committers* ● Thanh Do <thdo at microsoft dot com> ● Gunther Hagleitner <gunther at apache dot org> (Hive PMC) ● Pavan Lanka <pavibhai at gmail dot com> ● Aliaksei Sandryhaila <aliaksei.sandryhaila at hp dot com> ● Sergey Shelukhin <sershe at apache dot org> (Hive PMC) ● Dain Sundstrom <dain at fb dot com> ● Gopal Vijayaraghavan <gopalv at apache dot org> (Hive committer) ● Stephen Walkauskas <stephen.walkauskas at hp dot com> ● Kevin Wilfong <kevinwilfong at fb dot com> (Hive PMC) ● Jing Xu <jing.xu2 at hp dot com> ● Xuefu Zhang <xzhang at cloudera dot com> *Affiliations* The initial PMC is employed at Doc of the Bay, Hortonworks and Microsoft. The initial committers are employed by Cloudera, Facebook, HP, Hortonworks, Microsoft and an independent contributor.