Oops, apologies - thanks for the reminder. I uploaded the slides as an attachment on the wiki page.
Thanks, Tomer On Wed, Aug 8, 2012 at 9:14 PM, Jakob Homan <jgho...@gmail.com> wrote: > So, no response to my request above about the design docs and > not-TO-DOne MapR presentation? > > On Wed, Aug 8, 2012 at 3:25 PM, Chris Douglas <cdoug...@apache.org> wrote: > > +1 -C > > > > On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning <ted.dunn...@gmail.com> > wrote: > >> This is a duplicated attempt at sending this message, please ignore the > >> previous message if it eventually arrives. There appears to be a hangup > >> sending email from my apache email address via gmail. > >> > >> Abstract > >> ======== > >> Drill is a distributed system for interactive analysis of large-scale > >> datasets, inspired by Google’s Dremel ( > >> http://research.google.com/pubs/pub36632.html). > >> > >> Proposal > >> ======== > >> Drill is a distributed system for interactive analysis of large-scale > >> datasets. Drill is similar to Google’s Dremel, with the additional > >> flexibility needed to support a broader range of query languages, data > >> formats and data sources. It is designed to efficiently process nested > >> data. It is a design goal to scale to 10,000 servers or more and to be > able > >> to process petabyes of data and trillions of records in seconds. > >> > >> Background > >> ========== > >> Many organizations have the need to run data-intensive applications, > >> including batch processing, stream processing and interactive analysis. > In > >> recent years open source systems have emerged to address the need for > >> scalable batch processing (Apache Hadoop) and stream processing (Storm, > >> Apache S4). In 2010 Google published a paper called “Dremel: Interactive > >> Analysis of Web-Scale Datasets,” describing a scalable system used > >> internally for interactive analysis of nested data. No open source > project > >> has successfully replicated the capabilities of Dremel. > >> > >> Rationale > >> ========= > >> There is a strong need in the market for low-latency interactive > analysis > >> of large-scale datasets, including nested data (eg, JSON, Avro, Protocol > >> Buffers). This need was identified by Google and addressed internally > with > >> a system called Dremel. > >> > >> In recent years open source systems have emerged to address the need for > >> scalable batch processing (Apache Hadoop) and stream processing (Storm, > >> Apache S4). Apache Hadoop, originally inspired by Google’s internal > >> MapReduce system, is used by thousands of organizations processing > >> large-scale datasets. Apache Hadoop is designed to achieve very high > >> throughput, but is not designed to achieve the sub-second latency needed > >> for interactive data analysis and exploration. Drill, inspired by > Google’s > >> internal Dremel system, is intended to address this need. > >> > >> It is worth noting that, as explained by Google in the original paper, > >> Dremel complements MapReduce-based computing. Dremel is not intended as > a > >> replacement for MapReduce and is often used in conjunction with it to > >> analyze outputs of MapReduce pipelines or rapidly prototype larger > >> computations. Indeed, Dremel and MapReduce are both used by thousands of > >> Google employees. > >> > >> Like Dremel, Drill supports a nested data model with data encoded in a > >> number of formats such as JSON, Avro or Protocol Buffers. In many > >> organizations nested data is the standard, so supporting a nested data > >> model eliminates the need to normalize the data. With that said, flat > data > >> formats, such as CSV files, are naturally supported as a special case of > >> nested data. > >> > >> The Drill architecture consists of four key components/layers: > >> * Query languages: This layer is responsible for parsing the user’s > query > >> and constructing an execution plan. The initial goal is to support the > >> SQL-like language used by Dremel and Google BigQuery ( > >> https://developers.google.com/bigquery/docs/query-reference), which we > call > >> DrQL. However, Drill is designed to support other languages and > programming > >> models, such as the Mongo Query Language ( > >> http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( > >> http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume > ). > >> * Low-latency distributed execution engine: This layer is responsible > for > >> executing the physical plan. It provides the scalability and fault > >> tolerance needed to efficiently query petabytes of data on 10,000 > servers. > >> Drill’s execution engine is based on research in distributed execution > >> engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar > >> storage, and can be extended with additional operators and connectors. > >> * Nested data formats: This layer is responsible for supporting various > >> data formats. The initial goal is to support the column-based format > used > >> by Dremel. Drill is designed to support schema-based formats such as > >> Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less > >> formats such as JSON, BSON or YAML. In addition, it is designed to > support > >> column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and > >> row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A > >> particular distinction with Drill is that the execution engine is > flexible > >> enough to support column-based processing as well as row-based > processing. > >> This is important because column-based processing can be much more > >> efficient when the data is stored in a column-based format, but many > large > >> data assets are stored in a row-based format that would require > conversion > >> before use. > >> * Scalable data sources: This layer is responsible for supporting > various > >> data sources. The initial focus is to leverage Hadoop as a data source. > >> > >> It is worth noting that no open source project has successfully > replicated > >> the capabilities of Dremel, nor have any taken on the broader goals of > >> flexibility (eg, pluggable query languages, data formats, data sources > and > >> execution engine operators/connectors) that are part of Drill. > >> > >> Initial Goals > >> ============= > >> The initial goals for this project are to specify the detailed > requirements > >> and architecture, and then develop the initial implementation including > the > >> execution engine and DrQL. > >> Like Apache Hadoop, which was built to support multiple storage systems > >> (through the FileSystem API) and file formats (through the > >> InputFormat/OutputFormat APIs), Drill will be built to support multiple > >> query languages, data formats and data sources. The initial > implementation > >> of Drill will support the DrQL and a column-based format similar to > Dremel. > >> > >> Current Status > >> ============== > >> Significant work has been completed to identify the initial requirements > >> and define the overall system architecture. The next step is to > implement > >> the four components described in the Rationale section, and we intend > to do > >> that development as an Apache project. > >> > >> Meritocracy > >> =========== > >> We plan to invest in supporting a meritocracy. We will discuss the > >> requirements in an open forum. Several companies have already expressed > >> interest in this project, and we intend to invite additional developers > to > >> participate. We will encourage and monitor community participation so > that > >> privileges can be extended to those that contribute. Also, Drill has an > >> extensible/pluggable architecture that encourages developers to > contribute > >> various extensions, such as query languages, data formats, data sources > and > >> execution engine operators and connectors. While some companies will > surely > >> develop commercial extensions, we also anticipate that some companies > and > >> individuals will want to contribute such extensions back to the project, > >> and we look forward to fostering a rich ecosystem of extensions. > >> > >> Community > >> ========= > >> The need for a system for interactive analysis of large datasets in the > >> open source is tremendous, so there is a potential for a very large > >> community. We believe that Drill’s extensible architecture will further > >> encourage community participation. Also, related Apache projects (eg, > >> Hadoop) have very large and active communities, and we expect that over > >> time Drill will also attract a large community. > >> > >> Core Developers > >> =============== > >> The developers on the initial committers list include experienced > >> distributed systems engineers: > >> * Tomer Shiran has experience developing distributed execution engines. > He > >> developed Parallel DataSeries, a data-parallel version of the open > source > >> DataSeries system (http://tesla.hpl.hp.com/opensource/). He is also the > >> author of Applying Idealized Lower-bound Runtime Models to Understand > >> Inefficiencies in Data-intensive Computing (SIGMETRICS 2011). Tomer > worked > >> as a software developer and researcher at IBM Research, Microsoft and HP > >> Labs, and is now at MapR Technologies. He has been active in the Hadoop > >> community since 2009. > >> * Jason Frantz was at Clustrix, where he designed and developed the > first > >> scale-out SQL database based on MySQL. Jason developed the distributed > >> query optimizer that powered Clustrix. He is now a software engineer and > >> architect at MapR Technologies. > >> * Ted Dunning is a PMC member for Apache ZooKeeper and Apache Mahout, > and > >> has a history of over 30 years of contributions to open source. He is > now > >> at MapR Technologies. Ted has been very active in the Hadoop community > >> since the project’s early days. > >> * MC Srivas is the co-founder and CTO of MapR Technologies. While at > Google > >> he worked on Google’s scalable search infrastructure. MC Srivas has been > >> active in the Hadoop community since 2009. > >> * Chris Wensel is the founder and CEO of Concurrent. Prior to founding > >> Concurrent, he developed Cascading, an Apache-licensed open source > >> application framework enabling Java developers to quickly and easily > >> develop robust Data Analytics and Data Management applications on Apache > >> Hadoop. Chris has been involved in the Hadoop community since the > project's > >> early days. > >> * Keys Botzum was at IBM, where he worked on security and distributed > >> systems, and is currently at MapR Technologies. > >> * Gera Shegalov was at Oracle, where he worked on networking, storage > and > >> database kernels, and is currently at MapR Technologies. > >> * Ryan Rawson is the VP Engineering of Drawn to Scale where he developed > >> Spire, a real-time operational database for Hadoop. He is also a > committer > >> and PMC member for Apache HBase, and has a long history of > contributions to > >> open source. Ryan has been involved in the Hadoop community since the > >> project's early days. > >> > >> We realize that additional employer diversity is needed, and we will > work > >> aggressively to recruit developers from additional companies. > >> > >> Alignment > >> ========= > >> The initial committers strongly believe that a system for interactive > >> analysis of large-scale datasets will gain broader adoption as an open > >> source, community driven project, where the community can contribute not > >> only to the core components, but also to a growing collection of query > >> languages and optimizers, data formats, data formats, and execution > engine > >> operators and connectors. Drill will integrate closely with Apache > Hadoop. > >> First, the data will live in Hadoop. That is, Drill will support Hadoop > >> FileSystem implementations and HBase. Second, Hadoop-related data > formats > >> will be supported (eg, Apache Avro, RCFile). Third, MapReduce-based > tools > >> will be provided to produce column-based formats. Fourth, Drill tables > can > >> be registered in HCatalog. Finally, Hive is being considered as the > basis > >> of the DrQL implementation. > >> > >> Known Risks > >> =========== > >> > >> Orphaned Products > >> ================= > >> The contributors are leading vendors in this space, with significant > open > >> source experience, so the risk of being orphaned is relatively low. The > >> project could be at risk if vendors decided to change their strategies > in > >> the market. In such an event, the current committers plan to continue > >> working on the project on their own time, though the progress will > likely > >> be slower. We plan to mitigate this risk by recruiting additional > >> committers. > >> > >> Inexperience with Open Source > >> ============================= > >> The initial committers include veteran Apache members (committers and > PMC > >> members) and other developers who have varying degrees of experience > with > >> open source projects. All have been involved with source code that has > been > >> released under an open source license, and several also have experience > >> developing code with an open source development process. > >> > >> Homogenous Developers > >> ===================== > >> The initial committers are employed by a number of companies, including > >> MapR Technologies, Concurrent and Drawn to Scale. We are committed to > >> recruiting additional committers from other companies. > >> > >> Reliance on Salaried Developers > >> =============================== > >> It is expected that Drill development will occur on both salaried time > and > >> on volunteer time, after hours. The majority of initial committers are > paid > >> by their employer to contribute to this project. However, they are all > >> passionate about the project, and we are confident that the project will > >> continue even if no salaried developers contribute to the project. We > are > >> committed to recruiting additional committers including non-salaried > >> developers. > >> > >> Relationships with Other Apache Products > >> ======================================== > >> As mentioned in the Alignment section, Drill is closely integrated with > >> Hadoop, Avro, Hive and HBase in a numerous ways. For example, Drill data > >> lives inside a Hadoop environment (Drill operates on in situ data). We > look > >> forward to collaborating with those communities, as well as other Apache > >> communities. > >> > >> An Excessive Fascination with the Apache Brand > >> ============================================== > >> Drill solves a real problem that many organizations struggle with, and > has > >> been proven within Google to be of significant value. The architecture > is > >> based on academic and industry research. Our rationale for developing > Drill > >> as an Apache project is detailed in the Rationale section. We believe > that > >> the Apache brand and community process will help us attract more > >> contributors to this project, and help establish ubiquitous APIs. In > >> addition, establishing consensus among users and developers of a > >> Dremel-like tool is a key requirement for success of the project. > >> > >> Documentation > >> ============= > >> Drill is inspired by Google’s Dremel. Google has published a paper > >> highlighting Dremel’s innovative nested column-based data format and > >> execution engine: http://research.google.com/pubs/pub36632.html > >> > >> High-level slides have been published by MapR: TODO > >> > >> Initial Source > >> ============== > >> There is no initial source code. All source code will be developed > within > >> the Apache Incubator. > >> > >> Cryptography > >> ============ > >> Drill will eventually support encryption on the wire. This is not one of > >> the initial goals, and we do not expect Drill to be a controlled export > >> item due to the use of encryption. > >> > >> Required Resources > >> ================== > >> > >> Mailing List > >> ============ > >> * drill-private > >> * drill-dev > >> * drill-user > >> > >> Subversion Directory > >> ==================== > >> Git is the preferred source control system: git://git.apache.org/drill > >> > >> Issue Tracking > >> ============== > >> JIRA Drill (DRILL) > >> > >> Initial Committers > >> ================== > >> * Tomer Shiran (tshiran at maprtech dot com) > >> * Ted Dunning (tdunning at apache dot org) > >> * Jason Frantz (jfrantz at maprtech dot com) > >> * MC Srivas (mcsrivas at maprtech dot com) > >> * Chris Wensel (chris and concurrentinc dot com) > >> * Keys Botzum (kbotzum at maprtech dot com) > >> * Gera Shegalov (gshegalov at maprtech dot com) > >> * Ryan Rawson (ryan at drawntoscale dot com) > >> > >> Affiliations > >> ============ > >> The initial committers are employees of MapR Technologies, Drawn to > Scale > >> and Concurrent. The nominated mentors are employees of MapR > Technologies, > >> Lucid Imagination and Nokia. > >> > >> Sponsors > >> ======== > >> > >> Champion > >> ======== > >> Ted Dunning (tdunning at apache dot org) > >> > >> Nominated Mentors > >> ================= > >> * Ted Dunning (tdunning at apache dot org) – Chief Application > Architect at > >> MapR Technologies, Committer for Lucene, Mahout and ZooKeeper. > >> * Grant Ingersoll (grant at lucidimagination dot com) – Chief Scientist > at > >> Lucid Imagination, Committer for Lucene, Mahout and other projects. > >> * Isabel Drost (Isabel at apache dot org) – Software Developer at Nokia > >> Gate 5 GmbH, Committer for Lucene, Mahout and other projects. > >> > >> Sponsoring Entity > >> ================= > >> Incubator > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > > For additional commands, e-mail: general-h...@incubator.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > >