Re: [PROPOSAL] Drill for the Apache Incubator
Yes, we plan to support joins. We are in the process of setting up the mailing lists. On Thu, Aug 16, 2012 at 12:09 AM, karthik tunga karthik.tu...@gmail.comwrote: The proposal looks great. I was wondering what operations will drill support ? For example the dremel paper doesn't talk about joins, will drill support joins ? Sorry if I missed it, is there a dev mailing list I could subscribe to ? Cheers, Karthik On 13 August 2012 23:55, Bernd Fondermann bernd.fonderm...@gmail.com wrote: great proposal and a very promising mentor lineup. Have fun, Bernd On Thu, Aug 2, 2012 at 11:40 PM, Ted Dunning tdunn...@apache.org wrote: Abstract Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). Proposal Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called “Dremel: Interactive Analysis of Web-Scale Datasets,” describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. Rationale = There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery ( https://developers.google.com/bigquery/docs/query-reference), which we call DrQL. However, Drill is designed to support other languages and programming models, such as the Mongo Query Language ( http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume ). * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
Re: [PROPOSAL] Drill for the Apache Incubator
The mailing list request is in infra's hands. One of the better sources of information about Dremel is the BigQuery documentation. That says that the right side of a join must be 8MB and that the only outer join available is a left out join. What Drill does is somewhat of a different question. On Thu, Aug 16, 2012 at 12:18 AM, Tomer Shiran tshi...@maprtech.com wrote: Yes, we plan to support joins. We are in the process of setting up the mailing lists. On Thu, Aug 16, 2012 at 12:09 AM, karthik tunga karthik.tu...@gmail.com wrote: The proposal looks great. I was wondering what operations will drill support ? For example the dremel paper doesn't talk about joins, will drill support joins ? Sorry if I missed it, is there a dev mailing list I could subscribe to ? Cheers, Karthik On 13 August 2012 23:55, Bernd Fondermann bernd.fonderm...@gmail.com wrote: great proposal and a very promising mentor lineup. Have fun, Bernd On Thu, Aug 2, 2012 at 11:40 PM, Ted Dunning tdunn...@apache.org wrote: Abstract Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). Proposal Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called “Dremel: Interactive Analysis of Web-Scale Datasets,” describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. Rationale = There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery ( https://developers.google.com/bigquery/docs/query-reference), which we call DrQL. However, Drill is designed to support other languages and programming models, such as the Mongo Query Language ( http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( http://www.cascading.org/) or Plume ( https://github.com/tdunning/Plume ). * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution
Re: [PROPOSAL] Drill for the Apache Incubator
great proposal and a very promising mentor lineup. Have fun, Bernd On Thu, Aug 2, 2012 at 11:40 PM, Ted Dunning tdunn...@apache.org wrote: Abstract Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). Proposal Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called “Dremel: Interactive Analysis of Web-Scale Datasets,” describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. Rationale = There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery ( https://developers.google.com/bigquery/docs/query-reference), which we call DrQL. However, Drill is designed to support other languages and programming models, such as the Mongo Query Language ( http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume). * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill is that the execution engine is flexible enough to support column-based processing as well as row-based processing. This is important because column-based processing can be much more efficient when the data is stored in a column-based format, but many large data assets are stored in a row-based format that would require conversion before use. * Scalable data sources: This layer is responsible for supporting various data sources. The initial
Re: [PROPOSAL] Drill for the Apache Incubator
On Aug 8, 2012, at 2:13 PM, Ted Dunning wrote: It is clear that there are gobs of people with the credentials and track record to be potential contributors, but it is also clear that many of these people have huge demands on their time. That leaves doubt about how much contribution they can or should be making to a new project. Wow! It's your project, and you can choose how to run this. However, when I do contribute I hope my contributions aren't discouraged because I should not be contributing to a new project because of the demands on my time after I volunteered to. I don't wish to belabor this or stand in your way, good luck. Hopefully, the project will be encouraging to new contributors. Arun
Re: [PROPOSAL] Drill for the Apache Incubator
On Thu, Aug 9, 2012 at 9:45 AM, Arun C Murthy a...@hortonworks.com wrote: On Aug 8, 2012, at 2:13 PM, Ted Dunning wrote: It is clear that there are gobs of people with the credentials and track record to be potential contributors, but it is also clear that many of these people have huge demands on their time. That leaves doubt about how much contribution they can or should be making to a new project. Wow! It's your project, and you can choose how to run this. However, when I do contribute I hope my contributions aren't discouraged because I should not be contributing to a new project because of the demands on my time after I volunteered to. All contributions will be heartily welcomed. The stance I intend to encourage (and the other mentors and early committers share this intent) is similar to the Mahout project personality ... contributions and contributors are highly welcomed. The only thing that I am pushing for here is a timing detail. Since a vote is ongoing right now, I would like to finish the vote before changing anything. Assuming the vote succeeds (and there is a strong trend that way) then we will do the necessary plumbing to get the project started and be ready for contributions. I don't wish to belabor this or stand in your way, good luck. Hopefully, the project will be encouraging to new contributors. It absolutely will be. - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
On Mon, Aug 6, 2012 at 2:23 PM, Ted Dunning ted.dunn...@gmail.com wrote: No reason at all. Sorry. I may have been unclear. I was requesting that the design docs which are being referenced in the proposal: The requirement and design documents are currently stored in MapR Technologies' source code repository. They will be checked in as part of the initial code dump. be made available for review as part of the proposal, much as an initial source code base would be. There is also a reference to a presentation to-be-made available: High-level slides have been published by MapR: TODO Can those be made public? - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
The consensus in the group of committers listed in the proposal is that we would like to discourage piling on of pre-formation committers and encourage adding committers after formation based on contributions. It is clear that there are gobs of people with the credentials and track record to be potential contributors, but it is also clear that many of these people have huge demands on their time. That leaves doubt about how much contribution they can or should be making to a new project. It is also clear that there are gobs of people that are not already part of Apache who may have time and expertise to contribute. In any case, the vote is already started and will be done before long. Let's go with what we are already voting on without changing it in mid-stream and then adjust later. Progress, not perfection, as they say. On Wed, Aug 8, 2012 at 3:31 AM, Bertrand Delacretaz bdelacre...@apache.org wrote: On Wed, Aug 8, 2012 at 7:20 AM, Marvin Humphrey mar...@rectangular.com wrote: On Tue, Aug 7, 2012 at 10:09 PM, Arun C Murthy a...@hortonworks.com wrote: Wasn't clear, can I add myself now? Didn't the Incubator go back to discouraging open enrollment?... AFAIK, no. What was discussed is that incoming podlings should clearly state their requirements for people that want to be added as initial committers, to keep it fair. -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
+1 -C On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is a duplicated attempt at sending this message, please ignore the previous message if it eventually arrives. There appears to be a hangup sending email from my apache email address via gmail. Abstract Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). Proposal Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called “Dremel: Interactive Analysis of Web-Scale Datasets,” describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. Rationale = There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery ( https://developers.google.com/bigquery/docs/query-reference), which we call DrQL. However, Drill is designed to support other languages and programming models, such as the Mongo Query Language ( http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume). * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill is that the execution engine is flexible enough to support column-based processing as well as row-based processing. This is important because column-based processing can be much more efficient when the data is stored in a column-based format, but many large data assets are stored in a row-based format that
Re: [PROPOSAL] Drill for the Apache Incubator
Oops, apologies - thanks for the reminder. I uploaded the slides as an attachment on the wiki page. Thanks, Tomer On Wed, Aug 8, 2012 at 9:14 PM, Jakob Homan jgho...@gmail.com wrote: So, no response to my request above about the design docs and not-TO-DOne MapR presentation? On Wed, Aug 8, 2012 at 3:25 PM, Chris Douglas cdoug...@apache.org wrote: +1 -C On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: This is a duplicated attempt at sending this message, please ignore the previous message if it eventually arrives. There appears to be a hangup sending email from my apache email address via gmail. Abstract Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). Proposal Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called “Dremel: Interactive Analysis of Web-Scale Datasets,” describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. Rationale = There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery ( https://developers.google.com/bigquery/docs/query-reference), which we call DrQL. However, Drill is designed to support other languages and programming models, such as the Mongo Query Language ( http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume ). * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as
Re: [PROPOSAL] Drill for the Apache Incubator
FYI: I have posted the proposal to the wiki and updated it based on the feedback from Marvin and Jakob: http://wiki.apache.org/incubator/DrillProposal On Mon, Aug 6, 2012 at 2:29 PM, Ted Dunning ted.dunn...@gmail.com wrote: In fact, a big part of the motivation for proposing incubation before code is ready is exactly to foster the discussions needed to form community. It is true that many projects that start without the fundamentals face challenges that more mature projects face but that is really just a fact of life with young projects. My own experience includes a project that also started without an initial code drop. Mahout has gone on to have a vibrant welcoming community that has fostered the donation and development of some very valuable software. I expect Drill will be able to say the same thing before long. Sent from my iPhone On Aug 6, 2012, at 2:55 PM, Jakob Homan jgho...@gmail.com wrote: Any reason the design docs can't be put up in place of where the source would normally go? On Mon, Aug 6, 2012 at 11:23 AM, Tomer Shiran tshi...@maprtech.com wrote: Marvin, thanks for commenting on the proposal! The initial committers have been working on the design for several months, and will commit the design once the project is approved, so we do not expect much friction during the design phase. With that said, we certainly do want to engage others early on, and our goal in incubating earlier is to encourage feedback and contributions when it is still easy to change the APIs and extensibility points. This is important because Drill (unlike, say, Google's Dremel) must be really flexible in order to be relevant to a broad user base, allowing multiple data sources, data formats and query languages. While many projects enter incubation with a complete implementation, others don't, and due to the nature of this project we think that in this case it is better to start earlier. Thanks, Tomer On Mon, Aug 6, 2012 at 9:25 AM, Marvin Humphrey mar...@rectangular.com wrote: On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Initial Source == There is no initial source code. All source code will be developed within the Apache Incubator. Coming in without any source code is going to pose a challenge to this podling. http://www.apache.org/foundation/how-it-works.html#incubator The incubator filters projects on the basis of the likeliness of them becoming successful meritocratic communities. The basic requirements for incubation are: * a working codebase -- over the years and after several failures, the foundation came to understand that without an initial working codebase, it is generally hard to bootstrap a community. This is because merit is not well recognized by developers without a working codebase. Also, the friction that is developed during the initial design stage is likely to fragment the community. That last line in particular seems like something to watch out for. Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
RE: [PROPOSAL] Drill for the Apache Incubator
-Original Message- From: Marvin Humphrey [mailto:mar...@rectangular.com] Sent: Monday, August 06, 2012 12:25 PM To: general@incubator.apache.org Cc: Grant Ingersoll; Isabel Drost Subject: Re: [PROPOSAL] Drill for the Apache Incubator On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Initial Source == There is no initial source code. All source code will be developed within the Apache Incubator. Coming in without any source code is going to pose a challenge to this podling. http://www.apache.org/foundation/how-it-works.html#incubator The incubator filters projects on the basis of the likeliness of them becoming successful meritocratic communities. The basic requirements for incubation are: * a working codebase -- over the years and after several failures, the foundation came to understand that without an initial working codebase, it is generally hard to bootstrap a community. This is because merit is not well recognized by developers without a working codebase. Also, the friction that is developed during the initial design stage is likely to fragment the community. It seems like there could be flexibility in this requirement, based on a few factors. In this case, a design discussion has been ongoing; but I would also think that any community coming in with enough people who know the Apache way may also not need as much of a solid starting point code wise. That last line in particular seems like something to watch out for. Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
On 07/08/2012 21:14, Franklin, Matthew B. wrote: -Original Message- From: Marvin Humphrey [mailto:mar...@rectangular.com] Sent: Monday, August 06, 2012 12:25 PM To: general@incubator.apache.org Cc: Grant Ingersoll; Isabel Drost Subject: Re: [PROPOSAL] Drill for the Apache Incubator On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Initial Source == There is no initial source code. All source code will be developed within the Apache Incubator. Coming in without any source code is going to pose a challenge to this podling. http://www.apache.org/foundation/how-it-works.html#incubator The incubator filters projects on the basis of the likeliness of them becoming successful meritocratic communities. The basic requirements for incubation are: * a working codebase -- over the years and after several failures, the foundation came to understand that without an initial working codebase, it is generally hard to bootstrap a community. This is because merit is not well recognized by developers without a working codebase. Also, the friction that is developed during the initial design stage is likely to fragment the community. It seems like there could be flexibility in this requirement, based on a few factors. In this case, a design discussion has been ongoing; but I would also think that any community coming in with enough people who know the Apache way may also not need as much of a solid starting point code wise. +1. Given the credentials and the experience of proposed committers and mentors, and the fact that the initial design is already done, I don't think this is a serious risk. And it's an exciting proposal with a potentially big impact. -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __ [___||.__|__/|__||\/|: Information Retrieval, System Integration ___|||__||..\|..||..|: Contact: info at sigram dot com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
I concur with Andrzej. Let's see that VOTE Ted! Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm From: Andrzej Bialecki a...@getopt.org To: general@incubator.apache.org Sent: Tuesday, August 7, 2012 5:51 PM Subject: Re: [PROPOSAL] Drill for the Apache Incubator On 07/08/2012 21:14, Franklin, Matthew B. wrote: -Original Message- From: Marvin Humphrey [mailto:mar...@rectangular.com] Sent: Monday, August 06, 2012 12:25 PM To: general@incubator.apache.org Cc: Grant Ingersoll; Isabel Drost Subject: Re: [PROPOSAL] Drill for the Apache Incubator On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Initial Source == There is no initial source code. All source code will be developed within the Apache Incubator. Coming in without any source code is going to pose a challenge to this podling. http://www.apache.org/foundation/how-it-works.html#incubator The incubator filters projects on the basis of the likeliness of them becoming successful meritocratic communities. The basic requirements for incubation are: * a working codebase -- over the years and after several failures, the foundation came to understand that without an initial working codebase, it is generally hard to bootstrap a community. This is because merit is not well recognized by developers without a working codebase. Also, the friction that is developed during the initial design stage is likely to fragment the community. It seems like there could be flexibility in this requirement, based on a few factors. In this case, a design discussion has been ongoing; but I would also think that any community coming in with enough people who know the Apache way may also not need as much of a solid starting point code wise. +1. Given the credentials and the experience of proposed committers and mentors, and the fact that the initial design is already done, I don't think this is a serious risk. And it's an exciting proposal with a potentially big impact. -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __ [___||.__|__/|__||\/|: Information Retrieval, System Integration ___|||__||..\|..||..|: Contact: info at sigram dot com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
Just sent that out. Thanks for the encouragement! On Tue, Aug 7, 2012 at 6:02 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: I concur with Andrzej. Let's see that VOTE Ted! - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
Ted, Wasn't clear, can I add myself now? thanks, Arun On Aug 6, 2012, at 9:08 AM, Ted Dunning wrote: Sounds like some good pull. I will call a vote tomorrow. On Mon, Aug 6, 2012 at 9:45 AM, Arun C Murthy a...@hortonworks.com wrote: Agreed, likewise. I'd love to get involved and would like to add myself whenever you are ready. thanks, Arun On Aug 3, 2012, at 10:40 AM, Owen O'Malley wrote: On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). This sounds really interesting Ted and I would love to help you. Would it be ok to add myself as one of the initial committers? Thanks, Owen -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: [PROPOSAL] Drill for the Apache Incubator
On Tue, Aug 7, 2012 at 12:14 PM, Franklin, Matthew B. mfrank...@mitre.org wrote: The incubator filters projects on the basis of the likeliness of them becoming successful meritocratic communities. The basic requirements for incubation are: * a working codebase -- over the years and after several failures, the foundation came to understand that without an initial working codebase, it is generally hard to bootstrap a community. This is because merit is not well recognized by developers without a working codebase. Also, the friction that is developed during the initial design stage is likely to fragment the community. It seems like there could be flexibility in this requirement, based on a few factors. In this case, a design discussion has been ongoing; but I would also think that any community coming in with enough people who know the Apache way may also not need as much of a solid starting point code wise. In the abstract, I'm a little skeptical about your last point. The inclusive, collaborative emphasis of the Apache Way is effective for evolutionary development of an existing code base, but IMO it's less well suited to the revolutionary act of starting a project. Choosing what *not* to do is really important when you start out, and that's not necessarily our strength. In Drill's case, I think the focus problem is mitigated by the fact that the podling will start with design documents and the Dremel whitepaper rather than a blank slate empty repository. In addition, the other classic problem which afflicts podlings which start with no code -- difficulty refreshing the community with no releases -- seems unlikely to manifest. The proposal looks good to me now. :) Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
On Tue, Aug 7, 2012 at 10:09 PM, Arun C Murthy a...@hortonworks.com wrote: Wasn't clear, can I add myself now? Didn't the Incubator go back to discouraging open enrollment? Is it OK to be invited in based on merit later, or do you feel that due to the nature of this project, it's essential to be in on the ground floor? Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
Agreed, likewise. I'd love to get involved and would like to add myself whenever you are ready. thanks, Arun On Aug 3, 2012, at 10:40 AM, Owen O'Malley wrote: On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). This sounds really interesting Ted and I would love to help you. Would it be ok to add myself as one of the initial committers? Thanks, Owen -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: [PROPOSAL] Drill for the Apache Incubator
Sounds like some good pull. I will call a vote tomorrow. On Mon, Aug 6, 2012 at 9:45 AM, Arun C Murthy a...@hortonworks.com wrote: Agreed, likewise. I'd love to get involved and would like to add myself whenever you are ready. thanks, Arun On Aug 3, 2012, at 10:40 AM, Owen O'Malley wrote: On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). This sounds really interesting Ted and I would love to help you. Would it be ok to add myself as one of the initial committers? Thanks, Owen -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: [PROPOSAL] Drill for the Apache Incubator
On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Initial Source == There is no initial source code. All source code will be developed within the Apache Incubator. Coming in without any source code is going to pose a challenge to this podling. http://www.apache.org/foundation/how-it-works.html#incubator The incubator filters projects on the basis of the likeliness of them becoming successful meritocratic communities. The basic requirements for incubation are: * a working codebase -- over the years and after several failures, the foundation came to understand that without an initial working codebase, it is generally hard to bootstrap a community. This is because merit is not well recognized by developers without a working codebase. Also, the friction that is developed during the initial design stage is likely to fragment the community. That last line in particular seems like something to watch out for. Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
Marvin, thanks for commenting on the proposal! The initial committers have been working on the design for several months, and will commit the design once the project is approved, so we do not expect much friction during the design phase. With that said, we certainly do want to engage others early on, and our goal in incubating earlier is to encourage feedback and contributions when it is still easy to change the APIs and extensibility points. This is important because Drill (unlike, say, Google's Dremel) must be really flexible in order to be relevant to a broad user base, allowing multiple data sources, data formats and query languages. While many projects enter incubation with a complete implementation, others don't, and due to the nature of this project we think that in this case it is better to start earlier. Thanks, Tomer On Mon, Aug 6, 2012 at 9:25 AM, Marvin Humphrey mar...@rectangular.comwrote: On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Initial Source == There is no initial source code. All source code will be developed within the Apache Incubator. Coming in without any source code is going to pose a challenge to this podling. http://www.apache.org/foundation/how-it-works.html#incubator The incubator filters projects on the basis of the likeliness of them becoming successful meritocratic communities. The basic requirements for incubation are: * a working codebase -- over the years and after several failures, the foundation came to understand that without an initial working codebase, it is generally hard to bootstrap a community. This is because merit is not well recognized by developers without a working codebase. Also, the friction that is developed during the initial design stage is likely to fragment the community. That last line in particular seems like something to watch out for. Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
Any reason the design docs can't be put up in place of where the source would normally go? On Mon, Aug 6, 2012 at 11:23 AM, Tomer Shiran tshi...@maprtech.com wrote: Marvin, thanks for commenting on the proposal! The initial committers have been working on the design for several months, and will commit the design once the project is approved, so we do not expect much friction during the design phase. With that said, we certainly do want to engage others early on, and our goal in incubating earlier is to encourage feedback and contributions when it is still easy to change the APIs and extensibility points. This is important because Drill (unlike, say, Google's Dremel) must be really flexible in order to be relevant to a broad user base, allowing multiple data sources, data formats and query languages. While many projects enter incubation with a complete implementation, others don't, and due to the nature of this project we think that in this case it is better to start earlier. Thanks, Tomer On Mon, Aug 6, 2012 at 9:25 AM, Marvin Humphrey mar...@rectangular.comwrote: On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Initial Source == There is no initial source code. All source code will be developed within the Apache Incubator. Coming in without any source code is going to pose a challenge to this podling. http://www.apache.org/foundation/how-it-works.html#incubator The incubator filters projects on the basis of the likeliness of them becoming successful meritocratic communities. The basic requirements for incubation are: * a working codebase -- over the years and after several failures, the foundation came to understand that without an initial working codebase, it is generally hard to bootstrap a community. This is because merit is not well recognized by developers without a working codebase. Also, the friction that is developed during the initial design stage is likely to fragment the community. That last line in particular seems like something to watch out for. Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
No reason at all. Sent from my iPhone On Aug 6, 2012, at 2:55 PM, Jakob Homan jgho...@gmail.com wrote: Any reason the design docs can't be put up in place of where the source would normally go? On Mon, Aug 6, 2012 at 11:23 AM, Tomer Shiran tshi...@maprtech.com wrote: Marvin, thanks for commenting on the proposal! The initial committers have been working on the design for several months, and will commit the design once the project is approved, so we do not expect much friction during the design phase. With that said, we certainly do want to engage others early on, and our goal in incubating earlier is to encourage feedback and contributions when it is still easy to change the APIs and extensibility points. This is important because Drill (unlike, say, Google's Dremel) must be really flexible in order to be relevant to a broad user base, allowing multiple data sources, data formats and query languages. While many projects enter incubation with a complete implementation, others don't, and due to the nature of this project we think that in this case it is better to start earlier. Thanks, Tomer On Mon, Aug 6, 2012 at 9:25 AM, Marvin Humphrey mar...@rectangular.comwrote: On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Initial Source == There is no initial source code. All source code will be developed within the Apache Incubator. Coming in without any source code is going to pose a challenge to this podling. http://www.apache.org/foundation/how-it-works.html#incubator The incubator filters projects on the basis of the likeliness of them becoming successful meritocratic communities. The basic requirements for incubation are: * a working codebase -- over the years and after several failures, the foundation came to understand that without an initial working codebase, it is generally hard to bootstrap a community. This is because merit is not well recognized by developers without a working codebase. Also, the friction that is developed during the initial design stage is likely to fragment the community. That last line in particular seems like something to watch out for. Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [PROPOSAL] Drill for the Apache Incubator
In fact, a big part of the motivation for proposing incubation before code is ready is exactly to foster the discussions needed to form community. It is true that many projects that start without the fundamentals face challenges that more mature projects face but that is really just a fact of life with young projects. My own experience includes a project that also started without an initial code drop. Mahout has gone on to have a vibrant welcoming community that has fostered the donation and development of some very valuable software. I expect Drill will be able to say the same thing before long. Sent from my iPhone On Aug 6, 2012, at 2:55 PM, Jakob Homan jgho...@gmail.com wrote: Any reason the design docs can't be put up in place of where the source would normally go? On Mon, Aug 6, 2012 at 11:23 AM, Tomer Shiran tshi...@maprtech.com wrote: Marvin, thanks for commenting on the proposal! The initial committers have been working on the design for several months, and will commit the design once the project is approved, so we do not expect much friction during the design phase. With that said, we certainly do want to engage others early on, and our goal in incubating earlier is to encourage feedback and contributions when it is still easy to change the APIs and extensibility points. This is important because Drill (unlike, say, Google's Dremel) must be really flexible in order to be relevant to a broad user base, allowing multiple data sources, data formats and query languages. While many projects enter incubation with a complete implementation, others don't, and due to the nature of this project we think that in this case it is better to start earlier. Thanks, Tomer On Mon, Aug 6, 2012 at 9:25 AM, Marvin Humphrey mar...@rectangular.comwrote: On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Initial Source == There is no initial source code. All source code will be developed within the Apache Incubator. Coming in without any source code is going to pose a challenge to this podling. http://www.apache.org/foundation/how-it-works.html#incubator The incubator filters projects on the basis of the likeliness of them becoming successful meritocratic communities. The basic requirements for incubation are: * a working codebase -- over the years and after several failures, the foundation came to understand that without an initial working codebase, it is generally hard to bootstrap a community. This is because merit is not well recognized by developers without a working codebase. Also, the friction that is developed during the initial design stage is likely to fragment the community. That last line in particular seems like something to watch out for. Marvin Humphrey - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
[PROPOSAL] Drill for the Apache Incubator
Abstract Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). Proposal Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called “Dremel: Interactive Analysis of Web-Scale Datasets,” describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. Rationale = There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery ( https://developers.google.com/bigquery/docs/query-reference), which we call DrQL. However, Drill is designed to support other languages and programming models, such as the Mongo Query Language ( http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume). * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill is that the execution engine is flexible enough to support column-based processing as well as row-based processing. This is important because column-based processing can be much more efficient when the data is stored in a column-based format, but many large data assets are stored in a row-based format that would require conversion before use. * Scalable data sources: This layer is responsible for supporting various data sources. The initial focus is to leverage Hadoop as a data source. It is worth noting that no open source project has successfully replicated the capabilities of Dremel, nor have any taken on the broader goals of flexibility (eg, pluggable
[PROPOSAL] Drill for the Apache Incubator
This is a duplicated attempt at sending this message, please ignore the previous message if it eventually arrives. There appears to be a hangup sending email from my apache email address via gmail. Abstract Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). Proposal Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called “Dremel: Interactive Analysis of Web-Scale Datasets,” describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. Rationale = There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery ( https://developers.google.com/bigquery/docs/query-reference), which we call DrQL. However, Drill is designed to support other languages and programming models, such as the Mongo Query Language ( http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume). * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill is that the execution engine is flexible enough to support column-based processing as well as row-based processing. This is important because column-based processing can be much more efficient when the data is stored in a column-based format, but many large data assets are stored in a row-based format that would require conversion before use. * Scalable data sources: This layer is responsible for supporting various data sources. The initial focus is to leverage
Re: [PROPOSAL] Drill for the Apache Incubator
Owen, Sounds great to have additional contributors, but let's get a project approved and rolling and then we can start adding committers. On Fri, Aug 3, 2012 at 11:40 AM, Owen O'Malley omal...@apache.org wrote: On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote: Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). This sounds really interesting Ted and I would love to help you. Would it be ok to add myself as one of the initial committers? Thanks, Owen
[PROPOSAL] Drill for the Apache Incubator
This is a duplicated attempt at sending this message, please ignore the previous message if it eventually arrives. There appears to be a hangup sending email from my apache email address via gmail. Abstract Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). Proposal Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called “Dremel: Interactive Analysis of Web-Scale Datasets,” describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. Rationale = There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery ( https://developers.google.com/bigquery/docs/query-reference), which we call DrQL. However, Drill is designed to support other languages and programming models, such as the Mongo Query Language ( http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume). * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill is that the execution engine is flexible enough to support column-based processing as well as row-based processing. This is important because column-based processing can be much more efficient when the data is stored in a column-based format, but many large data assets are stored in a row-based format that would require conversion before use. * Scalable data sources: This layer is responsible for supporting various data sources. The initial focus is to leverage
Re: [PROPOSAL] Drill for the Apache Incubator
Sounds cool! Cheers, Chris On Aug 2, 2012, at 3:12 PM, Ted Dunning wrote: This is a duplicated attempt at sending this message, please ignore the previous message if it eventually arrives. There appears to be a hangup sending email from my apache email address via gmail. Abstract Drill is a distributed system for interactive analysis of large-scale datasets, inspired by Google’s Dremel ( http://research.google.com/pubs/pub36632.html). Proposal Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google’s Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called “Dremel: Interactive Analysis of Web-Scale Datasets,” describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. Rationale = There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google’s internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google’s internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user’s query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and Google BigQuery ( https://developers.google.com/bigquery/docs/query-reference), which we call DrQL. However, Drill is designed to support other languages and programming models, such as the Mongo Query Language ( http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading ( http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume). * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill’s execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill is that the execution engine is flexible enough to support column-based processing as well as row-based processing. This is important because column-based processing can be much more efficient when the data is stored in a column-based format, but many large data assets are stored in a row-based format