Re: [VOTE] Accept Drill into the Apache Incubator
Guys Any updates on this ? On Sun, Aug 12, 2012 at 12:31 AM, Ted Dunning ted.dunn...@gmail.com wrote: Yes. Saw that. Responded to him privately at the time. Good humor or good typo. On Sat, Aug 11, 2012 at 7:29 AM, Doug Cutting cutt...@gmail.com wrote: Otis said his vote was 'blinding', not 'binding'. Doug On Aug 11, 2012 12:28 AM, Ted Dunning ted.dunn...@gmail.com wrote: This vote is now closed. In the responses to this thread, I count 15 binding positive votes and 4 non-binding votes. The number of positive votes increases to 17 if you count myself (the champion) and Isabel (a mentor) but neither of us actually sent the key email to record a vote (oops). One of the non-binding votes was by Otis Gospadnetic who said that his vote was binding, but I didn't find his name on the list of incubator PMC members, so I counted it as non-binding. The list I used is at http://people.apache.org/committers-by-project.html#incubator-pmc By any count, this vote to admit Drill to incubator therefore passes. This proposal includes mentors so this vote also constitutes acceptance of the mentors by the Incubator PMC. All three of the mentors (Grant, myself, and Isabel) are Apache members. This proposal as approved also includes an initial list of committers, all of whom have ICLA's on file. I will coordinate with the other mentors and the committers to commit the status file and perform other establishment activities necessary to establish Drill as a project under incubation. I expect that this will take several days. I will announce progress on this mailing list to allow people to subscribe to the mailing lists. On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell apurt...@apache.org wrote: +1 (non-binding) On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used
Re: [VOTE] Accept Drill into the Apache Incubator
Yes. There are updates. SVN is up (we will be switching to git as soon as possible) Mailing lists are up. Send email to drill-dev-subscr...@incubator.apache.org or drill-use-subscr...@incubator.apache.org as desired. The issue tracker is up. See https://issues.apache.org/jira/browse/DRILL We will be sponsoring a hackathon soon in the SF bay area shortly to get a lot of f2f participation for building consensus. Several commercial companies have volunteered paid developers as well. It is an open question how to broaden physical involvement beyond that first meeting, but ad hoc meetings in various cities seem like a nice way to make that happen. Obviously, meat-space interactions will only be a small part of the total project, but it is a good way to build enthusiasm. The project web site is not up yet. It will be shortly. On Wed, Aug 22, 2012 at 7:47 PM, Akash Ashok thehellma...@gmail.com wrote: Guys Any updates on this ?
Re: [VOTE] Accept Drill into the Apache Incubator
This vote is now closed. In the responses to this thread, I count 15 binding positive votes and 4 non-binding votes. The number of positive votes increases to 17 if you count myself (the champion) and Isabel (a mentor) but neither of us actually sent the key email to record a vote (oops). One of the non-binding votes was by Otis Gospadnetic who said that his vote was binding, but I didn't find his name on the list of incubator PMC members, so I counted it as non-binding. The list I used is at http://people.apache.org/committers-by-project.html#incubator-pmc By any count, this vote to admit Drill to incubator therefore passes. This proposal includes mentors so this vote also constitutes acceptance of the mentors by the Incubator PMC. All three of the mentors (Grant, myself, and Isabel) are Apache members. This proposal as approved also includes an initial list of committers, all of whom have ICLA's on file. I will coordinate with the other mentors and the committers to commit the status file and perform other establishment activities necessary to establish Drill as a project under incubation. I expect that this will take several days. I will announce progress on this mailing list to allow people to subscribe to the mailing lists. On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell apurt...@apache.org wrote: +1 (non-binding) On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the
Re: [VOTE] Accept Drill into the Apache Incubator
Otis said his vote was 'blinding', not 'binding'. Doug On Aug 11, 2012 12:28 AM, Ted Dunning ted.dunn...@gmail.com wrote: This vote is now closed. In the responses to this thread, I count 15 binding positive votes and 4 non-binding votes. The number of positive votes increases to 17 if you count myself (the champion) and Isabel (a mentor) but neither of us actually sent the key email to record a vote (oops). One of the non-binding votes was by Otis Gospadnetic who said that his vote was binding, but I didn't find his name on the list of incubator PMC members, so I counted it as non-binding. The list I used is at http://people.apache.org/committers-by-project.html#incubator-pmc By any count, this vote to admit Drill to incubator therefore passes. This proposal includes mentors so this vote also constitutes acceptance of the mentors by the Incubator PMC. All three of the mentors (Grant, myself, and Isabel) are Apache members. This proposal as approved also includes an initial list of committers, all of whom have ICLA's on file. I will coordinate with the other mentors and the committers to commit the status file and perform other establishment activities necessary to establish Drill as a project under incubation. I expect that this will take several days. I will announce progress on this mailing list to allow people to subscribe to the mailing lists. On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell apurt...@apache.org wrote: +1 (non-binding) On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat
Re: [VOTE] Accept Drill into the Apache Incubator
Yes. Saw that. Responded to him privately at the time. Good humor or good typo. On Sat, Aug 11, 2012 at 7:29 AM, Doug Cutting cutt...@gmail.com wrote: Otis said his vote was 'blinding', not 'binding'. Doug On Aug 11, 2012 12:28 AM, Ted Dunning ted.dunn...@gmail.com wrote: This vote is now closed. In the responses to this thread, I count 15 binding positive votes and 4 non-binding votes. The number of positive votes increases to 17 if you count myself (the champion) and Isabel (a mentor) but neither of us actually sent the key email to record a vote (oops). One of the non-binding votes was by Otis Gospadnetic who said that his vote was binding, but I didn't find his name on the list of incubator PMC members, so I counted it as non-binding. The list I used is at http://people.apache.org/committers-by-project.html#incubator-pmc By any count, this vote to admit Drill to incubator therefore passes. This proposal includes mentors so this vote also constitutes acceptance of the mentors by the Incubator PMC. All three of the mentors (Grant, myself, and Isabel) are Apache members. This proposal as approved also includes an initial list of committers, all of whom have ICLA's on file. I will coordinate with the other mentors and the committers to commit the status file and perform other establishment activities necessary to establish Drill as a project under incubation. I expect that this will take several days. I will announce progress on this mailing list to allow people to subscribe to the mailing lists. On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell apurt...@apache.org wrote: +1 (non-binding) On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill
Re: [VOTE] Accept Drill into the Apache Incubator
+1 Tommaso 2012/8/8 Ted Dunning ted.dunn...@gmail.com I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user's query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and [[https://developers.google.com/bigquery/docs/query-reference|Google BigQuery]], which we call DrQL. However, Drill is designed to support other languages and programming models, such as the [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query Language]], [[http://www.cascading.org/|Cascading]] or [[https://github.com/tdunning/Plume|Plume]]. * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based
Re: [VOTE] Accept Drill into the Apache Incubator
+1 (non-binding) On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user's query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and [[https://developers.google.com/bigquery/docs/query-reference|Google BigQuery]], which we call DrQL. However, Drill is designed to support other languages and programming models, such as the [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query Language]], [[http://www.cascading.org/|Cascading]] or [[https://github.com/tdunning/Plume|Plume]]. * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel,
Re: [VOTE] Accept Drill into the Apache Incubator
+1 (binding) On Wed, Aug 8, 2012 at 8:33 AM, Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov wrote: +1 (binding). Good luck and sounds cool! Cheers, Chris On Aug 7, 2012, at 7:41 PM, Ted Dunning wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user's query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and [[https://developers.google.com/bigquery/docs/query-reference|Google BigQuery]], which we call DrQL. However, Drill is designed to support other languages and programming models, such as the [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query Language]], [[http://www.cascading.org/|Cascading]] or [[https://github.com/tdunning/Plume|Plume]]. * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support
Re: [VOTE] Accept Drill into the Apache Incubator
On 08/08/2012 04:41, Ted Dunning wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. +1 (binding) - this is an exciting proposal! -- Best regards, Andrzej Bialecki http://www.sigram.com, blog http://www.sigram.com/blog ___.,___,___,___,_._. __ [___||.__|__/|__||\/|: Information Retrieval, System Integration ___|||__||..\|..||..|: Contact: info at sigram dot com - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [VOTE] Accept Drill into the Apache Incubator
On Wed, Aug 8, 2012 at 4:41 AM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator... +1 -Bertrand - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [VOTE] Accept Drill into the Apache Incubator
On Wed, Aug 8, 2012 at 11:39 AM, Bertrand Delacretaz bdelacre...@apache.org wrote: On Wed, Aug 8, 2012 at 4:41 AM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator... +1 +1 cheers, Torsten - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [VOTE] Accept Drill into the Apache Incubator
On Aug 7, 2012, at 10:41 PM, Ted Dunning wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator +1 (binding) - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
Re: [VOTE] Accept Drill into the Apache Incubator
+1 (binding) On Wed, Aug 8, 2012 at 3:55 PM, Grant Ingersoll gsing...@apache.org wrote: On Aug 7, 2012, at 10:41 PM, Ted Dunning wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator +1 (binding) - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org -- Thanks - Mohammad Nour Life is like riding a bicycle. To keep your balance you must keep moving - Albert Einstein
Re: [VOTE] Accept Drill into the Apache Incubator
On Tue, Aug 7, 2012 at 9:41 PM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... +1 Phil - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
RE: [VOTE] Accept Drill into the Apache Incubator
+1 (binding) -Original Message- From: Ted Dunning [mailto:ted.dunn...@gmail.com] Sent: Tuesday, August 07, 2012 10:41 PM To: general@incubator.apache.org Subject: [VOTE] Accept Drill into the Apache Incubator I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user's query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and [[https://developers.google.com/bigquery/docs/query-reference|Google BigQuery]], which we call DrQL. However, Drill is designed to support other languages and programming models, such as the [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query Language]], [[http://www.cascading.org/|Cascading]] or [[https://github.com/tdunning/Plume|Plume]]. * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support
Re: [VOTE] Accept Drill into the Apache Incubator
+1 (blinding) Otis Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm From: Ted Dunning ted.dunn...@gmail.com To: general@incubator.apache.org Sent: Tuesday, August 7, 2012 10:41 PM Subject: [VOTE] Accept Drill into the Apache Incubator I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user's query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and [[https://developers.google.com/bigquery/docs/query-reference|Google BigQuery]], which we call DrQL. However, Drill is designed to support other languages and programming models, such as the [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query Language]], [[http://www.cascading.org/|Cascading]] or [[https://github.com/tdunning/Plume|Plume]]. * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV
Re: [VOTE] Accept Drill into the Apache Incubator
+1 -C (sorry, wrong thread) On Tue, Aug 7, 2012 at 7:41 PM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user's query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and [[https://developers.google.com/bigquery/docs/query-reference|Google BigQuery]], which we call DrQL. However, Drill is designed to support other languages and programming models, such as the [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query Language]], [[http://www.cascading.org/|Cascading]] or [[https://github.com/tdunning/Plume|Plume]]. * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as
Re: [VOTE] Accept Drill into the Apache Incubator
Hi, On Wed, Aug 8, 2012 at 4:41 AM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. [x] +1, bring Drill into Incubator BR, Jukka Zitting - To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org
[VOTE] Accept Drill into the Apache Incubator
I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user's query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and [[https://developers.google.com/bigquery/docs/query-reference|Google BigQuery]], which we call DrQL. However, Drill is designed to support other languages and programming models, such as the [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query Language]], [[http://www.cascading.org/|Cascading]] or [[https://github.com/tdunning/Plume|Plume]]. * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill is that the execution engine is flexible enough
Re: [VOTE] Accept Drill into the Apache Incubator
+1 (binding) On Tue, Aug 7, 2012 at 7:41 PM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user's query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and [[https://developers.google.com/bigquery/docs/query-reference|Google BigQuery]], which we call DrQL. However, Drill is designed to support other languages and programming models, such as the [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongohttp://www.mongodb.org/display/DOCS/Mongo+Query+Language%7CMongoQuery Language]], [[http://www.cascading.org/|Cascading]] or [[https://github.com/tdunning/Plume|Plume]]. * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is
Re: [VOTE] Accept Drill into the Apache Incubator
+1 (non-binding) On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user's query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and [[https://developers.google.com/bigquery/docs/query-reference|Google BigQuery]], which we call DrQL. However, Drill is designed to support other languages and programming models, such as the [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query Language]], [[http://www.cascading.org/|Cascading]] or [[https://github.com/tdunning/Plume|Plume]]. * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel,
Re: [VOTE] Accept Drill into the Apache Incubator
+1 (binding) On Aug 7, 2012, at 7:41 PM, Ted Dunning wrote: I would like to call a vote for accepting Drill for incubation in the Apache Incubator. The full proposal is available below. Discussion over the last few days has been quite positive. Please cast your vote: [ ] +1, bring Drill into Incubator [ ] +0, I don't care either way, [ ] -1, do not bring Drill into Incubator, because... This vote will be open for 72 hours and only votes from the Incubator PMC are binding. The start of the vote is just before 3AM UTC on 8 August so the closing time will be 3AM UTC on 11 August. Thank you for your consideration! Ted http://wiki.apache.org/incubator/DrillProposal = Drill = == Abstract == Drill is a distributed system for interactive analysis of large-scale datasets, inspired by [[http://research.google.com/pubs/pub36632.html|Google's Dremel]]. == Proposal == Drill is a distributed system for interactive analysis of large-scale datasets. Drill is similar to Google's Dremel, with the additional flexibility needed to support a broader range of query languages, data formats and data sources. It is designed to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds. == Background == Many organizations have the need to run data-intensive applications, including batch processing, stream processing and interactive analysis. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). In 2010 Google published a paper called Dremel: Interactive Analysis of Web-Scale Datasets, describing a scalable system used internally for interactive analysis of nested data. No open source project has successfully replicated the capabilities of Dremel. == Rationale == There is a strong need in the market for low-latency interactive analysis of large-scale datasets, including nested data (eg, JSON, Avro, Protocol Buffers). This need was identified by Google and addressed internally with a system called Dremel. In recent years open source systems have emerged to address the need for scalable batch processing (Apache Hadoop) and stream processing (Storm, Apache S4). Apache Hadoop, originally inspired by Google's internal MapReduce system, is used by thousands of organizations processing large-scale datasets. Apache Hadoop is designed to achieve very high throughput, but is not designed to achieve the sub-second latency needed for interactive data analysis and exploration. Drill, inspired by Google's internal Dremel system, is intended to address this need. It is worth noting that, as explained by Google in the original paper, Dremel complements MapReduce-based computing. Dremel is not intended as a replacement for MapReduce and is often used in conjunction with it to analyze outputs of MapReduce pipelines or rapidly prototype larger computations. Indeed, Dremel and MapReduce are both used by thousands of Google employees. Like Dremel, Drill supports a nested data model with data encoded in a number of formats such as JSON, Avro or Protocol Buffers. In many organizations nested data is the standard, so supporting a nested data model eliminates the need to normalize the data. With that said, flat data formats, such as CSV files, are naturally supported as a special case of nested data. The Drill architecture consists of four key components/layers: * Query languages: This layer is responsible for parsing the user's query and constructing an execution plan. The initial goal is to support the SQL-like language used by Dremel and [[https://developers.google.com/bigquery/docs/query-reference|Google BigQuery]], which we call DrQL. However, Drill is designed to support other languages and programming models, such as the [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query Language]], [[http://www.cascading.org/|Cascading]] or [[https://github.com/tdunning/Plume|Plume]]. * Low-latency distributed execution engine: This layer is responsible for executing the physical plan. It provides the scalability and fault tolerance needed to efficiently query petabytes of data on 10,000 servers. Drill's execution engine is based on research in distributed execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar storage, and can be extended with additional operators and connectors. * Nested data formats: This layer is responsible for supporting various data formats. The initial goal is to support the column-based format used by Dremel. Drill is designed to support schema-based formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less formats such as JSON, BSON or YAML. In addition, it is designed to support column-based formats such as Dremel, AVRO-806/Trevni and