Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-22 Thread Akash Ashok
Guys Any updates on this ?

On Sun, Aug 12, 2012 at 12:31 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 Yes.  Saw that.  Responded to him privately at the time.

 Good humor or good typo.

 On Sat, Aug 11, 2012 at 7:29 AM, Doug Cutting cutt...@gmail.com wrote:

  Otis said his vote was 'blinding', not 'binding'.
 
  Doug
  On Aug 11, 2012 12:28 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 
   This vote is now closed.
  
   In the responses to this thread, I count 15 binding positive votes and
   4 non-binding votes.  The number of positive votes increases to 17 if
   you count myself (the champion) and Isabel (a mentor) but neither of
   us actually sent the key email to record a vote (oops).
  
   One of the non-binding votes was by Otis Gospadnetic who said that his
   vote was binding, but I didn't find his name on the list of incubator
   PMC members, so I counted it as non-binding.  The list I used is at
   http://people.apache.org/committers-by-project.html#incubator-pmc
  
   By any count, this vote to admit Drill to incubator therefore passes.
  
   This proposal includes mentors so this vote also constitutes
   acceptance of the mentors by the Incubator PMC.  All three of the
   mentors (Grant, myself, and Isabel) are Apache members.
  
   This proposal as approved also includes an initial list of committers,
   all of whom have ICLA's on file.
  
   I will coordinate with the other mentors and the committers to commit
   the status file and perform other establishment activities necessary
   to establish Drill as a project under incubation.  I expect that this
   will take several days.  I will announce progress on this mailing list
   to allow people to subscribe to the mailing lists.
  
  
   On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell apurt...@apache.org
   wrote:
+1 (non-binding)
   
On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com
   wrote:
I would like to call a vote for accepting Drill for incubation in
 the
Apache Incubator. The full proposal is available below.  Discussion
over the last few days has been quite positive.
   
Please cast your vote:
   
[ ] +1, bring Drill into Incubator
[ ] +0, I don't care either way,
[ ] -1, do not bring Drill into Incubator, because...
   
This vote will be open for 72 hours and only votes from the
 Incubator
PMC are binding.  The start of the vote is just before 3AM UTC on 8
August so the closing time will be 3AM UTC on 11 August.
   
Thank you for your consideration!
   
Ted
   
http://wiki.apache.org/incubator/DrillProposal
   
= Drill =
   
== Abstract ==
Drill is a distributed system for interactive analysis of
 large-scale
datasets, inspired by
[[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
   
== Proposal ==
Drill is a distributed system for interactive analysis of
 large-scale
datasets. Drill is similar to Google's Dremel, with the additional
flexibility needed to support a broader range of query languages,
 data
formats and data sources. It is designed to efficiently process
 nested
data. It is a design goal to scale to 10,000 servers or more and to
 be
able to process petabyes of data and trillions of records in
 seconds.
   
== Background ==
Many organizations have the need to run data-intensive applications,
including batch processing, stream processing and interactive
analysis. In recent years open source systems have emerged to
 address
the need for scalable batch processing (Apache Hadoop) and stream
processing (Storm, Apache S4). In 2010 Google published a paper
 called
Dremel: Interactive Analysis of Web-Scale Datasets, describing a
scalable system used internally for interactive analysis of nested
data. No open source project has successfully replicated the
capabilities of Dremel.
   
== Rationale ==
There is a strong need in the market for low-latency interactive
analysis of large-scale datasets, including nested data (eg, JSON,
Avro, Protocol Buffers). This need was identified by Google and
addressed internally with a system called Dremel.
   
In recent years open source systems have emerged to address the need
for scalable batch processing (Apache Hadoop) and stream processing
(Storm, Apache S4). Apache Hadoop, originally inspired by Google's
internal MapReduce system, is used by thousands of organizations
processing large-scale datasets. Apache Hadoop is designed to
 achieve
very high throughput, but is not designed to achieve the sub-second
latency needed for interactive data analysis and exploration. Drill,
inspired by Google's internal Dremel system, is intended to address
this need.
   
It is worth noting that, as explained by Google in the original
 paper,
Dremel complements MapReduce-based computing. Dremel is not intended
as a replacement for MapReduce and is often used 

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-22 Thread Ted Dunning
Yes.  There are updates.

SVN is up (we will be switching to git as soon as possible)

Mailing lists are up.  Send email to
drill-dev-subscr...@incubator.apache.org or
drill-use-subscr...@incubator.apache.org as desired.

The issue tracker is up.  See https://issues.apache.org/jira/browse/DRILL

We will be sponsoring a hackathon soon in the SF bay area shortly to get a
lot of f2f participation for building consensus.  Several commercial
companies have volunteered paid developers as well.  It is an open question
how to broaden physical involvement beyond that first meeting, but ad hoc
meetings in various cities seem like a nice way to make that happen.
 Obviously, meat-space interactions will only be a small part of the total
project, but it is a good way to build enthusiasm.

The project web site is not up yet.  It will be shortly.

On Wed, Aug 22, 2012 at 7:47 PM, Akash Ashok thehellma...@gmail.com wrote:

 Guys Any updates on this ?




Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-11 Thread Ted Dunning
This vote is now closed.

In the responses to this thread, I count 15 binding positive votes and
4 non-binding votes.  The number of positive votes increases to 17 if
you count myself (the champion) and Isabel (a mentor) but neither of
us actually sent the key email to record a vote (oops).

One of the non-binding votes was by Otis Gospadnetic who said that his
vote was binding, but I didn't find his name on the list of incubator
PMC members, so I counted it as non-binding.  The list I used is at
http://people.apache.org/committers-by-project.html#incubator-pmc

By any count, this vote to admit Drill to incubator therefore passes.

This proposal includes mentors so this vote also constitutes
acceptance of the mentors by the Incubator PMC.  All three of the
mentors (Grant, myself, and Isabel) are Apache members.

This proposal as approved also includes an initial list of committers,
all of whom have ICLA's on file.

I will coordinate with the other mentors and the committers to commit
the status file and perform other establishment activities necessary
to establish Drill as a project under incubation.  I expect that this
will take several days.  I will announce progress on this mailing list
to allow people to subscribe to the mailing lists.


On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell apurt...@apache.org wrote:
 +1 (non-binding)

 On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator. The full proposal is available below.  Discussion
 over the last few days has been quite positive.

 Please cast your vote:

 [ ] +1, bring Drill into Incubator
 [ ] +0, I don't care either way,
 [ ] -1, do not bring Drill into Incubator, because...

 This vote will be open for 72 hours and only votes from the Incubator
 PMC are binding.  The start of the vote is just before 3AM UTC on 8
 August so the closing time will be 3AM UTC on 11 August.

 Thank you for your consideration!

 Ted

 http://wiki.apache.org/incubator/DrillProposal

 = Drill =

 == Abstract ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by
 [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].

 == Proposal ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets. Drill is similar to Google's Dremel, with the additional
 flexibility needed to support a broader range of query languages, data
 formats and data sources. It is designed to efficiently process nested
 data. It is a design goal to scale to 10,000 servers or more and to be
 able to process petabyes of data and trillions of records in seconds.

 == Background ==
 Many organizations have the need to run data-intensive applications,
 including batch processing, stream processing and interactive
 analysis. In recent years open source systems have emerged to address
 the need for scalable batch processing (Apache Hadoop) and stream
 processing (Storm, Apache S4). In 2010 Google published a paper called
 Dremel: Interactive Analysis of Web-Scale Datasets, describing a
 scalable system used internally for interactive analysis of nested
 data. No open source project has successfully replicated the
 capabilities of Dremel.

 == Rationale ==
 There is a strong need in the market for low-latency interactive
 analysis of large-scale datasets, including nested data (eg, JSON,
 Avro, Protocol Buffers). This need was identified by Google and
 addressed internally with a system called Dremel.

 In recent years open source systems have emerged to address the need
 for scalable batch processing (Apache Hadoop) and stream processing
 (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
 internal MapReduce system, is used by thousands of organizations
 processing large-scale datasets. Apache Hadoop is designed to achieve
 very high throughput, but is not designed to achieve the sub-second
 latency needed for interactive data analysis and exploration. Drill,
 inspired by Google's internal Dremel system, is intended to address
 this need.

 It is worth noting that, as explained by Google in the original paper,
 Dremel complements MapReduce-based computing. Dremel is not intended
 as a replacement for MapReduce and is often used in conjunction with
 it to analyze outputs of MapReduce pipelines or rapidly prototype
 larger computations. Indeed, Dremel and MapReduce are both used by
 thousands of Google employees.

 Like Dremel, Drill supports a nested data model with data encoded in a
 number of formats such as JSON, Avro or Protocol Buffers. In many
 organizations nested data is the standard, so supporting a nested data
 model eliminates the need to normalize the data. With that said, flat
 data formats, such as CSV files, are naturally supported as a special
 case of nested data.

 The Drill architecture consists of four key components/layers:
  * Query languages: This layer is responsible for parsing the 

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-11 Thread Doug Cutting
Otis said his vote was 'blinding', not 'binding'.

Doug
On Aug 11, 2012 12:28 AM, Ted Dunning ted.dunn...@gmail.com wrote:

 This vote is now closed.

 In the responses to this thread, I count 15 binding positive votes and
 4 non-binding votes.  The number of positive votes increases to 17 if
 you count myself (the champion) and Isabel (a mentor) but neither of
 us actually sent the key email to record a vote (oops).

 One of the non-binding votes was by Otis Gospadnetic who said that his
 vote was binding, but I didn't find his name on the list of incubator
 PMC members, so I counted it as non-binding.  The list I used is at
 http://people.apache.org/committers-by-project.html#incubator-pmc

 By any count, this vote to admit Drill to incubator therefore passes.

 This proposal includes mentors so this vote also constitutes
 acceptance of the mentors by the Incubator PMC.  All three of the
 mentors (Grant, myself, and Isabel) are Apache members.

 This proposal as approved also includes an initial list of committers,
 all of whom have ICLA's on file.

 I will coordinate with the other mentors and the committers to commit
 the status file and perform other establishment activities necessary
 to establish Drill as a project under incubation.  I expect that this
 will take several days.  I will announce progress on this mailing list
 to allow people to subscribe to the mailing lists.


 On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell apurt...@apache.org
 wrote:
  +1 (non-binding)
 
  On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  I would like to call a vote for accepting Drill for incubation in the
  Apache Incubator. The full proposal is available below.  Discussion
  over the last few days has been quite positive.
 
  Please cast your vote:
 
  [ ] +1, bring Drill into Incubator
  [ ] +0, I don't care either way,
  [ ] -1, do not bring Drill into Incubator, because...
 
  This vote will be open for 72 hours and only votes from the Incubator
  PMC are binding.  The start of the vote is just before 3AM UTC on 8
  August so the closing time will be 3AM UTC on 11 August.
 
  Thank you for your consideration!
 
  Ted
 
  http://wiki.apache.org/incubator/DrillProposal
 
  = Drill =
 
  == Abstract ==
  Drill is a distributed system for interactive analysis of large-scale
  datasets, inspired by
  [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
 
  == Proposal ==
  Drill is a distributed system for interactive analysis of large-scale
  datasets. Drill is similar to Google's Dremel, with the additional
  flexibility needed to support a broader range of query languages, data
  formats and data sources. It is designed to efficiently process nested
  data. It is a design goal to scale to 10,000 servers or more and to be
  able to process petabyes of data and trillions of records in seconds.
 
  == Background ==
  Many organizations have the need to run data-intensive applications,
  including batch processing, stream processing and interactive
  analysis. In recent years open source systems have emerged to address
  the need for scalable batch processing (Apache Hadoop) and stream
  processing (Storm, Apache S4). In 2010 Google published a paper called
  Dremel: Interactive Analysis of Web-Scale Datasets, describing a
  scalable system used internally for interactive analysis of nested
  data. No open source project has successfully replicated the
  capabilities of Dremel.
 
  == Rationale ==
  There is a strong need in the market for low-latency interactive
  analysis of large-scale datasets, including nested data (eg, JSON,
  Avro, Protocol Buffers). This need was identified by Google and
  addressed internally with a system called Dremel.
 
  In recent years open source systems have emerged to address the need
  for scalable batch processing (Apache Hadoop) and stream processing
  (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
  internal MapReduce system, is used by thousands of organizations
  processing large-scale datasets. Apache Hadoop is designed to achieve
  very high throughput, but is not designed to achieve the sub-second
  latency needed for interactive data analysis and exploration. Drill,
  inspired by Google's internal Dremel system, is intended to address
  this need.
 
  It is worth noting that, as explained by Google in the original paper,
  Dremel complements MapReduce-based computing. Dremel is not intended
  as a replacement for MapReduce and is often used in conjunction with
  it to analyze outputs of MapReduce pipelines or rapidly prototype
  larger computations. Indeed, Dremel and MapReduce are both used by
  thousands of Google employees.
 
  Like Dremel, Drill supports a nested data model with data encoded in a
  number of formats such as JSON, Avro or Protocol Buffers. In many
  organizations nested data is the standard, so supporting a nested data
  model eliminates the need to normalize the data. With that said, flat
  

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-11 Thread Ted Dunning
Yes.  Saw that.  Responded to him privately at the time.

Good humor or good typo.

On Sat, Aug 11, 2012 at 7:29 AM, Doug Cutting cutt...@gmail.com wrote:

 Otis said his vote was 'blinding', not 'binding'.

 Doug
 On Aug 11, 2012 12:28 AM, Ted Dunning ted.dunn...@gmail.com wrote:

  This vote is now closed.
 
  In the responses to this thread, I count 15 binding positive votes and
  4 non-binding votes.  The number of positive votes increases to 17 if
  you count myself (the champion) and Isabel (a mentor) but neither of
  us actually sent the key email to record a vote (oops).
 
  One of the non-binding votes was by Otis Gospadnetic who said that his
  vote was binding, but I didn't find his name on the list of incubator
  PMC members, so I counted it as non-binding.  The list I used is at
  http://people.apache.org/committers-by-project.html#incubator-pmc
 
  By any count, this vote to admit Drill to incubator therefore passes.
 
  This proposal includes mentors so this vote also constitutes
  acceptance of the mentors by the Incubator PMC.  All three of the
  mentors (Grant, myself, and Isabel) are Apache members.
 
  This proposal as approved also includes an initial list of committers,
  all of whom have ICLA's on file.
 
  I will coordinate with the other mentors and the committers to commit
  the status file and perform other establishment activities necessary
  to establish Drill as a project under incubation.  I expect that this
  will take several days.  I will announce progress on this mailing list
  to allow people to subscribe to the mailing lists.
 
 
  On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell apurt...@apache.org
  wrote:
   +1 (non-binding)
  
   On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com
  wrote:
   I would like to call a vote for accepting Drill for incubation in the
   Apache Incubator. The full proposal is available below.  Discussion
   over the last few days has been quite positive.
  
   Please cast your vote:
  
   [ ] +1, bring Drill into Incubator
   [ ] +0, I don't care either way,
   [ ] -1, do not bring Drill into Incubator, because...
  
   This vote will be open for 72 hours and only votes from the Incubator
   PMC are binding.  The start of the vote is just before 3AM UTC on 8
   August so the closing time will be 3AM UTC on 11 August.
  
   Thank you for your consideration!
  
   Ted
  
   http://wiki.apache.org/incubator/DrillProposal
  
   = Drill =
  
   == Abstract ==
   Drill is a distributed system for interactive analysis of large-scale
   datasets, inspired by
   [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
  
   == Proposal ==
   Drill is a distributed system for interactive analysis of large-scale
   datasets. Drill is similar to Google's Dremel, with the additional
   flexibility needed to support a broader range of query languages, data
   formats and data sources. It is designed to efficiently process nested
   data. It is a design goal to scale to 10,000 servers or more and to be
   able to process petabyes of data and trillions of records in seconds.
  
   == Background ==
   Many organizations have the need to run data-intensive applications,
   including batch processing, stream processing and interactive
   analysis. In recent years open source systems have emerged to address
   the need for scalable batch processing (Apache Hadoop) and stream
   processing (Storm, Apache S4). In 2010 Google published a paper called
   Dremel: Interactive Analysis of Web-Scale Datasets, describing a
   scalable system used internally for interactive analysis of nested
   data. No open source project has successfully replicated the
   capabilities of Dremel.
  
   == Rationale ==
   There is a strong need in the market for low-latency interactive
   analysis of large-scale datasets, including nested data (eg, JSON,
   Avro, Protocol Buffers). This need was identified by Google and
   addressed internally with a system called Dremel.
  
   In recent years open source systems have emerged to address the need
   for scalable batch processing (Apache Hadoop) and stream processing
   (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
   internal MapReduce system, is used by thousands of organizations
   processing large-scale datasets. Apache Hadoop is designed to achieve
   very high throughput, but is not designed to achieve the sub-second
   latency needed for interactive data analysis and exploration. Drill,
   inspired by Google's internal Dremel system, is intended to address
   this need.
  
   It is worth noting that, as explained by Google in the original paper,
   Dremel complements MapReduce-based computing. Dremel is not intended
   as a replacement for MapReduce and is often used in conjunction with
   it to analyze outputs of MapReduce pipelines or rapidly prototype
   larger computations. Indeed, Dremel and MapReduce are both used by
   thousands of Google employees.
  
   Like Dremel, Drill 

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-09 Thread Tommaso Teofili
+1

Tommaso

2012/8/8 Ted Dunning ted.dunn...@gmail.com

 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator. The full proposal is available below.  Discussion
 over the last few days has been quite positive.

 Please cast your vote:

 [ ] +1, bring Drill into Incubator
 [ ] +0, I don't care either way,
 [ ] -1, do not bring Drill into Incubator, because...

 This vote will be open for 72 hours and only votes from the Incubator
 PMC are binding.  The start of the vote is just before 3AM UTC on 8
 August so the closing time will be 3AM UTC on 11 August.

 Thank you for your consideration!

 Ted

 http://wiki.apache.org/incubator/DrillProposal

 = Drill =

 == Abstract ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by
 [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].

 == Proposal ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets. Drill is similar to Google's Dremel, with the additional
 flexibility needed to support a broader range of query languages, data
 formats and data sources. It is designed to efficiently process nested
 data. It is a design goal to scale to 10,000 servers or more and to be
 able to process petabyes of data and trillions of records in seconds.

 == Background ==
 Many organizations have the need to run data-intensive applications,
 including batch processing, stream processing and interactive
 analysis. In recent years open source systems have emerged to address
 the need for scalable batch processing (Apache Hadoop) and stream
 processing (Storm, Apache S4). In 2010 Google published a paper called
 Dremel: Interactive Analysis of Web-Scale Datasets, describing a
 scalable system used internally for interactive analysis of nested
 data. No open source project has successfully replicated the
 capabilities of Dremel.

 == Rationale ==
 There is a strong need in the market for low-latency interactive
 analysis of large-scale datasets, including nested data (eg, JSON,
 Avro, Protocol Buffers). This need was identified by Google and
 addressed internally with a system called Dremel.

 In recent years open source systems have emerged to address the need
 for scalable batch processing (Apache Hadoop) and stream processing
 (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
 internal MapReduce system, is used by thousands of organizations
 processing large-scale datasets. Apache Hadoop is designed to achieve
 very high throughput, but is not designed to achieve the sub-second
 latency needed for interactive data analysis and exploration. Drill,
 inspired by Google's internal Dremel system, is intended to address
 this need.

 It is worth noting that, as explained by Google in the original paper,
 Dremel complements MapReduce-based computing. Dremel is not intended
 as a replacement for MapReduce and is often used in conjunction with
 it to analyze outputs of MapReduce pipelines or rapidly prototype
 larger computations. Indeed, Dremel and MapReduce are both used by
 thousands of Google employees.

 Like Dremel, Drill supports a nested data model with data encoded in a
 number of formats such as JSON, Avro or Protocol Buffers. In many
 organizations nested data is the standard, so supporting a nested data
 model eliminates the need to normalize the data. With that said, flat
 data formats, such as CSV files, are naturally supported as a special
 case of nested data.

 The Drill architecture consists of four key components/layers:
  * Query languages: This layer is responsible for parsing the user's
 query and constructing an execution plan.  The initial goal is to
 support the SQL-like language used by Dremel and
 [[https://developers.google.com/bigquery/docs/query-reference|Google
 BigQuery]], which we call DrQL. However, Drill is designed to support
 other languages and programming models, such as the
 [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
 Language]], [[http://www.cascading.org/|Cascading]] or
 [[https://github.com/tdunning/Plume|Plume]].
  * Low-latency distributed execution engine: This layer is responsible
 for executing the physical plan. It provides the scalability and fault
 tolerance needed to efficiently query petabytes of data on 10,000
 servers. Drill's execution engine is based on research in distributed
 execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
 columnar storage, and can be extended with additional operators and
 connectors.
  * Nested data formats: This layer is responsible for supporting
 various data formats. The initial goal is to support the column-based
 format used by Dremel. Drill is designed to support schema-based
 formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
 and schema-less formats such as JSON, BSON or YAML. In addition, it is
 designed to support column-based formats such as Dremel,
 AVRO-806/Trevni and RCFile, and row-based 

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-09 Thread Andrew Purtell
+1 (non-binding)

On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator. The full proposal is available below.  Discussion
 over the last few days has been quite positive.

 Please cast your vote:

 [ ] +1, bring Drill into Incubator
 [ ] +0, I don't care either way,
 [ ] -1, do not bring Drill into Incubator, because...

 This vote will be open for 72 hours and only votes from the Incubator
 PMC are binding.  The start of the vote is just before 3AM UTC on 8
 August so the closing time will be 3AM UTC on 11 August.

 Thank you for your consideration!

 Ted

 http://wiki.apache.org/incubator/DrillProposal

 = Drill =

 == Abstract ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by
 [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].

 == Proposal ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets. Drill is similar to Google's Dremel, with the additional
 flexibility needed to support a broader range of query languages, data
 formats and data sources. It is designed to efficiently process nested
 data. It is a design goal to scale to 10,000 servers or more and to be
 able to process petabyes of data and trillions of records in seconds.

 == Background ==
 Many organizations have the need to run data-intensive applications,
 including batch processing, stream processing and interactive
 analysis. In recent years open source systems have emerged to address
 the need for scalable batch processing (Apache Hadoop) and stream
 processing (Storm, Apache S4). In 2010 Google published a paper called
 Dremel: Interactive Analysis of Web-Scale Datasets, describing a
 scalable system used internally for interactive analysis of nested
 data. No open source project has successfully replicated the
 capabilities of Dremel.

 == Rationale ==
 There is a strong need in the market for low-latency interactive
 analysis of large-scale datasets, including nested data (eg, JSON,
 Avro, Protocol Buffers). This need was identified by Google and
 addressed internally with a system called Dremel.

 In recent years open source systems have emerged to address the need
 for scalable batch processing (Apache Hadoop) and stream processing
 (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
 internal MapReduce system, is used by thousands of organizations
 processing large-scale datasets. Apache Hadoop is designed to achieve
 very high throughput, but is not designed to achieve the sub-second
 latency needed for interactive data analysis and exploration. Drill,
 inspired by Google's internal Dremel system, is intended to address
 this need.

 It is worth noting that, as explained by Google in the original paper,
 Dremel complements MapReduce-based computing. Dremel is not intended
 as a replacement for MapReduce and is often used in conjunction with
 it to analyze outputs of MapReduce pipelines or rapidly prototype
 larger computations. Indeed, Dremel and MapReduce are both used by
 thousands of Google employees.

 Like Dremel, Drill supports a nested data model with data encoded in a
 number of formats such as JSON, Avro or Protocol Buffers. In many
 organizations nested data is the standard, so supporting a nested data
 model eliminates the need to normalize the data. With that said, flat
 data formats, such as CSV files, are naturally supported as a special
 case of nested data.

 The Drill architecture consists of four key components/layers:
  * Query languages: This layer is responsible for parsing the user's
 query and constructing an execution plan.  The initial goal is to
 support the SQL-like language used by Dremel and
 [[https://developers.google.com/bigquery/docs/query-reference|Google
 BigQuery]], which we call DrQL. However, Drill is designed to support
 other languages and programming models, such as the
 [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
 Language]], [[http://www.cascading.org/|Cascading]] or
 [[https://github.com/tdunning/Plume|Plume]].
  * Low-latency distributed execution engine: This layer is responsible
 for executing the physical plan. It provides the scalability and fault
 tolerance needed to efficiently query petabytes of data on 10,000
 servers. Drill's execution engine is based on research in distributed
 execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
 columnar storage, and can be extended with additional operators and
 connectors.
  * Nested data formats: This layer is responsible for supporting
 various data formats. The initial goal is to support the column-based
 format used by Dremel. Drill is designed to support schema-based
 formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
 and schema-less formats such as JSON, BSON or YAML. In addition, it is
 designed to support column-based formats such as Dremel,
 

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Alex Karasulu
+1 (binding)

On Wed, Aug 8, 2012 at 8:33 AM, Mattmann, Chris A (388J) 
chris.a.mattm...@jpl.nasa.gov wrote:

 +1 (binding). Good luck and sounds cool!

 Cheers,
 Chris

 On Aug 7, 2012, at 7:41 PM, Ted Dunning wrote:

  I would like to call a vote for accepting Drill for incubation in the
  Apache Incubator. The full proposal is available below.  Discussion
  over the last few days has been quite positive.
 
  Please cast your vote:
 
  [ ] +1, bring Drill into Incubator
  [ ] +0, I don't care either way,
  [ ] -1, do not bring Drill into Incubator, because...
 
  This vote will be open for 72 hours and only votes from the Incubator
  PMC are binding.  The start of the vote is just before 3AM UTC on 8
  August so the closing time will be 3AM UTC on 11 August.
 
  Thank you for your consideration!
 
  Ted
 
  http://wiki.apache.org/incubator/DrillProposal
 
  = Drill =
 
  == Abstract ==
  Drill is a distributed system for interactive analysis of large-scale
  datasets, inspired by
  [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
 
  == Proposal ==
  Drill is a distributed system for interactive analysis of large-scale
  datasets. Drill is similar to Google's Dremel, with the additional
  flexibility needed to support a broader range of query languages, data
  formats and data sources. It is designed to efficiently process nested
  data. It is a design goal to scale to 10,000 servers or more and to be
  able to process petabyes of data and trillions of records in seconds.
 
  == Background ==
  Many organizations have the need to run data-intensive applications,
  including batch processing, stream processing and interactive
  analysis. In recent years open source systems have emerged to address
  the need for scalable batch processing (Apache Hadoop) and stream
  processing (Storm, Apache S4). In 2010 Google published a paper called
  Dremel: Interactive Analysis of Web-Scale Datasets, describing a
  scalable system used internally for interactive analysis of nested
  data. No open source project has successfully replicated the
  capabilities of Dremel.
 
  == Rationale ==
  There is a strong need in the market for low-latency interactive
  analysis of large-scale datasets, including nested data (eg, JSON,
  Avro, Protocol Buffers). This need was identified by Google and
  addressed internally with a system called Dremel.
 
  In recent years open source systems have emerged to address the need
  for scalable batch processing (Apache Hadoop) and stream processing
  (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
  internal MapReduce system, is used by thousands of organizations
  processing large-scale datasets. Apache Hadoop is designed to achieve
  very high throughput, but is not designed to achieve the sub-second
  latency needed for interactive data analysis and exploration. Drill,
  inspired by Google's internal Dremel system, is intended to address
  this need.
 
  It is worth noting that, as explained by Google in the original paper,
  Dremel complements MapReduce-based computing. Dremel is not intended
  as a replacement for MapReduce and is often used in conjunction with
  it to analyze outputs of MapReduce pipelines or rapidly prototype
  larger computations. Indeed, Dremel and MapReduce are both used by
  thousands of Google employees.
 
  Like Dremel, Drill supports a nested data model with data encoded in a
  number of formats such as JSON, Avro or Protocol Buffers. In many
  organizations nested data is the standard, so supporting a nested data
  model eliminates the need to normalize the data. With that said, flat
  data formats, such as CSV files, are naturally supported as a special
  case of nested data.
 
  The Drill architecture consists of four key components/layers:
  * Query languages: This layer is responsible for parsing the user's
  query and constructing an execution plan.  The initial goal is to
  support the SQL-like language used by Dremel and
  [[https://developers.google.com/bigquery/docs/query-reference|Google
  BigQuery]], which we call DrQL. However, Drill is designed to support
  other languages and programming models, such as the
  [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
  Language]], [[http://www.cascading.org/|Cascading]] or
  [[https://github.com/tdunning/Plume|Plume]].
  * Low-latency distributed execution engine: This layer is responsible
  for executing the physical plan. It provides the scalability and fault
  tolerance needed to efficiently query petabytes of data on 10,000
  servers. Drill's execution engine is based on research in distributed
  execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
  columnar storage, and can be extended with additional operators and
  connectors.
  * Nested data formats: This layer is responsible for supporting
  various data formats. The initial goal is to support the column-based
  format used by Dremel. Drill is designed to support 

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Andrzej Bialecki

On 08/08/2012 04:41, Ted Dunning wrote:

I would like to call a vote for accepting Drill for incubation in the
Apache Incubator. The full proposal is available below.  Discussion
over the last few days has been quite positive.

Please cast your vote:

[ ] +1, bring Drill into Incubator
[ ] +0, I don't care either way,
[ ] -1, do not bring Drill into Incubator, because...

This vote will be open for 72 hours and only votes from the Incubator
PMC are binding.  The start of the vote is just before 3AM UTC on 8
August so the closing time will be 3AM UTC on 11 August.


+1 (binding) - this is an exciting proposal!

--
Best regards,
Andrzej Bialecki
http://www.sigram.com, blog http://www.sigram.com/blog
 ___.,___,___,___,_._. __
[___||.__|__/|__||\/|: Information Retrieval, System Integration
___|||__||..\|..||..|: Contact: info at sigram dot com


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Bertrand Delacretaz
On Wed, Aug 8, 2012 at 4:41 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator...

+1

-Bertrand

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Torsten Curdt
On Wed, Aug 8, 2012 at 11:39 AM, Bertrand Delacretaz
bdelacre...@apache.org wrote:
 On Wed, Aug 8, 2012 at 4:41 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator...

 +1

+1

cheers,
Torsten

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Grant Ingersoll

On Aug 7, 2012, at 10:41 PM, Ted Dunning wrote:

 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator. The full proposal is available below.  Discussion
 over the last few days has been quite positive.
 
 Please cast your vote:
 
 [ ] +1, bring Drill into Incubator

+1 (binding)

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Mohammad Nour El-Din
+1 (binding)

On Wed, Aug 8, 2012 at 3:55 PM, Grant Ingersoll gsing...@apache.org wrote:


 On Aug 7, 2012, at 10:41 PM, Ted Dunning wrote:

  I would like to call a vote for accepting Drill for incubation in the
  Apache Incubator. The full proposal is available below.  Discussion
  over the last few days has been quite positive.
 
  Please cast your vote:
 
  [ ] +1, bring Drill into Incubator

 +1 (binding)

 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org




-- 
Thanks
- Mohammad Nour

Life is like riding a bicycle. To keep your balance you must keep moving
- Albert Einstein


Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Phillip Rhodes
On Tue, Aug 7, 2012 at 9:41 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator. The full proposal is available below.  Discussion
 over the last few days has been quite positive.

 Please cast your vote:

 [ ] +1, bring Drill into Incubator
 [ ] +0, I don't care either way,
 [ ] -1, do not bring Drill into Incubator, because...

+1


Phil

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



RE: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Franklin, Matthew B.
+1 (binding)

-Original Message-
From: Ted Dunning [mailto:ted.dunn...@gmail.com]
Sent: Tuesday, August 07, 2012 10:41 PM
To: general@incubator.apache.org
Subject: [VOTE] Accept Drill into the Apache Incubator

I would like to call a vote for accepting Drill for incubation in the
Apache Incubator. The full proposal is available below.  Discussion
over the last few days has been quite positive.

Please cast your vote:

[ ] +1, bring Drill into Incubator
[ ] +0, I don't care either way,
[ ] -1, do not bring Drill into Incubator, because...

This vote will be open for 72 hours and only votes from the Incubator
PMC are binding.  The start of the vote is just before 3AM UTC on 8
August so the closing time will be 3AM UTC on 11 August.

Thank you for your consideration!

Ted

http://wiki.apache.org/incubator/DrillProposal

= Drill =

== Abstract ==
Drill is a distributed system for interactive analysis of large-scale
datasets, inspired by
[[http://research.google.com/pubs/pub36632.html|Google's Dremel]].

== Proposal ==
Drill is a distributed system for interactive analysis of large-scale
datasets. Drill is similar to Google's Dremel, with the additional
flexibility needed to support a broader range of query languages, data
formats and data sources. It is designed to efficiently process nested
data. It is a design goal to scale to 10,000 servers or more and to be
able to process petabyes of data and trillions of records in seconds.

== Background ==
Many organizations have the need to run data-intensive applications,
including batch processing, stream processing and interactive
analysis. In recent years open source systems have emerged to address
the need for scalable batch processing (Apache Hadoop) and stream
processing (Storm, Apache S4). In 2010 Google published a paper called
Dremel: Interactive Analysis of Web-Scale Datasets, describing a
scalable system used internally for interactive analysis of nested
data. No open source project has successfully replicated the
capabilities of Dremel.

== Rationale ==
There is a strong need in the market for low-latency interactive
analysis of large-scale datasets, including nested data (eg, JSON,
Avro, Protocol Buffers). This need was identified by Google and
addressed internally with a system called Dremel.

In recent years open source systems have emerged to address the need
for scalable batch processing (Apache Hadoop) and stream processing
(Storm, Apache S4). Apache Hadoop, originally inspired by Google's
internal MapReduce system, is used by thousands of organizations
processing large-scale datasets. Apache Hadoop is designed to achieve
very high throughput, but is not designed to achieve the sub-second
latency needed for interactive data analysis and exploration. Drill,
inspired by Google's internal Dremel system, is intended to address
this need.

It is worth noting that, as explained by Google in the original paper,
Dremel complements MapReduce-based computing. Dremel is not intended
as a replacement for MapReduce and is often used in conjunction with
it to analyze outputs of MapReduce pipelines or rapidly prototype
larger computations. Indeed, Dremel and MapReduce are both used by
thousands of Google employees.

Like Dremel, Drill supports a nested data model with data encoded in a
number of formats such as JSON, Avro or Protocol Buffers. In many
organizations nested data is the standard, so supporting a nested data
model eliminates the need to normalize the data. With that said, flat
data formats, such as CSV files, are naturally supported as a special
case of nested data.

The Drill architecture consists of four key components/layers:
 * Query languages: This layer is responsible for parsing the user's
query and constructing an execution plan.  The initial goal is to
support the SQL-like language used by Dremel and
[[https://developers.google.com/bigquery/docs/query-reference|Google
BigQuery]], which we call DrQL. However, Drill is designed to support
other languages and programming models, such as the
[[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo
Query
Language]], [[http://www.cascading.org/|Cascading]] or
[[https://github.com/tdunning/Plume|Plume]].
 * Low-latency distributed execution engine: This layer is responsible
for executing the physical plan. It provides the scalability and fault
tolerance needed to efficiently query petabytes of data on 10,000
servers. Drill's execution engine is based on research in distributed
execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
columnar storage, and can be extended with additional operators and
connectors.
 * Nested data formats: This layer is responsible for supporting
various data formats. The initial goal is to support the column-based
format used by Dremel. Drill is designed to support schema-based
formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
and schema-less formats such as JSON, BSON or YAML. In addition, it is
designed to support

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Otis Gospodnetic
+1 (blinding)

Otis

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 




 From: Ted Dunning ted.dunn...@gmail.com
To: general@incubator.apache.org 
Sent: Tuesday, August 7, 2012 10:41 PM
Subject: [VOTE] Accept Drill into the Apache Incubator
 
I would like to call a vote for accepting Drill for incubation in the
Apache Incubator. The full proposal is available below.  Discussion
over the last few days has been quite positive.

Please cast your vote:

[ ] +1, bring Drill into Incubator
[ ] +0, I don't care either way,
[ ] -1, do not bring Drill into Incubator, because...

This vote will be open for 72 hours and only votes from the Incubator
PMC are binding.  The start of the vote is just before 3AM UTC on 8
August so the closing time will be 3AM UTC on 11 August.

Thank you for your consideration!

Ted

http://wiki.apache.org/incubator/DrillProposal

= Drill =

== Abstract ==
Drill is a distributed system for interactive analysis of large-scale
datasets, inspired by
[[http://research.google.com/pubs/pub36632.html|Google's Dremel]].

== Proposal ==
Drill is a distributed system for interactive analysis of large-scale
datasets. Drill is similar to Google's Dremel, with the additional
flexibility needed to support a broader range of query languages, data
formats and data sources. It is designed to efficiently process nested
data. It is a design goal to scale to 10,000 servers or more and to be
able to process petabyes of data and trillions of records in seconds.

== Background ==
Many organizations have the need to run data-intensive applications,
including batch processing, stream processing and interactive
analysis. In recent years open source systems have emerged to address
the need for scalable batch processing (Apache Hadoop) and stream
processing (Storm, Apache S4). In 2010 Google published a paper called
Dremel: Interactive Analysis of Web-Scale Datasets, describing a
scalable system used internally for interactive analysis of nested
data. No open source project has successfully replicated the
capabilities of Dremel.

== Rationale ==
There is a strong need in the market for low-latency interactive
analysis of large-scale datasets, including nested data (eg, JSON,
Avro, Protocol Buffers). This need was identified by Google and
addressed internally with a system called Dremel.

In recent years open source systems have emerged to address the need
for scalable batch processing (Apache Hadoop) and stream processing
(Storm, Apache S4). Apache Hadoop, originally inspired by Google's
internal MapReduce system, is used by thousands of organizations
processing large-scale datasets. Apache Hadoop is designed to achieve
very high throughput, but is not designed to achieve the sub-second
latency needed for interactive data analysis and exploration. Drill,
inspired by Google's internal Dremel system, is intended to address
this need.

It is worth noting that, as explained by Google in the original paper,
Dremel complements MapReduce-based computing. Dremel is not intended
as a replacement for MapReduce and is often used in conjunction with
it to analyze outputs of MapReduce pipelines or rapidly prototype
larger computations. Indeed, Dremel and MapReduce are both used by
thousands of Google employees.

Like Dremel, Drill supports a nested data model with data encoded in a
number of formats such as JSON, Avro or Protocol Buffers. In many
organizations nested data is the standard, so supporting a nested data
model eliminates the need to normalize the data. With that said, flat
data formats, such as CSV files, are naturally supported as a special
case of nested data.

The Drill architecture consists of four key components/layers:
* Query languages: This layer is responsible for parsing the user's
query and constructing an execution plan.  The initial goal is to
support the SQL-like language used by Dremel and
[[https://developers.google.com/bigquery/docs/query-reference|Google
BigQuery]], which we call DrQL. However, Drill is designed to support
other languages and programming models, such as the
[[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
Language]], [[http://www.cascading.org/|Cascading]] or
[[https://github.com/tdunning/Plume|Plume]].
* Low-latency distributed execution engine: This layer is responsible
for executing the physical plan. It provides the scalability and fault
tolerance needed to efficiently query petabytes of data on 10,000
servers. Drill's execution engine is based on research in distributed
execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
columnar storage, and can be extended with additional operators and
connectors.
* Nested data formats: This layer is responsible for supporting
various data formats. The initial goal is to support the column-based
format used by Dremel. Drill is designed to support schema-based
formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Chris Douglas
+1 -C

(sorry, wrong thread)

On Tue, Aug 7, 2012 at 7:41 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator. The full proposal is available below.  Discussion
 over the last few days has been quite positive.

 Please cast your vote:

 [ ] +1, bring Drill into Incubator
 [ ] +0, I don't care either way,
 [ ] -1, do not bring Drill into Incubator, because...

 This vote will be open for 72 hours and only votes from the Incubator
 PMC are binding.  The start of the vote is just before 3AM UTC on 8
 August so the closing time will be 3AM UTC on 11 August.

 Thank you for your consideration!

 Ted

 http://wiki.apache.org/incubator/DrillProposal

 = Drill =

 == Abstract ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by
 [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].

 == Proposal ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets. Drill is similar to Google's Dremel, with the additional
 flexibility needed to support a broader range of query languages, data
 formats and data sources. It is designed to efficiently process nested
 data. It is a design goal to scale to 10,000 servers or more and to be
 able to process petabyes of data and trillions of records in seconds.

 == Background ==
 Many organizations have the need to run data-intensive applications,
 including batch processing, stream processing and interactive
 analysis. In recent years open source systems have emerged to address
 the need for scalable batch processing (Apache Hadoop) and stream
 processing (Storm, Apache S4). In 2010 Google published a paper called
 Dremel: Interactive Analysis of Web-Scale Datasets, describing a
 scalable system used internally for interactive analysis of nested
 data. No open source project has successfully replicated the
 capabilities of Dremel.

 == Rationale ==
 There is a strong need in the market for low-latency interactive
 analysis of large-scale datasets, including nested data (eg, JSON,
 Avro, Protocol Buffers). This need was identified by Google and
 addressed internally with a system called Dremel.

 In recent years open source systems have emerged to address the need
 for scalable batch processing (Apache Hadoop) and stream processing
 (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
 internal MapReduce system, is used by thousands of organizations
 processing large-scale datasets. Apache Hadoop is designed to achieve
 very high throughput, but is not designed to achieve the sub-second
 latency needed for interactive data analysis and exploration. Drill,
 inspired by Google's internal Dremel system, is intended to address
 this need.

 It is worth noting that, as explained by Google in the original paper,
 Dremel complements MapReduce-based computing. Dremel is not intended
 as a replacement for MapReduce and is often used in conjunction with
 it to analyze outputs of MapReduce pipelines or rapidly prototype
 larger computations. Indeed, Dremel and MapReduce are both used by
 thousands of Google employees.

 Like Dremel, Drill supports a nested data model with data encoded in a
 number of formats such as JSON, Avro or Protocol Buffers. In many
 organizations nested data is the standard, so supporting a nested data
 model eliminates the need to normalize the data. With that said, flat
 data formats, such as CSV files, are naturally supported as a special
 case of nested data.

 The Drill architecture consists of four key components/layers:
  * Query languages: This layer is responsible for parsing the user's
 query and constructing an execution plan.  The initial goal is to
 support the SQL-like language used by Dremel and
 [[https://developers.google.com/bigquery/docs/query-reference|Google
 BigQuery]], which we call DrQL. However, Drill is designed to support
 other languages and programming models, such as the
 [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
 Language]], [[http://www.cascading.org/|Cascading]] or
 [[https://github.com/tdunning/Plume|Plume]].
  * Low-latency distributed execution engine: This layer is responsible
 for executing the physical plan. It provides the scalability and fault
 tolerance needed to efficiently query petabytes of data on 10,000
 servers. Drill's execution engine is based on research in distributed
 execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
 columnar storage, and can be extended with additional operators and
 connectors.
  * Nested data formats: This layer is responsible for supporting
 various data formats. The initial goal is to support the column-based
 format used by Dremel. Drill is designed to support schema-based
 formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
 and schema-less formats such as JSON, BSON or YAML. In addition, it is
 designed to support column-based formats such as 

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Jukka Zitting
Hi,

On Wed, Aug 8, 2012 at 4:41 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator. The full proposal is available below.  Discussion
 over the last few days has been quite positive.

  [x] +1, bring Drill into Incubator

BR,

Jukka Zitting

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



[VOTE] Accept Drill into the Apache Incubator

2012-08-07 Thread Ted Dunning
I would like to call a vote for accepting Drill for incubation in the
Apache Incubator. The full proposal is available below.  Discussion
over the last few days has been quite positive.

Please cast your vote:

[ ] +1, bring Drill into Incubator
[ ] +0, I don't care either way,
[ ] -1, do not bring Drill into Incubator, because...

This vote will be open for 72 hours and only votes from the Incubator
PMC are binding.  The start of the vote is just before 3AM UTC on 8
August so the closing time will be 3AM UTC on 11 August.

Thank you for your consideration!

Ted

http://wiki.apache.org/incubator/DrillProposal

= Drill =

== Abstract ==
Drill is a distributed system for interactive analysis of large-scale
datasets, inspired by
[[http://research.google.com/pubs/pub36632.html|Google's Dremel]].

== Proposal ==
Drill is a distributed system for interactive analysis of large-scale
datasets. Drill is similar to Google's Dremel, with the additional
flexibility needed to support a broader range of query languages, data
formats and data sources. It is designed to efficiently process nested
data. It is a design goal to scale to 10,000 servers or more and to be
able to process petabyes of data and trillions of records in seconds.

== Background ==
Many organizations have the need to run data-intensive applications,
including batch processing, stream processing and interactive
analysis. In recent years open source systems have emerged to address
the need for scalable batch processing (Apache Hadoop) and stream
processing (Storm, Apache S4). In 2010 Google published a paper called
Dremel: Interactive Analysis of Web-Scale Datasets, describing a
scalable system used internally for interactive analysis of nested
data. No open source project has successfully replicated the
capabilities of Dremel.

== Rationale ==
There is a strong need in the market for low-latency interactive
analysis of large-scale datasets, including nested data (eg, JSON,
Avro, Protocol Buffers). This need was identified by Google and
addressed internally with a system called Dremel.

In recent years open source systems have emerged to address the need
for scalable batch processing (Apache Hadoop) and stream processing
(Storm, Apache S4). Apache Hadoop, originally inspired by Google's
internal MapReduce system, is used by thousands of organizations
processing large-scale datasets. Apache Hadoop is designed to achieve
very high throughput, but is not designed to achieve the sub-second
latency needed for interactive data analysis and exploration. Drill,
inspired by Google's internal Dremel system, is intended to address
this need.

It is worth noting that, as explained by Google in the original paper,
Dremel complements MapReduce-based computing. Dremel is not intended
as a replacement for MapReduce and is often used in conjunction with
it to analyze outputs of MapReduce pipelines or rapidly prototype
larger computations. Indeed, Dremel and MapReduce are both used by
thousands of Google employees.

Like Dremel, Drill supports a nested data model with data encoded in a
number of formats such as JSON, Avro or Protocol Buffers. In many
organizations nested data is the standard, so supporting a nested data
model eliminates the need to normalize the data. With that said, flat
data formats, such as CSV files, are naturally supported as a special
case of nested data.

The Drill architecture consists of four key components/layers:
 * Query languages: This layer is responsible for parsing the user's
query and constructing an execution plan.  The initial goal is to
support the SQL-like language used by Dremel and
[[https://developers.google.com/bigquery/docs/query-reference|Google
BigQuery]], which we call DrQL. However, Drill is designed to support
other languages and programming models, such as the
[[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
Language]], [[http://www.cascading.org/|Cascading]] or
[[https://github.com/tdunning/Plume|Plume]].
 * Low-latency distributed execution engine: This layer is responsible
for executing the physical plan. It provides the scalability and fault
tolerance needed to efficiently query petabytes of data on 10,000
servers. Drill's execution engine is based on research in distributed
execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
columnar storage, and can be extended with additional operators and
connectors.
 * Nested data formats: This layer is responsible for supporting
various data formats. The initial goal is to support the column-based
format used by Dremel. Drill is designed to support schema-based
formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
and schema-less formats such as JSON, BSON or YAML. In addition, it is
designed to support column-based formats such as Dremel,
AVRO-806/Trevni and RCFile, and row-based formats such as Protocol
Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill
is that the execution engine is flexible enough 

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-07 Thread Scott Deboy
+1 (binding)

On Tue, Aug 7, 2012 at 7:41 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator. The full proposal is available below.  Discussion
 over the last few days has been quite positive.

 Please cast your vote:

 [ ] +1, bring Drill into Incubator
 [ ] +0, I don't care either way,
 [ ] -1, do not bring Drill into Incubator, because...

 This vote will be open for 72 hours and only votes from the Incubator
 PMC are binding.  The start of the vote is just before 3AM UTC on 8
 August so the closing time will be 3AM UTC on 11 August.

 Thank you for your consideration!

 Ted

 http://wiki.apache.org/incubator/DrillProposal

 = Drill =

 == Abstract ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by
 [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].

 == Proposal ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets. Drill is similar to Google's Dremel, with the additional
 flexibility needed to support a broader range of query languages, data
 formats and data sources. It is designed to efficiently process nested
 data. It is a design goal to scale to 10,000 servers or more and to be
 able to process petabyes of data and trillions of records in seconds.

 == Background ==
 Many organizations have the need to run data-intensive applications,
 including batch processing, stream processing and interactive
 analysis. In recent years open source systems have emerged to address
 the need for scalable batch processing (Apache Hadoop) and stream
 processing (Storm, Apache S4). In 2010 Google published a paper called
 Dremel: Interactive Analysis of Web-Scale Datasets, describing a
 scalable system used internally for interactive analysis of nested
 data. No open source project has successfully replicated the
 capabilities of Dremel.

 == Rationale ==
 There is a strong need in the market for low-latency interactive
 analysis of large-scale datasets, including nested data (eg, JSON,
 Avro, Protocol Buffers). This need was identified by Google and
 addressed internally with a system called Dremel.

 In recent years open source systems have emerged to address the need
 for scalable batch processing (Apache Hadoop) and stream processing
 (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
 internal MapReduce system, is used by thousands of organizations
 processing large-scale datasets. Apache Hadoop is designed to achieve
 very high throughput, but is not designed to achieve the sub-second
 latency needed for interactive data analysis and exploration. Drill,
 inspired by Google's internal Dremel system, is intended to address
 this need.

 It is worth noting that, as explained by Google in the original paper,
 Dremel complements MapReduce-based computing. Dremel is not intended
 as a replacement for MapReduce and is often used in conjunction with
 it to analyze outputs of MapReduce pipelines or rapidly prototype
 larger computations. Indeed, Dremel and MapReduce are both used by
 thousands of Google employees.

 Like Dremel, Drill supports a nested data model with data encoded in a
 number of formats such as JSON, Avro or Protocol Buffers. In many
 organizations nested data is the standard, so supporting a nested data
 model eliminates the need to normalize the data. With that said, flat
 data formats, such as CSV files, are naturally supported as a special
 case of nested data.

 The Drill architecture consists of four key components/layers:
  * Query languages: This layer is responsible for parsing the user's
 query and constructing an execution plan.  The initial goal is to
 support the SQL-like language used by Dremel and
 [[https://developers.google.com/bigquery/docs/query-reference|Google
 BigQuery]], which we call DrQL. However, Drill is designed to support
 other languages and programming models, such as the
 [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongohttp://www.mongodb.org/display/DOCS/Mongo+Query+Language%7CMongoQuery
 Language]], [[http://www.cascading.org/|Cascading]] or
 [[https://github.com/tdunning/Plume|Plume]].
  * Low-latency distributed execution engine: This layer is responsible
 for executing the physical plan. It provides the scalability and fault
 tolerance needed to efficiently query petabytes of data on 10,000
 servers. Drill's execution engine is based on research in distributed
 execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
 columnar storage, and can be extended with additional operators and
 connectors.
  * Nested data formats: This layer is responsible for supporting
 various data formats. The initial goal is to support the column-based
 format used by Dremel. Drill is designed to support schema-based
 formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
 and schema-less formats such as JSON, BSON or YAML. In addition, it is
 

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-07 Thread Ashish
+1 (non-binding)

On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning ted.dunn...@gmail.com wrote:
 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator. The full proposal is available below.  Discussion
 over the last few days has been quite positive.

 Please cast your vote:

 [ ] +1, bring Drill into Incubator
 [ ] +0, I don't care either way,
 [ ] -1, do not bring Drill into Incubator, because...

 This vote will be open for 72 hours and only votes from the Incubator
 PMC are binding.  The start of the vote is just before 3AM UTC on 8
 August so the closing time will be 3AM UTC on 11 August.

 Thank you for your consideration!

 Ted

 http://wiki.apache.org/incubator/DrillProposal

 = Drill =

 == Abstract ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by
 [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].

 == Proposal ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets. Drill is similar to Google's Dremel, with the additional
 flexibility needed to support a broader range of query languages, data
 formats and data sources. It is designed to efficiently process nested
 data. It is a design goal to scale to 10,000 servers or more and to be
 able to process petabyes of data and trillions of records in seconds.

 == Background ==
 Many organizations have the need to run data-intensive applications,
 including batch processing, stream processing and interactive
 analysis. In recent years open source systems have emerged to address
 the need for scalable batch processing (Apache Hadoop) and stream
 processing (Storm, Apache S4). In 2010 Google published a paper called
 Dremel: Interactive Analysis of Web-Scale Datasets, describing a
 scalable system used internally for interactive analysis of nested
 data. No open source project has successfully replicated the
 capabilities of Dremel.

 == Rationale ==
 There is a strong need in the market for low-latency interactive
 analysis of large-scale datasets, including nested data (eg, JSON,
 Avro, Protocol Buffers). This need was identified by Google and
 addressed internally with a system called Dremel.

 In recent years open source systems have emerged to address the need
 for scalable batch processing (Apache Hadoop) and stream processing
 (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
 internal MapReduce system, is used by thousands of organizations
 processing large-scale datasets. Apache Hadoop is designed to achieve
 very high throughput, but is not designed to achieve the sub-second
 latency needed for interactive data analysis and exploration. Drill,
 inspired by Google's internal Dremel system, is intended to address
 this need.

 It is worth noting that, as explained by Google in the original paper,
 Dremel complements MapReduce-based computing. Dremel is not intended
 as a replacement for MapReduce and is often used in conjunction with
 it to analyze outputs of MapReduce pipelines or rapidly prototype
 larger computations. Indeed, Dremel and MapReduce are both used by
 thousands of Google employees.

 Like Dremel, Drill supports a nested data model with data encoded in a
 number of formats such as JSON, Avro or Protocol Buffers. In many
 organizations nested data is the standard, so supporting a nested data
 model eliminates the need to normalize the data. With that said, flat
 data formats, such as CSV files, are naturally supported as a special
 case of nested data.

 The Drill architecture consists of four key components/layers:
  * Query languages: This layer is responsible for parsing the user's
 query and constructing an execution plan.  The initial goal is to
 support the SQL-like language used by Dremel and
 [[https://developers.google.com/bigquery/docs/query-reference|Google
 BigQuery]], which we call DrQL. However, Drill is designed to support
 other languages and programming models, such as the
 [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
 Language]], [[http://www.cascading.org/|Cascading]] or
 [[https://github.com/tdunning/Plume|Plume]].
  * Low-latency distributed execution engine: This layer is responsible
 for executing the physical plan. It provides the scalability and fault
 tolerance needed to efficiently query petabytes of data on 10,000
 servers. Drill's execution engine is based on research in distributed
 execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
 columnar storage, and can be extended with additional operators and
 connectors.
  * Nested data formats: This layer is responsible for supporting
 various data formats. The initial goal is to support the column-based
 format used by Dremel. Drill is designed to support schema-based
 formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
 and schema-less formats such as JSON, BSON or YAML. In addition, it is
 designed to support column-based formats such as Dremel,
 

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-07 Thread Devaraj Das
+1 (binding)

On Aug 7, 2012, at 7:41 PM, Ted Dunning wrote:

 I would like to call a vote for accepting Drill for incubation in the
 Apache Incubator. The full proposal is available below.  Discussion
 over the last few days has been quite positive.
 
 Please cast your vote:
 
 [ ] +1, bring Drill into Incubator
 [ ] +0, I don't care either way,
 [ ] -1, do not bring Drill into Incubator, because...
 
 This vote will be open for 72 hours and only votes from the Incubator
 PMC are binding.  The start of the vote is just before 3AM UTC on 8
 August so the closing time will be 3AM UTC on 11 August.
 
 Thank you for your consideration!
 
 Ted
 
 http://wiki.apache.org/incubator/DrillProposal
 
 = Drill =
 
 == Abstract ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by
 [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
 
 == Proposal ==
 Drill is a distributed system for interactive analysis of large-scale
 datasets. Drill is similar to Google's Dremel, with the additional
 flexibility needed to support a broader range of query languages, data
 formats and data sources. It is designed to efficiently process nested
 data. It is a design goal to scale to 10,000 servers or more and to be
 able to process petabyes of data and trillions of records in seconds.
 
 == Background ==
 Many organizations have the need to run data-intensive applications,
 including batch processing, stream processing and interactive
 analysis. In recent years open source systems have emerged to address
 the need for scalable batch processing (Apache Hadoop) and stream
 processing (Storm, Apache S4). In 2010 Google published a paper called
 Dremel: Interactive Analysis of Web-Scale Datasets, describing a
 scalable system used internally for interactive analysis of nested
 data. No open source project has successfully replicated the
 capabilities of Dremel.
 
 == Rationale ==
 There is a strong need in the market for low-latency interactive
 analysis of large-scale datasets, including nested data (eg, JSON,
 Avro, Protocol Buffers). This need was identified by Google and
 addressed internally with a system called Dremel.
 
 In recent years open source systems have emerged to address the need
 for scalable batch processing (Apache Hadoop) and stream processing
 (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
 internal MapReduce system, is used by thousands of organizations
 processing large-scale datasets. Apache Hadoop is designed to achieve
 very high throughput, but is not designed to achieve the sub-second
 latency needed for interactive data analysis and exploration. Drill,
 inspired by Google's internal Dremel system, is intended to address
 this need.
 
 It is worth noting that, as explained by Google in the original paper,
 Dremel complements MapReduce-based computing. Dremel is not intended
 as a replacement for MapReduce and is often used in conjunction with
 it to analyze outputs of MapReduce pipelines or rapidly prototype
 larger computations. Indeed, Dremel and MapReduce are both used by
 thousands of Google employees.
 
 Like Dremel, Drill supports a nested data model with data encoded in a
 number of formats such as JSON, Avro or Protocol Buffers. In many
 organizations nested data is the standard, so supporting a nested data
 model eliminates the need to normalize the data. With that said, flat
 data formats, such as CSV files, are naturally supported as a special
 case of nested data.
 
 The Drill architecture consists of four key components/layers:
 * Query languages: This layer is responsible for parsing the user's
 query and constructing an execution plan.  The initial goal is to
 support the SQL-like language used by Dremel and
 [[https://developers.google.com/bigquery/docs/query-reference|Google
 BigQuery]], which we call DrQL. However, Drill is designed to support
 other languages and programming models, such as the
 [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
 Language]], [[http://www.cascading.org/|Cascading]] or
 [[https://github.com/tdunning/Plume|Plume]].
 * Low-latency distributed execution engine: This layer is responsible
 for executing the physical plan. It provides the scalability and fault
 tolerance needed to efficiently query petabytes of data on 10,000
 servers. Drill's execution engine is based on research in distributed
 execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
 columnar storage, and can be extended with additional operators and
 connectors.
 * Nested data formats: This layer is responsible for supporting
 various data formats. The initial goal is to support the column-based
 format used by Dremel. Drill is designed to support schema-based
 formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
 and schema-less formats such as JSON, BSON or YAML. In addition, it is
 designed to support column-based formats such as Dremel,
 AVRO-806/Trevni and