Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-22 Thread Ted Dunning
Yes.  There are updates.

SVN is up (we will be switching to git as soon as possible)

Mailing lists are up.  Send email to
drill-dev-subscr...@incubator.apache.org or
drill-use-subscr...@incubator.apache.org as desired.

The issue tracker is up.  See https://issues.apache.org/jira/browse/DRILL

We will be sponsoring a hackathon soon in the SF bay area shortly to get a
lot of f2f participation for building consensus.  Several commercial
companies have volunteered paid developers as well.  It is an open question
how to broaden physical involvement beyond that first meeting, but ad hoc
meetings in various cities seem like a nice way to make that happen.
 Obviously, meat-space interactions will only be a small part of the total
project, but it is a good way to build enthusiasm.

The project web site is not up yet.  It will be shortly.

On Wed, Aug 22, 2012 at 7:47 PM, Akash Ashok  wrote:

> Guys Any updates on this ?
>
>


Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-22 Thread Akash Ashok
Guys Any updates on this ?

On Sun, Aug 12, 2012 at 12:31 AM, Ted Dunning  wrote:

> Yes.  Saw that.  Responded to him privately at the time.
>
> Good humor or good typo.
>
> On Sat, Aug 11, 2012 at 7:29 AM, Doug Cutting  wrote:
>
> > Otis said his vote was 'blinding', not 'binding'.
> >
> > Doug
> > On Aug 11, 2012 12:28 AM, "Ted Dunning"  wrote:
> >
> > > This vote is now closed.
> > >
> > > In the responses to this thread, I count 15 binding positive votes and
> > > 4 non-binding votes.  The number of positive votes increases to 17 if
> > > you count myself (the champion) and Isabel (a mentor) but neither of
> > > us actually sent the key email to record a vote (oops).
> > >
> > > One of the non-binding votes was by Otis Gospadnetic who said that his
> > > vote was binding, but I didn't find his name on the list of incubator
> > > PMC members, so I counted it as non-binding.  The list I used is at
> > > http://people.apache.org/committers-by-project.html#incubator-pmc
> > >
> > > By any count, this vote to admit Drill to incubator therefore passes.
> > >
> > > This proposal includes mentors so this vote also constitutes
> > > acceptance of the mentors by the Incubator PMC.  All three of the
> > > mentors (Grant, myself, and Isabel) are Apache members.
> > >
> > > This proposal as approved also includes an initial list of committers,
> > > all of whom have ICLA's on file.
> > >
> > > I will coordinate with the other mentors and the committers to commit
> > > the status file and perform other establishment activities necessary
> > > to establish Drill as a project under incubation.  I expect that this
> > > will take several days.  I will announce progress on this mailing list
> > > to allow people to subscribe to the mailing lists.
> > >
> > >
> > > On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell 
> > > wrote:
> > > > +1 (non-binding)
> > > >
> > > > On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning 
> > > wrote:
> > > >> I would like to call a vote for accepting Drill for incubation in
> the
> > > >> Apache Incubator. The full proposal is available below.  Discussion
> > > >> over the last few days has been quite positive.
> > > >>
> > > >> Please cast your vote:
> > > >>
> > > >> [ ] +1, bring Drill into Incubator
> > > >> [ ] +0, I don't care either way,
> > > >> [ ] -1, do not bring Drill into Incubator, because...
> > > >>
> > > >> This vote will be open for 72 hours and only votes from the
> Incubator
> > > >> PMC are binding.  The start of the vote is just before 3AM UTC on 8
> > > >> August so the closing time will be 3AM UTC on 11 August.
> > > >>
> > > >> Thank you for your consideration!
> > > >>
> > > >> Ted
> > > >>
> > > >> http://wiki.apache.org/incubator/DrillProposal
> > > >>
> > > >> = Drill =
> > > >>
> > > >> == Abstract ==
> > > >> Drill is a distributed system for interactive analysis of
> large-scale
> > > >> datasets, inspired by
> > > >> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
> > > >>
> > > >> == Proposal ==
> > > >> Drill is a distributed system for interactive analysis of
> large-scale
> > > >> datasets. Drill is similar to Google's Dremel, with the additional
> > > >> flexibility needed to support a broader range of query languages,
> data
> > > >> formats and data sources. It is designed to efficiently process
> nested
> > > >> data. It is a design goal to scale to 10,000 servers or more and to
> be
> > > >> able to process petabyes of data and trillions of records in
> seconds.
> > > >>
> > > >> == Background ==
> > > >> Many organizations have the need to run data-intensive applications,
> > > >> including batch processing, stream processing and interactive
> > > >> analysis. In recent years open source systems have emerged to
> address
> > > >> the need for scalable batch processing (Apache Hadoop) and stream
> > > >> processing (Storm, Apache S4). In 2010 Google published a paper
> called
> > > >> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> > > >> scalable system used internally for interactive analysis of nested
> > > >> data. No open source project has successfully replicated the
> > > >> capabilities of Dremel.
> > > >>
> > > >> == Rationale ==
> > > >> There is a strong need in the market for low-latency interactive
> > > >> analysis of large-scale datasets, including nested data (eg, JSON,
> > > >> Avro, Protocol Buffers). This need was identified by Google and
> > > >> addressed internally with a system called Dremel.
> > > >>
> > > >> In recent years open source systems have emerged to address the need
> > > >> for scalable batch processing (Apache Hadoop) and stream processing
> > > >> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> > > >> internal MapReduce system, is used by thousands of organizations
> > > >> processing large-scale datasets. Apache Hadoop is designed to
> achieve
> > > >> very high throughput, but is not designed to achieve the sub-second
> > > >> latency needed for interactive

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-11 Thread Ted Dunning
Yes.  Saw that.  Responded to him privately at the time.

Good humor or good typo.

On Sat, Aug 11, 2012 at 7:29 AM, Doug Cutting  wrote:

> Otis said his vote was 'blinding', not 'binding'.
>
> Doug
> On Aug 11, 2012 12:28 AM, "Ted Dunning"  wrote:
>
> > This vote is now closed.
> >
> > In the responses to this thread, I count 15 binding positive votes and
> > 4 non-binding votes.  The number of positive votes increases to 17 if
> > you count myself (the champion) and Isabel (a mentor) but neither of
> > us actually sent the key email to record a vote (oops).
> >
> > One of the non-binding votes was by Otis Gospadnetic who said that his
> > vote was binding, but I didn't find his name on the list of incubator
> > PMC members, so I counted it as non-binding.  The list I used is at
> > http://people.apache.org/committers-by-project.html#incubator-pmc
> >
> > By any count, this vote to admit Drill to incubator therefore passes.
> >
> > This proposal includes mentors so this vote also constitutes
> > acceptance of the mentors by the Incubator PMC.  All three of the
> > mentors (Grant, myself, and Isabel) are Apache members.
> >
> > This proposal as approved also includes an initial list of committers,
> > all of whom have ICLA's on file.
> >
> > I will coordinate with the other mentors and the committers to commit
> > the status file and perform other establishment activities necessary
> > to establish Drill as a project under incubation.  I expect that this
> > will take several days.  I will announce progress on this mailing list
> > to allow people to subscribe to the mailing lists.
> >
> >
> > On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell 
> > wrote:
> > > +1 (non-binding)
> > >
> > > On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning 
> > wrote:
> > >> I would like to call a vote for accepting Drill for incubation in the
> > >> Apache Incubator. The full proposal is available below.  Discussion
> > >> over the last few days has been quite positive.
> > >>
> > >> Please cast your vote:
> > >>
> > >> [ ] +1, bring Drill into Incubator
> > >> [ ] +0, I don't care either way,
> > >> [ ] -1, do not bring Drill into Incubator, because...
> > >>
> > >> This vote will be open for 72 hours and only votes from the Incubator
> > >> PMC are binding.  The start of the vote is just before 3AM UTC on 8
> > >> August so the closing time will be 3AM UTC on 11 August.
> > >>
> > >> Thank you for your consideration!
> > >>
> > >> Ted
> > >>
> > >> http://wiki.apache.org/incubator/DrillProposal
> > >>
> > >> = Drill =
> > >>
> > >> == Abstract ==
> > >> Drill is a distributed system for interactive analysis of large-scale
> > >> datasets, inspired by
> > >> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
> > >>
> > >> == Proposal ==
> > >> Drill is a distributed system for interactive analysis of large-scale
> > >> datasets. Drill is similar to Google's Dremel, with the additional
> > >> flexibility needed to support a broader range of query languages, data
> > >> formats and data sources. It is designed to efficiently process nested
> > >> data. It is a design goal to scale to 10,000 servers or more and to be
> > >> able to process petabyes of data and trillions of records in seconds.
> > >>
> > >> == Background ==
> > >> Many organizations have the need to run data-intensive applications,
> > >> including batch processing, stream processing and interactive
> > >> analysis. In recent years open source systems have emerged to address
> > >> the need for scalable batch processing (Apache Hadoop) and stream
> > >> processing (Storm, Apache S4). In 2010 Google published a paper called
> > >> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> > >> scalable system used internally for interactive analysis of nested
> > >> data. No open source project has successfully replicated the
> > >> capabilities of Dremel.
> > >>
> > >> == Rationale ==
> > >> There is a strong need in the market for low-latency interactive
> > >> analysis of large-scale datasets, including nested data (eg, JSON,
> > >> Avro, Protocol Buffers). This need was identified by Google and
> > >> addressed internally with a system called Dremel.
> > >>
> > >> In recent years open source systems have emerged to address the need
> > >> for scalable batch processing (Apache Hadoop) and stream processing
> > >> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> > >> internal MapReduce system, is used by thousands of organizations
> > >> processing large-scale datasets. Apache Hadoop is designed to achieve
> > >> very high throughput, but is not designed to achieve the sub-second
> > >> latency needed for interactive data analysis and exploration. Drill,
> > >> inspired by Google's internal Dremel system, is intended to address
> > >> this need.
> > >>
> > >> It is worth noting that, as explained by Google in the original paper,
> > >> Dremel complements MapReduce-based computing. Dremel is not intended
> > >> as a repla

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-11 Thread Doug Cutting
Otis said his vote was 'blinding', not 'binding'.

Doug
On Aug 11, 2012 12:28 AM, "Ted Dunning"  wrote:

> This vote is now closed.
>
> In the responses to this thread, I count 15 binding positive votes and
> 4 non-binding votes.  The number of positive votes increases to 17 if
> you count myself (the champion) and Isabel (a mentor) but neither of
> us actually sent the key email to record a vote (oops).
>
> One of the non-binding votes was by Otis Gospadnetic who said that his
> vote was binding, but I didn't find his name on the list of incubator
> PMC members, so I counted it as non-binding.  The list I used is at
> http://people.apache.org/committers-by-project.html#incubator-pmc
>
> By any count, this vote to admit Drill to incubator therefore passes.
>
> This proposal includes mentors so this vote also constitutes
> acceptance of the mentors by the Incubator PMC.  All three of the
> mentors (Grant, myself, and Isabel) are Apache members.
>
> This proposal as approved also includes an initial list of committers,
> all of whom have ICLA's on file.
>
> I will coordinate with the other mentors and the committers to commit
> the status file and perform other establishment activities necessary
> to establish Drill as a project under incubation.  I expect that this
> will take several days.  I will announce progress on this mailing list
> to allow people to subscribe to the mailing lists.
>
>
> On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell 
> wrote:
> > +1 (non-binding)
> >
> > On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning 
> wrote:
> >> I would like to call a vote for accepting Drill for incubation in the
> >> Apache Incubator. The full proposal is available below.  Discussion
> >> over the last few days has been quite positive.
> >>
> >> Please cast your vote:
> >>
> >> [ ] +1, bring Drill into Incubator
> >> [ ] +0, I don't care either way,
> >> [ ] -1, do not bring Drill into Incubator, because...
> >>
> >> This vote will be open for 72 hours and only votes from the Incubator
> >> PMC are binding.  The start of the vote is just before 3AM UTC on 8
> >> August so the closing time will be 3AM UTC on 11 August.
> >>
> >> Thank you for your consideration!
> >>
> >> Ted
> >>
> >> http://wiki.apache.org/incubator/DrillProposal
> >>
> >> = Drill =
> >>
> >> == Abstract ==
> >> Drill is a distributed system for interactive analysis of large-scale
> >> datasets, inspired by
> >> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
> >>
> >> == Proposal ==
> >> Drill is a distributed system for interactive analysis of large-scale
> >> datasets. Drill is similar to Google's Dremel, with the additional
> >> flexibility needed to support a broader range of query languages, data
> >> formats and data sources. It is designed to efficiently process nested
> >> data. It is a design goal to scale to 10,000 servers or more and to be
> >> able to process petabyes of data and trillions of records in seconds.
> >>
> >> == Background ==
> >> Many organizations have the need to run data-intensive applications,
> >> including batch processing, stream processing and interactive
> >> analysis. In recent years open source systems have emerged to address
> >> the need for scalable batch processing (Apache Hadoop) and stream
> >> processing (Storm, Apache S4). In 2010 Google published a paper called
> >> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> >> scalable system used internally for interactive analysis of nested
> >> data. No open source project has successfully replicated the
> >> capabilities of Dremel.
> >>
> >> == Rationale ==
> >> There is a strong need in the market for low-latency interactive
> >> analysis of large-scale datasets, including nested data (eg, JSON,
> >> Avro, Protocol Buffers). This need was identified by Google and
> >> addressed internally with a system called Dremel.
> >>
> >> In recent years open source systems have emerged to address the need
> >> for scalable batch processing (Apache Hadoop) and stream processing
> >> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> >> internal MapReduce system, is used by thousands of organizations
> >> processing large-scale datasets. Apache Hadoop is designed to achieve
> >> very high throughput, but is not designed to achieve the sub-second
> >> latency needed for interactive data analysis and exploration. Drill,
> >> inspired by Google's internal Dremel system, is intended to address
> >> this need.
> >>
> >> It is worth noting that, as explained by Google in the original paper,
> >> Dremel complements MapReduce-based computing. Dremel is not intended
> >> as a replacement for MapReduce and is often used in conjunction with
> >> it to analyze outputs of MapReduce pipelines or rapidly prototype
> >> larger computations. Indeed, Dremel and MapReduce are both used by
> >> thousands of Google employees.
> >>
> >> Like Dremel, Drill supports a nested data model with data encoded in a
> >> number of formats suc

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-11 Thread Ted Dunning
This vote is now closed.

In the responses to this thread, I count 15 binding positive votes and
4 non-binding votes.  The number of positive votes increases to 17 if
you count myself (the champion) and Isabel (a mentor) but neither of
us actually sent the key email to record a vote (oops).

One of the non-binding votes was by Otis Gospadnetic who said that his
vote was binding, but I didn't find his name on the list of incubator
PMC members, so I counted it as non-binding.  The list I used is at
http://people.apache.org/committers-by-project.html#incubator-pmc

By any count, this vote to admit Drill to incubator therefore passes.

This proposal includes mentors so this vote also constitutes
acceptance of the mentors by the Incubator PMC.  All three of the
mentors (Grant, myself, and Isabel) are Apache members.

This proposal as approved also includes an initial list of committers,
all of whom have ICLA's on file.

I will coordinate with the other mentors and the committers to commit
the status file and perform other establishment activities necessary
to establish Drill as a project under incubation.  I expect that this
will take several days.  I will announce progress on this mailing list
to allow people to subscribe to the mailing lists.


On Thu, Aug 9, 2012 at 11:27 AM, Andrew Purtell  wrote:
> +1 (non-binding)
>
> On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning  wrote:
>> I would like to call a vote for accepting Drill for incubation in the
>> Apache Incubator. The full proposal is available below.  Discussion
>> over the last few days has been quite positive.
>>
>> Please cast your vote:
>>
>> [ ] +1, bring Drill into Incubator
>> [ ] +0, I don't care either way,
>> [ ] -1, do not bring Drill into Incubator, because...
>>
>> This vote will be open for 72 hours and only votes from the Incubator
>> PMC are binding.  The start of the vote is just before 3AM UTC on 8
>> August so the closing time will be 3AM UTC on 11 August.
>>
>> Thank you for your consideration!
>>
>> Ted
>>
>> http://wiki.apache.org/incubator/DrillProposal
>>
>> = Drill =
>>
>> == Abstract ==
>> Drill is a distributed system for interactive analysis of large-scale
>> datasets, inspired by
>> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
>>
>> == Proposal ==
>> Drill is a distributed system for interactive analysis of large-scale
>> datasets. Drill is similar to Google's Dremel, with the additional
>> flexibility needed to support a broader range of query languages, data
>> formats and data sources. It is designed to efficiently process nested
>> data. It is a design goal to scale to 10,000 servers or more and to be
>> able to process petabyes of data and trillions of records in seconds.
>>
>> == Background ==
>> Many organizations have the need to run data-intensive applications,
>> including batch processing, stream processing and interactive
>> analysis. In recent years open source systems have emerged to address
>> the need for scalable batch processing (Apache Hadoop) and stream
>> processing (Storm, Apache S4). In 2010 Google published a paper called
>> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
>> scalable system used internally for interactive analysis of nested
>> data. No open source project has successfully replicated the
>> capabilities of Dremel.
>>
>> == Rationale ==
>> There is a strong need in the market for low-latency interactive
>> analysis of large-scale datasets, including nested data (eg, JSON,
>> Avro, Protocol Buffers). This need was identified by Google and
>> addressed internally with a system called Dremel.
>>
>> In recent years open source systems have emerged to address the need
>> for scalable batch processing (Apache Hadoop) and stream processing
>> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
>> internal MapReduce system, is used by thousands of organizations
>> processing large-scale datasets. Apache Hadoop is designed to achieve
>> very high throughput, but is not designed to achieve the sub-second
>> latency needed for interactive data analysis and exploration. Drill,
>> inspired by Google's internal Dremel system, is intended to address
>> this need.
>>
>> It is worth noting that, as explained by Google in the original paper,
>> Dremel complements MapReduce-based computing. Dremel is not intended
>> as a replacement for MapReduce and is often used in conjunction with
>> it to analyze outputs of MapReduce pipelines or rapidly prototype
>> larger computations. Indeed, Dremel and MapReduce are both used by
>> thousands of Google employees.
>>
>> Like Dremel, Drill supports a nested data model with data encoded in a
>> number of formats such as JSON, Avro or Protocol Buffers. In many
>> organizations nested data is the standard, so supporting a nested data
>> model eliminates the need to normalize the data. With that said, flat
>> data formats, such as CSV files, are naturally supported as a special
>> case of nested data.
>>
>> The Drill ar

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-09 Thread Andrew Purtell
+1 (non-binding)

On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning  wrote:
> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator. The full proposal is available below.  Discussion
> over the last few days has been quite positive.
>
> Please cast your vote:
>
> [ ] +1, bring Drill into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Drill into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.  The start of the vote is just before 3AM UTC on 8
> August so the closing time will be 3AM UTC on 11 August.
>
> Thank you for your consideration!
>
> Ted
>
> http://wiki.apache.org/incubator/DrillProposal
>
> = Drill =
>
> == Abstract ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets, inspired by
> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
>
> == Proposal ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets. Drill is similar to Google's Dremel, with the additional
> flexibility needed to support a broader range of query languages, data
> formats and data sources. It is designed to efficiently process nested
> data. It is a design goal to scale to 10,000 servers or more and to be
> able to process petabyes of data and trillions of records in seconds.
>
> == Background ==
> Many organizations have the need to run data-intensive applications,
> including batch processing, stream processing and interactive
> analysis. In recent years open source systems have emerged to address
> the need for scalable batch processing (Apache Hadoop) and stream
> processing (Storm, Apache S4). In 2010 Google published a paper called
> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> scalable system used internally for interactive analysis of nested
> data. No open source project has successfully replicated the
> capabilities of Dremel.
>
> == Rationale ==
> There is a strong need in the market for low-latency interactive
> analysis of large-scale datasets, including nested data (eg, JSON,
> Avro, Protocol Buffers). This need was identified by Google and
> addressed internally with a system called Dremel.
>
> In recent years open source systems have emerged to address the need
> for scalable batch processing (Apache Hadoop) and stream processing
> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> internal MapReduce system, is used by thousands of organizations
> processing large-scale datasets. Apache Hadoop is designed to achieve
> very high throughput, but is not designed to achieve the sub-second
> latency needed for interactive data analysis and exploration. Drill,
> inspired by Google's internal Dremel system, is intended to address
> this need.
>
> It is worth noting that, as explained by Google in the original paper,
> Dremel complements MapReduce-based computing. Dremel is not intended
> as a replacement for MapReduce and is often used in conjunction with
> it to analyze outputs of MapReduce pipelines or rapidly prototype
> larger computations. Indeed, Dremel and MapReduce are both used by
> thousands of Google employees.
>
> Like Dremel, Drill supports a nested data model with data encoded in a
> number of formats such as JSON, Avro or Protocol Buffers. In many
> organizations nested data is the standard, so supporting a nested data
> model eliminates the need to normalize the data. With that said, flat
> data formats, such as CSV files, are naturally supported as a special
> case of nested data.
>
> The Drill architecture consists of four key components/layers:
>  * Query languages: This layer is responsible for parsing the user's
> query and constructing an execution plan.  The initial goal is to
> support the SQL-like language used by Dremel and
> [[https://developers.google.com/bigquery/docs/query-reference|Google
> BigQuery]], which we call DrQL. However, Drill is designed to support
> other languages and programming models, such as the
> [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
> Language]], [[http://www.cascading.org/|Cascading]] or
> [[https://github.com/tdunning/Plume|Plume]].
>  * Low-latency distributed execution engine: This layer is responsible
> for executing the physical plan. It provides the scalability and fault
> tolerance needed to efficiently query petabytes of data on 10,000
> servers. Drill's execution engine is based on research in distributed
> execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
> columnar storage, and can be extended with additional operators and
> connectors.
>  * Nested data formats: This layer is responsible for supporting
> various data formats. The initial goal is to support the column-based
> format used by Dremel. Drill is designed to support schema-based
> formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
> and schema-less formats such as JSON, BSON or YAML. In additi

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-09 Thread Jakob Homan
+1 (binding)

On Thu, Aug 9, 2012 at 1:05 AM, Tommaso Teofili
 wrote:
> +1
>
> Tommaso
>
> 2012/8/8 Ted Dunning 
>
>> I would like to call a vote for accepting Drill for incubation in the
>> Apache Incubator. The full proposal is available below.  Discussion
>> over the last few days has been quite positive.
>>
>> Please cast your vote:
>>
>> [ ] +1, bring Drill into Incubator
>> [ ] +0, I don't care either way,
>> [ ] -1, do not bring Drill into Incubator, because...
>>
>> This vote will be open for 72 hours and only votes from the Incubator
>> PMC are binding.  The start of the vote is just before 3AM UTC on 8
>> August so the closing time will be 3AM UTC on 11 August.
>>
>> Thank you for your consideration!
>>
>> Ted
>>
>> http://wiki.apache.org/incubator/DrillProposal
>>
>> = Drill =
>>
>> == Abstract ==
>> Drill is a distributed system for interactive analysis of large-scale
>> datasets, inspired by
>> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
>>
>> == Proposal ==
>> Drill is a distributed system for interactive analysis of large-scale
>> datasets. Drill is similar to Google's Dremel, with the additional
>> flexibility needed to support a broader range of query languages, data
>> formats and data sources. It is designed to efficiently process nested
>> data. It is a design goal to scale to 10,000 servers or more and to be
>> able to process petabyes of data and trillions of records in seconds.
>>
>> == Background ==
>> Many organizations have the need to run data-intensive applications,
>> including batch processing, stream processing and interactive
>> analysis. In recent years open source systems have emerged to address
>> the need for scalable batch processing (Apache Hadoop) and stream
>> processing (Storm, Apache S4). In 2010 Google published a paper called
>> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
>> scalable system used internally for interactive analysis of nested
>> data. No open source project has successfully replicated the
>> capabilities of Dremel.
>>
>> == Rationale ==
>> There is a strong need in the market for low-latency interactive
>> analysis of large-scale datasets, including nested data (eg, JSON,
>> Avro, Protocol Buffers). This need was identified by Google and
>> addressed internally with a system called Dremel.
>>
>> In recent years open source systems have emerged to address the need
>> for scalable batch processing (Apache Hadoop) and stream processing
>> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
>> internal MapReduce system, is used by thousands of organizations
>> processing large-scale datasets. Apache Hadoop is designed to achieve
>> very high throughput, but is not designed to achieve the sub-second
>> latency needed for interactive data analysis and exploration. Drill,
>> inspired by Google's internal Dremel system, is intended to address
>> this need.
>>
>> It is worth noting that, as explained by Google in the original paper,
>> Dremel complements MapReduce-based computing. Dremel is not intended
>> as a replacement for MapReduce and is often used in conjunction with
>> it to analyze outputs of MapReduce pipelines or rapidly prototype
>> larger computations. Indeed, Dremel and MapReduce are both used by
>> thousands of Google employees.
>>
>> Like Dremel, Drill supports a nested data model with data encoded in a
>> number of formats such as JSON, Avro or Protocol Buffers. In many
>> organizations nested data is the standard, so supporting a nested data
>> model eliminates the need to normalize the data. With that said, flat
>> data formats, such as CSV files, are naturally supported as a special
>> case of nested data.
>>
>> The Drill architecture consists of four key components/layers:
>>  * Query languages: This layer is responsible for parsing the user's
>> query and constructing an execution plan.  The initial goal is to
>> support the SQL-like language used by Dremel and
>> [[https://developers.google.com/bigquery/docs/query-reference|Google
>> BigQuery]], which we call DrQL. However, Drill is designed to support
>> other languages and programming models, such as the
>> [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
>> Language]], [[http://www.cascading.org/|Cascading]] or
>> [[https://github.com/tdunning/Plume|Plume]].
>>  * Low-latency distributed execution engine: This layer is responsible
>> for executing the physical plan. It provides the scalability and fault
>> tolerance needed to efficiently query petabytes of data on 10,000
>> servers. Drill's execution engine is based on research in distributed
>> execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
>> columnar storage, and can be extended with additional operators and
>> connectors.
>>  * Nested data formats: This layer is responsible for supporting
>> various data formats. The initial goal is to support the column-based
>> format used by Dremel. Drill is designed to support schema-b

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-09 Thread Tommaso Teofili
+1

Tommaso

2012/8/8 Ted Dunning 

> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator. The full proposal is available below.  Discussion
> over the last few days has been quite positive.
>
> Please cast your vote:
>
> [ ] +1, bring Drill into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Drill into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.  The start of the vote is just before 3AM UTC on 8
> August so the closing time will be 3AM UTC on 11 August.
>
> Thank you for your consideration!
>
> Ted
>
> http://wiki.apache.org/incubator/DrillProposal
>
> = Drill =
>
> == Abstract ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets, inspired by
> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
>
> == Proposal ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets. Drill is similar to Google's Dremel, with the additional
> flexibility needed to support a broader range of query languages, data
> formats and data sources. It is designed to efficiently process nested
> data. It is a design goal to scale to 10,000 servers or more and to be
> able to process petabyes of data and trillions of records in seconds.
>
> == Background ==
> Many organizations have the need to run data-intensive applications,
> including batch processing, stream processing and interactive
> analysis. In recent years open source systems have emerged to address
> the need for scalable batch processing (Apache Hadoop) and stream
> processing (Storm, Apache S4). In 2010 Google published a paper called
> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> scalable system used internally for interactive analysis of nested
> data. No open source project has successfully replicated the
> capabilities of Dremel.
>
> == Rationale ==
> There is a strong need in the market for low-latency interactive
> analysis of large-scale datasets, including nested data (eg, JSON,
> Avro, Protocol Buffers). This need was identified by Google and
> addressed internally with a system called Dremel.
>
> In recent years open source systems have emerged to address the need
> for scalable batch processing (Apache Hadoop) and stream processing
> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> internal MapReduce system, is used by thousands of organizations
> processing large-scale datasets. Apache Hadoop is designed to achieve
> very high throughput, but is not designed to achieve the sub-second
> latency needed for interactive data analysis and exploration. Drill,
> inspired by Google's internal Dremel system, is intended to address
> this need.
>
> It is worth noting that, as explained by Google in the original paper,
> Dremel complements MapReduce-based computing. Dremel is not intended
> as a replacement for MapReduce and is often used in conjunction with
> it to analyze outputs of MapReduce pipelines or rapidly prototype
> larger computations. Indeed, Dremel and MapReduce are both used by
> thousands of Google employees.
>
> Like Dremel, Drill supports a nested data model with data encoded in a
> number of formats such as JSON, Avro or Protocol Buffers. In many
> organizations nested data is the standard, so supporting a nested data
> model eliminates the need to normalize the data. With that said, flat
> data formats, such as CSV files, are naturally supported as a special
> case of nested data.
>
> The Drill architecture consists of four key components/layers:
>  * Query languages: This layer is responsible for parsing the user's
> query and constructing an execution plan.  The initial goal is to
> support the SQL-like language used by Dremel and
> [[https://developers.google.com/bigquery/docs/query-reference|Google
> BigQuery]], which we call DrQL. However, Drill is designed to support
> other languages and programming models, such as the
> [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
> Language]], [[http://www.cascading.org/|Cascading]] or
> [[https://github.com/tdunning/Plume|Plume]].
>  * Low-latency distributed execution engine: This layer is responsible
> for executing the physical plan. It provides the scalability and fault
> tolerance needed to efficiently query petabytes of data on 10,000
> servers. Drill's execution engine is based on research in distributed
> execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
> columnar storage, and can be extended with additional operators and
> connectors.
>  * Nested data formats: This layer is responsible for supporting
> various data formats. The initial goal is to support the column-based
> format used by Dremel. Drill is designed to support schema-based
> formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
> and schema-less formats such as JSON, BSON or YAML. In addition, it is
> designed to support co

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Jukka Zitting
Hi,

On Wed, Aug 8, 2012 at 4:41 AM, Ted Dunning  wrote:
> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator. The full proposal is available below.  Discussion
> over the last few days has been quite positive.

  [x] +1, bring Drill into Incubator

BR,

Jukka Zitting

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Chris Douglas
+1 -C

(sorry, wrong thread)

On Tue, Aug 7, 2012 at 7:41 PM, Ted Dunning  wrote:
> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator. The full proposal is available below.  Discussion
> over the last few days has been quite positive.
>
> Please cast your vote:
>
> [ ] +1, bring Drill into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Drill into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.  The start of the vote is just before 3AM UTC on 8
> August so the closing time will be 3AM UTC on 11 August.
>
> Thank you for your consideration!
>
> Ted
>
> http://wiki.apache.org/incubator/DrillProposal
>
> = Drill =
>
> == Abstract ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets, inspired by
> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
>
> == Proposal ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets. Drill is similar to Google's Dremel, with the additional
> flexibility needed to support a broader range of query languages, data
> formats and data sources. It is designed to efficiently process nested
> data. It is a design goal to scale to 10,000 servers or more and to be
> able to process petabyes of data and trillions of records in seconds.
>
> == Background ==
> Many organizations have the need to run data-intensive applications,
> including batch processing, stream processing and interactive
> analysis. In recent years open source systems have emerged to address
> the need for scalable batch processing (Apache Hadoop) and stream
> processing (Storm, Apache S4). In 2010 Google published a paper called
> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> scalable system used internally for interactive analysis of nested
> data. No open source project has successfully replicated the
> capabilities of Dremel.
>
> == Rationale ==
> There is a strong need in the market for low-latency interactive
> analysis of large-scale datasets, including nested data (eg, JSON,
> Avro, Protocol Buffers). This need was identified by Google and
> addressed internally with a system called Dremel.
>
> In recent years open source systems have emerged to address the need
> for scalable batch processing (Apache Hadoop) and stream processing
> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> internal MapReduce system, is used by thousands of organizations
> processing large-scale datasets. Apache Hadoop is designed to achieve
> very high throughput, but is not designed to achieve the sub-second
> latency needed for interactive data analysis and exploration. Drill,
> inspired by Google's internal Dremel system, is intended to address
> this need.
>
> It is worth noting that, as explained by Google in the original paper,
> Dremel complements MapReduce-based computing. Dremel is not intended
> as a replacement for MapReduce and is often used in conjunction with
> it to analyze outputs of MapReduce pipelines or rapidly prototype
> larger computations. Indeed, Dremel and MapReduce are both used by
> thousands of Google employees.
>
> Like Dremel, Drill supports a nested data model with data encoded in a
> number of formats such as JSON, Avro or Protocol Buffers. In many
> organizations nested data is the standard, so supporting a nested data
> model eliminates the need to normalize the data. With that said, flat
> data formats, such as CSV files, are naturally supported as a special
> case of nested data.
>
> The Drill architecture consists of four key components/layers:
>  * Query languages: This layer is responsible for parsing the user's
> query and constructing an execution plan.  The initial goal is to
> support the SQL-like language used by Dremel and
> [[https://developers.google.com/bigquery/docs/query-reference|Google
> BigQuery]], which we call DrQL. However, Drill is designed to support
> other languages and programming models, such as the
> [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
> Language]], [[http://www.cascading.org/|Cascading]] or
> [[https://github.com/tdunning/Plume|Plume]].
>  * Low-latency distributed execution engine: This layer is responsible
> for executing the physical plan. It provides the scalability and fault
> tolerance needed to efficiently query petabytes of data on 10,000
> servers. Drill's execution engine is based on research in distributed
> execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
> columnar storage, and can be extended with additional operators and
> connectors.
>  * Nested data formats: This layer is responsible for supporting
> various data formats. The initial goal is to support the column-based
> format used by Dremel. Drill is designed to support schema-based
> formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
> and schema-less formats such as JSON, BSON or YAM

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Otis Gospodnetic
+1 (blinding)

Otis

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 



>
> From: Ted Dunning 
>To: general@incubator.apache.org 
>Sent: Tuesday, August 7, 2012 10:41 PM
>Subject: [VOTE] Accept Drill into the Apache Incubator
> 
>I would like to call a vote for accepting Drill for incubation in the
>Apache Incubator. The full proposal is available below.  Discussion
>over the last few days has been quite positive.
>
>Please cast your vote:
>
>[ ] +1, bring Drill into Incubator
>[ ] +0, I don't care either way,
>[ ] -1, do not bring Drill into Incubator, because...
>
>This vote will be open for 72 hours and only votes from the Incubator
>PMC are binding.  The start of the vote is just before 3AM UTC on 8
>August so the closing time will be 3AM UTC on 11 August.
>
>Thank you for your consideration!
>
>Ted
>
>http://wiki.apache.org/incubator/DrillProposal
>
>= Drill =
>
>== Abstract ==
>Drill is a distributed system for interactive analysis of large-scale
>datasets, inspired by
>[[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
>
>== Proposal ==
>Drill is a distributed system for interactive analysis of large-scale
>datasets. Drill is similar to Google's Dremel, with the additional
>flexibility needed to support a broader range of query languages, data
>formats and data sources. It is designed to efficiently process nested
>data. It is a design goal to scale to 10,000 servers or more and to be
>able to process petabyes of data and trillions of records in seconds.
>
>== Background ==
>Many organizations have the need to run data-intensive applications,
>including batch processing, stream processing and interactive
>analysis. In recent years open source systems have emerged to address
>the need for scalable batch processing (Apache Hadoop) and stream
>processing (Storm, Apache S4). In 2010 Google published a paper called
>"Dremel: Interactive Analysis of Web-Scale Datasets," describing a
>scalable system used internally for interactive analysis of nested
>data. No open source project has successfully replicated the
>capabilities of Dremel.
>
>== Rationale ==
>There is a strong need in the market for low-latency interactive
>analysis of large-scale datasets, including nested data (eg, JSON,
>Avro, Protocol Buffers). This need was identified by Google and
>addressed internally with a system called Dremel.
>
>In recent years open source systems have emerged to address the need
>for scalable batch processing (Apache Hadoop) and stream processing
>(Storm, Apache S4). Apache Hadoop, originally inspired by Google's
>internal MapReduce system, is used by thousands of organizations
>processing large-scale datasets. Apache Hadoop is designed to achieve
>very high throughput, but is not designed to achieve the sub-second
>latency needed for interactive data analysis and exploration. Drill,
>inspired by Google's internal Dremel system, is intended to address
>this need.
>
>It is worth noting that, as explained by Google in the original paper,
>Dremel complements MapReduce-based computing. Dremel is not intended
>as a replacement for MapReduce and is often used in conjunction with
>it to analyze outputs of MapReduce pipelines or rapidly prototype
>larger computations. Indeed, Dremel and MapReduce are both used by
>thousands of Google employees.
>
>Like Dremel, Drill supports a nested data model with data encoded in a
>number of formats such as JSON, Avro or Protocol Buffers. In many
>organizations nested data is the standard, so supporting a nested data
>model eliminates the need to normalize the data. With that said, flat
>data formats, such as CSV files, are naturally supported as a special
>case of nested data.
>
>The Drill architecture consists of four key components/layers:
>* Query languages: This layer is responsible for parsing the user's
>query and constructing an execution plan.  The initial goal is to
>support the SQL-like language used by Dremel and
>[[https://developers.google.com/bigquery/docs/query-reference|Google
>BigQuery]], which we call DrQL. However, Drill is designed to support
>other languages and programming models, such as the
>[[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
>Language]], [[http://www.cascading.org/|Cascading]] or
>[[https://github.com/tdunning/Plume|Plume]].
>* Low-latency distributed execution engine: This layer is responsible
>for executing the physical plan. It provides the scalability and fault
>tolerance needed to efficiently query petabytes of data on 10,000
>servers. Drill's execution engine is based on research in distributed
>execu

RE: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Franklin, Matthew B.
+1 (binding)

>-Original Message-
>From: Ted Dunning [mailto:ted.dunn...@gmail.com]
>Sent: Tuesday, August 07, 2012 10:41 PM
>To: general@incubator.apache.org
>Subject: [VOTE] Accept Drill into the Apache Incubator
>
>I would like to call a vote for accepting Drill for incubation in the
>Apache Incubator. The full proposal is available below.  Discussion
>over the last few days has been quite positive.
>
>Please cast your vote:
>
>[ ] +1, bring Drill into Incubator
>[ ] +0, I don't care either way,
>[ ] -1, do not bring Drill into Incubator, because...
>
>This vote will be open for 72 hours and only votes from the Incubator
>PMC are binding.  The start of the vote is just before 3AM UTC on 8
>August so the closing time will be 3AM UTC on 11 August.
>
>Thank you for your consideration!
>
>Ted
>
>http://wiki.apache.org/incubator/DrillProposal
>
>= Drill =
>
>== Abstract ==
>Drill is a distributed system for interactive analysis of large-scale
>datasets, inspired by
>[[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
>
>== Proposal ==
>Drill is a distributed system for interactive analysis of large-scale
>datasets. Drill is similar to Google's Dremel, with the additional
>flexibility needed to support a broader range of query languages, data
>formats and data sources. It is designed to efficiently process nested
>data. It is a design goal to scale to 10,000 servers or more and to be
>able to process petabyes of data and trillions of records in seconds.
>
>== Background ==
>Many organizations have the need to run data-intensive applications,
>including batch processing, stream processing and interactive
>analysis. In recent years open source systems have emerged to address
>the need for scalable batch processing (Apache Hadoop) and stream
>processing (Storm, Apache S4). In 2010 Google published a paper called
>"Dremel: Interactive Analysis of Web-Scale Datasets," describing a
>scalable system used internally for interactive analysis of nested
>data. No open source project has successfully replicated the
>capabilities of Dremel.
>
>== Rationale ==
>There is a strong need in the market for low-latency interactive
>analysis of large-scale datasets, including nested data (eg, JSON,
>Avro, Protocol Buffers). This need was identified by Google and
>addressed internally with a system called Dremel.
>
>In recent years open source systems have emerged to address the need
>for scalable batch processing (Apache Hadoop) and stream processing
>(Storm, Apache S4). Apache Hadoop, originally inspired by Google's
>internal MapReduce system, is used by thousands of organizations
>processing large-scale datasets. Apache Hadoop is designed to achieve
>very high throughput, but is not designed to achieve the sub-second
>latency needed for interactive data analysis and exploration. Drill,
>inspired by Google's internal Dremel system, is intended to address
>this need.
>
>It is worth noting that, as explained by Google in the original paper,
>Dremel complements MapReduce-based computing. Dremel is not intended
>as a replacement for MapReduce and is often used in conjunction with
>it to analyze outputs of MapReduce pipelines or rapidly prototype
>larger computations. Indeed, Dremel and MapReduce are both used by
>thousands of Google employees.
>
>Like Dremel, Drill supports a nested data model with data encoded in a
>number of formats such as JSON, Avro or Protocol Buffers. In many
>organizations nested data is the standard, so supporting a nested data
>model eliminates the need to normalize the data. With that said, flat
>data formats, such as CSV files, are naturally supported as a special
>case of nested data.
>
>The Drill architecture consists of four key components/layers:
> * Query languages: This layer is responsible for parsing the user's
>query and constructing an execution plan.  The initial goal is to
>support the SQL-like language used by Dremel and
>[[https://developers.google.com/bigquery/docs/query-reference|Google
>BigQuery]], which we call DrQL. However, Drill is designed to support
>other languages and programming models, such as the
>[[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo
>Query
>Language]], [[http://www.cascading.org/|Cascading]] or
>[[https://github.com/tdunning/Plume|Plume]].
> * Low-latency distributed execution engine: This layer is responsible
>for executing the physical plan. It provides the scalability and fault
>tolerance needed to efficiently query petabytes of data on 10,000
>servers. Drill's execution engine is based on research in distributed
>execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
&

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Phillip Rhodes
On Tue, Aug 7, 2012 at 9:41 PM, Ted Dunning  wrote:
> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator. The full proposal is available below.  Discussion
> over the last few days has been quite positive.
>
> Please cast your vote:
>
> [ ] +1, bring Drill into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Drill into Incubator, because...

+1


Phil

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Mohammad Nour El-Din
+1 (binding)

On Wed, Aug 8, 2012 at 3:55 PM, Grant Ingersoll  wrote:

>
> On Aug 7, 2012, at 10:41 PM, Ted Dunning wrote:
>
> > I would like to call a vote for accepting Drill for incubation in the
> > Apache Incubator. The full proposal is available below.  Discussion
> > over the last few days has been quite positive.
> >
> > Please cast your vote:
> >
> > [ ] +1, bring Drill into Incubator
>
> +1 (binding)
>
> -
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>


-- 
Thanks
- Mohammad Nour

"Life is like riding a bicycle. To keep your balance you must keep moving"
- Albert Einstein


Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Grant Ingersoll

On Aug 7, 2012, at 10:41 PM, Ted Dunning wrote:

> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator. The full proposal is available below.  Discussion
> over the last few days has been quite positive.
> 
> Please cast your vote:
> 
> [ ] +1, bring Drill into Incubator

+1 (binding)

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Torsten Curdt
On Wed, Aug 8, 2012 at 11:39 AM, Bertrand Delacretaz
 wrote:
> On Wed, Aug 8, 2012 at 4:41 AM, Ted Dunning  wrote:
>> I would like to call a vote for accepting Drill for incubation in the
>> Apache Incubator...
>
> +1

+1

cheers,
Torsten

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Bertrand Delacretaz
On Wed, Aug 8, 2012 at 4:41 AM, Ted Dunning  wrote:
> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator...

+1

-Bertrand

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-08 Thread Andrzej Bialecki

On 08/08/2012 04:41, Ted Dunning wrote:

I would like to call a vote for accepting Drill for incubation in the
Apache Incubator. The full proposal is available below.  Discussion
over the last few days has been quite positive.

Please cast your vote:

[ ] +1, bring Drill into Incubator
[ ] +0, I don't care either way,
[ ] -1, do not bring Drill into Incubator, because...

This vote will be open for 72 hours and only votes from the Incubator
PMC are binding.  The start of the vote is just before 3AM UTC on 8
August so the closing time will be 3AM UTC on 11 August.


+1 (binding) - this is an exciting proposal!

--
Best regards,
Andrzej Bialecki
http://www.sigram.com, blog http://www.sigram.com/blog
 ___.,___,___,___,_._. __<><
[___||.__|__/|__||\/|: Information Retrieval, System Integration
___|||__||..\|..||..|: Contact: info at sigram dot com


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-07 Thread Alex Karasulu
+1 (binding)

On Wed, Aug 8, 2012 at 8:33 AM, Mattmann, Chris A (388J) <
chris.a.mattm...@jpl.nasa.gov> wrote:

> +1 (binding). Good luck and sounds cool!
>
> Cheers,
> Chris
>
> On Aug 7, 2012, at 7:41 PM, Ted Dunning wrote:
>
> > I would like to call a vote for accepting Drill for incubation in the
> > Apache Incubator. The full proposal is available below.  Discussion
> > over the last few days has been quite positive.
> >
> > Please cast your vote:
> >
> > [ ] +1, bring Drill into Incubator
> > [ ] +0, I don't care either way,
> > [ ] -1, do not bring Drill into Incubator, because...
> >
> > This vote will be open for 72 hours and only votes from the Incubator
> > PMC are binding.  The start of the vote is just before 3AM UTC on 8
> > August so the closing time will be 3AM UTC on 11 August.
> >
> > Thank you for your consideration!
> >
> > Ted
> >
> > http://wiki.apache.org/incubator/DrillProposal
> >
> > = Drill =
> >
> > == Abstract ==
> > Drill is a distributed system for interactive analysis of large-scale
> > datasets, inspired by
> > [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
> >
> > == Proposal ==
> > Drill is a distributed system for interactive analysis of large-scale
> > datasets. Drill is similar to Google's Dremel, with the additional
> > flexibility needed to support a broader range of query languages, data
> > formats and data sources. It is designed to efficiently process nested
> > data. It is a design goal to scale to 10,000 servers or more and to be
> > able to process petabyes of data and trillions of records in seconds.
> >
> > == Background ==
> > Many organizations have the need to run data-intensive applications,
> > including batch processing, stream processing and interactive
> > analysis. In recent years open source systems have emerged to address
> > the need for scalable batch processing (Apache Hadoop) and stream
> > processing (Storm, Apache S4). In 2010 Google published a paper called
> > "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> > scalable system used internally for interactive analysis of nested
> > data. No open source project has successfully replicated the
> > capabilities of Dremel.
> >
> > == Rationale ==
> > There is a strong need in the market for low-latency interactive
> > analysis of large-scale datasets, including nested data (eg, JSON,
> > Avro, Protocol Buffers). This need was identified by Google and
> > addressed internally with a system called Dremel.
> >
> > In recent years open source systems have emerged to address the need
> > for scalable batch processing (Apache Hadoop) and stream processing
> > (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> > internal MapReduce system, is used by thousands of organizations
> > processing large-scale datasets. Apache Hadoop is designed to achieve
> > very high throughput, but is not designed to achieve the sub-second
> > latency needed for interactive data analysis and exploration. Drill,
> > inspired by Google's internal Dremel system, is intended to address
> > this need.
> >
> > It is worth noting that, as explained by Google in the original paper,
> > Dremel complements MapReduce-based computing. Dremel is not intended
> > as a replacement for MapReduce and is often used in conjunction with
> > it to analyze outputs of MapReduce pipelines or rapidly prototype
> > larger computations. Indeed, Dremel and MapReduce are both used by
> > thousands of Google employees.
> >
> > Like Dremel, Drill supports a nested data model with data encoded in a
> > number of formats such as JSON, Avro or Protocol Buffers. In many
> > organizations nested data is the standard, so supporting a nested data
> > model eliminates the need to normalize the data. With that said, flat
> > data formats, such as CSV files, are naturally supported as a special
> > case of nested data.
> >
> > The Drill architecture consists of four key components/layers:
> > * Query languages: This layer is responsible for parsing the user's
> > query and constructing an execution plan.  The initial goal is to
> > support the SQL-like language used by Dremel and
> > [[https://developers.google.com/bigquery/docs/query-reference|Google
> > BigQuery]], which we call DrQL. However, Drill is designed to support
> > other languages and programming models, such as the
> > [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
> > Language]], [[http://www.cascading.org/|Cascading]] or
> > [[https://github.com/tdunning/Plume|Plume]].
> > * Low-latency distributed execution engine: This layer is responsible
> > for executing the physical plan. It provides the scalability and fault
> > tolerance needed to efficiently query petabytes of data on 10,000
> > servers. Drill's execution engine is based on research in distributed
> > execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
> > columnar storage, and can be extended with additional operators and
> > connectors.

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-07 Thread Devaraj Das
+1 (binding)

On Aug 7, 2012, at 7:41 PM, Ted Dunning wrote:

> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator. The full proposal is available below.  Discussion
> over the last few days has been quite positive.
> 
> Please cast your vote:
> 
> [ ] +1, bring Drill into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Drill into Incubator, because...
> 
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.  The start of the vote is just before 3AM UTC on 8
> August so the closing time will be 3AM UTC on 11 August.
> 
> Thank you for your consideration!
> 
> Ted
> 
> http://wiki.apache.org/incubator/DrillProposal
> 
> = Drill =
> 
> == Abstract ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets, inspired by
> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
> 
> == Proposal ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets. Drill is similar to Google's Dremel, with the additional
> flexibility needed to support a broader range of query languages, data
> formats and data sources. It is designed to efficiently process nested
> data. It is a design goal to scale to 10,000 servers or more and to be
> able to process petabyes of data and trillions of records in seconds.
> 
> == Background ==
> Many organizations have the need to run data-intensive applications,
> including batch processing, stream processing and interactive
> analysis. In recent years open source systems have emerged to address
> the need for scalable batch processing (Apache Hadoop) and stream
> processing (Storm, Apache S4). In 2010 Google published a paper called
> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> scalable system used internally for interactive analysis of nested
> data. No open source project has successfully replicated the
> capabilities of Dremel.
> 
> == Rationale ==
> There is a strong need in the market for low-latency interactive
> analysis of large-scale datasets, including nested data (eg, JSON,
> Avro, Protocol Buffers). This need was identified by Google and
> addressed internally with a system called Dremel.
> 
> In recent years open source systems have emerged to address the need
> for scalable batch processing (Apache Hadoop) and stream processing
> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> internal MapReduce system, is used by thousands of organizations
> processing large-scale datasets. Apache Hadoop is designed to achieve
> very high throughput, but is not designed to achieve the sub-second
> latency needed for interactive data analysis and exploration. Drill,
> inspired by Google's internal Dremel system, is intended to address
> this need.
> 
> It is worth noting that, as explained by Google in the original paper,
> Dremel complements MapReduce-based computing. Dremel is not intended
> as a replacement for MapReduce and is often used in conjunction with
> it to analyze outputs of MapReduce pipelines or rapidly prototype
> larger computations. Indeed, Dremel and MapReduce are both used by
> thousands of Google employees.
> 
> Like Dremel, Drill supports a nested data model with data encoded in a
> number of formats such as JSON, Avro or Protocol Buffers. In many
> organizations nested data is the standard, so supporting a nested data
> model eliminates the need to normalize the data. With that said, flat
> data formats, such as CSV files, are naturally supported as a special
> case of nested data.
> 
> The Drill architecture consists of four key components/layers:
> * Query languages: This layer is responsible for parsing the user's
> query and constructing an execution plan.  The initial goal is to
> support the SQL-like language used by Dremel and
> [[https://developers.google.com/bigquery/docs/query-reference|Google
> BigQuery]], which we call DrQL. However, Drill is designed to support
> other languages and programming models, such as the
> [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
> Language]], [[http://www.cascading.org/|Cascading]] or
> [[https://github.com/tdunning/Plume|Plume]].
> * Low-latency distributed execution engine: This layer is responsible
> for executing the physical plan. It provides the scalability and fault
> tolerance needed to efficiently query petabytes of data on 10,000
> servers. Drill's execution engine is based on research in distributed
> execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
> columnar storage, and can be extended with additional operators and
> connectors.
> * Nested data formats: This layer is responsible for supporting
> various data formats. The initial goal is to support the column-based
> format used by Dremel. Drill is designed to support schema-based
> formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
> and schema-less formats such as JSON, BSON or YAML. In ad

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-07 Thread Arun C Murthy
+1 (binding)

On Aug 7, 2012, at 7:41 PM, Ted Dunning wrote:

> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator. The full proposal is available below.  Discussion
> over the last few days has been quite positive.
> 
> Please cast your vote:
> 
> [ ] +1, bring Drill into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Drill into Incubator, because...
> 
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.  The start of the vote is just before 3AM UTC on 8
> August so the closing time will be 3AM UTC on 11 August.
> 
> Thank you for your consideration!
> 
> Ted
> 
> http://wiki.apache.org/incubator/DrillProposal
> 
> = Drill =
> 
> == Abstract ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets, inspired by
> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
> 
> == Proposal ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets. Drill is similar to Google's Dremel, with the additional
> flexibility needed to support a broader range of query languages, data
> formats and data sources. It is designed to efficiently process nested
> data. It is a design goal to scale to 10,000 servers or more and to be
> able to process petabyes of data and trillions of records in seconds.
> 
> == Background ==
> Many organizations have the need to run data-intensive applications,
> including batch processing, stream processing and interactive
> analysis. In recent years open source systems have emerged to address
> the need for scalable batch processing (Apache Hadoop) and stream
> processing (Storm, Apache S4). In 2010 Google published a paper called
> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> scalable system used internally for interactive analysis of nested
> data. No open source project has successfully replicated the
> capabilities of Dremel.
> 
> == Rationale ==
> There is a strong need in the market for low-latency interactive
> analysis of large-scale datasets, including nested data (eg, JSON,
> Avro, Protocol Buffers). This need was identified by Google and
> addressed internally with a system called Dremel.
> 
> In recent years open source systems have emerged to address the need
> for scalable batch processing (Apache Hadoop) and stream processing
> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> internal MapReduce system, is used by thousands of organizations
> processing large-scale datasets. Apache Hadoop is designed to achieve
> very high throughput, but is not designed to achieve the sub-second
> latency needed for interactive data analysis and exploration. Drill,
> inspired by Google's internal Dremel system, is intended to address
> this need.
> 
> It is worth noting that, as explained by Google in the original paper,
> Dremel complements MapReduce-based computing. Dremel is not intended
> as a replacement for MapReduce and is often used in conjunction with
> it to analyze outputs of MapReduce pipelines or rapidly prototype
> larger computations. Indeed, Dremel and MapReduce are both used by
> thousands of Google employees.
> 
> Like Dremel, Drill supports a nested data model with data encoded in a
> number of formats such as JSON, Avro or Protocol Buffers. In many
> organizations nested data is the standard, so supporting a nested data
> model eliminates the need to normalize the data. With that said, flat
> data formats, such as CSV files, are naturally supported as a special
> case of nested data.
> 
> The Drill architecture consists of four key components/layers:
> * Query languages: This layer is responsible for parsing the user's
> query and constructing an execution plan.  The initial goal is to
> support the SQL-like language used by Dremel and
> [[https://developers.google.com/bigquery/docs/query-reference|Google
> BigQuery]], which we call DrQL. However, Drill is designed to support
> other languages and programming models, such as the
> [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
> Language]], [[http://www.cascading.org/|Cascading]] or
> [[https://github.com/tdunning/Plume|Plume]].
> * Low-latency distributed execution engine: This layer is responsible
> for executing the physical plan. It provides the scalability and fault
> tolerance needed to efficiently query petabytes of data on 10,000
> servers. Drill's execution engine is based on research in distributed
> execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
> columnar storage, and can be extended with additional operators and
> connectors.
> * Nested data formats: This layer is responsible for supporting
> various data formats. The initial goal is to support the column-based
> format used by Dremel. Drill is designed to support schema-based
> formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
> and schema-less formats such as JSON, BSON or YAML. In ad

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-07 Thread Mattmann, Chris A (388J)
+1 (binding). Good luck and sounds cool!

Cheers,
Chris

On Aug 7, 2012, at 7:41 PM, Ted Dunning wrote:

> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator. The full proposal is available below.  Discussion
> over the last few days has been quite positive.
> 
> Please cast your vote:
> 
> [ ] +1, bring Drill into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Drill into Incubator, because...
> 
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.  The start of the vote is just before 3AM UTC on 8
> August so the closing time will be 3AM UTC on 11 August.
> 
> Thank you for your consideration!
> 
> Ted
> 
> http://wiki.apache.org/incubator/DrillProposal
> 
> = Drill =
> 
> == Abstract ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets, inspired by
> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
> 
> == Proposal ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets. Drill is similar to Google's Dremel, with the additional
> flexibility needed to support a broader range of query languages, data
> formats and data sources. It is designed to efficiently process nested
> data. It is a design goal to scale to 10,000 servers or more and to be
> able to process petabyes of data and trillions of records in seconds.
> 
> == Background ==
> Many organizations have the need to run data-intensive applications,
> including batch processing, stream processing and interactive
> analysis. In recent years open source systems have emerged to address
> the need for scalable batch processing (Apache Hadoop) and stream
> processing (Storm, Apache S4). In 2010 Google published a paper called
> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> scalable system used internally for interactive analysis of nested
> data. No open source project has successfully replicated the
> capabilities of Dremel.
> 
> == Rationale ==
> There is a strong need in the market for low-latency interactive
> analysis of large-scale datasets, including nested data (eg, JSON,
> Avro, Protocol Buffers). This need was identified by Google and
> addressed internally with a system called Dremel.
> 
> In recent years open source systems have emerged to address the need
> for scalable batch processing (Apache Hadoop) and stream processing
> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> internal MapReduce system, is used by thousands of organizations
> processing large-scale datasets. Apache Hadoop is designed to achieve
> very high throughput, but is not designed to achieve the sub-second
> latency needed for interactive data analysis and exploration. Drill,
> inspired by Google's internal Dremel system, is intended to address
> this need.
> 
> It is worth noting that, as explained by Google in the original paper,
> Dremel complements MapReduce-based computing. Dremel is not intended
> as a replacement for MapReduce and is often used in conjunction with
> it to analyze outputs of MapReduce pipelines or rapidly prototype
> larger computations. Indeed, Dremel and MapReduce are both used by
> thousands of Google employees.
> 
> Like Dremel, Drill supports a nested data model with data encoded in a
> number of formats such as JSON, Avro or Protocol Buffers. In many
> organizations nested data is the standard, so supporting a nested data
> model eliminates the need to normalize the data. With that said, flat
> data formats, such as CSV files, are naturally supported as a special
> case of nested data.
> 
> The Drill architecture consists of four key components/layers:
> * Query languages: This layer is responsible for parsing the user's
> query and constructing an execution plan.  The initial goal is to
> support the SQL-like language used by Dremel and
> [[https://developers.google.com/bigquery/docs/query-reference|Google
> BigQuery]], which we call DrQL. However, Drill is designed to support
> other languages and programming models, such as the
> [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
> Language]], [[http://www.cascading.org/|Cascading]] or
> [[https://github.com/tdunning/Plume|Plume]].
> * Low-latency distributed execution engine: This layer is responsible
> for executing the physical plan. It provides the scalability and fault
> tolerance needed to efficiently query petabytes of data on 10,000
> servers. Drill's execution engine is based on research in distributed
> execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
> columnar storage, and can be extended with additional operators and
> connectors.
> * Nested data formats: This layer is responsible for supporting
> various data formats. The initial goal is to support the column-based
> format used by Dremel. Drill is designed to support schema-based
> formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
> and schema-les

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-07 Thread Ashish
+1 (non-binding)

On Wed, Aug 8, 2012 at 8:11 AM, Ted Dunning  wrote:
> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator. The full proposal is available below.  Discussion
> over the last few days has been quite positive.
>
> Please cast your vote:
>
> [ ] +1, bring Drill into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Drill into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.  The start of the vote is just before 3AM UTC on 8
> August so the closing time will be 3AM UTC on 11 August.
>
> Thank you for your consideration!
>
> Ted
>
> http://wiki.apache.org/incubator/DrillProposal
>
> = Drill =
>
> == Abstract ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets, inspired by
> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
>
> == Proposal ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets. Drill is similar to Google's Dremel, with the additional
> flexibility needed to support a broader range of query languages, data
> formats and data sources. It is designed to efficiently process nested
> data. It is a design goal to scale to 10,000 servers or more and to be
> able to process petabyes of data and trillions of records in seconds.
>
> == Background ==
> Many organizations have the need to run data-intensive applications,
> including batch processing, stream processing and interactive
> analysis. In recent years open source systems have emerged to address
> the need for scalable batch processing (Apache Hadoop) and stream
> processing (Storm, Apache S4). In 2010 Google published a paper called
> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> scalable system used internally for interactive analysis of nested
> data. No open source project has successfully replicated the
> capabilities of Dremel.
>
> == Rationale ==
> There is a strong need in the market for low-latency interactive
> analysis of large-scale datasets, including nested data (eg, JSON,
> Avro, Protocol Buffers). This need was identified by Google and
> addressed internally with a system called Dremel.
>
> In recent years open source systems have emerged to address the need
> for scalable batch processing (Apache Hadoop) and stream processing
> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> internal MapReduce system, is used by thousands of organizations
> processing large-scale datasets. Apache Hadoop is designed to achieve
> very high throughput, but is not designed to achieve the sub-second
> latency needed for interactive data analysis and exploration. Drill,
> inspired by Google's internal Dremel system, is intended to address
> this need.
>
> It is worth noting that, as explained by Google in the original paper,
> Dremel complements MapReduce-based computing. Dremel is not intended
> as a replacement for MapReduce and is often used in conjunction with
> it to analyze outputs of MapReduce pipelines or rapidly prototype
> larger computations. Indeed, Dremel and MapReduce are both used by
> thousands of Google employees.
>
> Like Dremel, Drill supports a nested data model with data encoded in a
> number of formats such as JSON, Avro or Protocol Buffers. In many
> organizations nested data is the standard, so supporting a nested data
> model eliminates the need to normalize the data. With that said, flat
> data formats, such as CSV files, are naturally supported as a special
> case of nested data.
>
> The Drill architecture consists of four key components/layers:
>  * Query languages: This layer is responsible for parsing the user's
> query and constructing an execution plan.  The initial goal is to
> support the SQL-like language used by Dremel and
> [[https://developers.google.com/bigquery/docs/query-reference|Google
> BigQuery]], which we call DrQL. However, Drill is designed to support
> other languages and programming models, such as the
> [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
> Language]], [[http://www.cascading.org/|Cascading]] or
> [[https://github.com/tdunning/Plume|Plume]].
>  * Low-latency distributed execution engine: This layer is responsible
> for executing the physical plan. It provides the scalability and fault
> tolerance needed to efficiently query petabytes of data on 10,000
> servers. Drill's execution engine is based on research in distributed
> execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
> columnar storage, and can be extended with additional operators and
> connectors.
>  * Nested data formats: This layer is responsible for supporting
> various data formats. The initial goal is to support the column-based
> format used by Dremel. Drill is designed to support schema-based
> formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
> and schema-less formats such as JSON, BSON or YAML. In additi

Re: [VOTE] Accept Drill into the Apache Incubator

2012-08-07 Thread Scott Deboy
+1 (binding)

On Tue, Aug 7, 2012 at 7:41 PM, Ted Dunning  wrote:

> I would like to call a vote for accepting Drill for incubation in the
> Apache Incubator. The full proposal is available below.  Discussion
> over the last few days has been quite positive.
>
> Please cast your vote:
>
> [ ] +1, bring Drill into Incubator
> [ ] +0, I don't care either way,
> [ ] -1, do not bring Drill into Incubator, because...
>
> This vote will be open for 72 hours and only votes from the Incubator
> PMC are binding.  The start of the vote is just before 3AM UTC on 8
> August so the closing time will be 3AM UTC on 11 August.
>
> Thank you for your consideration!
>
> Ted
>
> http://wiki.apache.org/incubator/DrillProposal
>
> = Drill =
>
> == Abstract ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets, inspired by
> [[http://research.google.com/pubs/pub36632.html|Google's Dremel]].
>
> == Proposal ==
> Drill is a distributed system for interactive analysis of large-scale
> datasets. Drill is similar to Google's Dremel, with the additional
> flexibility needed to support a broader range of query languages, data
> formats and data sources. It is designed to efficiently process nested
> data. It is a design goal to scale to 10,000 servers or more and to be
> able to process petabyes of data and trillions of records in seconds.
>
> == Background ==
> Many organizations have the need to run data-intensive applications,
> including batch processing, stream processing and interactive
> analysis. In recent years open source systems have emerged to address
> the need for scalable batch processing (Apache Hadoop) and stream
> processing (Storm, Apache S4). In 2010 Google published a paper called
> "Dremel: Interactive Analysis of Web-Scale Datasets," describing a
> scalable system used internally for interactive analysis of nested
> data. No open source project has successfully replicated the
> capabilities of Dremel.
>
> == Rationale ==
> There is a strong need in the market for low-latency interactive
> analysis of large-scale datasets, including nested data (eg, JSON,
> Avro, Protocol Buffers). This need was identified by Google and
> addressed internally with a system called Dremel.
>
> In recent years open source systems have emerged to address the need
> for scalable batch processing (Apache Hadoop) and stream processing
> (Storm, Apache S4). Apache Hadoop, originally inspired by Google's
> internal MapReduce system, is used by thousands of organizations
> processing large-scale datasets. Apache Hadoop is designed to achieve
> very high throughput, but is not designed to achieve the sub-second
> latency needed for interactive data analysis and exploration. Drill,
> inspired by Google's internal Dremel system, is intended to address
> this need.
>
> It is worth noting that, as explained by Google in the original paper,
> Dremel complements MapReduce-based computing. Dremel is not intended
> as a replacement for MapReduce and is often used in conjunction with
> it to analyze outputs of MapReduce pipelines or rapidly prototype
> larger computations. Indeed, Dremel and MapReduce are both used by
> thousands of Google employees.
>
> Like Dremel, Drill supports a nested data model with data encoded in a
> number of formats such as JSON, Avro or Protocol Buffers. In many
> organizations nested data is the standard, so supporting a nested data
> model eliminates the need to normalize the data. With that said, flat
> data formats, such as CSV files, are naturally supported as a special
> case of nested data.
>
> The Drill architecture consists of four key components/layers:
>  * Query languages: This layer is responsible for parsing the user's
> query and constructing an execution plan.  The initial goal is to
> support the SQL-like language used by Dremel and
> [[https://developers.google.com/bigquery/docs/query-reference|Google
> BigQuery]], which we call DrQL. However, Drill is designed to support
> other languages and programming models, such as the
> [[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|MongoQuery
> Language]], [[http://www.cascading.org/|Cascading]] or
> [[https://github.com/tdunning/Plume|Plume]].
>  * Low-latency distributed execution engine: This layer is responsible
> for executing the physical plan. It provides the scalability and fault
> tolerance needed to efficiently query petabytes of data on 10,000
> servers. Drill's execution engine is based on research in distributed
> execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
> columnar storage, and can be extended with additional operators and
> connectors.
>  * Nested data formats: This layer is responsible for supporting
> various data formats. The initial goal is to support the column-based
> format used by Dremel. Drill is designed to support schema-based
> formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
>

[VOTE] Accept Drill into the Apache Incubator

2012-08-07 Thread Ted Dunning
I would like to call a vote for accepting Drill for incubation in the
Apache Incubator. The full proposal is available below.  Discussion
over the last few days has been quite positive.

Please cast your vote:

[ ] +1, bring Drill into Incubator
[ ] +0, I don't care either way,
[ ] -1, do not bring Drill into Incubator, because...

This vote will be open for 72 hours and only votes from the Incubator
PMC are binding.  The start of the vote is just before 3AM UTC on 8
August so the closing time will be 3AM UTC on 11 August.

Thank you for your consideration!

Ted

http://wiki.apache.org/incubator/DrillProposal

= Drill =

== Abstract ==
Drill is a distributed system for interactive analysis of large-scale
datasets, inspired by
[[http://research.google.com/pubs/pub36632.html|Google's Dremel]].

== Proposal ==
Drill is a distributed system for interactive analysis of large-scale
datasets. Drill is similar to Google's Dremel, with the additional
flexibility needed to support a broader range of query languages, data
formats and data sources. It is designed to efficiently process nested
data. It is a design goal to scale to 10,000 servers or more and to be
able to process petabyes of data and trillions of records in seconds.

== Background ==
Many organizations have the need to run data-intensive applications,
including batch processing, stream processing and interactive
analysis. In recent years open source systems have emerged to address
the need for scalable batch processing (Apache Hadoop) and stream
processing (Storm, Apache S4). In 2010 Google published a paper called
"Dremel: Interactive Analysis of Web-Scale Datasets," describing a
scalable system used internally for interactive analysis of nested
data. No open source project has successfully replicated the
capabilities of Dremel.

== Rationale ==
There is a strong need in the market for low-latency interactive
analysis of large-scale datasets, including nested data (eg, JSON,
Avro, Protocol Buffers). This need was identified by Google and
addressed internally with a system called Dremel.

In recent years open source systems have emerged to address the need
for scalable batch processing (Apache Hadoop) and stream processing
(Storm, Apache S4). Apache Hadoop, originally inspired by Google's
internal MapReduce system, is used by thousands of organizations
processing large-scale datasets. Apache Hadoop is designed to achieve
very high throughput, but is not designed to achieve the sub-second
latency needed for interactive data analysis and exploration. Drill,
inspired by Google's internal Dremel system, is intended to address
this need.

It is worth noting that, as explained by Google in the original paper,
Dremel complements MapReduce-based computing. Dremel is not intended
as a replacement for MapReduce and is often used in conjunction with
it to analyze outputs of MapReduce pipelines or rapidly prototype
larger computations. Indeed, Dremel and MapReduce are both used by
thousands of Google employees.

Like Dremel, Drill supports a nested data model with data encoded in a
number of formats such as JSON, Avro or Protocol Buffers. In many
organizations nested data is the standard, so supporting a nested data
model eliminates the need to normalize the data. With that said, flat
data formats, such as CSV files, are naturally supported as a special
case of nested data.

The Drill architecture consists of four key components/layers:
 * Query languages: This layer is responsible for parsing the user's
query and constructing an execution plan.  The initial goal is to
support the SQL-like language used by Dremel and
[[https://developers.google.com/bigquery/docs/query-reference|Google
BigQuery]], which we call DrQL. However, Drill is designed to support
other languages and programming models, such as the
[[http://www.mongodb.org/display/DOCS/Mongo+Query+Language|Mongo Query
Language]], [[http://www.cascading.org/|Cascading]] or
[[https://github.com/tdunning/Plume|Plume]].
 * Low-latency distributed execution engine: This layer is responsible
for executing the physical plan. It provides the scalability and fault
tolerance needed to efficiently query petabytes of data on 10,000
servers. Drill's execution engine is based on research in distributed
execution engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and
columnar storage, and can be extended with additional operators and
connectors.
 * Nested data formats: This layer is responsible for supporting
various data formats. The initial goal is to support the column-based
format used by Dremel. Drill is designed to support schema-based
formats such as Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV,
and schema-less formats such as JSON, BSON or YAML. In addition, it is
designed to support column-based formats such as Dremel,
AVRO-806/Trevni and RCFile, and row-based formats such as Protocol
Buffers, Avro, JSON, BSON and CSV. A particular distinction with Drill
is that the execution engine is flexible enoug