Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-16 Thread Tomer Shiran
Yes, we plan to support joins.

We are in the process of setting up the mailing lists.

On Thu, Aug 16, 2012 at 12:09 AM, karthik tunga karthik.tu...@gmail.comwrote:

 The proposal looks great. I was wondering what operations will drill
 support ?
 For example the dremel paper doesn't talk about joins, will drill support
 joins ?

 Sorry if I missed it, is there a dev mailing list I could subscribe to ?

 Cheers,
 Karthik

 On 13 August 2012 23:55, Bernd Fondermann bernd.fonderm...@gmail.com
 wrote:

  great proposal and a very promising mentor lineup.
 
  Have fun,
 
Bernd
 
  On Thu, Aug 2, 2012 at 11:40 PM, Ted Dunning tdunn...@apache.org
 wrote:
   Abstract
   
   Drill is a distributed system for interactive analysis of large-scale
   datasets, inspired by Google’s Dremel (
   http://research.google.com/pubs/pub36632.html).
  
   Proposal
   
   Drill is a distributed system for interactive analysis of large-scale
   datasets. Drill is similar to Google’s Dremel, with the additional
   flexibility needed to support a broader range of query languages, data
   formats and data sources. It is designed to efficiently process nested
   data. It is a design goal to scale to 10,000 servers or more and to be
  able
   to process petabyes of data and trillions of records in seconds.
  
   Background
   ==
   Many organizations have the need to run data-intensive applications,
   including batch processing, stream processing and interactive analysis.
  In
   recent years open source systems have emerged to address the need for
   scalable batch processing (Apache Hadoop) and stream processing (Storm,
   Apache S4). In 2010 Google published a paper called “Dremel:
 Interactive
   Analysis of Web-Scale Datasets,” describing a scalable system used
   internally for interactive analysis of nested data. No open source
  project
   has successfully replicated the capabilities of Dremel.
  
   Rationale
   =
   There is a strong need in the market for low-latency interactive
 analysis
   of large-scale datasets, including nested data (eg, JSON, Avro,
 Protocol
   Buffers). This need was identified by Google and addressed internally
  with
   a system called Dremel.
  
   In recent years open source systems have emerged to address the need
 for
   scalable batch processing (Apache Hadoop) and stream processing (Storm,
   Apache S4). Apache Hadoop, originally inspired by Google’s internal
   MapReduce system, is used by thousands of organizations processing
   large-scale datasets. Apache Hadoop is designed to achieve very high
   throughput, but is not designed to achieve the sub-second latency
 needed
   for interactive data analysis and exploration. Drill, inspired by
  Google’s
   internal Dremel system, is intended to address this need.
  
   It is worth noting that, as explained by Google in the original paper,
   Dremel complements MapReduce-based computing. Dremel is not intended
 as a
   replacement for MapReduce and is often used in conjunction with it to
   analyze outputs of MapReduce pipelines or rapidly prototype larger
   computations. Indeed, Dremel and MapReduce are both used by thousands
 of
   Google employees.
  
   Like Dremel, Drill supports a nested data model with data encoded in a
   number of formats such as JSON, Avro or Protocol Buffers. In many
   organizations nested data is the standard, so supporting a nested data
   model eliminates the need to normalize the data. With that said, flat
  data
   formats, such as CSV files, are naturally supported as a special case
 of
   nested data.
  
   The Drill architecture consists of four key components/layers:
   * Query languages: This layer is responsible for parsing the user’s
 query
   and constructing an execution plan.  The initial goal is to support the
   SQL-like language used by Dremel and Google BigQuery (
   https://developers.google.com/bigquery/docs/query-reference), which we
  call
   DrQL. However, Drill is designed to support other languages and
  programming
   models, such as the Mongo Query Language (
   http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
   http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume
 ).
   * Low-latency distributed execution engine: This layer is responsible
 for
   executing the physical plan. It provides the scalability and fault
   tolerance needed to efficiently query petabytes of data on 10,000
  servers.
   Drill’s execution engine is based on research in distributed execution
   engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
   storage, and can be extended with additional operators and connectors.
   * Nested data formats: This layer is responsible for supporting various
   data formats. The initial goal is to support the column-based format
 used
   by Dremel. Drill is designed to support schema-based formats such as
   Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
   

Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-16 Thread Ted Dunning
The mailing list request is in infra's hands.

One of the better sources of information about Dremel is the BigQuery
documentation.  That says that the right side of a join must be  8MB and
that the only outer join available is a left out join.

What Drill does is somewhat of a different question.

On Thu, Aug 16, 2012 at 12:18 AM, Tomer Shiran tshi...@maprtech.com wrote:

 Yes, we plan to support joins.

 We are in the process of setting up the mailing lists.

 On Thu, Aug 16, 2012 at 12:09 AM, karthik tunga karthik.tu...@gmail.com
 wrote:

  The proposal looks great. I was wondering what operations will drill
  support ?
  For example the dremel paper doesn't talk about joins, will drill support
  joins ?
 
  Sorry if I missed it, is there a dev mailing list I could subscribe to ?
 
  Cheers,
  Karthik
 
  On 13 August 2012 23:55, Bernd Fondermann bernd.fonderm...@gmail.com
  wrote:
 
   great proposal and a very promising mentor lineup.
  
   Have fun,
  
 Bernd
  
   On Thu, Aug 2, 2012 at 11:40 PM, Ted Dunning tdunn...@apache.org
  wrote:
Abstract

Drill is a distributed system for interactive analysis of large-scale
datasets, inspired by Google’s Dremel (
http://research.google.com/pubs/pub36632.html).
   
Proposal

Drill is a distributed system for interactive analysis of large-scale
datasets. Drill is similar to Google’s Dremel, with the additional
flexibility needed to support a broader range of query languages,
 data
formats and data sources. It is designed to efficiently process
 nested
data. It is a design goal to scale to 10,000 servers or more and to
 be
   able
to process petabyes of data and trillions of records in seconds.
   
Background
==
Many organizations have the need to run data-intensive applications,
including batch processing, stream processing and interactive
 analysis.
   In
recent years open source systems have emerged to address the need for
scalable batch processing (Apache Hadoop) and stream processing
 (Storm,
Apache S4). In 2010 Google published a paper called “Dremel:
  Interactive
Analysis of Web-Scale Datasets,” describing a scalable system used
internally for interactive analysis of nested data. No open source
   project
has successfully replicated the capabilities of Dremel.
   
Rationale
=
There is a strong need in the market for low-latency interactive
  analysis
of large-scale datasets, including nested data (eg, JSON, Avro,
  Protocol
Buffers). This need was identified by Google and addressed internally
   with
a system called Dremel.
   
In recent years open source systems have emerged to address the need
  for
scalable batch processing (Apache Hadoop) and stream processing
 (Storm,
Apache S4). Apache Hadoop, originally inspired by Google’s internal
MapReduce system, is used by thousands of organizations processing
large-scale datasets. Apache Hadoop is designed to achieve very high
throughput, but is not designed to achieve the sub-second latency
  needed
for interactive data analysis and exploration. Drill, inspired by
   Google’s
internal Dremel system, is intended to address this need.
   
It is worth noting that, as explained by Google in the original
 paper,
Dremel complements MapReduce-based computing. Dremel is not intended
  as a
replacement for MapReduce and is often used in conjunction with it to
analyze outputs of MapReduce pipelines or rapidly prototype larger
computations. Indeed, Dremel and MapReduce are both used by thousands
  of
Google employees.
   
Like Dremel, Drill supports a nested data model with data encoded in
 a
number of formats such as JSON, Avro or Protocol Buffers. In many
organizations nested data is the standard, so supporting a nested
 data
model eliminates the need to normalize the data. With that said, flat
   data
formats, such as CSV files, are naturally supported as a special case
  of
nested data.
   
The Drill architecture consists of four key components/layers:
* Query languages: This layer is responsible for parsing the user’s
  query
and constructing an execution plan.  The initial goal is to support
 the
SQL-like language used by Dremel and Google BigQuery (
https://developers.google.com/bigquery/docs/query-reference), which
 we
   call
DrQL. However, Drill is designed to support other languages and
   programming
models, such as the Mongo Query Language (
http://www.mongodb.org/display/DOCS/Mongo+Query+Language),
 Cascading (
http://www.cascading.org/) or Plume (
 https://github.com/tdunning/Plume
  ).
* Low-latency distributed execution engine: This layer is responsible
  for
executing the physical plan. It provides the scalability and fault
tolerance needed to efficiently query petabytes of data on 10,000
   servers.
Drill’s execution 

Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-14 Thread Bernd Fondermann
great proposal and a very promising mentor lineup.

Have fun,

  Bernd

On Thu, Aug 2, 2012 at 11:40 PM, Ted Dunning tdunn...@apache.org wrote:
 Abstract
 
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by Google’s Dremel (
 http://research.google.com/pubs/pub36632.html).

 Proposal
 
 Drill is a distributed system for interactive analysis of large-scale
 datasets. Drill is similar to Google’s Dremel, with the additional
 flexibility needed to support a broader range of query languages, data
 formats and data sources. It is designed to efficiently process nested
 data. It is a design goal to scale to 10,000 servers or more and to be able
 to process petabyes of data and trillions of records in seconds.

 Background
 ==
 Many organizations have the need to run data-intensive applications,
 including batch processing, stream processing and interactive analysis. In
 recent years open source systems have emerged to address the need for
 scalable batch processing (Apache Hadoop) and stream processing (Storm,
 Apache S4). In 2010 Google published a paper called “Dremel: Interactive
 Analysis of Web-Scale Datasets,” describing a scalable system used
 internally for interactive analysis of nested data. No open source project
 has successfully replicated the capabilities of Dremel.

 Rationale
 =
 There is a strong need in the market for low-latency interactive analysis
 of large-scale datasets, including nested data (eg, JSON, Avro, Protocol
 Buffers). This need was identified by Google and addressed internally with
 a system called Dremel.

 In recent years open source systems have emerged to address the need for
 scalable batch processing (Apache Hadoop) and stream processing (Storm,
 Apache S4). Apache Hadoop, originally inspired by Google’s internal
 MapReduce system, is used by thousands of organizations processing
 large-scale datasets. Apache Hadoop is designed to achieve very high
 throughput, but is not designed to achieve the sub-second latency needed
 for interactive data analysis and exploration. Drill, inspired by Google’s
 internal Dremel system, is intended to address this need.

 It is worth noting that, as explained by Google in the original paper,
 Dremel complements MapReduce-based computing. Dremel is not intended as a
 replacement for MapReduce and is often used in conjunction with it to
 analyze outputs of MapReduce pipelines or rapidly prototype larger
 computations. Indeed, Dremel and MapReduce are both used by thousands of
 Google employees.

 Like Dremel, Drill supports a nested data model with data encoded in a
 number of formats such as JSON, Avro or Protocol Buffers. In many
 organizations nested data is the standard, so supporting a nested data
 model eliminates the need to normalize the data. With that said, flat data
 formats, such as CSV files, are naturally supported as a special case of
 nested data.

 The Drill architecture consists of four key components/layers:
 * Query languages: This layer is responsible for parsing the user’s query
 and constructing an execution plan.  The initial goal is to support the
 SQL-like language used by Dremel and Google BigQuery (
 https://developers.google.com/bigquery/docs/query-reference), which we call
 DrQL. However, Drill is designed to support other languages and programming
 models, such as the Mongo Query Language (
 http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
 http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume).
 * Low-latency distributed execution engine: This layer is responsible for
 executing the physical plan. It provides the scalability and fault
 tolerance needed to efficiently query petabytes of data on 10,000 servers.
 Drill’s execution engine is based on research in distributed execution
 engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
 storage, and can be extended with additional operators and connectors.
 * Nested data formats: This layer is responsible for supporting various
 data formats. The initial goal is to support the column-based format used
 by Dremel. Drill is designed to support schema-based formats such as
 Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
 formats such as JSON, BSON or YAML. In addition, it is designed to support
 column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
 row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
 particular distinction with Drill is that the execution engine is flexible
 enough to support column-based processing as well as row-based processing.
 This is important because column-based processing can be much more
 efficient when the data is stored in a column-based format, but many large
 data assets are stored in a row-based format that would require conversion
 before use.
 * Scalable data sources: This layer is responsible for supporting various
 data sources. The initial 

Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-09 Thread Arun C Murthy

On Aug 8, 2012, at 2:13 PM, Ted Dunning wrote:
 
  It is clear that there are gobs of people with the
 credentials and track record to be potential contributors, but it is
 also clear that many of these people have huge demands on their time.
 That leaves doubt about how much contribution they can or should be
 making to a new project.

Wow! It's your project, and you can choose how to run this. However, when I do 
contribute I hope my contributions aren't discouraged because I should not be 
contributing to a new project because of the demands on my time after I 
volunteered to. 

I don't wish to belabor this or stand in your way, good luck. Hopefully, the 
project will be encouraging to new contributors.

Arun

Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-09 Thread Ted Dunning
On Thu, Aug 9, 2012 at 9:45 AM, Arun C Murthy a...@hortonworks.com wrote:

 On Aug 8, 2012, at 2:13 PM, Ted Dunning wrote:

  It is clear that there are gobs of people with the
 credentials and track record to be potential contributors, but it is
 also clear that many of these people have huge demands on their time.
 That leaves doubt about how much contribution they can or should be
 making to a new project.

 Wow! It's your project, and you can choose how to run this. However, when I 
 do contribute I hope my
 contributions aren't discouraged because I should not be contributing to a 
 new project because of the
 demands on my time after I volunteered to.

All contributions will be heartily welcomed.  The stance I intend to
encourage (and the other mentors and early committers share this
intent) is similar to the Mahout project personality ... contributions
and contributors are highly welcomed.

The only thing that I am pushing for here is a timing detail.  Since a
vote is ongoing right now, I would like to finish the vote before
changing anything.  Assuming the vote succeeds (and there is a strong
trend that way) then we will do the necessary plumbing to get the
project started and be ready for contributions.


 I don't wish to belabor this or stand in your way, good luck. Hopefully, the 
 project will be encouraging to new contributors.

It absolutely will be.

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-08 Thread Jakob Homan
On Mon, Aug 6, 2012 at 2:23 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 No reason at all.

Sorry.  I may have been unclear.  I was requesting that the design
docs which are being referenced in the proposal:
The requirement and design documents are currently stored in MapR
Technologies' source code repository. They will be checked in as part
of the initial code dump.
be made available for review as part of the proposal, much as an
initial source code base would be.  There is also a reference to a
presentation to-be-made available:
High-level slides have been published by MapR: TODO

Can those be made public?

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-08 Thread Ted Dunning
The consensus in the group of committers listed in the proposal is
that we would like to discourage piling on of pre-formation committers
and encourage adding committers after formation based on
contributions.  It is clear that there are gobs of people with the
credentials and track record to be potential contributors, but it is
also clear that many of these people have huge demands on their time.
That leaves doubt about how much contribution they can or should be
making to a new project.

It is also clear that there are gobs of people that are not already
part of Apache who may have time and expertise to contribute.

In any case, the vote is already started and will be done before long.
 Let's go with what we are already voting on without changing it in
mid-stream and then adjust later.  Progress, not perfection, as they
say.

On Wed, Aug 8, 2012 at 3:31 AM, Bertrand Delacretaz
bdelacre...@apache.org wrote:
 On Wed, Aug 8, 2012 at 7:20 AM, Marvin Humphrey mar...@rectangular.com 
 wrote:
 On Tue, Aug 7, 2012 at 10:09 PM, Arun C Murthy a...@hortonworks.com wrote:
 Wasn't clear, can I add myself now?

 Didn't the Incubator go back to discouraging open enrollment?...

 AFAIK, no. What was discussed is that incoming podlings should clearly
 state their requirements for people that want to be added as initial
 committers, to keep it fair.

 -Bertrand

 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-08 Thread Chris Douglas
+1 -C

On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 This is a duplicated attempt at sending this message, please ignore the
 previous message if it eventually arrives.  There appears to be a hangup
 sending email from my apache email address via gmail.

 Abstract
 
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by Google’s Dremel (
 http://research.google.com/pubs/pub36632.html).

 Proposal
 
 Drill is a distributed system for interactive analysis of large-scale
 datasets. Drill is similar to Google’s Dremel, with the additional
 flexibility needed to support a broader range of query languages, data
 formats and data sources. It is designed to efficiently process nested
 data. It is a design goal to scale to 10,000 servers or more and to be able
 to process petabyes of data and trillions of records in seconds.

 Background
 ==
 Many organizations have the need to run data-intensive applications,
 including batch processing, stream processing and interactive analysis. In
 recent years open source systems have emerged to address the need for
 scalable batch processing (Apache Hadoop) and stream processing (Storm,
 Apache S4). In 2010 Google published a paper called “Dremel: Interactive
 Analysis of Web-Scale Datasets,” describing a scalable system used
 internally for interactive analysis of nested data. No open source project
 has successfully replicated the capabilities of Dremel.

 Rationale
 =
 There is a strong need in the market for low-latency interactive analysis
 of large-scale datasets, including nested data (eg, JSON, Avro, Protocol
 Buffers). This need was identified by Google and addressed internally with
 a system called Dremel.

 In recent years open source systems have emerged to address the need for
 scalable batch processing (Apache Hadoop) and stream processing (Storm,
 Apache S4). Apache Hadoop, originally inspired by Google’s internal
 MapReduce system, is used by thousands of organizations processing
 large-scale datasets. Apache Hadoop is designed to achieve very high
 throughput, but is not designed to achieve the sub-second latency needed
 for interactive data analysis and exploration. Drill, inspired by Google’s
 internal Dremel system, is intended to address this need.

 It is worth noting that, as explained by Google in the original paper,
 Dremel complements MapReduce-based computing. Dremel is not intended as a
 replacement for MapReduce and is often used in conjunction with it to
 analyze outputs of MapReduce pipelines or rapidly prototype larger
 computations. Indeed, Dremel and MapReduce are both used by thousands of
 Google employees.

 Like Dremel, Drill supports a nested data model with data encoded in a
 number of formats such as JSON, Avro or Protocol Buffers. In many
 organizations nested data is the standard, so supporting a nested data
 model eliminates the need to normalize the data. With that said, flat data
 formats, such as CSV files, are naturally supported as a special case of
 nested data.

 The Drill architecture consists of four key components/layers:
 * Query languages: This layer is responsible for parsing the user’s query
 and constructing an execution plan.  The initial goal is to support the
 SQL-like language used by Dremel and Google BigQuery (
 https://developers.google.com/bigquery/docs/query-reference), which we call
 DrQL. However, Drill is designed to support other languages and programming
 models, such as the Mongo Query Language (
 http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
 http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume).
 * Low-latency distributed execution engine: This layer is responsible for
 executing the physical plan. It provides the scalability and fault
 tolerance needed to efficiently query petabytes of data on 10,000 servers.
 Drill’s execution engine is based on research in distributed execution
 engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
 storage, and can be extended with additional operators and connectors.
 * Nested data formats: This layer is responsible for supporting various
 data formats. The initial goal is to support the column-based format used
 by Dremel. Drill is designed to support schema-based formats such as
 Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
 formats such as JSON, BSON or YAML. In addition, it is designed to support
 column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
 row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
 particular distinction with Drill is that the execution engine is flexible
 enough to support column-based processing as well as row-based processing.
 This is important because column-based processing can be much more
 efficient when the data is stored in a column-based format, but many large
 data assets are stored in a row-based format that 

Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-08 Thread Tomer Shiran
Oops, apologies - thanks for the reminder. I uploaded the slides as an
attachment on the wiki page.

Thanks,
Tomer

On Wed, Aug 8, 2012 at 9:14 PM, Jakob Homan jgho...@gmail.com wrote:

 So, no response to my request above about the design docs and
 not-TO-DOne MapR presentation?

 On Wed, Aug 8, 2012 at 3:25 PM, Chris Douglas cdoug...@apache.org wrote:
  +1 -C
 
  On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
  This is a duplicated attempt at sending this message, please ignore the
  previous message if it eventually arrives.  There appears to be a hangup
  sending email from my apache email address via gmail.
 
  Abstract
  
  Drill is a distributed system for interactive analysis of large-scale
  datasets, inspired by Google’s Dremel (
  http://research.google.com/pubs/pub36632.html).
 
  Proposal
  
  Drill is a distributed system for interactive analysis of large-scale
  datasets. Drill is similar to Google’s Dremel, with the additional
  flexibility needed to support a broader range of query languages, data
  formats and data sources. It is designed to efficiently process nested
  data. It is a design goal to scale to 10,000 servers or more and to be
 able
  to process petabyes of data and trillions of records in seconds.
 
  Background
  ==
  Many organizations have the need to run data-intensive applications,
  including batch processing, stream processing and interactive analysis.
 In
  recent years open source systems have emerged to address the need for
  scalable batch processing (Apache Hadoop) and stream processing (Storm,
  Apache S4). In 2010 Google published a paper called “Dremel: Interactive
  Analysis of Web-Scale Datasets,” describing a scalable system used
  internally for interactive analysis of nested data. No open source
 project
  has successfully replicated the capabilities of Dremel.
 
  Rationale
  =
  There is a strong need in the market for low-latency interactive
 analysis
  of large-scale datasets, including nested data (eg, JSON, Avro, Protocol
  Buffers). This need was identified by Google and addressed internally
 with
  a system called Dremel.
 
  In recent years open source systems have emerged to address the need for
  scalable batch processing (Apache Hadoop) and stream processing (Storm,
  Apache S4). Apache Hadoop, originally inspired by Google’s internal
  MapReduce system, is used by thousands of organizations processing
  large-scale datasets. Apache Hadoop is designed to achieve very high
  throughput, but is not designed to achieve the sub-second latency needed
  for interactive data analysis and exploration. Drill, inspired by
 Google’s
  internal Dremel system, is intended to address this need.
 
  It is worth noting that, as explained by Google in the original paper,
  Dremel complements MapReduce-based computing. Dremel is not intended as
 a
  replacement for MapReduce and is often used in conjunction with it to
  analyze outputs of MapReduce pipelines or rapidly prototype larger
  computations. Indeed, Dremel and MapReduce are both used by thousands of
  Google employees.
 
  Like Dremel, Drill supports a nested data model with data encoded in a
  number of formats such as JSON, Avro or Protocol Buffers. In many
  organizations nested data is the standard, so supporting a nested data
  model eliminates the need to normalize the data. With that said, flat
 data
  formats, such as CSV files, are naturally supported as a special case of
  nested data.
 
  The Drill architecture consists of four key components/layers:
  * Query languages: This layer is responsible for parsing the user’s
 query
  and constructing an execution plan.  The initial goal is to support the
  SQL-like language used by Dremel and Google BigQuery (
  https://developers.google.com/bigquery/docs/query-reference), which we
 call
  DrQL. However, Drill is designed to support other languages and
 programming
  models, such as the Mongo Query Language (
  http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
  http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume
 ).
  * Low-latency distributed execution engine: This layer is responsible
 for
  executing the physical plan. It provides the scalability and fault
  tolerance needed to efficiently query petabytes of data on 10,000
 servers.
  Drill’s execution engine is based on research in distributed execution
  engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
  storage, and can be extended with additional operators and connectors.
  * Nested data formats: This layer is responsible for supporting various
  data formats. The initial goal is to support the column-based format
 used
  by Dremel. Drill is designed to support schema-based formats such as
  Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
  formats such as JSON, BSON or YAML. In addition, it is designed to
 support
  column-based formats such as 

Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-07 Thread Tomer Shiran
FYI: I have posted the proposal to the wiki and updated it based on the
feedback from Marvin and Jakob:
http://wiki.apache.org/incubator/DrillProposal

On Mon, Aug 6, 2012 at 2:29 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 In fact, a big part of the motivation for proposing incubation before code
 is ready is exactly to foster the discussions needed to form community.

 It is true that many projects that start without the fundamentals face
 challenges that more mature projects face but that is really just a fact of
 life with young projects.

 My own experience includes a project that also started without an initial
 code drop.  Mahout has gone on to have a vibrant welcoming community that
 has fostered the donation and development of some very valuable software.
  I expect Drill will be able to say the same thing before long.

 Sent from my iPhone

 On Aug 6, 2012, at 2:55 PM, Jakob Homan jgho...@gmail.com wrote:

  Any reason the design docs can't be put up in place of where the
  source would normally go?
 
  On Mon, Aug 6, 2012 at 11:23 AM, Tomer Shiran tshi...@maprtech.com
 wrote:
  Marvin, thanks for commenting on the proposal! The initial committers
 have
  been working on the design for several months, and will commit the
 design
  once the project is approved, so we do not expect much friction during
 the
  design phase. With that said, we certainly do want to engage others
 early
  on, and our goal in incubating earlier is to encourage feedback and
  contributions when it is still easy to change the APIs and extensibility
  points. This is important because Drill (unlike, say, Google's Dremel)
 must
  be really flexible in order to be relevant to a broad user base,
 allowing
  multiple data sources, data formats and query languages. While many
  projects enter incubation with a complete implementation, others don't,
 and
  due to the nature of this project we think that in this case it is
 better
  to start earlier.
 
  Thanks,
  Tomer
 
  On Mon, Aug 6, 2012 at 9:25 AM, Marvin Humphrey mar...@rectangular.com
 wrote:
 
  On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
  Initial Source
  ==
  There is no initial source code. All source code will be developed
 within
  the Apache Incubator.
 
  Coming in without any source code is going to pose a challenge to this
  podling.
 
 http://www.apache.org/foundation/how-it-works.html#incubator
 
 The incubator filters projects on the basis of the likeliness of
  them becoming
 successful meritocratic communities. The basic requirements for
  incubation
 are:
 
 * a working codebase -- over the years and after several
 failures,
  the
   foundation came to understand that without an initial working
   codebase, it is generally hard to bootstrap a community. This
 is
   because merit is not well recognized by developers without a
  working
   codebase. Also, the friction that is developed during the
 initial
   design stage is likely to fragment the community.
 
  That last line in particular seems like something to watch out for.
 
  Marvin Humphrey
 
  -
  To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
  For additional commands, e-mail: general-h...@incubator.apache.org
 
 
 
  -
  To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
  For additional commands, e-mail: general-h...@incubator.apache.org
 

 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org




RE: [PROPOSAL] Drill for the Apache Incubator

2012-08-07 Thread Franklin, Matthew B.
-Original Message-
From: Marvin Humphrey [mailto:mar...@rectangular.com]
Sent: Monday, August 06, 2012 12:25 PM
To: general@incubator.apache.org
Cc: Grant Ingersoll; Isabel Drost
Subject: Re: [PROPOSAL] Drill for the Apache Incubator

On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com
wrote:

 Initial Source
 ==
 There is no initial source code. All source code will be developed within
 the Apache Incubator.

Coming in without any source code is going to pose a challenge to this
podling.

http://www.apache.org/foundation/how-it-works.html#incubator

The incubator filters projects on the basis of the likeliness of
them becoming
successful meritocratic communities. The basic requirements for incubation
are:

* a working codebase -- over the years and after several failures, the
  foundation came to understand that without an initial working
  codebase, it is generally hard to bootstrap a community. This is
  because merit is not well recognized by developers without a working
  codebase. Also, the friction that is developed during the initial
  design stage is likely to fragment the community.

It seems like there could be flexibility in this requirement, based on a few 
factors.  In this case, a design discussion has been ongoing; but I would also 
think that any community coming in with enough people who know the Apache way 
may also not need as much of a solid starting point code wise.


That last line in particular seems like something to watch out for.

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-07 Thread Andrzej Bialecki

On 07/08/2012 21:14, Franklin, Matthew B. wrote:

-Original Message-
From: Marvin Humphrey [mailto:mar...@rectangular.com]
Sent: Monday, August 06, 2012 12:25 PM
To: general@incubator.apache.org
Cc: Grant Ingersoll; Isabel Drost
Subject: Re: [PROPOSAL] Drill for the Apache Incubator

On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com
wrote:


Initial Source
==
There is no initial source code. All source code will be developed within
the Apache Incubator.


Coming in without any source code is going to pose a challenge to this
podling.

http://www.apache.org/foundation/how-it-works.html#incubator

The incubator filters projects on the basis of the likeliness of
them becoming
successful meritocratic communities. The basic requirements for incubation
are:

* a working codebase -- over the years and after several failures, the
  foundation came to understand that without an initial working
  codebase, it is generally hard to bootstrap a community. This is
  because merit is not well recognized by developers without a working
  codebase. Also, the friction that is developed during the initial
  design stage is likely to fragment the community.


It seems like there could be flexibility in this requirement, based on a few 
factors.  In this case, a design discussion has been ongoing; but I would also 
think that any community coming in with enough people who know the Apache way 
may also not need as much of a solid starting point code wise.


+1. Given the credentials and the experience of proposed committers and 
mentors, and the fact that the initial design is already done, I don't 
think this is a serious risk. And it's an exciting proposal with a 
potentially big impact.


--
Best regards,
Andrzej Bialecki
http://www.sigram.com, blog http://www.sigram.com/blog
 ___.,___,___,___,_._. __
[___||.__|__/|__||\/|: Information Retrieval, System Integration
___|||__||..\|..||..|: Contact: info at sigram dot com


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-07 Thread Otis Gospodnetic
I concur with Andrzej.  Let's see that VOTE Ted!

Otis 

Performance Monitoring for Solr / ElasticSearch / HBase - 
http://sematext.com/spm 




 From: Andrzej Bialecki a...@getopt.org
To: general@incubator.apache.org 
Sent: Tuesday, August 7, 2012 5:51 PM
Subject: Re: [PROPOSAL] Drill for the Apache Incubator
 
On 07/08/2012 21:14, Franklin, Matthew B. wrote:
 -Original Message-
 From: Marvin Humphrey [mailto:mar...@rectangular.com]
 Sent: Monday, August 06, 2012 12:25 PM
 To: general@incubator.apache.org
 Cc: Grant Ingersoll; Isabel Drost
 Subject: Re: [PROPOSAL] Drill for the Apache Incubator

 On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:

 Initial Source
 ==
 There is no initial source code. All source code will be developed within
 the Apache Incubator.

 Coming in without any source code is going to pose a challenge to this
 podling.

     http://www.apache.org/foundation/how-it-works.html#incubator

     The incubator filters projects on the basis of the likeliness of
 them becoming
     successful meritocratic communities. The basic requirements for 
incubation
     are:

         * a working codebase -- over the years and after several failures, 
the
           foundation came to understand that without an initial working
           codebase, it is generally hard to bootstrap a community. This is
           because merit is not well recognized by developers without a 
working
           codebase. Also, the friction that is developed during the initial
           design stage is likely to fragment the community.

 It seems like there could be flexibility in this requirement, based on a few 
 factors.  In this case, a design discussion has been ongoing; but I would 
 also think that any community coming in with enough people who know the 
 Apache way may also not need as much of a solid starting point code wise.

+1. Given the credentials and the experience of proposed committers and 
mentors, and the fact that the initial design is already done, I don't 
think this is a serious risk. And it's an exciting proposal with a 
potentially big impact.

-- 
Best regards,
Andrzej Bialecki
http://www.sigram.com, blog http://www.sigram.com/blog
  ___.,___,___,___,_._. __
[___||.__|__/|__||\/|: Information Retrieval, System Integration
___|||__||..\|..||..|: Contact: info at sigram dot com


-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org





Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-07 Thread Ted Dunning
Just sent that out.

Thanks for the encouragement!

On Tue, Aug 7, 2012 at 6:02 PM, Otis Gospodnetic
otis_gospodne...@yahoo.com wrote:
 I concur with Andrzej.  Let's see that VOTE Ted!

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-07 Thread Arun C Murthy
Ted,

Wasn't clear, can I add myself now?

thanks,
Arun

On Aug 6, 2012, at 9:08 AM, Ted Dunning wrote:

 Sounds like some good pull.  I will call a vote tomorrow.
 
 On Mon, Aug 6, 2012 at 9:45 AM, Arun C Murthy a...@hortonworks.com wrote:
 
 Agreed, likewise.
 
 I'd love to get involved and would like to add myself whenever you are
 ready.
 
 thanks,
 Arun
 
 On Aug 3, 2012, at 10:40 AM, Owen O'Malley wrote:
 
 On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by Google’s Dremel (
 http://research.google.com/pubs/pub36632.html).
 
 
 This sounds really interesting Ted and I would love to help you. Would it
 be ok to add myself as one of the initial committers?
 
 Thanks,
  Owen
 
 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/
 
 
 

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-07 Thread Marvin Humphrey
On Tue, Aug 7, 2012 at 12:14 PM, Franklin, Matthew B.
mfrank...@mitre.org wrote:
The incubator filters projects on the basis of the likeliness of them
becoming successful meritocratic communities. The basic requirements for
incubation are:

  * a working codebase -- over the years and after several failures,
the foundation came to understand that without an initial working
codebase, it is generally hard to bootstrap a community. This is
because merit is not well recognized by developers without a working
codebase. Also, the friction that is developed during the initial
design stage is likely to fragment the community.

 It seems like there could be flexibility in this requirement, based on a few
 factors.  In this case, a design discussion has been ongoing; but I would
 also think that any community coming in with enough people who know the
 Apache way may also not need as much of a solid starting point code wise.

In the abstract, I'm a little skeptical about your last point. The inclusive,
collaborative emphasis of the Apache Way is effective for evolutionary
development of an existing code base, but IMO it's less well suited to the
revolutionary act of starting a project.  Choosing what *not* to do is really
important when you start out, and that's not necessarily our strength.

In Drill's case, I think the focus problem is mitigated by the fact that the
podling will start with design documents and the Dremel whitepaper rather than
a blank slate empty repository.  In addition, the other classic problem which
afflicts podlings which start with no code -- difficulty refreshing the
community with no releases -- seems unlikely to manifest.

The proposal looks good to me now. :)

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-07 Thread Marvin Humphrey
On Tue, Aug 7, 2012 at 10:09 PM, Arun C Murthy a...@hortonworks.com wrote:

 Wasn't clear, can I add myself now?

Didn't the Incubator go back to discouraging open enrollment?

Is it OK to be invited in based on merit later, or do you feel that due to the
nature of this project, it's essential to be in on the ground floor?

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-06 Thread Arun C Murthy
Agreed, likewise.

I'd love to get involved and would like to add myself whenever you are ready.

thanks,
Arun

On Aug 3, 2012, at 10:40 AM, Owen O'Malley wrote:

 On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by Google’s Dremel (
 http://research.google.com/pubs/pub36632.html).
 
 
 This sounds really interesting Ted and I would love to help you. Would it
 be ok to add myself as one of the initial committers?
 
 Thanks,
   Owen

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/




Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-06 Thread Ted Dunning
Sounds like some good pull.  I will call a vote tomorrow.

On Mon, Aug 6, 2012 at 9:45 AM, Arun C Murthy a...@hortonworks.com wrote:

 Agreed, likewise.

 I'd love to get involved and would like to add myself whenever you are
 ready.

 thanks,
 Arun

 On Aug 3, 2012, at 10:40 AM, Owen O'Malley wrote:

  On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com
 wrote:
 
  Drill is a distributed system for interactive analysis of large-scale
  datasets, inspired by Google’s Dremel (
  http://research.google.com/pubs/pub36632.html).
 
 
  This sounds really interesting Ted and I would love to help you. Would it
  be ok to add myself as one of the initial committers?
 
  Thanks,
Owen

 --
 Arun C. Murthy
 Hortonworks Inc.
 http://hortonworks.com/





Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-06 Thread Marvin Humphrey
On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Initial Source
 ==
 There is no initial source code. All source code will be developed within
 the Apache Incubator.

Coming in without any source code is going to pose a challenge to this
podling.

http://www.apache.org/foundation/how-it-works.html#incubator

The incubator filters projects on the basis of the likeliness of
them becoming
successful meritocratic communities. The basic requirements for incubation
are:

* a working codebase -- over the years and after several failures, the
  foundation came to understand that without an initial working
  codebase, it is generally hard to bootstrap a community. This is
  because merit is not well recognized by developers without a working
  codebase. Also, the friction that is developed during the initial
  design stage is likely to fragment the community.

That last line in particular seems like something to watch out for.

Marvin Humphrey

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-06 Thread Tomer Shiran
Marvin, thanks for commenting on the proposal! The initial committers have
been working on the design for several months, and will commit the design
once the project is approved, so we do not expect much friction during the
design phase. With that said, we certainly do want to engage others early
on, and our goal in incubating earlier is to encourage feedback and
contributions when it is still easy to change the APIs and extensibility
points. This is important because Drill (unlike, say, Google's Dremel) must
be really flexible in order to be relevant to a broad user base, allowing
multiple data sources, data formats and query languages. While many
projects enter incubation with a complete implementation, others don't, and
due to the nature of this project we think that in this case it is better
to start earlier.

Thanks,
Tomer

On Mon, Aug 6, 2012 at 9:25 AM, Marvin Humphrey mar...@rectangular.comwrote:

 On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote:

  Initial Source
  ==
  There is no initial source code. All source code will be developed within
  the Apache Incubator.

 Coming in without any source code is going to pose a challenge to this
 podling.

 http://www.apache.org/foundation/how-it-works.html#incubator

 The incubator filters projects on the basis of the likeliness of
 them becoming
 successful meritocratic communities. The basic requirements for
 incubation
 are:

 * a working codebase -- over the years and after several failures,
 the
   foundation came to understand that without an initial working
   codebase, it is generally hard to bootstrap a community. This is
   because merit is not well recognized by developers without a
 working
   codebase. Also, the friction that is developed during the initial
   design stage is likely to fragment the community.

 That last line in particular seems like something to watch out for.

 Marvin Humphrey

 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org




Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-06 Thread Jakob Homan
Any reason the design docs can't be put up in place of where the
source would normally go?

On Mon, Aug 6, 2012 at 11:23 AM, Tomer Shiran tshi...@maprtech.com wrote:
 Marvin, thanks for commenting on the proposal! The initial committers have
 been working on the design for several months, and will commit the design
 once the project is approved, so we do not expect much friction during the
 design phase. With that said, we certainly do want to engage others early
 on, and our goal in incubating earlier is to encourage feedback and
 contributions when it is still easy to change the APIs and extensibility
 points. This is important because Drill (unlike, say, Google's Dremel) must
 be really flexible in order to be relevant to a broad user base, allowing
 multiple data sources, data formats and query languages. While many
 projects enter incubation with a complete implementation, others don't, and
 due to the nature of this project we think that in this case it is better
 to start earlier.

 Thanks,
 Tomer

 On Mon, Aug 6, 2012 at 9:25 AM, Marvin Humphrey mar...@rectangular.comwrote:

 On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote:

  Initial Source
  ==
  There is no initial source code. All source code will be developed within
  the Apache Incubator.

 Coming in without any source code is going to pose a challenge to this
 podling.

 http://www.apache.org/foundation/how-it-works.html#incubator

 The incubator filters projects on the basis of the likeliness of
 them becoming
 successful meritocratic communities. The basic requirements for
 incubation
 are:

 * a working codebase -- over the years and after several failures,
 the
   foundation came to understand that without an initial working
   codebase, it is generally hard to bootstrap a community. This is
   because merit is not well recognized by developers without a
 working
   codebase. Also, the friction that is developed during the initial
   design stage is likely to fragment the community.

 That last line in particular seems like something to watch out for.

 Marvin Humphrey

 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org



-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-06 Thread Ted Dunning
No reason at all. 

Sent from my iPhone

On Aug 6, 2012, at 2:55 PM, Jakob Homan jgho...@gmail.com wrote:

 Any reason the design docs can't be put up in place of where the
 source would normally go?
 
 On Mon, Aug 6, 2012 at 11:23 AM, Tomer Shiran tshi...@maprtech.com wrote:
 Marvin, thanks for commenting on the proposal! The initial committers have
 been working on the design for several months, and will commit the design
 once the project is approved, so we do not expect much friction during the
 design phase. With that said, we certainly do want to engage others early
 on, and our goal in incubating earlier is to encourage feedback and
 contributions when it is still easy to change the APIs and extensibility
 points. This is important because Drill (unlike, say, Google's Dremel) must
 be really flexible in order to be relevant to a broad user base, allowing
 multiple data sources, data formats and query languages. While many
 projects enter incubation with a complete implementation, others don't, and
 due to the nature of this project we think that in this case it is better
 to start earlier.
 
 Thanks,
 Tomer
 
 On Mon, Aug 6, 2012 at 9:25 AM, Marvin Humphrey 
 mar...@rectangular.comwrote:
 
 On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Initial Source
 ==
 There is no initial source code. All source code will be developed within
 the Apache Incubator.
 
 Coming in without any source code is going to pose a challenge to this
 podling.
 
http://www.apache.org/foundation/how-it-works.html#incubator
 
The incubator filters projects on the basis of the likeliness of
 them becoming
successful meritocratic communities. The basic requirements for
 incubation
are:
 
* a working codebase -- over the years and after several failures,
 the
  foundation came to understand that without an initial working
  codebase, it is generally hard to bootstrap a community. This is
  because merit is not well recognized by developers without a
 working
  codebase. Also, the friction that is developed during the initial
  design stage is likely to fragment the community.
 
 That last line in particular seems like something to watch out for.
 
 Marvin Humphrey
 
 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org
 
 
 
 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org
 

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-06 Thread Ted Dunning
In fact, a big part of the motivation for proposing incubation before code is 
ready is exactly to foster the discussions needed to form community. 

It is true that many projects that start without the fundamentals face 
challenges that more mature projects face but that is really just a fact of 
life with young projects. 

My own experience includes a project that also started without an initial code 
drop.  Mahout has gone on to have a vibrant welcoming community that has 
fostered the donation and development of some very valuable software.  I expect 
Drill will be able to say the same thing before long. 

Sent from my iPhone

On Aug 6, 2012, at 2:55 PM, Jakob Homan jgho...@gmail.com wrote:

 Any reason the design docs can't be put up in place of where the
 source would normally go?
 
 On Mon, Aug 6, 2012 at 11:23 AM, Tomer Shiran tshi...@maprtech.com wrote:
 Marvin, thanks for commenting on the proposal! The initial committers have
 been working on the design for several months, and will commit the design
 once the project is approved, so we do not expect much friction during the
 design phase. With that said, we certainly do want to engage others early
 on, and our goal in incubating earlier is to encourage feedback and
 contributions when it is still easy to change the APIs and extensibility
 points. This is important because Drill (unlike, say, Google's Dremel) must
 be really flexible in order to be relevant to a broad user base, allowing
 multiple data sources, data formats and query languages. While many
 projects enter incubation with a complete implementation, others don't, and
 due to the nature of this project we think that in this case it is better
 to start earlier.
 
 Thanks,
 Tomer
 
 On Mon, Aug 6, 2012 at 9:25 AM, Marvin Humphrey 
 mar...@rectangular.comwrote:
 
 On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 
 Initial Source
 ==
 There is no initial source code. All source code will be developed within
 the Apache Incubator.
 
 Coming in without any source code is going to pose a challenge to this
 podling.
 
http://www.apache.org/foundation/how-it-works.html#incubator
 
The incubator filters projects on the basis of the likeliness of
 them becoming
successful meritocratic communities. The basic requirements for
 incubation
are:
 
* a working codebase -- over the years and after several failures,
 the
  foundation came to understand that without an initial working
  codebase, it is generally hard to bootstrap a community. This is
  because merit is not well recognized by developers without a
 working
  codebase. Also, the friction that is developed during the initial
  design stage is likely to fragment the community.
 
 That last line in particular seems like something to watch out for.
 
 Marvin Humphrey
 
 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org
 
 
 
 -
 To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
 For additional commands, e-mail: general-h...@incubator.apache.org
 

-
To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
For additional commands, e-mail: general-h...@incubator.apache.org



[PROPOSAL] Drill for the Apache Incubator

2012-08-05 Thread Ted Dunning
Abstract

Drill is a distributed system for interactive analysis of large-scale
datasets, inspired by Google’s Dremel (
http://research.google.com/pubs/pub36632.html).

Proposal

Drill is a distributed system for interactive analysis of large-scale
datasets. Drill is similar to Google’s Dremel, with the additional
flexibility needed to support a broader range of query languages, data
formats and data sources. It is designed to efficiently process nested
data. It is a design goal to scale to 10,000 servers or more and to be able
to process petabyes of data and trillions of records in seconds.

Background
==
Many organizations have the need to run data-intensive applications,
including batch processing, stream processing and interactive analysis. In
recent years open source systems have emerged to address the need for
scalable batch processing (Apache Hadoop) and stream processing (Storm,
Apache S4). In 2010 Google published a paper called “Dremel: Interactive
Analysis of Web-Scale Datasets,” describing a scalable system used
internally for interactive analysis of nested data. No open source project
has successfully replicated the capabilities of Dremel.

Rationale
=
There is a strong need in the market for low-latency interactive analysis
of large-scale datasets, including nested data (eg, JSON, Avro, Protocol
Buffers). This need was identified by Google and addressed internally with
a system called Dremel.

In recent years open source systems have emerged to address the need for
scalable batch processing (Apache Hadoop) and stream processing (Storm,
Apache S4). Apache Hadoop, originally inspired by Google’s internal
MapReduce system, is used by thousands of organizations processing
large-scale datasets. Apache Hadoop is designed to achieve very high
throughput, but is not designed to achieve the sub-second latency needed
for interactive data analysis and exploration. Drill, inspired by Google’s
internal Dremel system, is intended to address this need.

It is worth noting that, as explained by Google in the original paper,
Dremel complements MapReduce-based computing. Dremel is not intended as a
replacement for MapReduce and is often used in conjunction with it to
analyze outputs of MapReduce pipelines or rapidly prototype larger
computations. Indeed, Dremel and MapReduce are both used by thousands of
Google employees.

Like Dremel, Drill supports a nested data model with data encoded in a
number of formats such as JSON, Avro or Protocol Buffers. In many
organizations nested data is the standard, so supporting a nested data
model eliminates the need to normalize the data. With that said, flat data
formats, such as CSV files, are naturally supported as a special case of
nested data.

The Drill architecture consists of four key components/layers:
* Query languages: This layer is responsible for parsing the user’s query
and constructing an execution plan.  The initial goal is to support the
SQL-like language used by Dremel and Google BigQuery (
https://developers.google.com/bigquery/docs/query-reference), which we call
DrQL. However, Drill is designed to support other languages and programming
models, such as the Mongo Query Language (
http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume).
* Low-latency distributed execution engine: This layer is responsible for
executing the physical plan. It provides the scalability and fault
tolerance needed to efficiently query petabytes of data on 10,000 servers.
Drill’s execution engine is based on research in distributed execution
engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
storage, and can be extended with additional operators and connectors.
* Nested data formats: This layer is responsible for supporting various
data formats. The initial goal is to support the column-based format used
by Dremel. Drill is designed to support schema-based formats such as
Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
formats such as JSON, BSON or YAML. In addition, it is designed to support
column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
particular distinction with Drill is that the execution engine is flexible
enough to support column-based processing as well as row-based processing.
This is important because column-based processing can be much more
efficient when the data is stored in a column-based format, but many large
data assets are stored in a row-based format that would require conversion
before use.
* Scalable data sources: This layer is responsible for supporting various
data sources. The initial focus is to leverage Hadoop as a data source.

It is worth noting that no open source project has successfully replicated
the capabilities of Dremel, nor have any taken on the broader goals of
flexibility (eg, pluggable 

[PROPOSAL] Drill for the Apache Incubator

2012-08-03 Thread Ted Dunning
This is a duplicated attempt at sending this message, please ignore the
previous message if it eventually arrives.  There appears to be a hangup
sending email from my apache email address via gmail.

Abstract

Drill is a distributed system for interactive analysis of large-scale
datasets, inspired by Google’s Dremel (
http://research.google.com/pubs/pub36632.html).

Proposal

Drill is a distributed system for interactive analysis of large-scale
datasets. Drill is similar to Google’s Dremel, with the additional
flexibility needed to support a broader range of query languages, data
formats and data sources. It is designed to efficiently process nested
data. It is a design goal to scale to 10,000 servers or more and to be able
to process petabyes of data and trillions of records in seconds.

Background
==
Many organizations have the need to run data-intensive applications,
including batch processing, stream processing and interactive analysis. In
recent years open source systems have emerged to address the need for
scalable batch processing (Apache Hadoop) and stream processing (Storm,
Apache S4). In 2010 Google published a paper called “Dremel: Interactive
Analysis of Web-Scale Datasets,” describing a scalable system used
internally for interactive analysis of nested data. No open source project
has successfully replicated the capabilities of Dremel.

Rationale
=
There is a strong need in the market for low-latency interactive analysis
of large-scale datasets, including nested data (eg, JSON, Avro, Protocol
Buffers). This need was identified by Google and addressed internally with
a system called Dremel.

In recent years open source systems have emerged to address the need for
scalable batch processing (Apache Hadoop) and stream processing (Storm,
Apache S4). Apache Hadoop, originally inspired by Google’s internal
MapReduce system, is used by thousands of organizations processing
large-scale datasets. Apache Hadoop is designed to achieve very high
throughput, but is not designed to achieve the sub-second latency needed
for interactive data analysis and exploration. Drill, inspired by Google’s
internal Dremel system, is intended to address this need.

It is worth noting that, as explained by Google in the original paper,
Dremel complements MapReduce-based computing. Dremel is not intended as a
replacement for MapReduce and is often used in conjunction with it to
analyze outputs of MapReduce pipelines or rapidly prototype larger
computations. Indeed, Dremel and MapReduce are both used by thousands of
Google employees.

Like Dremel, Drill supports a nested data model with data encoded in a
number of formats such as JSON, Avro or Protocol Buffers. In many
organizations nested data is the standard, so supporting a nested data
model eliminates the need to normalize the data. With that said, flat data
formats, such as CSV files, are naturally supported as a special case of
nested data.

The Drill architecture consists of four key components/layers:
* Query languages: This layer is responsible for parsing the user’s query
and constructing an execution plan.  The initial goal is to support the
SQL-like language used by Dremel and Google BigQuery (
https://developers.google.com/bigquery/docs/query-reference), which we call
DrQL. However, Drill is designed to support other languages and programming
models, such as the Mongo Query Language (
http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume).
* Low-latency distributed execution engine: This layer is responsible for
executing the physical plan. It provides the scalability and fault
tolerance needed to efficiently query petabytes of data on 10,000 servers.
Drill’s execution engine is based on research in distributed execution
engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
storage, and can be extended with additional operators and connectors.
* Nested data formats: This layer is responsible for supporting various
data formats. The initial goal is to support the column-based format used
by Dremel. Drill is designed to support schema-based formats such as
Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
formats such as JSON, BSON or YAML. In addition, it is designed to support
column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
particular distinction with Drill is that the execution engine is flexible
enough to support column-based processing as well as row-based processing.
This is important because column-based processing can be much more
efficient when the data is stored in a column-based format, but many large
data assets are stored in a row-based format that would require conversion
before use.
* Scalable data sources: This layer is responsible for supporting various
data sources. The initial focus is to leverage 

Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-03 Thread Ted Dunning
Owen,

Sounds great to have additional contributors, but let's get a project
approved and rolling and then we can start adding committers.

On Fri, Aug 3, 2012 at 11:40 AM, Owen O'Malley omal...@apache.org wrote:

 On Thu, Aug 2, 2012 at 3:12 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 Drill is a distributed system for interactive analysis of large-scale
  datasets, inspired by Google’s Dremel (
  http://research.google.com/pubs/pub36632.html).
 

 This sounds really interesting Ted and I would love to help you. Would it
 be ok to add myself as one of the initial committers?

 Thanks,
Owen



[PROPOSAL] Drill for the Apache Incubator

2012-08-02 Thread Ted Dunning
This is a duplicated attempt at sending this message, please ignore the
previous message if it eventually arrives.  There appears to be a hangup
sending email from my apache email address via gmail.

Abstract

Drill is a distributed system for interactive analysis of large-scale
datasets, inspired by Google’s Dremel (
http://research.google.com/pubs/pub36632.html).

Proposal

Drill is a distributed system for interactive analysis of large-scale
datasets. Drill is similar to Google’s Dremel, with the additional
flexibility needed to support a broader range of query languages, data
formats and data sources. It is designed to efficiently process nested
data. It is a design goal to scale to 10,000 servers or more and to be able
to process petabyes of data and trillions of records in seconds.

Background
==
Many organizations have the need to run data-intensive applications,
including batch processing, stream processing and interactive analysis. In
recent years open source systems have emerged to address the need for
scalable batch processing (Apache Hadoop) and stream processing (Storm,
Apache S4). In 2010 Google published a paper called “Dremel: Interactive
Analysis of Web-Scale Datasets,” describing a scalable system used
internally for interactive analysis of nested data. No open source project
has successfully replicated the capabilities of Dremel.

Rationale
=
There is a strong need in the market for low-latency interactive analysis
of large-scale datasets, including nested data (eg, JSON, Avro, Protocol
Buffers). This need was identified by Google and addressed internally with
a system called Dremel.

In recent years open source systems have emerged to address the need for
scalable batch processing (Apache Hadoop) and stream processing (Storm,
Apache S4). Apache Hadoop, originally inspired by Google’s internal
MapReduce system, is used by thousands of organizations processing
large-scale datasets. Apache Hadoop is designed to achieve very high
throughput, but is not designed to achieve the sub-second latency needed
for interactive data analysis and exploration. Drill, inspired by Google’s
internal Dremel system, is intended to address this need.

It is worth noting that, as explained by Google in the original paper,
Dremel complements MapReduce-based computing. Dremel is not intended as a
replacement for MapReduce and is often used in conjunction with it to
analyze outputs of MapReduce pipelines or rapidly prototype larger
computations. Indeed, Dremel and MapReduce are both used by thousands of
Google employees.

Like Dremel, Drill supports a nested data model with data encoded in a
number of formats such as JSON, Avro or Protocol Buffers. In many
organizations nested data is the standard, so supporting a nested data
model eliminates the need to normalize the data. With that said, flat data
formats, such as CSV files, are naturally supported as a special case of
nested data.

The Drill architecture consists of four key components/layers:
* Query languages: This layer is responsible for parsing the user’s query
and constructing an execution plan.  The initial goal is to support the
SQL-like language used by Dremel and Google BigQuery (
https://developers.google.com/bigquery/docs/query-reference), which we call
DrQL. However, Drill is designed to support other languages and programming
models, such as the Mongo Query Language (
http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume).
* Low-latency distributed execution engine: This layer is responsible for
executing the physical plan. It provides the scalability and fault
tolerance needed to efficiently query petabytes of data on 10,000 servers.
Drill’s execution engine is based on research in distributed execution
engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
storage, and can be extended with additional operators and connectors.
* Nested data formats: This layer is responsible for supporting various
data formats. The initial goal is to support the column-based format used
by Dremel. Drill is designed to support schema-based formats such as
Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
formats such as JSON, BSON or YAML. In addition, it is designed to support
column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
particular distinction with Drill is that the execution engine is flexible
enough to support column-based processing as well as row-based processing.
This is important because column-based processing can be much more
efficient when the data is stored in a column-based format, but many large
data assets are stored in a row-based format that would require conversion
before use.
* Scalable data sources: This layer is responsible for supporting various
data sources. The initial focus is to leverage 

Re: [PROPOSAL] Drill for the Apache Incubator

2012-08-02 Thread Mattmann, Chris A (388J)
Sounds cool!

Cheers,
Chris

On Aug 2, 2012, at 3:12 PM, Ted Dunning wrote:

 This is a duplicated attempt at sending this message, please ignore the
 previous message if it eventually arrives.  There appears to be a hangup
 sending email from my apache email address via gmail.
 
 Abstract
 
 Drill is a distributed system for interactive analysis of large-scale
 datasets, inspired by Google’s Dremel (
 http://research.google.com/pubs/pub36632.html).
 
 Proposal
 
 Drill is a distributed system for interactive analysis of large-scale
 datasets. Drill is similar to Google’s Dremel, with the additional
 flexibility needed to support a broader range of query languages, data
 formats and data sources. It is designed to efficiently process nested
 data. It is a design goal to scale to 10,000 servers or more and to be able
 to process petabyes of data and trillions of records in seconds.
 
 Background
 ==
 Many organizations have the need to run data-intensive applications,
 including batch processing, stream processing and interactive analysis. In
 recent years open source systems have emerged to address the need for
 scalable batch processing (Apache Hadoop) and stream processing (Storm,
 Apache S4). In 2010 Google published a paper called “Dremel: Interactive
 Analysis of Web-Scale Datasets,” describing a scalable system used
 internally for interactive analysis of nested data. No open source project
 has successfully replicated the capabilities of Dremel.
 
 Rationale
 =
 There is a strong need in the market for low-latency interactive analysis
 of large-scale datasets, including nested data (eg, JSON, Avro, Protocol
 Buffers). This need was identified by Google and addressed internally with
 a system called Dremel.
 
 In recent years open source systems have emerged to address the need for
 scalable batch processing (Apache Hadoop) and stream processing (Storm,
 Apache S4). Apache Hadoop, originally inspired by Google’s internal
 MapReduce system, is used by thousands of organizations processing
 large-scale datasets. Apache Hadoop is designed to achieve very high
 throughput, but is not designed to achieve the sub-second latency needed
 for interactive data analysis and exploration. Drill, inspired by Google’s
 internal Dremel system, is intended to address this need.
 
 It is worth noting that, as explained by Google in the original paper,
 Dremel complements MapReduce-based computing. Dremel is not intended as a
 replacement for MapReduce and is often used in conjunction with it to
 analyze outputs of MapReduce pipelines or rapidly prototype larger
 computations. Indeed, Dremel and MapReduce are both used by thousands of
 Google employees.
 
 Like Dremel, Drill supports a nested data model with data encoded in a
 number of formats such as JSON, Avro or Protocol Buffers. In many
 organizations nested data is the standard, so supporting a nested data
 model eliminates the need to normalize the data. With that said, flat data
 formats, such as CSV files, are naturally supported as a special case of
 nested data.
 
 The Drill architecture consists of four key components/layers:
 * Query languages: This layer is responsible for parsing the user’s query
 and constructing an execution plan.  The initial goal is to support the
 SQL-like language used by Dremel and Google BigQuery (
 https://developers.google.com/bigquery/docs/query-reference), which we call
 DrQL. However, Drill is designed to support other languages and programming
 models, such as the Mongo Query Language (
 http://www.mongodb.org/display/DOCS/Mongo+Query+Language), Cascading (
 http://www.cascading.org/) or Plume (https://github.com/tdunning/Plume).
 * Low-latency distributed execution engine: This layer is responsible for
 executing the physical plan. It provides the scalability and fault
 tolerance needed to efficiently query petabytes of data on 10,000 servers.
 Drill’s execution engine is based on research in distributed execution
 engines (eg, Dremel, Dryad, Hyracks, CIEL, Stratosphere) and columnar
 storage, and can be extended with additional operators and connectors.
 * Nested data formats: This layer is responsible for supporting various
 data formats. The initial goal is to support the column-based format used
 by Dremel. Drill is designed to support schema-based formats such as
 Protocol Buffers/Dremel, Avro/AVRO-806/Trevni and CSV, and schema-less
 formats such as JSON, BSON or YAML. In addition, it is designed to support
 column-based formats such as Dremel, AVRO-806/Trevni and RCFile, and
 row-based formats such as Protocol Buffers, Avro, JSON, BSON and CSV. A
 particular distinction with Drill is that the execution engine is flexible
 enough to support column-based processing as well as row-based processing.
 This is important because column-based processing can be much more
 efficient when the data is stored in a column-based format, but many large
 data assets are stored in a row-based format