Drill 2.0 (design) hackathon

Aman Sinha Thu, 24 Aug 2017 08:39:51 -0700

Drill Developers,

In order to kick-start the Drill 2.0  release discussions, I would like to
propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).


As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
for most of you!

The goal is to get the community together for a day-long technical
discussion on key topics in preparation for a Drill 2.0 release as well as
potential improvements in upcoming 1.xx releases.  Depending on the
interest areas, we could form groups and have a volunteer lead each group.

 Based on prior discussions on the dev list, hangouts and existing JIRAs,
there is already a substantial set of topics and I have summarized a few of
them below.   What other topics do folks want to talk about?   Feel free to
respond to this thread and I will create a google doc to consolidate.
Understandably, the list would be long but we will use the hackathon to get
a sense of a reasonable feature set for 1.xx and 2.0 releases.


1. Metadata management.

  1a: Defining an abstraction layer for various types of metadata: views,
schema, statistics, security

  1b: Underlying storage for metadata: what are the options and their
trade-offs?

      - Hive metastore

      - Parquet metadata cache (parquet specific)

      - An embedded DBMS

      - A distributed key-value store

      - Others..



2. Drill integration with Apache Arrow

  2a: Evaluate the choices and tradeoffs



3. Resource management

  3a: Memory limits per query

  3b: Spilling

  3c: Resource management with Drill on Yarn/Mesos/Kubernetes

  3d: Local vs. global resource management

  3e: Aligning with admission control/queueing



4. TPC-DS coverage and related planner/operator enhancements

  4a: Additional set operations: INTERSECT, EXCEPT

  4b: GROUPING SETS, ROLLUP, CUBE support

  4c: Handling inequality joins and cartesian joins of non-scalar inputs
(via Nested Loop Join)

  4d: Remaining gaps in correlated subquery

  4e: Statistics: Number of Distinct Values, Histograms



5. Schema handling

  5a: Creation, management of schema

  5b: Handling schema changes in certain common cases

  5c: Schema-awareness

  5d: Others TBD



6. Concurrency

  6a: What are the bottlenecks to achieving higher concurrency

  6b: Ideas to address these..e.g async execution ?



7. Storage plugins,  REST APIs related enhancements

    <Topics TBD>



8. Performance improvements

  8a: Filter pushdown

  8b: Vectorized Parquet reader

  8c: Code-gen improvements

  8d: Others TBD

Drill 2.0 (design) hackathon

Reply via email to