Re: Drill 2.0 (design) hackathon

Paul Rogers Wed, 30 Aug 2017 09:45:29 -0700

A partial list of Drill’s public APIs:

IMHO, highest priority for Drill 2.0.



  *   JDBC/ODBC drivers
  *   Client (for JDBC/ODBC) + ODBC & JDBC
  *   Client (for full Drill async, columnar)
  *   Storage plugin
  *   Format plugin
  *   System/session options
  *   Queueing (e.g. ZK-based queues)
  *   Rest API
  *   Resource Planning (e.g. max query memory per node)
  *   Metadata access, storage (e.g. file system locations vs. a metastore)
  *   Metadata files formats (Parquet, views, etc.)

Lower priority for future releases:


  *   Query Planning (e.g. Calcite rules)
  *   Config options
  *   SQL syntax, especially Drill extensions
  *   UDF
  *   Management (e.g. JMX, Rest API calls, etc.)
  *   Drill File System (HDFS)
  *   Web UI
  *   Shell scripts

There are certainly more. Please suggest those that are missing. I’ve taken a 
rough cut at which APIs need forward/backward compatibility first, in part 
based on those that are the “most public” and most likely to change. Others are 
important, but we can’t do them all at once.

Thanks,

- Paul

On Aug 29, 2017, at 6:00 PM, Aman Sinha 
<amansi...@apache.org<mailto:amansi...@apache.org>> wrote:

Hi Paul,
certainly makes sense to have the API compatibility discussions during this
hackathon.  The 2.0 release may be a good checkpoint to introduce breaking
changes necessitating changes to the ODBC/JDBC drivers and other external
applications. As part of this exercise (not during the hackathon but as a
follow-up action), we also should clearly identify the "public" interfaces.


I will add this to the agenda.

thanks,
-Aman

On Tue, Aug 29, 2017 at 2:08 PM, Paul Rogers 
<prog...@mapr.com<mailto:prog...@mapr.com>> wrote:

Thanks Aman for organizing the Hackathon!

The list included many good ideas for Drill 2.0. Some of those require
changes to Drill’s “public” interfaces (file format, client protocol, SQL
behavior, etc.)

At present, Drill has no good mechanism to handle backward/forward
compatibility at the API level. Protobuf versioning certainly helps, but
can’t completely solve semantic changes (where a field changes meaning, or
a non-Protobuf data chunk changes format.) As just one concrete example,
changing to Arrow will break pre-Arrow ODBC/JDBC drivers because class
names and data formats will change.

Perhaps we can prioritize, for the proposed 2.0 release, a one-time set of
breaking changes that introduce a versioning mechanism into our public
APIs. Once these are in place, we can evolve the APIs in the future by
following the newly-created versioning protocol.

Without such a mechanism, we cannot support old & new clients in the same
cluster. Nor can we support rolling upgrades. Of course, another solution
is to get it right the second time, then freeze all APIs and agree to never
again change them. Not sure we have sufficient access to a crystal ball to
predict everything we’d ever need in our APIs, however...

Thanks,

- Paul

On Aug 24, 2017, at 8:39 AM, Aman Sinha 
<amansi...@apache.org<mailto:amansi...@apache.org>> wrote:

Drill Developers,

In order to kick-start the Drill 2.0  release discussions, I would like
to
propose a Drill 2.0  (design) hackathon (a.k.a Drill Developer Day ™ J ).

As I mentioned in the hangout on Tuesday,  MapR has offered to host it on
Sept 18th at their offices at 350 Holger Way, San Jose.   Hope that works
for most of you!

The goal is to get the community together for a day-long technical
discussion on key topics in preparation for a Drill 2.0 release as well
as
potential improvements in upcoming 1.xx releases.  Depending on the
interest areas, we could form groups and have a volunteer lead each
group.

Based on prior discussions on the dev list, hangouts and existing JIRAs,
there is already a substantial set of topics and I have summarized a few
of
them below.   What other topics do folks want to talk about?   Feel free
to
respond to this thread and I will create a google doc to consolidate.
Understandably, the list would be long but we will use the hackathon to
get
a sense of a reasonable feature set for 1.xx and 2.0 releases.


1. Metadata management.

1a: Defining an abstraction layer for various types of metadata: views,
schema, statistics, security

1b: Underlying storage for metadata: what are the options and their
trade-offs?

    - Hive metastore

    - Parquet metadata cache (parquet specific)

    - An embedded DBMS

    - A distributed key-value store

    - Others..



2. Drill integration with Apache Arrow

2a: Evaluate the choices and tradeoffs



3. Resource management

3a: Memory limits per query

3b: Spilling

3c: Resource management with Drill on Yarn/Mesos/Kubernetes

3d: Local vs. global resource management

3e: Aligning with admission control/queueing



4. TPC-DS coverage and related planner/operator enhancements

4a: Additional set operations: INTERSECT, EXCEPT

4b: GROUPING SETS, ROLLUP, CUBE support

4c: Handling inequality joins and cartesian joins of non-scalar inputs
(via Nested Loop Join)

4d: Remaining gaps in correlated subquery

4e: Statistics: Number of Distinct Values, Histograms



5. Schema handling

5a: Creation, management of schema

5b: Handling schema changes in certain common cases

5c: Schema-awareness

5d: Others TBD



6. Concurrency

6a: What are the bottlenecks to achieving higher concurrency

6b: Ideas to address these..e.g async execution ?



7. Storage plugins,  REST APIs related enhancements

  <Topics TBD>



8. Performance improvements

8a: Filter pushdown

8b: Vectorized Parquet reader

8c: Code-gen improvements

8d: Others TBD

Re: Drill 2.0 (design) hackathon

Reply via email to