Re: Thinking about Drill 2.0

Parth Chandra Mon, 05 Jun 2017 13:54:16 -0700

Adding to my list of things to consider for Drill 2.0,  I would think that
getting Drill off our forks of Calcite and Parquet should also be a goal,
though a tactical one.




On Mon, Jun 5, 2017 at 1:51 PM, Parth Chandra <par...@apache.org> wrote:

> Nice suggestion Paul, to start a discussion on 2.0 (it's about time). I
> would like to make this a broader discussion than just APIs, though APIs
> are a good place to start. In particular. we usually get the opportunity to
> break backward compatibility only for a major release and that is the time
> we have to finalize the APIs.
>
> In the broader discussion I feel we also need to consider some other
> aspects -
>   1) Formalize Drill's support for schema free operations.
>   2) Drill's execution engine architecture and it's 'optimistic' use of
> resources.
>
> Re the APIs:
>   One more public API is the UDFs. This and the storage plugin APIs
> together are tied at the hip with vectors and memory management. I'm not
> sure if we can cleanly separate the underlying representation of vectors
> from the interfaces to these APIs, but I agree we need to clarify this
> part. For instance, some of the performance benefits in the Parquet scan
> come from vectorizing writes to the vector especially for null or repeated
> values. We could provide interfaces to provide the same without which the
> scans would have to be vector-internals aware. The same goes for UDFs.
> Assuming that a 2.0 goal would be to provide vectorized interfaces for
> users to write table (or aggregate) UDFs, one now needs a standardized data
> set representation. If you choose this data set representation to be
> columnar (for better vectorization), will you end up with ValueVector/Arrow
> based RecordBatches? I included Arrow in this since the project is
> formalizing exactly this requirement.
>
> For the client APIs, I believe that ODBC and JDBC drivers initially were
> written using record based APIs provided by vendors, but to get better
> performance started to move to working with raw streams coming over the
> wire (eg TDS with Sybase/MS-SQLServer [1] ). So what Drill does is in fact
> similar to that approach. The client APIs are really thin layers on top of
> the vector data stream and provide row based, read only access to the
> vector.
>
> Lest I begin to sound too contrary,  thank you for starting this
> discussion. It is really needed!
>
> Parth
>
>
>
>
>
>
>
> On Mon, Jun 5, 2017 at 11:59 AM, Paul Rogers <prog...@mapr.com> wrote:
>
>> Hi All,
>>
>> A while back there was a discussion about the scope of Drill 2.0. Got me
>> thinking about possible topics. My two cents:
>>
>> Drill 2.0 should focus on making Drill’s external APIs production ready.
>> This means five things:
>>
>> * Clearly identify and define each API.
>> * (Re)design each API to ensure it fully isolates the client from Drill
>> internals.
>> * Ensure the API allows full version compatibility: Allow mixing of
>> old/new clients and servers with some limits.
>> * Fully test each API.
>> * Fully document each API.
>>
>> Once client code is isolated from Drill internals, we are free to evolve
>> the internals in either Drill 2.0 or a later release.
>>
>> In my mind, the top APIs to revisit are:
>>
>> * The drill client API.
>> * The storage plugin API.
>>
>> (Explanation below.)
>>
>> What other APIs should we consider? Here are some examples, please
>> suggest items you know about:
>>
>> * Command line scripts and arguments
>> * REST API
>> * Names and contents of system tables
>> * Structure of the storage plugin configuration JSON
>> * Structure of the query profile
>> * Structure of the EXPLAIN PLAN output.
>> * Semantics of Drill functions, such as the date functions recently
>> partially fixed by adding “ANSI” alternatives.
>> * Naming of config and system/session options.
>> * (Your suggestions here…)
>>
>> I’ve taken the liberty of moving some API-breaking tickets in the Apache
>> Drill JIRA to 2.0. Perhaps we can add others so that we have a good
>> inventory of 2.0 candidates.
>>
>> Here are the reasons for my two suggestions.
>>
>> Today, we expose Drill value vectors to the client. This means if we want
>> to enhance anything about Drill’s internal memory format (i.e. value
>> vectors, such as a possible move to Arrow), we break compatibility with old
>> clients. Using value vectors also means we need a very large percentage of
>> Drill’s internal code on the client in Java or C++. We are learning that
>> doing so is a challenge.
>>
>> A new client API should follow established SQL database tradition: a
>> synchronous, row-based API designed for versioning, for forward and
>> backward compatibility, and to support ODBC and JDBC users.
>>
>> We can certainly maintain the existing full, async, heavy-weight client
>> for our tests and for applications that would benefit from it.
>>
>> Once we define a new API, we are free to alter Drill’s value vectors to,
>> say, add the needed null states to fully support JSON, to change offset
>> vectors to not need n+1 values (which doubles vector size in 64K batches),
>> and so on. Since vectors become private to Drill (or Arrow) after the new
>> client API, we are free to innovate to improve them.
>>
>> Similarly, the storage plugin API exposes details of Calcite (which seems
>> to evolve with each new version), exposes value vector implementations, and
>> so on. A cleaner, simpler, more isolated API will allow storage plugins to
>> be built faster, but will also isolate them from Drill internals changes.
>> Without isolation, each change to Drill internals would require plugin
>> authors to update their plugin before Drill can be released.
>>
>> Thoughts? Suggestions?
>>
>> Thanks,
>>
>> - Paul
>
>
>

Re: Thinking about Drill 2.0

Reply via email to