Adding to my list of things to consider for Drill 2.0, I would think that getting Drill off our forks of Calcite and Parquet should also be a goal, though a tactical one.
On Mon, Jun 5, 2017 at 1:51 PM, Parth Chandra <par...@apache.org> wrote: > Nice suggestion Paul, to start a discussion on 2.0 (it's about time). I > would like to make this a broader discussion than just APIs, though APIs > are a good place to start. In particular. we usually get the opportunity to > break backward compatibility only for a major release and that is the time > we have to finalize the APIs. > > In the broader discussion I feel we also need to consider some other > aspects - > 1) Formalize Drill's support for schema free operations. > 2) Drill's execution engine architecture and it's 'optimistic' use of > resources. > > Re the APIs: > One more public API is the UDFs. This and the storage plugin APIs > together are tied at the hip with vectors and memory management. I'm not > sure if we can cleanly separate the underlying representation of vectors > from the interfaces to these APIs, but I agree we need to clarify this > part. For instance, some of the performance benefits in the Parquet scan > come from vectorizing writes to the vector especially for null or repeated > values. We could provide interfaces to provide the same without which the > scans would have to be vector-internals aware. The same goes for UDFs. > Assuming that a 2.0 goal would be to provide vectorized interfaces for > users to write table (or aggregate) UDFs, one now needs a standardized data > set representation. If you choose this data set representation to be > columnar (for better vectorization), will you end up with ValueVector/Arrow > based RecordBatches? I included Arrow in this since the project is > formalizing exactly this requirement. > > For the client APIs, I believe that ODBC and JDBC drivers initially were > written using record based APIs provided by vendors, but to get better > performance started to move to working with raw streams coming over the > wire (eg TDS with Sybase/MS-SQLServer [1] ). So what Drill does is in fact > similar to that approach. The client APIs are really thin layers on top of > the vector data stream and provide row based, read only access to the > vector. > > Lest I begin to sound too contrary, thank you for starting this > discussion. It is really needed! > > Parth > > > > > > > > On Mon, Jun 5, 2017 at 11:59 AM, Paul Rogers <prog...@mapr.com> wrote: > >> Hi All, >> >> A while back there was a discussion about the scope of Drill 2.0. Got me >> thinking about possible topics. My two cents: >> >> Drill 2.0 should focus on making Drill’s external APIs production ready. >> This means five things: >> >> * Clearly identify and define each API. >> * (Re)design each API to ensure it fully isolates the client from Drill >> internals. >> * Ensure the API allows full version compatibility: Allow mixing of >> old/new clients and servers with some limits. >> * Fully test each API. >> * Fully document each API. >> >> Once client code is isolated from Drill internals, we are free to evolve >> the internals in either Drill 2.0 or a later release. >> >> In my mind, the top APIs to revisit are: >> >> * The drill client API. >> * The storage plugin API. >> >> (Explanation below.) >> >> What other APIs should we consider? Here are some examples, please >> suggest items you know about: >> >> * Command line scripts and arguments >> * REST API >> * Names and contents of system tables >> * Structure of the storage plugin configuration JSON >> * Structure of the query profile >> * Structure of the EXPLAIN PLAN output. >> * Semantics of Drill functions, such as the date functions recently >> partially fixed by adding “ANSI” alternatives. >> * Naming of config and system/session options. >> * (Your suggestions here…) >> >> I’ve taken the liberty of moving some API-breaking tickets in the Apache >> Drill JIRA to 2.0. Perhaps we can add others so that we have a good >> inventory of 2.0 candidates. >> >> Here are the reasons for my two suggestions. >> >> Today, we expose Drill value vectors to the client. This means if we want >> to enhance anything about Drill’s internal memory format (i.e. value >> vectors, such as a possible move to Arrow), we break compatibility with old >> clients. Using value vectors also means we need a very large percentage of >> Drill’s internal code on the client in Java or C++. We are learning that >> doing so is a challenge. >> >> A new client API should follow established SQL database tradition: a >> synchronous, row-based API designed for versioning, for forward and >> backward compatibility, and to support ODBC and JDBC users. >> >> We can certainly maintain the existing full, async, heavy-weight client >> for our tests and for applications that would benefit from it. >> >> Once we define a new API, we are free to alter Drill’s value vectors to, >> say, add the needed null states to fully support JSON, to change offset >> vectors to not need n+1 values (which doubles vector size in 64K batches), >> and so on. Since vectors become private to Drill (or Arrow) after the new >> client API, we are free to innovate to improve them. >> >> Similarly, the storage plugin API exposes details of Calcite (which seems >> to evolve with each new version), exposes value vector implementations, and >> so on. A cleaner, simpler, more isolated API will allow storage plugins to >> be built faster, but will also isolate them from Drill internals changes. >> Without isolation, each change to Drill internals would require plugin >> authors to update their plugin before Drill can be released. >> >> Thoughts? Suggestions? >> >> Thanks, >> >> - Paul > > >