Hi All, As I’ve been playing with and learning about Drill, it struck me that Drill is a wonderful “industrial strength” query engine, but that the client API is a bit complex if all an app wants to do is execute a few queries. I wondered if we need an adapter between the full-blown Drill columnar, asynchronous RPC that Drill uses internally, and the row-based, synchronous API that most apps know and love.
In thinking about a simpler client API, a few items came to mind: - We have the JDBC API for Java apps, but the internals of the current JDBC use the Drill client and so the JDBC jar is quite big (20MB). - The current client API is not versioned, requiring clients to be upgraded in lock-step with servers. Many admins, however, find it necessary to upgrade clients on a schedule different from that of the server. (Imagine upgrading dozens of desktop users at the same time as the Drill cluster.) Many of the traditional DB products version their interferes to simplify this task. - A cool feature of Drill is schema-on-read, which means Drill may encounter different schemas as data is read. At present, it is a bit hard for clients to consume different schemas. It turns out, however, that stored procedures provide something similar (multiple result sets) that we could leverage that idea to make schema changes into a first-class feature of the API. Playing around a bit in my spare time, I found that we can grab lots of ideas from “traditional” DB APIs to solve the above problems (and more): - A simplified client API provides a row-based view of results, with schema changes as a first-class API concept. - A “direct" version of the client can sit directly on top of the Drill Client, much like the current JDBC driver. - Because the client API is simple, it is easy to create a new wire protocol to carry the required row-based client messages. - That wire protocol enables a very light-weight remote version of the client API. - A new server implements the server-side of the new wire protocol. The server is an adapter: it converts the “retail” row-based API into the “wholesale” columnar API of Drill. - A new JDBC implementation uses the remote API instead of directly using the Drill Client API. Because the remote client has no dependencies on Drill (or, indeed, anything other than the JDK), it is very small. Indeed, the revised JDBC jar is about 1% of the size of the existing JDBC driver. (200KB instead of 20MB.) The result is a little prototype project called “Jig”. I’d like to toss it out to the community to see if this is something of interest to others. The code works just well enough to prove the concept, though I’ve left off the more “advanced” data types, multiple cursors per connection, and other details. The advantage for Java users is a simpler API, smaller JDBC driver, fewer dependencies and cross-version compatibility. If we add clients in other languages, then just about any language can easily query Drill without a Java or ODBC bridge. This would be handy for that Caravel integration project discussed here a month or so back. Also for data scientists who prefer Python or R. In case there is interest in this idea, a more detailed proposal is available: https://docs.google.com/document/d/1TpJOEUO-DBDGIidOML2_InpJ-fK4yHmsbV5ncqXT6pM The code is in a GitHub repo: https://github.com/paul-rogers/drill-jig The JIRA for this enhancement: DRILL-4791: https://issues.apache.org/jira/browse/DRILL-4791 This has been a great little learning exercise. Is this something that might we might want to take further? Thoughts on the approach taken? Thanks, - Paul
