Re: Thinking about Drill 2.0

Julian Hyde Fri, 16 Jun 2017 21:22:08 -0700
Avatica?

> On Jun 15, 2017, at 10:39 AM, Paul Rogers <prog...@mapr.com> wrote:
> 
> Hi Uwe,
> 
> This is incredibly helpful information! You explanation makes perfect sense.
> 
> We work quite a bit with ODBC and JDBC: two interfaces that are very much 
> synchronous and row-based. There are three challenges key with working with 
> Drill:
> 
> * Drill results are columnar, requiring a column-to-row translation for xDBC
> * Drill uses an asynchronous API, while JDBC and ODBC are synchronous, 
> resulting in an async-to-sync API translation.
> * The JDBC API is based on the Drill client which requires quite a bit 
> (almost all, really) of Drill code.
> 
> The thought is to create a new API that serves the need of ODBC and JDBC, but 
> without the complexity (while, of course, preserving the existing client for 
> other uses.) Said another way, find a way to keep the xDBC interfaces simple 
> so that they don’t take quite so much space in the client, and don’t require 
> quite so much work to maintain.
> 
> The first issue (row vs. columnar) turns out to not be a huge issue, the 
> columnar-to-row translation code exists and works. The real issue is allowing 
> the client to the size of the data sent from the server. (At present, the 
> server decides the “batch” size, and sometimes the size is huge.) So, we can 
> just focus on controlling batch size (and thus client buffer allocations), 
> but retain the columnar form, even for ODBC and JDBC.
> 
> So, for the Pandas use case, does your code allow (or benefit from) multiple 
> simultaneous queries over the same connection? Or, since Python seems to be 
> only approximately multi-threaded, would a synchronous, columnar API work 
> better? Here I just mean, in a single connection, is there a need to run 
> multiple concurrent queries, or is the classic 
> one-concurrent-query-per-connection model easier for Python to consume?
> 
> Another point you raise is that our client-side column format should be 
> Arrow, or Arrow-compatible. (That is, either using Arrow code, or the same 
> data format as Arrow.) That way users of your work can easily leverage Drill.
> 
> This last question raises an interesting issue that I (at least) need to 
> understand more clearly. Is Arrow a data format + code? Or, is the data 
> format one aspect of Arrow, and the implementation another? Would be great to 
> have a common data format, but as we squeeze ever more performance from 
> Drill, we find we have to very carefully tune our data manipulation code for 
> the specific needs of Drill queries. I wonder how we’d do that if we switched 
> to using Arrow’s generic vector implementation code? Has anyone else wrestled 
> with this question for your project?
> 
> Thanks,
> 
> - Paul
> 
> 
>> On Jun 15, 2017, at 12:23 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>> 
>> Hello Paul,
>> 
>> Bringing in a bit of the perspective partly of an Arrow developer but mostly 
>> someone that works quite a lot in Python with the respective data libraries 
>> there: In Python all (performant) data chrunching work is done on columar 
>> representations. While this is partly due to columnar being a more CPU 
>> efficient on these tasks, this is also because columnar can be abstracted in 
>> a form that you implement all computational work with C/C++ or an LLVM-based 
>> JIT while still keeping clear and understandable interfaces in Python. In 
>> the end to make an efficient Python support, we will always have to convert 
>> into a columnar representation, making row-wise APIs to a system that is 
>> internally columnar quite annoying as we have a lot of wastage in the 
>> conversion layer. In the case that one would want to provide the ability to 
>> support Python UDFs, this would lead to the situation that in most cases the 
>> UDF calls will be greatly dominated by the conversion logic.
>> 
>> For the actual performance differences that this makes, you can have a look 
>> at the work that recently is happening in Apache Spark where Arrow is used 
>> for the conversion of the result from Spark's internal JVM data structures 
>> into typical Python ones ("Pandas DataFrames"). In comparision to the 
>> existing conversion, this sees currently a speedup of 40x but will be even 
>> higher once further steps are implemented. Julien should be able to provide 
>> a link to slides that outline the work better.
>> 
>> As I'm quite new to Drill, I cannot go into much further details w.r.t. 
>> Drill but be aware that for languages like Python, having a columnar API 
>> really matters. While Drill integrates with Python at the moment not really 
>> as a first class citizen, moving to row-wise APIs won't probably make a 
>> difference to the current situation but good columnar APIs would help us to 
>> keep the path open for the future.
>> 
>> Uwe
>
Re: Thinking about Drill 2.0

Reply via email to