Avatica?
> On Jun 15, 2017, at 10:39 AM, Paul Rogers <prog...@mapr.com> wrote:
>
> Hi Uwe,
>
> This is incredibly helpful information! You explanation makes perfect sense.
>
> We work quite a bit with ODBC and JDBC: two interfaces that are very much
> synchronous and row-based. There are three challenges key with working with
> Drill:
>
> * Drill results are columnar, requiring a column-to-row translation for xDBC
> * Drill uses an asynchronous API, while JDBC and ODBC are synchronous,
> resulting in an async-to-sync API translation.
> * The JDBC API is based on the Drill client which requires quite a bit
> (almost all, really) of Drill code.
>
> The thought is to create a new API that serves the need of ODBC and JDBC, but
> without the complexity (while, of course, preserving the existing client for
> other uses.) Said another way, find a way to keep the xDBC interfaces simple
> so that they don’t take quite so much space in the client, and don’t require
> quite so much work to maintain.
>
> The first issue (row vs. columnar) turns out to not be a huge issue, the
> columnar-to-row translation code exists and works. The real issue is allowing
> the client to the size of the data sent from the server. (At present, the
> server decides the “batch” size, and sometimes the size is huge.) So, we can
> just focus on controlling batch size (and thus client buffer allocations),
> but retain the columnar form, even for ODBC and JDBC.
>
> So, for the Pandas use case, does your code allow (or benefit from) multiple
> simultaneous queries over the same connection? Or, since Python seems to be
> only approximately multi-threaded, would a synchronous, columnar API work
> better? Here I just mean, in a single connection, is there a need to run
> multiple concurrent queries, or is the classic
> one-concurrent-query-per-connection model easier for Python to consume?
>
> Another point you raise is that our client-side column format should be
> Arrow, or Arrow-compatible. (That is, either using Arrow code, or the same
> data format as Arrow.) That way users of your work can easily leverage Drill.
>
> This last question raises an interesting issue that I (at least) need to
> understand more clearly. Is Arrow a data format + code? Or, is the data
> format one aspect of Arrow, and the implementation another? Would be great to
> have a common data format, but as we squeeze ever more performance from
> Drill, we find we have to very carefully tune our data manipulation code for
> the specific needs of Drill queries. I wonder how we’d do that if we switched
> to using Arrow’s generic vector implementation code? Has anyone else wrestled
> with this question for your project?
>
> Thanks,
>
> - Paul
>
>
>> On Jun 15, 2017, at 12:23 AM, Uwe L. Korn <uw...@xhochy.com> wrote:
>>
>> Hello Paul,
>>
>> Bringing in a bit of the perspective partly of an Arrow developer but mostly
>> someone that works quite a lot in Python with the respective data libraries
>> there: In Python all (performant) data chrunching work is done on columar
>> representations. While this is partly due to columnar being a more CPU
>> efficient on these tasks, this is also because columnar can be abstracted in
>> a form that you implement all computational work with C/C++ or an LLVM-based
>> JIT while still keeping clear and understandable interfaces in Python. In
>> the end to make an efficient Python support, we will always have to convert
>> into a columnar representation, making row-wise APIs to a system that is
>> internally columnar quite annoying as we have a lot of wastage in the
>> conversion layer. In the case that one would want to provide the ability to
>> support Python UDFs, this would lead to the situation that in most cases the
>> UDF calls will be greatly dominated by the conversion logic.
>>
>> For the actual performance differences that this makes, you can have a look
>> at the work that recently is happening in Apache Spark where Arrow is used
>> for the conversion of the result from Spark's internal JVM data structures
>> into typical Python ones ("Pandas DataFrames"). In comparision to the
>> existing conversion, this sees currently a speedup of 40x but will be even
>> higher once further steps are implemented. Julien should be able to provide
>> a link to slides that outline the work better.
>>
>> As I'm quite new to Drill, I cannot go into much further details w.r.t.
>> Drill but be aware that for languages like Python, having a columnar API
>> really matters. While Drill integrates with Python at the moment not really
>> as a first class citizen, moving to row-wise APIs won't probably make a
>> difference to the current situation but good columnar APIs would help us to
>> keep the path open for the future.
>>
>> Uwe
>