"Crude-but-effective" Arrow integration

Paul Rogers Sat, 18 Aug 2018 16:45:02 -0700

Hi All,

Charles recently suggested why Arrow integration could be helpful. (See quote 
below.)  When we've looked at reworking Drill's internals to use Arrow, we 
found the project to be costly with little direct benefit in terms of 
performance or stability. But, Charles points out that the real value is in 
data exchange, not in changing Drill's internals.


 It might be fairly simple to integrate with Arrow for input or output. Why? As 
it turns out (last time I checked) the memory layout of Arrow vectors is 
identical to Drill's, so it is simply a matter of reinterpreting Drill's 
vectors as Arrow vectors (or visa-versa); possibly passing memory ownership 
somehow. (I suspect the memory ownership issue will be the fussiest part of the 
whole exercise.)


Drill and Arrow use different metadata formats. But, since they both describe 
the same in-memory layout, we can probably translate from one to the other with 
some straightforward code. Since metadata is a small part of a typical result 
set, the overhead of the metadata translation is likely negligible.


If an Arrow client wants to consume Drill output, someone could wrap the Drill 
native Drill Client API that speaks Drill value vectors. The wrapper could 
reinterpret Drill vectors as Arrow vectors, and convert metadata.


If we want Drill to consume Arrow data, then we'd have to play the same trick 
in reverse: reinterpret Arrow vectors as Drill vectors, then convert Arrow 
metadata to Drill format.

Building such integration can be done by the community to enable integration. 
Granted, this approach is a bit on the "crude-but-effective" side. But, if the 
integration proves valuable, then there is justification for a next round of 
deeper integration.


 Charles' original comment from the discussion about project state:

(quote)
The first [suggested improvement] is the Arrow integration.  I’m not enough of 
a software engineer to understand
all the internal details here, but as I understand it, the promise of Arrow is 
that many tools
will share a common memory model and that it will be possible to transfer data 
from one tool
to the other without having to serialize/deserialize the data.  In the data 
science community
many of the major platforms, Python-pandas, R, and Spark are moving or have 
adopted Arrow.
 
Drill’s strength is the ease that it can query many different data sources and 
if Drill
were to adopt Arrow, I suspect that many people would adopt it as a part of a 
machine learning
pipeline.  Just recently, I attempted to do some data manipulation using Spark, 
and couldn’t
help but notice how difficult ti was in contrast with Drill. I’m sure this is a 
very complex
task, but I do think that it could be worth it in the end.

(unquote)

Thanks,
- Paul

"Crude-but-effective" Arrow integration

Reply via email to