Thanks Andy!
Very helpful. You have hit on one of the questions that we've been wrestling
with: which tools would consume Drill data as Arrow? More generally, what are
the use cases for Arrow data interchange?
Flight makes sense for transferring large data sets, such as in exchanges
within a distributed engine, or from a "data service" such as a hypothetical
Flight-based S3 Select. Flight (and Arrow in general) seems less useful as a
client API for things like BI tools, dashboards and the like; xDBC seems like a
better fit since such tools will consume "human-sized" result sets.
The article in your link notes that there is a Spark consumer for Flight.
Drill's use case would likely be similar -- both tools could consume large data
sets from Flight-enabled sources.
As for Drill as a producer, one could conjure an example in which Spark reads
data from Drill. Maybe Drill runs a number of complex SQL queries to produce
data sets upon which Spark runs some ML tasks. Drill is probably a better tool
to run the kind of monster SQL statements that business analysts like to
create, but Spark is better for the kind of algorithmic processing typical of
ML. (One could argue, with Flight, you get the best of both worlds. Charles, we
need your insight here.) Perhaps Flight's creators have similar scenarios in
mind.
More practically, between the example flight server you mentioned (as a
producer) and Spark (as a consumer), we have what we need if someone wants to
create the prototypes we mentioned.
Or, if someone wants to get very meta, we can have Drill using Flight to read
from another Drill. Not sure it's useful, but would be a cool demo.
Thanks,
- Paul
On Monday, January 13, 2020, 04:21:29 PM PST, Andy Grove
<[email protected]> wrote:
Hi Paul,
There is a test flight server in the Arrow Java project [1] that might be a
good starting point, although I haven't used it myself. I was looking at
Arrow Flight for my Ballista Poc [2] although I don't really have time to
spend on that right now.
I'm less sure of the value of having an Arrow consumer for Drill since any
vectorized processing would already have been performed by Drill? I may be
missing something though.
Thanks,
Andy.
[1]
https://github.com/apache/arrow/tree/master/java/flight/flight-core#example-usage
[2] https://github.com/andygrove/ballista