Re: About integration of drill and arrow

Igor Guzenko Thu, 09 Jan 2020 04:46:20 -0800

Hi Paul,

Though I have very limited knowledge about Arrow at the moment, I can
highlight a few advantages of trying it:
 1. Allows fixing all the long-standing nullability issues and provide
better integration for storage plugins like Hive.
               https://jira.apache.org/jira/browse/DRILL-1344
               https://jira.apache.org/jira/browse/DRILL-3831
               https://jira.apache.org/jira/browse/DRILL-4824
               https://jira.apache.org/jira/browse/DRILL-7255
               https://jira.apache.org/jira/browse/DRILL-7366
2. Some work was done by community to implement optimized Arrow readers for
Parquet and other formats&tools. We could try to adopt and check whether we
can benefit from them.
3. Since Arrow is under active development we could try their newest
features, like Flight which promises improved data transfers over the
network.


Thanks,
Igor
On Wed, Jan 8, 2020 at 11:55 PM Paul Rogers <[email protected]>
wrote:

> Hi Igor,
>
> Before diving into design issues, it may be worthwhile to think about the
> premise: should Drill adopt Arrow as its internal memory layout? This is
> the question that the team has wrestled with since Arrow was launched.
> Arrow has three parts. Let's think about each.
>
> First is a direct memory layout. The approach you suggest will let us work
> with the Arrow memory format. Use EVF to access vectors, then the
> underlying vectors can be swapped from Drill to Arrow. But, what is the
> advantage of using Arrow? The arrow layout isn't better than Drill's; it is
> just different. Adopting the Arrow memory layout by itself provides little
> benefit, but bit cost. This is one reason the team has been so reluctant to
> atop Arrow.
>
> The only advantage of using the Arrow memory layout is if Drill could
> benefit from code written for Arrow. The second part of Arrow is a set of
> modules to manipulate vectors. Gandiva is the most prominent example.
> However, there are major challenges. Most SQL operations are defined to
> work on rows; some clever thinking will be needed to convert those
> operations into a series of column operations. (Drill's codegen is NOT
> columnar: it works row-by-row.) So, if we want to benefit from Gandiva, we
> must completely rethink how we process batches.
>
> Is it worth doing all that work? The primary benefit would be performance.
> But, it is not clear that our current implementation is the bottleneck. The
> current implementation is row-based, code generated in Java. Would be great
> for someone to do some benchmarks to show the benefit from adopting Gandiva
> to see if the potential gain justifies the likely large development cost.
>
> The third advantage of using Arrow is to allow exchange of vectors between
> Drill and Arrow-based clients or readers. As it turns out, this is not the
> big win it seems. As we've discussed, we could easily create an Arrow-based
> client for Drill -- there will be an RPC between the client and Drill and
> we can use that to do format conversion.
>
> For readers, Drill will want control over batch sizes; Drill cannot
> blindly accept whatever size vectors a reader chooses to produce. (More on
> that later.) Incoming data will be subject to projection and selection, so
> it will quickly move out of the incoming Arrow vectors into vector which
> Drill creates.
>
> Arrow gets (or got) a lot of press. However, our job is to focus on what's
> best for Drill. There actually might be a memory layout for Drill that is
> better than Arrow (and better than our current vectors.) A couple of us did
> a prototype some time ago that seemed to show promise. So, it is not clear
> that adopting Arrow is necessarily a huge win: maybe it is, maybe not. We
> need to figure it out.
>
> What IS clearly a huge win is the idea you outlined: creating a layer
> between memory layout and the rest of Drill so that we can try out
> different memory layouts to see what works best.
>
> Thanks,
> - Paul
>
>
>
>     On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko <
> [email protected]> wrote:
>
>  Hello Paul,
>
> I totally agree that integrating Arrow by simply replacing Vectors usage
> everywhere will cause a disaster.
> After the first look at the new *E*nhanced*V*ector*F*ramework and based on
> your suggestions I think I have an idea to share.
> In my opinion, the integration can be done in the two major stages:
>
> *1. Preparation Stage*
>       1.1 Extract all EVF and related components to a separate module. So
> the new separate module will depend only upon Vectors module.
>       1.2 Step-by-step rewriting of all operators to use a higher-level
> EVF module and remove Vectors module from exec and modules dependencies.
>       1.3 Ensure that only module which depends on Vectors is the new EVF
> one.
> *2. Integration Stage*
>         2.1 Add dependency on Arrow Vectors module into EVF module.
>         2.2 Replace all usages of Drill Vectors & Protobuf Meta with Arrow
> Vectors & Flatbuffers Meta in EVF module.
>         2.3 Finalize integration by removing Drill Vectors module
> completely.
>
>
> *NOTE:* I think that any way we won't preserve any backward compatibility
> for drivers and custom UDFs.
> And proposed changes are a major step forward to be included in Drill 2.0
> version.
>
>
> Below is the very first list of packages that in future may be transformed
> into EVF module:
> *Module:* exec/Vectors
> *Packages:*
> org.apache.drill.exec.record.metadata - (An enhanced set of classes to
> describe a Drill schema.)
> org.apache.drill.exec.record.metadata.schema.parser
>
> org.apache.drill.exec.vector.accessor - (JSON-like readers and writers for
> each kind of Drill vector.)
> org.apache.drill.exec.vector.accessor.convert
> org.apache.drill.exec.vector.accessor.impl
> org.apache.drill.exec.vector.accessor.reader
> org.apache.drill.exec.vector.accessor.writer
> org.apache.drill.exec.vector.accessor.writer.dummy
>
> *Module:* exec/Java Execution Engine
> *Packages:*
> org.apache.drill.exec.physical.rowSet - (Record batches management)
> org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory
> mgmt)
> org.apache.drill.exec.physical.impl.scan - (Row set based scan)
>
> Thanks,
> Igor Guzenko
>
>

Re: About integration of drill and arrow

Reply via email to