Re: About integration of drill and arrow

Paul Rogers Thu, 09 Jan 2020 12:20:34 -0800

Hi Volodymyr,

All good points. The Arrow/Drill conversion is a good option, especially for 
readers and clients. Between operators, such conversion is likely to introduce 
performance hits. As you know, the main feature that differentiates one query 
engine from another is performance, so adding conversions is unlikely to help 
Drill in the performance battle.

Flatten should actually be pretty simple with EVF. Creating repeated values is 
much like filling in implicit columns: set a value, then "copy down" n times.

Still, you raise good issues. Operators that fit your description are things 
like exchanges: these operators want to work at the low level of buffers. Not 
much the column readers/writers can do to help. And, as you point out, 
commercial components will be a challenge as Apache Drill does not maintain 
that code.

Your larger point is valid: no matter how we approach it, moving to Arrow is a 
large project that will break compatibility.

We've discussed a simple first step: Support an Arrow client to see if there is 
any interest. Support Arrow readers to see if that gives us any benefit. These 
are the visible tip of the iceberg. If we see advantages, we can then think 
about changing the internals; the vast bulk of the iceberg which is below water 
and unseen.

I think I disagree that we'd want to swap code that works directly with 
ValueVectors to code that works directly with ArrowVectors. Doing so locks us 
into someone else's memory format and forces Drill to change every time Arrow 
changes. Give the small size of the Drill team, and the frantic pace of Arrow 
change, This was the team's concern early on and I'm still not convinced this 
is a good strategy.

So, my original point remains: what is the benefit of all this cost? Is there 
another path that gives us greater benefit for the same or lesser cost? In 
short, what is our goal? Where do we want Drill to go?

In fact, another radical suggestion is to embrace the wonderful work done on 
Presto. Maybe Drill 2.0 is simply Presto. We focus on adding support for local 
files (Drill's unique strength), and embrace Presto's great support for data 
types, connectors, UDFs, clients and so on.

As a team, we should ask the fundamental question: What benefits can Drill 
offer that are not already offered by, say Presto or commercial Drill 
derivatives? If new can answer that question, we'll have a better idea about 
whether investment in Arrow will get us there.

Or, are we better off to just leave well enough alone as we have done for 
several years?

Thanks,
- Paul

    On Thursday, January 9, 2020, 05:57:52 AM PST, Volodymyr Vysotskyi 
<volody...@apache.org> wrote:  

 Hi all,

Glad to see that this discussion became active again!

I have some comments regarding the steps for moving from Drill Vectors to
Arrow Vectors.

No doubt that using EVF for all operators and readers instead of value
vectors will simplify things a lot.
But considering the target goal - integration with Arrow, it may be the
main show-stopper for it.
There may be some operators which would be hard to adapt to use EVF, for
example, I think Flatten operator will be among them since its
implementation deeply connected with value vectors.
Also, it requires moving all storage and format plugins to EVF, which also
may be problematic, for example, some plugins like MaprDB have specific
features, and it should be considered when moving to EVF.
Some other plugins are so obsolete, that I'm not sure that they still work
and that someone still uses it, so except moving to EVF, they should be
resurrected to verify that they weren't broken more than before.

This is a huge piece of work, and only after that, we will proceed with the
next step - integrating Arrow to EVF and then handling new Arrow-related
issues for all the operators and readers at the same time.

I propose to update these steps a little bit.
1. I agree that at first, we should extract EVF-related classes into a
separate module.
2. But as the next step, I propose to extract EVF API which doesn't depend
on the vector implementation (Drill vectors, or Arrow ones).
3. After that, introduce module with Arrow which also implements this EVF
API.
4. Introduce transformers that will be able to convert from Drill vectors
into Arrow vectors and vice versa. These transformers may be implemented to
work using EVF abstractions instead of operating with specific vector
implementations.

5.1. At this point, we can introduce Arrow connectors to fetch the data in
Arrow format or return it in such a format using transformers from step 4.

5.2. Also, at this point, we may start rewriting operators to EVF and
switching EVF implementation from the EVF based on Drill Vectors to the
implementation which uses Arrow Vectors. Or switching implementations for
existing EVF-based format plugins and fix newly discovered issues in Arrow.
Since at this point we will have operators which use Arrow format and
operators which use Drill Vectors format, we should insert operators that
transform one vector format to another introduced in step 4 between every
pair of operators which returns batches in a different format.

I know, that such an approach requires some additional work, like
introducing transformers from step 4 and may cause some performance
degradations for the case when format transformation is complex for some
types and when we still have sequences of operators with different formats.

But with this approach, transitioning to Arrow wouldn't be blocked until
everything is moved to EVF and it would be possible to transmit
step-by-step, and Drill still will be able to switch between formats if it
would be required.

Kind regards,
Volodymyr Vysotskyi

On Thu, Jan 9, 2020 at 2:45 PM Igor Guzenko <ihor.huzenko....@gmail.com>
wrote:

> Hi Paul,
>
> Though I have very limited knowledge about Arrow at the moment, I can
> highlight a few advantages of trying it:
>  1. Allows fixing all the long-standing nullability issues and provide
> better integration for storage plugins like Hive.
>                https://jira.apache.org/jira/browse/DRILL-1344
>                https://jira.apache.org/jira/browse/DRILL-3831
>                https://jira.apache.org/jira/browse/DRILL-4824
>                https://jira.apache.org/jira/browse/DRILL-7255
>                https://jira.apache.org/jira/browse/DRILL-7366
> 2. Some work was done by community to implement optimized Arrow readers for
> Parquet and other formats&tools. We could try to adopt and check whether we
> can benefit from them.
> 3. Since Arrow is under active development we could try their newest
> features, like Flight which promises improved data transfers over the
> network.
>
> Thanks,
> Igor
> On Wed, Jan 8, 2020 at 11:55 PM Paul Rogers <par0...@yahoo.com.invalid>
> wrote:
>
> > Hi Igor,
> >
> > Before diving into design issues, it may be worthwhile to think about the
> > premise: should Drill adopt Arrow as its internal memory layout? This is
> > the question that the team has wrestled with since Arrow was launched.
> > Arrow has three parts. Let's think about each.
> >
> > First is a direct memory layout. The approach you suggest will let us
> work
> > with the Arrow memory format. Use EVF to access vectors, then the
> > underlying vectors can be swapped from Drill to Arrow. But, what is the
> > advantage of using Arrow? The arrow layout isn't better than Drill's; it
> is
> > just different. Adopting the Arrow memory layout by itself provides
> little
> > benefit, but bit cost. This is one reason the team has been so reluctant
> to
> > atop Arrow.
> >
> > The only advantage of using the Arrow memory layout is if Drill could
> > benefit from code written for Arrow. The second part of Arrow is a set of
> > modules to manipulate vectors. Gandiva is the most prominent example.
> > However, there are major challenges. Most SQL operations are defined to
> > work on rows; some clever thinking will be needed to convert those
> > operations into a series of column operations. (Drill's codegen is NOT
> > columnar: it works row-by-row.) So, if we want to benefit from Gandiva,
> we
> > must completely rethink how we process batches.
> >
> > Is it worth doing all that work? The primary benefit would be
> performance.
> > But, it is not clear that our current implementation is the bottleneck.
> The
> > current implementation is row-based, code generated in Java. Would be
> great
> > for someone to do some benchmarks to show the benefit from adopting
> Gandiva
> > to see if the potential gain justifies the likely large development cost.
> >
> > The third advantage of using Arrow is to allow exchange of vectors
> between
> > Drill and Arrow-based clients or readers. As it turns out, this is not
> the
> > big win it seems. As we've discussed, we could easily create an
> Arrow-based
> > client for Drill -- there will be an RPC between the client and Drill and
> > we can use that to do format conversion.
> >
> > For readers, Drill will want control over batch sizes; Drill cannot
> > blindly accept whatever size vectors a reader chooses to produce. (More
> on
> > that later.) Incoming data will be subject to projection and selection,
> so
> > it will quickly move out of the incoming Arrow vectors into vector which
> > Drill creates.
> >
> > Arrow gets (or got) a lot of press. However, our job is to focus on
> what's
> > best for Drill. There actually might be a memory layout for Drill that is
> > better than Arrow (and better than our current vectors.) A couple of us
> did
> > a prototype some time ago that seemed to show promise. So, it is not
> clear
> > that adopting Arrow is necessarily a huge win: maybe it is, maybe not. We
> > need to figure it out.
> >
> > What IS clearly a huge win is the idea you outlined: creating a layer
> > between memory layout and the rest of Drill so that we can try out
> > different memory layouts to see what works best.
> >
> > Thanks,
> > - Paul
> >
> >
> >
> >    On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko <
> > ihor.huzenko....@gmail.com> wrote:
> >
> >  Hello Paul,
> >
> > I totally agree that integrating Arrow by simply replacing Vectors usage
> > everywhere will cause a disaster.
> > After the first look at the new *E*nhanced*V*ector*F*ramework and based
> on
> > your suggestions I think I have an idea to share.
> > In my opinion, the integration can be done in the two major stages:
> >
> > *1. Preparation Stage*
> >      1.1 Extract all EVF and related components to a separate module. So
> > the new separate module will depend only upon Vectors module.
> >      1.2 Step-by-step rewriting of all operators to use a higher-level
> > EVF module and remove Vectors module from exec and modules dependencies.
> >      1.3 Ensure that only module which depends on Vectors is the new EVF
> > one.
> > *2. Integration Stage*
> >        2.1 Add dependency on Arrow Vectors module into EVF module.
> >        2.2 Replace all usages of Drill Vectors & Protobuf Meta with
> Arrow
> > Vectors & Flatbuffers Meta in EVF module.
> >        2.3 Finalize integration by removing Drill Vectors module
> > completely.
> >
> >
> > *NOTE:* I think that any way we won't preserve any backward compatibility
> > for drivers and custom UDFs.
> > And proposed changes are a major step forward to be included in Drill 2.0
> > version.
> >
> >
> > Below is the very first list of packages that in future may be
> transformed
> > into EVF module:
> > *Module:* exec/Vectors
> > *Packages:*
> > org.apache.drill.exec.record.metadata - (An enhanced set of classes to
> > describe a Drill schema.)
> > org.apache.drill.exec.record.metadata.schema.parser
> >
> > org.apache.drill.exec.vector.accessor - (JSON-like readers and writers
> for
> > each kind of Drill vector.)
> > org.apache.drill.exec.vector.accessor.convert
> > org.apache.drill.exec.vector.accessor.impl
> > org.apache.drill.exec.vector.accessor.reader
> > org.apache.drill.exec.vector.accessor.writer
> > org.apache.drill.exec.vector.accessor.writer.dummy
> >
> > *Module:* exec/Java Execution Engine
> > *Packages:*
> > org.apache.drill.exec.physical.rowSet - (Record batches management)
> > org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory
> > mgmt)
> > org.apache.drill.exec.physical.impl.scan - (Row set based scan)
> >
> > Thanks,
> > Igor Guzenko
> >
> >
>

Re: About integration of drill and arrow

Reply via email to