Re: About integration of drill and arrow

Paul Rogers Wed, 08 Jan 2020 14:55:06 -0800

I Igor,

With the background stuff out of the way, we can now discuss the gist of your 
idea: inserting an API layer between the memory layout (vector implementation) 
and the rest of Drill. As noted, this is a good idea for many reasons. One of 
the most compelling reason is that, with this approach, we have one 
implementation we can make high quality, rather than zillions of partial 
implementations spread across different operators and readers, each with their 
own unique bugs and limitations.

You identified three areas to think about when considering how to use the 
column readers and writers beyond the scan operator. 

1. Readers. As already discussed, we are making good progress migrating readers 
to use EVF. More to go, of course; but we just need to do the work. (Parquet is 
probably the biggest open question since that reader kind of did its own thing 
to gain peak performance.)

2. Operators. Each operator either uses a generic way of working with vectors 
(e.g. Selection Vector Remover AKA SVR) or uses code gen (e.g. Project). The 
SVR is the simplest operator to convert after the Scan;. I have been chipping 
away to see how it can migrate to column readers/writers. The recent Project 
refactoring PR, and the Union typeof() fix are both ways of exploring what it 
will take to migrate codegen. Seems doing so will be a big effort.

3. Clients. We have multiple clients all of which work with vectors at a low 
level: Community JDBC, Community C++ client, Simba JDBC client, ODBC client, 
MapRDB client and probably more. This is probably the largest risk area.

Some of this was meant to be tackled in the oft-discussed "Drill 2.0" since 
changing the clients, vector format or wire protocol will break compatibility.

The Readers and Operators are internal to Drill. It turns out that we can 
change these incrementally in the "Drill 1" series of releases as long as we 
don't muck with the value vectors themselves. Thank you for reviewing the 
various enabling PRs done so far. I hope to review some that you can 
contribute. 

The clients present a complex challenge. Today, the clients use the same code 
as Drill internally, especially around value vectors. (This is one reason the 
community JDBC driver keeps growing: it contains much of Drill's engine code.) 
We can modify the community JDBC driver to use EVF at the cost of breaking 
backward compatibility. We could probably create a C++ version of EVF.

But, again, we should step back and ask if this is the best approach. In an 
ideal world, most clients would use a row-based API. JDBC, ODBC, REST and most 
other clients consume data row-by-row. They want to control the number of rows 
delivered at a time. These clients neither need, nor benefit from, being sent 
vectors and needing to do the complex vector-to-row "rotation." And, with a 
client-specific row-based API, clients would be isolated from Drill's 
internals, helping with our backward compatibility story moving forward. As it 
turns out, I did a very early prototype of a row-based client a few years back 
called "Jig". [1] In fact, EVF resulted from many of the ideas that started in 
Jig.

The only client that WOULD benefit from the current vector format would be one 
we do not have: an Arrow-based client for consumers that use Arrow internally. 
We'd have to create that client (without creating a new dependency on Drill 
internals.)

So if we have to change clients anyway to adopt Arrow (or whatever) it might be 
a good idea to provide a new, simple, versioned, row-based API so that the 
client code can be made far simpler without the dependency on Drill internal 
code. This lets us improve Drill further without again worrying about breaking 
clients.

The clients are challenging. Two especially challenging bits will be the C++ 
and commercial clients. At present, we have no C++ developers on Drill (AFAIK), 
and the C++ client has traditionally been very difficult indeed to maintain. 
Apache Drill has no visibility into the commercial clients: requiring changes 
to those might encounter push-back.

Lots to work out! I really appreciate your interest. Really great if you can 
help us come up with a plan to solve these challenges!

Thanks,
- Paul

[1] https://github.com/paul-rogers/drill-jig

    On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko 
<ihor.huzenko....@gmail.com> wrote:  

 Hello Paul,

I totally agree that integrating Arrow by simply replacing Vectors usage
everywhere will cause a disaster.
After the first look at the new *E*nhanced*V*ector*F*ramework and based on
your suggestions I think I have an idea to share.
In my opinion, the integration can be done in the two major stages:

*1. Preparation Stage*
      1.1 Extract all EVF and related components to a separate module. So
the new separate module will depend only upon Vectors module.
      1.2 Step-by-step rewriting of all operators to use a higher-level
EVF module and remove Vectors module from exec and modules dependencies.
      1.3 Ensure that only module which depends on Vectors is the new EVF
one.
*2. Integration Stage*
        2.1 Add dependency on Arrow Vectors module into EVF module.
        2.2 Replace all usages of Drill Vectors & Protobuf Meta with Arrow
Vectors & Flatbuffers Meta in EVF module.
        2.3 Finalize integration by removing Drill Vectors module
completely.

*NOTE:* I think that any way we won't preserve any backward compatibility
for drivers and custom UDFs.
And proposed changes are a major step forward to be included in Drill 2.0
version.

Below is the very first list of packages that in future may be transformed
into EVF module:
*Module:* exec/Vectors
*Packages:*
org.apache.drill.exec.record.metadata - (An enhanced set of classes to
describe a Drill schema.)
org.apache.drill.exec.record.metadata.schema.parser

org.apache.drill.exec.vector.accessor - (JSON-like readers and writers for
each kind of Drill vector.)
org.apache.drill.exec.vector.accessor.convert
org.apache.drill.exec.vector.accessor.impl
org.apache.drill.exec.vector.accessor.reader
org.apache.drill.exec.vector.accessor.writer
org.apache.drill.exec.vector.accessor.writer.dummy

*Module:* exec/Java Execution Engine
*Packages:*
org.apache.drill.exec.physical.rowSet - (Record batches management)
org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory
mgmt)
org.apache.drill.exec.physical.impl.scan - (Row set based scan)

Thanks,
Igor Guzenko

On Mon, Dec 9, 2019 at 8:53 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi All,
>
> Would be good to do some design brainstorming around this.
>
> Integration with other tools depends on the APIs (the first two items I
> mentioned.) Last time I checked (more than a year ago), memory layout of
> Arrow is close to that in Drill; so conversion is around "packaging" and
> metadata, which can be encapsulated in an API.
>
> Converting internals is a major undertaking. We have large amounts of
> complex, critical code that works directly with the details of value
> vectors. My thought was to first convert code to use the column
> readers/writers we've developed. Then, once all internal code uses that
> abstraction, we can replace the underlying vector implementation with
> Arrow. This lets us work in small stages, each of which is deliverable by
> itself.
>
> The other approach is to change all code that works directly with Drill
> vectors to instead work with Arrow. Because that code is so detailed and
> fragile, that is a huge, risky project.
>
> There are other approaches as well. Would be good to explore them before
> we dive into a major project.
>
> Thanks,
> - Paul
>
>
>
>    On Monday, December 9, 2019, 07:07:31 AM PST, Charles Givre <
> cgi...@gmail.com> wrote:
>
>  Hi Igor,
> That would be really great if you could see that through to completion.
> IMHO, the value from this is not so much performance related but rather the
> ability to use Drill to gather and prep data and seamlessly "hand it off"
> to other platforms for machine learning.
> -- C
>
>
> > On Dec 9, 2019, at 5:48 AM, Igor Guzenko <ihor.huzenko....@gmail.com>
> wrote:
> >
> > Hello Nai and Paul,
> >
> > I would like to contribute full Apache Arrow integration.
> >
> > Thanks,
> > Igor
> >
> > On Mon, Dec 9, 2019 at 8:56 AM Paul Rogers <par0...@yahoo.com.invalid>
> > wrote:
> >
> >> Hi Nai Yan,
> >>
> >> Integration is still in the discussion stages. Work has been progressing
> >> on some foundations which would help that integration.
> >>
> >> At the Developer's Day we talked about several ways to integrate. These
> >> include:
> >>
> >> 1. A storage plugin to read Arrow buffers from some source so that you
> >> could use Arrow data in a Drill query.
> >>
> >> 2. A new Drill client API that produces Arrow buffers from a Drill query
> >> so that an Arrow-based tool can consume Arrow data from Drill.
> >>
> >> 3. Replacement of the Drill value vectors internally with Arrow buffers.
> >>
> >> The first two are relatively straightforward; they just need someone to
> >> contribute an implementation. The third is a major long-term project
> >> because of the way Drill value vectors and Arrow vectors have diverged.
> >>
> >>
> >> I wonder, which of these use cases is of interest to you? How might you
> >> use that integration in you project?
> >>
> >>
> >> Thanks,
> >> - Paul
> >>
> >>
> >>
> >>    On Sunday, December 8, 2019, 10:33:23 PM PST, Nai Yan. <
> >> zhaon...@gmail.com> wrote:
> >>
> >> Greetings,
> >>      As mentioned in Drill develper Day 2018, there's a plan for Drill
> to
> >> integrate Arrow (gandiva from Dremio). I was wondering how is going.
> >>
> >>      Thanks in adavance.
> >>
> >>
> >>
> >> Nai Yan
> >>
>

Re: About integration of drill and arrow

Reply via email to