I Igor, With the background stuff out of the way, we can now discuss the gist of your idea: inserting an API layer between the memory layout (vector implementation) and the rest of Drill. As noted, this is a good idea for many reasons. One of the most compelling reason is that, with this approach, we have one implementation we can make high quality, rather than zillions of partial implementations spread across different operators and readers, each with their own unique bugs and limitations.
You identified three areas to think about when considering how to use the column readers and writers beyond the scan operator. 1. Readers. As already discussed, we are making good progress migrating readers to use EVF. More to go, of course; but we just need to do the work. (Parquet is probably the biggest open question since that reader kind of did its own thing to gain peak performance.) 2. Operators. Each operator either uses a generic way of working with vectors (e.g. Selection Vector Remover AKA SVR) or uses code gen (e.g. Project). The SVR is the simplest operator to convert after the Scan;. I have been chipping away to see how it can migrate to column readers/writers. The recent Project refactoring PR, and the Union typeof() fix are both ways of exploring what it will take to migrate codegen. Seems doing so will be a big effort. 3. Clients. We have multiple clients all of which work with vectors at a low level: Community JDBC, Community C++ client, Simba JDBC client, ODBC client, MapRDB client and probably more. This is probably the largest risk area. Some of this was meant to be tackled in the oft-discussed "Drill 2.0" since changing the clients, vector format or wire protocol will break compatibility. The Readers and Operators are internal to Drill. It turns out that we can change these incrementally in the "Drill 1" series of releases as long as we don't muck with the value vectors themselves. Thank you for reviewing the various enabling PRs done so far. I hope to review some that you can contribute. The clients present a complex challenge. Today, the clients use the same code as Drill internally, especially around value vectors. (This is one reason the community JDBC driver keeps growing: it contains much of Drill's engine code.) We can modify the community JDBC driver to use EVF at the cost of breaking backward compatibility. We could probably create a C++ version of EVF. But, again, we should step back and ask if this is the best approach. In an ideal world, most clients would use a row-based API. JDBC, ODBC, REST and most other clients consume data row-by-row. They want to control the number of rows delivered at a time. These clients neither need, nor benefit from, being sent vectors and needing to do the complex vector-to-row "rotation." And, with a client-specific row-based API, clients would be isolated from Drill's internals, helping with our backward compatibility story moving forward. As it turns out, I did a very early prototype of a row-based client a few years back called "Jig". [1] In fact, EVF resulted from many of the ideas that started in Jig. The only client that WOULD benefit from the current vector format would be one we do not have: an Arrow-based client for consumers that use Arrow internally. We'd have to create that client (without creating a new dependency on Drill internals.) So if we have to change clients anyway to adopt Arrow (or whatever) it might be a good idea to provide a new, simple, versioned, row-based API so that the client code can be made far simpler without the dependency on Drill internal code. This lets us improve Drill further without again worrying about breaking clients. The clients are challenging. Two especially challenging bits will be the C++ and commercial clients. At present, we have no C++ developers on Drill (AFAIK), and the C++ client has traditionally been very difficult indeed to maintain. Apache Drill has no visibility into the commercial clients: requiring changes to those might encounter push-back. Lots to work out! I really appreciate your interest. Really great if you can help us come up with a plan to solve these challenges! Thanks, - Paul [1] https://github.com/paul-rogers/drill-jig On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko <ihor.huzenko....@gmail.com> wrote: Hello Paul, I totally agree that integrating Arrow by simply replacing Vectors usage everywhere will cause a disaster. After the first look at the new *E*nhanced*V*ector*F*ramework and based on your suggestions I think I have an idea to share. In my opinion, the integration can be done in the two major stages: *1. Preparation Stage* 1.1 Extract all EVF and related components to a separate module. So the new separate module will depend only upon Vectors module. 1.2 Step-by-step rewriting of all operators to use a higher-level EVF module and remove Vectors module from exec and modules dependencies. 1.3 Ensure that only module which depends on Vectors is the new EVF one. *2. Integration Stage* 2.1 Add dependency on Arrow Vectors module into EVF module. 2.2 Replace all usages of Drill Vectors & Protobuf Meta with Arrow Vectors & Flatbuffers Meta in EVF module. 2.3 Finalize integration by removing Drill Vectors module completely. *NOTE:* I think that any way we won't preserve any backward compatibility for drivers and custom UDFs. And proposed changes are a major step forward to be included in Drill 2.0 version. Below is the very first list of packages that in future may be transformed into EVF module: *Module:* exec/Vectors *Packages:* org.apache.drill.exec.record.metadata - (An enhanced set of classes to describe a Drill schema.) org.apache.drill.exec.record.metadata.schema.parser org.apache.drill.exec.vector.accessor - (JSON-like readers and writers for each kind of Drill vector.) org.apache.drill.exec.vector.accessor.convert org.apache.drill.exec.vector.accessor.impl org.apache.drill.exec.vector.accessor.reader org.apache.drill.exec.vector.accessor.writer org.apache.drill.exec.vector.accessor.writer.dummy *Module:* exec/Java Execution Engine *Packages:* org.apache.drill.exec.physical.rowSet - (Record batches management) org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory mgmt) org.apache.drill.exec.physical.impl.scan - (Row set based scan) Thanks, Igor Guzenko On Mon, Dec 9, 2019 at 8:53 PM Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi All, > > Would be good to do some design brainstorming around this. > > Integration with other tools depends on the APIs (the first two items I > mentioned.) Last time I checked (more than a year ago), memory layout of > Arrow is close to that in Drill; so conversion is around "packaging" and > metadata, which can be encapsulated in an API. > > Converting internals is a major undertaking. We have large amounts of > complex, critical code that works directly with the details of value > vectors. My thought was to first convert code to use the column > readers/writers we've developed. Then, once all internal code uses that > abstraction, we can replace the underlying vector implementation with > Arrow. This lets us work in small stages, each of which is deliverable by > itself. > > The other approach is to change all code that works directly with Drill > vectors to instead work with Arrow. Because that code is so detailed and > fragile, that is a huge, risky project. > > There are other approaches as well. Would be good to explore them before > we dive into a major project. > > Thanks, > - Paul > > > > On Monday, December 9, 2019, 07:07:31 AM PST, Charles Givre < > cgi...@gmail.com> wrote: > > Hi Igor, > That would be really great if you could see that through to completion. > IMHO, the value from this is not so much performance related but rather the > ability to use Drill to gather and prep data and seamlessly "hand it off" > to other platforms for machine learning. > -- C > > > > On Dec 9, 2019, at 5:48 AM, Igor Guzenko <ihor.huzenko....@gmail.com> > wrote: > > > > Hello Nai and Paul, > > > > I would like to contribute full Apache Arrow integration. > > > > Thanks, > > Igor > > > > On Mon, Dec 9, 2019 at 8:56 AM Paul Rogers <par0...@yahoo.com.invalid> > > wrote: > > > >> Hi Nai Yan, > >> > >> Integration is still in the discussion stages. Work has been progressing > >> on some foundations which would help that integration. > >> > >> At the Developer's Day we talked about several ways to integrate. These > >> include: > >> > >> 1. A storage plugin to read Arrow buffers from some source so that you > >> could use Arrow data in a Drill query. > >> > >> 2. A new Drill client API that produces Arrow buffers from a Drill query > >> so that an Arrow-based tool can consume Arrow data from Drill. > >> > >> 3. Replacement of the Drill value vectors internally with Arrow buffers. > >> > >> The first two are relatively straightforward; they just need someone to > >> contribute an implementation. The third is a major long-term project > >> because of the way Drill value vectors and Arrow vectors have diverged. > >> > >> > >> I wonder, which of these use cases is of interest to you? How might you > >> use that integration in you project? > >> > >> > >> Thanks, > >> - Paul > >> > >> > >> > >> On Sunday, December 8, 2019, 10:33:23 PM PST, Nai Yan. < > >> zhaon...@gmail.com> wrote: > >> > >> Greetings, > >> As mentioned in Drill develper Day 2018, there's a plan for Drill > to > >> integrate Arrow (gandiva from Dremio). I was wondering how is going. > >> > >> Thanks in adavance. > >> > >> > >> > >> Nai Yan > >> >