Hi Aman, Thanks for sending. I looked through the slides and really liked the presentation. @Paul, how would a Drill-to-Arrow bridge work exactly? Would it require serialization/deserialization of Drill objects? —C
> On Jan 30, 2019, at 02:16, Paul Rogers <[email protected]> wrote: > > Hi Aman, > > Thanks for sharing the update. Glad to hear things are still percolating. > > I think Drill is an under appreciated treasure for doing queries in the > complex systems that folks seem to be building today. The ability to read > multiple data sources is something that maybe only Spark can do as well. (And > Spark can't act as a general purpose query engine like Drill can.) Adding > Arrow support for input and output would build on this advantage. > > I wonder if the output (client) side might be a great first start. Could be > build as a separate app just by combining Arrow and the Drill client code > together. Would let lots of Arrow-aware apps query data with Drill rather > than having to write their own readers, own filters, own aggregators and, in > the end, their own query engine. > > Charles was asking about Summer of Code ideas. This might be one: a > stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that > and any Arrow tool in any language could talk to Drill via the bridge. > > Thanks, > - Paul > > > > On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha > <[email protected]> wrote: > > Hi Charles, > You may have seen the talk that was given on the Drill Developer Day [1] by > Karthik and me ... look for the slides on 'Drill-Arrow Integration' which > describes 2 high level options and what the integration might entail. > Option 1 corresponds to what you and Paul are discussing in this thread. > Option 2 is the deeper integration. We do plan to work on one of them (not > finalized yet) but it will likely be after 1.16.0 since Statistics support > and Resource Manager related tasks (these were also discussed in the > Developer Day) are consuming our time. If you are interested in > contributing/collaborating, let me know. > > [1] > https://drive.google.com/drive/folders/17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn > > Aman > > On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <[email protected]> > wrote: > >> Hi Charles, >> I didn't see anything on this on the public mailing list. Haven't seen any >> commits related to it either. My guess is that this kind of interface is >> not important for the kind of data warehouse use cases that MapR is >> probably still trying to capture. >> I followed the Arrow mailing lists for much of last year. Not much >> activity in the Java arena. (I think most of that might be done by Dremio.) >> Most activity in other languages. The code itself has drifted far away from >> the original Drill structure. I found that even the metadata had vastly >> changed; turned out to be far too much work to port the "Row Set" stuff I >> did for Drill. >> This does mean, BTW, that the Drill folks did the right thing by not >> following Arrow. They'd have spend a huge amount of time tracking the >> massive changes. >> Still, converting Arrow vectors to Drill vectors might be an exercise in >> bit twirling and memory ownership. Harder now than it once was since I >> think Arrow defines all vectors to be nullable, and uses a different scheme >> than Drill for representing nulls. >> Thanks, >> - Paul >> >> >> >> On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre < >> [email protected]> wrote: >> >> Hey Paul, >> I’m curious as to what, if anything ever came of this thread? IMHO, >> you’re on to something here. We could get the benefit of >> Arrow—specifically the interoperability with other big data tools—without >> the pain of having to completely re-work Drill. This seems like a real >> win-win to me. >> — C >> >>> On Aug 20, 2018, at 13:51, Paul Rogers <[email protected]> >> wrote: >>> >>> Hi Ted, >>> >>> We may be confusing two very different ideas. The one is a >> Drill-to-Arrow adapter on Drill's periphery, this is the >> "crude-but-effective" integration suggestion. On the periphery we are not >> changing existing code, we're just building an adapter to read Arrow data >> into Drill, or convert Drill output to Arrow. >>> >>> The other idea, being discussed in a parallel thread, is to convert >> Drill's runtime engine to use Arrow. That is a whole other beast. >>> >>> When changing Drill internals, code must change. There is a cost >> associated with that. Whether the Arrow code is better or not is not the >> key question. Rather, the key question is simply the volume of changes. >>> >>> Drill divides into roughly two main layers: plan-time and run-time. >> Plan-time is not much affected by Arrow. But, run-time code is all about >> manipulating vectors and their metadata, often in quite detailed ways with >> APIs unique to Drill. While swapping Arrow vectors for Drill vectors is >> conceptually simple, those of us who've looked at the details have noted >> that the sheer volume of the lines of code that must change is daunting. >>> >>> Would be good to get second options. That PR I mentioned will show the >> volume of code that changed at that time (but Drill has grown since then.) >> Parth is another good resource as he reviewed the original PR and has kept >> a close eye on Arrow. >>> >>> When considering Arrow in the Drill execution engine, we must >> realistically understand the cost then ask, do the benefits we gain justify >> those costs? Would Arrow be the highest-priority investment? Frankly, would >> Arrow integration increase Drill adoption more than the many other topics >> discussed recently on these mail lists? >>> >>> Charles and others make a strong case for Arrow for integration. What is >> the strong case for Drill's internals? That's really the question the group >> will want to answer. >>> >>> More details below. >>> >>> Thanks, >>> - Paul >>> >>> >>> >>> On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning < >> [email protected]> wrote: >>> >>> Inline. >>> >>> >>> On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <[email protected]> >>> wrote: >>> >>>> ... >>>> By contrast, migrating Drill internals to Arrow has always been seen as >>>> the bulk of the cost; costs which the "crude-but-effective" suggestion >>>> seeks to avoid. Some of the full-integration costs include: >>>> >>>> * Reworking Drill's direct memory model to work with Arrow's. >>>> >>> >>> >>> Ted: This should be relatively isolated to the allocation/deallocation >> code. The >>> deallocation should become a no-op. The allocation becomes simpler and >>> safer. >>> >>> Paul: If only that were true. Drill has an ingenious integration of >> vector allocation and Netty. Arrow may have done the same. (Probably did, >> since such integration is key to avoiding copies on send/receive.). That >> code is highly complex. Clearly, the swap can be done; it will simply take >> some work to get right. >>> >>> >>>> * Changing all low-level runtime code that works with vectors to instead >>>> work with Arrow vectors. >>>> >>> >>> >>> Ted: Why? You already said that most code doesn't have to change since >> the >>> format is the same. >>> >>> Paul: My comment about the format being the same was that the direct >> memory layout is the same, allowing conversion of a Drill vector to an >> Arrow vector by relabeling the direct memory that holds the data. >>> >>> Paul: But, in the Drill runtime engine, we don't work with the memory >> directly, we use the vector APIs, mutator APIs and so on. These all changed >> in Arrow. Granted, the Arrow versions are cleaner. But, that does mean that >> every vector reference (of which there are thousands) must be revised to >> use the Arrow APIs. That is the cost that has put us off a bit. >>> >>> >>>> * Change all Drill's vector metadata, and code that uses that metadata, >> to >>>> use Arrow's metadata instead. >>>> >>> >>> >>> Ted: Why? You said that converting Arrow metadata to Drill's metadata >> would be >>> simple. Why not just continue with that? >>> >>> Paul: In an API, we can convert one data structure to the other by >> writing code to copy data. But, if we change Drill's internals, we must >> rewrite code in every operator that uses Drill's metadata to instead use >> Arrows. That is a much more extensive undertaking than simply converting >> metadata on input or output. >>> >>> >>>> * Since generated code works directly with vectors, change all the code >>>> generation. >>>> >>> >>> Ted: Why? You said the UDFs would just work. >>> >>> Paul: Again, I fear we are confusing two issues. If we don't change >> Drill's internals, then UDFs will work as today. If we do change Drill to >> Arrow, then, since UDFs are part of the code gen system, they must change >> to adapt to the Arrow APIs. Specially, Drill "holders" must be converted to >> Arrow holders. Drill complex writers must convert to Arrow complex writers. >>> >>> Paul: Here I'll point out that the Arrow vector code and writers have >> the same uncontrolled memory flaw that they inherited from Drill. So, if we >> replace the mutators and writers, we might as well use the "result set >> loader" model which a) hides the details, and b) manages memory to a given >> budget. Either way, UDFs must change if we move to Arrow for Drill >> internals. >>> >>> >>>> * Since Drill vectors and metadata are exposed via the Drill client to >>>> JDBC and ODBC, those must be revised as well. >>>> >>> >>> Ted: How much given the high level of compatibility? >>> >>> Paul: As with Drill internals, all JDBC/ODBC code that uses Drill vector >> and metadata classes must be revised to use Arrow vectors and metadata, >> adapting the code to the changed APIs. This is not a huge technical >> challenge, it is just a pile of work. Perhaps this was done in that Arrow >> conversion PR. >>> >>> >>> >>>> * Since the wire format will change, clients of Drill must upgrade their >>>> JDBC/ODBC drivers when migrating to an Arrow-based Drill.> >>> >>> >>> Ted: Doesn't this have to happen fairly often anyway? >>> >>> Ted: Perhaps this would be a good excuse for a 2.0 step. >>> >>> Paul: As Drill matures, users would appreciate the ability to use JDBC >> and ODBC drivers with multiple Drill versions. If a shop has 1000 desktops >> using the drivers against five Drill clusters, it is impractical to upgrade >> everything in one go. >>> >>> Paul: You hit the nail on the head: conversion to Arrow would justify a >> jump to "Drill 2.0" to explain the required big-bang upgrade (and, to >> highlight the cool new capabilities that come with Arrow.) >>>
