Jim, I really like this use case. As a data scientist myself, I see the big value of Drill as being able to rapidly get raw data ready for machine learning. This would be great if we could do this!
> On Jan 30, 2019, at 08:43, Jim Scott <[email protected]> wrote: > > Paul, > > Your example is exactly the same as one which I spoke with some people on > the RAPIDS.ai project about. Using Drill as a tool to gather (query) all > the data to get a representative data set for an ML/AI workload, then > feeding the resultset directly into GPU memory. RAPIDS.ai is based on Arrow > which created a GPU Data Frame. The whole point of that project was to > reduce total number of memcopy operations to result in an end-to-end speed > up. > > That model to allow Drill to plug into other tools would be a GREAT use > case for Drill. > > Jim > > On Wed, Jan 30, 2019 at 2:17 AM Paul Rogers <[email protected] > <mailto:[email protected]>> > wrote: > >> Hi Aman, >> >> Thanks for sharing the update. Glad to hear things are still percolating. >> >> I think Drill is an under appreciated treasure for doing queries in the >> complex systems that folks seem to be building today. The ability to read >> multiple data sources is something that maybe only Spark can do as well. >> (And Spark can't act as a general purpose query engine like Drill can.) >> Adding Arrow support for input and output would build on this advantage. >> >> I wonder if the output (client) side might be a great first start. Could >> be build as a separate app just by combining Arrow and the Drill client >> code together. Would let lots of Arrow-aware apps query data with Drill >> rather than having to write their own readers, own filters, own aggregators >> and, in the end, their own query engine. >> >> Charles was asking about Summer of Code ideas. This might be one: a >> stand-alone Drill-to-arrow bridge. I think Arrow has an RPC layer. Add that >> and any Arrow tool in any language could talk to Drill via the bridge. >> >> Thanks, >> - Paul >> >> >> >> On Tuesday, January 29, 2019, 1:54:30 PM PST, Aman Sinha < >> [email protected]> wrote: >> >> Hi Charles, >> You may have seen the talk that was given on the Drill Developer Day [1] by >> Karthik and me ... look for the slides on 'Drill-Arrow Integration' which >> describes 2 high level options and what the integration might entail. >> Option 1 corresponds to what you and Paul are discussing in this thread. >> Option 2 is the deeper integration. We do plan to work on one of them (not >> finalized yet) but it will likely be after 1.16.0 since Statistics support >> and Resource Manager related tasks (these were also discussed in the >> Developer Day) are consuming our time. If you are interested in >> contributing/collaborating, let me know. >> >> [1] >> >> https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e= >> >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__drive.google.com_drive_folders_17I2jZq2HdDwUDXFOIg1Vecry8yGTDWhn&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=QCNo6Od9yBrl9o0wvYLOA97m53QHz3nzbe8yPRFgMso&m=49qJveZG0Wk1sCxdbEX9S34uYxi7ndkpmnzLBpns9CQ&s=qLa8hfgTP2F51grPeHfwtnXZs_O09OR7vkNWBg5sXHc&e=> >> >> Aman >> >> On Tue, Jan 29, 2019 at 12:08 AM Paul Rogers <[email protected]> >> wrote: >> >>> Hi Charles, >>> I didn't see anything on this on the public mailing list. Haven't seen >> any >>> commits related to it either. My guess is that this kind of interface is >>> not important for the kind of data warehouse use cases that MapR is >>> probably still trying to capture. >>> I followed the Arrow mailing lists for much of last year. Not much >>> activity in the Java arena. (I think most of that might be done by >> Dremio.) >>> Most activity in other languages. The code itself has drifted far away >> from >>> the original Drill structure. I found that even the metadata had vastly >>> changed; turned out to be far too much work to port the "Row Set" stuff I >>> did for Drill. >>> This does mean, BTW, that the Drill folks did the right thing by not >>> following Arrow. They'd have spend a huge amount of time tracking the >>> massive changes. >>> Still, converting Arrow vectors to Drill vectors might be an exercise in >>> bit twirling and memory ownership. Harder now than it once was since I >>> think Arrow defines all vectors to be nullable, and uses a different >> scheme >>> than Drill for representing nulls. >>> Thanks, >>> - Paul >>> >>> >>> >>> On Monday, January 28, 2019, 5:54:12 PM PST, Charles Givre < >>> [email protected]> wrote: >>> >>> Hey Paul, >>> I’m curious as to what, if anything ever came of this thread? IMHO, >>> you’re on to something here. We could get the benefit of >>> Arrow—specifically the interoperability with other big data tools—without >>> the pain of having to completely re-work Drill. This seems like a real >>> win-win to me. >>> — C >>> >>>> On Aug 20, 2018, at 13:51, Paul Rogers <[email protected]> >>> wrote: >>>> >>>> Hi Ted, >>>> >>>> We may be confusing two very different ideas. The one is a >>> Drill-to-Arrow adapter on Drill's periphery, this is the >>> "crude-but-effective" integration suggestion. On the periphery we are not >>> changing existing code, we're just building an adapter to read Arrow data >>> into Drill, or convert Drill output to Arrow. >>>> >>>> The other idea, being discussed in a parallel thread, is to convert >>> Drill's runtime engine to use Arrow. That is a whole other beast. >>>> >>>> When changing Drill internals, code must change. There is a cost >>> associated with that. Whether the Arrow code is better or not is not the >>> key question. Rather, the key question is simply the volume of changes. >>>> >>>> Drill divides into roughly two main layers: plan-time and run-time. >>> Plan-time is not much affected by Arrow. But, run-time code is all about >>> manipulating vectors and their metadata, often in quite detailed ways >> with >>> APIs unique to Drill. While swapping Arrow vectors for Drill vectors is >>> conceptually simple, those of us who've looked at the details have noted >>> that the sheer volume of the lines of code that must change is daunting. >>>> >>>> Would be good to get second options. That PR I mentioned will show the >>> volume of code that changed at that time (but Drill has grown since >> then.) >>> Parth is another good resource as he reviewed the original PR and has >> kept >>> a close eye on Arrow. >>>> >>>> When considering Arrow in the Drill execution engine, we must >>> realistically understand the cost then ask, do the benefits we gain >> justify >>> those costs? Would Arrow be the highest-priority investment? Frankly, >> would >>> Arrow integration increase Drill adoption more than the many other topics >>> discussed recently on these mail lists? >>>> >>>> Charles and others make a strong case for Arrow for integration. What >> is >>> the strong case for Drill's internals? That's really the question the >> group >>> will want to answer. >>>> >>>> More details below. >>>> >>>> Thanks, >>>> - Paul >>>> >>>> >>>> >>>> On Monday, August 20, 2018, 9:41:49 AM PDT, Ted Dunning < >>> [email protected]> wrote: >>>> >>>> Inline. >>>> >>>> >>>> On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers <[email protected] >>> >>>> wrote: >>>> >>>>> ... >>>>> By contrast, migrating Drill internals to Arrow has always been seen >> as >>>>> the bulk of the cost; costs which the "crude-but-effective" suggestion >>>>> seeks to avoid. Some of the full-integration costs include: >>>>> >>>>> * Reworking Drill's direct memory model to work with Arrow's. >>>>> >>>> >>>> >>>> Ted: This should be relatively isolated to the allocation/deallocation >>> code. The >>>> deallocation should become a no-op. The allocation becomes simpler and >>>> safer. >>>> >>>> Paul: If only that were true. Drill has an ingenious integration of >>> vector allocation and Netty. Arrow may have done the same. (Probably did, >>> since such integration is key to avoiding copies on send/receive.). That >>> code is highly complex. Clearly, the swap can be done; it will simply >> take >>> some work to get right. >>>> >>>> >>>>> * Changing all low-level runtime code that works with vectors to >> instead >>>>> work with Arrow vectors. >>>>> >>>> >>>> >>>> Ted: Why? You already said that most code doesn't have to change since >>> the >>>> format is the same. >>>> >>>> Paul: My comment about the format being the same was that the direct >>> memory layout is the same, allowing conversion of a Drill vector to an >>> Arrow vector by relabeling the direct memory that holds the data. >>>> >>>> Paul: But, in the Drill runtime engine, we don't work with the memory >>> directly, we use the vector APIs, mutator APIs and so on. These all >> changed >>> in Arrow. Granted, the Arrow versions are cleaner. But, that does mean >> that >>> every vector reference (of which there are thousands) must be revised to >>> use the Arrow APIs. That is the cost that has put us off a bit. >>>> >>>> >>>>> * Change all Drill's vector metadata, and code that uses that >> metadata, >>> to >>>>> use Arrow's metadata instead. >>>>> >>>> >>>> >>>> Ted: Why? You said that converting Arrow metadata to Drill's metadata >>> would be >>>> simple. Why not just continue with that? >>>> >>>> Paul: In an API, we can convert one data structure to the other by >>> writing code to copy data. But, if we change Drill's internals, we must >>> rewrite code in every operator that uses Drill's metadata to instead use >>> Arrows. That is a much more extensive undertaking than simply converting >>> metadata on input or output. >>>> >>>> >>>>> * Since generated code works directly with vectors, change all the >> code >>>>> generation. >>>>> >>>> >>>> Ted: Why? You said the UDFs would just work. >>>> >>>> Paul: Again, I fear we are confusing two issues. If we don't change >>> Drill's internals, then UDFs will work as today. If we do change Drill to >>> Arrow, then, since UDFs are part of the code gen system, they must change >>> to adapt to the Arrow APIs. Specially, Drill "holders" must be converted >> to >>> Arrow holders. Drill complex writers must convert to Arrow complex >> writers. >>>> >>>> Paul: Here I'll point out that the Arrow vector code and writers have >>> the same uncontrolled memory flaw that they inherited from Drill. So, if >> we >>> replace the mutators and writers, we might as well use the "result set >>> loader" model which a) hides the details, and b) manages memory to a >> given >>> budget. Either way, UDFs must change if we move to Arrow for Drill >>> internals. >>>> >>>> >>>>> * Since Drill vectors and metadata are exposed via the Drill client to >>>>> JDBC and ODBC, those must be revised as well. >>>>> >>>> >>>> Ted: How much given the high level of compatibility? >>>> >>>> Paul: As with Drill internals, all JDBC/ODBC code that uses Drill >> vector >>> and metadata classes must be revised to use Arrow vectors and metadata, >>> adapting the code to the changed APIs. This is not a huge technical >>> challenge, it is just a pile of work. Perhaps this was done in that Arrow >>> conversion PR. >>>> >>>> >>>> >>>>> * Since the wire format will change, clients of Drill must upgrade >> their >>>>> JDBC/ODBC drivers when migrating to an Arrow-based Drill.> >>>> >>>> >>>> Ted: Doesn't this have to happen fairly often anyway? >>>> >>>> Ted: Perhaps this would be a good excuse for a 2.0 step. >>>> >>>> Paul: As Drill matures, users would appreciate the ability to use JDBC >>> and ODBC drivers with multiple Drill versions. If a shop has 1000 >> desktops >>> using the drivers against five Drill clusters, it is impractical to >> upgrade >>> everything in one go. >>>> >>>> Paul: You hit the nail on the head: conversion to Arrow would justify a >>> jump to "Drill 2.0" to explain the required big-bang upgrade (and, to >>> highlight the cool new capabilities that come with Arrow.) >>>> >>> > > > > -- > > > *Jim Scott*Mobile/Text | +1 (989) 450-0212 > [image: MapR logo] > <https://mapr.com/?utm_source=signature&utm_medium=email&utm_campaign=mapr-logo > > <https://mapr.com/?utm_source=signature&utm_medium=email&utm_campaign=mapr-logo>>
