Re: "Crude-but-effective" Arrow integration

2019-01-30 Thread Paul Rogers
Hi Jim, Thanks for the description of the real-world use case. I like your idea of letting Drill do the grunt work, then letting the ML/AI workload focus on that aspect of the problem. Charles, just brainstorming a bit, I think the easiest way to start is to create a simple, stand-alone server

Re: "Crude-but-effective" Arrow integration

2019-01-30 Thread Charles Givre
Jim, I really like this use case. As a data scientist myself, I see the big value of Drill as being able to rapidly get raw data ready for machine learning. This would be great if we could do this! > On Jan 30, 2019, at 08:43, Jim Scott wrote: > > Paul, > > Your example is exactly the same

Re: "Crude-but-effective" Arrow integration

2019-01-30 Thread Jim Scott
Paul, Your example is exactly the same as one which I spoke with some people on the RAPIDS.ai project about. Using Drill as a tool to gather (query) all the data to get a representative data set for an ML/AI workload, then feeding the resultset directly into GPU memory. RAPIDS.ai is based on Arrow

Re: "Crude-but-effective" Arrow integration

2019-01-30 Thread Charles Givre
Hi Aman, Thanks for sending. I looked through the slides and really liked the presentation. @Paul, how would a Drill-to-Arrow bridge work exactly? Would it require serialization/deserialization of Drill objects? —C > On Jan 30, 2019, at 02:16, Paul Rogers wrote: > > Hi Aman, > > Thanks

Re: "Crude-but-effective" Arrow integration

2019-01-29 Thread Paul Rogers
Hi Aman, Thanks for sharing the update. Glad to hear things are still percolating. I think Drill is an under appreciated treasure for doing queries in the complex systems that folks seem to be building today. The ability to read multiple data sources is something that maybe only Spark can do as

Re: "Crude-but-effective" Arrow integration

2019-01-29 Thread Aman Sinha
Hi Charles, You may have seen the talk that was given on the Drill Developer Day [1] by Karthik and me ... look for the slides on 'Drill-Arrow Integration' which describes 2 high level options and what the integration might entail. Option 1 corresponds to what you and Paul are discussing in this th

Re: "Crude-but-effective" Arrow integration

2019-01-29 Thread Paul Rogers
Hi Charles, I didn't see anything on this on the public mailing list. Haven't seen any commits related to it either. My guess is that this kind of interface is not important for the kind of data warehouse use cases that MapR is probably still trying to capture. I followed the Arrow mailing lists

Re: "Crude-but-effective" Arrow integration

2019-01-28 Thread Charles Givre
Hey Paul, I’m curious as to what, if anything ever came of this thread? IMHO, you’re on to something here. We could get the benefit of Arrow—specifically the interoperability with other big data tools—without the pain of having to completely re-work Drill. This seems like a real win-win to me

Re: "Crude-but-effective" Arrow integration

2018-08-20 Thread Paul Rogers
Hi Ted, We may be confusing two very different ideas. The one is a Drill-to-Arrow adapter on Drill's periphery, this is the "crude-but-effective" integration suggestion. On the periphery we are not changing existing code, we're just building an adapter to read Arrow data into Drill, or convert

Re: "Crude-but-effective" Arrow integration

2018-08-20 Thread Ted Dunning
Inline. On Mon, Aug 20, 2018 at 9:20 AM Paul Rogers wrote: > ... > By contrast, migrating Drill internals to Arrow has always been seen as > the bulk of the cost; costs which the "crude-but-effective" suggestion > seeks to avoid. Some of the full-integration costs include: > > * Reworking Drill

Re: "Crude-but-effective" Arrow integration

2018-08-20 Thread Paul Rogers
Hi Ted, The "crude but effective" integration suggestion allows Drill to participate in an Arrow pipeline with minimal work. By contrast, migrating Drill internals to Arrow has always been seen as the bulk of the cost; costs which the "crude-but-effective" suggestion seeks to avoid. Some of th

Re: "Crude-but-effective" Arrow integration

2018-08-20 Thread Paul Rogers
Hi Charles, Regarding UDFs and Arrow: if Arrow is used just as an interface format (as outlined in the original post), then Drill's internals continue to use Drill value vectors and UDFs are unchanged. If Arrow is adopted internally in Drill, then vast amounts of runtime code must change (see

Re: "Crude-but-effective" Arrow integration

2018-08-20 Thread Ted Dunning
This makes it sound like allocation is the important difference. As such that might mean that converting drill would be easier than was thought. On Sat, Aug 18, 2018, 16:44 Paul Rogers wrote: > Hi All, > > Charles recently suggested why Arrow integration could be helpful. (See > quote below.) W

Re: "Crude-but-effective" Arrow integration

2018-08-20 Thread Charles Givre
Hi Paul, This is a very interesting approach. i really like the concept in that it sounds like we could prove the value of the Arrow integration without “major surgery” to Drill. If it proves to be valuable we could proceed with deeper integration, or if we determine that it is not necessary,

"Crude-but-effective" Arrow integration

2018-08-18 Thread Paul Rogers
Hi All, Charles recently suggested why Arrow integration could be helpful. (See quote below.)  When we've looked at reworking Drill's internals to use Arrow, we found the project to be costly with little direct benefit in terms of performance or stability. But, Charles points out that the real