Re: [DISCUSS] Some ideas for Drill 1.21

Charles Givre Sun, 06 Feb 2022 08:11:59 -0800

Hi Luoc, 
Thanks for your concern.  Apache projects are often backed unofficially by a 
company.  Drill was, for years, backed my MapR as evident by all the MapR 
unique code that is still in the Drill codebase. However, since MapR's 
acquisition, I think it is safe to say that Drill really has become a 
community-driven project.  While some of the committers are colleagues of mine 
at DataDistillr, and Drill is a core part of DataDisitllr, from our 
perspective, we've really just been focusing on making Drill better for 
everyone as well as building the community of Drill users, regardless of 
whether they use DataDistillr or not.  We haven't rejected any PRs because they 
go against our business model or tried to steer Drill against the community or 
anything like that.


Just for your awareness, there are other OSS projects, including some Apache 
projects where one company controls everything.  Outside contributions are only 
accepted if they fit the company's roadmap, and there is no real 
community-building that happens.  From my perspective, that is not what I want 
from Drill.  My personal goal is to build an active community of users and 
developers around an awesome tool. 

I hope this answers your concerns.
Best,
-- C


> On Feb 6, 2022, at 9:42 AM, luoc <[email protected]> wrote:
> 
> 
> Before we discuss the next release, I would like to explain that Apache 
> project should not be directly linked to a commercial company, otherwise this 
> will affect the motivation of the community to contribute.
> 
> Thanks.
> 
>> On Feb 6, 2022, at 21:29, Charles Givre <[email protected]> wrote:
>> 
>> Hello all, 
>> Firstly, I wanted to thank everyone for all the work that has gone into 
>> Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I wanted to 
>> start a discussion around topic for Drill 1.21 and that is INFO_SCHEMA 
>> improvements.  As my company wades further and further into Drill, it has 
>> become apparent that the INFO_SCHEMA could use some attention.  James Turton 
>> submitted a PR which was merged into Drill 1.20, but in so doing he 
>> uncovered an entire Pandora's box of other issues which might be worth 
>> addressing.  In a nutshell, the issues with the INFO_SCHEMA are all 
>> performance related: it can be very slow and also can consume significant 
>> resources when executing even basic queries.  
>> 
>> My understanding of how the info schema (IS) works is that when a user 
>> executes a query, Drill will attempt to instantiate every enabled storage 
>> plugin to discover schemata and other information. As you might imagine, 
>> this can be costly. 
>> 
>> So, (and again, this is only meant as a conversation starter), I was 
>> thinking there are some general ideas as to how we might improve the IS:
>> 1.  Implement a limit pushdown:  As far as I can tell, there is no limit 
>> pushdown in the IS and this could be a relatively quick win for improving IS 
>> query performance.
>> 2.  Caching:  I understand that caching is tricky, but perhaps we could add 
>> some sort of schema caching for IS queries, or make better use of the Drill 
>> metastore to reduce the number of connections during IS queries.  Perhaps in 
>> combination with the metastore, we could implement some sort of "metastore 
>> first" plan, whereby Drill first hits the metastore for query results and if 
>> the limit is reached, we're done.  If not, query the storage plugins...
>> 3.  Parallelization:  It did not appear to me that Drill parallelizes IS 
>> queries.   We may be able to add some parallelization which would improve 
>> overall speed, but not necessarily reduce overall compute cost
>> 4.  Convert to EVF2:  Not sure that there's a performance benefit here, but 
>> at least we could get rid of cruft
>> 5.  Reduce SeDe:   I imagine there was a good reason for doing this, but the 
>> IS seems to obtain a POJO from the storage plugin then write these results 
>> to old-school Drill vectors.  I'm sure there was a reason it was done this 
>> way, (or maybe not) but I have to wonder if there is a more efficient way of 
>> obtaining the information from the storage plugin, ideally w/o all the 
>> object creation. 
>> 
>> These are just some thoughts, and I'm curious as to what the community 
>> thinks about this.  Thanks everyone!
>> -- C
>

Re: [DISCUSS] Some ideas for Drill 1.21

Reply via email to