Re: Next Version

James Turton Mon, 01 Jan 2024 04:57:13 -0800

P.P.S. since I'm spamming this thread today. With

> this suggests to me that we should keep putting effort into: embeddedDrill, Windows support, rapid installation and setup, low "time to insight".

I'm not going so far as to suggest that Drill be thought of as desktopsoftware, rather that ad hoc Drill deployments working on small (Gb) tobig (Tb) data may be as, or more, important than long lived, heavilyintegrated, professionally managed deployments working on really Bigdata (Pb). Perhaps the last category belongs almost entirely toBigQuery, Athena, Snowflake and the like nowadays anyway.

I still think a cluster is the often the most effective way to deployDrill so the question contemplated is really "Can we make it faster andeasier to spin up a cluster (and embedded Drill), connect to datasources and start running (successful) queries"?


On 2024/01/01 07:33, James Turton wrote:

P.S. I also have an admittedly vague idea about deprecating the UNIONdata type, which still breaks things in many operators, in favour of adifferent approach where we kick any invalid data encountered whileloading column FOO out to a generated _FOO_EXCEPTIONS VARCHAR (orVARBINARY, though binary data formats tend not to be malformed?)column. This would let a query over dirty data complete withoutinvisible data swallowing, and would mean we could cut further efforton UNION support.
On 2024/01/01 07:11, James Turton wrote:
Happy New Year!
Here's another two cents. Make that five now that I scan this emailagain!
Excluding our Docker Hub images (which are popular), Drill isdownloaded ~1000 times a month [1] (order of magnitude, it's hard tocount genuinely new installations from web server downloads).
What roles are these folks in? I'm a data engineer by day and I don'tthink that we count for a large share of those downloads. The DEs Iwork with are risk averse sorts that tend to favour setups with rigidschemas early on and no surprises for their users at query time. Addto that a second stat from the download data: the biggest singledownload user OS is Windows, at about 50% [1]. Some of these usersmay go on to copy that download to a server environment but I have atheory that many of them go on to run embedded Drill right there onbeefy Windows laptops.
I conjecture that most of the people reaching for Drill are analystsor developers working _away_ from an established, shared datainfrastructure. There may not be any shared data engineering wherethey are, or they may find themselves in a fashionable "Data Mesh"environment [2]. I'm probably abusing Data Mesh a bit here in thatI'm told that it mainly proposes a federation of distinct data_teams_, rather than of data _systems_ but, if you entertain mycynical formulation of "Data Mesh guys! Silos aren't uncool anymore!" just a bit, then you can well imagine why a user in a DataMesh might look for something like Drill to combine data fromdifferent silos on their own machine. Tangentially this suggests tome that we should keep putting effort into: embedded Drill, Windowssupport, rapid installation and setup, low "time to insight".
MongoDB questions still come up frequently giving a reason beyond theJSON files questions to think that the JSON data model is still veryimportant. Wherever we decide to bound the current EVF v2 data modelimplementation, maybe we can sketch out a design of whatever isunimplemented in some updates to the Drill wiki pages? This wouldgive other devs a head start if we decide that some unsupportedcomplex data type is worth implementing down the road?
1. https://infra-reports.apache.org/#downloads&project=drill
2. https://martinfowler.com/articles/data-mesh-principles.html

Regards
James

On 2024/01/01 03:16, Charles Givre wrote:
I'll throw my .02 here... As a user of Drill, I've only had theoccasion to use the Union once. However, when I used it, it consumedso much memory, we ended up finding a workaround anyway and stoppedusing it. Honestly, since we improved the implicit casting rules, Ithink Drill is a lot smarter about how it reads data anyway. Bottomline, I do think we could drop the union and repeated union.
The repeated lists and maps however are unfortunately something thatdoes come up a bit. Honestly, I'm not sure what work is remaininghere but TBH Drill works pretty well at the moment with most of thedata I'm using it for. This would include some really nasty nestedJSON objects.
-- C
On Dec 31, 2023, at 01:38, Paul Rogers <[email protected]> wrote:

Hi Luoc,
Thanks for reminding me about the EVF V2 work. I got mostly doneaddingprojection for complex types, then got busy on other projects. I'veyet totackle the hard cases: unions, repeated unions and repeated lists(which
are, in fact, repeated repeated unions).
The code to handle unprojected fields in these areas is gettingawfullycomplicated. In doing that work, and then seeing a trick that Druiduses,
I'm tempted to rework the projection bits of the code to use a cleaner
approach. However, it might be better to commit the work done thusfar so
folks can use it before I wander off to take another approach.
Then, I wondered if anyone actually still uses this stuff. Do youstill
need the code to handle non-projection of complex types?
Of course, perhaps no one will ever need the hard cases: I've neverbeen
convinced that unions, repeated lists, or arrays of repeated lists are
things that any sane data engineer will want to use -- or use morethan
once.

Thanks,

- Paul
On Sat, Dec 30, 2023 at 10:26 PM James Turton <[email protected]>wrote:
Hi Luoc and Drill devs!

It's best to email Paul directly since he doesn't follow these lists
closely. In the meantime I've prepared a PR of backported fixes for
1.21.2 to the 1.21 branch [1]. I think we can try to get the Netty
upgrade that Maksym is working on, and which looks close to done,
included? There's at least one CVE applicable to our currentversion of
Netty...

Regards
James


1. https://github.com/apache/drill/pull/2860

On 2023/12/11 04:41, luoc wrote:
Hello all,
   1.22 will be a more stable version. This is a digression: Is Paul
still interested in participating in the EVF V2 refactoring in the
framework? I would like to offer time to assist him.
luoc
2023年12月9日 01:01，Charles Givre <[email protected]> 写道：

Hello all,
Happy Friday everyone! I wanted to raise the topic of gettinga Drill
minor release out the door before the end of the year. My opinionis thatI'd really like to release Drill 1.22 once the integration withApacheDaffodil is complete, but it sounds like that is still a few weeksaway.
What does everyone think about issuing a maintenance releasebefore the
end of the year? There are a number of singificant fixesincluding somesecurity updates and a major bug in the ES plugin that basicallymakes it
unusable.
Best,
-- C

Re: Next Version

Reply via email to