Re: Next Version

Paul Rogers Mon, 01 Jan 2024 12:33:52 -0800

Hi All,

My two cents on Charles' other points: about Drill's use with Mongo or
Druid. If this is common, we might want to put more effort into the
integrations above the level of the reader. I'm most familiar with Druid,
so let's use that as an example.


Druid provides a SQL interface, so it is convenient to forward Drill
queries to Druid as SQL. But, Druid has a very limited distribution
architecture: it is two-level: the coordinator and the data nodes. This
means we've got, say, 10 Drill nodes, that pick one Drill node to be the
reader that talks to the one Druid coordinator, that then talks to, say, 20
data nodes. This is clearly a bottleneck, and will never perform anywhere
near what Druid's native UI can do.

So, a better approach is to bypass Druid SQL and use Druid native queries.
Bypass the coordinator and talk directly to the data nodes. Now, we have
our 10 Drill nodes each talking to two Druid data nodes, providing a
parallelism far better than Druid itself provides. Drill's distributed
sort, join and windowing functionality is far more scalable than Druid's
only single-node functionality.

Druid is optimized for small, simple queries that power dashboards. Druid
frowns on "BI" use cases that touch large chunks of data. In Druid, the
coordinator is the bottleneck: BI queries against the coordinator kill
dashboard SLAs. With the above setup, Drill would provide a wonderful,
scalable BI solution for Druid that does not degrade the system because
Drill would no longer put load on Druid's weak link: the coordinator node.

Mongo is also distributed. Does it have the same potential to use Drill to
distribute work to avoid a similar bottleneck?

To give MapR some credit, MapR-DB had a client that allowed distributed
queries. The Drill integration with MapR-DB was supposed to use an approach
similar to the one outlined above for Druid.

Alas, the above trick won't work for a traditional DBMS using JDBC.
However, if the DB is sharded, then, with the right metadata, Drill could
distribute queries to the shards so the DB's own query system doesn't have
to.

So there you have it, a fun weekend project for someone familiar with the
details of a particular distributed DB.

Thanks,

- Paul


On Mon, Jan 1, 2024 at 7:17 AM Charles Givre <cgi...@gmail.com> wrote:

> To continue the thread hijacking....
>
> I'd agree with what James is saying.  What if we were to create a docker
> container (or some sort of package) that included Drill, Superset and all
> associated configuration stuff so that a user could just run a docker
> command and have a fully functional Drill instance set up with Superset?
>
> Regarding the JSON, for a while we were working on updating all the
> plugins to use EVF2.  From my recollection, we got all the formats
> converted except for parquet (major project) and HDF5 (PR pending:
> https://github.com/apache/drill/pull/2515).  We had also started working
> on removing the old JSON reader, however, there were a few places it reared
> its head:
> 1.  The Druid plugin.  I wrote a draft PR that is pending to swap it out
> for the EVF JSON reader but haven't touched it in a really long time. (
> https://github.com/apache/drill/pull/2657)
> 2.  The Mongo plugin:  No work there...
> 3.  The conversion UDFs.   Work started.  (
> https://github.com/apache/drill/pull/2567)
>
> In any event, given the interest in Mongo/Drill, it might be worthwhile to
> take a look at the Mongo plugin to see what it would take to swap out the
> old JSON reader for the EVF one.
> Regarding unprojected columns, if that's the holdup, I'd say scrap that
> feature for complex data types.
>
> What do you think?
>
>
> > On Jan 1, 2024, at 07:57, James Turton <dz...@apache.org> wrote:
> >
> > P.P.S. since I'm spamming this thread today. With
> >
> > > this suggests to me that we should keep putting effort into: embedded
> Drill, Windows support, rapid installation and setup, low "time to insight".
> >
> > I'm not going so far as to suggest that Drill be thought of as desktop
> software, rather that ad hoc Drill deployments working on small (Gb) to big
> (Tb) data may be as, or more, important than long lived, heavily
> integrated, professionally managed deployments working on really Big data
> (Pb). Perhaps the last category belongs almost entirely to BigQuery,
> Athena, Snowflake and the like nowadays anyway.
> >
> > I still think a cluster is the often the most effective way to deploy
> Drill so the question contemplated is really "Can we make it faster and
> easier to spin up a cluster (and embedded Drill), connect to data sources
> and start running (successful) queries"?
> >
> > On 2024/01/01 07:33, James Turton wrote:
> >> P.S. I also have an admittedly vague idea about deprecating the UNION
> data type, which still breaks things in many operators, in favour of a
> different approach where we kick any invalid data encountered while loading
> column FOO out to a generated _FOO_EXCEPTIONS VARCHAR (or VARBINARY, though
> binary data formats tend not to be malformed?) column. This would let a
> query over dirty data complete without invisible data swallowing, and would
> mean we could cut further effort on UNION support.
> >>
> >> On 2024/01/01 07:11, James Turton wrote:
> >>> Happy New Year!
> >>>
> >>> Here's another two cents. Make that five now that I scan this email
> again!
> >>>
> >>> Excluding our Docker Hub images (which are popular), Drill is
> downloaded ~1000 times a month [1] (order of magnitude, it's hard to count
> genuinely new installations from web server downloads).
> >>>
> >>> What roles are these folks in? I'm a data engineer by day and I don't
> think that we count for a large share of those downloads. The DEs I work
> with are risk averse sorts that tend to favour setups with rigid schemas
> early on and no surprises for their users at query time. Add to that a
> second stat from the download data: the biggest single download user OS is
> Windows, at about 50% [1]. Some of these users may go on to copy that
> download to a server environment but I have a theory that many of them go
> on to run embedded Drill right there on beefy Windows laptops.
> >>>
> >>> I conjecture that most of the people reaching for Drill are analysts
> or developers working _away_ from an established, shared data
> infrastructure. There may not be any shared data engineering where they
> are, or they may find themselves in a fashionable "Data Mesh" environment
> [2]. I'm probably abusing Data Mesh a bit here in that I'm told that it
> mainly proposes a federation of distinct data _teams_, rather than of data
> _systems_ but, if you entertain my cynical formulation of "Data Mesh guys!
> Silos aren't uncool any more!" just a bit, then you can well imagine why a
> user in a Data Mesh might look for something like Drill to combine data
> from different silos on their own machine. Tangentially this suggests to me
> that we should keep putting effort into: embedded Drill, Windows support,
> rapid installation and setup, low "time to insight".
> >>>
> >>> MongoDB questions still come up frequently giving a reason beyond the
> JSON files questions to think that the JSON data model is still very
> important. Wherever we decide to bound the current EVF v2 data model
> implementation, maybe we can sketch out a design of whatever is
> unimplemented in some updates to the Drill wiki pages? This would give
> other devs a head start if we decide that some unsupported complex data
> type is worth implementing down the road?
> >>>
> >>> 1. https://infra-reports.apache.org/#downloads&project=drill
> >>> 2. https://martinfowler.com/articles/data-mesh-principles.html
> >>>
> >>> Regards
> >>> James
> >>>
> >>> On 2024/01/01 03:16, Charles Givre wrote:
> >>>> I'll throw my .02 here...  As a user of Drill, I've only had the
> occasion to use the Union once. However, when I used it, it consumed so
> much memory, we ended up finding a workaround anyway and stopped using it.
> Honestly, since we improved the implicit casting rules, I think Drill is a
> lot smarter about how it reads data anyway. Bottom line, I do think we
> could drop the union and repeated union.
> >>>>
> >>>> The repeated lists and maps however are unfortunately something that
> does come up a bit.   Honestly, I'm not sure what work is remaining here
> but TBH Drill works pretty well at the moment with most of the data I'm
> using it for.  This would include some really nasty nested JSON objects.
> >>>>
> >>>> -- C
> >>>>
> >>>>
> >>>>> On Dec 31, 2023, at 01:38, Paul Rogers <par0...@gmail.com> wrote:
> >>>>>
> >>>>> Hi Luoc,
> >>>>>
> >>>>> Thanks for reminding me about the EVF V2 work. I got mostly done
> adding
> >>>>> projection for complex types, then got busy on other projects. I've
> yet to
> >>>>> tackle the hard cases: unions, repeated unions and repeated lists
> (which
> >>>>> are, in fact, repeated repeated unions).
> >>>>>
> >>>>> The code to handle unprojected fields in these areas is getting
> awfully
> >>>>> complicated. In doing that work, and then seeing a trick that Druid
> uses,
> >>>>> I'm tempted to rework the projection bits of the code to use a
> cleaner
> >>>>> approach. However, it might be better to commit the work done thus
> far so
> >>>>> folks can use it before I wander off to take another approach.
> >>>>>
> >>>>> Then, I wondered if anyone actually still uses this stuff. Do you
> still
> >>>>> need the code to handle non-projection of complex types?
> >>>>>
> >>>>> Of course, perhaps no one will ever need the hard cases: I've never
> been
> >>>>> convinced that unions, repeated lists, or arrays of repeated lists
> are
> >>>>> things that any sane data engineer will want to use -- or use more
> than
> >>>>> once.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> - Paul
> >>>>>
> >>>>>
> >>>>> On Sat, Dec 30, 2023 at 10:26 PM James Turton <dz...@apache.org>
> wrote:
> >>>>>
> >>>>>> Hi Luoc and Drill devs!
> >>>>>>
> >>>>>> It's best to email Paul directly since he doesn't follow these lists
> >>>>>> closely. In the meantime I've prepared a PR of backported fixes for
> >>>>>> 1.21.2 to the 1.21 branch [1]. I think we can try to get the Netty
> >>>>>> upgrade that Maksym is working on, and which looks close to done,
> >>>>>> included? There's at least one CVE  applicable to our current
> version of
> >>>>>> Netty...
> >>>>>>
> >>>>>> Regards
> >>>>>> James
> >>>>>>
> >>>>>>
> >>>>>> 1. https://github.com/apache/drill/pull/2860
> >>>>>>
> >>>>>> On 2023/12/11 04:41, luoc wrote:
> >>>>>>> Hello all,
> >>>>>>>    1.22 will be a more stable version. This is a digression: Is
> Paul
> >>>>>> still interested in participating in the EVF V2 refactoring in the
> >>>>>> framework? I would like to offer time to assist him.
> >>>>>>> luoc
> >>>>>>>
> >>>>>>>> 2023年12月9日 01:01，Charles Givre <cgi...@gmail.com> 写道：
> >>>>>>>>
> >>>>>>>> Hello all,
> >>>>>>>> Happy Friday everyone!   I wanted to raise the topic of getting a
> Drill
> >>>>>> minor release out the door before the end of the year. My opinion
> is that
> >>>>>> I'd really like to release Drill 1.22 once the integration with
> Apache
> >>>>>> Daffodil is complete, but it sounds like that is still a few weeks
> away.
> >>>>>>>> What does everyone think about issuing a maintenance release
> before the
> >>>>>> end of the year?  There are a number of singificant fixes including
> some
> >>>>>> security updates and a major bug in the ES plugin that basically
> makes it
> >>>>>> unusable.
> >>>>>>>> Best,
> >>>>>>>> -- C
> >>>>>>
> >>>
> >>
> >
>
>

Re: Next Version

Reply via email to