Hi All,
Drill, like all open source projects, exists to serve those that use it. To
that end, the best contributions come when some company needs a feature
badly enough that it is worth the effort to develop and contribute a
solution. That's pretty standard, as along as the contribution is general
purpose. In fact, I hope everyone using Drill in support of their company
will contribute enhancements back to Drill. If you maintain your own
private fork, you're not helping the community that provided you with the
bulk of the code.
For the info schema, I'm at a loss to guess why this would be slow, unless
every plugin is going off and scanning some external source. Knowing that
we have a dozen plugins is not slow. Looking at plugin configs is not slow.
What could be slow is if you want to know about every possible file in HDFS
or S3, every database and table in an external DB, etc. In this case, the
bottleneck is either the external system, or the act of querying a dozen
different external systems. Perhap, Charles, you can elaborate on the
specific scenario you have in mind.
Depending on the core issues, there are various solutions. One solution is
to cache all the external metadata in Drill. That's what Impala did with
the Hive Metastore, and it was a mess. I don't expect Drill would do any
better a job. One reason it was a mess is that, in a production system,
there is a vast amount of metadata. You end up playing all manner of tricks
to try to compress it. Since Drill (and Impala) are fully symmetric, each
node has to hold the entire cache. That is memory that can't be used to run
queries. So, to gain performance (for metadata) you give up performance (at
run time.)
One solution is to create a separate metadata cache node. The query goes to
some Drillbit that acts as Foreman. The Foreman plans the query and
retrieves the needed metadata from the metadata node. The challenge here is
that there will be a large amount of metadata transferred, and the next
thing we know we'll want to cache it in each Drillbit, putting us back
where we started.
So, one can go another step: shift all query planning to the metadata node
and have a single planner node. The user connects to any Drillbit as
Foreman, but that Foreman first talks to the "planner/metadata" node to
give it SQL and get back a plan. The Foreman then runs the plan as usual.
(The Foreman runs the root fragment of the plan, which can be compute
intensive, so we don't want the planner node to also act as the Foreman.)
The notion here is that the SQL in/plan out is much smaller than the
metadata that is needed to compute the plan.
The idea about metadata has long been that Drill should provide a metadata
API. The Drill metastore should be seen as just one of many metadata
implementations. The Drill metastore is a "starter solution" for those who
have not already invested in another solution. (Many shops have HMS or
Amazon Glue, which is Amazon's version of HMS, or one of the newer
metadata/catalog solutions.)
One can go even further. Consider file and directory pruning in HMS. Every
tool has to do the exact same thing: given a set of predicates, find the
directories and files that match. Impala does it. Spark must do it.
Preso/Trino probably does it. Drill, when operating in Hive/HMS mode must
do it. Maybe someone has come with the One True Metadata Pruner and Drill
can just delegate the task to that external tool, and get back the list of
directories and files to scan. Far better than building yet another pruner.
(I think Drill currently has two Parquet metadata pruners, duplicating what
many other tools have done.)
If we see the source of metadata as plugable, then a shop such as DDR that
has specific needs (maybe caching those external schemas), can build a
metadata plugin for that use case. If the solution is general, it can be
contributed to Drill as another metadata option.
In any case, if we can better understand the specific problem you are
encountering, we can perhaps offer more specific suggestions.
Thanks,
- Paul
On Sun, Feb 6, 2022 at 8:11 AM Charles Givre<cgi...@gmail.com> wrote:
Hi Luoc,
Thanks for your concern. Apache projects are often backed unofficially
by
a company. Drill was, for years, backed my MapR as evident by all the
MapR
unique code that is still in the Drill codebase. However, since MapR's
acquisition, I think it is safe to say that Drill really has become a
community-driven project. While some of the committers are colleagues of
mine at DataDistillr, and Drill is a core part of DataDisitllr, from our
perspective, we've really just been focusing on making Drill better for
everyone as well as building the community of Drill users, regardless of
whether they use DataDistillr or not. We haven't rejected any PRs
because
they go against our business model or tried to steer Drill against the
community or anything like that.
Just for your awareness, there are other OSS projects, including some
Apache projects where one company controls everything. Outside
contributions are only accepted if they fit the company's roadmap, and
there is no real community-building that happens. From my perspective,
that is not what I want from Drill. My personal goal is to build an
active
community of users and developers around an awesome tool.
I hope this answers your concerns.
Best,
-- C
On Feb 6, 2022, at 9:42 AM, luoc<l...@apache.org> wrote:
Before we discuss the next release, I would like to explain that Apache
project should not be directly linked to a commercial company, otherwise
this will affect the motivation of the community to contribute.
Thanks.
On Feb 6, 2022, at 21:29, Charles Givre<cgi...@gmail.com> wrote:
Hello all,
Firstly, I wanted to thank everyone for all the work that has gone
into
Drill 1.20 as well as the ongoing discussion around Drill 2.0. I wanted
to start a discussion around topic for Drill 1.21 and that is INFO_SCHEMA
improvements. As my company wades further and further into Drill, it has
become apparent that the INFO_SCHEMA could use some attention. James
Turton submitted a PR which was merged into Drill 1.20, but in so doing
he
uncovered an entire Pandora's box of other issues which might be worth
addressing. In a nutshell, the issues with the INFO_SCHEMA are all
performance related: it can be very slow and also can consume significant
resources when executing even basic queries.
My understanding of how the info schema (IS) works is that when a user
executes a query, Drill will attempt to instantiate every enabled storage
plugin to discover schemata and other information. As you might imagine,
this can be costly.
So, (and again, this is only meant as a conversation starter), I was
thinking there are some general ideas as to how we might improve the IS:
1. Implement a limit pushdown: As far as I can tell, there is no
limit pushdown in the IS and this could be a relatively quick win for
improving IS query performance.
2. Caching: I understand that caching is tricky, but perhaps we
could
add some sort of schema caching for IS queries, or make better use of the
Drill metastore to reduce the number of connections during IS queries.
Perhaps in combination with the metastore, we could implement some sort
of
"metastore first" plan, whereby Drill first hits the metastore for query
results and if the limit is reached, we're done. If not, query the
storage
plugins...
3. Parallelization: It did not appear to me that Drill parallelizes
IS queries. We may be able to add some parallelization which would
improve overall speed, but not necessarily reduce overall compute cost
4. Convert to EVF2: Not sure that there's a performance benefit
here,
but at least we could get rid of cruft
5. Reduce SeDe: I imagine there was a good reason for doing this,
but the IS seems to obtain a POJO from the storage plugin then write
these
results to old-school Drill vectors. I'm sure there was a reason it was
done this way, (or maybe not) but I have to wonder if there is a more
efficient way of obtaining the information from the storage plugin,
ideally
w/o all the object creation.
These are just some thoughts, and I'm curious as to what the community
thinks about this. Thanks everyone!
-- C