My 2c while I wait for 1.20.0 RC1 to upload.

I think it's good that we continue to bring every design decision out here to the community like Charles did with this one.  Some relevant excerpts I turned up while zipping around a few ASF docs now.

   "Mailing lists are the virtual rooms where ASF communities live,
   form and grow. All formal decisions the project's PMC makes need to
   have an email thread (possibly with a recorded vote) as an audit
   trail that this was an official decision." [1]

   "We firmly believe in hats
   <https://www.apache.org/foundation/how-it-works.html#hats>. Your
   role at the ASF is one assigned to you personally, and is bestowed
   on you by your peers. It is not tied to your job or current employer
   or company." [2]

   "Unless they specifically state otherwise, whatever an ASF
   participant posts on any mailing list is done /as themselves/. It is
   the individual point-of-view, wearing their personal hat and not as
   a mouthpiece for whatever company happens to be signing their
   paychecks right now, and not even as a director of the ASF." [2]


Info schema.  Info schema is slow when the set of enabled storage plugins is slow to register schemas.  Flaky plugins can be so slow to do this as to make info schema appear broken.  Info schema recently had its filter push down improved so that unneeded schema registration is avoidable [3], and I tested it working in the case of an unreachable active PostgreSQL plugin (provided my WHERE clause excluded said pg).

In my opinion making today's "on-demand" info schema, which re-fetches schema metadata from sources whenever a query requests it, more efficient is the right place to start.  Rewriting it on EVF2 would, I understand, gain it limit push down support for free, though filter push down seems more likely to be helpful on this kind of data to me.  There is also no reason I can see for info schema not to fetch schema metadata from plugins concurrently.  I don't know if this would be best achieved by explicit programming of the concurrency, or by making the info schema look "splittable" to Drill so that multiple fragments get created.

Lastly, I'm generally against introducing any sort of results caching, data or metadata, except in special circumstances such as when the planner can be certain that the underlying data has not changed (seldom or never the case for Drill because it doesn't control its own storage layer).  I think that databases, reliable ones anyway, tend to shun results caching and push it to the application layer, since only that layer can decide what kind of staleness is acceptable, but correct me if I'm wrong.  My conclusion here is that I'd rather do this last, and only after careful consideration.

[1] https://infra.apache.org/mailing-list-moderation.html
[2] https://www.apache.org/foundation/how-it-works.html#management
[3] https://github.com/apache/drill/pull/2388

On 2022/02/07 21:05, Ted Dunning wrote:
Another option is to store metadata as data in a distributed data store.
For static resources, that can scale very well. For highly dynamic
resources like conventional databases behind JDBC connections, you can
generally delegate metadata to that layer. Performance for delegated
metadata won't necessarily be great, but those systems are usually either
small (like Postgress or mySQL) or fading away (like Hive).

Focusing metadata and planning to a single node will make query concurrency
much worse (and it's already not good).


On Sun, Feb 6, 2022 at 6:28 PM Paul Rogers<par0...@gmail.com>  wrote:

Hi All,

Drill, like all open source projects, exists to serve those that use it. To
that end, the best contributions come when some company needs a feature
badly enough that it is worth the effort to develop and contribute a
solution. That's pretty standard, as along as the contribution is general
purpose. In fact, I hope everyone using Drill in support of their company
will contribute enhancements back to Drill. If you maintain your own
private fork, you're not helping the community that provided you with the
bulk of the code.

For the info schema, I'm at a loss to guess why this would be slow, unless
every plugin is going off and scanning some external source. Knowing that
we have a dozen plugins is not slow. Looking at plugin configs is not slow.
What could be slow is if you want to know about every possible file in HDFS
or S3, every database and table in an external DB, etc. In this case, the
bottleneck is either the external system, or the act of querying a dozen
different external systems. Perhap, Charles, you can elaborate on the
specific scenario you have in mind.

Depending on the core issues, there are various solutions. One solution is
to cache all the external metadata in Drill. That's what Impala did with
the Hive Metastore, and it was a mess. I don't expect Drill would do any
better a job. One reason it was a mess is that, in a production system,
there is a vast amount of metadata. You end up playing all manner of tricks
to try to compress it. Since Drill (and Impala) are fully symmetric, each
node has to hold the entire cache. That is memory that can't be used to run
queries. So, to gain performance (for metadata) you give up performance (at
run time.)

One solution is to create a separate metadata cache node. The query goes to
some Drillbit that acts as Foreman. The Foreman plans the query and
retrieves the needed metadata from the metadata node. The challenge here is
that there will be a large amount of metadata transferred, and the next
thing we know we'll want to cache it in each Drillbit, putting us back
where we started.

So, one can go another step: shift all query planning to the metadata node
and have a single planner node. The user connects to any Drillbit as
Foreman, but that Foreman first talks to the "planner/metadata" node to
give it SQL and get back a plan. The Foreman then runs the plan as usual.
(The Foreman runs the root fragment of the plan, which can be compute
intensive, so we don't want the planner node to also act as the Foreman.)
The notion here is that the SQL in/plan out is much smaller than the
metadata that is needed to compute the plan.

The idea about metadata has long been that Drill should provide a metadata
API. The Drill metastore should be seen as just one of many metadata
implementations. The Drill metastore is a "starter solution" for those who
have not already invested in another solution. (Many shops have HMS or
Amazon Glue, which is Amazon's version of HMS, or one of the newer
metadata/catalog solutions.)

One can go even further. Consider file and directory pruning in HMS. Every
tool has to do the exact same thing: given a set of predicates, find the
directories and files that match. Impala does it. Spark must do it.
Preso/Trino probably does it. Drill, when operating in Hive/HMS mode must
do it. Maybe someone has come with the One True Metadata Pruner and Drill
can just delegate the task to that external tool, and get back the list of
directories and files to scan. Far better than building yet another pruner.
(I think Drill currently has two Parquet metadata pruners, duplicating what
many other tools have done.)

If we see the source of metadata as plugable, then a shop such as DDR that
has specific needs (maybe caching those external schemas), can build a
metadata plugin for that use case. If the solution is general, it can be
contributed to Drill as another metadata option.

In any case, if we can better understand the specific problem you are
encountering, we can perhaps offer more specific suggestions.

Thanks,

- Paul

On Sun, Feb 6, 2022 at 8:11 AM Charles Givre<cgi...@gmail.com>  wrote:

Hi Luoc,
Thanks for your concern.  Apache projects are often backed unofficially
by
a company.  Drill was, for years, backed my MapR as evident by all the
MapR
unique code that is still in the Drill codebase. However, since MapR's
acquisition, I think it is safe to say that Drill really has become a
community-driven project.  While some of the committers are colleagues of
mine at DataDistillr, and Drill is a core part of DataDisitllr, from our
perspective, we've really just been focusing on making Drill better for
everyone as well as building the community of Drill users, regardless of
whether they use DataDistillr or not.  We haven't rejected any PRs
because
they go against our business model or tried to steer Drill against the
community or anything like that.

Just for your awareness, there are other OSS projects, including some
Apache projects where one company controls everything.  Outside
contributions are only accepted if they fit the company's roadmap, and
there is no real community-building that happens.  From my perspective,
that is not what I want from Drill.  My personal goal is to build an
active
community of users and developers around an awesome tool.

I hope this answers your concerns.
Best,
-- C


On Feb 6, 2022, at 9:42 AM, luoc<l...@apache.org>  wrote:


Before we discuss the next release, I would like to explain that Apache
project should not be directly linked to a commercial company, otherwise
this will affect the motivation of the community to contribute.
Thanks.

On Feb 6, 2022, at 21:29, Charles Givre<cgi...@gmail.com>  wrote:

Hello all,
Firstly, I wanted to thank everyone for all the work that has gone
into
Drill 1.20 as well as the ongoing discussion around Drill 2.0.   I wanted
to start a discussion around topic for Drill 1.21 and that is INFO_SCHEMA
improvements.  As my company wades further and further into Drill, it has
become apparent that the INFO_SCHEMA could use some attention.  James
Turton submitted a PR which was merged into Drill 1.20, but in so doing
he
uncovered an entire Pandora's box of other issues which might be worth
addressing.  In a nutshell, the issues with the INFO_SCHEMA are all
performance related: it can be very slow and also can consume significant
resources when executing even basic queries.
My understanding of how the info schema (IS) works is that when a user
executes a query, Drill will attempt to instantiate every enabled storage
plugin to discover schemata and other information. As you might imagine,
this can be costly.
So, (and again, this is only meant as a conversation starter), I was
thinking there are some general ideas as to how we might improve the IS:
1.  Implement a limit pushdown:  As far as I can tell, there is no
limit pushdown in the IS and this could be a relatively quick win for
improving IS query performance.
2.  Caching:  I understand that caching is tricky, but perhaps we
could
add some sort of schema caching for IS queries, or make better use of the
Drill metastore to reduce the number of connections during IS queries.
Perhaps in combination with the metastore, we could implement some sort
of
"metastore first" plan, whereby Drill first hits the metastore for query
results and if the limit is reached, we're done.  If not, query the
storage
plugins...
3.  Parallelization:  It did not appear to me that Drill parallelizes
IS queries.   We may be able to add some parallelization which would
improve overall speed, but not necessarily reduce overall compute cost
4.  Convert to EVF2:  Not sure that there's a performance benefit
here,
but at least we could get rid of cruft
5.  Reduce SeDe:   I imagine there was a good reason for doing this,
but the IS seems to obtain a POJO from the storage plugin then write
these
results to old-school Drill vectors.  I'm sure there was a reason it was
done this way, (or maybe not) but I have to wonder if there is a more
efficient way of obtaining the information from the storage plugin,
ideally
w/o all the object creation.
These are just some thoughts, and I'm curious as to what the community
thinks about this.  Thanks everyone!
-- C

Reply via email to