Re: [DISCUSS] Resurrect support for Table Statistics in Drill

Gautam Parai Fri, 09 Nov 2018 15:11:15 -0800

Hi Paul,

Thanks so much for the feedback. Please see my responses below.


To take a step back, recall that Arina is working on a metadata proposal. A
> key aspect of that proposal is that it provides an API so that Drill can
> connect (via an adapter) to any metadata system. The gist of my comments is
> that it would be wonderful if stats could work the same way. Provide a
> generic way to compute stats (as a query). That data can then be written to
> whatever metadata system the user wants to use, and served back via Arina's
> API.
> Here is a little ASCII diagram which, I hope, survives e-mail:
> Stats query --> custom metastore --> Metadata API --> Drill planner
> That first arrow represents a query someone runs that gathers stats, along
> with code to write the results into the custom metastore. Very powerful
> concept.

I think we are on the same page as far as the high-level idea goes! Here is
a relevant excerpt from the design doc "As mentioned earlier in the Design
Overview, the storage API will abstract away the underlying mechanism used
to store statistics e.g. a distributed file system, database or a key-value
store. .... As a first cut, the storage will be backed by the file system
as outlined above. However, this can be backed by a persistent store in the
future. The interface will allow us to use different storage mechanism for
statistics."


The point here was a bit more subtle. First, users (and QA) want to know
> the values of stats. Stats-as-query lets people easily play with the stats.
> I may request SELECT ndv(id) FROM ... because I want to know the NDV. The
> stats gathering mechanism may want NDV for all columns, along with other
> material.
> Second, the stats gathering logic will know what stats are wanted for that
> specific purpose. Maybe in the Drill native version, Drill guesses which
> columns should have stats.
> But, savvy users will know ahead of time (or from experience) which
> columns justify the cost of stats. If I roll my own metadata system, I may
> know that it is worth gathering stats on a small subset of columns, so I'll
> issue the query to do so.
> The key thought is, separate out the mechanism to compute stats from the
> consumption of stats. (BTW, Impala uses a plain query to compute its stats.)

The existing mechanism allows users to specify the columns on which they
want to collect statistics. For non-trivial user workloads, the choice
would quickly be narrowed down to collecting all statistics. The proposed
query format for collecting statistics is very similar to
Hive/Postgres/Oracle, so I would say it is a good starting point. There
maybe merits to users experimenting their way around like you suggested.
However, my immediate concern is to make the feature graduate from
`experimental` to say `usable`! For that to happen we need to fix benchmark
regressions and improve Drill query performance/stability which I think is
a MUST to make the feature successful. We already have the pieces lined up
e.g. UDAFs so someone from the Drill community could easily take it up or I
can do it later once we are able to address performance issues. I have
created the JIRA (https://issues.apache.org/jira/browse/DRILL-6841).


Stats (histograms, NDV) are complex to compute. Perfect: they force us to
> extend our aggregate UDF functionality so it is powerful enough that
> community members can compute aggregations just as complex as stats. We
> kill two birds with one stone.....

True. But, imagine how much easier it will be to explain to savvy users or
> Drill developers if you can say: to see the difference between different
> histograms, simply try out histogram1(foo) vs. histogram2(foo). Try out 10
> buckets: histogram1(foo, 10) vs. 20 buckets: histogram1(foo, 20). That is,
> if we encapsulate these different types of histograms in a (complex)
> aggregate function, Drill can keep adding new ones as easily as adding an
> aggregate UDF.

If things are tunable (the user gets to choose the histogram style and
> size, say), then being able to visualize the results will help the user
> choose wisely.

Further, in training (or the next edition of the Drill book), we can
> explain how stats work by having people play with queries that provide the
> information.

Yes, at a high-level, we are already relying on the UDAF mechanism for
computing statistics using the existing aggregation mechanism. To take a
query form like you suggested, all we need is some (non-trivial)
boilerplate code connecting these building blocks. We can also enhance
ANALYZE itself to provide the exact functionality enhancements you
mentioned. These could be take up as part of (
https://issues.apache.org/jira/browse/DRILL-6841).


> e)Having said all that, the current implementation does do what you are
> asking for (albeit not for the parallel version) which may be useful for
> experimentation.
> Very cool. Just need the parallel version and we'll be in great shape.

Yes, this is exactly what we could address as part of (
https://issues.apache.org/jira/browse/DRILL-6841).


This is my very point. We are basing the stats implementation on a broken
> storage format. The suggestion here is to split the problem.
> Discussion about metadata pointed out the benefit of separating the use of
> schemas, say, from their storage. Arina proposes an API that can be
> implemented in a variety of ways. Stats are similar, though inverted.
> Provide an API (a query) that computes stats; community users can store
> those stats in a variety of ways. That extension then feeds the data back
> to Drill via the metadata API.
> Again, they key point is: separate the stats computation mechanism (which
> is just a query with sophisticated aggregates) from storage (which needs to
> be pluggable.)
> Then, the first out-of-the-box implementation might be based on the
> Parquet file mechanism, but it is just one implementation of possibly many.

Yes, this(separating computing stats from storing stats) will/is already
taken care of in the existing implementation.


Your work includes a new SQL statement to gather stats: COMPUTE STATS ON
> <table> or ANALYZE TABLE <table> ... (I don't recall the specifics.) My
> suggestion is not about that statement itself, rather about how that
> statement is implemented. The current code, if I recall, introduces a new
> execution model which gathers stats and writes a file.
> The specific suggestion is that the new ANALYZE TABLE ... syntax is
> provided. But, it is implemented as an extension to a CTAS statement: run a
> query that computes stats, and write that to some file structure. Said
> another way, ANALYZE TABLE is simply an alias for (or is rewritten to) a
> stats query plus output to Drill's preferred internal format.
> This model allows others to use queries to compute stats for the reasons
> discussed previously.

I don't think the implementation is introducing a new execution model!
Essentially all we have are UDAFs which are aggregated using some
extensions of the StreamingAgg operator. All this is exposed as an ANALYZE
statement.

Thanks,
Gautam

On Fri, Nov 9, 2018 at 1:03 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Gautam,
>
> One follow-up clarification: I realize one point was a bit unclear.
>
> I suggested that stats be gathered by a query. By this, I simply mean that
> stats use the existing query mechanism with, perhaps, an enhanced UDAF
> (user-defined aggregate function) API.
>
> Your work includes a new SQL statement to gather stats: COMPUTE STATS ON
> <table> or ANALYZE TABLE <table> ... (I don't recall the specifics.) My
> suggestion is not about that statement itself, rather about how that
> statement is implemented. The current code, if I recall, introduces a new
> execution model which gathers stats and writes a file.
>
> The specific suggestion is that the new ANALYZE TABLE ... syntax is
> provided. But, it is implemented as an extension to a CTAS statement: run a
> query that computes stats, and write that to some file structure. Said
> another way, ANALYZE TABLE is simply an alias for (or is rewritten to) a
> stats query plus output to Drill's preferred internal format.
>
> This model allows others to use queries to compute stats for the reasons
> discussed previously.
>
> I hope this clarifies things a bit...
>
> Also, to be clear, I'm just tossing out ideas to make Drill as useful as
> possible. Whatever we do, would be good to get the existing version into
> the code base so folks can play with it.
>
> Thanks,
> - Paul
>
>
>
>     On Thursday, November 8, 2018, 3:57:35 PM PST, Paul Rogers <
> par0...@yahoo.com> wrote:
>
>  Hi Gautam,
>
> Thanks much for the explanations. You raise some interesting points. I
> noticed that Boaz has just filed a JIRA ticket to tackle the inefficient
> count distinct case.
>
> To take a step back, recall that Arina is working on a metadata proposal.
> A key aspect of that proposal is that it provides an API so that Drill can
> connect (via an adapter) to any metadata system. The gist of my comments is
> that it would be wonderful if stats could work the same way. Provide a
> generic way to compute stats (as a query). That data can then be written to
> whatever metadata system the user wants to use, and served back via Arina's
> API.
>
> Here is a little ASCII diagram which, I hope, survives e-mail:
>
> Stats query --> custom metastore --> Metadata API --> Drill planner
>
> That first arrow represents a query someone runs that gathers stats, along
> with code to write the results into the custom metastore. Very powerful
> concept.
>
> Detailed comments below.
>
> > a)This approach offloads the hard task of figuring out which statistics
> are
> needed for which columns based on the user workload and then adapting to
> changes in the workload! This may be useful for experimenting, but not in
> practice.
>
> The point here was a bit more subtle. First, users (and QA) want to know
> the values of stats. Stats-as-query lets people easily play with the stats.
> I may request SELECT ndv(id) FROM ... because I want to know the NDV. The
> stats gathering mechanism may want NDV for all columns, along with other
> material.
>
> Second, the stats gathering logic will know what stats are wanted for that
> specific purpose. Maybe in the Drill native version, Drill guesses which
> columns should have stats.
>
> But, savvy users will know ahead of time (or from experience) which
> columns justify the cost of stats. If I roll my own metadata system, I may
> know that it is worth gathering stats on a small subset of columns, so I'll
> issue the query to do so.
>
> The key thought is, separate out the mechanism to compute stats from the
> consumption of stats. (BTW, Impala uses a plain query to compute its stats.)
>
> > b)Please note this approach would also require making additional
> Calcite/Drill code changes to generate the correct plans (serial and
> parallel). Currently, we bypass these changes and directly generate the
> physical plan. ...
>
> Excellent. The result would be the ability for Drill to generate efficient
> plans for complex stats: whither those stats are requested by the user or
> by Drill itself. Stats are easier to add: just add the necessary (probably
> complex) functions and the planner takes care of the rest. There are not
> two distinct code paths to reason about and maintain.
>
> > c)Exposing all such statistics may not be useful for the users e.g. for
> NDV
> we save the HLL structure which allows us to parallelize the plans and
> compute NDV efficiently and in the future will allow us to compute
> partition-wise statistics.
>
> Stats (histograms, NDV) are complex to compute. Perfect: they force us to
> extend our aggregate UDF functionality so it is powerful enough that
> community members can compute aggregations just as complex as stats. We
> kill two birds with one stone.
>
> > d)Consider, in the future, we add Histograms. There are several different
> kinds of histograms with trade-offs and we may decide to generate one or
> the other based on `ndv` values etc. We cannot expect the user to figure
> all this out on their own.
>
> True. But, imagine how much easier it will be to explain to savvy users or
> Drill developers if you can say: to see the difference between different
> histograms, simply try out histogram1(foo) vs. histogram2(foo). Try out 10
> buckets: histogram1(foo, 10) vs. 20 buckets: histogram1(foo, 20). That is,
> if we encapsulate these different types of histograms in a (complex)
> aggregate function, Drill can keep adding new ones as easily as adding an
> aggregate UDF.
>
> If things are tunable (the user gets to choose the histogram style and
> size, say), then being able to visualize the results will help the user
> choose wisely.
>
> Further, in training (or the next edition of the Drill book), we can
> explain how stats work by having people play with queries that provide the
> information.
>
> > e)Having said all that, the current implementation does do what you are
> asking for (albeit not for the parallel version) which may be useful for
> experimentation.
>
> Very cool. Just need the parallel version and we'll be in great shape.
>
>
> > Yes, the current approach of storing statistics is emulated from Parquet
> metadata. Even though it is riddled with concurrency issues, it does not do
> any worse. Hence, I would contend that it is a good starting point. Once,
> the meta-store work is complete I plan to integrate statistics with it and
> leverage all the great benefits that come with this approach.
>
> This is my very point. We are basing the stats implementation on a broken
> storage format. The suggestion here is to split the problem.
>
> Discussion about metadata pointed out the benefit of separating the use of
> schemas, say, from their storage. Arina proposes an API that can be
> implemented in a variety of ways. Stats are similar, though inverted.
> Provide an API (a query) that computes stats; community users can store
> those stats in a variety of ways. That extension then feeds the data back
> to Drill via the metadata API.
>
> Again, they key point is: separate the stats computation mechanism (which
> is just a query with sophisticated aggregates) from storage (which needs to
> be pluggable.)
>
>
> Then, the first out-of-the-box implementation might be based on the
> Parquet file mechanism, but it is just one implementation of possibly many.
>
> You may point out a very reasonable fact: that the code path you describe
> already exists, and so it is less costly, today, to just use that path. You
> should argue that you are not in a position to start over. Very useful to
> play with what we have. But, let's also keep in mind where Drill wants to
> go: that it is unique in its modularity and its ability to integrate with a
> wide variety of systems.
>
> Thanks,
> - Paul
>

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

Reply via email to