Hi Gautam, One follow-up clarification: I realize one point was a bit unclear.
I suggested that stats be gathered by a query. By this, I simply mean that stats use the existing query mechanism with, perhaps, an enhanced UDAF (user-defined aggregate function) API. Your work includes a new SQL statement to gather stats: COMPUTE STATS ON <table> or ANALYZE TABLE <table> ... (I don't recall the specifics.) My suggestion is not about that statement itself, rather about how that statement is implemented. The current code, if I recall, introduces a new execution model which gathers stats and writes a file. The specific suggestion is that the new ANALYZE TABLE ... syntax is provided. But, it is implemented as an extension to a CTAS statement: run a query that computes stats, and write that to some file structure. Said another way, ANALYZE TABLE is simply an alias for (or is rewritten to) a stats query plus output to Drill's preferred internal format. This model allows others to use queries to compute stats for the reasons discussed previously. I hope this clarifies things a bit... Also, to be clear, I'm just tossing out ideas to make Drill as useful as possible. Whatever we do, would be good to get the existing version into the code base so folks can play with it. Thanks, - Paul On Thursday, November 8, 2018, 3:57:35 PM PST, Paul Rogers <par0...@yahoo.com> wrote: Hi Gautam, Thanks much for the explanations. You raise some interesting points. I noticed that Boaz has just filed a JIRA ticket to tackle the inefficient count distinct case. To take a step back, recall that Arina is working on a metadata proposal. A key aspect of that proposal is that it provides an API so that Drill can connect (via an adapter) to any metadata system. The gist of my comments is that it would be wonderful if stats could work the same way. Provide a generic way to compute stats (as a query). That data can then be written to whatever metadata system the user wants to use, and served back via Arina's API. Here is a little ASCII diagram which, I hope, survives e-mail: Stats query --> custom metastore --> Metadata API --> Drill planner That first arrow represents a query someone runs that gathers stats, along with code to write the results into the custom metastore. Very powerful concept. Detailed comments below. > a)This approach offloads the hard task of figuring out which statistics are needed for which columns based on the user workload and then adapting to changes in the workload! This may be useful for experimenting, but not in practice. The point here was a bit more subtle. First, users (and QA) want to know the values of stats. Stats-as-query lets people easily play with the stats. I may request SELECT ndv(id) FROM ... because I want to know the NDV. The stats gathering mechanism may want NDV for all columns, along with other material. Second, the stats gathering logic will know what stats are wanted for that specific purpose. Maybe in the Drill native version, Drill guesses which columns should have stats. But, savvy users will know ahead of time (or from experience) which columns justify the cost of stats. If I roll my own metadata system, I may know that it is worth gathering stats on a small subset of columns, so I'll issue the query to do so. The key thought is, separate out the mechanism to compute stats from the consumption of stats. (BTW, Impala uses a plain query to compute its stats.) > b)Please note this approach would also require making additional Calcite/Drill code changes to generate the correct plans (serial and parallel). Currently, we bypass these changes and directly generate the physical plan. ... Excellent. The result would be the ability for Drill to generate efficient plans for complex stats: whither those stats are requested by the user or by Drill itself. Stats are easier to add: just add the necessary (probably complex) functions and the planner takes care of the rest. There are not two distinct code paths to reason about and maintain. > c)Exposing all such statistics may not be useful for the users e.g. for NDV we save the HLL structure which allows us to parallelize the plans and compute NDV efficiently and in the future will allow us to compute partition-wise statistics. Stats (histograms, NDV) are complex to compute. Perfect: they force us to extend our aggregate UDF functionality so it is powerful enough that community members can compute aggregations just as complex as stats. We kill two birds with one stone. > d)Consider, in the future, we add Histograms. There are several different kinds of histograms with trade-offs and we may decide to generate one or the other based on `ndv` values etc. We cannot expect the user to figure all this out on their own. True. But, imagine how much easier it will be to explain to savvy users or Drill developers if you can say: to see the difference between different histograms, simply try out histogram1(foo) vs. histogram2(foo). Try out 10 buckets: histogram1(foo, 10) vs. 20 buckets: histogram1(foo, 20). That is, if we encapsulate these different types of histograms in a (complex) aggregate function, Drill can keep adding new ones as easily as adding an aggregate UDF. If things are tunable (the user gets to choose the histogram style and size, say), then being able to visualize the results will help the user choose wisely. Further, in training (or the next edition of the Drill book), we can explain how stats work by having people play with queries that provide the information. > e)Having said all that, the current implementation does do what you are asking for (albeit not for the parallel version) which may be useful for experimentation. Very cool. Just need the parallel version and we'll be in great shape. > Yes, the current approach of storing statistics is emulated from Parquet metadata. Even though it is riddled with concurrency issues, it does not do any worse. Hence, I would contend that it is a good starting point. Once, the meta-store work is complete I plan to integrate statistics with it and leverage all the great benefits that come with this approach. This is my very point. We are basing the stats implementation on a broken storage format. The suggestion here is to split the problem. Discussion about metadata pointed out the benefit of separating the use of schemas, say, from their storage. Arina proposes an API that can be implemented in a variety of ways. Stats are similar, though inverted. Provide an API (a query) that computes stats; community users can store those stats in a variety of ways. That extension then feeds the data back to Drill via the metadata API. Again, they key point is: separate the stats computation mechanism (which is just a query with sophisticated aggregates) from storage (which needs to be pluggable.) Then, the first out-of-the-box implementation might be based on the Parquet file mechanism, but it is just one implementation of possibly many. You may point out a very reasonable fact: that the code path you describe already exists, and so it is less costly, today, to just use that path. You should argue that you are not in a position to start over. Very useful to play with what we have. But, let's also keep in mind where Drill wants to go: that it is unique in its modularity and its ability to integrate with a wide variety of systems. Thanks, - Paul