Re: [DISCUSS] Resurrect support for Table Statistics in Drill

Paul Rogers Tue, 06 Nov 2018 12:05:00 -0800

Hi All,

Stats would be a great addition. Here are a couple of issues that came up in 
the earlier code review, revisited in light of recent proposed work.

First, the code to gather the stats is rather complex; it is the evolution of 
some work an intern did way back when. We'd be advised to find a simpler 
implementation, ideally one that uses mechanisms we already have.

Second, at present, we have no good story for storing the stats. The file-based 
approach is similar to that used for Parquet metadata, and there are many known 
concurrency issues with that approach -- it is not something to emulate.

One possible approach is to convert metadata gathering to a plain old query. 
That is, rather than having a special mechanism to gather stats, just add 
functions in Drill. Maybe we want NDV and a histogram. (Can't recall all the 
stats that Guatam implemented.) Just implement them as new functions:

SELECT ndv(foo), histogram(foo, 10), ndv(bar), histogram(bar, 10) FROM myTable;

The above would simply display the stats (with the histogram presented as a 
Drill array with 10 buckets.)

Such an approach could build on the aggression mechanism that already exists, 
and would avoid the use of the complex map structure in the current PR. It 
would also give QA and users an easy way to check the stats values.

Later, when the file problem is solved, or the metastore is available, some 
process can kick off a query of the appropriate form an write the results to 
the metastore in a concurrency-safe way. And, a COMPUTE STATS command would 
just be a wrapper around the above query along with writing the stats to some 
location.

Just my two cents...

Thanks,
- Paul

    On Tuesday, November 6, 2018, 2:51:35 AM PST, Vitalii Diravka 
<vita...@apache.org> wrote:  

 +1
It will help to rely on that code in the process of implementing Drill
Metastore, DRILL-6552.

@Gautam Please address all current commits and rebase onto latest master,
then Vova and me will do additional review for it.
Just for clarification, am I right, the changes state is the same as in
last comment in DRILL-1328 [1]
(will not include histograms and will cause some regressions for TPC-H and
TPC-DS benchmarks)?

[1]
https://issues.apache.org/jira/browse/DRILL-1328?focusedCommentId=16061374&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16061374

Kind regards
Vitalii

On Tue, Nov 6, 2018 at 1:47 AM Parth Chandra <par...@apache.org> wrote:

> +1
> I'd say go for it.
> If the option to use enhanced stats an be turned on per session, then users
> can experiment and choose to turn it on for queries where they do not
> experience performance degradation.
>
>
> On Fri, Nov 2, 2018 at 3:25 PM Gautam Parai <gpa...@mapr.com> wrote:
>
> > Hi all,
> >
> > I had an initial implementation for statistics support for Drill
> > [DRILL-1328] <https://issues.apache.org/jira/browse/DRILL-1328>. This
> JIRA
> > has links to the design spec as well as the PR. Unfortunately, because of
> > some regressions on performance benchmarks (TPCH/TPCDS) we decided to
> > temporarily shelve the implementation. I would like to resolve the
> pending
> > issues and get the changes in.
> >
> > Hopefully, it will be okay to merge it in as an experimental feature
> since
> > in order to resolve these issues we may need to change the existing join
> > ordering algorithm in Drill, add support for Histograms and a few other
> > planning related issues. Moreover, the community is adding a meta-store
> for
> > Drill [DRILL-6552] <https://issues.apache.org/jira/browse/DRILL-6552>.
> > Statistics should also be able to leverage the brand new meta-store
> instead
> > of/in addition to having a custom store implementation.
> >
> > My plan is to address the most critical review comments and get the
> initial
> > version in as an experimental feature. Some other good-to-have aspects
> like
> > handling schema changes during the statistics collection process maybe
> > deferred to the next iteration. Subsequently, I will improve these
> > good-to-have features and additional performance improvements. It would
> be
> > great to get the initial implementation in to avoid the rebase issues and
> > allow other community members to use and contribute to the feature.
> >
> > Please take a look at the design doc and the PR and provide suggestions
> and
> > feedback on the JIRA. Also I will try to present the current state of
> > statistics and the feature in one of the bi-weekly Drill Community
> > Hangouts.
> >
> > Thanks,
> > Gautam
> >
>

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

Reply via email to