Hi Paul,

Thank you so much for the feedback. Please see my responses below.

First, the code to gather the stats is rather complex; it is the evolution
> of some work an intern did way back when. We'd be advised to find a simpler
> implementation, ideally one that uses mechanisms we already have.
> One possible approach is to convert metadata gathering to a plain old
> query. That is, rather than having a special mechanism to gather stats,
> just add functions in Drill. Maybe we want NDV and a histogram. (Can't
> recall all the stats that Guatam implemented.) Just implement them as new
> functions:
> SELECT ndv(foo), histogram(foo, 10), ndv(bar), histogram(bar, 10) FROM
> myTable;
> The above would simply display the stats (with the histogram presented as
> a Drill array with 10 buckets.)
> Such an approach could build on the aggression mechanism that already
> exists, and would avoid the use of the complex map structure in the current
> PR. It would also give QA and users an easy way to check the stats values.


a)This approach offloads the hard task of figuring out which statistics are
needed for which columns based on the user workload and then adapting to
changes in the workload! This may be useful for experimenting, but not in
practice.
b)Please note this approach would also require making additional
Calcite/Drill code changes to generate the correct plans (serial and
parallel). Currently, we bypass these changes and directly generate the
physical plan. e.g. if we said `Let us write the simplest query we can for
getting number of distinct values and be done with it` we will end up with
the plan below.

select count(distinct employee_id), count(distinct full_name) from
cp.`employee.json`;

00-00    Screen

00-01      Project(EXPR$0=[$0], EXPR$1=[$1])

00-02        NestedLoopJoin(condition=[true], joinType=[inner])

00-04          StreamAgg(group=[{}], EXPR$0=[COUNT($0)])

00-06            HashAgg(group=[{0}])

00-08              Scan(table=[[cp, employee.json]],
groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json,
numFiles=1, columns=[`employee_id`], files=[classpath:/employee.json]]])

00-03          StreamAgg(group=[{}], EXPR$1=[COUNT($0)])

00-05            HashAgg(group=[{0}])

00-07              Scan(table=[[cp, employee.json]],
groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json,
numFiles=1, columns=[`full_name`], files=[classpath:/employee.json]]])


It is evident that such a plan will perform poorly in practice!

c)Exposing all such statistics may not be useful for the users e.g. for NDV
we save the HLL structure which allows us to parallelize the plans and
compute NDV efficiently and in the future will allow us to compute
partition-wise statistics.
d)Consider, in the future, we add Histograms. There are several different
kinds of histograms with trade-offs and we may decide to generate one or
the other based on `ndv` values etc. We cannot expect the user to figure
all this out on their own.
e)Having said all that, the current implementation does do what you are
asking for (albeit not for the parallel version) which may be useful for
experimentation.

select ndv(gender), ndv(full_name) from cp.`employee.json`;

*+---------+---------+*

*| **EXPR$0 ** | **EXPR$1 ** |*

*+---------+---------+*

*| *2      * | *1155   * |*

*+---------+---------+*


select hll(gender), hll(full_name) from cp.`employee.json`;

*+---------+---------+*

*| **EXPR$0 ** | **EXPR$1 ** |*

*+---------+---------+*

\x00\x00\x00\x14\x00\x0A\xAA\xAC\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\

....


However, the issues I mentioned earlier is the reason why we would not want
to do so.




Second, at present, we have no good story for storing the stats. The
> file-based approach is similar to that used for Parquet metadata, and there
> are many known concurrency issues with that approach -- it is not something
> to emulate.
> Later, when the file problem is solved, or the metastore is available,
> some process can kick off a query of the appropriate form an write the
> results to the metastore in a concurrency-safe way. And, a COMPUTE STATS
> command would just be a wrapper around the above query along with writing
> the stats to some location.


Yes, the current approach of storing statistics is emulated from Parquet
metadata. Even though it is riddled with concurrency issues, it does not do
any worse. Hence, I would contend that it is a good starting point. Once,
the meta-store work is complete I plan to integrate statistics with it and
leverage all the great benefits that come with this approach.

Thanks,
Gautam

On Tue, Nov 6, 2018 at 12:04 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi All,
>
> Stats would be a great addition. Here are a couple of issues that came up
> in the earlier code review, revisited in light of recent proposed work.
>
> First, the code to gather the stats is rather complex; it is the evolution
> of some work an intern did way back when. We'd be advised to find a simpler
> implementation, ideally one that uses mechanisms we already have.
>
> Second, at present, we have no good story for storing the stats. The
> file-based approach is similar to that used for Parquet metadata, and there
> are many known concurrency issues with that approach -- it is not something
> to emulate.
>
> One possible approach is to convert metadata gathering to a plain old
> query. That is, rather than having a special mechanism to gather stats,
> just add functions in Drill. Maybe we want NDV and a histogram. (Can't
> recall all the stats that Guatam implemented.) Just implement them as new
> functions:
>
> SELECT ndv(foo), histogram(foo, 10), ndv(bar), histogram(bar, 10) FROM
> myTable;
>
> The above would simply display the stats (with the histogram presented as
> a Drill array with 10 buckets.)
>
> Such an approach could build on the aggression mechanism that already
> exists, and would avoid the use of the complex map structure in the current
> PR. It would also give QA and users an easy way to check the stats values.
>
> Later, when the file problem is solved, or the metastore is available,
> some process can kick off a query of the appropriate form an write the
> results to the metastore in a concurrency-safe way. And, a COMPUTE STATS
> command would just be a wrapper around the above query along with writing
> the stats to some location.
>
> Just my two cents...
>
> Thanks,
> - Paul
>
>
>
>     On Tuesday, November 6, 2018, 2:51:35 AM PST, Vitalii Diravka <
> vita...@apache.org> wrote:
>
>  +1
> It will help to rely on that code in the process of implementing Drill
> Metastore, DRILL-6552.
>
> @Gautam Please address all current commits and rebase onto latest master,
> then Vova and me will do additional review for it.
> Just for clarification, am I right, the changes state is the same as in
> last comment in DRILL-1328 [1]
> (will not include histograms and will cause some regressions for TPC-H and
> TPC-DS benchmarks)?
>
> [1]
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D1328-3FfocusedCommentId-3D16061374-26page-3Dcom.atlassian.jira.plugin.system.issuetabpanels-253Acomment-2Dtabpanel-23comment-2D16061374&d=DwICaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=jGaWXfAULy7L7yLSDM6rFQ&m=Ff9ubGKUQTI0u7uQ2BvlrxfKnDg6zhLL0XoTBAbkwAY&s=kBFloq2EVOAYXO-O7d19Oe-dug40ZW3Lili5D1ipJ5A&e=
>
>
> Kind regards
> Vitalii
>
>
> On Tue, Nov 6, 2018 at 1:47 AM Parth Chandra <par...@apache.org> wrote:
>
> > +1
> > I'd say go for it.
> > If the option to use enhanced stats an be turned on per session, then
> users
> > can experiment and choose to turn it on for queries where they do not
> > experience performance degradation.
> >
> >
> > On Fri, Nov 2, 2018 at 3:25 PM Gautam Parai <gpa...@mapr.com> wrote:
> >
> > > Hi all,
> > >
> > > I had an initial implementation for statistics support for Drill
> > > [DRILL-1328] <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D1328&d=DwICaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=jGaWXfAULy7L7yLSDM6rFQ&m=Ff9ubGKUQTI0u7uQ2BvlrxfKnDg6zhLL0XoTBAbkwAY&s=EuyD1jex4MRUHer4Q0qf4a8HRmDQ3gOEyD7FYHLmacY&e=>.
> This
> > JIRA
> > > has links to the design spec as well as the PR. Unfortunately, because
> of
> > > some regressions on performance benchmarks (TPCH/TPCDS) we decided to
> > > temporarily shelve the implementation. I would like to resolve the
> > pending
> > > issues and get the changes in.
> > >
> > > Hopefully, it will be okay to merge it in as an experimental feature
> > since
> > > in order to resolve these issues we may need to change the existing
> join
> > > ordering algorithm in Drill, add support for Histograms and a few other
> > > planning related issues. Moreover, the community is adding a meta-store
> > for
> > > Drill [DRILL-6552] <
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_DRILL-2D6552&d=DwICaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=jGaWXfAULy7L7yLSDM6rFQ&m=Ff9ubGKUQTI0u7uQ2BvlrxfKnDg6zhLL0XoTBAbkwAY&s=SZYjPcxSZ0mMjrSybNnUWKO0E-QNSgd3kTVNzNZDxks&e=
> >.
> > > Statistics should also be able to leverage the brand new meta-store
> > instead
> > > of/in addition to having a custom store implementation.
> > >
> > > My plan is to address the most critical review comments and get the
> > initial
> > > version in as an experimental feature. Some other good-to-have aspects
> > like
> > > handling schema changes during the statistics collection process maybe
> > > deferred to the next iteration. Subsequently, I will improve these
> > > good-to-have features and additional performance improvements. It would
> > be
> > > great to get the initial implementation in to avoid the rebase issues
> and
> > > allow other community members to use and contribute to the feature.
> > >
> > > Please take a look at the design doc and the PR and provide suggestions
> > and
> > > feedback on the JIRA. Also I will try to present the current state of
> > > statistics and the feature in one of the bi-weekly Drill Community
> > > Hangouts.
> > >
> > > Thanks,
> > > Gautam
> > >
> >
>

Reply via email to