Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-14 Thread Vitalii Diravka
You have described the similar design, which Drill Metastore are going to
provide.
I'm talking about API with put and get statistics and different storages
for Metadata, aka Metastore Plugins.
Since the project is in progress it is possible to determine the API for
storing and obtaining stats and also the way how it can be stored in
metadata cache files.


Kind regards
Vitalii


On Sat, Nov 10, 2018 at 1:03 PM Paul Rogers 
wrote:

> Hi Gautam,
>
> You touched on the key issue: storage. You mention that the Drill stats
> implementation learned from Oracle. Very wise: Oracle is the clear expert
> in this space.
>
> There is a very important difference, however, between Drill and Oracle.
> Oracle is a complete database including both query engine and storage.
> Drill is a query engine only. This is the issue at the heart of our
> discussion.
>
> Oracle has a tabular storage engine for relational data. Oracle uses that
> storage engine for metadata and stats. This ensures that metadata and stats
> benefit from concurrency control, transactions, crash recovery (i.e. roll
> forward/roll back), backup and so.
>
> Drill's equivalents are. . . (crickets.)
>
> Drill is a query engine that sits atop the storage engine of your choice.
> That is what sets Drill apart from Impala and Hive which are tightly
> coupled to HDFS, HMS, Ranger/Sentry, etc. (Spark takes a similar position
> to Drill: Spark runs on anything and has no storage, other than shuffle
> files.)
>
> As a query engine, Drill should compute stats, as you suggested. But, when
> it comes to STORING stats, Drill has nothing to say, nor should it.
>
> We currently use a broken implementation for Parquet metadata. We write
> files into the data directory (destroying directory update timestamps),
> across multiple files, with no concurrency control, no versioning, no crash
> recovery, no nothing. Run a query concurrently with Parquet metadata
> collection: things get corrupted. Run two Parquet metadata updates, things
> get really corrupted. Why? Storage is hard to get right when doing
> concurrent access and update.
>
> This is not a foundation on which to build! Oracle would not survive a day
> if it corrupted system tables when two or more users did operations at the
> same time.
>
> OK, Drill has a problem. The first step is to acknowledge it. The next is
> to look for solutions.
>
> Either Drill adds a storage engine, or it stays agnostic, leaves storage
> to an external system, and makes stats storage a plugin. Drill already
> accesses data via a plugin. This is why Drill can read HDFS, S3, Aluxio,
> Kafka, JDBC, and on and on. This is a valuable, differentiating feature. It
> is, in fact, why Drill has a place in a world dominated by Hive, Spark and
> Impala.
>
> For stats, this means that Drill does the query engine part (gather stats
> on the one hand, and consume stats for planning on the other.) But, it
> means that Drill DOES NOT attempt to store the stats. Drill relies on an
> external system for that role.
>
> Here is where the stats discussion aligns with the metadata (table schema)
> discussion. There are many ways to store metadata (including stats). In a
> RDBMS, in HMS, in files (done with MVCC or other concurrency control), in a
> key/value store and so on. All of these are more robust than the broken
> Parquet metadata file implementation.
>
> So, if stats are to be stored by an external storage system, that means
> that Drill's focus should be on APIs: how to obtain the stats from Drill to
> store them, and how to return them to Drill when requested when planning a
> query. This is exactly the same model we take with data (Drill gives data
> to HDFS to store, asks HDFS for the location of the data during planning.)
>
> This is the reason I suggested gathering stats as a query: you need add no
> new API: just issue a query using the existing Drill client. As you point
> out, perhaps Drill is in a better position to decide what stats should be
> gathered. Point taken. So, instead of using a query, define a stats API
> with both "put" and "get" interfaces.
>
> Then, of course, you can certainly create a POC implementation of the
> storage engine based on the broken Parquet metadata file format. Since it
> is just a reference implementation, the fragility of the solution can be
> forgiven.
>
> This is a very complex topic, and touches on Drill's place in the open
> source query engine world. Thanks much for having the patience to discuss
> the issues here on the dev list.
>
> What do other people think about the storage question? Is the plugin
> approach the right one? Is there some other alternative the project should
> consider? Should Drill build its own?
>
> Thanks,
> - Paul
>
>
>
> On Friday, November 9, 2018, 3:11:11 PM PST, Gautam Parai <
> gpa...@mapr.com> wrote:
>
>  Hi Paul,
>
> ...


Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-10 Thread Paul Rogers
Hi Gautam,

You touched on the key issue: storage. You mention that the Drill stats 
implementation learned from Oracle. Very wise: Oracle is the clear expert in 
this space.

There is a very important difference, however, between Drill and Oracle. Oracle 
is a complete database including both query engine and storage. Drill is a 
query engine only. This is the issue at the heart of our discussion.

Oracle has a tabular storage engine for relational data. Oracle uses that 
storage engine for metadata and stats. This ensures that metadata and stats 
benefit from concurrency control, transactions, crash recovery (i.e. roll 
forward/roll back), backup and so.

Drill's equivalents are. . . (crickets.)

Drill is a query engine that sits atop the storage engine of your choice. That 
is what sets Drill apart from Impala and Hive which are tightly coupled to 
HDFS, HMS, Ranger/Sentry, etc. (Spark takes a similar position to Drill: Spark 
runs on anything and has no storage, other than shuffle files.)

As a query engine, Drill should compute stats, as you suggested. But, when it 
comes to STORING stats, Drill has nothing to say, nor should it.

We currently use a broken implementation for Parquet metadata. We write files 
into the data directory (destroying directory update timestamps), across 
multiple files, with no concurrency control, no versioning, no crash recovery, 
no nothing. Run a query concurrently with Parquet metadata collection: things 
get corrupted. Run two Parquet metadata updates, things get really corrupted. 
Why? Storage is hard to get right when doing concurrent access and update.

This is not a foundation on which to build! Oracle would not survive a day if 
it corrupted system tables when two or more users did operations at the same 
time.

OK, Drill has a problem. The first step is to acknowledge it. The next is to 
look for solutions.

Either Drill adds a storage engine, or it stays agnostic, leaves storage to an 
external system, and makes stats storage a plugin. Drill already accesses data 
via a plugin. This is why Drill can read HDFS, S3, Aluxio, Kafka, JDBC, and on 
and on. This is a valuable, differentiating feature. It is, in fact, why Drill 
has a place in a world dominated by Hive, Spark and Impala.

For stats, this means that Drill does the query engine part (gather stats on 
the one hand, and consume stats for planning on the other.) But, it means that 
Drill DOES NOT attempt to store the stats. Drill relies on an external system 
for that role.

Here is where the stats discussion aligns with the metadata (table schema) 
discussion. There are many ways to store metadata (including stats). In a 
RDBMS, in HMS, in files (done with MVCC or other concurrency control), in a 
key/value store and so on. All of these are more robust than the broken Parquet 
metadata file implementation.

So, if stats are to be stored by an external storage system, that means that 
Drill's focus should be on APIs: how to obtain the stats from Drill to store 
them, and how to return them to Drill when requested when planning a query. 
This is exactly the same model we take with data (Drill gives data to HDFS to 
store, asks HDFS for the location of the data during planning.)

This is the reason I suggested gathering stats as a query: you need add no new 
API: just issue a query using the existing Drill client. As you point out, 
perhaps Drill is in a better position to decide what stats should be gathered. 
Point taken. So, instead of using a query, define a stats API with both "put" 
and "get" interfaces.

Then, of course, you can certainly create a POC implementation of the storage 
engine based on the broken Parquet metadata file format. Since it is just a 
reference implementation, the fragility of the solution can be forgiven.

This is a very complex topic, and touches on Drill's place in the open source 
query engine world. Thanks much for having the patience to discuss the issues 
here on the dev list.

What do other people think about the storage question? Is the plugin approach 
the right one? Is there some other alternative the project should consider? 
Should Drill build its own?

Thanks,
- Paul

 

On Friday, November 9, 2018, 3:11:11 PM PST, Gautam Parai  
wrote:  
 
 Hi Paul,

...  

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-09 Thread Gautam Parai
Hi Paul,

Thanks so much for the feedback. Please see my responses below.

To take a step back, recall that Arina is working on a metadata proposal. A
> key aspect of that proposal is that it provides an API so that Drill can
> connect (via an adapter) to any metadata system. The gist of my comments is
> that it would be wonderful if stats could work the same way. Provide a
> generic way to compute stats (as a query). That data can then be written to
> whatever metadata system the user wants to use, and served back via Arina's
> API.
> Here is a little ASCII diagram which, I hope, survives e-mail:
> Stats query --> custom metastore --> Metadata API --> Drill planner
> That first arrow represents a query someone runs that gathers stats, along
> with code to write the results into the custom metastore. Very powerful
> concept.

I think we are on the same page as far as the high-level idea goes! Here is
a relevant excerpt from the design doc "As mentioned earlier in the Design
Overview, the storage API will abstract away the underlying mechanism used
to store statistics e.g. a distributed file system, database or a key-value
store.  As a first cut, the storage will be backed by the file system
as outlined above. However, this can be backed by a persistent store in the
future. The interface will allow us to use different storage mechanism for
statistics."


The point here was a bit more subtle. First, users (and QA) want to know
> the values of stats. Stats-as-query lets people easily play with the stats.
> I may request SELECT ndv(id) FROM ... because I want to know the NDV. The
> stats gathering mechanism may want NDV for all columns, along with other
> material.
> Second, the stats gathering logic will know what stats are wanted for that
> specific purpose. Maybe in the Drill native version, Drill guesses which
> columns should have stats.
> But, savvy users will know ahead of time (or from experience) which
> columns justify the cost of stats. If I roll my own metadata system, I may
> know that it is worth gathering stats on a small subset of columns, so I'll
> issue the query to do so.
> The key thought is, separate out the mechanism to compute stats from the
> consumption of stats. (BTW, Impala uses a plain query to compute its stats.)

The existing mechanism allows users to specify the columns on which they
want to collect statistics. For non-trivial user workloads, the choice
would quickly be narrowed down to collecting all statistics. The proposed
query format for collecting statistics is very similar to
Hive/Postgres/Oracle, so I would say it is a good starting point. There
maybe merits to users experimenting their way around like you suggested.
However, my immediate concern is to make the feature graduate from
`experimental` to say `usable`! For that to happen we need to fix benchmark
regressions and improve Drill query performance/stability which I think is
a MUST to make the feature successful. We already have the pieces lined up
e.g. UDAFs so someone from the Drill community could easily take it up or I
can do it later once we are able to address performance issues. I have
created the JIRA (https://issues.apache.org/jira/browse/DRILL-6841).


Stats (histograms, NDV) are complex to compute. Perfect: they force us to
> extend our aggregate UDF functionality so it is powerful enough that
> community members can compute aggregations just as complex as stats. We
> kill two birds with one stone.

True. But, imagine how much easier it will be to explain to savvy users or
> Drill developers if you can say: to see the difference between different
> histograms, simply try out histogram1(foo) vs. histogram2(foo). Try out 10
> buckets: histogram1(foo, 10) vs. 20 buckets: histogram1(foo, 20). That is,
> if we encapsulate these different types of histograms in a (complex)
> aggregate function, Drill can keep adding new ones as easily as adding an
> aggregate UDF.

If things are tunable (the user gets to choose the histogram style and
> size, say), then being able to visualize the results will help the user
> choose wisely.

Further, in training (or the next edition of the Drill book), we can
> explain how stats work by having people play with queries that provide the
> information.

Yes, at a high-level, we are already relying on the UDAF mechanism for
computing statistics using the existing aggregation mechanism. To take a
query form like you suggested, all we need is some (non-trivial)
boilerplate code connecting these building blocks. We can also enhance
ANALYZE itself to provide the exact functionality enhancements you
mentioned. These could be take up as part of (
https://issues.apache.org/jira/browse/DRILL-6841).


> e)Having said all that, the current implementation does do what you are
> asking for (albeit not for the parallel version) which may be useful for
> experimentation.
> Very cool. Just need the parallel version and we'll be in great shape.

Yes, this is exactly what we could address 

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-09 Thread Paul Rogers
Hi Gautam,

One follow-up clarification: I realize one point was a bit unclear.

I suggested that stats be gathered by a query. By this, I simply mean that 
stats use the existing query mechanism with, perhaps, an enhanced UDAF 
(user-defined aggregate function) API.

Your work includes a new SQL statement to gather stats: COMPUTE STATS ON 
 or ANALYZE TABLE  ... (I don't recall the specifics.) My 
suggestion is not about that statement itself, rather about how that statement 
is implemented. The current code, if I recall, introduces a new execution model 
which gathers stats and writes a file.

The specific suggestion is that the new ANALYZE TABLE ... syntax is provided. 
But, it is implemented as an extension to a CTAS statement: run a query that 
computes stats, and write that to some file structure. Said another way, 
ANALYZE TABLE is simply an alias for (or is rewritten to) a stats query plus 
output to Drill's preferred internal format.

This model allows others to use queries to compute stats for the reasons 
discussed previously.

I hope this clarifies things a bit...

Also, to be clear, I'm just tossing out ideas to make Drill as useful as 
possible. Whatever we do, would be good to get the existing version into the 
code base so folks can play with it.

Thanks,
- Paul

 

On Thursday, November 8, 2018, 3:57:35 PM PST, Paul Rogers 
 wrote:  
 
 Hi Gautam,

Thanks much for the explanations. You raise some interesting points. I noticed 
that Boaz has just filed a JIRA ticket to tackle the inefficient count distinct 
case.

To take a step back, recall that Arina is working on a metadata proposal. A key 
aspect of that proposal is that it provides an API so that Drill can connect 
(via an adapter) to any metadata system. The gist of my comments is that it 
would be wonderful if stats could work the same way. Provide a generic way to 
compute stats (as a query). That data can then be written to whatever metadata 
system the user wants to use, and served back via Arina's API.

Here is a little ASCII diagram which, I hope, survives e-mail:

Stats query --> custom metastore --> Metadata API --> Drill planner

That first arrow represents a query someone runs that gathers stats, along with 
code to write the results into the custom metastore. Very powerful concept.

Detailed comments below.

> a)This approach offloads the hard task of figuring out which statistics are
needed for which columns based on the user workload and then adapting to
changes in the workload! This may be useful for experimenting, but not in
practice.

The point here was a bit more subtle. First, users (and QA) want to know the 
values of stats. Stats-as-query lets people easily play with the stats. I may 
request SELECT ndv(id) FROM ... because I want to know the NDV. The stats 
gathering mechanism may want NDV for all columns, along with other material.

Second, the stats gathering logic will know what stats are wanted for that 
specific purpose. Maybe in the Drill native version, Drill guesses which 
columns should have stats.

But, savvy users will know ahead of time (or from experience) which columns 
justify the cost of stats. If I roll my own metadata system, I may know that it 
is worth gathering stats on a small subset of columns, so I'll issue the query 
to do so.

The key thought is, separate out the mechanism to compute stats from the 
consumption of stats. (BTW, Impala uses a plain query to compute its stats.)

> b)Please note this approach would also require making additional
Calcite/Drill code changes to generate the correct plans (serial and
parallel). Currently, we bypass these changes and directly generate the
physical plan. ...

Excellent. The result would be the ability for Drill to generate efficient 
plans for complex stats: whither those stats are requested by the user or by 
Drill itself. Stats are easier to add: just add the necessary (probably 
complex) functions and the planner takes care of the rest. There are not two 
distinct code paths to reason about and maintain.

> c)Exposing all such statistics may not be useful for the users e.g. for NDV
we save the HLL structure which allows us to parallelize the plans and
compute NDV efficiently and in the future will allow us to compute
partition-wise statistics.

Stats (histograms, NDV) are complex to compute. Perfect: they force us to 
extend our aggregate UDF functionality so it is powerful enough that community 
members can compute aggregations just as complex as stats. We kill two birds 
with one stone.

> d)Consider, in the future, we add Histograms. There are several different
kinds of histograms with trade-offs and we may decide to generate one or
the other based on `ndv` values etc. We cannot expect the user to figure
all this out on their own.

True. But, imagine how much easier it will be to explain to savvy users or 
Drill developers if you can say: to see the difference between different 
histograms, simply try out histogram1(foo) vs. 

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-08 Thread Paul Rogers
Hi Gautam,

Thanks much for the explanations. You raise some interesting points. I noticed 
that Boaz has just filed a JIRA ticket to tackle the inefficient count distinct 
case.

To take a step back, recall that Arina is working on a metadata proposal. A key 
aspect of that proposal is that it provides an API so that Drill can connect 
(via an adapter) to any metadata system. The gist of my comments is that it 
would be wonderful if stats could work the same way. Provide a generic way to 
compute stats (as a query). That data can then be written to whatever metadata 
system the user wants to use, and served back via Arina's API.

Here is a little ASCII diagram which, I hope, survives e-mail:

Stats query --> custom metastore --> Metadata API --> Drill planner

That first arrow represents a query someone runs that gathers stats, along with 
code to write the results into the custom metastore. Very powerful concept.

Detailed comments below.

> a)This approach offloads the hard task of figuring out which statistics are
needed for which columns based on the user workload and then adapting to
changes in the workload! This may be useful for experimenting, but not in
practice.

The point here was a bit more subtle. First, users (and QA) want to know the 
values of stats. Stats-as-query lets people easily play with the stats. I may 
request SELECT ndv(id) FROM ... because I want to know the NDV. The stats 
gathering mechanism may want NDV for all columns, along with other material.

Second, the stats gathering logic will know what stats are wanted for that 
specific purpose. Maybe in the Drill native version, Drill guesses which 
columns should have stats.

But, savvy users will know ahead of time (or from experience) which columns 
justify the cost of stats. If I roll my own metadata system, I may know that it 
is worth gathering stats on a small subset of columns, so I'll issue the query 
to do so.

The key thought is, separate out the mechanism to compute stats from the 
consumption of stats. (BTW, Impala uses a plain query to compute its stats.)

> b)Please note this approach would also require making additional
Calcite/Drill code changes to generate the correct plans (serial and
parallel). Currently, we bypass these changes and directly generate the
physical plan. ...

Excellent. The result would be the ability for Drill to generate efficient 
plans for complex stats: whither those stats are requested by the user or by 
Drill itself. Stats are easier to add: just add the necessary (probably 
complex) functions and the planner takes care of the rest. There are not two 
distinct code paths to reason about and maintain.

> c)Exposing all such statistics may not be useful for the users e.g. for NDV
we save the HLL structure which allows us to parallelize the plans and
compute NDV efficiently and in the future will allow us to compute
partition-wise statistics.

Stats (histograms, NDV) are complex to compute. Perfect: they force us to 
extend our aggregate UDF functionality so it is powerful enough that community 
members can compute aggregations just as complex as stats. We kill two birds 
with one stone.

> d)Consider, in the future, we add Histograms. There are several different
kinds of histograms with trade-offs and we may decide to generate one or
the other based on `ndv` values etc. We cannot expect the user to figure
all this out on their own.

True. But, imagine how much easier it will be to explain to savvy users or 
Drill developers if you can say: to see the difference between different 
histograms, simply try out histogram1(foo) vs. histogram2(foo). Try out 10 
buckets: histogram1(foo, 10) vs. 20 buckets: histogram1(foo, 20). That is, if 
we encapsulate these different types of histograms in a (complex) aggregate 
function, Drill can keep adding new ones as easily as adding an aggregate UDF.

If things are tunable (the user gets to choose the histogram style and size, 
say), then being able to visualize the results will help the user choose wisely.

Further, in training (or the next edition of the Drill book), we can explain 
how stats work by having people play with queries that provide the information.

> e)Having said all that, the current implementation does do what you are
asking for (albeit not for the parallel version) which may be useful for
experimentation.

Very cool. Just need the parallel version and we'll be in great shape.


> Yes, the current approach of storing statistics is emulated from Parquet
metadata. Even though it is riddled with concurrency issues, it does not do
any worse. Hence, I would contend that it is a good starting point. Once,
the meta-store work is complete I plan to integrate statistics with it and
leverage all the great benefits that come with this approach.

This is my very point. We are basing the stats implementation on a broken 
storage format. The suggestion here is to split the problem.

Discussion about metadata pointed out the benefit of separating the 

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-08 Thread Gautam Parai
Hi Paul,

Thank you so much for the feedback. Please see my responses below.

First, the code to gather the stats is rather complex; it is the evolution
> of some work an intern did way back when. We'd be advised to find a simpler
> implementation, ideally one that uses mechanisms we already have.
> One possible approach is to convert metadata gathering to a plain old
> query. That is, rather than having a special mechanism to gather stats,
> just add functions in Drill. Maybe we want NDV and a histogram. (Can't
> recall all the stats that Guatam implemented.) Just implement them as new
> functions:
> SELECT ndv(foo), histogram(foo, 10), ndv(bar), histogram(bar, 10) FROM
> myTable;
> The above would simply display the stats (with the histogram presented as
> a Drill array with 10 buckets.)
> Such an approach could build on the aggression mechanism that already
> exists, and would avoid the use of the complex map structure in the current
> PR. It would also give QA and users an easy way to check the stats values.


a)This approach offloads the hard task of figuring out which statistics are
needed for which columns based on the user workload and then adapting to
changes in the workload! This may be useful for experimenting, but not in
practice.
b)Please note this approach would also require making additional
Calcite/Drill code changes to generate the correct plans (serial and
parallel). Currently, we bypass these changes and directly generate the
physical plan. e.g. if we said `Let us write the simplest query we can for
getting number of distinct values and be done with it` we will end up with
the plan below.

select count(distinct employee_id), count(distinct full_name) from
cp.`employee.json`;

00-00Screen

00-01  Project(EXPR$0=[$0], EXPR$1=[$1])

00-02NestedLoopJoin(condition=[true], joinType=[inner])

00-04  StreamAgg(group=[{}], EXPR$0=[COUNT($0)])

00-06HashAgg(group=[{0}])

00-08  Scan(table=[[cp, employee.json]],
groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json,
numFiles=1, columns=[`employee_id`], files=[classpath:/employee.json]]])

00-03  StreamAgg(group=[{}], EXPR$1=[COUNT($0)])

00-05HashAgg(group=[{0}])

00-07  Scan(table=[[cp, employee.json]],
groupscan=[EasyGroupScan [selectionRoot=classpath:/employee.json,
numFiles=1, columns=[`full_name`], files=[classpath:/employee.json]]])


It is evident that such a plan will perform poorly in practice!

c)Exposing all such statistics may not be useful for the users e.g. for NDV
we save the HLL structure which allows us to parallelize the plans and
compute NDV efficiently and in the future will allow us to compute
partition-wise statistics.
d)Consider, in the future, we add Histograms. There are several different
kinds of histograms with trade-offs and we may decide to generate one or
the other based on `ndv` values etc. We cannot expect the user to figure
all this out on their own.
e)Having said all that, the current implementation does do what you are
asking for (albeit not for the parallel version) which may be useful for
experimentation.

select ndv(gender), ndv(full_name) from cp.`employee.json`;

*+-+-+*

*| **EXPR$0 ** | **EXPR$1 ** |*

*+-+-+*

*| *2  * | *1155   * |*

*+-+-+*


select hll(gender), hll(full_name) from cp.`employee.json`;

*+-+-+*

*| **EXPR$0 ** | **EXPR$1 ** |*

*+-+-+*

\x00\x00\x00\x14\x00\x0A\xAA\xAC\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\




However, the issues I mentioned earlier is the reason why we would not want
to do so.




Second, at present, we have no good story for storing the stats. The
> file-based approach is similar to that used for Parquet metadata, and there
> are many known concurrency issues with that approach -- it is not something
> to emulate.
> Later, when the file problem is solved, or the metastore is available,
> some process can kick off a query of the appropriate form an write the
> results to the metastore in a concurrency-safe way. And, a COMPUTE STATS
> command would just be a wrapper around the above query along with writing
> the stats to some location.


Yes, the current approach of storing statistics is emulated from Parquet
metadata. Even though it is riddled with concurrency issues, it does not do
any worse. Hence, I would contend that it is a good starting point. Once,
the meta-store work is complete I plan to integrate statistics with it and
leverage all the great benefits that come with this approach.

Thanks,
Gautam

On Tue, Nov 6, 2018 at 12:04 PM Paul Rogers 
wrote:

> Hi All,
>
> Stats would be a great addition. Here are a couple of issues that came up
> in the earlier code review, revisited in light of recent proposed work.
>
> First, the code to gather the stats is rather complex; it is the evolution
> of some work an intern did way back when. We'd be advised to find a simpler
> implementation, 

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-06 Thread Paul Rogers
Hi All,

Stats would be a great addition. Here are a couple of issues that came up in 
the earlier code review, revisited in light of recent proposed work.

First, the code to gather the stats is rather complex; it is the evolution of 
some work an intern did way back when. We'd be advised to find a simpler 
implementation, ideally one that uses mechanisms we already have.

Second, at present, we have no good story for storing the stats. The file-based 
approach is similar to that used for Parquet metadata, and there are many known 
concurrency issues with that approach -- it is not something to emulate.

One possible approach is to convert metadata gathering to a plain old query. 
That is, rather than having a special mechanism to gather stats, just add 
functions in Drill. Maybe we want NDV and a histogram. (Can't recall all the 
stats that Guatam implemented.) Just implement them as new functions:

SELECT ndv(foo), histogram(foo, 10), ndv(bar), histogram(bar, 10) FROM myTable;

The above would simply display the stats (with the histogram presented as a 
Drill array with 10 buckets.)

Such an approach could build on the aggression mechanism that already exists, 
and would avoid the use of the complex map structure in the current PR. It 
would also give QA and users an easy way to check the stats values.

Later, when the file problem is solved, or the metastore is available, some 
process can kick off a query of the appropriate form an write the results to 
the metastore in a concurrency-safe way. And, a COMPUTE STATS command would 
just be a wrapper around the above query along with writing the stats to some 
location.

Just my two cents...

Thanks,
- Paul

 

On Tuesday, November 6, 2018, 2:51:35 AM PST, Vitalii Diravka 
 wrote:  
 
 +1
It will help to rely on that code in the process of implementing Drill
Metastore, DRILL-6552.

@Gautam Please address all current commits and rebase onto latest master,
then Vova and me will do additional review for it.
Just for clarification, am I right, the changes state is the same as in
last comment in DRILL-1328 [1]
(will not include histograms and will cause some regressions for TPC-H and
TPC-DS benchmarks)?

[1]
https://issues.apache.org/jira/browse/DRILL-1328?focusedCommentId=16061374=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16061374


Kind regards
Vitalii


On Tue, Nov 6, 2018 at 1:47 AM Parth Chandra  wrote:

> +1
> I'd say go for it.
> If the option to use enhanced stats an be turned on per session, then users
> can experiment and choose to turn it on for queries where they do not
> experience performance degradation.
>
>
> On Fri, Nov 2, 2018 at 3:25 PM Gautam Parai  wrote:
>
> > Hi all,
> >
> > I had an initial implementation for statistics support for Drill
> > [DRILL-1328] . This
> JIRA
> > has links to the design spec as well as the PR. Unfortunately, because of
> > some regressions on performance benchmarks (TPCH/TPCDS) we decided to
> > temporarily shelve the implementation. I would like to resolve the
> pending
> > issues and get the changes in.
> >
> > Hopefully, it will be okay to merge it in as an experimental feature
> since
> > in order to resolve these issues we may need to change the existing join
> > ordering algorithm in Drill, add support for Histograms and a few other
> > planning related issues. Moreover, the community is adding a meta-store
> for
> > Drill [DRILL-6552] .
> > Statistics should also be able to leverage the brand new meta-store
> instead
> > of/in addition to having a custom store implementation.
> >
> > My plan is to address the most critical review comments and get the
> initial
> > version in as an experimental feature. Some other good-to-have aspects
> like
> > handling schema changes during the statistics collection process maybe
> > deferred to the next iteration. Subsequently, I will improve these
> > good-to-have features and additional performance improvements. It would
> be
> > great to get the initial implementation in to avoid the rebase issues and
> > allow other community members to use and contribute to the feature.
> >
> > Please take a look at the design doc and the PR and provide suggestions
> and
> > feedback on the JIRA. Also I will try to present the current state of
> > statistics and the feature in one of the bi-weekly Drill Community
> > Hangouts.
> >
> > Thanks,
> > Gautam
> >
>
  

Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-06 Thread Vitalii Diravka
+1
It will help to rely on that code in the process of implementing Drill
Metastore, DRILL-6552.

@Gautam Please address all current commits and rebase onto latest master,
then Vova and me will do additional review for it.
Just for clarification, am I right, the changes state is the same as in
last comment in DRILL-1328 [1]
(will not include histograms and will cause some regressions for TPC-H and
TPC-DS benchmarks)?

[1]
https://issues.apache.org/jira/browse/DRILL-1328?focusedCommentId=16061374=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16061374


Kind regards
Vitalii


On Tue, Nov 6, 2018 at 1:47 AM Parth Chandra  wrote:

> +1
> I'd say go for it.
> If the option to use enhanced stats an be turned on per session, then users
> can experiment and choose to turn it on for queries where they do not
> experience performance degradation.
>
>
> On Fri, Nov 2, 2018 at 3:25 PM Gautam Parai  wrote:
>
> > Hi all,
> >
> > I had an initial implementation for statistics support for Drill
> > [DRILL-1328] . This
> JIRA
> > has links to the design spec as well as the PR. Unfortunately, because of
> > some regressions on performance benchmarks (TPCH/TPCDS) we decided to
> > temporarily shelve the implementation. I would like to resolve the
> pending
> > issues and get the changes in.
> >
> > Hopefully, it will be okay to merge it in as an experimental feature
> since
> > in order to resolve these issues we may need to change the existing join
> > ordering algorithm in Drill, add support for Histograms and a few other
> > planning related issues. Moreover, the community is adding a meta-store
> for
> > Drill [DRILL-6552] .
> > Statistics should also be able to leverage the brand new meta-store
> instead
> > of/in addition to having a custom store implementation.
> >
> > My plan is to address the most critical review comments and get the
> initial
> > version in as an experimental feature. Some other good-to-have aspects
> like
> > handling schema changes during the statistics collection process maybe
> > deferred to the next iteration. Subsequently, I will improve these
> > good-to-have features and additional performance improvements. It would
> be
> > great to get the initial implementation in to avoid the rebase issues and
> > allow other community members to use and contribute to the feature.
> >
> > Please take a look at the design doc and the PR and provide suggestions
> and
> > feedback on the JIRA. Also I will try to present the current state of
> > statistics and the feature in one of the bi-weekly Drill Community
> > Hangouts.
> >
> > Thanks,
> > Gautam
> >
>


Re: [DISCUSS] Resurrect support for Table Statistics in Drill

2018-11-05 Thread Parth Chandra
+1
I'd say go for it.
If the option to use enhanced stats an be turned on per session, then users
can experiment and choose to turn it on for queries where they do not
experience performance degradation.


On Fri, Nov 2, 2018 at 3:25 PM Gautam Parai  wrote:

> Hi all,
>
> I had an initial implementation for statistics support for Drill
> [DRILL-1328] . This JIRA
> has links to the design spec as well as the PR. Unfortunately, because of
> some regressions on performance benchmarks (TPCH/TPCDS) we decided to
> temporarily shelve the implementation. I would like to resolve the pending
> issues and get the changes in.
>
> Hopefully, it will be okay to merge it in as an experimental feature since
> in order to resolve these issues we may need to change the existing join
> ordering algorithm in Drill, add support for Histograms and a few other
> planning related issues. Moreover, the community is adding a meta-store for
> Drill [DRILL-6552] .
> Statistics should also be able to leverage the brand new meta-store instead
> of/in addition to having a custom store implementation.
>
> My plan is to address the most critical review comments and get the initial
> version in as an experimental feature. Some other good-to-have aspects like
> handling schema changes during the statistics collection process maybe
> deferred to the next iteration. Subsequently, I will improve these
> good-to-have features and additional performance improvements. It would be
> great to get the initial implementation in to avoid the rebase issues and
> allow other community members to use and contribute to the feature.
>
> Please take a look at the design doc and the PR and provide suggestions and
> feedback on the JIRA. Also I will try to present the current state of
> statistics and the feature in one of the bi-weekly Drill Community
> Hangouts.
>
> Thanks,
> Gautam
>