[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Hao Hao (Code Review) Fri, 23 Aug 2019 12:27:26 -0700

Hao Hao has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/14107 )


Change subject: KUDU-2921: Exposing the table statistics to spark relation.
......................................................................


Patch Set 2:

(1 comment)

> Patch Set 2:
>
> > Patch Set 2:
> > 
> > > Actually, I have considered to put those statistics into existing message 
> > > such as getShema which will be used when opening table. But as for the 
> > > table statistics may contain more and more entries it may be better to be 
> > > a independent rpc. What's more, because this statistics may change when 
> > > client trying to write, so I think it will be useful to support a rpc for 
> > > client to get the statistics change. Although for query plan it will only 
> > > use once and save one round trip if we put it into existing rpc but it 
> > > will be heavy if we want to get statistic change in other use cases.
> >
> > How are the stats useful for a writing client (i.e. an Impala or Spark 
> > executor) vs. the query planner?
> >
> > What are the other use cases you're thinking about? If they're 
> > hypothetical, we could add the stats to GetTableLocations now, and also 
> > expose them via new RPC later, when those other use cases materialize.
> >
> > >  > If doing this, let's make sure to make it optional, since the
> > >  > statistics could become larger later (eg with things like
> > >  > histograms, NDVs per column, etc)
> > >
> > > Will make it optional if do that :)
> >
> > Agreed.
>
> Here are some stats that systems use:
> - size in bytes
> - number of rows
> - per-column stats
>   - distinct count
>   - min/max/histogram
>
> The per-column stats seem like they wouldn't be a good fit for the GetSchema 
> endpoint, since they may end up returning user data.
>
> So I'm hesitant about adding stats in with GetSchema; from an authorization 
> point of view, getting the number of rows in a table, for instance, currently 
> requires some scan privileges. If we tie it into the GetSchema RPC, this is 
> no longer the case.
>
> Some info about Spark:
> https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-CatalogStatistics.html
> https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-ColumnStat.html

Agree.

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc
File src/kudu/master/catalog_manager.cc:

http://gerrit.cloudera.org:8080/#/c/14107/1/src/kudu/master/catalog_manager.cc@2906
PS1, Line 2906:       table->set_name(table_name);
> Actually, this metrics doesn't contained in tablemetadata and  I think it m
AFAIU, Impala requires "ALTER and SELECT on Table", because compute stats 
actually trigger SELECT queries and alteration on the table. Here we are only 
using pre-computed stats associated with the table, so arguably  "METADATA on 
Table" may be sufficient. However, as the stats contain data (number of rows) 
that requires scan privileges, to be more restrictive, it is better to at least 
requires "SELECT on Table" here.



--
To view, visit http://gerrit.cloudera.org:8080/14107
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I7742a76708f989b0ccc8ba417f3390013e260175
Gerrit-Change-Number: 14107
Gerrit-PatchSet: 2
Gerrit-Owner: ZhangYao <triplesheep0...@gmail.com>
Gerrit-Reviewer: Adar Dembo <a...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <granthe...@apache.org>
Gerrit-Reviewer: Hao Hao <hao....@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Tidy Bot (241)
Gerrit-Reviewer: Todd Lipcon <t...@apache.org>
Gerrit-Reviewer: ZhangYao <triplesheep0...@gmail.com>
Gerrit-Comment-Date: Fri, 23 Aug 2019 19:27:06 +0000
Gerrit-HasComments: Yes

[kudu-CR] KUDU-2921: Exposing the table statistics to spark relation.

Reply via email to