Recently, I'm work with getting statistic for Hive's partitioned table[1], I
would like to share my experience as a developer.
I have to admit the ndv really make me confused in the first glance, but I can
find what it means easily in web search engine with the keyword like "nvd
statistic".
Hi Jing,
Hmm, granularity and ndv still don't seem to mean the same thing to me.
Granularity basically means how detailed the data is, in another word,
whether a field / column be further divided. For example, a field like
"age“ cannot be further divided so it is quite granular. In contrast, an
Hi Jing,
I agree with you that "NDV is more SQL-oriented(implementation)
and granularity is more data analytics-oriented". As you said,
"granularity"
may be commonly used for data modeling and business-related.
However, TableStats is not used for data modeling but is an implementation
detail for
Thanks all for your feedback! It is very informative.
to Becket:
At the beginning, I chose the same word because we used it in daily work.
Before I started this discussion, to make sure it is the right one, I did
some checking and it turns out that *cardinality* has a very different
(also very
Hi,
+1 for NDV (number of distinct values) is a widely used terminology in
table statistics.
I've also seen the one called `distinctCount`.
This name can be found in databases like oracle too. [1]
So it is not good to change a completely different name.
[1]
rg/jira/browse/FLINK-27597
发件人:Jing Ge
日期:2022年6月2日 00:21
主题:[DISCUSS] suggest using granularityNumber in ColumnStats
收件人:dev
Hi Dev,
I am not really sure if it is feasible to start this discussion. According
to the contribution guidelines, dev ml is the right place to reach
consensus.
In Co
Hi Jing,
I can see there might be developers who don't understand the meaning at the
first glance.
However, NDV is a widely used terminology in table statistics, see
[1][2][3].
If we use another name, it may confuse developers who are familiar with
stats and optimization.
I think at least, the
Hi Jing,
While I do agree that NDV is a little confusing at first sight, it seems
quite concise once I got the meaning. So personally I am OK with keeping it
as is, but proper documentation would be helpful. If we really want to
replace it with a more professional name, *cardinality* might be a
Hi Dev,
I am not really sure if it is feasible to start this discussion. According
to the contribution guidelines, dev ml is the right place to reach
consensus.
In ColumnStats, Currently ndv, which stands for "number of distinct
values", is used. First of all, it is difficult to understand the