Recently, I'm work with getting statistic for Hive's partitioned table[1], I 
would like to share my experience as a developer.

I have to admit the ndv really make me confused in the first glance, but I can 
find what it means easily in web search engine with the keyword like "nvd 
statistic".

And to be honest, the name granularityNumber is not intuitive to me either, and 
make it even harder to search what it means in web search engine with the 
keyword like "granularityNumber statistic”.

Personal, I prefer to use ndv in ColumnStats. 

[1] https://issues.apache.org/jira/browse/FLINK-27597


Best regards,
Yuxia



> 2022年6月2日 上午12:44,Jing Ge <j...@ververica.com> 写道:
> 
> Hi Dev,
> 
> I am not really sure if it is feasible to start this discussion. According
> to the contribution guidelines, dev ml is the right place to reach
> consensus.
> 
> In ColumnStats, Currently ndv, which stands for "number of distinct
> values", is used. First of all, it is difficult to understand the meaning
> with the abbreviation. Second, it might be good to use a professional
> naming instead.
> 
> 
> 
> Suggestion:
> 
> replace ndv with granularityNumber:
> 
> 
> 
> The good news, afaik, is that the method getNdv() hasn't been used within
> Flink which means the renaming will have very limited impact.
> 
> 
> 
> ColumnStats {
> 
> /** number of distinct values. */
> 
> @Deprecated
> private final Long ndv;
> 
> 
> 
> /**Granularity refers to the level of details used to sort and separate
> data at column level. Highly granular data is categorized or separated very
> precisely. For example, the granularity number of gender columns should
> normally be 2. The granularity number of the month column will be 12. In
> the SQL world, it means the number of distinct values. */
> 
> private final Long granularityNumber;
> 
> 
> 
> @Deprecated
> public Long getNdv()
> { return ndv; }
> 
> 
> 
> public Long getGranularityNumber()
> { return granularityNumber; }
> }
> 
> Best regards,
> -- 
> 
> Jing

Reply via email to