Hi Jing, I can see there might be developers who don't understand the meaning at the first glance. However, NDV is a widely used terminology in table statistics, see [1][2][3]. If we use another name, it may confuse developers who are familiar with stats and optimization. I think at least, the Javadoc is needed to explain the meaning and full name. If we want to change the name, we can use the full name "numberOfDistinctValues()".
Best, Jark [1]: https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute [2]: https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/ [3]: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md On Thu, 2 Jun 2022 at 14:44, Becket Qin <becket....@gmail.com> wrote: > Hi Jing, > > While I do agree that NDV is a little confusing at first sight, it seems > quite concise once I got the meaning. So personally I am OK with keeping it > as is, but proper documentation would be helpful. If we really want to > replace it with a more professional name, *cardinality* might be a good > alternative. > > Thanks, > > Jiangjie (Becket) Qin > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge <j...@ververica.com> wrote: > > > Hi Dev, > > > > I am not really sure if it is feasible to start this discussion. > According > > to the contribution guidelines, dev ml is the right place to reach > > consensus. > > > > In ColumnStats, Currently ndv, which stands for "number of distinct > > values", is used. First of all, it is difficult to understand the meaning > > with the abbreviation. Second, it might be good to use a professional > > naming instead. > > > > > > > > Suggestion: > > > > replace ndv with granularityNumber: > > > > > > > > The good news, afaik, is that the method getNdv() hasn't been used within > > Flink which means the renaming will have very limited impact. > > > > > > > > ColumnStats { > > > > /** number of distinct values. */ > > > > @Deprecated > > private final Long ndv; > > > > > > > > /**Granularity refers to the level of details used to sort and separate > > data at column level. Highly granular data is categorized or separated > very > > precisely. For example, the granularity number of gender columns should > > normally be 2. The granularity number of the month column will be 12. In > > the SQL world, it means the number of distinct values. */ > > > > private final Long granularityNumber; > > > > > > > > @Deprecated > > public Long getNdv() > > { return ndv; } > > > > > > > > public Long getGranularityNumber() > > { return granularityNumber; } > > } > > > > Best regards, > > -- > > > > Jing > > >