Thanks all for your feedback! It is very informative. to Becket:
At the beginning, I chose the same word because we used it in daily work. Before I started this discussion, to make sure it is the right one, I did some checking and it turns out that *cardinality* has a very different (also very common) meaning within data modeling[1]. And on the other side *granularity* is actually the right word for the meaning when we use cardinality in the context of NDV[2]. to Jark, Jingsong, NDV seems to me more like a function than a field defined in a class. Briefly speaking, NDV is more SQL-oriented(implementation) and *granularity* is more data analytics-oriented(abstraction/concept)[3][4]. Best regards, Jing [1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling) [2] https://www.talon.one/glossary/granularity [3] https://www.quora.com/What-is-granularity-in-database [4] https://www.statisticshowto.com/data-granularity/ On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li <jingsongl...@gmail.com> wrote: > Hi, > > +1 for NDV (number of distinct values) is a widely used terminology in > table statistics. > > I've also seen the one called `distinctCount`. > > This name can be found in databases like oracle too. [1] > > So it is not good to change a completely different name. > > [1] > > https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922 > > Best, > Jingsong > > On Thu, Jun 2, 2022 at 4:46 PM Jark Wu <imj...@gmail.com> wrote: > > > Hi Jing, > > > > I can see there might be developers who don't understand the meaning at > the > > first glance. > > However, NDV is a widely used terminology in table statistics, see > > [1][2][3]. > > If we use another name, it may confuse developers who are familiar with > > stats and optimization. > > I think at least, the Javadoc is needed to explain the meaning and full > > name. > > If we want to change the name, we can use the full name > > "numberOfDistinctValues()". > > > > Best, > > Jark > > > > [1]: > > > > > https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute > > [2]: > > > https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/ > > [3]: > > > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > > > > On Thu, 2 Jun 2022 at 14:44, Becket Qin <becket....@gmail.com> wrote: > > > > > Hi Jing, > > > > > > While I do agree that NDV is a little confusing at first sight, it > seems > > > quite concise once I got the meaning. So personally I am OK with > keeping > > it > > > as is, but proper documentation would be helpful. If we really want to > > > replace it with a more professional name, *cardinality* might be a good > > > alternative. > > > > > > Thanks, > > > > > > Jiangjie (Becket) Qin > > > > > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge <j...@ververica.com> wrote: > > > > > > > Hi Dev, > > > > > > > > I am not really sure if it is feasible to start this discussion. > > > According > > > > to the contribution guidelines, dev ml is the right place to reach > > > > consensus. > > > > > > > > In ColumnStats, Currently ndv, which stands for "number of distinct > > > > values", is used. First of all, it is difficult to understand the > > meaning > > > > with the abbreviation. Second, it might be good to use a professional > > > > naming instead. > > > > > > > > > > > > > > > > Suggestion: > > > > > > > > replace ndv with granularityNumber: > > > > > > > > > > > > > > > > The good news, afaik, is that the method getNdv() hasn't been used > > within > > > > Flink which means the renaming will have very limited impact. > > > > > > > > > > > > > > > > ColumnStats { > > > > > > > > /** number of distinct values. */ > > > > > > > > @Deprecated > > > > private final Long ndv; > > > > > > > > > > > > > > > > /**Granularity refers to the level of details used to sort and > separate > > > > data at column level. Highly granular data is categorized or > separated > > > very > > > > precisely. For example, the granularity number of gender columns > should > > > > normally be 2. The granularity number of the month column will be 12. > > In > > > > the SQL world, it means the number of distinct values. */ > > > > > > > > private final Long granularityNumber; > > > > > > > > > > > > > > > > @Deprecated > > > > public Long getNdv() > > > > { return ndv; } > > > > > > > > > > > > > > > > public Long getGranularityNumber() > > > > { return granularityNumber; } > > > > } > > > > > > > > Best regards, > > > > -- > > > > > > > > Jing > > > > > > > > > >