Hi Jing, I agree with you that "NDV is more SQL-oriented(implementation) and granularity is more data analytics-oriented". As you said, "granularity" may be commonly used for data modeling and business-related. However, TableStats is not used for data modeling but is an implementation detail for SQL optimization. NDV is the terminology in the optimizer field, and Calcite also uses this word[1]. I didn't notice there any vendors are using "granularity" for this purpose. If I miss any, please correct me.
If NDV sounds like a function to you, I'm OK to use "numDistinctVals" as Calcite does. Best, Jark [1]: https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdUtil.html#numDistinctVals(java.lang.Double,java.lang.Double) On Fri, 3 Jun 2022 at 00:14, Jing Ge <[email protected]> wrote: > Thanks all for your feedback! It is very informative. > > to Becket: > > At the beginning, I chose the same word because we used it in daily work. > Before I started this discussion, to make sure it is the right one, I did > some checking and it turns out that *cardinality* has a very different > (also very common) meaning within data modeling[1]. And on the other side > *granularity* is actually the right word for the meaning when we use > cardinality in the context of NDV[2]. > > to Jark, Jingsong, > > NDV seems to me more like a function than a field defined in a class. > Briefly speaking, NDV is more SQL-oriented(implementation) and > *granularity* is more data analytics-oriented(abstraction/concept)[3][4]. > > Best regards, > Jing > > [1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling) > [2] https://www.talon.one/glossary/granularity > [3] https://www.quora.com/What-is-granularity-in-database > [4] https://www.statisticshowto.com/data-granularity/ > > On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li <[email protected]> > wrote: > > > Hi, > > > > +1 for NDV (number of distinct values) is a widely used terminology in > > table statistics. > > > > I've also seen the one called `distinctCount`. > > > > This name can be found in databases like oracle too. [1] > > > > So it is not good to change a completely different name. > > > > [1] > > > > > https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922 > > > > Best, > > Jingsong > > > > On Thu, Jun 2, 2022 at 4:46 PM Jark Wu <[email protected]> wrote: > > > > > Hi Jing, > > > > > > I can see there might be developers who don't understand the meaning at > > the > > > first glance. > > > However, NDV is a widely used terminology in table statistics, see > > > [1][2][3]. > > > If we use another name, it may confuse developers who are familiar with > > > stats and optimization. > > > I think at least, the Javadoc is needed to explain the meaning and full > > > name. > > > If we want to change the name, we can use the full name > > > "numberOfDistinctValues()". > > > > > > Best, > > > Jark > > > > > > [1]: > > > > > > > > > https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute > > > [2]: > > > > > > https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/ > > > [3]: > > > > > > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > > > > > > On Thu, 2 Jun 2022 at 14:44, Becket Qin <[email protected]> wrote: > > > > > > > Hi Jing, > > > > > > > > While I do agree that NDV is a little confusing at first sight, it > > seems > > > > quite concise once I got the meaning. So personally I am OK with > > keeping > > > it > > > > as is, but proper documentation would be helpful. If we really want > to > > > > replace it with a more professional name, *cardinality* might be a > good > > > > alternative. > > > > > > > > Thanks, > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge <[email protected]> wrote: > > > > > > > > > Hi Dev, > > > > > > > > > > I am not really sure if it is feasible to start this discussion. > > > > According > > > > > to the contribution guidelines, dev ml is the right place to reach > > > > > consensus. > > > > > > > > > > In ColumnStats, Currently ndv, which stands for "number of distinct > > > > > values", is used. First of all, it is difficult to understand the > > > meaning > > > > > with the abbreviation. Second, it might be good to use a > professional > > > > > naming instead. > > > > > > > > > > > > > > > > > > > > Suggestion: > > > > > > > > > > replace ndv with granularityNumber: > > > > > > > > > > > > > > > > > > > > The good news, afaik, is that the method getNdv() hasn't been used > > > within > > > > > Flink which means the renaming will have very limited impact. > > > > > > > > > > > > > > > > > > > > ColumnStats { > > > > > > > > > > /** number of distinct values. */ > > > > > > > > > > @Deprecated > > > > > private final Long ndv; > > > > > > > > > > > > > > > > > > > > /**Granularity refers to the level of details used to sort and > > separate > > > > > data at column level. Highly granular data is categorized or > > separated > > > > very > > > > > precisely. For example, the granularity number of gender columns > > should > > > > > normally be 2. The granularity number of the month column will be > 12. > > > In > > > > > the SQL world, it means the number of distinct values. */ > > > > > > > > > > private final Long granularityNumber; > > > > > > > > > > > > > > > > > > > > @Deprecated > > > > > public Long getNdv() > > > > > { return ndv; } > > > > > > > > > > > > > > > > > > > > public Long getGranularityNumber() > > > > > { return granularityNumber; } > > > > > } > > > > > > > > > > Best regards, > > > > > -- > > > > > > > > > > Jing > > > > > > > > > > > > > > >
