Hi Jing,

I agree with you that "NDV is more SQL-oriented(implementation)
and granularity is more data analytics-oriented". As you said,
"granularity"
may be commonly used for data modeling and business-related.
However, TableStats is not used for data modeling but is an implementation
 detail for SQL optimization. NDV is the terminology in the optimizer
field,
and Calcite also uses this word[1]. I didn't notice there any vendors are
using "granularity" for this purpose. If I miss any, please correct me.

If NDV sounds like a function to you, I'm OK to use "numDistinctVals" as
Calcite does.

Best,
Jark


[1]:
https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdUtil.html#numDistinctVals(java.lang.Double,java.lang.Double)

On Fri, 3 Jun 2022 at 00:14, Jing Ge <j...@ververica.com> wrote:

> Thanks all for your feedback! It is very informative.
>
> to Becket:
>
> At the beginning, I chose the same word because we used it in daily work.
> Before I started this discussion, to make sure it is the right one, I did
> some checking and it turns out that *cardinality* has a very different
> (also very common) meaning within data modeling[1]. And on the other side
> *granularity* is actually the right word for the meaning when we use
> cardinality in the context of NDV[2].
>
> to Jark, Jingsong,
>
> NDV seems to me more like a function than a field defined in a class.
> Briefly speaking, NDV is more SQL-oriented(implementation) and
> *granularity* is more data analytics-oriented(abstraction/concept)[3][4].
>
> Best regards,
> Jing
>
> [1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
> [2] https://www.talon.one/glossary/granularity
> [3] https://www.quora.com/What-is-granularity-in-database
> [4] https://www.statisticshowto.com/data-granularity/
>
> On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li <jingsongl...@gmail.com>
> wrote:
>
> > Hi,
> >
> > +1 for NDV (number of distinct values) is a widely used terminology in
> > table statistics.
> >
> > I've also seen the one called `distinctCount`.
> >
> > This name can be found in databases like oracle too. [1]
> >
> > So it is not good to change a completely different name.
> >
> > [1]
> >
> >
> https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922
> >
> > Best,
> > Jingsong
> >
> > On Thu, Jun 2, 2022 at 4:46 PM Jark Wu <imj...@gmail.com> wrote:
> >
> > > Hi Jing,
> > >
> > > I can see there might be developers who don't understand the meaning at
> > the
> > > first glance.
> > > However, NDV is a widely used terminology in table statistics, see
> > > [1][2][3].
> > > If we use another name, it may confuse developers who are familiar with
> > > stats and optimization.
> > > I think at least, the Javadoc is needed to explain the meaning and full
> > > name.
> > > If we want to change the name, we can use the full name
> > > "numberOfDistinctValues()".
> > >
> > > Best,
> > > Jark
> > >
> > > [1]:
> > >
> > >
> >
> https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute
> > > [2]:
> > >
> >
> https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/
> > > [3]:
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > >
> > > On Thu, 2 Jun 2022 at 14:44, Becket Qin <becket....@gmail.com> wrote:
> > >
> > > > Hi Jing,
> > > >
> > > > While I do agree that NDV is a little confusing at first sight, it
> > seems
> > > > quite concise once I got the meaning. So personally I am OK with
> > keeping
> > > it
> > > > as is, but proper documentation would be helpful. If we really want
> to
> > > > replace it with a more professional name, *cardinality* might be a
> good
> > > > alternative.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge <j...@ververica.com> wrote:
> > > >
> > > > > Hi Dev,
> > > > >
> > > > > I am not really sure if it is feasible to start this discussion.
> > > > According
> > > > > to the contribution guidelines, dev ml is the right place to reach
> > > > > consensus.
> > > > >
> > > > > In ColumnStats, Currently ndv, which stands for "number of distinct
> > > > > values", is used. First of all, it is difficult to understand the
> > > meaning
> > > > > with the abbreviation. Second, it might be good to use a
> professional
> > > > > naming instead.
> > > > >
> > > > >
> > > > >
> > > > > Suggestion:
> > > > >
> > > > > replace ndv with granularityNumber:
> > > > >
> > > > >
> > > > >
> > > > > The good news, afaik, is that the method getNdv() hasn't been used
> > > within
> > > > > Flink which means the renaming will have very limited impact.
> > > > >
> > > > >
> > > > >
> > > > > ColumnStats {
> > > > >
> > > > > /** number of distinct values. */
> > > > >
> > > > > @Deprecated
> > > > > private final Long ndv;
> > > > >
> > > > >
> > > > >
> > > > > /**Granularity refers to the level of details used to sort and
> > separate
> > > > > data at column level. Highly granular data is categorized or
> > separated
> > > > very
> > > > > precisely. For example, the granularity number of gender columns
> > should
> > > > > normally be 2. The granularity number of the month column will be
> 12.
> > > In
> > > > > the SQL world, it means the number of distinct values. */
> > > > >
> > > > > private final Long granularityNumber;
> > > > >
> > > > >
> > > > >
> > > > > @Deprecated
> > > > > public Long getNdv()
> > > > > { return ndv; }
> > > > >
> > > > >
> > > > >
> > > > > public Long getGranularityNumber()
> > > > > { return granularityNumber; }
> > > > > }
> > > > >
> > > > > Best regards,
> > > > > --
> > > > >
> > > > > Jing
> > > > >
> > > >
> > >
> >
>

Reply via email to