Thanks all for your feedback! It is very informative.

to Becket:

At the beginning, I chose the same word because we used it in daily work.
Before I started this discussion, to make sure it is the right one, I did
some checking and it turns out that *cardinality* has a very different
(also very common) meaning within data modeling[1]. And on the other side
*granularity* is actually the right word for the meaning when we use
cardinality in the context of NDV[2].

to Jark, Jingsong,

NDV seems to me more like a function than a field defined in a class.
Briefly speaking, NDV is more SQL-oriented(implementation) and
*granularity* is more data analytics-oriented(abstraction/concept)[3][4].

Best regards,
Jing

[1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
[2] https://www.talon.one/glossary/granularity
[3] https://www.quora.com/What-is-granularity-in-database
[4] https://www.statisticshowto.com/data-granularity/

On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li <jingsongl...@gmail.com> wrote:

> Hi,
>
> +1 for NDV (number of distinct values) is a widely used terminology in
> table statistics.
>
> I've also seen the one called `distinctCount`.
>
> This name can be found in databases like oracle too. [1]
>
> So it is not good to change a completely different name.
>
> [1]
>
> https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922
>
> Best,
> Jingsong
>
> On Thu, Jun 2, 2022 at 4:46 PM Jark Wu <imj...@gmail.com> wrote:
>
> > Hi Jing,
> >
> > I can see there might be developers who don't understand the meaning at
> the
> > first glance.
> > However, NDV is a widely used terminology in table statistics, see
> > [1][2][3].
> > If we use another name, it may confuse developers who are familiar with
> > stats and optimization.
> > I think at least, the Javadoc is needed to explain the meaning and full
> > name.
> > If we want to change the name, we can use the full name
> > "numberOfDistinctValues()".
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> >
> https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute
> > [2]:
> >
> https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/
> > [3]:
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> >
> > On Thu, 2 Jun 2022 at 14:44, Becket Qin <becket....@gmail.com> wrote:
> >
> > > Hi Jing,
> > >
> > > While I do agree that NDV is a little confusing at first sight, it
> seems
> > > quite concise once I got the meaning. So personally I am OK with
> keeping
> > it
> > > as is, but proper documentation would be helpful. If we really want to
> > > replace it with a more professional name, *cardinality* might be a good
> > > alternative.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge <j...@ververica.com> wrote:
> > >
> > > > Hi Dev,
> > > >
> > > > I am not really sure if it is feasible to start this discussion.
> > > According
> > > > to the contribution guidelines, dev ml is the right place to reach
> > > > consensus.
> > > >
> > > > In ColumnStats, Currently ndv, which stands for "number of distinct
> > > > values", is used. First of all, it is difficult to understand the
> > meaning
> > > > with the abbreviation. Second, it might be good to use a professional
> > > > naming instead.
> > > >
> > > >
> > > >
> > > > Suggestion:
> > > >
> > > > replace ndv with granularityNumber:
> > > >
> > > >
> > > >
> > > > The good news, afaik, is that the method getNdv() hasn't been used
> > within
> > > > Flink which means the renaming will have very limited impact.
> > > >
> > > >
> > > >
> > > > ColumnStats {
> > > >
> > > > /** number of distinct values. */
> > > >
> > > > @Deprecated
> > > > private final Long ndv;
> > > >
> > > >
> > > >
> > > > /**Granularity refers to the level of details used to sort and
> separate
> > > > data at column level. Highly granular data is categorized or
> separated
> > > very
> > > > precisely. For example, the granularity number of gender columns
> should
> > > > normally be 2. The granularity number of the month column will be 12.
> > In
> > > > the SQL world, it means the number of distinct values. */
> > > >
> > > > private final Long granularityNumber;
> > > >
> > > >
> > > >
> > > > @Deprecated
> > > > public Long getNdv()
> > > > { return ndv; }
> > > >
> > > >
> > > >
> > > > public Long getGranularityNumber()
> > > > { return granularityNumber; }
> > > > }
> > > >
> > > > Best regards,
> > > > --
> > > >
> > > > Jing
> > > >
> > >
> >
>

Reply via email to