Re: [DISCUSS] suggest using granularityNumber in ColumnStats

Jark Wu Thu, 02 Jun 2022 01:39:47 -0700

Hi Jing,

I can see there might be developers who don't understand the meaning at the
first glance.
However, NDV is a widely used terminology in table statistics, see
[1][2][3].
If we use another name, it may confuse developers who are familiar with
stats and optimization.
I think at least, the Javadoc is needed to explain the meaning and full
name.
If we want to change the name, we can use the full name
"numberOfDistinctValues()".


Best,
Jark

[1]:
https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute
[2]:
https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/
[3]:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md

On Thu, 2 Jun 2022 at 14:44, Becket Qin <becket....@gmail.com> wrote:

> Hi Jing,
>
> While I do agree that NDV is a little confusing at first sight, it seems
> quite concise once I got the meaning. So personally I am OK with keeping it
> as is, but proper documentation would be helpful. If we really want to
> replace it with a more professional name, *cardinality* might be a good
> alternative.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Thu, Jun 2, 2022 at 12:51 AM Jing Ge <j...@ververica.com> wrote:
>
> > Hi Dev,
> >
> > I am not really sure if it is feasible to start this discussion.
> According
> > to the contribution guidelines, dev ml is the right place to reach
> > consensus.
> >
> > In ColumnStats, Currently ndv, which stands for "number of distinct
> > values", is used. First of all, it is difficult to understand the meaning
> > with the abbreviation. Second, it might be good to use a professional
> > naming instead.
> >
> >
> >
> > Suggestion:
> >
> > replace ndv with granularityNumber:
> >
> >
> >
> > The good news, afaik, is that the method getNdv() hasn't been used within
> > Flink which means the renaming will have very limited impact.
> >
> >
> >
> > ColumnStats {
> >
> > /** number of distinct values. */
> >
> > @Deprecated
> > private final Long ndv;
> >
> >
> >
> > /**Granularity refers to the level of details used to sort and separate
> > data at column level. Highly granular data is categorized or separated
> very
> > precisely. For example, the granularity number of gender columns should
> > normally be 2. The granularity number of the month column will be 12. In
> > the SQL world, it means the number of distinct values. */
> >
> > private final Long granularityNumber;
> >
> >
> >
> > @Deprecated
> > public Long getNdv()
> > { return ndv; }
> >
> >
> >
> > public Long getGranularityNumber()
> > { return granularityNumber; }
> > }
> >
> > Best regards,
> > --
> >
> > Jing
> >
>

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

Reply via email to