subject:"\[DISCUSS\] suggest using granularityNumber in ColumnStats"

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-04 Thread Yuxia Luo

Recently, I'm work with getting statistic for Hive's partitioned table[1], I 
would like to share my experience as a developer.

I have to admit the ndv really make me confused in the first glance, but I can 
find what it means easily in web search engine with the keyword like "nvd 
statistic".

And to be honest, the name granularityNumber is not intuitive to me either, and 
make it even harder to search what it means in web search engine with the 
keyword like "granularityNumber statistic”.

Personal, I prefer to use ndv in ColumnStats. 

[1] https://issues.apache.org/jira/browse/FLINK-27597


Best regards,
Yuxia



> 2022年6月2日 上午12:44，Jing Ge  写道：
> 
> Hi Dev,
> 
> I am not really sure if it is feasible to start this discussion. According
> to the contribution guidelines, dev ml is the right place to reach
> consensus.
> 
> In ColumnStats, Currently ndv, which stands for "number of distinct
> values", is used. First of all, it is difficult to understand the meaning
> with the abbreviation. Second, it might be good to use a professional
> naming instead.
> 
> 
> 
> Suggestion:
> 
> replace ndv with granularityNumber:
> 
> 
> 
> The good news, afaik, is that the method getNdv() hasn't been used within
> Flink which means the renaming will have very limited impact.
> 
> 
> 
> ColumnStats {
> 
> /** number of distinct values. */
> 
> @Deprecated
> private final Long ndv;
> 
> 
> 
> /**Granularity refers to the level of details used to sort and separate
> data at column level. Highly granular data is categorized or separated very
> precisely. For example, the granularity number of gender columns should
> normally be 2. The granularity number of the month column will be 12. In
> the SQL world, it means the number of distinct values. */
> 
> private final Long granularityNumber;
> 
> 
> 
> @Deprecated
> public Long getNdv()
> { return ndv; }
> 
> 
> 
> public Long getGranularityNumber()
> { return granularityNumber; }
> }
> 
> Best regards,
> -- 
> 
> Jing

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-04 Thread Becket Qin

Hi Jing,

Hmm, granularity and ndv still don't seem to mean the same thing to me.
Granularity basically means how detailed the data is, in another word,
whether a field / column be further divided. For example, a field like
"age“ cannot be further divided so it is quite granular. In contrast, an
"address" field can be further divided into "street", "city", "country",
etc. Therefore "address" is less granular. When it comes to NDV, it
actually means how many distinct values are there in the field / column,
which is orthogonal to the granularity.

Anyways, it looks like most people think NDV or its full phrase is a better
name. It probably makes sense to just use either of them.

Thanks,

Jiangjie (Becket) Qin


On Fri, Jun 3, 2022 at 9:45 PM Jark Wu  wrote:

> Hi Jing,
>
> I agree with you that "NDV is more SQL-oriented(implementation)
> and granularity is more data analytics-oriented". As you said,
> "granularity"
> may be commonly used for data modeling and business-related.
> However, TableStats is not used for data modeling but is an implementation
>  detail for SQL optimization. NDV is the terminology in the optimizer
> field,
> and Calcite also uses this word[1]. I didn't notice there any vendors are
> using "granularity" for this purpose. If I miss any, please correct me.
>
> If NDV sounds like a function to you, I'm OK to use "numDistinctVals" as
> Calcite does.
>
> Best,
> Jark
>
>
> [1]:
>
> https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdUtil.html#numDistinctVals(java.lang.Double,java.lang.Double)
>
> On Fri, 3 Jun 2022 at 00:14, Jing Ge  wrote:
>
> > Thanks all for your feedback! It is very informative.
> >
> > to Becket:
> >
> > At the beginning, I chose the same word because we used it in daily work.
> > Before I started this discussion, to make sure it is the right one, I did
> > some checking and it turns out that *cardinality* has a very different
> > (also very common) meaning within data modeling[1]. And on the other side
> > *granularity* is actually the right word for the meaning when we use
> > cardinality in the context of NDV[2].
> >
> > to Jark, Jingsong,
> >
> > NDV seems to me more like a function than a field defined in a class.
> > Briefly speaking, NDV is more SQL-oriented(implementation) and
> > *granularity* is more data analytics-oriented(abstraction/concept)[3][4].
> >
> > Best regards,
> > Jing
> >
> > [1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
> > [2] https://www.talon.one/glossary/granularity
> > [3] https://www.quora.com/What-is-granularity-in-database
> > [4] https://www.statisticshowto.com/data-granularity/
> >
> > On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li 
> > wrote:
> >
> > > Hi,
> > >
> > > +1 for NDV (number of distinct values) is a widely used terminology in
> > > table statistics.
> > >
> > > I've also seen the one called `distinctCount`.
> > >
> > > This name can be found in databases like oracle too. [1]
> > >
> > > So it is not good to change a completely different name.
> > >
> > > [1]
> > >
> > >
> >
> https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922
> > >
> > > Best,
> > > Jingsong
> > >
> > > On Thu, Jun 2, 2022 at 4:46 PM Jark Wu  wrote:
> > >
> > > > Hi Jing,
> > > >
> > > > I can see there might be developers who don't understand the meaning
> at
> > > the
> > > > first glance.
> > > > However, NDV is a widely used terminology in table statistics, see
> > > > [1][2][3].
> > > > If we use another name, it may confuse developers who are familiar
> with
> > > > stats and optimization.
> > > > I think at least, the Javadoc is needed to explain the meaning and
> full
> > > > name.
> > > > If we want to change the name, we can use the full name
> > > > "numberOfDistinctValues()".
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > [1]:
> > > >
> > > >
> > >
> >
> https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute
> > > > [2]:
> > > >
> > >
> >
> https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/
> > > > [3]:
> > > >
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > > >
> > > > On Thu, 2 Jun 2022 at 14:44, Becket Qin 
> wrote:
> > > >
> > > > > Hi Jing,
> > > > >
> > > > > While I do agree that NDV is a little confusing at first sight, it
> > > seems
> > > > > quite concise once I got the meaning. So personally I am OK with
> > > keeping
> > > > it
> > > > > as is, but proper documentation would be helpful. If we really want
> > to
> > > > > replace it with a more professional name, *cardinality* might be a
> > good
> > > > > alternative.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Jiangjie (Becket) Qin
> > > > >
> > > > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge 
> wrote:
> > > > >
> > > > > > Hi Dev,
> > > > > >
> > > > > > I am not really sure if it is feasible to start this discussion.
> > > > > According
> > > > > > to the contribut

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-03 Thread Jark Wu

Hi Jing,

I agree with you that "NDV is more SQL-oriented(implementation)
and granularity is more data analytics-oriented". As you said,
"granularity"
may be commonly used for data modeling and business-related.
However, TableStats is not used for data modeling but is an implementation
 detail for SQL optimization. NDV is the terminology in the optimizer
field,
and Calcite also uses this word[1]. I didn't notice there any vendors are
using "granularity" for this purpose. If I miss any, please correct me.

If NDV sounds like a function to you, I'm OK to use "numDistinctVals" as
Calcite does.

Best,
Jark


[1]:
https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdUtil.html#numDistinctVals(java.lang.Double,java.lang.Double)

On Fri, 3 Jun 2022 at 00:14, Jing Ge  wrote:

> Thanks all for your feedback! It is very informative.
>
> to Becket:
>
> At the beginning, I chose the same word because we used it in daily work.
> Before I started this discussion, to make sure it is the right one, I did
> some checking and it turns out that *cardinality* has a very different
> (also very common) meaning within data modeling[1]. And on the other side
> *granularity* is actually the right word for the meaning when we use
> cardinality in the context of NDV[2].
>
> to Jark, Jingsong,
>
> NDV seems to me more like a function than a field defined in a class.
> Briefly speaking, NDV is more SQL-oriented(implementation) and
> *granularity* is more data analytics-oriented(abstraction/concept)[3][4].
>
> Best regards,
> Jing
>
> [1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
> [2] https://www.talon.one/glossary/granularity
> [3] https://www.quora.com/What-is-granularity-in-database
> [4] https://www.statisticshowto.com/data-granularity/
>
> On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li 
> wrote:
>
> > Hi,
> >
> > +1 for NDV (number of distinct values) is a widely used terminology in
> > table statistics.
> >
> > I've also seen the one called `distinctCount`.
> >
> > This name can be found in databases like oracle too. [1]
> >
> > So it is not good to change a completely different name.
> >
> > [1]
> >
> >
> https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922
> >
> > Best,
> > Jingsong
> >
> > On Thu, Jun 2, 2022 at 4:46 PM Jark Wu  wrote:
> >
> > > Hi Jing,
> > >
> > > I can see there might be developers who don't understand the meaning at
> > the
> > > first glance.
> > > However, NDV is a widely used terminology in table statistics, see
> > > [1][2][3].
> > > If we use another name, it may confuse developers who are familiar with
> > > stats and optimization.
> > > I think at least, the Javadoc is needed to explain the meaning and full
> > > name.
> > > If we want to change the name, we can use the full name
> > > "numberOfDistinctValues()".
> > >
> > > Best,
> > > Jark
> > >
> > > [1]:
> > >
> > >
> >
> https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute
> > > [2]:
> > >
> >
> https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/
> > > [3]:
> > >
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> > >
> > > On Thu, 2 Jun 2022 at 14:44, Becket Qin  wrote:
> > >
> > > > Hi Jing,
> > > >
> > > > While I do agree that NDV is a little confusing at first sight, it
> > seems
> > > > quite concise once I got the meaning. So personally I am OK with
> > keeping
> > > it
> > > > as is, but proper documentation would be helpful. If we really want
> to
> > > > replace it with a more professional name, *cardinality* might be a
> good
> > > > alternative.
> > > >
> > > > Thanks,
> > > >
> > > > Jiangjie (Becket) Qin
> > > >
> > > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge  wrote:
> > > >
> > > > > Hi Dev,
> > > > >
> > > > > I am not really sure if it is feasible to start this discussion.
> > > > According
> > > > > to the contribution guidelines, dev ml is the right place to reach
> > > > > consensus.
> > > > >
> > > > > In ColumnStats, Currently ndv, which stands for "number of distinct
> > > > > values", is used. First of all, it is difficult to understand the
> > > meaning
> > > > > with the abbreviation. Second, it might be good to use a
> professional
> > > > > naming instead.
> > > > >
> > > > >
> > > > >
> > > > > Suggestion:
> > > > >
> > > > > replace ndv with granularityNumber:
> > > > >
> > > > >
> > > > >
> > > > > The good news, afaik, is that the method getNdv() hasn't been used
> > > within
> > > > > Flink which means the renaming will have very limited impact.
> > > > >
> > > > >
> > > > >
> > > > > ColumnStats {
> > > > >
> > > > > /** number of distinct values. */
> > > > >
> > > > > @Deprecated
> > > > > private final Long ndv;
> > > > >
> > > > >
> > > > >
> > > > > /**Granularity refers to the level of details used to sort and
> > separate
> > > > > data at column level. Highly granular data is categorized or
> > separated
> > > >

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-02 Thread Jing Ge

Thanks all for your feedback! It is very informative.

to Becket:

At the beginning, I chose the same word because we used it in daily work.
Before I started this discussion, to make sure it is the right one, I did
some checking and it turns out that *cardinality* has a very different
(also very common) meaning within data modeling[1]. And on the other side
*granularity* is actually the right word for the meaning when we use
cardinality in the context of NDV[2].

to Jark, Jingsong,

NDV seems to me more like a function than a field defined in a class.
Briefly speaking, NDV is more SQL-oriented(implementation) and
*granularity* is more data analytics-oriented(abstraction/concept)[3][4].

Best regards,
Jing

[1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
[2] https://www.talon.one/glossary/granularity
[3] https://www.quora.com/What-is-granularity-in-database
[4] https://www.statisticshowto.com/data-granularity/

On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li  wrote:

> Hi,
>
> +1 for NDV (number of distinct values) is a widely used terminology in
> table statistics.
>
> I've also seen the one called `distinctCount`.
>
> This name can be found in databases like oracle too. [1]
>
> So it is not good to change a completely different name.
>
> [1]
>
> https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922
>
> Best,
> Jingsong
>
> On Thu, Jun 2, 2022 at 4:46 PM Jark Wu  wrote:
>
> > Hi Jing,
> >
> > I can see there might be developers who don't understand the meaning at
> the
> > first glance.
> > However, NDV is a widely used terminology in table statistics, see
> > [1][2][3].
> > If we use another name, it may confuse developers who are familiar with
> > stats and optimization.
> > I think at least, the Javadoc is needed to explain the meaning and full
> > name.
> > If we want to change the name, we can use the full name
> > "numberOfDistinctValues()".
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> >
> https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute
> > [2]:
> >
> https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/
> > [3]:
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
> >
> > On Thu, 2 Jun 2022 at 14:44, Becket Qin  wrote:
> >
> > > Hi Jing,
> > >
> > > While I do agree that NDV is a little confusing at first sight, it
> seems
> > > quite concise once I got the meaning. So personally I am OK with
> keeping
> > it
> > > as is, but proper documentation would be helpful. If we really want to
> > > replace it with a more professional name, *cardinality* might be a good
> > > alternative.
> > >
> > > Thanks,
> > >
> > > Jiangjie (Becket) Qin
> > >
> > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge  wrote:
> > >
> > > > Hi Dev,
> > > >
> > > > I am not really sure if it is feasible to start this discussion.
> > > According
> > > > to the contribution guidelines, dev ml is the right place to reach
> > > > consensus.
> > > >
> > > > In ColumnStats, Currently ndv, which stands for "number of distinct
> > > > values", is used. First of all, it is difficult to understand the
> > meaning
> > > > with the abbreviation. Second, it might be good to use a professional
> > > > naming instead.
> > > >
> > > >
> > > >
> > > > Suggestion:
> > > >
> > > > replace ndv with granularityNumber:
> > > >
> > > >
> > > >
> > > > The good news, afaik, is that the method getNdv() hasn't been used
> > within
> > > > Flink which means the renaming will have very limited impact.
> > > >
> > > >
> > > >
> > > > ColumnStats {
> > > >
> > > > /** number of distinct values. */
> > > >
> > > > @Deprecated
> > > > private final Long ndv;
> > > >
> > > >
> > > >
> > > > /**Granularity refers to the level of details used to sort and
> separate
> > > > data at column level. Highly granular data is categorized or
> separated
> > > very
> > > > precisely. For example, the granularity number of gender columns
> should
> > > > normally be 2. The granularity number of the month column will be 12.
> > In
> > > > the SQL world, it means the number of distinct values. */
> > > >
> > > > private final Long granularityNumber;
> > > >
> > > >
> > > >
> > > > @Deprecated
> > > > public Long getNdv()
> > > > { return ndv; }
> > > >
> > > >
> > > >
> > > > public Long getGranularityNumber()
> > > > { return granularityNumber; }
> > > > }
> > > >
> > > > Best regards,
> > > > --
> > > >
> > > > Jing
> > > >
> > >
> >
>

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-02 Thread Jingsong Li

Hi,

+1 for NDV (number of distinct values) is a widely used terminology in
table statistics.

I've also seen the one called `distinctCount`.

This name can be found in databases like oracle too. [1]

So it is not good to change a completely different name.

[1]
https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922

Best,
Jingsong

On Thu, Jun 2, 2022 at 4:46 PM Jark Wu  wrote:

> Hi Jing,
>
> I can see there might be developers who don't understand the meaning at the
> first glance.
> However, NDV is a widely used terminology in table statistics, see
> [1][2][3].
> If we use another name, it may confuse developers who are familiar with
> stats and optimization.
> I think at least, the Javadoc is needed to explain the meaning and full
> name.
> If we want to change the name, we can use the full name
> "numberOfDistinctValues()".
>
> Best,
> Jark
>
> [1]:
>
> https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute
> [2]:
> https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/
> [3]:
> https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md
>
> On Thu, 2 Jun 2022 at 14:44, Becket Qin  wrote:
>
> > Hi Jing,
> >
> > While I do agree that NDV is a little confusing at first sight, it seems
> > quite concise once I got the meaning. So personally I am OK with keeping
> it
> > as is, but proper documentation would be helpful. If we really want to
> > replace it with a more professional name, *cardinality* might be a good
> > alternative.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge  wrote:
> >
> > > Hi Dev,
> > >
> > > I am not really sure if it is feasible to start this discussion.
> > According
> > > to the contribution guidelines, dev ml is the right place to reach
> > > consensus.
> > >
> > > In ColumnStats, Currently ndv, which stands for "number of distinct
> > > values", is used. First of all, it is difficult to understand the
> meaning
> > > with the abbreviation. Second, it might be good to use a professional
> > > naming instead.
> > >
> > >
> > >
> > > Suggestion:
> > >
> > > replace ndv with granularityNumber:
> > >
> > >
> > >
> > > The good news, afaik, is that the method getNdv() hasn't been used
> within
> > > Flink which means the renaming will have very limited impact.
> > >
> > >
> > >
> > > ColumnStats {
> > >
> > > /** number of distinct values. */
> > >
> > > @Deprecated
> > > private final Long ndv;
> > >
> > >
> > >
> > > /**Granularity refers to the level of details used to sort and separate
> > > data at column level. Highly granular data is categorized or separated
> > very
> > > precisely. For example, the granularity number of gender columns should
> > > normally be 2. The granularity number of the month column will be 12.
> In
> > > the SQL world, it means the number of distinct values. */
> > >
> > > private final Long granularityNumber;
> > >
> > >
> > >
> > > @Deprecated
> > > public Long getNdv()
> > > { return ndv; }
> > >
> > >
> > >
> > > public Long getGranularityNumber()
> > > { return granularityNumber; }
> > > }
> > >
> > > Best regards,
> > > --
> > >
> > > Jing
> > >
> >
>

回复： [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-02 Thread luoyu...@alumni.sjtu.edu.cn

As I'm work with getting statistic for Hive's partitioned table[1] currently, I 
would like to shard my expirence as a developer. 

I have to admit the ndv really make me confused in the first glance, but I can 
find what it means easily in web search engine with the keyword like "nvd 
statistic". 

And to be honest, the name granularityNumber is not intuitive to me either, and 
make it even harder to search what it means in web search engine with the 
keyword like "granularityNumber statistic". 

So, I prefer to still use ndv in ColumnStats. 

[1] https://issues.apache.org/jira/browse/FLINK-27597 

发件人：Jing Ge 
日期：2022年6月2日 00:21
主题：[DISCUSS] suggest using granularityNumber in ColumnStats
收件人：dev 

Hi Dev,

I am not really sure if it is feasible to start this discussion. According
to the contribution guidelines, dev ml is the right place to reach
consensus.

In ColumnStats, Currently ndv, which stands for "number of distinct
values", is used. First of all, it is difficult to understand the meaning
with the abbreviation. Second, it might be good to use a professional
naming instead.



Suggestion:

replace ndv with granularityNumber:



The good news, afaik, is that the method getNdv() hasn't been used within
Flink which means the renaming will have very limited impact.



ColumnStats {

/** number of distinct values. */

@Deprecated
private final Long ndv;



/**Granularity refers to the level of details used to sort and separate
data at column level. Highly granular data is categorized or separated very
precisely. For example, the granularity number of gender columns should
normally be 2. The granularity number of the month column will be 12. In
the SQL world, it means the number of distinct values. */

private final Long granularityNumber;



@Deprecated
public Long getNdv()
{ return ndv; }



public Long getGranularityNumber()
{ return granularityNumber; }
}

Best regards,
-- 

Jing






yuxia Luo
luoyu...@alumni.sjtu.edu.cn
Best, yuxia

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-02 Thread Jark Wu

Hi Jing,

I can see there might be developers who don't understand the meaning at the
first glance.
However, NDV is a widely used terminology in table statistics, see
[1][2][3].
If we use another name, it may confuse developers who are familiar with
stats and optimization.
I think at least, the Javadoc is needed to explain the meaning and full
name.
If we want to change the name, we can use the full name
"numberOfDistinctValues()".

Best,
Jark

[1]:
https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute
[2]:
https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/
[3]:
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md

On Thu, 2 Jun 2022 at 14:44, Becket Qin  wrote:

> Hi Jing,
>
> While I do agree that NDV is a little confusing at first sight, it seems
> quite concise once I got the meaning. So personally I am OK with keeping it
> as is, but proper documentation would be helpful. If we really want to
> replace it with a more professional name, *cardinality* might be a good
> alternative.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Thu, Jun 2, 2022 at 12:51 AM Jing Ge  wrote:
>
> > Hi Dev,
> >
> > I am not really sure if it is feasible to start this discussion.
> According
> > to the contribution guidelines, dev ml is the right place to reach
> > consensus.
> >
> > In ColumnStats, Currently ndv, which stands for "number of distinct
> > values", is used. First of all, it is difficult to understand the meaning
> > with the abbreviation. Second, it might be good to use a professional
> > naming instead.
> >
> >
> >
> > Suggestion:
> >
> > replace ndv with granularityNumber:
> >
> >
> >
> > The good news, afaik, is that the method getNdv() hasn't been used within
> > Flink which means the renaming will have very limited impact.
> >
> >
> >
> > ColumnStats {
> >
> > /** number of distinct values. */
> >
> > @Deprecated
> > private final Long ndv;
> >
> >
> >
> > /**Granularity refers to the level of details used to sort and separate
> > data at column level. Highly granular data is categorized or separated
> very
> > precisely. For example, the granularity number of gender columns should
> > normally be 2. The granularity number of the month column will be 12. In
> > the SQL world, it means the number of distinct values. */
> >
> > private final Long granularityNumber;
> >
> >
> >
> > @Deprecated
> > public Long getNdv()
> > { return ndv; }
> >
> >
> >
> > public Long getGranularityNumber()
> > { return granularityNumber; }
> > }
> >
> > Best regards,
> > --
> >
> > Jing
> >
>

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-01 Thread Becket Qin

Hi Jing,

While I do agree that NDV is a little confusing at first sight, it seems
quite concise once I got the meaning. So personally I am OK with keeping it
as is, but proper documentation would be helpful. If we really want to
replace it with a more professional name, *cardinality* might be a good
alternative.

Thanks,

Jiangjie (Becket) Qin

On Thu, Jun 2, 2022 at 12:51 AM Jing Ge  wrote:

> Hi Dev,
>
> I am not really sure if it is feasible to start this discussion. According
> to the contribution guidelines, dev ml is the right place to reach
> consensus.
>
> In ColumnStats, Currently ndv, which stands for "number of distinct
> values", is used. First of all, it is difficult to understand the meaning
> with the abbreviation. Second, it might be good to use a professional
> naming instead.
>
>
>
> Suggestion:
>
> replace ndv with granularityNumber:
>
>
>
> The good news, afaik, is that the method getNdv() hasn't been used within
> Flink which means the renaming will have very limited impact.
>
>
>
> ColumnStats {
>
> /** number of distinct values. */
>
> @Deprecated
> private final Long ndv;
>
>
>
> /**Granularity refers to the level of details used to sort and separate
> data at column level. Highly granular data is categorized or separated very
> precisely. For example, the granularity number of gender columns should
> normally be 2. The granularity number of the month column will be 12. In
> the SQL world, it means the number of distinct values. */
>
> private final Long granularityNumber;
>
>
>
> @Deprecated
> public Long getNdv()
> { return ndv; }
>
>
>
> public Long getGranularityNumber()
> { return granularityNumber; }
> }
>
> Best regards,
> --
>
> Jing
>

[DISCUSS] suggest using granularityNumber in ColumnStats

2022-06-01 Thread Jing Ge

Hi Dev,

I am not really sure if it is feasible to start this discussion. According
to the contribution guidelines, dev ml is the right place to reach
consensus.

In ColumnStats, Currently ndv, which stands for "number of distinct
values", is used. First of all, it is difficult to understand the meaning
with the abbreviation. Second, it might be good to use a professional
naming instead.



Suggestion:

replace ndv with granularityNumber:



The good news, afaik, is that the method getNdv() hasn't been used within
Flink which means the renaming will have very limited impact.



ColumnStats {

/** number of distinct values. */

@Deprecated
private final Long ndv;



/**Granularity refers to the level of details used to sort and separate
data at column level. Highly granular data is categorized or separated very
precisely. For example, the granularity number of gender columns should
normally be 2. The granularity number of the month column will be 12. In
the SQL world, it means the number of distinct values. */

private final Long granularityNumber;



@Deprecated
public Long getNdv()
{ return ndv; }



public Long getGranularityNumber()
{ return granularityNumber; }
}

Best regards,
-- 

Jing

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

回复： [DISCUSS] suggest using granularityNumber in ColumnStats

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

Re: [DISCUSS] suggest using granularityNumber in ColumnStats

[DISCUSS] suggest using granularityNumber in ColumnStats

9 matches

Site Navigation

Mail list logo

Footer information