Re: [DISCUSS] suggest using granularityNumber in ColumnStats
Recently, I'm work with getting statistic for Hive's partitioned table[1], I would like to share my experience as a developer. I have to admit the ndv really make me confused in the first glance, but I can find what it means easily in web search engine with the keyword like "nvd statistic". And to be honest, the name granularityNumber is not intuitive to me either, and make it even harder to search what it means in web search engine with the keyword like "granularityNumber statistic”. Personal, I prefer to use ndv in ColumnStats. [1] https://issues.apache.org/jira/browse/FLINK-27597 Best regards, Yuxia > 2022年6月2日 上午12:44,Jing Ge 写道: > > Hi Dev, > > I am not really sure if it is feasible to start this discussion. According > to the contribution guidelines, dev ml is the right place to reach > consensus. > > In ColumnStats, Currently ndv, which stands for "number of distinct > values", is used. First of all, it is difficult to understand the meaning > with the abbreviation. Second, it might be good to use a professional > naming instead. > > > > Suggestion: > > replace ndv with granularityNumber: > > > > The good news, afaik, is that the method getNdv() hasn't been used within > Flink which means the renaming will have very limited impact. > > > > ColumnStats { > > /** number of distinct values. */ > > @Deprecated > private final Long ndv; > > > > /**Granularity refers to the level of details used to sort and separate > data at column level. Highly granular data is categorized or separated very > precisely. For example, the granularity number of gender columns should > normally be 2. The granularity number of the month column will be 12. In > the SQL world, it means the number of distinct values. */ > > private final Long granularityNumber; > > > > @Deprecated > public Long getNdv() > { return ndv; } > > > > public Long getGranularityNumber() > { return granularityNumber; } > } > > Best regards, > -- > > Jing
Re: [DISCUSS] suggest using granularityNumber in ColumnStats
Hi Jing, Hmm, granularity and ndv still don't seem to mean the same thing to me. Granularity basically means how detailed the data is, in another word, whether a field / column be further divided. For example, a field like "age“ cannot be further divided so it is quite granular. In contrast, an "address" field can be further divided into "street", "city", "country", etc. Therefore "address" is less granular. When it comes to NDV, it actually means how many distinct values are there in the field / column, which is orthogonal to the granularity. Anyways, it looks like most people think NDV or its full phrase is a better name. It probably makes sense to just use either of them. Thanks, Jiangjie (Becket) Qin On Fri, Jun 3, 2022 at 9:45 PM Jark Wu wrote: > Hi Jing, > > I agree with you that "NDV is more SQL-oriented(implementation) > and granularity is more data analytics-oriented". As you said, > "granularity" > may be commonly used for data modeling and business-related. > However, TableStats is not used for data modeling but is an implementation > detail for SQL optimization. NDV is the terminology in the optimizer > field, > and Calcite also uses this word[1]. I didn't notice there any vendors are > using "granularity" for this purpose. If I miss any, please correct me. > > If NDV sounds like a function to you, I'm OK to use "numDistinctVals" as > Calcite does. > > Best, > Jark > > > [1]: > > https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdUtil.html#numDistinctVals(java.lang.Double,java.lang.Double) > > On Fri, 3 Jun 2022 at 00:14, Jing Ge wrote: > > > Thanks all for your feedback! It is very informative. > > > > to Becket: > > > > At the beginning, I chose the same word because we used it in daily work. > > Before I started this discussion, to make sure it is the right one, I did > > some checking and it turns out that *cardinality* has a very different > > (also very common) meaning within data modeling[1]. And on the other side > > *granularity* is actually the right word for the meaning when we use > > cardinality in the context of NDV[2]. > > > > to Jark, Jingsong, > > > > NDV seems to me more like a function than a field defined in a class. > > Briefly speaking, NDV is more SQL-oriented(implementation) and > > *granularity* is more data analytics-oriented(abstraction/concept)[3][4]. > > > > Best regards, > > Jing > > > > [1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling) > > [2] https://www.talon.one/glossary/granularity > > [3] https://www.quora.com/What-is-granularity-in-database > > [4] https://www.statisticshowto.com/data-granularity/ > > > > On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li > > wrote: > > > > > Hi, > > > > > > +1 for NDV (number of distinct values) is a widely used terminology in > > > table statistics. > > > > > > I've also seen the one called `distinctCount`. > > > > > > This name can be found in databases like oracle too. [1] > > > > > > So it is not good to change a completely different name. > > > > > > [1] > > > > > > > > > https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922 > > > > > > Best, > > > Jingsong > > > > > > On Thu, Jun 2, 2022 at 4:46 PM Jark Wu wrote: > > > > > > > Hi Jing, > > > > > > > > I can see there might be developers who don't understand the meaning > at > > > the > > > > first glance. > > > > However, NDV is a widely used terminology in table statistics, see > > > > [1][2][3]. > > > > If we use another name, it may confuse developers who are familiar > with > > > > stats and optimization. > > > > I think at least, the Javadoc is needed to explain the meaning and > full > > > > name. > > > > If we want to change the name, we can use the full name > > > > "numberOfDistinctValues()". > > > > > > > > Best, > > > > Jark > > > > > > > > [1]: > > > > > > > > > > > > > > https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute > > > > [2]: > > > > > > > > > > https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/ > > > > [3]: > > > > > > > > > > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > > > > > > > > On Thu, 2 Jun 2022 at 14:44, Becket Qin > wrote: > > > > > > > > > Hi Jing, > > > > > > > > > > While I do agree that NDV is a little confusing at first sight, it > > > seems > > > > > quite concise once I got the meaning. So personally I am OK with > > > keeping > > > > it > > > > > as is, but proper documentation would be helpful. If we really want > > to > > > > > replace it with a more professional name, *cardinality* might be a > > good > > > > > alternative. > > > > > > > > > > Thanks, > > > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge > wrote: > > > > > > > > > > > Hi Dev, > > > > > > > > > > > > I am not really sure if it is feasible to start this discussion. > > > > > According > > > > > > to the contribut
Re: [DISCUSS] suggest using granularityNumber in ColumnStats
Hi Jing, I agree with you that "NDV is more SQL-oriented(implementation) and granularity is more data analytics-oriented". As you said, "granularity" may be commonly used for data modeling and business-related. However, TableStats is not used for data modeling but is an implementation detail for SQL optimization. NDV is the terminology in the optimizer field, and Calcite also uses this word[1]. I didn't notice there any vendors are using "granularity" for this purpose. If I miss any, please correct me. If NDV sounds like a function to you, I'm OK to use "numDistinctVals" as Calcite does. Best, Jark [1]: https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/RelMdUtil.html#numDistinctVals(java.lang.Double,java.lang.Double) On Fri, 3 Jun 2022 at 00:14, Jing Ge wrote: > Thanks all for your feedback! It is very informative. > > to Becket: > > At the beginning, I chose the same word because we used it in daily work. > Before I started this discussion, to make sure it is the right one, I did > some checking and it turns out that *cardinality* has a very different > (also very common) meaning within data modeling[1]. And on the other side > *granularity* is actually the right word for the meaning when we use > cardinality in the context of NDV[2]. > > to Jark, Jingsong, > > NDV seems to me more like a function than a field defined in a class. > Briefly speaking, NDV is more SQL-oriented(implementation) and > *granularity* is more data analytics-oriented(abstraction/concept)[3][4]. > > Best regards, > Jing > > [1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling) > [2] https://www.talon.one/glossary/granularity > [3] https://www.quora.com/What-is-granularity-in-database > [4] https://www.statisticshowto.com/data-granularity/ > > On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li > wrote: > > > Hi, > > > > +1 for NDV (number of distinct values) is a widely used terminology in > > table statistics. > > > > I've also seen the one called `distinctCount`. > > > > This name can be found in databases like oracle too. [1] > > > > So it is not good to change a completely different name. > > > > [1] > > > > > https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922 > > > > Best, > > Jingsong > > > > On Thu, Jun 2, 2022 at 4:46 PM Jark Wu wrote: > > > > > Hi Jing, > > > > > > I can see there might be developers who don't understand the meaning at > > the > > > first glance. > > > However, NDV is a widely used terminology in table statistics, see > > > [1][2][3]. > > > If we use another name, it may confuse developers who are familiar with > > > stats and optimization. > > > I think at least, the Javadoc is needed to explain the meaning and full > > > name. > > > If we want to change the name, we can use the full name > > > "numberOfDistinctValues()". > > > > > > Best, > > > Jark > > > > > > [1]: > > > > > > > > > https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute > > > [2]: > > > > > > https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/ > > > [3]: > > > > > > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > > > > > > On Thu, 2 Jun 2022 at 14:44, Becket Qin wrote: > > > > > > > Hi Jing, > > > > > > > > While I do agree that NDV is a little confusing at first sight, it > > seems > > > > quite concise once I got the meaning. So personally I am OK with > > keeping > > > it > > > > as is, but proper documentation would be helpful. If we really want > to > > > > replace it with a more professional name, *cardinality* might be a > good > > > > alternative. > > > > > > > > Thanks, > > > > > > > > Jiangjie (Becket) Qin > > > > > > > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge wrote: > > > > > > > > > Hi Dev, > > > > > > > > > > I am not really sure if it is feasible to start this discussion. > > > > According > > > > > to the contribution guidelines, dev ml is the right place to reach > > > > > consensus. > > > > > > > > > > In ColumnStats, Currently ndv, which stands for "number of distinct > > > > > values", is used. First of all, it is difficult to understand the > > > meaning > > > > > with the abbreviation. Second, it might be good to use a > professional > > > > > naming instead. > > > > > > > > > > > > > > > > > > > > Suggestion: > > > > > > > > > > replace ndv with granularityNumber: > > > > > > > > > > > > > > > > > > > > The good news, afaik, is that the method getNdv() hasn't been used > > > within > > > > > Flink which means the renaming will have very limited impact. > > > > > > > > > > > > > > > > > > > > ColumnStats { > > > > > > > > > > /** number of distinct values. */ > > > > > > > > > > @Deprecated > > > > > private final Long ndv; > > > > > > > > > > > > > > > > > > > > /**Granularity refers to the level of details used to sort and > > separate > > > > > data at column level. Highly granular data is categorized or > > separated > > > >
Re: [DISCUSS] suggest using granularityNumber in ColumnStats
Thanks all for your feedback! It is very informative. to Becket: At the beginning, I chose the same word because we used it in daily work. Before I started this discussion, to make sure it is the right one, I did some checking and it turns out that *cardinality* has a very different (also very common) meaning within data modeling[1]. And on the other side *granularity* is actually the right word for the meaning when we use cardinality in the context of NDV[2]. to Jark, Jingsong, NDV seems to me more like a function than a field defined in a class. Briefly speaking, NDV is more SQL-oriented(implementation) and *granularity* is more data analytics-oriented(abstraction/concept)[3][4]. Best regards, Jing [1] https://en.wikipedia.org/wiki/Cardinality_(data_modeling) [2] https://www.talon.one/glossary/granularity [3] https://www.quora.com/What-is-granularity-in-database [4] https://www.statisticshowto.com/data-granularity/ On Thu, Jun 2, 2022 at 11:16 AM Jingsong Li wrote: > Hi, > > +1 for NDV (number of distinct values) is a widely used terminology in > table statistics. > > I've also seen the one called `distinctCount`. > > This name can be found in databases like oracle too. [1] > > So it is not good to change a completely different name. > > [1] > > https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922 > > Best, > Jingsong > > On Thu, Jun 2, 2022 at 4:46 PM Jark Wu wrote: > > > Hi Jing, > > > > I can see there might be developers who don't understand the meaning at > the > > first glance. > > However, NDV is a widely used terminology in table statistics, see > > [1][2][3]. > > If we use another name, it may confuse developers who are familiar with > > stats and optimization. > > I think at least, the Javadoc is needed to explain the meaning and full > > name. > > If we want to change the name, we can use the full name > > "numberOfDistinctValues()". > > > > Best, > > Jark > > > > [1]: > > > > > https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute > > [2]: > > > https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/ > > [3]: > > > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > > > > On Thu, 2 Jun 2022 at 14:44, Becket Qin wrote: > > > > > Hi Jing, > > > > > > While I do agree that NDV is a little confusing at first sight, it > seems > > > quite concise once I got the meaning. So personally I am OK with > keeping > > it > > > as is, but proper documentation would be helpful. If we really want to > > > replace it with a more professional name, *cardinality* might be a good > > > alternative. > > > > > > Thanks, > > > > > > Jiangjie (Becket) Qin > > > > > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge wrote: > > > > > > > Hi Dev, > > > > > > > > I am not really sure if it is feasible to start this discussion. > > > According > > > > to the contribution guidelines, dev ml is the right place to reach > > > > consensus. > > > > > > > > In ColumnStats, Currently ndv, which stands for "number of distinct > > > > values", is used. First of all, it is difficult to understand the > > meaning > > > > with the abbreviation. Second, it might be good to use a professional > > > > naming instead. > > > > > > > > > > > > > > > > Suggestion: > > > > > > > > replace ndv with granularityNumber: > > > > > > > > > > > > > > > > The good news, afaik, is that the method getNdv() hasn't been used > > within > > > > Flink which means the renaming will have very limited impact. > > > > > > > > > > > > > > > > ColumnStats { > > > > > > > > /** number of distinct values. */ > > > > > > > > @Deprecated > > > > private final Long ndv; > > > > > > > > > > > > > > > > /**Granularity refers to the level of details used to sort and > separate > > > > data at column level. Highly granular data is categorized or > separated > > > very > > > > precisely. For example, the granularity number of gender columns > should > > > > normally be 2. The granularity number of the month column will be 12. > > In > > > > the SQL world, it means the number of distinct values. */ > > > > > > > > private final Long granularityNumber; > > > > > > > > > > > > > > > > @Deprecated > > > > public Long getNdv() > > > > { return ndv; } > > > > > > > > > > > > > > > > public Long getGranularityNumber() > > > > { return granularityNumber; } > > > > } > > > > > > > > Best regards, > > > > -- > > > > > > > > Jing > > > > > > > > > >
Re: [DISCUSS] suggest using granularityNumber in ColumnStats
Hi, +1 for NDV (number of distinct values) is a widely used terminology in table statistics. I've also seen the one called `distinctCount`. This name can be found in databases like oracle too. [1] So it is not good to change a completely different name. [1] https://docs.oracle.com/database/121/TGSQL/glossary.htm#GUID-34DC46FD-32CE-4242-8ED9-945AE7A9F922 Best, Jingsong On Thu, Jun 2, 2022 at 4:46 PM Jark Wu wrote: > Hi Jing, > > I can see there might be developers who don't understand the meaning at the > first glance. > However, NDV is a widely used terminology in table statistics, see > [1][2][3]. > If we use another name, it may confuse developers who are familiar with > stats and optimization. > I think at least, the Javadoc is needed to explain the meaning and full > name. > If we want to change the name, we can use the full name > "numberOfDistinctValues()". > > Best, > Jark > > [1]: > > https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute > [2]: > https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/ > [3]: > https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md > > On Thu, 2 Jun 2022 at 14:44, Becket Qin wrote: > > > Hi Jing, > > > > While I do agree that NDV is a little confusing at first sight, it seems > > quite concise once I got the meaning. So personally I am OK with keeping > it > > as is, but proper documentation would be helpful. If we really want to > > replace it with a more professional name, *cardinality* might be a good > > alternative. > > > > Thanks, > > > > Jiangjie (Becket) Qin > > > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge wrote: > > > > > Hi Dev, > > > > > > I am not really sure if it is feasible to start this discussion. > > According > > > to the contribution guidelines, dev ml is the right place to reach > > > consensus. > > > > > > In ColumnStats, Currently ndv, which stands for "number of distinct > > > values", is used. First of all, it is difficult to understand the > meaning > > > with the abbreviation. Second, it might be good to use a professional > > > naming instead. > > > > > > > > > > > > Suggestion: > > > > > > replace ndv with granularityNumber: > > > > > > > > > > > > The good news, afaik, is that the method getNdv() hasn't been used > within > > > Flink which means the renaming will have very limited impact. > > > > > > > > > > > > ColumnStats { > > > > > > /** number of distinct values. */ > > > > > > @Deprecated > > > private final Long ndv; > > > > > > > > > > > > /**Granularity refers to the level of details used to sort and separate > > > data at column level. Highly granular data is categorized or separated > > very > > > precisely. For example, the granularity number of gender columns should > > > normally be 2. The granularity number of the month column will be 12. > In > > > the SQL world, it means the number of distinct values. */ > > > > > > private final Long granularityNumber; > > > > > > > > > > > > @Deprecated > > > public Long getNdv() > > > { return ndv; } > > > > > > > > > > > > public Long getGranularityNumber() > > > { return granularityNumber; } > > > } > > > > > > Best regards, > > > -- > > > > > > Jing > > > > > >
回复: [DISCUSS] suggest using granularityNumber in ColumnStats
As I'm work with getting statistic for Hive's partitioned table[1] currently, I would like to shard my expirence as a developer. I have to admit the ndv really make me confused in the first glance, but I can find what it means easily in web search engine with the keyword like "nvd statistic". And to be honest, the name granularityNumber is not intuitive to me either, and make it even harder to search what it means in web search engine with the keyword like "granularityNumber statistic". So, I prefer to still use ndv in ColumnStats. [1] https://issues.apache.org/jira/browse/FLINK-27597 发件人:Jing Ge 日期:2022年6月2日 00:21 主题:[DISCUSS] suggest using granularityNumber in ColumnStats 收件人:dev Hi Dev, I am not really sure if it is feasible to start this discussion. According to the contribution guidelines, dev ml is the right place to reach consensus. In ColumnStats, Currently ndv, which stands for "number of distinct values", is used. First of all, it is difficult to understand the meaning with the abbreviation. Second, it might be good to use a professional naming instead. Suggestion: replace ndv with granularityNumber: The good news, afaik, is that the method getNdv() hasn't been used within Flink which means the renaming will have very limited impact. ColumnStats { /** number of distinct values. */ @Deprecated private final Long ndv; /**Granularity refers to the level of details used to sort and separate data at column level. Highly granular data is categorized or separated very precisely. For example, the granularity number of gender columns should normally be 2. The granularity number of the month column will be 12. In the SQL world, it means the number of distinct values. */ private final Long granularityNumber; @Deprecated public Long getNdv() { return ndv; } public Long getGranularityNumber() { return granularityNumber; } } Best regards, -- Jing yuxia Luo luoyu...@alumni.sjtu.edu.cn Best, yuxia
Re: [DISCUSS] suggest using granularityNumber in ColumnStats
Hi Jing, I can see there might be developers who don't understand the meaning at the first glance. However, NDV is a widely used terminology in table statistics, see [1][2][3]. If we use another name, it may confuse developers who are familiar with stats and optimization. I think at least, the Javadoc is needed to explain the meaning and full name. If we want to change the name, we can use the full name "numberOfDistinctValues()". Best, Jark [1]: https://www.alibabacloud.com/help/en/maxcompute/latest/collect-information-for-the-optimizer-of-maxcompute [2]: https://docs.dremio.com/software/sql-reference/sql-functions/functions/ndv/ [3]: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/README.md On Thu, 2 Jun 2022 at 14:44, Becket Qin wrote: > Hi Jing, > > While I do agree that NDV is a little confusing at first sight, it seems > quite concise once I got the meaning. So personally I am OK with keeping it > as is, but proper documentation would be helpful. If we really want to > replace it with a more professional name, *cardinality* might be a good > alternative. > > Thanks, > > Jiangjie (Becket) Qin > > On Thu, Jun 2, 2022 at 12:51 AM Jing Ge wrote: > > > Hi Dev, > > > > I am not really sure if it is feasible to start this discussion. > According > > to the contribution guidelines, dev ml is the right place to reach > > consensus. > > > > In ColumnStats, Currently ndv, which stands for "number of distinct > > values", is used. First of all, it is difficult to understand the meaning > > with the abbreviation. Second, it might be good to use a professional > > naming instead. > > > > > > > > Suggestion: > > > > replace ndv with granularityNumber: > > > > > > > > The good news, afaik, is that the method getNdv() hasn't been used within > > Flink which means the renaming will have very limited impact. > > > > > > > > ColumnStats { > > > > /** number of distinct values. */ > > > > @Deprecated > > private final Long ndv; > > > > > > > > /**Granularity refers to the level of details used to sort and separate > > data at column level. Highly granular data is categorized or separated > very > > precisely. For example, the granularity number of gender columns should > > normally be 2. The granularity number of the month column will be 12. In > > the SQL world, it means the number of distinct values. */ > > > > private final Long granularityNumber; > > > > > > > > @Deprecated > > public Long getNdv() > > { return ndv; } > > > > > > > > public Long getGranularityNumber() > > { return granularityNumber; } > > } > > > > Best regards, > > -- > > > > Jing > > >
Re: [DISCUSS] suggest using granularityNumber in ColumnStats
Hi Jing, While I do agree that NDV is a little confusing at first sight, it seems quite concise once I got the meaning. So personally I am OK with keeping it as is, but proper documentation would be helpful. If we really want to replace it with a more professional name, *cardinality* might be a good alternative. Thanks, Jiangjie (Becket) Qin On Thu, Jun 2, 2022 at 12:51 AM Jing Ge wrote: > Hi Dev, > > I am not really sure if it is feasible to start this discussion. According > to the contribution guidelines, dev ml is the right place to reach > consensus. > > In ColumnStats, Currently ndv, which stands for "number of distinct > values", is used. First of all, it is difficult to understand the meaning > with the abbreviation. Second, it might be good to use a professional > naming instead. > > > > Suggestion: > > replace ndv with granularityNumber: > > > > The good news, afaik, is that the method getNdv() hasn't been used within > Flink which means the renaming will have very limited impact. > > > > ColumnStats { > > /** number of distinct values. */ > > @Deprecated > private final Long ndv; > > > > /**Granularity refers to the level of details used to sort and separate > data at column level. Highly granular data is categorized or separated very > precisely. For example, the granularity number of gender columns should > normally be 2. The granularity number of the month column will be 12. In > the SQL world, it means the number of distinct values. */ > > private final Long granularityNumber; > > > > @Deprecated > public Long getNdv() > { return ndv; } > > > > public Long getGranularityNumber() > { return granularityNumber; } > } > > Best regards, > -- > > Jing >
[DISCUSS] suggest using granularityNumber in ColumnStats
Hi Dev, I am not really sure if it is feasible to start this discussion. According to the contribution guidelines, dev ml is the right place to reach consensus. In ColumnStats, Currently ndv, which stands for "number of distinct values", is used. First of all, it is difficult to understand the meaning with the abbreviation. Second, it might be good to use a professional naming instead. Suggestion: replace ndv with granularityNumber: The good news, afaik, is that the method getNdv() hasn't been used within Flink which means the renaming will have very limited impact. ColumnStats { /** number of distinct values. */ @Deprecated private final Long ndv; /**Granularity refers to the level of details used to sort and separate data at column level. Highly granular data is categorized or separated very precisely. For example, the granularity number of gender columns should normally be 2. The granularity number of the month column will be 12. In the SQL world, it means the number of distinct values. */ private final Long granularityNumber; @Deprecated public Long getNdv() { return ndv; } public Long getGranularityNumber() { return granularityNumber; } } Best regards, -- Jing