subject:"\[jira\] \[Updated\] \(HUDI\-7579\) Functional index \(on col stats\) creation fails to process all files\/partitions"

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-14 Thread ASF GitHub Bot (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HUDI-7579:
-
Labels: hudi-1.0.0-beta2 pull-request-available  (was: hudi-1.0.0-beta2)

> Functional index (on col stats) creation fails to process all files/partitions
> --
>
> Key: HUDI-7579
> URL: https://issues.apache.org/jira/browse/HUDI-7579
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2, pull-request-available
> Fix For: 1.0.0
>
>
> Creating a functional index on an existing table fails to process all files 
> and partitions of the table. The col-stats MDT partition ends up having an 
> entry only for subset of files that belong to the table. An example follows.
>  
> The following create-table and inserts should create a table with 3 
> partitions (with each partition having one slice)}}{}}}
> {code:java}
> spark.sql(
>  s"""
> |create table test_table(
> |  id int,
> |  name string,
> |  ts long,
> |  price int
> |) using hudi
> | options (
> |  primaryKey ='id',
> |  type = 'cow',
> |  preCombineField = 'ts',
> |  hoodie.metadata.record.index.enable = 'true',
> |  hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(price)
> | location '$basePath'
> """.stripMargin)
>spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
> 1000, 10)")
>spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
> 20, 100)")
>spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
> 20, 1000)"){code}
> Now create a functional index (using col stats) on this table. The col-stat 
> in the MDT should have three entries (representing column level stats for 3 
> files). However, col stats only has one single entry (for one of the file).
>  
> {code:java}
> var createIndexSql = s"create index idx_datestr on test_table using 
> column_stats(ts) options(func='from_unixtime', format='-MM-dd')"
> spark.sql(createIndexSql)
> spark.sql(s"select key, type, ColumnStatsMetadata from 
> hudi_metadata('test_table') where type = 3").show(false) {code}
> As seen below, col-stats has only one entry for one of the file (and is 
> missing statistics for two other files): 
> *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null,
> {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, 
> {1970-01-01}
> , null, null, null, null}, 1, 0, 434874, 869748, false}*
>  
> {noformat}
>  
> +++---+
> |key                                             |type|ColumnStatsMetadata    
>                                                                               
>                                                                               
>                                                                               
>   |
> +++---+
> |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
> |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, 1, 0, 434874, 869748, false}|
> +++---+
> {noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
Labels: hudi-1.0.0-beta2  (was: )

> Functional index (on col stats) creation fails to process all files/partitions
> --
>
> Key: HUDI-7579
> URL: https://issues.apache.org/jira/browse/HUDI-7579
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Vinaykumar Bhat
>Priority: Major
>  Labels: hudi-1.0.0-beta2
> Fix For: 1.0.0
>
>
> Creating a functional index on an existing table fails to process all files 
> and partitions of the table. The col-stats MDT partition ends up having an 
> entry only for subset of files that belong to the table. An example follows.
>  
> The following create-table and inserts should create a table with 3 
> partitions (with each partition having one slice)}}{}}}
> {code:java}
> spark.sql(
>  s"""
> |create table test_table(
> |  id int,
> |  name string,
> |  ts long,
> |  price int
> |) using hudi
> | options (
> |  primaryKey ='id',
> |  type = 'cow',
> |  preCombineField = 'ts',
> |  hoodie.metadata.record.index.enable = 'true',
> |  hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(price)
> | location '$basePath'
> """.stripMargin)
>spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
> 1000, 10)")
>spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
> 20, 100)")
>spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
> 20, 1000)"){code}
> Now create a functional index (using col stats) on this table. The col-stat 
> in the MDT should have three entries (representing column level stats for 3 
> files). However, col stats only has one single entry (for one of the file).
>  
> {code:java}
> var createIndexSql = s"create index idx_datestr on test_table using 
> column_stats(ts) options(func='from_unixtime', format='-MM-dd')"
> spark.sql(createIndexSql)
> spark.sql(s"select key, type, ColumnStatsMetadata from 
> hudi_metadata('test_table') where type = 3").show(false) {code}
> As seen below, col-stats has only one entry for one of the file (and is 
> missing statistics for two other files): 
> *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null,
> {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, 
> {1970-01-01}
> , null, null, null, null}, 1, 0, 434874, 869748, false}*
>  
> {noformat}
>  
> +++---+
> |key                                             |type|ColumnStatsMetadata    
>                                                                               
>                                                                               
>                                                                               
>   |
> +++---+
> |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
> |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, 1, 0, 434874, 869748, false}|
> +++---+
> {noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
  Component/s: index
Epic Link: HUDI-512
Fix Version/s: 1.0.0

> Functional index (on col stats) creation fails to process all files/partitions
> --
>
> Key: HUDI-7579
> URL: https://issues.apache.org/jira/browse/HUDI-7579
> Project: Apache Hudi
>  Issue Type: Bug
>  Components: index
>Reporter: Vinaykumar Bhat
>Priority: Major
> Fix For: 1.0.0
>
>
> Creating a functional index on an existing table fails to process all files 
> and partitions of the table. The col-stats MDT partition ends up having an 
> entry only for subset of files that belong to the table. An example follows.
>  
> The following create-table and inserts should create a table with 3 
> partitions (with each partition having one slice)}}{}}}
> {code:java}
> spark.sql(
>  s"""
> |create table test_table(
> |  id int,
> |  name string,
> |  ts long,
> |  price int
> |) using hudi
> | options (
> |  primaryKey ='id',
> |  type = 'cow',
> |  preCombineField = 'ts',
> |  hoodie.metadata.record.index.enable = 'true',
> |  hoodie.datasource.write.recordkey.field = 'id'
> | )
> | partitioned by(price)
> | location '$basePath'
> """.stripMargin)
>spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
> 1000, 10)")
>spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
> 20, 100)")
>spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
> 20, 1000)"){code}
> Now create a functional index (using col stats) on this table. The col-stat 
> in the MDT should have three entries (representing column level stats for 3 
> files). However, col stats only has one single entry (for one of the file).
>  
> {code:java}
> var createIndexSql = s"create index idx_datestr on test_table using 
> column_stats(ts) options(func='from_unixtime', format='-MM-dd')"
> spark.sql(createIndexSql)
> spark.sql(s"select key, type, ColumnStatsMetadata from 
> hudi_metadata('test_table') where type = 3").show(false) {code}
> As seen below, col-stats has only one entry for one of the file (and is 
> missing statistics for two other files): 
> *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null,
> {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, 
> {1970-01-01}
> , null, null, null, null}, 1, 0, 434874, 869748, false}*
>  
> {noformat}
>  
> +++---+
> |key                                             |type|ColumnStatsMetadata    
>                                                                               
>                                                                               
>                                                                               
>   |
> +++---+
> |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
> |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, 1, 0, 434874, 869748, false}|
> +++---+
> {noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
Description: 
Creating a functional index on an existing table fails to process all files and 
partitions of the table. The col-stats MDT partition ends up having an entry 
only for subset of files that belong to the table. An example follows.

 

The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = 'cow',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
{code:java}
var createIndexSql = s"create index idx_datestr on test_table using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)
spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('test_table') where type = 3").show(false) {code}
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
*{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null,

{1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, 
{1970-01-01}

, null, null, null, null}, 1, 0, 434874, 869748, false}*

 
{noformat}
 
+++---+
|key                                             |type|ColumnStatsMetadata      
                                                                                
                                                                                
                                                                          |
+++---+
|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|
+++---+
{noformat}
 
 

  was:
Creating a functional index on an existing table fails to process all files and 
partitions of the table. The col-stats MDT partition ends up having an entry 
only for subset of files that belong to the table. An example follows.

 

The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = 'cow',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
Description: 
Creating a functional index on an existing table fails to process all files and 
partitions of the table. The col-stats MDT partition ends up having an entry 
only for subset of files that belong to the table. An example follows.

 

The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = 'cow',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
{code:java}
var createIndexSql = s"create index idx_datestr on test_table using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)
spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('test_table') where type = 3").show(false) {code}
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
*{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null,

{1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, 
{1970-01-01}

, null, null, null, null}, 1, 0, 434874, 869748, false}*
{code:java}
+++---+
|key                                             |type|ColumnStatsMetadata      
                                                                                
                                                                                
                                                                          |
+++---+
|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|
+++---+
 {code}
 
 

  was:
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = 'cow',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
Description: 
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = 'cow',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
{code:java}
var createIndexSql = s"create index idx_datestr on test_table using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)
spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('test_table') where type = 3").show(false) {code}
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
*{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}*
{code:java}
+++---+
|key                                             |type|ColumnStatsMetadata      
                                                                                
                                                                                
                                                                          |
+++---+
|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|
+++---+
 {code}
 
 

  was:
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = '$tableType',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}

Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
{code:java}
var createIndexSql = s"create index idx_datestr on $tableName using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)
spark.sql(s"select key,

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

2024-04-09 Thread Vinaykumar Bhat (Jira)



 [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
--
Description: 
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice)}}{}}}
{code:java}
spark.sql(
 s"""
|create table test_table(
|  id int,
|  name string,
|  ts long,
|  price int
|) using hudi
| options (
|  primaryKey ='id',
|  type = '$tableType',
|  preCombineField = 'ts',
|  hoodie.metadata.record.index.enable = 'true',
|  hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
20, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
20, 1000)"){code}

Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
{code:java}
var createIndexSql = s"create index idx_datestr on $tableName using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)
spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('$tableName') where type = 3").show(false) {code}
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
*{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}*
{code:java}
+++---+
|key                                             |type|ColumnStatsMetadata      
                                                                                
                                                                                
                                                                          |
+++---+
|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|
+++---+
 {code}

 
 

  was:
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice){{{}{}}}

```
spark.sql(s"""create table test_table (id int, name string, ts long, price int) 
using hudi
| options (
| primaryKey ='id',
| type = '$tableType',
| preCombineField = 'ts',
| hoodie.metadata.record.index.enable = 'true',
| hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)

spark.sql(s"insert into $tableName (id, name, ts, price) values(1, 'a1', 1000, 
10)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(2, 'a2', 
20, 100)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(3, 'a3', 
20, 1000)")
```
 
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
```
var createIndexSql = s"create index idx_datestr on test_table using 
column_stats(ts) options(func='from_unixtime', format='-MM-dd')"

spark.sql(createIndexSql)

spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('test_table') where type = 3").show(false)
```
 
As seen below, col-stats has only one entry for one of the file (and is

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

7 matches

Site Navigation

Mail list logo

Footer information