[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

Vinaykumar Bhat (Jira) Tue, 09 Apr 2024 03:06:11 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinaykumar Bhat updated HUDI-7579:
----------------------------------
    Description: 
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice){{{{}}{}}}
{code:java}
    spark.sql(
     s"""
        |create table test_table(
        |  id int,
        |  name string,
        |  ts long,
        |  price int
        |) using hudi
        | options (
        |  primaryKey ='id',
        |  type = '$tableType',
        |  preCombineField = 'ts',
        |  hoodie.metadata.record.index.enable = 'true',
        |  hoodie.datasource.write.recordkey.field = 'id'
        | )
        | partitioned by(price)
        | location '$basePath'
""".stripMargin)
   spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
1000, 10)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
200000, 100)")
   spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
2000000000, 1000)"){code}

Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
{code:java}
var createIndexSql = s"create index idx_datestr on $tableName using 
column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')"

spark.sql(createIndexSql)
spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('$tableName') where type = 3").show(false) {code}
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
*{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}*
{code:java}
+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|key                                             |type|ColumnStatsMetadata      
                                                                                
                                                                                
                                                                          |
+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|
+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
 {code}

 
 

  was:
The following create-table and inserts should create a table with 3 partitions 
(with each partition having one slice){{{}{}}}

```
spark.sql(s"""create table test_table (id int, name string, ts long, price int) 
using hudi
| options (
| primaryKey ='id',
| type = '$tableType',
| preCombineField = 'ts',
| hoodie.metadata.record.index.enable = 'true',
| hoodie.datasource.write.recordkey.field = 'id'
| )
| partitioned by(price)
| location '$basePath'
""".stripMargin)

spark.sql(s"insert into $tableName (id, name, ts, price) values(1, 'a1', 1000, 
10)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(2, 'a2', 
200000, 100)")
spark.sql(s"insert into $tableName (id, name, ts, price) values(3, 'a3', 
2000000000, 1000)")
```
 
Now create a functional index (using col stats) on this table. The col-stat in 
the MDT should have three entries (representing column level stats for 3 
files). However, col stats only has one single entry (for one of the file).
 
```
var createIndexSql = s"create index idx_datestr on test_table using 
column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')"

spark.sql(createIndexSql)

spark.sql(s"select key, type, ColumnStatsMetadata from 
hudi_metadata('test_table') where type = 3").show(false)
```
 
As seen below, col-stats has only one entry for one of the file (and is missing 
statistics for two other files): 
`\{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}`
{{+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+}}
{{|key                                             |type|ColumnStatsMetadata    
                                                                                
                                                                                
                                                                            |}}
{{+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+}}
{{|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
|\{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 
\{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 
0, 434874, 869748, false}|}}
{{+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+}}
 
 


> Functional index (on col stats) creation fails to process all files/partitions
> ------------------------------------------------------------------------------
>
>                 Key: HUDI-7579
>                 URL: https://issues.apache.org/jira/browse/HUDI-7579
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Vinaykumar Bhat
>            Priority: Major
>
> The following create-table and inserts should create a table with 3 
> partitions (with each partition having one slice){{{{}}{}}}
> {code:java}
>     spark.sql(
>      s"""
>         |create table test_table(
>         |  id int,
>         |  name string,
>         |  ts long,
>         |  price int
>         |) using hudi
>         | options (
>         |  primaryKey ='id',
>         |  type = '$tableType',
>         |  preCombineField = 'ts',
>         |  hoodie.metadata.record.index.enable = 'true',
>         |  hoodie.datasource.write.recordkey.field = 'id'
>         | )
>         | partitioned by(price)
>         | location '$basePath'
> """.stripMargin)
>    spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
> 1000, 10)")
>    spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
> 200000, 100)")
>    spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
> 2000000000, 1000)"){code}
> Now create a functional index (using col stats) on this table. The col-stat 
> in the MDT should have three entries (representing column level stats for 3 
> files). However, col stats only has one single entry (for one of the file).
>  
> {code:java}
> var createIndexSql = s"create index idx_datestr on $tableName using 
> column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')"
> spark.sql(createIndexSql)
> spark.sql(s"select key, type, ColumnStatsMetadata from 
> hudi_metadata('$tableName') where type = 3").show(false) {code}
> As seen below, col-stats has only one entry for one of the file (and is 
> missing statistics for two other files): 
> *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, 1, 0, 434874, 869748, false}*
> {code:java}
> +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |key                                             |type|ColumnStatsMetadata    
>                                                                               
>                                                                               
>                                                                               
>   |
> +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
> |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, 1, 0, 434874, 869748, false}|
> +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>  {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions

Reply via email to