[ 
https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinaykumar Bhat updated HUDI-7579:
----------------------------------
      Component/s: index
        Epic Link: HUDI-512
    Fix Version/s: 1.0.0

> Functional index (on col stats) creation fails to process all files/partitions
> ------------------------------------------------------------------------------
>
>                 Key: HUDI-7579
>                 URL: https://issues.apache.org/jira/browse/HUDI-7579
>             Project: Apache Hudi
>          Issue Type: Bug
>          Components: index
>            Reporter: Vinaykumar Bhat
>            Priority: Major
>             Fix For: 1.0.0
>
>
> Creating a functional index on an existing table fails to process all files 
> and partitions of the table. The col-stats MDT partition ends up having an 
> entry only for subset of files that belong to the table. An example follows.
>  
> The following create-table and inserts should create a table with 3 
> partitions (with each partition having one slice){{{{}}{}}}
> {code:java}
>     spark.sql(
>      s"""
>         |create table test_table(
>         |  id int,
>         |  name string,
>         |  ts long,
>         |  price int
>         |) using hudi
>         | options (
>         |  primaryKey ='id',
>         |  type = 'cow',
>         |  preCombineField = 'ts',
>         |  hoodie.metadata.record.index.enable = 'true',
>         |  hoodie.datasource.write.recordkey.field = 'id'
>         | )
>         | partitioned by(price)
>         | location '$basePath'
> """.stripMargin)
>    spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 
> 1000, 10)")
>    spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 
> 200000, 100)")
>    spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 
> 2000000000, 1000)"){code}
> Now create a functional index (using col stats) on this table. The col-stat 
> in the MDT should have three entries (representing column level stats for 3 
> files). However, col stats only has one single entry (for one of the file).
>  
> {code:java}
> var createIndexSql = s"create index idx_datestr on test_table using 
> column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')"
> spark.sql(createIndexSql)
> spark.sql(s"select key, type, ColumnStatsMetadata from 
> hudi_metadata('test_table') where type = 3").show(false) {code}
> As seen below, col-stats has only one entry for one of the file (and is 
> missing statistics for two other files): 
> *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null,
> {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, 
> {1970-01-01}
> , null, null, null, null}, 1, 0, 434874, 869748, false}*
>  
> {noformat}
>  
> +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |key                                             |type|ColumnStatsMetadata    
>                                                                               
>                                                                               
>                                                                               
>   |
> +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3   
> |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, 
> ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, 
> null}, 1, 0, 434874, 869748, false}|
> +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> {noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to