[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated HUDI-7579: --------------------------------- Labels: hudi-1.0.0-beta2 pull-request-available (was: hudi-1.0.0-beta2) > Functional index (on col stats) creation fails to process all files/partitions > ------------------------------------------------------------------------------ > > Key: HUDI-7579 > URL: https://issues.apache.org/jira/browse/HUDI-7579 > Project: Apache Hudi > Issue Type: Bug > Components: index > Reporter: Vinaykumar Bhat > Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > Creating a functional index on an existing table fails to process all files > and partitions of the table. The col-stats MDT partition ends up having an > entry only for subset of files that belong to the table. An example follows. > > The following create-table and inserts should create a table with 3 > partitions (with each partition having one slice){{{{}}{}}} > {code:java} > spark.sql( > s""" > |create table test_table( > | id int, > | name string, > | ts long, > | price int > |) using hudi > | options ( > | primaryKey ='id', > | type = 'cow', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(price) > | location '$basePath' > """.stripMargin) > spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', > 1000, 10)") > spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', > 200000, 100)") > spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', > 2000000000, 1000)"){code} > Now create a functional index (using col stats) on this table. The col-stat > in the MDT should have three entries (representing column level stats for 3 > files). However, col stats only has one single entry (for one of the file). > > {code:java} > var createIndexSql = s"create index idx_datestr on test_table using > column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')" > spark.sql(createIndexSql) > spark.sql(s"select key, type, ColumnStatsMetadata from > hudi_metadata('test_table') where type = 3").show(false) {code} > As seen below, col-stats has only one entry for one of the file (and is > missing statistics for two other files): > *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, > {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, > {1970-01-01} > , null, null, null, null}, 1, 0, 434874, 869748, false}* > > {noformat} > > +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > |key |type|ColumnStatsMetadata > > > > | > +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 > |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, 1, 0, 434874, 869748, false}| > +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > {noformat} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)