[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HUDI-7579: - Labels: hudi-1.0.0-beta2 pull-request-available (was: hudi-1.0.0-beta2) > Functional index (on col stats) creation fails to process all files/partitions > -- > > Key: HUDI-7579 > URL: https://issues.apache.org/jira/browse/HUDI-7579 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > Creating a functional index on an existing table fails to process all files > and partitions of the table. The col-stats MDT partition ends up having an > entry only for subset of files that belong to the table. An example follows. > > The following create-table and inserts should create a table with 3 > partitions (with each partition having one slice)}}{}}} > {code:java} > spark.sql( > s""" > |create table test_table( > | id int, > | name string, > | ts long, > | price int > |) using hudi > | options ( > | primaryKey ='id', > | type = 'cow', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(price) > | location '$basePath' > """.stripMargin) >spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', > 1000, 10)") >spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', > 20, 100)") >spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', > 20, 1000)"){code} > Now create a functional index (using col stats) on this table. The col-stat > in the MDT should have three entries (representing column level stats for 3 > files). However, col stats only has one single entry (for one of the file). > > {code:java} > var createIndexSql = s"create index idx_datestr on test_table using > column_stats(ts) options(func='from_unixtime', format='-MM-dd')" > spark.sql(createIndexSql) > spark.sql(s"select key, type, ColumnStatsMetadata from > hudi_metadata('test_table') where type = 3").show(false) {code} > As seen below, col-stats has only one entry for one of the file (and is > missing statistics for two other files): > *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, > {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, > {1970-01-01} > , null, null, null, null}, 1, 0, 434874, 869748, false}* > > {noformat} > > +++---+ > |key |type|ColumnStatsMetadata > > > > | > +++---+ > |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 > |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, 1, 0, 434874, 869748, false}| > +++---+ > {noformat} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Labels: hudi-1.0.0-beta2 (was: ) > Functional index (on col stats) creation fails to process all files/partitions > -- > > Key: HUDI-7579 > URL: https://issues.apache.org/jira/browse/HUDI-7579 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Creating a functional index on an existing table fails to process all files > and partitions of the table. The col-stats MDT partition ends up having an > entry only for subset of files that belong to the table. An example follows. > > The following create-table and inserts should create a table with 3 > partitions (with each partition having one slice)}}{}}} > {code:java} > spark.sql( > s""" > |create table test_table( > | id int, > | name string, > | ts long, > | price int > |) using hudi > | options ( > | primaryKey ='id', > | type = 'cow', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(price) > | location '$basePath' > """.stripMargin) >spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', > 1000, 10)") >spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', > 20, 100)") >spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', > 20, 1000)"){code} > Now create a functional index (using col stats) on this table. The col-stat > in the MDT should have three entries (representing column level stats for 3 > files). However, col stats only has one single entry (for one of the file). > > {code:java} > var createIndexSql = s"create index idx_datestr on test_table using > column_stats(ts) options(func='from_unixtime', format='-MM-dd')" > spark.sql(createIndexSql) > spark.sql(s"select key, type, ColumnStatsMetadata from > hudi_metadata('test_table') where type = 3").show(false) {code} > As seen below, col-stats has only one entry for one of the file (and is > missing statistics for two other files): > *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, > {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, > {1970-01-01} > , null, null, null, null}, 1, 0, 434874, 869748, false}* > > {noformat} > > +++---+ > |key |type|ColumnStatsMetadata > > > > | > +++---+ > |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 > |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, 1, 0, 434874, 869748, false}| > +++---+ > {noformat} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Component/s: index Epic Link: HUDI-512 Fix Version/s: 1.0.0 > Functional index (on col stats) creation fails to process all files/partitions > -- > > Key: HUDI-7579 > URL: https://issues.apache.org/jira/browse/HUDI-7579 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Vinaykumar Bhat >Priority: Major > Fix For: 1.0.0 > > > Creating a functional index on an existing table fails to process all files > and partitions of the table. The col-stats MDT partition ends up having an > entry only for subset of files that belong to the table. An example follows. > > The following create-table and inserts should create a table with 3 > partitions (with each partition having one slice)}}{}}} > {code:java} > spark.sql( > s""" > |create table test_table( > | id int, > | name string, > | ts long, > | price int > |) using hudi > | options ( > | primaryKey ='id', > | type = 'cow', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(price) > | location '$basePath' > """.stripMargin) >spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', > 1000, 10)") >spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', > 20, 100)") >spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', > 20, 1000)"){code} > Now create a functional index (using col stats) on this table. The col-stat > in the MDT should have three entries (representing column level stats for 3 > files). However, col stats only has one single entry (for one of the file). > > {code:java} > var createIndexSql = s"create index idx_datestr on test_table using > column_stats(ts) options(func='from_unixtime', format='-MM-dd')" > spark.sql(createIndexSql) > spark.sql(s"select key, type, ColumnStatsMetadata from > hudi_metadata('test_table') where type = 3").show(false) {code} > As seen below, col-stats has only one entry for one of the file (and is > missing statistics for two other files): > *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, > {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, > {1970-01-01} > , null, null, null, null}, 1, 0, 434874, 869748, false}* > > {noformat} > > +++---+ > |key |type|ColumnStatsMetadata > > > > | > +++---+ > |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 > |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, 1, 0, 434874, 869748, false}| > +++---+ > {noformat} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Description: Creating a functional index on an existing table fails to process all files and partitions of the table. The col-stats MDT partition ends up having an entry only for subset of files that belong to the table. An example follows. The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). {code:java} var createIndexSql = s"create index idx_datestr on test_table using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('test_table') where type = 3").show(false) {code} As seen below, col-stats has only one entry for one of the file (and is missing statistics for two other files): *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, {1970-01-01} , null, null, null, null}, 1, 0, 434874, 869748, false}* {noformat} +++---+ |key |type|ColumnStatsMetadata | +++---+ |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}| +++---+ {noformat} was: Creating a functional index on an existing table fails to process all files and partitions of the table. The col-stats MDT partition ends up having an entry only for subset of files that belong to the table. An example follows. The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Description: Creating a functional index on an existing table fails to process all files and partitions of the table. The col-stats MDT partition ends up having an entry only for subset of files that belong to the table. An example follows. The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). {code:java} var createIndexSql = s"create index idx_datestr on test_table using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('test_table') where type = 3").show(false) {code} As seen below, col-stats has only one entry for one of the file (and is missing statistics for two other files): *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, {1970-01-01} , null, null, null, null}, 1, 0, 434874, 869748, false}* {code:java} +++---+ |key |type|ColumnStatsMetadata | +++---+ |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}| +++---+ {code} was: The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Description: The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). {code:java} var createIndexSql = s"create index idx_datestr on test_table using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('test_table') where type = 3").show(false) {code} As seen below, col-stats has only one entry for one of the file (and is missing statistics for two other files): *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}* {code:java} +++---+ |key |type|ColumnStatsMetadata | +++---+ |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}| +++---+ {code} was: The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). {code:java} var createIndexSql = s"create index idx_datestr on $tableName using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key,
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Description: The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). {code:java} var createIndexSql = s"create index idx_datestr on $tableName using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('$tableName') where type = 3").show(false) {code} As seen below, col-stats has only one entry for one of the file (and is missing statistics for two other files): *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}* {code:java} +++---+ |key |type|ColumnStatsMetadata | +++---+ |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}| +++---+ {code} was: The following create-table and inserts should create a table with 3 partitions (with each partition having one slice){{{}{}}} ``` spark.sql(s"""create table test_table (id int, name string, ts long, price int) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into $tableName (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into $tableName (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into $tableName (id, name, ts, price) values(3, 'a3', 20, 1000)") ``` Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). ``` var createIndexSql = s"create index idx_datestr on test_table using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('test_table') where type = 3").show(false) ``` As seen below, col-stats has only one entry for one of the file (and is