[jira] [Resolved] (HUDI-7474) Functional index creation fails for an existing table as reported by community user
[ https://issues.apache.org/jira/browse/HUDI-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat resolved HUDI-7474. --- > Functional index creation fails for an existing table as reported by > community user > --- > > Key: HUDI-7474 > URL: https://issues.apache.org/jira/browse/HUDI-7474 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > > Investigate issue reported with functional index here - > https://github.com/apache/hudi/issues/10110 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues
[ https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835624#comment-17835624 ] Vinaykumar Bhat edited comment on HUDI-7580 at 4/10/24 7:29 AM: I think the problem is that spark [rewrites the schema|https://stackoverflow.com/questions/50962934/partition-column-is-moved-to-end-of-row-when-saving-a-file-to-parquet] (in the catalog) for partitioned table - the partitioning column is moved to the end of the schema. But {{[InsertIntoHoodieTableCommand::run(...)|https://github.com/apache/hudi/blob/984a248de4c783fb0d3728dff28f472fe863c9f2/hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala#L57]}} has no special logic to understand this - it reads the schema from the catalog and maps the array of {{GenericInternalRow}} to the read schema. was (Author: JIRAUSER303569): I think the problem is that spark [rewrites the schema|https://stackoverflow.com/questions/50962934/partition-column-is-moved-to-end-of-row-when-saving-a-file-to-parquet] (in the catalog) for partitioned table - the partitioning column is moved to the end of the schema. But {{InsertIntoHoodieTableCommand::run(...)}} has no special logic to understand this - it reads the schema from the catalog and maps the array of {{GenericInternalRow}} to the read schema. > Inserting rows into partitioned table leads to data sanity issues > - > > Key: HUDI-7580 > URL: https://issues.apache.org/jira/browse/HUDI-7580 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 1.0.0-beta1 >Reporter: Vinaykumar Bhat >Priority: Major > > Came across this behaviour of partitioned tables when trying to debug some > other issue with functional-index. It seems that the column ordering gets > messed up while inserting records into a hudi table. Hence, a subsequent > query returns wrong results. An example follows: > > The following is a scala test: > {code:java} > test("Test Create Functional Index") { > if (HoodieSparkUtils.gteqSpark3_2) { > withTempDir { tmp => > val tableType = "cow" > val tableName = "rides" > val basePath = s"${tmp.getCanonicalPath}/$tableName" > spark.sql("set hoodie.metadata.enable=true") > spark.sql( > s""" >|create table $tableName ( >| id int, >| name string, >| price int, >| ts long >|) using hudi >| options ( >| primaryKey ='id', >| type = '$tableType', >| preCombineField = 'ts', >| hoodie.metadata.record.index.enable = 'true', >| hoodie.datasource.write.recordkey.field = 'id' >| ) >| partitioned by(price) >| location '$basePath' >""".stripMargin) > spark.sql(s"insert into $tableName (id, name, price, ts) values(1, > 'a1', 10, 1000)") > spark.sql(s"insert into $tableName (id, name, price, ts) values(2, > 'a2', 100, 20)") > spark.sql(s"insert into $tableName (id, name, price, ts) values(3, > 'a3', 1000, 20)") > spark.sql(s"select id, name, price, ts from $tableName").show(false) > } > } > } {code} > > The query returns the following result (note how *price* and *ts* columns are > mixed up). > {code:java} > +---++--++ > |id |name|price |ts | > +---++--++ > |3 |a3 |20|1000| > |2 |a2 |20|100 | > |1 |a1 |1000 |10 | > +---++--++ > {code} > > Having the partition column as the last column in the schema does not cause > this problem. If the mixed-up columns are of incompatible datatypes, then the > insert fails with an error. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues
[ https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835624#comment-17835624 ] Vinaykumar Bhat edited comment on HUDI-7580 at 4/10/24 7:27 AM: I think the problem is that spark [rewrites the schema|https://stackoverflow.com/questions/50962934/partition-column-is-moved-to-end-of-row-when-saving-a-file-to-parquet] (in the catalog) for partitioned table - the partitioning column is moved to the end of the schema. But {{InsertIntoHoodieTableCommand::run(...)}} has no special logic to understand this - it reads the schema from the catalog and maps the array of {{GenericInternalRow}} to the read schema. was (Author: JIRAUSER303569): I think the problem is that spark [rewrites the schema|[http://example.com|https://stackoverflow.com/questions/50962934/partition-column-is-moved-to-end-of-row-when-saving-a-file-to-parquet]] (in the catalog) for partitioned table - the partitioning column is moved to the end of the schema. But {{InsertIntoHoodieTableCommand::run(...)}} has no special logic to understand this - it reads the schema from the catalog and maps the array of {{GenericInternalRow}} to the read schema. > Inserting rows into partitioned table leads to data sanity issues > - > > Key: HUDI-7580 > URL: https://issues.apache.org/jira/browse/HUDI-7580 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 1.0.0-beta1 >Reporter: Vinaykumar Bhat >Priority: Major > > Came across this behaviour of partitioned tables when trying to debug some > other issue with functional-index. It seems that the column ordering gets > messed up while inserting records into a hudi table. Hence, a subsequent > query returns wrong results. An example follows: > > The following is a scala test: > {code:java} > test("Test Create Functional Index") { > if (HoodieSparkUtils.gteqSpark3_2) { > withTempDir { tmp => > val tableType = "cow" > val tableName = "rides" > val basePath = s"${tmp.getCanonicalPath}/$tableName" > spark.sql("set hoodie.metadata.enable=true") > spark.sql( > s""" >|create table $tableName ( >| id int, >| name string, >| price int, >| ts long >|) using hudi >| options ( >| primaryKey ='id', >| type = '$tableType', >| preCombineField = 'ts', >| hoodie.metadata.record.index.enable = 'true', >| hoodie.datasource.write.recordkey.field = 'id' >| ) >| partitioned by(price) >| location '$basePath' >""".stripMargin) > spark.sql(s"insert into $tableName (id, name, price, ts) values(1, > 'a1', 10, 1000)") > spark.sql(s"insert into $tableName (id, name, price, ts) values(2, > 'a2', 100, 20)") > spark.sql(s"insert into $tableName (id, name, price, ts) values(3, > 'a3', 1000, 20)") > spark.sql(s"select id, name, price, ts from $tableName").show(false) > } > } > } {code} > > The query returns the following result (note how *price* and *ts* columns are > mixed up). > {code:java} > +---++--++ > |id |name|price |ts | > +---++--++ > |3 |a3 |20|1000| > |2 |a2 |20|100 | > |1 |a1 |1000 |10 | > +---++--++ > {code} > > Having the partition column as the last column in the schema does not cause > this problem. If the mixed-up columns are of incompatible datatypes, then the > insert fails with an error. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues
[ https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835624#comment-17835624 ] Vinaykumar Bhat commented on HUDI-7580: --- I think the problem is that spark [rewrites the schema|[http://example.com|https://stackoverflow.com/questions/50962934/partition-column-is-moved-to-end-of-row-when-saving-a-file-to-parquet]] (in the catalog) for partitioned table - the partitioning column is moved to the end of the schema. But {{InsertIntoHoodieTableCommand::run(...)}} has no special logic to understand this - it reads the schema from the catalog and maps the array of {{GenericInternalRow}} to the read schema. > Inserting rows into partitioned table leads to data sanity issues > - > > Key: HUDI-7580 > URL: https://issues.apache.org/jira/browse/HUDI-7580 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 1.0.0-beta1 >Reporter: Vinaykumar Bhat >Priority: Major > > Came across this behaviour of partitioned tables when trying to debug some > other issue with functional-index. It seems that the column ordering gets > messed up while inserting records into a hudi table. Hence, a subsequent > query returns wrong results. An example follows: > > The following is a scala test: > {code:java} > test("Test Create Functional Index") { > if (HoodieSparkUtils.gteqSpark3_2) { > withTempDir { tmp => > val tableType = "cow" > val tableName = "rides" > val basePath = s"${tmp.getCanonicalPath}/$tableName" > spark.sql("set hoodie.metadata.enable=true") > spark.sql( > s""" >|create table $tableName ( >| id int, >| name string, >| price int, >| ts long >|) using hudi >| options ( >| primaryKey ='id', >| type = '$tableType', >| preCombineField = 'ts', >| hoodie.metadata.record.index.enable = 'true', >| hoodie.datasource.write.recordkey.field = 'id' >| ) >| partitioned by(price) >| location '$basePath' >""".stripMargin) > spark.sql(s"insert into $tableName (id, name, price, ts) values(1, > 'a1', 10, 1000)") > spark.sql(s"insert into $tableName (id, name, price, ts) values(2, > 'a2', 100, 20)") > spark.sql(s"insert into $tableName (id, name, price, ts) values(3, > 'a3', 1000, 20)") > spark.sql(s"select id, name, price, ts from $tableName").show(false) > } > } > } {code} > > The query returns the following result (note how *price* and *ts* columns are > mixed up). > {code:java} > +---++--++ > |id |name|price |ts | > +---++--++ > |3 |a3 |20|1000| > |2 |a2 |20|100 | > |1 |a1 |1000 |10 | > +---++--++ > {code} > > Having the partition column as the last column in the schema does not cause > this problem. If the mixed-up columns are of incompatible datatypes, then the > insert fails with an error. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7582) Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()
[ https://issues.apache.org/jira/browse/HUDI-7582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7582: -- Labels: hudi-1.0.0-beta2 (was: ) > Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame() > - > > Key: HUDI-7582 > URL: https://issues.apache.org/jira/browse/HUDI-7582 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > > lookupCandidateFilesInMetadataTable(...) calls > FunctionalIndexSupport::loadFunctionalIndexDataFrame() with an empty string > for indexPartition which results in a NPE as loadFunctionalIndexDataFrame() > tries to lookup and dereference index-definition using this empty string. > > This part of the code should never have worked - hence it looks like > functional index (based on col-stats) is not tested on the query path. trying > to get the index-partition to use on the query side seems more involved - the > incoming query predicate needs to be parsed to get the (column-names, > function-name) for all the query predicate and then fetch the corresponding > index-partition by walking through the index-defs maintained in the > index-metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Labels: hudi-1.0.0-beta2 (was: ) > Functional index (on col stats) creation fails to process all files/partitions > -- > > Key: HUDI-7579 > URL: https://issues.apache.org/jira/browse/HUDI-7579 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Creating a functional index on an existing table fails to process all files > and partitions of the table. The col-stats MDT partition ends up having an > entry only for subset of files that belong to the table. An example follows. > > The following create-table and inserts should create a table with 3 > partitions (with each partition having one slice)}}{}}} > {code:java} > spark.sql( > s""" > |create table test_table( > | id int, > | name string, > | ts long, > | price int > |) using hudi > | options ( > | primaryKey ='id', > | type = 'cow', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(price) > | location '$basePath' > """.stripMargin) >spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', > 1000, 10)") >spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', > 20, 100)") >spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', > 20, 1000)"){code} > Now create a functional index (using col stats) on this table. The col-stat > in the MDT should have three entries (representing column level stats for 3 > files). However, col stats only has one single entry (for one of the file). > > {code:java} > var createIndexSql = s"create index idx_datestr on test_table using > column_stats(ts) options(func='from_unixtime', format='-MM-dd')" > spark.sql(createIndexSql) > spark.sql(s"select key, type, ColumnStatsMetadata from > hudi_metadata('test_table') where type = 3").show(false) {code} > As seen below, col-stats has only one entry for one of the file (and is > missing statistics for two other files): > *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, > {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, > {1970-01-01} > , null, null, null, null}, 1, 0, 434874, 869748, false}* > > {noformat} > > +++---+ > |key |type|ColumnStatsMetadata > > > > | > +++---+ > |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 > |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, 1, 0, 434874, 869748, false}| > +++---+ > {noformat} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7582) Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame()
Vinaykumar Bhat created HUDI-7582: - Summary: Fix NPE in FunctionalIndexSupport::loadFunctionalIndexDataFrame() Key: HUDI-7582 URL: https://issues.apache.org/jira/browse/HUDI-7582 Project: Apache Hudi Issue Type: Bug Components: index Reporter: Vinaykumar Bhat lookupCandidateFilesInMetadataTable(...) calls FunctionalIndexSupport::loadFunctionalIndexDataFrame() with an empty string for indexPartition which results in a NPE as loadFunctionalIndexDataFrame() tries to lookup and dereference index-definition using this empty string. This part of the code should never have worked - hence it looks like functional index (based on col-stats) is not tested on the query path. trying to get the index-partition to use on the query side seems more involved - the incoming query predicate needs to be parsed to get the (column-names, function-name) for all the query predicate and then fetch the corresponding index-partition by walking through the index-defs maintained in the index-metadata. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7007) Integrate functional index using bloom filter on reader side
[ https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835348#comment-17835348 ] Vinaykumar Bhat edited comment on HUDI-7007 at 4/9/24 11:45 AM: I am not sure if creating functional indexes (using col stats) works correctly. For example, creating a functional index on a table with three existing files fails to process all the files. The col-stats in the MDT is created for only one of the file. HUDI-7579 has more details. was (Author: JIRAUSER303569): I am not sure if creating functional indexes (using col stats) works corerctly. For example, on creating a functional index on a table with three existing files fails to process all the files. The col-stats in the MDT is created for only one of the file. HUDI-7579 has more details. > Integrate functional index using bloom filter on reader side > > > Key: HUDI-7007 > URL: https://issues.apache.org/jira/browse/HUDI-7007 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Currently, one can create a functional index on a column using bloom filters. > However, only the one created using column stats is supported on the reader > side (check `FunctionalIndexSupport`). This ticket tracks the support for > using bloom filters on functional index in the reader path. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7007) Integrate functional index using bloom filter on reader side
[ https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17835348#comment-17835348 ] Vinaykumar Bhat commented on HUDI-7007: --- I am not sure if creating functional indexes (using col stats) works corerctly. For example, on creating a functional index on a table with three existing files fails to process all the files. The col-stats in the MDT is created for only one of the file. HUDI-7579 has more details. > Integrate functional index using bloom filter on reader side > > > Key: HUDI-7007 > URL: https://issues.apache.org/jira/browse/HUDI-7007 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Currently, one can create a functional index on a column using bloom filters. > However, only the one created using column stats is supported on the reader > side (check `FunctionalIndexSupport`). This ticket tracks the support for > using bloom filters on functional index in the reader path. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues
[ https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7580: -- Description: Came across this behaviour of partitioned tables when trying to debug some other issue with functional-index. It seems that the column ordering gets messed up while inserting records into a hudi table. Hence, a subsequent query returns wrong results. An example follows: The following is a scala test: {code:java} test("Test Create Functional Index") { if (HoodieSparkUtils.gteqSpark3_2) { withTempDir { tmp => val tableType = "cow" val tableName = "rides" val basePath = s"${tmp.getCanonicalPath}/$tableName" spark.sql("set hoodie.metadata.enable=true") spark.sql( s""" |create table $tableName ( | id int, | name string, | price int, | ts long |) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 'a1', 10, 1000)") spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 'a2', 100, 20)") spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 'a3', 1000, 20)") spark.sql(s"select id, name, price, ts from $tableName").show(false) } } } {code} The query returns the following result (note how *price* and *ts* columns are mixed up). {code:java} +---++--++ |id |name|price |ts | +---++--++ |3 |a3 |20|1000| |2 |a2 |20|100 | |1 |a1 |1000 |10 | +---++--++ {code} Having the partition column as the last column in the schema does not cause this problem. If the mixed-up columns are of incompatible datatypes, then the insert fails with an error. was: Came across this behaviour of partitioned tables when trying to debug some other issue with functional-index. It seems that the column ordering gets messed up while inserting records into a hudi table. Hence, a subsequent query returns wrong results. An example follows: The following is a scala test: {code:java} test("Test Create Functional Index") { if (HoodieSparkUtils.gteqSpark3_2) { withTempDir { tmp => val tableType = "cow" val tableName = "rides" val basePath = s"${tmp.getCanonicalPath}/$tableName" spark.sql("set hoodie.metadata.enable=true") spark.sql( s""" |create table $tableName ( | id int, | name string, | price int, | ts long |) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 'a1', 10, 1000)") spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 'a2', 100, 20)") spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 'a3', 1000, 20)") spark.sql(s"select id, name, price, ts from $tableName").show(false) } } } {code} The query returns the following result (note how price ans ts columns are mixed up). {code:java} +---++--++ |id |name|price |ts | +---++--++ |3 |a3 |20|1000| |2 |a2 |20|100 | |1 |a1 |1000 |10 | +---++--++ {code} Have the partition column as the last column in the schema does not cause this problem. If the mixed-up columns are of imcompatible datatypes, then the insert fails with an error. > Inserting rows into partitioned table leads to data sanity issues > - > > Key: HUDI-7580 > URL: https://issues.apache.org/jira/browse/HUDI-7580 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 1.0.0-beta1 >Reporter: Vinaykumar Bhat >Priority: Major > > Came across this behaviour of partitioned tables when trying to debug some > other issue with functional-index. It seems that
[jira] [Created] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues
Vinaykumar Bhat created HUDI-7580: - Summary: Inserting rows into partitioned table leads to data sanity issues Key: HUDI-7580 URL: https://issues.apache.org/jira/browse/HUDI-7580 Project: Apache Hudi Issue Type: Bug Affects Versions: 1.0.0-beta1 Reporter: Vinaykumar Bhat Came across this behaviour of partitioned tables when trying to debug some other issue with functional-index. It seems that the column ordering gets messed up while inserting records into a hudi table. Hence, a subsequent query returns wrong results. An example follows: The following is a scala test: {code:java} test("Test Create Functional Index") { if (HoodieSparkUtils.gteqSpark3_2) { withTempDir { tmp => val tableType = "cow" val tableName = "rides" val basePath = s"${tmp.getCanonicalPath}/$tableName" spark.sql("set hoodie.metadata.enable=true") spark.sql( s""" |create table $tableName ( | id int, | name string, | price int, | ts long |) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 'a1', 10, 1000)") spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 'a2', 100, 20)") spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 'a3', 1000, 20)") spark.sql(s"select id, name, price, ts from $tableName").show(false) } } } {code} The query returns the following result (note how price ans ts columns are mixed up). {code:java} +---++--++ |id |name|price |ts | +---++--++ |3 |a3 |20|1000| |2 |a2 |20|100 | |1 |a1 |1000 |10 | +---++--++ {code} Have the partition column as the last column in the schema does not cause this problem. If the mixed-up columns are of imcompatible datatypes, then the insert fails with an error. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Component/s: index Epic Link: HUDI-512 Fix Version/s: 1.0.0 > Functional index (on col stats) creation fails to process all files/partitions > -- > > Key: HUDI-7579 > URL: https://issues.apache.org/jira/browse/HUDI-7579 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Vinaykumar Bhat >Priority: Major > Fix For: 1.0.0 > > > Creating a functional index on an existing table fails to process all files > and partitions of the table. The col-stats MDT partition ends up having an > entry only for subset of files that belong to the table. An example follows. > > The following create-table and inserts should create a table with 3 > partitions (with each partition having one slice)}}{}}} > {code:java} > spark.sql( > s""" > |create table test_table( > | id int, > | name string, > | ts long, > | price int > |) using hudi > | options ( > | primaryKey ='id', > | type = 'cow', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(price) > | location '$basePath' > """.stripMargin) >spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', > 1000, 10)") >spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', > 20, 100)") >spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', > 20, 1000)"){code} > Now create a functional index (using col stats) on this table. The col-stat > in the MDT should have three entries (representing column level stats for 3 > files). However, col stats only has one single entry (for one of the file). > > {code:java} > var createIndexSql = s"create index idx_datestr on test_table using > column_stats(ts) options(func='from_unixtime', format='-MM-dd')" > spark.sql(createIndexSql) > spark.sql(s"select key, type, ColumnStatsMetadata from > hudi_metadata('test_table') where type = 3").show(false) {code} > As seen below, col-stats has only one entry for one of the file (and is > missing statistics for two other files): > *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, > {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, > {1970-01-01} > , null, null, null, null}, 1, 0, 434874, 869748, false}* > > {noformat} > > +++---+ > |key |type|ColumnStatsMetadata > > > > | > +++---+ > |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 > |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, > ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, > null}, 1, 0, 434874, 869748, false}| > +++---+ > {noformat} > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Description: Creating a functional index on an existing table fails to process all files and partitions of the table. The col-stats MDT partition ends up having an entry only for subset of files that belong to the table. An example follows. The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). {code:java} var createIndexSql = s"create index idx_datestr on test_table using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('test_table') where type = 3").show(false) {code} As seen below, col-stats has only one entry for one of the file (and is missing statistics for two other files): *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, {1970-01-01} , null, null, null, null}, 1, 0, 434874, 869748, false}* {noformat} +++---+ |key |type|ColumnStatsMetadata | +++---+ |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}| +++---+ {noformat} was: Creating a functional index on an existing table fails to process all files and partitions of the table. The col-stats MDT partition ends up having an entry only for subset of files that belong to the table. An example follows. The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price)
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Description: Creating a functional index on an existing table fails to process all files and partitions of the table. The col-stats MDT partition ends up having an entry only for subset of files that belong to the table. An example follows. The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). {code:java} var createIndexSql = s"create index idx_datestr on test_table using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('test_table') where type = 3").show(false) {code} As seen below, col-stats has only one entry for one of the file (and is missing statistics for two other files): *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, {1970-01-01} , null, null, null, null}, 1, 0, 434874, 869748, false}* {code:java} +++---+ |key |type|ColumnStatsMetadata | +++---+ |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}| +++---+ {code} was: The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Description: The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). {code:java} var createIndexSql = s"create index idx_datestr on test_table using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('test_table') where type = 3").show(false) {code} As seen below, col-stats has only one entry for one of the file (and is missing statistics for two other files): *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}* {code:java} +++---+ |key |type|ColumnStatsMetadata | +++---+ |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}| +++---+ {code} was: The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). {code:java} var createIndexSql = s"create index idx_datestr on $tableName using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key,
[jira] [Updated] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
[ https://issues.apache.org/jira/browse/HUDI-7579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7579: -- Description: The following create-table and inserts should create a table with 3 partitions (with each partition having one slice)}}{}}} {code:java} spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 20, 1000)"){code} Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). {code:java} var createIndexSql = s"create index idx_datestr on $tableName using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('$tableName') where type = 3").show(false) {code} As seen below, col-stats has only one entry for one of the file (and is missing statistics for two other files): *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}* {code:java} +++---+ |key |type|ColumnStatsMetadata | +++---+ |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}| +++---+ {code} was: The following create-table and inserts should create a table with 3 partitions (with each partition having one slice){{{}{}}} ``` spark.sql(s"""create table test_table (id int, name string, ts long, price int) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into $tableName (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into $tableName (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into $tableName (id, name, ts, price) values(3, 'a3', 20, 1000)") ``` Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). ``` var createIndexSql = s"create index idx_datestr on test_table using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('test_table') where type = 3").show(false) ``` As seen below, col-stats has only one entry for one of the file (and
[jira] [Created] (HUDI-7579) Functional index (on col stats) creation fails to process all files/partitions
Vinaykumar Bhat created HUDI-7579: - Summary: Functional index (on col stats) creation fails to process all files/partitions Key: HUDI-7579 URL: https://issues.apache.org/jira/browse/HUDI-7579 Project: Apache Hudi Issue Type: Bug Reporter: Vinaykumar Bhat The following create-table and inserts should create a table with 3 partitions (with each partition having one slice){{{}{}}} ``` spark.sql(s"""create table test_table (id int, name string, ts long, price int) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into $tableName (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into $tableName (id, name, ts, price) values(2, 'a2', 20, 100)") spark.sql(s"insert into $tableName (id, name, ts, price) values(3, 'a3', 20, 1000)") ``` Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file). ``` var createIndexSql = s"create index idx_datestr on test_table using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('test_table') where type = 3").show(false) ``` As seen below, col-stats has only one entry for one of the file (and is missing statistics for two other files): `\{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}` {{+++---+}} {{|key |type|ColumnStatsMetadata |}} {{+++---+}} {{|oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 |\{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, \{null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}|}} {{+++---+}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files
[ https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7569: -- Sprint: Sprint 2024-03-25 > Fix wrong result while using RLI for pruning files > -- > > Key: HUDI-7569 > URL: https://issues.apache.org/jira/browse/HUDI-7569 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > Data skipping (pruning files) for RLI is supported only when the query > predicate has `EqualTo` or `In` expressions/filters on the record-key column. > However, the logic for detecting valid `In` expression/filter on record-key > has bugs. It tries to prune files assuming that `In` expression/filter can > reference only record-key column even when the `In` query is based on other > columns. > > For example, a query of the foem `select * from trips_table where driver in > ('abc', 'xyz')` has the potential to return wrong results if the record-key > for this table also has values 'abc' or 'xyz' for some rows of the table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7559) Fix functional index (on column stats): Handle NPE in filterQueriesWithRecordKey(...)
[ https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7559: -- Story Points: 4 > Fix functional index (on column stats): Handle NPE in > filterQueriesWithRecordKey(...) > - > > Key: HUDI-7559 > URL: https://issues.apache.org/jira/browse/HUDI-7559 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` throws NPE which is > then subsequently ignored by `lookupCandidateFilesInMetadataTable()` > rendering every other index (like FunctionalIndex, ColStat Index) to not be > used for data skipping (i.e pruning files) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7569) Fix wrong result while using RLI for pruning files
[ https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7569: - Story Points: 4 Assignee: Vinaykumar Bhat > Fix wrong result while using RLI for pruning files > -- > > Key: HUDI-7569 > URL: https://issues.apache.org/jira/browse/HUDI-7569 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2, pull-request-available > Fix For: 1.0.0 > > > Data skipping (pruning files) for RLI is supported only when the query > predicate has `EqualTo` or `In` expressions/filters on the record-key column. > However, the logic for detecting valid `In` expression/filter on record-key > has bugs. It tries to prune files assuming that `In` expression/filter can > reference only record-key column even when the `In` query is based on other > columns. > > For example, a query of the foem `select * from trips_table where driver in > ('abc', 'xyz')` has the potential to return wrong results if the record-key > for this table also has values 'abc' or 'xyz' for some rows of the table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files
[ https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7569: -- Epic Link: HUDI-512 > Fix wrong result while using RLI for pruning files > -- > > Key: HUDI-7569 > URL: https://issues.apache.org/jira/browse/HUDI-7569 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Data skipping (pruning files) for RLI is supported only when the query > predicate has `EqualTo` or `In` expressions/filters on the record-key column. > However, the logic for detecting valid `In` expression/filter on record-key > has bugs. It tries to prune files assuming that `In` expression/filter can > reference only record-key column even when the `In` query is based on other > columns. > > For example, a query of the foem `select * from trips_table where driver in > ('abc', 'xyz')` has the potential to return wrong results if the record-key > for this table also has values 'abc' or 'xyz' for some rows of the table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files
[ https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7569: -- Fix Version/s: 1.0.0 > Fix wrong result while using RLI for pruning files > -- > > Key: HUDI-7569 > URL: https://issues.apache.org/jira/browse/HUDI-7569 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Priority: Major > Fix For: 1.0.0 > > > Data skipping (pruning files) for RLI is supported only when the query > predicate has `EqualTo` or `In` expressions/filters on the record-key column. > However, the logic for detecting valid `In` expression/filter on record-key > has bugs. It tries to prune files assuming that `In` expression/filter can > reference only record-key column even when the `In` query is based on other > columns. > > For example, a query of the foem `select * from trips_table where driver in > ('abc', 'xyz')` has the potential to return wrong results if the record-key > for this table also has values 'abc' or 'xyz' for some rows of the table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7569) Fix wrong result while using RLI for pruning files
[ https://issues.apache.org/jira/browse/HUDI-7569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7569: -- Labels: hudi-1.0.0-beta2 (was: ) > Fix wrong result while using RLI for pruning files > -- > > Key: HUDI-7569 > URL: https://issues.apache.org/jira/browse/HUDI-7569 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Data skipping (pruning files) for RLI is supported only when the query > predicate has `EqualTo` or `In` expressions/filters on the record-key column. > However, the logic for detecting valid `In` expression/filter on record-key > has bugs. It tries to prune files assuming that `In` expression/filter can > reference only record-key column even when the `In` query is based on other > columns. > > For example, a query of the foem `select * from trips_table where driver in > ('abc', 'xyz')` has the potential to return wrong results if the record-key > for this table also has values 'abc' or 'xyz' for some rows of the table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7569) Fix wrong result while using RLI for pruning files
Vinaykumar Bhat created HUDI-7569: - Summary: Fix wrong result while using RLI for pruning files Key: HUDI-7569 URL: https://issues.apache.org/jira/browse/HUDI-7569 Project: Apache Hudi Issue Type: Bug Reporter: Vinaykumar Bhat Data skipping (pruning files) for RLI is supported only when the query predicate has `EqualTo` or `In` expressions/filters on the record-key column. However, the logic for detecting valid `In` expression/filter on record-key has bugs. It tries to prune files assuming that `In` expression/filter can reference only record-key column even when the `In` query is based on other columns. For example, a query of the foem `select * from trips_table where driver in ('abc', 'xyz')` has the potential to return wrong results if the record-key for this table also has values 'abc' or 'xyz' for some rows of the table. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7563) Implement DROP INDEX support
[ https://issues.apache.org/jira/browse/HUDI-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7563: -- Fix Version/s: 1.0.0 > Implement DROP INDEX support > > > Key: HUDI-7563 > URL: https://issues.apache.org/jira/browse/HUDI-7563 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > DROP INDEX is not supported for functional index and secondary index -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7563) Implement DROP INDEX support
Vinaykumar Bhat created HUDI-7563: - Summary: Implement DROP INDEX support Key: HUDI-7563 URL: https://issues.apache.org/jira/browse/HUDI-7563 Project: Apache Hudi Issue Type: New Feature Reporter: Vinaykumar Bhat DROP INDEX is not supported for functional index and secondary index -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7563) Implement DROP INDEX support
[ https://issues.apache.org/jira/browse/HUDI-7563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7563: -- Labels: hudi-1.0.0-beta2 (was: ) > Implement DROP INDEX support > > > Key: HUDI-7563 > URL: https://issues.apache.org/jira/browse/HUDI-7563 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > > DROP INDEX is not supported for functional index and secondary index -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7559) Fix functional index (on column stats): Handle NPE in filterQueriesWithRecordKey(...)
[ https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7559: -- Description: `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` throws NPE which is then subsequently ignored by `lookupCandidateFilesInMetadataTable()` rendering every other index (like FunctionalIndex, ColStat Index) to not be used for data skipping (i.e pruning files) (was: `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` throws NPE which is then subsequently `lookupCandidateFilesInMetadataTable()` rendering every other index (like FunctionalIndex, ColStat Index) to not be used for data skipping (i.e pruning files)) > Fix functional index (on column stats): Handle NPE in > filterQueriesWithRecordKey(...) > - > > Key: HUDI-7559 > URL: https://issues.apache.org/jira/browse/HUDI-7559 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` throws NPE which is > then subsequently ignored by `lookupCandidateFilesInMetadataTable()` > rendering every other index (like FunctionalIndex, ColStat Index) to not be > used for data skipping (i.e pruning files) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7559) Fix functional index (on column stats): Handle NPE in filterQueriesWithRecordKey(...)
[ https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7559: -- Description: `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` throws NPE which is then subsequently `lookupCandidateFilesInMetadataTable()` rendering every other index (like FunctionalIndex, ColStat Index) to not be used for data skipping (i.e pruning files) Summary: Fix functional index (on column stats): Handle NPE in filterQueriesWithRecordKey(...) (was: Fix issues with functional index (on column stats) based pruning) > Fix functional index (on column stats): Handle NPE in > filterQueriesWithRecordKey(...) > - > > Key: HUDI-7559 > URL: https://issues.apache.org/jira/browse/HUDI-7559 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > `RecordLevelIndexSupport::filterQueryWithRecordKey(...)` throws NPE which is > then subsequently `lookupCandidateFilesInMetadataTable()` rendering every > other index (like FunctionalIndex, ColStat Index) to not be used for data > skipping (i.e pruning files) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7559) Fix issues with functional index (on column stats) based pruning
[ https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7559: -- Status: In Progress (was: Open) > Fix issues with functional index (on column stats) based pruning > > > Key: HUDI-7559 > URL: https://issues.apache.org/jira/browse/HUDI-7559 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7559) Fix issues with functional index (on column stats) based pruning
[ https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7559: -- Epic Link: HUDI-512 > Fix issues with functional index (on column stats) based pruning > > > Key: HUDI-7559 > URL: https://issues.apache.org/jira/browse/HUDI-7559 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7559) Fix issues with functional index (on column stats) based pruning
[ https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7559: -- Fix Version/s: 1.0.0 > Fix issues with functional index (on column stats) based pruning > > > Key: HUDI-7559 > URL: https://issues.apache.org/jira/browse/HUDI-7559 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7559) Fix issues with functional index (on column stats) based pruning
Vinaykumar Bhat created HUDI-7559: - Summary: Fix issues with functional index (on column stats) based pruning Key: HUDI-7559 URL: https://issues.apache.org/jira/browse/HUDI-7559 Project: Apache Hudi Issue Type: Bug Reporter: Vinaykumar Bhat -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7559) Fix issues with functional index (on column stats) based pruning
[ https://issues.apache.org/jira/browse/HUDI-7559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7559: - Assignee: Vinaykumar Bhat > Fix issues with functional index (on column stats) based pruning > > > Key: HUDI-7559 > URL: https://issues.apache.org/jira/browse/HUDI-7559 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7007) Integrate functional index using bloom filter on reader side
[ https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831645#comment-17831645 ] Vinaykumar Bhat edited comment on HUDI-7007 at 3/28/24 6:46 AM: Seems like ` {_}FunctionalIndexSupport::loadFunctionalIndexDataFrame(...){_}` is always called (from `{_}HoodieFileIndex::lookupCandidateFilesInMetadataTable(...){_}` with an empty `{_}indexPartition{_}` string. So, it is likely that file pruning based on functional index is not supported. was (Author: JIRAUSER303569): Seems like ` FunctionalIndexSupport::loadFunctionalIndexDataFrame(...)` is always called (from `HoodieFileIndex::lookupCandidateFilesInMetadataTable(...)` with an empty `indexPartition` string. So, it is likely that file pruning based on functional index is not supported. > Integrate functional index using bloom filter on reader side > > > Key: HUDI-7007 > URL: https://issues.apache.org/jira/browse/HUDI-7007 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Currently, one can create a functional index on a column using bloom filters. > However, only the one created using column stats is supported on the reader > side (check `FunctionalIndexSupport`). This ticket tracks the support for > using bloom filters on functional index in the reader path. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7007) Integrate functional index using bloom filter on reader side
[ https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831645#comment-17831645 ] Vinaykumar Bhat commented on HUDI-7007: --- Seems like ` FunctionalIndexSupport::loadFunctionalIndexDataFrame(...)` is always called (from `HoodieFileIndex::lookupCandidateFilesInMetadataTable(...)` with an empty `indexPartition` string. So, it is likely that file pruning based on functional index is not supported. > Integrate functional index using bloom filter on reader side > > > Key: HUDI-7007 > URL: https://issues.apache.org/jira/browse/HUDI-7007 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Currently, one can create a functional index on a column using bloom filters. > However, only the one created using column stats is supported on the reader > side (check `FunctionalIndexSupport`). This ticket tracks the support for > using bloom filters on functional index in the reader path. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7007) Integrate functional index using bloom filter on reader side
[ https://issues.apache.org/jira/browse/HUDI-7007?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831243#comment-17831243 ] Vinaykumar Bhat commented on HUDI-7007: --- [~codope] - need some pointers on this. Are there any tests that executes a query resulting in pruning files based on functional index? I saw `TestFunctionalIndex.scala`, but none of the tests there seem to have such a query. > Integrate functional index using bloom filter on reader side > > > Key: HUDI-7007 > URL: https://issues.apache.org/jira/browse/HUDI-7007 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Vinaykumar Bhat >Priority: Major > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Currently, one can create a functional index on a column using bloom filters. > However, only the one created using column stats is supported on the reader > side (check `FunctionalIndexSupport`). This ticket tracks the support for > using bloom filters on functional index in the reader path. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7484) Fix partitioning style when partition is inferred from partitionBy
[ https://issues.apache.org/jira/browse/HUDI-7484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17831167#comment-17831167 ] Vinaykumar Bhat commented on HUDI-7484: --- [~codope] Do you have some pointers or a test case? I want to understand what this is about and how to proceed. > Fix partitioning style when partition is inferred from partitionBy > -- > > Key: HUDI-7484 > URL: https://issues.apache.org/jira/browse/HUDI-7484 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Vinaykumar Bhat >Priority: Major > Fix For: 1.0.0 > > > When inferring partition from partitionBy() arguments and hive style > partitioning is enabled, we observe that the partitioining style is not > uniformed for multi-level partition. Directory structure is as follows: > partition=2015 > |- 03 > |- 15 > |- 16 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (HUDI-7117) Functional index creation not working when table is created using datasource writer
[ https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830973#comment-17830973 ] Vinaykumar Bhat edited comment on HUDI-7117 at 3/26/24 2:54 PM: This is likely not an issue, but a gap in understanding the feature. The issue is that {{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}} creates a temporary view (similar to the one that is created using {{{}CREATE TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table. Hence the following {{CREATE INDEX ...}} statement to create a functional fails as the object on which the index is being created is not a hudi managed table. Instead of creating a temporary view, one can use {{saveAsTable(...)}} method on the DataFrameWriter object to create a hudi managed table and then create functional index on those tables. An example follows: {code:java} val columns = Seq("ts", "transaction_id", "rider", "driver", "price", "location") val data = Seq( (1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A", "driver-K", 19.10, "san_francisco"), (1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C", "driver-M", 27.70, "san_francisco"), (1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D", "driver-L", 33.90, "san_francisco"), (1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F", "driver-P", 34.15, "sao_paulo"), (169511511L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J", "driver-T", 17.85, "chennai")); var inserts = spark.createDataFrame(data).toDF(columns: _*) inserts.write.format("hudi"). option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location"). option(HoodieWriteConfig.TABLE_NAME, tableName). option("hoodie.datasource.write.operation", "upsert"). option("hoodie.datasource.write.recordkey.field", "transaction_id"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.table.type", HoodieTableType.COPY_ON_WRITE.name()). option("hoodie.table.metadata.enable", "true"). option("hoodie.parquet.small.file.limit", "0"). option("path", "/tmp/temp_table_path/"). mode(SaveMode.Append). saveAsTable("temp_table") spark.catalog.listTables().show(false) spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING column_stats(ts) options(func='from_unixtime', format='-MM-dd')") {code} was (Author: JIRAUSER303569): This is likely not an issue, but a gap in understanding the feature. The issue is that {{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}} creates a temporary view (similar to the one that is created using {{{}CREATE TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table. Hence the following {{CREATE INDEX ...}} statement to create a functional fails as the object on which the index is being created is not a hudi managed table. Instead of creating a temporary view, one can use {{saveAsTable(...)}} method on the DataFrameWriter object to create a hudi managed table and then create functional index on those tables. An example follows: val columns = Seq("ts", "transaction_id", "rider", "driver", "price", "location") val data = Seq((1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A", "driver-K", 19.10, "san_francisco"), (1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C", "driver-M", 27.70, "san_francisco"), (1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D", "driver-L", 33.90, "san_francisco"), (1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F", "driver-P", 34.15, "sao_paulo"), (169511511L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J", "driver-T", 17.85, "chennai")); var inserts = spark.createDataFrame(data).toDF(columns: _*) inserts.write.format("hudi"). option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location"). option(HoodieWriteConfig.TABLE_NAME, tableName). option("hoodie.datasource.write.operation", "upsert"). option("hoodie.datasource.write.recordkey.field", "transaction_id"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.table.type", HoodieTableType.COPY_ON_WRITE.name()). option("hoodie.table.metadata.enable", "true"). option("hoodie.parquet.small.file.limit", "0"). option("path", "/tmp/temp_table_path/"). mode(SaveMode.Append). saveAsTable("temp_table") spark.catalog.listTables().show(false) spark.sql(s"select from_unixtime(ts, '-MM-dd') as datestr FROM temp_table").show() spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING column_stats(ts) options(func='from_unixtime', format='-MM-dd')") > Functional index creation not working when table is created using datasource > writer > ---
[jira] [Commented] (HUDI-7117) Functional index creation not working when table is created using datasource writer
[ https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17830973#comment-17830973 ] Vinaykumar Bhat commented on HUDI-7117: --- This is likely not an issue, but a gap in understanding the feature. The issue is that {{spark.read.format("hudi").load(PATH).createOrReplaceTempView(TABLE_NAME)}} creates a temporary view (similar to the one that is created using {{{}CREATE TEMPORARY VIEW ...{}}}) and it is neither a table nor a hudi managed table. Hence the following {{CREATE INDEX ...}} statement to create a functional fails as the object on which the index is being created is not a hudi managed table. Instead of creating a temporary view, one can use {{saveAsTable(...)}} method on the DataFrameWriter object to create a hudi managed table and then create functional index on those tables. An example follows: val columns = Seq("ts", "transaction_id", "rider", "driver", "price", "location") val data = Seq((1695159649087L, "334e26e9-8355-45cc-97c6-c31daf0df330", "rider-A", "driver-K", 19.10, "san_francisco"), (1695091554788L, "e96c4396-3fad-413a-a942-4cb36106d721", "rider-C", "driver-M", 27.70, "san_francisco"), (1695046462179L, "9909a8b1-2d15-4d3d-8ec9-efc48c536a00", "rider-D", "driver-L", 33.90, "san_francisco"), (1695516137016L, "e3cf430c-889d-4015-bc98-59bdce1e530c", "rider-F", "driver-P", 34.15, "sao_paulo"), (169511511L, "c8abbe79-8d89-47ea-b4ce-4d224bae5bfa", "rider-J", "driver-T", 17.85, "chennai")); var inserts = spark.createDataFrame(data).toDF(columns: _*) inserts.write.format("hudi"). option(DataSourceWriteOptions.PARTITIONPATH_FIELD.key(), "location"). option(HoodieWriteConfig.TABLE_NAME, tableName). option("hoodie.datasource.write.operation", "upsert"). option("hoodie.datasource.write.recordkey.field", "transaction_id"). option("hoodie.datasource.write.precombine.field", "ts"). option("hoodie.datasource.write.table.type", HoodieTableType.COPY_ON_WRITE.name()). option("hoodie.table.metadata.enable", "true"). option("hoodie.parquet.small.file.limit", "0"). option("path", "/tmp/temp_table_path/"). mode(SaveMode.Append). saveAsTable("temp_table") spark.catalog.listTables().show(false) spark.sql(s"select from_unixtime(ts, '-MM-dd') as datestr FROM temp_table").show() spark.sql(s"CREATE INDEX hudi_table_func_index_datestr ON temp_table USING column_stats(ts) options(func='from_unixtime', format='-MM-dd')") > Functional index creation not working when table is created using datasource > writer > --- > > Key: HUDI-7117 > URL: https://issues.apache.org/jira/browse/HUDI-7117 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Aditya Goenka >Assignee: Vinaykumar Bhat >Priority: Blocker > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Details and Reproducible code under Github Issue - > [https://github.com/apache/hudi/issues/10110] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7117) Functional index creation not working when table is created using datasource writer
[ https://issues.apache.org/jira/browse/HUDI-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7117: -- Status: In Progress (was: Open) > Functional index creation not working when table is created using datasource > writer > --- > > Key: HUDI-7117 > URL: https://issues.apache.org/jira/browse/HUDI-7117 > Project: Apache Hudi > Issue Type: Bug > Components: index >Reporter: Aditya Goenka >Assignee: Vinaykumar Bhat >Priority: Blocker > Labels: hudi-1.0.0-beta2 > Fix For: 1.0.0 > > > Details and Reproducible code under Github Issue - > [https://github.com/apache/hudi/issues/10110] > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times
[ https://issues.apache.org/jira/browse/HUDI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7480: -- Status: Patch Available (was: In Progress) > initializeFunctionalIndexPartition is called multiple times > --- > > Key: HUDI-7480 > URL: https://issues.apache.org/jira/browse/HUDI-7480 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > This is due to a issue in > initializeFromFilesystem(), which tries to check if MDT partition needs to be > initialized based on the absence of partition-type. But for functional index, > partition-type actually store the prefix (func_index_)- hence the check > always fails and we try to reinit the same functional index partition again. > > Simple test: > {quote}spark.sql( > s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long > |) using hudi > | options ( > | primaryKey ='id', > | type = '$tableType', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(ts) > | location '$basePath' > """.stripMargin) > spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)") > spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)") > spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)") > > var createIndexSql = s"create index idx_datestr on $tableName using > column_stats(ts) options(func='from_unixtime', format='-MM-dd')" > spark.sql(createIndexSql) > > -- This insert throws null-pointer exception > spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times
[ https://issues.apache.org/jira/browse/HUDI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reopened HUDI-7480: --- > initializeFunctionalIndexPartition is called multiple times > --- > > Key: HUDI-7480 > URL: https://issues.apache.org/jira/browse/HUDI-7480 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > This is due to a issue in > initializeFromFilesystem(), which tries to check if MDT partition needs to be > initialized based on the absence of partition-type. But for functional index, > partition-type actually store the prefix (func_index_)- hence the check > always fails and we try to reinit the same functional index partition again. > > Simple test: > {quote}spark.sql( > s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long > |) using hudi > | options ( > | primaryKey ='id', > | type = '$tableType', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(ts) > | location '$basePath' > """.stripMargin) > spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)") > spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)") > spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)") > > var createIndexSql = s"create index idx_datestr on $tableName using > column_stats(ts) options(func='from_unixtime', format='-MM-dd')" > spark.sql(createIndexSql) > > -- This insert throws null-pointer exception > spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times
[ https://issues.apache.org/jira/browse/HUDI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7480: -- Status: In Progress (was: Reopened) > initializeFunctionalIndexPartition is called multiple times > --- > > Key: HUDI-7480 > URL: https://issues.apache.org/jira/browse/HUDI-7480 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > This is due to a issue in > initializeFromFilesystem(), which tries to check if MDT partition needs to be > initialized based on the absence of partition-type. But for functional index, > partition-type actually store the prefix (func_index_)- hence the check > always fails and we try to reinit the same functional index partition again. > > Simple test: > {quote}spark.sql( > s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long > |) using hudi > | options ( > | primaryKey ='id', > | type = '$tableType', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(ts) > | location '$basePath' > """.stripMargin) > spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)") > spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)") > spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)") > > var createIndexSql = s"create index idx_datestr on $tableName using > column_stats(ts) options(func='from_unixtime', format='-MM-dd')" > spark.sql(createIndexSql) > > -- This insert throws null-pointer exception > spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times
[ https://issues.apache.org/jira/browse/HUDI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat resolved HUDI-7480. --- > initializeFunctionalIndexPartition is called multiple times > --- > > Key: HUDI-7480 > URL: https://issues.apache.org/jira/browse/HUDI-7480 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > This is due to a issue in > initializeFromFilesystem(), which tries to check if MDT partition needs to be > initialized based on the absence of partition-type. But for functional index, > partition-type actually store the prefix (func_index_)- hence the check > always fails and we try to reinit the same functional index partition again. > > Simple test: > {quote}spark.sql( > s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long > |) using hudi > | options ( > | primaryKey ='id', > | type = '$tableType', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(ts) > | location '$basePath' > """.stripMargin) > spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)") > spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)") > spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)") > > var createIndexSql = s"create index idx_datestr on $tableName using > column_stats(ts) options(func='from_unixtime', format='-MM-dd')" > spark.sql(createIndexSql) > > -- This insert throws null-pointer exception > spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7504) Replace expensive file existance check (in object store) with spark options
[ https://issues.apache.org/jira/browse/HUDI-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17829355#comment-17829355 ] Vinaykumar Bhat commented on HUDI-7504: --- Support for these configs (as part of DataSourceOptions) in spark was added only in 3.4.0. Will hold on to the PR for now. > Replace expensive file existance check (in object store) with spark options > --- > > Key: HUDI-7504 > URL: https://issues.apache.org/jira/browse/HUDI-7504 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > > The incremental loading from S3 and GCS performs a existence check for an > object. This is expensive. This happens > [here|[http://example.com|https://github.com/apache/hudi/blob/130498708bb1cd5da1d0e725971b3d721eeef231/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java#L161]] > > This can be replaced with spark provided options: > spark.sql.files.ignoreMissingFiles > spark.sql.files.ignoreCorruptFiles > > Ref for these options: > [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7512) Support sorting of input records in insert operation
[ https://issues.apache.org/jira/browse/HUDI-7512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7512: - Assignee: Vinaykumar Bhat > Support sorting of input records in insert operation > > > Key: HUDI-7512 > URL: https://issues.apache.org/jira/browse/HUDI-7512 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7512) Support sorting of input records in insert operation
Vinaykumar Bhat created HUDI-7512: - Summary: Support sorting of input records in insert operation Key: HUDI-7512 URL: https://issues.apache.org/jira/browse/HUDI-7512 Project: Apache Hudi Issue Type: Improvement Reporter: Vinaykumar Bhat -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7504) Replace expensive file existance check (in object store) with spark options
Vinaykumar Bhat created HUDI-7504: - Summary: Replace expensive file existance check (in object store) with spark options Key: HUDI-7504 URL: https://issues.apache.org/jira/browse/HUDI-7504 Project: Apache Hudi Issue Type: Improvement Reporter: Vinaykumar Bhat The incremental loading from S3 and GCS performs a existence check for an object. This is expensive. This happens [here|[http://example.com|https://github.com/apache/hudi/blob/130498708bb1cd5da1d0e725971b3d721eeef231/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java#L161]] This can be replaced with spark provided options: spark.sql.files.ignoreMissingFiles spark.sql.files.ignoreCorruptFiles Ref for these options: [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7504) Replace expensive file existance check (in object store) with spark options
[ https://issues.apache.org/jira/browse/HUDI-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7504: - Assignee: Vinaykumar Bhat > Replace expensive file existance check (in object store) with spark options > --- > > Key: HUDI-7504 > URL: https://issues.apache.org/jira/browse/HUDI-7504 > Project: Apache Hudi > Issue Type: Improvement >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > > The incremental loading from S3 and GCS performs a existence check for an > object. This is expensive. This happens > [here|[http://example.com|https://github.com/apache/hudi/blob/130498708bb1cd5da1d0e725971b3d721eeef231/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/helpers/CloudObjectsSelectorCommon.java#L161]] > > This can be replaced with spark provided options: > spark.sql.files.ignoreMissingFiles > spark.sql.files.ignoreCorruptFiles > > Ref for these options: > [https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html#ignore-missing-files] > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times
[ https://issues.apache.org/jira/browse/HUDI-7480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7480: - Assignee: Sagar Sumit > initializeFunctionalIndexPartition is called multiple times > --- > > Key: HUDI-7480 > URL: https://issues.apache.org/jira/browse/HUDI-7480 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Sagar Sumit >Priority: Major > > This is due to a issue in > initializeFromFilesystem(), which tries to check if MDT partition needs to be > initialized based on the absence of partition-type. But for functional index, > partition-type actually store the prefix (func_index_)- hence the check > always fails and we try to reinit the same functional index partition again. > > Simple test: > {quote}spark.sql( > s""" > |create table $tableName ( > | id int, > | name string, > | price double, > | ts long > |) using hudi > | options ( > | primaryKey ='id', > | type = '$tableType', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(ts) > | location '$basePath' > """.stripMargin) > spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)") > spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)") > spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)") > > var createIndexSql = s"create index idx_datestr on $tableName using > column_stats(ts) options(func='from_unixtime', format='-MM-dd')" > spark.sql(createIndexSql) > > -- This insert throws null-pointer exception > spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7480) initializeFunctionalIndexPartition is called multiple times
Vinaykumar Bhat created HUDI-7480: - Summary: initializeFunctionalIndexPartition is called multiple times Key: HUDI-7480 URL: https://issues.apache.org/jira/browse/HUDI-7480 Project: Apache Hudi Issue Type: Bug Reporter: Vinaykumar Bhat This is due to a issue in initializeFromFilesystem(), which tries to check if MDT partition needs to be initialized based on the absence of partition-type. But for functional index, partition-type actually store the prefix (func_index_)- hence the check always fails and we try to reinit the same functional index partition again. Simple test: {quote}spark.sql( s""" |create table $tableName ( | id int, | name string, | price double, | ts long |) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(ts) | location '$basePath' """.stripMargin) spark.sql(s"insert into $tableName values(1, 'a1', 10, 1000)") spark.sql(s"insert into $tableName values(2, 'a2', 10, 1001)") spark.sql(s"insert into $tableName values(3, 'a3', 10, 1002)") var createIndexSql = s"create index idx_datestr on $tableName using column_stats(ts) options(func='from_unixtime', format='-MM-dd')" spark.sql(createIndexSql) -- This insert throws null-pointer exception spark.sql(s"insert into $tableName values(4, 'a4', 10, 1004)"){quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7474) Functional index creation fails for an existing table as reported by community user
[ https://issues.apache.org/jira/browse/HUDI-7474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7474: - Assignee: Vinaykumar Bhat > Functional index creation fails for an existing table as reported by > community user > --- > > Key: HUDI-7474 > URL: https://issues.apache.org/jira/browse/HUDI-7474 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > > Investigate issue reported with functional index here - > https://github.com/apache/hudi/issues/10110 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7474) Functional index creation fails for an existing table as reported by community user
Vinaykumar Bhat created HUDI-7474: - Summary: Functional index creation fails for an existing table as reported by community user Key: HUDI-7474 URL: https://issues.apache.org/jira/browse/HUDI-7474 Project: Apache Hudi Issue Type: Bug Reporter: Vinaykumar Bhat Investigate issue reported with functional index here - https://github.com/apache/hudi/issues/10110 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7472) Creating a functional index implicitly drops metadata RLI partition
[ https://issues.apache.org/jira/browse/HUDI-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7472: -- Affects Version/s: 1.0.0 > Creating a functional index implicitly drops metadata RLI partition > --- > > Key: HUDI-7472 > URL: https://issues.apache.org/jira/browse/HUDI-7472 > Project: Apache Hudi > Issue Type: Bug >Affects Versions: 1.0.0 >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > > This is because of a bug in generating write-config for the index creation > which does not set the relevent fields for enabling RLI. The metadata writer > creating code path in `HudiTable` ends up dropping the metadata partitions > for RLI, bloom and col-stats because it assumes the current 'write-config' > has disabled it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (HUDI-7146) Implement secondary index
[ https://issues.apache.org/jira/browse/HUDI-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17822959#comment-17822959 ] Vinaykumar Bhat commented on HUDI-7146: --- Issue found during testing functional-index based configs > Implement secondary index > - > > Key: HUDI-7146 > URL: https://issues.apache.org/jira/browse/HUDI-7146 > Project: Apache Hudi > Issue Type: Task >Reporter: Sagar Sumit >Assignee: Sagar Sumit >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > # Secondary index schema should be flexible enough to accommodate various > kinds of secondary index. > # Reuse as much as possible the existing framework for indexing. > # Merge with existing index config and introduce as less configs as possible. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7472) Creating a functional index implicitly drops metadata RLI partition
Vinaykumar Bhat created HUDI-7472: - Summary: Creating a functional index implicitly drops metadata RLI partition Key: HUDI-7472 URL: https://issues.apache.org/jira/browse/HUDI-7472 Project: Apache Hudi Issue Type: Bug Reporter: Vinaykumar Bhat This is because of a bug in generating write-config for the index creation which does not set the relevent fields for enabling RLI. The metadata writer creating code path in `HudiTable` ends up dropping the metadata partitions for RLI, bloom and col-stats because it assumes the current 'write-config' has disabled it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7472) Creating a functional index implicitly drops metadata RLI partition
[ https://issues.apache.org/jira/browse/HUDI-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7472: - Assignee: Vinaykumar Bhat > Creating a functional index implicitly drops metadata RLI partition > --- > > Key: HUDI-7472 > URL: https://issues.apache.org/jira/browse/HUDI-7472 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > > This is because of a bug in generating write-config for the index creation > which does not set the relevent fields for enabling RLI. The metadata writer > creating code path in `HudiTable` ends up dropping the metadata partitions > for RLI, bloom and col-stats because it assumes the current 'write-config' > has disabled it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (HUDI-7458) Creating multiple functional index fails
[ https://issues.apache.org/jira/browse/HUDI-7458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat updated HUDI-7458: -- Reviewers: Sagar Sumit > Creating multiple functional index fails > > > Key: HUDI-7458 > URL: https://issues.apache.org/jira/browse/HUDI-7458 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > > Looks like an issue in ` > HoodieSparkFunctionalIndexClient::create(...)` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7458) Creating multiple functional index fails
Vinaykumar Bhat created HUDI-7458: - Summary: Creating multiple functional index fails Key: HUDI-7458 URL: https://issues.apache.org/jira/browse/HUDI-7458 Project: Apache Hudi Issue Type: Bug Reporter: Vinaykumar Bhat Looks like an issue in ` HoodieSparkFunctionalIndexClient::create(...)` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7458) Creating multiple functional index fails
[ https://issues.apache.org/jira/browse/HUDI-7458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7458: - Assignee: Vinaykumar Bhat > Creating multiple functional index fails > > > Key: HUDI-7458 > URL: https://issues.apache.org/jira/browse/HUDI-7458 > Project: Apache Hudi > Issue Type: Bug >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > > Looks like an issue in ` > HoodieSparkFunctionalIndexClient::create(...)` -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7405) Implement reader path support for secondary index
[ https://issues.apache.org/jira/browse/HUDI-7405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7405: - Assignee: Vinaykumar Bhat > Implement reader path support for secondary index > - > > Key: HUDI-7405 > URL: https://issues.apache.org/jira/browse/HUDI-7405 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7405) Implement reader path support for secondary index
Vinaykumar Bhat created HUDI-7405: - Summary: Implement reader path support for secondary index Key: HUDI-7405 URL: https://issues.apache.org/jira/browse/HUDI-7405 Project: Apache Hudi Issue Type: Sub-task Reporter: Vinaykumar Bhat -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7384) Implement writer path support for secondary index
Vinaykumar Bhat created HUDI-7384: - Summary: Implement writer path support for secondary index Key: HUDI-7384 URL: https://issues.apache.org/jira/browse/HUDI-7384 Project: Apache Hudi Issue Type: Sub-task Reporter: Vinaykumar Bhat # Basic initialization ona. existing table # Handle inserts/upserts -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7384) Implement writer path support for secondary index
[ https://issues.apache.org/jira/browse/HUDI-7384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7384: - Assignee: Vinaykumar Bhat > Implement writer path support for secondary index > - > > Key: HUDI-7384 > URL: https://issues.apache.org/jira/browse/HUDI-7384 > Project: Apache Hudi > Issue Type: Sub-task >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > > # Basic initialization ona. existing table > # Handle inserts/upserts -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7301) Update hudi docs/websites with documentation for the new spark TVF
[ https://issues.apache.org/jira/browse/HUDI-7301?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7301: - Assignee: Vinaykumar Bhat > Update hudi docs/websites with documentation for the new spark TVF > -- > > Key: HUDI-7301 > URL: https://issues.apache.org/jira/browse/HUDI-7301 > Project: Apache Hudi > Issue Type: Task >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > > Hudi documentation and website needs to be updated to reflect the support for > new spark-sql related table-valued-functions -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7301) Update hudi docs/websites with documentation for the new spark TVF
Vinaykumar Bhat created HUDI-7301: - Summary: Update hudi docs/websites with documentation for the new spark TVF Key: HUDI-7301 URL: https://issues.apache.org/jira/browse/HUDI-7301 Project: Apache Hudi Issue Type: Task Reporter: Vinaykumar Bhat Hudi documentation and website needs to be updated to reflect the support for new spark-sql related table-valued-functions -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7294) Add TVF to query hudi metadata
Vinaykumar Bhat created HUDI-7294: - Summary: Add TVF to query hudi metadata Key: HUDI-7294 URL: https://issues.apache.org/jira/browse/HUDI-7294 Project: Apache Hudi Issue Type: New Feature Reporter: Vinaykumar Bhat Assignee: Vinaykumar Bhat Having a table valued function to query hudi metadata for a given table through spark-sql will help in debugging -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Closed] (HUDI-7243) Add TVF to query hudi timeline through spark-sql
[ https://issues.apache.org/jira/browse/HUDI-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat closed HUDI-7243. - Resolution: Fixed > Add TVF to query hudi timeline through spark-sql > > > Key: HUDI-7243 > URL: https://issues.apache.org/jira/browse/HUDI-7243 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Having a table valued function to query hudi timeline for a given table > through spark-sql will help in debugging -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HUDI-7243) Add TVF to query hudi timeline through spark-sql
[ https://issues.apache.org/jira/browse/HUDI-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat resolved HUDI-7243. --- > Add TVF to query hudi timeline through spark-sql > > > Key: HUDI-7243 > URL: https://issues.apache.org/jira/browse/HUDI-7243 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Labels: pull-request-available > Fix For: 1.0.0 > > > Having a table valued function to query hudi timeline for a given table > through spark-sql will help in debugging -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7261) Add TVF to query hudi file system view through spark-sql
Vinaykumar Bhat created HUDI-7261: - Summary: Add TVF to query hudi file system view through spark-sql Key: HUDI-7261 URL: https://issues.apache.org/jira/browse/HUDI-7261 Project: Apache Hudi Issue Type: New Feature Reporter: Vinaykumar Bhat Assignee: Vinaykumar Bhat Having a table valued function to query hudi table's file system view through spark-sql will help in debugging -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Assigned] (HUDI-7243) Add TVF to query hudi timeline through spark-sql
[ https://issues.apache.org/jira/browse/HUDI-7243?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinaykumar Bhat reassigned HUDI-7243: - Assignee: Vinaykumar Bhat > Add TVF to query hudi timeline through spark-sql > > > Key: HUDI-7243 > URL: https://issues.apache.org/jira/browse/HUDI-7243 > Project: Apache Hudi > Issue Type: New Feature >Reporter: Vinaykumar Bhat >Assignee: Vinaykumar Bhat >Priority: Major > Fix For: 1.0.0 > > > Having a table valued function to query hudi timeline for a given table > through spark-sql will help in debugging -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HUDI-7243) Add TVF to query hudi timeline through spark-sql
Vinaykumar Bhat created HUDI-7243: - Summary: Add TVF to query hudi timeline through spark-sql Key: HUDI-7243 URL: https://issues.apache.org/jira/browse/HUDI-7243 Project: Apache Hudi Issue Type: New Feature Reporter: Vinaykumar Bhat Fix For: 1.0.0 Having a table valued function to query hudi timeline for a given table through spark-sql will help in debugging -- This message was sent by Atlassian Jira (v8.20.10#820010)