[ https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinaykumar Bhat updated HUDI-7580: ---------------------------------- Description: Came across this behaviour of partitioned tables when trying to debug some other issue with functional-index. It seems that the column ordering gets messed up while inserting records into a hudi table. Hence, a subsequent query returns wrong results. An example follows: The following is a scala test: {code:java} test("Test Create Functional Index") { if (HoodieSparkUtils.gteqSpark3_2) { withTempDir { tmp => val tableType = "cow" val tableName = "rides" val basePath = s"${tmp.getCanonicalPath}/$tableName" spark.sql("set hoodie.metadata.enable=true") spark.sql( s""" |create table $tableName ( | id int, | name string, | price int, | ts long |) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 'a1', 10, 1000)") spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 'a2', 100, 200000)") spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 'a3', 1000, 2000000000)") spark.sql(s"select id, name, price, ts from $tableName").show(false) } } } {code} The query returns the following result (note how *price* and *ts* columns are mixed up). {code:java} +---+----+----------+----+ |id |name|price |ts | +---+----+----------+----+ |3 |a3 |2000000000|1000| |2 |a2 |200000 |100 | |1 |a1 |1000 |10 | +---+----+----------+----+ {code} Having the partition column as the last column in the schema does not cause this problem. If the mixed-up columns are of incompatible datatypes, then the insert fails with an error. was: Came across this behaviour of partitioned tables when trying to debug some other issue with functional-index. It seems that the column ordering gets messed up while inserting records into a hudi table. Hence, a subsequent query returns wrong results. An example follows: The following is a scala test: {code:java} test("Test Create Functional Index") { if (HoodieSparkUtils.gteqSpark3_2) { withTempDir { tmp => val tableType = "cow" val tableName = "rides" val basePath = s"${tmp.getCanonicalPath}/$tableName" spark.sql("set hoodie.metadata.enable=true") spark.sql( s""" |create table $tableName ( | id int, | name string, | price int, | ts long |) using hudi | options ( | primaryKey ='id', | type = '$tableType', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 'a1', 10, 1000)") spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 'a2', 100, 200000)") spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 'a3', 1000, 2000000000)") spark.sql(s"select id, name, price, ts from $tableName").show(false) } } } {code} The query returns the following result (note how price ans ts columns are mixed up). {code:java} +---+----+----------+----+ |id |name|price |ts | +---+----+----------+----+ |3 |a3 |2000000000|1000| |2 |a2 |200000 |100 | |1 |a1 |1000 |10 | +---+----+----------+----+ {code} Have the partition column as the last column in the schema does not cause this problem. If the mixed-up columns are of imcompatible datatypes, then the insert fails with an error. > Inserting rows into partitioned table leads to data sanity issues > ----------------------------------------------------------------- > > Key: HUDI-7580 > URL: https://issues.apache.org/jira/browse/HUDI-7580 > Project: Apache Hudi > Issue Type: Bug > Affects Versions: 1.0.0-beta1 > Reporter: Vinaykumar Bhat > Priority: Major > > Came across this behaviour of partitioned tables when trying to debug some > other issue with functional-index. It seems that the column ordering gets > messed up while inserting records into a hudi table. Hence, a subsequent > query returns wrong results. An example follows: > > The following is a scala test: > {code:java} > test("Test Create Functional Index") { > if (HoodieSparkUtils.gteqSpark3_2) { > withTempDir { tmp => > val tableType = "cow" > val tableName = "rides" > val basePath = s"${tmp.getCanonicalPath}/$tableName" > spark.sql("set hoodie.metadata.enable=true") > spark.sql( > s""" > |create table $tableName ( > | id int, > | name string, > | price int, > | ts long > |) using hudi > | options ( > | primaryKey ='id', > | type = '$tableType', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(price) > | location '$basePath' > """.stripMargin) > spark.sql(s"insert into $tableName (id, name, price, ts) values(1, > 'a1', 10, 1000)") > spark.sql(s"insert into $tableName (id, name, price, ts) values(2, > 'a2', 100, 200000)") > spark.sql(s"insert into $tableName (id, name, price, ts) values(3, > 'a3', 1000, 2000000000)") > spark.sql(s"select id, name, price, ts from $tableName").show(false) > } > } > } {code} > > The query returns the following result (note how *price* and *ts* columns are > mixed up). > {code:java} > +---+----+----------+----+ > |id |name|price |ts | > +---+----+----------+----+ > |3 |a3 |2000000000|1000| > |2 |a2 |200000 |100 | > |1 |a1 |1000 |10 | > +---+----+----------+----+ > {code} > > Having the partition column as the last column in the schema does not cause > this problem. If the mixed-up columns are of incompatible datatypes, then the > insert fails with an error. -- This message was sent by Atlassian Jira (v8.20.10#820010)