[ https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sagar Sumit updated HUDI-7580: ------------------------------ Fix Version/s: 0.15.0 1.0.0 > Inserting rows into partitioned table leads to data sanity issues > ----------------------------------------------------------------- > > Key: HUDI-7580 > URL: https://issues.apache.org/jira/browse/HUDI-7580 > Project: Apache Hudi > Issue Type: Bug > Affects Versions: 1.0.0-beta1 > Reporter: Vinaykumar Bhat > Assignee: Sagar Sumit > Priority: Major > Fix For: 0.15.0, 1.0.0 > > Original Estimate: 4m > Remaining Estimate: 4m > > Came across this behaviour of partitioned tables when trying to debug some > other issue with functional-index. It seems that the column ordering gets > messed up while inserting records into a hudi table. Hence, a subsequent > query returns wrong results. An example follows: > > The following is a scala test: > {code:java} > test("Test Create Functional Index") { > if (HoodieSparkUtils.gteqSpark3_2) { > withTempDir { tmp => > val tableType = "cow" > val tableName = "rides" > val basePath = s"${tmp.getCanonicalPath}/$tableName" > spark.sql("set hoodie.metadata.enable=true") > spark.sql( > s""" > |create table $tableName ( > | id int, > | name string, > | price int, > | ts long > |) using hudi > | options ( > | primaryKey ='id', > | type = '$tableType', > | preCombineField = 'ts', > | hoodie.metadata.record.index.enable = 'true', > | hoodie.datasource.write.recordkey.field = 'id' > | ) > | partitioned by(price) > | location '$basePath' > """.stripMargin) > spark.sql(s"insert into $tableName (id, name, price, ts) values(1, > 'a1', 10, 1000)") > spark.sql(s"insert into $tableName (id, name, price, ts) values(2, > 'a2', 100, 200000)") > spark.sql(s"insert into $tableName (id, name, price, ts) values(3, > 'a3', 1000, 2000000000)") > spark.sql(s"select id, name, price, ts from $tableName").show(false) > } > } > } {code} > > The query returns the following result (note how *price* and *ts* columns are > mixed up). > {code:java} > +---+----+----------+----+ > |id |name|price |ts | > +---+----+----------+----+ > |3 |a3 |2000000000|1000| > |2 |a2 |200000 |100 | > |1 |a1 |1000 |10 | > +---+----+----------+----+ > {code} > > Having the partition column as the last column in the schema does not cause > this problem. If the mixed-up columns are of incompatible datatypes, then the > insert fails with an error. -- This message was sent by Atlassian Jira (v8.20.10#820010)