[ 
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sagar Sumit updated HUDI-7580:
------------------------------
    Labels: hudi-1.0.0-beta2  (was: )

> Inserting rows into partitioned table leads to data sanity issues
> -----------------------------------------------------------------
>
>                 Key: HUDI-7580
>                 URL: https://issues.apache.org/jira/browse/HUDI-7580
>             Project: Apache Hudi
>          Issue Type: Bug
>    Affects Versions: 1.0.0-beta1, 0.14.1
>            Reporter: Vinaykumar Bhat
>            Assignee: Sagar Sumit
>            Priority: Major
>              Labels: hudi-1.0.0-beta2
>             Fix For: 0.15.0, 1.0.0
>
>   Original Estimate: 4m
>  Remaining Estimate: 4m
>
> Came across this behaviour of partitioned tables when trying to debug some 
> other issue with functional-index. It seems that the column ordering gets 
> messed up while inserting records into a hudi table. Hence, a subsequent 
> query returns wrong results. An example follows:
>  
> The following is a scala test:
> {code:java}
>   test("Test Create Functional Index") {
>     if (HoodieSparkUtils.gteqSpark3_2) {
>       withTempDir { tmp =>
>         val tableType = "cow"
>           val tableName = "rides"
>           val basePath = s"${tmp.getCanonicalPath}/$tableName"
>           spark.sql("set hoodie.metadata.enable=true")
>           spark.sql(
>             s"""
>                |create table $tableName (
>                |  id int,
>                |  name string,
>                |  price int,
>                |  ts long
>                |) using hudi
>                | options (
>                |  primaryKey ='id',
>                |  type = '$tableType',
>                |  preCombineField = 'ts',
>                |  hoodie.metadata.record.index.enable = 'true',
>                |  hoodie.datasource.write.recordkey.field = 'id'
>                | )
>                | partitioned by(price)
>                | location '$basePath'
>        """.stripMargin)
>           spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
> 'a1', 10, 1000)")
>           spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
> 'a2', 100, 200000)")
>           spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
> 'a3', 1000, 2000000000)")
>           spark.sql(s"select id, name, price, ts from $tableName").show(false)
>       }
>     }
>   } {code}
>  
> The query returns the following result (note how *price* and *ts* columns are 
> mixed up). 
> {code:java}
> +---+----+----------+----+
> |id |name|price     |ts  |
> +---+----+----------+----+
> |3  |a3  |2000000000|1000|
> |2  |a2  |200000    |100 |
> |1  |a1  |1000      |10  |
> +---+----+----------+----+
>  {code}
>  
> Having the partition column as the last column in the schema does not cause 
> this problem. If the mixed-up columns are of incompatible datatypes, then the 
> insert fails with an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to