[jira] [Updated] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

Vinaykumar Bhat (Jira) Tue, 09 Apr 2024 04:01:07 -0700


     [ 
https://issues.apache.org/jira/browse/HUDI-7580?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Vinaykumar Bhat updated HUDI-7580:
----------------------------------
    Description: 
Came across this behaviour of partitioned tables when trying to debug some 
other issue with functional-index. It seems that the column ordering gets 
messed up while inserting records into a hudi table. Hence, a subsequent query 
returns wrong results. An example follows:

 

The following is a scala test:
{code:java}
  test("Test Create Functional Index") {
    if (HoodieSparkUtils.gteqSpark3_2) {
      withTempDir { tmp =>
        val tableType = "cow"
          val tableName = "rides"
          val basePath = s"${tmp.getCanonicalPath}/$tableName"
          spark.sql("set hoodie.metadata.enable=true")
          spark.sql(
            s"""
               |create table $tableName (
               |  id int,
               |  name string,
               |  price int,
               |  ts long
               |) using hudi
               | options (
               |  primaryKey ='id',
               |  type = '$tableType',
               |  preCombineField = 'ts',
               |  hoodie.metadata.record.index.enable = 'true',
               |  hoodie.datasource.write.recordkey.field = 'id'
               | )
               | partitioned by(price)
               | location '$basePath'
       """.stripMargin)
          spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
'a1', 10, 1000)")
          spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
'a2', 100, 200000)")
          spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
'a3', 1000, 2000000000)")

          spark.sql(s"select id, name, price, ts from $tableName").show(false)
      }
    }
  } {code}
 

The query returns the following result (note how *price* and *ts* columns are 
mixed up). 
{code:java}
+---+----+----------+----+
|id |name|price     |ts  |
+---+----+----------+----+
|3  |a3  |2000000000|1000|
|2  |a2  |200000    |100 |
|1  |a1  |1000      |10  |
+---+----+----------+----+
 {code}
 

Having the partition column as the last column in the schema does not cause 
this problem. If the mixed-up columns are of incompatible datatypes, then the 
insert fails with an error.

  was:
Came across this behaviour of partitioned tables when trying to debug some 
other issue with functional-index. It seems that the column ordering gets 
messed up while inserting records into a hudi table. Hence, a subsequent query 
returns wrong results. An example follows:

 

The following is a scala test:
{code:java}
  test("Test Create Functional Index") {
    if (HoodieSparkUtils.gteqSpark3_2) {
      withTempDir { tmp =>
        val tableType = "cow"
          val tableName = "rides"
          val basePath = s"${tmp.getCanonicalPath}/$tableName"
          spark.sql("set hoodie.metadata.enable=true")
          spark.sql(
            s"""
               |create table $tableName (
               |  id int,
               |  name string,
               |  price int,
               |  ts long
               |) using hudi
               | options (
               |  primaryKey ='id',
               |  type = '$tableType',
               |  preCombineField = 'ts',
               |  hoodie.metadata.record.index.enable = 'true',
               |  hoodie.datasource.write.recordkey.field = 'id'
               | )
               | partitioned by(price)
               | location '$basePath'
       """.stripMargin)
          spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
'a1', 10, 1000)")
          spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
'a2', 100, 200000)")
          spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
'a3', 1000, 2000000000)")

          spark.sql(s"select id, name, price, ts from $tableName").show(false)
      }
    }
  } {code}
 

The query returns the following result (note how price ans ts columns are mixed 
up). 
{code:java}
+---+----+----------+----+
|id |name|price     |ts  |
+---+----+----------+----+
|3  |a3  |2000000000|1000|
|2  |a2  |200000    |100 |
|1  |a1  |1000      |10  |
+---+----+----------+----+
 {code}
 

Have the partition column as the last column in the schema does not cause this 
problem. If the mixed-up columns are of imcompatible datatypes, then the insert 
fails with an error.


> Inserting rows into partitioned table leads to data sanity issues
> -----------------------------------------------------------------
>
>                 Key: HUDI-7580
>                 URL: https://issues.apache.org/jira/browse/HUDI-7580
>             Project: Apache Hudi
>          Issue Type: Bug
>    Affects Versions: 1.0.0-beta1
>            Reporter: Vinaykumar Bhat
>            Priority: Major
>
> Came across this behaviour of partitioned tables when trying to debug some 
> other issue with functional-index. It seems that the column ordering gets 
> messed up while inserting records into a hudi table. Hence, a subsequent 
> query returns wrong results. An example follows:
>  
> The following is a scala test:
> {code:java}
>   test("Test Create Functional Index") {
>     if (HoodieSparkUtils.gteqSpark3_2) {
>       withTempDir { tmp =>
>         val tableType = "cow"
>           val tableName = "rides"
>           val basePath = s"${tmp.getCanonicalPath}/$tableName"
>           spark.sql("set hoodie.metadata.enable=true")
>           spark.sql(
>             s"""
>                |create table $tableName (
>                |  id int,
>                |  name string,
>                |  price int,
>                |  ts long
>                |) using hudi
>                | options (
>                |  primaryKey ='id',
>                |  type = '$tableType',
>                |  preCombineField = 'ts',
>                |  hoodie.metadata.record.index.enable = 'true',
>                |  hoodie.datasource.write.recordkey.field = 'id'
>                | )
>                | partitioned by(price)
>                | location '$basePath'
>        """.stripMargin)
>           spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 
> 'a1', 10, 1000)")
>           spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 
> 'a2', 100, 200000)")
>           spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 
> 'a3', 1000, 2000000000)")
>           spark.sql(s"select id, name, price, ts from $tableName").show(false)
>       }
>     }
>   } {code}
>  
> The query returns the following result (note how *price* and *ts* columns are 
> mixed up). 
> {code:java}
> +---+----+----------+----+
> |id |name|price     |ts  |
> +---+----+----------+----+
> |3  |a3  |2000000000|1000|
> |2  |a2  |200000    |100 |
> |1  |a1  |1000      |10  |
> +---+----+----------+----+
>  {code}
>  
> Having the partition column as the last column in the schema does not cause 
> this problem. If the mixed-up columns are of incompatible datatypes, then the 
> insert fails with an error.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (HUDI-7580) Inserting rows into partitioned table leads to data sanity issues

Reply via email to