[jira] [Updated] (SPARK-43112) Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables

Asif (Jira) Wed, 12 Apr 2023 15:13:06 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-43112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Asif updated SPARK-43112:
-------------------------
    Description: 
The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its 
output method implemented as 
  // The partition column should always appear after data columns.
  override def output: Seq[AttributeReference] = dataCols ++ partitionCols

But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect 
that the output from HiveTableRelation is in the order in which the columns are 
actually defined in the DDL.

As a result, multiple mismatch scenarios can happen like:
1) data type casting exception being thrown , even though the data frame being 
inserted has schema which is identical to what is used for creating ddl.
              OR
2) Wrong column being used for partitioning , if the datatypes are same or 
cast-able, like date type and long

will be creating a PR with the bug test

  was:
The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its 
output method implemented as 
  // The partition column should always appear after data columns.
  override def output: Seq[AttributeReference] = dataCols ++ partitionCols

But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect 
that the out from HiveTableRelation is in the order in which the columns are 
actually defined in the DDL.

As a result, multiple mistmatch scenarios can happen like:
1) data type casting exception being thrown , even though the data frame being 
inserted has schema which is identical to what is used for creating ddl.
              OR
2) Wrong column being used for partitioning , if the datatypes are same or 
castable, like datetype and long

will be creating a PR with the bug test


> Spark may  use a column other than the actual specified partitioning column 
> for partitioning, for Hive format tables
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-43112
>                 URL: https://issues.apache.org/jira/browse/SPARK-43112
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.1
>            Reporter: Asif
>            Priority: Critical
>
> The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its 
> output method implemented as 
>   // The partition column should always appear after data columns.
>   override def output: Seq[AttributeReference] = dataCols ++ partitionCols
> But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect 
> that the output from HiveTableRelation is in the order in which the columns 
> are actually defined in the DDL.
> As a result, multiple mismatch scenarios can happen like:
> 1) data type casting exception being thrown , even though the data frame 
> being inserted has schema which is identical to what is used for creating ddl.
>               OR
> 2) Wrong column being used for partitioning , if the datatypes are same or 
> cast-able, like date type and long
> will be creating a PR with the bug test



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-43112) Spark may use a column other than the actual specified partitioning column for partitioning, for Hive format tables

Reply via email to