[ https://issues.apache.org/jira/browse/SPARK-43112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17711607#comment-17711607 ]
Asif commented on SPARK-43112: ------------------------------ Open a WIP PR [SPARK-43112|https://github.com/apache/spark/pull/40765/] which has bug tests as of now > Spark may use a column other than the actual specified partitioning column > for partitioning, for Hive format tables > -------------------------------------------------------------------------------------------------------------------- > > Key: SPARK-43112 > URL: https://issues.apache.org/jira/browse/SPARK-43112 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.3.1 > Reporter: Asif > Priority: Critical > > The class org.apache.spark.sql.catalyst.catalog.HiveTableRelation has its > output method implemented as > // The partition column should always appear after data columns. > override def output: Seq[AttributeReference] = dataCols ++ partitionCols > But the DataWriting commands of spark like InsertIntoHiveDirCommand, expect > that the output from HiveTableRelation is in the order in which the columns > are actually defined in the DDL. > As a result, multiple mismatch scenarios can happen like: > 1) data type casting exception being thrown , even though the data frame > being inserted has schema which is identical to what is used for creating ddl. > OR > 2) Wrong column being used for partitioning , if the datatypes are same or > cast-able, like date type and long > will be creating a PR with the bug test -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org