[
https://issues.apache.org/jira/browse/SPARK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andrew Otto updated SPARK-23890:
Description:
As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE
CHANGE COLUMN commands to Hive. This restriction was loosened in
[https://github.com/apache/spark/pull/12714] to allow for those commands if
they only change the column comment.
Wikimedia has been evolving Parquet backed Hive tables with data originally
from JSON events by adding newly found columns to the Hive table schema, via a
Spark job we call 'Refine'. We do this by recursively merging an input
DataFrame schema with a Hive table DataFrame schema, finding new fields, and
then issuing an ALTER TABLE statement to add the columns. However, because we
allow for nested data types in the incoming JSON data, we make extensive use of
struct type fields. In order to add newly detected fields in a nested data
type, we must alter the struct column and append the nested struct field. This
requires CHANGE COLUMN that alters the column type. In reality, the 'type' of
the column is not changing, it just just a new field being added to the struct,
but to SQL, this looks like a type change.
-We were about to upgrade to Spark 2 but this new restriction in SQL DDL that
can be sent to Hive will block us. I believe this is fixable by adding an
exception in
[command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and
destination type are both struct types, and the destination type only adds new
fields.-
In this [PR|https://github.com/apache/spark/pull/21012], I was told that the
Spark 3 datasource v2 would support this.
However, it is clear that it does not. There is an [explicit
check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441]
and
[test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583]
that prevents this from happening.
This an be done via {{{}ALTER TABLE ADD COLUMN nested1.new_field1{}}}, but this
is not supported for any datasource v1 sources.
was:
As part of SPARK-14118, Spark SQL removed support for sending ALTER TABLE
CHANGE COLUMN commands to Hive. This restriction was loosened in
[https://github.com/apache/spark/pull/12714] to allow for those commands if
they only change the column comment.
Wikimedia has been evolving Parquet backed Hive tables with data originally
from JSON events by adding newly found columns to the Hive table schema, via a
Spark job we call 'Refine'. We do this by recursively merging an input
DataFrame schema with a Hive table DataFrame schema, finding new fields, and
then issuing an ALTER TABLE statement to add the columns. However, because we
allow for nested data types in the incoming JSON data, we make extensive use of
struct type fields. In order to add newly detected fields in a nested data
type, we must alter the struct column and append the nested struct field. This
requires CHANGE COLUMN that alters the column type. In reality, the 'type' of
the column is not changing, it just just a new field being added to the struct,
but to SQL, this looks like a type change.
-We were about to upgrade to Spark 2 but this new restriction in SQL DDL that
can be sent to Hive will block us. I believe this is fixable by adding an
exception in
[command/ddl.scala|https://github.com/apache/spark/blob/v2.3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/ddl.scala#L294-L325]
to allow ALTER TABLE CHANGE COLUMN with a new type, if the original type and
destination type are both struct types, and the destination type only adds new
fields.-
In this [PR|https://github.com/apache/spark/pull/21012], I was told that the
Spark 3 datasource v2 would support this.
However, it is clear that it does not. There is an [explicit
check|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L1441]
and
[test|https://github.com/apache/spark/blob/e3f46ed57dc063566cdb9425b4d5e02c65332df1/sql/core/src/test/scala/org/apache/spark/sql/connector/AlterTableTests.scala#L583]
that prevents this from happening.
> Support DDL for adding nested columns to struct types
> -
>
> Key: SPARK-23890
> URL: https://issues.apache.org/jira/browse/SPARK-23890
> Project: Spark
> Issue Type: Bug
> Components: SQL
>Affects Versions: 2.0.0, 3.0.0
>Reporter: Andrew Otto
>Priority: